Quick Setup

This directory contains scripts to quickly set up and test the data pipeline resilience examples from Chapter 2.

๐Ÿš€ Quick Start

The fastest way to try the example with zero local dependencies:

# Download and run the setup script
curl -fsSL https://raw.githubusercontent.com/tasker-systems/tasker/main/spec/blog/post_02_data_pipeline_resilience/setup-scripts/blog-setup.sh | bash

# Or with custom app name
curl -fsSL https://raw.githubusercontent.com/tasker-systems/tasker/main/spec/blog/post_02_data_pipeline_resilience/setup-scripts/blog-setup.sh | bash -s -- --app-name my-pipeline-demo

Requirements: Docker and Docker Compose only

Local Setup

If you prefer to run the setup script locally:

# Download the script
curl -fsSL https://raw.githubusercontent.com/tasker-systems/tasker/main/spec/blog/post_02_data_pipeline_resilience/setup-scripts/blog-setup.sh -o blog-setup.sh
chmod +x blog-setup.sh

# Run with options
./blog-setup.sh --app-name pipeline-demo --output-dir ./demos

๐Ÿ› ๏ธ How It Works

Docker-Based Architecture

The setup script creates a complete Docker environment with:

  • Rails application with live code reloading

  • PostgreSQL 15 database with sample data

  • Redis 7 for background job processing

  • Sidekiq for workflow execution

  • All tested code examples from the GitHub repository

Integration with Tasker Repository

All code examples are downloaded directly from the tested repository:

This ensures the examples are always up-to-date and have passed integration tests.

๐Ÿ“‹ What Gets Created

Application Structure

API Endpoints

  • POST /analytics/start - Start the analytics pipeline

  • GET /analytics/status/:task_id - Monitor pipeline progress

  • GET /analytics/results/:task_id - Get generated insights

๐Ÿงช Testing the Pipeline Resilience

Start the Application

Wait for all services to be ready (you'll see "Ready for connections" messages).

Start Analytics Pipeline

Monitor Pipeline Progress

Get Pipeline Results

Test with Different Date Ranges

๐Ÿ”ง Key Features Demonstrated

Parallel Processing

The pipeline demonstrates parallel data extraction:

  • Orders, users, and products are extracted simultaneously

  • Transformations wait for their dependencies to complete

  • Maximum resource utilization without bottlenecks

Progress Tracking

Real-time visibility into long-running operations:

  • Batch processing with progress updates

  • Estimated completion times

  • Current operation status

Intelligent Retry Logic

Different retry strategies for different failure types:

  • Database timeouts: 3 retries with exponential backoff

  • CRM API failures: 5 retries (external services can be flaky)

  • Dashboard updates: 3 retries (eventual consistency)

Data Quality Assurance

Built-in data validation and quality checks:

  • Schema validation for extracted data

  • Completeness checks for critical fields

  • Anomaly detection for unusual patterns

Business Intelligence

The pipeline generates actionable insights:

  • Customer segmentation and churn risk analysis

  • Product performance and inventory optimization

  • Revenue analysis and profit margin tracking

  • Automated business recommendations

๐Ÿ” Monitoring and Observability

Docker Logs

Pipeline Monitoring

Progress Tracking

Each step provides detailed progress information:

  • Records processed vs. total records

  • Current batch being processed

  • Estimated time remaining

  • Data quality metrics

๐Ÿ› ๏ธ Customization

Adding New Data Sources

  1. Create a new extraction step handler

  2. Add it to the YAML configuration

  3. Update transformation steps to use the new data

Example:

Modifying Business Logic

Update the insight generation in generate_insights_handler.rb:

Adjusting Retry Policies

Update the YAML configuration:

๐Ÿ”ง Troubleshooting

Common Issues

Docker services won't start:

  • Ensure Docker is running: docker --version

  • Check for port conflicts: docker-compose ps

  • Free up resources: docker system prune

Pipeline doesn't start:

  • Ensure all services are healthy: docker-compose ps

  • Check Sidekiq is running: docker-compose logs sidekiq

  • Verify database is ready: docker-compose exec web rails db:migrate:status

Steps fail with data errors:

  • Check sample data exists: docker-compose exec web rails console

  • Verify data quality: Check for null values or invalid formats

  • Review step logs: docker-compose logs -f sidekiq

No progress updates:

  • Ensure Redis is running: docker-compose exec redis redis-cli ping

  • Check step handler implementations include progress tracking

  • Verify event subscribers are loaded

Getting Help

  1. Check service status: docker-compose ps

  2. View logs: docker-compose logs -f

  3. Restart services: docker-compose restart

  4. Clean restart: docker-compose down && docker-compose up

๐Ÿ“– Learn More

๐Ÿ›‘ Cleanup

When you're done experimenting:

๐Ÿ’ก Next Steps

Once you have the pipeline running:

  1. Experiment with failure scenarios - Stop dependencies mid-processing

  2. Customize the business logic - Modify customer segmentation rules

  3. Add new data sources - Extend with additional extractions

  4. Implement real integrations - Replace mock APIs with real services

  5. Scale the processing - Test with larger datasets

The patterns demonstrated here scale from simple ETL jobs to enterprise data platforms handling millions of records.

Last updated