Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Backpressure Architecture

Last Updated: 2026-02-05 Audience: Architects, Developers, Operations Status: Active Related Docs: Worker Event Systems | MPSC Channel Guidelines

<- Back to Documentation Hub


This document provides the unified backpressure strategy for tasker-core, covering all system components from API ingestion through worker execution.

Core Principle

Step idempotency is the primary constraint. Any backpressure mechanism must ensure that step claiming, business logic execution, and result persistence remain stable and consistent. The system must gracefully degrade under load without compromising workflow correctness.

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKPRESSURE FLOW OVERVIEW                           │
└─────────────────────────────────────────────────────────────────────────────┘

                            ┌─────────────────┐
                            │  External Client │
                            └────────┬────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [1] API LAYER BACKPRESSURE      │
                    │  • Circuit breaker (503)         │
                    │  • System overload (503)         │
                    │  • Request validation            │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [2] ORCHESTRATION BACKPRESSURE  │
                    │  • Command channel (bounded)     │
                    │  • Connection pool limits        │
                    │  • PGMQ depth checks             │
                    └────────────────┬────────────────┘
                                     │
                         ┌───────────┴───────────┐
                         │     PGMQ Queues       │
                         │  • Namespace queues   │
                         │  • Result queues      │
                         └───────────┬───────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [3] WORKER BACKPRESSURE         │
                    │  • Claim capacity check          │
                    │  • Semaphore-bounded handlers    │
                    │  • Completion channel bounds     │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [4] RESULT FLOW BACKPRESSURE    │
                    │  • Completion channel bounds     │
                    │  • Domain event drop semantics   │
                    └─────────────────────────────────┘

Backpressure Points by Component

1. API Layer

The API layer provides backpressure through 503 responses with intelligent Retry-After headers.

Rate Limiting (429): Rate limiting is intentionally out of scope for tasker-core. This responsibility belongs to upstream infrastructure (API Gateways, NLB/ALB, service mesh). Tasker focuses on system health-based backpressure via 503 responses.

MechanismStatusBehavior
Circuit BreakerImplementedReturn 503 when database breaker open
System OverloadPlannedReturn 503 when queue/channel saturation detected
Request ValidationImplementedReturn 400 for invalid requests

Response Codes:

  • 200 OK - Request accepted
  • 400 Bad Request - Invalid request format
  • 503 Service Unavailable - System overloaded (includes Retry-After header)

503 Response Triggers:

  1. Circuit Breaker Open: Database operations failing repeatedly
  2. Queue Depth High (Planned): PGMQ namespace queues approaching capacity
  3. Channel Saturation (Planned): Command channel buffer > 80% full

Retry-After Header Strategy:

503 Service Unavailable
Retry-After: {calculated_delay_seconds}

Calculation:
- Circuit breaker open: Use breaker timeout (default 30s)
- Queue depth high: Estimate based on processing rate
- Channel saturation: Short delay (5-10s) for buffer drain

Configuration:

# config/tasker/base/common.toml
[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

2. Orchestration Layer

The orchestration layer protects internal processing from command flooding.

MechanismStatusBehavior
Command ChannelImplementedBounded MPSC with monitoring
Connection PoolImplementedDatabase connection limits
PGMQ Depth CheckPlannedReject enqueue when queue too deep

Command Channel Backpressure:

Command Sender → [Bounded Channel] → Command Processor
                      │
                      └── If full: Block with timeout → Reject

Configuration:

# config/tasker/base/orchestration.toml
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

3. Messaging Layer

The messaging layer provides the backbone between orchestration and workers. Provider-agnostic via MessageClient, supporting PGMQ (default) and RabbitMQ backends.

MechanismStatusBehavior
Visibility TimeoutImplementedMessages return to queue after timeout
Batch Size LimitsImplementedBounded message reads
Queue Depth CheckPlannedReject enqueue when depth exceeded
Messaging Circuit BreakerImplementedFast-fail send/receive when provider unhealthy

Messaging Circuit Breaker: MessageClient wraps send/receive operations with circuit breaker protection. When the messaging provider (PGMQ or RabbitMQ) fails repeatedly, the breaker opens and returns MessagingError::CircuitBreakerOpen immediately, preventing slow timeouts from cascading into orchestration and worker processing loops. Ack/nack and health check operations bypass the breaker — ack/nack failures are safe (visibility timeout handles redelivery), and health check must work when the breaker is open to detect recovery. See Circuit Breakers for details.

Queue Depth Monitoring (Planned):

The system will work with PGMQ’s native capabilities rather than enforcing arbitrary limits. Queue depth monitoring provides visibility without hard rejection:

┌──────────────────────────────────────────────────────────────────────┐
│ QUEUE DEPTH STRATEGY                                                  │
├──────────────────────────────────────────────────────────────────────┤
│ Level    │ Depth Ratio │ Action                                       │
├──────────────────────────────────────────────────────────────────────┤
│ Normal   │ < 70%       │ Normal operation                             │
│ Warning  │ 70-85%      │ Log warning, emit metric                     │
│ Critical │ 85-95%      │ API returns 503 for new tasks                │
│ Overflow │ > 95%       │ API rejects all writes, alert operators      │
└──────────────────────────────────────────────────────────────────────┘

Note: Depth ratio = current_depth / configured_soft_limit
Soft limit is advisory, not a hard PGMQ constraint.

Portability Considerations:

  • Queue depth semantics vary by backend (PGMQ vs RabbitMQ vs SQS)
  • Configuration is backend-agnostic where possible
  • Backend-specific tuning goes in backend-specific config sections

Configuration:

# config/tasker/base/common.toml
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# Messaging circuit breaker
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

4. Worker Layer

The worker layer protects handler execution from saturation.

MechanismStatusBehavior
Semaphore-Bounded DispatchImplementedMax concurrent handlers
Claim Capacity CheckPlannedRefuse claims when at capacity
Handler TimeoutImplementedKill stuck handlers
Completion ChannelImplementedBounded result buffer

Handler Dispatch Flow:

Step Message
     │
     ▼
┌─────────────────┐
│ Capacity Check  │──── At capacity? ──── Leave in queue
│ (Planned)       │                       (visibility timeout)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Acquire Permit  │
│ (Semaphore)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Execute Handler │
│ (with timeout)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Release Permit  │──── BEFORE sending to completion channel
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Send Completion │
└─────────────────┘

Configuration:

# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000

5. Domain Events

Domain events use fire-and-forget semantics to avoid blocking the critical path.

MechanismStatusBehavior
Try-SendImplementedNon-blocking send
Drop on FullImplementedEvents dropped if channel full
MetricsPlannedTrack dropped events

Domain Event Flow:

Handler Complete
     │
     ├── Result → Completion Channel (blocking, must succeed)
     │
     └── Domain Events → try_send() → If full: DROP with metric
                                       │
                                       └── Step execution NOT affected

Segmentation of Responsibility

Orchestration System

The orchestration system must protect itself from:

  1. Client overload: Too many /v1/tasks requests
  2. Internal saturation: Command channel overflow
  3. Database exhaustion: Connection pool depletion
  4. Queue explosion: Unbounded PGMQ growth

Backpressure Response Hierarchy:

  1. Return 503 to client with Retry-After (fastest, cheapest)
  2. Block at command channel (internal buffering)
  3. Soft-reject at queue depth threshold (503 to new tasks)
  4. Circuit breaker opens (stop accepting work)

Worker System

The worker system must protect itself from:

  1. Handler saturation: Too many concurrent handlers
  2. FFI backlog: Ruby/Python handlers falling behind
  3. Completion overflow: Results backing up
  4. Step starvation: Claims outpacing processing

Backpressure Response Hierarchy:

  1. Refuse step claim (leave in queue, visibility timeout)
  2. Block at dispatch channel (internal buffering)
  3. Block at completion channel (handler waits)
  4. Circuit breaker opens (stop claiming work)

Step Idempotency Guarantees

Safe Backpressure Points

These backpressure points preserve step idempotency:

PointWhy Safe
API 503 rejectionTask not yet created
Queue depth soft-limitStep not yet enqueued
Step claim refusalMessage stays in queue, visibility timeout protects
Handler dispatch channel fullStep claimed but execution queued
Completion channel backpressureHandler completed, result buffered

Unsafe Patterns (NEVER DO)

PatternRiskMitigation
Drop step after claimingLost workAlways send result (success or failure)
Timeout during handler executionDuplicate execution on retryHandlers MUST be idempotent
Drop completion resultOrchestration unaware of completionCompletion channel blocks, never drops
Reset step state without visibility timeoutRace with other workersUse PGMQ visibility timeout

Idempotency Contract

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STEP EXECUTION IDEMPOTENCY CONTRACT                       │
└─────────────────────────────────────────────────────────────────────────────┘

1. CLAIM: Atomic via pgmq_read_specific_message()
   ├── Only one worker can claim a message
   ├── Visibility timeout protects against worker crash
   └── If claim fails: Message stays in queue → another worker claims

2. EXECUTE: Handler invocation (FFI boundary critical - see below)
   ├── Handlers SHOULD be idempotent (business logic recommendation)
   ├── Timeout generates FAILURE result (not drop)
   ├── Panic generates FAILURE result (not drop)
   └── Error generates FAILURE result (not drop)

3. PERSIST: Result submission
   ├── Completion channel is bounded but BLOCKING
   ├── Result MUST reach orchestration (never dropped)
   └── If send fails: Step remains "in_progress" → recovered by orchestration

4. FINALIZE: Orchestration processes result
   ├── State transition is atomic
   ├── Duplicate results handled by state guards
   └── Idempotent: Same result processed twice = same outcome

FFI Boundary Idempotency Semantics

The FFI boundary (Rust → Ruby/Python handler) creates a critical demarcation for error classification:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FFI BOUNDARY ERROR CLASSIFICATION                         │
└─────────────────────────────────────────────────────────────────────────────┘

                           FFI BOUNDARY
                                │
    BEFORE FFI CROSSING         │         AFTER FFI CROSSING
    (System Layer)              │         (Business Logic Layer)
                                │
    ┌─────────────────────┐     │     ┌─────────────────────┐
    │ System errors are   │     │     │ System failures     │
    │ RETRYABLE:          │     │     │ are PERMANENT:      │
    │                     │     │     │                     │
    │ • Channel timeout   │     │     │ • Worker crash      │
    │ • Queue unavailable │     │     │ • FFI panic         │
    │ • Claim race lost   │     │     │ • Process killed    │
    │ • Network partition │     │     │                     │
    │ • Message malformed │     │     │ We cannot know if   │
    │                     │     │     │ business logic      │
    │ Step has NOT been   │     │     │ executed or not.    │
    │ handed to handler.  │     │     │                     │
    └─────────────────────┘     │     └─────────────────────┘
                                │
                                │     ┌─────────────────────┐
                                │     │ Developer errors    │
                                │     │ are TRUSTED:        │
                                │     │                     │
                                │     │ • RetryableError →  │
                                │     │   System retries    │
                                │     │                     │
                                │     │ • PermanentError →  │
                                │     │   Step fails        │
                                │     │                     │
                                │     │ Developer knows     │
                                │     │ their domain logic. │
                                │     └─────────────────────┘

Key Principles:

  1. Before FFI: Any system error is safe to retry because no business logic has executed.

  2. After FFI, system failure: If the worker crashes or FFI call fails after dispatch, we MUST treat it as permanent failure. We cannot know if the handler:

    • Never started (safe to retry)
    • Started but didn’t complete (unknown side effects)
    • Completed but didn’t return (work is done)
  3. After FFI, developer error: Trust the developer’s classification:

    • RetryableError: Developer explicitly signals safe to retry (e.g., temporary API unavailable)
    • PermanentError: Developer explicitly signals not retriable (e.g., invalid data, business rule violation)

Implementation Guidance:

#![allow(unused)]
fn main() {
// BEFORE FFI - system error, retryable
match dispatch_to_handler(step).await {
    Err(DispatchError::ChannelFull) => StepResult::retryable("dispatch_channel_full"),
    Err(DispatchError::Timeout) => StepResult::retryable("dispatch_timeout"),
    Ok(ffi_handle) => {
        // AFTER FFI - different rules apply
        match ffi_handle.await {
            // System crash after FFI = permanent (unknown state)
            Err(FfiError::ProcessCrash) => StepResult::permanent("handler_crash"),
            Err(FfiError::Panic) => StepResult::permanent("handler_panic"),

            // Developer-returned errors = trust their classification
            Ok(HandlerResult::RetryableError(msg)) => StepResult::retryable(msg),
            Ok(HandlerResult::PermanentError(msg)) => StepResult::permanent(msg),
            Ok(HandlerResult::Success(data)) => StepResult::success(data),
        }
    }
}
}

Note: We RECOMMEND handlers be idempotent but cannot REQUIRE it—business logic is developer-controlled. The system provides visibility timeout protection and duplicate result handling, but ultimate idempotency responsibility lies with handler implementations.


Backpressure Decision Tree

Use this decision tree when designing new backpressure mechanisms:

                    ┌─────────────────────────┐
                    │ New Backpressure Point  │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │ Does this affect step   │
                    │ execution correctness?  │
                    └────────────┬────────────┘
                                 │
                   ┌─────────────┴─────────────┐
                   │                           │
                  Yes                          No
                   │                           │
                   ▼                           ▼
         ┌─────────────────┐         ┌─────────────────┐
         │ Can the work be │         │ Safe to drop    │
         │ retried safely? │         │ or timeout      │
         └────────┬────────┘         └─────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
       Yes                  No
        │                   │
        ▼                   ▼
  ┌───────────┐      ┌───────────────┐
  │ Use block │      │ MUST NOT DROP │
  │ or reject │      │ Block until   │
  │ (retriable│      │ success       │
  │ error)    │      └───────────────┘
  └───────────┘

Configuration Reference

TOML Structure: Configuration files are organized as config/tasker/base/{common,worker,orchestration}.toml with environment overrides in config/tasker/environments/{test,development,production}/.

Complete Backpressure Configuration

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/common.toml - Shared settings
# ════════════════════════════════════════════════════════════════════════════

# Circuit breaker defaults (inherited by all component breakers)
[common.circuit_breakers.default_config]
failure_threshold = 5      # Failures before opening
timeout_seconds = 30       # Time in open state before half-open
success_threshold = 2      # Successes in half-open to close

# Web/API database circuit breaker
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

# Messaging circuit breaker - PGMQ/RabbitMQ operations
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Queue configuration
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/orchestration.toml - Orchestration layer
# ════════════════════════════════════════════════════════════════════════════

[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

[orchestration.mpsc_channels.event_channel]
event_channel_buffer_size = 10000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/worker.toml - Worker layer
# ════════════════════════════════════════════════════════════════════════════

[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000        # Steps waiting for handler
completion_buffer_size = 1000      # Results waiting for orchestration
max_concurrent_handlers = 10       # Semaphore permits
handler_timeout_ms = 30000         # Max handler execution time

[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000        # FFI events waiting for Ruby/Python
completion_timeout_ms = 30000      # Time to wait for FFI completion
starvation_warning_threshold_ms = 10000  # Warn if event waits this long

# Planned:
# claim_capacity_threshold = 0.8   # Refuse claims at 80% capacity

Monitoring and Alerting

See Backpressure Monitoring Runbook for:

  • Key metrics to monitor
  • Alerting thresholds
  • Incident response procedures

Key Metrics Summary

MetricTypeAlert Threshold
api_requests_rejected_totalCounter> 10/min
circuit_breaker_stateGaugestate = open
mpsc_channel_saturationGauge> 80%
pgmq_queue_depthGauge> 80% of max
worker_claim_refusals_totalCounter> 10/min
handler_semaphore_wait_time_msHistogramp99 > 1000ms


<- Back to Documentation Hub