Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Circuit Breakers

Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring

<- Back to Documentation Hub


Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.

Core Concept

Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.

State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CIRCUIT BREAKER STATE MACHINE                            │
└─────────────────────────────────────────────────────────────────────────────┘

                    Success
                  ┌─────────┐
                  │         │
                  ▼         │
              ┌───────┐     │
      ───────>│CLOSED │─────┘
              └───┬───┘
                  │
                  │ failure_threshold
                  │ consecutive failures
                  │
                  ▼
              ┌───────┐
              │ OPEN  │◄─────────────────────┐
              └───┬───┘                      │
                  │                          │
                  │ timeout_seconds          │ Any failure
                  │ elapsed                  │ in half-open
                  │                          │
                  ▼                          │
            ┌──────────┐                     │
            │HALF-OPEN │─────────────────────┘
            └────┬─────┘
                 │
                 │ success_threshold
                 │ consecutive successes
                 │
                 ▼
            ┌───────┐
            │CLOSED │
            └───────┘

States:

  • Closed: Normal operation. All calls allowed. Tracks consecutive failures.
  • Open: Failing fast. All calls rejected immediately. Waiting for timeout.
  • Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.

Unified Trait: CircuitBreakerBehavior

All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:

#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
    fn name(&self) -> &str;
    fn state(&self) -> CircuitState;
    fn should_allow(&self) -> bool;
    fn record_success(&self, duration: Duration);
    fn record_failure(&self, duration: Duration);
    fn is_healthy(&self) -> bool;
    fn force_open(&self);
    fn force_closed(&self);
    fn metrics(&self) -> CircuitBreakerMetrics;
}
}

Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:

  • Consistent state machine behavior across all breakers
  • Proper half-open → closed recovery via success_threshold
  • Lock-free atomic state management
  • Domain-specific methods remain as additional methods on each type

Circuit Breaker Implementations

Tasker-core has four circuit breaker implementations, each protecting specific components. All wrap the generic CircuitBreaker from tasker_shared::resilience:

Circuit BreakerLocationPurposeTrigger Type
Web Databasetasker-orchestrationAPI database operationsError-based
Task Readinesstasker-orchestrationFallback poller database checksError-based
FFI Completiontasker-workerRuby/Python handler completion channelLatency-based
Messagingtasker-sharedMessage queue operations (PGMQ/RabbitMQ)Error-based

1. Web Database Circuit Breaker

Purpose: Protects API endpoints from cascading database failures.

Scope: Independent from orchestration system’s internal operations.

Behavior:

  • Opens when database queries fail repeatedly
  • Returns 503 with Retry-After header when open
  • Fast-fail rejection with atomic state management

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Consecutive failures before opening
success_threshold = 2      # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)

Health Check Integration:

  • Included in /health/ready endpoint
  • State reported in /health/detailed response
  • Metric: api_circuit_breaker_state (0=closed, 1=half-open, 2=open)

2. Task Readiness Circuit Breaker

Purpose: Protects fallback poller from database overload during polling cycles.

Scope: Independent from web circuit breaker, specific to task readiness queries.

Behavior:

  • Opens when task readiness queries fail repeatedly
  • Skips polling cycles when open (doesn’t fail-fast, just skips)
  • Allows orchestration to continue processing existing work

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10     # Higher threshold for polling
timeout_seconds = 60       # Longer recovery window
success_threshold = 3      # More successes needed for confidence

Why Separate from Web?:

  • Different failure patterns (polling vs request-driven)
  • Different recovery semantics (skip vs reject)
  • Isolation prevents web failures from stopping polling (and vice versa)

3. FFI Completion Circuit Breaker

Purpose: Protects Ruby/Python worker completion channels from backpressure.

Scope: Worker-specific, protects FFI boundary.

Behavior:

  • Latency-based: Treats slow sends (>100ms) as failures
  • Opens when completion channel is consistently slow
  • Prevents FFI threads from blocking on saturated channels
  • Drops completions when open (with metrics), allowing handler threads to continue

Configuration (config/tasker/base/worker.toml):

[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5            # Slow sends before opening
recovery_timeout_seconds = 5     # Short recovery window
success_threshold = 2            # Successes to close
slow_send_threshold_ms = 100     # Latency threshold (100ms)

Why Latency-Based?:

  • Slow channel sends indicate backpressure buildup
  • Blocking FFI threads can cascade to Ruby/Python handler starvation
  • Error-only detection misses slow-but-completing operations
  • Latency detection catches degradation before total failure

Metrics:

  • ffi_completion_slow_sends_total - Sends exceeding latency threshold
  • ffi_completion_circuit_open_rejections_total - Rejections due to open circuit

4. Messaging Circuit Breaker

Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).

Scope: Integrated into MessageClient, shared across orchestration and worker messaging.

Behavior:

  • Opens when send/receive operations fail repeatedly
  • Protected operations: send_step_message, receive_step_messages, send_step_result, receive_step_results, send_task_request, receive_task_requests, send_task_finalization, receive_task_finalizations, send_message, receive_messages
  • Unprotected operations (safe to fail or needed for recovery): ack_message, nack_message, extend_visibility, health_check, ensure_queue, queue stats
  • Coordinates with visibility timeout for message safety
  • Provider-agnostic: works with both PGMQ and RabbitMQ backends

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes to close
# timeout_seconds inherited from default_config (30s)

Why ack/nack bypass the breaker?:

  • Ack/nack failure causes message redelivery via visibility timeout, which is safe
  • Health check must work when breaker is open to detect recovery
  • Queue management is startup-only and should not be gated

Configuration Reference

Global Settings

[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30    # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions

Default Configuration

Applied to any circuit breaker without explicit configuration:

[common.circuit_breakers.default_config]
failure_threshold = 5      # 1-100 range
timeout_seconds = 30       # 1-300 range
success_threshold = 2      # 1-50 range

Component-Specific Overrides

# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3

# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

Note: timeout_seconds is inherited from default_config for all component circuit breakers. The pgmq key is accepted as an alias for messaging for backward compatibility.

Worker-Specific Configuration

# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100

Environment Overrides

Different environments may need different thresholds:

Test (config/tasker/environments/test/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 2      # Faster failure detection
timeout_seconds = 5        # Quick recovery for tests
success_threshold = 1

Production (config/tasker/environments/production/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 10     # More tolerance for transient failures
timeout_seconds = 60       # Longer recovery window
success_threshold = 5      # More confidence before closing

Health Endpoint Integration

Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.

Orchestration Health (/health/detailed)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "Circuit breaker state: Closed",
      "duration_ms": 1,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Worker Health (/health/detailed)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
      "duration_ms": 0,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Health Status Mapping

Circuit Breaker StateHealth StatusImpact
All ClosedhealthyNormal operation
Any Half-OpendegradedTesting recovery
Any OpenunhealthyFailing fast

Monitoring and Alerting

Key Metrics

MetricTypeDescription
api_circuit_breaker_stateGaugeWeb breaker state (0/1/2)
tasker_circuit_breaker_stateGaugePer-component state
api_requests_rejected_totalCounterRejections due to open breaker
ffi_completion_slow_sends_totalCounterSlow send detections
ffi_completion_circuit_open_rejections_totalCounterFFI breaker rejections

Prometheus Alerts

groups:
  - name: circuit_breakers
    rules:
      - alert: TaskerCircuitBreakerOpen
        expr: api_circuit_breaker_state == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN"
          description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"

      - alert: TaskerCircuitBreakerHalfOpen
        expr: api_circuit_breaker_state == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker stuck in half-open"
          description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"

      - alert: TaskerFFISlowSendsHigh
        expr: rate(ffi_completion_slow_sends_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "FFI completion channel experiencing backpressure"
          description: "Slow sends averaging >10/second, circuit breaker may open"

Grafana Dashboard Panels

Circuit Breaker State Timeline:

Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)

FFI Latency Percentiles:

Panel: Time series
Queries:
  - histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical

Operational Procedures

When Circuit Breaker Opens

Immediate Actions:

  1. Check database connectivity: pg_isready -h <host> -p 5432
  2. Check connection pool status: /health/detailed endpoint
  3. Review recent error logs for root cause
  4. Monitor queue depth for message backlog

Recovery:

  • Circuit automatically tests recovery after timeout_seconds
  • No manual intervention needed for transient failures
  • For persistent failures, fix underlying issue first

Escalation:

  • If breaker stays open >5 minutes, escalate to database team
  • If breaker oscillates (open/half-open/open), increase failure_threshold

Tuning Guidelines

Symptom: Breaker opens too frequently

  • Increase failure_threshold
  • Investigate root cause of failures
  • Consider if failures are transient vs systemic

Symptom: Breaker stays open too long

  • Decrease timeout_seconds
  • Verify downstream system has recovered
  • Check if success_threshold is too high

Symptom: FFI breaker opens unnecessarily

  • Increase slow_send_threshold_ms
  • Verify channel buffer sizes are adequate
  • Check Ruby/Python handler throughput

Architecture Integration

Relationship to Backpressure

Circuit breakers are one layer of the broader backpressure strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        RESILIENCE LAYER STACK                                │
└─────────────────────────────────────────────────────────────────────────────┘

Layer 1: Circuit Breakers     → Fast-fail on component failure
Layer 2: Bounded Channels     → Backpressure on internal queues
Layer 3: Visibility Timeouts  → Message-level retry safety
Layer 4: Semaphore Limits     → Handler execution rate limiting
Layer 5: Connection Pools     → Database resource management

See Backpressure Architecture for the complete strategy.

Independence Principle

Each circuit breaker operates independently:

  • Web breaker can be open while task readiness breaker is closed
  • FFI breaker state doesn’t affect PGMQ breaker
  • Prevents single failure mode from cascading across components
  • Allows targeted recovery per component

Integration Points

ComponentCircuit BreakerIntegration Point
tasker-orchestration/src/webWeb DatabaseAPI request handlers
tasker-orchestration/src/orchestration/task_readinessTask ReadinessFallback poller loop
tasker-worker/src/worker/handlersFFI CompletionCompletion channel sends
tasker-shared/src/messaging/client.rsMessagingMessageClient send/receive methods

Troubleshooting

Common Issues

Issue: Web circuit breaker flapping (open → half-open → open rapidly)

Diagnosis:

  1. Check database query latency (slow queries can cause timeout failures)
  2. Review connection pool saturation
  3. Check if PostgreSQL is under memory pressure

Resolution:

  • Increase failure_threshold if failures are transient
  • Increase timeout_seconds to give more recovery time
  • Fix underlying database performance issues

Issue: FFI completion circuit breaker opens during normal load

Diagnosis:

  1. Check Ruby/Python handler execution time
  2. Review completion channel buffer utilization
  3. Verify worker concurrency settings

Resolution:

  • Increase slow_send_threshold_ms if handlers are legitimately slow
  • Increase channel buffer size in worker config
  • Reduce handler concurrency if system is overloaded

Issue: Task readiness breaker open but web API working fine

Diagnosis:

  • Task readiness queries may be slower/different than API queries
  • Polling may hit database at different times (e.g., during maintenance)

Resolution:

  • Independent breakers are working as designed
  • Check specific task readiness query performance
  • Consider database index optimization for readiness queries

Source Code Reference

ComponentFile
CircuitBreakerBehavior Traittasker-shared/src/resilience/behavior.rs
Generic CircuitBreakertasker-shared/src/resilience/circuit_breaker.rs
Circuit Breaker Configtasker-shared/src/config/circuit_breaker.rs
MessageClient (messaging breaker)tasker-shared/src/messaging/client.rs
WebDatabaseCircuitBreakertasker-orchestration/src/api_common/circuit_breaker.rs
Web CB Helperstasker-orchestration/src/web/circuit_breaker.rs
TaskReadinessCircuitBreakertasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs
FfiCompletionCircuitBreakertasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Worker Health Integrationtasker-worker/src/web/handlers/health.rs
Circuit Breaker Typestasker-shared/src/types/api/worker.rs

<- Back to Documentation Hub