Circuit Breakers
Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring
<- Back to Documentation Hub
Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.
Core Concept
Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.
State Machine
┌─────────────────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATE MACHINE │
└─────────────────────────────────────────────────────────────────────────────┘
Success
┌─────────┐
│ │
▼ │
┌───────┐ │
───────>│CLOSED │─────┘
└───┬───┘
│
│ failure_threshold
│ consecutive failures
│
▼
┌───────┐
│ OPEN │◄─────────────────────┐
└───┬───┘ │
│ │
│ timeout_seconds │ Any failure
│ elapsed │ in half-open
│ │
▼ │
┌──────────┐ │
│HALF-OPEN │─────────────────────┘
└────┬─────┘
│
│ success_threshold
│ consecutive successes
│
▼
┌───────┐
│CLOSED │
└───────┘
States:
- Closed: Normal operation. All calls allowed. Tracks consecutive failures.
- Open: Failing fast. All calls rejected immediately. Waiting for timeout.
- Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.
Unified Trait: CircuitBreakerBehavior
All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:
#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
fn name(&self) -> &str;
fn state(&self) -> CircuitState;
fn should_allow(&self) -> bool;
fn record_success(&self, duration: Duration);
fn record_failure(&self, duration: Duration);
fn is_healthy(&self) -> bool;
fn force_open(&self);
fn force_closed(&self);
fn metrics(&self) -> CircuitBreakerMetrics;
}
}
Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:
- Consistent state machine behavior across all breakers
- Proper half-open → closed recovery via
success_threshold - Lock-free atomic state management
- Domain-specific methods remain as additional methods on each type
Circuit Breaker Implementations
Tasker-core has four circuit breaker implementations, each protecting specific components.
All wrap the generic CircuitBreaker from tasker_shared::resilience:
| Circuit Breaker | Location | Purpose | Trigger Type |
|---|---|---|---|
| Web Database | tasker-orchestration | API database operations | Error-based |
| Task Readiness | tasker-orchestration | Fallback poller database checks | Error-based |
| FFI Completion | tasker-worker | Ruby/Python handler completion channel | Latency-based |
| Messaging | tasker-shared | Message queue operations (PGMQ/RabbitMQ) | Error-based |
1. Web Database Circuit Breaker
Purpose: Protects API endpoints from cascading database failures.
Scope: Independent from orchestration system’s internal operations.
Behavior:
- Opens when database queries fail repeatedly
- Returns 503 with
Retry-Afterheader when open - Fast-fail rejection with atomic state management
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.web]
failure_threshold = 5 # Consecutive failures before opening
success_threshold = 2 # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)
Health Check Integration:
- Included in
/health/readyendpoint - State reported in
/health/detailedresponse - Metric:
api_circuit_breaker_state(0=closed, 1=half-open, 2=open)
2. Task Readiness Circuit Breaker
Purpose: Protects fallback poller from database overload during polling cycles.
Scope: Independent from web circuit breaker, specific to task readiness queries.
Behavior:
- Opens when task readiness queries fail repeatedly
- Skips polling cycles when open (doesn’t fail-fast, just skips)
- Allows orchestration to continue processing existing work
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10 # Higher threshold for polling
timeout_seconds = 60 # Longer recovery window
success_threshold = 3 # More successes needed for confidence
Why Separate from Web?:
- Different failure patterns (polling vs request-driven)
- Different recovery semantics (skip vs reject)
- Isolation prevents web failures from stopping polling (and vice versa)
3. FFI Completion Circuit Breaker
Purpose: Protects Ruby/Python worker completion channels from backpressure.
Scope: Worker-specific, protects FFI boundary.
Behavior:
- Latency-based: Treats slow sends (>100ms) as failures
- Opens when completion channel is consistently slow
- Prevents FFI threads from blocking on saturated channels
- Drops completions when open (with metrics), allowing handler threads to continue
Configuration (config/tasker/base/worker.toml):
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5 # Slow sends before opening
recovery_timeout_seconds = 5 # Short recovery window
success_threshold = 2 # Successes to close
slow_send_threshold_ms = 100 # Latency threshold (100ms)
Why Latency-Based?:
- Slow channel sends indicate backpressure buildup
- Blocking FFI threads can cascade to Ruby/Python handler starvation
- Error-only detection misses slow-but-completing operations
- Latency detection catches degradation before total failure
Metrics:
ffi_completion_slow_sends_total- Sends exceeding latency thresholdffi_completion_circuit_open_rejections_total- Rejections due to open circuit
4. Messaging Circuit Breaker
Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).
Scope: Integrated into MessageClient, shared across orchestration and worker messaging.
Behavior:
- Opens when send/receive operations fail repeatedly
- Protected operations:
send_step_message,receive_step_messages,send_step_result,receive_step_results,send_task_request,receive_task_requests,send_task_finalization,receive_task_finalizations,send_message,receive_messages - Unprotected operations (safe to fail or needed for recovery):
ack_message,nack_message,extend_visibility,health_check,ensure_queue, queue stats - Coordinates with visibility timeout for message safety
- Provider-agnostic: works with both PGMQ and RabbitMQ backends
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes to close
# timeout_seconds inherited from default_config (30s)
Why ack/nack bypass the breaker?:
- Ack/nack failure causes message redelivery via visibility timeout, which is safe
- Health check must work when breaker is open to detect recovery
- Queue management is startup-only and should not be gated
Configuration Reference
Global Settings
[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30 # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions
Default Configuration
Applied to any circuit breaker without explicit configuration:
[common.circuit_breakers.default_config]
failure_threshold = 5 # 1-100 range
timeout_seconds = 30 # 1-300 range
success_threshold = 2 # 1-50 range
Component-Specific Overrides
# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3
# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2
# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2
Note:
timeout_secondsis inherited fromdefault_configfor all component circuit breakers. Thepgmqkey is accepted as an alias formessagingfor backward compatibility.
Worker-Specific Configuration
# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100
Environment Overrides
Different environments may need different thresholds:
Test (config/tasker/environments/test/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 2 # Faster failure detection
timeout_seconds = 5 # Quick recovery for tests
success_threshold = 1
Production (config/tasker/environments/production/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 10 # More tolerance for transient failures
timeout_seconds = 60 # Longer recovery window
success_threshold = 5 # More confidence before closing
Health Endpoint Integration
Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.
Orchestration Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "Circuit breaker state: Closed",
"duration_ms": 1,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Worker Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
"duration_ms": 0,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Health Status Mapping
| Circuit Breaker State | Health Status | Impact |
|---|---|---|
| All Closed | healthy | Normal operation |
| Any Half-Open | degraded | Testing recovery |
| Any Open | unhealthy | Failing fast |
Monitoring and Alerting
Key Metrics
| Metric | Type | Description |
|---|---|---|
api_circuit_breaker_state | Gauge | Web breaker state (0/1/2) |
tasker_circuit_breaker_state | Gauge | Per-component state |
api_requests_rejected_total | Counter | Rejections due to open breaker |
ffi_completion_slow_sends_total | Counter | Slow send detections |
ffi_completion_circuit_open_rejections_total | Counter | FFI breaker rejections |
Prometheus Alerts
groups:
- name: circuit_breakers
rules:
- alert: TaskerCircuitBreakerOpen
expr: api_circuit_breaker_state == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker is OPEN"
description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"
- alert: TaskerCircuitBreakerHalfOpen
expr: api_circuit_breaker_state == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker stuck in half-open"
description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"
- alert: TaskerFFISlowSendsHigh
expr: rate(ffi_completion_slow_sends_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "FFI completion channel experiencing backpressure"
description: "Slow sends averaging >10/second, circuit breaker may open"
Grafana Dashboard Panels
Circuit Breaker State Timeline:
Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)
FFI Latency Percentiles:
Panel: Time series
Queries:
- histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical
Operational Procedures
When Circuit Breaker Opens
Immediate Actions:
- Check database connectivity:
pg_isready -h <host> -p 5432 - Check connection pool status:
/health/detailedendpoint - Review recent error logs for root cause
- Monitor queue depth for message backlog
Recovery:
- Circuit automatically tests recovery after
timeout_seconds - No manual intervention needed for transient failures
- For persistent failures, fix underlying issue first
Escalation:
- If breaker stays open >5 minutes, escalate to database team
- If breaker oscillates (open/half-open/open), increase
failure_threshold
Tuning Guidelines
Symptom: Breaker opens too frequently
- Increase
failure_threshold - Investigate root cause of failures
- Consider if failures are transient vs systemic
Symptom: Breaker stays open too long
- Decrease
timeout_seconds - Verify downstream system has recovered
- Check if
success_thresholdis too high
Symptom: FFI breaker opens unnecessarily
- Increase
slow_send_threshold_ms - Verify channel buffer sizes are adequate
- Check Ruby/Python handler throughput
Architecture Integration
Relationship to Backpressure
Circuit breakers are one layer of the broader backpressure strategy:
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESILIENCE LAYER STACK │
└─────────────────────────────────────────────────────────────────────────────┘
Layer 1: Circuit Breakers → Fast-fail on component failure
Layer 2: Bounded Channels → Backpressure on internal queues
Layer 3: Visibility Timeouts → Message-level retry safety
Layer 4: Semaphore Limits → Handler execution rate limiting
Layer 5: Connection Pools → Database resource management
See Backpressure Architecture for the complete strategy.
Independence Principle
Each circuit breaker operates independently:
- Web breaker can be open while task readiness breaker is closed
- FFI breaker state doesn’t affect PGMQ breaker
- Prevents single failure mode from cascading across components
- Allows targeted recovery per component
Integration Points
| Component | Circuit Breaker | Integration Point |
|---|---|---|
tasker-orchestration/src/web | Web Database | API request handlers |
tasker-orchestration/src/orchestration/task_readiness | Task Readiness | Fallback poller loop |
tasker-worker/src/worker/handlers | FFI Completion | Completion channel sends |
tasker-shared/src/messaging/client.rs | Messaging | MessageClient send/receive methods |
Troubleshooting
Common Issues
Issue: Web circuit breaker flapping (open → half-open → open rapidly)
Diagnosis:
- Check database query latency (slow queries can cause timeout failures)
- Review connection pool saturation
- Check if PostgreSQL is under memory pressure
Resolution:
- Increase
failure_thresholdif failures are transient - Increase
timeout_secondsto give more recovery time - Fix underlying database performance issues
Issue: FFI completion circuit breaker opens during normal load
Diagnosis:
- Check Ruby/Python handler execution time
- Review completion channel buffer utilization
- Verify worker concurrency settings
Resolution:
- Increase
slow_send_threshold_msif handlers are legitimately slow - Increase channel buffer size in worker config
- Reduce handler concurrency if system is overloaded
Issue: Task readiness breaker open but web API working fine
Diagnosis:
- Task readiness queries may be slower/different than API queries
- Polling may hit database at different times (e.g., during maintenance)
Resolution:
- Independent breakers are working as designed
- Check specific task readiness query performance
- Consider database index optimization for readiness queries
Source Code Reference
| Component | File |
|---|---|
CircuitBreakerBehavior Trait | tasker-shared/src/resilience/behavior.rs |
Generic CircuitBreaker | tasker-shared/src/resilience/circuit_breaker.rs |
| Circuit Breaker Config | tasker-shared/src/config/circuit_breaker.rs |
MessageClient (messaging breaker) | tasker-shared/src/messaging/client.rs |
WebDatabaseCircuitBreaker | tasker-orchestration/src/api_common/circuit_breaker.rs |
| Web CB Helpers | tasker-orchestration/src/web/circuit_breaker.rs |
TaskReadinessCircuitBreaker | tasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs |
FfiCompletionCircuitBreaker | tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs |
| Worker Health Integration | tasker-worker/src/web/handlers/health.rs |
| Circuit Breaker Types | tasker-shared/src/types/api/worker.rs |
Related Documentation
- Backpressure Architecture - Complete resilience strategy
- Operations: Backpressure Monitoring - Operational runbooks
- Operations: MPSC Channel Tuning - Channel capacity management
- Observability - Metrics and logging standards
- Configuration Management - TOML configuration reference
<- Back to Documentation Hub