Backpressure Monitoring Runbook
Last Updated: 2026-02-05 Audience: Operations, SRE, On-Call Engineers Status: Active Related Docs: Backpressure Architecture | MPSC Channel Tuning
This runbook provides guidance for monitoring, alerting, and responding to backpressure events in tasker-core.
Quick Reference
Critical Metrics Dashboard
| Metric | Normal | Warning | Critical | Action |
|---|---|---|---|---|
api_circuit_breaker_state | closed | - | open | See Circuit Breaker Open |
messaging_circuit_breaker_state | closed | half-open | open | See Messaging Circuit Breaker Open |
api_requests_rejected_total | < 1/min | > 5/min | > 20/min | See API Rejections |
mpsc_channel_saturation | < 50% | > 70% | > 90% | See Channel Saturation |
pgmq_queue_depth | < 50% max | > 70% max | > 90% max | See Queue Depth High |
worker_claim_refusals_total | < 5/min | > 20/min | > 50/min | See Claim Refusals |
handler_semaphore_wait_ms_p99 | < 100ms | > 500ms | > 2000ms | See Handler Wait |
domain_events_dropped_total | < 10/min | > 50/min | > 200/min | See Domain Events |
Key Metrics
API Layer Metrics
api_requests_total
- Type: Counter
- Labels:
endpoint,status_code,method - Description: Total API requests received
- Usage: Calculate request rate, error rate
api_requests_rejected_total
- Type: Counter
- Labels:
endpoint,reason(rate_limit, circuit_breaker, validation) - Description: Requests rejected due to backpressure
- Alert: > 10/min sustained
api_circuit_breaker_state
- Type: Gauge
- Values: 0 = closed, 1 = half-open, 2 = open
- Description: Current circuit breaker state
- Alert: state = 2 (open)
api_request_latency_ms
- Type: Histogram
- Labels:
endpoint - Description: Request processing time
- Alert: p99 > 5000ms
Messaging Metrics
messaging_circuit_breaker_state
- Type: Gauge
- Values: 0 = closed, 1 = half-open, 2 = open
- Description: Current messaging circuit breaker state
- Alert: state = 2 (open) — both orchestration and workers lose queue access
messaging_circuit_breaker_rejections_total
- Type: Counter
- Labels:
operation(send, receive) - Description: Messaging operations rejected due to open circuit breaker
- Alert: > 0 (any rejection indicates messaging outage)
Orchestration Metrics
orchestration_command_channel_size
- Type: Gauge
- Description: Current command channel buffer usage
- Alert: > 80% of
command_buffer_size
orchestration_command_channel_saturation
- Type: Gauge (0.0 - 1.0)
- Description: Channel saturation ratio
- Alert: > 0.8 sustained for > 1min
pgmq_queue_depth
- Type: Gauge
- Labels:
queue_name - Description: Messages in queue
- Alert: > configured max_queue_depth * 0.8
pgmq_enqueue_failures_total
- Type: Counter
- Labels:
queue_name,reason - Description: Failed enqueue operations
- Alert: > 0 (any failure)
Worker Metrics
worker_claim_refusals_total
- Type: Counter
- Labels:
worker_id,namespace - Description: Step claims refused due to capacity
- Alert: > 10/min sustained
worker_handler_semaphore_permits_available
- Type: Gauge
- Labels:
worker_id - Description: Available handler permits
- Alert: = 0 sustained for > 30s
worker_handler_semaphore_wait_ms
- Type: Histogram
- Labels:
worker_id - Description: Time waiting for semaphore permit
- Alert: p99 > 1000ms
worker_dispatch_channel_saturation
- Type: Gauge
- Labels:
worker_id - Description: Dispatch channel saturation
- Alert: > 0.8 sustained
worker_completion_channel_saturation
- Type: Gauge
- Labels:
worker_id - Description: Completion channel saturation
- Alert: > 0.8 sustained
domain_events_dropped_total
- Type: Counter
- Labels:
worker_id,event_type - Description: Domain events dropped due to channel full
- Alert: > 50/min (informational)
Alert Configurations
Prometheus Alert Rules
groups:
- name: tasker_backpressure
rules:
# API Layer
- alert: TaskerCircuitBreakerOpen
expr: api_circuit_breaker_state == 2
for: 30s
labels:
severity: critical
annotations:
summary: "Tasker API circuit breaker is open"
description: "Circuit breaker {{ $labels.instance }} has been open for > 30s"
runbook: "https://docs/operations/backpressure-monitoring.md#circuit-breaker-open"
- alert: TaskerMessagingCircuitBreakerOpen
expr: messaging_circuit_breaker_state == 2
for: 30s
labels:
severity: critical
annotations:
summary: "Tasker messaging circuit breaker is open"
description: "Messaging circuit breaker has been open for > 30s — queue operations are failing"
runbook: "https://docs/operations/backpressure-monitoring.md#messaging-circuit-breaker-open"
- alert: TaskerAPIRejectionsHigh
expr: rate(api_requests_rejected_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of API request rejections"
description: "{{ $value | printf \"%.2f\" }} requests/sec being rejected"
runbook: "https://docs/operations/backpressure-monitoring.md#api-rejections-high"
# Orchestration Layer
- alert: TaskerCommandChannelSaturated
expr: orchestration_command_channel_saturation > 0.8
for: 1m
labels:
severity: warning
annotations:
summary: "Orchestration command channel is saturated"
description: "Channel saturation at {{ $value | printf \"%.0f\" }}%"
runbook: "https://docs/operations/backpressure-monitoring.md#channel-saturation"
- alert: TaskerPGMQQueueDepthHigh
expr: pgmq_queue_depth / pgmq_queue_max_depth > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PGMQ queue depth is high"
description: "Queue {{ $labels.queue_name }} at {{ $value | printf \"%.0f\" }}% capacity"
runbook: "https://docs/operations/backpressure-monitoring.md#pgmq-queue-depth-high"
# Worker Layer
- alert: TaskerWorkerClaimRefusalsHigh
expr: rate(worker_claim_refusals_total[5m]) > 0.2
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of worker claim refusals"
description: "Worker {{ $labels.worker_id }} refusing {{ $value | printf \"%.1f\" }} claims/sec"
runbook: "https://docs/operations/backpressure-monitoring.md#worker-claim-refusals-high"
- alert: TaskerHandlerWaitTimeHigh
expr: histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket) > 2000
for: 2m
labels:
severity: warning
annotations:
summary: "Handler wait time is high"
description: "p99 handler wait time is {{ $value | printf \"%.0f\" }}ms"
runbook: "https://docs/operations/backpressure-monitoring.md#handler-wait-time-high"
- alert: TaskerDomainEventsDropped
expr: rate(domain_events_dropped_total[5m]) > 1
for: 5m
labels:
severity: info
annotations:
summary: "Domain events being dropped"
description: "{{ $value | printf \"%.1f\" }} events/sec dropped"
runbook: "https://docs/operations/backpressure-monitoring.md#domain-events-dropped"
Incident Response Procedures
Circuit Breaker Open
Severity: Critical
Symptoms:
- API returning 503 responses
api_circuit_breaker_state = 2- Downstream operations failing
Immediate Actions:
-
Check database connectivity
psql $DATABASE_URL -c "SELECT 1" -
Check PGMQ extension health
psql $DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5" -
Check recent error logs
kubectl logs -l app=tasker-orchestration --tail=100 | grep ERROR
Recovery:
- Circuit breaker will automatically attempt recovery after
timeout_seconds(default: 30s) - If database is healthy, breaker should close after
success_threshold(default: 2) successful requests - If database is unhealthy, fix database first
Escalation:
- If breaker remains open > 5 min after database recovery: Escalate to engineering
Messaging Circuit Breaker Open
Severity: Critical
Symptoms:
- Orchestration cannot enqueue steps or send task finalizations
- Workers cannot receive step messages or send results
messaging_circuit_breaker_state = 2MessagingError::CircuitBreakerOpenin logs
Immediate Actions:
-
Check messaging backend health
# For PGMQ (default) psql $PGMQ_DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5" # For RabbitMQ rabbitmqctl status -
Check PGMQ database connectivity (may differ from main database)
psql $PGMQ_DATABASE_URL -c "SELECT 1" -
Check recent messaging errors
kubectl logs -l app=tasker-orchestration --tail=100 | grep -E "messaging|circuit_breaker|CircuitBreakerOpen"
Impact:
- Orchestration: Task initialization stalls, step results cannot be received, task finalizations blocked
- Workers: Step messages not received, results cannot be sent back to orchestration
- Safety: Messages remain in queues protected by visibility timeouts; no data loss occurs
- Health checks: Unaffected (bypass circuit breaker to detect recovery)
Recovery:
- Circuit breaker will automatically test recovery after
timeout_seconds(default: 30s) - On recovery, queued messages will be processed normally (visibility timeouts protect against loss)
- If messaging backend is unhealthy, fix it first — the breaker protects against cascading timeouts
Escalation:
- If breaker remains open > 5 min after backend recovery: Escalate to engineering
- If both web and messaging breakers are open simultaneously: Likely database-wide issue, escalate to DBA
API Rejections High
Severity: Warning
Symptoms:
- Clients receiving 429 or 503 responses
api_requests_rejected_totalincreasing
Diagnosis:
-
Check rejection reason distribution
sum by (reason) (rate(api_requests_rejected_total[5m])) -
If
reason=rate_limit: Legitimate load spike or client misbehavior -
If
reason=circuit_breaker: See Circuit Breaker Open
Actions:
- Rate limit rejections:
- Identify high-volume client
- Consider increasing rate limit or contacting client
- Circuit breaker rejections:
- Follow circuit breaker procedure
Channel Saturation
Severity: Warning → Critical if sustained
Symptoms:
mpsc_channel_saturation > 0.8- Increased latency
- Potential backpressure cascade
Diagnosis:
-
Identify saturated channel
orchestration_command_channel_saturation > 0.8 -
Check upstream rate
rate(orchestration_commands_received_total[5m]) -
Check downstream processing rate
rate(orchestration_commands_processed_total[5m])
Actions:
-
Temporary: Scale up orchestration replicas
-
Short-term: Increase channel buffer size
[orchestration.mpsc_channels.command_processor] command_buffer_size = 10000 # Increase from 5000 -
Long-term: Investigate why processing is slow
PGMQ Queue Depth High
Severity: Warning → Critical if approaching max
Symptoms:
pgmq_queue_depthgrowing- Step execution delays
- Potential OOM if queue grows unbounded
Diagnosis:
-
Identify growing queue
pgmq_queue_depth{queue_name=~".*"} -
Check worker health
sum(worker_handler_semaphore_permits_available) -
Check for stuck workers
count(worker_claim_refusals_total) by (worker_id)
Actions:
-
Scale workers: Add more worker replicas
-
Increase handler concurrency (short-term):
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 20 # Increase from 10 -
Investigate slow handlers: Check handler execution latency
Worker Claim Refusals High
Severity: Warning
Symptoms:
worker_claim_refusals_totalincreasing- Workers at capacity
- Step execution delayed
Diagnosis:
-
Check handler permit usage
worker_handler_semaphore_permits_available -
Check handler execution time
histogram_quantile(0.99, worker_handler_execution_ms_bucket)
Actions:
-
Scale workers: Add replicas
-
Optimize handlers: If execution time is high
-
Adjust threshold: If refusals are premature
[worker] claim_capacity_threshold = 0.9 # More aggressive claiming
Handler Wait Time High
Severity: Warning
Symptoms:
handler_semaphore_wait_ms_p99 > 1000ms- Steps waiting for execution
- Increased end-to-end latency
Diagnosis:
-
Check permit utilization
1 - (worker_handler_semaphore_permits_available / worker_handler_semaphore_permits_total) -
Check completion channel saturation
worker_completion_channel_saturation
Actions:
-
Increase permits (if CPU/memory allow):
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 15 -
Optimize handlers: Reduce execution time
-
Scale workers: If resources constrained
Domain Events Dropped
Severity: Informational
Symptoms:
domain_events_dropped_totalincreasing- Downstream event consumers missing events
Diagnosis:
-
This is expected behavior under load
-
Check if drop rate is excessive
rate(domain_events_dropped_total[5m]) / rate(domain_events_dispatched_total[5m])
Actions:
-
If < 1% dropped: Normal, no action needed
-
If > 5% dropped: Consider increasing event channel buffer
[shared.domain_events] buffer_size = 20000 # Increase from 10000 -
Note: Domain events are non-critical. Dropping does not affect step execution.
Capacity Planning
Determining Appropriate Limits
Command Channel Size
Required buffer = (peak_requests_per_second) * (avg_processing_time_ms / 1000) * safety_factor
Example:
peak_requests_per_second = 100
avg_processing_time_ms = 50
safety_factor = 2
Required buffer = 100 * 0.05 * 2 = 10 messages
Recommended: 5000 (50x headroom for bursts)
Handler Concurrency
Optimal concurrency = (worker_cpu_cores) * (1 + (io_wait_ratio))
Example:
worker_cpu_cores = 4
io_wait_ratio = 0.8 (handlers are mostly I/O bound)
Optimal concurrency = 4 * 1.8 = 7.2
Recommended: 8-10 permits
PGMQ Queue Depth
Max depth = (expected_processing_rate) * (max_acceptable_delay_seconds)
Example:
expected_processing_rate = 100 steps/sec
max_acceptable_delay = 60 seconds
Max depth = 100 * 60 = 6000 messages
Recommended: 10000 (headroom for bursts)
Grafana Dashboard
Import this dashboard for backpressure monitoring:
{
"dashboard": {
"title": "Tasker Backpressure",
"panels": [
{
"title": "Circuit Breaker State",
"type": "stat",
"targets": [{"expr": "api_circuit_breaker_state"}]
},
{
"title": "API Rejections Rate",
"type": "graph",
"targets": [{"expr": "rate(api_requests_rejected_total[5m])"}]
},
{
"title": "Channel Saturation",
"type": "graph",
"targets": [
{"expr": "orchestration_command_channel_saturation", "legendFormat": "orchestration"},
{"expr": "worker_dispatch_channel_saturation", "legendFormat": "worker-dispatch"},
{"expr": "worker_completion_channel_saturation", "legendFormat": "worker-completion"}
]
},
{
"title": "PGMQ Queue Depth",
"type": "graph",
"targets": [{"expr": "pgmq_queue_depth", "legendFormat": "{{queue_name}}"}]
},
{
"title": "Handler Wait Time (p99)",
"type": "graph",
"targets": [{"expr": "histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket)"}]
},
{
"title": "Worker Claim Refusals",
"type": "graph",
"targets": [{"expr": "rate(worker_claim_refusals_total[5m])"}]
}
]
}
}
Related Documentation
- Backpressure Architecture - Strategy overview
- MPSC Channel Tuning - Channel configuration
- Worker Event Systems - Worker architecture