MPSC Channel Tuning - Operational Runbook

Last Updated: 2025-12-10 Owner: Platform Engineering Related: ADR: Bounded MPSC Channels | Circuit Breakers | Backpressure Architecture

Overview

This runbook provides operational guidance for monitoring, diagnosing, and tuning MPSC channel buffer sizes in the tasker-core system. All channels are bounded with configurable capacities to prevent unbounded memory growth.

Quick Reference

Configuration Files

File	Purpose	When to Edit
`config/tasker/base/mpsc_channels.toml`	Base configuration	Default values
`config/tasker/environments/test/mpsc_channels.toml`	Test overrides	Test environment tuning
`config/tasker/environments/development/mpsc_channels.toml`	Dev overrides	Local development tuning
`config/tasker/environments/production/mpsc_channels.toml`	Prod overrides	Production capacity planning

Key Metrics

Metric	Description	Alert Threshold
`mpsc_channel_usage_percent`	Current fill percentage	> 80%
`mpsc_channel_capacity`	Configured buffer size	N/A (informational)
`mpsc_channel_full_events_total`	Overflow events counter	> 0 (indicates backpressure)

Default Buffer Sizes

Channel	Base	Test	Development	Production
Orchestration command	1000	100	1000	5000
PGMQ notifications	10000	10000	10000	50000
Task readiness	1000	100	500	5000
Worker command	1000	1000	1000	2000
Event publisher	5000	5000	5000	10000
Ruby FFI	1000	1000	500	2000

Monitoring and Alerting

Recommended Alerts

Critical: Channel Saturation

# Alert when any channel is >90% full for 5 minutes
mpsc_channel_usage_percent > 90

Action: Immediate capacity increase or identify bottleneck

Warning: Channel High Usage

# Alert when any channel is >80% full for 15 minutes
mpsc_channel_usage_percent > 80

Action: Plan capacity increase, investigate throughput

Info: Channel Overflow Events

# Alert on any overflow events
rate(mpsc_channel_full_events_total[5m]) > 0

Action: Review backpressure handling, consider capacity increase

Grafana Queries

Channel Usage by Component

max by (channel, component) (mpsc_channel_usage_percent)

Channel Capacity Configuration

max by (channel, component) (mpsc_channel_capacity)

Overflow Event Rate

rate(mpsc_channel_full_events_total[5m])

Log Patterns

Saturation Warning (80% full)

WARN mpsc_channel_saturation channel=orchestration_command usage_percent=82.5

Overflow Event (channel full)

ERROR mpsc_channel_full channel=event_publisher action=dropped

Backpressure Applied

ERROR Ruby FFI event channel full - backpressure applied

Common Issues and Solutions

Issue 1: High Channel Saturation

Symptoms:

mpsc_channel_usage_percent consistently > 80%
Slow message processing
Increased latency

Diagnosis:

Check which channel is saturated:

# Grep logs for saturation warnings
grep "mpsc_channel_saturation" logs/tasker.log | tail -20

Check metrics for specific channel:

mpsc_channel_usage_percent{channel="orchestration_command"}

Solutions:

Short-term (Immediate Relief):

# Edit appropriate environment file
# Example: config/tasker/environments/production/mpsc_channels.toml

[mpsc_channels.orchestration.command_processor]
command_buffer_size = 10000  # Increase from 5000

Long-term:

Investigate message producer rate
Optimize message consumer processing
Consider horizontal scaling

Issue 2: PGMQ Notification Bursts

Symptoms:

Spike in mpsc_channel_usage_percent{channel="pgmq_notifications"}
During bulk task creation (1000+ tasks)
Temporary saturation followed by recovery

Diagnosis:

Correlate with bulk task operations:

# Check for bulk task creation in logs
grep "Bulk task creation" logs/tasker.log

Verify buffer size configuration:

# Check current production configuration
cat config/tasker/environments/production/mpsc_channels.toml | \
  grep -A 2 "event_listeners"

Solutions:

If production buffer < 50000:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 50000  # Recommended for production

If already at 50000 and still saturating:

Consider notification coalescing (future feature)
Implement batch notification processing
Scale orchestration services horizontally

Issue 3: Ruby FFI Backpressure

Symptoms:

Errors: “Ruby FFI event channel full - backpressure applied”
Ruby handler slowness
Increased Rust-side latency

Diagnosis:

Check Ruby handler processing time:

# Add timing to Ruby handlers
time_start = Time.now
result = handler.execute(step)
duration = Time.now - time_start
logger.warn("Slow handler: #{duration}s") if duration > 1.0

Check FFI channel saturation:

mpsc_channel_usage_percent{channel="ruby_ffi"}

Solutions:

If Ruby handlers are slow:

Optimize Ruby handler code
Consider async Ruby processing
Profile Ruby handler performance

If FFI buffer too small:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 2000  # Increase from 1000

If Rust-side producing too fast:

Add rate limiting to Rust event production
Batch events before FFI crossing

Issue 4: Event Publisher Drops

Symptoms:

Counter increasing: mpsc_channel_full_events_total{channel="event_publisher"}
Log warnings: “Event channel full, dropping event”

Diagnosis:

Check drop rate:

rate(mpsc_channel_full_events_total{channel="event_publisher"}[5m])

Identify event types being dropped:

grep "dropping event" logs/tasker.log | awk '{print $NF}' | sort | uniq -c

Solutions:

If drops are rare (< 1/min):

Acceptable for non-critical events
Monitor but no action needed

If drops are frequent (> 10/min):

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 20000  # Increase from 10000

If drops are continuous:

Investigate event storm cause
Consider event sampling/filtering
Review event subscriber performance

Capacity Planning

Sizing Formula

General guideline:

buffer_size = (peak_message_rate_per_sec * avg_processing_time_sec) * safety_factor

Where:

peak_message_rate_per_sec: Expected peak throughput
avg_processing_time_sec: Average consumer processing time
safety_factor: 2-5x for bursts

Example calculation:

# Orchestration command channel
peak_rate = 500 messages/sec
processing_time = 0.01 sec (10ms)
safety_factor = 2x

buffer_size = (500 * 0.01) * 2 = 10 messages minimum
# Use 1000 for burst handling

Environment-Specific Guidelines

Test Environment:

Use small buffers (100-500)
Exposes backpressure issues early
Forces proper error handling

Development Environment:

Use moderate buffers (500-1000)
Balances local resource usage
Mimics test environment behavior

Production Environment:

Use large buffers (2000-50000)
Handles real-world burst traffic
Prioritizes availability over memory

When to Increase Buffer Sizes

Increase if:

✅ Saturation > 80% for extended periods
✅ Overflow events occur regularly
✅ Latency increases during peak load
✅ Known traffic increase incoming

Don’t increase if:

❌ Consumer is bottleneck (fix consumer instead)
❌ Saturation is brief and recovers quickly
❌ Would mask underlying performance issue

Configuration Change Procedure

1. Identify Need

Review metrics and logs to determine which channel needs adjustment.

2. Calculate New Size

Use sizing formula or apply percentage increase:

new_size = current_size * (100 / (100 - target_usage_percent))

# Example: Currently 90% full, target 70%
new_size = 5000 * (100 / (100 - 70)) = 5000 * 3.33 = 16,650
# Round up: 20,000

3. Update Configuration

Important: Environment overrides MUST use full [mpsc_channels.*] prefix!

# ✅ CORRECT
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 20000

# ❌ WRONG - creates conflicting top-level key
[orchestration.command_processor]
command_buffer_size = 20000

4. Deploy

Local/Development:

# Restart service - picks up new config automatically
cargo run --package tasker-orchestration --bin tasker-server --features web-api

Production:

# Standard deployment process
# Configuration is loaded at service startup
kubectl rollout restart deployment/tasker-orchestration

5. Monitor

Watch metrics for 1-2 hours post-change:

Channel usage percentage should decrease
Overflow events should stop
Latency should improve

6. Document

Update this runbook with:

Why change was made
New values
Observed impact

Troubleshooting Checklist

□ Check metric: mpsc_channel_usage_percent for affected channel
□ Review logs for saturation warnings in last 24 hours
□ Verify configuration file has correct [mpsc_channels] prefix
□ Confirm environment variable TASKER_ENV matches intended environment
□ Check if issue correlates with specific operations (bulk tasks, etc.)
□ Verify consumer processing time hasn't increased
□ Check for resource constraints (CPU, memory)
□ Review recent code changes that might affect throughput
□ Consider if horizontal scaling is needed vs buffer increase

Emergency Response

Critical Saturation (>95%)

Immediate Actions:

Increase buffer size by 2-5x in production config
Deploy immediately via rolling restart
Page on-call if service degradation visible

Example:

# Edit config
vim config/tasker/environments/production/mpsc_channels.toml

# Deploy
kubectl rollout restart deployment/tasker-orchestration

# Monitor
watch -n 5 'curl -s localhost:9090/api/v1/query?query=mpsc_channel_usage_percent | jq'

Service Unresponsive Due to Backpressure

Symptoms:

All channels showing 100% usage
No message processing
Health checks failing

Actions:

Check for downstream bottleneck (database, queue service)
Scale out consumer services
Temporarily increase all buffer sizes
Check circuit breaker states (/health/detailed endpoint) - if circuit breakers are open, address underlying database/service issues first

Note: MPSC channels and circuit breakers are complementary resilience mechanisms. Channel saturation indicates internal backpressure, while circuit breaker state indicates external service health. See Circuit Breakers for operational guidance.

Best Practices

Monitor Proactively: Don’t wait for alerts - review metrics weekly
Test Changes in Dev: Validate buffer changes in development first
Document Rationale: Note why each production override exists
Gradual Increases: Prefer 2x increases over 10x jumps
Review Quarterly: Adjust defaults based on production patterns
Alert on Changes: Get notified of configuration file commits

Architecture:

Backpressure Architecture - How MPSC channels fit into the broader resilience strategy
Circuit Breakers - Fault isolation working alongside bounded channels
ADR: Bounded MPSC Channels - Design decisions

Development:

Developer Guidelines - Creating and using MPSC channels

Operations:

Backpressure Monitoring - Unified alerting and incident response

Support

Questions? Ask in #platform-engineering Slack channel Issues? File ticket with label infrastructure/channels Escalation? Page on-call via PagerDuty

Keyboard shortcuts

Tasker Documentation