Deployment Patterns and Configuration
Last Updated: 2026-01-15 Audience: Architects, Operators Status: Active Related Docs: Documentation Hub | Quick Start | Observability | Messaging Abstraction
← Back to Documentation Hub
Overview
Tasker Core supports three deployment modes, each optimized for different operational requirements and infrastructure constraints. This guide covers deployment patterns, configuration management, and production considerations.
Key Deployment Modes:
- Hybrid Mode (Recommended) - Event-driven with polling fallback
- EventDrivenOnly Mode - Pure event-driven for lowest latency
- PollingOnly Mode - Traditional polling for restricted environments
Messaging Backend Options:
- PGMQ (Default) - PostgreSQL-based, single infrastructure dependency
- RabbitMQ - AMQP broker, higher throughput for high-volume scenarios
Messaging Backend Selection
Tasker Core supports multiple messaging backends through a provider-agnostic abstraction layer. The choice of backend affects deployment architecture and operational requirements.
Backend Comparison
| Feature | PGMQ | RabbitMQ |
|---|---|---|
| Infrastructure | PostgreSQL only | PostgreSQL + RabbitMQ |
| Delivery Model | Poll + pg_notify signals | Native push (basic_consume) |
| Fallback Polling | Required for reliability | Not needed |
| Throughput | Good | Higher |
| Latency | Low (~10-50ms) | Lowest (~5-20ms) |
| Operational Complexity | Lower | Higher |
| Message Persistence | PostgreSQL transactions | RabbitMQ durability |
PGMQ (Default)
PostgreSQL Message Queue is the default backend, ideal for:
- Simpler deployments: Single database dependency
- Transactional workflows: Messages participate in PostgreSQL transactions
- Smaller to medium scale: Excellent for most workloads
Configuration:
# Default - no additional configuration needed
TASKER_MESSAGING_BACKEND=pgmq
Deployment Mode Interaction:
- Uses
pg_notifyfor real-time notifications - Fallback polling recommended for reliability
- Hybrid mode provides best balance
RabbitMQ
AMQP-based messaging for high-throughput scenarios:
- High-volume workloads: Better throughput characteristics
- Existing RabbitMQ infrastructure: Leverage existing investments
- Pure push delivery: No fallback polling required
Configuration:
TASKER_MESSAGING_BACKEND=rabbitmq
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/%2F
Deployment Mode Interaction:
- EventDrivenOnly mode is natural fit (no fallback needed)
- Native push delivery via
basic_consume() - Protocol-guaranteed message delivery
Choosing a Backend
Decision Tree:
┌─────────────────┐
│ Do you need the │
│ highest possible │
│ throughput? │
└────────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Do you have │ │ Use PGMQ │
│ existing │ │ (simpler ops) │
│ RabbitMQ? │ └────────────────┘
└───────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Use RabbitMQ │ │ Evaluate │
└────────────────┘ │ operational │
│ tradeoffs │
└────────────────┘
Recommendation: Start with PGMQ. Migrate to RabbitMQ only when throughput requirements demand it.
Production Deployment Strategy: Mixed Mode Architecture
Important: In production-grade Kubernetes environments, you typically run multiple orchestration containers simultaneously with different deployment modes. This is not just about horizontal scaling with identical configurations—it’s about deploying containers with different coordination strategies to optimize for both throughput and reliability.
Recommended Production Pattern
High-Throughput + Safety Net Architecture:
# Most orchestration containers in EventDrivenOnly mode for maximum throughput
- EventDrivenOnly containers: 8-12 replicas (handles 80-90% of workload)
- PollingOnly containers: 2-3 replicas (safety net for missed events)
Why this works:
- EventDrivenOnly containers handle the bulk of work with ~10ms latency
- PollingOnly containers catch any events that might be missed during network issues or LISTEN/NOTIFY failures
- Both sets of containers coordinate through atomic SQL operations (no conflicts)
- Scale each mode independently based on throughput needs
Alternative: All-Hybrid Deployment
You can also deploy all containers in Hybrid mode and scale horizontally:
# All containers use Hybrid mode
- Hybrid containers: 10-15 replicas
This is simpler but less flexible. The mixed-mode approach lets you:
- Tune for specific workload patterns (event-heavy vs. polling-heavy)
- Adapt to infrastructure constraints (some networks better for events, others for polling)
- Optimize resource usage (EventDrivenOnly uses less CPU than Hybrid)
- Scale dimensions independently (scale up event listeners without scaling pollers)
Key Insight
The different deployment modes exist not just for config tuning, but to enable sophisticated deployment strategies where you mix coordination approaches across containers to meet production throughput and reliability requirements.
Deployment Mode Comparison
| Feature | Hybrid | EventDrivenOnly | PollingOnly |
|---|---|---|---|
| Latency | Low (event-driven primary) | Lowest (~10ms) | Higher (~100-500ms) |
| Reliability | Highest (automatic fallback) | Good (requires stable connections) | Good (no dependencies) |
| Resource Usage | Medium (listeners + pollers) | Low (listeners only) | Medium (pollers only) |
| Network Requirements | Standard PostgreSQL | Persistent connections required | Standard PostgreSQL |
| Production Recommended | ✅ Yes | ⚠️ With stable network | ⚠️ For restricted environments |
| Complexity | Medium | Low | Low |
Hybrid Mode (Recommended)
Overview
Hybrid mode combines the best of both worlds: event-driven coordination for real-time performance with polling fallback for reliability.
How it works:
- PostgreSQL LISTEN/NOTIFY provides real-time event notifications
- If event listeners fail or lag, polling automatically takes over
- System continuously monitors and switches between modes
- No manual intervention required
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
# Event listener settings
enable_event_listeners = true
listener_reconnect_interval_ms = 5000
listener_health_check_interval_ms = 30000
# Polling fallback settings
enable_polling_fallback = true
polling_interval_ms = 1000
fallback_activation_threshold_ms = 5000
# Worker event settings
[orchestration.worker_events]
enable_worker_listeners = true
worker_listener_reconnect_ms = 5000
When to Use Hybrid Mode
Ideal for:
- Production deployments requiring high reliability
- Environments with occasional network instability
- Systems requiring both low latency and guaranteed delivery
- Multi-region deployments with variable network quality
Example: Production E-commerce Platform
# docker-compose.production.yml
version: '3.8'
services:
orchestration:
image: tasker-orchestration:latest
environment:
- TASKER_ENV=production
- TASKER_DEPLOYMENT_MODE=Hybrid
- DATABASE_URL=postgresql://tasker:${DB_PASSWORD}@postgres:5432/tasker_production
- RUST_LOG=info
deploy:
replicas: 3
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
postgres:
image: postgres:16
environment:
- POSTGRES_DB=tasker_production
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
postgres-data:
Monitoring Hybrid Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Hybrid mode health indicators
tasker_event_listener_active{mode="hybrid"} = 1 // Listener is active
tasker_event_listener_lag_ms{mode="hybrid"} < 100 // Event lag is acceptable
tasker_polling_fallback_active{mode="hybrid"} = 0 // Not in fallback mode
tasker_mode_switches_total{mode="hybrid"} < 10/hour // Infrequent mode switching
}
Alert conditions:
- Event listener down for > 60 seconds
- Polling fallback active for > 5 minutes
- Mode switches > 20 per hour (indicates instability)
EventDrivenOnly Mode
Overview
EventDrivenOnly mode provides the lowest possible latency by relying entirely on PostgreSQL LISTEN/NOTIFY for coordination.
How it works:
- Orchestration and workers establish persistent PostgreSQL connections
- LISTEN on specific channels for events
- Immediate notification on queue changes
- No polling overhead or delay
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "EventDrivenOnly"
[orchestration.event_driven]
# Listener configuration
listener_reconnect_interval_ms = 2000
listener_health_check_interval_ms = 15000
max_reconnect_attempts = 10
# Event channels
channels = [
"pgmq_message_ready.orchestration",
"pgmq_message_ready.*",
"pgmq_queue_created"
]
# Connection pool for listeners
listener_pool_size = 5
connection_timeout_ms = 5000
When to Use EventDrivenOnly Mode
Ideal for:
- High-throughput, low-latency requirements
- Stable network environments
- Development and testing environments
- Systems with reliable PostgreSQL infrastructure
Not recommended for:
- Unstable network connections
- Environments with frequent PostgreSQL failovers
- Systems requiring guaranteed operation during network issues
Example: High-Performance Payment Processing
#![allow(unused)]
fn main() {
// Worker configuration for event-driven mode
use tasker_worker::WorkerConfig;
let config = WorkerConfig {
deployment_mode: DeploymentMode::EventDrivenOnly,
namespaces: vec!["payments".to_string()],
event_driven_settings: EventDrivenSettings {
listener_reconnect_interval_ms: 2000,
health_check_interval_ms: 15000,
max_reconnect_attempts: 10,
},
..Default::default()
};
// Start worker with event-driven mode
let worker = WorkerCore::from_config(config).await?;
worker.start().await?;
}
Monitoring EventDrivenOnly Mode
Critical Metrics:
#![allow(unused)]
fn main() {
// Event-driven health indicators
tasker_event_listener_active{mode="event_driven"} = 1 // Must be 1
tasker_event_notifications_received_total // Should be > 0
tasker_event_processing_duration_seconds // Should be < 0.01
tasker_listener_reconnections_total // Should be low
}
Alert conditions:
- Event listener inactive
- No events received for > 60 seconds (when activity expected)
- Reconnections > 5 per hour
PollingOnly Mode
Overview
PollingOnly mode provides the most reliable operation in restricted or unstable network environments by using traditional polling.
How it works:
- Orchestration and workers poll message queues at regular intervals
- No dependency on persistent connections or LISTEN/NOTIFY
- Configurable polling intervals for performance/resource trade-offs
- Automatic retry and backoff on failures
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Polling intervals
task_request_poll_interval_ms = 1000
step_result_poll_interval_ms = 500
finalization_poll_interval_ms = 2000
# Batch processing
batch_size = 10
max_messages_per_poll = 100
# Backoff on errors
error_backoff_base_ms = 1000
error_backoff_max_ms = 30000
error_backoff_multiplier = 2.0
When to Use PollingOnly Mode
Ideal for:
- Restricted network environments (firewalls blocking persistent connections)
- Environments with frequent PostgreSQL connection issues
- Systems prioritizing reliability over latency
- Legacy infrastructure with limited LISTEN/NOTIFY support
Not recommended for:
- High-frequency, low-latency requirements
- Systems with strict resource constraints
- Environments where polling overhead is problematic
Example: Batch Processing System
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Longer intervals for batch processing
task_request_poll_interval_ms = 5000
step_result_poll_interval_ms = 2000
finalization_poll_interval_ms = 10000
# Large batches for efficiency
batch_size = 50
max_messages_per_poll = 500
Monitoring PollingOnly Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Polling health indicators
tasker_polling_cycles_total // Should be increasing
tasker_polling_messages_processed_total // Should be > 0
tasker_polling_duration_seconds // Should be stable
tasker_polling_errors_total // Should be low
}
Alert conditions:
- Polling stopped (no cycles in last 60 seconds)
- Polling duration > 10x interval (indicates overload)
- Error rate > 5% of polling cycles
Configuration Management
Component-Based Configuration
Tasker Core uses a component-based TOML configuration system with environment-specific overrides.
Configuration Structure:
config/tasker/
├── base/ # Base configuration (all environments)
│ ├── database.toml # Database connection pool settings
│ ├── orchestration.toml # Orchestration and deployment mode
│ ├── circuit_breakers.toml # Circuit breaker thresholds
│ ├── executor_pools.toml # Executor pool sizing
│ ├── pgmq.toml # Message queue configuration
│ ├── query_cache.toml # Query caching settings
│ └── telemetry.toml # Metrics and logging
│
└── environments/ # Environment-specific overrides
├── development/
│ └── *.toml # Development overrides
├── test/
│ └── *.toml # Test overrides
└── production/
└── *.toml # Production overrides
Environment Detection
# Set environment via TASKER_ENV
export TASKER_ENV=production
# Validate configuration
cargo run --bin config-validator
# Expected output:
# ✓ Configuration loaded successfully
# ✓ Environment: production
# ✓ Deployment mode: Hybrid
# ✓ Database pool: 50 connections
# ✓ Circuit breakers: 10 configurations
Example: Production Configuration
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
max_concurrent_tasks = 1000
task_timeout_seconds = 3600
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true
polling_interval_ms = 2000
fallback_activation_threshold_ms = 10000
[orchestration.health]
health_check_interval_ms = 30000
unhealthy_threshold = 3
recovery_threshold = 2
# config/tasker/environments/production/database.toml
[database]
max_connections = 50
min_connections = 10
connection_timeout_ms = 5000
idle_timeout_seconds = 600
max_lifetime_seconds = 1800
[database.query_cache]
enabled = true
max_size = 1000
ttl_seconds = 300
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5
timeout_seconds = 60
half_open_timeout_seconds = 30
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Docker Compose Deployment
Development Setup
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD: tasker
POSTGRES_DB: tasker_rust_test
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U tasker"]
interval: 5s
timeout: 5s
retries: 5
orchestration:
build:
context: .
target: orchestration
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
worker:
build:
context: .
target: worker
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8081:8081"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
ruby-worker:
build:
context: ./workers/ruby
dockerfile: Dockerfile
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8082:8082"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
volumes:
postgres-data:
Production Deployment
# docker-compose.production.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: tasker_production
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
placement:
constraints:
- node.labels.database == true
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
secrets:
- db_password
orchestration:
image: tasker-orchestration:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
order: start-first
rollback_config:
parallelism: 0
order: stop-first
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
secrets:
- database_url
worker:
image: tasker-worker:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 5
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
secrets:
- database_url
secrets:
db_password:
external: true
database_url:
external: true
volumes:
postgres-data:
driver: local
Kubernetes Deployment
Mixed-Mode Production Deployment (Recommended)
This example demonstrates the recommended production pattern: multiple orchestration deployments with different modes.
# k8s/orchestration-event-driven.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-event-driven
namespace: tasker
labels:
app: tasker-orchestration
mode: event-driven
spec:
replicas: 10 # Majority of orchestration capacity
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: event-driven
template:
metadata:
labels:
app: tasker-orchestration
mode: event-driven
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "EventDrivenOnly" # High-throughput mode
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 500m # Lower CPU for event-driven
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-polling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-polling
namespace: tasker
labels:
app: tasker-orchestration
mode: polling
spec:
replicas: 3 # Safety net for missed events
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: polling
template:
metadata:
labels:
app: tasker-orchestration
mode: polling
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "PollingOnly" # Reliability safety net
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 750m # Higher CPU for polling
memory: 512Mi
limits:
cpu: 1500m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-service.yaml
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration # Matches BOTH deployments
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Key points about this mixed-mode deployment:
- 10 EventDrivenOnly pods handle 80-90% of work with ~10ms latency
- 3 PollingOnly pods catch anything missed by event listeners
- Single service load balances across all 13 pods
- No conflicts - atomic SQL operations prevent duplicate processing
- Independent scaling - scale event-driven pods for throughput, polling pods for reliability
Single-Mode Orchestration Deployment
# k8s/orchestration-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration
namespace: tasker
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
template:
metadata:
labels:
app: tasker-orchestration
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Worker Deployment
# k8s/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
replicas: 5
selector:
matchLabels:
app: tasker-worker
namespace: payments
template:
metadata:
labels:
app: tasker-worker
namespace: payments
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8081"
spec:
containers:
- name: worker
image: tasker-worker:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
- name: WORKER_NAMESPACES
value: "payments"
ports:
- containerPort: 8081
name: http
protocol: TCP
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 20
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-worker-payments
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Health Monitoring
Health Check Endpoints
Orchestration Health:
# Basic health check
curl http://localhost:8080/health
# Response:
{
"status": "healthy",
"database": "connected",
"message_queue": "operational"
}
# Detailed health check
curl http://localhost:8080/health/detailed
# Response:
{
"status": "healthy",
"deployment_mode": "Hybrid",
"event_listeners": {
"active": true,
"channels": 3,
"lag_ms": 12
},
"polling": {
"active": false,
"fallback_triggered": false
},
"database": {
"status": "connected",
"pool_size": 50,
"active_connections": 23
},
"circuit_breakers": {
"database": "closed",
"message_queue": "closed"
},
"executors": {
"task_initializer": {
"active": 3,
"max": 10,
"queue_depth": 5
},
"result_processor": {
"active": 5,
"max": 10,
"queue_depth": 12
}
}
}
Worker Health:
# Worker health check
curl http://localhost:8081/health
# Response:
{
"status": "healthy",
"namespaces": ["payments", "inventory"],
"active_executions": 8,
"claimed_steps": 3
}
Kubernetes Probes
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from load balancer if not ready
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
gRPC Health Checks
Tasker Core exposes gRPC health endpoints alongside REST for Kubernetes gRPC health probes.
Port Allocation:
| Service | REST Port | gRPC Port |
|---|---|---|
| Orchestration | 8080 | 9190 |
| Rust Worker | 8081 | 9191 |
| Ruby Worker | 8082 | 9200 |
| Python Worker | 8083 | 9300 |
| TypeScript Worker | 8085 | 9400 |
gRPC Health Endpoints:
# Using grpcurl
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckLiveness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckReadiness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckDetailedHealth
Kubernetes gRPC Probes (Kubernetes 1.24+):
# gRPC liveness probe
livenessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# gRPC readiness probe
readinessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Configuration (config/tasker/base/orchestration.toml):
[orchestration.grpc]
enabled = true
bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
enable_reflection = true # Service discovery via grpcurl
enable_health_service = true # gRPC health checks
Scaling Patterns
Horizontal Scaling
Mixed-Mode Orchestration Scaling (Recommended)
Scale different deployment modes independently to optimize for throughput and reliability:
# Scale event-driven pods for throughput
kubectl scale deployment tasker-orchestration-event-driven --replicas=15 -n tasker
# Scale polling pods for reliability
kubectl scale deployment tasker-orchestration-polling --replicas=5 -n tasker
Scaling strategy by workload:
| Scenario | Event-Driven Pods | Polling Pods | Rationale |
|---|---|---|---|
| High throughput | 15-20 | 3-5 | Maximize event-driven capacity |
| Network unstable | 5-8 | 5-8 | Balance between modes |
| Cost optimization | 10-12 | 2-3 | Minimize polling overhead |
| Maximum reliability | 8-10 | 8-10 | Ensure complete coverage |
Single-Mode Orchestration Scaling
If using single deployment mode (Hybrid or EventDrivenOnly):
# Scale orchestration to 10 replicas (all same mode)
kubectl scale deployment tasker-orchestration --replicas=10 -n tasker
Key principles:
- Multiple orchestration instances process tasks independently
- Atomic finalization claiming prevents duplicate processing
- Load balancer distributes API requests across instances
Worker Scaling
Workers scale independently per namespace:
# Scale payment workers to 10 replicas
kubectl scale deployment tasker-worker-payments --replicas=10 -n tasker
- Each worker claims steps from namespace-specific queues
- No coordination required between workers
- Scale per namespace based on queue depth
Vertical Scaling
Resource Allocation:
# High-throughput orchestration
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
# Standard worker
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Auto-Scaling
HPA Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-orchestration
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-orchestration
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: tasker_tasks_active
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Production Considerations
Database Configuration
Connection Pooling:
# config/tasker/environments/production/database.toml
[database]
max_connections = 50 # Total pool size
min_connections = 10 # Minimum maintained connections
connection_timeout_ms = 5000 # Connection acquisition timeout
idle_timeout_seconds = 600 # Close idle connections after 10 minutes
max_lifetime_seconds = 1800 # Recycle connections after 30 minutes
Calculation:
Total DB Connections = (Orchestration Replicas × Pool Size) + (Worker Replicas × Pool Size)
Example: (3 × 50) + (10 × 20) = 350 connections
Ensure PostgreSQL max_connections > Total DB Connections + Buffer
Recommended: max_connections = 500 for above example
Circuit Breaker Tuning
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5 # Open after 5 consecutive errors
timeout_seconds = 60 # Stay open for 60 seconds
half_open_timeout_seconds = 30 # Test recovery for 30 seconds
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Executor Pool Sizing
# config/tasker/environments/production/executor_pools.toml
[executor_pools.task_initializer]
min_executors = 2
max_executors = 10
queue_high_watermark = 100
queue_low_watermark = 10
[executor_pools.result_processor]
min_executors = 5
max_executors = 20
queue_high_watermark = 200
queue_low_watermark = 20
[executor_pools.step_enqueuer]
min_executors = 3
max_executors = 15
queue_high_watermark = 150
queue_low_watermark = 15
Observability Integration
Prometheus Metrics:
# Prometheus scrape config
scrape_configs:
- job_name: 'tasker-orchestration'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- tasker
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Key Alerts:
# alerts.yaml
groups:
- name: tasker
interval: 30s
rules:
- alert: TaskerOrchestrationDown
expr: up{job="tasker-orchestration"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Tasker orchestration instance down"
- alert: TaskerHighErrorRate
expr: rate(tasker_step_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in step execution"
- alert: TaskerCircuitBreakerOpen
expr: tasker_circuit_breaker_state{state="open"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: TaskerDatabasePoolExhausted
expr: tasker_database_pool_active >= tasker_database_pool_max
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection pool exhausted"
Migration Strategies
Migrating to Hybrid Mode
Step 1: Enable event listeners
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true # Keep polling enabled during migration
Step 2: Monitor event listener health
# Check metrics for event listener stability
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Step 3: Gradually reduce polling frequency
# Once event listeners are stable
[orchestration.hybrid]
polling_interval_ms = 5000 # Increase from 1000ms to 5000ms
Step 4: Validate performance
- Monitor latency metrics:
tasker_step_discovery_duration_seconds - Verify no missed events:
tasker_polling_messages_found_totalshould be near zero
Rollback Plan
If event-driven mode fails:
# Immediate rollback to PollingOnly
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
task_request_poll_interval_ms = 500 # Aggressive polling
Gradual rollback:
- Increase polling frequency in Hybrid mode
- Monitor for stability
- Disable event listeners once polling is stable
- Switch to PollingOnly mode
Troubleshooting
Event Listener Issues
Problem: Event listeners not receiving notifications
Diagnosis:
-- Check PostgreSQL LISTEN/NOTIFY is working
NOTIFY pgmq_message_ready, 'test';
# Check listener status
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Solutions:
- Verify PostgreSQL version supports LISTEN/NOTIFY (9.0+)
- Check firewall rules allow persistent connections
- Increase
listener_reconnect_interval_msif connections drop frequently - Switch to Hybrid or PollingOnly mode if issues persist
Polling Performance Issues
Problem: High CPU usage from polling
Diagnosis:
# Check polling frequency and batch sizes
curl http://localhost:8080/health/detailed | jq '.polling'
Solutions:
- Increase polling intervals
- Increase batch sizes to process more messages per poll
- Switch to Hybrid or EventDrivenOnly mode for better performance
- Scale horizontally to distribute polling load
Database Connection Exhaustion
Problem: “connection pool exhausted” errors
Diagnosis:
-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'tasker_production';
-- Check max connections
SHOW max_connections;
Solutions:
- Increase
max_connectionsin database.toml - Increase PostgreSQL
max_connectionssetting - Reduce number of replicas
- Implement connection pooling at infrastructure level (PgBouncer)
Best Practices
Configuration Management
- Use environment-specific overrides instead of modifying base configuration
- Validate configuration with
config-validatorbefore deployment - Version control all configuration including environment overrides
- Use secrets management for sensitive values (passwords, keys)
Deployment Strategy
- Use mixed-mode architecture in production (EventDrivenOnly + PollingOnly)
- Deploy 80-90% of orchestration pods in EventDrivenOnly mode for throughput
- Deploy 10-20% of orchestration pods in PollingOnly mode as safety net
- Single service load balances across all pods
- Alternative: Deploy all pods in Hybrid mode for simpler configuration
- Trade-off: Less tuning flexibility, slightly higher resource usage
- Scale each mode independently based on workload characteristics
- Monitor deployment mode metrics to adjust ratios over time
- Test mixed-mode deployments in staging before production
Deployment Operations
- Always test configuration changes in staging first
- Use rolling updates with health checks to prevent downtime
- Monitor deployment mode health during and after deployments
- Keep polling capacity available even when event-driven is primary
Scaling Guidelines
- Mixed-mode orchestration: Scale EventDrivenOnly and PollingOnly deployments independently
- Scale event-driven pods based on throughput requirements
- Scale polling pods based on reliability requirements
- Single-mode orchestration: Scale based on API request rate and task initialization throughput
- Workers: Scale based on namespace-specific queue depth
- Database connections: Monitor and adjust pool sizes as replicas scale
- Use HPA for automatic scaling based on CPU/memory and custom metrics
Observability
- Enable comprehensive metrics in production
- Set up alerts for circuit breaker states, connection pool exhaustion
- Monitor deployment mode distribution in mixed-mode deployments
- Track event listener lag in EventDrivenOnly and Hybrid modes
- Monitor polling overhead to optimize resource usage
- Track step execution latency per namespace and handler
Summary
Tasker Core’s flexible deployment modes enable sophisticated production architectures:
Deployment Modes
- Hybrid Mode: Event-driven with polling fallback in a single container
- EventDrivenOnly Mode: Maximum throughput with ~10ms latency
- PollingOnly Mode: Reliable safety net with traditional polling
Recommended Production Pattern
Mixed-Mode Architecture (recommended for production at scale):
- Deploy majority of orchestration pods in EventDrivenOnly mode for high throughput
- Deploy minority of orchestration pods in PollingOnly mode as reliability safety net
- Both deployments coordinate through atomic SQL operations with no conflicts
- Scale each mode independently based on workload characteristics
Alternative: Deploy all pods in Hybrid mode for simpler configuration with automatic fallback.
The key insight: deployment modes exist not just for configuration tuning, but to enable mixing coordination strategies across containers to meet production requirements for both throughput and reliability.
← Back to Documentation Hub
Next: Observability | Benchmarks | Quick Start