Tasker
Workflow orchestration that meets your code where it lives.
Tasker is an open-source workflow orchestration engine built on PostgreSQL and PGMQ. You define workflows as task templates with ordered steps, implement handlers in Rust, Ruby, Python, or TypeScript, and the engine handles execution, retries, circuit breaking, and observability.
Your existing business logic — API calls, database operations, service integrations — becomes a distributed, event-driven, retryable workflow with minimal ceremony. No DSLs to learn, no framework rewrites. Just thin handler wrappers around code you already have.
Get Started
From zero to your first workflow. Install, write a handler, define a template, submit a task, and watch it run.
An honest look at where Tasker fits in the workflow orchestration landscape — and where established tools might be a better choice.
How Tasker works under the hood: actors, state machines, event systems, circuit breakers, and the PostgreSQL-native execution model.
Complete reference for all 246 configuration parameters across orchestration, workers, and shared settings.
Choose Your Language
Tasker is polyglot from the ground up. The orchestration engine is Rust; workers can be any of four languages, all sharing the same core abstractions expressed idiomatically.
| Language | Package | Install | Registry |
|---|---|---|---|
| Rust | tasker-client / tasker-worker | cargo add tasker-client tasker-worker | crates.io |
| Ruby | tasker-rb | gem install tasker-rb | rubygems.org |
| Python | tasker-py | pip install tasker-py | pypi.org |
| TypeScript | @tasker-systems/tasker | npm install @tasker-systems/tasker | npmjs.com |
Each language guide covers installation, handler patterns, testing, and production considerations:
Rust · Ruby · Python · TypeScript
Explore the Documentation
For New Users
- Core Concepts — Tasks, steps, handlers, templates, and namespaces
- Choosing Your Package — Which package do you need?
- Quick Start — Running in 5 minutes
Architecture & Design
- Architecture Overview — System design and component interaction
- Design Principles — The tenets behind Tasker’s design decisions
- Architectural Decisions — ADRs documenting key technical choices
Operational Guides
- Handler Resolution — How Tasker finds and runs your handlers
- Retry Semantics — Retry strategies, backoff, and circuit breaking
- Batch Processing — Processing work in batches
- DLQ System — Dead letter queue for failed tasks
- Observability — Metrics, tracing, and logging
Reference
- Configuration Reference — All 246 parameters documented
- Worker API Convergence — Cross-language API alignment
- FFI Safety — How polyglot workers communicate safely
Framework Integrations
- Contrib Packages — Rails, FastAPI, Axum, and Bun integrations
- Example Workflows — E-commerce, ETL, and approval system patterns
Engineering Stories
A progressive-disclosure blog series teaching Tasker concepts through real-world scenarios. Each story follows an engineering team as they adopt workflow orchestration, with working code examples across all four languages.
| Story | What You’ll Learn |
|---|---|
| 01: E-commerce Checkout | Basic workflows, error handling, retry patterns |
| 02: Data Pipeline Resilience | ETL orchestration, resilience under failure |
| 03: Microservices Coordination | Cross-service workflows, distributed tracing |
| 04: Team Scaling | Namespace isolation, multi-team patterns |
| 05: Observability | OpenTelemetry integration, domain events |
| 06: Batch Processing | Batch step patterns, throughput optimization |
| 07: Conditional Workflows | Decision handlers, approval flows |
| 08: Production Debugging | DLQ investigation, diagnostics tooling |
Stories are being rewritten for the current Tasker architecture. View archive →
The Project
Tasker is open-source software (MIT license) built by an engineer who has spent years designing workflow systems at multiple organizations — and finally had the opportunity to build the one that was always in his head.
It’s not venture-backed. It’s not chasing a market. It’s a labor of love built for the engineering community.
Source Repositories
| Repository | Description |
|---|---|
| tasker-core | Rust orchestration engine, polyglot workers, and CLI |
| tasker-contrib | Framework integrations and community packages |
| tasker-book | This documentation site |
Why Tasker
Last Updated: 2025-01-09 Audience: Engineers evaluating workflow orchestration tools Status: Pre-Alpha
The Story
Tasker is a labor of love.
Over the years, I’ve built workflow systems at multiple organizations—each time encountering the same fundamental challenges: orchestrating complex, multi-step processes with proper dependency management, ensuring idempotency, handling retries gracefully, and doing all of this in a way that doesn’t require teams to rewrite their existing business logic.
Each time, I’d design parts of the solution I wished we could build—but the investment was never justifiable. General-purpose workflow infrastructure rarely makes sense for a single company to build from scratch when there are urgent product features to ship. So I’d compromise, cobble together something workable, and move on.
Tasker represents the opportunity to finally build that system properly—the one that’s been evolving in my head for years. Not as a venture-backed startup chasing a market, but as open-source software built by someone who genuinely cares about the problem space and wants to give back to the engineering community.
The Landscape
Honesty is important, and so in full candor: Tasker is not solving an unsolved problem. The workflow orchestration space has mature, battle-tested options.
Apache Airflow
What it does well: Airflow is the industry standard for data pipeline orchestration. Born at Airbnb and now an Apache project with thousands of contributors, it excels at scheduled, batch-oriented workflows defined as Python DAGs. Its ecosystem of operators and integrations is unmatched—if you need to connect to a cloud service, there’s probably an Airflow provider for it.
When to choose it: You have scheduled ETL/ELT workloads, your team is Python-native, you need managed cloud options (AWS MWAA, Google Cloud Composer, Astronomer), and you value ecosystem breadth over ergonomic simplicity.
Honest comparison: Airflow’s 10+ years of production use across thousands of companies represents a level of battle-testing Tasker simply cannot match. If your primary use case is data pipeline orchestration with scheduled intervals, Airflow is likely the safer choice.
Temporal
What it does well: Temporal pioneered “durable execution”—workflows that automatically survive crashes, network failures, and infrastructure outages. It reconstructs application state transparently, letting developers write code as if failures don’t exist. The event history and replay capabilities are genuinely impressive.
When to choose it: You need long-running workflows (hours, days, or longer), your operations require true durability guarantees, you’re building microservice orchestration with complex saga patterns, or you need human-in-the-loop workflows with unbounded wait times.
Honest comparison: Temporal’s durable execution model is architecturally different from Tasker. If your workflows genuinely need to survive arbitrary failures mid-execution and resume from exact state, Temporal was purpose-built for this. Tasker provides resilience through retries and idempotent step execution, but doesn’t offer Temporal’s deterministic replay.
Prefect
What it does well: Prefect feels like “what if workflow orchestration were just Python decorators?” It emphasizes minimal boilerplate—add @flow and @task decorators to existing functions, and you have an orchestrated workflow. Prefect 3.0 embraces dynamic workflows with native Python control flow.
When to choose it: Your team is Python-native, you want the fastest path from script to production pipeline, you value simplicity and developer experience, or you’re doing ML/data science workflows where Jupyter-to-production is important.
Honest comparison: Prefect’s decorator-based ergonomics are genuinely excellent for Python-only teams. If your organization is homogeneously Python and you don’t need polyglot support, Prefect delivers a very clean experience.
Dagster
What it does well: Dagster introduced “software-defined assets” as first-class primitives—you define what data assets should exist and their dependencies, and the orchestrator figures out how to materialize them. This asset-centric model provides excellent lineage tracking and observability.
When to choose it: You’re building a data platform where understanding asset lineage is critical, you want a declarative approach focused on data products rather than task sequences, or you need strong dbt integration and data quality built into your orchestration layer.
Honest comparison: Dagster’s asset-centric philosophy is a genuinely different way of thinking about orchestration. If your mental model centers on “what data assets need to exist” rather than “what steps need to execute,” Dagster may be a better conceptual fit.
So Why Tasker?
Given this landscape, why build another workflow orchestrator?
Philosophy: Meet Teams Where They Are
Most workflow tools require you to think in their terms. Define your work as DAGs using their DSL. Adopt their scheduling model. Often, rewrite your business logic to fit their execution model.
Tasker takes a different approach: bring workflow orchestration to your existing code, rather than bringing your code to a workflow framework.
If your codebase already has reasonable SOLID characteristics—services with clear responsibilities, well-defined interfaces, operations that can be made idempotent—Tasker aims to turn that code into distributed, event-driven, retryable workflows with minimal ceremony.
This philosophy manifests in several ways:
Polyglot from the ground up. Tasker’s orchestration engine is written in Rust, but workers can be written in Ruby, Python, TypeScript, or native Rust. Each language implementation shares the same core abstractions—same handler signatures, same result factories, same patterns—expressed idiomatically for each language. This isn’t an afterthought; cross-language consistency is a core design principle.
Minimal migration burden. Your existing business logic—API calls, database operations, external service integrations—can become step handlers with thin wrappers. You don’t need to restructure your application around the orchestrator.
Framework-agnostic core. Tasker Core provides the fundamentals without framework opinions. Tasker Contrib then provides framework-specific integrations (Rails, FastAPI, Bun) for teams who want batteries-included ergonomics. Progressive disclosure: learn the core concepts first, add framework sugar when needed.
Architecture: Event-Driven with Resilience Built In
Tasker’s architecture reflects lessons learned from building distributed systems:
PostgreSQL-native by default. Everything flows through Postgres—task state, step queues (via PGMQ), event coordination (via LISTEN/NOTIFY). This isn’t because Postgres is trendy; it’s because many teams already have Postgres expertise and operational knowledge. Tasker works as a single-dependency system on PostgreSQL alone. For environments requiring higher throughput or existing RabbitMQ infrastructure, Tasker also supports RabbitMQ as an alternative messaging backend—switch with a configuration change.
Event-driven with polling fallback. Real-time responsiveness through Postgres notifications, but with polling as a reliability backstop. Events can be missed; polling ensures eventual consistency.
Defense in depth. Multiple overlapping protection layers provide robust idempotency without single-point dependency. Database-level atomicity, state machine guards, transaction boundaries, and application-level filtering each catch what others might miss.
Composition over inheritance. Handler capabilities are composed via mixins/traits, not class hierarchies. This enables selective capability inclusion, clear separation of concerns, and easier testing.
Performance: Fast by Default
Tasker is built in Rust not for marketing purposes, but because workflow orchestration has real performance implications. When you’re coordinating thousands of steps across distributed workers, overhead matters.
- Complex 7-step DAG workflows complete in under 133ms with push-based notifications
- Concurrent execution via work-stealing thread pools
- Lock-free channel-based internal coordination
- Zero-copy where possible in the FFI boundaries
The Honest Assessment
Tasker excels when:
- You need polyglot worker support across Ruby, Python, TypeScript, and Rust
- Your team already has Postgres expertise and wants to avoid additional infrastructure
- You want to bring orchestration to existing business logic rather than rewriting
- You value clean, consistent APIs across languages
- Performance matters and you’re willing to trade ecosystem breadth for it
Tasker may not be the right choice when:
- You need the battle-tested maturity and ecosystem of Airflow
- Your workflows require Temporal-style durable execution with deterministic replay
- You’re an all-Python team and Prefect’s ergonomics fit perfectly
- You’re building a data platform where asset-centric thinking (Dagster) is the right model
- You need managed cloud offerings with SLAs and enterprise support
What Tasker Is (and Isn’t)
Tasker Is:
- A workflow orchestration engine for step-based DAG execution with complex dependencies
- PostgreSQL-native with flexible messaging using PGMQ (default) or RabbitMQ
- Polyglot by design with first-class support for multiple languages
- Focused on developer experience for teams who want minimal intrusion
- Open source (MIT license) and built as a labor of love
Tasker Is Not:
- A data orchestration platform like Dagster with asset lineage and data quality primitives
- A durable execution engine like Temporal with deterministic replay and unlimited durability
- A scheduled job runner for simple cron-style workloads (use actual cron)
- A message bus for pure pub/sub fan-out (use Kafka or dedicated brokers)
- Enterprise software with commercial support, SLAs, or managed offerings
Current State
Tasker is pre-alpha software. This is important context:
What this means:
- The architecture is solidifying but breaking changes are expected
- Documentation is comprehensive but evolving
- There are no production deployments (that I know of) outside development
- You should not bet critical business processes on Tasker today
What this enables:
- Rapid iteration based on real feedback
- Willingness to break APIs to get the design right
- Focus on architectural correctness over backward compatibility
- Honest experimentation without legacy constraints
If you’re evaluating Tasker, I’d encourage you to explore it for non-critical workloads, provide feedback, and help shape what it becomes. If you need production-ready workflow orchestration today, please consider the established tools above—I genuinely recommend them for their respective strengths.
The Path Forward
Tasker is being built with care, not speed. The goal isn’t to capture market share or compete with well-funded companies. The goal is to create something genuinely useful—a workflow orchestration system that respects developers’ time and meets them where they are.
The codebase is open, the design decisions are documented, and contributions are welcome. This is software built by an engineer for engineers, not a product chasing metrics.
If that resonates with you, welcome. Let’s build something good together.
Related Documentation
- Tasker Core Tenets - The 10 foundational design principles
- Use Cases & Patterns - When and how to use Tasker
- Quick Start Guide - Get running in 5 minutes
- CHRONOLOGY - Development timeline and lessons learned
← Back to Documentation Hub
Getting Started
This section walks you from “what is Tasker?” to running your first workflow.
Path
- Overview - What Tasker is and why it exists
- Core Concepts - Tasks, steps, handlers, templates, and namespaces
- Installation - Installing packages and running infrastructure
- Choosing Your Package - Which package do you need?
- Your First Handler - Write a step handler in your language
- Your First Workflow - Define a template, submit a task, watch it run
- Next Steps - Where to go from here
Overview
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Core Concepts
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Installation
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Choosing Your Package
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Your First Handler
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Your First Workflow
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Next Steps
This page will be written as part of the consumer documentation effort. See TAS-215 for details.
Tasker Core Architecture
This directory contains architectural reference documentation describing how Tasker Core’s components work together.
Documents
| Document | Description |
|---|---|
| Crate Architecture | Workspace structure and crate responsibilities |
| Messaging Abstraction | Provider-agnostic messaging (PGMQ, RabbitMQ) |
| Actors | Actor-based orchestration lifecycle components |
| Worker Actors | Actor pattern for worker step execution |
| Worker Event Systems | Dual-channel event architecture for workers |
| States and Lifecycles | Dual state machine architecture (Task + Step) |
| Events and Commands | Event-driven coordination patterns |
| Domain Events | Business event publishing (durable/fast/broadcast) |
| Idempotency and Atomicity | Defense-in-depth guarantees |
| Backpressure Architecture | Unified resilience and flow control |
| Circuit Breakers | Fault isolation and cascade prevention |
| Deployment Patterns | Hybrid, EventDriven, PollingOnly modes; PGMQ/RabbitMQ backends |
When to Read These
- Designing new features: Understand how components interact
- Debugging flow issues: Trace message paths through actors
- Understanding trade-offs: See why patterns were chosen
- Onboarding: Build mental model of the system
Related Documentation
- Principles - The “why” behind architectural decisions
- Guides - Practical “how-to” documentation
- CHRONOLOGY - Historical context for decisions
Actor-Based Architecture
Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Worker Actor Architecture | Events and Commands | States and Lifecycles
← Back to Documentation Hub
This document provides comprehensive documentation of the actor-based architecture in tasker-core, covering the lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture replaces imperative delegation with message-based actor coordination.
Overview
The tasker-core system implements a lightweight Actor pattern inspired by frameworks like Actix, but designed specifically for our orchestration needs without external dependencies. The architecture provides:
- Actor Abstraction: Lifecycle components encapsulated as actors with clear lifecycle hooks
- Message-Based Communication: Type-safe message handling via Handler
trait - Central Registry: ActorRegistry for managing all orchestration actors
- Service Decomposition: Focused components following single responsibility principle
- Direct Integration: Command processor calls actors directly without wrapper layers
This architecture eliminates inconsistencies in lifecycle component initialization, provides type-safe message handling, and creates a clear separation between command processing and business logic execution.
Implementation Status
All phases implemented and production-ready: core abstractions, all 4 primary actors, message hydration, module reorganization, service decomposition, and direct actor integration.
Core Concepts
What is an Actor?
In the tasker-core context, an Actor is an encapsulated lifecycle component that:
- Manages its own state: Each actor owns its dependencies and configuration
- Processes messages: Responds to typed command messages via the Handler
trait - Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
- Is isolated: Actors communicate through message passing
- Is thread-safe: All actors are Send + Sync + ’static
Why Actors?
The previous architecture had several inconsistencies:
#![allow(unused)]
fn main() {
// OLD: Inconsistent initialization patterns
pub struct TaskInitializer {
// Constructor pattern
}
pub struct TaskFinalizer {
// Builder pattern with new()
}
pub struct StepEnqueuer {
// Factory pattern with create()
}
}
The actor pattern provides consistency:
#![allow(unused)]
fn main() {
// NEW: Consistent actor pattern
impl OrchestrationActor for TaskRequestActor {
fn name(&self) -> &'static str { "TaskRequestActor" }
fn context(&self) -> &Arc<SystemContext> { &self.context }
fn started(&mut self) -> TaskerResult<()> { /* initialization */ }
fn stopped(&mut self) -> TaskerResult<()> { /* cleanup */ }
}
}
Actor vs Service
Services (underlying business logic):
- Encapsulate business logic
- Stateless operations on domain models
- Direct method invocation
- Examples: TaskFinalizer, StepEnqueuerService, OrchestrationResultProcessor
Actors (message-based coordination):
- Wrap services with message-based interface
- Manage service lifecycle
- Asynchronous message handling
- Examples: TaskRequestActor, ResultProcessorActor, StepEnqueuerActor, TaskFinalizerActor
The relationship:
#![allow(unused)]
fn main() {
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer, // Wraps underlying service
}
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
// Delegates to service
self.service.finalize_task(msg.task_uuid).await
}
}
}
Actor Traits
OrchestrationActor Trait
The base trait for all orchestration actors, defined in tasker-orchestration/src/actors/traits.rs:
#![allow(unused)]
fn main() {
/// Base trait for all orchestration actors
///
/// Provides lifecycle management and context access for all actors in the
/// orchestration system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
///
/// # Lifecycle
///
/// 1. **Construction**: Actor is created by ActorRegistry
/// 2. **Initialization**: `started()` is called during registry build
/// 3. **Operation**: Actor processes messages via Handler<M> implementations
/// 4. **Shutdown**: `stopped()` is called during registry shutdown
pub trait OrchestrationActor: Send + Sync + 'static {
/// Returns the unique name of this actor
///
/// Used for logging, metrics, and debugging. Should be a static string
/// that clearly identifies the actor's purpose.
fn name(&self) -> &'static str;
/// Returns a reference to the system context
///
/// Provides access to database pool, configuration, and other
/// framework-level resources.
fn context(&self) -> &Arc<SystemContext>;
/// Called when the actor is started
///
/// Perform any initialization work here, such as:
/// - Setting up database connections
/// - Loading configuration
/// - Initializing caches
///
/// # Errors
///
/// Return an error if initialization fails. The actor will not be
/// registered and the system will fail to start.
fn started(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor started");
Ok(())
}
/// Called when the actor is stopped
///
/// Perform any cleanup work here, such as:
/// - Closing database connections
/// - Flushing caches
/// - Releasing resources
///
/// # Errors
///
/// Return an error if cleanup fails. Errors are logged but do not
/// prevent other actors from shutting down.
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor stopped");
Ok(())
}
}
}
Key Design Decisions:
- Send + Sync + ’static: Enables actors to be shared across threads
- Default lifecycle hooks: Actors only override when needed
- Context injection: All actors have access to SystemContext
- Error handling: Lifecycle failures are TaskerResult for proper error propagation
Handler Trait
The message handling trait, enabling type-safe message processing:
#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
///
/// Actors implement Handler<M> for each message type they can process.
/// This provides type-safe, asynchronous message handling with clear
/// input/output contracts.
#[async_trait]
pub trait Handler<M: Message>: OrchestrationActor {
/// The response type returned by this handler
type Response: Send;
/// Handle a message asynchronously
///
/// Process the message and return a response. This method should be
/// idempotent where possible and handle errors gracefully.
async fn handle(&self, msg: M) -> TaskerResult<Self::Response>;
}
}
Key Design Decisions:
- async_trait: All message handling is asynchronous
- Type safety: Message and Response types are checked at compile time
- Multiple implementations: Actor can implement Handler
for multiple message types - Error propagation: TaskerResult ensures proper error handling
Message Trait
The marker trait for command messages:
#![allow(unused)]
fn main() {
/// Marker trait for command messages
///
/// All messages sent to actors must implement this trait. The associated
/// `Response` type defines what the handler will return.
pub trait Message: Send + 'static {
/// The response type for this message
type Response: Send;
}
}
Key Design Decisions:
- Marker trait: No methods, just type constraints
- Associated type: Response type is part of the message definition
- Send + ’static: Enables messages to cross thread boundaries
ActorRegistry
The central registry managing all orchestration actors, defined in tasker-orchestration/src/actors/registry.rs:
Purpose
The ActorRegistry serves as:
- Single Source of Truth: All actors are registered here
- Lifecycle Manager: Handles initialization and shutdown
- Dependency Injection: Provides SystemContext to all actors
- Type-Safe Access: Strongly-typed access to each actor
Structure
#![allow(unused)]
fn main() {
/// Registry managing all orchestration actors
///
/// The ActorRegistry holds Arc references to all actors in the system,
/// providing centralized access and lifecycle management.
#[derive(Clone)]
pub struct ActorRegistry {
/// System context shared by all actors
context: Arc<SystemContext>,
/// Task request actor for processing task initialization requests
pub task_request_actor: Arc<TaskRequestActor>,
/// Result processor actor for processing step execution results
pub result_processor_actor: Arc<ResultProcessorActor>,
/// Step enqueuer actor for batch processing ready tasks
pub step_enqueuer_actor: Arc<StepEnqueuerActor>,
/// Task finalizer actor for task finalization with atomic claiming
pub task_finalizer_actor: Arc<TaskFinalizerActor>,
}
}
Initialization
The build() method creates and initializes all actors:
#![allow(unused)]
fn main() {
impl ActorRegistry {
pub async fn build(context: Arc<SystemContext>) -> TaskerResult<Self> {
tracing::info!("Building ActorRegistry with actors");
// Create shared StepEnqueuerService (used by multiple actors)
let task_claim_step_enqueuer = StepEnqueuerService::new(context.clone()).await?;
let task_claim_step_enqueuer = Arc::new(task_claim_step_enqueuer);
// Create TaskRequestActor and its dependencies
let task_initializer = Arc::new(TaskInitializer::new(
context.clone(),
task_claim_step_enqueuer.clone(),
));
let task_request_processor = Arc::new(TaskRequestProcessor::new(
context.message_client.clone(),
context.task_handler_registry.clone(),
task_initializer,
TaskRequestProcessorConfig::default(),
));
let mut task_request_actor = TaskRequestActor::new(context.clone(), task_request_processor);
task_request_actor.started()?;
let task_request_actor = Arc::new(task_request_actor);
// Create ResultProcessorActor and its dependencies
let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
let result_processor = Arc::new(OrchestrationResultProcessor::new(
task_finalizer,
context.clone(),
));
let mut result_processor_actor =
ResultProcessorActor::new(context.clone(), result_processor);
result_processor_actor.started()?;
let result_processor_actor = Arc::new(result_processor_actor);
// Create StepEnqueuerActor using shared StepEnqueuerService
let mut step_enqueuer_actor =
StepEnqueuerActor::new(context.clone(), task_claim_step_enqueuer.clone());
step_enqueuer_actor.started()?;
let step_enqueuer_actor = Arc::new(step_enqueuer_actor);
// Create TaskFinalizerActor using shared StepEnqueuerService
let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
let mut task_finalizer_actor = TaskFinalizerActor::new(context.clone(), task_finalizer);
task_finalizer_actor.started()?;
let task_finalizer_actor = Arc::new(task_finalizer_actor);
tracing::info!("✅ ActorRegistry built successfully with 4 actors");
Ok(Self {
context,
task_request_actor,
result_processor_actor,
step_enqueuer_actor,
task_finalizer_actor,
})
}
}
}
Shutdown
The shutdown() method gracefully stops all actors:
#![allow(unused)]
fn main() {
impl ActorRegistry {
pub async fn shutdown(&mut self) {
tracing::info!("Shutting down ActorRegistry");
// Call stopped() on all actors in reverse initialization order
if let Some(actor) = Arc::get_mut(&mut self.task_finalizer_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop TaskFinalizerActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.step_enqueuer_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop StepEnqueuerActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.result_processor_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop ResultProcessorActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.task_request_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop TaskRequestActor");
}
}
tracing::info!("✅ ActorRegistry shutdown complete");
}
}
}
Implemented Actors
TaskRequestActor
Handles task initialization requests from external clients.
Location: tasker-orchestration/src/actors/task_request_actor.rs
Message: ProcessTaskRequestMessage
- Input:
TaskRequestMessagewith task details - Response:
Uuidof created task
Delegation: Wraps TaskRequestProcessor service
Purpose: Entry point for new workflow instances, coordinates task creation and initial step discovery.
ResultProcessorActor
Processes step execution results from workers.
Location: tasker-orchestration/src/actors/result_processor_actor.rs
Message: ProcessStepResultMessage
- Input:
StepExecutionResultwith execution outcome - Response:
()(unit type)
Delegation: Wraps OrchestrationResultProcessor service
Purpose: Handles step completion, coordinates task finalization when appropriate.
StepEnqueuerActor
Manages batch processing of ready tasks.
Location: tasker-orchestration/src/actors/step_enqueuer_actor.rs
Message: ProcessBatchMessage
- Input: Empty (uses system state)
- Response:
StepEnqueuerServiceResultwith batch stats
Delegation: Wraps StepEnqueuerService
Purpose: Discovers ready tasks and enqueues their executable steps.
TaskFinalizerActor
Handles task finalization with atomic claiming.
Location: tasker-orchestration/src/actors/task_finalizer_actor.rs
Message: FinalizeTaskMessage
- Input:
task_uuidto finalize - Response:
FinalizationResultwith action taken
Delegation: Wraps TaskFinalizer service (decomposed into focused components)
Purpose: Completes or fails tasks based on step execution results, prevents race conditions through atomic claiming.
Integration with Commands
Command Processor Integration
The command processor calls actors directly without intermediate wrapper layers:
#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs
/// Handle task initialization using TaskRequestActor directly
async fn handle_initialize_task(
&self,
request: TaskRequestMessage,
) -> TaskerResult<TaskInitializeResult> {
// Direct actor-based task initialization
let msg = ProcessTaskRequestMessage { request };
let task_uuid = self.actors.task_request_actor.handle(msg).await?;
Ok(TaskInitializeResult::Success {
task_uuid,
message: "Task initialized successfully".to_string(),
})
}
/// Handle step result processing using ResultProcessorActor directly
async fn handle_process_step_result(
&self,
step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
// Direct actor-based step result processing
let msg = ProcessStepResultMessage {
result: step_result.clone(),
};
match self.actors.result_processor_actor.handle(msg).await {
Ok(()) => Ok(StepProcessResult::Success {
message: format!(
"Step {} result processed successfully",
step_result.step_uuid
),
}),
Err(e) => Ok(StepProcessResult::Error {
message: format!("Failed to process step result: {e}"),
}),
}
}
/// Handle task finalization using TaskFinalizerActor directly
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
// Direct actor-based task finalization
let msg = FinalizeTaskMessage { task_uuid };
let result = self.actors.task_finalizer_actor.handle(msg).await?;
Ok(TaskFinalizationResult::Success {
task_uuid: result.task_uuid,
final_status: format!("{:?}", result.action),
completion_time: Some(chrono::Utc::now()),
})
}
}
Design Evolution: Initially planned to use lifecycle_services/ as a wrapper layer between command processor and actors. After implementing Phase 7 service decomposition, we found direct actor calls were simpler and cleaner, so we removed the intermediate layer.
Service Decomposition (Phase 7)
Large services (800-900 lines) were decomposed into focused components following single responsibility principle:
TaskFinalizer Decomposition
task_finalization/ (848 lines → 6 files)
├── mod.rs # Public API and types
├── service.rs # Main TaskFinalizer service (~200 lines)
├── completion_handler.rs # Task completion logic
├── event_publisher.rs # Lifecycle event publishing
├── execution_context_provider.rs # Context fetching
└── state_handlers.rs # State-specific handling
StepEnqueuerService Decomposition
step_enqueuer_services/ (781 lines → 3 files)
├── mod.rs # Public API
├── service.rs # Main service (~250 lines)
├── batch_processor.rs # Batch processing logic
└── state_handlers.rs # State-specific processing
ResultProcessor Decomposition
result_processing/ (889 lines → 4 files)
├── mod.rs # Public API
├── service.rs # Main processor
├── metadata_processor.rs # Metadata handling
├── error_handler.rs # Error processing
└── result_validator.rs # Result validation
Actor Lifecycle
Lifecycle Phases
┌─────────────────┐
│ Construction │ ActorRegistry::build() creates actor instances
└────────┬────────┘
│
▼
┌─────────────────┐
│ Initialization │ started() hook called on each actor
└────────┬────────┘
│
▼
┌─────────────────┐
│ Operation │ Actors process messages via Handler<M>::handle()
└────────┬────────┘
│
▼
┌─────────────────┐
│ Shutdown │ stopped() hook called on each actor (reverse order)
└─────────────────┘
Example Actor Implementation
#![allow(unused)]
fn main() {
use tasker_orchestration::actors::{OrchestrationActor, Handler, Message};
// Define the actor
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer,
}
// Implement base actor trait
impl OrchestrationActor for TaskFinalizerActor {
fn name(&self) -> &'static str {
"TaskFinalizerActor"
}
fn context(&self) -> &Arc<SystemContext> {
&self.context
}
fn started(&mut self) -> TaskerResult<()> {
tracing::info!("TaskFinalizerActor starting");
Ok(())
}
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!("TaskFinalizerActor stopping");
Ok(())
}
}
// Define message type
pub struct FinalizeTaskMessage {
pub task_uuid: Uuid,
}
impl Message for FinalizeTaskMessage {
type Response = FinalizationResult;
}
// Implement message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
tracing::debug!(
actor = %self.name(),
task_uuid = %msg.task_uuid,
"Processing FinalizeTaskMessage"
);
// Delegate to service
self.service.finalize_task(msg.task_uuid).await
.map_err(|e| e.into())
}
}
}
Benefits
1. Consistency
All lifecycle components follow the same pattern:
- Uniform initialization via
started() - Uniform cleanup via
stopped() - Uniform message handling via
Handler<M>
2. Type Safety
Messages and responses are type-checked at compile time:
#![allow(unused)]
fn main() {
// Compile error if message/response types don't match
impl Handler<WrongMessage> for TaskFinalizerActor {
type Response = WrongResponse; // ❌ Won't compile
// ...
}
}
3. Testability
- Clear message boundaries for mocking
- Isolated actor lifecycle for unit tests
- Type-safe message construction
4. Maintainability
- Clear separation of concerns
- Explicit message contracts
- Centralized lifecycle management
- Decomposed services (<300 lines per file)
5. Simplicity
- Direct actor calls (no wrapper layers)
- Pure routing in command processor
- Easy to trace message flow
Summary
The actor-based architecture provides a consistent, type-safe foundation for lifecycle component management in tasker-core. Key takeaways:
- Lightweight Pattern: Actors wrap decomposed services, providing message-based interface
- Lifecycle Management: Consistent initialization and shutdown via traits
- Type Safety: Compile-time verification of message contracts
- Service Decomposition: Focused components following single responsibility principle
- Direct Integration: Command processor calls actors directly without intermediate wrappers
- Production Ready: All phases complete, zero breaking changes, full test coverage
The architecture provides a solid foundation for future scalability and maintainability while maintaining the proven reliability of existing orchestration logic.
← Back to Documentation Hub
Backpressure Architecture
Last Updated: 2026-02-05 Audience: Architects, Developers, Operations Status: Active Related Docs: Worker Event Systems | MPSC Channel Guidelines
<- Back to Documentation Hub
This document provides the unified backpressure strategy for tasker-core, covering all system components from API ingestion through worker execution.
Core Principle
Step idempotency is the primary constraint. Any backpressure mechanism must ensure that step claiming, business logic execution, and result persistence remain stable and consistent. The system must gracefully degrade under load without compromising workflow correctness.
System Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKPRESSURE FLOW OVERVIEW │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ External Client │
└────────┬────────┘
│
┌────────────────▼────────────────┐
│ [1] API LAYER BACKPRESSURE │
│ • Circuit breaker (503) │
│ • System overload (503) │
│ • Request validation │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ [2] ORCHESTRATION BACKPRESSURE │
│ • Command channel (bounded) │
│ • Connection pool limits │
│ • PGMQ depth checks │
└────────────────┬────────────────┘
│
┌───────────┴───────────┐
│ PGMQ Queues │
│ • Namespace queues │
│ • Result queues │
└───────────┬───────────┘
│
┌────────────────▼────────────────┐
│ [3] WORKER BACKPRESSURE │
│ • Claim capacity check │
│ • Semaphore-bounded handlers │
│ • Completion channel bounds │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ [4] RESULT FLOW BACKPRESSURE │
│ • Completion channel bounds │
│ • Domain event drop semantics │
└─────────────────────────────────┘
Backpressure Points by Component
1. API Layer
The API layer provides backpressure through 503 responses with intelligent Retry-After headers.
Rate Limiting (429): Rate limiting is intentionally out of scope for tasker-core. This responsibility belongs to upstream infrastructure (API Gateways, NLB/ALB, service mesh). Tasker focuses on system health-based backpressure via 503 responses.
| Mechanism | Status | Behavior |
|---|---|---|
| Circuit Breaker | Implemented | Return 503 when database breaker open |
| System Overload | Planned | Return 503 when queue/channel saturation detected |
| Request Validation | Implemented | Return 400 for invalid requests |
Response Codes:
200 OK- Request accepted400 Bad Request- Invalid request format503 Service Unavailable- System overloaded (includesRetry-Afterheader)
503 Response Triggers:
- Circuit Breaker Open: Database operations failing repeatedly
- Queue Depth High (Planned): PGMQ namespace queues approaching capacity
- Channel Saturation (Planned): Command channel buffer > 80% full
Retry-After Header Strategy:
503 Service Unavailable
Retry-After: {calculated_delay_seconds}
Calculation:
- Circuit breaker open: Use breaker timeout (default 30s)
- Queue depth high: Estimate based on processing rate
- Channel saturation: Short delay (5-10s) for buffer drain
Configuration:
# config/tasker/base/common.toml
[common.circuit_breakers.component_configs.web]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)
2. Orchestration Layer
The orchestration layer protects internal processing from command flooding.
| Mechanism | Status | Behavior |
|---|---|---|
| Command Channel | Implemented | Bounded MPSC with monitoring |
| Connection Pool | Implemented | Database connection limits |
| PGMQ Depth Check | Planned | Reject enqueue when queue too deep |
Command Channel Backpressure:
Command Sender → [Bounded Channel] → Command Processor
│
└── If full: Block with timeout → Reject
Configuration:
# config/tasker/base/orchestration.toml
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000
[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000
3. Messaging Layer
The messaging layer provides the backbone between orchestration and workers. Provider-agnostic via MessageClient, supporting PGMQ (default) and RabbitMQ backends.
| Mechanism | Status | Behavior |
|---|---|---|
| Visibility Timeout | Implemented | Messages return to queue after timeout |
| Batch Size Limits | Implemented | Bounded message reads |
| Queue Depth Check | Planned | Reject enqueue when depth exceeded |
| Messaging Circuit Breaker | Implemented | Fast-fail send/receive when provider unhealthy |
Messaging Circuit Breaker:
MessageClientwraps send/receive operations with circuit breaker protection. When the messaging provider (PGMQ or RabbitMQ) fails repeatedly, the breaker opens and returnsMessagingError::CircuitBreakerOpenimmediately, preventing slow timeouts from cascading into orchestration and worker processing loops. Ack/nack and health check operations bypass the breaker — ack/nack failures are safe (visibility timeout handles redelivery), and health check must work when the breaker is open to detect recovery. See Circuit Breakers for details.
Queue Depth Monitoring (Planned):
The system will work with PGMQ’s native capabilities rather than enforcing arbitrary limits. Queue depth monitoring provides visibility without hard rejection:
┌──────────────────────────────────────────────────────────────────────┐
│ QUEUE DEPTH STRATEGY │
├──────────────────────────────────────────────────────────────────────┤
│ Level │ Depth Ratio │ Action │
├──────────────────────────────────────────────────────────────────────┤
│ Normal │ < 70% │ Normal operation │
│ Warning │ 70-85% │ Log warning, emit metric │
│ Critical │ 85-95% │ API returns 503 for new tasks │
│ Overflow │ > 95% │ API rejects all writes, alert operators │
└──────────────────────────────────────────────────────────────────────┘
Note: Depth ratio = current_depth / configured_soft_limit
Soft limit is advisory, not a hard PGMQ constraint.
Portability Considerations:
- Queue depth semantics vary by backend (PGMQ vs RabbitMQ vs SQS)
- Configuration is backend-agnostic where possible
- Backend-specific tuning goes in backend-specific config sections
Configuration:
# config/tasker/base/common.toml
[common.queues]
default_visibility_timeout_seconds = 30
[common.queues.pgmq]
poll_interval_ms = 250
[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000
# Messaging circuit breaker
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)
4. Worker Layer
The worker layer protects handler execution from saturation.
| Mechanism | Status | Behavior |
|---|---|---|
| Semaphore-Bounded Dispatch | Implemented | Max concurrent handlers |
| Claim Capacity Check | Planned | Refuse claims when at capacity |
| Handler Timeout | Implemented | Kill stuck handlers |
| Completion Channel | Implemented | Bounded result buffer |
Handler Dispatch Flow:
Step Message
│
▼
┌─────────────────┐
│ Capacity Check │──── At capacity? ──── Leave in queue
│ (Planned) │ (visibility timeout)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Acquire Permit │
│ (Semaphore) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Execute Handler │
│ (with timeout) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Release Permit │──── BEFORE sending to completion channel
└────────┬────────┘
│
▼
┌─────────────────┐
│ Send Completion │
└─────────────────┘
Configuration:
# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000
5. Domain Events
Domain events use fire-and-forget semantics to avoid blocking the critical path.
| Mechanism | Status | Behavior |
|---|---|---|
| Try-Send | Implemented | Non-blocking send |
| Drop on Full | Implemented | Events dropped if channel full |
| Metrics | Planned | Track dropped events |
Domain Event Flow:
Handler Complete
│
├── Result → Completion Channel (blocking, must succeed)
│
└── Domain Events → try_send() → If full: DROP with metric
│
└── Step execution NOT affected
Segmentation of Responsibility
Orchestration System
The orchestration system must protect itself from:
- Client overload: Too many
/v1/tasksrequests - Internal saturation: Command channel overflow
- Database exhaustion: Connection pool depletion
- Queue explosion: Unbounded PGMQ growth
Backpressure Response Hierarchy:
- Return 503 to client with Retry-After (fastest, cheapest)
- Block at command channel (internal buffering)
- Soft-reject at queue depth threshold (503 to new tasks)
- Circuit breaker opens (stop accepting work)
Worker System
The worker system must protect itself from:
- Handler saturation: Too many concurrent handlers
- FFI backlog: Ruby/Python handlers falling behind
- Completion overflow: Results backing up
- Step starvation: Claims outpacing processing
Backpressure Response Hierarchy:
- Refuse step claim (leave in queue, visibility timeout)
- Block at dispatch channel (internal buffering)
- Block at completion channel (handler waits)
- Circuit breaker opens (stop claiming work)
Step Idempotency Guarantees
Safe Backpressure Points
These backpressure points preserve step idempotency:
| Point | Why Safe |
|---|---|
| API 503 rejection | Task not yet created |
| Queue depth soft-limit | Step not yet enqueued |
| Step claim refusal | Message stays in queue, visibility timeout protects |
| Handler dispatch channel full | Step claimed but execution queued |
| Completion channel backpressure | Handler completed, result buffered |
Unsafe Patterns (NEVER DO)
| Pattern | Risk | Mitigation |
|---|---|---|
| Drop step after claiming | Lost work | Always send result (success or failure) |
| Timeout during handler execution | Duplicate execution on retry | Handlers MUST be idempotent |
| Drop completion result | Orchestration unaware of completion | Completion channel blocks, never drops |
| Reset step state without visibility timeout | Race with other workers | Use PGMQ visibility timeout |
Idempotency Contract
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP EXECUTION IDEMPOTENCY CONTRACT │
└─────────────────────────────────────────────────────────────────────────────┘
1. CLAIM: Atomic via pgmq_read_specific_message()
├── Only one worker can claim a message
├── Visibility timeout protects against worker crash
└── If claim fails: Message stays in queue → another worker claims
2. EXECUTE: Handler invocation (FFI boundary critical - see below)
├── Handlers SHOULD be idempotent (business logic recommendation)
├── Timeout generates FAILURE result (not drop)
├── Panic generates FAILURE result (not drop)
└── Error generates FAILURE result (not drop)
3. PERSIST: Result submission
├── Completion channel is bounded but BLOCKING
├── Result MUST reach orchestration (never dropped)
└── If send fails: Step remains "in_progress" → recovered by orchestration
4. FINALIZE: Orchestration processes result
├── State transition is atomic
├── Duplicate results handled by state guards
└── Idempotent: Same result processed twice = same outcome
FFI Boundary Idempotency Semantics
The FFI boundary (Rust → Ruby/Python handler) creates a critical demarcation for error classification:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FFI BOUNDARY ERROR CLASSIFICATION │
└─────────────────────────────────────────────────────────────────────────────┘
FFI BOUNDARY
│
BEFORE FFI CROSSING │ AFTER FFI CROSSING
(System Layer) │ (Business Logic Layer)
│
┌─────────────────────┐ │ ┌─────────────────────┐
│ System errors are │ │ │ System failures │
│ RETRYABLE: │ │ │ are PERMANENT: │
│ │ │ │ │
│ • Channel timeout │ │ │ • Worker crash │
│ • Queue unavailable │ │ │ • FFI panic │
│ • Claim race lost │ │ │ • Process killed │
│ • Network partition │ │ │ │
│ • Message malformed │ │ │ We cannot know if │
│ │ │ │ business logic │
│ Step has NOT been │ │ │ executed or not. │
│ handed to handler. │ │ │ │
└─────────────────────┘ │ └─────────────────────┘
│
│ ┌─────────────────────┐
│ │ Developer errors │
│ │ are TRUSTED: │
│ │ │
│ │ • RetryableError → │
│ │ System retries │
│ │ │
│ │ • PermanentError → │
│ │ Step fails │
│ │ │
│ │ Developer knows │
│ │ their domain logic. │
│ └─────────────────────┘
Key Principles:
-
Before FFI: Any system error is safe to retry because no business logic has executed.
-
After FFI, system failure: If the worker crashes or FFI call fails after dispatch, we MUST treat it as permanent failure. We cannot know if the handler:
- Never started (safe to retry)
- Started but didn’t complete (unknown side effects)
- Completed but didn’t return (work is done)
-
After FFI, developer error: Trust the developer’s classification:
RetryableError: Developer explicitly signals safe to retry (e.g., temporary API unavailable)PermanentError: Developer explicitly signals not retriable (e.g., invalid data, business rule violation)
Implementation Guidance:
#![allow(unused)]
fn main() {
// BEFORE FFI - system error, retryable
match dispatch_to_handler(step).await {
Err(DispatchError::ChannelFull) => StepResult::retryable("dispatch_channel_full"),
Err(DispatchError::Timeout) => StepResult::retryable("dispatch_timeout"),
Ok(ffi_handle) => {
// AFTER FFI - different rules apply
match ffi_handle.await {
// System crash after FFI = permanent (unknown state)
Err(FfiError::ProcessCrash) => StepResult::permanent("handler_crash"),
Err(FfiError::Panic) => StepResult::permanent("handler_panic"),
// Developer-returned errors = trust their classification
Ok(HandlerResult::RetryableError(msg)) => StepResult::retryable(msg),
Ok(HandlerResult::PermanentError(msg)) => StepResult::permanent(msg),
Ok(HandlerResult::Success(data)) => StepResult::success(data),
}
}
}
}
Note: We RECOMMEND handlers be idempotent but cannot REQUIRE it—business logic is developer-controlled. The system provides visibility timeout protection and duplicate result handling, but ultimate idempotency responsibility lies with handler implementations.
Backpressure Decision Tree
Use this decision tree when designing new backpressure mechanisms:
┌─────────────────────────┐
│ New Backpressure Point │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Does this affect step │
│ execution correctness? │
└────────────┬────────────┘
│
┌─────────────┴─────────────┐
│ │
Yes No
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Can the work be │ │ Safe to drop │
│ retried safely? │ │ or timeout │
└────────┬────────┘ └─────────────────┘
│
┌─────────┴─────────┐
│ │
Yes No
│ │
▼ ▼
┌───────────┐ ┌───────────────┐
│ Use block │ │ MUST NOT DROP │
│ or reject │ │ Block until │
│ (retriable│ │ success │
│ error) │ └───────────────┘
└───────────┘
Configuration Reference
TOML Structure: Configuration files are organized as
config/tasker/base/{common,worker,orchestration}.tomlwith environment overrides inconfig/tasker/environments/{test,development,production}/.
Complete Backpressure Configuration
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/common.toml - Shared settings
# ════════════════════════════════════════════════════════════════════════════
# Circuit breaker defaults (inherited by all component breakers)
[common.circuit_breakers.default_config]
failure_threshold = 5 # Failures before opening
timeout_seconds = 30 # Time in open state before half-open
success_threshold = 2 # Successes in half-open to close
# Web/API database circuit breaker
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2
# Messaging circuit breaker - PGMQ/RabbitMQ operations
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2
# Queue configuration
[common.queues]
default_visibility_timeout_seconds = 30
[common.queues.pgmq]
poll_interval_ms = 250
[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/orchestration.toml - Orchestration layer
# ════════════════════════════════════════════════════════════════════════════
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000
[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000
[orchestration.mpsc_channels.event_channel]
event_channel_buffer_size = 10000
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/worker.toml - Worker layer
# ════════════════════════════════════════════════════════════════════════════
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000 # Steps waiting for handler
completion_buffer_size = 1000 # Results waiting for orchestration
max_concurrent_handlers = 10 # Semaphore permits
handler_timeout_ms = 30000 # Max handler execution time
[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000 # FFI events waiting for Ruby/Python
completion_timeout_ms = 30000 # Time to wait for FFI completion
starvation_warning_threshold_ms = 10000 # Warn if event waits this long
# Planned:
# claim_capacity_threshold = 0.8 # Refuse claims at 80% capacity
Monitoring and Alerting
See Backpressure Monitoring Runbook for:
- Key metrics to monitor
- Alerting thresholds
- Incident response procedures
Key Metrics Summary
| Metric | Type | Alert Threshold |
|---|---|---|
api_requests_rejected_total | Counter | > 10/min |
circuit_breaker_state | Gauge | state = open |
mpsc_channel_saturation | Gauge | > 80% |
pgmq_queue_depth | Gauge | > 80% of max |
worker_claim_refusals_total | Counter | > 10/min |
handler_semaphore_wait_time_ms | Histogram | p99 > 1000ms |
Related Documentation
- Worker Event Systems - Dual-channel architecture
- MPSC Channel Guidelines - Channel creation guide
- MPSC Channel Tuning - Operational tuning
- Bounded MPSC Channels ADR
<- Back to Documentation Hub
Circuit Breakers
Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring
<- Back to Documentation Hub
Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.
Core Concept
Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.
State Machine
┌─────────────────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATE MACHINE │
└─────────────────────────────────────────────────────────────────────────────┘
Success
┌─────────┐
│ │
▼ │
┌───────┐ │
───────>│CLOSED │─────┘
└───┬───┘
│
│ failure_threshold
│ consecutive failures
│
▼
┌───────┐
│ OPEN │◄─────────────────────┐
└───┬───┘ │
│ │
│ timeout_seconds │ Any failure
│ elapsed │ in half-open
│ │
▼ │
┌──────────┐ │
│HALF-OPEN │─────────────────────┘
└────┬─────┘
│
│ success_threshold
│ consecutive successes
│
▼
┌───────┐
│CLOSED │
└───────┘
States:
- Closed: Normal operation. All calls allowed. Tracks consecutive failures.
- Open: Failing fast. All calls rejected immediately. Waiting for timeout.
- Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.
Unified Trait: CircuitBreakerBehavior
All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:
#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
fn name(&self) -> &str;
fn state(&self) -> CircuitState;
fn should_allow(&self) -> bool;
fn record_success(&self, duration: Duration);
fn record_failure(&self, duration: Duration);
fn is_healthy(&self) -> bool;
fn force_open(&self);
fn force_closed(&self);
fn metrics(&self) -> CircuitBreakerMetrics;
}
}
Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:
- Consistent state machine behavior across all breakers
- Proper half-open → closed recovery via
success_threshold - Lock-free atomic state management
- Domain-specific methods remain as additional methods on each type
Circuit Breaker Implementations
Tasker-core has four circuit breaker implementations, each protecting specific components.
All wrap the generic CircuitBreaker from tasker_shared::resilience:
| Circuit Breaker | Location | Purpose | Trigger Type |
|---|---|---|---|
| Web Database | tasker-orchestration | API database operations | Error-based |
| Task Readiness | tasker-orchestration | Fallback poller database checks | Error-based |
| FFI Completion | tasker-worker | Ruby/Python handler completion channel | Latency-based |
| Messaging | tasker-shared | Message queue operations (PGMQ/RabbitMQ) | Error-based |
1. Web Database Circuit Breaker
Purpose: Protects API endpoints from cascading database failures.
Scope: Independent from orchestration system’s internal operations.
Behavior:
- Opens when database queries fail repeatedly
- Returns 503 with
Retry-Afterheader when open - Fast-fail rejection with atomic state management
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.web]
failure_threshold = 5 # Consecutive failures before opening
success_threshold = 2 # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)
Health Check Integration:
- Included in
/health/readyendpoint - State reported in
/health/detailedresponse - Metric:
api_circuit_breaker_state(0=closed, 1=half-open, 2=open)
2. Task Readiness Circuit Breaker
Purpose: Protects fallback poller from database overload during polling cycles.
Scope: Independent from web circuit breaker, specific to task readiness queries.
Behavior:
- Opens when task readiness queries fail repeatedly
- Skips polling cycles when open (doesn’t fail-fast, just skips)
- Allows orchestration to continue processing existing work
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10 # Higher threshold for polling
timeout_seconds = 60 # Longer recovery window
success_threshold = 3 # More successes needed for confidence
Why Separate from Web?:
- Different failure patterns (polling vs request-driven)
- Different recovery semantics (skip vs reject)
- Isolation prevents web failures from stopping polling (and vice versa)
3. FFI Completion Circuit Breaker
Purpose: Protects Ruby/Python worker completion channels from backpressure.
Scope: Worker-specific, protects FFI boundary.
Behavior:
- Latency-based: Treats slow sends (>100ms) as failures
- Opens when completion channel is consistently slow
- Prevents FFI threads from blocking on saturated channels
- Drops completions when open (with metrics), allowing handler threads to continue
Configuration (config/tasker/base/worker.toml):
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5 # Slow sends before opening
recovery_timeout_seconds = 5 # Short recovery window
success_threshold = 2 # Successes to close
slow_send_threshold_ms = 100 # Latency threshold (100ms)
Why Latency-Based?:
- Slow channel sends indicate backpressure buildup
- Blocking FFI threads can cascade to Ruby/Python handler starvation
- Error-only detection misses slow-but-completing operations
- Latency detection catches degradation before total failure
Metrics:
ffi_completion_slow_sends_total- Sends exceeding latency thresholdffi_completion_circuit_open_rejections_total- Rejections due to open circuit
4. Messaging Circuit Breaker
Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).
Scope: Integrated into MessageClient, shared across orchestration and worker messaging.
Behavior:
- Opens when send/receive operations fail repeatedly
- Protected operations:
send_step_message,receive_step_messages,send_step_result,receive_step_results,send_task_request,receive_task_requests,send_task_finalization,receive_task_finalizations,send_message,receive_messages - Unprotected operations (safe to fail or needed for recovery):
ack_message,nack_message,extend_visibility,health_check,ensure_queue, queue stats - Coordinates with visibility timeout for message safety
- Provider-agnostic: works with both PGMQ and RabbitMQ backends
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes to close
# timeout_seconds inherited from default_config (30s)
Why ack/nack bypass the breaker?:
- Ack/nack failure causes message redelivery via visibility timeout, which is safe
- Health check must work when breaker is open to detect recovery
- Queue management is startup-only and should not be gated
Configuration Reference
Global Settings
[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30 # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions
Default Configuration
Applied to any circuit breaker without explicit configuration:
[common.circuit_breakers.default_config]
failure_threshold = 5 # 1-100 range
timeout_seconds = 30 # 1-300 range
success_threshold = 2 # 1-50 range
Component-Specific Overrides
# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3
# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2
# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2
Note:
timeout_secondsis inherited fromdefault_configfor all component circuit breakers. Thepgmqkey is accepted as an alias formessagingfor backward compatibility.
Worker-Specific Configuration
# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100
Environment Overrides
Different environments may need different thresholds:
Test (config/tasker/environments/test/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 2 # Faster failure detection
timeout_seconds = 5 # Quick recovery for tests
success_threshold = 1
Production (config/tasker/environments/production/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 10 # More tolerance for transient failures
timeout_seconds = 60 # Longer recovery window
success_threshold = 5 # More confidence before closing
Health Endpoint Integration
Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.
Orchestration Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "Circuit breaker state: Closed",
"duration_ms": 1,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Worker Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
"duration_ms": 0,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Health Status Mapping
| Circuit Breaker State | Health Status | Impact |
|---|---|---|
| All Closed | healthy | Normal operation |
| Any Half-Open | degraded | Testing recovery |
| Any Open | unhealthy | Failing fast |
Monitoring and Alerting
Key Metrics
| Metric | Type | Description |
|---|---|---|
api_circuit_breaker_state | Gauge | Web breaker state (0/1/2) |
tasker_circuit_breaker_state | Gauge | Per-component state |
api_requests_rejected_total | Counter | Rejections due to open breaker |
ffi_completion_slow_sends_total | Counter | Slow send detections |
ffi_completion_circuit_open_rejections_total | Counter | FFI breaker rejections |
Prometheus Alerts
groups:
- name: circuit_breakers
rules:
- alert: TaskerCircuitBreakerOpen
expr: api_circuit_breaker_state == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker is OPEN"
description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"
- alert: TaskerCircuitBreakerHalfOpen
expr: api_circuit_breaker_state == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker stuck in half-open"
description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"
- alert: TaskerFFISlowSendsHigh
expr: rate(ffi_completion_slow_sends_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "FFI completion channel experiencing backpressure"
description: "Slow sends averaging >10/second, circuit breaker may open"
Grafana Dashboard Panels
Circuit Breaker State Timeline:
Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)
FFI Latency Percentiles:
Panel: Time series
Queries:
- histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical
Operational Procedures
When Circuit Breaker Opens
Immediate Actions:
- Check database connectivity:
pg_isready -h <host> -p 5432 - Check connection pool status:
/health/detailedendpoint - Review recent error logs for root cause
- Monitor queue depth for message backlog
Recovery:
- Circuit automatically tests recovery after
timeout_seconds - No manual intervention needed for transient failures
- For persistent failures, fix underlying issue first
Escalation:
- If breaker stays open >5 minutes, escalate to database team
- If breaker oscillates (open/half-open/open), increase
failure_threshold
Tuning Guidelines
Symptom: Breaker opens too frequently
- Increase
failure_threshold - Investigate root cause of failures
- Consider if failures are transient vs systemic
Symptom: Breaker stays open too long
- Decrease
timeout_seconds - Verify downstream system has recovered
- Check if
success_thresholdis too high
Symptom: FFI breaker opens unnecessarily
- Increase
slow_send_threshold_ms - Verify channel buffer sizes are adequate
- Check Ruby/Python handler throughput
Architecture Integration
Relationship to Backpressure
Circuit breakers are one layer of the broader backpressure strategy:
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESILIENCE LAYER STACK │
└─────────────────────────────────────────────────────────────────────────────┘
Layer 1: Circuit Breakers → Fast-fail on component failure
Layer 2: Bounded Channels → Backpressure on internal queues
Layer 3: Visibility Timeouts → Message-level retry safety
Layer 4: Semaphore Limits → Handler execution rate limiting
Layer 5: Connection Pools → Database resource management
See Backpressure Architecture for the complete strategy.
Independence Principle
Each circuit breaker operates independently:
- Web breaker can be open while task readiness breaker is closed
- FFI breaker state doesn’t affect PGMQ breaker
- Prevents single failure mode from cascading across components
- Allows targeted recovery per component
Integration Points
| Component | Circuit Breaker | Integration Point |
|---|---|---|
tasker-orchestration/src/web | Web Database | API request handlers |
tasker-orchestration/src/orchestration/task_readiness | Task Readiness | Fallback poller loop |
tasker-worker/src/worker/handlers | FFI Completion | Completion channel sends |
tasker-shared/src/messaging/client.rs | Messaging | MessageClient send/receive methods |
Troubleshooting
Common Issues
Issue: Web circuit breaker flapping (open → half-open → open rapidly)
Diagnosis:
- Check database query latency (slow queries can cause timeout failures)
- Review connection pool saturation
- Check if PostgreSQL is under memory pressure
Resolution:
- Increase
failure_thresholdif failures are transient - Increase
timeout_secondsto give more recovery time - Fix underlying database performance issues
Issue: FFI completion circuit breaker opens during normal load
Diagnosis:
- Check Ruby/Python handler execution time
- Review completion channel buffer utilization
- Verify worker concurrency settings
Resolution:
- Increase
slow_send_threshold_msif handlers are legitimately slow - Increase channel buffer size in worker config
- Reduce handler concurrency if system is overloaded
Issue: Task readiness breaker open but web API working fine
Diagnosis:
- Task readiness queries may be slower/different than API queries
- Polling may hit database at different times (e.g., during maintenance)
Resolution:
- Independent breakers are working as designed
- Check specific task readiness query performance
- Consider database index optimization for readiness queries
Source Code Reference
| Component | File |
|---|---|
CircuitBreakerBehavior Trait | tasker-shared/src/resilience/behavior.rs |
Generic CircuitBreaker | tasker-shared/src/resilience/circuit_breaker.rs |
| Circuit Breaker Config | tasker-shared/src/config/circuit_breaker.rs |
MessageClient (messaging breaker) | tasker-shared/src/messaging/client.rs |
WebDatabaseCircuitBreaker | tasker-orchestration/src/api_common/circuit_breaker.rs |
| Web CB Helpers | tasker-orchestration/src/web/circuit_breaker.rs |
TaskReadinessCircuitBreaker | tasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs |
FfiCompletionCircuitBreaker | tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs |
| Worker Health Integration | tasker-worker/src/web/handlers/health.rs |
| Circuit Breaker Types | tasker-shared/src/types/api/worker.rs |
Related Documentation
- Backpressure Architecture - Complete resilience strategy
- Operations: Backpressure Monitoring - Operational runbooks
- Operations: MPSC Channel Tuning - Channel capacity management
- Observability - Metrics and logging standards
- Configuration Management - TOML configuration reference
<- Back to Documentation Hub
Crate Architecture
Last Updated: 2026-01-15 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands | Quick Start
← Back to Documentation Hub
Overview
Tasker Core is organized as a Cargo workspace with 7 member crates, each with a specific responsibility in the workflow orchestration system. This document explains the role of each crate, their inter-dependencies, and how they work together to provide a complete orchestration solution.
Design Philosophy
The crate structure follows these principles:
- Separation of Concerns: Each crate has a well-defined responsibility
- Minimal Dependencies: Crates depend on the minimum necessary dependencies
- Shared Foundation: Common types and utilities in
tasker-shared - Language Flexibility: Support for multiple worker implementations (Rust, Ruby, Python planned)
- Production Ready: Workers and the orchestration system can be deployed and scaled independently
Workspace Structure
tasker-core/
├── tasker-pgmq/ # PGMQ wrapper with notification support
├── tasker-shared/ # Shared types, SQL functions, utilities
├── tasker-orchestration/ # Task coordination and lifecycle management
├── tasker-worker/ # Step execution and handler integration
├── tasker-client/ # API client library (REST + gRPC transport)
├── tasker-ctl/ # CLI binary (depends on tasker-client)
└── workers/
├── ruby/ext/tasker_core/ # Ruby FFI bindings
└── rust/ # Rust native worker
Crate Dependency Graph
┌─────────────────────────────────────────────────────────┐
│ External Dependencies │
│ (sqlx, tokio, serde, pgmq, magnus, axum, etc.) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ tasker-pgmq │
│ PGMQ wrapper with PostgreSQL LISTEN/NOTIFY │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ tasker-shared │
│ Core types, SQL functions, state machines │
└─────────────────────────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ tasker-orchestration │ │ tasker-worker │
│ Task coordination │ │ Step execution │
│ Lifecycle management │ │ Handler integration │
│ REST API │ │ FFI support │
└──────────────────────────┘ └──────────────────────────┘
│ │
▼ │
┌──────────────────────────┐ │
│ tasker-client │ │
│ API client library │ │
│ REST + gRPC transport │ │
└──────────────────────────┘ │
│ │
▼ │
┌──────────────────────────┐ │
│ tasker-ctl │ │
│ CLI binary │ │
└──────────────────────────┘ │
│
┌────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌────────────┐ ┌────────────┐
│ workers/ │ │ workers/ │
│ ruby/ │ │ rust/ │
│ ext/ │ │ │
└────────────┘ └────────────┘
Core Crates
tasker-pgmq
Purpose: Wrapper around PostgreSQL Message Queue (PGMQ) with native PostgreSQL LISTEN/NOTIFY support
Location: tasker-pgmq/
Key Responsibilities:
- Wrap
pgmqcrate with notification capabilities - Provide atomic
pgmq_send_with_notify()operations - Handle notification channel management
- Support namespace-aware queue naming
Public API:
#![allow(unused)]
fn main() {
pub struct PgmqClient {
// Send message with atomic notification
pub async fn send_with_notify<T>(&self, queue: &str, msg: T) -> Result<i64>;
// Read message with visibility timeout
pub async fn read<T>(&self, queue: &str, vt: i32) -> Result<Option<Message<T>>>;
// Delete processed message
pub async fn delete(&self, queue: &str, msg_id: i64) -> Result<bool>;
}
}
When to Use:
- When you need reliable message queuing with PostgreSQL
- When you need atomic send + notify operations
- When building event-driven systems on PostgreSQL
Dependencies:
pgmq- Core PostgreSQL message queue functionalitysqlx- Database connectivitytokio- Async runtime
tasker-shared
Purpose: Foundation crate containing all shared types, utilities, and SQL function interfaces
Location: tasker-shared/
Key Responsibilities:
- Core domain models (
Task,WorkflowStep,TaskTransition, etc.) - State machine implementations (Task + Step)
- SQL function executor and registry
- Database utilities and migrations
- Event system traits and types
- Messaging abstraction layer: Provider-agnostic messaging with PGMQ, RabbitMQ, and InMemory backends
- Factory system for testing
- Metrics and observability primitives
Public API:
#![allow(unused)]
fn main() {
// Core Models
pub mod models {
pub struct Task { /* ... */ }
pub struct WorkflowStep { /* ... */ }
pub struct TaskTransition { /* ... */ }
pub struct WorkflowStepTransition { /* ... */ }
}
// State Machines
pub mod state_machine {
pub struct TaskStateMachine { /* ... */ }
pub struct StepStateMachine { /* ... */ }
pub enum TaskState { /* 12 states */ }
pub enum WorkflowStepState { /* 9 states */ }
}
// SQL Functions
pub mod database {
pub struct SqlFunctionExecutor { /* ... */ }
pub async fn get_step_readiness_status(...) -> Result<Vec<StepReadinessStatus>>;
pub async fn get_next_ready_tasks(...) -> Result<Vec<ReadyTaskInfo>>;
}
// Event System
pub mod event_system {
pub trait EventDrivenSystem { /* ... */ }
pub enum DeploymentMode { Hybrid, EventDrivenOnly, PollingOnly }
}
// Messaging
pub mod messaging {
// Provider abstraction
pub enum MessagingProvider { Pgmq, RabbitMq, InMemory }
pub trait MessagingService { /* send_message, receive_messages, ack_message, ... */ }
pub trait SupportsPushNotifications { /* subscribe, subscribe_many, requires_fallback_polling */ }
pub enum MessageNotification { Available { ... }, Message(...) }
// Domain client
pub struct MessageClient { /* High-level queue operations */ }
// Message types
pub struct SimpleStepMessage { /* ... */ }
pub struct TaskRequestMessage { /* ... */ }
pub struct StepExecutionResult { /* ... */ }
}
}
When to Use:
- Always - This is the foundation for all other crates
- When you need core domain models
- When you need state machine logic
- When you need SQL function access
- When you need testing factories
Dependencies:
tasker-pgmq- Message queue operationssqlx- Database operationsserde- Serialization- Many workspace-shared dependencies
Why It’s Separate:
- Eliminates circular dependencies between orchestration and worker
- Provides single source of truth for domain models
- Enables independent testing of core logic
- Allows multiple implementations (orchestration vs worker) to share code
tasker-orchestration
Purpose: Task coordination, lifecycle management, and orchestration REST API
Location: tasker-orchestration/
Key Responsibilities:
- Actor-based lifecycle coordination
- Task initialization and finalization
- Step discovery and enqueueing
- Result processing from workers
- Dynamic executor pool management
- Event-driven coordination
- REST API endpoints
- Health monitoring
- Metrics collection
Public API:
#![allow(unused)]
fn main() {
// Core orchestration
pub struct OrchestrationCore {
pub async fn new() -> Result<Self>;
pub async fn from_config(config: ConfigManager) -> Result<Self>;
}
// Actor-based coordination
pub mod actors {
pub struct ActorRegistry { /* ... */ }
pub struct TaskRequestActor { /* ... */ }
pub struct ResultProcessorActor { /* ... */ }
pub struct StepEnqueuerActor { /* ... */ }
pub struct TaskFinalizerActor { /* ... */ }
pub trait OrchestrationActor { /* ... */ }
pub trait Handler<M: Message> { /* ... */ }
pub trait Message { /* ... */ }
}
// Lifecycle services (wrapped by actors)
pub mod lifecycle {
pub struct TaskInitializer { /* ... */ }
pub struct StepEnqueuerService { /* ... */ }
pub struct OrchestrationResultProcessor { /* ... */ }
pub struct TaskFinalizer { /* ... */ }
}
// Message hydration (Phase 4)
pub mod hydration {
pub struct StepResultHydrator { /* ... */ }
pub struct TaskRequestHydrator { /* ... */ }
pub struct FinalizationHydrator { /* ... */ }
}
// REST API (Axum)
pub mod web {
// POST /v1/tasks
pub async fn create_task(request: TaskRequest) -> Result<TaskResponse>;
// GET /v1/tasks/{uuid}
pub async fn get_task(uuid: Uuid) -> Result<TaskResponse>;
// GET /health
pub async fn health_check() -> Result<HealthResponse>;
}
// gRPC API (Tonic)
// Feature-gated behind `grpc-api`
pub mod grpc {
pub struct GrpcServer { /* ... */ }
pub struct GrpcState { /* wraps Arc<SharedApiServices> */ }
pub mod services {
pub struct TaskServiceImpl { /* 6 RPCs */ }
pub struct StepServiceImpl { /* 4 RPCs */ }
pub struct TemplateServiceImpl { /* 2 RPCs */ }
pub struct HealthServiceImpl { /* 4 RPCs */ }
pub struct AnalyticsServiceImpl { /* 2 RPCs */ }
pub struct DlqServiceImpl { /* 6 RPCs */ }
pub struct ConfigServiceImpl { /* 1 RPC */ }
}
pub mod interceptors {
pub struct AuthInterceptor { /* Bearer token, API key */ }
}
}
// Event systems
pub mod event_systems {
pub struct OrchestrationEventSystem { /* ... */ }
pub struct TaskReadinessEventSystem { /* ... */ }
}
}
Actor Architecture:
The orchestration crate implements a lightweight actor pattern for lifecycle component coordination:
- ActorRegistry: Manages all 4 orchestration actors with lifecycle hooks
- Message-Based Communication: Type-safe message handling via
Handler<M>trait - Service Decomposition: Large services decomposed into focused components (<300 lines per file)
- Direct Integration: Command processor calls actors directly without wrapper layers
See Actor-Based Architecture for comprehensive documentation.
When to Use:
- When you need to run the orchestration server
- When you need task coordination logic
- When building custom orchestration components
- When integrating with the REST API
Dependencies:
tasker-shared- Core types and SQL functionstasker-pgmq- Message queuingaxum- REST API frameworktower-http- HTTP middleware
Deployment: Typically deployed as a server process (tasker-server binary)
Dual-Server Architecture:
Orchestration supports both REST and gRPC APIs running simultaneously via SharedApiServices:
#![allow(unused)]
fn main() {
pub struct SharedApiServices {
pub security_service: Option<Arc<SecurityService>>,
pub task_service: TaskService,
pub step_service: StepService,
pub health_service: HealthService,
// ... other services
}
// Both APIs share the same service instances
AppState { services: Arc<SharedApiServices>, ... } // REST
GrpcState { services: Arc<SharedApiServices>, ... } // gRPC
}
Port Allocation:
- REST: 8080 (configurable)
- gRPC: 9190 (configurable)
tasker-worker
Purpose: Step execution, handler integration, and worker coordination
Location: tasker-worker/
Key Responsibilities:
- Claim steps from namespace queues
- Execute step handlers (Rust or FFI)
- Submit results to orchestration
- Template management and caching
- Event-driven step claiming
- Worker health monitoring
- FFI integration layer
Public API:
#![allow(unused)]
fn main() {
// Worker core
pub struct WorkerCore {
pub async fn new(config: WorkerConfig) -> Result<Self>;
pub async fn start(&mut self) -> Result<()>;
}
// Handler execution
pub mod handlers {
pub trait StepHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult>;
}
}
// Template management
pub mod task_template_manager {
pub struct TaskTemplateManager {
pub async fn load_templates(&mut self) -> Result<()>;
pub fn get_template(&self, name: &str) -> Option<&TaskTemplate>;
}
}
// Event systems
pub mod event_systems {
pub struct WorkerEventSystem { /* ... */ }
}
}
When to Use:
- When you need to run a worker process
- When implementing custom step handlers
- When integrating with Ruby/Python handlers via FFI
- When building worker-specific tools
Dependencies:
tasker-shared- Core types and messagingtasker-pgmq- Message queuingmagnus(optional) - Ruby FFI bindings
Deployment: Deployed as worker processes, typically one per namespace or scaled horizontally
tasker-client
Purpose: Transport-agnostic API client library for REST and gRPC
Location: tasker-client/
Key Responsibilities:
- HTTP client for orchestration REST API
- gRPC client for orchestration gRPC API (feature-gated)
- Transport abstraction via unified client traits
- Configuration management and auth resolution
- Client-side request building
Public API:
#![allow(unused)]
fn main() {
// REST client
pub struct RestOrchestrationClient {
pub async fn new(base_url: &str) -> Result<Self>;
// Task, step, template, health operations
}
// gRPC client (feature-gated)
#[cfg(feature = "grpc")]
pub struct GrpcOrchestrationClient {
pub async fn connect(endpoint: &str) -> Result<Self>;
pub async fn connect_with_auth(endpoint: &str, auth: GrpcAuthConfig) -> Result<Self>;
// Same operations as REST client
}
// Transport-agnostic client
pub enum UnifiedOrchestrationClient {
Rest(Box<RestOrchestrationClient>),
Grpc(Box<GrpcOrchestrationClient>),
}
// Client trait for transport abstraction
pub trait OrchestrationClient: Send + Sync {
async fn create_task(&self, request: TaskRequest) -> Result<TaskResponse>;
async fn get_task(&self, uuid: Uuid) -> Result<TaskResponse>;
async fn list_tasks(&self, filters: TaskFilters) -> Result<Vec<TaskResponse>>;
async fn health_check(&self) -> Result<HealthResponse>;
// ... more operations
}
}
When to Use:
- When you need to interact with orchestration API from Rust
- When building integration tests
- When implementing client applications or FFI bindings
- When building UI frontends (TUI, web) that need API access
tasker-ctl
Purpose: Command-line interface for Tasker (split from tasker-client)
Location: tasker-ctl/
Key Responsibilities:
- CLI argument parsing and command dispatch (via clap)
- Task, worker, system, config, auth, and DLQ commands
- Configuration documentation generation (via askama, feature-gated)
- API key generation and management
CLI Tools:
# Task management
tasker-ctl task create --template linear_workflow
tasker-ctl task get <uuid>
tasker-ctl task list --namespace payments
# Health checks
tasker-ctl health
# Configuration docs generation
tasker-ctl docs generate
When to Use:
- When managing tasks from the command line
- When generating configuration documentation
- When performing administrative operations (auth, DLQ management)
Dependencies:
reqwest- HTTP clientclap- CLI argument parsingserde_json- JSON serialization
Worker Implementations
workers/ruby/ext/tasker_core
Purpose: Ruby FFI bindings enabling Ruby workers to execute Rust-orchestrated workflows
Location: workers/ruby/ext/tasker_core/
Key Responsibilities:
- Expose Rust worker functionality to Ruby via Magnus (FFI)
- Handle Ruby handler execution
- Manage Ruby <-> Rust type conversions
- Provide Ruby API for template registration
- FFI performance optimization
Ruby API:
# Worker bootstrap
result = TaskerCore::Worker::Bootstrap.start!
# Template registration (automatic)
# Ruby templates in workers/ruby/app/tasker/tasks/templates/
# Handler execution (automatic via FFI)
class MyHandler < TaskerCore::StepHandler
def execute(context)
# Step implementation
{ success: true, result: "done" }
end
end
When to Use:
- When you have existing Ruby handlers
- When you need Ruby-specific libraries or gems
- When migrating from Ruby-based orchestration
- When team expertise is primarily Ruby
Dependencies:
magnus- Ruby FFI bindingstasker-worker- Core worker logic- Ruby runtime
Performance Considerations:
- FFI overhead: ~5-10ms per step (measured)
- Ruby GC can impact latency
- Thread-safe FFI calls via Ruby global lock
- Best for I/O-bound operations, not CPU-intensive
workers/rust
Purpose: Native Rust worker implementation for maximum performance
Location: workers/rust/
Key Responsibilities:
- Native Rust step handler execution
- Template definitions in Rust
- Direct integration with tasker-worker
- Maximum performance for CPU-intensive operations
Handler API:
#![allow(unused)]
fn main() {
// Define handler in Rust
pub struct MyHandler;
#[async_trait]
impl StepHandler for MyHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
// Step implementation
Ok(StepResult::success(json!({"result": "done"})))
}
}
// Register in template
pub fn register_template() -> TaskTemplate {
TaskTemplate {
name: "my_workflow",
steps: vec![
StepTemplate {
name: "my_step",
handler: Box::new(MyHandler),
// ...
}
]
}
}
}
When to Use:
- When you need maximum performance
- For CPU-intensive operations
- When building new workflows in Rust
- When minimizing latency is critical
Dependencies:
tasker-worker- Core worker logictokio- Async runtime
Performance: Native Rust handlers have zero FFI overhead
Crate Relationships
How Crates Work Together
Task Creation Flow
Client Application
↓ [HTTP POST]
tasker-client
↓ [REST API]
tasker-orchestration::web
↓ [Task lifecycle]
tasker-orchestration::lifecycle::TaskInitializer
↓ [Uses]
tasker-shared::models::Task
tasker-shared::database::sql_functions
↓ [PostgreSQL]
Database + PGMQ
Step Execution Flow
tasker-orchestration::lifecycle::StepEnqueuer
↓ [pgmq_send_with_notify]
PGMQ namespace queue
↓ [pg_notify event]
tasker-worker::event_systems::WorkerEventSystem
↓ [Claims step]
tasker-worker::handlers::execute_handler
↓ [FFI or native]
workers/ruby or workers/rust
↓ [Returns result]
tasker-worker::orchestration_result_sender
↓ [pgmq_send_with_notify]
PGMQ orchestration_step_results queue
↓ [pg_notify event]
tasker-orchestration::lifecycle::ResultProcessor
↓ [Updates state]
tasker-shared::models::WorkflowStepTransition
Dependency Rationale
Why tasker-shared exists:
- Prevents circular dependencies (orchestration ↔ worker)
- Single source of truth for domain models
- Enables independent testing
- Allows SQL function reuse
Why workers are separate from tasker-worker:
- Language-specific implementations
- Independent deployment
- FFI boundary separation
- Multiple worker types supported
Why tasker-pgmq is separate:
- Reusable in other projects
- Focused responsibility
- Easy to test independently
- Can be published as separate crate
Building and Testing
Build All Crates
# Build everything with all features
cargo build --all-features
# Build specific crate
cargo build --package tasker-orchestration --all-features
# Build workspace root (minimal, mostly for integration)
cargo build
Test All Crates
# Test everything
cargo test --all-features
# Test specific crate
cargo test --package tasker-shared --all-features
# Test with database
DATABASE_URL="postgresql://..." cargo test --all-features
Feature Flags
# Root workspace features
[features]
benchmarks = [
"tasker-shared/benchmarks",
# ...
]
test-utils = [
"tasker-orchestration/test-utils",
"tasker-shared/test-utils",
"tasker-worker/test-utils",
]
Migration Notes
Root Crate Being Phased Out
The root tasker-core crate (defined in the workspace root Cargo.toml) is being phased out:
- Current: Contains minimal code, mostly workspace configuration
- Future: Will be removed entirely, replaced by individual crates
- Impact: No functional impact, internal restructuring only
- Timeline: Complete when all functionality moved to member crates
Why: Cleaner workspace structure, better separation of concerns, easier to understand
Adding New Crates
When adding a new crate to the workspace:
- Add to
[workspace.members]in rootCargo.toml - Create crate:
cargo new --lib tasker-new-crate - Add workspace dependencies to crate’s
Cargo.toml - Update this documentation
- Add to dependency graph above
- Document public API
Best Practices
When to Create a New Crate
Create a new crate when:
- ✅ You have a distinct, reusable component
- ✅ You need independent versioning
- ✅ You want to reduce compile times
- ✅ You need isolation for testing
- ✅ You have language-specific implementations
Don’t create a new crate when:
- ❌ It’s tightly coupled to existing crates
- ❌ It’s only used in one place
- ❌ It would create circular dependencies
- ❌ It’s a small utility module
Dependency Management
- Use workspace dependencies: Define versions in root
Cargo.toml - Minimize dependencies: Only depend on what you need
- Version consistently: Use
workspace = truein member crates - Document dependencies: Explain why each dependency is needed
API Design
- Stable public API: Changes should be backward compatible
- Clear documentation: Every public item needs docs
- Examples in docs: Show how to use the API
- Error handling: Use
Resultwith meaningful error types
Related Documentation
- Actor-Based Architecture - Actor pattern implementation in tasker-orchestration
- Messaging Abstraction - Provider-agnostic messaging
- Quick Start - Get running with the crates
- Events and Commands - How crates coordinate
- States and Lifecycles - State machines in tasker-shared
- Task Readiness & Execution - SQL functions in tasker-shared
- Archive: Ruby Integration Lessons - FFI patterns
← Back to Documentation Hub
Deployment Patterns and Configuration
Last Updated: 2026-01-15 Audience: Architects, Operators Status: Active Related Docs: Documentation Hub | Quick Start | Observability | Messaging Abstraction
← Back to Documentation Hub
Overview
Tasker Core supports three deployment modes, each optimized for different operational requirements and infrastructure constraints. This guide covers deployment patterns, configuration management, and production considerations.
Key Deployment Modes:
- Hybrid Mode (Recommended) - Event-driven with polling fallback
- EventDrivenOnly Mode - Pure event-driven for lowest latency
- PollingOnly Mode - Traditional polling for restricted environments
Messaging Backend Options:
- PGMQ (Default) - PostgreSQL-based, single infrastructure dependency
- RabbitMQ - AMQP broker, higher throughput for high-volume scenarios
Messaging Backend Selection
Tasker Core supports multiple messaging backends through a provider-agnostic abstraction layer. The choice of backend affects deployment architecture and operational requirements.
Backend Comparison
| Feature | PGMQ | RabbitMQ |
|---|---|---|
| Infrastructure | PostgreSQL only | PostgreSQL + RabbitMQ |
| Delivery Model | Poll + pg_notify signals | Native push (basic_consume) |
| Fallback Polling | Required for reliability | Not needed |
| Throughput | Good | Higher |
| Latency | Low (~10-50ms) | Lowest (~5-20ms) |
| Operational Complexity | Lower | Higher |
| Message Persistence | PostgreSQL transactions | RabbitMQ durability |
PGMQ (Default)
PostgreSQL Message Queue is the default backend, ideal for:
- Simpler deployments: Single database dependency
- Transactional workflows: Messages participate in PostgreSQL transactions
- Smaller to medium scale: Excellent for most workloads
Configuration:
# Default - no additional configuration needed
TASKER_MESSAGING_BACKEND=pgmq
Deployment Mode Interaction:
- Uses
pg_notifyfor real-time notifications - Fallback polling recommended for reliability
- Hybrid mode provides best balance
RabbitMQ
AMQP-based messaging for high-throughput scenarios:
- High-volume workloads: Better throughput characteristics
- Existing RabbitMQ infrastructure: Leverage existing investments
- Pure push delivery: No fallback polling required
Configuration:
TASKER_MESSAGING_BACKEND=rabbitmq
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/%2F
Deployment Mode Interaction:
- EventDrivenOnly mode is natural fit (no fallback needed)
- Native push delivery via
basic_consume() - Protocol-guaranteed message delivery
Choosing a Backend
Decision Tree:
┌─────────────────┐
│ Do you need the │
│ highest possible │
│ throughput? │
└────────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Do you have │ │ Use PGMQ │
│ existing │ │ (simpler ops) │
│ RabbitMQ? │ └────────────────┘
└───────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Use RabbitMQ │ │ Evaluate │
└────────────────┘ │ operational │
│ tradeoffs │
└────────────────┘
Recommendation: Start with PGMQ. Migrate to RabbitMQ only when throughput requirements demand it.
Production Deployment Strategy: Mixed Mode Architecture
Important: In production-grade Kubernetes environments, you typically run multiple orchestration containers simultaneously with different deployment modes. This is not just about horizontal scaling with identical configurations—it’s about deploying containers with different coordination strategies to optimize for both throughput and reliability.
Recommended Production Pattern
High-Throughput + Safety Net Architecture:
# Most orchestration containers in EventDrivenOnly mode for maximum throughput
- EventDrivenOnly containers: 8-12 replicas (handles 80-90% of workload)
- PollingOnly containers: 2-3 replicas (safety net for missed events)
Why this works:
- EventDrivenOnly containers handle the bulk of work with ~10ms latency
- PollingOnly containers catch any events that might be missed during network issues or LISTEN/NOTIFY failures
- Both sets of containers coordinate through atomic SQL operations (no conflicts)
- Scale each mode independently based on throughput needs
Alternative: All-Hybrid Deployment
You can also deploy all containers in Hybrid mode and scale horizontally:
# All containers use Hybrid mode
- Hybrid containers: 10-15 replicas
This is simpler but less flexible. The mixed-mode approach lets you:
- Tune for specific workload patterns (event-heavy vs. polling-heavy)
- Adapt to infrastructure constraints (some networks better for events, others for polling)
- Optimize resource usage (EventDrivenOnly uses less CPU than Hybrid)
- Scale dimensions independently (scale up event listeners without scaling pollers)
Key Insight
The different deployment modes exist not just for config tuning, but to enable sophisticated deployment strategies where you mix coordination approaches across containers to meet production throughput and reliability requirements.
Deployment Mode Comparison
| Feature | Hybrid | EventDrivenOnly | PollingOnly |
|---|---|---|---|
| Latency | Low (event-driven primary) | Lowest (~10ms) | Higher (~100-500ms) |
| Reliability | Highest (automatic fallback) | Good (requires stable connections) | Good (no dependencies) |
| Resource Usage | Medium (listeners + pollers) | Low (listeners only) | Medium (pollers only) |
| Network Requirements | Standard PostgreSQL | Persistent connections required | Standard PostgreSQL |
| Production Recommended | ✅ Yes | ⚠️ With stable network | ⚠️ For restricted environments |
| Complexity | Medium | Low | Low |
Hybrid Mode (Recommended)
Overview
Hybrid mode combines the best of both worlds: event-driven coordination for real-time performance with polling fallback for reliability.
How it works:
- PostgreSQL LISTEN/NOTIFY provides real-time event notifications
- If event listeners fail or lag, polling automatically takes over
- System continuously monitors and switches between modes
- No manual intervention required
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
# Event listener settings
enable_event_listeners = true
listener_reconnect_interval_ms = 5000
listener_health_check_interval_ms = 30000
# Polling fallback settings
enable_polling_fallback = true
polling_interval_ms = 1000
fallback_activation_threshold_ms = 5000
# Worker event settings
[orchestration.worker_events]
enable_worker_listeners = true
worker_listener_reconnect_ms = 5000
When to Use Hybrid Mode
Ideal for:
- Production deployments requiring high reliability
- Environments with occasional network instability
- Systems requiring both low latency and guaranteed delivery
- Multi-region deployments with variable network quality
Example: Production E-commerce Platform
# docker-compose.production.yml
version: '3.8'
services:
orchestration:
image: tasker-orchestration:latest
environment:
- TASKER_ENV=production
- TASKER_DEPLOYMENT_MODE=Hybrid
- DATABASE_URL=postgresql://tasker:${DB_PASSWORD}@postgres:5432/tasker_production
- RUST_LOG=info
deploy:
replicas: 3
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
postgres:
image: postgres:16
environment:
- POSTGRES_DB=tasker_production
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
postgres-data:
Monitoring Hybrid Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Hybrid mode health indicators
tasker_event_listener_active{mode="hybrid"} = 1 // Listener is active
tasker_event_listener_lag_ms{mode="hybrid"} < 100 // Event lag is acceptable
tasker_polling_fallback_active{mode="hybrid"} = 0 // Not in fallback mode
tasker_mode_switches_total{mode="hybrid"} < 10/hour // Infrequent mode switching
}
Alert conditions:
- Event listener down for > 60 seconds
- Polling fallback active for > 5 minutes
- Mode switches > 20 per hour (indicates instability)
EventDrivenOnly Mode
Overview
EventDrivenOnly mode provides the lowest possible latency by relying entirely on PostgreSQL LISTEN/NOTIFY for coordination.
How it works:
- Orchestration and workers establish persistent PostgreSQL connections
- LISTEN on specific channels for events
- Immediate notification on queue changes
- No polling overhead or delay
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "EventDrivenOnly"
[orchestration.event_driven]
# Listener configuration
listener_reconnect_interval_ms = 2000
listener_health_check_interval_ms = 15000
max_reconnect_attempts = 10
# Event channels
channels = [
"pgmq_message_ready.orchestration",
"pgmq_message_ready.*",
"pgmq_queue_created"
]
# Connection pool for listeners
listener_pool_size = 5
connection_timeout_ms = 5000
When to Use EventDrivenOnly Mode
Ideal for:
- High-throughput, low-latency requirements
- Stable network environments
- Development and testing environments
- Systems with reliable PostgreSQL infrastructure
Not recommended for:
- Unstable network connections
- Environments with frequent PostgreSQL failovers
- Systems requiring guaranteed operation during network issues
Example: High-Performance Payment Processing
#![allow(unused)]
fn main() {
// Worker configuration for event-driven mode
use tasker_worker::WorkerConfig;
let config = WorkerConfig {
deployment_mode: DeploymentMode::EventDrivenOnly,
namespaces: vec!["payments".to_string()],
event_driven_settings: EventDrivenSettings {
listener_reconnect_interval_ms: 2000,
health_check_interval_ms: 15000,
max_reconnect_attempts: 10,
},
..Default::default()
};
// Start worker with event-driven mode
let worker = WorkerCore::from_config(config).await?;
worker.start().await?;
}
Monitoring EventDrivenOnly Mode
Critical Metrics:
#![allow(unused)]
fn main() {
// Event-driven health indicators
tasker_event_listener_active{mode="event_driven"} = 1 // Must be 1
tasker_event_notifications_received_total // Should be > 0
tasker_event_processing_duration_seconds // Should be < 0.01
tasker_listener_reconnections_total // Should be low
}
Alert conditions:
- Event listener inactive
- No events received for > 60 seconds (when activity expected)
- Reconnections > 5 per hour
PollingOnly Mode
Overview
PollingOnly mode provides the most reliable operation in restricted or unstable network environments by using traditional polling.
How it works:
- Orchestration and workers poll message queues at regular intervals
- No dependency on persistent connections or LISTEN/NOTIFY
- Configurable polling intervals for performance/resource trade-offs
- Automatic retry and backoff on failures
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Polling intervals
task_request_poll_interval_ms = 1000
step_result_poll_interval_ms = 500
finalization_poll_interval_ms = 2000
# Batch processing
batch_size = 10
max_messages_per_poll = 100
# Backoff on errors
error_backoff_base_ms = 1000
error_backoff_max_ms = 30000
error_backoff_multiplier = 2.0
When to Use PollingOnly Mode
Ideal for:
- Restricted network environments (firewalls blocking persistent connections)
- Environments with frequent PostgreSQL connection issues
- Systems prioritizing reliability over latency
- Legacy infrastructure with limited LISTEN/NOTIFY support
Not recommended for:
- High-frequency, low-latency requirements
- Systems with strict resource constraints
- Environments where polling overhead is problematic
Example: Batch Processing System
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Longer intervals for batch processing
task_request_poll_interval_ms = 5000
step_result_poll_interval_ms = 2000
finalization_poll_interval_ms = 10000
# Large batches for efficiency
batch_size = 50
max_messages_per_poll = 500
Monitoring PollingOnly Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Polling health indicators
tasker_polling_cycles_total // Should be increasing
tasker_polling_messages_processed_total // Should be > 0
tasker_polling_duration_seconds // Should be stable
tasker_polling_errors_total // Should be low
}
Alert conditions:
- Polling stopped (no cycles in last 60 seconds)
- Polling duration > 10x interval (indicates overload)
- Error rate > 5% of polling cycles
Configuration Management
Component-Based Configuration
Tasker Core uses a component-based TOML configuration system with environment-specific overrides.
Configuration Structure:
config/tasker/
├── base/ # Base configuration (all environments)
│ ├── database.toml # Database connection pool settings
│ ├── orchestration.toml # Orchestration and deployment mode
│ ├── circuit_breakers.toml # Circuit breaker thresholds
│ ├── executor_pools.toml # Executor pool sizing
│ ├── pgmq.toml # Message queue configuration
│ ├── query_cache.toml # Query caching settings
│ └── telemetry.toml # Metrics and logging
│
└── environments/ # Environment-specific overrides
├── development/
│ └── *.toml # Development overrides
├── test/
│ └── *.toml # Test overrides
└── production/
└── *.toml # Production overrides
Environment Detection
# Set environment via TASKER_ENV
export TASKER_ENV=production
# Validate configuration
cargo run --bin config-validator
# Expected output:
# ✓ Configuration loaded successfully
# ✓ Environment: production
# ✓ Deployment mode: Hybrid
# ✓ Database pool: 50 connections
# ✓ Circuit breakers: 10 configurations
Example: Production Configuration
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
max_concurrent_tasks = 1000
task_timeout_seconds = 3600
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true
polling_interval_ms = 2000
fallback_activation_threshold_ms = 10000
[orchestration.health]
health_check_interval_ms = 30000
unhealthy_threshold = 3
recovery_threshold = 2
# config/tasker/environments/production/database.toml
[database]
max_connections = 50
min_connections = 10
connection_timeout_ms = 5000
idle_timeout_seconds = 600
max_lifetime_seconds = 1800
[database.query_cache]
enabled = true
max_size = 1000
ttl_seconds = 300
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5
timeout_seconds = 60
half_open_timeout_seconds = 30
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Docker Compose Deployment
Development Setup
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD: tasker
POSTGRES_DB: tasker_rust_test
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U tasker"]
interval: 5s
timeout: 5s
retries: 5
orchestration:
build:
context: .
target: orchestration
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
worker:
build:
context: .
target: worker
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8081:8081"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
ruby-worker:
build:
context: ./workers/ruby
dockerfile: Dockerfile
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8082:8082"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
volumes:
postgres-data:
Production Deployment
# docker-compose.production.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: tasker_production
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
placement:
constraints:
- node.labels.database == true
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
secrets:
- db_password
orchestration:
image: tasker-orchestration:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
order: start-first
rollback_config:
parallelism: 0
order: stop-first
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
secrets:
- database_url
worker:
image: tasker-worker:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 5
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
secrets:
- database_url
secrets:
db_password:
external: true
database_url:
external: true
volumes:
postgres-data:
driver: local
Kubernetes Deployment
Mixed-Mode Production Deployment (Recommended)
This example demonstrates the recommended production pattern: multiple orchestration deployments with different modes.
# k8s/orchestration-event-driven.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-event-driven
namespace: tasker
labels:
app: tasker-orchestration
mode: event-driven
spec:
replicas: 10 # Majority of orchestration capacity
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: event-driven
template:
metadata:
labels:
app: tasker-orchestration
mode: event-driven
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "EventDrivenOnly" # High-throughput mode
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 500m # Lower CPU for event-driven
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-polling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-polling
namespace: tasker
labels:
app: tasker-orchestration
mode: polling
spec:
replicas: 3 # Safety net for missed events
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: polling
template:
metadata:
labels:
app: tasker-orchestration
mode: polling
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "PollingOnly" # Reliability safety net
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 750m # Higher CPU for polling
memory: 512Mi
limits:
cpu: 1500m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-service.yaml
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration # Matches BOTH deployments
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Key points about this mixed-mode deployment:
- 10 EventDrivenOnly pods handle 80-90% of work with ~10ms latency
- 3 PollingOnly pods catch anything missed by event listeners
- Single service load balances across all 13 pods
- No conflicts - atomic SQL operations prevent duplicate processing
- Independent scaling - scale event-driven pods for throughput, polling pods for reliability
Single-Mode Orchestration Deployment
# k8s/orchestration-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration
namespace: tasker
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
template:
metadata:
labels:
app: tasker-orchestration
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Worker Deployment
# k8s/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
replicas: 5
selector:
matchLabels:
app: tasker-worker
namespace: payments
template:
metadata:
labels:
app: tasker-worker
namespace: payments
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8081"
spec:
containers:
- name: worker
image: tasker-worker:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
- name: WORKER_NAMESPACES
value: "payments"
ports:
- containerPort: 8081
name: http
protocol: TCP
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 20
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-worker-payments
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Health Monitoring
Health Check Endpoints
Orchestration Health:
# Basic health check
curl http://localhost:8080/health
# Response:
{
"status": "healthy",
"database": "connected",
"message_queue": "operational"
}
# Detailed health check
curl http://localhost:8080/health/detailed
# Response:
{
"status": "healthy",
"deployment_mode": "Hybrid",
"event_listeners": {
"active": true,
"channels": 3,
"lag_ms": 12
},
"polling": {
"active": false,
"fallback_triggered": false
},
"database": {
"status": "connected",
"pool_size": 50,
"active_connections": 23
},
"circuit_breakers": {
"database": "closed",
"message_queue": "closed"
},
"executors": {
"task_initializer": {
"active": 3,
"max": 10,
"queue_depth": 5
},
"result_processor": {
"active": 5,
"max": 10,
"queue_depth": 12
}
}
}
Worker Health:
# Worker health check
curl http://localhost:8081/health
# Response:
{
"status": "healthy",
"namespaces": ["payments", "inventory"],
"active_executions": 8,
"claimed_steps": 3
}
Kubernetes Probes
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from load balancer if not ready
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
gRPC Health Checks
Tasker Core exposes gRPC health endpoints alongside REST for Kubernetes gRPC health probes.
Port Allocation:
| Service | REST Port | gRPC Port |
|---|---|---|
| Orchestration | 8080 | 9190 |
| Rust Worker | 8081 | 9191 |
| Ruby Worker | 8082 | 9200 |
| Python Worker | 8083 | 9300 |
| TypeScript Worker | 8085 | 9400 |
gRPC Health Endpoints:
# Using grpcurl
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckLiveness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckReadiness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckDetailedHealth
Kubernetes gRPC Probes (Kubernetes 1.24+):
# gRPC liveness probe
livenessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# gRPC readiness probe
readinessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Configuration (config/tasker/base/orchestration.toml):
[orchestration.grpc]
enabled = true
bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
enable_reflection = true # Service discovery via grpcurl
enable_health_service = true # gRPC health checks
Scaling Patterns
Horizontal Scaling
Mixed-Mode Orchestration Scaling (Recommended)
Scale different deployment modes independently to optimize for throughput and reliability:
# Scale event-driven pods for throughput
kubectl scale deployment tasker-orchestration-event-driven --replicas=15 -n tasker
# Scale polling pods for reliability
kubectl scale deployment tasker-orchestration-polling --replicas=5 -n tasker
Scaling strategy by workload:
| Scenario | Event-Driven Pods | Polling Pods | Rationale |
|---|---|---|---|
| High throughput | 15-20 | 3-5 | Maximize event-driven capacity |
| Network unstable | 5-8 | 5-8 | Balance between modes |
| Cost optimization | 10-12 | 2-3 | Minimize polling overhead |
| Maximum reliability | 8-10 | 8-10 | Ensure complete coverage |
Single-Mode Orchestration Scaling
If using single deployment mode (Hybrid or EventDrivenOnly):
# Scale orchestration to 10 replicas (all same mode)
kubectl scale deployment tasker-orchestration --replicas=10 -n tasker
Key principles:
- Multiple orchestration instances process tasks independently
- Atomic finalization claiming prevents duplicate processing
- Load balancer distributes API requests across instances
Worker Scaling
Workers scale independently per namespace:
# Scale payment workers to 10 replicas
kubectl scale deployment tasker-worker-payments --replicas=10 -n tasker
- Each worker claims steps from namespace-specific queues
- No coordination required between workers
- Scale per namespace based on queue depth
Vertical Scaling
Resource Allocation:
# High-throughput orchestration
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
# Standard worker
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Auto-Scaling
HPA Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-orchestration
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-orchestration
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: tasker_tasks_active
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Production Considerations
Database Configuration
Connection Pooling:
# config/tasker/environments/production/database.toml
[database]
max_connections = 50 # Total pool size
min_connections = 10 # Minimum maintained connections
connection_timeout_ms = 5000 # Connection acquisition timeout
idle_timeout_seconds = 600 # Close idle connections after 10 minutes
max_lifetime_seconds = 1800 # Recycle connections after 30 minutes
Calculation:
Total DB Connections = (Orchestration Replicas × Pool Size) + (Worker Replicas × Pool Size)
Example: (3 × 50) + (10 × 20) = 350 connections
Ensure PostgreSQL max_connections > Total DB Connections + Buffer
Recommended: max_connections = 500 for above example
Circuit Breaker Tuning
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5 # Open after 5 consecutive errors
timeout_seconds = 60 # Stay open for 60 seconds
half_open_timeout_seconds = 30 # Test recovery for 30 seconds
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Executor Pool Sizing
# config/tasker/environments/production/executor_pools.toml
[executor_pools.task_initializer]
min_executors = 2
max_executors = 10
queue_high_watermark = 100
queue_low_watermark = 10
[executor_pools.result_processor]
min_executors = 5
max_executors = 20
queue_high_watermark = 200
queue_low_watermark = 20
[executor_pools.step_enqueuer]
min_executors = 3
max_executors = 15
queue_high_watermark = 150
queue_low_watermark = 15
Observability Integration
Prometheus Metrics:
# Prometheus scrape config
scrape_configs:
- job_name: 'tasker-orchestration'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- tasker
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Key Alerts:
# alerts.yaml
groups:
- name: tasker
interval: 30s
rules:
- alert: TaskerOrchestrationDown
expr: up{job="tasker-orchestration"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Tasker orchestration instance down"
- alert: TaskerHighErrorRate
expr: rate(tasker_step_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in step execution"
- alert: TaskerCircuitBreakerOpen
expr: tasker_circuit_breaker_state{state="open"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: TaskerDatabasePoolExhausted
expr: tasker_database_pool_active >= tasker_database_pool_max
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection pool exhausted"
Migration Strategies
Migrating to Hybrid Mode
Step 1: Enable event listeners
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true # Keep polling enabled during migration
Step 2: Monitor event listener health
# Check metrics for event listener stability
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Step 3: Gradually reduce polling frequency
# Once event listeners are stable
[orchestration.hybrid]
polling_interval_ms = 5000 # Increase from 1000ms to 5000ms
Step 4: Validate performance
- Monitor latency metrics:
tasker_step_discovery_duration_seconds - Verify no missed events:
tasker_polling_messages_found_totalshould be near zero
Rollback Plan
If event-driven mode fails:
# Immediate rollback to PollingOnly
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
task_request_poll_interval_ms = 500 # Aggressive polling
Gradual rollback:
- Increase polling frequency in Hybrid mode
- Monitor for stability
- Disable event listeners once polling is stable
- Switch to PollingOnly mode
Troubleshooting
Event Listener Issues
Problem: Event listeners not receiving notifications
Diagnosis:
-- Check PostgreSQL LISTEN/NOTIFY is working
NOTIFY pgmq_message_ready, 'test';
# Check listener status
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Solutions:
- Verify PostgreSQL version supports LISTEN/NOTIFY (9.0+)
- Check firewall rules allow persistent connections
- Increase
listener_reconnect_interval_msif connections drop frequently - Switch to Hybrid or PollingOnly mode if issues persist
Polling Performance Issues
Problem: High CPU usage from polling
Diagnosis:
# Check polling frequency and batch sizes
curl http://localhost:8080/health/detailed | jq '.polling'
Solutions:
- Increase polling intervals
- Increase batch sizes to process more messages per poll
- Switch to Hybrid or EventDrivenOnly mode for better performance
- Scale horizontally to distribute polling load
Database Connection Exhaustion
Problem: “connection pool exhausted” errors
Diagnosis:
-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'tasker_production';
-- Check max connections
SHOW max_connections;
Solutions:
- Increase
max_connectionsin database.toml - Increase PostgreSQL
max_connectionssetting - Reduce number of replicas
- Implement connection pooling at infrastructure level (PgBouncer)
Best Practices
Configuration Management
- Use environment-specific overrides instead of modifying base configuration
- Validate configuration with
config-validatorbefore deployment - Version control all configuration including environment overrides
- Use secrets management for sensitive values (passwords, keys)
Deployment Strategy
- Use mixed-mode architecture in production (EventDrivenOnly + PollingOnly)
- Deploy 80-90% of orchestration pods in EventDrivenOnly mode for throughput
- Deploy 10-20% of orchestration pods in PollingOnly mode as safety net
- Single service load balances across all pods
- Alternative: Deploy all pods in Hybrid mode for simpler configuration
- Trade-off: Less tuning flexibility, slightly higher resource usage
- Scale each mode independently based on workload characteristics
- Monitor deployment mode metrics to adjust ratios over time
- Test mixed-mode deployments in staging before production
Deployment Operations
- Always test configuration changes in staging first
- Use rolling updates with health checks to prevent downtime
- Monitor deployment mode health during and after deployments
- Keep polling capacity available even when event-driven is primary
Scaling Guidelines
- Mixed-mode orchestration: Scale EventDrivenOnly and PollingOnly deployments independently
- Scale event-driven pods based on throughput requirements
- Scale polling pods based on reliability requirements
- Single-mode orchestration: Scale based on API request rate and task initialization throughput
- Workers: Scale based on namespace-specific queue depth
- Database connections: Monitor and adjust pool sizes as replicas scale
- Use HPA for automatic scaling based on CPU/memory and custom metrics
Observability
- Enable comprehensive metrics in production
- Set up alerts for circuit breaker states, connection pool exhaustion
- Monitor deployment mode distribution in mixed-mode deployments
- Track event listener lag in EventDrivenOnly and Hybrid modes
- Monitor polling overhead to optimize resource usage
- Track step execution latency per namespace and handler
Summary
Tasker Core’s flexible deployment modes enable sophisticated production architectures:
Deployment Modes
- Hybrid Mode: Event-driven with polling fallback in a single container
- EventDrivenOnly Mode: Maximum throughput with ~10ms latency
- PollingOnly Mode: Reliable safety net with traditional polling
Recommended Production Pattern
Mixed-Mode Architecture (recommended for production at scale):
- Deploy majority of orchestration pods in EventDrivenOnly mode for high throughput
- Deploy minority of orchestration pods in PollingOnly mode as reliability safety net
- Both deployments coordinate through atomic SQL operations with no conflicts
- Scale each mode independently based on workload characteristics
Alternative: Deploy all pods in Hybrid mode for simpler configuration with automatic fallback.
The key insight: deployment modes exist not just for configuration tuning, but to enable mixing coordination strategies across containers to meet production requirements for both throughput and reliability.
← Back to Documentation Hub
Next: Observability | Benchmarks | Quick Start
Domain Events Architecture
Last Updated: 2025-12-01 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Observability | States and Lifecycles
← Back to Documentation Hub
This document provides comprehensive documentation of the domain event system in tasker-core, covering event delivery modes, publisher patterns, subscriber implementations, and integration with the workflow orchestration system.
Overview
Domain Events vs System Events
The tasker-core system distinguishes between two types of events:
| Aspect | System Events | Domain Events |
|---|---|---|
| Purpose | Internal coordination | Business observability |
| Producers | Orchestration components | Step handlers during execution |
| Consumers | Event systems, command processors | External systems, analytics, audit |
| Delivery | PGMQ + LISTEN/NOTIFY | Configurable (Durable/Fast/Broadcast) |
| Semantics | At-least-once | Fire-and-forget (best effort) |
System events handle internal workflow coordination: task initialization, step enqueueing, result processing, and finalization. These are documented in Events and Commands.
Domain events enable business observability: payment processed, order fulfilled, inventory updated. Step handlers publish these events to enable external systems to react to business outcomes.
Key Design Principle: Non-Blocking Publication
Domain event publishing never fails the step. This is a fundamental design decision:
- Event publish errors are logged with
warn!level - Step execution continues regardless of publish outcome
- Business logic success is independent of event delivery
- A handler that successfully processes a payment should not fail if event publishing fails
#![allow(unused)]
fn main() {
// Event publishing is fire-and-forget
if let Err(e) = publisher.publish_event(event_name, payload, metadata).await {
warn!(
handler = self.handler_name(),
event_name = event_name,
error = %e,
"Failed to publish domain event - step will continue"
);
}
// Step continues regardless of publish result
}
Architecture
Data Flow
flowchart TB
subgraph handlers["Step Handlers"]
SH["Step Handler<br/>(Rust/Ruby)"]
end
SH -->|"publish_domain_event(name, payload)"| ER
subgraph routing["Event Routing"]
ER["EventRouter<br/>(Delivery Mode)"]
end
ER --> Durable
ER --> Fast
ER --> Broadcast
subgraph modes["Delivery Modes"]
Durable["Durable<br/>(PGMQ)"]
Fast["Fast<br/>(In-Process)"]
Broadcast["Broadcast<br/>(Both Paths)"]
end
Durable --> NEQ["Namespace<br/>Event Queue"]
Fast --> IPB["InProcessEventBus"]
Broadcast --> NEQ
Broadcast --> IPB
subgraph external["External Integration"]
NEQ --> EC["External Consumer<br/>(Polling)"]
end
subgraph internal["Internal Subscribers"]
IPB --> RS["Rust<br/>Subscribers"]
IPB --> RFF["Ruby FFI<br/>Channel"]
end
style handlers fill:#e1f5fe
style routing fill:#fff3e0
style modes fill:#f3e5f5
style external fill:#ffebee
style internal fill:#e8f5e9
Component Summary
| Component | Purpose | Location |
|---|---|---|
EventRouter | Routes events based on delivery mode | tasker-shared/src/events/domain_events/router.rs |
DomainEventPublisher | Durable PGMQ-based publishing | tasker-shared/src/events/domain_events/publisher.rs |
InProcessEventBus | Fast in-memory event dispatch | tasker-shared/src/events/domain_events/in_process_bus.rs |
EventRegistry | Pattern-based subscriber registration | tasker-shared/src/events/domain_events/registry.rs |
StepEventPublisher | Handler callback trait | tasker-shared/src/events/domain_events/step_event_publisher.rs |
GenericStepEventPublisher | Default publisher implementation | tasker-shared/src/events/domain_events/generic_publisher.rs |
Delivery Modes
Overview
The domain event system supports three delivery modes, configured per event in YAML templates:
| Mode | Durability | Latency | Use Case |
|---|---|---|---|
| Durable | High (PGMQ) | Higher (~5-10ms) | External system integration, audit trails |
| Fast | Low (memory) | Lowest (<1ms) | Internal subscribers, metrics, real-time processing |
| Broadcast | High + Low | Both paths | Events needing both internal and external delivery |
Durable Mode (PGMQ) - External Integration Boundary
Durable events define the integration boundary between Tasker and external systems. Events are published to namespace-specific PGMQ queues where external consumers can poll and process them.
Key Design Decision: Tasker does NOT consume durable events internally. PGMQ serves as a lightweight, PostgreSQL-native alternative to external message brokers (Kafka, AWS SNS/SQS, RabbitMQ). External systems or middleware proxies can:
- Poll PGMQ queues directly
- Forward events to Kafka, SNS/SQS, or other messaging systems
- Implement custom event processing pipelines
payment.processed → payments_domain_events (PGMQ queue) → External Systems
order.fulfilled → fulfillment_domain_events (PGMQ queue) → External Systems
Characteristics:
- Persisted in PostgreSQL (survives restarts)
- For external consumer integration only
- No internal Tasker polling or subscription
- Consumer acknowledgment and retry handled by external consumers
- Ordered within namespace
Implementation:
#![allow(unused)]
fn main() {
// DomainEventPublisher routes durable events to PGMQ
pub async fn publish_event(
&self,
event_name: &str,
payload: Value,
metadata: EventMetadata,
) -> TaskerResult<()> {
let queue_name = format!("{}_domain_events", metadata.namespace);
let message = DomainEventMessage {
event_name: event_name.to_string(),
payload,
metadata,
};
self.message_client
.send_message(&queue_name, &message)
.await
}
}
Fast Mode (In-Process) - Internal Subscriber Pattern
Fast events are the only delivery mode with internal subscriber support. Events are dispatched immediately to in-memory subscribers within the Tasker worker process.
#![allow(unused)]
fn main() {
// InProcessEventBus provides dual-path delivery
pub struct InProcessEventBus {
event_sender: tokio::sync::broadcast::Sender<DomainEvent>,
ffi_event_sender: Option<mpsc::Sender<DomainEvent>>,
}
}
Characteristics:
- Zero persistence overhead
- Sub-millisecond latency
- Lost on service restart
- Internal to Tasker process only
- Dual-path: Rust subscribers + Ruby FFI channel
- Non-blocking broadcast semantics
Dual-Path Architecture:
InProcessEventBus
│
├──► tokio::broadcast::Sender ──► Rust Subscribers (EventRegistry)
│
└──► mpsc::Sender ──► Ruby FFI Channel ──► Ruby Event Handlers
Use Cases:
- Real-time metrics collection
- Internal logging and telemetry
- Secondary actions that are not business-critical parts of the Task -> WorkflowStep DAG hierarchy
- Example: DataDog, Sentry, NewRelic, PagerDuty, Salesforce, Slack, Zapier
Broadcast Mode - Internal + External Delivery
Broadcast mode delivers events to both paths simultaneously: the fast in-process bus for internal subscribers AND the durable PGMQ queue for external systems. This ensures internal subscribers receive the same event shape as external consumers.
#![allow(unused)]
fn main() {
// EventRouter handles broadcast semantics
async fn route_event(&self, event: DomainEvent, mode: EventDeliveryMode) {
match mode {
EventDeliveryMode::Durable => {
self.durable_publisher.publish(event).await;
}
EventDeliveryMode::Fast => {
self.in_process_bus.publish(event).await;
}
EventDeliveryMode::Broadcast => {
// Send to both paths concurrently
let (durable, fast) = tokio::join!(
self.durable_publisher.publish(event.clone()),
self.in_process_bus.publish(event)
);
// Log errors but don't fail
}
}
}
}
When to Use Broadcast:
- Internal subscribers need the same event that external systems receive
- Real-time internal metrics tracking for events also exported externally
- Audit logging both internally and to external compliance systems
Important: Data published via broadcast goes to BOTH the internal process AND the public PGMQ boundary. Do not use broadcast for sensitive internal-only data (use fast for those).
Publisher Patterns
Default Publisher (GenericStepEventPublisher)
The default publisher automatically handles event publication for step handlers:
#![allow(unused)]
fn main() {
pub struct GenericStepEventPublisher {
router: Arc<EventRouter>,
default_delivery_mode: EventDeliveryMode,
}
impl GenericStepEventPublisher {
/// Publish event with metadata extracted from step context
pub async fn publish(
&self,
step_data: &TaskSequenceStep,
event_name: &str,
payload: Value,
) -> TaskerResult<()> {
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id,
fired_at: Utc::now(),
fired_by: "generic_publisher".to_string(),
};
self.router.route_event(event_name, payload, metadata).await
}
}
}
Custom Publishers
Custom publishers extend TaskerCore::DomainEvents::BasePublisher (Ruby) to provide specialized event handling with payload transformation, conditional publishing, and lifecycle hooks.
Real Example: PaymentEventPublisher (workers/ruby/spec/handlers/examples/domain_events/publishers/payment_event_publisher.rb):
# Custom publisher for payment-related domain events
# Demonstrates durable delivery mode with custom payload enrichment
module DomainEvents
module Publishers
class PaymentEventPublisher < TaskerCore::DomainEvents::BasePublisher
# Must match the `publisher:` field in YAML
def name
'DomainEvents::Publishers::PaymentEventPublisher'
end
# Transform step result into payment event payload
def transform_payload(step_result, event_declaration, step_context = nil)
result = step_result[:result] || {}
event_name = event_declaration[:name]
if step_result[:success] && event_name&.include?('processed')
build_success_payload(result, step_result, step_context)
elsif !step_result[:success] && event_name&.include?('failed')
build_failure_payload(result, step_result, step_context)
else
result
end
end
# Determine if this event should be published
def should_publish?(step_result, event_declaration, step_context = nil)
result = step_result[:result] || {}
event_name = event_declaration[:name]
# For success events, verify we have transaction data
if event_name&.include?('processed')
return step_result[:success] && result[:transaction_id].present?
end
# For failure events, verify we have error info
if event_name&.include?('failed')
metadata = step_result[:metadata] || {}
return !step_result[:success] && metadata[:error_code].present?
end
true # Default: always publish
end
# Add execution metrics to event metadata
def additional_metadata(step_result, event_declaration, step_context = nil)
metadata = step_result[:metadata] || {}
{
execution_time_ms: metadata[:execution_time_ms],
publisher_type: 'custom',
publisher_name: name,
payment_provider: metadata[:payment_provider]
}
end
private
def build_success_payload(result, step_result, step_context)
{
transaction_id: result[:transaction_id],
amount: result[:amount],
currency: result[:currency] || 'USD',
payment_method: result[:payment_method] || 'credit_card',
processed_at: result[:processed_at] || Time.now.iso8601,
delivery_mode: 'durable',
publisher: name
}
end
end
end
end
YAML Configuration for Custom Publisher:
steps:
- name: process_payment
publishes_events:
- name: payment.processed
condition: success
delivery_mode: durable
publisher: DomainEvents::Publishers::PaymentEventPublisher
- name: payment.failed
condition: failure
delivery_mode: durable
publisher: DomainEvents::Publishers::PaymentEventPublisher
YAML Event Declaration
Events are declared in task template YAML files using the publishes_events field at the step level:
# config/tasks/payments/credit_card_payment/1.0.0.yaml
name: credit_card_payment
namespace_name: payments
version: 1.0.0
description: Process credit card payments with validation and fraud detection
# Task-level domain events (optional)
domain_events: []
steps:
- name: process_payment
description: Process the payment transaction
handler:
callable: PaymentProcessing::StepHandler::ProcessPaymentHandler
initialization:
gateway_url: "${PAYMENT_GATEWAY_URL}"
dependencies:
- validate_payment
retry:
retryable: true
limit: 3
backoff: exponential
timeout_seconds: 120
# Step-level event declarations
publishes_events:
- name: payment.processed
description: "Payment successfully processed"
condition: success # success, failure, retryable_failure, permanent_failure, always
schema:
type: object
required: [transaction_id, amount]
properties:
transaction_id: { type: string }
amount: { type: number }
delivery_mode: broadcast # durable, fast, or broadcast
publisher: PaymentEventPublisher # optional custom publisher
- name: payment.failed
description: "Payment processing failed permanently"
condition: permanent_failure
schema:
type: object
required: [error_code, reason]
properties:
error_code: { type: string }
reason: { type: string }
delivery_mode: durable
Publication Conditions:
success: Publish only when step completes successfullyfailure: Publish on any step failure (backward compatible)retryable_failure: Publish only on retryable failures (step can be retried)permanent_failure: Publish only on permanent failures (exhausted retries or non-retryable)always: Publish regardless of step outcome
Event Declaration Fields:
name: Event name in dotted notation (e.g.,payment.processed)description: Human-readable description of when this event is publishedcondition: When to publish (defaults tosuccess)schema: JSON Schema for validating event payloadsdelivery_mode: Delivery mode (defaults todurable)publisher: Optional custom publisher class name
Subscriber Patterns
Subscriber patterns apply only to fast (in-process) events. Durable events are consumed by external systems, not by internal Tasker subscribers.
Rust Subscribers (InProcessEventBus)
Rust subscribers are registered with the InProcessEventBus using the EventHandler type. Subscribers are async closures that receive DomainEvent instances.
Real Example: Logging Subscriber (workers/rust/src/event_subscribers/logging_subscriber.rs):
#![allow(unused)]
fn main() {
use std::sync::Arc;
use tasker_shared::events::registry::EventHandler;
use tracing::info;
/// Create a logging subscriber that logs all events matching a pattern
pub fn create_logging_subscriber(prefix: &str) -> EventHandler {
let prefix = prefix.to_string();
Arc::new(move |event| {
let prefix = prefix.clone();
Box::pin(async move {
let step_name = event.metadata.step_name.as_deref().unwrap_or("unknown");
info!(
prefix = %prefix,
event_name = %event.event_name,
event_id = %event.event_id,
task_uuid = %event.metadata.task_uuid,
step_name = %step_name,
namespace = %event.metadata.namespace,
correlation_id = %event.metadata.correlation_id,
fired_at = %event.metadata.fired_at,
"Domain event received"
);
Ok(())
})
})
}
}
Real Example: Metrics Collector (workers/rust/src/event_subscribers/metrics_subscriber.rs):
#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
/// Collects metrics from domain events (thread-safe)
pub struct EventMetricsCollector {
events_received: AtomicU64,
success_events: AtomicU64,
failure_events: AtomicU64,
// ... additional fields
}
impl EventMetricsCollector {
pub fn new() -> Arc<Self> {
Arc::new(Self {
events_received: AtomicU64::new(0),
success_events: AtomicU64::new(0),
failure_events: AtomicU64::new(0),
})
}
/// Create an event handler for this collector
pub fn create_handler(self: &Arc<Self>) -> EventHandler {
let metrics = Arc::clone(self);
Arc::new(move |event| {
let metrics = Arc::clone(&metrics);
Box::pin(async move {
metrics.events_received.fetch_add(1, Ordering::Relaxed);
if event.payload.execution_result.success {
metrics.success_events.fetch_add(1, Ordering::Relaxed);
} else {
metrics.failure_events.fetch_add(1, Ordering::Relaxed);
}
Ok(())
})
})
}
pub fn events_received(&self) -> u64 {
self.events_received.load(Ordering::Relaxed)
}
}
}
Registration with InProcessEventBus:
#![allow(unused)]
fn main() {
use tasker_worker::worker::in_process_event_bus::InProcessEventBus;
let mut bus = InProcessEventBus::new(config);
// Subscribe to all events
bus.subscribe("*", create_logging_subscriber("[ALL]")).unwrap();
// Subscribe to specific patterns
bus.subscribe("payment.*", create_logging_subscriber("[PAYMENT]")).unwrap();
// Use metrics collector
let metrics = EventMetricsCollector::new();
bus.subscribe("*", metrics.create_handler()).unwrap();
}
Ruby Subscribers (BaseSubscriber)
Ruby subscribers extend TaskerCore::DomainEvents::BaseSubscriber and use the class-level subscribes_to pattern declaration.
Real Example: LoggingSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/logging_subscriber.rb):
# Example logging subscriber for fast/in-process domain events
module DomainEvents
module Subscribers
class LoggingSubscriber < TaskerCore::DomainEvents::BaseSubscriber
# Subscribe to all events using pattern matching
subscribes_to '*'
# Handle any domain event by logging its details
def handle(event)
event_name = event[:event_name]
metadata = event[:metadata] || {}
logger.info "[LoggingSubscriber] Event: #{event_name}"
logger.info " Task: #{metadata[:task_uuid]}"
logger.info " Step: #{metadata[:step_name]}"
logger.info " Namespace: #{metadata[:namespace]}"
logger.info " Correlation: #{metadata[:correlation_id]}"
end
end
end
end
Real Example: MetricsSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/metrics_subscriber.rb):
# Example metrics subscriber for fast/in-process domain events
module DomainEvents
module Subscribers
class MetricsSubscriber < TaskerCore::DomainEvents::BaseSubscriber
subscribes_to '*'
class << self
attr_accessor :events_received, :success_events, :failure_events,
:events_by_namespace, :last_event_at
def reset_counters!
@mutex = Mutex.new
@events_received = 0
@success_events = 0
@failure_events = 0
@events_by_namespace = Hash.new(0)
@last_event_at = nil
end
end
reset_counters!
def handle(event)
event_name = event[:event_name]
metadata = event[:metadata] || {}
execution_result = event[:execution_result] || {}
self.class.increment(:events_received)
if execution_result[:success]
self.class.increment(:success_events)
else
self.class.increment(:failure_events)
end
namespace = metadata[:namespace] || 'unknown'
self.class.increment_hash(:events_by_namespace, namespace)
self.class.set(:last_event_at, Time.now)
end
end
end
end
Registration in Bootstrap:
# Register subscribers with the registry
registry = TaskerCore::DomainEvents::SubscriberRegistry.instance
registry.register(DomainEvents::Subscribers::LoggingSubscriber)
registry.register(DomainEvents::Subscribers::MetricsSubscriber)
registry.start_all!
# Later, query metrics
puts "Total events: #{DomainEvents::Subscribers::MetricsSubscriber.events_received}"
puts "By namespace: #{DomainEvents::Subscribers::MetricsSubscriber.events_by_namespace}"
External PGMQ Consumers (Durable Events)
Durable events are published to PGMQ queues for external consumption. Tasker does not provide internal consumers for these queues. External systems can consume events using:
- Direct PGMQ Polling: Query
pgmq.q_{namespace}_domain_eventstables directly - PGMQ Client Libraries: Use pgmq client libraries in Python, Node.js, Go, etc.
- Middleware Proxies: Build adapters that forward events to Kafka, SNS/SQS, etc.
Example: External Python Consumer:
import pgmq
# Connect to PGMQ
queue = pgmq.Queue("payments_domain_events", dsn="postgresql://...")
# Poll for events
while True:
messages = queue.read(batch_size=50, vt=30)
for msg in messages:
process_event(msg.message)
queue.delete(msg.msg_id)
Configuration
Domain event system configuration is part of the worker configuration in worker.toml files.
TOML Configuration
# config/tasker/base/worker.toml
# In-process event bus configuration for fast domain event delivery
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 2000 # Channel capacity for broadcast events
log_subscriber_errors = true # Log errors from event subscribers
dispatch_timeout_ms = 5000 # Timeout for event dispatch
# Domain Event System MPSC Configuration
[worker.mpsc_channels.domain_events]
command_buffer_size = 1000 # Channel capacity for domain event commands
shutdown_drain_timeout_ms = 5000 # Time to drain events on shutdown
log_dropped_events = true # Log when events are dropped due to backpressure
# In-process event settings (part of worker event systems)
[worker.event_systems.worker.metadata.in_process_events]
ffi_integration_enabled = true # Enable Ruby/Python FFI event channel
deduplication_cache_size = 10000 # Event deduplication cache size
Environment Overrides
Test Environment (config/tasker/environments/test/worker.toml):
# In-process event bus - smaller buffers for testing
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 1000
log_subscriber_errors = true
dispatch_timeout_ms = 2000
# Domain Event System - smaller buffers for testing
[worker.mpsc_channels.domain_events]
command_buffer_size = 100
shutdown_drain_timeout_ms = 1000
log_dropped_events = true
[worker.event_systems.worker.metadata.in_process_events]
deduplication_cache_size = 1000
Production Environment (config/tasker/environments/production/worker.toml):
# In-process event bus - large buffers for production throughput
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 5000
log_subscriber_errors = false # Reduce log noise in production
dispatch_timeout_ms = 10000
# Domain Event System - large buffers for production throughput
[worker.mpsc_channels.domain_events]
command_buffer_size = 5000
shutdown_drain_timeout_ms = 10000
log_dropped_events = false # Reduce log noise in production
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
broadcast_buffer_size | Capacity of the broadcast channel for fast events | 2000 |
log_subscriber_errors | Whether to log subscriber errors | true |
dispatch_timeout_ms | Timeout for event dispatch to subscribers | 5000 |
command_buffer_size | Capacity of domain event command channel | 1000 |
shutdown_drain_timeout_ms | Time to drain pending events on shutdown | 5000 |
log_dropped_events | Whether to log events dropped due to backpressure | true |
ffi_integration_enabled | Enable FFI event channel for Ruby/Python | true |
deduplication_cache_size | Size of event deduplication cache | 10000 |
Integration with Step Execution
Event-Driven Domain Event Publishing
The worker uses an event-driven command pattern for step execution and domain event publishing. Nothing blocks - domain events are dispatched after successful orchestration notification using fire-and-forget semantics.
Flow (tasker-worker/src/worker/command_processor.rs):
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ FFI Handler │────►│ Completion │────►│ WorkerProcessor │
│ (Ruby/Rust) │ │ Channel │ │ Command Loop │
└─────────────────┘ └──────────────────┘ └──────────┬──────────┘
│
┌───────────────────────────────────┴───────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌────────────────────┐
│ 1. Send result to │ │ 2. Dispatch domain │
│ orchestration │──── on success ─────────►│ events │
│ (PGMQ) │ │ (fire-and-forget)│
└─────────────────────┘ └────────────────────┘
Implementation:
#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 512-525)
// Worker command processor receives step completions via FFI channel
match self.handle_send_step_result(step_result.clone()).await {
Ok(()) => {
// Dispatch domain events AFTER successful orchestration notification.
// Domain events are declarative of what HAS happened - the step is only
// truly complete once orchestration has been notified successfully.
// Fire-and-forget semantics (try_send) - never blocks the worker.
self.dispatch_domain_events(&step_result, None);
info!(
worker_id = %self.worker_id,
step_uuid = %step_result.step_uuid,
"Step completion forwarded to orchestration successfully"
);
}
Err(e) => {
// Don't dispatch domain events - orchestration wasn't notified,
// so the step isn't truly complete from the system's perspective
error!(
worker_id = %self.worker_id,
step_uuid = %step_result.step_uuid,
error = %e,
"Failed to forward step completion to orchestration"
);
}
}
}
Domain Event Dispatch (fire-and-forget):
#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 362-432)
fn dispatch_domain_events(&mut self, step_result: &StepExecutionResult, correlation_id: Option<Uuid>) {
// Retrieve cached step context (stored when step was claimed)
let task_sequence_step = match self.step_execution_contexts.remove(&step_result.step_uuid) {
Some(ctx) => ctx,
None => return, // No context = can't build events
};
// Build events from step definition's publishes_events declarations
for event_def in &task_sequence_step.step_definition.publishes_events {
// Check publication condition before building event
if !event_def.should_publish(step_result.success) {
continue; // Skip events whose condition doesn't match
}
let event = DomainEventToPublish {
event_name: event_def.name.clone(),
delivery_mode: event_def.delivery_mode,
business_payload: step_result.result.clone(),
metadata: EventMetadata { /* ... */ },
task_sequence_step: task_sequence_step.clone(),
execution_result: step_result.clone(),
};
domain_events.push(event);
}
// Fire-and-forget dispatch - try_send never blocks
let dispatched = handle.dispatch_events(domain_events, publisher_name, correlation);
if !dispatched {
warn!(
step_uuid = %step_result.step_uuid,
"Domain event dispatch failed - channel full (events dropped)"
);
}
}
}
Key Design Decisions:
- Events only after orchestration success: Domain events are declarative of what HAS happened. If orchestration notification fails, the step isn’t truly complete from the system’s perspective.
- Fire-and-forget via
try_send: Never blocks the worker command loop. If the channel is full, events are dropped and logged. - Context caching: Step execution context is cached when the step is claimed, then retrieved for event building after completion.
Correlation ID Propagation
Domain events maintain correlation IDs for end-to-end distributed tracing. The correlation ID originates from the task and flows through all step executions and domain events.
EventMetadata Structure (tasker-shared/src/events/domain_events.rs):
#![allow(unused)]
fn main() {
pub struct EventMetadata {
pub task_uuid: Uuid,
pub step_uuid: Option<Uuid>,
pub step_name: Option<String>,
pub namespace: String,
pub correlation_id: Uuid, // From task for end-to-end tracing
pub fired_at: DateTime<Utc>,
pub fired_by: String, // Publisher identifier (worker_id)
}
}
Getting Correlation ID via API:
Use the orchestration API to get the correlation ID for a task:
# Get task details including correlation_id
curl http://localhost:8080/v1/tasks/{task_uuid} | jq '.correlation_id'
# Response includes correlation_id
{
"task_uuid": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
"correlation_id": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
"status": "complete",
...
}
Tracing Events in PGMQ:
# Find all durable events for a correlation ID
psql $DATABASE_URL -c "
SELECT
message->>'event_name' as event,
message->'metadata'->>'step_name' as step,
message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';
"
Metrics and Observability
OpenTelemetry Metrics
Domain event publication emits OpenTelemetry counter metrics (tasker-shared/src/events/domain_events.rs:207-219):
#![allow(unused)]
fn main() {
// Metric emitted on every domain event publication
let counter = opentelemetry::global::meter("tasker")
.u64_counter("tasker.domain_events.published.total")
.with_description("Total number of domain events published")
.build();
counter.add(1, &[
opentelemetry::KeyValue::new("event_name", event_name.to_string()),
opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
]);
}
Prometheus Metrics Endpoint
The orchestration service exposes Prometheus-format metrics:
# Get Prometheus metrics from orchestration service
curl http://localhost:8080/metrics
# Get Prometheus metrics from worker service
curl http://localhost:8081/metrics
OpenTelemetry Tracing
Domain event publication is instrumented with tracing spans (tasker-shared/src/events/domain_events.rs:157-161):
#![allow(unused)]
fn main() {
#[instrument(skip(self, payload, metadata), fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
&self,
event_name: &str,
payload: DomainEventPayload,
metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
// ... implementation with debug! and info! logs including correlation_id
}
}
Grafana Query Examples
Loki Query - Domain Events by Correlation ID:
{service_name="tasker-worker"} |= "Domain event published" | json | correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"
Loki Query - All Domain Event Publications:
{service_name=~"tasker.*"} |= "Domain event" | json | line_format "{{.event_name}} - {{.namespace}} - {{.correlation_id}}"
Tempo Query - Trace by Correlation ID:
{resource.service.name="tasker-worker"} && {span.correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Prometheus Query - Event Publication Rate by Namespace:
sum by (namespace) (rate(tasker_domain_events_published_total[5m]))
Prometheus Query - Event Publication Rate by Event Name:
topk(10, sum by (event_name) (rate(tasker_domain_events_published_total[5m])))
Structured Log Fields
Domain event logs include structured fields for querying:
| Field | Description | Example |
|---|---|---|
event_id | Unique event UUID (v7, time-ordered) | 0199c3e0-d123-... |
event_name | Event name in dot notation | payment.processed |
queue_name | Target PGMQ queue | payments_domain_events |
task_uuid | Parent task UUID | 0199c3e0-ccdb-... |
correlation_id | End-to-end trace correlation | 0199c3e0-ccdb-... |
namespace | Event namespace | payments |
message_id | PGMQ message ID | 12345 |
Best Practices
1. Choose the Right Delivery Mode
| Scenario | Recommended Mode | Rationale |
|---|---|---|
| External system integration | Durable | Reliable delivery to external consumers |
| Internal metrics/telemetry | Fast | Internal subscribers only, low latency |
| Internal + external needs | Broadcast | Same event shape to both internal and external |
| Audit trails for compliance | Durable | Persisted for external audit systems |
| Real-time internal dashboards | Fast | In-process subscribers handle immediately |
Key Decision Criteria:
- Need internal Tasker subscribers? → Use
fastorbroadcast - Need external system integration? → Use
durableorbroadcast - Internal-only, sensitive data? → Use
fast(never reaches PGMQ boundary)
2. Design Event Payloads
Do:
#![allow(unused)]
fn main() {
json!({
"transaction_id": "TXN-123",
"amount": 99.99,
"currency": "USD",
"timestamp": "2025-12-01T10:00:00Z",
"idempotency_key": step_uuid
})
}
Don’t:
#![allow(unused)]
fn main() {
json!({
"data": "payment processed", // No structure
"info": full_database_record // Too much data
})
}
3. Handle Subscriber Failures Gracefully
#![allow(unused)]
fn main() {
#[async_trait]
impl EventSubscriber for MySubscriber {
async fn handle(&self, event: &DomainEvent) -> TaskerResult<()> {
// Wrap in timeout
match timeout(Duration::from_secs(5), self.process(event)).await {
Ok(result) => result,
Err(_) => {
warn!(event = %event.name, "Subscriber timeout");
Ok(()) // Don't fail the dispatch
}
}
}
}
}
4. Use Correlation IDs for Debugging
#![allow(unused)]
fn main() {
// Always include correlation ID in logs
info!(
correlation_id = %event.metadata.correlation_id,
event_name = %event.name,
namespace = %event.metadata.namespace,
"Processing domain event"
);
}
Related Documentation
- Events and Commands: events-and-commands.md - System event architecture
- Observability: observability/README.md - Metrics and monitoring
- States and Lifecycles: states-and-lifecycles.md - Task/step state machines
This domain event architecture provides a flexible, reliable foundation for business observability in the tasker-core workflow orchestration system.
Events and Commands Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Messaging Abstraction | States and Lifecycles | Deployment Patterns
← Back to Documentation Hub
This document provides comprehensive documentation of the event-driven and command pattern architecture in tasker-core, covering the unified event system foundation, orchestration and worker implementations, and the flow of tasks and steps through the system.
Overview
The tasker-core system implements a sophisticated hybrid architecture that combines:
- Event-Driven Systems: Real-time coordination using PostgreSQL LISTEN/NOTIFY and PGMQ notifications
- Command Pattern: Async command processors using tokio mpsc channels for orchestration and worker operations
- Hybrid Deployment Modes: PollingOnly, EventDrivenOnly, and Hybrid modes with fallback polling
- Queue-Based Communication: Provider-agnostic message queues (PGMQ or RabbitMQ) for reliable step execution and result processing
This architecture eliminates polling complexity while maintaining resilience through fallback mechanisms and provides horizontal scaling capabilities with atomic operation guarantees.
Event System Foundation
EventDrivenSystem Trait
The foundation of the event architecture is defined in tasker-shared/src/event_system/event_driven.rs with the EventDrivenSystem trait:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
type SystemId: Send + Sync + Clone + fmt::Display + fmt::Debug;
type Event: Send + Sync + Clone + fmt::Debug;
type Config: Send + Sync + Clone;
type Statistics: EventSystemStatistics + Send + Sync + Clone;
// Core lifecycle methods
async fn start(&mut self) -> Result<(), DeploymentModeError>;
async fn stop(&mut self) -> Result<(), DeploymentModeError>;
fn is_running(&self) -> bool;
// Event processing
async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
// Monitoring and health
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
fn statistics(&self) -> Self::Statistics;
// Configuration
fn deployment_mode(&self) -> DeploymentMode;
fn config(&self) -> &Self::Config;
}
}
Deployment Modes
The system supports three deployment modes for different operational requirements:
PollingOnly Mode
- Traditional polling-based coordination
- No event listeners or real-time notifications
- Reliable fallback for environments with networking restrictions
- Higher latency but guaranteed operation
EventDrivenOnly Mode
- Pure event-driven coordination using PostgreSQL LISTEN/NOTIFY
- Real-time response to database changes
- Lowest latency for step discovery and task coordination
- Requires reliable PostgreSQL connections
Hybrid Mode
- Primary event-driven coordination with polling fallback
- Best of both worlds: real-time when possible, reliable when needed
- Automatic fallback during connection issues
- Production-ready with resilience guarantees
Selecting a Deployment Mode
The Tasker system is built with the expectation of distributed deployment with multiple instances of both orchestration core servers and worker servers operating simultaneously. The goal of separating deployment mode is to enable different deployments to scale up event driven only processing nodes to meet demand, while having polling only nodes at a reasonable fallback polling interval and batch size. It is also to deploy in hybrid mode and control these on an instance over instance level.
Event Types and Sources
Queue-Level Events (Provider-Agnostic)
The system supports multiple messaging backends through MessageNotification:
#![allow(unused)]
fn main() {
pub enum MessageNotification {
/// Signal-only notification (PGMQ style)
/// Indicates a message is available but requires separate fetch
Available {
queue_name: String,
msg_id: Option<i64>,
},
/// Full message notification (RabbitMQ style)
/// Contains the complete message payload
Message(QueuedMessage<Vec<u8>>),
}
}
Event Sources by Provider:
| Provider | Notification Type | Fetch Required | Fallback Polling |
|---|---|---|---|
| PGMQ | Available | Yes (read by msg_id) | Required |
| RabbitMQ | Message | No (full payload) | Not needed |
| InMemory | Message | No | Not needed |
Common Event Types:
- Step Results: Worker completion notifications
- Task Requests: New task initialization requests
- Message Ready Events: Queue message availability notifications
- Transport: Provider-agnostic via
MessagingProvider.subscribe_many()
Command Pattern Architecture
Command Processor Pattern
Both orchestration and worker systems implement the command pattern to replace complex polling-based coordinators:
Benefits:
- No Polling Loops (Except where intended for fallback): Pure tokio mpsc command processing
- Simplified Architecture: ~100 lines vs 1000+ lines of complex systems
- Race Condition Prevention: Atomic operations through proper delegation
- Observability Preservation: Maintains metrics through delegated components
Command Flow Patterns
Both systems follow consistent command processing patterns:
sequenceDiagram
participant Client
participant CommandChannel
participant Processor
participant Delegate
participant Response
Client->>CommandChannel: Send Command + ResponseChannel
CommandChannel->>Processor: Receive Command
Processor->>Delegate: Delegate to Business Logic Component
Delegate-->>Processor: Return Result
Processor->>Response: Send Result via ResponseChannel
Response-->>Client: Receive Result
Orchestration Event Systems
OrchestrationEventSystem
Implemented in tasker-orchestration/src/orchestration/event_systems/orchestration_event_system.rs:
#![allow(unused)]
fn main() {
pub struct OrchestrationEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
queue_listener: Option<OrchestrationQueueListener>,
fallback_poller: Option<OrchestrationFallbackPoller>,
context: Arc<SystemContext>,
orchestration_core: Arc<OrchestrationCore>,
command_sender: mpsc::Sender<OrchestrationCommand>,
// ... statistics and state
}
}
Orchestration Command Types
The command processor handles both full-message and signal-only notification types:
#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
// Task lifecycle
InitializeTask { request: TaskRequestMessage, resp: CommandResponder<TaskInitializeResult> },
ProcessStepResult { result: StepExecutionResult, resp: CommandResponder<StepProcessResult> },
FinalizeTask { task_uuid: Uuid, resp: CommandResponder<TaskFinalizationResult> },
// Full message processing (RabbitMQ style - MessageNotification::Message)
// Used when provider delivers complete message payload
ProcessStepResultFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<StepProcessResult> },
InitializeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskInitializeResult> },
FinalizeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskFinalizationResult> },
// Signal-only processing (PGMQ style - MessageNotification::Available)
// Used when provider sends notification that requires separate fetch
ProcessStepResultFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<StepProcessResult> },
InitializeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskInitializeResult> },
FinalizeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskFinalizationResult> },
// Task readiness (database events)
ProcessTaskReadiness { task_uuid: Uuid, namespace: String, priority: i32, ready_steps: i32, triggered_by: String, resp: CommandResponder<TaskReadinessResult> },
// System operations
GetProcessingStats { resp: CommandResponder<OrchestrationProcessingStats> },
HealthCheck { resp: CommandResponder<SystemHealth> },
Shutdown { resp: CommandResponder<()> },
}
}
Command Routing by Notification Type:
MessageNotification::Message->*FromMessagecommands (immediate processing)MessageNotification::Available->*FromMessageEventcommands (requires fetch)
Orchestration Queue Architecture
The orchestration system coordinates multiple queue types:
- orchestration_step_results: Step completion results from workers
- orchestration_task_requests: New task initialization requests
- orchestration_task_finalization: Task finalization notifications
- Namespace Queues: Per-namespace step queues (e.g.,
fulfillment_queue,inventory_queue)
TaskReadinessEventSystem
Handles database-level events for task readiness using PostgreSQL LISTEN/NOTIFY:
#![allow(unused)]
fn main() {
pub struct TaskReadinessEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
listener: Option<TaskReadinessListener>,
fallback_poller: Option<TaskReadinessFallbackPoller>,
context: Arc<SystemContext>,
command_sender: mpsc::Sender<OrchestrationCommand>,
// ... configuration and statistics
}
}
PGMQ Notification Channels:
pgmq_message_ready.orchestration: Orchestration queue messages ready (task requests, step results, finalizations)pgmq_message_ready.{namespace}: Worker namespace queue messages ready (e.g.,payments,fulfillment,linear_workflow)pgmq_message_ready: Global channel for all queue messages (fallback)pgmq_queue_created: Queue creation notifications
Unified Event Coordination
The UnifiedEventCoordinator demonstrates coordinated management of multiple event systems:
#![allow(unused)]
fn main() {
pub struct UnifiedEventCoordinator {
orchestration_system: OrchestrationEventSystem,
task_readiness_fallback: FallbackPoller,
deployment_mode: DeploymentMode,
health_monitor: EventSystemHealthMonitor,
// ... coordination logic
}
}
Coordination Features:
- Shared Command Channel: Both systems send commands to same orchestration processor
- Health Monitoring: Unified health checking across all event systems
- Deployment Mode Management: Synchronized mode changes
- Statistics Aggregation: Combined metrics from all systems
Worker Event Systems
WorkerEventSystem
Implemented in tasker-worker/src/worker/event_systems/worker_event_system.rs:
#![allow(unused)]
fn main() {
pub struct WorkerEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
queue_listeners: HashMap<String, WorkerQueueListener>,
fallback_pollers: HashMap<String, WorkerFallbackPoller>,
context: Arc<SystemContext>,
command_sender: mpsc::Sender<WorkerCommand>,
// ... statistics and configuration
}
}
Worker Command Types
#![allow(unused)]
fn main() {
pub enum WorkerCommand {
// Step execution
ExecuteStep { message: PgmqMessage<SimpleStepMessage>, queue_name: String, resp: CommandResponder<()> },
ExecuteStepWithCorrelation { message: PgmqMessage<SimpleStepMessage>, queue_name: String, correlation_id: Uuid, resp: CommandResponder<()> },
// Result processing
SendStepResult { result: StepExecutionResult, resp: CommandResponder<()> },
ProcessStepCompletion { step_result: StepExecutionResult, correlation_id: Option<Uuid>, resp: CommandResponder<()> },
// Event integration
ExecuteStepFromMessage { queue_name: String, message: PgmqMessage, resp: CommandResponder<()> },
ExecuteStepFromEvent { message_event: MessageReadyEvent, resp: CommandResponder<()> },
// System operations
GetWorkerStatus { resp: CommandResponder<WorkerStatus> },
SetEventIntegration { enabled: bool, resp: CommandResponder<()> },
GetEventStatus { resp: CommandResponder<EventIntegrationStatus> },
RefreshTemplateCache { namespace: Option<String>, resp: CommandResponder<()> },
HealthCheck { resp: CommandResponder<WorkerHealthStatus> },
Shutdown { resp: CommandResponder<()> },
}
}
Worker Queue Architecture
Workers monitor namespace-specific queues for step execution as Custom Namespace Queues that are dynamically configured per deployment
Example queues:
- fulfillment_queue: All fulfillment namespace steps
- inventory_queue: All inventory namespace steps
- notifications_queue: All notification namespace steps
- payment_queue: All payment processing steps
Event Flow and System Interactions
Complete Task Execution Flow
sequenceDiagram
participant Client
participant Orchestration
participant TaskDB
participant StepQueue
participant Worker
participant ResultQueue
%% Task Initialization
Client->>Orchestration: TaskRequestMessage (via pgmq_send_with_notify)
Orchestration->>TaskDB: Create Task + Steps
%% Step Discovery and Enqueueing (Event-Driven or Fallback Polling)
Orchestration->>StepQueue: pgmq_send_with_notify(ready steps)
StepQueue-->>Worker: pg_notify('pgmq_message_ready.{namespace}')
%% Step Execution
Worker->>StepQueue: pgmq.read() to claim step
Worker->>Worker: Execute Step Handler
Worker->>ResultQueue: pgmq_send_with_notify(StepExecutionResult)
ResultQueue-->>Orchestration: pg_notify('pgmq_message_ready.orchestration')
%% Result Processing
Orchestration->>Orchestration: ProcessStepResult Command
Orchestration->>TaskDB: Update Step State
Note over Orchestration: Fallback poller discovers ready steps if events missed
%% Task Completion
Note over Orchestration: All Steps Complete
Orchestration->>Orchestration: FinalizeTask Command
Orchestration->>TaskDB: Mark Task Complete
Orchestration-->>Client: Task Completed
Event-Driven Step Discovery
sequenceDiagram
participant Worker
participant PostgreSQL
participant PgmqNotify
participant OrchestrationListener
participant StepEnqueuer
Worker->>PostgreSQL: pgmq_send_with_notify('orchestration_step_results', result)
PostgreSQL->>PostgreSQL: Atomic: pgmq.send() + pg_notify()
PostgreSQL->>PgmqNotify: NOTIFY 'pgmq_message_ready.orchestration'
PgmqNotify->>OrchestrationListener: MessageReadyEvent
OrchestrationListener->>StepEnqueuer: ProcessStepResult Command
StepEnqueuer->>PostgreSQL: Query ready steps, enqueue via pgmq_send_with_notify()
Hybrid Mode Operation
stateDiagram-v2
[*] --> EventDriven
EventDriven --> Processing : Event Received
Processing --> EventDriven : Success
Processing --> PollingFallback : Event Failed
PollingFallback --> FallbackPolling : Start Polling
FallbackPolling --> EventDriven : Connection Restored
FallbackPolling --> Processing : Poll Found Work
EventDriven --> HealthCheck : Periodic Check
HealthCheck --> EventDriven : Healthy
HealthCheck --> PollingFallback : Event Issues Detected
Queue Architecture and Message Flow
PGMQ Integration
The system uses PostgreSQL Message Queue (PGMQ) for reliable message delivery:
Queue Types and Purposes
| Queue Name | Purpose | Message Type | Processing System |
|---|---|---|---|
orchestration_step_results | Step completion results | StepExecutionResult | Orchestration |
orchestration_task_requests | New task requests | TaskRequestMessage | Orchestration |
orchestration_task_finalization | Task finalization | TaskFinalizationMessage | Orchestration |
{namespace}_queue | Namespace-specific steps | SimpleStepMessage | Workers |
Message Processing Patterns
Event-Driven Processing:
- Message arrives in PGMQ queue
- PostgreSQL triggers pg_notify with
MessageReadyEvent - Event system receives notification
- System processes message via command pattern
- Message deleted after successful processing
Polling-Based Processing (Fallback):
- Periodic queue polling (configurable interval)
- Fetch available messages in batches
- Process messages via command pattern
- Delete processed messages
Circuit Breaker Integration
All PGMQ operations are protected by circuit breakers:
#![allow(unused)]
fn main() {
pub struct UnifiedPgmqClient {
standard_client: Box<dyn PgmqClientTrait + Send + Sync>,
protected_client: Option<ProtectedPgmqClient>,
circuit_breaker_enabled: bool,
}
}
Circuit Breaker Features:
- Automatic Protection: Failure detection and circuit opening
- Configurable Thresholds: Error rate and timeout configuration
- Seamless Fallback: Automatic switching between standard and protected clients
- Recovery Detection: Automatic circuit closing when service recovers
Statistics and Monitoring
Event System Statistics
Both orchestration and worker event systems implement comprehensive statistics:
#![allow(unused)]
fn main() {
pub trait EventSystemStatistics {
fn events_processed(&self) -> u64;
fn events_failed(&self) -> u64;
fn processing_rate(&self) -> f64; // events/second
fn average_latency_ms(&self) -> f64;
fn deployment_mode_score(&self) -> f64; // 0.0-1.0 effectiveness
fn success_rate(&self) -> f64; // derived: processed/(processed+failed)
}
}
Health Monitoring
Deployment Mode Health Status
#![allow(unused)]
fn main() {
pub enum DeploymentModeHealthStatus {
Healthy, // All systems operational
Degraded { reason: String },// Some issues but functional
Unhealthy { reason: String },// Significant issues
Critical { reason: String }, // System failure imminent
}
}
Health Check Integration
- Event System Health: Connection status, processing latency, error rates
- Command Processor Health: Queue backlog, processing timeout detection
- Database Health: Connection pool status, query performance
- Circuit Breaker Status: Circuit state, failure rates, recovery status
Metrics Collection
Key metrics collected across the system:
Orchestration Metrics
- Task Initialization Rate: Tasks/minute initialized
- Step Enqueueing Rate: Steps/minute enqueued to worker queues
- Result Processing Rate: Results/minute processed from workers
- Task Completion Rate: Tasks/minute completed successfully
- Error Rates: Failures by operation type and cause
Worker Metrics
- Step Execution Rate: Steps/minute executed
- Handler Performance: Execution time by handler type
- Queue Processing: Messages claimed/processed by queue
- Result Submission Rate: Results/minute sent to orchestration
- FFI Integration: Event correlation and handler communication stats
Error Handling and Resilience
Error Categories
The system handles multiple error categories with appropriate strategies:
Transient Errors
- Database Connection Issues: Circuit breaker protection + retry with exponential backoff
- Queue Processing Failures: Message retry with backoff, poison message detection
- Network Interruptions: Automatic fallback to polling mode
Permanent Errors
- Invalid Message Format: Dead letter queue for manual analysis
- Handler Execution Failures: Step failure state with retry limits
- Configuration Errors: System startup prevention with clear error messages
System Errors
- Resource Exhaustion: Graceful degradation and load shedding
- Component Crashes: Automatic restart with state recovery
- Data Corruption: Transaction rollback and consistency validation
Fallback Mechanisms
Event System Fallbacks
- Event-Driven -> Polling: Automatic fallback when event connection fails
- Real-time -> Batch: Switch to batch processing during high load
- Primary -> Secondary: Database failover support for high availability
Command Processing Fallbacks
- Async -> Sync: Degraded operation for critical operations
- Distributed -> Local: Local processing when coordination fails
- Optimistic -> Pessimistic: Conservative processing during uncertainty
Configuration Management
Event System Configuration
Event systems are configured via TOML with environment overrides:
# config/tasker/base/event_systems.toml
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
health_monitoring_enabled = true
health_check_interval = "30s"
max_concurrent_processors = 10
processing_timeout = "100ms"
[orchestration_event_system.queue_listener]
enabled = true
batch_size = 50
poll_interval = "1s"
connection_timeout = "5s"
[orchestration_event_system.fallback_poller]
enabled = true
poll_interval = "5s"
batch_size = 20
max_retry_attempts = 3
[task_readiness]
enabled = true
polling_interval_seconds = 30
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
# PGMQ channels handled by listeners, not direct postgres channels
supported_namespaces = ["orchestration"]
Runtime Configuration Changes
Certain configuration changes can be applied at runtime:
- Deployment Mode Switching: EventDrivenOnly <-> Hybrid <-> PollingOnly
- Event Integration Toggle: Enable/disable event processing
- Health Check Intervals: Adjust monitoring frequency
- Circuit Breaker Thresholds: Modify failure detection sensitivity
Integration Points
State Machine Integration
Event systems integrate tightly with the state machines documented in states-and-lifecycles.md:
- Task State Changes: Event systems react to task transitions
- Step State Changes: Step completion triggers task readiness checks
- Event Generation: State transitions generate events for system coordination
- Atomic Operations: Event processing maintains state machine consistency
Database Integration
Event systems coordinate with PostgreSQL at multiple levels:
- LISTEN/NOTIFY: Real-time notifications for database changes
- PGMQ Integration: Reliable message queues built on PostgreSQL
- Transaction Coordination: Event processing within database transactions
- SQL Functions: Database functions generate events and notifications
External System Integration
The event architecture supports integration with external systems:
- Webhook Events: HTTP callbacks for external system notifications
- Message Bus Integration: Apache Kafka, RabbitMQ, etc. for enterprise messaging
- Monitoring Integration: Prometheus, DataDog, etc. for metrics export
- API Integration: REST and GraphQL APIs for external coordination
Actor Integration
Overview
The tasker-core system implements a lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture provides a consistent, type-safe foundation for orchestration component management with all lifecycle operations coordinated through actors.
Status: Complete (Phases 1-7) - Production ready
For comprehensive actor documentation, see Actor-Based Architecture.
Actor Pattern Basics
The actor pattern introduces three core traits:
- OrchestrationActor: Base trait for all actors with lifecycle hooks
- Handler
: Message handling trait for type-safe command processing - Message: Marker trait for command messages
#![allow(unused)]
fn main() {
// Actor definition
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer,
}
// Message definition
pub struct FinalizeTaskMessage {
pub task_uuid: Uuid,
}
impl Message for FinalizeTaskMessage {
type Response = FinalizationResult;
}
// Message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
self.service.finalize_task(msg.task_uuid).await
.map_err(|e| e.into())
}
}
}
Integration with Command Processor
The actor pattern integrates seamlessly with the command processor through direct actor calls:
#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
// Direct actor-based task finalization
let msg = FinalizeTaskMessage { task_uuid };
let result = self.actors.task_finalizer_actor.handle(msg).await?;
Ok(TaskFinalizationResult::Success {
task_uuid: result.task_uuid,
final_status: format!("{:?}", result.action),
completion_time: Some(chrono::Utc::now()),
})
}
async fn handle_process_step_result(
&self,
step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
// Direct actor-based step result processing
let msg = ProcessStepResultMessage {
result: step_result.clone(),
};
match self.actors.result_processor_actor.handle(msg).await {
Ok(()) => Ok(StepProcessResult::Success {
message: format!(
"Step {} result processed successfully",
step_result.step_uuid
),
}),
Err(e) => Ok(StepProcessResult::Error {
message: format!("Failed to process step result: {e}"),
}),
}
}
}
Event → Command → Actor Flow
The complete event-to-actor flow:
┌──────────────┐
│ PGMQ Message │ Message arrives in queue
└──────┬───────┘
│
▼
┌──────────────────┐
│ Event Listener │ EventDrivenSystem processes notification
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Command Channel │ Send command to processor via tokio::mpsc
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Command Processor│ Convert command to actor message
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Actor Registry │ Route message to appropriate actor
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Handler<M>:: │ Actor processes message
│ handle() │ Delegates to underlying service
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Response │ Return result to command processor
└──────────────────┘
ActorRegistry and Lifecycle
The ActorRegistry manages all 4 orchestration actors and integrates with the system lifecycle:
#![allow(unused)]
fn main() {
// During system startup
let context = Arc::new(SystemContext::with_pool(pool).await?);
let actors = ActorRegistry::build(context).await?; // Calls started() on all actors
// During operation
let msg = FinalizeTaskMessage { task_uuid };
let result = actors.task_finalizer_actor.handle(msg).await?;
// During shutdown
actors.shutdown().await; // Calls stopped() on all actors in reverse order
}
Current Actors:
- TaskRequestActor: Handles task initialization requests
- ResultProcessorActor: Processes step execution results
- StepEnqueuerActor: Manages batch processing of ready tasks
- TaskFinalizerActor: Handles task finalization with atomic claiming
Benefits for Event-Driven Architecture
The actor pattern enhances the event-driven architecture by providing:
- Type Safety: Compile-time verification of message contracts
- Consistency: Uniform lifecycle management across all components
- Testability: Clear message boundaries for isolated testing
- Observability: Actor-level metrics and tracing
- Evolvability: Easy to add new message handlers and actors
Implementation Status
The actor integration is complete:
-
Phase 1 ✅: Actor infrastructure and test harness
- OrchestrationActor, Handler
, Message traits - ActorRegistry structure
- OrchestrationActor, Handler
-
Phase 2-3 ✅: All 4 primary actors implemented
- TaskRequestActor, ResultProcessorActor
- StepEnqueuerActor, TaskFinalizerActor
-
Phase 4-6 ✅: Message hydration and module reorganization
- Hydration layer for PGMQ messages
- Clean module organization
-
Phase 7 ✅: Service decomposition
- Large services decomposed into focused components
- All files <300 lines following single responsibility principle
-
Cleanup ✅: Direct actor integration
- Command processor calls actors directly
- Removed intermediate wrapper layers
- Production-ready implementation
Service Decomposition
Large services (800-900 lines) were decomposed into focused components:
TaskFinalizer (848 → 6 files):
service.rs: Main TaskFinalizer (~200 lines)completion_handler.rs: Task completion logicevent_publisher.rs: Lifecycle event publishingexecution_context_provider.rs: Context fetchingstate_handlers.rs: State-specific handling
StepEnqueuerService (781 → 3 files):
service.rs: Main service (~250 lines)batch_processor.rs: Batch processing logicstate_handlers.rs: State-specific processing
ResultProcessor (889 → 4 files):
service.rs: Main processormetadata_processor.rs: Metadata handlingerror_handler.rs: Error processingresult_validator.rs: Result validation
This comprehensive event and command architecture, now enhanced with the actor pattern, provides the foundation for scalable, reliable, and maintainable workflow orchestration in the tasker-core system while maintaining the flexibility to operate in diverse deployment environments.
Idempotency and Atomicity Guarantees
Last Updated: 2025-01-19 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands | Task Readiness & Execution
← Back to Documentation Hub
Overview
Tasker Core is designed for distributed orchestration with multiple orchestrator instances processing tasks concurrently. This document explains the defense-in-depth approach that ensures safe concurrent operation without race conditions, data corruption, or lost work.
The system provides idempotency and atomicity through four overlapping protection layers:
- Database Atomicity: PostgreSQL constraints, row locking, and compare-and-swap operations
- State Machine Guards: Current-state validation before all transitions
- Transaction Boundaries: All-or-nothing semantics for complex operations
- Application Logic: State-based filtering and idempotent patterns
These layers work together to ensure that operations can be safely retried, multiple orchestrators can process work concurrently, and crashes don’t leave the system in an inconsistent state.
Core Protection Mechanisms
Layer 1: Database Atomicity
PostgreSQL provides fundamental atomic guarantees through several mechanisms:
Unique Constraints
Purpose: Prevent duplicate creation of entities
Key Constraints:
tasker.tasks.identity_hash(UNIQUE) - Prevents duplicate task creation from identical requeststasker.task_namespaces.name(UNIQUE) - Namespace name uniquenesstasker.named_tasks (namespace_id, name, version)(UNIQUE) - Task template uniquenesstasker.named_steps.system_name(UNIQUE) - Step handler uniqueness
Example Protection:
#![allow(unused)]
fn main() {
// Two orchestrators receive identical TaskRequestMessage
// Orchestrator A creates task first -> commits successfully
// Orchestrator B attempts to create -> unique constraint violation
// Result: Exactly one task created, error cleanly handled
}
See Task Initialization for details on how this protects task creation.
Row-Level Locking
Purpose: Prevent concurrent modifications to the same database row
Locking Patterns:
-
FOR UPDATE- Exclusive lock, blocks concurrent transactions-- Used in: transition_task_state_atomic() SELECT * FROM tasker.tasks WHERE task_uuid = $1 FOR UPDATE; -- Blocks until transaction commits or rolls back -
FOR UPDATE SKIP LOCKED- Lock-free work distribution-- Used in: get_next_ready_tasks() SELECT * FROM tasker.tasks WHERE state = ANY($1) FOR UPDATE SKIP LOCKED LIMIT $2; -- Each orchestrator gets different tasks, no blocking
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Two orchestrators attempt state transition on same task
// Orchestrator A: BEGIN; SELECT FOR UPDATE; UPDATE state; COMMIT;
// Orchestrator B: BEGIN; SELECT FOR UPDATE (BLOCKS until A commits)
// UPDATE fails due to state validation
// Result: Only one transition succeeds, no race condition
}
Compare-and-Swap Semantics
Purpose: Validate expected state before making changes
Pattern: All state transitions validate current state in the same transaction as the update
-- From transition_task_state_atomic()
UPDATE tasker.tasks
SET state = $new_state, updated_at = NOW()
WHERE task_uuid = $uuid
AND state = $expected_current_state -- Critical: CAS validation
RETURNING *;
Example Protection:
#![allow(unused)]
fn main() {
// Orchestrator A and B both think task is in "Pending" state
// A transitions: WHERE state = 'Pending' -> succeeds, now "Initializing"
// B transitions: WHERE state = 'Pending' -> returns 0 rows (fails gracefully)
// Result: Atomic transition, no invalid state
}
See SQL Function Architecture for more on database-level guarantees.
Layer 2: State Machine Guards
Purpose: Enforce valid state transitions through application-level validation
Both task and step state machines validate current state before allowing transitions. This provides protection even when database constraints alone wouldn’t catch invalid operations.
Task State Machine
Defined in tasker-shared/src/state_machine/task_state_machine.rs, the TaskStateMachine validates:
- Current state retrieval: Always fetch latest state from database
- Event applicability: Check if event is valid for current state
- Terminal state protection: Cannot transition from Complete/Error/Cancelled
- Ownership tracking: Processor UUID tracked for audit (not enforced after ownership removal)
Example Protection:
#![allow(unused)]
fn main() {
// TaskStateMachine prevents invalid transitions
let mut state_machine = TaskStateMachine::new(task, context);
// Attempt to mark complete when still processing
let result = state_machine.transition(TaskEvent::MarkComplete).await;
// Result: Error - cannot mark complete while steps are in progress
// Current state validation prevents:
// - Completing tasks with pending steps
// - Re-initializing completed tasks
// - Transitioning from terminal states
}
See States and Lifecycles for complete state machine documentation.
Workflow Step State Machine
Defined in tasker-shared/src/state_machine/step_state_machine.rs, the StepStateMachine ensures:
- Execution claiming: Only Pending/Enqueued steps can transition to InProgress
- Completion validation: Only InProgress steps can be marked complete
- Retry eligibility: Validates max_attempts and backoff timing
Example Protection:
#![allow(unused)]
fn main() {
// Worker attempts to claim already-processing step
let mut step_machine = StepStateMachine::new(step.into(), context);
match step_machine.current_state().await {
WorkflowStepState::InProgress => {
// Already being processed by another worker
return Ok(false); // Cannot claim
}
WorkflowStepState::Pending | WorkflowStepState::Enqueued => {
// Attempt atomic transition
step_machine.transition(StepEvent::Start).await?;
}
}
}
This prevents:
- Multiple workers executing the same step concurrently
- Marking steps complete that weren’t started
- Retrying steps that exceeded max_attempts
Layer 3: Transaction Boundaries
Purpose: Ensure all-or-nothing semantics for multi-step operations
Critical operations wrap multiple database changes in a single transaction, ensuring atomic completion or full rollback on failure.
Task Initialization Transaction
Task creation involves multiple dependent entities that must all succeed or all fail:
#![allow(unused)]
fn main() {
// From TaskInitializer.initialize_task()
let mut tx = pool.begin().await?;
// 1. Create or find namespace (find-or-create is idempotent)
let namespace = NamespaceResolver::resolve_namespace(&mut tx, namespace_name).await?;
// 2. Create or find named task
let named_task = NamespaceResolver::resolve_named_task(&mut tx, namespace, task_name).await?;
// 3. Create task record
let task = create_task(&mut tx, named_task.uuid, context).await?;
// 4. Create all workflow steps and edges
let (step_count, step_mapping) = WorkflowStepBuilder::create_workflow_steps(
&mut tx, task.uuid, template
).await?;
// 5. Initialize state machine
StateInitializer::initialize_task_state(&mut tx, task.uuid).await?;
// ALL OR NOTHING: Commit entire transaction
tx.commit().await?;
}
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Task creation partially fails
// - Namespace created ✓
// - Named task created ✓
// - Task record created ✓
// - Workflow steps: Cycle detected ✗ (error thrown)
// Result: tx.rollback() -> ALL changes reverted, clean failure
}
Cycle Detection Enforcement
Workflow dependencies are validated during task initialization to prevent circular references:
#![allow(unused)]
fn main() {
// From WorkflowStepBuilder::create_step_dependencies()
for dependency in &step_definition.dependencies {
let from_uuid = step_mapping[dependency];
let to_uuid = step_mapping[&step_definition.name];
// Check for self-reference
if from_uuid == to_uuid {
return Err(CycleDetected { from, to });
}
// Check for path that would create cycle
if WorkflowStepEdge::would_create_cycle(pool, from_uuid, to_uuid).await? {
return Err(CycleDetected { from, to });
}
// Safe to create edge
WorkflowStepEdge::create_with_transaction(&mut tx, edge).await?;
}
}
This prevents invalid DAG structures from ever being persisted to the database.
Layer 4: Application Logic Patterns
Purpose: Implement idempotent patterns at the application level
Beyond database and state machine protections, application code uses several patterns to ensure safe retry and concurrent operation.
Find-or-Create Pattern
Used for entities that should be unique but may be created by multiple orchestrators:
#![allow(unused)]
fn main() {
// From NamespaceResolver
pub async fn resolve_namespace(
tx: &mut Transaction<'_, Postgres>,
name: &str,
) -> Result<TaskNamespace> {
// Try to find existing
if let Some(namespace) = TaskNamespace::find_by_name(pool, name).await? {
return Ok(namespace);
}
// Create if not found
match TaskNamespace::create_with_transaction(tx, NewTaskNamespace { name }).await {
Ok(namespace) => Ok(namespace),
Err(sqlx::Error::Database(e)) if is_unique_violation(&e) => {
// Another orchestrator created it between our find and create
// Re-query to get the one that won the race
TaskNamespace::find_by_name(pool, name).await?
.ok_or(Error::NotFound)
}
Err(e) => Err(e),
}
}
}
Why This Works:
- First attempt: Finds existing → idempotent
- Create attempt: Unique constraint prevents duplicates
- Retry after unique violation: Gets the winner → idempotent
- Result: Exactly one namespace, regardless of concurrent attempts
State-Based Filtering
Operations filter by state to naturally deduplicate work:
#![allow(unused)]
fn main() {
// From StepEnqueuerService
// Only enqueue steps in specific states
let ready_steps = steps.iter()
.filter(|step| matches!(
step.state,
WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
))
.collect();
// Skip steps already:
// - Enqueued (another orchestrator handled it)
// - InProgress (worker is executing)
// - Complete (already done)
// - Error (terminal state)
}
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Orchestrator crash mid-batch
// Before crash: Enqueued steps 1-5 of 10
// After restart: Process task again
// State filtering:
// - Steps 1-5: state = Enqueued → skip
// - Steps 6-10: state = Pending → enqueue
// Result: Each step enqueued exactly once
}
State-Before-Queue Pattern
Ensures workers only see steps in correct state:
#![allow(unused)]
fn main() {
// 1. Commit state transition to database FIRST
step_state_machine.transition(StepEvent::Enqueue).await?;
// Step now in Enqueued state in database
// 2. THEN send PGMQ notification
pgmq_client.send_with_notify(queue_name, step_message).await?;
// Worker receives notification and:
// - Queries database for step
// - Sees state = Enqueued (committed)
// - Can safely claim and execute
}
Why Order Matters:
#![allow(unused)]
fn main() {
// Wrong order (queue-before-state):
// 1. Send PGMQ message
// 2. Worker receives immediately
// 3. Worker queries database → state still Pending
// 4. Worker might skip or fail to claim
// 5. State transition commits
// Correct order (state-before-queue):
// 1. State transition commits
// 2. Send PGMQ message
// 3. Worker receives
// 4. Worker queries → state correctly Enqueued
// 5. Worker can claim
}
See Events and Commands for event system details.
Component-by-Component Guarantees
Task Initialization Idempotency
Component: TaskRequestActor and TaskInitializer service
Operation: Creating a new task from a template
File: tasker-orchestration/src/orchestration/lifecycle/task_initialization/
Protection Mechanisms
-
Identity Hash Unique Constraint
#![allow(unused)] fn main() { // Tasks are identified by hash of (namespace, task_name, context) let identity_hash = calculate_identity_hash(namespace, name, context); NewTask { identity_hash, // Unique constraint prevents duplicates named_task_uuid, context, // ... } } -
Transaction Atomicity
- All entities created in single transaction
- Namespace, named task, task, workflow steps, edges
- Cycle detection validates DAG before committing
- Any failure rolls back everything
-
Find-or-Create for Shared Entities
- Namespaces can be created by any orchestrator
- Named tasks shared across workflow instances
- Named steps reused across tasks
Concurrent Scenario
Two orchestrators receive identical TaskRequestMessage:
T0: Orchestrator A begins transaction
T1: Orchestrator B begins transaction
T2: A creates namespace "payments"
T3: B attempts to create namespace "payments"
T4: A creates task with identity_hash "abc123"
T5: B attempts to create task with identity_hash "abc123"
T6: A commits successfully ✓
T7: B attempts commit → unique constraint violation on identity_hash
T8: B transaction rolled back
Result:
- Exactly one task created
- No partial state in database
- Orchestrator B receives clear error
- Retry-safe: B can check if task exists and return it
Cycle Detection
Prevents invalid workflow definitions:
#![allow(unused)]
fn main() {
// Template defines: A depends on B, B depends on C, C depends on A
// During initialization:
// - Create steps A, B, C
// - Create edge A -> B (valid)
// - Create edge B -> C (valid)
// - Attempt edge C -> A
// - would_create_cycle() returns true
// - Error: CycleDetected
// - Transaction rolled back
// Result: Invalid workflow rejected, no partial data
}
See tasker-shared/src/models/core/workflow_step_edge.rs:236-270 for cycle detection implementation.
Step Enqueueing Idempotency
Component: StepEnqueuerActor and StepEnqueuerService
Operation: Enqueueing ready workflow steps to worker queues
File: tasker-orchestration/src/orchestration/lifecycle/step_enqueuer_services/
Multi-Layer Protection
-
SQL-Level Row Locking
-- get_next_ready_tasks() uses SKIP LOCKED SELECT task_uuid FROM tasker.tasks WHERE state = ANY($states) FOR UPDATE SKIP LOCKED -- Prevents concurrent claiming LIMIT $batch_size;Each orchestrator gets different tasks, no overlap
-
State Machine Compare-and-Swap
#![allow(unused)] fn main() { // Only transition if task in expected state state_machine.transition(TaskEvent::EnqueueSteps(uuids)).await?; // Fails if another orchestrator already transitioned } -
Step State Filtering
#![allow(unused)] fn main() { // Only enqueue steps in specific states let enqueueable = steps.filter(|s| matches!( s.state, WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry )); } -
State-Before-Queue Ordering
#![allow(unused)] fn main() { // 1. Commit step state to Enqueued step.transition(StepEvent::Enqueue).await?; // 2. Send PGMQ message pgmq.send_with_notify(queue, message).await?; }
Concurrent Scenario
Two orchestrators discover the same ready steps:
T0: Orchestrator A queries get_next_ready_tasks(batch=100)
T1: Orchestrator B queries get_next_ready_tasks(batch=100)
T2: A gets tasks [1,2,3] (locked by A's transaction)
T3: B gets tasks [4,5,6] (different rows, SKIP LOCKED)
T4: A enqueues steps for tasks 1,2,3
T5: B enqueues steps for tasks 4,5,6
T6: Both commit successfully
Result: No overlap, each task processed once
Orchestrator Crash Mid-Batch:
T0: Orchestrator A gets task 1 with steps [A, B, C, D]
T1: A enqueues steps A, B to "payments_queue"
T2: A crashes before processing steps C, D
T3: Task 1 state still EnqueuingSteps
T4: Orchestrator B picks up task 1 (A's transaction rolled back)
T5: B queries steps for task 1
T6: Steps A, B have state = Enqueued → skip
T7: Steps C, D have state = Pending → enqueue
Result: Steps A, B enqueued once, C, D recovered and enqueued
Result Processing Idempotency
Component: ResultProcessorActor and OrchestrationResultProcessor
Operation: Processing step execution results from workers
File: tasker-orchestration/src/orchestration/lifecycle/result_processing/
Protection Mechanisms
-
State Guard Validation
#![allow(unused)] fn main() { // TaskCoordinator validates step state before processing result let current_state = step_state_machine.current_state().await?; match current_state { WorkflowStepState::InProgress => { // Valid: step is being processed step_state_machine.transition(StepEvent::Complete).await?; } WorkflowStepState::Complete => { // Idempotent: already processed this result return Ok(AlreadyComplete); } _ => { // Invalid state for result processing return Err(InvalidState); } } } -
Atomic State Transitions
- Step result processing uses compare-and-swap
- Task state transitions validate current state
- All updates in same transaction as state check
-
Ownership Removed
- Processor UUID tracked for audit only
- Not enforced for transitions
- Any orchestrator can process results
- Enables recovery after crashes
Concurrent Scenario
Worker submits result, orchestrator crashes, retry arrives:
T0: Worker completes step A, submits result to orchestration_step_results queue
T1: Orchestrator A pulls message, begins processing
T2: A transitions step A to Complete
T3: A begins task state evaluation
T4: A crashes before deleting PGMQ message
T5: PGMQ visibility timeout expires → message reappears
T6: Orchestrator B pulls same message
T7: B queries step A state → Complete
T8: B returns early (idempotent, already processed)
T9: B deletes PGMQ message
Result: Step processed exactly once, retry is harmless
Before Ownership Removal (Ownership Enforced):
// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: task.processor_uuid != B.uuid
// Error: Ownership violation → TASK STUCK
After Ownership Removal (Ownership Audit-Only):
// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: current task state (no ownership check)
// B processes successfully → TASK RECOVERS
See the Ownership Removal ADR for full analysis.
Task Finalization Idempotency
Component: TaskFinalizerActor and TaskFinalizer service
Operation: Finalizing task to terminal state
File: tasker-orchestration/src/orchestration/lifecycle/task_finalization/
Current Protection (Sufficient for Recovery)
-
State Guard Protection
#![allow(unused)] fn main() { // TaskFinalizer checks current task state let context = ExecutionContextProvider::fetch(task_uuid).await?; match context.should_finalize() { true => { // Transition to Complete task_state_machine.transition(TaskEvent::MarkComplete).await?; } false => { // Not ready to finalize (steps still pending) return Ok(NotReady); } } } -
Idempotent for Recovery
#![allow(unused)] fn main() { // Scenario: Orchestrator crashes during finalization // - Task state already Complete → state guard returns early // - Task state still StepsInProcess → retry succeeds // Result: Recovery works, final state reached }
Concurrent Scenario (Not Graceful)
Two orchestrators attempt finalization simultaneously:
T0: Orchestrators A and B both receive finalization trigger
T1: A checks: all steps complete → proceed
T2: B checks: all steps complete → proceed
T3: A transitions task to Complete (succeeds)
T4: B attempts transition to Complete
T5: State guard: task already Complete
T6: B receives StateMachineError (invalid transition)
Result:
- ✓ Task finalized exactly once (correct)
- ✓ No data corruption
- ⚠️ Orchestrator B gets error (not graceful)
Future Enhancement: Atomic Finalization Claiming
Atomic claiming would make concurrent finalization graceful:
-- Proposed claim_task_for_finalization() function
UPDATE tasker.tasks
SET finalization_claimed_at = NOW(),
finalization_claimed_by = $processor_uuid
WHERE task_uuid = $uuid
AND state = 'StepsInProcess'
AND finalization_claimed_at IS NULL
RETURNING *;
With atomic finalization claiming:
T0: Orchestrators A and B both receive finalization trigger
T1: A calls claim_task_for_finalization() → succeeds
T2: B calls claim_task_for_finalization() → returns 0 rows
T3: A proceeds with finalization
T4: B returns early (silent success, already claimed)
This enhancement is deferred (implementation not yet scheduled).
SQL Function Atomicity
File: tasker-shared/src/database/sql/
Documented: Task Readiness & Execution
Atomic State Transitions
Function: transition_task_state_atomic()
Protection: Compare-and-swap with row locking
-- Atomic state transition with validation
UPDATE tasker.tasks
SET state = $new_state,
updated_at = NOW()
WHERE task_uuid = $uuid
AND state = $expected_current_state -- CAS: only if state matches
FOR UPDATE; -- Lock prevents concurrent modifications
Key Guarantees:
- Returns 0 rows if state doesn’t match → safe retry
- Row lock prevents concurrent transitions
- Processor UUID tracked for audit, not enforced
Work Distribution Without Contention
Function: get_next_ready_tasks()
Protection: Lock-free claiming via SKIP LOCKED
SELECT task_uuid, correlation_id, state
FROM tasker.tasks
WHERE state = ANY($processable_states)
AND (
state NOT IN ('WaitingForRetry') OR
last_retry_at + retry_interval < NOW()
)
ORDER BY
CASE state
WHEN 'Pending' THEN 1
WHEN 'WaitingForRetry' THEN 2
ELSE 3
END,
created_at ASC
FOR UPDATE SKIP LOCKED -- Skip locked rows, no blocking
LIMIT $batch_size;
Key Guarantees:
- Each orchestrator gets different tasks
- No blocking or contention
- Dynamic priority (Pending before WaitingForRetry)
- Prevents task starvation
Step Readiness with Dependency Validation
Function: get_step_readiness_status()
Protection: Validates dependencies in single query
WITH step_dependencies AS (
SELECT COUNT(*) as total_deps,
SUM(CASE WHEN dep_step.state = 'Complete' THEN 1 ELSE 0 END) as completed_deps
FROM tasker.workflow_step_edges e
JOIN tasker.workflow_steps dep_step ON e.from_step_uuid = dep_step.uuid
WHERE e.to_step_uuid = $step_uuid
)
SELECT
CASE
WHEN total_deps = completed_deps THEN 'Ready'
WHEN step.state = 'Error' AND step.attempts < step.max_attempts THEN 'WaitingForRetry'
ELSE 'Blocked'
END as readiness
FROM step_dependencies, tasker.workflow_steps step
WHERE step.uuid = $step_uuid;
Key Guarantees:
- Atomic dependency check
- Handles retry logic with backoff
- Prevents premature execution
Cycle Detection
Function: WorkflowStepEdge::would_create_cycle() (Rust, uses SQL)
Protection: Recursive CTE path traversal
WITH RECURSIVE step_path AS (
-- Base: Start from proposed destination
SELECT from_step_uuid, to_step_uuid, 1 as depth
FROM tasker.workflow_step_edges
WHERE from_step_uuid = $proposed_to
UNION ALL
-- Recursive: Follow edges
SELECT sp.from_step_uuid, wse.to_step_uuid, sp.depth + 1
FROM step_path sp
JOIN tasker.workflow_step_edges wse ON sp.to_step_uuid = wse.from_step_uuid
WHERE sp.depth < 100 -- Prevent infinite recursion
)
SELECT COUNT(*) as has_path
FROM step_path
WHERE to_step_uuid = $proposed_from;
Returns: True if adding edge would create cycle
Enforcement: Called by WorkflowStepBuilder during task initialization
- Self-reference check:
from_uuid == to_uuid - Path check: Would adding edge create cycle?
- Error before commit: Transaction rolled back on cycle
See tasker-orchestration/src/orchestration/lifecycle/task_initialization/workflow_step_builder.rs for enforcement.
Cross-Cutting Scenarios
Multiple Orchestrators Processing Same Task
Scenario: Load balancer distributes work to multiple orchestrators
Protection:
-
Work Distribution:
-- Each orchestrator gets different tasks via SKIP LOCKED Orchestrator A: Tasks [1, 2, 3] Orchestrator B: Tasks [4, 5, 6] -
State Transitions:
#![allow(unused)] fn main() { // Both attempt to transition same task (shouldn't happen, but...) A: transition(Pending -> Initializing) → succeeds B: transition(Pending -> Initializing) → fails (state already changed) } -
Step Enqueueing:
#![allow(unused)] fn main() { // Task in EnqueuingSteps state A: Processes task, enqueues steps A, B B: Cannot claim task (state not in processable states) // OR if B claims during transition: B: Filters steps by state → A, B already Enqueued, skips them }
Result: No duplicate work, clean coordination
Orchestrator Crashes and Recovers
Scenario: Orchestrator crashes mid-operation, another takes over
During Task Initialization
Before ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A)
T2: A crashes
T3: Task stuck in Initializing forever (ownership blocks recovery)
After ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A for audit)
T2: A crashes
T3: Orchestrator B picks up task 1
T4: B transitions Initializing -> EnqueuingSteps (succeeds, no ownership check)
T5: Task recovers automatically
During Step Enqueueing
T0: Orchestrator A enqueues steps [A, B] of task 1
T1: A crashes before committing
T2: Transaction rolls back
T3: Steps A, B remain in Pending state
T4: Orchestrator B picks up task 1
T5: B enqueues steps A, B (state still Pending)
T6: No duplicate work
During Result Processing
T0: Worker completes step A
T1: Orchestrator A receives result, transitions step to Complete
T2: A crashes before updating task state
T3: PGMQ message visibility timeout expires
T4: Orchestrator B receives same result message
T5: B queries step A → already Complete
T6: B skips processing (idempotent)
T7: B evaluates task state, continues workflow
Result: Complete recovery, no manual intervention
Retry After Transient Failure
Scenario: Database connection lost during operation
#![allow(unused)]
fn main() {
// Orchestrator attempts task initialization
let result = task_initializer.initialize(request).await;
match result {
Err(TaskInitializationError::Database(_)) => {
// Transient failure (connection lost)
// Retry same request
let retry_result = task_initializer.initialize(request).await;
// Possibilities:
// 1. Succeeds: Transaction completed before connection lost
// → identity_hash unique constraint prevents duplicate
// → Get existing task
// 2. Succeeds: Transaction rolled back
// → Create task successfully
// 3. Fails: Different error
// → Handle appropriately
}
Ok(task) => { /* Success */ }
}
}
Key Pattern: Operations are designed to be retry-safe
- Database constraints prevent duplicates
- State guards prevent invalid transitions
- Find-or-create handles concurrent creation
PGMQ Message Duplicate Delivery
Scenario: PGMQ message processed twice due to visibility timeout
#![allow(unused)]
fn main() {
// Worker completes step, sends result
pgmq.send("orchestration_step_results", result).await?;
// Orchestrator A receives message
let message = pgmq.read("orchestration_step_results").await?;
// A processes result
result_processor.process(message.payload).await?;
// A about to delete message, crashes
// Message visibility timeout expires → message reappears
// Orchestrator B receives same message
let duplicate = pgmq.read("orchestration_step_results").await?;
// B processes result
// State machine checks: step already Complete
// Returns early (idempotent)
result_processor.process(duplicate.payload).await?; // Harmless
// B deletes message
pgmq.delete(duplicate.msg_id).await?;
}
Protection:
- State guards: Check current state before processing
- Idempotent handlers: Safe to process same message multiple times
- Message deletion: Only after confirmed processing
See Events and Commands for PGMQ architecture.
Multi-Instance Validation
The defense-in-depth architecture was validated through comprehensive multi-instance cluster testing. This section documents the validation results and confirms the effectiveness of the protection mechanisms.
Test Configuration
- Orchestration Instances: 2 (ports 8080, 8081)
- Worker Instances: 2 per type (Rust: 8100-8101, Ruby: 8200-8201, Python: 8300-8301, TypeScript: 8400-8401)
- Total Services: 10 concurrent instances
- Database: Shared PostgreSQL with PGMQ messaging
Validation Results
| Metric | Result |
|---|---|
| Tests Passed | 1,645 |
| Intermittent Failures | 3 (resource contention, not race conditions) |
| Tests Skipped | 21 (domain event tests, require single-instance) |
| Race Conditions Detected | 0 |
| Data Corruption Detected | 0 |
What Was Validated
-
Concurrent Task Creation
- Tasks created through different orchestration instances
- No duplicate tasks or UUIDs
- All tasks complete successfully
- State consistent across all instances
-
Work Distribution
SKIP LOCKEDdistributes tasks without overlap- Multiple workers claim different steps
- No duplicate step processing
-
State Machine Guards
- Invalid transitions rejected at state machine layer
- Compare-and-swap prevents concurrent modifications
- Terminal states protected from re-entry
-
Transaction Boundaries
- All-or-nothing semantics maintained under load
- No partial task initialization observed
- Crash recovery works correctly
-
Cross-Instance Consistency
- Task state queries return same result from any instance
- Step state transitions visible immediately to all instances
- No stale reads observed
Protection Layer Effectiveness
| Layer | Validation Method | Result |
|---|---|---|
| Database Atomicity | Concurrent unique constraint tests | Duplicates correctly rejected |
| State Machine Guards | Parallel transition attempts | Invalid transitions blocked |
| Transaction Boundaries | Crash injection tests | Clean rollback, no corruption |
| Application Logic | State filtering under load | Idempotent processing confirmed |
Intermittent Failures Analysis
Three tests showed intermittent failures under heavy parallelization:
- Root Cause: Database connection pool exhaustion when running 1600+ tests in parallel
- Evidence: Failures occurred only at high parallelism (>4 threads), not with serialized execution
- Classification: Resource contention, NOT race conditions
- Mitigation: Nextest configured with
test-threads = 1for multi_instance tests
Key Finding: No race conditions were detected. All intermittent failures traced to resource limits.
Domain Event Tests
21 tests were excluded from cluster mode using #[cfg(not(feature = "test-cluster"))]:
- Reason: Domain event tests verify in-process event delivery (publish/subscribe within single process)
- Behavior in Cluster: Events published in one instance aren’t delivered to subscribers in another instance
- Status: Working as designed - these tests run correctly in single-instance CI
Stress Test Results
Rapid Task Burst Test:
- 25 tasks created in <1 second
- All tasks completed successfully
- No duplicate UUIDs
- Creation rate: ~50 tasks/second sustained
Round-Robin Distribution Test:
- Tasks distributed evenly across orchestration instances
- Load balancing working correctly
- No single-instance bottleneck
Recommendations Validated
The following architectural decisions were validated by cluster testing:
- Ownership Removal: Processor UUID as audit-only (not enforced) enables automatic recovery
- SKIP LOCKED Pattern: Effective for contention-free work distribution
- State-Before-Queue Pattern: Prevents workers from seeing uncommitted state
- Find-or-Create Pattern: Handles concurrent entity creation correctly
Future Enhancements Identified
Testing identified one P2 improvement opportunity:
Atomic Finalization Claiming
- Current: Second orchestrator gets
StateMachineErrorduring concurrent finalization - Proposed: Transaction-based locking for graceful handling
- Priority: P2 (operational improvement, correctness already ensured)
Running Cluster Validation
To reproduce the validation:
# Setup cluster environment
cargo make setup-env-cluster
# Start full cluster
cargo make cluster-start-all
# Run all tests including cluster tests
cargo make test-rust-all
# Stop cluster
cargo make cluster-stop
See Cluster Testing Guide for detailed instructions.
Design Principles
Defense in Depth
The system intentionally provides multiple overlapping protection layers rather than relying on a single mechanism. This ensures:
- Resilience: If one layer fails (e.g., application bug), others prevent corruption
- Clear Semantics: Each layer has a specific purpose and failure mode
- Ease of Reasoning: Developers can understand guarantees at each level
- Graceful Degradation: System remains safe even under partial failures
Fail-Safe Defaults
When in doubt, the system errs on the side of caution:
- State transitions fail if current state doesn’t match → prevents invalid states
- Unique constraints fail creation → prevents duplicates
- Row locks block concurrent access → prevents race conditions
- Cycle detection fails initialization → prevents invalid workflows
Better to fail cleanly than to corrupt data.
Retry Safety
All critical operations are designed to be safely retryable:
- Idempotent: Same operation, repeated → same outcome
- State-Based: Check current state before acting
- Atomic: All-or-nothing commits
- No Side Effects: Operations don’t accumulate partial state
This enables:
- Automatic retry after transient failures
- Duplicate message handling
- Recovery after crashes
- Horizontal scaling without coordination overhead
Audit Trail Without Enforcement
Ownership Decision: Track ownership for observability, don’t enforce for correctness
#![allow(unused)]
fn main() {
// Processor UUID recorded in all transitions
pub struct TaskTransition {
pub task_uuid: Uuid,
pub from_state: TaskState,
pub to_state: TaskState,
pub processor_uuid: Uuid, // For audit and debugging
pub event: String,
pub timestamp: DateTime<Utc>,
}
// But NOT enforced in transition logic
impl TaskStateMachine {
pub async fn transition(&mut self, event: TaskEvent) -> Result<TaskState> {
// ✅ Tracks processor UUID
// ❌ Does NOT require ownership match
// Reason: Enables recovery after crashes
}
}
}
Why This Works:
- State guards provide correctness (current state validation)
- Processor UUID provides observability (who did what when)
- No ownership blocking means automatic recovery
- Full audit trail for debugging and monitoring
Implementation Checklist
When implementing new orchestration operations, ensure:
Database Layer
- Unique constraints for entities that must be singular
-
FOR UPDATElocking for state transitions -
FOR UPDATE SKIP LOCKEDfor work distribution - Compare-and-swap (CAS) in UPDATE WHERE clauses
- Transaction wrapping for multi-step operations
State Machine Layer
- Current state retrieval before transitions
- Event applicability validation
- Terminal state protection
- Error handling for invalid transitions
Application Layer
- Find-or-create pattern for shared entities
- State-based filtering before processing
- State-before-queue ordering for events
- Idempotent message handlers
Testing
- Concurrent operation tests (multiple orchestrators)
- Crash recovery tests (mid-operation failures)
- Retry safety tests (duplicate message handling)
- Race condition tests (timing-dependent scenarios)
Related Documentation
Core Architecture
- States and Lifecycles - Dual state machine architecture
- Events and Commands - Event-driven coordination patterns
- Actor-Based Architecture - Orchestration actor pattern
- Task Readiness & Execution - SQL functions and execution logic
Implementation Details
- Ownership Removal ADR - Processor UUID ownership removal decision
Multi-Instance Validation
- Cluster Testing Guide - Running multi-instance cluster tests
Testing
- Comprehensive Lifecycle Testing - Testing patterns including concurrent scenarios
← Back to Documentation Hub
Messaging Abstraction Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Deployment Patterns | Crate Architecture
<- Back to Documentation Hub
Overview
The provider-agnostic messaging abstraction enables Tasker Core to support multiple messaging backends through a unified interface. This architecture allows switching between PGMQ (PostgreSQL Message Queue) and RabbitMQ without changes to business logic.
Key Benefits:
- Zero handler changes required: Switching providers requires only configuration changes
- Provider-specific optimizations: Each backend can leverage its native strengths
- Testability: In-memory provider for fast unit testing
- Gradual migration: Systems can transition between providers incrementally
Core Concepts
Message Delivery Models
Different messaging providers have fundamentally different delivery models:
| Provider | Native Model | Push Support | Notification Type | Fallback Needed |
|---|---|---|---|---|
| PGMQ | Poll | Yes (pg_notify) | Signal only | Yes (catch-up) |
| RabbitMQ | Push | Yes (native) | Full message | No |
| InMemory | Push | Yes | Full message | No |
PGMQ (Signal-Only):
pg_notifysends a signal that a message exists- Worker must fetch the message after receiving the signal
- Fallback polling catches missed signals
RabbitMQ (Full Message Push):
basic_consume()delivers complete messages- No separate fetch required
- Protocol guarantees delivery
Architecture Layers
┌─────────────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ (Orchestration, Workers, Event Systems) │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Uses MessageClient
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MessageClient │
│ Domain-level facade with queue classification │
│ Location: tasker-shared/src/messaging/client.rs │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Delegates to
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MessagingProvider Enum │
│ Runtime dispatch without trait objects (zero-cost abstraction) │
│ Location: tasker-shared/src/messaging/service/provider.rs │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ PGMQ │ │ RabbitMQ │ │ InMemory │
│ Provider │ │ Provider │ │ Provider │
└───────────┘ └───────────┘ └───────────┘
Core Traits and Types
MessagingService Trait
Location: tasker-shared/src/messaging/service/traits.rs
The foundational trait defining queue operations:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait MessagingService: Send + Sync {
// Queue lifecycle
async fn create_queue(&self, name: &str) -> Result<(), MessagingError>;
async fn delete_queue(&self, name: &str) -> Result<(), MessagingError>;
async fn queue_exists(&self, name: &str) -> Result<bool, MessagingError>;
async fn list_queues(&self) -> Result<Vec<String>, MessagingError>;
// Message operations
async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError>;
async fn send_message_with_delay(&self, queue: &str, payload: &[u8], delay_seconds: i64) -> Result<i64, MessagingError>;
async fn receive_messages(&self, queue: &str, limit: i32, visibility_timeout: i32) -> Result<Vec<QueuedMessage<Vec<u8>>>, MessagingError>;
async fn ack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
async fn nack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
// Provider information
fn provider_name(&self) -> &'static str;
}
}
SupportsPushNotifications Trait
Location: tasker-shared/src/messaging/service/traits.rs
Extends MessagingService with push notification capabilities:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait SupportsPushNotifications: MessagingService {
/// Subscribe to messages on a single queue
fn subscribe(&self, queue_name: &str)
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;
/// Subscribe to messages on multiple queues
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;
/// Whether this provider requires fallback polling for reliability
fn requires_fallback_polling(&self) -> bool;
/// Suggested polling interval if fallback is needed
fn fallback_polling_interval(&self) -> Option<Duration>;
/// Whether this provider supports fetching by message ID
fn supports_fetch_by_message_id(&self) -> bool;
}
}
MessageNotification Enum
Location: tasker-shared/src/messaging/service/traits.rs
Abstracts the two notification models:
#![allow(unused)]
fn main() {
pub enum MessageNotification {
/// Signal-only notification (PGMQ style)
/// Indicates a message is available but requires separate fetch
Available {
queue_name: String,
msg_id: Option<i64>,
},
/// Full message notification (RabbitMQ style)
/// Contains the complete message payload
Message(QueuedMessage<Vec<u8>>),
}
}
Provider Implementations
PGMQ Provider
Location: tasker-shared/src/messaging/service/providers/pgmq.rs
PostgreSQL-based message queue with LISTEN/NOTIFY integration:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for PgmqMessagingService {
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
// Uses PgmqNotifyListener for pg_notify subscription
// Returns MessageNotification::Available (signal-only) for large messages
// Returns MessageNotification::Message for small messages (<7KB)
}
fn requires_fallback_polling(&self) -> bool {
true // pg_notify can miss signals during connection issues
}
fn supports_fetch_by_message_id(&self) -> bool {
true // PGMQ supports read_specific_message()
}
}
}
Characteristics:
- Uses PostgreSQL for storage and delivery
pg_notifyfor real-time notifications- Fallback polling required for reliability
- Supports visibility timeout for message claiming
RabbitMQ Provider
Location: tasker-shared/src/messaging/service/providers/rabbitmq.rs
AMQP-based message broker with native push delivery:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for RabbitMqMessagingService {
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
// Uses lapin basic_consume() for native push delivery
// Always returns MessageNotification::Message (full payload)
}
fn requires_fallback_polling(&self) -> bool {
false // AMQP protocol guarantees delivery
}
fn supports_fetch_by_message_id(&self) -> bool {
false // RabbitMQ doesn't support fetch-by-ID
}
}
}
Characteristics:
- Native push delivery via AMQP protocol
- No fallback polling needed
- Higher throughput for high-volume scenarios
- Requires separate infrastructure (RabbitMQ server)
InMemory Provider
Location: tasker-shared/src/messaging/service/providers/in_memory.rs
In-process message queue for testing:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for InMemoryMessagingService {
fn requires_fallback_polling(&self) -> bool {
false // In-memory is reliable within process
}
}
}
Use Cases:
- Unit testing without external dependencies
- Integration testing with controlled timing
- Development environments
MessagingProvider Enum
Location: tasker-shared/src/messaging/service/provider.rs
Enum dispatch pattern for runtime provider selection without trait objects:
#![allow(unused)]
fn main() {
pub enum MessagingProvider {
Pgmq(PgmqMessagingService),
RabbitMq(RabbitMqMessagingService),
InMemory(InMemoryMessagingService),
}
impl MessagingProvider {
/// Delegate all MessagingService methods to the underlying provider
pub async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError> {
match self {
Self::Pgmq(p) => p.send_message(queue, payload).await,
Self::RabbitMq(p) => p.send_message(queue, payload).await,
Self::InMemory(p) => p.send_message(queue, payload).await,
}
}
/// Subscribe to push notifications
pub fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
match self {
Self::Pgmq(p) => p.subscribe_many(queue_names),
Self::RabbitMq(p) => p.subscribe_many(queue_names),
Self::InMemory(p) => p.subscribe_many(queue_names),
}
}
/// Check if fallback polling is required
pub fn requires_fallback_polling(&self) -> bool {
match self {
Self::Pgmq(p) => p.requires_fallback_polling(),
Self::RabbitMq(p) => p.requires_fallback_polling(),
Self::InMemory(p) => p.requires_fallback_polling(),
}
}
}
}
Benefits:
- Zero-cost abstraction (no vtable indirection)
- Exhaustive match ensures all providers handled
- Easy to add new providers
MessageClient Facade
Location: tasker-shared/src/messaging/client.rs
Domain-level facade providing high-level queue operations:
#![allow(unused)]
fn main() {
pub struct MessageClient {
provider: Arc<MessagingProvider>,
classifier: QueueClassifier,
}
impl MessageClient {
/// Send a step message to the appropriate namespace queue
pub async fn send_step_message(
&self,
namespace: &str,
step: &SimpleStepMessage,
) -> Result<i64, MessagingError> {
let queue_name = self.classifier.step_queue_for_namespace(namespace);
let payload = serde_json::to_vec(step)?;
self.provider.send_message(&queue_name, &payload).await
}
/// Send a step result to the orchestration queue
pub async fn send_step_result(
&self,
result: &StepExecutionResult,
) -> Result<i64, MessagingError> {
let queue_name = self.classifier.orchestration_results_queue();
let payload = serde_json::to_vec(result)?;
self.provider.send_message(&queue_name, &payload).await
}
/// Access the underlying provider for advanced operations
pub fn provider(&self) -> &MessagingProvider {
&self.provider
}
}
}
Event System Integration
Provider-Agnostic Queue Listeners
Both orchestration and worker queue listeners use provider.subscribe_many():
#![allow(unused)]
fn main() {
// tasker-orchestration/src/orchestration/orchestration_queues/listener.rs
impl OrchestrationQueueListener {
pub async fn start(&mut self) -> Result<(), MessagingError> {
let queues = vec![
self.classifier.orchestration_results_queue(),
self.classifier.orchestration_requests_queue(),
self.classifier.orchestration_finalization_queue(),
];
// Provider-agnostic subscription
let stream = self.provider.subscribe_many(&queues)?;
// Process notifications
while let Some(notification) = stream.next().await {
match notification {
MessageNotification::Available { queue_name, msg_id } => {
// PGMQ style: send event command to fetch message
self.send_event_command(queue_name, msg_id).await;
}
MessageNotification::Message(msg) => {
// RabbitMQ style: send message command with full payload
self.send_message_command(msg).await;
}
}
}
}
}
}
Deployment Mode Selection
Event systems select the appropriate mode based on provider capabilities:
#![allow(unused)]
fn main() {
// Determine effective deployment mode for this provider
let effective_mode = deployment_mode.effective_for_provider(provider.provider_name());
match effective_mode {
DeploymentMode::EventDrivenOnly => {
// Start queue listener only (no fallback poller)
// RabbitMQ typically uses this mode
}
DeploymentMode::Hybrid => {
// Start both listener and fallback poller
// PGMQ uses this mode for reliability
}
DeploymentMode::PollingOnly => {
// Start fallback poller only
// For restricted environments
}
}
}
Command Routing
Dual Command Variants
Command processors handle both notification types:
#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
// For full message notifications (RabbitMQ)
ProcessStepResultFromMessage {
queue_name: String,
message: QueuedMessage<Vec<u8>>,
resp: CommandResponder<StepProcessResult>,
},
// For signal-only notifications (PGMQ)
ProcessStepResultFromMessageEvent {
message_event: MessageReadyEvent,
resp: CommandResponder<StepProcessResult>,
},
}
}
Routing Logic:
MessageNotification::Message->ProcessStepResultFromMessageMessageNotification::Available->ProcessStepResultFromMessageEvent
Type-Safe Channel Wrappers
NewType wrappers for MPSC channels prevent accidental misuse:
Orchestration Channels
Location: tasker-orchestration/src/orchestration/channels.rs
#![allow(unused)]
fn main() {
/// Strongly-typed sender for orchestration commands
#[derive(Debug, Clone)]
pub struct OrchestrationCommandSender(pub(crate) mpsc::Sender<OrchestrationCommand>);
/// Strongly-typed receiver for orchestration commands
#[derive(Debug)]
pub struct OrchestrationCommandReceiver(pub(crate) mpsc::Receiver<OrchestrationCommand>);
/// Strongly-typed sender for orchestration notifications
#[derive(Debug, Clone)]
pub struct OrchestrationNotificationSender(pub(crate) mpsc::Sender<OrchestrationNotification>);
/// Strongly-typed receiver for orchestration notifications
#[derive(Debug)]
pub struct OrchestrationNotificationReceiver(pub(crate) mpsc::Receiver<OrchestrationNotification>);
}
Worker Channels
Location: tasker-worker/src/worker/channels.rs
#![allow(unused)]
fn main() {
/// Strongly-typed sender for worker commands
#[derive(Debug, Clone)]
pub struct WorkerCommandSender(pub(crate) mpsc::Sender<WorkerCommand>);
/// Strongly-typed receiver for worker commands
#[derive(Debug)]
pub struct WorkerCommandReceiver(pub(crate) mpsc::Receiver<WorkerCommand>);
}
Channel Factory
#![allow(unused)]
fn main() {
pub struct ChannelFactory;
impl ChannelFactory {
/// Create type-safe orchestration command channel pair
pub fn orchestration_command_channel(buffer_size: usize)
-> (OrchestrationCommandSender, OrchestrationCommandReceiver)
{
let (tx, rx) = mpsc::channel(buffer_size);
(OrchestrationCommandSender(tx), OrchestrationCommandReceiver(rx))
}
}
}
Benefits:
- Compile-time prevention of channel misuse
- Self-documenting function signatures
- Zero runtime overhead (NewTypes compile away)
Configuration
Provider Selection
# config/dotenv/test.env
# Valid values: pgmq (default), rabbitmq
TASKER_MESSAGING_BACKEND=pgmq
# RabbitMQ connection (only used when backend=rabbitmq)
RABBITMQ_URL=amqp://tasker:tasker@localhost:5672/%2F
Provider-Specific Settings
# config/tasker/base/common.toml
[pgmq]
visibility_timeout_seconds = 60
max_message_size_bytes = 1048576
batch_size = 100
[rabbitmq]
prefetch_count = 100
connection_timeout_seconds = 30
heartbeat_seconds = 60
Migration Guide
Switching from PGMQ to RabbitMQ
- Deploy RabbitMQ infrastructure
- Update configuration:
export TASKER_MESSAGING_BACKEND=rabbitmq export RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/%2F - Restart services - No code changes required
Gradual Migration
For zero-downtime migration:
- Deploy new services with RabbitMQ configuration
- Gradually shift traffic to new services
- Monitor for any issues
- Decommission PGMQ-based services
Testing
Provider-Agnostic Tests
Most tests should use InMemoryMessagingService for speed:
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_step_execution() {
let provider = MessagingProvider::InMemory(InMemoryMessagingService::new());
let client = MessageClient::new(Arc::new(provider));
// Test with in-memory provider
client.send_step_message("payments", &step_msg).await.unwrap();
}
}
Provider-Specific Tests
For integration tests that need specific provider behavior:
#![allow(unused)]
fn main() {
#[tokio::test]
#[cfg(feature = "integration-tests")]
async fn test_pgmq_notifications() {
let provider = MessagingProvider::Pgmq(PgmqMessagingService::new(pool).await?);
// Test PGMQ-specific behavior
}
}
Best Practices
1. Use MessageClient for Application Code
#![allow(unused)]
fn main() {
// Good: Use domain-level facade
let client = context.message_client();
client.send_step_result(&result).await?;
// Avoid: Direct provider access unless necessary
let provider = context.messaging_provider();
provider.send_message("queue", &payload).await?;
}
2. Handle Both Notification Types
#![allow(unused)]
fn main() {
match notification {
MessageNotification::Available { queue_name, msg_id } => {
// Signal-only: need to fetch message
}
MessageNotification::Message(msg) => {
// Full message: can process immediately
}
}
}
3. Respect Provider Capabilities
#![allow(unused)]
fn main() {
if provider.supports_fetch_by_message_id() {
// Can use read_specific_message()
} else {
// Must use alternative approach
}
}
4. Configure Fallback Appropriately
#![allow(unused)]
fn main() {
if provider.requires_fallback_polling() {
// Start fallback poller for reliability
}
}
Related Documentation
- Events and Commands - Command pattern details
- Deployment Patterns - Deployment modes and configuration
- Worker Event Systems - Worker event architecture
- Crate Architecture - Workspace structure
<- Back to Documentation Hub
Next: Events and Commands | Deployment Patterns
States and Lifecycles
Last Updated: 2025-10-10 Audience: All Status: Active Related Docs: Documentation Hub | Events and Commands | Task Readiness & Execution
← Back to Documentation Hub
This document provides comprehensive documentation of the state machine architecture in tasker-core, covering both task and workflow step lifecycles, their state transitions, and the underlying persistence mechanisms.
Overview
The tasker-core system implements a sophisticated dual-state-machine architecture:
- Task State Machine: Manages overall workflow orchestration with 12 comprehensive states
- Workflow Step State Machine: Manages individual step execution with 8 states including orchestration queuing
Both state machines work in coordination to provide atomic, auditable, and resilient workflow execution with proper event-driven communication between orchestration and worker systems.
Task State Machine Architecture
Task State Definitions
The task state machine implements 12 comprehensive states as defined in tasker-shared/src/state_machine/states.rs:
Initial States
Pending: Created but not started (default initial state)Initializing: Discovering initial ready steps and setting up task context
Active Processing States
EnqueuingSteps: Actively enqueuing ready steps to worker queuesStepsInProcess: Steps are being processed by workers (orchestration monitoring)EvaluatingResults: Processing results from completed steps and determining next actions
Waiting States
WaitingForDependencies: No ready steps, waiting for dependencies to be satisfiedWaitingForRetry: Waiting for retry timeout before attempting failed steps againBlockedByFailures: Has failures that prevent progress (manual intervention may be needed)
Terminal States
Complete: All steps completed successfully (terminal)Error: Task failed permanently (terminal)Cancelled: Task was cancelled (terminal)ResolvedManually: Manually resolved by operator (terminal)
Task State Properties
Each state has key properties that drive system behavior:
#![allow(unused)]
fn main() {
impl TaskState {
pub fn is_terminal(&self) -> bool // Cannot transition further
pub fn requires_ownership(&self) -> bool // Processor ownership required
pub fn is_active(&self) -> bool // Currently being processed
pub fn is_waiting(&self) -> bool // Waiting for external conditions
pub fn can_be_processed(&self) -> bool // Available for orchestration pickup
}
}
Ownership-Required States: Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults
Processable States: Pending, WaitingForDependencies, WaitingForRetry
Task Lifecycle Flow
stateDiagram-v2
[*] --> Pending
%% Initial Flow
Pending --> Initializing : Start
%% From Initializing
Initializing --> EnqueuingSteps : ReadyStepsFound(count)
Initializing --> Complete : NoStepsFound
Initializing --> WaitingForDependencies : NoDependenciesReady
%% Processing Flow
EnqueuingSteps --> StepsInProcess : StepsEnqueued(uuids)
EnqueuingSteps --> Error : EnqueueFailed(error)
StepsInProcess --> EvaluatingResults : AllStepsCompleted
StepsInProcess --> EvaluatingResults : StepCompleted(uuid)
StepsInProcess --> WaitingForRetry : StepFailed(uuid)
%% Result Evaluation
EvaluatingResults --> Complete : AllStepsSuccessful
EvaluatingResults --> EnqueuingSteps : ReadyStepsFound(count)
EvaluatingResults --> WaitingForDependencies : NoDependenciesReady
EvaluatingResults --> BlockedByFailures : PermanentFailure(error)
%% Waiting States
WaitingForDependencies --> EvaluatingResults : DependenciesReady
WaitingForRetry --> EnqueuingSteps : RetryReady
%% Problem Resolution
BlockedByFailures --> Error : GiveUp
BlockedByFailures --> ResolvedManually : ManualResolution
%% Cancellation (from any non-terminal state)
Pending --> Cancelled : Cancel
Initializing --> Cancelled : Cancel
EnqueuingSteps --> Cancelled : Cancel
StepsInProcess --> Cancelled : Cancel
EvaluatingResults --> Cancelled : Cancel
WaitingForDependencies --> Cancelled : Cancel
WaitingForRetry --> Cancelled : Cancel
BlockedByFailures --> Cancelled : Cancel
%% Legacy Support
Error --> Pending : Reset
%% Terminal States
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Task Event System
Task state transitions are driven by events defined in tasker-shared/src/state_machine/events.rs:
Lifecycle Events
Start: Begin task processingCancel: Cancel task executionGiveUp: Abandon task (BlockedByFailures -> Error)ManualResolution: Manually resolve task
Discovery Events
ReadyStepsFound(count): Ready steps discovered during initialization/evaluationNoStepsFound: No steps defined - task can complete immediatelyNoDependenciesReady: Dependencies not satisfied - wait requiredDependenciesReady: Dependencies now ready - can proceed
Processing Events
StepsEnqueued(vec<Uuid>): Steps successfully queued for workersEnqueueFailed(error): Failed to enqueue stepsStepCompleted(uuid): Individual step completedStepFailed(uuid): Individual step failedAllStepsCompleted: All current batch steps finishedAllStepsSuccessful: All steps completed successfully
System Events
PermanentFailure(error): Unrecoverable failureRetryReady: Retry timeout expiredTimeout: Operation timeout occurredProcessorCrashed: Processor became unavailable
Processor Ownership
The task state machine implements processor ownership for active states to prevent race conditions:
#![allow(unused)]
fn main() {
// Ownership validation in task_state_machine.rs
if target_state.requires_ownership() {
let current_processor = self.get_current_processor().await?;
TransitionGuard::check_ownership(target_state, current_processor, self.processor_uuid)?;
}
}
Ownership Rules:
- States requiring ownership:
Initializing,EnqueuingSteps,StepsInProcess,EvaluatingResults - Processor UUID stored in
tasker.task_transitions.processor_uuidcolumn - Atomic ownership claiming prevents concurrent processing
- Ownership validated on each transition attempt
Workflow Step State Machine Architecture
Step State Definitions
The workflow step state machine implements 9 states for individual step execution:
Processing Pipeline States
Pending: Initial state when step is createdEnqueued: Queued for processing but not yet claimed by workerInProgress: Currently being executed by a workerEnqueuedForOrchestration: Worker completed, queued for orchestration processingEnqueuedAsErrorForOrchestration: Worker failed, queued for orchestration error processing
Waiting States
WaitingForRetry: Step failed with retryable error, waiting for backoff period before retry
Terminal States
Complete: Step completed successfully (after orchestration processing)Error: Step failed permanently (non-retryable or max retries exceeded)Cancelled: Step was cancelledResolvedManually: Step was manually resolved by operator
State Machine Evolution
Previously, the Error state was used for both retryable and permanent failures. The introduction of WaitingForRetry created a semantic change:
- Before:
Error= any failure (retryable or permanent) - After:
Error= permanent failure only,WaitingForRetry= retryable failure awaiting backoff
This change required updates to:
get_step_readiness_status()to recognizeWaitingForRetryas a ready-eligible stateget_task_execution_context()to properly detect blocked vs recovering tasks- Error classification logic to distinguish permanent from retryable errors
Step State Properties
#![allow(unused)]
fn main() {
impl WorkflowStepState {
pub fn is_terminal(&self) -> bool // No further transitions
pub fn is_error(&self) -> bool // In error state (may allow retry)
pub fn is_active(&self) -> bool // Being processed by worker
pub fn is_in_processing_pipeline(&self) -> bool // In execution pipeline
pub fn is_ready_for_claiming(&self) -> bool // Available for worker claim
pub fn satisfies_dependencies(&self) -> bool // Can satisfy other step dependencies
}
}
Step Lifecycle Flow
stateDiagram-v2
[*] --> Pending
%% Main Execution Path
Pending --> Enqueued : Enqueue
Enqueued --> InProgress : Start (worker claims)
InProgress --> EnqueuedForOrchestration : EnqueueForOrchestration(success)
EnqueuedForOrchestration --> Complete : Complete(results) [orchestration]
%% Error Handling Path
InProgress --> EnqueuedAsErrorForOrchestration : EnqueueForOrchestration(error)
EnqueuedAsErrorForOrchestration --> WaitingForRetry : WaitForRetry(error) [retryable]
EnqueuedAsErrorForOrchestration --> Error : Fail(error) [permanent/max retries]
%% Retry Path
WaitingForRetry --> Pending : Retry (after backoff)
%% Legacy Direct Path (deprecated)
InProgress --> Complete : Complete(results) [direct - legacy]
InProgress --> Error : Fail(error) [direct - legacy]
%% Legacy Backward Compatibility
Pending --> InProgress : Start [legacy]
%% Direct Failure Paths (error before worker processing)
Pending --> Error : Fail(error)
Enqueued --> Error : Fail(error)
%% Cancellation Paths
Pending --> Cancelled : Cancel
Enqueued --> Cancelled : Cancel
InProgress --> Cancelled : Cancel
EnqueuedForOrchestration --> Cancelled : Cancel
EnqueuedAsErrorForOrchestration --> Cancelled : Cancel
WaitingForRetry --> Cancelled : Cancel
Error --> Cancelled : Cancel
%% Manual Resolution (from any state)
Pending --> ResolvedManually : ResolveManually
Enqueued --> ResolvedManually : ResolveManually
InProgress --> ResolvedManually : ResolveManually
EnqueuedForOrchestration --> ResolvedManually : ResolveManually
EnqueuedAsErrorForOrchestration --> ResolvedManually : ResolveManually
WaitingForRetry --> ResolvedManually : ResolveManually
Error --> ResolvedManually : ResolveManually
%% Terminal States
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Step Event System
Step transitions are driven by StepEvent types:
Processing Events
Enqueue: Queue step for worker processingStart: Begin step execution (worker claims step)EnqueueForOrchestration(results): Worker completes, queues for orchestrationComplete(results): Mark step complete (from orchestration or legacy direct)Fail(error): Mark step as permanently failedWaitForRetry(error): Mark step for retry after backoff
Control Events
Cancel: Cancel step executionResolveManually: Manual operator resolutionRetry: Retry step from WaitingForRetry or Error state
Step Execution Flow Integration
The step state machine integrates tightly with the task state machine:
- Task Discovers Ready Steps:
TaskEvent::ReadyStepsFound(count)-> Task moves toEnqueuingSteps - Steps Get Enqueued:
StepEvent::Enqueue-> Steps move toEnqueuedstate - Workers Claim Steps:
StepEvent::Start-> Steps move toInProgress - Workers Complete Steps:
StepEvent::EnqueueForOrchestration(results)-> Steps move toEnqueuedForOrchestration - Orchestration Processes Results:
StepEvent::Complete(results)-> Steps move toComplete - Task Evaluates Progress:
TaskEvent::StepCompleted(uuid)-> Task moves toEvaluatingResults - Task Completes or Continues: Based on remaining steps -> Task moves to
Completeor back toEnqueuingSteps
Guard Conditions and Validation
Both state machines implement comprehensive guard conditions in tasker-shared/src/state_machine/guards.rs:
Task Guards
TransitionGuard
- Validates all task state transitions
- Prevents invalid state combinations
- Enforces terminal state immutability
- Supports legacy transition compatibility
Ownership Validation
- Checks processor ownership for ownership-required states
- Prevents concurrent task processing
- Allows ownership claiming for unowned tasks
Step Guards
StepDependenciesMetGuard
- Validates all step dependencies are satisfied
- Delegates to
WorkflowStep::dependencies_met() - Prevents premature step execution
StepNotInProgressGuard
- Ensures step is not already being processed
- Prevents duplicate worker claims
- Validates step availability
Retry Guards
StepCanBeRetriedGuard: Validates step is in Error state- Checks retry limits and conditions
- Prevents infinite retry loops
Orchestration Guards
StepCanBeEnqueuedForOrchestrationGuard: Step must be InProgressStepCanBeCompletedFromOrchestrationGuard: Step must be EnqueuedForOrchestrationStepCanBeFailedFromOrchestrationGuard: Step must be EnqueuedForOrchestration
Persistence Layer Architecture
Delegation Pattern
The persistence layer in tasker-shared/src/state_machine/persistence.rs implements a delegation pattern to the model layer:
#![allow(unused)]
fn main() {
// TaskTransitionPersistence -> TaskTransition::create() & TaskTransition::get_current()
// StepTransitionPersistence -> WorkflowStepTransition::create() & WorkflowStepTransition::get_current()
}
Benefits:
- No SQL duplication between state machine and models
- Atomic transaction handling in models
- Single source of truth for database operations
- Independent testability of model methods
Transition Storage
Task Transitions (tasker.task_transitions)
CREATE TABLE tasker.task_transitions (
task_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
task_uuid UUID NOT NULL,
to_state VARCHAR NOT NULL,
from_state VARCHAR,
processor_uuid UUID, -- Ownership tracking
metadata JSONB,
sort_key INTEGER NOT NULL,
most_recent BOOLEAN DEFAULT false,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Step Transitions (tasker.workflow_step_transitions)
CREATE TABLE tasker.workflow_step_transitions (
workflow_step_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
workflow_step_uuid UUID NOT NULL,
to_state VARCHAR NOT NULL,
from_state VARCHAR,
metadata JSONB,
sort_key INTEGER NOT NULL,
most_recent BOOLEAN DEFAULT false,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Current State Resolution
Both transition models implement efficient current state resolution:
#![allow(unused)]
fn main() {
// O(1) current state lookup using most_recent flag
TaskTransition::get_current(pool, task_uuid) -> Option<TaskTransition>
WorkflowStepTransition::get_current(pool, step_uuid) -> Option<WorkflowStepTransition>
}
Performance Optimization:
most_recent = trueflag on latest transition only- Indexed queries:
(task_uuid, most_recent) WHERE most_recent = true - Atomic flag updates during transition creation
Atomic Transitions with Ownership
Atomic transitions with processor ownership:
#![allow(unused)]
fn main() {
impl TaskTransitionPersistence {
pub async fn transition_with_ownership(
&self,
task_uuid: Uuid,
from_state: TaskState,
to_state: TaskState,
processor_uuid: Uuid,
metadata: Option<Value>,
pool: &PgPool,
) -> PersistenceResult<bool>
}
}
Atomicity Guarantees:
- Single database transaction for state change
- Processor UUID stored in dedicated column
most_recentflag updated atomically- Race condition prevention through database constraints
Action System
Both state machines execute actions after successful transitions:
Task Actions
- PublishTransitionEventAction: Publishes task state change events
- UpdateTaskCompletionAction: Updates task completion status
- ErrorStateCleanupAction: Performs error state cleanup
Step Actions
- PublishTransitionEventAction: Publishes step state change events
- UpdateStepResultsAction: Updates step results and execution data
- TriggerStepDiscoveryAction: Triggers task-level step discovery
- ErrorStateCleanupAction: Performs step error cleanup
Actions execute sequentially after transition persistence, ensuring consistency.
State Machine Integration Points
Task <-> Step Coordination
- Step Discovery: Task initialization discovers ready steps
- Step Enqueueing: Task enqueues discovered steps to worker queues
- Progress Monitoring: Task monitors step completion via events
- Result Processing: Task processes step results and discovers next steps
- Completion Detection: Task completes when all steps are complete
Event-Driven Communication
- pg_notify: PostgreSQL notifications for real-time coordination
- Event Publishers: Publish state transition events to event system
- Event Subscribers: React to state changes across system boundaries
- Queue Integration: Provider-agnostic message queues (PGMQ or RabbitMQ) for worker communication
Worker Integration
- Step Claiming: Workers claim
Enqueuedsteps from queues - Progress Updates: Workers transition steps to
InProgress - Result Submission: Workers submit results via
EnqueueForOrchestration - Orchestration Processing: Orchestration processes results and completes steps
This sophisticated state machine architecture provides the foundation for reliable, auditable, and scalable workflow orchestration in the tasker-core system.
Step Result Audit System
The step result audit system provides SOC2-compliant audit trails for workflow step execution results, enabling complete attribution tracking for compliance and debugging.
Audit Table Design
The tasker.workflow_step_result_audit table stores lightweight references with attribution data:
CREATE TABLE tasker.workflow_step_result_audit (
workflow_step_result_audit_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
workflow_step_uuid UUID NOT NULL REFERENCES tasker.workflow_steps,
workflow_step_transition_uuid UUID NOT NULL REFERENCES tasker.workflow_step_transitions,
task_uuid UUID NOT NULL REFERENCES tasker.tasks,
recorded_at TIMESTAMP NOT NULL DEFAULT NOW(),
-- Attribution (NEW data not in transitions)
worker_uuid UUID,
correlation_id UUID,
-- Extracted scalars for indexing/filtering
success BOOLEAN NOT NULL,
execution_time_ms BIGINT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE (workflow_step_uuid, workflow_step_transition_uuid)
);
Design Principles
-
No Data Duplication: Full execution results already exist in
tasker.workflow_step_transitions.metadata. The audit table stores references only. -
Attribution Capture: The audit system captures NEW attribution data:
worker_uuid: Which worker instance processed the stepcorrelation_id: Distributed tracing identifier for request correlation
-
Indexed Scalars: Success and execution time are extracted for efficient filtering without JSON parsing.
-
SQL Trigger: A database trigger (
trg_step_result_audit) guarantees audit record creation when workers persist results, ensuring SOC2 compliance.
Attribution Flow
Attribution data flows through the system via TransitionContext:
#![allow(unused)]
fn main() {
// Worker creates attribution context
let context = TransitionContext::with_worker(
worker_uuid,
Some(correlation_id),
);
// Context is merged into transition metadata
state_machine.transition_with_context(event, Some(context)).await?;
// SQL trigger extracts attribution from metadata
-- In trigger:
-- v_worker_uuid := (NEW.metadata->>'worker_uuid')::UUID;
-- v_correlation_id := (NEW.metadata->>'correlation_id')::UUID;
}
Trigger Behavior
The create_step_result_audit trigger fires on transitions to:
enqueued_for_orchestration: Successful step completionenqueued_as_error_for_orchestration: Failed step completion
These states represent when workers persist execution results, creating the audit trail.
Querying Audit History
Via API
GET /v1/tasks/{task_uuid}/workflow_steps/{step_uuid}/audit
Returns audit records with full transition details via JOIN, ordered by recorded_at DESC.
Via Client
#![allow(unused)]
fn main() {
let audit_history = client.get_step_audit_history(task_uuid, step_uuid).await?;
for record in audit_history {
println!("Worker: {:?}, Success: {}, Time: {:?}ms",
record.worker_uuid,
record.success,
record.execution_time_ms
);
}
}
Via Model
#![allow(unused)]
fn main() {
// Get audit history for a step with full transition details
let history = WorkflowStepResultAudit::get_audit_history(&pool, step_uuid).await?;
// Get all audit records for a task
let task_history = WorkflowStepResultAudit::get_task_audit_history(&pool, task_uuid).await?;
// Query by worker for attribution investigation
let worker_records = WorkflowStepResultAudit::get_by_worker(&pool, worker_uuid, Some(100)).await?;
// Query by correlation ID for distributed tracing
let correlated = WorkflowStepResultAudit::get_by_correlation_id(&pool, correlation_id).await?;
}
Indexes for Common Query Patterns
The audit table includes optimized indexes:
idx_audit_step_uuid: Primary query - get audit history for a stepidx_audit_task_uuid: Get all audit records for a taskidx_audit_recorded_at: Time-range queries for SOC2 audit reportsidx_audit_worker_uuid: Attribution investigation (partial index)idx_audit_correlation_id: Distributed tracing queries (partial index)idx_audit_success: Success/failure filtering
Historical Data
The migration includes a backfill for existing transitions. Historical records will have NULL attribution (worker_uuid, correlation_id) since that data wasn’t captured before the audit system was introduced.
Worker Actor-Based Architecture
Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands
<- Back to Documentation Hub
This document provides comprehensive documentation of the worker actor-based architecture in tasker-worker, covering the lightweight Actor pattern that mirrors the orchestration architecture for step execution and worker coordination.
Overview
The tasker-worker system implements a lightweight Actor pattern that mirrors the orchestration architecture, providing:
- Actor Abstraction: Worker components encapsulated as actors with clear lifecycle hooks
- Message-Based Communication: Type-safe message handling via
Handler<M>trait - Central Registry:
WorkerActorRegistryfor managing all worker actors - Service Decomposition: Focused services following single responsibility principle
- Lock-Free Statistics: AtomicU64 counters for hot-path performance
- Direct Integration: Command processor routes to actors without wrapper layers
This architecture provides consistency between orchestration and worker systems, enabling clearer code organization and improved maintainability.
Implementation Status
Complete: All phases implemented and production-ready
- Phase 1: Core abstractions (traits, registry, lifecycle management)
- Phase 2: Service decomposition from 1575 LOC command_processor.rs
- Phase 3: All 5 primary actors implemented
- Phase 4: Command processor refactored to pure routing (~200 LOC)
- Phase 5: Stateless service design eliminating lock contention
- Cleanup: Lock-free AtomicU64 statistics, shared event system
Current State: Production-ready actor-based worker with 5 actors managing all step execution operations.
Core Concepts
What is a Worker Actor?
In the tasker-worker context, a Worker Actor is an encapsulated step execution component that:
- Manages its own state: Each actor owns its dependencies and configuration
- Processes messages: Responds to typed command messages via the
Handler<M>trait - Has lifecycle hooks: Initialization (
started) and cleanup (stopped) methods - Is isolated: Actors communicate through message passing
- Is thread-safe: All actors are
Send + Sync + 'static
Why Actors for Workers?
The previous architecture had a monolithic command processor:
#![allow(unused)]
fn main() {
// OLD: 1575 LOC monolithic command processor
pub struct WorkerProcessor {
// All logic mixed together
// RwLock contention on hot path
// Two-phase initialization complexity
}
}
The actor pattern provides:
#![allow(unused)]
fn main() {
// NEW: Pure routing command processor (~200 LOC)
impl ActorCommandProcessor {
async fn handle_command(&self, command: WorkerCommand) -> bool {
match command {
WorkerCommand::ExecuteStep { message, queue_name, resp } => {
let msg = ExecuteStepMessage { message, queue_name };
let result = self.actors.step_executor_actor.handle(msg).await;
let _ = resp.send(result);
true
}
// ... pure routing, no business logic
}
}
}
}
Actor vs Service
Services (underlying business logic):
- Encapsulate step execution logic
- Stateless operations on step data
- Direct method invocation
- Examples:
StepExecutorService,FFICompletionService,WorkerStatusService
Actors (message-based coordination):
- Wrap services with message-based interface
- Manage service lifecycle
- Asynchronous message handling
- Examples:
StepExecutorActor,FFICompletionActor,WorkerStatusActor
The relationship:
#![allow(unused)]
fn main() {
pub struct StepExecutorActor {
context: Arc<SystemContext>,
service: Arc<StepExecutorService>, // Wraps underlying service
}
#[async_trait]
impl Handler<ExecuteStepMessage> for StepExecutorActor {
async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
// Delegates to stateless service
self.service.execute_step(msg.message, &msg.queue_name).await
}
}
}
Worker Actor Traits
WorkerActor Trait
The base trait for all worker actors, defined in tasker-worker/src/worker/actors/traits.rs:
#![allow(unused)]
fn main() {
/// Base trait for all worker actors
///
/// Provides lifecycle management and context access for all actors in the
/// worker system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
pub trait WorkerActor: Send + Sync + 'static {
/// Returns the unique name of this actor
fn name(&self) -> &'static str;
/// Returns a reference to the system context
fn context(&self) -> &Arc<SystemContext>;
/// Called when the actor is started
fn started(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor started");
Ok(())
}
/// Called when the actor is stopped
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor stopped");
Ok(())
}
}
}
Handler Trait
The message handling trait, enabling type-safe message processing:
#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
#[async_trait]
pub trait Handler<M: Message>: WorkerActor {
/// Handle a message asynchronously
async fn handle(&self, msg: M) -> TaskerResult<M::Response>;
}
}
Message Trait
The marker trait for command messages:
#![allow(unused)]
fn main() {
/// Marker trait for command messages
pub trait Message: Send + 'static {
/// The response type for this message
type Response: Send;
}
}
WorkerActorRegistry
The central registry managing all worker actors, defined in tasker-worker/src/worker/actors/registry.rs:
Structure
#![allow(unused)]
fn main() {
/// Registry managing all worker actors
#[derive(Clone)]
pub struct WorkerActorRegistry {
/// System context shared by all actors
context: Arc<SystemContext>,
/// Worker ID for this registry
worker_id: String,
/// Step executor actor for step execution pub step_executor_actor: Arc<StepExecutorActor>,
/// FFI completion actor for handling step completions pub ffi_completion_actor: Arc<FFICompletionActor>,
/// Template cache actor for template management pub template_cache_actor: Arc<TemplateCacheActor>,
/// Domain event actor for event dispatching pub domain_event_actor: Arc<DomainEventActor>,
/// Worker status actor for health and status pub worker_status_actor: Arc<WorkerStatusActor>,
}
}
Initialization
All dependencies required at construction time (no two-phase initialization):
#![allow(unused)]
fn main() {
impl WorkerActorRegistry {
pub async fn build(
context: Arc<SystemContext>,
worker_id: String,
task_template_manager: Arc<TaskTemplateManager>,
event_publisher: WorkerEventPublisher,
domain_event_handle: DomainEventSystemHandle,
) -> TaskerResult<Self> {
// Create actors with all dependencies upfront
let mut step_executor_actor = StepExecutorActor::new(
context.clone(),
worker_id.clone(),
task_template_manager.clone(),
event_publisher,
domain_event_handle,
);
// Call started() lifecycle hook
step_executor_actor.started()?;
// ... create other actors ...
Ok(Self {
context,
worker_id,
step_executor_actor: Arc::new(step_executor_actor),
// ...
})
}
}
}
Implemented Actors
StepExecutorActor
Handles step execution from PGMQ messages and events.
Location: tasker-worker/src/worker/actors/step_executor_actor.rs
Messages:
ExecuteStepMessage- Execute step from raw dataExecuteStepWithCorrelationMessage- Execute with FFI correlationExecuteStepFromPgmqMessage- Execute from PGMQ messageExecuteStepFromEventMessage- Execute from event notification
Delegation: Wraps StepExecutorService (stateless, no locks)
Purpose: Central coordinator for all step execution, handles claiming, handler invocation, and result construction.
FFICompletionActor
Handles step completion results from FFI handlers.
Location: tasker-worker/src/worker/actors/ffi_completion_actor.rs
Messages:
SendStepResultMessage- Send result to orchestrationProcessStepCompletionMessage- Process completion with correlation
Delegation: Wraps FFICompletionService
Purpose: Forwards step execution results to orchestration queue, manages correlation for async FFI handlers.
TemplateCacheActor
Manages task template caching and refresh.
Location: tasker-worker/src/worker/actors/template_cache_actor.rs
Messages:
RefreshTemplateCacheMessage- Refresh cache for namespace
Delegation: Wraps TaskTemplateManager
Purpose: Maintains handler template cache for efficient step execution.
DomainEventActor
Dispatches domain events after step completion.
Location: tasker-worker/src/worker/actors/domain_event_actor.rs
Messages:
DispatchDomainEventsMessage- Dispatch events for completed step
Delegation: Wraps DomainEventSystemHandle
Purpose: Fire-and-forget domain event dispatch (never blocks step completion).
WorkerStatusActor
Provides worker health and status reporting.
Location: tasker-worker/src/worker/actors/worker_status_actor.rs
Messages:
GetWorkerStatusMessage- Get current worker statusHealthCheckMessage- Perform health checkGetEventStatusMessage- Get event integration statusSetEventIntegrationMessage- Enable/disable event integration
Features:
- Lock-free statistics via
AtomicStepExecutionStats - AtomicU64 counters for
total_executed,total_succeeded,total_failed - Average execution time computed on read from
sum / count
Purpose: Real-time health monitoring and statistics without lock contention.
Lock-Free Statistics
The WorkerStatusActor uses atomic counters for lock-free statistics on the hot path:
#![allow(unused)]
fn main() {
/// Lock-free step execution statistics using atomic counters
#[derive(Debug)]
pub struct AtomicStepExecutionStats {
total_executed: AtomicU64,
total_succeeded: AtomicU64,
total_failed: AtomicU64,
total_execution_time_ms: AtomicU64,
}
impl AtomicStepExecutionStats {
/// Record a successful step execution (lock-free)
#[inline]
pub fn record_success(&self, execution_time_ms: u64) {
self.total_executed.fetch_add(1, Ordering::Relaxed);
self.total_succeeded.fetch_add(1, Ordering::Relaxed);
self.total_execution_time_ms.fetch_add(execution_time_ms, Ordering::Relaxed);
}
/// Record a failed step execution (lock-free)
#[inline]
pub fn record_failure(&self) {
self.total_executed.fetch_add(1, Ordering::Relaxed);
self.total_failed.fetch_add(1, Ordering::Relaxed);
}
/// Get a snapshot of current statistics
pub fn snapshot(&self) -> StepExecutionStats {
let total_executed = self.total_executed.load(Ordering::Relaxed);
let total_time = self.total_execution_time_ms.load(Ordering::Relaxed);
let average_execution_time_ms = if total_executed > 0 {
total_time as f64 / total_executed as f64
} else {
0.0
};
StepExecutionStats {
total_executed,
total_succeeded: self.total_succeeded.load(Ordering::Relaxed),
total_failed: self.total_failed.load(Ordering::Relaxed),
average_execution_time_ms,
}
}
}
}
Benefits:
- Zero lock contention on step completion (every step calls
record_successorrecord_failure) - Sub-microsecond overhead per operation
- Consistent averages computed from totals
Integration with Commands
ActorCommandProcessor
The command processor provides pure routing to actors:
#![allow(unused)]
fn main() {
impl ActorCommandProcessor {
async fn handle_command(&self, command: WorkerCommand) -> bool {
match command {
// Step Execution Commands -> StepExecutorActor
WorkerCommand::ExecuteStep { message, queue_name, resp } => {
let msg = ExecuteStepMessage { message, queue_name };
let result = self.actors.step_executor_actor.handle(msg).await;
let _ = resp.send(result);
true
}
// Completion Commands -> FFICompletionActor
WorkerCommand::SendStepResult { result, resp } => {
let msg = SendStepResultMessage { result };
let send_result = self.actors.ffi_completion_actor.handle(msg).await;
let _ = resp.send(send_result);
true
}
// Status Commands -> WorkerStatusActor
WorkerCommand::HealthCheck { resp } => {
let result = self.actors.worker_status_actor.handle(HealthCheckMessage).await;
let _ = resp.send(result);
true
}
WorkerCommand::Shutdown { resp } => {
let _ = resp.send(Ok(()));
false // Exit command loop
}
}
}
}
}
FFI Completion Flow
Domain events are dispatched after successful orchestration notification:
#![allow(unused)]
fn main() {
async fn handle_ffi_completion(&self, step_result: StepExecutionResult) {
// Record stats (lock-free)
if step_result.success {
self.actors.worker_status_actor
.record_success(step_result.metadata.execution_time_ms as f64).await;
} else {
self.actors.worker_status_actor.record_failure().await;
}
// Send to orchestration FIRST
let msg = SendStepResultMessage { result: step_result.clone() };
match self.actors.ffi_completion_actor.handle(msg).await {
Ok(()) => {
// Domain events dispatched AFTER successful orchestration notification
// Fire-and-forget - never blocks the worker
self.actors.step_executor_actor
.dispatch_domain_events(step_result.step_uuid, &step_result, None).await;
}
Err(e) => {
// Don't dispatch domain events - orchestration wasn't notified
tracing::error!("Failed to forward step completion to orchestration");
}
}
}
}
Service Decomposition
Large services were decomposed from the monolithic command processor:
StepExecutorService
services/step_execution/
├── mod.rs # Public API
├── service.rs # StepExecutorService (~250 lines)
├── step_claimer.rs # Step claiming logic
├── handler_invoker.rs # Handler invocation
└── result_builder.rs # Result construction
Key Design: Completely stateless service using &self methods. Wrapped in Arc<StepExecutorService> without any locks.
FFICompletionService
services/ffi_completion/
├── mod.rs # Public API
├── service.rs # FFICompletionService
└── result_sender.rs # Orchestration result sender
WorkerStatusService
services/worker_status/
├── mod.rs # Public API
└── service.rs # WorkerStatusService
Key Architectural Decisions
1. Stateless Services
Services use &self methods with no mutable state:
#![allow(unused)]
fn main() {
impl StepExecutorService {
pub async fn execute_step(
&self, // Immutable reference
message: PgmqMessage<SimpleStepMessage>,
queue_name: &str,
) -> TaskerResult<bool> {
// Stateless execution - no mutable state
}
}
}
Benefits:
- Zero lock contention
- Maximum concurrency per worker
- Simplified reasoning about state
2. Constructor-Based Dependency Injection
All dependencies required at construction time:
#![allow(unused)]
fn main() {
pub async fn new(
context: Arc<SystemContext>,
worker_id: String,
task_template_manager: Arc<TaskTemplateManager>,
event_publisher: WorkerEventPublisher, // Required
domain_event_handle: DomainEventSystemHandle, // Required
) -> TaskerResult<Self>
}
Benefits:
- Compiler enforces complete initialization
- No “partially initialized” states
- Clear dependency graph
3. Shared Event System
Event publisher and subscriber share the same WorkerEventSystem:
#![allow(unused)]
fn main() {
let shared_event_system = event_system
.unwrap_or_else(|| Arc::new(WorkerEventSystem::new()));
let event_publisher =
WorkerEventPublisher::with_event_system(worker_id.clone(), shared_event_system.clone());
// Enable subscriber with same shared system
processor.enable_event_subscriber(Some(shared_event_system)).await;
}
Benefits:
- FFI handlers reliably receive step execution events
- No isolated event systems causing silent failures
4. Graceful Degradation
Domain events never fail step completion:
#![allow(unused)]
fn main() {
// dispatch_domain_events returns () not TaskerResult<()>
// Errors logged but never propagated
pub async fn dispatch_domain_events(
&self,
step_uuid: Uuid,
result: &StepExecutionResult,
metadata: Option<HashMap<String, serde_json::Value>>,
) {
// Fire-and-forget with error logging
// Channel full? Log and continue
// Dispatch error? Log and continue
}
}
Comparison with Orchestration Actors
| Aspect | Orchestration | Worker |
|---|---|---|
| Actor Count | 4 actors | 5 actors |
| Registry | ActorRegistry | WorkerActorRegistry |
| Base Trait | OrchestrationActor | WorkerActor |
| Message Trait | Handler<M> | Handler<M> (same) |
| Service Design | Decomposed | Stateless |
| Statistics | N/A | Lock-free AtomicU64 |
| LOC Reduction | ~800 -> ~200 | 1575 -> ~200 |
Benefits
1. Consistency with Orchestration
Same patterns and traits as orchestration actors:
- Identical
Handler<M>trait interface - Similar registry lifecycle management
- Consistent message-based communication
2. Zero Lock Contention
- Stateless services eliminate RwLock on hot path
- AtomicU64 counters for statistics
- Maximum concurrent step execution
3. Type Safety
Messages and responses checked at compile time:
#![allow(unused)]
fn main() {
// Compile error if types don't match
impl Handler<ExecuteStepMessage> for StepExecutorActor {
async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
// Must return bool, not something else
}
}
}
4. Testability
- Clear message boundaries for mocking
- Isolated actor lifecycle for unit tests
- 119 unit tests, 73 E2E tests passing
5. Maintainability
- 1575 LOC -> ~200 LOC command processor
- Focused services (<300 lines per file)
- Clear separation of concerns
Detailed Analysis
For design rationale, see the Worker Decomposition ADR.
Summary
The worker actor-based architecture provides a consistent, type-safe foundation for step execution in tasker-worker. Key takeaways:
- Mirrors Orchestration: Same patterns as orchestration actors for consistency
- Lock-Free Performance: Stateless services and AtomicU64 counters
- Type Safety: Compile-time verification of message contracts
- Pure Routing: Command processor delegates without business logic
- Graceful Degradation: Domain events never fail step completion
- Production Ready: 119 unit tests, 73 E2E tests, full regression coverage
The architecture provides a solid foundation for high-throughput step execution while maintaining the proven reliability of the orchestration system.
<- Back to Documentation Hub
Worker Event Systems Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Worker Actors | Events and Commands | Messaging Abstraction
<- Back to Documentation Hub
This document provides comprehensive documentation of the worker event system architecture in tasker-worker, covering the dual-channel event pattern, domain event publishing, and FFI integration.
Overview
The worker event system implements a dual-channel architecture for non-blocking step execution:
- WorkerEventSystem: Receives step execution events via provider-agnostic subscriptions
- HandlerDispatchService: Fire-and-forget handler invocation with bounded concurrency
- CompletionProcessorService: Routes results back to orchestration
- DomainEventSystem: Fire-and-forget domain event publishing
Messaging Backend Support: The worker event system supports multiple messaging backends (PGMQ, RabbitMQ) through a provider-agnostic abstraction. See Messaging Abstraction for details.
This architecture enables true parallel handler execution while maintaining strict ordering guarantees for domain events.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER EVENT FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
MessagingProvider (PGMQ or RabbitMQ)
│
│ provider.subscribe_many()
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WorkerEventSystem │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ WorkerQueueListener │ │ WorkerFallbackPoller │ │
│ │ (provider-agnostic) │ │ (PGMQ only) │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ │ │
│ └───────────┬───────────────┘ │
│ │ │
│ ▼ │
│ MessageNotification::Message → ExecuteStepFromMessage (RabbitMQ) │
│ MessageNotification::Available → ExecuteStepFromEvent (PGMQ) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ActorCommandProcessor │
│ │ │
│ ▼ │
│ StepExecutorActor │
│ │ │
│ │ claim step, send to dispatch channel │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ │
Rust Workers FFI Workers (Ruby/Python)
│ │
▼ ▼
┌───────────────────────────────┐ ┌───────────────────────────────┐
│ HandlerDispatchService │ │ FfiDispatchChannel │
│ │ │ │
│ dispatch_receiver │ │ pending_events HashMap │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ [Semaphore] N permits │ │ poll_step_events() │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ handler.call() │ │ Ruby/Python handler │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ PostHandlerCallback │ │ complete_step_event() │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ completion_sender │ │ PostHandlerCallback │
│ │ │ │ │
└───────────────┬───────────────┘ │ ▼ │
│ │ completion_sender │
│ │ │
│ └───────────────┬───────────────┘
│ │
└───────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CompletionProcessorService │
│ │ │
│ ▼ │
│ FFICompletionService │
│ │ │
│ ▼ │
│ orchestration_step_results queue │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
Orchestration
Core Components
1. WorkerEventSystem
Location: tasker-worker/src/worker/event_systems/worker_event_system.rs
Implements the EventDrivenSystem trait for worker namespace queue processing. Supports three deployment modes with provider-agnostic message handling:
| Mode | Description | PGMQ Behavior | RabbitMQ Behavior |
|---|---|---|---|
PollingOnly | Traditional polling | Poll PGMQ tables | Poll via basic_get |
EventDrivenOnly | Pure push delivery | pg_notify signals | basic_consume push |
Hybrid | Event-driven + polling | pg_notify + fallback | Push only (no fallback) |
Provider-Specific Behavior:
- PGMQ: Uses
MessageNotification::Available(signal-only), requires fallback polling - RabbitMQ: Uses
MessageNotification::Message(full payload), no fallback needed
Key Features:
- Unified configuration via
WorkerEventSystemConfig - Atomic statistics with
AtomicU64counters - Converts
WorkerNotificationtoWorkerCommandfor processing
#![allow(unused)]
fn main() {
// Worker notification to command conversion (provider-agnostic)
match notification {
// RabbitMQ style - full message delivered
WorkerNotification::Message(msg) => {
command_sender.send(WorkerCommand::ExecuteStepFromMessage {
queue_name: msg.queue_name.clone(),
message: msg,
resp: resp_tx,
}).await;
}
// PGMQ style - signal-only, requires fetch
WorkerNotification::Event(WorkerQueueEvent::StepMessage(msg_event)) => {
command_sender.send(WorkerCommand::ExecuteStepFromEvent {
message_event: msg_event,
resp: resp_tx,
}).await;
}
// ...
}
}
2. HandlerDispatchService
Location: tasker-worker/src/worker/handlers/dispatch_service.rs
Non-blocking handler dispatch with bounded parallelism.
Architecture:
dispatch_receiver → [Semaphore] → handler.call() → [callback] → completion_sender
│ │
└─→ Bounded to N concurrent └─→ Domain events
tasks
Key Design Decisions:
- Semaphore-Bounded Concurrency: Limits concurrent handlers to prevent resource exhaustion
- Permit Release Before Send: Prevents backpressure cascade
- Post-Handler Callback: Domain events fire only after result is committed
#![allow(unused)]
fn main() {
tokio::spawn(async move {
let permit = semaphore.acquire().await?;
let result = execute_with_timeout(®istry, &msg, timeout).await;
// Release permit BEFORE sending to completion channel
drop(permit);
// Send result FIRST
sender.send(result.clone()).await?;
// Callback fires AFTER result is committed
if let Some(cb) = callback {
cb.on_handler_complete(&step, &result, &worker_id).await;
}
});
}
Error Handling:
| Scenario | Behavior |
|---|---|
| Handler timeout | StepExecutionResult::failure() with error_type=handler_timeout |
| Handler panic | Caught via catch_unwind(), failure result generated |
| Handler error | Failure result with error_type=handler_error |
| Semaphore closed | Failure result with error_type=semaphore_acquisition_failed |
Handler Resolution
Before handler execution, the dispatch service resolves the handler using a resolver chain pattern:
HandlerDefinition ResolverChain Handler
│ │ │
│ callable: "process_payment" │ │
│ method: "refund" │ │
│ resolver: null │ │
│ │ │
├───────────────────────────────────►│ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ ExplicitMappingResolver (10) │ │
│ │ can_resolve? ─► YES │ │
│ │ resolve() ─────────────────────────────────►│
│ └───────────────────────────────┘ │
│ │
│ ┌───────────────────────────────┐ │
│ │ MethodDispatchWrapper │ │
│ │ (if method != "call") │◄─────────────┤
│ └───────────────────────────────┘ │
Built-in Resolvers:
| Resolver | Priority | Function |
|---|---|---|
ExplicitMappingResolver | 10 | Hash lookup of registered handlers |
ClassConstantResolver | 100 | Runtime class lookup (Ruby only) |
ClassLookupResolver | 100 | Runtime class lookup (Python/TypeScript only) |
Method Dispatch: When handler.method is specified and not "call", a MethodDispatchWrapper is applied to invoke the specified method instead of the default call() method.
See Handler Resolution Guide for complete documentation.
3. FfiDispatchChannel
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs
Pull-based polling interface for FFI workers (Ruby, Python). Enables language-specific handlers without complex FFI memory management.
Flow:
Rust Ruby/Python
│ │
│ dispatch(step) │
│ ──────────────────────────────► │
│ │ pending_events.insert()
│ │
│ poll_step_events() │
│ ◄────────────────────────────── │
│ │
│ │ handler.call()
│ │
│ complete_step_event(result) │
│ ◄────────────────────────────── │
│ │
│ PostHandlerCallback │
│ completion_sender.send() │
│ │
Key Features:
- Thread-safe pending events map with lock poisoning recovery
- Configurable completion timeout (default 30s)
- Starvation detection and warnings
- Fire-and-forget callbacks via
runtime_handle.spawn()
4. CompletionProcessorService
Location: tasker-worker/src/worker/handlers/completion_processor.rs
Receives completed step results and routes to orchestration queue via FFICompletionService.
completion_receiver → CompletionProcessorService → FFICompletionService → orchestration_step_results
Note: Currently processes completions sequentially. Parallel processing is planned as a future enhancement.
5. DomainEventSystem
Location: tasker-worker/src/worker/event_systems/domain_event_system.rs
Async system for fire-and-forget domain event publishing.
Architecture:
command_processor.rs DomainEventSystem
│ │
│ try_send(command) │ spawn process_loop()
▼ ▼
mpsc::Sender<DomainEventCommand> → mpsc::Receiver
│
▼
EventRouter → PGMQ / InProcess
Key Design:
try_send()never blocks - if channel is full, events are dropped with metrics- Background task processes commands asynchronously
- Graceful shutdown drains fast events up to configurable timeout
- Three delivery modes: Durable (PGMQ), Fast (in-process), Broadcast
Shared Event Abstractions
EventDrivenSystem Trait
Location: tasker-shared/src/event_system/event_driven.rs
Unified trait for all event-driven systems:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
type SystemId: Send + Sync + Clone;
type Event: Send + Sync + Clone;
type Config: Send + Sync + Clone;
type Statistics: EventSystemStatistics;
fn system_id(&self) -> Self::SystemId;
fn deployment_mode(&self) -> DeploymentMode;
fn is_running(&self) -> bool;
async fn start(&mut self) -> Result<(), DeploymentModeError>;
async fn stop(&mut self) -> Result<(), DeploymentModeError>;
async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
fn statistics(&self) -> Self::Statistics;
fn config(&self) -> &Self::Config;
}
}
Deployment Modes
Location: tasker-shared/src/event_system/deployment.rs
#![allow(unused)]
fn main() {
pub enum DeploymentMode {
PollingOnly, // Traditional polling, no events
EventDrivenOnly, // Pure event-driven, no polling
Hybrid, // Event-driven with polling fallback
}
}
PostHandlerCallback Trait
Location: tasker-worker/src/worker/handlers/dispatch_service.rs
Extensibility point for post-handler actions:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait PostHandlerCallback: Send + Sync + 'static {
/// Called after a handler completes
async fn on_handler_complete(
&self,
step: &TaskSequenceStep,
result: &StepExecutionResult,
worker_id: &str,
);
/// Name of this callback for logging purposes
fn name(&self) -> &str;
}
}
Implementations:
NoOpCallback: Default no-operation callbackDomainEventCallback: Publishes domain events toDomainEventSystem
Configuration
Worker Event System
# config/tasker/base/event_systems.toml
[event_systems.worker]
system_id = "worker-event-system"
deployment_mode = "Hybrid"
[event_systems.worker.metadata.listener]
retry_interval_seconds = 5
max_retry_attempts = 3
event_timeout_seconds = 60
batch_processing = true
connection_timeout_seconds = 30
[event_systems.worker.metadata.fallback_poller]
enabled = true
polling_interval_ms = 100
batch_size = 10
age_threshold_seconds = 30
max_age_hours = 24
visibility_timeout_seconds = 60
Handler Dispatch
# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000
[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000
completion_timeout_ms = 30000
starvation_warning_threshold_ms = 10000
callback_timeout_ms = 5000
completion_send_timeout_ms = 10000
Integration with Worker Actors
The event systems integrate with the worker actor architecture:
WorkerEventSystem
│
▼
ActorCommandProcessor
│
├──► StepExecutorActor ──► dispatch_sender
│
├──► FFICompletionActor ◄── completion_receiver
│
└──► DomainEventActor ◄── PostHandlerCallback
See Worker Actors Documentation for actor details.
Event Flow Guarantees
Ordering Guarantee
Domain events fire AFTER result is committed to completion channel:
handler.call()
→ result committed to completion_sender
→ PostHandlerCallback.on_handler_complete()
→ domain events dispatched
This eliminates race conditions where downstream systems see events before orchestration processes results.
Idempotency Guarantee
State machine guards prevent duplicate execution:
- Step claimed atomically via
transition_step_state_atomic() - State guards reject duplicate claims
- Results are deduplicated by completion channel
Fire-and-Forget Guarantee
Domain event failures never fail step completion:
#![allow(unused)]
fn main() {
// DomainEventCallback
pub async fn on_handler_complete(&self, step, result, worker_id) {
// dispatch_events uses try_send() - never blocks
// If channel full, events dropped with metrics
// Step completion is NOT affected
self.handle.dispatch_events(events, publisher_name, correlation_id);
}
}
Monitoring
Key Metrics
| Metric | Description |
|---|---|
tasker.worker.events_processed | Total events processed |
tasker.worker.events_failed | Events that failed processing |
tasker.ffi.pending_events | Pending FFI events (starvation indicator) |
tasker.ffi.oldest_event_age_ms | Age of oldest pending event |
tasker.channel.completion.saturation | Completion channel utilization |
tasker.domain_events.dispatched | Domain events dispatched |
tasker.domain_events.dropped | Domain events dropped (backpressure) |
Health Checks
#![allow(unused)]
fn main() {
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError> {
if self.is_running.load(Ordering::Acquire) {
Ok(DeploymentModeHealthStatus::Healthy)
} else {
Ok(DeploymentModeHealthStatus::Critical)
}
}
}
Backpressure Handling
The worker event system implements multiple backpressure mechanisms to ensure graceful degradation under load while preserving step idempotency.
Backpressure Points
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER BACKPRESSURE FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
[1] Step Claiming
│
├── Planned: Capacity check before claiming
│ └── If at capacity: Leave message in queue (visibility timeout)
│
▼
[2] Handler Dispatch Channel (Bounded)
│
├── dispatch_buffer_size = 1000
│ └── If full: Sender blocks until space available
│
▼
[3] Semaphore-Bounded Execution
│
├── max_concurrent_handlers = 10
│ └── If permits exhausted: Task waits for permit
│
├── CRITICAL: Permit released BEFORE sending to completion channel
│ └── Prevents backpressure cascade
│
▼
[4] Completion Channel (Bounded)
│
├── completion_buffer_size = 1000
│ └── If full: Handler task blocks until space available
│
▼
[5] Domain Events (Fire-and-Forget)
│
└── try_send() semantics
└── If channel full: Events DROPPED (step execution unaffected)
Handler Dispatch Backpressure
The HandlerDispatchService uses semaphore-bounded parallelism:
#![allow(unused)]
fn main() {
// Permit acquisition blocks if all permits in use
let permit = semaphore.acquire().await?;
let result = execute_with_timeout(®istry, &msg, timeout).await;
// CRITICAL: Release permit BEFORE sending to completion channel
// This prevents backpressure cascade where full completion channel
// holds permits, starving new handler execution
drop(permit);
// Now send to completion channel (may block if full)
sender.send(result).await?;
}
Why permit release before send matters:
- If completion channel is full, handler task blocks on send
- If permit is held during block, no new handlers can start
- By releasing permit first, new handlers can start even if completions are backing up
FFI Dispatch Backpressure
The FfiDispatchChannel handles backpressure for Ruby/Python workers:
| Scenario | Behavior |
|---|---|
| Dispatch channel full | Sender blocks |
| FFI polling too slow | Starvation warning logged |
| Completion timeout | Failure result generated |
| Callback timeout | Callback fire-and-forget, logged |
Starvation Detection:
[worker.mpsc_channels.ffi_dispatch]
starvation_warning_threshold_ms = 10000 # Warn if event waits > 10s
Domain Event Drop Semantics
Domain events use try_send() and are explicitly designed to be droppable:
#![allow(unused)]
fn main() {
// Domain events fire AFTER result is committed
// They are non-critical and use fire-and-forget semantics
match event_sender.try_send(event) {
Ok(()) => { /* Event dispatched */ }
Err(TrySendError::Full(_)) => {
// Event dropped - step execution NOT affected
warn!("Domain event dropped: channel full");
metrics.increment("domain_events_dropped");
}
}
}
Why this is safe: Domain events are informational. Dropping them does not affect step execution correctness. The step result is already committed to the completion channel before domain events fire.
Step Claiming Backpressure (Planned)
Future enhancement: Workers will check capacity before claiming steps:
#![allow(unused)]
fn main() {
// Planned implementation
fn should_claim_step(&self) -> bool {
let available = self.semaphore.available_permits();
let threshold = self.config.claim_capacity_threshold; // e.g., 0.8
let max = self.config.max_concurrent_handlers;
available as f64 / max as f64 > (1.0 - threshold)
}
}
If at capacity:
- Worker does NOT acknowledge the PGMQ message
- Message returns to queue after visibility timeout
- Another worker (or same worker later) claims it
Idempotency Under Backpressure
All backpressure mechanisms preserve step idempotency:
| Backpressure Point | Idempotency Guarantee |
|---|---|
| Claim refusal | Message stays in queue, visibility timeout protects |
| Dispatch channel full | Step claimed but queued for execution |
| Semaphore wait | Step claimed, waiting for permit |
| Completion channel full | Handler completed, result buffered |
| Domain event drop | Non-critical, step result already persisted |
Critical Rule: A claimed step MUST produce a result (success or failure). Backpressure may delay but never drop step execution.
For comprehensive backpressure strategy, see Backpressure Architecture.
Best Practices
1. Choose Deployment Mode
- Production: Use
Hybridfor reliability with event-driven performance - Development: Use
EventDrivenOnlyfor fastest iteration - Restricted environments: Use
PollingOnlywhen pg_notify unavailable
2. Tune Concurrency
[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 10 # Start here, increase based on monitoring
Monitor:
- Semaphore wait times
- Handler execution latency
- Completion channel saturation
3. Configure Timeouts
handler_timeout_ms = 30000 # Match your slowest handler
completion_timeout_ms = 30000 # FFI completion timeout
callback_timeout_ms = 5000 # Domain event callback timeout
4. Monitor Starvation
For FFI workers, monitor pending event age:
# Ruby
metrics = Tasker.ffi_dispatch_metrics
if metrics[:oldest_pending_age_ms] > 10000
warn "FFI polling falling behind"
end
Related Documentation
- Messaging Abstraction - Provider-agnostic messaging
- Backpressure Architecture - Unified backpressure strategy
- Worker Actor-Based Architecture - Actor pattern implementation
- Events and Commands - Command pattern details
- Dual-Channel Event System ADR - Dual-channel event system decision
- FFI Callback Safety - FFI guidelines
- RCA: Parallel Execution Timing Bugs - Lessons learned
- Backpressure Monitoring Runbook - Metrics and alerting
<- Back to Documentation Hub
Tasker Core Guides
This directory contains practical how-to guides for working with Tasker Core.
Documents
| Document | Description |
|---|---|
| Quick Start | Get running in 5 minutes |
| Use Cases and Patterns | Practical workflow examples |
| Conditional Workflows | Runtime decision-making and dynamic steps |
| Batch Processing | Parallel processing with cursor-based workers |
| DLQ System | Dead letter queue investigation and resolution |
| Retry Semantics | Understanding max_attempts and retryable flags |
| Identity Strategy | Task deduplication with STRICT, CALLER_PROVIDED, ALWAYS_UNIQUE |
| Configuration Management | TOML architecture, CLI tools, runtime observability |
When to Read These
- Getting started: Begin with Quick Start
- Implementing features: Check Use Cases and Patterns
- Handling errors: See Retry Semantics and DLQ System
- Processing data: Review Batch Processing
- Deploying: Consult Configuration Management
Related Documentation
- Architecture - The “what” - system structure
- Principles - The “why” - design philosophy
- Workers - Language-specific handler development
API Security Guide
API-level security for orchestration (8080) and worker (8081) endpoints using JWT bearer tokens and API key authentication with permission-based access control.
Security is disabled by default for backward compatibility. Enable it explicitly in configuration.
See also: Auth Documentation Hub for architecture overview, Permissions for route mapping, Configuration for full reference, Testing for E2E test patterns.
Quick Start
1. Generate Keys
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys
2. Generate a Token
cargo run --bin tasker-ctl -- auth generate-token \
--private-key ./keys/jwt-private-key.pem \
--permissions "tasks:create,tasks:read,tasks:list,steps:read" \
--subject my-service \
--expiry-hours 24
3. Enable Auth in Configuration
In config/tasker/base/orchestration.toml:
[auth]
enabled = true
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
4. Use the Token
export TASKER_AUTH_TOKEN=<generated-token>
cargo run --bin tasker-ctl -- task list
Or with curl:
curl -H "Authorization: Bearer $TASKER_AUTH_TOKEN" http://localhost:8080/v1/tasks
Permission Vocabulary
| Permission | Resource | Description |
|---|---|---|
tasks:create | tasks | Create new tasks |
tasks:read | tasks | Read task details |
tasks:list | tasks | List tasks |
tasks:cancel | tasks | Cancel running tasks |
tasks:context_read | tasks | Read task context data |
steps:read | steps | Read workflow step details |
steps:resolve | steps | Manually resolve steps |
dlq:read | dlq | Read DLQ entries |
dlq:update | dlq | Update DLQ investigations |
dlq:stats | dlq | View DLQ statistics |
templates:read | templates | Read task templates |
templates:validate | templates | Validate templates |
system:config_read | system | Read system configuration |
system:handlers_read | system | Read handler registry |
system:analytics_read | system | Read analytics data |
worker:config_read | worker | Read worker configuration |
worker:templates_read | worker | Read worker templates |
Wildcards
tasks:*- All task permissionssteps:*- All step permissionsdlq:*- All DLQ permissions*- All permissions (superuser)
Show All Permissions
cargo run --bin tasker-ctl -- auth show-permissions
Configuration Reference
Server-Side (orchestration.toml / worker.toml)
[auth]
enabled = true
# JWT Configuration
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
jwt_token_expiry_hours = 24
# Key Configuration (one of these):
jwt_public_key_path = "./keys/jwt-public-key.pem" # File path (preferred)
jwt_public_key = "-----BEGIN RSA PUBLIC KEY-----..." # Inline PEM
# Or set env: TASKER_JWT_PUBLIC_KEY_PATH
# JWKS (for dynamic key rotation)
jwt_verification_method = "jwks" # "public_key" (default) or "jwks"
jwks_url = "https://auth.example.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
# Permission validation
permissions_claim = "permissions" # JWT claim containing permissions
strict_validation = true # Reject tokens with unknown permissions
log_unknown_permissions = true
# API Key Authentication
api_key_header = "X-API-Key"
api_keys_enabled = true
[[auth.api_keys]]
key = "sk-prod-key-1"
permissions = ["tasks:read", "tasks:list", "steps:read"]
description = "Read-only monitoring service"
[[auth.api_keys]]
key = "sk-admin-key"
permissions = ["*"]
description = "Admin key"
Client-Side (Environment Variables)
| Variable | Description |
|---|---|
TASKER_AUTH_TOKEN | Bearer token for both APIs |
TASKER_ORCHESTRATION_AUTH_TOKEN | Override token for orchestration only |
TASKER_WORKER_AUTH_TOKEN | Override token for worker only |
TASKER_API_KEY | API key (fallback if no token) |
TASKER_API_KEY_HEADER | Custom header name (default: X-API-Key) |
Priority: endpoint-specific token > global token > API key > config file.
JWT Token Structure
{
"sub": "my-service",
"iss": "tasker-core",
"aud": "tasker-api",
"iat": 1706000000,
"exp": 1706086400,
"permissions": [
"tasks:create",
"tasks:read",
"tasks:list",
"steps:read"
],
"worker_namespaces": []
}
Common Role Patterns
Read-only operator:
permissions: ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
Task submitter:
permissions: ["tasks:create", "tasks:read", "tasks:list"]
Ops admin:
permissions: ["tasks:*", "steps:*", "dlq:*", "system:*"]
Worker service:
permissions: ["worker:config_read", "worker:templates_read"]
Superuser:
permissions: ["*"]
Public Endpoints
These endpoints never require authentication:
GET /health- Basic health checkGET /health/detailed- Detailed healthGET /metrics- Prometheus metrics
API Key Authentication
API keys are validated against a configured registry. Each key has its own set of permissions.
# Using API key
curl -H "X-API-Key: sk-prod-key-1" http://localhost:8080/v1/tasks
API keys are simpler than JWTs but have limitations:
- No expiration (rotate by removing from config)
- No claims beyond permissions
- Best for service-to-service communication with static permissions
Error Responses
401 Unauthorized (Missing/Invalid Credentials)
{
"error": "unauthorized",
"message": "Missing authentication credentials"
}
403 Forbidden (Insufficient Permissions)
{
"error": "forbidden",
"message": "Missing required permission: tasks:create"
}
Migration Guide: Disabled to Enabled
- Generate keys and distribute the public key to server config
- Generate tokens for each service/user with appropriate permissions
- Set
enabled = truein auth config - Deploy - services without valid tokens will get 401 responses
- Monitor the
tasker.auth.failures.totalmetric for issues
All endpoints remain accessible without auth when enabled = false.
Observability
Structured Logs
infoon successful authentication (subject, method)warnon authentication failure (error details)warnon permission denial (subject, required permission)
Prometheus Metrics
| Metric | Type | Labels |
|---|---|---|
tasker.auth.requests.total | Counter | method, result |
tasker.auth.failures.total | Counter | reason |
tasker.permission.denials.total | Counter | permission |
tasker.auth.jwt.verification.duration | Histogram | result |
CLI Auth Commands
# Generate RSA key pair
tasker-ctl auth generate-keys [--output-dir ./keys] [--key-size 2048]
# Generate JWT token
tasker-ctl auth generate-token \
--permissions tasks:create,tasks:read \
--subject my-service \
--private-key ./keys/jwt-private-key.pem \
--expiry-hours 24
# List all permissions
tasker-ctl auth show-permissions
# Validate a token
tasker-ctl auth validate-token \
--token <JWT> \
--public-key ./keys/jwt-public-key.pem
gRPC Authentication
gRPC endpoints support the same authentication methods as REST, using gRPC metadata instead of HTTP headers.
gRPC Ports
| Service | REST Port | gRPC Port |
|---|---|---|
| Orchestration | 8080 | 9190 |
| Rust Worker | 8081 | 9191 |
Bearer Token (gRPC)
# Using grpcurl with Bearer token
grpcurl -plaintext \
-H "Authorization: Bearer $TASKER_AUTH_TOKEN" \
localhost:9190 tasker.v1.TaskService/ListTasks
API Key (gRPC)
# Using grpcurl with API key
grpcurl -plaintext \
-H "X-API-Key: sk-prod-key-1" \
localhost:9190 tasker.v1.TaskService/ListTasks
gRPC Client Configuration
#![allow(unused)]
fn main() {
use tasker_client::grpc_clients::{OrchestrationGrpcClient, GrpcAuthConfig};
// With API key
let client = OrchestrationGrpcClient::connect_with_auth(
"http://localhost:9190",
GrpcAuthConfig::ApiKey("sk-prod-key-1".to_string()),
).await?;
// With Bearer token
let client = OrchestrationGrpcClient::connect_with_auth(
"http://localhost:9190",
GrpcAuthConfig::Bearer("eyJ...".to_string()),
).await?;
}
gRPC Error Codes
| gRPC Status | HTTP Equivalent | Meaning |
|---|---|---|
UNAUTHENTICATED | 401 | Missing or invalid credentials |
PERMISSION_DENIED | 403 | Valid credentials but insufficient permissions |
NOT_FOUND | 404 | Resource not found |
UNAVAILABLE | 503 | Service unavailable |
Public gRPC Endpoints
These endpoints never require authentication:
HealthService/CheckHealth- Basic health checkHealthService/CheckLiveness- Kubernetes liveness probeHealthService/CheckReadiness- Kubernetes readiness probeHealthService/CheckDetailedHealth- Detailed health metrics
Security Considerations
- Key storage: Private keys should never be committed to git. Use file paths or environment variables.
- Token expiry: Set appropriate expiry times. Short-lived tokens (1-24h) are preferred.
- Least privilege: Grant only the permissions each service needs.
- Key rotation: Use JWKS for automatic key rotation in production.
- API key rotation: Remove old keys from config and redeploy.
- Audit: Monitor
tasker.auth.failures.totalandtasker.permission.denials.totalfor anomalies.
External Auth Provider Integration
Integrating Tasker’s API security with external identity providers via JWKS endpoints.
See also: Auth Documentation Hub for architecture overview, Configuration for full TOML reference.
JWKS Integration
Tasker supports JWKS (JSON Web Key Set) for dynamic public key discovery. This enables key rotation without redeploying Tasker.
Configuration
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://your-provider.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://your-provider.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions" # or custom claim name
How It Works
- On first request, Tasker fetches the JWKS from the configured URL
- Keys are cached for the configured refresh interval
- When a token has an unknown
kid(Key ID), a refresh is triggered - RSA keys are parsed from the JWK
nandecomponents
Auth0
Auth0 Configuration
-
Create an API in Auth0 Dashboard:
- Name:
Tasker API - Identifier:
tasker-api(this becomes the audience) - Signing Algorithm: RS256
- Name:
-
Create permissions in the API settings matching Tasker’s vocabulary:
tasks:create,tasks:read,tasks:list, etc.
-
Assign permissions to users/applications via Auth0 roles
Tasker Configuration for Auth0
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.auth0.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.auth0.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"
Token Request
curl --request POST \
--url https://YOUR_DOMAIN.auth0.com/oauth/token \
--header 'content-type: application/json' \
--data '{
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"audience": "tasker-api",
"grant_type": "client_credentials"
}'
Keycloak
Keycloak Configuration
- Create a realm and client for Tasker
- Define client roles matching Tasker permissions
- Configure the client to include roles in the
permissionstoken claim via a protocol mapper
Tasker Configuration for Keycloak
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://keycloak.example.com/realms/YOUR_REALM/protocol/openid-connect/certs"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://keycloak.example.com/realms/YOUR_REALM"
jwt_audience = "tasker-api"
permissions_claim = "permissions" # Configure via protocol mapper
Okta
Okta Configuration
- Create an API authorization server
- Add custom claims for permissions
- Define scopes matching Tasker permissions
Tasker Configuration for Okta
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID/v1/keys"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID"
jwt_audience = "tasker-api"
permissions_claim = "scp" # Okta uses "scp" for scopes by default
Custom JWKS Endpoint
Any provider that serves a standard JWKS endpoint works. The endpoint must return:
{
"keys": [
{
"kty": "RSA",
"kid": "key-id-1",
"use": "sig",
"alg": "RS256",
"n": "<base64url-encoded modulus>",
"e": "<base64url-encoded exponent>"
}
]
}
Static Public Key (Development)
For development or simple deployments without a JWKS endpoint:
[auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
Generate keys with:
tasker-ctl auth generate-keys --output-dir /etc/tasker/keys
Permission Claim Mapping
If your identity provider uses a different claim name for permissions:
permissions_claim = "custom_permissions" # Default: "permissions"
The claim must be a JSON array of strings:
{
"custom_permissions": ["tasks:create", "tasks:read"]
}
Strict Validation
When strict_validation = true (default), tokens containing unknown permission strings are rejected. Set to false if your provider includes additional scopes/permissions not in Tasker’s vocabulary:
strict_validation = false
log_unknown_permissions = true # Still log unknown permissions for monitoring
Batch Processing in Tasker
Last Updated: 2026-01-06 Status: Production Ready Related: Conditional Workflows, DLQ System
Table of Contents
- Overview
- Architecture Foundations
- Core Concepts
- Checkpoint Yielding
- Workflow Pattern
- Data Structures
- Implementation Patterns
- Use Cases
- Operator Workflows
- Code Examples
- Best Practices
Overview
Batch processing in Tasker enables parallel processing of large datasets by dynamically creating worker steps at runtime. A single “batchable” step analyzes a workload and instructs orchestration to create N worker instances, each processing a subset of data using cursor-based boundaries.
Key Characteristics
Dynamic Worker Creation: Workers are created at runtime based on dataset analysis, using predefined in templates for structure, but scaled according to need.
Cursor-Based Resumability: Each worker processes a specific range (cursor) and can resume from checkpoints on failure.
Deferred Convergence: Aggregation steps use intersection semantics to wait for all created workers, regardless of count.
Standard Lifecycle: Workers use existing retry, timeout, and DLQ mechanics - no special recovery system needed.
Example Flow
Task: Process 1000-row CSV file
1. analyze_csv (batchable step)
→ Counts rows: 1000
→ Calculates workers: 5 (200 rows each)
→ Returns BatchProcessingOutcome::CreateBatches
2. Orchestration creates workers dynamically:
├─ process_csv_batch_001 (rows 1-200)
├─ process_csv_batch_002 (rows 201-400)
├─ process_csv_batch_003 (rows 401-600)
├─ process_csv_batch_004 (rows 601-800)
└─ process_csv_batch_005 (rows 801-1000)
3. Workers process in parallel
4. aggregate_csv_results (deferred convergence)
→ Waits for all 5 workers (intersection semantics)
→ Aggregates results from completed workers
→ Returns combined metrics
Architecture Foundations
Batch processing builds on and extends three foundational Tasker patterns:
1. DAG (Directed Acyclic Graph) Workflow Orchestration
What Batch Processing Inherits:
- Worker steps are full DAG nodes with standard state machines
- Parent-child dependencies enforced via
tasker_workflow_step_edges - Cycle detection prevents circular dependencies
- Topological ordering ensures correct execution sequence
What Batch Processing Extends:
- Dynamic node creation: Template steps instantiated N times at runtime
- Edge generation: Batchable step → worker instances → convergence step
- Transactional atomicity: All workers created in single database transaction
Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:357-400
#![allow(unused)]
fn main() {
// Transaction ensures all-or-nothing worker creation
let mut tx = pool.begin().await?;
for (i, cursor_config) in cursor_configs.iter().enumerate() {
// Create worker instance from template
let worker_step = WorkflowStepCreator::create_from_template(
&mut tx,
task_uuid,
&worker_template,
&format!("{}_{:03}", worker_template_name, i + 1),
Some(batch_worker_inputs.clone()),
).await?;
// Create edge: batchable → worker
WorkflowStepEdge::create_with_transaction(
&mut tx,
batchable_step.workflow_step_uuid,
worker_step.workflow_step_uuid,
"batch_dependency",
).await?;
}
tx.commit().await?; // Atomic - all workers or none
}
2. Retryability and Lifecycle Management
What Batch Processing Inherits:
- Standard
lifecycle.max_retriesconfiguration per template - Exponential backoff via
lifecycle.backoff_multiplier - Staleness detection using
lifecycle.max_steps_in_process_minutes - Standard state transitions (Pending → Enqueued → InProgress → Complete/Error)
What Batch Processing Extends:
- Checkpoint-based resumability: Workers checkpoint progress and resume from last cursor position
- Cursor preservation during retry:
workflow_steps.resultsfield preserved byResetForRetryaction - Additional staleness detection: Checkpoint timestamp tracking alongside duration-based detection
Key Simplification:
- ❌ No BatchRecoveryService - Uses standard retry + DLQ
- ❌ No duplicate timeout settings - Uses
lifecycleconfig only - ✅ Cursor data preserved during
ResetForRetry
Configuration Example: tests/fixtures/task_templates/ruby/batch_processing_products_csv.yaml:749-752
- name: process_csv_batch
type: batch_worker
lifecycle:
max_steps_in_process_minutes: 120 # DLQ timeout
max_retries: 3 # Standard retry limit
backoff_multiplier: 2.0 # Exponential backoff
3. Deferred Convergence
What Batch Processing Inherits:
- Intersection semantics: Wait for declared dependencies ∩ actually created steps
- Template-level dependencies: Convergence step depends on worker template, not instances
- Runtime resolution: System computes effective dependencies when workers are created
What Batch Processing Extends:
- Batch aggregation pattern: Convergence steps aggregate results from N workers
- NoBatches scenario handling: Placeholder worker created when dataset too small
- Scenario detection helpers:
BatchAggregationScenario::detect()for both cases
Flow Comparison:
Conditional Workflows (Decision Points):
decision_step → creates → option_a, option_b (conditional)
↓
convergence_step (depends on option_a AND option_b templates)
→ waits for whichever were created (intersection)
Batch Processing (Batchable Steps):
batchable_step → creates → worker_001, worker_002, ..., worker_N
↓
convergence_step (depends on worker template)
→ waits for ALL workers created (intersection)
Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:600-650
#![allow(unused)]
fn main() {
// Determine and create convergence step with intersection semantics
pub async fn determine_and_create_convergence_step(
&self,
tx: &mut PgTransaction,
task_uuid: Uuid,
convergence_template: &StepDefinition,
created_workers: &[WorkflowStep],
) -> Result<Option<WorkflowStep>> {
// Create convergence step as deferred type
let convergence_step = WorkflowStepCreator::create_from_template(
tx,
task_uuid,
convergence_template,
&convergence_template.name,
None,
).await?;
// Create edges from ALL worker instances to convergence step
for worker in created_workers {
WorkflowStepEdge::create_with_transaction(
tx,
worker.workflow_step_uuid,
convergence_step.workflow_step_uuid,
"batch_convergence_dependency",
).await?;
}
Ok(Some(convergence_step))
}
}
Core Concepts
Batchable Steps
Purpose: Analyze a workload and decide whether to create batch workers.
Responsibilities:
- Examine dataset (size, complexity, business logic)
- Calculate optimal worker count based on batch size
- Generate cursor configurations defining batch boundaries
- Return
BatchProcessingOutcomeinstructing orchestration
Returns: BatchProcessingOutcome enum with two variants:
NoBatches: Dataset too small or empty - create placeholder workerCreateBatches: Create N workers with cursor configurations
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-120
#![allow(unused)]
fn main() {
// Batchable handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let csv_file_path = step_data.task.context.get("csv_file_path").unwrap();
// Count rows in CSV (excluding header)
let total_rows = count_csv_rows(csv_file_path)?;
// Get batch configuration from handler initialization
let batch_size = step_data.handler_initialization
.get("batch_size").and_then(|v| v.as_u64()).unwrap_or(200);
if total_rows == 0 {
// No batches needed
let outcome = BatchProcessingOutcome::no_batches();
return Ok(success_result(
step_uuid,
json!({ "batch_processing_outcome": outcome.to_value() }),
elapsed_ms,
None,
));
}
// Calculate workers
let worker_count = (total_rows as f64 / batch_size as f64).ceil() as u32;
// Generate cursor configs
let cursor_configs = create_cursor_configs(total_rows, worker_count);
// Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"process_csv_batch".to_string(),
worker_count,
cursor_configs,
total_rows,
);
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_rows": total_rows
}),
elapsed_ms,
None,
))
}
}
Batch Workers
Purpose: Process a specific subset of data defined by cursor configuration.
Responsibilities:
- Extract cursor config from
workflow_step.inputs - Check for
is_no_opflag (NoBatches placeholder scenario) - Process items within cursor range (start_cursor to end_cursor)
- Checkpoint progress periodically for resumability
- Return processed results for aggregation
Cursor Configuration: Each worker receives BatchWorkerInputs in workflow_step.inputs:
{
"cursor": {
"batch_id": "001",
"start_cursor": 1,
"end_cursor": 200,
"batch_size": 200
},
"batch_metadata": {
"checkpoint_interval": 100,
"cursor_field": "row_number",
"failure_strategy": "fail_fast"
},
"is_no_op": false
}
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-280
#![allow(unused)]
fn main() {
// Batch worker handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Extract context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// Check for no-op placeholder worker
if context.is_no_op() {
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario - no items to process"
}),
elapsed_ms,
None,
));
}
// Get cursor range
let start_row = context.start_position();
let end_row = context.end_position();
// Get CSV file path from dependency results
let csv_file_path = step_data
.dependency_results
.get("analyze_csv")
.and_then(|r| r.result.get("csv_file_path"))
.unwrap();
// Process CSV rows in cursor range
let mut processed_count = 0;
let mut metrics = initialize_metrics();
let file = File::open(csv_file_path)?;
let mut csv_reader = csv::ReaderBuilder::new()
.has_headers(true)
.from_reader(BufReader::new(file));
for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
let data_row_num = row_idx + 1; // 1-indexed after header
if data_row_num < start_row {
continue; // Skip rows before our range
}
if data_row_num >= end_row {
break; // Processed all our rows
}
let product: Product = result?;
// Update metrics
metrics.total_inventory_value += product.price * (product.stock as f64);
metrics.category_counts.entry(product.category.clone())
.or_insert(0) += 1;
processed_count += 1;
// Checkpoint progress periodically
if processed_count % context.checkpoint_interval() == 0 {
checkpoint_progress(step_uuid, data_row_num).await?;
}
}
// Return results for aggregation
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"total_inventory_value": metrics.total_inventory_value,
"category_counts": metrics.category_counts,
"batch_id": context.batch_id(),
"start_row": start_row,
"end_row": end_row
}),
elapsed_ms,
None,
))
}
}
Convergence Steps
Purpose: Aggregate results from all batch workers using deferred intersection semantics.
Responsibilities:
- Detect scenario using
BatchAggregationScenario::detect() - Handle both NoBatches and WithBatches scenarios
- Aggregate metrics from all worker results
- Return combined results for task completion
Scenario Detection:
#![allow(unused)]
fn main() {
pub enum BatchAggregationScenario {
/// No batches created - placeholder worker used
NoBatches {
batchable_result: StepDependencyResult,
},
/// Batches created - multiple workers processed data
WithBatches {
batch_results: Vec<(String, StepDependencyResult)>,
worker_count: u32,
},
}
}
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-480
#![allow(unused)]
fn main() {
// Convergence handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"analyze_csv", // batchable step name
"process_csv_batch_", // batch worker prefix
)?;
match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// No workers created - get dataset size from batchable step
let total_rows = batchable_result
.result.get("total_rows")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Return zero metrics
Ok(success_result(
step_uuid,
json!({
"total_processed": total_rows,
"total_inventory_value": 0.0,
"category_counts": {},
"worker_count": 0
}),
elapsed_ms,
None,
))
}
BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
// Aggregate results from all workers
let mut total_processed = 0u64;
let mut total_inventory_value = 0.0;
let mut global_category_counts = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
for (step_name, result) in batch_results {
// Sum processed counts
total_processed += result.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Sum inventory values
total_inventory_value += result.result
.get("total_inventory_value")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
// Merge category counts
if let Some(categories) = result.result
.get("category_counts")
.and_then(|v| v.as_object()) {
for (category, count) in categories {
*global_category_counts.entry(category.clone()).or_insert(0)
+= count.as_u64().unwrap_or(0);
}
}
// Find global max price
let batch_max_price = result.result
.get("max_price")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
if batch_max_price > max_price {
max_price = batch_max_price;
max_price_product = result.result
.get("max_price_product")
.and_then(|v| v.as_str())
.map(String::from);
}
}
// Return aggregated metrics
Ok(success_result(
step_uuid,
json!({
"total_processed": total_processed,
"total_inventory_value": total_inventory_value,
"category_counts": global_category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"worker_count": worker_count
}),
elapsed_ms,
None,
))
}
}
}
}
Checkpoint Yielding
Checkpoint yielding enables handler-driven progress persistence during long-running batch operations. Handlers explicitly checkpoint their progress, persist state to the database, and yield control back to the orchestrator for re-dispatch.
Key Characteristics
Handler-Driven: Handlers decide when to checkpoint based on business logic, not configuration timers. This gives handlers full control over checkpoint frequency and timing.
Checkpoint-Persist-Then-Redispatch: Progress is atomically saved to the database before the step is re-dispatched. This ensures no progress is ever lost, even during infrastructure failures.
Step Remains In-Progress: During checkpoint yield cycles, the step stays in InProgress state. It is not released or re-enqueued through normal channels—the re-dispatch happens internally.
State Machine Integrity: Only Success or Failure results trigger state transitions. Checkpoint yields are internal handler mechanics that don’t affect the step state machine.
When to Use Checkpoint Yielding
Use checkpoint yielding when:
- Processing takes longer than your visibility timeout (prevents DLQ escalation)
- You want resumable processing after transient failures
- You need to periodically release resources (memory, connections)
- Long-running operations need progress visibility
Don’t use checkpoint yielding when:
- Batch processing completes quickly (<30 seconds)
- The overhead of checkpointing exceeds the benefit
- Operations are inherently non-resumable
API Reference
All languages provide a checkpoint_yield() method (or checkpointYield() in TypeScript) on the Batchable mixin:
Ruby
class MyBatchWorkerHandler
include Tasker::StepHandler::Batchable
def call(step_data)
context = BatchWorkerContext.from_step_data(step_data)
# Resume from checkpoint if present
start_item = context.has_checkpoint? ? context.checkpoint_cursor : 0
accumulated = context.accumulated_results || []
items = fetch_items_to_process(start_item)
items.each_with_index do |item, idx|
result = process_item(item)
accumulated << result
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0
checkpoint_yield(
cursor: start_item + idx + 1,
items_processed: accumulated.size,
accumulated_results: { processed: accumulated }
)
# Handler execution stops here and resumes on re-dispatch
end
end
# Return final success result
success_result(results: { all_processed: accumulated })
end
end
BatchWorkerContext Accessors (Ruby):
checkpoint_cursor- Current cursor position (or nil if no checkpoint)accumulated_results- Previously accumulated results (or nil)has_checkpoint?- Returns true if checkpoint data existscheckpoint_items_processed- Number of items processed at checkpoint
Python
class MyBatchWorkerHandler(BatchableHandler):
def call(self, step_data: TaskSequenceStep) -> StepExecutionResult:
context = BatchWorkerContext.from_step_data(step_data)
# Resume from checkpoint if present
start_item = context.checkpoint_cursor if context.has_checkpoint() else 0
accumulated = context.accumulated_results or []
items = self.fetch_items_to_process(start_item)
for idx, item in enumerate(items):
result = self.process_item(item)
accumulated.append(result)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0:
self.checkpoint_yield(
cursor=start_item + idx + 1,
items_processed=len(accumulated),
accumulated_results={"processed": accumulated}
)
# Handler execution stops here and resumes on re-dispatch
# Return final success result
return self.success_result(results={"all_processed": accumulated})
BatchWorkerContext Accessors (Python):
checkpoint_cursor: int | str | dict | None- Current cursor positionaccumulated_results: dict | None- Previously accumulated resultshas_checkpoint() -> bool- Returns true if checkpoint data existscheckpoint_items_processed: int- Number of items processed at checkpoint
TypeScript
class MyBatchWorkerHandler extends BatchableHandler {
async call(stepData: TaskSequenceStep): Promise<StepExecutionResult> {
const context = BatchWorkerContext.fromStepData(stepData);
// Resume from checkpoint if present
const startItem = context.hasCheckpoint() ? context.checkpointCursor : 0;
const accumulated = context.accumulatedResults ?? [];
const items = await this.fetchItemsToProcess(startItem);
for (let idx = 0; idx < items.length; idx++) {
const result = await this.processItem(items[idx]);
accumulated.push(result);
// Checkpoint every 1000 items
if ((idx + 1) % 1000 === 0) {
await this.checkpointYield({
cursor: startItem + idx + 1,
itemsProcessed: accumulated.length,
accumulatedResults: { processed: accumulated }
});
// Handler execution stops here and resumes on re-dispatch
}
}
// Return final success result
return this.successResult({ results: { allProcessed: accumulated } });
}
}
BatchWorkerContext Properties (TypeScript):
checkpointCursor: number | string | Record<string, unknown> | undefinedaccumulatedResults: Record<string, unknown> | undefinedhasCheckpoint(): booleancheckpointItemsProcessed: number
Checkpoint Data Structure
Checkpoints are persisted in the checkpoint JSONB column on workflow_steps:
{
"cursor": 1000,
"items_processed": 1000,
"timestamp": "2026-01-06T12:00:00Z",
"accumulated_results": {
"processed": ["item1", "item2", "..."]
},
"history": [
{"cursor": 500, "timestamp": "2026-01-06T11:59:30Z"},
{"cursor": 1000, "timestamp": "2026-01-06T12:00:00Z"}
]
}
Fields:
cursor- Flexible JSON value representing position (integer, string, or object)items_processed- Total items processed at this checkpointtimestamp- ISO 8601 timestamp when checkpoint was createdaccumulated_results- Optional intermediate results to preservehistory- Array of previous checkpoint positions (appended automatically)
Checkpoint Flow
┌──────────────────────────────────────────────────────────────────┐
│ Handler calls checkpoint_yield(cursor, items_processed, ...) │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ FFI Bridge: checkpoint_yield_step_event() │
│ Converts language-specific types to CheckpointYieldData │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ CheckpointService::save_checkpoint() │
│ - Atomic SQL update with history append │
│ - Uses JSONB jsonb_set for history array │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Worker re-dispatches step via internal MPSC channel │
│ - Step stays InProgress (not released) │
│ - Re-queued for immediate processing │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Handler resumes with checkpoint data in workflow_step │
│ - BatchWorkerContext provides checkpoint accessors │
│ - Handler continues from saved cursor position │
└──────────────────────────────────────────────────────────────────┘
Failure and Recovery
Transient Failure After Checkpoint:
- Handler checkpoints at position 500
- Handler fails at position 750 (transient error)
- Step is retried (standard retry semantics)
- Handler resumes from checkpoint (position 500)
- Items 500-750 are reprocessed (idempotency required)
- Processing continues to completion
Permanent Failure:
- Handler checkpoints at position 500
- Handler encounters non-retryable error
- Step transitions to Error state
- Checkpoint data preserved for operator inspection
- Manual intervention may use checkpoint to resume later
Best Practices
Checkpoint Frequency:
- Too frequent: Overhead dominates (database writes, re-dispatch latency)
- Too infrequent: Lost progress on failure, long recovery time
- Rule of thumb: Checkpoint every 1-5 minutes of work, or every 1000-10000 items
Accumulated Results:
- Keep accumulated results small (summaries, counts, IDs)
- For large result sets, write to external storage and store reference
- Unbounded accumulated results can cause performance degradation
Cursor Design:
- Use monotonic cursors (integers, timestamps) when possible
- Complex cursors (objects) are supported but harder to debug
- Cursor must uniquely identify resume position
Idempotency:
- Items between last checkpoint and failure will be reprocessed
- Ensure item processing is idempotent or use deduplication
- Consider storing processed item IDs in accumulated_results
Monitoring
Checkpoint Events (logged automatically):
INFO checkpoint_yield_step_event step_uuid=abc cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc history_length=2
Metrics to Monitor:
- Checkpoint frequency per step
- Average items processed between checkpoints
- Checkpoint history size (detect unbounded growth)
- Re-dispatch latency after checkpoint
Known Limitations
History Array Growth: The history array grows with each checkpoint. For very long-running processes with frequent checkpoints, this can lead to large JSONB values. Consider:
- Setting a maximum history length (future enhancement)
- Clearing history on step completion
- Using external storage for detailed history
Accumulated Results Size: No built-in size limit on accumulated_results. Handlers must self-regulate to prevent database bloat. Consider:
- Storing summaries instead of raw data
- Using external storage for large intermediate results
- Implementing size checks before checkpoint
Workflow Pattern
Template Definition
Batch processing workflows use three step types in YAML templates:
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
steps:
# BATCHABLE STEP: Analyzes dataset and decides batching strategy
- name: analyze_csv
type: batchable
dependencies: []
handler:
callable: BatchProcessing::CsvAnalyzerHandler
initialization:
batch_size: 200
max_workers: 5
# BATCH WORKER TEMPLATE: Single batch processing unit
# Orchestration creates N instances from this template at runtime
- name: process_csv_batch
type: batch_worker
dependencies:
- analyze_csv
lifecycle:
max_steps_in_process_minutes: 120
max_retries: 3
backoff_multiplier: 2.0
handler:
callable: BatchProcessing::CsvBatchProcessorHandler
initialization:
operation: "inventory_analysis"
# DEFERRED CONVERGENCE STEP: Aggregates results from all workers
- name: aggregate_csv_results
type: deferred_convergence
dependencies:
- process_csv_batch # Template dependency - resolves to all instances
handler:
callable: BatchProcessing::CsvResultsAggregatorHandler
initialization:
aggregation_type: "inventory_metrics"
Runtime Execution Flow
1. Task Initialization
User creates task with context: { "csv_file_path": "/path/to/data.csv" }
↓
Task enters Initializing state
↓
Orchestration discovers ready steps: [analyze_csv]
2. Batchable Step Execution
analyze_csv step enqueued to worker queue
↓
Worker claims step, executes CsvAnalyzerHandler
↓
Handler counts rows: 1000
Handler calculates workers: 5 (200 rows each)
Handler generates cursor configs
Handler returns BatchProcessingOutcome::CreateBatches
↓
Step completes with batch_processing_outcome in results
3. Batch Worker Creation (Orchestration)
ResultProcessorActor processes analyze_csv completion
↓
Detects batch_processing_outcome in step results
↓
Sends ProcessBatchableStepMessage to BatchProcessingActor
↓
BatchProcessingService.process_batchable_step():
- Begins database transaction
- Creates 5 worker instances from process_csv_batch template:
* process_csv_batch_001 (cursor: rows 1-200)
* process_csv_batch_002 (cursor: rows 201-400)
* process_csv_batch_003 (cursor: rows 401-600)
* process_csv_batch_004 (cursor: rows 601-800)
* process_csv_batch_005 (cursor: rows 801-1000)
- Creates edges: analyze_csv → each worker
- Creates convergence step: aggregate_csv_results
- Creates edges: each worker → aggregate_csv_results
- Commits transaction (all-or-nothing)
↓
Workers enqueued to worker queue with PGMQ notifications
4. Parallel Worker Execution
5 workers execute in parallel:
Worker 001:
- Extracts cursor: start=1, end=200
- Processes CSV rows 1-200
- Returns: processed_count=200, metrics={...}
Worker 002:
- Extracts cursor: start=201, end=400
- Processes CSV rows 201-400
- Returns: processed_count=200, metrics={...}
... (workers 003-005 similar)
All workers complete
5. Convergence Step Execution
Orchestration discovers aggregate_csv_results is ready
(all parent workers completed - intersection semantics)
↓
aggregate_csv_results enqueued to worker queue
↓
Worker claims step, executes CsvResultsAggregatorHandler
↓
Handler detects scenario: WithBatches (5 workers)
Handler aggregates results from all 5 workers:
- total_processed: 1000
- total_inventory_value: $XXX,XXX.XX
- category_counts: {electronics: 300, clothing: 250, ...}
Handler returns aggregated metrics
↓
Step completes
6. Task Completion
Orchestration detects all steps complete
↓
TaskFinalizerActor finalizes task
↓
Task state: Complete
NoBatches Scenario Flow
When dataset is too small or empty:
analyze_csv determines dataset too small (e.g., 50 rows < 200 batch_size)
↓
Returns BatchProcessingOutcome::NoBatches
↓
Orchestration creates single placeholder worker:
- process_csv_batch_001 (is_no_op: true)
- No cursor configuration needed
- Still maintains DAG structure
↓
Placeholder worker executes:
- Detects is_no_op flag
- Returns immediately with no_op: true
- No actual data processing
↓
Convergence step detects NoBatches scenario:
- Uses batchable step result directly
- Returns zero metrics or empty aggregation
Why placeholder workers?
- Maintains consistent DAG structure
- Convergence step logic handles both scenarios uniformly
- No special-case orchestration logic needed
- Standard retry/DLQ mechanics still apply
Data Structures
BatchProcessingOutcome
Location: tasker-shared/src/messaging/execution_types.rs
Purpose: Returned by batchable handlers to instruct orchestration.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
/// No batching needed - create placeholder worker
NoBatches,
/// Create N batch workers with cursor configurations
CreateBatches {
/// Template step name (e.g., "process_csv_batch")
worker_template_name: String,
/// Number of workers to create
worker_count: u32,
/// Cursor configurations for each worker
cursor_configs: Vec<CursorConfig>,
/// Total items across all batches
total_items: u64,
},
}
impl BatchProcessingOutcome {
pub fn no_batches() -> Self {
BatchProcessingOutcome::NoBatches
}
pub fn create_batches(
worker_template_name: String,
worker_count: u32,
cursor_configs: Vec<CursorConfig>,
total_items: u64,
) -> Self {
BatchProcessingOutcome::CreateBatches {
worker_template_name,
worker_count,
cursor_configs,
total_items,
}
}
pub fn to_value(&self) -> serde_json::Value {
serde_json::to_value(self).unwrap_or(json!({}))
}
}
}
Ruby Mirror: workers/ruby/lib/tasker_core/types/batch_processing_outcome.rb
module TaskerCore
module Types
module BatchProcessingOutcome
class NoBatches < Dry::Struct
attribute :type, Types::String.default('no_batches')
def to_h
{ 'type' => 'no_batches' }
end
def requires_batch_creation?
false
end
end
class CreateBatches < Dry::Struct
attribute :type, Types::String.default('create_batches')
attribute :worker_template_name, Types::Strict::String
attribute :worker_count, Types::Coercible::Integer.constrained(gteq: 1)
attribute :cursor_configs, Types::Array.of(Types::Hash).constrained(min_size: 1)
attribute :total_items, Types::Coercible::Integer.constrained(gteq: 0)
def to_h
{
'type' => 'create_batches',
'worker_template_name' => worker_template_name,
'worker_count' => worker_count,
'cursor_configs' => cursor_configs,
'total_items' => total_items
}
end
def requires_batch_creation?
true
end
end
class << self
def no_batches
NoBatches.new
end
def create_batches(worker_template_name:, worker_count:, cursor_configs:, total_items:)
CreateBatches.new(
worker_template_name: worker_template_name,
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_items
)
end
end
end
end
end
CursorConfig
Location: tasker-shared/src/messaging/execution_types.rs
Purpose: Defines batch boundaries for each worker.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct CursorConfig {
/// Batch identifier (e.g., "001", "002", "003")
pub batch_id: String,
/// Starting position (inclusive) - flexible JSON value
pub start_cursor: serde_json::Value,
/// Ending position (exclusive) - flexible JSON value
pub end_cursor: serde_json::Value,
/// Number of items in this batch
pub batch_size: u32,
}
}
Design Notes:
- Cursor values use
serde_json::Valuefor flexibility - Supports integers, strings, timestamps, UUIDs, etc.
- Batch IDs are zero-padded strings for consistent ordering
start_cursoris inclusive,end_cursoris exclusive
Example Cursor Configs:
// Numeric cursors (CSV row numbers)
{
"batch_id": "001",
"start_cursor": 1,
"end_cursor": 200,
"batch_size": 200
}
// Timestamp cursors (event processing)
{
"batch_id": "002",
"start_cursor": "2025-01-01T00:00:00Z",
"end_cursor": "2025-01-01T01:00:00Z",
"batch_size": 3600
}
// UUID cursors (database pagination)
{
"batch_id": "003",
"start_cursor": "00000000-0000-0000-0000-000000000000",
"end_cursor": "3fffffff-ffff-ffff-ffff-ffffffffffff",
"batch_size": 1000000
}
BatchWorkerInputs
Location: tasker-shared/src/models/core/batch_worker.rs
Purpose: Stored in workflow_steps.inputs for each worker instance.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchWorkerInputs {
/// Cursor configuration defining this worker's batch range
pub cursor: CursorConfig,
/// Batch processing metadata
pub batch_metadata: BatchMetadata,
/// Flag indicating if this is a placeholder worker (NoBatches scenario)
#[serde(default)]
pub is_no_op: bool,
}
impl BatchWorkerInputs {
pub fn new(
cursor_config: CursorConfig,
batch_config: &BatchConfiguration,
is_no_op: bool,
) -> Self {
Self {
cursor: cursor_config,
batch_metadata: BatchMetadata {
checkpoint_interval: batch_config.checkpoint_interval,
cursor_field: batch_config.cursor_field.clone(),
failure_strategy: batch_config.failure_strategy.clone(),
},
is_no_op,
}
}
}
}
Storage Location:
- ✅
workflow_steps.inputs(instance-specific runtime data) - ❌ NOT in
step_definition.handler.initialization(that’s the template)
BatchMetadata
Location: tasker-shared/src/models/core/batch_worker.rs
Purpose: Runtime configuration for batch processing behavior.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchMetadata {
/// Checkpoint frequency (every N items)
pub checkpoint_interval: u32,
/// Field name used for cursor tracking (e.g., "id", "row_number")
pub cursor_field: String,
/// How to handle failures during batch processing
pub failure_strategy: FailureStrategy,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum FailureStrategy {
/// Fail immediately if any item fails
FailFast,
/// Continue processing remaining items, report failures at end
ContinueOnFailure,
/// Isolate failed items to separate queue
IsolateFailed,
}
}
Implementation Patterns
Rust Implementation
1. Batchable Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_shared::messaging::execution_types::{BatchProcessingOutcome, CursorConfig};
use serde_json::json;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Analyze dataset
let dataset_size = analyze_dataset(step_data)?;
let batch_size = get_batch_size_from_config(step_data)?;
// 2. Check if batching needed
if dataset_size == 0 || dataset_size < batch_size {
let outcome = BatchProcessingOutcome::no_batches();
return Ok(success_result(
step_uuid,
json!({ "batch_processing_outcome": outcome.to_value() }),
elapsed_ms,
None,
));
}
// 3. Calculate worker count
let worker_count = (dataset_size as f64 / batch_size as f64).ceil() as u32;
// 4. Generate cursor configs
let cursor_configs = create_cursor_configs(dataset_size, worker_count, batch_size);
// 5. Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"worker_template_name".to_string(),
worker_count,
cursor_configs,
dataset_size,
);
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_items": dataset_size
}),
elapsed_ms,
None,
))
}
fn create_cursor_configs(
total_items: u64,
worker_count: u32,
batch_size: u64,
) -> Vec<CursorConfig> {
let mut cursor_configs = Vec::new();
let items_per_worker = (total_items as f64 / worker_count as f64).ceil() as u64;
for i in 0..worker_count {
let start_position = i as u64 * items_per_worker;
let end_position = ((i + 1) as u64 * items_per_worker).min(total_items);
cursor_configs.push(CursorConfig {
batch_id: format!("{:03}", i + 1),
start_cursor: json!(start_position),
end_cursor: json!(end_position),
batch_size: (end_position - start_position) as u32,
});
}
cursor_configs
}
}
2. Batch Worker Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchWorkerContext;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Extract batch worker context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// 2. Check for no-op placeholder worker
if context.is_no_op() {
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario",
"batch_id": context.batch_id()
}),
elapsed_ms,
None,
));
}
// 3. Extract cursor range
let start_idx = context.start_position();
let end_idx = context.end_position();
let checkpoint_interval = context.checkpoint_interval();
// 4. Process items in range
let mut processed_count = 0;
let mut results = initialize_results();
for idx in start_idx..end_idx {
// Process item
let item = get_item(idx)?;
update_results(&mut results, item);
processed_count += 1;
// 5. Checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
checkpoint_progress(step_uuid, idx).await?;
}
}
// 6. Return results for aggregation
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"results": results,
"batch_id": context.batch_id(),
"start_position": start_idx,
"end_position": end_idx
}),
elapsed_ms,
None,
))
}
}
3. Convergence Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchAggregationScenario;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"batchable_step_name",
"batch_worker_prefix_",
)?;
// 2. Handle both scenarios
let aggregated_results = match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// Get dataset info from batchable step
let total_items = batchable_result
.result.get("total_items")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Return zero metrics
json!({
"total_processed": total_items,
"worker_count": 0
})
}
BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
// Aggregate results from all workers
let mut total_processed = 0u64;
for (step_name, result) in batch_results {
total_processed += result.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Additional aggregation logic...
}
json!({
"total_processed": total_processed,
"worker_count": worker_count
})
}
};
// 3. Return aggregated results
Ok(success_result(
step_uuid,
aggregated_results,
elapsed_ms,
None,
))
}
}
Ruby Implementation
1. Batchable Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
def call(task, _sequence, step)
csv_file_path = task.context['csv_file_path']
total_rows = count_csv_rows(csv_file_path)
# Get batch configuration
batch_size = step_definition_initialization['batch_size'] || 200
max_workers = step_definition_initialization['max_workers'] || 5
# Calculate worker count
worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min
if worker_count == 0 || total_rows == 0
# Use helper for NoBatches outcome
return no_batches_success(
reason: 'dataset_too_small',
total_rows: total_rows
)
end
# Generate cursor configs using helper
cursor_configs = generate_cursor_configs(
total_items: total_rows,
worker_count: worker_count
)
# Use helper for CreateBatches outcome
create_batches_success(
worker_template_name: 'process_csv_batch',
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_rows,
additional_data: {
'csv_file_path' => csv_file_path
}
)
end
private
def count_csv_rows(csv_file_path)
CSV.read(csv_file_path, headers: true).length
end
end
end
2. Batch Worker Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Extract batch context using helper
batch_ctx = get_batch_context(context)
# Use helper to check for no-op worker
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Get CSV file path from dependency results
csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
# Process CSV rows in cursor range
metrics = process_csv_batch(
csv_file_path,
batch_ctx.start_cursor,
batch_ctx.end_cursor
)
# Return results for aggregation
success(
result_data: {
'processed_count' => metrics[:processed_count],
'total_inventory_value' => metrics[:total_inventory_value],
'category_counts' => metrics[:category_counts],
'batch_id' => batch_ctx.batch_id
}
)
end
private
def process_csv_batch(csv_file_path, start_row, end_row)
metrics = {
processed_count: 0,
total_inventory_value: 0.0,
category_counts: Hash.new(0)
}
CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
next if data_row_num < start_row
break if data_row_num >= end_row
product = parse_product(row)
metrics[:total_inventory_value] += product.price * product.stock
metrics[:category_counts][product.category] += 1
metrics[:processed_count] += 1
end
metrics
end
end
end
3. Convergence Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
def call(_task, sequence, _step)
# Detect scenario using helper
scenario = detect_aggregation_scenario(
sequence,
batchable_step_name: 'analyze_csv',
batch_worker_prefix: 'process_csv_batch_'
)
# Use helper for aggregation with custom block
aggregate_batch_worker_results(scenario) do |batch_results|
# Custom aggregation logic
total_processed = 0
total_inventory_value = 0.0
global_category_counts = Hash.new(0)
batch_results.each do |step_name, result|
total_processed += result.dig('result', 'processed_count') || 0
total_inventory_value += result.dig('result', 'total_inventory_value') || 0.0
(result.dig('result', 'category_counts') || {}).each do |category, count|
global_category_counts[category] += count
end
end
{
'total_processed' => total_processed,
'total_inventory_value' => total_inventory_value,
'category_counts' => global_category_counts,
'worker_count' => batch_results.size
}
end
end
end
end
Use Cases
1. Large Dataset Processing
Scenario: Process millions of records from a database, file, or API.
Why Batch Processing?
- Single worker would timeout
- Memory constraints prevent loading entire dataset
- Want parallelism for speed
Example: Product catalog synchronization
Batchable: Analyze product table (5 million products)
Workers: 100 workers × 50,000 products each
Convergence: Aggregate sync statistics
Result: 5M products synced in 10 minutes vs 2 hours sequential
2. Time-Based Event Processing
Scenario: Process events from a time-series database or log aggregation system.
Why Batch Processing?
- Events span long time ranges
- Want to process hourly/daily chunks in parallel
- Need resumability for long-running processing
Example: Analytics event processing
Batchable: Analyze events (30 days × 24 hours)
Workers: 720 workers (1 per hour)
Cursors: Timestamp ranges (2025-01-01T00:00 to 2025-01-01T01:00)
Convergence: Aggregate daily/monthly metrics
3. Multi-Source Data Integration
Scenario: Fetch data from multiple external APIs or services.
Why Batch Processing?
- Each source is independent
- Want parallel fetching for speed
- Some sources may fail (need retry per source)
Example: Third-party data enrichment
Batchable: Analyze customer list (partition by data provider)
Workers: 5 workers (1 per provider: Stripe, Salesforce, HubSpot, etc.)
Cursors: Provider-specific identifiers
Convergence: Merge enriched customer profiles
4. Bulk File Processing
Scenario: Process multiple files (CSVs, images, documents).
Why Batch Processing?
- Each file is independent processing unit
- Want parallelism across files
- File sizes vary (dynamic batch sizing)
Example: Image transformation pipeline
Batchable: List S3 bucket objects (1000 images)
Workers: 20 workers × 50 images each
Cursors: S3 object key ranges
Convergence: Verify all images transformed
5. Geographical Data Partitioning
Scenario: Process data partitioned by geography (regions, countries, cities).
Why Batch Processing?
- Geographic boundaries provide natural partitions
- Want parallel processing per region
- Different regions may have different data volumes
Example: Regional sales report generation
Batchable: Analyze sales data (50 US states)
Workers: 50 workers (1 per state)
Cursors: State codes (AL, AK, AZ, ...)
Convergence: National sales dashboard
Operator Workflows
Batch processing integrates seamlessly with the DLQ (Dead Letter Queue) system for operator visibility and manual intervention. This section shows how operators manage failed batch workers.
DLQ Integration Principles
From DLQ System Documentation:
- Investigation Tracking Only: DLQ tracks “why task is stuck” and “who investigated” - it doesn’t manipulate tasks
- Step-Level Resolution: Operators fix problem steps using step APIs, not task-level operations
- Three Resolution Types:
- ResetForRetry: Reset attempts, return step to pending (cursor preserved)
- ResolveManually: Skip step, mark resolved without results
- CompleteManually: Provide manual results for dependent steps
Key for Batch Processing: Cursor data in workflow_steps.results is preserved during ResetForRetry, enabling resumability without data loss.
Staleness Detection for Batch Workers
Batch workers have two staleness detection mechanisms:
1. Duration-Based (Standard):
lifecycle:
max_steps_in_process_minutes: 120 # DLQ threshold
If worker stays in InProgress state for > 120 minutes, flagged as stale.
2. Checkpoint-Based (Batch-Specific):
#![allow(unused)]
fn main() {
// Workers checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
checkpoint_progress(step_uuid, current_cursor).await?;
}
}
If last checkpoint timestamp is too old, flagged as stale even if within duration threshold.
Common Operator Scenarios
Scenario 1: Transient Database Failure
Problem: 3 out of 5 batch workers failed due to database connection timeout.
Step 1: Find the stuck task in DLQ:
# Get investigation queue (prioritized by age and reason)
curl http://localhost:8080/v1/dlq/investigation-queue | jq
Step 2: Get task details and identify failed workers:
-- Get DLQ entry for the task
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.resolution_status,
dlq.task_snapshot->'workflow_steps' as steps
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = 'task-uuid-here'
AND dlq.resolution_status = 'pending';
-- Query task's workflow steps to find failed batch workers
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = 'task-uuid-here'
AND ws.name LIKE 'process_csv_batch_%'
AND ws.current_state = 'Error';
Result:
workflow_step_uuid | name | current_state | attempts | last_error
-------------------|------------------------|---------------|----------|------------------
uuid-worker-2 | process_csv_batch_002 | Error | 3 | DB timeout
uuid-worker-4 | process_csv_batch_004 | Error | 3 | DB timeout
uuid-worker-5 | process_csv_batch_005 | Error | 3 | DB timeout
Operator Action: Database is now healthy - reset workers for retry
# Get task UUID from DLQ entry
TASK_UUID="abc-123-task-uuid"
# Reset worker 2 (preserves cursor: rows 201-400)
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-2 \
-H "Content-Type: application/json" \
-d '{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection restored, resetting attempts"
}'
# Reset workers 4 and 5
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-4 \
-H "Content-Type: application/json" \
-d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-5 \
-H "Content-Type: application/json" \
-d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'
# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"resolution_status": "manually_resolved",
"resolution_notes": "Reset 3 failed batch workers after database connection restored",
"resolved_by": "operator@example.com"
}'
Result:
- Workers 2, 4, 5 return to
Pendingstate - Cursor configs preserved in
workflow_steps.inputs - Retry attempt counter reset to 0
- Workers re-enqueued for execution
- DLQ entry updated with resolution metadata
Scenario 2: Bad Data in Specific Batch
Problem: Worker 3 repeatedly fails due to malformed CSV row in its range (rows 401-600).
Investigation:
-- Get worker details
SELECT
ws.name,
ws.current_state,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-3';
Result:
name: process_csv_batch_003
current_state: Error
attempts: 3
last_error: "CSV parsing failed at row 523: invalid price format"
Operator Decision: Row 523 has known data quality issue, already fixed in source system.
Option 1: Complete Manually (provide results for this batch):
TASK_UUID="abc-123-task-uuid"
STEP_UUID="uuid-worker-3"
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
-H "Content-Type: application/json" \
-d '{
"action_type": "complete_manually",
"completion_data": {
"result": {
"processed_count": 199,
"total_inventory_value": 45232.50,
"category_counts": {"electronics": 150, "clothing": 49},
"batch_id": "003",
"note": "Row 523 skipped due to data quality issue, manually verified totals"
},
"metadata": {
"manually_verified": true,
"verification_method": "manual_inspection",
"skipped_rows": [523]
}
},
"reason": "Manual completion after verifying corrected data in source system",
"completed_by": "operator@example.com"
}'
Option 2: Resolve Manually (skip this batch):
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
-H "Content-Type: application/json" \
-d '{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical batch containing known bad data, skipping 200 rows out of 1000 total"
}'
Result (Option 1):
- Worker 3 marked
Completewith manual results - Convergence step receives manual results in aggregation
- Task completes successfully with note about manual intervention
Result (Option 2):
- Worker 3 marked
ResolvedManually(no results provided) - Convergence step detects missing results, adjusts aggregation
- Task completes with reduced total (800 rows instead of 1000)
Scenario 3: Long-Running Worker Needs Checkpoint
Problem: Worker 1 processing 10,000 rows, operator notices it’s been running 90 minutes (threshold: 120 minutes).
Investigation:
-- Check last checkpoint
SELECT
ws.name,
ws.current_state,
ws.results->>'last_checkpoint_cursor' as last_checkpoint,
ws.results->>'checkpoint_timestamp' as checkpoint_time,
NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-1';
Result:
name: process_large_batch_001
current_state: InProgress
last_checkpoint: 7850
checkpoint_time: 2025-01-15 11:30:00
time_since_checkpoint: 00:05:00
Operator Action: Worker is healthy and making progress (checkpointed 5 minutes ago at row 7850).
No action needed - worker will complete normally. Operator adds investigation note to DLQ entry:
DLQ_ENTRY_UUID="dlq-entry-uuid-here"
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"metadata": {
"investigation_notes": "Worker healthy, last checkpoint at row 7850 (5 min ago), estimated 15 min remaining",
"investigator": "operator@example.com",
"timestamp": "2025-01-15T11:35:00Z",
"action_taken": "none - monitoring"
}
}'
Scenario 4: All Workers Failed - Batch Strategy Issue
Problem: All 10 workers fail with “memory exhausted” error - batch size too large.
Investigation via API:
TASK_UUID="task-uuid-here"
# Get task details including all workflow steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps | jq '.[] | select(.name | startswith("process_large_batch_")) | {name, current_state, attempts, last_error}'
Result: All workers show current_state: "Error" with same OOM error in last_error.
Operator Action: Cancel entire task, will re-run with smaller batch size.
DLQ_ENTRY_UUID="dlq-entry-uuid-here"
# Cancel task (cancels all workers)
curl -X DELETE http://localhost:8080/v1/tasks/${TASK_UUID}
# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"resolution_status": "permanently_failed",
"resolution_notes": "Batch size too large causing OOM. Cancelled task and created new task with batch_size: 5000 instead of 10000",
"resolved_by": "operator@example.com",
"metadata": {
"root_cause": "configuration_error",
"corrective_action": "reduced_batch_size",
"new_task_uuid": "new-task-uuid-here"
}
}'
# Create new task with corrected configuration
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"namespace": "data_processing",
"template_name": "large_dataset_processor",
"context": {
"dataset_id": "dataset-123",
"batch_size": 5000,
"max_workers": 20
}
}'
DLQ Query Patterns for Batch Processing
1. Find DLQ entry for a batch processing task:
-- Get DLQ entry with task snapshot
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.resolution_status,
dlq.dlq_timestamp,
dlq.resolution_notes,
dlq.resolved_by,
dlq.task_snapshot->'namespace_name' as namespace,
dlq.task_snapshot->'template_name' as template,
dlq.task_snapshot->'current_state' as task_state
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = :task_uuid
AND dlq.resolution_status = 'pending'
ORDER BY dlq.dlq_timestamp DESC
LIMIT 1;
2. Check batch completion progress:
SELECT
COUNT(*) FILTER (WHERE ws.current_state = 'Complete') as completed_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'InProgress') as in_progress_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'Error') as failed_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'Pending') as pending_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'WaitingForRetry') as waiting_retry,
COUNT(*) as total_workers
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
AND ws.name LIKE 'process_%_batch_%';
3. Find workers with stale checkpoints:
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.results->>'last_checkpoint_cursor' as checkpoint_cursor,
ws.results->>'checkpoint_timestamp' as checkpoint_time,
NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
AND ws.name LIKE 'process_%_batch_%'
AND ws.current_state = 'InProgress'
AND ws.results->>'checkpoint_timestamp' IS NOT NULL
AND NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz > interval '15 minutes'
ORDER BY time_since_checkpoint DESC;
4. Get aggregated batch task health:
SELECT
t.task_uuid,
t.namespace_name,
t.template_name,
t.current_state as task_state,
t.execution_status,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_count,
jsonb_object_agg(
ws.current_state,
COUNT(*)
) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_states,
dlq.dlq_reason,
dlq.resolution_status
FROM tasker.tasks t
JOIN tasker.workflow_steps ws ON ws.task_uuid = t.task_uuid
LEFT JOIN tasker.tasks_dlq dlq ON dlq.task_uuid = t.task_uuid
AND dlq.resolution_status = 'pending'
WHERE t.task_uuid = :task_uuid
GROUP BY t.task_uuid, t.namespace_name, t.template_name, t.current_state, t.execution_status,
dlq.dlq_reason, dlq.resolution_status;
5. Find all batch tasks in DLQ:
-- Find tasks with batch workers that are stuck
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.dlq_timestamp,
t.namespace_name,
t.template_name,
t.current_state,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as batch_worker_count,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.current_state = 'Error' AND ws.name LIKE 'process_%_batch_%') as failed_workers
FROM tasker.tasks_dlq dlq
JOIN tasker.tasks t ON t.task_uuid = dlq.task_uuid
JOIN tasker.workflow_steps ws ON ws.task_uuid = dlq.task_uuid
WHERE dlq.resolution_status = 'pending'
GROUP BY dlq.dlq_entry_uuid, dlq.task_uuid, dlq.dlq_reason, dlq.dlq_timestamp,
t.namespace_name, t.template_name, t.current_state
HAVING COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') > 0
ORDER BY dlq.dlq_timestamp DESC;
Operator Dashboard Recommendations
For monitoring batch processing tasks, operators should have dashboards showing:
-
Batch Progress:
- Total workers vs completed workers
- Estimated completion time based on worker velocity
- Current throughput (items/second across all workers)
-
Stale Worker Alerts:
- Workers exceeding duration threshold
- Workers with stale checkpoints
- Workers with repeated failures
-
Batch Health Metrics:
- Success rate per batch
- Average processing time per worker
- Resource utilization (CPU, memory)
-
Resolution Actions:
- Recent operator interventions
- Resolution action distribution (ResetForRetry vs ResolveManually)
- Time to resolution for stale workers
Code Examples
Complete Working Example: CSV Product Inventory
This section shows a complete end-to-end implementation processing a 1000-row CSV file in 5 parallel batches.
Rust Implementation
1. Batchable Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-150
#![allow(unused)]
fn main() {
pub struct CsvAnalyzerHandler;
#[async_trait]
impl StepHandler for CsvAnalyzerHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Get CSV file path from task context
let csv_file_path = step_data
.task
.context
.get("csv_file_path")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow!("Missing csv_file_path in task context"))?;
// Count total data rows (excluding header)
let file = File::open(csv_file_path)?;
let reader = BufReader::new(file);
let total_rows = reader.lines().count().saturating_sub(1) as u64;
info!("CSV Analysis: {} rows in {}", total_rows, csv_file_path);
// Get batch configuration
let handler_init = step_data.handler_initialization.as_object().unwrap();
let batch_size = handler_init
.get("batch_size")
.and_then(|v| v.as_u64())
.unwrap_or(200);
let max_workers = handler_init
.get("max_workers")
.and_then(|v| v.as_u64())
.unwrap_or(5);
// Determine if batching needed
if total_rows == 0 {
let outcome = BatchProcessingOutcome::no_batches();
let elapsed_ms = start_time.elapsed().as_millis() as u64;
return Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"reason": "empty_dataset",
"total_rows": 0
}),
elapsed_ms,
None,
));
}
// Calculate worker count
let worker_count = ((total_rows as f64 / batch_size as f64).ceil() as u64)
.min(max_workers);
// Generate cursor configurations
let actual_batch_size = (total_rows as f64 / worker_count as f64).ceil() as u64;
let mut cursor_configs = Vec::new();
for i in 0..worker_count {
let start_row = (i * actual_batch_size) + 1; // 1-indexed after header
let end_row = ((i + 1) * actual_batch_size).min(total_rows) + 1;
cursor_configs.push(CursorConfig {
batch_id: format!("{:03}", i + 1),
start_cursor: json!(start_row),
end_cursor: json!(end_row),
batch_size: (end_row - start_row) as u32,
});
}
info!(
"Creating {} batch workers for {} rows (batch_size: {})",
worker_count, total_rows, actual_batch_size
);
// Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"process_csv_batch".to_string(),
worker_count as u32,
cursor_configs,
total_rows,
);
let elapsed_ms = start_time.elapsed().as_millis() as u64;
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_rows": total_rows,
"csv_file_path": csv_file_path
}),
elapsed_ms,
Some(json!({
"batch_size": actual_batch_size,
"file_path": csv_file_path
})),
))
}
}
}
2. Batch Worker Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-350
#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;
#[derive(Debug, Deserialize)]
struct Product {
id: u32,
title: String,
category: String,
price: f64,
stock: u32,
rating: f64,
}
#[async_trait]
impl StepHandler for CsvBatchProcessorHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract batch worker context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// Check for no-op placeholder worker
if context.is_no_op() {
let elapsed_ms = start_time.elapsed().as_millis() as u64;
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario - no items to process",
"batch_id": context.batch_id()
}),
elapsed_ms,
None,
));
}
// Get CSV file path from dependency results
let csv_file_path = step_data
.dependency_results
.get("analyze_csv")
.and_then(|r| r.result.get("csv_file_path"))
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow!("Missing csv_file_path from analyze_csv"))?;
// Extract cursor range
let start_row = context.start_position();
let end_row = context.end_position();
info!(
"Processing batch {} (rows {}-{})",
context.batch_id(),
start_row,
end_row
);
// Initialize metrics
let mut processed_count = 0u64;
let mut total_inventory_value = 0.0;
let mut category_counts: HashMap<String, u32> = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
let mut total_rating = 0.0;
// Open CSV and process rows in cursor range
let file = File::open(Path::new(csv_file_path))?;
let mut csv_reader = csv::ReaderBuilder::new()
.has_headers(true)
.from_reader(BufReader::new(file));
for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
let data_row_num = row_idx + 1; // 1-indexed after header
if data_row_num < start_row {
continue; // Skip rows before our range
}
if data_row_num >= end_row {
break; // Processed all our rows
}
let product: Product = result?;
// Calculate inventory metrics
let inventory_value = product.price * (product.stock as f64);
total_inventory_value += inventory_value;
*category_counts.entry(product.category.clone()).or_insert(0) += 1;
if product.price > max_price {
max_price = product.price;
max_price_product = Some(product.title.clone());
}
total_rating += product.rating;
processed_count += 1;
// Checkpoint progress periodically
if processed_count % context.checkpoint_interval() == 0 {
debug!(
"Checkpoint: batch {} processed {} items",
context.batch_id(),
processed_count
);
}
}
let average_rating = if processed_count > 0 {
total_rating / (processed_count as f64)
} else {
0.0
};
let elapsed_ms = start_time.elapsed().as_millis() as u64;
info!(
"Batch {} complete: {} items processed",
context.batch_id(),
processed_count
);
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"total_inventory_value": total_inventory_value,
"category_counts": category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"average_rating": average_rating,
"batch_id": context.batch_id(),
"start_row": start_row,
"end_row": end_row
}),
elapsed_ms,
None,
))
}
}
}
3. Convergence Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-520
#![allow(unused)]
fn main() {
pub struct CsvResultsAggregatorHandler;
#[async_trait]
impl StepHandler for CsvResultsAggregatorHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"analyze_csv",
"process_csv_batch_",
)?;
let (total_processed, total_inventory_value, category_counts, max_price, max_price_product, overall_avg_rating, worker_count) = match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// No batch workers - get dataset size from batchable step
let total_rows = batchable_result
.result
.get("total_rows")
.and_then(|v| v.as_u64())
.unwrap_or(0);
info!("NoBatches scenario: {} rows (no processing needed)", total_rows);
(total_rows, 0.0, HashMap::new(), 0.0, None, 0.0, 0)
}
BatchAggregationScenario::WithBatches {
batch_results,
worker_count,
} => {
info!("Aggregating results from {} batch workers", worker_count);
let mut total_processed = 0u64;
let mut total_inventory_value = 0.0;
let mut global_category_counts: HashMap<String, u64> = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
let mut weighted_ratings = Vec::new();
for (step_name, result) in batch_results {
// Sum processed counts
let count = result
.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
total_processed += count;
// Sum inventory values
let value = result
.result
.get("total_inventory_value")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
total_inventory_value += value;
// Merge category counts
if let Some(categories) = result
.result
.get("category_counts")
.and_then(|v| v.as_object())
{
for (category, cat_count) in categories {
*global_category_counts
.entry(category.clone())
.or_insert(0) += cat_count.as_u64().unwrap_or(0);
}
}
// Find global max price
let batch_max_price = result
.result
.get("max_price")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
if batch_max_price > max_price {
max_price = batch_max_price;
max_price_product = result
.result
.get("max_price_product")
.and_then(|v| v.as_str())
.map(String::from);
}
// Collect ratings for weighted average
let avg_rating = result
.result
.get("average_rating")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
weighted_ratings.push((count, avg_rating));
}
// Calculate overall weighted average rating
let total_items = weighted_ratings.iter().map(|(c, _)| c).sum::<u64>();
let overall_avg_rating = if total_items > 0 {
weighted_ratings
.iter()
.map(|(count, avg)| (*count as f64) * avg)
.sum::<f64>()
/ (total_items as f64)
} else {
0.0
};
(
total_processed,
total_inventory_value,
global_category_counts,
max_price,
max_price_product,
overall_avg_rating,
worker_count,
)
}
};
let elapsed_ms = start_time.elapsed().as_millis() as u64;
info!(
"Aggregation complete: {} total items processed by {} workers",
total_processed, worker_count
);
Ok(success_result(
step_uuid,
json!({
"total_processed": total_processed,
"total_inventory_value": total_inventory_value,
"category_counts": category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"overall_average_rating": overall_avg_rating,
"worker_count": worker_count
}),
elapsed_ms,
None,
))
}
}
}
Ruby Implementation
1. Batchable Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_analyzer_handler.rb
module BatchProcessing
module StepHandlers
# CSV Analyzer - Batchable Step
class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
def call(task, _sequence, step)
csv_file_path = task.context['csv_file_path']
raise ArgumentError, 'Missing csv_file_path in task context' unless csv_file_path
# Count CSV rows (excluding header)
total_rows = count_csv_rows(csv_file_path)
Rails.logger.info("CSV Analysis: #{total_rows} rows in #{csv_file_path}")
# Get batch configuration from handler initialization
batch_size = step_definition_initialization['batch_size'] || 200
max_workers = step_definition_initialization['max_workers'] || 5
# Calculate worker count
worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min
if worker_count.zero? || total_rows.zero?
# Use helper for NoBatches outcome
return no_batches_success(
reason: 'empty_dataset',
total_rows: total_rows
)
end
# Generate cursor configs using helper
cursor_configs = generate_cursor_configs(
total_items: total_rows,
worker_count: worker_count
) do |batch_idx, start_pos, end_pos, items_in_batch|
# Adjust to 1-indexed row numbers (after header)
{
'batch_id' => format('%03d', batch_idx + 1),
'start_cursor' => start_pos + 1,
'end_cursor' => end_pos + 1,
'batch_size' => items_in_batch
}
end
Rails.logger.info("Creating #{worker_count} batch workers for #{total_rows} rows")
# Use helper for CreateBatches outcome
create_batches_success(
worker_template_name: 'process_csv_batch',
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_rows,
additional_data: {
'csv_file_path' => csv_file_path
}
)
end
private
def count_csv_rows(csv_file_path)
CSV.read(csv_file_path, headers: true).length
end
def step_definition_initialization
@step_definition_initialization ||= {}
end
end
end
end
2. Batch Worker Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_batch_processor_handler.rb
module BatchProcessing
module StepHandlers
# CSV Batch Processor - Batch Worker
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
Product = Struct.new(
:id, :title, :description, :category, :price,
:discount_percentage, :rating, :stock, :brand, :sku, :weight,
keyword_init: true
)
def call(context)
# Extract batch context using helper
batch_ctx = get_batch_context(context)
# Use helper to check for no-op worker
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Get CSV file path from dependency results
csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
raise ArgumentError, 'Missing csv_file_path from analyze_csv' unless csv_file_path
Rails.logger.info("Processing batch #{batch_ctx.batch_id} (rows #{batch_ctx.start_cursor}-#{batch_ctx.end_cursor})")
# Process CSV rows in cursor range
metrics = process_csv_batch(
csv_file_path,
batch_ctx.start_cursor,
batch_ctx.end_cursor
)
Rails.logger.info("Batch #{batch_ctx.batch_id} complete: #{metrics[:processed_count]} items processed")
# Return results for aggregation
success(
result_data: {
'processed_count' => metrics[:processed_count],
'total_inventory_value' => metrics[:total_inventory_value],
'category_counts' => metrics[:category_counts],
'max_price' => metrics[:max_price],
'max_price_product' => metrics[:max_price_product],
'average_rating' => metrics[:average_rating],
'batch_id' => batch_ctx.batch_id,
'start_row' => batch_ctx.start_cursor,
'end_row' => batch_ctx.end_cursor
}
)
end
private
def process_csv_batch(csv_file_path, start_row, end_row)
metrics = {
processed_count: 0,
total_inventory_value: 0.0,
category_counts: Hash.new(0),
max_price: 0.0,
max_price_product: nil,
ratings: []
}
CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
# Skip rows before our range
next if data_row_num < start_row
# Break when we've processed all our rows
break if data_row_num >= end_row
product = parse_product(row)
# Calculate inventory metrics
inventory_value = product.price * product.stock
metrics[:total_inventory_value] += inventory_value
metrics[:category_counts][product.category] += 1
if product.price > metrics[:max_price]
metrics[:max_price] = product.price
metrics[:max_price_product] = product.title
end
metrics[:ratings] << product.rating
metrics[:processed_count] += 1
end
# Calculate average rating
metrics[:average_rating] = if metrics[:ratings].any?
metrics[:ratings].sum / metrics[:ratings].size.to_f
else
0.0
end
metrics.except(:ratings)
end
def parse_product(row)
Product.new(
id: row['id'].to_i,
title: row['title'],
description: row['description'],
category: row['category'],
price: row['price'].to_f,
discount_percentage: row['discountPercentage'].to_f,
rating: row['rating'].to_f,
stock: row['stock'].to_i,
brand: row['brand'],
sku: row['sku'],
weight: row['weight'].to_i
)
end
end
end
end
3. Convergence Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_results_aggregator_handler.rb
module BatchProcessing
module StepHandlers
# CSV Results Aggregator - Deferred Convergence Step
class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
def call(_task, sequence, _step)
# Detect scenario using helper
scenario = detect_aggregation_scenario(
sequence,
batchable_step_name: 'analyze_csv',
batch_worker_prefix: 'process_csv_batch_'
)
# Use helper for aggregation with custom block
aggregate_batch_worker_results(scenario) do |batch_results|
aggregate_csv_metrics(batch_results)
end
end
private
def aggregate_csv_metrics(batch_results)
total_processed = 0
total_inventory_value = 0.0
global_category_counts = Hash.new(0)
max_price = 0.0
max_price_product = nil
weighted_ratings = []
batch_results.each do |step_name, batch_result|
result = batch_result['result'] || {}
# Sum processed counts
count = result['processed_count'] || 0
total_processed += count
# Sum inventory values
total_inventory_value += result['total_inventory_value'] || 0.0
# Merge category counts
(result['category_counts'] || {}).each do |category, cat_count|
global_category_counts[category] += cat_count
end
# Find global max price
batch_max_price = result['max_price'] || 0.0
if batch_max_price > max_price
max_price = batch_max_price
max_price_product = result['max_price_product']
end
# Collect ratings for weighted average
avg_rating = result['average_rating'] || 0.0
weighted_ratings << { count: count, avg: avg_rating }
end
# Calculate overall weighted average rating
total_items = weighted_ratings.sum { |r| r[:count] }
overall_avg_rating = if total_items.positive?
weighted_ratings.sum { |r| r[:avg] * r[:count] } / total_items.to_f
else
0.0
end
Rails.logger.info("Aggregation complete: #{total_processed} total items processed by #{batch_results.size} workers")
{
'total_processed' => total_processed,
'total_inventory_value' => total_inventory_value,
'category_counts' => global_category_counts,
'max_price' => max_price,
'max_price_product' => max_price_product,
'overall_average_rating' => overall_avg_rating,
'worker_count' => batch_results.size
}
end
end
end
end
YAML Template
File: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml
---
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
description: "Process CSV product data in parallel batches"
task_handler:
callable: rust_handler
initialization: {}
steps:
# BATCHABLE STEP: CSV Analysis and Batch Planning
- name: analyze_csv
type: batchable
dependencies: []
handler:
callable: CsvAnalyzerHandler
initialization:
batch_size: 200
max_workers: 5
# BATCH WORKER TEMPLATE: Single CSV Batch Processing
# Orchestration creates N instances from this template
- name: process_csv_batch
type: batch_worker
dependencies:
- analyze_csv
lifecycle:
max_steps_in_process_minutes: 120
max_retries: 3
backoff_multiplier: 2.0
handler:
callable: CsvBatchProcessorHandler
initialization:
operation: "inventory_analysis"
# DEFERRED CONVERGENCE STEP: CSV Results Aggregation
- name: aggregate_csv_results
type: deferred_convergence
dependencies:
- process_csv_batch # Template dependency - resolves to all worker instances
handler:
callable: CsvResultsAggregatorHandler
initialization:
aggregation_type: "inventory_metrics"
Best Practices
1. Batch Size Calculation
Guideline: Balance parallelism with overhead.
Too Small:
- Excessive orchestration overhead
- Too many database transactions
- Diminishing returns on parallelism
Too Large:
- Workers timeout or OOM
- Long retry times on failure
- Reduced parallelism
Recommended Approach:
def calculate_optimal_batch_size(total_items, item_processing_time_ms)
# Target: Each batch takes 5-10 minutes
target_duration_ms = 7.5 * 60 * 1000
# Calculate items per batch
items_per_batch = (target_duration_ms / item_processing_time_ms).ceil
# Enforce min/max bounds
[[items_per_batch, 100].max, 10000].min
end
2. Worker Count Limits
Guideline: Limit concurrency based on system resources.
handler:
initialization:
batch_size: 200
max_workers: 10 # Prevents creating 100 workers for 20,000 items
Considerations:
- Database connection pool size
- Memory per worker
- External API rate limits
- CPU cores available
3. Cursor Design
Guideline: Use cursors that support resumability.
Good Cursor Types:
- ✅ Integer offsets:
start_cursor: 1000, end_cursor: 2000 - ✅ Timestamps:
start_cursor: "2025-01-01T00:00:00Z" - ✅ Database IDs:
start_cursor: uuid_a, end_cursor: uuid_b - ✅ Composite keys:
{ date: "2025-01-01", partition: "US-WEST" }
Bad Cursor Types:
- ❌ Page numbers (data can shift between pages)
- ❌ Non-deterministic ordering (random, relevance scores)
- ❌ Mutable values (last_modified_at can change)
4. Checkpoint Frequency
Guideline: Balance resumability with performance.
#![allow(unused)]
fn main() {
// Checkpoint every 100 items
if processed_count % 100 == 0 {
checkpoint_progress(step_uuid, current_cursor).await?;
}
}
Factors:
- Item processing time (faster = higher frequency)
- Worker failure rate (higher = more frequent checkpoints)
- Database write overhead (less frequent = better performance)
Recommended:
- Fast items (< 10ms each): Checkpoint every 1000 items
- Medium items (10-100ms each): Checkpoint every 100 items
- Slow items (> 100ms each): Checkpoint every 10 items
5. Error Handling Strategies
FailFast (default):
#![allow(unused)]
fn main() {
FailureStrategy::FailFast
}
- Worker fails immediately on first error
- Suitable for: Data validation, schema violations
- Retry preserves cursor for retry
ContinueOnFailure:
#![allow(unused)]
fn main() {
FailureStrategy::ContinueOnFailure
}
- Worker processes all items, collects errors
- Suitable for: Best-effort processing, partial results acceptable
- Returns both results and error list
IsolateFailed:
#![allow(unused)]
fn main() {
FailureStrategy::IsolateFailed
}
- Failed items moved to separate queue
- Suitable for: Large batches with few expected failures
- Allows manual review of failed items
6. Aggregation Patterns
Sum/Count:
#![allow(unused)]
fn main() {
let total = batch_results.iter()
.map(|(_, r)| r.result.get("count").unwrap().as_u64().unwrap())
.sum::<u64>();
}
Max/Min:
#![allow(unused)]
fn main() {
let max_value = batch_results.iter()
.filter_map(|(_, r)| r.result.get("max").and_then(|v| v.as_f64()))
.max_by(|a, b| a.partial_cmp(b).unwrap())
.unwrap_or(0.0);
}
Weighted Average:
#![allow(unused)]
fn main() {
let total_weight: u64 = weighted_values.iter().map(|(w, _)| w).sum();
let weighted_avg = weighted_values.iter()
.map(|(weight, value)| (*weight as f64) * value)
.sum::<f64>() / (total_weight as f64);
}
Merge HashMaps:
#![allow(unused)]
fn main() {
let mut merged = HashMap::new();
for (_, result) in batch_results {
if let Some(counts) = result.get("counts").and_then(|v| v.as_object()) {
for (key, count) in counts {
*merged.entry(key.clone()).or_insert(0) += count.as_u64().unwrap();
}
}
}
}
7. Testing Strategies
Unit Tests: Test handler logic independently
#![allow(unused)]
fn main() {
#[test]
fn test_cursor_generation() {
let configs = create_cursor_configs(1000, 5, 200);
assert_eq!(configs.len(), 5);
assert_eq!(configs[0].start_cursor, json!(0));
assert_eq!(configs[0].end_cursor, json!(200));
}
}
Integration Tests: Test with small datasets
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_batch_processing_integration() {
let task = create_task_with_csv("test_data_10_rows.csv").await;
assert_eq!(task.current_state, TaskState::Complete);
let steps = get_workflow_steps(task.task_uuid).await;
let workers = steps.iter().filter(|s| s.step_type == "batch_worker").count();
assert_eq!(workers, 1); // 10 rows = 1 worker with batch_size 200
}
}
E2E Tests: Test complete workflow with realistic data
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_csv_batch_processing_e2e() {
let task = create_task_with_csv("products_1000_rows.csv").await;
wait_for_completion(task.task_uuid, Duration::from_secs(60)).await;
let results = get_aggregation_results(task.task_uuid).await;
assert_eq!(results["total_processed"], 1000);
assert_eq!(results["worker_count"], 5);
}
}
8. Monitoring and Observability
Metrics to Track:
- Worker creation time
- Individual worker duration
- Batch size distribution
- Retry rate per batch
- Aggregation duration
Recommended Dashboards:
-- Batch processing health
SELECT
COUNT(*) FILTER (WHERE step_type = 'batch_worker') as total_workers,
AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_worker_duration_sec,
MAX(EXTRACT(EPOCH FROM (updated_at - created_at))) as max_worker_duration_sec,
COUNT(*) FILTER (WHERE current_state = 'Error') as failed_workers
FROM tasker.workflow_steps
WHERE task_uuid = :task_uuid
AND step_type = 'batch_worker';
Summary
Batch processing in Tasker provides a robust, production-ready pattern for parallel dataset processing:
Key Strengths:
- ✅ Builds on proven DAG, retry, and deferred convergence foundations
- ✅ No special recovery system needed (uses standard DLQ + retry)
- ✅ Transaction-based worker creation prevents corruption
- ✅ Cursor-based resumability enables long-running processing
- ✅ Language-agnostic design works across Rust and Ruby workers
Integration Points:
- DAG: Workers are full nodes with standard lifecycle
- Retryability: Uses
lifecycle.max_retriesand exponential backoff - Deferred Convergence: Intersection semantics aggregate dynamic worker counts
- DLQ: Standard operator workflows with cursor preservation
Production Readiness:
- 908 tests passing (Ruby workers)
- Real-world CSV processing (1000 rows)
- Docker integration working
- Code review complete with recommended fixes
For More Information:
- Conditional Workflows: See
docs/conditional-workflows.md - DLQ System: See
docs/dlq-system.md - Code Examples: See
workers/rust/src/step_handlers/batch_processing_*.rs
Caching Guide
This guide covers Tasker’s distributed caching system, including configuration, backend selection, circuit breaker protection, and operational considerations.
Overview
Tasker provides optional caching for:
- Task Templates: Reduces database queries when loading workflow definitions
- Analytics: Caches performance metrics and bottleneck analysis results
Caching is disabled by default and must be explicitly enabled in configuration.
Configuration
Basic Setup
[common.cache]
enabled = true
backend = "redis" # or "dragonfly" / "moka" / "memory" / "in-memory"
default_ttl_seconds = 3600 # 1 hour default
template_ttl_seconds = 3600 # 1 hour for templates
analytics_ttl_seconds = 60 # 1 minute for analytics
key_prefix = "tasker" # Namespace for cache keys
[common.cache.redis]
url = "${REDIS_URL:-redis://localhost:6379}"
max_connections = 10
connection_timeout_seconds = 5
database = 0
[common.cache.moka]
max_capacity = 10000 # Maximum entries in cache
Backend Selection
| Backend | Config Value | Use Case |
|---|---|---|
| Redis | "redis" | Multi-instance deployments (production) |
| Dragonfly | "dragonfly" | Redis-compatible with better multi-threaded performance |
| Memcached | "memcached" | Simple distributed cache (requires cache-memcached feature) |
| Moka | "moka", "memory", "in-memory" | Single-instance, development, DoS protection |
| NoOp | (enabled = false) | Disabled, always-miss |
Cache Backends
Redis (Distributed)
Redis is the recommended backend for production deployments:
- Shared state: All instances see the same cache entries
- Invalidation works: Worker bootstrap invalidations propagate to all instances
- Persistence: Survives process restarts (if Redis is configured for persistence)
[common.cache]
enabled = true
backend = "redis"
[common.cache.redis]
url = "redis://redis.internal:6379"
Dragonfly (Distributed)
Dragonfly is a Redis-compatible in-memory data store with better multi-threaded performance. It uses the same port (6379) and protocol as Redis, so no code changes are required.
- Redis compatible: Drop-in replacement for Redis
- Better performance: Multi-threaded architecture for higher throughput
- Shared state: Same distributed semantics as Redis
[common.cache]
enabled = true
backend = "dragonfly" # Uses Redis provider internally
[common.cache.redis]
url = "redis://dragonfly.internal:6379"
Note: Dragonfly is used in Tasker’s test and CI environments for improved performance. For production, either Redis or Dragonfly works.
Memcached (Distributed)
Memcached is a simple, high-performance distributed cache. It requires the
cache-memcached feature flag (not enabled by default).
- Simple protocol: Lightweight key-value store
- Distributed: State is shared across instances
- No pattern deletion: Relies on TTL expiry (like Moka)
[common.cache]
enabled = true
backend = "memcached"
[common.cache.memcached]
url = "tcp://memcached.internal:11211"
connection_timeout_seconds = 5
Note: Enable with cargo build --features cache-memcached. Not enabled
by default to reduce dependency footprint.
Moka (In-Memory)
Moka provides a high-performance in-memory cache:
- Zero network latency: All operations are in-process
- DoS protection: Rate-limits expensive operations without Redis dependency
- Single-instance only: Cache is not shared across processes
[common.cache]
enabled = true
backend = "moka"
[common.cache.moka]
max_capacity = 10000
Important: Moka is only suitable for:
- Single-instance deployments
- Development environments
- Analytics caching (where brief staleness is acceptable)
NoOp (Disabled)
When caching is disabled or a backend fails to initialize:
[common.cache]
enabled = false
The NoOp provider always returns cache misses and succeeds on writes (no-op). This is also used as a graceful fallback when Redis connection fails.
Circuit Breaker Protection
The cache circuit breaker prevents repeated timeout penalties when Redis/Dragonfly is unavailable. Instead of waiting for connection timeouts on every request, the circuit breaker fails fast after detecting failures.
Configuration
[common.circuit_breakers.component_configs.cache]
failure_threshold = 5 # Open after 5 consecutive failures
timeout_seconds = 15 # Test recovery after 15 seconds
success_threshold = 2 # Close after 2 successful calls
Behavior When Circuit is Open
When the circuit breaker is open (cache unavailable):
| Operation | Behavior |
|---|---|
get() | Returns None (cache miss) |
set() | Returns Ok(()) (no-op) |
delete() | Returns Ok(()) (no-op) |
health_check() | Returns false (unhealthy) |
This fail-fast behavior ensures:
- Requests don’t wait for connection timeouts
- Database queries still work (cache miss → DB fallback)
- Recovery is automatic when Redis/Dragonfly becomes available
Circuit States
| State | Description |
|---|---|
| Closed | Normal operation, all calls go through |
| Open | Failing fast, calls return fallback values |
| Half-Open | Testing recovery, limited calls allowed |
Monitoring
Circuit state is logged at state transitions:
INFO Circuit breaker half-open (testing recovery)
INFO Circuit breaker closed (recovered)
ERROR Circuit breaker opened (failing fast)
Usage Context Constraints
Different caching use cases have different consistency requirements. Tasker enforces these constraints at runtime:
Template Caching
Constraint: Requires distributed cache (Redis) or no cache (NoOp)
Templates are cached to avoid repeated database queries when loading workflow definitions. However, workers invalidate the template cache on bootstrap when they register new handler versions.
If an in-memory cache (Moka) is used:
- Orchestration server caches templates in its local memory
- Worker boots and invalidates templates in Redis (or nowhere, if Moka)
- Orchestration server never sees the invalidation
- Stale templates are served → operational errors
Behavior with Moka: Template caching is automatically disabled with a warning:
WARN Cache provider 'moka' is not safe for template caching (in-memory cache
would drift from worker invalidations). Template caching disabled.
Analytics Caching
Constraint: Any backend allowed
Analytics data is informational and TTL-bounded. Brief staleness is acceptable, and in-memory caching provides DoS protection for expensive aggregation queries.
Behavior with Moka: Analytics caching works normally.
Cache Keys
Cache keys are prefixed with the configured key_prefix to allow multiple
Tasker deployments to share a Redis instance:
| Resource | Key Pattern |
|---|---|
| Templates | {prefix}:template:{namespace}:{name}:{version} |
| Performance Metrics | {prefix}:analytics:performance:{hours} |
| Bottleneck Analysis | {prefix}:analytics:bottlenecks:{limit}:{min_executions} |
Operational Patterns
Multi-Instance Production
[common.cache]
enabled = true
backend = "redis"
template_ttl_seconds = 3600 # Long TTL, rely on invalidation
analytics_ttl_seconds = 60 # Short TTL for fresh data
- Templates cached for 1 hour but invalidated on worker registration
- Analytics cached briefly to reduce database load
Single-Instance / Development
[common.cache]
enabled = true
backend = "moka"
template_ttl_seconds = 300 # Shorter TTL since no invalidation
analytics_ttl_seconds = 30
- Template caching automatically disabled (Moka constraint)
- Analytics caching works, provides DoS protection
Caching Disabled
[common.cache]
enabled = false
- All cache operations are no-ops
- Every request hits the database
- Useful for debugging or when cache adds complexity without benefit
Graceful Degradation
Tasker never fails to start due to cache issues:
- Redis connection failure: Falls back to NoOp with warning
- Backend misconfiguration: Falls back to NoOp with warning
- Cache operation errors: Logged as warnings, never propagated
WARN Failed to connect to Redis, falling back to NoOp cache (graceful degradation)
The cache layer uses “best-effort” writes—failures are logged but never block request processing.
Monitoring
Cache Hit/Miss Rates
Cache operations are logged at DEBUG level:
DEBUG hours=24 "Performance metrics cache HIT"
DEBUG hours=24 "Performance metrics cache MISS, querying DB"
Provider Status
On startup, the active cache provider is logged:
INFO backend="redis" "Distributed cache provider initialized successfully"
INFO backend="moka" max_capacity=10000 "In-memory cache provider initialized"
INFO "Distributed cache disabled by configuration"
Troubleshooting
Templates Not Caching
- Check if backend is Moka—template caching is disabled with Moka
- Check for Redis connection warnings in logs
- Verify
enabled = truein configuration
Stale Templates Being Served
- Verify all instances point to the same Redis
- Check that workers are properly invalidating on bootstrap
- Consider reducing
template_ttl_seconds
High Cache Miss Rate
- Check Redis connectivity and latency
- Verify TTL settings aren’t too aggressive
- Check for cache key collisions (multiple deployments, same prefix)
Memory Growth with Moka
- Reduce
max_capacitysetting - Check TTL settings—items evict on TTL or capacity limit
- Monitor entry count if metrics are available
Conditional Workflows and Decision Points
Last Updated: 2025-10-27 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Use Cases & Patterns | States and Lifecycles
← Back to Documentation Hub
Overview
Conditional workflows enable runtime decision-making that dynamically determines which workflow steps to execute based on business logic. Unlike static DAG workflows where all steps are predefined, conditional workflows use decision point steps to create steps on-demand based on runtime conditions.
Dynamic Workflow Decision Points provide this capability through:
- Decision Point Steps: Special step type that evaluates business logic and returns step names to create
- Deferred Steps: Step type with dynamic dependency resolution using intersection semantics
- Type-Safe Integration: Ruby and Rust helpers ensuring clean serialization between languages
Table of Contents
- When to Use Conditional Workflows
- Logical Pattern
- Architecture and Implementation
- YAML Configuration
- Simple Example: Approval Routing
- Complex Example: Multi-Tier Approval
- Ruby Implementation Guide
- Rust Implementation Guide
- Best Practices
- Limitations and Constraints
When to Use Conditional Workflows
✅ Use Conditional Workflows When:
1. Business Logic Determines Execution Path
- Approval workflows with amount-based routing (small/medium/large)
- Risk-based processing (low/medium/high risk paths)
- Tiered customer service (bronze/silver/gold/platinum)
- Regulatory compliance with jurisdictional variations
2. Step Requirements Are Unknown Until Runtime
- Dynamic validation checks based on request type
- Multi-stage approvals where approval count depends on amount
- Conditional enrichment steps based on data completeness
- Parallel processing with variable worker count
3. Workflow Complexity Varies By Input
- Simple cases skip expensive steps
- Complex cases trigger additional validation
- Emergency processing bypasses normal checks
- VIP customers get expedited handling
❌ Don’t Use Conditional Workflows When:
1. Static DAG is Sufficient
- All possible execution paths known at design time
- Complexity overhead not justified
- Simple if/else can be handled in handler code
2. Purely Sequential Logic
- No parallelism or branching needed
- Handler code can make decisions directly
3. Real-Time Sub-Second Decisions
- Decision overhead (~10-20ms) not acceptable
- In-memory processing required
Logical Pattern
Core Concepts
Task Initialization
↓
Regular Step(s)
↓
Decision Point Step ← Evaluates business logic
↓
[Decision Made]
↓
┌───┴───┐
↓ ↓
Path A Path B ← Steps created dynamically
↓ ↓
└───┬───┘
↓
Convergence Step ← Deferred dependencies resolve via intersection
↓
Task Complete
Decision Point Pattern
- Evaluation Phase: Decision point step executes handler
- Decision Output: Handler returns list of step names to create
- Dynamic Creation: Orchestration creates specified steps with proper dependencies
- Execution: Created steps execute like normal steps
- Convergence: Deferred steps wait for intersection of declared dependencies + created steps
Intersection Semantics for Deferred Steps
Declared Dependencies (in template):
- step_a
- step_b
- step_c
Actually Created Steps (by decision point):
Only step_a and step_c were created
Effective Dependencies (intersection):
step_a AND step_c (step_b ignored since not created)
This enables convergence steps that work regardless of which path was taken.
Architecture and Implementation
Step Type: Decision Point
Decision point steps are regular steps with a special handler that returns a DecisionPointOutcome:
#![allow(unused)]
fn main() {
pub enum DecisionPointOutcome {
NoBranches, // No additional steps needed
CreateSteps { // Dynamically create these steps
step_names: Vec<String>,
},
}
}
Key Characteristics:
- Executes like a normal step
- Result includes
decision_point_outcomefield - Orchestration detects outcome and creates steps
- Created steps depend on the decision point step
- Fully atomic - either all steps created or none
Step Type: Deferred
Deferred steps use intersection semantics for dependency resolution:
type: deferred # Special step type
dependencies:
- routing_decision # Must wait for decision point
- step_a # Might be created
- step_b # Might be created
- step_c # Might be created
Resolution Logic:
- Wait for decision point to complete
- Check which declared dependencies actually exist
- Wait only for intersection of declared + created
- Execute when all existing dependencies complete
Orchestration Flow
┌─────────────────────────────────────────┐
│ Step Result Processor │
│ │
│ 1. Check if result has │
│ decision_point_outcome field │
│ │
│ 2. If CreateSteps: │
│ - Validate step names exist │
│ - Create WorkflowStep records │
│ - Set dependencies │
│ - Enqueue for execution │
│ │
│ 3. If NoBranches: │
│ - Continue normally │
│ │
│ 4. Metrics and telemetry: │
│ - Track steps_created count │
│ - Log decision outcome │
│ - Warn if depth limit approached │
└─────────────────────────────────────────┘
Configuration
Decision point behavior is configured per environment:
# config/tasker/base/orchestration.toml
[orchestration.decision_points]
enabled = true
max_depth = 3 # Prevent infinite recursion
warn_threshold = 2 # Warn when nearing limit
YAML Configuration
Task Template Structure
Actual Implementation (from tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml):
---
name: approval_routing
namespace_name: conditional_approval
version: 1.0.0
description: >
Ruby implementation of conditional approval workflow demonstrating dynamic decision points.
Routes approval requests through different paths based on amount thresholds.
task_handler:
callable: tasker_worker_ruby::TaskHandler
initialization: {}
steps:
- name: validate_request
type: standard
dependencies: []
handler:
callable: ConditionalApproval::StepHandlers::ValidateRequestHandler
initialization: {}
- name: routing_decision
type: decision # DECISION POINT
dependencies:
- validate_request
handler:
callable: ConditionalApproval::StepHandlers::RoutingDecisionHandler
initialization: {}
- name: finalize_approval
type: deferred # DEFERRED - uses intersection semantics
dependencies:
- auto_approve # ALL possible dependencies listed
- manager_approval # System computes intersection at runtime
- finance_review
handler:
callable: ConditionalApproval::StepHandlers::FinalizeApprovalHandler
initialization: {}
# Possible dynamic branches (created by decision point)
- name: auto_approve
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::AutoApproveHandler
initialization: {}
- name: manager_approval
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::ManagerApprovalHandler
initialization: {}
- name: finance_review
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::FinanceReviewHandler
initialization: {}
Key Points:
type: decisionmarks the decision point steptype: deferredenables intersection semantics for convergence- ALL possible dependencies listed in deferred step
- Orchestration computes: declared deps ∩ actually created steps
Simple Example: Approval Routing
Business Requirement
Route approval requests based on amount:
- < $1,000: Auto-approve (no human intervention)
- $1,000 - $4,999: Manager approval required
- ≥ $5,000: Manager + Finance approval required
Template Configuration
namespace: approval_workflows
name: simple_routing
version: "1.0"
steps:
- name: validate_request
handler: validate_request
- name: routing_decision
handler: routing_decision
type: decision_point
dependencies:
- validate_request
- name: auto_approve
handler: auto_approve
dependencies:
- routing_decision
- name: manager_approval
handler: manager_approval
dependencies:
- routing_decision
- name: finance_review
handler: finance_review
dependencies:
- routing_decision
- name: finalize_approval
handler: finalize_approval
type: deferred
dependencies:
- routing_decision
- auto_approve
- manager_approval
- finance_review
Ruby Handler Implementation
Actual Implementation (from workers/ruby/spec/handlers/examples/conditional_approval/step_handlers/routing_decision_handler.rb):
# frozen_string_literal: true
module ConditionalApproval
module StepHandlers
# Routing Decision: DECISION POINT that routes approval based on amount
#
# Uses TaskerCore::StepHandler::Decision base class for clean, type-safe
# decision outcome serialization consistent with Rust expectations.
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
SMALL_AMOUNT_THRESHOLD = 1_000
LARGE_AMOUNT_THRESHOLD = 5_000
def call(task, _sequence, _step)
# Get amount from validated request
amount = task.context['amount']
raise 'Amount is required for routing decision' unless amount
# Make routing decision based on amount
route = determine_route(amount)
# Use Decision base class helper for clean outcome serialization
decision_success(
steps: route[:steps],
result_data: {
route_type: route[:type],
reasoning: route[:reasoning],
amount: amount
},
metadata: {
operation: 'routing_decision',
route_thresholds: {
small: SMALL_AMOUNT_THRESHOLD,
large: LARGE_AMOUNT_THRESHOLD
}
}
)
end
private
def determine_route(amount)
if amount < SMALL_AMOUNT_THRESHOLD
{
type: 'auto_approval',
steps: ['auto_approve'],
reasoning: "Amount $#{amount} below threshold - auto-approval"
}
elsif amount < LARGE_AMOUNT_THRESHOLD
{
type: 'manager_only',
steps: ['manager_approval'],
reasoning: "Amount $#{amount} requires manager approval"
}
else
{
type: 'dual_approval',
steps: %w[manager_approval finance_review],
reasoning: "Amount $#{amount} >= $#{LARGE_AMOUNT_THRESHOLD} - dual approval required"
}
end
end
end
end
end
Key Ruby Patterns:
- Inherit from
TaskerCore::StepHandler::Decision- Specialized base class for decision points - Use helper method
decision_success(steps:, result_data:, metadata:)- Clean API for decision outcomes - Helper automatically creates
DecisionPointOutcomeand embeds it correctly - No manual serialization needed - base class handles Rust compatibility
- For no-branch scenarios, use
decision_no_branches(result_data:, metadata:)
Execution Flow Examples
Example 1: Small Amount ($500)
1. validate_request → Complete
2. routing_decision → Complete (creates: auto_approve)
3. auto_approve → Complete
4. finalize_approval → Complete
(waits for: routing_decision ∩ {auto_approve} = auto_approve)
Total Steps Created: 4
Execution Time: ~500ms
Example 2: Medium Amount ($2,500)
1. validate_request → Complete
2. routing_decision → Complete (creates: manager_approval)
3. manager_approval → Complete
4. finalize_approval → Complete
(waits for: routing_decision ∩ {manager_approval} = manager_approval)
Total Steps Created: 4
Execution Time: ~2s (human approval delay)
Example 3: Large Amount ($10,000)
1. validate_request → Complete
2. routing_decision → Complete (creates: manager_approval, finance_review)
3. manager_approval → Complete (parallel)
3. finance_review → Complete (parallel)
4. finalize_approval → Complete
(waits for: routing_decision ∩ {manager_approval, finance_review})
Total Steps Created: 5
Execution Time: ~3s (parallel approvals)
Complex Example: Multi-Tier Approval
Business Requirement
Implement sophisticated approval routing with:
- Risk assessment step
- Tiered approval requirements
- Emergency override path
- Compliance checks based on jurisdiction
Template Configuration
namespace: approval_workflows
name: multi_tier_approval
version: "1.0"
steps:
# Phase 1: Initial validation and risk assessment
- name: validate_request
handler: validate_request
- name: assess_risk
handler: assess_risk
dependencies:
- validate_request
# Phase 2: Primary routing decision
- name: primary_routing
handler: primary_routing
type: decision_point
dependencies:
- assess_risk
# Phase 3: Conditional approval paths
- name: emergency_approval
handler: emergency_approval
dependencies:
- primary_routing
- name: standard_manager_approval
handler: standard_manager_approval
dependencies:
- primary_routing
- name: senior_manager_approval
handler: senior_manager_approval
dependencies:
- primary_routing
# Phase 4: Secondary routing for high-risk cases
- name: compliance_routing
handler: compliance_routing
type: decision_point
dependencies:
- primary_routing
- senior_manager_approval # Only if created
# Phase 5: Compliance paths
- name: legal_review
handler: legal_review
dependencies:
- compliance_routing
- name: fraud_investigation
handler: fraud_investigation
dependencies:
- compliance_routing
- name: jurisdictional_check
handler: jurisdictional_check
dependencies:
- compliance_routing
# Phase 6: Convergence
- name: finalize_approval
handler: finalize_approval
type: deferred
dependencies:
- primary_routing
- emergency_approval
- standard_manager_approval
- senior_manager_approval
- compliance_routing
- legal_review
- fraud_investigation
- jurisdictional_check
Ruby Handler: Primary Routing
class PrimaryRoutingHandler < TaskerCore::StepHandler::Decision
def call(task, sequence, _step)
amount = task.context['amount']
risk_score = sequence.get_results('assess_risk')['risk_score']
is_emergency = task.context['emergency'] == true
steps_to_create = if is_emergency && amount < 10_000
# Emergency override path
['emergency_approval']
elsif risk_score < 30 && amount < 5_000
# Low risk, standard approval
['standard_manager_approval']
else
# High risk or large amount - senior approval + compliance routing
['senior_manager_approval', 'compliance_routing']
end
decision_success(
steps: steps_to_create,
result_data: {
route_type: determine_route_type(is_emergency, risk_score, amount),
risk_score: risk_score,
amount: amount,
emergency: is_emergency
}
)
end
end
Ruby Handler: Compliance Routing (Nested Decision)
class ComplianceRoutingHandler < TaskerCore::StepHandler::Decision
def call(task, sequence, _step)
amount = task.context['amount']
risk_score = sequence.get_results('assess_risk')['risk_score']
jurisdiction = task.context['jurisdiction']
steps_to_create = []
# Large amounts always need legal review
steps_to_create << 'legal_review' if amount >= 50_000
# High risk triggers fraud investigation
steps_to_create << 'fraud_investigation' if risk_score >= 70
# Certain jurisdictions need special checks
steps_to_create << 'jurisdictional_check' if high_regulation_jurisdiction?(jurisdiction)
if steps_to_create.empty?
# No additional compliance steps needed
decision_no_branches(
result_data: { reason: 'no_compliance_requirements' }
)
else
decision_success(
steps: steps_to_create,
result_data: {
compliance_level: 'enhanced',
checks_required: steps_to_create
}
)
end
end
private
def high_regulation_jurisdiction?(jurisdiction)
%w[EU UK APAC].include?(jurisdiction)
end
end
Execution Scenarios
Scenario 1: Emergency Low-Risk Request ($5,000)
Path: validate → assess_risk → primary_routing → emergency_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates emergency_approval)
Complexity: Low
Scenario 2: Standard Medium-Risk Request ($3,000, Risk 25)
Path: validate → assess_risk → primary_routing → standard_manager_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates standard_manager_approval)
Complexity: Low
Scenario 3: High-Risk Large Amount ($75,000, Risk 80, EU)
Path: validate → assess_risk → primary_routing → senior_manager_approval + compliance_routing
→ legal_review + fraud_investigation + jurisdictional_check → finalize
Steps Created: 9
Decision Points: 2 (primary_routing → compliance_routing)
Complexity: High (nested decisions)
Ruby Implementation Guide
Using the Decision Base Class
The TaskerCore::StepHandler::Decision base class provides type-safe helpers:
class MyDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
# Your business logic here
amount = context.get_task_field('amount')
if amount < 1000
# Create single step
decision_success(
steps: 'auto_approve', # Can pass string or array
result_data: { route: 'auto' }
)
elsif amount < 5000
# Create multiple steps
decision_success(
steps: ['manager_approval', 'risk_check'],
result_data: { route: 'standard' }
)
else
# No additional steps needed
decision_no_branches(
result_data: { route: 'none', reason: 'manual_review_required' }
)
end
end
end
Helper Methods
decision_success(steps:, result_data: {}, metadata: {})
- Creates steps dynamically
steps: String or Array of step namesresult_data: Additional data to store in step resultsmetadata: Observability metadata
decision_no_branches(result_data: {}, metadata: {})
- No additional steps created
- Workflow proceeds to next static step
decision_with_custom_outcome(outcome:, result_data: {}, metadata: {})
- Advanced: Full control over outcome structure
- Most handlers should use
decision_successordecision_no_branches
validate_decision_outcome!(outcome)
- Validates custom outcome structure
- Raises error if invalid
Type Definitions
# workers/ruby/lib/tasker_core/types/decision_point_outcome.rb
module TaskerCore
module Types
module DecisionPointOutcome
# Factory methods
def self.no_branches
NoBranches.new
end
def self.create_steps(step_names)
CreateSteps.new(step_names: step_names)
end
# Serialization format (matches Rust)
class NoBranches
def to_h
{ type: 'no_branches' }
end
end
class CreateSteps
def to_h
{ type: 'create_steps', step_names: step_names }
end
end
end
end
end
Rust Implementation Guide
Decision Handler Implementation
Actual Implementation (from workers/rust/src/step_handlers/conditional_approval_rust.rs):
#![allow(unused)]
fn main() {
use super::{error_result, success_result, RustStepHandler, StepHandlerConfig};
use anyhow::Result;
use async_trait::async_trait;
use chrono::Utc;
use serde_json::json;
use std::collections::HashMap;
use tasker_shared::messaging::{DecisionPointOutcome, StepExecutionResult};
use tasker_shared::types::TaskSequenceStep;
const SMALL_AMOUNT_THRESHOLD: i64 = 1000;
const LARGE_AMOUNT_THRESHOLD: i64 = 5000;
pub struct RoutingDecisionHandler {
#[allow(dead_code)]
config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for RoutingDecisionHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract amount from task context
let amount: i64 = step_data.get_context_field("amount")?;
// Business logic: determine routing
let (route_type, steps, reasoning) = if amount < SMALL_AMOUNT_THRESHOLD {
(
"auto_approval",
vec!["auto_approve"],
format!("Amount ${} under threshold", amount)
)
} else if amount < LARGE_AMOUNT_THRESHOLD {
(
"manager_only",
vec!["manager_approval"],
format!("Amount ${} requires manager approval", amount)
)
} else {
(
"dual_approval",
vec!["manager_approval", "finance_review"],
format!("Amount ${} requires dual approval", amount)
)
};
// Create decision point outcome
let outcome = DecisionPointOutcome::create_steps(
steps.iter().map(|s| s.to_string()).collect()
);
// Build result with embedded outcome
let result_data = json!({
"route_type": route_type,
"reasoning": reasoning,
"amount": amount,
"decision_point_outcome": outcome.to_value() // Embedded outcome
});
let metadata = HashMap::from([
("route_type".to_string(), json!(route_type)),
("steps_to_create".to_string(), json!(steps)),
]);
Ok(success_result(
step_uuid,
result_data,
start_time.elapsed().as_millis() as i64,
Some(metadata),
))
}
fn name(&self) -> &str {
"routing_decision"
}
fn new(config: StepHandlerConfig) -> Self {
Self { config }
}
}
}
DecisionPointOutcome Type
Type Definition (from tasker-shared/src/messaging/execution_types.rs):
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum DecisionPointOutcome {
NoBranches,
CreateSteps {
step_names: Vec<String>,
},
}
impl DecisionPointOutcome {
/// Create outcome that creates specific steps
pub fn create_steps(step_names: Vec<String>) -> Self {
Self::CreateSteps { step_names }
}
/// Create outcome with no additional steps
pub fn no_branches() -> Self {
Self::NoBranches
}
/// Convert to JSON value for embedding in StepExecutionResult
pub fn to_value(&self) -> serde_json::Value {
serde_json::to_value(self).expect("DecisionPointOutcome serialization should not fail")
}
/// Extract decision outcome from step execution result
pub fn from_step_result(result: &serde_json::Value) -> Option<Self> {
result
.as_object()?
.get("decision_point_outcome")
.and_then(|v| serde_json::from_value(v.clone()).ok())
}
}
}
Key Rust Patterns:
DecisionPointOutcome::create_steps(vec![...])- Type-safe factoryoutcome.to_value()- Serializes to JSON matching Ruby format- Embedded in result JSON as
decision_point_outcomefield - Serde handles serialization:
{ "type": "create_steps", "step_names": [...] }
Best Practices
1. Keep Decision Logic Deterministic
# ✅ Good: Deterministic decision based on input
def call(context)
amount = context.get_task_field('amount')
steps = if amount < 1000
['auto_approve']
else
['manager_approval']
end
decision_success(steps: steps)
end
# ❌ Bad: Non-deterministic (time-based, random)
def call(context)
# Decision changes based on when it runs
steps = if Time.now.hour < 9
['emergency_approval']
else
['standard_approval']
end
decision_success(steps: steps)
end
2. Validate Step Names
Ensure all step names in decision outcomes exist in template:
VALID_STEPS = %w[auto_approve manager_approval finance_review].freeze
def call(context)
steps_to_create = determine_steps(context)
# Validate step names
invalid = steps_to_create - VALID_STEPS
unless invalid.empty?
raise "Invalid step names: #{invalid.join(', ')}"
end
decision_success(steps: steps_to_create)
end
3. Use Deferred Type for Convergence
Any step that might depend on dynamically created steps should be type: deferred:
# ✅ Correct
- name: finalize
type: deferred # Uses intersection semantics
dependencies:
- routing_decision
- auto_approve
- manager_approval
# ❌ Wrong - will fail if dependencies don't all exist
- name: finalize
dependencies:
- routing_decision
- auto_approve
- manager_approval
4. Limit Decision Depth
Prevent infinite recursion:
[orchestration.decision_points]
max_depth = 3 # Maximum nesting level
warn_threshold = 2 # Warn when approaching limit
# ✅ Good: Linear decision chain (depth 1-2)
validate → routing_decision → compliance_check → finalize
# ⚠️ Be Careful: Deep nesting (depth 3)
validate → routing_1 → routing_2 → routing_3 → finalize
# ❌ Bad: Circular or unbounded nesting
routing_decision creates steps that create more routing decisions...
5. Handle No-Branch Cases
Explicitly return no_branches when no steps needed:
def call(context)
amount = context.get_task_field('amount')
if context.get_task_field('skip_approval')
# No additional steps needed
decision_no_branches(
result_data: { reason: 'approval_skipped' }
)
else
decision_success(steps: determine_steps(amount))
end
end
6. Meaningful Result Data
Include context for debugging and audit trails:
decision_success(
steps: ['manager_approval', 'finance_review'],
result_data: {
route_type: 'dual_approval',
reasoning: "Amount $#{amount} >= $5,000 threshold",
amount: amount,
thresholds_applied: {
small: 1_000,
large: 5_000
}
},
metadata: {
decision_time_ms: elapsed_ms,
steps_created_count: 2
}
)
Limitations and Constraints
Technical Limits
1. Maximum Decision Depth
- Default: 3 levels of nested decision points
- Configurable via
orchestration.decision_points.max_depth - Prevents infinite recursion
2. Step Names Must Exist in Template
- All step names in
CreateStepsmust be defined in template - Orchestration validates before creating steps
- Invalid names cause permanent failure
3. Decision Logic is Non-Retryable by Default
- Decision steps should be deterministic
- Retry disabled by default (
max_attempts: 1) - External API calls should be in separate steps
4. Created Steps Cannot Modify Template
- Decision points create instances of template steps
- Cannot dynamically define new step types
- All possible steps must be in template
Performance Considerations
1. Decision Overhead
- Each decision point adds ~10-20ms overhead
- Includes: handler execution + step creation + dependency resolution
- Factor into SLA planning
2. Database Impact
- Each created step = 1 WorkflowStep record + edges
- Large branch counts increase database operations
- Monitor
workflow_stepstable growth
3. Observability
- Decision outcomes logged with telemetry
- Metrics track:
decision_points.steps_created,decision_points.depth - Use structured logging for audit trails
Semantic Constraints
1. Deferred Dependencies Must Include Decision Point
# ✅ Correct
- name: finalize
type: deferred
dependencies:
- routing_decision # Must list the decision point
- auto_approve
- manager_approval
# ❌ Wrong - missing decision point
- name: finalize
type: deferred
dependencies:
- auto_approve
- manager_approval
2. Decision Points Cannot Be Circular
# ❌ Not allowed - circular dependency
routing_a creates routing_b
routing_b creates routing_a
3. No Dynamic Template Modification
- Cannot add new handler types at runtime
- Cannot modify step configurations
- All possibilities must be predefined
Testing Decision Point Workflows
E2E Test Structure
Both Ruby and Rust implementations include comprehensive E2E tests covering all routing scenarios:
Test Locations:
- Ruby:
tests/e2e/ruby/conditional_approval_test.rs - Rust:
tests/e2e/rust/conditional_approval_rust.rs
Test Scenarios:
-
Small Amount ($500) - Auto-approval only
validate_request → routing_decision → auto_approve → finalize_approval Expected: 4 steps created, only auto_approve path taken -
Medium Amount ($3,000) - Manager approval only
validate_request → routing_decision → manager_approval → finalize_approval Expected: 4 steps created, only manager path taken -
Large Amount ($10,000) - Dual approval
validate_request → routing_decision → manager_approval + finance_review → finalize_approval Expected: 5 steps created, both approval paths taken (parallel) -
API Validation - Initial step count verification
Expected: 2 steps at initialization (validate_request, routing_decision) Reason: finalize_approval is transitive descendant of decision point
Running Tests
# Run all E2E tests
cargo test --test e2e_tests
# Run Ruby conditional approval tests only
cargo test --test e2e_tests e2e::ruby::conditional_approval
# Run Rust conditional approval tests only
cargo test --test e2e_tests e2e::rust::conditional_approval_rust
# Run with output for debugging
cargo test --test e2e_tests -- --nocapture
Test Fixtures
Ruby Template: tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml
Rust Template: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml
Both templates demonstrate:
- Decision point step configuration (
type: decision) - Deferred convergence step (
type: deferred) - Dynamic step dependencies
- Namespace isolation between Ruby/Rust
Validation Checklist
When implementing decision point workflows, ensure:
- ✅ Decision point step has
type: decision - ✅ Deferred convergence step has
type: deferred - ✅ All possible dependencies listed in deferred step
- ✅ Handler embeds
decision_point_outcomein result - ✅ Step names in outcome match template definitions
- ✅ E2E tests cover all routing scenarios
- ✅ Tests validate step creation and completion
- ✅ Namespace isolated if multiple implementations exist
Related Documentation
- Use Cases & Patterns - More workflow examples
- States and Lifecycles - State machine details
- Task and Step Readiness - Dependency resolution logic
- Quick Start - Getting started guide
- Crate Architecture - System architecture overview
- Decision Point E2E Tests - Detailed test documentation
← Back to Documentation Hub
Configuration Management
Last Updated: 2025-10-17 Audience: Operators, Developers, Architects Status: Active Related Docs: Environment Configuration Comparison, Deployment Patterns
← Back to Documentation Hub
Overview
Tasker Core implements a sophisticated component-based configuration system with environment-specific overrides, runtime observability, and comprehensive validation. This document explains how to manage, validate, inspect, and deploy Tasker configurations.
Key Features
| Feature | Description | Benefit |
|---|---|---|
| Component-Based Architecture | 3 focused TOML files organized by common, orchestration, and worker | Easy to understand and maintain |
| Environment Overrides | Test, development, production-specific settings | Safe defaults with production scale-out |
| Single-File Runtime Loading | Load from pre-merged configuration files at runtime | Deployment certainty - exact config known at build time |
| Runtime Observability | /config API endpoints with secret redaction | Live inspection of deployed configurations |
| CLI Tools | Generate and validate single deployable configs | Build-time verification, deployment artifacts |
| Context-Specific Validation | Orchestration and worker-specific validation rules | Catch errors before deployment |
| Secret Redaction | 12+ sensitive key patterns automatically hidden | Safe configuration inspection |
Quick Start
Inspect Running System Configuration
# Check orchestration configuration (includes common + orchestration-specific)
curl http://localhost:8080/config | jq
# Check worker configuration (includes common + worker-specific)
curl http://localhost:8081/config | jq
# Secrets are automatically redacted for safety
Generate Deployable Configuration
# Generate production orchestration config for deployment
tasker-ctl config generate \
--context orchestration \
--environment production \
--output config/tasker/orchestration-production.toml
# This merged file is then loaded at runtime via TASKER_CONFIG_PATH
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
Validate Configuration
# Validate orchestration config for production
tasker-ctl config validate \
--context orchestration \
--environment production
# Validates: type safety, ranges, required fields, business rules
Part 1: Configuration Architecture
1.1 Component-Based Structure
Tasker uses a component-based TOML architecture where configuration is split into focused files with single responsibility:
config/tasker/
├── base/ # Base configuration (defaults)
│ ├── common.toml # Shared: database, circuit breakers, telemetry
│ ├── orchestration.toml # Orchestration-specific settings
│ └── worker.toml # Worker-specific settings
│
├── environments/ # Environment-specific overrides
│ ├── test/
│ │ ├── common.toml # Test overrides (small values, fast execution)
│ │ ├── orchestration.toml
│ │ └── worker.toml
│ │
│ ├── development/
│ │ ├── common.toml # Development overrides (medium values, local Docker)
│ │ ├── orchestration.toml
│ │ └── worker.toml
│ │
│ └── production/
│ ├── common.toml # Production overrides (large values, scale-out)
│ ├── orchestration.toml
│ └── worker.toml
│
├── orchestration-test.toml # Generated merged configs (used at runtime via TASKER_CONFIG_PATH)
├── orchestration-production.toml # Single-file deployment artifacts
├── worker-test.toml
└── worker-production.toml
1.2 Configuration Contexts
Tasker has three configuration contexts:
| Context | Purpose | Components |
|---|---|---|
| Common | Shared across orchestration and worker | Database, circuit breakers, telemetry, backoff, system |
| Orchestration | Orchestration-specific settings | Web API, MPSC channels, event systems, shutdown |
| Worker | Worker-specific settings | Handler discovery, resource limits, health monitoring |
1.3 Environment Detection
Configuration loading uses TASKER_ENV environment variable:
# Test environment (default) - small values for fast tests
export TASKER_ENV=test
# Development environment - medium values for local Docker
export TASKER_ENV=development
# Production environment - large values for scale-out
export TASKER_ENV=production
Detection Order:
TASKER_ENVenvironment variable- Default to “development” if not set
1.4 Runtime Configuration Loading
Production/Docker Deployment: Single-file loading via TASKER_CONFIG_PATH
Runtime systems (orchestration and worker) load configuration from pre-merged single files:
# Set path to merged configuration file
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
# System loads this single file at startup
# No directory merging at runtime - configuration is fully determined at build time
Key Benefits:
- Deployment Certainty: Exact configuration known before deployment
- Simplified Debugging: Single file shows exactly what’s running
- Configuration Auditing: One file to version control and code review
- Fail Loudly: Missing or invalid config halts startup with explicit errors
Configuration Path Precedence:
The system uses a two-tier configuration loading strategy with clear precedence:
-
Primary: TASKER_CONFIG_PATH (Explicit single file - Docker/production)
- When set, system loads configuration from this exact file path
- Intended for production and Docker deployments
- Example:
TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml - Source logging:
"📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)"
-
Fallback: TASKER_CONFIG_ROOT (Convention-based - tests/development)
- When
TASKER_CONFIG_PATHis not set, system looks for config using convention - Convention:
{TASKER_CONFIG_ROOT}/tasker/{context}-{environment}.toml - Examples:
- Orchestration:
/config/tasker/generated/orchestration-test.toml - Worker:
/config/tasker/worker-production.toml
- Orchestration:
- Source logging:
"📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))"
- When
Logging and Transparency:
The system clearly logs which approach was taken at startup:
# Explicit path approach (TASKER_CONFIG_PATH set)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)
# Convention-based approach (TASKER_CONFIG_ROOT set)
INFO tasker_shared::system_context: Using convention-based config path: /config/tasker/generated/orchestration-test.toml (environment=test)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))
When to Use Each:
| Environment | Recommended Approach | Reason |
|---|---|---|
| Production | TASKER_CONFIG_PATH | Explicit, auditable, matches what’s reviewed |
| Docker | TASKER_CONFIG_PATH | Single source of truth, no ambiguity |
| Kubernetes | TASKER_CONFIG_PATH | ConfigMap contains exact file |
| Tests (nextest) | TASKER_CONFIG_ROOT | Tests span multiple contexts, convention handles both |
| Local dev | Either | Personal preference |
Error Handling:
If neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set:
ConfigurationError("Neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set.
For Docker/production: set TASKER_CONFIG_PATH to the merged config file.
For tests/development: set TASKER_CONFIG_ROOT to the config directory.")
Local Development: Directory-based loading (legacy tests only)
For legacy test compatibility, you can still use directory-based loading via the load_context_direct() method, but this is not supported for production use.
1.5 Merging Strategy
Configuration merging follows environment overrides win pattern:
# base/common.toml
[database.pool]
max_connections = 30
min_connections = 8
# environments/production/common.toml
[database.pool]
max_connections = 50
# Result: max_connections = 50, min_connections = 8 (inherited from base)
Part 2: Runtime Observability
2.1 Configuration API Endpoints
Tasker provides unified configuration endpoints that return complete configuration (common + context-specific) in a single response.
Orchestration API
Endpoint: GET /config (system endpoint at root level)
Purpose: Inspect complete orchestration configuration including common settings
Example Request:
curl http://localhost:8080/config | jq
Response Structure:
{
"environment": "production",
"common": {
"database": {
"url": "***REDACTED***",
"pool": {
"max_connections": 50,
"min_connections": 15
}
},
"circuit_breakers": { "...": "..." },
"telemetry": { "...": "..." },
"system": { "...": "..." },
"backoff": { "...": "..." },
"task_templates": { "...": "..." }
},
"orchestration": {
"web": {
"bind_address": "0.0.0.0:8080",
"request_timeout_ms": 60000
},
"mpsc_channels": {
"command_buffer_size": 5000,
"pgmq_notification_buffer_size": 50000
},
"event_systems": { "...": "..." }
},
"metadata": {
"timestamp": "2025-10-17T15:30:45Z",
"source": "runtime",
"redacted_fields": [
"database.url",
"telemetry.api_key"
]
}
}
Worker API
Endpoint: GET /config (system endpoint at root level)
Purpose: Inspect complete worker configuration including common settings
Example Request:
curl http://localhost:8081/config | jq
Response Structure:
{
"environment": "production",
"common": {
"database": { "...": "..." },
"circuit_breakers": { "...": "..." },
"telemetry": { "...": "..." }
},
"worker": {
"template_path": "/app/templates",
"max_concurrent_steps": 500,
"resource_limits": {
"max_memory_mb": 4096,
"max_cpu_percent": 90
},
"web": {
"bind_address": "0.0.0.0:8081",
"request_timeout_ms": 60000
}
},
"metadata": {
"timestamp": "2025-10-17T15:30:45Z",
"source": "runtime",
"redacted_fields": [
"database.url",
"worker.auth_token"
]
}
}
2.2 Design Philosophy
Single Endpoint, Complete Configuration: Each system has one /config endpoint that returns both common and context-specific configuration in a single response.
Benefits:
- Single curl command: Get complete picture without correlation
- Easy comparison: Compare orchestration vs worker configs for compatibility
- Tooling-friendly: Automated tools can validate shared config matches
- Debugging-friendly: No mental correlation between multiple endpoints
- System endpoint: At root level like
/health,/metrics(not under/v1/)
2.3 Comprehensive Secret Redaction
All sensitive configuration values are automatically redacted before returning to clients.
Sensitive Key Patterns (12+ patterns, case-insensitive):
password,secret,token,key,api_keyprivate_key,jwt_private_key,jwt_public_keyauth_token,credentials,database_url,url
Key Features:
- Recursive Processing: Handles deeply nested objects and arrays
- Field Path Tracking: Reports which fields were redacted (e.g.,
database.url) - Smart Skipping: Empty strings and booleans not redacted
- Case-Insensitive: Catches
API_KEY,Secret_Token,database_PASSWORD - Structure Preservation: Non-sensitive data remains intact
Example:
{
"database": {
"url": "***REDACTED***",
"adapter": "postgresql",
"pool": {
"max_connections": 30
}
},
"metadata": {
"redacted_fields": ["database.url"]
}
}
2.4 OpenAPI/Swagger Integration
All configuration endpoints are documented with OpenAPI 3.0 and Swagger UI.
Access Swagger UI:
- Orchestration: http://localhost:8080/api-docs/ui
- Worker: http://localhost:8081/api-docs/ui
OpenAPI Specification:
- Orchestration: http://localhost:8080/api-docs/openapi.json
- Worker: http://localhost:8081/api-docs/openapi.json
Part 3: CLI Tools
3.1 Generate Command
Purpose: Generate a single merged configuration file from base + environment overrides for deployment.
Command Signature:
tasker-ctl config generate \
--context <common|orchestration|worker> \
--environment <test|development|production>
Examples:
# Generate orchestration config for production
tasker-ctl config generate --context orchestration --environment production
# Generate worker config for development
tasker-ctl config generate --context worker --environment development
# Generate common config for test
tasker-ctl config generate --context common --environment test
Output Location: Automatically generated at:
config/tasker/generated/{context}-{environment}.toml
Key Features:
- Automatic Paths: No need for
--source-diror--outputflags - Metadata Headers: Generated files include rich metadata:
# Generated by Tasker Configuration System # Context: orchestration # Environment: production # Generated At: 2025-10-17T15:30:45Z # Base Config: config/tasker/base/orchestration.toml # Environment Override: config/tasker/environments/production/orchestration.toml # # This is a merged configuration file combining base settings with # environment-specific overrides. Environment values take precedence. - Automatic Validation: Validates during generation
- Smart Merging: TOML-level merging preserves structure
3.2 Validate Command
Purpose: Validate configuration files with context-specific validation rules.
Command Signature:
tasker-ctl config validate \
--context <common|orchestration|worker> \
--environment <test|development|production>
Examples:
# Validate orchestration config for production
tasker-ctl config validate --context orchestration --environment production
# Validate worker config for test
tasker-ctl config validate --context worker --environment test
Validation Features:
- Environment variable substitution (
${VAR:-default}) - Type checking (numeric ranges, boolean values)
- Required field validation
- Context-specific business rules
- Clear error messages
Example Output:
🔍 Validating configuration...
Context: orchestration
Environment: production
✓ Configuration loaded
✓ Validation passed
✅ Configuration is valid!
📊 Configuration Summary:
Context: orchestration
Environment: production
Database: postgresql://tasker:***@localhost/tasker_production
Web API: 0.0.0.0:8080
MPSC Channels: 5 configured
3.3 Configuration Validator Binary
For quick validation without the full CLI:
# Validate all three environments
TASKER_ENV=test cargo run --bin config-validator
TASKER_ENV=development cargo run --bin config-validator
TASKER_ENV=production cargo run --bin config-validator
Part 4: Environment-Specific Configurations
See Environment Configuration Comparison for complete details on configuration values across environments.
4.1 Scaling Pattern
Tasker follows a 1:5:50 scaling pattern across environments:
| Component | Test | Development | Production | Pattern |
|---|---|---|---|---|
| Database Connections | 10 | 25 | 50 | 1x → 2.5x → 5x |
| Concurrent Steps | 10 | 50 | 500 | 1x → 5x → 50x |
| MPSC Channel Buffers | 100-500 | 500-1000 | 2000-50000 | 1x → 5-10x → 20-100x |
| Memory Limits | 512MB | 2GB | 4GB | 1x → 4x → 8x |
4.2 Environment Philosophy
Test Environment:
- Goal: Fast execution, test isolation
- Strategy: Minimal resources, small buffers
- Example: 10 database connections, 100-500 MPSC buffers
Development Environment:
- Goal: Comfortable local Docker development
- Strategy: Medium values, realistic workflows
- Example: 25 database connections, 2GB RAM, 500-1000 MPSC buffers
- Cluster Testing: 2 orchestrators to test multi-instance coordination
Production Environment:
- Goal: High throughput, scale-out capacity
- Strategy: Large values, production resilience
- Example: 50 database connections, 4GB RAM, 2000-50000 MPSC buffers
Part 5: Deployment Workflows
5.1 Docker Deployment
Build-Time Configuration Generation:
FROM rust:1.75 as builder
WORKDIR /app
COPY . .
# Build CLI tool
RUN cargo build --release --bin tasker-ctl
# Generate production config (single merged file)
RUN ./target/release/tasker-ctl config generate \
--context orchestration \
--environment production \
--output config/tasker/orchestration-production.toml
# Build orchestration binary
RUN cargo build --release --bin tasker-orchestration
FROM rust:1.75-slim
WORKDIR /app
# Copy orchestration binary
COPY --from=builder /app/target/release/tasker-orchestration /usr/local/bin/
# Copy generated config (single file with all merged settings)
COPY --from=builder /app/config/tasker/orchestration-production.toml /app/config/orchestration.toml
# Set environment - TASKER_CONFIG_PATH is REQUIRED
ENV TASKER_CONFIG_PATH=/app/config/orchestration.toml
ENV TASKER_ENV=production
CMD ["tasker-orchestration"]
Key Changes from Phase 2:
- ✅ Single merged file generated at build time
- ✅
TASKER_CONFIG_PATHenvironment variable (required) - ✅ No runtime merging - exact config known at build time
- ✅ Fail loudly if
TASKER_CONFIG_PATHnot set
5.2 Kubernetes Deployment
ConfigMap Strategy with Pre-Generated Config:
# Step 1: Generate merged configuration locally
tasker-ctl config generate \
--context orchestration \
--environment production \
--output orchestration-production.toml
# Step 2: Create ConfigMap from generated file
kubectl create configmap tasker-orchestration-config \
--from-file=orchestration.toml=orchestration-production.toml
Deployment Manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration
spec:
replicas: 2
selector:
matchLabels:
app: tasker-orchestration
template:
metadata:
labels:
app: tasker-orchestration
spec:
containers:
- name: orchestration
image: tasker/orchestration:latest
env:
- name: TASKER_ENV
value: "production"
# REQUIRED: Path to single merged configuration file
- name: TASKER_CONFIG_PATH
value: "/config/orchestration.toml"
# DATABASE_URL should be in a separate secret
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-db-credentials
key: database-url
volumeMounts:
- name: config
mountPath: /config
readOnly: true
volumes:
- name: config
configMap:
name: tasker-orchestration-config
items:
- key: orchestration.toml
path: orchestration.toml
Key Benefits:
- ✅ Generated file reviewed before deployment
- ✅ Single source of truth for runtime configuration
- ✅ Easy to diff between environments
- ✅ ConfigMap contains exact runtime configuration
5.3 Local Development and Testing
For Tests (Legacy directory-based loading):
# Set test environment
export TASKER_ENV=test
# Tests use legacy load_context_direct() method
cargo test --all-features
For Docker Compose (Single-file loading):
# Generate test configs first
tasker-ctl config generate --context orchestration --environment test \
--output config/tasker/generated/orchestration-test.toml
tasker-ctl config generate --context worker --environment test \
--output config/tasker/generated/worker-test.toml
# Start services with generated configs
docker-compose -f docker/docker-compose.test.yml up
Docker Compose Configuration:
services:
orchestration:
environment:
# REQUIRED: Path to single merged file
TASKER_CONFIG_PATH: /app/config/tasker/generated/orchestration-test.toml
volumes:
# Mount config directory (contains generated files)
- ./config/tasker:/app/config/tasker:ro
Key Points:
- ✅ Tests use legacy directory-based loading for convenience
- ✅ Docker Compose uses single-file loading (matches production)
- ✅ Generated files should be committed to repo for reproducibility
- ✅ Both approaches work; choose based on use case
Part 6: Configuration Validation
6.1 Context-Specific Validation
Each configuration context has specific validation rules:
Common Configuration:
- Database URL format and connectivity
- Pool size ranges (1-1000 connections)
- Circuit breaker thresholds (1-100 failures)
- Timeout durations (1-3600 seconds)
Orchestration Configuration:
- Web API bind address format
- Request timeout ranges (1000-300000 ms)
- MPSC channel buffer sizes (100-100000)
- Event system configuration consistency
Worker Configuration:
- Template path existence
- Resource limit ranges (memory, CPU %)
- Handler discovery path validation
- Concurrent step limits (1-10000)
6.2 Validation Workflow
Pre-Deployment Validation:
# Validate before generating deployment artifact
tasker-ctl config validate --context orchestration --environment production
# Generate only if validation passes
tasker-ctl config generate --context orchestration --environment production
Runtime Validation:
- Configuration validated on application startup
- Invalid config prevents startup (fail-fast)
- Clear error messages for troubleshooting
6.3 Common Validation Errors
Example Error Messages:
❌ Validation Error: database.pool.max_connections
Value: 5000
Issue: Exceeds maximum allowed value (1000)
Fix: Reduce to 1000 or less
❌ Validation Error: web.bind_address
Value: "invalid:port"
Issue: Invalid IP:port format
Fix: Use format like "0.0.0.0:8080" or "127.0.0.1:3000"
Part 7: Operational Workflows
7.1 Compare Deployed Configurations
Cross-System Comparison:
# Get orchestration config
curl http://orchestration:8080/config > orch-config.json
# Get worker config
curl http://worker:8081/config > worker-config.json
# Compare common sections for compatibility
jq '.common' orch-config.json > orch-common.json
jq '.common' worker-config.json > worker-common.json
diff orch-common.json worker-common.json
Why This Matters:
- Ensures orchestration and worker share same database config
- Validates circuit breaker settings match
- Confirms telemetry endpoints aligned
7.2 Debug Configuration Issues
Step 1: Inspect Runtime Config
# Check what's actually deployed
curl http://localhost:8080/config | jq '.orchestration.web'
Step 2: Compare to Expected
# Check generated config file
cat config/tasker/generated/orchestration-production.toml
# Compare values
Step 3: Trace Configuration Source
# Check metadata for source files
curl http://localhost:8080/config | jq '.metadata'
# Metadata shows:
# - Environment (production)
# - Timestamp (when config was loaded)
# - Source (runtime)
# - Redacted fields (for transparency)
7.3 Configuration Drift Detection
Manual Comparison:
# Generate what should be deployed
tasker-ctl config generate --context orchestration --environment production
# Compare to runtime
diff config/tasker/generated/orchestration-production.toml \
<(curl -s http://localhost:8080/config | jq -r '.orchestration')
Automated Monitoring (future):
- Periodic config snapshots
- Alert on unexpected changes
- Configuration version tracking
Part 8: Best Practices
8.1 Configuration Management
DO:
✅ Use environment variables for secrets (${DATABASE_URL})
✅ Validate configs before deployment
✅ Generate single deployable artifacts for production
✅ Use /config endpoints for debugging
✅ Keep environment overrides minimal (only what changes)
✅ Document configuration changes in commit messages
DON’T: ❌ Commit production secrets to config files ❌ Mix test and production configurations ❌ Skip validation before deployment ❌ Use unbounded configuration values ❌ Override all settings in environment files
8.2 Security Best Practices
Secrets Management:
# ✅ GOOD: Use environment variable substitution
[database]
url = "${DATABASE_URL}"
# ❌ BAD: Hard-code credentials
[database]
url = "postgresql://user:password@localhost/db"
Production Deployment:
# ✅ GOOD: Use Kubernetes secrets
kubectl create secret generic tasker-db-url \
--from-literal=url='postgresql://...'
# ❌ BAD: Commit secrets to config files
Runtime Inspection:
/configendpoint automatically redacts secrets- Safe to use in logging and monitoring
- Field path tracking shows what was redacted
8.3 Testing Strategy
Test All Environments:
# Ensure all environments validate
for env in test development production; do
echo "Validating $env..."
tasker-ctl config validate --context orchestration --environment $env
done
Integration Testing:
# Test with generated configs
tasker-ctl config generate --context orchestration --environment test
export TASKER_CONFIG_PATH=config/tasker/generated/orchestration-test.toml
cargo test --all-features
Part 9: Troubleshooting
9.1 Common Issues
Issue: Configuration fails to load
# Check environment variable
echo $TASKER_ENV
# Check config files exist
ls -la config/tasker/base/
ls -la config/tasker/environments/$TASKER_ENV/
# Validate config
tasker-ctl config validate --context orchestration --environment $TASKER_ENV
Issue: Unexpected configuration values at runtime
# Check runtime config
curl http://localhost:8080/config | jq
# Compare to expected
cat config/tasker/generated/orchestration-$TASKER_ENV.toml
Issue: Validation errors
# Run validation with detailed output
RUST_LOG=debug tasker-ctl config validate \
--context orchestration \
--environment production
9.2 Debug Mode
Enable Configuration Debug Logging:
# Detailed config loading logs
RUST_LOG=tasker_shared::config=debug cargo run
# Shows:
# - Which files are loaded
# - Merge order
# - Environment variable substitution
# - Validation results
Part 10: Future Enhancements
10.1 Planned Features
Explain Command (Deferred):
# Get documentation for a parameter
tasker-ctl config explain --parameter database.pool.max_connections
# Shows:
# - Purpose and system impact
# - Valid range and type
# - Environment-specific recommendations
# - Related parameters
# - Example usage
Detect-Unused Command (Deferred):
# Find unused configuration parameters
tasker-ctl config detect-unused --context orchestration
# Auto-remove with backup
tasker-ctl config detect-unused --context orchestration --fix
10.2 Operational Enhancements
Configuration Versioning:
- Track configuration changes over time
- Compare configs across versions
- Rollback capability
Automated Drift Detection:
- Periodic config snapshots
- Alert on unexpected changes
- Configuration compliance checking
Configuration Templates:
- Pre-built configurations for common scenarios
- Quick-start templates for new deployments
- Best practice configurations
Related Documentation
- Environment Configuration Comparison - Detailed comparison of configuration values across environments
- Deployment Patterns - Deployment modes and strategies
- Quick Start Guide - Getting started with Tasker
Summary
Tasker’s configuration system provides:
- Component-Based Architecture: Focused TOML files with single responsibility
- Environment Scaling: 1:5:50 pattern from test → development → production
- Single-File Runtime Loading: Deploy exact configuration known at build time via
TASKER_CONFIG_PATH - Runtime Observability:
/configendpoints with comprehensive secret redaction - CLI Tools: Generate and validate single deployable configs
- Context-Specific Validation: Catch errors before deployment
- Security First: Automatic secret redaction, environment variable substitution
Key Workflows:
- Production/Docker: Generate single-file config at build time, set
TASKER_CONFIG_PATH, deploy - Testing: Use legacy directory-based loading for convenience
- Debugging: Use
/configendpoints to inspect runtime configuration - Validation: Validate before generating deployment artifacts
Phase 3 Changes (October 2025):
- ✅ Runtime systems now require
TASKER_CONFIG_PATHenvironment variable - ✅ Configuration loaded from single merged files (no runtime merging)
- ✅ Deployment certainty: exact config known at build time
- ✅ Fail loudly: missing/invalid config halts startup with explicit errors
- ✅ Generated configs committed to repo for reproducibility
← Back to Documentation Hub
Dead Letter Queue (DLQ) System Architecture
Purpose: Investigation tracking system for stuck, stale, or problematic tasks
Last Updated: 2025-11-01
Executive Summary
The DLQ (Dead Letter Queue) system is an investigation tracking system, NOT a task manipulation layer.
Key Principles:
- DLQ tracks “why task is stuck” and “who investigated”
- Resolution happens at step level via step APIs
- No task-level “requeue” - fix the problem steps instead
- Steps carry their own retry, attempt, and state lifecycles independent of DLQ
- DLQ is for audit, visibility, and investigation only
Architecture: PostgreSQL-based system with:
tasks_dlqtable for investigation tracking- 3 database views for monitoring and analysis
- 6 REST endpoints for operator interaction
- Background staleness detection service
DLQ vs Step Resolution
What DLQ Does
✅ Investigation Tracking:
- Record when and why task became stuck
- Capture complete task snapshot for debugging
- Track operator investigation workflow
- Provide visibility into systemic issues
✅ Visibility and Monitoring:
- Dashboard statistics by DLQ reason
- Prioritized investigation queue for triage
- Proactive staleness monitoring (before DLQ)
- Alerting integration for high-priority entries
What DLQ Does NOT Do
❌ Task Manipulation:
- Does NOT retry failed steps
- Does NOT requeue tasks
- Does NOT modify step state
- Does NOT execute business logic
Why This Separation Matters
Steps are mutable - Operators can:
- Manually resolve failed steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - View step readiness status:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Check retry eligibility and dependency satisfaction
- Trigger next steps by completing blocked steps
DLQ is immutable audit trail - Operators should:
- Review task snapshot to understand what went wrong
- Use step endpoints to fix the underlying problem
- Update DLQ investigation status to track resolution
- Analyze DLQ patterns to prevent future occurrences
DLQ Reasons
staleness_timeout
Definition: Task exceeded state-specific staleness threshold
States:
waiting_for_dependencies- Default 60 minuteswaiting_for_retry- Default 30 minutessteps_in_process- Default 30 minutes
Template Override: Configure per-template thresholds:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
max_duration_minutes: 1440 # 24 hours
Resolution Pattern:
- Operator:
GET /v1/dlq/task/{task_uuid}- Review task snapshot - Identify stuck steps: Check
current_statein snapshot - Fix steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Task state machine automatically progresses when steps fixed
- Operator:
PATCH /v1/dlq/entry/{dlq_entry_uuid}- Mark investigation resolved
Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring
max_retries_exceeded
Definition: Step exhausted all retry attempts and remains in Error state
Resolution Pattern:
- Review step results:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Analyze
last_failure_atand error details - Fix underlying issue (infrastructure, data, etc.)
- Manually resolve step:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Update DLQ investigation status
dependency_cycle_detected
Definition: Circular dependency detected in workflow step graph
Resolution Pattern:
- Review task template configuration
- Identify cycle in step dependencies
- Update template to break cycle
- Manually cancel affected tasks
- Re-submit tasks with corrected template
worker_unavailable
Definition: No worker available for task’s namespace
Resolution Pattern:
- Check worker service health
- Verify namespace configuration
- Scale worker capacity if needed
- Tasks automatically progress when worker available
manual_dlq
Definition: Operator manually sent task to DLQ for investigation
Resolution Pattern: Custom per-investigation
Database Schema
tasks_dlq Table
CREATE TABLE tasker.tasks_dlq (
dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
task_uuid UUID NOT NULL UNIQUE, -- One pending entry per task
original_state VARCHAR(50) NOT NULL,
dlq_reason dlq_reason NOT NULL,
dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
task_snapshot JSONB, -- Complete task state for debugging
resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
resolution_notes TEXT,
resolved_at TIMESTAMPTZ,
resolved_by VARCHAR(255),
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
ON tasker.tasks_dlq (task_uuid)
WHERE resolution_status = 'pending';
Key Fields:
dlq_entry_uuid- UUID v7 (time-ordered) for investigation trackingtask_uuid- Foreign key to task (unique for pending entries)original_state- Task state when sent to DLQtask_snapshot- JSONB snapshot with debugging contextresolution_status- Investigation workflow status
Database Views
v_dlq_dashboard
Purpose: Aggregated statistics for monitoring dashboard
Columns:
dlq_reason- Why tasks are in DLQtotal_entries- Count of entriespending,manually_resolved,permanent_failures,cancelled- Breakdown by statusoldest_entry,newest_entry- Time rangeavg_resolution_time_minutes- Average time to resolve
Use Case: High-level DLQ health monitoring
v_dlq_investigation_queue
Purpose: Prioritized queue for operator triage
Columns:
- Task and DLQ entry UUIDs
priority_score- Composite score (base reason priority + age factor)minutes_in_dlq- How long entry has been pending- Task metadata for context
Ordering: Priority score DESC (most urgent first)
Use Case: Operator dashboard showing “what to investigate next”
v_task_staleness_monitoring
Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ
Columns:
task_uuid,namespace_name,task_namecurrent_state,time_in_state_minutesstaleness_threshold_minutes- Threshold for this statehealth_status- healthy | warning | stalepriority- Task priority for ordering
Health Status Classification:
healthy- < 80% of thresholdwarning- 80-99% of thresholdstale- ≥ 100% of threshold
Use Case: Alerting at 80% threshold to prevent DLQ entries
REST API Endpoints
1. List DLQ Entries
GET /v1/dlq?resolution_status=pending&limit=50
Purpose: Browse DLQ entries with filtering
Query Parameters:
resolution_status- Filter by status (optional)limit- Max entries (default: 50)offset- Pagination offset (default: 0)
Response: Array of DlqEntry objects
Use Case: General DLQ browsing and pagination
2. Get DLQ Entry with Task Snapshot
GET /v1/dlq/task/{task_uuid}
Purpose: Retrieve most recent DLQ entry for a task with complete snapshot
Response: DlqEntry with full task_snapshot JSONB
Task Snapshot Contains:
- Task UUID, namespace, name
- Current state and time in state
- Staleness threshold
- Task age and priority
- Template configuration
- Detection time
Use Case: Investigation starting point - “why is this task stuck?”
3. Update DLQ Investigation Status
PATCH /v1/dlq/entry/{dlq_entry_uuid}
Purpose: Track investigation workflow
Request Body:
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed by manually completing stuck step using step API",
"resolved_by": "operator@example.com",
"metadata": {
"fixed_step_uuid": "...",
"root_cause": "database connection timeout"
}
}
Use Case: Document investigation findings and resolution
4. Get DLQ Statistics
GET /v1/dlq/stats
Purpose: Aggregated statistics for monitoring
Response: Statistics grouped by dlq_reason
Use Case: Dashboard metrics, identifying systemic issues
5. Get Investigation Queue
GET /v1/dlq/investigation-queue?limit=100
Purpose: Prioritized queue for operator triage
Response: Array of DlqInvestigationQueueEntry ordered by priority
Priority Factors:
- Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
- Age multiplier (older entries = higher priority)
Use Case: “What should I investigate next?”
6. Get Staleness Monitoring
GET /v1/dlq/staleness?limit=100
Purpose: Proactive monitoring BEFORE tasks hit DLQ
Response: Array of StalenessMonitoring with health status
Ordering: Stale first, then warning, then healthy
Use Case: Alerting and prevention
Alert Integration:
# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
Step Endpoints and Resolution Workflow
Step Endpoints
1. List Task Steps
GET /v1/tasks/{uuid}/workflow_steps
Returns: Array of steps with readiness status
Key Fields:
current_state- Step state (pending, enqueued, in_progress, complete, error)dependencies_satisfied- Can step execute?retry_eligible- Can step retry?ready_for_execution- Ready to enqueue?attempts/max_attempts- Retry trackinglast_failure_at- When step last failednext_retry_at- When step eligible for retry
Use Case: Understand task execution status
2. Get Step Details
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Returns: Single step with full readiness analysis
Use Case: Deep dive into specific step
3. Manually Resolve Step
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Purpose: Operator actions to handle stuck or failed steps
Action Types:
- ResetForRetry - Reset attempt counter and return to pending for automatic retry:
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection restored, resetting attempts"
}
- ResolveManually - Mark step as manually resolved without results:
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical step, bypassing for workflow continuation"
}
- CompleteManually - Complete step with execution results for dependent steps:
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validated": true,
"score": 95
},
"metadata": {
"manually_verified": true,
"verification_method": "manual_inspection"
}
},
"reason": "Manual verification completed after infrastructure fix",
"completed_by": "operator@example.com"
}
Behavior by Action Type:
reset_for_retry: Clears attempt counter, transitions topending, enables automatic retryresolve_manually: Transitions toresolved_manually(terminal state)complete_manually: Transitions tocompletewith results available for dependent steps
Common Effects:
- Triggers task state machine re-evaluation
- Task automatically discovers next ready steps
- Task progresses when all dependencies satisfied
Use Case: Unblock stuck workflow by fixing problem step
Complete Resolution Workflow
Scenario: Task Stuck in waiting_for_dependencies
1. Operator receives DLQ alert
GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority
2. Operator reviews task snapshot
GET /v1/dlq/task/abc-123
# Response:
{
"dlq_entry_uuid": "xyz-789",
"task_uuid": "abc-123",
"original_state": "waiting_for_dependencies",
"dlq_reason": "staleness_timeout",
"task_snapshot": {
"task_uuid": "abc-123",
"namespace": "order_processing",
"task_name": "fulfill_order",
"current_state": "error",
"time_in_state_minutes": 65,
"threshold_minutes": 60
}
}
3. Operator checks task steps
GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)
4. Operator investigates step_2 failure
GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout
5. Operator fixes infrastructure issue
# Fix database connection pool configuration
# Verify database connectivity
6. Operator chooses resolution strategy
Option A: Reset for retry (infrastructure fixed, retry should work):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection pool fixed, resetting attempts for automatic retry"
}
Option B: Resolve manually (bypass step entirely):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical validation step, bypassing"
}
Option C: Complete manually (provide results for dependent steps):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validation_status": "passed",
"score": 100
},
"metadata": {
"manually_verified": true
}
},
"reason": "Manual validation completed",
"completed_by": "operator@example.com"
}
7. Task state machine automatically progresses
Outcome depends on action type chosen:
If Option A (reset_for_retry):
- Step 2 →
pending(attempts reset to 0) - Automatic retry begins when dependencies satisfied
- Step 2 re-enqueued to worker
- If successful, workflow continues normally
If Option B (resolve_manually):
- Step 2 →
resolved_manually(terminal state) - Step 3 dependencies satisfied (manual resolution counts as success)
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker
- Task resumes normal execution
If Option C (complete_manually):
- Step 2 →
complete(with operator-provided results) - Step 3 can consume results from completion_data
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker with access to step 2 results
- Task resumes normal execution
8. Operator updates DLQ investigation
PATCH /v1/dlq/entry/xyz-789
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
"resolved_by": "operator@example.com",
"metadata": {
"root_cause": "database_connection_timeout",
"fixed_step_uuid": "{step_2_uuid}",
"infrastructure_fix": "increased_connection_pool_size"
}
}
Step Retry and Attempt Lifecycles
Step State Machine
States:
pending- Initial state, awaiting dependenciesenqueued- Sent to worker queuein_progress- Worker actively processingenqueued_for_orchestration- Result submitted, awaiting orchestrationcomplete- Successfully finishederror- Failed (may be retryable)cancelled- Manually cancelledresolved_manually- Operator intervention
Retry Logic
Configured per step in template:
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 30000
Retry Eligibility Criteria:
retryable: truein configurationattempts < max_attempts- Current state is
error next_retry_attimestamp has passed (backoff elapsed)
Backoff Calculation:
backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)
Example (base=1000ms, max=30000ms):
- Attempt 1 fails → wait 1s
- Attempt 2 fails → wait 2s
- Attempt 3 fails → wait 4s
SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at
Attempt Tracking
Fields (on workflow_steps table):
attempts- Current attempt countmax_attempts- Configuration limitlast_attempted_at- Timestamp of last executionlast_failure_at- Timestamp of last failure
Workflow:
- Step enqueued →
attempts++ - Step fails → Record
last_failure_at, calculatenext_retry_at - Backoff elapses → Step becomes
retry_eligible: true - Orchestration discovers ready steps → Step re-enqueued
- Repeat until success or
attempts >= max_attempts
Max Attempts Exceeded:
- Step remains in
errorstate retry_eligible: false- Task transitions to
errorstate - May trigger DLQ entry with reason
max_retries_exceeded
Independence from DLQ
Key Point: Step retry logic is INDEPENDENT of DLQ
- Steps retry automatically based on configuration
- DLQ does NOT trigger retries
- DLQ does NOT modify retry counters
- DLQ is pure observation and investigation
Why This Matters:
- Retry logic is predictable and configuration-driven
- DLQ doesn’t interfere with normal workflow execution
- Operators can manually resolve to bypass retry limits
- DLQ provides visibility into retry exhaustion patterns
Staleness Detection
Background Service
Component: tasker-orchestration/src/orchestration/staleness_detector.rs
Configuration:
[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300 # 5 minutes
Operation:
- Timer triggers every 5 minutes
- Calls
detect_and_transition_stale_tasks()SQL function - Function identifies tasks exceeding thresholds
- Creates DLQ entries for stale tasks
- Transitions tasks to
errorstate - Records OpenTelemetry metrics
Staleness Thresholds
Per-State Defaults (configurable):
waiting_for_dependencies: 60 minuteswaiting_for_retry: 30 minutessteps_in_process: 30 minutes
Per-Template Override:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
Precedence: Template config > Global defaults
Staleness SQL Function
Function: detect_and_transition_stale_tasks()
Architecture:
v_task_state_analysis (base view)
│
├── get_stale_tasks_for_dlq() (discovery function)
│ │
│ └── detect_and_transition_stale_tasks() (main orchestration)
│ ├── create_dlq_entry() (DLQ creation)
│ └── transition_stale_task_to_error() (state transition)
Performance Optimization:
- Expensive joins happen ONCE in base view
- Discovery function filters stale tasks
- Main function processes results in loop
- LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries
Output: Returns StalenessResult records with:
- Task identification (UUID, namespace, name)
- State and timing information
action_taken- What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)moved_to_dlq- Booleantransition_success- Boolean
OpenTelemetry Metrics
Metrics Exported
Counters:
tasker.dlq.entries_created.total- DLQ entries createdtasker.staleness.tasks_detected.total- Stale tasks detectedtasker.staleness.tasks_transitioned_to_error.total- Tasks moved to Errortasker.staleness.detection_runs.total- Detection cycles
Histograms:
tasker.staleness.detection.duration- Detection execution time (ms)tasker.dlq.time_in_queue- Time in DLQ before resolution (hours)
Gauges:
tasker.dlq.pending_investigations- Current pending DLQ count
Alert Examples
Prometheus Alerting Rules:
# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
expr: tasker_dlq_pending_investigations > 50
for: 15m
labels:
severity: warning
annotations:
summary: "High number of pending DLQ investigations ({{ $value }})"
# Alert on slow detection cycles
- alert: SlowStalenessDetection
expr: tasker_staleness_detection_duration > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Staleness detection taking >5s ({{ $value }}ms)"
# Alert on high stale task rate
- alert: HighStalenessRate
expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
for: 10m
labels:
severity: critical
annotations:
summary: "High rate of stale task detection ({{ $value }}/sec)"
CLI Usage Examples
The tasker-ctl tool provides commands for managing workflow steps directly from the command line.
List Workflow Steps
# List all steps for a task
tasker-ctl task steps <TASK_UUID>
# Example output:
# ✓ Found 3 workflow steps:
#
# Step: validate_input (01933d7c-...)
# State: complete
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 1/3
#
# Step: process_order (01933d7c-...)
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 3/3
# ⚠ Retry eligible
Get Step Details
# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>
# Example output:
# ✓ Step Details:
#
# UUID: 01933d7c-...
# Name: process_order
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Retry eligible: false
# Attempts: 3/3
# Last failure: 2025-11-02T14:23:45Z
Reset Step for Retry
When infrastructure is fixed and you want to reset attempt counter:
tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
--reason "Database connection pool increased" \
--reset-by "ops-team@example.com"
# Example output:
# ✓ Step reset successfully!
# New state: pending
# Reason: Database connection pool increased
# Reset by: ops-team@example.com
Resolve Step Manually
When you want to bypass a non-critical step:
tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
--reason "Non-critical validation, bypassing" \
--resolved-by "ops-team@example.com"
# Example output:
# ✓ Step resolved manually!
# New state: resolved_manually
# Reason: Non-critical validation, bypassing
# Resolved by: ops-team@example.com
Complete Step Manually with Results
When you’ve manually performed the step’s work and need to provide results:
tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
--result '{"validated": true, "score": 95}' \
--metadata '{"verification_method": "manual_review"}' \
--reason "Manual verification after infrastructure fix" \
--completed-by "ops-team@example.com"
# Example output:
# ✓ Step completed manually with results!
# New state: complete
# Reason: Manual verification after infrastructure fix
# Completed by: ops-team@example.com
JSON Formatting Tips:
# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'
# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
"validation_status": "passed",
"checks": ["auth", "permissions", "rate_limit"],
"score": 100
}
EOF
)"
# Or read from a file
--result "$(cat result.json)"
Operational Runbooks
Runbook 1: Investigating High DLQ Count
Trigger: tasker_dlq_pending_investigations > 50
Steps:
- Check DLQ dashboard:
curl /v1/dlq/stats | jq
- Identify dominant reason:
{
"dlq_reason": "staleness_timeout",
"total_entries": 45,
"pending": 45
}
- Get investigation queue:
curl /v1/dlq/investigation-queue?limit=10 | jq
- Check staleness monitoring:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'
- Identify patterns:
- Common namespace?
- Common task template?
- Common time period?
- Take action:
- Infrastructure issue? → Fix and manually resolve affected tasks
- Template misconfiguration? → Update template thresholds
- Worker unavailable? → Scale worker capacity
- Systemic dependency issue? → Investigate upstream systems
Runbook 2: Proactive Staleness Prevention
Trigger: Regular monitoring (not incident-driven)
Steps:
- Monitor warning threshold:
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
- Alert when warning count exceeds baseline:
if [ $warning_count -gt 10 ]; then
alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi
- Investigate early:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
task_uuid,
current_state,
time_in_state_minutes,
staleness_threshold_minutes,
threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'
- Intervene before DLQ:
- Check task steps for blockages
- Review dependencies
- Manually resolve if appropriate
Best Practices
For Operators
✅ DO:
- Use staleness monitoring for proactive prevention
- Document investigation findings in DLQ resolution notes
- Fix root causes, not just symptoms
- Update DLQ investigation status promptly
- Use step endpoints to resolve stuck workflows
- Monitor DLQ statistics for systemic patterns
❌ DON’T:
- Don’t try to “requeue” from DLQ - fix the steps instead
- Don’t ignore warning health status - investigate early
- Don’t manually resolve steps without fixing root cause
- Don’t leave DLQ investigations in pending status indefinitely
For Developers
✅ DO:
- Configure appropriate staleness thresholds per template
- Make steps retryable with sensible backoff
- Implement idempotent step handlers
- Add defensive timeouts to prevent hanging
- Test workflows under failure scenarios
❌ DON’T:
- Don’t set thresholds too low (causes false positives)
- Don’t set thresholds too high (delays detection)
- Don’t make all steps non-retryable
- Don’t ignore DLQ patterns - they indicate design issues
- Don’t rely on DLQ for normal workflow control flow
Testing
Test Coverage
Unit Tests: SQL function testing (17 tests)
- Staleness detection logic
- DLQ entry creation
- Threshold calculation with template overrides
- View query correctness
Integration Tests: Lifecycle testing (4 tests)
- Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
- Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
- Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
- Complete investigation workflow (test_dlq_investigation_workflow)
Metrics Tests: OpenTelemetry integration (1 test)
- Staleness detection metrics recording
- DLQ investigation metrics recording
- Pending investigations gauge query
Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml
- 2-step linear workflow
- 2-minute staleness thresholds for fast test execution
- Test-only template for lifecycle validation
Performance: All 22 tests complete in 0.95s (< 1s target)
Implementation Notes
File Locations:
- Staleness detector:
tasker-orchestration/src/orchestration/staleness_detector.rs - DLQ models:
tasker-shared/src/models/orchestration/dlq.rs - SQL functions:
migrations/20251122000004_add_dlq_discovery_function.sql - Database views:
migrations/20251122000003_add_dlq_views.sql
Key Design Decisions:
- Investigation tracking only - no task manipulation
- Step-level resolution via existing step endpoints
- Proactive monitoring at 80% threshold
- Template-specific threshold overrides
- Atomic DLQ entry creation with unique constraint
- Time-ordered UUID v7 for investigation tracking
Future Enhancements
Potential improvements (not currently planned):
-
DLQ Patterns Analysis
- Machine learning to identify systemic issues
- Automated root cause suggestions
- Pattern clustering by namespace/template
-
Advanced Alerting
- Anomaly detection on staleness rates
- Predictive DLQ entry forecasting
- Correlation with infrastructure metrics
-
Investigation Workflow
- Automated triage rules
- Escalation policies
- Integration with incident management systems
-
Performance Optimization
- Materialized views for dashboard
- Query result caching
- Incremental staleness detection
End of Documentation
Handler Resolution Guide
Last Updated: 2026-01-08 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | API Convergence Matrix
<- Back to Guides
Overview
Handler resolution is the process of converting a callable address (a string in your YAML template) into an executable handler instance that can process workflow steps. The resolver chain pattern provides a flexible, extensible approach that works consistently across all language workers.
This guide covers:
- The mental model for handler resolution
- The common path for task templates
- Built-in resolvers and how they work
- Method dispatch for multi-method handlers
- Writing custom resolvers
- Cross-language considerations
Mental Model
Handler resolution uses three key concepts:
handler:
callable: "PaymentProcessor" # 1. Address: WHERE to find the handler
method: "refund" # 2. Entry Point: WHICH method to invoke
resolver: "explicit_mapping" # 3. Resolution Hint: HOW to resolve
1. Address (callable)
The callable field is a logical address that identifies the handler. Think of it like a URL - it points to where the handler lives, but the format depends on your resolution strategy:
| Format | Example | Resolver |
|---|---|---|
| Short name | payment_processor | ExplicitMappingResolver |
| Class path (Ruby) | PaymentHandlers::ProcessPaymentHandler | ClassConstantResolver |
| Module path (Python) | payment_handlers.ProcessPaymentHandler | ClassLookupResolver |
| Namespace path (TS) | PaymentHandlers.ProcessPaymentHandler | ClassLookupResolver |
2. Entry Point (method)
The method field specifies which method to invoke on the handler. This enables multi-method handlers - a single handler class that exposes multiple entry points:
# Default: calls the `call` method
handler:
callable: payment_processor
# Explicit method: calls the `refund` method instead
handler:
callable: payment_processor
method: refund
When to use method dispatch:
- Payment handlers with
charge,refund,voidmethods - Validation handlers with
validate_input,validate_outputmethods - CRUD handlers with
create,read,update,deletemethods
3. Resolution Hint (resolver)
The resolver field is an optional optimization that bypasses the resolver chain and goes directly to a specific resolver:
# Let the chain figure it out (default)
handler:
callable: payment_processor
# Skip directly to explicit mapping (faster, explicit)
handler:
callable: payment_processor
resolver: explicit_mapping
When to use resolver hints:
- Performance optimization for high-throughput steps
- Explicit documentation of resolution strategy
- Avoiding ambiguity when multiple resolvers could match
The Common Path
For most templates, you don’t need to think about resolution at all. The default resolution flow handles common cases automatically:
# Most common pattern - just specify the callable
steps:
- name: process_payment
handler:
callable: process_payment # Resolved by ExplicitMappingResolver
initialization:
timeout_ms: 5000
What happens under the hood:
- Worker receives step execution event
- HandlerDispatchService extracts the
HandlerDefinition - ResolverChain iterates through resolvers by priority
- ExplicitMappingResolver (priority 10) finds the registered handler
- Handler is invoked with
call()method (default)
Resolver Chain Architecture
The resolver chain is an ordered list of resolvers, each with a priority. Lower priority numbers are checked first:
┌─────────────────────────────────────────────────────────────────┐
│ ResolverChain │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ ExplicitMapping │ │ ClassConstant │ │
│ │ Priority: 10 │──│ Priority: 100 │──► ... │
│ │ │ │ │ │
│ │ "process_payment" ──►│ │ "Handlers::Payment"──► │
│ │ Handler instance │ │ constantize() │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Resolution Flow
HandlerDefinition
│
▼
┌──────────────────┐
│ Has resolver │──Yes──► Go directly to named resolver
│ hint? │
└────────┬─────────┘
│ No
▼
┌──────────────────┐
│ ExplicitMapping │──can_resolve?──Yes──► Return handler
│ (priority 10) │
└────────┬─────────┘
│ No
▼
┌──────────────────┐
│ ClassConstant │──can_resolve?──Yes──► Return handler
│ (priority 100) │
└────────┬─────────┘
│ No
▼
ResolutionError
Built-in Resolvers
ExplicitMappingResolver (Priority 10)
The primary resolver for all workers. Handlers are registered with string keys at startup:
#![allow(unused)]
fn main() {
// Rust registration
registry.register("process_payment", Arc::new(ProcessPaymentHandler::new()));
}
# Ruby registration
registry.register("process_payment", ProcessPaymentHandler)
# Python registration
registry.register("process_payment", ProcessPaymentHandler)
// TypeScript registration
registry.register("process_payment", ProcessPaymentHandler);
When it resolves: When the callable exactly matches a registered key.
Best for:
- Native Rust handlers (required - no runtime reflection)
- Performance-critical handlers
- Explicit, predictable resolution
Class Lookup Resolvers (Priority 100)
Dynamic language only (Ruby, Python, TypeScript). Interprets the callable as a class path and instantiates it at runtime.
Naming Note: Ruby uses
ClassConstantResolver(Ruby terminology for classes). Python and TypeScript useClassLookupResolver. The functionality is equivalent.
# Ruby: Uses Object.const_get (ClassConstantResolver)
handler:
callable: PaymentHandlers::ProcessPaymentHandler
# Python: Uses importlib (ClassLookupResolver)
handler:
callable: payment_handlers.ProcessPaymentHandler
# TypeScript: Uses dynamic import (ClassLookupResolver)
handler:
callable: PaymentHandlers.ProcessPaymentHandler
When it resolves: When the callable looks like a class/module path (contains ::, ., or starts with uppercase).
Best for:
- Convention-over-configuration setups
- Handlers that don’t need explicit registration
- Dynamic handler loading
Not available in Rust: Rust has no runtime reflection, so class lookup resolvers always return None. Use ExplicitMappingResolver instead.
Method Dispatch
Method dispatch allows a single handler to expose multiple entry points. This is useful for handlers that perform related operations:
Defining a Multi-Method Handler
# Ruby
class PaymentHandler < TaskerCore::StepHandler::Base
def call(context)
# Default method - standard payment processing
end
def refund(context)
# Refund-specific logic
end
def void(context)
# Void-specific logic
end
end
# Python
class PaymentHandler(StepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
# Default method
pass
def refund(self, context: StepContext) -> StepHandlerResult:
# Refund-specific logic
pass
// TypeScript
class PaymentHandler extends StepHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
// Default method
}
async refund(context: StepContext): Promise<StepHandlerResult> {
// Refund-specific logic
}
}
#![allow(unused)]
fn main() {
// Rust - requires explicit method routing
impl RustStepHandler for PaymentHandler {
async fn call(&self, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Default method
}
async fn invoke_method(&self, method: &str, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
match method {
"refund" => self.refund(step).await,
"void" => self.void(step).await,
_ => self.call(step).await,
}
}
}
}
Using Method Dispatch in Templates
steps:
- name: process_refund
handler:
callable: payment_handler
method: refund # Invokes refund() instead of call()
initialization:
reason_required: true
How Method Dispatch Works
- Resolver chain resolves the handler from
callable - If
methodis specified and not “call”, aMethodDispatchWrapperis applied - When invoked, the wrapper calls the specified method instead of
call()
┌───────────────────┐
HandlerDefinition ──│ ResolverChain │── Handler
(method: "refund") │ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ MethodDispatch │
│ Wrapper │
│ │
│ inner.refund() │
└───────────────────┘
Writing Custom Resolvers
You can extend the resolver chain with custom resolution strategies for your domain.
Rust Custom Resolver
#![allow(unused)]
fn main() {
use tasker_shared::registry::{StepHandlerResolver, ResolutionContext, ResolvedHandler};
use async_trait::async_trait;
#[derive(Debug)]
pub struct ServiceDiscoveryResolver {
service_registry: Arc<ServiceRegistry>,
}
#[async_trait]
impl StepHandlerResolver for ServiceDiscoveryResolver {
fn resolver_name(&self) -> &str {
"service_discovery"
}
fn priority(&self) -> u32 {
50 // Between explicit (10) and class constant (100)
}
fn can_resolve(&self, definition: &HandlerDefinition) -> bool {
// Resolve callables that start with "service://"
definition.callable.starts_with("service://")
}
async fn resolve(
&self,
definition: &HandlerDefinition,
context: &ResolutionContext,
) -> Result<Arc<dyn ResolvedHandler>, ResolutionError> {
let service_name = definition.callable.strip_prefix("service://").unwrap();
let handler = self.service_registry.lookup(service_name).await?;
Ok(Arc::new(StepHandlerAsResolved::new(handler)))
}
}
}
Ruby Custom Resolver
module TaskerCore
module Registry
class ServiceDiscoveryResolver < BaseResolver
def resolver_name
"service_discovery"
end
def priority
50
end
def can_resolve?(definition)
definition.callable.start_with?("service://")
end
def resolve(definition, context)
service_name = definition.callable.delete_prefix("service://")
handler_class = ServiceRegistry.lookup(service_name)
handler_class.new
end
end
end
end
Python Custom Resolver
from tasker_core.registry import BaseResolver, ResolutionError
class ServiceDiscoveryResolver(BaseResolver):
def resolver_name(self) -> str:
return "service_discovery"
def priority(self) -> int:
return 50
def can_resolve(self, definition: HandlerDefinition) -> bool:
return definition.callable.startswith("service://")
async def resolve(
self, definition: HandlerDefinition, context: ResolutionContext
) -> ResolvedHandler:
service_name = definition.callable.removeprefix("service://")
handler_class = self.service_registry.lookup(service_name)
return handler_class()
TypeScript Custom Resolver
import { BaseResolver, HandlerDefinition, ResolutionContext } from './registry';
export class ServiceDiscoveryResolver extends BaseResolver {
resolverName(): string {
return 'service_discovery';
}
priority(): number {
return 50;
}
canResolve(definition: HandlerDefinition): boolean {
return definition.callable.startsWith('service://');
}
async resolve(
definition: HandlerDefinition,
context: ResolutionContext
): Promise<ResolvedHandler> {
const serviceName = definition.callable.replace('service://', '');
const HandlerClass = await this.serviceRegistry.lookup(serviceName);
return new HandlerClass();
}
}
Registering Custom Resolvers
#![allow(unused)]
fn main() {
// Rust
let mut chain = ResolverChain::new();
chain.register(Arc::new(ExplicitMappingResolver::new()));
chain.register(Arc::new(ServiceDiscoveryResolver::new(service_registry)));
chain.register(Arc::new(ClassConstantResolver::new()));
}
# Ruby
chain = TaskerCore::Registry::ResolverChain.new
chain.register(TaskerCore::Registry::ExplicitMappingResolver.new)
chain.register(ServiceDiscoveryResolver.new(service_registry))
chain.register(TaskerCore::Registry::ClassConstantResolver.new)
Cross-Language Considerations
Why Rust is Different
Rust has no runtime reflection, which affects handler resolution:
| Capability | Ruby/Python/TypeScript | Rust |
|---|---|---|
| Class Lookup Resolver | ✅ Works | ❌ Always returns None |
| Method dispatch | ✅ Native (send, getattr) | ⚠️ Requires invoke_method |
| Dynamic handler loading | ✅ const_get, importlib | ❌ Must pre-register |
Best Practice for Rust:
- Always use ExplicitMappingResolver with explicit registration
- Implement
invoke_method()for multi-method handlers - Use resolver hints (
resolver: explicit_mapping) for clarity
Method Dispatch by Language
| Language | Default Method | Dynamic Dispatch |
|---|---|---|
| Ruby | call | handler.public_send(method, context) |
| Python | call | getattr(handler, method)(context) |
| TypeScript | call | handler[method](context) |
| Rust | call | handler.invoke_method(method, step) |
Troubleshooting
“Handler not found” Error
Symptoms: ResolutionError: No resolver could resolve callable 'my_handler'
Causes:
- Handler not registered with ExplicitMappingResolver
- Class path typo (for ClassConstantResolver)
- Handler registered with different name than callable
Solutions:
#![allow(unused)]
fn main() {
// Verify registration
assert!(registry.is_registered("my_handler"));
// Check registered handlers
println!("{:?}", registry.list_handlers());
}
Method Not Found
Symptoms: MethodNotFound: Handler 'my_handler' does not respond to 'refund'
Causes:
- Method name typo in YAML template
- Method not defined on handler class
- Method is private (Ruby) or underscore-prefixed (Python)
Solutions:
# Verify method name matches exactly
handler:
callable: payment_handler
method: refund # Must match method name in handler
Resolver Hint Ignored
Symptoms: Resolution works but seems slow, or wrong resolver is used
Causes:
- Resolver hint name doesn’t match any registered resolver
- Resolver with that name returns
Nonefor this callable
Solutions:
# Use exact resolver name
handler:
callable: my_handler
resolver: explicit_mapping # Not "explicit" or "mapping"
Best Practices
1. Prefer Explicit Registration
# Good: Clear, predictable, works in all languages
handler:
callable: process_payment
# Avoid: Relies on runtime class lookup, not portable to Rust
handler:
callable: PaymentHandlers::ProcessPaymentHandler
2. Use Method Dispatch for Related Operations
# Good: Single handler, multiple entry points
steps:
- name: validate_input
handler:
callable: validator
method: validate_input
- name: validate_output
handler:
callable: validator
method: validate_output
# Avoid: Separate handlers for closely related operations
steps:
- name: validate_input
handler:
callable: input_validator
- name: validate_output
handler:
callable: output_validator
3. Document Resolution Strategy
# Good: Explicit about how resolution works
handler:
callable: payment_processor
resolver: explicit_mapping # Self-documenting
method: refund
initialization:
timeout_ms: 5000
4. Test Resolution in Isolation
#![allow(unused)]
fn main() {
#[test]
fn test_handler_resolution() {
let chain = create_resolver_chain();
let definition = HandlerDefinition::builder()
.callable("process_payment")
.build();
assert!(chain.can_resolve(&definition));
}
}
Summary
| Concept | Purpose | Default |
|---|---|---|
callable | Handler address | Required |
method | Entry point method | "call" |
resolver | Resolution strategy hint | Chain iteration |
| ExplicitMappingResolver | Registered handlers | Priority 10 |
| ClassConstantResolver / ClassLookupResolver | Dynamic class lookup | Priority 100 |
| MethodDispatchWrapper | Multi-method support | Applied when method != "call" |
The resolver chain provides a flexible, extensible system for handler resolution that works consistently across all language workers while respecting each language’s capabilities.
Task Identity Strategy Pattern
Last Updated: 2026-01-20 Audience: Developers, Operators Status: Active Related Docs: Documentation Hub | Idempotency and Atomicity
← Back to Documentation Hub
Overview
Task identity determines how Tasker deduplicates task creation requests. The identity strategy pattern allows named tasks to configure their deduplication behavior based on domain requirements.
When a task creation request arrives, Tasker computes an identity hash based on the configured strategy. If a task with that identity hash already exists, the request is rejected with a 409 Conflict response.
Why This Matters
Task identity is domain-specific:
| Use Case | Same Template + Same Context | Desired Behavior |
|---|---|---|
| Payment processing | Likely accidental duplicate | Deduplicate (safety) |
| Nightly batch job | Intentional repetition | Allow (operational) |
| Report generation | Could be either | Configurable |
| Event-driven triggers | Often intentional | Allow |
| Retry with same params | Intentional | Allow |
A TaskRequest with identical context might be:
- An accidental duplicate (network retry, user double-click) → should deduplicate
- An intentional repetition (scheduled job, legitimate re-run) → should allow
Identity Strategies
STRICT (Default)
identity_hash = hash(named_task_uuid, normalized_context)
Same named task + same context = same identity hash = deduplicated.
Use when:
- Accidental duplicates are a risk (payments, orders, notifications)
- Context fully describes the work to be done
- Network retries or user double-clicks should be safe
Example:
#![allow(unused)]
fn main() {
// Payment processing - same payment_id should never create duplicate tasks
TaskRequest {
namespace: "payments".to_string(),
name: "process_payment".to_string(),
context: json!({
"payment_id": "PAY-12345",
"amount": 100.00,
"currency": "USD"
}),
idempotency_key: None, // Uses STRICT strategy
..Default::default()
}
}
CALLER_PROVIDED
identity_hash = hash(named_task_uuid, idempotency_key)
Caller must provide idempotency_key. Request is rejected with 400 Bad Request if the key is missing.
Use when:
- Caller has a natural idempotency key (order_id, transaction_id, request_id)
- Caller needs control over deduplication scope
- Similar to Stripe’s Idempotency-Key pattern
Example:
#![allow(unused)]
fn main() {
// Order processing - caller controls idempotency with their order ID
TaskRequest {
namespace: "orders".to_string(),
name: "fulfill_order".to_string(),
context: json!({
"order_id": "ORD-98765",
"items": [...]
}),
idempotency_key: Some("ORD-98765".to_string()), // Required for CallerProvided
..Default::default()
}
}
ALWAYS_UNIQUE
identity_hash = uuidv7()
Every request creates a new task. No deduplication.
Use when:
- Every submission should create work (notifications, events)
- Repetition is intentional (scheduled jobs, cron-like triggers)
- Context doesn’t define uniqueness
Example:
#![allow(unused)]
fn main() {
// Notification sending - every call should send a notification
TaskRequest {
namespace: "notifications".to_string(),
name: "send_email".to_string(),
context: json!({
"user_id": 123,
"template": "welcome",
"data": {...}
}),
idempotency_key: None, // ALWAYS_UNIQUE ignores this
..Default::default()
}
}
Configuration
Named Task Configuration
Set the identity strategy in your task template:
# templates/payments/process_payment.yaml
namespace: payments
name: process_payment
version: "1.0.0"
identity_strategy: strict # strict | caller_provided | always_unique
steps:
- name: validate_payment
handler: payment_validator
# ...
Per-Request Override
The idempotency_key field overrides any strategy:
#![allow(unused)]
fn main() {
// Even if named task is ALWAYS_UNIQUE, this key makes it deduplicate
TaskRequest {
idempotency_key: Some("my-custom-key-12345".to_string()),
// ... other fields
}
}
Precedence:
idempotency_key(if provided) → always uses hash of key- Named task’s
identity_strategy→ applies if no key provided - Default → STRICT (if strategy not configured)
API Behavior
Successful Creation (201 Created)
{
"task_uuid": "019bddae-b818-7d82-b7c5-bd42e5db27fc",
"step_count": 4,
"message": "Task created successfully"
}
Duplicate Identity (409 Conflict)
When a task with the same identity hash exists:
{
"error": {
"code": "CONFLICT",
"message": "A task with this identity already exists. The task's identity strategy prevents duplicate creation."
}
}
Security Note: The API returns 409 Conflict rather than the existing task’s UUID. This prevents potential data leakage where attackers could probe for existing task UUIDs by submitting requests with guessed contexts.
Missing Idempotency Key (400 Bad Request)
When CallerProvided strategy requires a key:
{
"error": {
"code": "BAD_REQUEST",
"message": "idempotency_key is required when named task uses CallerProvided identity strategy"
}
}
JSON Normalization
For STRICT strategy, the context JSON is normalized before hashing:
- Key ordering: Keys are sorted alphabetically (recursively)
- Whitespace: Removed for consistency
- Semantic equivalence:
{"b":2,"a":1}and{"a":1,"b":2}produce the same hash
This means these two requests produce the same identity hash:
#![allow(unused)]
fn main() {
// Request 1
context: json!({"user_id": 123, "action": "create"})
// Request 2 - same content, different key order
context: json!({"action": "create", "user_id": 123})
}
Note: Array order is preserved (arrays are ordered by definition).
Recommended Patterns
Pattern 1: Time-Bucketed Keys
For deduplication within a time window but allowing repetition across windows:
#![allow(unused)]
fn main() {
// Dedupe within same hour, allow across hours
let hour_bucket = chrono::Utc::now().format("%Y-%m-%d-%H");
let idempotency_key = format!("{}-{}-{}", job_name, customer_id, hour_bucket);
TaskRequest {
namespace: "reports".to_string(),
name: "generate_report".to_string(),
context: json!({ "customer_id": 12345 }),
idempotency_key: Some(idempotency_key),
..Default::default()
}
}
Pattern 2: Time-Aware Context
Include scheduling context directly in the request:
#![allow(unused)]
fn main() {
TaskRequest {
namespace: "batch".to_string(),
name: "daily_reconciliation".to_string(),
context: json!({
"account_id": "ACC-001",
"run_date": "2026-01-20", // Changes daily
"run_window": "morning" // Optional: finer granularity
}),
..Default::default()
}
}
Granularity Guide
| Dedup Window | Key/Context Pattern | Use Case |
|---|---|---|
| Per-minute | {job}-{YYYY-MM-DD-HH-mm} | High-frequency event processing |
| Per-hour | {job}-{YYYY-MM-DD-HH} | Hourly reports, rate-limited APIs |
| Per-day | {job}-{YYYY-MM-DD} | Daily batch jobs, EOD processing |
| Per-week | {job}-{YYYY-Www} | Weekly aggregations |
| Per-month | {job}-{YYYY-MM} | Monthly billing cycles |
Anti-Patterns
Don’t Rely on Timing
#![allow(unused)]
fn main() {
// BAD: Hoping requests are "far enough apart"
TaskRequest { context: json!({ "customer_id": 123 }) }
}
Don’t Use ALWAYS_UNIQUE for Critical Operations
#![allow(unused)]
fn main() {
// BAD: Creates duplicate work on network retries
// Named task with AlwaysUnique for payment processing
}
Do Make Identity Explicit
#![allow(unused)]
fn main() {
// GOOD: Clear what makes this task unique
TaskRequest {
context: json!({
"payment_id": "PAY-123", // Natural idempotency key
"amount": 100
}),
..Default::default()
}
}
Database Implementation
The identity strategy is enforced at the database level:
- UNIQUE constraint on
identity_hashcolumn prevents duplicates - identity_strategy column on
named_tasksstores the configured strategy - Atomic insertion with constraint violation returns 409 Conflict
-- Identity hash has unique constraint
CREATE UNIQUE INDEX idx_tasks_identity_hash ON tasker.tasks(identity_hash);
-- Named tasks store their strategy
ALTER TABLE tasker.named_tasks
ADD COLUMN identity_strategy VARCHAR(20) DEFAULT 'strict';
Testing Considerations
When writing tests that create tasks, inject a unique identifier to avoid identity hash collisions:
#![allow(unused)]
fn main() {
// Test utility that ensures unique identity per test run
fn create_task_request(namespace: &str, name: &str, context: Value) -> TaskRequest {
let mut ctx = context.as_object().cloned().unwrap_or_default();
ctx.insert("_test_run_id".to_string(), json!(Uuid::now_v7().to_string()));
TaskRequest {
namespace: namespace.to_string(),
name: name.to_string(),
context: Value::Object(ctx),
..Default::default()
}
}
}
Summary
| Strategy | Identity Hash | Deduplicates? | Key Required? |
|---|---|---|---|
| STRICT | hash(uuid, context) | Yes | No |
| CALLER_PROVIDED | hash(uuid, key) | Yes | Yes |
| ALWAYS_UNIQUE | uuidv7() | No | No |
Choose STRICT (default) unless you have a specific reason not to. It’s the safest option for preventing accidental duplicate task creation.
Quick Start Guide
Last Updated: 2025-10-10 Audience: Developers Status: Active Time to Complete: 5 minutes Related Docs: Documentation Hub | Use Cases | Crate Architecture
← Back to Documentation Hub
Get Tasker Core Running in 5 Minutes
This guide will get you from zero to running your first workflow in under 5 minutes using Docker Compose.
Prerequisites
Before starting, ensure you have:
- Docker and Docker Compose installed
- Git to clone the repository
- curl for testing (or any HTTP client)
That’s it! Docker Compose handles all the complexity.
Step 1: Clone and Start Services (2 minutes)
# Clone the repository
git clone https://github.com/tasker-systems/tasker-core
cd tasker-core
# Start PostgreSQL (includes PGMQ extension for default messaging)
docker-compose up -d postgres
# Wait for PostgreSQL to be ready (about 10 seconds)
docker-compose logs -f postgres
# Press Ctrl+C when you see "database system is ready to accept connections"
# Run database migrations
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1" # Verify connection
# Start orchestration server and workers
docker-compose --profile server up -d
# Verify all services are healthy
docker-compose ps
You should see:
NAME STATUS PORTS
tasker-postgres Up (healthy) 5432
tasker-orchestration Up (healthy) 0.0.0.0:8080->8080/tcp
tasker-worker Up (healthy) 0.0.0.0:8081->8081/tcp
tasker-ruby-worker Up (healthy) 0.0.0.0:8082->8082/tcp
Step 2: Verify Services (30 seconds)
Check that all services are responding:
# Check orchestration health
curl http://localhost:8080/health
# Expected response:
# {
# "status": "healthy",
# "database": "connected",
# "message_queue": "operational"
# }
# Check Rust worker health
curl http://localhost:8081/health
# Check Ruby worker health (if started)
curl http://localhost:8082/health
Step 3: Create Your First Task (1 minute)
Now let’s create a simple linear workflow with 4 steps:
# Create a task using the linear_workflow template
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"template_name": "linear_workflow",
"namespace": "rust_e2e_linear",
"configuration": {
"test_value": "hello_world"
}
}'
Response:
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"status": "pending",
"namespace": "rust_e2e_linear",
"created_at": "2025-10-10T12:00:00Z"
}
Save the task_uuid from the response! You’ll need it to check the task status.
Step 4: Monitor Task Execution (1 minute)
Watch your workflow execute in real-time:
# Replace {task_uuid} with your actual task UUID
TASK_UUID="01234567-89ab-cdef-0123-456789abcdef"
# Check task status
curl http://localhost:8080/v1/tasks/${TASK_UUID}
Initial Response (task just created):
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"current_state": "initializing",
"total_steps": 4,
"completed_steps": 0,
"namespace": "rust_e2e_linear"
}
Wait a few seconds and check again:
# Check again after a few seconds
curl http://localhost:8080/v1/tasks/${TASK_UUID}
Final Response (task completed):
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"current_state": "complete",
"total_steps": 4,
"completed_steps": 4,
"namespace": "rust_e2e_linear",
"completed_at": "2025-10-10T12:00:05Z",
"duration_ms": 134
}
Congratulations! 🎉 You’ve just executed your first workflow with Tasker Core!
What Just Happened?
Let’s break down what happened in those ~100-150ms:
1. Orchestration received task creation request
↓
2. Task initialized with "linear_workflow" template
↓
3. 4 workflow steps created with dependencies:
- mathematical_add (no dependencies)
- mathematical_multiply (depends on add)
- mathematical_subtract (depends on multiply)
- mathematical_divide (depends on subtract)
↓
4. Orchestration discovered step 1 was ready
↓
5. Step 1 enqueued to "rust_e2e_linear" namespace queue
↓
6. Worker claimed and executed step 1
↓
7. Worker sent result back to orchestration
↓
8. Orchestration processed result, discovered step 2
↓
9. Steps 2, 3, 4 executed sequentially (due to dependencies)
↓
10. All steps complete → Task marked "complete"
Key Observations:
- Each step executed by autonomous workers
- Steps executed in dependency order automatically
- Complete workflow: ~130-150ms (including all coordination)
- All state changes recorded in audit trail
View Detailed Task Information
Get complete task execution details:
# Get full task details including steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/details
Response includes:
{
"task": {
"task_uuid": "...",
"current_state": "complete",
"namespace": "rust_e2e_linear"
},
"steps": [
{
"name": "mathematical_add",
"current_state": "complete",
"result": {"value": 15},
"duration_ms": 12
},
{
"name": "mathematical_multiply",
"current_state": "complete",
"result": {"value": 30},
"duration_ms": 8
},
// ... remaining steps
],
"state_transitions": [
{
"from_state": null,
"to_state": "pending",
"timestamp": "2025-10-10T12:00:00.000Z"
},
{
"from_state": "pending",
"to_state": "initializing",
"timestamp": "2025-10-10T12:00:00.050Z"
},
// ... complete transition history
]
}
Try a More Complex Workflow
Now try the diamond workflow pattern (parallel execution):
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"template_name": "diamond_workflow",
"namespace": "rust_e2e_diamond",
"configuration": {
"test_value": "parallel_test"
}
}'
Diamond pattern:
step_1 (root)
/ \
step_2 step_3 ← Execute in PARALLEL
\ /
step_4 (join)
Steps 2 and 3 execute simultaneously because they both depend only on step 1!
View Logs
See what’s happening inside the services:
# Orchestration logs
docker-compose logs -f orchestration
# Worker logs
docker-compose logs -f worker
# All logs
docker-compose logs -f
Key log patterns to look for:
Task initialized: task_uuid=...- Task createdStep enqueued: step_uuid=...- Step sent to workerStep claimed: step_uuid=...- Worker picked up stepStep completed: step_uuid=...- Step finished successfullyTask finalized: task_uuid=...- Workflow complete
Explore the API
List All Tasks
curl http://localhost:8080/v1/tasks
Get Task Execution Context
curl http://localhost:8080/v1/tasks/${TASK_UUID}/context
View Available Templates
curl http://localhost:8080/v1/templates
Check System Health
curl http://localhost:8080/health/detailed
Next Steps
1. Understand What You Just Built
Read about the architecture:
- Crate Architecture - How the workspace is organized
- Events and Commands - How orchestration and workers coordinate
- States and Lifecycles - Task and step state machines
2. See Real-World Examples
Explore practical use cases:
- Use Cases and Patterns - E-commerce, payments, ETL, microservices
- See example templates in:
tests/fixtures/task_templates/
3. Create Your Own Workflow
Option A: Rust Handler (Native Performance)
#![allow(unused)]
fn main() {
// workers/rust/src/handlers/my_handler.rs
pub struct MyCustomHandler;
#[async_trait]
impl StepHandler for MyCustomHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
// Your business logic here
let input: String = context.configuration.get("input")?;
let result = process_data(&input).await?;
Ok(StepResult::success(json!({
"output": result
})))
}
}
}
Option B: Ruby Handler (via FFI)
# workers/ruby/app/tasker/tasks/templates/my_workflow/handlers/my_handler.rb
class MyHandler < TaskerCore::StepHandler
def execute(context)
input = context.configuration['input']
result = process_data(input)
{ success: true, output: result }
end
end
Define Your Workflow Template
# tests/fixtures/task_templates/rust/my_workflow.yaml
namespace: my_namespace
name: my_workflow
version: "1.0"
steps:
- name: my_step
handler: my_handler
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
4. Deploy to Production
Learn about deployment:
- Deployment Patterns - Hybrid, EventDriven, PollingOnly modes
- Observability - Metrics, logging, monitoring
- Benchmarks - Performance validation
5. Run Tests Locally
# Build the workspace
cargo build --all-features
# Run all tests
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo test --all-features
# Run benchmarks
cargo bench --all-features
Troubleshooting
Services Won’t Start
# Check Docker service status
docker-compose ps
# View service logs
docker-compose logs postgres
docker-compose logs orchestration
# Restart services
docker-compose restart
# Clean restart
docker-compose down
docker-compose up -d
Task Stays in “pending” or “initializing”
Possible causes:
- Template not found - Check available templates:
curl http://localhost:8080/v1/templates - Worker not running - Check worker status:
curl http://localhost:8081/health - Database connection issue - Check logs:
docker-compose logs postgres
Solution:
# Verify template exists
curl http://localhost:8080/v1/templates | jq '.[] | select(.name == "linear_workflow")'
# Restart workers
docker-compose restart worker
# Check orchestration logs for errors
docker-compose logs orchestration | grep ERROR
“Connection refused” Errors
Cause: Services not fully started yet
Solution: Wait 10-15 seconds after docker-compose up, then check health:
curl http://localhost:8080/health
PostgreSQL Connection Issues
# Verify PostgreSQL is running
docker-compose ps postgres
# Test connection
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"
# View PostgreSQL logs
docker-compose logs postgres | tail -50
Cleanup
When you’re done exploring:
# Stop all services
docker-compose down
# Stop and remove volumes (cleans database)
docker-compose down -v
# Remove all Docker resources (complete cleanup)
docker-compose down -v
docker system prune -f
Summary
You’ve successfully:
- ✅ Started Tasker Core services with Docker Compose
- ✅ Created and executed a linear workflow
- ✅ Monitored task execution in real-time
- ✅ Viewed detailed task and step information
- ✅ Explored the REST API
Total time: ~5 minutes from zero to working workflow! 🚀
Getting Help
- Documentation Issues: Open an issue on GitHub
- Architecture Questions: See Crate Architecture
- Use Case Examples: See Use Cases and Patterns
- Deployment Help: See Deployment Patterns
← Back to Documentation Hub
Next: Use Cases and Patterns | Crate Architecture
Retry Semantics: max_attempts and retryable
Last Updated: 2025-10-10 Audience: Developers Status: Active Related Docs: Documentation Hub | Bug Report: Retry Eligibility Logic | States and Lifecycles
← Back to Documentation Hub
Overview
The Tasker orchestration system uses two configuration fields to control step execution and retry behavior:
max_attempts: Maximum number of total execution attempts (including first execution)retryable: Whether the step can be retried after failure
Semantic Definitions
max_attempts
Definition: The maximum number of times a step can be attempted, including the first execution.
This is NOT “number of retries” - it’s total attempts:
max_attempts=0: Step can never execute (likely a configuration error)max_attempts=1: Exactly one attempt (no retries after failure)max_attempts=3: First attempt + up to 2 retries = 3 total attempts
Implementation: SQL formula attempts < max_attempts where attempts starts at 0.
retryable
Definition: Whether a step can be retried after the first execution fails.
Important: The retryable flag does NOT affect the first execution attempt:
- First execution (attempts=0): Always eligible regardless of retryable setting
- Retry attempts (attempts>0): Require
retryable=true
Configuration Examples
Single Execution, No Retries
retry:
retryable: false
max_attempts: 1 # First attempt only
backoff: exponential
Behavior:
| attempts | retry_eligible | Outcome |
|---|---|---|
| 0 | ✅ true | First execution allowed |
| 1 | ❌ false | No retries (retryable=false) |
Use Case: Idempotent operations that should not retry (e.g., record creation with unique constraints)
Multiple Attempts with Retries
retry:
retryable: true
max_attempts: 3 # First attempt + 2 retries
backoff: exponential
backoff_base_ms: 1000
Behavior:
| attempts | retry_eligible | Outcome |
|---|---|---|
| 0 | ✅ true | First execution allowed |
| 1 | ✅ true | First retry allowed (1 < 3) |
| 2 | ✅ true | Second retry allowed (2 < 3) |
| 3 | ❌ false | Max attempts exhausted (3 >= 3) |
Use Case: External API calls that might have transient failures
Unlimited Retries (Not Recommended)
retry:
retryable: true
max_attempts: 999999
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 300000 # Cap at 5 minutes
Behavior: Will retry until external intervention (task cancellation, system restart)
Use Case: Critical operations that must eventually succeed (use with caution!)
Retry Eligibility Logic
SQL Implementation
From migrations/20251006000000_fix_retry_eligibility_logic.sql:
-- retry_eligible calculation
(
COALESCE(ws.attempts, 0) = 0 -- First attempt always eligible
OR (
COALESCE(ws.retryable, true) = true -- Must be retryable for retries
AND COALESCE(ws.attempts, 0) < COALESCE(ws.max_attempts, 3)
)
) as retry_eligible
Decision Tree
Is attempts = 0?
├─ YES → retry_eligible = TRUE (first execution)
└─ NO → Is retryable = true?
├─ YES → Is attempts < max_attempts?
│ ├─ YES → retry_eligible = TRUE (retry allowed)
│ └─ NO → retry_eligible = FALSE (max attempts exhausted)
└─ NO → retry_eligible = FALSE (retries disabled)
Edge Cases
max_attempts=0
retry:
max_attempts: 0
Behavior: Step can never execute (0 < 0 = false for all attempts)
Status: ⚠️ Configuration error - likely unintended
Recommendation: Use max_attempts: 1 for single execution
retryable=false with max_attempts > 1
retry:
retryable: false
max_attempts: 3 # Only first attempt will execute
Behavior: First execution allowed, but no retries regardless of max_attempts
Effective Result: Same as max_attempts: 1
Recommendation: Set max_attempts: 1 when retryable: false for clarity
Historical Context
Why “max_attempts” instead of “retry_limit”?
The original field name retry_limit was semantically confusing:
Old Interpretation (incorrect):
retry_limit=1→ “1 retry allowed” → 2 total attempts?retry_limit=0→ “0 retries” → 1 attempt or blocked?
New Interpretation (clear):
max_attempts=1→ “1 total attempt” → exactly 1 executionmax_attempts=0→ “0 attempts” → clearly invalid
Migration Timeline
- Original:
retry_limitfield with ambiguous semantics - 2025-10-05: Bug discovered -
retry_limit=0blocked all execution - 2025-10-06: Fixed SQL logic + renamed to
max_attempts - 2025-10-06: Added 6 SQL boundary tests for edge cases
Testing
Boundary Condition Tests
See tests/integration/sql_functions/retry_boundary_tests.rs for comprehensive coverage:
test_max_attempts_zero_allows_first_execution- Edge case handlingtest_max_attempts_zero_blocks_after_first- Exhaustion after firsttest_max_attempts_one_semantics- Single execution semanticstest_max_attempts_three_progression- Standard retry progressiontest_first_attempt_ignores_retryable_flag- First execution independencetest_retries_require_retryable_true- Retry flag enforcement
All tests passing as of 2025-10-06.
Best Practices
For Single-Execution Steps
retry:
retryable: false
max_attempts: 1
backoff: exponential # Ignored, but required for schema
Why: Makes intent crystal clear - execute once, never retry
For Transient Failure Tolerance
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 30000
Why: Reasonable retry count with exponential backoff prevents thundering herd
For Critical Operations
retry:
retryable: true
max_attempts: 10
backoff: exponential
backoff_base_ms: 5000
max_backoff_ms: 300000 # 5 minutes
Why: More attempts with longer backoff for operations that must succeed
Related Documentation
- Bug Report: Retry Eligibility Logic
- State Machine Documentation
- SQL Function:
get_step_readiness_status_batch - Migration:
20251006000000_fix_retry_eligibility_logic.sql
Questions or Issues? See test suite for comprehensive examples or consult bug report for historical context.
Use Cases and Patterns
Last Updated: 2025-10-10 Audience: Developers, Architects, Product Managers Status: Active Related Docs: Documentation Hub | Quick Start | Crate Architecture
← Back to Documentation Hub
Overview
This guide provides practical examples of when and how to use Tasker Core for workflow orchestration. Each use case includes architectural patterns, example workflows, and implementation guidance based on real-world scenarios.
Table of Contents
- E-Commerce Order Fulfillment
- Payment Processing Pipeline
- Data Transformation ETL
- Microservices Orchestration
- Scheduled Job Coordination
- Conditional Workflows and Decision Points
- Anti-Patterns
E-Commerce Order Fulfillment
Problem Statement
An e-commerce platform needs to coordinate multiple steps when processing orders:
- Validate order details and inventory
- Reserve inventory and process payment (parallel)
- Ship order after both payment and inventory confirmed
- Send confirmation emails
- Handle failures gracefully with retries
Why Tasker Core?
- Complex Dependencies: Steps have clear dependency relationships
- Parallel Execution: Payment and inventory can happen simultaneously
- Retry Logic: External API calls need retry with backoff
- Audit Trail: Complete history needed for compliance
- Idempotency: Steps must handle duplicate executions safely
Workflow Structure
Task: order_fulfillment_#{order_id}
Priority: Based on order value and customer tier
Namespace: fulfillment
Steps:
1. validate_order
- Handler: ValidateOrderHandler
- Dependencies: None (root step)
- Retry: retryable=true, max_attempts=3
- Validates order data, checks fraud
2. check_inventory
- Handler: InventoryCheckHandler
- Dependencies: validate_order (must complete)
- Retry: retryable=true, max_attempts=5
- Queries inventory system
3. reserve_inventory
- Handler: InventoryReservationHandler
- Dependencies: check_inventory
- Retry: retryable=true, max_attempts=3
- Reserves stock with timeout
4. process_payment
- Handler: PaymentProcessingHandler
- Dependencies: validate_order
- Retry: retryable=true, max_attempts=3
- Charges customer payment method
- **Runs in parallel with reserve_inventory**
5. ship_order
- Handler: ShippingHandler
- Dependencies: reserve_inventory AND process_payment
- Retry: retryable=false, max_attempts=1
- Creates shipping label, schedules pickup
6. send_confirmation
- Handler: EmailNotificationHandler
- Dependencies: ship_order
- Retry: retryable=true, max_attempts=10
- Sends confirmation email to customer
Implementation Pattern
Task Template (YAML configuration):
namespace: fulfillment
name: order_fulfillment
version: "1.0"
steps:
- name: validate_order
handler: validate_order
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
- name: check_inventory
handler: check_inventory
dependencies:
- validate_order
retry:
retryable: true
max_attempts: 5
backoff: exponential
backoff_base_ms: 2000
# ... remaining steps
Step Handler (Rust implementation):
#![allow(unused)]
fn main() {
pub struct ValidateOrderHandler;
#[async_trait]
impl StepHandler for ValidateOrderHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
// Extract order data from context
let order_id: String = context.configuration.get("order_id")?;
let customer_id: String = context.configuration.get("customer_id")?;
// Validate order
let order = validate_order_data(&order_id).await?;
// Check fraud detection
if check_fraud_risk(&customer_id, &order).await? {
return Ok(StepResult::permanent_failure(
"fraud_detected",
json!({"reason": "High fraud risk"})
));
}
// Success - pass data to next steps
Ok(StepResult::success(json!({
"order_id": order_id,
"validated_at": Utc::now(),
"total_amount": order.total
})))
}
}
}
Ruby Handler Alternative:
class ProcessPaymentHandler < TaskerCore::StepHandler
def execute(context)
order_id = context.configuration['order_id']
amount = context.configuration['amount']
# Process payment via payment gateway
result = PaymentGateway.charge(
amount: amount,
idempotency_key: context.step_uuid
)
if result.success?
{ success: true, transaction_id: result.transaction_id }
else
# Retryable failure with backoff
{ success: false, retryable: true, error: result.error }
end
rescue PaymentGateway::NetworkError => e
# Transient error, retry
{ success: false, retryable: true, error: e.message }
rescue PaymentGateway::CardDeclined => e
# Permanent failure, don't retry
{ success: false, retryable: false, error: e.message }
end
end
Key Patterns
1. Parallel Execution
reserve_inventoryandprocess_paymentboth depend only on earlier steps- Tasker automatically executes them in parallel
ship_orderwaits for both to complete
2. Idempotent Handlers
- Use
step_uuidas idempotency key for external APIs - Check if operation already completed before retrying
- Handle duplicate executions gracefully
3. Smart Retry Logic
- Network errors → retryable with exponential backoff
- Business logic failures → permanent, no retry
- Configure max_attempts based on criticality
4. Data Flow
- Early steps provide data to later steps via results
- Access parent results:
context.parent_results["validate_order"] - Build context as workflow progresses
Observability
Monitor these metrics for order fulfillment:
#![allow(unused)]
fn main() {
// Track order processing stages
metrics::counter!("orders.validated").increment(1);
metrics::counter!("orders.payment_processed").increment(1);
metrics::counter!("orders.shipped").increment(1);
// Track failures by reason
metrics::counter!("orders.failed", "reason" => "fraud").increment(1);
metrics::counter!("orders.failed", "reason" => "inventory").increment(1);
// Track timing
metrics::histogram!("order.fulfillment_time_ms").record(elapsed_ms);
}
Payment Processing Pipeline
Problem Statement
A fintech platform needs to process payments with strict requirements:
- Multiple payment methods (card, bank transfer, wallet)
- Regulatory compliance and audit trails
- Automatic retry for transient failures
- Reconciliation with accounting system
- Webhook notifications to customers
Why Tasker Core?
- Compliance: Complete audit trail with state transitions
- Reliability: Automatic retry with configurable limits
- Observability: Detailed metrics for financial operations
- Idempotency: Prevent duplicate charges
- Flexibility: Support multiple payment flows
Workflow Structure
Task: payment_processing_#{payment_id}
Namespace: payments
Priority: High (financial operations)
Steps:
1. validate_payment_request
- Verify payment details
- Check account status
- Validate payment method
2. check_fraud
- Run fraud detection
- Verify transaction limits
- Check velocity rules
3. authorize_payment
- Contact payment gateway
- Reserve funds (authorization hold)
- Return authorization code
4. capture_payment (depends on authorize_payment)
- Capture authorized funds
- Handle settlement
- Generate receipt
5. record_transaction (depends on capture_payment)
- Write to accounting ledger
- Update customer balance
- Create audit records
6. send_notification (depends on record_transaction)
- Send webhook to merchant
- Send receipt to customer
- Update payment status
Implementation Highlights
Retry Strategy for Payment Gateway:
#![allow(unused)]
fn main() {
impl StepHandler for AuthorizePaymentHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
let payment_id = context.configuration.get("payment_id")?;
match gateway.authorize(payment_id, &context.step_uuid).await {
Ok(auth) => {
Ok(StepResult::success(json!({
"authorization_code": auth.code,
"authorized_at": Utc::now(),
"gateway_transaction_id": auth.transaction_id
})))
}
Err(GatewayError::NetworkTimeout) => {
// Transient - retry with backoff
Ok(StepResult::retryable_failure(
"network_timeout",
json!({"retry_recommended": true})
))
}
Err(GatewayError::InsufficientFunds) => {
// Permanent - don't retry
Ok(StepResult::permanent_failure(
"insufficient_funds",
json!({"requires_manual_intervention": false})
))
}
Err(GatewayError::InvalidCard) => {
// Permanent - don't retry
Ok(StepResult::permanent_failure(
"invalid_card",
json!({"requires_manual_intervention": true})
))
}
}
}
}
}
Idempotency Pattern:
#![allow(unused)]
fn main() {
async fn capture_payment(context: &StepContext) -> Result<StepResult> {
let idempotency_key = context.step_uuid.to_string();
// Check if we already captured this payment
if let Some(existing) = check_existing_capture(&idempotency_key).await? {
return Ok(StepResult::success(json!({
"already_captured": true,
"transaction_id": existing.transaction_id,
"note": "Idempotent duplicate detected"
})));
}
// Proceed with capture
let result = gateway.capture(&idempotency_key).await?;
// Store idempotency record
store_capture_record(&idempotency_key, &result).await?;
Ok(StepResult::success(json!(result)))
}
}
Key Patterns
1. Two-Phase Commit
- Authorize (reserve) → Capture (settle)
- Allows cancellation between phases
- Common in payment processing
2. Audit Trail
- Every state transition recorded
- Regulatory compliance built-in
- Forensic investigation support
3. Circuit Breaking
- Protect against payment gateway failures
- Automatic backoff when gateway degraded
- Fallback to alternate gateways
Data Transformation ETL
Problem Statement
A data analytics platform needs to process data through multiple transformation stages:
- Extract data from multiple sources (APIs, databases, files)
- Transform data (clean, enrich, aggregate)
- Load to data warehouse
- Handle large datasets with partitioning
- Retry transient failures, skip corrupted data
Why Tasker Core?
- DAG Execution: Complex transformation pipelines
- Parallel Processing: Independent partitions processed concurrently
- Error Handling: Skip corrupted records, retry transient failures
- Observability: Track data quality and processing metrics
- Scheduling: Integrate with cron/scheduler for periodic runs
Workflow Structure
Task: etl_customer_data_#{date}
Namespace: data_pipeline
Steps:
1. extract_customer_profiles
- Fetch from customer database
- Partition by customer_id ranges
- Creates multiple output partitions
2. extract_transaction_history
- Fetch from transactions database
- Runs in parallel with extract_customer_profiles
- Time-based partitioning
3. enrich_customer_data (depends on extract_customer_profiles)
- Add demographic data from external API
- Process partitions in parallel
- Each partition is independent
4. join_transactions (depends on enrich_customer_data, extract_transaction_history)
- Join enriched profiles with transactions
- Aggregate metrics per customer
- Parallel processing per partition
5. load_to_warehouse (depends on join_transactions)
- Bulk load to data warehouse
- Verify data quality
- Update metadata tables
6. generate_summary_report (depends on load_to_warehouse)
- Generate processing statistics
- Send notification with summary
- Archive source files
Implementation Pattern
Partition-Based Processing:
#![allow(unused)]
fn main() {
pub struct ExtractCustomerProfilesHandler;
#[async_trait]
impl StepHandler for ExtractCustomerProfilesHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
let date: String = context.configuration.get("processing_date")?;
// Determine partitions (e.g., by customer_id ranges)
let partitions = calculate_partitions(1000000, 100000)?; // 10 partitions
// Extract data for each partition
let mut partition_files = Vec::new();
for partition in partitions {
let filename = extract_partition(&date, partition).await?;
partition_files.push(filename);
}
// Return partition info for downstream steps
Ok(StepResult::success(json!({
"partitions": partition_files,
"total_records": 1000000,
"extracted_at": Utc::now()
})))
}
}
}
Error Handling for Data Quality:
#![allow(unused)]
fn main() {
async fn enrich_customer_data(context: &StepContext) -> Result<StepResult> {
let partition_file: String = context.configuration.get("partition_file")?;
let mut processed = 0;
let mut skipped = 0;
let mut errors = Vec::new();
for record in read_partition(&partition_file).await? {
match enrich_record(record).await {
Ok(enriched) => {
write_enriched(enriched).await?;
processed += 1;
}
Err(EnrichmentError::MalformedData(e)) => {
// Skip corrupted record, continue processing
skipped += 1;
errors.push(format!("Skipped record: {}", e));
}
Err(EnrichmentError::ApiTimeout(e)) => {
// Transient failure, retry entire step
return Ok(StepResult::retryable_failure(
"api_timeout",
json!({"error": e.to_string()})
));
}
}
}
if skipped as f64 / processed as f64 > 0.1 {
// Too many skipped records
return Ok(StepResult::permanent_failure(
"data_quality_issue",
json!({
"processed": processed,
"skipped": skipped,
"error_rate": skipped as f64 / processed as f64
})
));
}
Ok(StepResult::success(json!({
"processed": processed,
"skipped": skipped,
"errors": errors
})))
}
}
Key Patterns
1. Partition-Based Parallelism
- Split large datasets into partitions
- Process partitions independently
- Aggregate results in final step
2. Graceful Degradation
- Skip corrupted individual records
- Continue processing remaining data
- Report data quality issues
3. Monitoring Data Quality
- Track record counts through pipeline
- Alert on unexpected error rates
- Validate schema at boundaries
Microservices Orchestration
Problem Statement
Coordinate operations across multiple microservices:
- User registration flow (auth, profile, notifications, analytics)
- Distributed transactions with compensation
- Service dependency management
- Timeout and circuit breaking
Why Tasker Core?
- Service Coordination: Orchestrate distributed operations
- Saga Pattern: Implement compensation for failures
- Resilience: Circuit breakers and timeouts
- Observability: End-to-end tracing with correlation IDs
- Flexibility: Handle heterogeneous service protocols
Workflow Structure (User Registration Example)
Task: user_registration_#{user_id}
Namespace: user_onboarding
Steps:
1. create_auth_account
- Call auth service to create account
- Generate user credentials
- Store authentication tokens
2. create_user_profile (depends on create_auth_account)
- Call profile service
- Initialize user preferences
- Set default settings
3. setup_notification_preferences (depends on create_user_profile)
- Call notification service
- Configure email preferences
- Set up push notifications
4. track_user_signup (depends on create_user_profile)
- Call analytics service
- Record signup event
- Runs in parallel with setup_notification_preferences
5. send_welcome_email (depends on setup_notification_preferences)
- Send welcome email
- Provide onboarding links
- Track email delivery
Compensation Steps (on failure):
- If create_user_profile fails → delete_auth_account
- If any step fails after profile → deactivate_user
Implementation Pattern (Saga with Compensation)
#![allow(unused)]
fn main() {
pub struct CreateUserProfileHandler;
#[async_trait]
impl StepHandler for CreateUserProfileHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
let user_id: String = context.configuration.get("user_id")?;
let email: String = context.configuration.get("email")?;
// Get auth details from previous step
let auth_result = context.parent_results.get("create_auth_account")
.ok_or("Missing auth result")?;
let auth_token: String = auth_result.get("auth_token")?;
// Call profile service
match profile_service.create_profile(&user_id, &email, &auth_token).await {
Ok(profile) => {
Ok(StepResult::success(json!({
"profile_id": profile.id,
"created_at": profile.created_at,
"user_id": user_id
})))
}
Err(ProfileServiceError::DuplicateEmail) => {
// Permanent failure - email already exists
// Trigger compensation
Ok(StepResult::permanent_failure_with_compensation(
"duplicate_email",
json!({"email": email}),
vec!["delete_auth_account"] // Compensation steps
))
}
Err(ProfileServiceError::ServiceUnavailable) => {
// Transient - retry
Ok(StepResult::retryable_failure(
"service_unavailable",
json!({"retry_recommended": true})
))
}
}
}
}
}
Compensation Handler:
#![allow(unused)]
fn main() {
pub struct DeleteAuthAccountHandler;
#[async_trait]
impl StepHandler for DeleteAuthAccountHandler {
async fn execute(&self, context: StepContext) -> Result<StepResult> {
let user_id: String = context.configuration.get("user_id")?;
// Best-effort deletion
match auth_service.delete_account(&user_id).await {
Ok(_) => {
Ok(StepResult::success(json!({
"compensated": true,
"user_id": user_id
})))
}
Err(e) => {
// Log error but don't fail - compensation is best-effort
warn!("Compensation failed for user {}: {}", user_id, e);
Ok(StepResult::success(json!({
"compensated": false,
"error": e.to_string(),
"requires_manual_cleanup": true
})))
}
}
}
}
}
Key Patterns
1. Correlation IDs
- Pass correlation_id through all services
- Enable end-to-end tracing
- Simplify debugging distributed issues
2. Compensation (Saga Pattern)
- Define compensation steps for cleanup
- Execute on permanent failures
- Best-effort execution, log failures
3. Service Circuit Breakers
- Wrap service calls in circuit breakers
- Fail fast when services degraded
- Automatic recovery detection
Scheduled Job Coordination
Problem Statement
Run periodic jobs with dependencies:
- Daily report generation (depends on data refresh)
- Scheduled data backups (depends on maintenance window)
- Cleanup jobs (depends on retention policies)
Why Tasker Core?
- Dependency Management: Jobs run in correct order
- Failure Handling: Automatic retry of failed jobs
- Observability: Track job execution history
- Flexibility: Dynamic scheduling based on results
Implementation Pattern
#![allow(unused)]
fn main() {
// External scheduler (cron, Kubernetes CronJob, etc.) creates tasks
pub async fn schedule_daily_reports() -> Result<Uuid> {
let client = OrchestrationClient::new("http://orchestration:8080").await?;
let task_request = TaskRequest {
template_name: "daily_reporting".to_string(),
namespace: "scheduled_jobs".to_string(),
configuration: json!({
"report_date": Utc::now().format("%Y-%m-%d").to_string(),
"report_types": ["sales", "inventory", "customer_activity"]
}),
priority: 5, // Normal priority
};
let response = client.create_task(task_request).await?;
Ok(response.task_uuid)
}
}
Conditional Workflows and Decision Points
Problem Statement
Many workflows require runtime decision-making where the execution path depends on business logic evaluated at runtime:
- Approval routing based on request amount or risk level
- Tiered processing based on customer status
- Compliance checks varying by jurisdiction
- Dynamic resource allocation based on workload
Why Use Decision Points?
Traditional Approach (Static DAG):
# Must define ALL possible paths upfront
Steps:
- validate
- route_A # Always created
- route_B # Always created
- route_C # Always created
- converge # Must handle all paths
Decision Point Approach (Dynamic DAG):
# Create ONLY the needed path at runtime
Steps:
- validate
- routing_decision # Decides which path
- route_A # Created dynamically if needed
- route_B # Created dynamically if needed
- route_C # Created dynamically if needed
- converge # Uses intersection semantics
Benefits
- Efficiency: Only execute steps actually needed
- Clarity: Workflow reflects actual business logic
- Cost Savings: Reduce API calls, processing time, and resource usage
- Flexibility: Add new paths without changing core logic
Core Pattern
Task: conditional_approval
Steps:
1. validate_request # Regular step
2. routing_decision # Decision point (type: decision_point)
→ Evaluates business logic
→ Returns: CreateSteps(['manager_approval']) or NoBranches
3. auto_approve # Might be created
4. manager_approval # Might be created
5. finance_review # Might be created
6. finalize_approval # Convergence (type: deferred)
→ Waits for intersection of dependencies
Example: Amount-Based Approval Routing
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
# Business logic determines which steps to create
steps = if amount < 1_000
['auto_approve']
elsif amount < 5_000
['manager_approval']
else
['manager_approval', 'finance_review']
end
# Return decision outcome
decision_success(
steps: steps,
result_data: {
route_type: determine_route_type(amount),
amount: amount
}
)
end
end
Real-World Scenarios
1. E-Commerce Returns Processing
- Low-value returns: Auto-approve
- Medium-value: Manager review
- High-value or suspicious: Fraud investigation + manager review
2. Financial Risk Assessment
- Low-risk transactions: Standard processing
- Medium-risk: Additional verification
- High-risk: Manual review + compliance checks + legal review
3. Healthcare Prior Authorization
- Standard procedures: Auto-approve
- Specialized care: Medical director review
- Experimental treatments: Medical director + insurance review + compliance
4. Customer Support Escalation
- Simple issues: Tier 1 resolution
- Complex issues: Tier 2 specialist
- VIP customers: Immediate senior support + account manager notification
Key Features
Decision Point Steps:
- Special step type that returns
DecisionPointOutcome - Can return
NoBranches(no additional steps) orCreateSteps(list of step names) - Fully atomic - either all steps created or none
- Supports nested decisions (configurable depth limit)
Deferred Steps:
- Use intersection semantics for dependencies
- Wait for: (declared dependencies) ∩ (actually created steps)
- Enable convergence regardless of path taken
Type-Safe Implementation:
- Ruby:
TaskerCore::StepHandler::Decisionbase class - Rust:
DecisionPointOutcomeenum with serde support - Automatic validation and serialization
Implementation
See the complete guide: Conditional Workflows and Decision Points
Covers:
- When to use conditional workflows
- YAML configuration
- Ruby and Rust implementation patterns
- Simple and complex examples
- Best practices and limitations
Anti-Patterns
❌ Don’t Use Tasker Core For:
1. Simple Cron Jobs
# ❌ Anti-pattern: Single-step scheduled job
Task: send_daily_email
Steps:
- send_email # No dependencies, no retry needed
Why: Overhead not justified. Use native cron or systemd timers.
2. Real-Time Sub-Millisecond Operations
# ❌ Anti-pattern: High-frequency trading
Task: execute_trade_#{microseconds}
Steps:
- check_price # Needs <1ms latency
- execute_order
Why: Architectural overhead (~10-20ms) too high. Use in-memory queues or direct service calls.
3. Pure Fan-Out
# ❌ Anti-pattern: Simple message broadcasting
Task: broadcast_notification
Steps:
- send_to_user_1
- send_to_user_2
- send_to_user_3
# ... 1000s of independent steps
Why: Use message bus (Kafka, RabbitMQ) for pub/sub patterns. Tasker is for orchestration, not broadcasting.
4. Stateless Single Operations
# ❌ Anti-pattern: Single API call with no retry
Task: fetch_user_data
Steps:
- call_api # No dependencies, no state management needed
Why: Direct API call with client-side retry is simpler.
Pattern Selection Guide
| Characteristic | Use Tasker Core? | Alternative |
|---|---|---|
| Multiple dependent steps | ✅ Yes | N/A |
| Parallel execution needed | ✅ Yes | Thread pools for simple cases |
| Retry logic required | ✅ Yes | Client-side retry libraries |
| Audit trail needed | ✅ Yes | Append-only logs |
| Single step, no retry | ❌ No | Direct function call |
| Sub-second latency required | ❌ No | In-memory queues |
| Pure broadcast/fan-out | ❌ No | Message bus (Kafka, etc.) |
| Simple scheduled job | ❌ No | Cron, systemd timers |
Related Documentation
- Quick Start - Get your first workflow running
- Conditional Workflows - Runtime decision-making and dynamic step creation
- Crate Architecture - Understand the codebase
- Deployment Patterns - Deploy to production
- States and Lifecycles - State machine deep dive
- Events and Commands - Event-driven patterns
← Back to Documentation Hub
Worker Crates Overview
Last Updated: 2025-12-27 Audience: Developers, Architects, Operators Status: Active Related Docs: Worker Event Systems | Worker Actors
<- Back to Documentation Hub
The tasker-core workspace provides four worker implementations for executing workflow step handlers. Each implementation targets different deployment scenarios and developer ecosystems while sharing the same core Rust foundation.
Quick Navigation
| Document | Description |
|---|---|
| API Convergence Matrix | Quick reference for aligned APIs across languages |
| Example Handlers | Side-by-side handler examples |
| Patterns and Practices | Common patterns across all workers |
| Rust Worker | Native Rust implementation |
| Ruby Worker | Ruby gem for Rails integration |
| Python Worker | Python package for data pipelines |
| TypeScript Worker | TypeScript/JS for Bun/Node/Deno |
Overview
Four Workers, One Foundation
All workers share the same Rust core (tasker-worker crate) for orchestration, queueing, and state management. The language-specific workers add handler execution in their respective runtimes.
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
PostgreSQL + PGMQ
│
▼
┌─────────────────────────────┐
│ Rust Core (tasker-worker) │
│ ─────────────────────────│
│ • Queue Management │
│ • State Machines │
│ • Orchestration │
│ • Actor System │
└─────────────────────────────┘
│
┌───────────────┬───────────┼───────────┬───────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐
│ Rust │ │ Ruby │ │ Python │ │ TypeScript │
│ Worker │ │ Worker │ │ Worker │ │ Worker │
│───────────│ │───────────│ │───────────│ │─────────────│
│ Native │ │ FFI Bridge│ │ FFI Bridge│ │ FFI Bridge │
│ Handlers │ │ + Gem │ │ + Package │ │ Bun/Node/Deno│
└───────────┘ └───────────┘ └───────────┘ └─────────────┘
Comparison Table
| Feature | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Performance | Native | GVL-limited | GIL-limited | V8/Bun native |
| Integration | Standalone | Rails/Rack apps | Data pipelines | Node/Bun/Deno apps |
| Handler Style | Async traits | Class-based | ABC-based | Class-based |
| Concurrency | Tokio async | Thread + FFI poll | Thread + FFI poll | Event loop + FFI poll |
| Deployment | Binary | Gem + Server | Package + Server | Package + Server |
| Headless Mode | N/A | Library embed | Library embed | Library embed |
| Runtimes | - | MRI | CPython | Bun, Node.js, Deno |
When to Use Each
Rust Worker - Best for:
- Maximum throughput requirements
- Resource-constrained environments
- Standalone microservices
- Performance-critical handlers
Ruby Worker - Best for:
- Rails/Ruby applications
- ActiveRecord/ORM integration
- Existing Ruby codebases
- Quick prototyping with Ruby ecosystem
Python Worker - Best for:
- Data processing pipelines
- ML/AI integration
- Scientific computing workflows
- Python-native team preferences
TypeScript Worker - Best for:
- Modern JavaScript/TypeScript applications
- Full-stack Node.js teams
- Edge computing with Bun or Deno
- React/Vue/Angular backend services
- Multi-runtime deployment flexibility
Deployment Modes
Server Mode
All workers can run as standalone servers:
Rust:
cargo run -p workers-rust
Ruby:
cd workers/ruby
./bin/server.rb
Python:
cd workers/python
python bin/server.py
TypeScript (Bun):
cd workers/typescript
bun run bin/server.ts
TypeScript (Node.js):
cd workers/typescript
npx tsx bin/server.ts
Headless/Embedded Mode (Ruby, Python & TypeScript)
Ruby, Python, and TypeScript workers can be embedded into existing applications without running the HTTP server. Headless mode is controlled via TOML configuration, not bootstrap parameters.
TOML Configuration (e.g., config/tasker/base/worker.toml):
[web]
enabled = false # Disables HTTP server for headless/embedded mode
Ruby (in Rails):
# config/initializers/tasker.rb
require 'tasker_core'
# Bootstrap worker (web server disabled via TOML config)
TaskerCore::Worker::Bootstrap.start!
# Register handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
'MyHandler',
MyHandler
)
Python (in application):
from tasker_core import bootstrap_worker, HandlerRegistry
from tasker_core.types import BootstrapConfig
# Bootstrap worker (web server disabled via TOML config)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)
# Register handlers
registry = HandlerRegistry.instance()
registry.register("my_handler", MyHandler)
TypeScript (in application):
import { createRuntime, HandlerRegistry, EventEmitter, EventPoller, StepExecutionSubscriber } from '@tasker-systems/tasker';
// Bootstrap worker (web server disabled via TOML config)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });
// Register handlers
const registry = new HandlerRegistry();
registry.register('my_handler', MyHandler);
// Start event processing
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();
const poller = new EventPoller(runtime, emitter);
poller.start();
Core Concepts
1. Handler Registration
All workers use a registry pattern for handler discovery:
┌─────────────────────┐
│ HandlerRegistry │
│ (Singleton) │
└─────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Handler A│ │Handler B│ │Handler C│
└─────────┘ └─────────┘ └─────────┘
2. Event Flow
Step events flow through a consistent pipeline:
1. PGMQ Queue → Event received
2. Worker claims step (atomic)
3. Handler resolved by name
4. Handler.call(context) executed
5. Result sent to completion channel
6. Orchestration receives result
3. Error Classification
All workers distinguish between:
- Retryable Errors: Transient failures → Re-enqueue step
- Permanent Errors: Unrecoverable → Mark step failed
4. Graceful Shutdown
All workers handle shutdown signals (SIGTERM, SIGINT):
1. Signal received
2. Stop accepting new work
3. Complete in-flight handlers
4. Flush completion channel
5. Shutdown Rust foundation
6. Exit cleanly
Configuration
Environment Variables
Common across all workers:
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
TASKER_NAMESPACE | Worker namespace for queue isolation | default |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
Language-Specific
Ruby:
| Variable | Description |
|---|---|
RUBY_GC_HEAP_GROWTH_FACTOR | GC tuning for production |
Python:
| Variable | Description |
|---|---|
PYTHON_HANDLER_PATH | Path for handler auto-discovery |
Handler Types
All workers support specialized handler types:
StepHandler (Base)
Basic step execution:
class MyHandler(StepHandler):
handler_name = "my_handler"
def call(self, context):
return self.success({"result": "done"})
ApiHandler
HTTP/REST API integration with automatic error classification:
class FetchDataHandler < TaskerCore::StepHandler::Api
def call(context)
user_id = context.get_task_field('user_id')
response = connection.get("/users/#{user_id}")
process_response(response)
success(result: response.body)
end
end
DecisionHandler
Dynamic workflow routing:
class RouteHandler(DecisionHandler):
handler_name = "route_handler"
def call(self, context):
if context.input_data["amount"] < 1000:
return self.route_to_steps(["auto_approve"])
return self.route_to_steps(["manager_approval"])
Batchable
Large dataset processing. Note: Ruby uses subclass inheritance, Python uses mixin:
Ruby (subclass of Base):
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
batch_ctx = get_batch_context(context)
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
batch_worker_complete(processed_count: batch_ctx.batch_size)
end
end
Python (mixin):
class CsvBatchProcessor(StepHandler, Batchable):
handler_name = "csv_batch_processor"
def call(self, context: StepContext) -> StepHandlerResult:
batch_ctx = self.get_batch_context(context)
if batch_ctx is None:
return self.failure(message="No batch context", error_type="missing_context")
# Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
batch_size = batch_ctx.cursor_config.end_cursor - batch_ctx.cursor_config.start_cursor
return self.batch_worker_success(items_processed=batch_size)
Quick Start
Rust
# Build and run
cd workers/rust
cargo run
# With custom configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run
Ruby
# Install dependencies
cd workers/ruby
bundle install
bundle exec rake compile
# Run server
./bin/server.rb
Python
# Install dependencies
cd workers/python
uv sync
uv run maturin develop
# Run server
python bin/server.py
TypeScript
# Install dependencies
cd workers/typescript
bun install
cargo build --release -p tasker-ts
# Run server (Bun)
bun run bin/server.ts
# Run server (Node.js)
npx tsx bin/server.ts
# Run server (Deno)
deno run --allow-ffi --allow-env --allow-net bin/server.ts
Monitoring
Health Checks
All workers expose health status:
# Python
from tasker_core import get_health_check
health = get_health_check()
# Ruby
health = TaskerCore::FFI.health_check
Metrics
Common metrics available:
| Metric | Description |
|---|---|
pending_count | Events awaiting processing |
in_flight_count | Events being processed |
completed_count | Successfully completed |
failed_count | Failed events |
starvation_detected | Processing bottleneck |
Logging
All workers use structured logging:
2025-01-15T10:30:00Z [INFO] python-worker: Processing step step_uuid=abc-123 handler=process_order
2025-01-15T10:30:01Z [INFO] python-worker: Step completed step_uuid=abc-123 success=true duration_ms=150
Architecture Deep Dive
For detailed architectural documentation:
- Worker Event Systems - Dual-channel architecture, event-driven processing
- Worker Actors - Actor-based coordination, message handling
- Events and Commands - Event definitions, command routing
See Also
- API Convergence Matrix - Quick reference tables
- Example Handlers - Side-by-side code examples
- Patterns and Practices - Common patterns
- Rust Worker - Native implementation details
- Ruby Worker - Ruby gem documentation
- Python Worker - Python package documentation
- TypeScript Worker - TypeScript/JS package documentation
API Convergence Matrix
Last Updated: 2026-01-08 Status: Active <- Back to Worker Crates Overview
Overview
This document provides a quick reference for the aligned APIs across Ruby, Python, TypeScript, and Rust worker implementations. All four languages share consistent patterns for handler execution, result creation, registry operations, and composition via mixins/traits.
Handler Signatures
| Language | Base Class | Signature |
|---|---|---|
| Ruby | TaskerCore::StepHandler::Base | def call(context) |
| Python | BaseStepHandler | def call(self, context: StepContext) -> StepHandlerResult |
| TypeScript | StepHandler | async call(context: StepContext): Promise<StepHandlerResult> |
| Rust | StepHandler trait | async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult |
Composition Pattern
All languages use composition via mixins/traits rather than inheritance hierarchies.
Handler Composition
| Language | Base | Mixin Syntax | Example |
|---|---|---|---|
| Ruby | StepHandler::Base | include Mixins::API | class Handler < Base; include Mixins::API |
| Python | StepHandler | Multiple inheritance | class Handler(StepHandler, APIMixin) |
| TypeScript | StepHandler | applyAPI(this) | Mixin functions applied in constructor |
| Rust | impl StepHandler | impl APICapable | Multiple trait implementations |
Available Mixins/Traits
| Capability | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| API | Mixins::API | APIMixin | applyAPI() | APICapable |
| Decision | Mixins::Decision | DecisionMixin | applyDecision() | DecisionCapable |
| Batchable | Mixins::Batchable | BatchableMixin | BatchableHandler | BatchableCapable |
StepContext Fields
The StepContext provides unified access to step execution data across Ruby, Python, and TypeScript.
| Field | Type | Description |
|---|---|---|
task_uuid | String | Unique task identifier (UUID v7) |
step_uuid | String | Unique step identifier (UUID v7) |
input_data | Dict/Hash | Input data for the step from workflow_step.inputs |
step_inputs | Dict/Hash | Alias for input_data |
step_config | Dict/Hash | Handler configuration from step_definition.handler.initialization |
dependency_results | Wrapper | Results from parent steps (DependencyResultsWrapper) |
retry_count | Integer | Current retry attempt (from workflow_step.attempts) |
max_retries | Integer | Maximum retry attempts (from workflow_step.max_attempts) |
Convenience Methods
| Method | Description |
|---|---|
get_task_field(name) | Get field from task context |
get_dependency_result(step_name) | Get result from a parent step |
Ruby-Specific Accessors
| Property | Type | Description |
|---|---|---|
task | TaskWrapper | Full task wrapper with context and metadata |
workflow_step | WorkflowStepWrapper | Workflow step with execution state |
step_definition | StepDefinitionWrapper | Step definition from task template |
Result Factories
Success Results
| Language | Method | Example |
|---|---|---|
| Ruby | success(result:, metadata:) | success(result: { id: 123 }, metadata: { ms: 50 }) |
| Python | self.success(result, metadata) | self.success({"id": 123}, {"ms": 50}) |
| Rust | StepExecutionResult::success(...) | StepExecutionResult::success(result, metadata) |
Failure Results
| Language | Method | Key Parameters |
|---|---|---|
| Ruby | failure(message:, error_type:, error_code:, retryable:, metadata:) | keyword arguments |
| Python | self.failure(message, error_type, error_code, retryable, metadata) | positional/keyword |
| Rust | StepExecutionResult::failure(...) | structured fields |
Result Fields
| Field | Ruby | Python | Rust | Description |
|---|---|---|---|---|
| success | bool | bool | bool | Whether step succeeded |
| result | Hash | Dict | HashMap | Result data |
| metadata | Hash | Dict | HashMap | Additional context |
| error_message | String | str | String | Human-readable error |
| error_type | String | str | String | Error classification |
| error_code | String (optional) | str (optional) | String (optional) | Application error code |
| retryable | bool | bool | bool | Whether to retry |
Standard error_type Values
Use these standard values for consistent error classification:
| Value | Description | Retry Behavior |
|---|---|---|
PermanentError | Non-recoverable failure | Never retry |
RetryableError | Temporary failure | Will retry |
ValidationError | Input validation failed | No retry |
TimeoutError | Operation timed out | May retry |
UnexpectedError | Unexpected handler error | May retry |
Registry API
| Operation | Ruby | Python | Rust |
|---|---|---|---|
| Register | register(name, klass) | register(name, klass) | register_handler(name, handler) |
| Check | is_registered(name) | is_registered(name) | is_registered(name) |
| Resolve | resolve(name) | resolve(name) | get_handler(name) |
| List | list_handlers | list_handlers() | list_handlers() |
Note: Ruby also provides original method names (register_handler, handler_available?, resolve_handler, registered_handlers) as the primary API with the above as cross-language aliases.
Resolver Chain API
Handler resolution uses a chain-of-responsibility pattern to convert callable addresses into executable handlers.
StepHandlerResolver Interface
| Method | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| Get Name | name | resolver_name() | resolverName() | resolver_name(&self) |
| Get Priority | priority | priority() | priority() | priority(&self) |
| Can Resolve? | can_resolve?(definition, config) | can_resolve(definition) | canResolve(definition) | can_resolve(&self, definition) |
| Resolve | resolve(definition, config) | resolve(definition, context) | resolve(definition, context) | resolve(&self, definition, context) |
ResolverChain Operations
| Operation | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| Create | ResolverChain.new | ResolverChain() | new ResolverChain() | ResolverChain::new() |
| Register | register(resolver) | register(resolver) | register(resolver) | register(resolver) |
| Resolve | resolve(definition, context) | resolve(definition, context) | resolve(definition, context) | resolve(definition, context) |
| Can Resolve? | can_resolve?(definition) | can_resolve(definition) | canResolve(definition) | can_resolve(definition) |
| List | resolvers | resolvers | resolvers | resolvers() |
Built-in Resolvers
| Resolver | Priority | Function | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|---|---|
| ExplicitMappingResolver | 10 | Hash lookup of registered handlers | ✅ | ✅ | ✅ | ✅ |
| ClassConstantResolver | 100 | Runtime class lookup (Ruby) | ❌ | ✅ | - | - |
| ClassLookupResolver | 100 | Runtime class lookup (Python/TS) | ❌ | - | ✅ | ✅ |
Note: Class lookup resolvers are not available in Rust due to lack of runtime reflection. Rust handlers must use ExplicitMappingResolver. Ruby uses ClassConstantResolver (Ruby terminology); Python and TypeScript use ClassLookupResolver (same functionality, language-appropriate naming).
HandlerDefinition Fields
| Field | Type | Description | Required |
|---|---|---|---|
callable | String | Handler address (name or class path) | Yes |
method | String | Entry point method (default: "call") | No |
resolver | String | Resolution hint to bypass chain | No |
initialization | Dict/Hash | Handler configuration | No |
Method Dispatch
Multi-method handlers expose multiple entry points through the method field:
| Language | Default Method | Dynamic Dispatch |
|---|---|---|
| Ruby | call | handler.public_send(method, context) |
| Python | call | getattr(handler, method)(context) |
| TypeScript | call | handler[method](context) |
| Rust | call | handler.invoke_method(method, step) |
Creating Multi-Method Handlers:
| Language | Signature |
|---|---|
| Ruby | Define additional methods alongside call |
| Python | Define additional methods alongside call |
| TypeScript | Define additional async methods alongside call |
| Rust | Implement invoke_method to dispatch to internal methods |
See Handler Resolution Guide for complete documentation.
Specialized Handlers
API Handler
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| GET | get(path, params: {}, headers: {}) | self.get(path, params={}, headers={}) | this.get(path, params?, headers?) |
| POST | post(path, data: {}, headers: {}) | self.post(path, data={}, headers={}) | this.post(path, data?, headers?) |
| PUT | put(path, data: {}, headers: {}) | self.put(path, data={}, headers={}) | this.put(path, data?, headers?) |
| DELETE | delete(path, params: {}, headers: {}) | self.delete(path, params={}, headers={}) | this.delete(path, params?, headers?) |
Decision Handler
| Language | Simple API | Result Fields |
|---|---|---|
| Ruby | decision_success(steps:, routing_context:) | decision_point_outcome: { type, step_names } |
| Python | decision_success(steps, routing_context) | decision_point_outcome: { type, step_names } |
| TypeScript | decisionSuccess(steps, routingContext?) | decision_point_outcome: { type, step_names } |
| Rust | decision_success(step_uuid, step_names, ...) | Pattern-based |
Decision Helper Methods (Cross-Language):
decision_success(steps, routing_context)- Create dynamic stepsskip_branches(reason, routing_context)- Skip all conditional branchesdecision_failure(message, error_type)- Decision could not be made
Batchable Handler
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| Get Context | get_batch_context(context) | get_batch_context(context) | getBatchContext(context) |
| Complete Batch | batch_worker_complete(processed_count:, result_data:) | batch_worker_complete(processed_count, result_data) | batchWorkerComplete(processedCount, resultData) |
| Handle No-Op | handle_no_op_worker(batch_ctx) | handle_no_op_worker(batch_ctx) | handleNoOpWorker(batchCtx) |
Standard Batch Result Fields:
processed_count/items_processeditems_succeeded/items_failedstart_cursor,end_cursor,batch_size,last_cursor
Cursor Indexing:
- All languages use 0-indexed cursors (start at 0, not 1)
- Ruby was updated from 1-indexed to 0-indexed for consistency
Checkpoint Yielding
Checkpoint yielding enables batch workers to persist progress and yield control for re-dispatch.
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| Checkpoint | checkpoint_yield(cursor:, items_processed:, accumulated_results:) | checkpoint_yield(cursor, items_processed, accumulated_results) | checkpointYield({ cursor, itemsProcessed, accumulatedResults }) |
BatchWorkerContext Checkpoint Accessors:
| Accessor | Ruby | Python | TypeScript |
|---|---|---|---|
| Cursor | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Accumulated Results | accumulated_results | accumulated_results | accumulatedResults |
| Has Checkpoint? | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
| Items Processed | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
FFI Contract:
| Function | Description |
|---|---|
checkpoint_yield_step_event(event_id, data) | Persist checkpoint and re-dispatch step |
Key Invariants:
- Progress is atomically saved before re-dispatch
- Step remains
InProgressduring checkpoint yield cycle - Only
Success/Failuretrigger state transitions
See Batch Processing Guide - Checkpoint Yielding for full documentation.
Domain Events
Publisher Contract
| Language | Base Class | Key Method |
|---|---|---|
| Ruby | TaskerCore::DomainEvents::BasePublisher | publish(ctx) |
| Python | BasePublisher | publish(ctx) |
| TypeScript | BasePublisher | publish(ctx) |
| Rust | StepEventPublisher trait | publish(ctx) |
Publisher Lifecycle Hooks
All languages support publisher lifecycle hooks for instrumentation:
| Hook | Ruby | Python | TypeScript | Description |
|---|---|---|---|---|
| Before Publish | before_publish(ctx) | before_publish(ctx) | beforePublish(ctx) | Called before publishing |
| After Publish | after_publish(ctx, result) | after_publish(ctx, result) | afterPublish(ctx, result) | Called after successful publish |
| On Error | on_publish_error(ctx, error) | on_publish_error(ctx, error) | onPublishError(ctx, error) | Called on publish failure |
| Metadata | additional_metadata(ctx) | additional_metadata(ctx) | additionalMetadata(ctx) | Inject custom metadata |
StepEventContext Fields
| Field | Description |
|---|---|
task_uuid | Task identifier |
step_uuid | Step identifier |
step_name | Handler/step name |
namespace | Task namespace |
correlation_id | Tracing correlation ID |
result | Step execution result |
metadata | Additional metadata |
Subscriber Contract
| Language | Base Class | Key Methods |
|---|---|---|
| Ruby | TaskerCore::DomainEvents::BaseSubscriber | subscribes_to, handle(event) |
| Python | BaseSubscriber | subscribes_to(), handle(event) |
| TypeScript | BaseSubscriber | subscribesTo(), handle(event) |
| Rust | EventHandler closures | N/A |
Subscriber Lifecycle Hooks
All languages support subscriber lifecycle hooks:
| Hook | Ruby | Python | TypeScript | Description |
|---|---|---|---|---|
| Before Handle | before_handle(event) | before_handle(event) | beforeHandle(event) | Called before handling |
| After Handle | after_handle(event, result) | after_handle(event, result) | afterHandle(event, result) | Called after handling |
| On Error | on_handle_error(event, error) | on_handle_error(event, error) | onHandleError(event, error) | Called on handler failure |
Registries
| Language | Publisher Registry | Subscriber Registry |
|---|---|---|
| Ruby | PublisherRegistry.instance | SubscriberRegistry.instance |
| Python | PublisherRegistry.instance() | SubscriberRegistry.instance() |
| TypeScript | PublisherRegistry.getInstance() | SubscriberRegistry.getInstance() |
Migration Summary
Ruby
| Before | After |
|---|---|
def call(task, sequence, step) | def call(context) |
class Handler < API | class Handler < Base; include Mixins::API |
task.context['field'] | context.get_task_field('field') |
sequence.get_results('step') | context.get_dependency_result('step') |
| 1-indexed cursors | 0-indexed cursors |
Python
| Before | After |
|---|---|
def handle(self, task, sequence, step) | def call(self, context) |
class Handler(APIHandler) | class Handler(StepHandler, APIMixin) |
| N/A | self.success(result, metadata) |
| N/A | Publisher/Subscriber lifecycle hooks |
TypeScript
| Before | After |
|---|---|
class Handler extends APIHandler | class Handler extends StepHandler implements APICapable |
| No domain events | Complete domain events module |
| N/A | Publisher/Subscriber lifecycle hooks |
| N/A | applyAPI(this), applyDecision(this) mixins |
Rust
| Before | After |
|---|---|
| (already aligned) | (already aligned) |
| N/A | APICapable, DecisionCapable, BatchableCapable traits |
See Also
- Example Handlers - Side-by-side code examples
- Patterns and Practices - Common patterns
- Ruby Worker - Ruby implementation details
- Python Worker - Python implementation details
- TypeScript Worker - TypeScript implementation details
- Rust Worker - Rust implementation details
- Composition Over Inheritance - Why mixins over inheritance
- FFI Boundary Types - Cross-language type alignment
- Handler Resolution Guide - Custom resolver strategies
Example Handlers - Cross-Language Reference
Last Updated: 2025-12-21 Status: Active <- Back to Worker Crates Overview
Overview
This document provides side-by-side handler examples across Ruby, Python, and Rust. These examples demonstrate the aligned APIs that enable consistent patterns across all worker implementations.
Simple Step Handler
Ruby
class ProcessOrderHandler < TaskerCore::StepHandler::Base
def call(context)
order_id = context.get_task_field('order_id')
amount = context.get_task_field('amount')
result = process_order(order_id, amount)
success(
result: {
order_id: order_id,
status: 'processed',
total: result[:total]
},
metadata: { processed_at: Time.now.iso8601 }
)
rescue StandardError => e
failure(
message: e.message,
error_type: 'UnexpectedError',
retryable: true,
metadata: { order_id: order_id }
)
end
private
def process_order(order_id, amount)
# Business logic here
{ total: amount * 1.08 }
end
end
Python
from tasker_core import BaseStepHandler, StepContext, StepHandlerResult
class ProcessOrderHandler(BaseStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
try:
order_id = context.get_task_field("order_id")
amount = context.get_task_field("amount")
result = self.process_order(order_id, amount)
return self.success(
result={
"order_id": order_id,
"status": "processed",
"total": result["total"],
},
metadata={"processed_at": datetime.now().isoformat()},
)
except Exception as e:
return self.failure(
message=str(e),
error_type="handler_error",
retryable=True,
metadata={"order_id": order_id},
)
def process_order(self, order_id: str, amount: float) -> dict:
# Business logic here
return {"total": amount * 1.08}
Rust
#![allow(unused)]
fn main() {
use tasker_shared::types::{TaskSequenceStep, StepExecutionResult};
pub struct ProcessOrderHandler;
impl ProcessOrderHandler {
pub async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
let order_id = step_data.task.context.get("order_id")
.and_then(|v| v.as_str())
.unwrap_or_default();
let amount = step_data.task.context.get("amount")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
match self.process_order(order_id, amount).await {
Ok(result) => StepExecutionResult::success(
serde_json::json!({
"order_id": order_id,
"status": "processed",
"total": result.total,
}),
Some(serde_json::json!({
"processed_at": chrono::Utc::now().to_rfc3339(),
})),
),
Err(e) => StepExecutionResult::failure(
&e.to_string(),
"handler_error",
true, // retryable
),
}
}
async fn process_order(&self, _order_id: &str, amount: f64) -> Result<OrderResult, Error> {
Ok(OrderResult { total: amount * 1.08 })
}
}
}
Handler with Dependencies
Ruby
class ShipOrderHandler < TaskerCore::StepHandler::Base
def call(context)
# Get results from dependent steps
validation = context.get_dependency_result('validate_order')
payment = context.get_dependency_result('process_payment')
unless validation && validation['valid']
return failure(
message: 'Order validation failed',
error_type: 'ValidationError',
retryable: false
)
end
unless payment && payment['status'] == 'completed'
return failure(
message: 'Payment not completed',
error_type: 'PermanentError',
retryable: false
)
end
# Access task context
order_id = context.get_task_field('order_id')
shipping_address = context.get_task_field('shipping_address')
tracking_number = create_shipment(order_id, shipping_address)
success(result: {
order_id: order_id,
tracking_number: tracking_number,
shipped_at: Time.now.iso8601
})
end
end
Python
class ShipOrderHandler(BaseStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
# Get results from dependent steps
validation = context.get_dependency_result("validate_order")
payment = context.get_dependency_result("process_payment")
if not validation or not validation.get("valid"):
return self.failure(
message="Order validation failed",
error_type="validation_error",
retryable=False,
)
if not payment or payment.get("status") != "completed":
return self.failure(
message="Payment not completed",
error_type="permanent_error",
retryable=False,
)
# Access task context
order_id = context.get_task_field("order_id")
shipping_address = context.get_task_field("shipping_address")
tracking_number = self.create_shipment(order_id, shipping_address)
return self.success(
result={
"order_id": order_id,
"tracking_number": tracking_number,
"shipped_at": datetime.now().isoformat(),
}
)
Decision Handler
Ruby
class ApprovalRoutingHandler < TaskerCore::StepHandler::Decision
THRESHOLDS = {
auto_approve: 1000,
manager_only: 5000
}.freeze
def call(context)
amount = context.get_task_field('amount').to_f
department = context.get_task_field('department')
if amount < THRESHOLDS[:auto_approve]
decision_success(
steps: ['auto_approve'],
result_data: {
route_type: 'automatic',
amount: amount,
reason: 'Below threshold'
}
)
elsif amount < THRESHOLDS[:manager_only]
decision_success(
steps: ['manager_approval'],
result_data: {
route_type: 'manager',
amount: amount,
approver: find_manager(department)
}
)
else
decision_success(
steps: ['manager_approval', 'finance_review'],
result_data: {
route_type: 'dual_approval',
amount: amount,
requires_cfo: amount > 50_000
}
)
end
end
private
def find_manager(department)
# Lookup logic
"manager@example.com"
end
end
Python
class ApprovalRoutingHandler(DecisionHandler):
THRESHOLDS = {
"auto_approve": 1000,
"manager_only": 5000,
}
def call(self, context: StepContext) -> StepHandlerResult:
amount = float(context.get_task_field("amount") or 0)
department = context.get_task_field("department")
if amount < self.THRESHOLDS["auto_approve"]:
return self.decision_success(
steps=["auto_approve"],
routing_context={
"route_type": "automatic",
"amount": amount,
"reason": "Below threshold",
},
)
elif amount < self.THRESHOLDS["manager_only"]:
return self.decision_success(
steps=["manager_approval"],
routing_context={
"route_type": "manager",
"amount": amount,
"approver": self.find_manager(department),
},
)
else:
return self.decision_success(
steps=["manager_approval", "finance_review"],
routing_context={
"route_type": "dual_approval",
"amount": amount,
"requires_cfo": amount > 50000,
},
)
def find_manager(self, department: str) -> str:
return "manager@example.com"
Batch Processing Handler
Ruby (Analyzer)
class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
BATCH_SIZE = 100
def call(context)
file_path = context.get_task_field('csv_file_path')
total_rows = count_csv_rows(file_path)
if total_rows <= BATCH_SIZE
# Small file - process inline, no batches needed
outcome = TaskerCore::Types::BatchProcessingOutcome.no_batches
success(
result: {
batch_processing_outcome: outcome.to_h,
total_rows: total_rows,
processing_mode: 'inline'
}
)
else
# Large file - create batch workers
cursor_configs = calculate_batches(total_rows, BATCH_SIZE)
outcome = TaskerCore::Types::BatchProcessingOutcome.create_batches(
worker_template_name: 'process_csv_batch',
worker_count: cursor_configs.size,
cursor_configs: cursor_configs,
total_items: total_rows
)
success(
result: {
batch_processing_outcome: outcome.to_h,
total_rows: total_rows,
batch_count: cursor_configs.size
}
)
end
end
private
def calculate_batches(total, batch_size)
(0...total).step(batch_size).map.with_index do |start, idx|
{
'batch_id' => format('%03d', idx),
'start_cursor' => start,
'end_cursor' => [start + batch_size, total].min,
'batch_size' => [batch_size, total - start].min
}
end
end
end
Ruby (Batch Worker)
class CsvBatchWorkerHandler < TaskerCore::StepHandler::Batchable
def call(context)
batch_ctx = get_batch_context(context)
# Handle placeholder batches
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Get file path from analyzer step
analyzer_result = context.get_dependency_result('analyze_csv')
file_path = analyzer_result&.dig('csv_file_path')
# Process this batch
records = read_csv_range(file_path, batch_ctx.start_cursor, batch_ctx.batch_size)
processed = records.map { |row| transform_row(row) }
batch_worker_complete(
processed_count: processed.size,
result_data: {
batch_id: batch_ctx.batch_id,
records_processed: processed.size,
summary: calculate_summary(processed)
}
)
end
end
Python (Batch Worker)
class CsvBatchWorkerHandler(BatchableHandler):
def call(self, context: StepContext) -> StepHandlerResult:
batch_ctx = self.get_batch_context(context)
# Handle placeholder batches
no_op_result = self.handle_no_op_worker(batch_ctx)
if no_op_result:
return no_op_result
# Get file path from analyzer step
analyzer_result = context.get_dependency_result("analyze_csv")
file_path = analyzer_result.get("csv_file_path") if analyzer_result else None
# Process this batch
records = self.read_csv_range(
file_path, batch_ctx.start_cursor, batch_ctx.batch_size
)
processed = [self.transform_row(row) for row in records]
return self.batch_worker_complete(
processed_count=len(processed),
result_data={
"batch_id": batch_ctx.batch_id,
"records_processed": len(processed),
"summary": self.calculate_summary(processed),
},
)
API Handler
Ruby
class FetchUserHandler < TaskerCore::StepHandler::Api
def call(context)
user_id = context.get_task_field('user_id')
# Automatic error classification (429 -> retryable, 404 -> permanent)
response = connection.get("/users/#{user_id}")
process_response(response)
success(result: {
user_id: user_id,
email: response.body['email'],
name: response.body['name']
})
end
def base_url
'https://api.example.com'
end
def configure_connection
Faraday.new(base_url) do |conn|
conn.request :json
conn.response :json
conn.options.timeout = 30
end
end
end
Python
class FetchUserHandler(ApiStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
user_id = context.get_task_field("user_id")
# Automatic error classification
response = self.get(f"/users/{user_id}")
return self.success(
result={
"user_id": user_id,
"email": response["email"],
"name": response["name"],
}
)
@property
def base_url(self) -> str:
return "https://api.example.com"
def configure_session(self, session):
session.headers["Authorization"] = f"Bearer {self.get_token()}"
session.timeout = 30
Error Handling Patterns
Ruby - Raising Exceptions
class ValidateOrderHandler < TaskerCore::StepHandler::Base
def call(context)
order = context.task.context
# Permanent error - will not retry
if order['amount'].to_f <= 0
raise TaskerCore::Errors::PermanentError.new(
'Order amount must be positive',
error_code: 'INVALID_AMOUNT',
context: { amount: order['amount'] }
)
end
# Retryable error - will retry with backoff
if external_service_unavailable?
raise TaskerCore::Errors::RetryableError.new(
'External service temporarily unavailable',
retry_after: 30,
context: { service: 'payment_gateway' }
)
end
success(result: { valid: true })
end
end
Python - Returning Failures
class ValidateOrderHandler(BaseStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
order = context.task.context
# Permanent error - will not retry
amount = float(order.get("amount", 0))
if amount <= 0:
return self.failure(
message="Order amount must be positive",
error_type="validation_error",
error_code="INVALID_AMOUNT",
retryable=False,
metadata={"amount": amount},
)
# Retryable error - will retry with backoff
if self.external_service_unavailable():
return self.failure(
message="External service temporarily unavailable",
error_type="retryable_error",
retryable=True,
metadata={"service": "payment_gateway"},
)
return self.success(result={"valid": True})
See Also
- API Convergence Matrix - Quick reference tables
- Patterns and Practices - Common patterns
- Ruby Worker - Ruby implementation details
- Python Worker - Python implementation details
- Rust Worker - Rust implementation details
FFI Safety Safeguards
Last Updated: 2026-02-02 Status: Production Implementation Applies To: Ruby (Magnus), Python (PyO3), TypeScript (C FFI) workers
Overview
Tasker’s FFI workers embed the Rust tasker-worker runtime inside language-specific host processes (Ruby, Python, TypeScript/JavaScript). This document describes the safeguards that prevent Rust-side failures from crashing or corrupting the host process, ensuring that infrastructure unavailability, misconfiguration, and unexpected panics are surfaced as language-native errors rather than process faults.
FFI Architecture
Host Process (Ruby / Python / Node.js)
│
▼
FFI Boundary
┌─────────────────────────────────────┐
│ Language Binding Layer │
│ (Magnus / PyO3 / extern "C") │
│ │
│ ┌─────────────────────────────┐ │
│ │ Bridge Module │ │
│ │ (bootstrap, poll, complete)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ FfiDispatchChannel │ │
│ │ (event dispatch, callbacks)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ WorkerBootstrap │ │
│ │ (runtime, DB, messaging) │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
Panic Safety by Framework
Each FFI framework provides different levels of automatic panic protection:
| Framework | Panic Handling | Mechanism |
|---|---|---|
| Magnus (Ruby) | Automatic | Catches panics at FFI boundary, converts to Ruby RuntimeError |
| PyO3 (Python) | Automatic | Catches panics at #[pyfunction] boundary, converts to PanicException |
| C FFI (TypeScript) | Manual | Requires explicit std::panic::catch_unwind wrappers |
TypeScript C FFI: Explicit Panic Guards
Because the TypeScript worker uses raw extern "C" functions (for compatibility with Node.js, Bun, and Deno FFI), panics unwinding through this boundary would be undefined behavior. All extern "C" functions that call into bridge internals are wrapped with catch_unwind:
#![allow(unused)]
fn main() {
// workers/typescript/src-rust/lib.rs
#[no_mangle]
pub unsafe extern "C" fn bootstrap_worker(config_json: *const c_char) -> *mut c_char {
// ... parse config_json ...
let result = panic::catch_unwind(AssertUnwindSafe(|| {
bridge::bootstrap_worker_internal(config_str)
}));
match result {
Ok(Ok(json)) => /* return JSON */,
Ok(Err(e)) => json_error(&format!("Bootstrap failed: {}", e)),
Err(panic_info) => {
// Extract panic message, log it, return JSON error
json_error(&msg)
}
}
}
}
Protected functions: bootstrap_worker, stop_worker, get_worker_status, transition_to_graceful_shutdown, poll_step_events, poll_in_process_events, complete_step_event, checkpoint_yield_step_event, get_ffi_dispatch_metrics, check_starvation_warnings, cleanup_timeouts.
Error Handling at FFI Boundaries
Bootstrap Failures
When infrastructure is unavailable during worker startup, errors flow through the normal Result path rather than panicking:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database unreachable | TaskerError::DatabaseError returned | Language exception, app can retry |
| Config TOML missing | TaskerError::ConfigurationError returned | Language exception with descriptive message |
| Worker config section absent | TaskerError::ConfigurationError returned | Language exception (was previously a panic) |
| Messaging backend unavailable | TaskerError::ConfigurationError returned | Language exception |
| Tokio runtime creation fails | Logged + language error returned | Language exception |
| Port already in use | TaskerError::WorkerError returned | Language exception |
| Redis/cache unavailable | Graceful degradation to noop cache | No error - worker starts without cache |
Steady-State Operation Failures
Once bootstrapped, the worker handles infrastructure failures gracefully:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database goes down during poll | Poll returns None (no events) | No impact - polling continues |
| Completion channel full | Retry loop with timeout, then logged | Step result may be lost after timeout |
| Completion channel closed | Returns false to caller | App code sees completion failure |
| Callback timeout (5s) | Logged, step completion unaffected | Domain events may be delayed |
| Messaging down during callback | Callback times out, logged | Domain events may not publish |
| Lock poisoned | Error returned to caller | Language exception |
| Worker not initialized | Error returned to caller | Language exception |
Lock Acquisition
All three workers validate lock acquisition before proceeding:
#![allow(unused)]
fn main() {
// Pattern used in all workers
let handle_guard = WORKER_SYSTEM.lock().map_err(|e| {
error!("Failed to acquire worker system lock: {}", e);
// Convert to language-appropriate error
})?;
}
A poisoned mutex (from a previous panic) produces a language exception rather than propagating the original panic.
EventRouter Availability
Post-bootstrap access to the EventRouter uses fallible error handling rather than .expect():
#![allow(unused)]
fn main() {
// Use ok_or_else instead of expect to prevent panic at FFI boundary
let event_router = worker_core.event_router().ok_or_else(|| {
error!("EventRouter not available from WorkerCore after bootstrap");
// Return language-appropriate error
})?;
}
Callback Safety
The FfiDispatchChannel uses a fire-and-forget pattern for post-completion callbacks, preventing the host process from being blocked or deadlocked by Rust-side async operations:
- Completion is sent first - the step result is delivered to the completion channel before any callback fires
- Callback is spawned separately - runs in the Tokio runtime, not the FFI caller’s thread
- Timeout protection - callbacks are bounded by a configurable timeout (default 5s)
- Callback failures are logged - they never affect step completion or the host process
FFI Thread (Ruby/Python/JS) Tokio Runtime
│ │
├──► complete(event_id, result) │
│ ├──► send result to channel │
│ └──► spawn callback ─────────┼──► callback.on_handler_complete()
│ │ (with 5s timeout)
◄──── return true ────────────────│
│ (immediate, non-blocking) │
See docs/development/ffi-callback-safety.md for detailed callback safety guidelines.
Backpressure Protection
Completion Channel
The completion channel uses a try-send retry loop with timeout to prevent indefinite blocking:
- Try-send avoids blocking the FFI thread
- Retry with sleep (10ms intervals) handles transient backpressure
- Timeout (configurable, default 30s) prevents permanent stalls
- Logged when backpressure delays exceed 100ms
Starvation Detection
The FfiDispatchChannel tracks event age and warns when polling falls behind:
- Events older than
starvation_warning_threshold_ms(default 10s) trigger warnings check_starvation_warnings()can be called periodically from the host processFfiDispatchMetricsexposes pending count, oldest event age, and starvation status
Infrastructure Dependency Matrix
| Component | Bootstrap | Poll | Complete | Callback |
|---|---|---|---|---|
| Database | Required (error on failure) | Not needed | Not needed | Errors logged |
| Message Bus | Required (error on failure) | Not needed | Not needed | Errors logged |
| Config System | Required (error on failure) | Not needed | Not needed | Not needed |
| Cache (Redis) | Optional (degrades to noop) | Not needed | Not needed | Not needed |
| Tokio Runtime | Required (error on failure) | Used | Used | Used |
Worker Lifecycle Safety
Start (bootstrap_worker)
- Validates configuration, creates runtime, initializes all subsystems
- All failures return language-appropriate errors
- Already-running detection prevents double initialization
Status (get_worker_status)
- Safe when worker is not initialized (returns
running: false) - Safe when worker is running (queries internal state)
- Lock acquisition failure returns error
Stop (stop_worker)
- Safe when worker is not running (returns success message)
- Sends shutdown signal and clears handle
- In-flight operations complete before shutdown
Graceful Shutdown (transition_to_graceful_shutdown)
- Initiates graceful shutdown allowing in-flight work to drain
- Errors during transition are logged and returned
- Requires worker to be running (error otherwise)
Adding a New FFI Worker
When implementing a new language worker:
-
Check framework panic safety - if the framework (like Magnus/PyO3) catches panics automatically, you get protection for free. If using raw C FFI, wrap all
extern "C"functions withcatch_unwind. -
Use the standard bridge pattern - global
WORKER_SYSTEMmutex,BridgeHandlestruct containingWorkerSystemHandle+FfiDispatchChannel+ runtime. -
Handle all lock acquisitions - always use
.map_err()on.lock()calls. -
Avoid
.expect()and.unwrap()in FFI code - useok_or_else()ormap_err()to convert to language-appropriate errors. -
Use fire-and-forget callbacks - never block the FFI thread on async operations.
-
Integrate starvation detection - call
check_starvation_warnings()periodically. -
Expose metrics - expose
FfiDispatchMetricsfor health monitoring.
Related Documentation
- FFI Callback Safety - Detailed callback patterns and deadlock prevention
- Worker Event Systems - Dispatch and completion channel architecture
- MPSC Channel Guidelines - Channel sizing and configuration
- Worker Patterns & Practices - General worker development patterns
- Memory Management - FFI memory management across languages
FFI Memory Management in TypeScript Workers
Status: Active
Applies To: TypeScript/Bun/Node.js FFI
Related: Ruby (Magnus), Python (PyO3)
Overview
This document explains the memory management pattern used when calling Rust functions from TypeScript via FFI (Foreign Function Interface). Understanding this pattern is critical for preventing memory leaks and undefined behavior.
Key Principle: When Rust hands memory to JavaScript across the FFI boundary, Rust’s ownership system no longer applies. The JavaScript code becomes responsible for explicitly freeing that memory.
The Memory Handoff Pattern
Three-Step Process
// 1. ALLOCATE: Rust allocates memory and returns a pointer
const ptr = this.lib.symbols.get_worker_status() as Pointer;
// 2. READ: JavaScript reads/copies the data from that pointer
const json = new CString(ptr); // Read C string into JS string
const status = JSON.parse(json); // Parse into JS object
// 3. FREE: JavaScript tells Rust to deallocate the memory
this.lib.symbols.free_rust_string(ptr); // Rust frees the memory
// After this point, 'status' is a safe JavaScript object
// and the Rust memory has been freed (no leak)
Why This Pattern Exists
When Rust returns a pointer across the FFI boundary, it deliberately leaks the memory from Rust’s perspective:
#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn get_worker_status() -> *mut c_char {
let status = WorkerStatus { /* ... */ };
let json = serde_json::to_string(&status).unwrap();
// into_raw() transfers ownership OUT of Rust's memory system
CString::new(json).unwrap().into_raw()
// Rust's Drop trait will NOT run on this memory!
}
}
The .into_raw() method:
- Converts
CStringto a raw pointer - Prevents Rust from freeing the memory when it goes out of scope
- Transfers ownership responsibility to the caller
Without this, Rust would free the memory immediately, and JavaScript would read garbage data (use-after-free).
The Free Function
JavaScript must call back into Rust to free the memory:
#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn free_rust_string(ptr: *mut c_char) {
if ptr.is_null() {
return;
}
// SAFETY: We know this pointer came from CString::into_raw()
// and this function is only called once per pointer
unsafe {
let _ = CString::from_raw(ptr);
// CString goes out of scope here and properly frees the memory
}
}
}
This reconstructs the CString from the raw pointer, which causes Rust’s Drop trait to run and free the memory.
Safety Guarantees
This pattern is safe because of three key properties:
1. Single-Threaded JavaScript Runtime
JavaScript (and TypeScript) runs on a single thread (ignoring Web Workers), which means:
- No race conditions: The read → free sequence is atomic from Rust’s perspective
- No concurrent access: Only one piece of code can access the pointer at a time
- Predictable execution order: Steps always happen in sequence
2. One-Way Handoff
Rust follows a strict contract:
Rust allocates → Returns pointer → NEVER TOUCHES IT AGAIN
- Rust doesn’t keep any references to the memory
- Rust never reads or writes to that memory after returning the pointer
- The memory is “orphaned” from Rust’s perspective until
free_rust_stringis called
3. JavaScript Copies Before Freeing
JavaScript creates a new copy of the data before freeing:
const ptr = this.lib.symbols.get_worker_status() as Pointer;
// Step 1: Read bytes from Rust memory into a JavaScript string
const json = new CString(ptr); // COPY operation
// Step 2: Parse string into JavaScript objects
const status = JSON.parse(json); // Creates new JS objects
// Step 3: Free the Rust memory
this.lib.symbols.free_rust_string(ptr);
// At this point:
// - 'status' is pure JavaScript (managed by V8/JavaScriptCore)
// - Rust memory has been freed (no leak)
// - 'ptr' is invalid (but we never use it again)
The status object is fully owned by JavaScript’s garbage collector. It has no connection to the freed Rust memory.
Comparison to Ruby and Python FFI
Ruby (Magnus)
# Ruby FFI with Magnus
result = TaskerCore::FFI.get_worker_status()
# No explicit free needed - Magnus manages memory via Rust Drop traits
How it works: Magnus creates a bridge between Ruby’s GC and Rust’s ownership system. When Ruby no longer references the object, Rust’s Drop trait eventually runs.
Python (PyO3)
# Python FFI with PyO3
result = tasker_core.get_worker_status()
# No explicit free needed - PyO3 uses Python's reference counting
How it works: PyO3 wraps Rust data in PyObject wrappers. When Python’s reference count reaches zero, the Rust data is dropped.
TypeScript (Bun/Node FFI)
// TypeScript FFI - manual memory management required
const ptr = lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
lib.symbols.free_rust_string(ptr); // MUST call explicitly
Why different: Bun and Node.js use raw C FFI (similar to ctypes in Python or FFI gem in Ruby). There’s no automatic memory management bridge, so we must manually free.
Tradeoff: More verbose, but gives us complete control and makes memory lifetime explicit.
Common Pitfalls and How We Avoid Them
1. Memory Leak (Forgetting to Free)
Problem:
// BAD: Memory leak
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
// Oops! Forgot to call free_rust_string(ptr)
How we avoid it: Every code path that allocates a pointer must free it. We wrap this in methods like pollStepEvents() that handle the complete lifecycle:
pollStepEvents(): FfiStepEvent[] {
const ptr = this.lib.symbols.poll_step_events() as Pointer;
if (!ptr) {
return []; // No allocation, no free needed
}
const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr); // Always freed
return events;
}
2. Double-Free
Problem:
// BAD: Double-free (undefined behavior)
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
this.lib.symbols.free_rust_string(ptr);
this.lib.symbols.free_rust_string(ptr); // CRASH! Already freed
How we avoid it: We free the pointer exactly once in each code path, and we never store pointers for reuse. Each pointer is used in a single scope and immediately freed.
3. Use-After-Free
Problem:
// BAD: Use-after-free
const ptr = this.lib.symbols.get_worker_status();
this.lib.symbols.free_rust_string(ptr);
const json = new CString(ptr); // CRASH! Memory is gone
How we avoid it: We always read/copy before freeing. The order is strictly: allocate → read → free.
Pattern in Practice
Example: Worker Status
getWorkerStatus(): WorkerStatus {
// 1. Allocate: Rust allocates memory for JSON string
const ptr = this.lib.symbols.get_worker_status() as Pointer;
// 2. Read: Copy data into JavaScript
const json = new CString(ptr); // Rust memory → JS string
const status = JSON.parse(json); // JS string → JS object
// 3. Free: Deallocate Rust memory
this.lib.symbols.free_rust_string(ptr);
// 4. Return: Pure JavaScript object (safe)
return status;
}
Example: Polling Step Events
pollStepEvents(): FfiStepEvent[] {
const ptr = this.lib.symbols.poll_step_events() as Pointer;
// Handle null pointer (no events available)
if (!ptr) {
return [];
}
const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr);
return events;
}
Example: Bootstrap Worker
bootstrapWorker(config: BootstrapConfig): BootstrapResult {
const configJson = JSON.stringify(config);
// Pass JavaScript data TO Rust (no pointer returned)
const ptr = this.lib.symbols.bootstrap_worker(configJson) as Pointer;
// Read the result
const json = new CString(ptr);
const result = JSON.parse(json);
// Free the result pointer
this.lib.symbols.free_rust_string(ptr);
return result;
}
Memory Lifetime Diagrams
Successful Pattern
Time →
JavaScript: [allocate ptr] → [read data] → [free ptr] → [use data]
Rust Memory: [allocated] → [allocated] → [freed] → [freed]
JS Objects: [none] → [created] → [exists] → [exists]
↑
Data copied here
Memory Leak (Anti-Pattern)
Time →
JavaScript: [allocate ptr] → [read data] → [use data] → ...
Rust Memory: [allocated] → [allocated] → [LEAK] → [LEAK]
JS Objects: [none] → [created] → [exists] → [exists]
↑
Forgot to free! Memory leaked
Use-After-Free (Anti-Pattern)
Time →
JavaScript: [allocate ptr] → [free ptr] → [read ptr] → CRASH!
Rust Memory: [allocated] → [freed] → [freed]
JS Objects: [none] → [none] → [CORRUPT]
↑
Reading freed memory!
Best Practices
1. Keep Pointer Lifetime Short
// GOOD: Pointer freed in same scope
const result = this.getWorkerStatus();
// BAD: Don't store pointers
this.statusPtr = this.lib.symbols.get_worker_status(); // Leak risk
2. Always Free in Same Method
// GOOD: Allocate and free in same method
pollStepEvents(): FfiStepEvent[] {
const ptr = this.lib.symbols.poll_step_events();
if (!ptr) return [];
const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr);
return events;
}
// BAD: Returning pointer for later freeing
getPtrToStatus(): Pointer {
return this.lib.symbols.get_worker_status(); // Who will free this?
}
3. Handle Null Pointers
// GOOD: Check for null before freeing
const ptr = this.lib.symbols.poll_step_events();
if (!ptr) {
return []; // No memory allocated, nothing to free
}
const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr);
return events;
4. Document Ownership in Comments
/**
* Poll for step events from FFI.
*
* MEMORY: This function manages the lifetime of the pointer returned
* by poll_step_events(). The pointer is freed before returning.
*/
pollStepEvents(): FfiStepEvent[] {
// ...
}
Testing Memory Safety
Rust Tests
Rust’s test suite can verify FFI functions don’t leak:
#![allow(unused)]
fn main() {
#[test]
fn test_status_no_leak() {
let ptr = get_worker_status();
assert!(!ptr.is_null());
// Manually free to ensure it works
free_rust_string(ptr);
// If we had a leak, tools like valgrind or AddressSanitizer
// would catch it
}
}
TypeScript Tests
TypeScript tests verify proper usage:
test('status retrieval frees memory', () => {
const runtime = new BunTaskerRuntime();
// This should not leak - memory freed internally
const status = runtime.getWorkerStatus();
expect(status.running).toBeDefined();
// Call multiple times to stress test
for (let i = 0; i < 100; i++) {
runtime.getWorkerStatus();
}
// If we leaked, we'd have 100 leaked strings
});
Leak Detection Tools
- Valgrind (Linux): Detects memory leaks in Rust code
- AddressSanitizer: Detects use-after-free and double-free
- Process memory monitoring: Track RSS growth over time
When in Doubt
Golden Rule: Every *mut c_char pointer returned by a Rust FFI function must have a corresponding free_rust_string() call in the TypeScript code, executed exactly once per pointer, after all reads are complete.
If you see a pattern like:
const ptr = this.lib.symbols.some_function();
Ask yourself:
- Does this return a pointer to allocated memory? (Check Rust signature)
- Am I reading the data before freeing?
- Am I freeing exactly once?
- Am I never using
ptrafter freeing?
If the answer to all is “yes”, you’re following the pattern correctly.
References
- Rust FFI Guidelines: https://doc.rust-lang.org/nomicon/ffi.html
- Bun FFI Documentation: https://bun.sh/docs/api/ffi
- Node.js ffi-napi: https://github.com/node-ffi-napi/node-ffi-napi
- docs/worker-crates/patterns-and-practices.md: General worker patterns
Worker Crates: Common Patterns and Practices
Last Updated: 2026-01-06 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | Worker Actors
<- Back to Worker Crates Overview
This document describes the common patterns and practices shared across all three worker implementations (Rust, Ruby, Python). Understanding these patterns helps developers write consistent handlers regardless of the language.
Table of Contents
- Architectural Patterns
- Handler Lifecycle
- Error Handling
- Polling Architecture
- Event Bridge Pattern
- Singleton Pattern
- Observability
- Checkpoint Yielding
Architectural Patterns
Dual-Channel Architecture
All workers implement a dual-channel architecture for non-blocking step execution:
┌─────────────────────────────────────────────────────────────────┐
│ DUAL-CHANNEL PATTERN │
└─────────────────────────────────────────────────────────────────┘
PostgreSQL PGMQ
│
▼
┌───────────────────┐
│ Dispatch Channel │ ──→ Step events flow TO handlers
└───────────────────┘
│
▼
┌───────────────────┐
│ Handler Execution │ ──→ Business logic runs here
└───────────────────┘
│
▼
┌───────────────────┐
│ Completion Channel │ ──→ Results flow BACK to orchestration
└───────────────────┘
│
▼
Orchestration
Benefits:
- Fire-and-forget dispatch (non-blocking)
- Bounded concurrency via semaphores
- Results processed independently from dispatch
- Consistent pattern across all languages
Language-Specific Implementations
| Component | Rust | Ruby | Python |
|---|---|---|---|
| Dispatch Channel | mpsc::channel | poll_step_events FFI | poll_step_events FFI |
| Completion Channel | mpsc::channel | complete_step_event FFI | complete_step_event FFI |
| Concurrency Model | Tokio async tasks | Ruby threads + FFI polling | Python threads + FFI polling |
| GIL Handling | N/A | Pull-based polling | Pull-based polling |
Handler Lifecycle
Handler Registration
All implementations follow the same registration pattern:
1. Define handler class/struct
2. Set handler_name identifier
3. Register with HandlerRegistry
4. Handler ready for resolution
Ruby Example:
class ProcessOrderHandler < TaskerCore::StepHandler::Base
def call(context)
# Access data via cross-language standard methods
order_id = context.get_task_field('order_id')
# Business logic here...
# Return result using base class helper (keyword args required)
success(result: { order_id: order_id, status: 'processed' })
end
end
# Registration
registry = TaskerCore::Registry::HandlerRegistry.instance
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)
Python Example:
from tasker_core import StepHandler, StepHandlerResult, HandlerRegistry
class ProcessOrderHandler(StepHandler):
handler_name = "process_order"
def call(self, context):
order_id = context.input_data.get("order_id")
return StepHandlerResult.success_handler_result(
{"order_id": order_id, "status": "processed"}
)
# Registration
registry = HandlerRegistry.instance()
registry.register("process_order", ProcessOrderHandler)
Handler Resolution Flow
1. Step event received with handler name
2. Registry.resolve(handler_name) called
3. Handler class instantiated
4. handler.call(context) invoked
5. Result returned to completion channel
Handler Context
All handlers receive a context object containing:
| Field | Description |
|---|---|
task_uuid | Unique identifier for the task |
step_uuid | Unique identifier for the step |
input_data | Task context data passed to the step |
dependency_results | Results from parent/dependency steps |
step_config | Configuration from step definition |
step_inputs | Runtime inputs from workflow_step.inputs |
retry_count | Current retry attempt number |
max_retries | Maximum retry attempts allowed |
Handler Results
All handlers return a structured result indicating success or failure. However, the APIs differ between Ruby and Python - this is a known design inconsistency that may be addressed in a future ticket.
Ruby - Uses keyword arguments and separate Success/Error types:
# Via base handler shortcuts
success(result: { key: "value" }, metadata: { duration_ms: 150 })
failure(
message: "Something went wrong",
error_type: "PermanentError",
error_code: "VALIDATION_ERROR", # Ruby has error_code field
retryable: false,
metadata: { field: "email" }
)
# Or via type factory methods
TaskerCore::Types::StepHandlerCallResult.success(result: { key: "value" })
TaskerCore::Types::StepHandlerCallResult.error(
error_type: "PermanentError",
message: "Error message",
error_code: "ERR_001"
)
Python - Uses positional/keyword arguments and a single result type:
# Via base handler shortcuts
self.success(result={"key": "value"}, metadata={"duration_ms": 150})
self.failure(
message="Something went wrong",
error_type="ValidationError", # Python has error_type only (no error_code)
retryable=False,
metadata={"field": "email"}
)
# Or via class factory methods
StepHandlerResult.success_handler_result(
{"key": "value"}, # First positional arg is result
{"duration_ms": 150} # Second positional arg is metadata
)
StepHandlerResult.failure_handler_result(
message="Something went wrong",
error_type="ValidationError",
retryable=False,
metadata={"field": "email"}
)
Key Differences:
| Aspect | Ruby | Python |
|---|---|---|
| Factory method names | .success(), .error() | .success_handler_result(), .failure_handler_result() |
| Result type | Success / Error structs | Single StepHandlerResult class |
| Error code field | error_code (freeform) | Not present |
| Argument style | Keyword required (result:) | Positional allowed |
Error Handling
Error Classification
All workers classify errors into two categories:
| Type | Description | Behavior |
|---|---|---|
| Retryable | Transient errors that may succeed on retry | Step re-enqueued up to max_retries |
| Permanent | Unrecoverable errors | Step marked as failed immediately |
HTTP Status Code Classification (ApiHandler)
400, 401, 403, 404, 422 → Permanent Error (client errors)
429 → Retryable Error (rate limiting)
500-599 → Retryable Error (server errors)
Exception Hierarchy
Ruby:
TaskerCore::Error # Base class
├── TaskerCore::RetryableError # Transient failures
├── TaskerCore::PermanentError # Unrecoverable failures
├── TaskerCore::FFIError # FFI bridge errors
└── TaskerCore::ConfigurationError # Configuration issues
Python:
TaskerError # Base class
├── WorkerNotInitializedError # Worker not bootstrapped
├── WorkerBootstrapError # Bootstrap failed
├── WorkerAlreadyRunningError # Double initialization
├── FFIError # FFI bridge errors
├── ConversionError # Type conversion errors
└── StepExecutionError # Handler execution errors
Error Context Propagation
All errors should include context for debugging:
StepHandlerResult.failure_handler_result(
message="Payment gateway timeout",
error_type="gateway_timeout",
retryable=True,
metadata={
"gateway": "stripe",
"request_id": "req_xyz",
"response_time_ms": 30000
}
)
Polling Architecture
Why Polling?
Ruby and Python workers use a pull-based polling model due to language runtime constraints:
Ruby: The Global VM Lock (GVL) prevents Rust from safely calling Ruby methods from Rust threads. Polling allows Ruby to control thread context.
Python: The Global Interpreter Lock (GIL) has the same limitation. Python must initiate all cross-language calls.
Polling Characteristics
| Parameter | Default Value | Description |
|---|---|---|
| Poll Interval | 10ms | Time between polls when no events |
| Max Latency | ~10ms | Time from event generation to processing start |
| Starvation Check | Every 100 polls (1 second) | Detect processing bottlenecks |
| Cleanup Interval | Every 1000 polls (10 seconds) | Clean up timed-out events |
Poll Loop Structure
while running:
# 1. Poll for event
event = poll_step_events()
if event:
# 2. Process event through handler
process_event(event)
else:
# 3. Sleep when no events
time.sleep(0.01) # 10ms
# 4. Periodic maintenance
if poll_count % 100 == 0:
check_starvation_warnings()
if poll_count % 1000 == 0:
cleanup_timeouts()
FFI Contract
Ruby and Python share the same FFI contract:
| Function | Description |
|---|---|
poll_step_events() | Get next pending event (returns None if empty) |
complete_step_event(event_id, result) | Submit handler result |
get_ffi_dispatch_metrics() | Get dispatch channel metrics |
check_starvation_warnings() | Trigger starvation logging |
cleanup_timeouts() | Clean up timed-out events |
Event Bridge Pattern
Overview
All workers implement an EventBridge (pub/sub) pattern for internal coordination:
┌─────────────────────────────────────────────────────────────────┐
│ EVENT BRIDGE PATTERN │
└─────────────────────────────────────────────────────────────────┘
Publishers EventBridge Subscribers
───────── ─────────── ───────────
HandlerRegistry ──publish──→ ──notify──→ StepExecutionSubscriber
EventPoller ──publish──→ [Events] ──notify──→ MetricsCollector
Worker ──publish──→ ──notify──→ Custom Subscribers
Standard Event Names
| Event | Description | Payload |
|---|---|---|
handler_registered | Handler added to registry | (name, handler_class) |
step_execution_received | Step event received | FfiStepEvent |
step_execution_completed | Handler finished | StepHandlerResult |
worker_started | Worker bootstrap complete | worker_id |
worker_stopped | Worker shutdown | worker_id |
Implementation Libraries
| Language | Library | Pattern |
|---|---|---|
| Ruby | dry-events | Publisher/Subscriber |
| Python | pyee | EventEmitter |
| Rust | Native channels | mpsc |
Usage Example (Python)
from tasker_core import EventBridge, EventNames
bridge = EventBridge.instance()
# Subscribe to events
def on_step_received(event):
print(f"Processing step {event.step_uuid}")
bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)
# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)
Singleton Pattern
Worker State Management
All workers store global state in a thread-safe singleton:
┌─────────────────────────────────────────────────────────────────┐
│ SINGLETON WORKER STATE │
└─────────────────────────────────────────────────────────────────┘
Thread-Safe Global
│
▼
┌──────────────────┐
│ WorkerSystem │
│ ┌────────────┐ │
│ │ Mutex/Lock │ │
│ │ Inner │ │
│ │ State │ │
│ └────────────┘ │
└──────────────────┘
│
├──→ HandlerRegistry
├──→ EventBridge
├──→ EventPoller
└──→ Configuration
Singleton Classes
| Language | Singleton Implementation |
|---|---|
| Rust | OnceLock<Mutex<WorkerSystem>> |
| Ruby | Singleton module |
| Python | Class-level _instance with instance() classmethod |
Reset for Testing
All singletons provide reset methods for test isolation:
# Python
HandlerRegistry.reset_instance()
EventBridge.reset_instance()
# Ruby
TaskerCore::Registry::HandlerRegistry.reset_instance!
Observability
Health Checks
All workers expose health information via FFI:
from tasker_core import get_health_check
health = get_health_check()
# Returns: HealthCheck with component statuses
Metrics
Standard metrics available from all workers:
| Metric | Description |
|---|---|
pending_count | Events awaiting processing |
in_flight_count | Events currently being processed |
completed_count | Successfully completed events |
failed_count | Failed events |
starvation_detected | Whether events are timing out |
starving_event_count | Events exceeding timeout threshold |
Structured Logging
All workers use structured logging with consistent fields:
from tasker_core import log_info, LogContext
context = LogContext(
correlation_id="abc-123",
task_uuid="task-456",
operation="process_order"
)
log_info("Processing order", context)
Specialized Handlers
Handler Type Hierarchy
Ruby (all are subclasses):
TaskerCore::StepHandler::Base
├── TaskerCore::StepHandler::Api # HTTP/REST API integration
├── TaskerCore::StepHandler::Decision # Dynamic workflow decisions
└── TaskerCore::StepHandler::Batchable # Batch processing support
Python (Batchable is a mixin):
StepHandler (ABC)
├── ApiHandler # HTTP/REST API integration (subclass)
├── DecisionHandler # Dynamic workflow decisions (subclass)
└── + Batchable # Batch processing (mixin via multiple inheritance)
ApiHandler
For HTTP API integration with automatic error classification:
class FetchUserHandler(ApiHandler):
handler_name = "fetch_user"
def call(self, context):
response = self.get(f"/users/{context.input_data['user_id']}")
return self.success(result=response.json())
DecisionHandler
For dynamic workflow routing:
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
if amount < 1000
decision_success(steps: ['auto_approve'], result_data: { route: 'auto' })
else
decision_success(steps: ['manager_approval', 'finance_review'])
end
end
end
Batchable
For processing large datasets in chunks. Note: Ruby uses subclass inheritance, Python uses mixin.
Ruby (subclass):
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
batch_ctx = get_batch_context(context)
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
batch_worker_complete(processed_count: batch_ctx.batch_size)
end
end
Python (mixin):
class CsvProcessorHandler(StepHandler, Batchable):
handler_name = "csv_processor"
def call(self, context: StepContext) -> StepHandlerResult:
batch_ctx = self.get_batch_context(context)
# Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
return self.batch_worker_success(processed_count=batch_ctx.batch_size)
Checkpoint Yielding
Checkpoint yielding enables batch workers to persist progress and yield control back to the orchestrator for re-dispatch. This is essential for long-running batch operations.
When to Use
- Processing takes longer than visibility timeout
- You need resumable processing after failures
- Long-running operations need progress visibility
Cross-Language API
All Batchable handlers provide checkpoint_yield() (or checkpointYield() in TypeScript):
Ruby:
class MyBatchWorker < TaskerCore::StepHandler::Batchable
def call(context)
batch_ctx = get_batch_context(context)
# Resume from checkpoint if present
start = batch_ctx.has_checkpoint? ? batch_ctx.checkpoint_cursor : 0
items.each_with_index do |item, idx|
process_item(item)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0
checkpoint_yield(
cursor: start + idx + 1,
items_processed: idx + 1,
accumulated_results: { partial: "data" }
)
end
end
batch_worker_complete(processed_count: items.size)
end
end
Python:
class MyBatchWorker(StepHandler, Batchable):
def call(self, context):
batch_ctx = self.get_batch_context(context)
# Resume from checkpoint if present
start = batch_ctx.checkpoint_cursor if batch_ctx.has_checkpoint() else 0
for idx, item in enumerate(items):
self.process_item(item)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0:
self.checkpoint_yield(
cursor=start + idx + 1,
items_processed=idx + 1,
accumulated_results={"partial": "data"}
)
return self.batch_worker_success(processed_count=len(items))
TypeScript:
class MyBatchWorker extends BatchableHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
const batchCtx = this.getBatchContext(context);
// Resume from checkpoint if present
const start = batchCtx.hasCheckpoint() ? batchCtx.checkpointCursor : 0;
for (let idx = 0; idx < items.length; idx++) {
await this.processItem(items[idx]);
// Checkpoint every 1000 items
if ((idx + 1) % 1000 === 0) {
await this.checkpointYield({
cursor: start + idx + 1,
itemsProcessed: idx + 1,
accumulatedResults: { partial: "data" }
});
}
}
return this.batchWorkerSuccess({ processedCount: items.length });
}
}
BatchWorkerContext Checkpoint Accessors
All languages provide consistent accessors for checkpoint data:
| Accessor | Ruby | Python | TypeScript |
|---|---|---|---|
| Cursor position | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Accumulated data | accumulated_results | accumulated_results | accumulatedResults |
| Has checkpoint? | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
| Items processed | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
FFI Contract
| Function | Description |
|---|---|
checkpoint_yield_step_event(event_id, data) | Persist checkpoint and re-dispatch step |
Key Invariants
- Checkpoint-Persist-Then-Redispatch: Progress saved before re-dispatch
- Step Stays InProgress: No state machine transitions during yield
- Handler-Driven: Handlers decide when to checkpoint
See Batch Processing Guide - Checkpoint Yielding for comprehensive documentation.
Best Practices
1. Keep Handlers Focused
Each handler should do one thing well:
- Validate input
- Perform single operation
- Return clear result
2. Use Error Classification
Always specify whether errors are retryable:
# Good - clear error classification
return self.failure("API rate limit", retryable=True)
# Bad - ambiguous error handling
raise Exception("API error")
3. Include Context in Errors
return StepHandlerResult.failure_handler_result(
message="Database connection failed",
error_type="database_error",
retryable=True,
metadata={
"host": "db.example.com",
"port": 5432,
"connection_timeout_ms": 5000
}
)
4. Use Structured Logging
log_info("Order processed", {
"order_id": order_id,
"total": total,
"items_count": len(items)
})
5. Test Handler Isolation
Reset singletons between tests:
def setup_method(self):
HandlerRegistry.reset_instance()
EventBridge.reset_instance()
See Also
- Worker Crates Overview - High-level introduction
- Rust Worker - Native Rust implementation
- Ruby Worker - Ruby gem documentation
- Python Worker - Python package documentation
- Worker Event Systems - Detailed architecture
- Worker Actors - Actor pattern documentation
Python Worker
Last Updated: 2026-01-01
Audience: Python Developers
Status: Active
Package: tasker_core
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Python worker provides a package-based interface for integrating tasker-core workflow execution into Python applications. It supports both standalone server deployment and headless embedding in existing codebases.
Quick Start
Installation
cd workers/python
uv sync # Install dependencies
uv run maturin develop # Build FFI extension
Running the Server
python bin/server.py
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
PYTHON_HANDLER_PATH | Path for handler auto-discovery | Not set |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
Architecture
Server Mode
Location: workers/python/bin/server.py
The server bootstraps the Rust foundation and manages Python handler execution:
from tasker_core import (
bootstrap_worker,
EventBridge,
EventPoller,
HandlerRegistry,
StepExecutionSubscriber,
)
# Bootstrap Rust worker foundation
result = bootstrap_worker(config)
# Start event dispatch system
event_bridge = EventBridge.instance()
event_bridge.start()
# Create step execution subscriber
handler_registry = HandlerRegistry.instance()
step_subscriber = StepExecutionSubscriber(
event_bridge=event_bridge,
handler_registry=handler_registry,
worker_id="python-worker-001"
)
step_subscriber.start()
# Start event poller (10ms polling)
event_poller = EventPoller(polling_interval_ms=10)
event_poller.on_step_event(lambda e: event_bridge.publish("step_execution_received", e))
event_poller.start()
# Wait for shutdown signal
shutdown_event.wait()
# Graceful shutdown
event_poller.stop()
step_subscriber.stop()
event_bridge.stop()
stop_worker()
Headless/Embedded Mode
For embedding in existing Python applications:
from tasker_core import (
bootstrap_worker,
HandlerRegistry,
EventBridge,
EventPoller,
StepExecutionSubscriber,
)
from tasker_core.types import BootstrapConfig
# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)
# Register handlers
registry = HandlerRegistry.instance()
registry.register("process_data", ProcessDataHandler)
# Start event dispatch (required for embedded usage)
bridge = EventBridge.instance()
bridge.start()
subscriber = StepExecutionSubscriber(bridge, registry, "embedded-worker")
subscriber.start()
poller = EventPoller()
poller.on_step_event(lambda e: bridge.publish("step_execution_received", e))
poller.start()
FFI Bridge
Python communicates with the Rust foundation via FFI polling:
┌────────────────────────────────────────────────────────────────┐
│ PYTHON FFI BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ FFI (poll_step_events)
▼
┌─────────────────────┐
│ EventPoller │
│ (daemon thread) │──→ poll every 10ms
└─────────────────────┘
│
│ publish to EventBridge
▼
┌─────────────────────┐
│ StepExecution │
│ Subscriber │──→ route to handler
└─────────────────────┘
│
│ handler.call(context)
▼
┌─────────────────────┐
│ Handler Execution │
└─────────────────────┘
│
│ FFI (complete_step_event)
▼
Rust Completion Channel
Handler Development
Base Handler (ABC)
Location: python/tasker_core/step_handler/base.py
All handlers inherit from StepHandler:
from tasker_core import StepHandler, StepContext, StepHandlerResult
class ProcessOrderHandler(StepHandler):
handler_name = "process_order"
handler_version = "1.0.0"
def call(self, context: StepContext) -> StepHandlerResult:
# Access input data
order_id = context.input_data.get("order_id")
amount = context.input_data.get("amount")
# Business logic
result = self.process_order(order_id, amount)
# Return success
return self.success(result={
"order_id": order_id,
"status": "processed",
"total": result["total"]
})
Handler Signature
def call(self, context: StepContext) -> StepHandlerResult:
# context.task_uuid - Task identifier
# context.step_uuid - Step identifier
# context.input_data - Task context data
# context.dependency_results - Results from parent steps
# context.step_config - Handler configuration
# context.step_inputs - Runtime inputs
# context.retry_count - Current retry attempt
# context.max_retries - Maximum retry attempts
Result Methods
# Success result (from base class)
return self.success(
result={"key": "value"},
metadata={"duration_ms": 100}
)
# Failure result (from base class)
return self.failure(
message="Payment declined",
error_type="payment_error",
retryable=True,
metadata={"card_last_four": "1234"}
)
# Or using factory methods
from tasker_core import StepHandlerResult
return StepHandlerResult.success_handler_result(
{"key": "value"},
{"duration_ms": 100}
)
return StepHandlerResult.failure_handler_result(
message="Error",
error_type="validation_error",
retryable=False
)
Accessing Dependencies
def call(self, context: StepContext) -> StepHandlerResult:
# Get result from a dependency step
validation = context.dependency_results.get("validate_order", {})
if validation.get("valid"):
# Process with validated data
return self.success(result={"processed": True})
return self.failure("Validation failed", retryable=False)
Composition Pattern
Python handlers use composition via mixins (multiple inheritance) rather than single inheritance.
Using Mixins (Recommended for New Code)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin
class MyHandler(StepHandler, APIMixin, DecisionMixin):
handler_name = "my_handler"
def call(self, context: StepContext) -> StepHandlerResult:
# Has both API methods (get, post, put, delete)
# And Decision methods (decision_success, skip_branches)
response = self.get("/api/endpoint")
return self.decision_success(["next_step"], response)
Available Mixins
| Mixin | Location | Methods Provided |
|---|---|---|
APIMixin | mixins/api.py | get, post, put, delete, http_client |
DecisionMixin | mixins/decision.py | decision_success, skip_branches, decision_failure |
BatchableMixin | (base class) | get_batch_context, handle_no_op_worker, create_cursor_configs |
Using Wrapper Classes (Backward Compatible)
The wrapper classes delegate to mixins internally:
# These are equivalent:
class MyHandler(ApiHandler):
# Inherits API methods via APIMixin internally
pass
class MyHandler(StepHandler, APIMixin):
# Explicit mixin inclusion
pass
Specialized Handlers
API Handler
Location: python/tasker_core/step_handler/api.py
For HTTP API integration with automatic error classification:
from tasker_core.step_handler import ApiHandler
class FetchUserHandler(ApiHandler):
handler_name = "fetch_user"
base_url = "https://api.example.com"
def call(self, context: StepContext) -> StepHandlerResult:
user_id = context.input_data["user_id"]
# Automatic error classification
response = self.get(f"/users/{user_id}")
return self.api_success(response)
HTTP Methods:
# GET request
response = self.get("/path", params={"key": "value"}, headers={})
# POST request
response = self.post("/path", data={"key": "value"}, headers={})
# PUT request
response = self.put("/path", data={"key": "value"}, headers={})
# DELETE request
response = self.delete("/path", params={}, headers={})
ApiResponse Properties:
response.status_code # HTTP status code
response.headers # Response headers
response.body # Parsed body (dict or str)
response.ok # True if 2xx status
response.is_client_error # True if 4xx status
response.is_server_error # True if 5xx status
response.is_retryable # True if should retry (408, 429, 500-504)
response.retry_after # Retry-After header value in seconds
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Non-retryable | Permanent failure |
| 408, 429, 500-504 | Retryable | Standard retry |
Decision Handler
Location: python/tasker_core/step_handler/decision.py
For dynamic workflow routing:
from tasker_core.step_handler import DecisionHandler
from tasker_core import DecisionPointOutcome
class RoutingDecisionHandler(DecisionHandler):
handler_name = "routing_decision"
def call(self, context: StepContext) -> StepHandlerResult:
amount = context.input_data.get("amount", 0)
if amount < 1000:
# Auto-approve small amounts
outcome = DecisionPointOutcome.create_steps(
["auto_approve"],
routing_context={"route_type": "auto"}
)
return self.decision_success(outcome)
elif amount < 5000:
# Manager approval for medium amounts
outcome = DecisionPointOutcome.create_steps(
["manager_approval"],
routing_context={"route_type": "manager"}
)
return self.decision_success(outcome)
else:
# Dual approval for large amounts
outcome = DecisionPointOutcome.create_steps(
["manager_approval", "finance_review"],
routing_context={"route_type": "dual"}
)
return self.decision_success(outcome)
Decision Methods:
# Create steps
outcome = DecisionPointOutcome.create_steps(
step_names=["step1", "step2"],
routing_context={"key": "value"}
)
return self.decision_success(outcome)
# No branches needed
outcome = DecisionPointOutcome.no_branches(reason="condition not met")
return self.decision_no_branches(outcome)
Batchable Mixin
Location: python/tasker_core/batch_processing/
For processing large datasets in chunks. Both analyzer and worker handlers implement the standard call() method:
Analyzer Handler (creates batch configurations):
from tasker_core import StepHandler, StepHandlerResult
from tasker_core.batch_processing import Batchable
class CsvAnalyzerHandler(StepHandler, Batchable):
handler_name = "csv_analyzer"
def call(self, context: StepContext) -> StepHandlerResult:
"""Analyze CSV and create batch worker configurations."""
csv_path = context.input_data["csv_path"]
row_count = count_csv_rows(csv_path)
if row_count == 0:
# No data to process
return self.batch_analyzer_success(
cursor_configs=[],
total_items=0,
batch_metadata={"csv_path": csv_path}
)
# Create cursor ranges for batch workers
cursor_configs = self.create_cursor_ranges(
total_items=row_count,
batch_size=100,
max_batches=5
)
return self.batch_analyzer_success(
cursor_configs=cursor_configs,
total_items=row_count,
worker_template_name="process_csv_batch",
batch_metadata={"csv_path": csv_path}
)
Worker Handler (processes a batch):
class CsvBatchProcessorHandler(StepHandler, Batchable):
handler_name = "csv_batch_processor"
def call(self, context: StepContext) -> StepHandlerResult:
"""Process a batch of CSV rows."""
# Get cursor config from step_inputs
step_inputs = context.step_inputs or {}
# Check for no-op placeholder batch
if step_inputs.get("is_no_op"):
return self.batch_worker_success(
items_processed=0,
items_succeeded=0,
metadata={"no_op": True}
)
cursor = step_inputs.get("cursor", {})
start_cursor = cursor.get("start_cursor", 0)
end_cursor = cursor.get("end_cursor", 0)
# Get CSV path from analyzer result
analyzer_result = context.get_dependency_result("analyze_csv")
csv_path = analyzer_result["batch_metadata"]["csv_path"]
# Process the batch
results = process_csv_batch(csv_path, start_cursor, end_cursor)
return self.batch_worker_success(
items_processed=results["count"],
items_succeeded=results["success"],
items_failed=results["failed"],
results=results["data"],
last_cursor=end_cursor
)
Batchable Helper Methods:
# Analyzer helpers
self.create_cursor_ranges(total_items, batch_size, max_batches)
self.batch_analyzer_success(cursor_configs, total_items, worker_template_name, batch_metadata)
# Worker helpers
self.batch_worker_success(items_processed, items_succeeded, items_failed, results, last_cursor, metadata)
self.get_batch_context(context) # Returns BatchWorkerContext or None
# Aggregator helpers
self.aggregate_worker_results(worker_results) # Returns aggregated counts
Handler Registry
Registration
Location: python/tasker_core/handler.py
from tasker_core import HandlerRegistry
registry = HandlerRegistry.instance()
# Manual registration
registry.register("process_order", ProcessOrderHandler)
# Check if registered
registry.is_registered("process_order") # True
# Resolve and instantiate
handler = registry.resolve("process_order")
result = handler.call(context)
# List all handlers
registry.list_handlers() # ["process_order", ...]
# Handler count
registry.handler_count() # 1
Auto-Discovery
# Discover handlers from a package
count = registry.discover_handlers("myapp.handlers")
print(f"Discovered {count} handlers")
Handlers are discovered by:
- Scanning the package for classes inheriting from
StepHandler - Using the
handler_nameclass attribute for registration
Type System
Pydantic Models
Python types use Pydantic for validation:
from tasker_core import StepContext, StepHandlerResult, FfiStepEvent
# StepContext - validated from FFI event
context = StepContext.from_ffi_event(event, "handler_name")
context.task_uuid # UUID
context.step_uuid # UUID
context.input_data # dict
context.retry_count # int
# StepHandlerResult - structured result
result = StepHandlerResult.success_handler_result({"key": "value"})
result.success # True
result.result # {"key": "value"}
result.error_message # None
Configuration Types
from tasker_core.types import BootstrapConfig, CursorConfig
# Bootstrap configuration
# Note: Headless mode is controlled via TOML config (web.enabled = false)
config = BootstrapConfig(
namespace="my-app",
log_level="info"
)
# Cursor configuration for batch processing
cursor = CursorConfig(
batch_size=100,
start_cursor=0,
end_cursor=1000
)
Event System
EventBridge
Location: python/tasker_core/event_bridge.py
from tasker_core import EventBridge, EventNames
bridge = EventBridge.instance()
# Start the event system
bridge.start()
# Subscribe to events
def on_step_received(event):
print(f"Processing step: {event.step_uuid}")
bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)
# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)
# Stop when done
bridge.stop()
Event Names
from tasker_core import EventNames
EventNames.STEP_EXECUTION_RECEIVED # Step event received from FFI
EventNames.STEP_COMPLETION_SENT # Handler result sent to FFI
EventNames.HANDLER_REGISTERED # Handler registered
EventNames.HANDLER_ERROR # Handler execution error
EventNames.POLLER_METRICS # FFI dispatch metrics update
EventNames.POLLER_ERROR # Poller encountered an error
EventPoller
Location: python/tasker_core/event_poller.py
from tasker_core import EventPoller
poller = EventPoller(
polling_interval_ms=10, # Poll every 10ms
starvation_check_interval=100, # Check every 1 second
cleanup_interval=1000 # Cleanup every 10 seconds
)
# Register callbacks
poller.on_step_event(handle_step)
poller.on_metrics(handle_metrics)
poller.on_error(handle_error)
# Start polling (daemon thread)
poller.start()
# Get metrics
metrics = poller.get_metrics()
print(f"Pending: {metrics.pending_count}")
# Stop polling
poller.stop(timeout=5.0)
Domain Events
Python has full domain event support with lifecycle hooks matching Ruby and TypeScript capabilities.
Location: python/tasker_core/domain_events.py
BasePublisher
Publishers transform step execution context into domain-specific events:
from tasker_core.domain_events import BasePublisher, StepEventContext, DomainEvent
class PaymentEventPublisher(BasePublisher):
publisher_name = "payment_events"
def publishes_for(self) -> list[str]:
"""Which steps trigger this publisher."""
return ["process_payment", "refund_payment"]
async def transform_payload(self, ctx: StepEventContext) -> dict:
"""Transform step context into domain event payload."""
return {
"payment_id": ctx.result.get("payment_id"),
"amount": ctx.result.get("amount"),
"currency": ctx.result.get("currency"),
"status": ctx.result.get("status")
}
# Lifecycle hooks (optional)
async def before_publish(self, ctx: StepEventContext) -> None:
"""Called before publishing."""
print(f"Publishing payment event for step: {ctx.step_name}")
async def after_publish(self, ctx: StepEventContext, event: DomainEvent) -> None:
"""Called after successful publish."""
print(f"Published event: {event.event_name}")
async def on_publish_error(self, ctx: StepEventContext, error: Exception) -> None:
"""Called on publish failure."""
print(f"Failed to publish: {error}")
async def additional_metadata(self, ctx: StepEventContext) -> dict:
"""Inject custom metadata."""
return {"payment_processor": "stripe"}
BaseSubscriber
Subscribers react to domain events matching specific patterns:
from tasker_core.domain_events import BaseSubscriber, InProcessDomainEvent, SubscriberResult
class AuditLoggingSubscriber(BaseSubscriber):
subscriber_name = "audit_logger"
def subscribes_to(self) -> list[str]:
"""Which events to handle (glob patterns supported)."""
return ["payment.*", "order.completed"]
async def handle(self, event: InProcessDomainEvent) -> SubscriberResult:
"""Handle matching events."""
await self.log_to_audit_trail(event)
return SubscriberResult(success=True)
# Lifecycle hooks (optional)
async def before_handle(self, event: InProcessDomainEvent) -> None:
"""Called before handling."""
print(f"Handling: {event.event_name}")
async def after_handle(self, event: InProcessDomainEvent, result: SubscriberResult) -> None:
"""Called after handling."""
print(f"Handled successfully: {result.success}")
async def on_handle_error(self, event: InProcessDomainEvent, error: Exception) -> None:
"""Called on handler failure."""
print(f"Handler error: {error}")
Registries
Manage publishers and subscribers with singleton registries:
from tasker_core.domain_events import PublisherRegistry, SubscriberRegistry
# Publisher Registry
pub_registry = PublisherRegistry.instance()
pub_registry.register(PaymentEventPublisher)
pub_registry.register(OrderEventPublisher)
# Get publisher for a step
publisher = pub_registry.get_for_step("process_payment")
# Subscriber Registry
sub_registry = SubscriberRegistry.instance()
sub_registry.register(AuditLoggingSubscriber)
sub_registry.register(MetricsSubscriber)
# Start all subscribers
sub_registry.start_all()
# Stop all subscribers
sub_registry.stop_all()
Signal Handling
The Python worker handles signals for graceful shutdown:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
SIGUSR1 | Report worker status |
import signal
def handle_shutdown(signum, frame):
print("Shutting down...")
shutdown_event.set()
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)
Error Handling
Exception Classes
from tasker_core import (
TaskerError, # Base class
WorkerNotInitializedError,
WorkerBootstrapError,
WorkerAlreadyRunningError,
FFIError,
ConversionError,
StepExecutionError,
)
Using StepExecutionError
from tasker_core import StepExecutionError
def call(self, context):
# Retryable error
raise StepExecutionError(
"Database connection timeout",
error_type="database_error",
retryable=True
)
# Non-retryable error
raise StepExecutionError(
"Invalid input format",
error_type="validation_error",
retryable=False
)
Logging
Structured Logging
from tasker_core import log_info, log_error, log_warn, log_debug, LogContext
# Simple logging
log_info("Processing started")
log_error("Failed to connect")
# With context dict
log_info("Order processed", {
"order_id": "123",
"amount": "100.00"
})
# With LogContext model
context = LogContext(
correlation_id="abc-123",
task_uuid="task-456",
operation="process_order"
)
log_info("Processing", context)
File Structure
workers/python/
├── bin/
│ └── server.py # Production server
├── python/
│ └── tasker_core/
│ ├── __init__.py # Package exports
│ ├── handler.py # Handler registry
│ ├── event_bridge.py # Event pub/sub
│ ├── event_poller.py # FFI polling
│ ├── logging.py # Structured logging
│ ├── types.py # Pydantic models
│ ├── step_handler/
│ │ ├── __init__.py
│ │ ├── base.py # Base handler ABC
│ │ ├── api.py # API handler
│ │ └── decision.py # Decision handler
│ ├── batch_processing/
│ │ └── __init__.py # Batchable mixin
│ └── step_execution_subscriber.py
├── src/ # Rust/PyO3 extension
├── tests/
│ ├── test_step_handler.py
│ ├── test_module_exports.py
│ └── handlers/examples/
├── pyproject.toml
└── uv.lock
Testing
Unit Tests
cd workers/python
uv run pytest tests/
With Coverage
uv run pytest tests/ --cov=tasker_core
Type Checking
uv run mypy python/tasker_core/
Linting
uv run ruff check python/
Example Handlers
Linear Workflow
class LinearStep1Handler(StepHandler):
handler_name = "linear_step_1"
def call(self, context: StepContext) -> StepHandlerResult:
return self.success(result={
"step1_processed": True,
"input_received": context.input_data,
"processed_at": datetime.now().isoformat()
})
Data Processing
class TransformDataHandler(StepHandler):
handler_name = "transform_data"
def call(self, context: StepContext) -> StepHandlerResult:
# Get raw data from dependency
raw_data = context.dependency_results.get("fetch_data", {})
# Transform
transformed = [
{"id": item["id"], "value": item["raw_value"] * 2}
for item in raw_data.get("items", [])
]
return self.success(result={
"items": transformed,
"count": len(transformed)
})
Conditional Approval
class ApprovalRouterHandler(DecisionHandler):
handler_name = "approval_router"
THRESHOLDS = {
"auto": 1000,
"manager": 5000
}
def call(self, context: StepContext) -> StepHandlerResult:
amount = context.input_data.get("amount", 0)
if amount < self.THRESHOLDS["auto"]:
outcome = DecisionPointOutcome.create_steps(["auto_approve"])
elif amount < self.THRESHOLDS["manager"]:
outcome = DecisionPointOutcome.create_steps(["manager_approval"])
else:
outcome = DecisionPointOutcome.create_steps(
["manager_approval", "finance_review"]
)
return self.decision_success(outcome)
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Ruby Worker - Ruby implementation
- Worker Event Systems - Architecture details
Ruby Worker
Last Updated: 2026-01-01
Audience: Ruby Developers
Status: Active
Package: tasker_core (gem)
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Ruby worker provides a gem-based interface for integrating tasker-core workflow execution into Ruby applications. It supports both standalone server deployment and headless embedding in Rails applications.
Quick Start
Installation
cd workers/ruby
bundle install
bundle exec rake compile # Compile FFI extension
Running the Server
./bin/server.rb
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
RUBY_GC_HEAP_GROWTH_FACTOR | GC tuning for production | Ruby default |
Architecture
Server Mode
Location: workers/ruby/bin/server.rb
The server bootstraps the Rust foundation and manages Ruby handler execution:
# Bootstrap the worker system
bootstrap = TaskerCore::Worker::Bootstrap.start!
# Signal handlers for graceful shutdown
Signal.trap('TERM') { shutdown_event.set }
Signal.trap('INT') { shutdown_event.set }
# Main loop with health checks
loop do
break if shutdown_event.set?
sleep(1)
end
# Graceful shutdown
bootstrap.shutdown!
Headless/Embedded Mode
For embedding in Rails applications without an HTTP server:
# config/initializers/tasker.rb
require 'tasker_core'
# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
TaskerCore::Worker::Bootstrap.start!
# Register application handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
'ProcessOrderHandler',
ProcessOrderHandler
)
FFI Bridge
Ruby communicates with the Rust foundation via FFI polling:
┌────────────────────────────────────────────────────────────────┐
│ RUBY FFI BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ FFI (poll_step_events)
▼
┌─────────────┐
│ Ruby │
│ Thread │──→ poll every 10ms
└─────────────┘
│
▼
┌─────────────┐
│ Handler │
│ Execution │──→ handler.call(context)
└─────────────┘
│
│ FFI (complete_step_event)
▼
Rust Completion Channel
Handler Development
Base Handler
Location: lib/tasker_core/step_handler/base.rb
All handlers inherit from TaskerCore::StepHandler::Base:
class ProcessOrderHandler < TaskerCore::StepHandler::Base
def call(context)
# Access task context via cross-language standard methods
order_id = context.get_task_field('order_id')
amount = context.get_task_field('amount')
# Business logic
result = process_order(order_id, amount)
# Return success result
success(result: {
order_id: order_id,
status: 'processed',
total: result[:total]
})
end
end
Handler Signature
def call(context)
# context - StepContext with cross-language standard fields:
# context.task_uuid - Task UUID
# context.step_uuid - Step UUID
# context.input_data - Step inputs from workflow_step.inputs
# context.step_config - Handler config from step_definition
# context.retry_count - Current retry attempt
# context.max_retries - Maximum retry attempts
# context.get_task_field('field') - Get field from task context
# context.get_dependency_result('step') - Get result from parent step
end
Result Methods
# Success result (keyword arguments required)
success(
result: { key: 'value' },
metadata: { duration_ms: 100 }
)
# Failure result
# error_type must be one of: 'PermanentError', 'RetryableError',
# 'ValidationError', 'UnexpectedError', 'StepCompletionError'
failure(
message: 'Payment declined',
error_type: 'PermanentError', # Use enum value, not freeform string
error_code: 'PAYMENT_DECLINED', # Optional freeform error code
retryable: false,
metadata: { card_last_four: '1234' }
)
Accessing Dependencies
def call(context)
# Get result from a dependency step
validation_result = context.get_dependency_result('validate_order')
if validation_result && validation_result['valid']
# Process with validated data
end
end
Composition Pattern
Ruby handlers use composition via mixins rather than inheritance. You can use either:
- Wrapper classes (Api, Decision, Batchable) - simpler, backward compatible
- Mixin modules (Mixins::API, Mixins::Decision, Mixins::Batchable) - explicit composition
Using Mixins (Recommended for New Code)
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
include TaskerCore::StepHandler::Mixins::Decision
def call(context)
# Has both API methods (get, post, put, delete)
# And Decision methods (decision_success, decision_no_branches)
response = get('/api/endpoint')
decision_success(steps: ['next_step'], result_data: response)
end
end
Available Mixins
| Mixin | Location | Methods Provided |
|---|---|---|
Mixins::API | mixins/api.rb | get, post, put, delete, connection |
Mixins::Decision | mixins/decision.rb | decision_success, decision_no_branches, skip_branches |
Mixins::Batchable | mixins/batchable.rb | get_batch_context, handle_no_op_worker, create_cursor_configs |
Using Wrapper Classes (Backward Compatible)
The wrapper classes delegate to mixins internally:
# These are equivalent:
class MyHandler < TaskerCore::StepHandler::Api
# Inherits API methods via Mixins::API
end
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
# Explicit mixin inclusion
end
Specialized Handlers
API Handler
Location: lib/tasker_core/step_handler/api.rb
For HTTP API integration with automatic error classification:
class FetchUserHandler < TaskerCore::StepHandler::Api
def call(context)
user_id = context.get_task_field('user_id')
# Automatic error classification (429 → retryable, 404 → permanent)
response = connection.get("/users/#{user_id}")
process_response(response) # Raises on errors, returns response on success
# Return success result with response data
success(result: response.body)
end
# Optional: Custom connection configuration
def configure_connection
Faraday.new(base_url) do |conn|
conn.request :json
conn.response :json
conn.options.timeout = 30
end
end
end
HTTP Methods Available:
get(path, params: {}, headers: {})post(path, data: {}, headers: {})put(path, data: {}, headers: {})delete(path, params: {}, headers: {})
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Permanent | No retry |
| 429 | Retryable | Respect Retry-After |
| 500-599 | Retryable | Standard backoff |
Decision Handler
Location: lib/tasker_core/step_handler/decision.rb
For dynamic workflow routing:
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
if amount < 1000
# Auto-approve small amounts
decision_success(
steps: ['auto_approve'],
result_data: { route_type: 'auto', amount: amount }
)
elsif amount < 5000
# Manager approval for medium amounts
decision_success(
steps: ['manager_approval'],
result_data: { route_type: 'manager', amount: amount }
)
else
# Dual approval for large amounts
decision_success(
steps: ['manager_approval', 'finance_review'],
result_data: { route_type: 'dual', amount: amount }
)
end
end
end
Decision Methods:
decision_success(steps:, result_data: {})- Create steps dynamicallydecision_no_branches(result_data: {})- Skip conditional steps
Batchable Handler
Location: lib/tasker_core/step_handler/batchable.rb
For processing large datasets in chunks:
Breaking Change: Cursors are now 0-indexed (previously 1-indexed) to match Python, TypeScript, and Rust.
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Extract batch context from step inputs
batch_ctx = get_batch_context(context)
# Handle no-op placeholder batches
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Process this batch
csv_file = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
records = read_csv_batch(csv_file, batch_ctx.start_cursor, batch_ctx.batch_size)
processed = records.map { |record| transform_record(record) }
# Return batch completion
batch_worker_complete(
processed_count: processed.size,
result_data: { records: processed }
)
end
end
Batch Helper Methods:
get_batch_context(context)- Get batch boundaries from StepContexthandle_no_op_worker(batch_ctx)- Handle placeholder batchesbatch_worker_complete(processed_count:, result_data:)- Complete batchcreate_cursor_configs(total_items, worker_count)- Create 0-indexed cursor ranges
Cursor Indexing:
# Creates 0-indexed cursor ranges
configs = create_cursor_configs(1000, 5)
# => [
# { batch_id: '1', start_cursor: 0, end_cursor: 200 },
# { batch_id: '2', start_cursor: 200, end_cursor: 400 },
# { batch_id: '3', start_cursor: 400, end_cursor: 600 },
# { batch_id: '4', start_cursor: 600, end_cursor: 800 },
# { batch_id: '5', start_cursor: 800, end_cursor: 1000 }
# ]
Handler Registry
Registration
Location: lib/tasker_core/registry/handler_registry.rb
registry = TaskerCore::Registry::HandlerRegistry.instance
# Manual registration
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)
# Check availability
registry.handler_available?('ProcessOrderHandler') # => true
# List all handlers
registry.registered_handlers # => ["ProcessOrderHandler", ...]
Discovery Modes
-
Preloaded Handlers (Test environment)
- ObjectSpace scanning for loaded handler classes
-
Template-Driven Discovery
- YAML templates define handler references
- Handlers loaded from configured paths
Handler Search Paths
app/handlers/
lib/handlers/
handlers/
app/tasker/handlers/
lib/tasker/handlers/
spec/handlers/examples/ (test environment)
Configuration
Bootstrap Configuration
Bootstrap configuration is controlled via TOML files, not Ruby parameters:
# config/tasker/base/worker.toml
[web]
enabled = true # Set to false for headless/embedded mode
bind_address = "0.0.0.0"
port = 8080
# Ruby bootstrap is simple - config comes from TOML
TaskerCore::Worker::Bootstrap.start!
Handler Configuration
class MyHandler < TaskerCore::StepHandler::Base
def initialize(config: {})
super
@timeout = config[:timeout] || 30
@max_retries = config[:retries] || 3
end
def config_schema
{
type: 'object',
properties: {
timeout: { type: 'integer', minimum: 1, default: 30 },
retries: { type: 'integer', minimum: 0, default: 3 }
}
}
end
end
Signal Handling
The Ruby worker handles multiple signals:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
SIGUSR1 | Report worker status |
SIGUSR2 | Reload configuration |
# Status reporting
Signal.trap('USR1') do
logger.info "Worker Status: #{bootstrap.status.inspect}"
end
# Configuration reload
Signal.trap('USR2') do
bootstrap.reload_config
end
Error Handling
Exception Classes
TaskerCore::Errors::Error # Base class
├── TaskerCore::Errors::ConfigurationError # Configuration issues
├── TaskerCore::Errors::FFIError # FFI bridge errors
├── TaskerCore::Errors::ProceduralError # Base for workflow errors
│ ├── TaskerCore::Errors::RetryableError # Transient failures
│ ├── TaskerCore::Errors::PermanentError # Unrecoverable failures
│ │ ├── TaskerCore::Errors::ValidationError # Validation failures
│ │ └── TaskerCore::Errors::NotFoundError # Resource not found
│ ├── TaskerCore::Errors::TimeoutError # Timeout failures
│ └── TaskerCore::Errors::NetworkError # Network failures
└── TaskerCore::Errors::ServerError # Embedded server errors
Raising Errors
def call(context)
# Retryable error (will be retried)
raise TaskerCore::Errors::RetryableError.new(
'Database connection timeout',
retry_after: 5,
context: { service: 'database' }
)
# Permanent error (no retry)
raise TaskerCore::Errors::PermanentError.new(
'Invalid order format',
error_code: 'INVALID_ORDER',
context: { field: 'order_id' }
)
# Validation error (permanent, with field info)
raise TaskerCore::Errors::ValidationError.new(
'Email format is invalid',
field: 'email',
error_code: 'INVALID_EMAIL'
)
end
Logging
Structured Logging (Recommended)
New code should use TaskerCore::Tracing for unified structured logging via FFI:
# Recommended: Use Tracing directly
TaskerCore::Tracing.info('Processing order', {
order_id: order.id,
amount: order.total,
customer_id: order.customer_id
})
TaskerCore::Tracing.error('Payment failed', {
error_code: 'DECLINED',
card_last_four: '1234'
})
Legacy Logger (Deprecated)
Note: TaskerCore::Logger is maintained for backward compatibility but delegates to TaskerCore::Tracing. New code should use Tracing directly.
# Legacy (still works, but deprecated)
logger = TaskerCore::Logger.instance
logger.info('Processing order', {
order_id: order.id,
amount: order.total
})
Log Levels
Controlled via RUST_LOG environment variable:
trace- Very detailed debuggingdebug- Debugging informationinfo- Normal operationwarn- Warning conditionserror- Error conditions
File Structure
workers/ruby/
├── bin/
│ ├── server.rb # Production server
│ └── health_check.rb # Health check script
├── ext/
│ └── tasker_core/
│ └── extconf.rb # FFI extension config
├── lib/
│ └── tasker_core/
│ ├── errors.rb # Exception classes
│ ├── handlers.rb # Handler namespace
│ ├── internal.rb # Internal modules
│ ├── logger.rb # Logging
│ ├── models.rb # Type models
│ ├── registry/
│ │ ├── handler_registry.rb
│ │ └── step_handler_resolver.rb
│ ├── step_handler/
│ │ ├── base.rb # Base handler
│ │ ├── api.rb # API handler
│ │ ├── decision.rb # Decision handler
│ │ └── batchable.rb # Batch handler
│ ├── task_handler/
│ │ └── base.rb # Task orchestration
│ ├── types/ # Type definitions
│ └── version.rb
├── spec/
│ ├── handlers/examples/ # Example handlers
│ └── integration/ # Integration tests
├── Gemfile
└── tasker_core.gemspec
Testing
Unit Tests
cd workers/ruby
bundle exec rspec spec/
Integration Tests
DATABASE_URL=postgresql://... bundle exec rspec spec/integration/
E2E Tests (from project root)
DATABASE_URL=postgresql://... \
TASKER_ENV=test \
bundle exec rspec spec/handlers/
Example Handlers
Linear Workflow
# spec/handlers/examples/linear_workflow/step_handlers/linear_step_1_handler.rb
module LinearWorkflow
module StepHandlers
class LinearStep1Handler < TaskerCore::StepHandler::Base
def call(context)
input = context.context # Full task context
success(result: {
step1_processed: true,
input_received: input,
processed_at: Time.now.iso8601
})
end
end
end
end
Order Fulfillment
class ValidateOrderHandler < TaskerCore::StepHandler::Base
def call(context)
order = context.context # Full task context
unless order['items']&.any?
return failure(
message: 'Order must have at least one item',
error_type: 'ValidationError',
error_code: 'EMPTY_ORDER',
retryable: false
)
end
success(result: {
valid: true,
item_count: order['items'].size,
total: calculate_total(order['items'])
})
end
end
Conditional Approval
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
THRESHOLDS = {
auto_approve: 1000,
manager_only: 5000
}.freeze
def call(context)
amount = context.get_task_field('amount').to_f
if amount < THRESHOLDS[:auto_approve]
decision_success(steps: ['auto_approve'])
elsif amount < THRESHOLDS[:manager_only]
decision_success(steps: ['manager_approval'])
else
decision_success(steps: ['manager_approval', 'finance_review'])
end
end
end
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Python Worker - Python implementation
- Worker Event Systems - Architecture details
Rust Worker
Last Updated: 2026-01-01
Audience: Rust Developers
Status: Active
Package: workers-rust
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Rust worker is the native, high-performance implementation for workflow step execution. It demonstrates the full capability of the tasker-worker foundation with zero FFI overhead.
Quick Start
Running the Server
cd workers/rust
cargo run
With Custom Configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
RUST_LOG | Log level | info |
Architecture
Entry Point
Location: workers/rust/src/main.rs
#[tokio::main]
async fn main() -> Result<()> {
// Initialize structured logging
tasker_shared::logging::init_tracing();
// Bootstrap worker system
let mut bootstrap_result = bootstrap().await?;
// Start event handler (legacy path)
tokio::spawn(async move {
bootstrap_result.event_handler.start().await
});
// Wait for shutdown signal
tokio::select! {
_ = tokio::signal::ctrl_c() => { /* shutdown */ }
_ = wait_for_sigterm() => { /* shutdown */ }
}
bootstrap_result.worker_handle.stop()?;
Ok(())
}
Bootstrap Process
Location: workers/rust/src/bootstrap.rs
The bootstrap process:
- Creates step handler registry with all handlers
- Sets up global event system
- Bootstraps
tasker-workerfoundation - Creates domain event publisher registry
- Spawns
HandlerDispatchServicefor non-blocking dispatch - Creates event handler for legacy path
#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<RustWorkerBootstrapResult> {
// Create registry with all handlers
let registry = Arc::new(RustStepHandlerRegistry::new());
// Bootstrap worker foundation
let worker_handle = WorkerBootstrap::bootstrap_with_event_system(...).await?;
// Set up dispatch service (non-blocking path)
let dispatch_service = HandlerDispatchService::with_callback(...);
Ok(RustWorkerBootstrapResult {
worker_handle,
event_handler,
dispatch_service_handle,
})
}
}
Handler Dispatch
The Rust worker uses the HandlerDispatchService for non-blocking handler execution:
┌────────────────────────────────────────────────────────────────┐
│ RUST HANDLER DISPATCH │
└────────────────────────────────────────────────────────────────┘
PGMQ Queue
│
▼
┌─────────────┐
│ Dispatch │
│ Channel │
└─────────────┘
│
▼
┌─────────────────────────────────────────┐
│ HandlerDispatchService │
│ ┌────────────────────────────────────┐ │
│ │ Semaphore (10 permits) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ handler.call(&step_data).await │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ DomainEventCallback │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
│
▼
┌─────────────┐
│ Completion │
│ Channel │
└─────────────┘
│
▼
Orchestration
Handler Development
Capability Traits
Rust uses traits for handler composition, matching the mixin pattern in Ruby/Python/TypeScript.
Location: tasker-worker/src/handler_capabilities.rs
APICapable Trait
For HTTP API integration:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::APICapable;
impl APICapable for MyHandler {
// Use the helper methods:
// - api_success(step_uuid, data, status, headers, execution_time_ms)
// - api_failure(step_uuid, message, status, error_type, execution_time_ms)
// - classify_status_code(status) -> ErrorClassification
}
}
DecisionCapable Trait
For dynamic workflow routing:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::DecisionCapable;
impl DecisionCapable for MyHandler {
// Use the helper methods:
// - decision_success(step_uuid, step_names, routing_context, execution_time_ms)
// - skip_branches(step_uuid, reason, routing_context, execution_time_ms)
// - decision_failure(step_uuid, message, error_type, execution_time_ms)
}
}
BatchableCapable Trait
For batch processing:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::BatchableCapable;
impl BatchableCapable for MyHandler {
// Use the helper methods:
// - create_cursor_configs(total_items, worker_count) -> Vec<CursorConfig>
// - create_cursor_ranges(total_items, batch_size, max_batches) -> Vec<CursorConfig>
// - batch_analyzer_success(step_uuid, worker_template, configs, total_items, ...)
// - batch_worker_success(step_uuid, processed, succeeded, failed, skipped, ...)
// - no_batches_outcome(step_uuid, reason, execution_time_ms)
// - batch_failure(step_uuid, message, error_type, retryable, ...)
}
}
Composing Multiple Traits
#![allow(unused)]
fn main() {
// Implement multiple capability traits for a single handler
pub struct CompositeHandler {
config: StepHandlerConfig,
}
impl APICapable for CompositeHandler {}
impl DecisionCapable for CompositeHandler {}
#[async_trait]
impl RustStepHandler for CompositeHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Use API capability to fetch data
let response = self.call_api("/users/123").await?;
// Use Decision capability to route based on response
if response.status == 200 {
self.decision_success(step_uuid, vec!["process_user"], None, 50)
} else {
self.api_failure(step_uuid, "API failed", response.status, "api_error", 50)
}
}
}
}
Handler Trait
Location: workers/rust/src/step_handlers/mod.rs
All Rust handlers implement the RustStepHandler trait:
#![allow(unused)]
fn main() {
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
#[async_trait]
pub trait RustStepHandler: Send + Sync {
/// Handler name for registration
fn name(&self) -> &str;
/// Execute the handler
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult>;
/// Create a new instance with configuration from YAML
fn new(config: StepHandlerConfig) -> Self where Self: Sized;
}
}
Creating a Handler
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use anyhow::Result;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use crate::step_handlers::{RustStepHandler, StepHandlerConfig, success_result};
use serde_json::json;
pub struct ProcessOrderHandler {
_config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
fn name(&self) -> &str {
"process_order"
}
fn new(config: StepHandlerConfig) -> Self {
Self { _config: config }
}
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract input from task context
let order_id = step_data.task.context
.get("order_id")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing order_id"))?;
// Process the order
let result = process_order(order_id).await?;
// Return success using helper function
Ok(success_result(
step_uuid,
json!({
"order_id": order_id,
"status": "processed",
"total": result.total
}),
start_time.elapsed().as_millis() as i64,
None,
))
}
}
}
Handler Registration
Location: workers/rust/src/step_handlers/registry.rs
Handlers are registered in the RustStepHandlerRegistry:
#![allow(unused)]
fn main() {
pub struct RustStepHandlerRegistry {
handlers: HashMap<String, Arc<dyn RustStepHandler>>,
}
impl RustStepHandlerRegistry {
pub fn new() -> Self {
let mut registry = Self {
handlers: HashMap::new(),
};
registry.register_all_handlers();
registry
}
fn register_all_handlers(&mut self) {
let empty_config = StepHandlerConfig::empty();
// Linear workflow handlers
self.register_handler(Arc::new(LinearStep1Handler::new(empty_config.clone())));
self.register_handler(Arc::new(LinearStep2Handler::new(empty_config.clone())));
// Order fulfillment handlers
self.register_handler(Arc::new(ValidateOrderHandler::new(empty_config.clone())));
self.register_handler(Arc::new(ProcessPaymentHandler::new(empty_config.clone())));
// ... more handlers
}
fn register_handler(&mut self, handler: Arc<dyn RustStepHandler>) {
let name = handler.name().to_string();
self.handlers.insert(name, handler);
}
pub fn get_handler(&self, name: &str) -> Result<Arc<dyn RustStepHandler>, RustStepHandlerError> {
self.handlers
.get(name)
.cloned()
.ok_or_else(|| RustStepHandlerError::SystemError {
message: format!("Handler '{}' not found in registry", name),
})
}
}
}
Example Handlers
Linear Workflow
Location: workers/rust/src/step_handlers/linear_workflow.rs
Simple sequential workflow with 4 steps:
#![allow(unused)]
fn main() {
pub struct LinearStep1Handler;
#[async_trait]
impl RustStepHandler for LinearStep1Handler {
fn name(&self) -> &str {
"linear_step_1"
}
async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
info!("LinearStep1Handler: Processing step");
let input = step_data.input_data.clone();
let mut result = serde_json::Map::new();
result.insert("step1_processed".to_string(), json!(true));
result.insert("input_received".to_string(), input);
Ok(StepHandlerResult::success(json!(result)))
}
}
}
Diamond Workflow
Location: workers/rust/src/step_handlers/diamond_workflow.rs
Parallel branching with convergence:
┌─────┐
│Start│
└──┬──┘
│
┌────┴────┐
▼ ▼
┌───┐ ┌───┐
│ B │ │ C │
└─┬─┘ └─┬─┘
│ │
└────┬────┘
▼
┌─────┐
│ End │
└─────┘
Batch Processing
Location: workers/rust/src/step_handlers/batch_processing_products_csv.rs
Three-phase batch processing:
- Analyzer: Counts total records
- Batch Processor: Processes chunks
- Aggregator: Combines results
#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;
#[async_trait]
impl RustStepHandler for CsvBatchProcessorHandler {
fn name(&self) -> &str {
"csv_batch_processor"
}
async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
let batch_size = step_data.step_inputs
.get("batch_size")
.and_then(|v| v.as_u64())
.unwrap_or(100) as usize;
let start_cursor = step_data.step_inputs
.get("start_cursor")
.and_then(|v| v.as_u64())
.unwrap_or(0) as usize;
// Process records in batch
let processed = process_batch(start_cursor, batch_size).await?;
Ok(StepHandlerResult::success(json!({
"processed_count": processed,
"batch_complete": true
})))
}
}
}
Error Injection (Testing)
Location: workers/rust/src/step_handlers/error_injection/
Handlers for testing retry behavior:
#![allow(unused)]
fn main() {
pub struct FailNTimesHandler;
impl FailNTimesHandler {
/// Create handler that fails N times before succeeding
pub fn new(fail_count: u32) -> Self {
Self { fail_count, attempts: AtomicU32::new(0) }
}
}
#[async_trait]
impl RustStepHandler for FailNTimesHandler {
async fn call(&self, _step_data: &StepExecutionData) -> Result<StepHandlerResult> {
let attempt = self.attempts.fetch_add(1, Ordering::SeqCst);
if attempt < self.fail_count {
Ok(StepHandlerResult::failure(
"Intentional failure for testing",
"test_error",
true, // retryable
))
} else {
Ok(StepHandlerResult::success(json!({"attempts": attempt + 1})))
}
}
}
}
Domain Events
Post-Execution Publishing
Handlers can publish domain events after step execution using the StepEventPublisher trait:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use std::sync::Arc;
use tasker_shared::events::domain_events::DomainEventPublisher;
use tasker_worker::worker::step_event_publisher::{
StepEventPublisher, StepEventContext, PublishResult
};
#[derive(Debug)]
pub struct PaymentEventPublisher {
domain_publisher: Arc<DomainEventPublisher>,
}
impl PaymentEventPublisher {
pub fn new(domain_publisher: Arc<DomainEventPublisher>) -> Self {
Self { domain_publisher }
}
}
#[async_trait]
impl StepEventPublisher for PaymentEventPublisher {
fn name(&self) -> &str {
"PaymentEventPublisher"
}
fn domain_publisher(&self) -> &Arc<DomainEventPublisher> {
&self.domain_publisher
}
async fn publish(&self, ctx: &StepEventContext) -> PublishResult {
let mut result = PublishResult::default();
if ctx.step_succeeded() {
let payload = json!({
"order_id": ctx.execution_result.result["order_id"],
"amount": ctx.execution_result.result["amount"],
});
// Uses default impl from trait
match self.publish_event(ctx, "payment.completed", payload).await {
Ok(event_id) => result.published.push(event_id),
Err(e) => result.errors.push(e.to_string()),
}
}
result
}
}
}
Dual-Path Delivery
Events can route to different delivery paths:
| Path | Description | Use Case |
|---|---|---|
durable | Published to PGMQ | External consumers, audit |
fast | In-process bus | Metrics, telemetry |
Configuration
Bootstrap Configuration
#![allow(unused)]
fn main() {
pub struct WorkerBootstrapConfig {
pub worker_id: String,
pub enable_web_api: bool,
pub event_driven_enabled: bool,
pub deployment_mode_hint: Option<String>,
}
// Default configuration
let config = WorkerBootstrapConfig {
worker_id: "rust-worker-001".to_string(),
enable_web_api: true,
event_driven_enabled: true,
deployment_mode_hint: Some("Hybrid".to_string()),
..Default::default()
};
}
Dispatch Configuration
#![allow(unused)]
fn main() {
let config = HandlerDispatchConfig {
max_concurrent_handlers: 10,
handler_timeout: Duration::from_secs(30),
service_id: "rust-handler-dispatch".to_string(),
load_shedding: LoadSheddingConfig::default(),
};
}
Signal Handling
The Rust worker handles graceful shutdown:
#![allow(unused)]
fn main() {
// Wait for shutdown signal
tokio::select! {
_ = tokio::signal::ctrl_c() => {
info!("Received Ctrl+C, initiating graceful shutdown...");
}
result = wait_for_sigterm() => {
info!("Received SIGTERM, initiating graceful shutdown...");
}
}
// Graceful shutdown sequence
bootstrap_result.worker_handle.stop()?;
}
Performance
Characteristics
- Zero FFI Overhead: Native Rust handlers
- Async/Await: Non-blocking I/O with Tokio
- Bounded Concurrency: Semaphore-limited parallelism
- Memory Safety: Rust’s ownership model
Benchmarking
# Run with release optimizations
cargo run --release
# With performance profiling
RUST_LOG=trace cargo run --release
File Structure
workers/rust/
├── src/
│ ├── main.rs # Entry point
│ ├── bootstrap.rs # Worker initialization
│ ├── lib.rs # Library exports
│ ├── event_handler.rs # Event bridging (legacy)
│ ├── global_event_system.rs # Global event coordination
│ ├── step_handlers/
│ │ ├── mod.rs # Handler traits and types
│ │ ├── registry.rs # Handler registry
│ │ ├── linear_workflow.rs # Linear workflow handlers
│ │ ├── diamond_workflow.rs # Diamond workflow handlers
│ │ ├── tree_workflow.rs # Tree workflow handlers
│ │ ├── mixed_dag_workflow.rs
│ │ ├── order_fulfillment.rs
│ │ ├── batch_processing_*.rs
│ │ ├── error_injection/ # Test handlers
│ │ └── domain_event_*.rs # Event publishing
│ └── event_subscribers/
│ ├── mod.rs
│ ├── logging_subscriber.rs
│ └── metrics_subscriber.rs
├── Cargo.toml
└── tests/
Testing
Unit Tests
cargo test -p workers-rust
Integration Tests
# With database
DATABASE_URL=postgresql://... cargo test -p workers-rust --test integration
E2E Tests
# From project root
DATABASE_URL=postgresql://... cargo nextest run --package workers-rust
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Worker Event Systems - Architecture details
- Worker Actors - Actor pattern documentation
TypeScript Worker
Last Updated: 2026-01-01
Audience: TypeScript/JavaScript Developers
Status: Active
Package: @tasker-systems/tasker
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The TypeScript worker provides a multi-runtime interface for integrating tasker-core workflow execution into TypeScript/JavaScript applications. It supports Bun, Node.js, and Deno runtimes with unified FFI bindings to the Rust worker foundation.
Quick Start
Installation
cd workers/typescript
bun install # Install dependencies
cargo build --release -p tasker-ts # Build FFI library
Running the Server
# With Bun (recommended for production)
bun run bin/server.ts
# With Node.js
npx tsx bin/server.ts
# With Deno
deno run --allow-ffi --allow-env --allow-net bin/server.ts
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
TASKER_FFI_LIBRARY_PATH | Path to libtasker_ts | Auto-detected |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
PORT | HTTP server port | 8081 |
Architecture
Server Mode
Location: workers/typescript/bin/server.ts
The server bootstraps the Rust foundation and manages TypeScript handler execution:
import { createRuntime } from '../src/ffi/index.js';
import { EventEmitter } from '../src/events/event-emitter.js';
import { EventPoller } from '../src/events/event-poller.js';
import { HandlerRegistry } from '../src/handler/registry.js';
import { StepExecutionSubscriber } from '../src/subscriber/step-execution-subscriber.js';
// Create runtime for current environment (Bun/Node/Deno)
const runtime = createRuntime();
await runtime.load(libraryPath);
// Bootstrap Rust worker foundation
const result = runtime.bootstrapWorker({ namespace: 'my-app' });
// Create event system
const emitter = new EventEmitter();
const registry = new HandlerRegistry();
// Register handlers
registry.register('process_order', ProcessOrderHandler);
// Create step execution subscriber
const subscriber = new StepExecutionSubscriber(
emitter,
registry,
runtime,
{ workerId: 'typescript-worker-001' }
);
subscriber.start();
// Start event poller (10ms polling)
const poller = new EventPoller(runtime, emitter, {
pollingIntervalMs: 10
});
poller.start();
// Wait for shutdown signal
await shutdownSignal;
// Graceful shutdown
poller.stop();
await subscriber.waitForCompletion();
runtime.stopWorker();
Headless/Embedded Mode
For embedding in existing TypeScript applications:
import { createRuntime } from '@tasker-systems/tasker';
import { EventEmitter, EventPoller, HandlerRegistry, StepExecutionSubscriber } from '@tasker-systems/tasker';
// Bootstrap worker (headless mode via TOML: web.enabled = false)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });
// Register handlers
const registry = new HandlerRegistry();
registry.register('process_data', ProcessDataHandler);
// Start event system
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();
const poller = new EventPoller(runtime, emitter);
poller.start();
FFI Bridge
TypeScript communicates with the Rust foundation via FFI polling:
┌────────────────────────────────────────────────────────────────┐
│ TYPESCRIPT FFI BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ FFI (pollStepEvents)
▼
┌─────────────────────┐
│ EventPoller │
│ (setInterval) │──→ poll every 10ms
└─────────────────────┘
│
│ emit to EventEmitter
▼
┌─────────────────────┐
│ StepExecution │
│ Subscriber │──→ route to handler
└─────────────────────┘
│
│ handler.call(context)
▼
┌─────────────────────┐
│ Handler Execution │
└─────────────────────┘
│
│ FFI (completeStepEvent)
▼
Rust Completion Channel
Multi-Runtime Support
| Runtime | FFI Library | Status |
|---|---|---|
| Bun | koffi | Production |
| Node.js | koffi | Production |
| Deno | Deno.dlopen | Production |
Handler Development
Base Handler
Location: workers/typescript/src/handler/base.ts
All handlers extend StepHandler:
import { StepHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class ProcessOrderHandler extends StepHandler {
static handlerName = 'process_order';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Access input data
const orderId = context.getInput<string>('order_id');
const amount = context.getInput<number>('amount');
// Business logic
const result = await this.processOrder(orderId, amount);
// Return success
return this.success({
order_id: orderId,
status: 'processed',
total: result.total
});
}
private async processOrder(orderId: string, amount: number) {
// Implementation
return { total: amount * 1.1 };
}
}
Handler Signature
async call(context: StepContext): Promise<StepHandlerResult>
// StepContext provides:
context.taskUuid // Task identifier
context.stepUuid // Step identifier
context.stepInputs // Runtime inputs
context.stepConfig // Handler configuration
context.dependencyResults // Results from parent steps
context.taskContext // Full task context
context.retryCount // Current retry attempt
// Type-safe accessors:
context.getInput<T>(key) // Get single input
context.getDependencyResult(stepName) // Get dependency result
context.getAllDependencyResults(name) // Get all instances (batch workers)
Result Methods
// Success result (from base class)
return this.success(
{ key: 'value' }, // result
{ duration_ms: 100 } // metadata (optional)
);
// Failure result (from base class)
return this.failure(
'Payment declined', // message
'payment_error', // errorType
true, // retryable
{ card_last_four: '1234' } // metadata (optional)
);
Error Types
import { ErrorType } from '@tasker-systems/tasker';
ErrorType.PERMANENT_ERROR // Non-retryable failures
ErrorType.RETRYABLE_ERROR // Retryable failures
ErrorType.VALIDATION_ERROR // Input validation failures
ErrorType.HANDLER_ERROR // Handler execution failures
Accessing Dependencies
async call(context: StepContext): Promise<StepHandlerResult> {
// Get result from a dependency step
const validation = context.getDependencyResult('validate_order') as {
valid: boolean;
amount: number;
} | null;
if (!validation) {
return this.failure('Missing validation result', 'dependency_error', false);
}
if (validation.valid) {
return this.success({ processed: true, amount: validation.amount });
}
return this.failure('Validation failed', 'validation_error', false);
}
Specialized Handlers
Mixin Pattern
TypeScript uses composition via mixins rather than inheritance. You can use either:
- Wrapper classes (ApiHandler, DecisionHandler) - simpler, backward compatible
- Mixin functions (applyAPI, applyDecision) - explicit composition
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';
// Using mixin pattern (recommended for new code)
class MyHandler extends StepHandler implements APICapable {
constructor() {
super();
applyAPI(this); // Adds get/post/put/delete methods
}
async call(context: StepContext): Promise<StepHandlerResult> {
const response = await this.get('/api/data');
return this.apiSuccess(response);
}
}
// Or using wrapper class (simpler, backward compatible)
import { ApiHandler } from '@tasker-systems/tasker';
class MyHandler extends ApiHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
const response = await this.get('/api/data');
return this.apiSuccess(response);
}
}
API Handler
Location: workers/typescript/src/handler/api.ts
For HTTP API integration with automatic error classification:
import { ApiHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class FetchUserHandler extends ApiHandler {
static handlerName = 'fetch_user';
static handlerVersion = '1.0.0';
protected baseUrl = 'https://api.example.com';
async call(context: StepContext): Promise<StepHandlerResult> {
const userId = context.getInput<string>('user_id');
// Automatic error classification
const response = await this.get(`/users/${userId}`);
if (!response.ok) {
return this.apiFailure(response);
}
return this.apiSuccess(response);
}
}
HTTP Methods:
// GET request
const response = await this.get('/path', {
params: { key: 'value' },
headers: { 'Authorization': 'Bearer token' }
});
// POST request
const response = await this.post('/path', {
body: { key: 'value' },
headers: {}
});
// PUT request
const response = await this.put('/path', { body: { key: 'value' } });
// DELETE request
const response = await this.delete('/path', { params: {} });
ApiResponse Properties:
response.statusCode // HTTP status code
response.headers // Response headers
response.body // Parsed body (object or string)
response.ok // True if 2xx status
response.isClientError // True if 4xx status
response.isServerError // True if 5xx status
response.isRetryable // True if should retry (408, 429, 500-504)
response.retryAfter // Retry-After header value in seconds
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Non-retryable | Permanent failure |
| 408, 429, 500-504 | Retryable | Standard retry |
Decision Handler
Location: workers/typescript/src/handler/decision.ts
For dynamic workflow routing:
import { DecisionHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class RoutingDecisionHandler extends DecisionHandler {
static handlerName = 'routing_decision';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const amount = context.getInput<number>('amount') ?? 0;
if (amount < 1000) {
// Auto-approve small amounts
return this.decisionSuccess(['auto_approve'], {
route_type: 'auto',
amount
});
} else if (amount < 5000) {
// Manager approval for medium amounts
return this.decisionSuccess(['manager_approval'], {
route_type: 'manager',
amount
});
} else {
// Dual approval for large amounts
return this.decisionSuccess(['manager_approval', 'finance_review'], {
route_type: 'dual',
amount
});
}
}
}
Decision Methods:
// Activate specific steps
return this.decisionSuccess(
['step1', 'step2'], // steps to activate
{ route_reason: 'threshold' } // routing context
);
// No branches needed
return this.decisionNoBranches('condition not met');
BatchableStepHandler
Location: workers/typescript/src/handler/batchable.ts
For processing large datasets in chunks. Cross-language aligned with Ruby and Python implementations.
Analyzer Handler (creates batch configurations):
import { BatchableStepHandler } from '@tasker-systems/tasker';
import type { StepContext, BatchableResult } from '@tasker-systems/tasker';
export class CsvAnalyzerHandler extends BatchableStepHandler {
static handlerName = 'csv_analyzer';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<BatchableResult> {
const csvPath = context.getInput<string>('csv_path');
const rowCount = await this.countCsvRows(csvPath);
if (rowCount === 0) {
// No data to process - use cross-language standard
return this.noBatchesResult('empty_dataset', {
csv_path: csvPath,
analyzed_at: new Date().toISOString()
});
}
// Create cursor configs using Ruby-style helper
// Divides rowCount into 5 roughly equal batches
const batchConfigs = this.createCursorConfigs(rowCount, 5);
return this.batchSuccess('process_csv_batch', batchConfigs, {
csv_path: csvPath,
total_rows: rowCount,
analyzed_at: new Date().toISOString()
});
}
}
Worker Handler (processes a batch):
export class CsvBatchProcessorHandler extends BatchableStepHandler {
static handlerName = 'csv_batch_processor';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Cross-language standard: check for no-op worker first
const noOpResult = this.handleNoOpWorker(context);
if (noOpResult) {
return noOpResult;
}
// Get batch worker inputs from Rust
const batchInputs = this.getBatchWorkerInputs(context);
const cursor = batchInputs?.cursor;
if (!cursor) {
return this.failure('Missing batch cursor', 'batch_error', false);
}
// Process the batch
const results = await this.processCsvBatch(
cursor.start_cursor,
cursor.end_cursor
);
return this.success({
batch_id: cursor.batch_id,
rows_processed: results.count,
items_succeeded: results.success,
items_failed: results.failed
});
}
}
Aggregator Handler (combines results):
export class CsvAggregatorHandler extends StepHandler {
static handlerName = 'csv_aggregator';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Get all batch worker results
const workerResults = context.getAllDependencyResults('process_csv_batch') as Array<{
rows_processed: number;
items_succeeded: number;
items_failed: number;
} | null>;
// Aggregate results
let totalProcessed = 0;
let totalSucceeded = 0;
let totalFailed = 0;
for (const result of workerResults) {
if (result) {
totalProcessed += result.rows_processed ?? 0;
totalSucceeded += result.items_succeeded ?? 0;
totalFailed += result.items_failed ?? 0;
}
}
return this.success({
total_processed: totalProcessed,
total_succeeded: totalSucceeded,
total_failed: totalFailed,
worker_count: workerResults.length
});
}
}
BatchableStepHandler Methods (Cross-Language Aligned):
| Method | Ruby Equivalent | Purpose |
|---|---|---|
batchSuccess(template, configs, metadata) | batch_success | Create batch workers |
noBatchesResult(reason, metadata) | no_batches_outcome | Empty dataset handling |
createCursorConfigs(total, workers) | create_cursor_configs | Divide work by worker count |
handleNoOpWorker(context) | handle_no_op_worker | Detect no-op placeholders |
getBatchWorkerInputs(context) | get_batch_context | Access Rust batch inputs |
aggregateWorkerResults(results) | aggregate_batch_worker_results | Static aggregation helper |
Handler Registry
Registration
Location: workers/typescript/src/handler/registry.ts
import { HandlerRegistry } from '@tasker-systems/tasker';
const registry = new HandlerRegistry();
// Manual registration
registry.register('process_order', ProcessOrderHandler);
// Check if registered
registry.isRegistered('process_order'); // true
// Resolve and instantiate
const handler = registry.resolve('process_order');
if (handler) {
const result = await handler.call(context);
}
// List all handlers
registry.listHandlers(); // ['process_order', ...]
// Handler count
registry.handlerCount(); // 1
Bulk Registration
import { registerExampleHandlers } from './handlers/examples/index.js';
// Register multiple handlers at once
registerExampleHandlers(registry);
Type System
Core Types
import type {
StepContext,
StepHandlerResult,
BatchableResult,
FfiStepEvent,
BootstrapConfig,
WorkerStatus,
} from '@tasker-systems/tasker';
// StepContext - created from FFI event
const context = StepContext.fromFfiEvent(event, 'handler_name');
context.taskUuid; // string
context.stepUuid; // string
context.stepInputs; // Record<string, unknown>
context.retryCount; // number
// StepHandlerResult - handler output
result.success; // boolean
result.result; // Record<string, unknown>
result.errorMessage; // string | undefined
result.retryable; // boolean
Configuration Types
import type { BootstrapConfig } from '@tasker-systems/tasker';
const config: BootstrapConfig = {
namespace: 'my-app',
environment: 'production',
configPath: '/path/to/config.toml'
};
Event System
EventEmitter
Location: workers/typescript/src/events/event-emitter.ts
import { EventEmitter } from '@tasker-systems/tasker';
import { StepEventNames } from '@tasker-systems/tasker';
const emitter = new EventEmitter();
// Subscribe to events
emitter.on(StepEventNames.STEP_EXECUTION_RECEIVED, (payload) => {
console.log(`Processing step: ${payload.event.step_uuid}`);
});
emitter.on(StepEventNames.STEP_EXECUTION_COMPLETED, (payload) => {
console.log(`Step completed: ${payload.stepUuid}`);
});
// Emit events
emitter.emit(StepEventNames.STEP_EXECUTION_RECEIVED, {
event: ffiStepEvent
});
Event Names
import { StepEventNames } from '@tasker-systems/tasker';
StepEventNames.STEP_EXECUTION_RECEIVED // Step event received from FFI
StepEventNames.STEP_EXECUTION_STARTED // Handler execution started
StepEventNames.STEP_EXECUTION_COMPLETED // Handler execution completed
StepEventNames.STEP_EXECUTION_FAILED // Handler execution failed
StepEventNames.STEP_COMPLETION_SENT // Result sent to FFI
EventPoller
Location: workers/typescript/src/events/event-poller.ts
import { EventPoller } from '@tasker-systems/tasker';
const poller = new EventPoller(runtime, emitter, {
pollingIntervalMs: 10, // Poll every 10ms
starvationCheckInterval: 100, // Check every 1 second
cleanupInterval: 1000 // Cleanup every 10 seconds
});
// Start polling
poller.start();
// Get metrics
const metrics = poller.getMetrics();
console.log(`Pending: ${metrics.pendingCount}`);
// Stop polling
poller.stop();
Domain Events
TypeScript has full domain event support, matching Ruby and Python capabilities. The domain events module provides BasePublisher, BaseSubscriber, and registries for custom event handling.
Location: workers/typescript/src/handler/domain-events.ts
BasePublisher
Publishers transform step execution context into domain-specific events:
import { BasePublisher, StepEventContext, DomainEvent } from '@tasker-systems/tasker';
export class PaymentEventPublisher extends BasePublisher {
static publisherName = 'payment_events';
// Required: which steps trigger this publisher
publishesFor(): string[] {
return ['process_payment', 'refund_payment'];
}
// Transform step context into domain event
async transformPayload(ctx: StepEventContext): Promise<Record<string, unknown>> {
return {
payment_id: ctx.result?.payment_id,
amount: ctx.result?.amount,
currency: ctx.result?.currency,
status: ctx.result?.status
};
}
// Lifecycle hooks (optional)
async beforePublish(ctx: StepEventContext): Promise<void> {
console.log(`Publishing payment event for step: ${ctx.stepName}`);
}
async afterPublish(ctx: StepEventContext, event: DomainEvent): Promise<void> {
console.log(`Published event: ${event.eventName}`);
}
async onPublishError(ctx: StepEventContext, error: Error): Promise<void> {
console.error(`Failed to publish: ${error.message}`);
}
// Inject custom metadata
async additionalMetadata(ctx: StepEventContext): Promise<Record<string, unknown>> {
return { payment_processor: 'stripe' };
}
}
BaseSubscriber
Subscribers react to domain events matching specific patterns:
import { BaseSubscriber, InProcessDomainEvent, SubscriberResult } from '@tasker-systems/tasker';
export class AuditLoggingSubscriber extends BaseSubscriber {
static subscriberName = 'audit_logger';
// Which events to handle (glob patterns supported)
subscribesTo(): string[] {
return ['payment.*', 'order.completed'];
}
// Handle matching events
async handle(event: InProcessDomainEvent): Promise<SubscriberResult> {
await this.logToAuditTrail(event);
return { success: true };
}
// Lifecycle hooks (optional)
async beforeHandle(event: InProcessDomainEvent): Promise<void> {
console.log(`Handling: ${event.eventName}`);
}
async afterHandle(event: InProcessDomainEvent, result: SubscriberResult): Promise<void> {
console.log(`Handled successfully: ${result.success}`);
}
async onHandleError(event: InProcessDomainEvent, error: Error): Promise<void> {
console.error(`Handler error: ${error.message}`);
}
}
Registries
Manage publishers and subscribers with singleton registries:
import { PublisherRegistry, SubscriberRegistry } from '@tasker-systems/tasker';
// Publisher Registry
const pubRegistry = PublisherRegistry.getInstance();
pubRegistry.register(PaymentEventPublisher);
pubRegistry.register(OrderEventPublisher);
pubRegistry.freeze(); // Prevent further registrations
// Get publisher for a step
const publisher = pubRegistry.getForStep('process_payment');
// Subscriber Registry
const subRegistry = SubscriberRegistry.getInstance();
subRegistry.register(AuditLoggingSubscriber);
subRegistry.register(MetricsSubscriber);
// Start all subscribers
subRegistry.startAll();
// Stop all subscribers
subRegistry.stopAll();
FFI Integration
Domain events integrate with the Rust FFI layer for cross-language event flow:
import { createFfiPollAdapter, InProcessDomainEventPoller } from '@tasker-systems/tasker';
// Create poller connected to Rust broadcast channel
const poller = new InProcessDomainEventPoller();
// Set the FFI poll function
poller.setPollFunction(createFfiPollAdapter(runtime));
// Start polling for events
poller.start((event) => {
// Route to appropriate subscriber
const subscribers = subRegistry.getMatchingSubscribers(event.eventName);
for (const sub of subscribers) {
sub.handle(event);
}
});
Signal Handling
The TypeScript worker handles signals for graceful shutdown:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
import { ShutdownController } from '@tasker-systems/tasker';
const shutdown = new ShutdownController();
// Register signal handlers
shutdown.registerSignalHandlers();
// Wait for shutdown signal
await shutdown.waitForShutdown();
// Or check if shutdown requested
if (shutdown.isShutdownRequested()) {
// Begin cleanup
}
Error Handling
Using Failure Results
async call(context: StepContext): Promise<StepHandlerResult> {
try {
const result = await this.processData(context);
return this.success(result);
} catch (error) {
if (error instanceof NetworkError) {
// Retryable error
return this.failure(
error.message,
ErrorType.RETRYABLE_ERROR,
true,
{ endpoint: error.endpoint }
);
}
// Non-retryable error
return this.failure(
error instanceof Error ? error.message : 'Unknown error',
ErrorType.HANDLER_ERROR,
false
);
}
}
Logging
Structured Logging
import { logInfo, logError, logWarn, logDebug } from '@tasker-systems/tasker';
// Simple logging
logInfo('Processing started', { component: 'handler' });
logError('Failed to connect', { component: 'database' });
// With additional context
logInfo('Order processed', {
component: 'handler',
order_id: '123',
amount: '100.00'
});
Pino Integration
The worker uses pino for structured logging:
import pino from 'pino';
const logger = pino({
name: 'my-handler',
level: process.env.RUST_LOG ?? 'info'
});
logger.info({ orderId: '123' }, 'Processing order');
File Structure
workers/typescript/
├── bin/
│ └── server.ts # Production server
├── src/
│ ├── index.ts # Package exports
│ ├── bootstrap/
│ │ └── bootstrap.ts # Worker initialization
│ ├── events/
│ │ ├── event-emitter.ts # Event pub/sub
│ │ ├── event-poller.ts # FFI polling
│ │ └── event-system.ts # Combined event system
│ ├── ffi/
│ │ ├── bun-runtime.ts # Bun FFI adapter
│ │ ├── node-runtime.ts # Node.js FFI adapter
│ │ ├── deno-runtime.ts # Deno FFI adapter
│ │ ├── runtime-interface.ts # Common interface
│ │ └── types.ts # FFI types
│ ├── handler/
│ │ ├── base.ts # Base handler class
│ │ ├── api.ts # API handler
│ │ ├── decision.ts # Decision handler
│ │ ├── batchable.ts # Batchable handler
│ │ ├── domain-events.ts # Domain events module
│ │ ├── registry.ts # Handler registry
│ │ └── mixins/ # Mixin modules
│ │ ├── index.ts # Mixin exports
│ │ ├── api.ts # APIMixin, applyAPI
│ │ └── decision.ts # DecisionMixin, applyDecision
│ ├── server/
│ │ ├── worker-server.ts # Server implementation
│ │ └── types.ts # Server types
│ ├── subscriber/
│ │ └── step-execution-subscriber.ts
│ └── types/
│ ├── step-context.ts # Step context
│ └── step-handler-result.ts
├── tests/
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── handlers/examples/ # Example handlers
├── src-rust/ # Rust FFI extension
├── package.json
├── tsconfig.json
└── biome.json # Linting config
Testing
Unit Tests
cd workers/typescript
bun test # Run all tests
bun test tests/unit/ # Run unit tests only
Integration Tests
bun test tests/integration/ # Run integration tests
With Coverage
bun test --coverage
Linting
bun run check # Biome lint + format check
bun run check:fix # Auto-fix issues
Type Checking
bunx tsc --noEmit # Type check without emit
Example Handlers
Linear Workflow
export class DoubleHandler extends StepHandler {
static handlerName = 'double_value';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const value = context.getInput<number>('value') ?? 0;
return this.success({
result: value * 2,
operation: 'double'
});
}
}
export class AddHandler extends StepHandler {
static handlerName = 'add_constant';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const prev = context.getDependencyResult('double_value') as { result: number } | null;
const value = prev?.result ?? 0;
return this.success({
result: value + 10,
operation: 'add'
});
}
}
Diamond Workflow (Parallel Branches)
export class DiamondStartHandler extends StepHandler {
static handlerName = 'diamond_start';
async call(context: StepContext): Promise<StepHandlerResult> {
const input = context.getInput<number>('value') ?? 0;
return this.success({ squared: input * input });
}
}
export class BranchBHandler extends StepHandler {
static handlerName = 'branch_b';
async call(context: StepContext): Promise<StepHandlerResult> {
const start = context.getDependencyResult('diamond_start') as { squared: number };
return this.success({ result: start.squared + 25 });
}
}
export class BranchCHandler extends StepHandler {
static handlerName = 'branch_c';
async call(context: StepContext): Promise<StepHandlerResult> {
const start = context.getDependencyResult('diamond_start') as { squared: number };
return this.success({ result: start.squared * 2 });
}
}
export class DiamondEndHandler extends StepHandler {
static handlerName = 'diamond_end';
async call(context: StepContext): Promise<StepHandlerResult> {
const branchB = context.getDependencyResult('branch_b') as { result: number };
const branchC = context.getDependencyResult('branch_c') as { result: number };
return this.success({
final: (branchB.result + branchC.result) / 2
});
}
}
Error Handling
export class RetryableErrorHandler extends StepHandler {
static handlerName = 'retryable_error';
async call(context: StepContext): Promise<StepHandlerResult> {
// Simulate a retryable error (e.g., network timeout)
return this.failure(
'Connection timeout - will be retried',
ErrorType.RETRYABLE_ERROR,
true,
{ attempt: context.retryCount }
);
}
}
export class PermanentErrorHandler extends StepHandler {
static handlerName = 'permanent_error';
async call(context: StepContext): Promise<StepHandlerResult> {
// Simulate a permanent error (e.g., validation failure)
return this.failure(
'Invalid input - no retry allowed',
ErrorType.PERMANENT_ERROR,
false
);
}
}
Docker Deployment
Dockerfile
FROM oven/bun:1.1.38 AS runtime
WORKDIR /app
# Copy built artifacts
COPY workers/typescript/dist/ ./dist/
COPY workers/typescript/package.json ./
COPY target/release/libtasker_ts.dylib ./lib/
# Install production dependencies
RUN bun install --production
# Set environment
ENV TASKER_FFI_LIBRARY_PATH=/app/lib/libtasker_ts.dylib
ENV PORT=8081
EXPOSE 8081
CMD ["bun", "run", "dist/bin/server.js"]
Docker Compose
typescript-worker:
build:
context: .
dockerfile: docker/build/typescript-worker.Dockerfile
environment:
DATABASE_URL: postgresql://tasker:tasker@postgres:5432/tasker
TASKER_ENV: production
TASKER_TEMPLATE_PATH: /app/templates
PORT: 8081
ports:
- "8084:8081"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
interval: 10s
timeout: 5s
retries: 3
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Python Worker - Python implementation
- Ruby Worker - Ruby implementation
- Worker Event Systems - Architecture details
Observability Documentation
Last Updated: 2025-12-01 Audience: Operators, Developers Status: Active Related Docs: Documentation Hub | Benchmarks | Deployment Patterns | Domain Events
← Back to Documentation Hub
This directory contains documentation for monitoring, metrics, logging, and performance measurement in tasker-core.
Quick Navigation
📊 Performance & Benchmarking → ../benchmarks/
All benchmark documentation has been consolidated in the docs/benchmarks/ directory.
See: Benchmark README for:
- API performance benchmarks
- SQL function benchmarks
- Event propagation benchmarks
- End-to-end latency benchmarks
- Benchmark quick reference
- Performance targets and CI integration
Migration Note: The following files remain in this directory for historical context but are superseded by the consolidated benchmarks documentation:
benchmark-implementation-decision.md- Decision rationale (archived)benchmark-quick-reference.md- Superseded by ../benchmarks/README.mdbenchmark-strategy-summary.md- Consolidated into benchmark-specific docsbenchmarking-guide.md- SQL benchmarks moved to ../benchmarks/sql-benchmarks.mdphase-5.4-distributed-benchmarks-plan.md- Implementation complete
Observability Categories
1. Metrics (metrics-*.md)
Purpose: System health, performance counters, and operational metrics
Documentation:
- metrics-reference.md - Complete metrics catalog
- metrics-verification.md - Verification procedures
- VERIFICATION_RESULTS.md - Test results and validation
Key Metrics Tracked:
- Task lifecycle events (created, started, completed, failed)
- Step execution metrics (claimed, executed, retried)
- Database operation performance (query times, cache hit rates)
- Worker health (active workers, queue depths, claim rates)
- System resource usage (memory, connections, threads)
Export Targets:
- OpenTelemetry (planned)
- Prometheus (supported)
- CloudWatch (planned)
- Datadog (planned)
Quick Reference:
#![allow(unused)]
fn main() {
// Example: Recording a metric
metrics::counter!("tasker.tasks.created").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed_ms);
metrics::gauge!("tasker.workers.active").set(worker_count as f64);
}
2. Logging (logging-standards.md)
Purpose: Structured logging for debugging, audit trails, and operational visibility
Documentation:
- logging-standards.md - Logging standards and best practices
Log Levels:
- ERROR: Critical failures requiring immediate attention
- WARN: Degraded operation or retry scenarios
- INFO: Significant lifecycle events and state transitions
- DEBUG: Detailed execution flow for troubleshooting
- TRACE: Exhaustive detail for deep debugging
Structured Fields:
#![allow(unused)]
fn main() {
info!(
task_uuid = %task_uuid,
correlation_id = %correlation_id,
step_name = %step_name,
elapsed_ms = elapsed.as_millis(),
"Step execution completed successfully"
);
}
Key Standards:
- Use structured logging (not string interpolation)
- Include correlation IDs for distributed tracing
- Log state transitions at INFO level
- Include timing information for performance analysis
- Sanitize sensitive data (credentials, PII)
3. Tracing and OpenTelemetry
Purpose: Distributed request tracing across services
Status: ✅ Active
Documentation:
- opentelemetry-improvements.md - Telemetry enhancements
Current Features:
- Distributed trace propagation via correlation IDs (UUIDv7)
- Span creation for major operations:
- API request handling
- Step execution (claim → execute → submit)
- Orchestration coordination
- Domain event publishing
- Message queue operations
- Two-phase FFI telemetry initialization (safe for Ruby/Python workers)
- Integration with Grafana LGTM stack (Prometheus, Tempo)
- Domain event metrics (
/metrics/eventsendpoint)
Two-Phase FFI Initialization:
- Phase 1: Console-only logging (safe during FFI bridge setup)
- Phase 2: Full OpenTelemetry (after FFI established)
Example:
#![allow(unused)]
fn main() {
#[tracing::instrument(
name = "publish_domain_event",
skip(self, payload),
fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id,
delivery_mode = ?delivery_mode
)
)]
async fn publish_event(&self, event_name: &str, ...) -> Result<()> {
// Implementation
}
}
4. Health Checks
Purpose: Service health monitoring for orchestration, availability, and alerting
Endpoints:
GET /health- Overall service healthGET /health/ready- Readiness for traffic (K8s readiness probe)GET /health/live- Liveness check (K8s liveness probe)
Health Indicators:
- Database connection pool status
- Message queue connectivity
- Worker availability
- Circuit breaker states
- Resource utilization (memory, connections)
Response Format:
{
"status": "healthy",
"checks": {
"database": {
"status": "healthy",
"connections_active": 5,
"connections_idle": 15,
"connections_max": 20
},
"message_queue": {
"status": "healthy",
"queues_monitored": 3
},
"circuit_breakers": {
"status": "healthy",
"open_breakers": 0
}
},
"uptime_seconds": 3600
}
Observability Architecture
Component-Level Instrumentation
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │ Health │ │
│ │ (Counters│ │(Structured)│ │(Planned)│ │ Checks │ │
│ │Histograms│ │ JSON │ │ Spans │ │ HTTP │ │
│ │ Gauges) │ │ Fields │ │ Tags │ │ Probes │ │
│ └─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘ │
│ │ │ │ │ │
└────────┼─────────────┼─────────────┼─────────────┼────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│Prometheus │ │ Loki / │ │ Jaeger / │ │ K8s │
│ OTLP │ │CloudWatch │ │ Tempo │ │ Probes │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
Instrumentation Points
Orchestration:
- Task lifecycle transitions
- Step discovery and enqueueing
- Result processing
- Finalization operations
- Database query performance
Worker:
- Step claiming
- Handler execution
- Result submission
- FFI call overhead (Ruby workers)
- Event propagation latency
Database:
- Query execution times
- Connection pool metrics
- Transaction commit latency
- Buffer cache hit ratio
Message Queue:
- Message send/receive latency
- Queue depth
- Notification propagation time
- Message processing errors
Performance Monitoring
Key Performance Indicators (KPIs)
| Metric | Target | Alert Threshold | Notes |
|---|---|---|---|
| API Response Time (p99) | < 100ms | > 200ms | User-facing latency |
| SQL Function Time (mean) | < 3ms | > 5ms | Orchestration efficiency |
| Event Propagation (p95) | < 10ms | > 20ms | Real-time coordination |
| E2E Task Completion (p99) | < 500ms | > 1000ms | End-user experience |
| Worker Claim Success Rate | > 95% | < 90% | Resource contention |
| Database Connection Pool | < 80% | > 90% | Resource exhaustion |
Monitoring Dashboards
Recommended Dashboard Panels:
-
Task Throughput
- Tasks created/min
- Tasks completed/min
- Tasks failed/min
- Active tasks count
-
Step Execution
- Steps enqueued/min
- Steps completed/min
- Average step execution time
- Step retry rate
-
System Health
- Worker health status
- Database connection pool utilization
- Circuit breaker status
- API response times (p50, p95, p99)
-
Error Rates
- Task failures by namespace
- Step failures by handler
- Database errors
- Message queue errors
Correlation and Debugging
Correlation ID Propagation
Every request generates a UUIDv7 correlation ID that flows through:
- API request → Task creation
- Task → Step enqueueing
- Step → Worker execution
- Worker → Result submission
- Result → Orchestration processing
Tracing a Request:
# Find correlation ID from task creation
curl http://localhost:8080/v1/tasks/{task_uuid} | jq .correlation_id
# Search logs across all services
docker logs orchestration 2>&1 | grep {correlation_id}
docker logs worker 2>&1 | grep {correlation_id}
# Query database for full timeline
psql $DATABASE_URL -c "
SELECT
created_at,
from_state,
to_state,
metadata->>'duration_ms' as duration
FROM tasker.task_transitions
WHERE metadata->>'correlation_id' = '{correlation_id}'
ORDER BY created_at;
"
Debug Logging
Enable debug logging for detailed execution flow:
# Docker Compose
RUST_LOG=debug docker-compose up
# Local development
RUST_LOG=tasker_worker=debug,tasker_orchestration=debug cargo run
# Specific modules
RUST_LOG=tasker_worker::worker::command_processor=trace cargo test
Best Practices
1. Structured Logging
✅ Do:
#![allow(unused)]
fn main() {
info!(
task_uuid = %task.uuid,
namespace = %task.namespace,
elapsed_ms = elapsed.as_millis(),
"Task completed successfully"
);
}
❌ Don’t:
#![allow(unused)]
fn main() {
info!("Task {} in namespace {} completed in {}ms",
task.uuid, task.namespace, elapsed.as_millis());
}
2. Metric Naming
Use consistent, hierarchical naming:
#![allow(unused)]
fn main() {
metrics::counter!("tasker.tasks.created").increment(1);
metrics::counter!("tasker.tasks.completed").increment(1);
metrics::counter!("tasker.tasks.failed").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed);
}
3. Performance Measurement
Measure at operation boundaries:
#![allow(unused)]
fn main() {
let start = Instant::now();
let result = operation().await?;
let elapsed = start.elapsed();
metrics::histogram!("tasker.operation.duration_ms")
.record(elapsed.as_millis() as f64);
info!(
operation = "operation_name",
elapsed_ms = elapsed.as_millis(),
success = result.is_ok(),
"Operation completed"
);
}
4. Error Context
Include rich context in errors:
#![allow(unused)]
fn main() {
error!(
task_uuid = %task_uuid,
step_uuid = %step_uuid,
error = %err,
retry_count = retry_count,
"Step execution failed, will retry"
);
}
Tools and Integration
Development Tools
Metrics Visualization:
# Prometheus (if configured)
open http://localhost:9090
# Grafana (if configured)
open http://localhost:3000
Log Aggregation:
# Docker Compose logs
docker-compose -f docker/docker-compose.test.yml logs -f
# Specific service
docker-compose -f docker/docker-compose.test.yml logs -f orchestration
# JSON parsing
docker-compose logs orchestration | jq 'select(.level == "ERROR")'
Production Tools (Planned)
- Metrics: Prometheus + Grafana / DataDog / CloudWatch
- Logs: Loki / CloudWatch Logs / Splunk
- Traces: Jaeger / Tempo / Honeycomb
- Alerts: AlertManager / PagerDuty / Opsgenie
Related Documentation
- Benchmarks: ../benchmarks/README.md
- SQL Functions: ../task-and-step-readiness-and-execution.md
File Organization
Current Files
Active:
metrics-reference.md- Complete metrics catalogmetrics-verification.md- Verification procedureslogging-standards.md- Logging best practicesopentelemetry-improvements.md- Telemetry enhancementsVERIFICATION_RESULTS.md- Test results
Archived (superseded by docs/benchmarks/):
benchmark-implementation-decision.mdbenchmark-quick-reference.mdbenchmark-strategy-summary.mdbenchmarking-guide.mdphase-5.4-distributed-benchmarks-plan.md
Recommended Cleanup
Move benchmark files to docs/archive/ or delete:
# Option 1: Archive
mkdir -p docs/archive/benchmarks
mv docs/observability/benchmark-*.md docs/archive/benchmarks/
mv docs/observability/phase-5.4-*.md docs/archive/benchmarks/
# Option 2: Delete (information consolidated)
rm docs/observability/benchmark-*.md
rm docs/observability/phase-5.4-*.md
Contributing
When adding observability instrumentation:
- Follow standards: Use structured logging and consistent metric naming
- Include context: Add correlation IDs and relevant metadata
- Document metrics: Update metrics-reference.md with new metrics
- Test instrumentation: Verify metrics and logs in development
- Consider performance: Avoid expensive operations in hot paths
Benchmark Audit & Profiling Plan
Created: 2025-10-09 Status: 📋 Planning Purpose: Audit existing benchmarks, establish profiling tooling, baseline before Actor/Services refactor
Executive Summary
Before refactoring tasker-orchestration/src/orchestration/lifecycle/ to Actor/Services pattern, we need to:
- Audit Benchmarks: Review which benchmarks are implemented vs placeholders
- Clean Up: Remove or complete placeholder benchmarks
- Establish Profiling: Set up flamegraph/samply tooling
- Baseline Profiles: Capture performance profiles for comparison post-refactor
Current Status: We have working SQL and E2E benchmarks but several placeholder component benchmarks that need decisions.
Benchmark Inventory
✅ Working & Complete Benchmarks
1. SQL Function Benchmarks
- Location:
tasker-shared/benches/sql_functions.rs - Status: ✅ Complete, Compiles, Well-documented
- Coverage:
get_next_ready_tasks()(4 batch sizes)get_step_readiness_status()(5 diverse samples)transition_task_state_atomic()(5 samples)get_task_execution_context()(5 samples)get_step_transitive_dependencies()(10 samples)
- Documentation:
docs/observability/benchmarking-guide.md - Run Command:
cargo bench --package tasker-shared --features benchmarks
2. Event Propagation Benchmarks
- Location:
tasker-shared/benches/event_propagation.rs - Status: ✅ Complete, Compiles
- Coverage: PostgreSQL LISTEN/NOTIFY event propagation
- Run Command:
cargo bench --package tasker-shared --features benchmarks event_propagation
3. Task Initialization Benchmarks
- Location:
tasker-client/benches/task_initialization.rs - Status: ✅ Complete, Compiles
- Coverage: API task creation latency
- Run Command:
export SQLX_OFFLINE=true cargo bench --package tasker-client --features benchmarks task_initialization
4. End-to-End Workflow Latency Benchmarks
- Location:
tests/benches/e2e_latency.rs - Status: ✅ Complete, Compiles
- Coverage: Complete workflow execution (API → Result)
- Linear workflow (Ruby FFI)
- Diamond workflow (Ruby FFI)
- Linear workflow (Rust native)
- Diamond workflow (Rust native)
- Prerequisites: Docker Compose services running
- Run Command:
export SQLX_OFFLINE=true cargo bench --bench e2e_latency
⚠️ Placeholder Benchmarks (Need Decision)
5. Orchestration Benchmarks
- Location:
tasker-orchestration/benches/ - Files:
orchestration_benchmarks.rs- Empty placeholderstep_enqueueing.rs- Placeholder with documentation
- Status: Not implemented
- Documented Intent: Measure orchestration coordination latency
- Challenges:
- Requires triggering orchestration cycle without full execution
- Need step discovery measurement isolation
- Queue publishing and notification overhead breakdown
6. Worker Benchmarks
- Location:
tasker-worker/benches/ - Files:
worker_benchmarks.rs- Empty placeholderworker_execution.rs- Placeholder with documentationhandler_overhead.rs- Placeholder with documentation
- Status: Not implemented
- Documented Intent:
- Worker processing cycle (claim, execute, submit)
- Framework overhead vs pure handler execution
- Ruby FFI overhead measurement
- Challenges:
- Need pre-enqueued steps in test queues
- Noop handler implementations for baseline
- Breakdown metrics for each phase
Recommendations
Option 1: Keep Placeholders for Future Work ✅ RECOMMENDED
Rationale:
- Phase 5.4 distributed benchmarks are documented but complex to implement
- E2E benchmarks (
e2e_latency.rs) already provide full workflow metrics - SQL benchmarks provide component-level detail
- Actor/Services refactor is more urgent than distributed component benchmarks
Action:
- Keep placeholder files with clear “NOT IMPLEMENTED” status
- Update comments to reference this audit document
- Future ticket (post-refactor) can implement if needed
Option 2: Remove Placeholders
Rationale:
- Reduce confusion about benchmark status
- E2E benchmarks already cover end-to-end latency
- SQL benchmarks cover database hot paths
Action:
- Delete placeholder bench files
- Document decision in this file
- Can recreate later if specific component isolation needed
Option 3: Implement Placeholders Now
Rationale:
- Complete benchmark suite before refactor
- Better baseline data for Actor/Services comparison
Concerns:
- 2-3 days implementation effort
- Delays Actor/Services refactor
- May need re-implementation post-refactor anyway
Decision: Option 1 (Keep Placeholders, Document Status)
We have sufficient benchmarking coverage:
- ✅ SQL functions (hot path queries)
- ✅ E2E workflows (user-facing latency)
- ✅ Event propagation (LISTEN/NOTIFY)
- ✅ Task initialization (API latency)
What’s Missing:
- Component-level orchestration breakdown (not critical for refactor)
- Worker cycle breakdown (available via OpenTelemetry traces)
- Framework overhead measurement (nice-to-have, not blocking)
Action Items:
- Update placeholder comments with “Status: Planned for future implementation”
- Reference this document for implementation guidance
- Move forward with profiling and refactor
Profiling Tooling Setup
Goals
- Identify Inefficiencies: Find hot spots in lifecycle code
- Establish Baseline: Profile before Actor/Services refactor
- Compare Post-Refactor: Validate performance impact of refactor
- Continuous Profiling: Enable ongoing performance analysis
Tool Selection
Primary: samply (macOS-friendly)
- GitHub: https://github.com/mstange/samply
- Advantages:
- Native macOS support (uses Instruments)
- Interactive web UI for flamegraphs
- Low overhead
- Works with release builds
- Use Case: Development profiling on macOS
Secondary: flamegraph (CI/production)
- GitHub: https://github.com/flamegraph-rs/flamegraph
- Advantages:
- Linux support (perf-based)
- SVG output for CI artifacts
- Well-established in Rust ecosystem
- Use Case: CI profiling, Linux production analysis
Tertiary: cargo-flamegraph (Convenience)
- Cargo Plugin: Wraps flamegraph-rs
- Advantages:
- Single command profiling
- Automatic symbol resolution
- Use Case: Quick local profiling
Installation
macOS Setup (samply)
# Install samply
cargo install samply
# macOS requires SIP adjustment for sampling (one-time setup)
# https://github.com/mstange/samply#macos-permissions
# Verify installation
samply --version
Linux Setup (flamegraph)
# Install prerequisites (Ubuntu/Debian)
sudo apt-get install linux-tools-common linux-tools-generic
# Install flamegraph
cargo install flamegraph
# Allow perf without sudo (optional)
echo 'kernel.perf_event_paranoid=-1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Verify installation
flamegraph --version
Cross-Platform (cargo-flamegraph)
# Install cargo-flamegraph
cargo install cargo-flamegraph
# Verify installation
cargo flamegraph --version
Profiling Workflows
1. Profile E2E Benchmark (Recommended for Baseline)
Captures the entire workflow execution including orchestration lifecycle:
# macOS
samply record cargo bench --bench e2e_latency -- --profile-time=60
# Linux
cargo flamegraph --bench e2e_latency -- --profile-time=60
# Output: Interactive flamegraph showing hot paths
What to Look For:
- Time spent in
lifecycle/modules (task_initializer, step_enqueuer, result_processor, etc.) - Database query time vs business logic time
- Serialization/deserialization overhead
- Lock contention (should be minimal with our architecture)
2. Profile SQL Benchmarks
Isolates database performance:
# Profile just SQL function benchmarks
samply record cargo bench --package tasker-shared --features benchmarks sql_functions
# Output: Shows PostgreSQL function overhead
What to Look For:
- Time in
sqlxquery execution - Connection pool overhead
- Query planning time (shouldn’t be visible if using prepared statements)
3. Profile Integration Tests (Realistic Workload)
Profile actual test execution for realistic patterns:
# Profile a specific integration test
samply record cargo test --test e2e_tests e2e::rust::simple_integration_tests::test_linear_workflow
# Profile all integration tests (longer run)
samply record cargo test --test e2e_tests --all-features
What to Look For:
- Initialization overhead
- Test setup time vs actual execution time
- Repeated patterns across tests
4. Profile Specific Lifecycle Components
Isolate specific modules for deep analysis:
# Example: Profile only result processing
samply record cargo test --package tasker-orchestration --test lifecycle_integration_tests \
test_result_processing_updates_task_state --all-features -- --nocapture
# Or profile a unit test for a specific function
samply record cargo test --package tasker-orchestration \
result_processor::tests::test_process_step_result_success --all-features
Baseline Profiling Plan
Phase 1: Capture Pre-Refactor Baselines (Day 1)
Goal: Establish performance baseline of current lifecycle code before Actor/Services refactor
# 1. Clean build
cargo clean
cargo build --release --all-features
# 2. Profile E2E benchmarks (primary baseline)
samply record --output=baseline-e2e-pre-refactor.json \
cargo bench --bench e2e_latency
# 3. Profile SQL benchmarks
samply record --output=baseline-sql-pre-refactor.json \
cargo bench --package tasker-shared --features benchmarks
# 4. Profile specific lifecycle operations
samply record --output=baseline-task-init-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::task_initializer::tests --all-features
samply record --output=baseline-step-enqueue-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::step_enqueuer::tests --all-features
samply record --output=baseline-result-processor-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::result_processor::tests --all-features
Deliverables (completed, profiles removed — superseded by cluster benchmarks):
Baseline profile files in(removed)profiles/pre-refactor/- Performance baselines now in
docs/benchmarks/README.md
Phase 2: Identify Optimization Opportunities (Day 1)
Goal: Document current performance characteristics to preserve in refactor
Analysis Checklist:
- ✅ Time spent in each lifecycle module (task_initializer, step_enqueuer, etc.)
- ✅ Database query time breakdown
- ✅ Serialization overhead (JSON, MessagePack)
- ✅ Lock contention points (if any)
- ✅ Unnecessary allocations or clones
- ✅ Recursive call depth
Document Findings:
Performance baselines are now documented in docs/benchmarks/README.md.
The original lifecycle-performance-baseline.md was removed — its measurements had
data quality issues and the refactor it targeted is complete.
Phase 3: Post-Refactor Validation (After Refactor)
Goal: Validate Actor/Services refactor maintains or improves performance
# Re-run same profiling commands after refactor
samply record --output=baseline-e2e-post-refactor.json \
cargo bench --bench e2e_latency
# Compare baselines
# (samply doesn't have built-in diff, use manual comparison)
Success Criteria:
- E2E latency: Within 10% of baseline (preferably faster)
- SQL latency: Unchanged (no regression from refactor)
- Lifecycle operation time: Within 20% of baseline
- No new hot paths or contention points
Regression Signals:
- E2E latency >20% slower
- New allocations/clones in hot paths
- Increased lock contention
- Message passing overhead >5% of total time
Profiling Best Practices
1. Use Release Builds
# Always profile release builds (--release flag)
cargo build --release --all-features
samply record cargo bench --bench e2e_latency
Rationale: Debug builds have 10-100x overhead that masks real performance issues
2. Run Multiple Times
# Run 3 times, compare consistency
for i in {1..3}; do
samply record --output=profile-$i.json cargo bench --bench e2e_latency
done
Rationale: Catch warm-up effects, JIT compilation, cache behavior
3. Isolate Interference
# Close other applications
# Disable background processes (Spotlight, backups)
# Use consistent hardware (don't profile on battery power)
4. Focus on Hot Paths
80/20 Rule: 80% of time is spent in 20% of code
Priority Order:
- Top 5 functions by time (>5% each)
- Recursive calls (can amplify overhead)
- Locks and synchronization (contention multiplies)
- Allocations in loops (O(n) becomes visible)
5. Benchmark-Driven Profiling
Always profile realistic workloads:
- ✅ E2E benchmarks (represents user experience)
- ✅ Integration tests (real workflow patterns)
- ❌ Unit tests (too isolated, not representative)
Flamegraph Interpretation
Reading Flamegraphs
┌─────────────────────────────────────────────┐ ← Total Program Time (100%)
│ │
│ ┌────────────────┐ ┌─────────────────┐ │
│ │ Database Ops │ │ Serialization │ │ ← High-level Operations (60%)
│ │ (30%) │ │ (30%) │ │
│ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌───────────┐ │ │
│ │ │ SQL Exec │ │ │ │ JSON Ser │ │ │ ← Leaf Operations (25%)
│ │ │ (25%) │ │ │ │ (20%) │ │ │
│ └──┴──────────┴──┘ └──┴───────────┴─┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Business Logic (20%) │ │ ← Remaining Time
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Width = Time spent in function (including children) Height = Call stack depth Color = Function group (can be customized)
Key Patterns
1. Wide Flat Bars = Hot Path
┌───────────────────────────────────────┐
│ step_enqueuer::enqueue_ready_steps() │ ← 40% of total time
└───────────────────────────────────────┘
Action: Optimize this function
2. Deep Call Stack = Recursion/Abstractions
┌─────────────────────────┐
│ process_dependencies() │
│ ┌─────────────────────┐│
│ │ resolve_deps() ││
│ │ ┌─────────────────┐││
│ │ │ check_ready() │││
│ │ └─────────────────┘││
│ └─────────────────────┘│
└─────────────────────────┘
Action: Consider flattening or caching
3. Many Narrow Bars = Fragmentation
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│A│B│C│D│E│F│G│H│I│J│K│L│M│ ← Many small functions
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
Action: Not necessarily bad (may be inlining), but check if overhead-heavy
Integration with CI
GitHub Actions Workflow (Future Enhancement)
# .github/workflows/profile-benchmarks.yml
name: Profile Benchmarks
on:
pull_request:
paths:
- 'tasker-orchestration/src/orchestration/lifecycle/**'
- 'tasker-shared/src/**'
jobs:
profile:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install flamegraph
run: cargo install flamegraph
- name: Profile benchmarks
run: |
cargo flamegraph --bench e2e_latency -- --profile-time=60 -o flamegraph.svg
- name: Upload flamegraph
uses: actions/upload-artifact@v3
with:
name: flamegraph
path: flamegraph.svg
- name: Compare with baseline
run: |
# TODO: Implement baseline comparison
# Download previous flamegraph, compare hot paths
Documentation Structure
Created Documents
-
This Document:
docs/observability/benchmark-audit-and-profiling-plan.md- Benchmark inventory
- Profiling tooling setup
- Baseline capture plan
-
Existing:
docs/observability/benchmarking-guide.md- SQL benchmark documentation
- Running instructions
- Performance expectations
-
(Removed — superseded bydocs/observability/lifecycle-performance-baseline.mddocs/benchmarks/README.md)
Next Steps
Before Actor/Services Refactor
- ✅ Audit Complete: Documented benchmark status
- ⏳ Install Profiling Tools:
cargo install samply # macOS cargo install flamegraph # Linux - ⏳ Capture Baselines (1 day):
- Run profiling plan Phase 1
- Generate flamegraphs
- Document hot paths
- ✅ Baseline Document: Superseded by
docs/benchmarks/README.md
During Actor/Services Refactor
- Incremental Profiling: Profile after each major component conversion
- Compare Baselines: Ensure no performance regressions
- Document Changes: Note architectural changes affecting performance
After Actor/Services Refactor
- Full Re-Profile: Run profiling plan Phase 3
- Comparison Analysis: Document performance changes
- Update Documentation: Reflect new architecture
- Benchmark Updates: Update benchmarks if Actor/Services changes measurement approach
Summary
Current State:
- ✅ SQL benchmarks working
- ✅ E2E benchmarks working
- ✅ Event propagation benchmarks working
- ✅ Task initialization benchmarks working
- ⚠️ Component benchmarks are placeholders (OK for now)
Decision:
- Keep placeholder benchmarks for future work
- Move forward with profiling and baseline capture
- Sufficient coverage to validate Actor/Services refactor
Action Plan:
- Install profiling tools (samply/flamegraph)
- Capture pre-refactor baselines (1 day)
- Document current hot paths
- Proceed with Actor/Services refactor
- Validate post-refactor performance
Success Criteria:
- Baseline profiles captured
- Hot paths documented
- Post-refactor validation plan established
- No performance regressions from refactor
Benchmark Implementation Decision: Event-Driven + E2E Focus
Date: 2025-10-08 Decision: Focus on event propagation and E2E benchmarks; infer worker metrics from traces
Context
Original Phase 5.4 plan included 7 benchmark categories:
- ✅ API Task Creation
- 🚧 Worker Processing Cycle
- ✅ Event Propagation
- 🚧 Step Enqueueing
- 🚧 Handler Overhead
- ✅ SQL Functions
- ✅ E2E Latency
Architectural Challenge: Worker Benchmarking
Problem: Direct worker benchmarking doesn’t match production reality
In a distributed system with multiple workers:
- ❌ Can’t predict which worker will claim which step
- ❌ Can’t control step distribution across workers
- ❌ Artificial scenarios required to direct specific steps to specific workers
- ❌ API queries would need to know which worker to query (unknowable in advance)
Example:
Task with 10 steps across 3 workers:
- Worker A might claim steps 1, 3, 7
- Worker B might claim steps 2, 5, 6, 9
- Worker C might claim steps 4, 8, 10
Which worker do you benchmark? How do you ensure consistent measurement?
Decision: Focus on Observable Metrics
✅ What We WILL Measure Directly
1. Event Propagation (tasker-shared/benches/event_propagation.rs)
Status: ✅ IMPLEMENTED
Measures: PostgreSQL LISTEN/NOTIFY round-trip latency
Approach:
#![allow(unused)]
fn main() {
// Setup listener on test channel
listener.listen("pgmq_message_ready.benchmark_event_test").await;
// Send message with notify
let send_time = Instant::now();
sqlx::query("SELECT pgmq_send_with_notify(...)").execute(&pool).await;
// Measure until listener receives
let received_at = listener.recv().await;
let latency = received_at.duration_since(send_time);
}
Why it works:
- Observable from outside the system
- Deterministic measurement (single listener, single sender)
- Matches production behavior (real LISTEN/NOTIFY path)
- Critical for worker responsiveness
Expected Performance: < 5-10ms p95
2. End-to-End Latency (tests/benches/e2e_latency.rs)
Status: ✅ IMPLEMENTED
Measures: Complete workflow execution (API → Task Complete)
Approach:
#![allow(unused)]
fn main() {
// Create task
let response = client.create_task(request).await;
let start = Instant::now();
// Poll for completion
loop {
let task = client.get_task(task_uuid).await;
if task.execution_status == "AllComplete" {
return start.elapsed();
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
}
Why it works:
- Measures user experience (submit → result)
- Naturally includes ALL system overhead:
- API processing
- Database writes
- Message queue latency
- Worker claim/execute/submit (embedded in total time)
- Event propagation
- Orchestration coordination
- No need to know which workers executed which steps
- Reflects real production behavior
Expected Performance:
- Linear (3 steps): < 500ms p99
- Diamond (4 steps): < 800ms p99
📊 What We WILL Infer from Traces
Worker-Level Breakdown via OpenTelemetry
Instead of direct benchmarking, use existing OpenTelemetry instrumentation:
# Query traces by correlation_id from E2E benchmark
curl "http://localhost:16686/api/traces?service=tasker-worker&tags=correlation_id:abc-123"
# Extract span timings:
{
"spans": [
{"operationName": "step_claim", "duration": 15ms},
{"operationName": "execute_handler", "duration": 42ms}, // Business logic
{"operationName": "submit_result", "duration": 23ms}
]
}
Advantages:
- ✅ Works across distributed workers (correlation ID links everything)
- ✅ Captures real production behavior (actual task execution)
- ✅ Breaks down by step type (different handlers have different timing)
- ✅ Shows which worker processed each step
- ✅ Already instrumented (Phase 3.3 work)
Metrics Available:
step_claim_duration- Time to claim step from queuehandler_execution_duration- Time to execute handler logicresult_submission_duration- Time to submit result backffi_overhead- Rust vs Ruby handler comparison
🚧 Benchmarks NOT Implemented (By Design)
Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- Requires artificial pre-arrangement of which worker claims which step
- Doesn’t match production (multiple workers competing for steps)
- Metrics available via OpenTelemetry traces instead
Alternative: Query traces for step_claim → execute_handler → submit_result span timing
Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- Difficult to trigger orchestration step discovery without full execution
- Result naturally embedded in E2E latency measurement
- Coordination overhead visible in E2E timing
Alternative: E2E benchmark includes step enqueueing naturally
Handler Overhead (tasker-worker/benches/handler_overhead.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- FFI overhead varies by handler type (can’t benchmark in isolation)
- Real overhead visible in E2E benchmark + traces
- Rust vs Ruby comparison available via trace analysis
Alternative: Compare handler_execution_duration spans for Rust vs Ruby handlers in traces
Implementation Summary
✅ Complete Benchmarks (3/7)
| Benchmark | Status | Measures | Run Command |
|---|---|---|---|
| SQL Functions | ✅ Complete | PostgreSQL function performance | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions |
| Task Initialization | ✅ Complete | API task creation latency | cargo bench -p tasker-client --features benchmarks |
| Event Propagation | ✅ Complete | LISTEN/NOTIFY round-trip | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks event_propagation |
| E2E Latency | ✅ Complete | Complete workflow execution | cargo bench --test e2e_latency |
🚧 Placeholder Benchmarks (3/7)
| Benchmark | Status | Alternative Measurement |
|---|---|---|
| Worker Execution | 🚧 Placeholder | OpenTelemetry traces (correlation ID) |
| Step Enqueueing | 🚧 Placeholder | Embedded in E2E latency |
| Handler Overhead | 🚧 Placeholder | OpenTelemetry span comparison (Rust vs Ruby) |
Advantages of This Approach
1. Matches Production Reality
- E2E benchmark reflects actual user experience
- No artificial worker pre-arrangement required
- Measures real distributed system behavior
2. Complete Coverage
- E2E latency includes ALL components naturally
- OpenTelemetry provides worker-level breakdown
- Event propagation measures critical notification path
3. Lower Maintenance
- Fewer benchmarks to maintain
- No complex setup for worker isolation
- Traces provide flexible analysis
4. Better Insights
- Correlation IDs link entire workflow across services
- Can analyze timing for ANY task in production
- Breakdown available on-demand via trace queries
How to Use This System
Running Performance Analysis
Step 1: Run E2E benchmark
cargo bench --test e2e_latency
Step 2: Extract correlation_id from benchmark output
Created task: abc-123-def-456 (correlation_id: xyz-789)
Step 3: Query traces for breakdown
# Jaeger UI or API
curl "http://localhost:16686/api/traces?tags=correlation_id:xyz-789"
Step 4: Analyze span timing
{
"spans": [
{"service": "orchestration", "operation": "create_task", "duration": 18ms},
{"service": "orchestration", "operation": "enqueue_steps", "duration": 12ms},
{"service": "worker", "operation": "step_claim", "duration": 15ms},
{"service": "worker", "operation": "execute_handler", "duration": 42ms},
{"service": "worker", "operation": "submit_result", "duration": 23ms},
{"service": "orchestration", "operation": "process_result", "duration": 8ms}
]
}
Total E2E: ~118ms (matches benchmark) Worker overhead: 15ms + 23ms = 38ms (claim + submit, excluding business logic)
Recommendations
Completion Criteria
✅ Complete with 4 working benchmarks:
- SQL Functions
- Task Initialization
- Event Propagation
- E2E Latency
📋 Document that worker-level metrics come from OpenTelemetry
For Future Enhancement
If direct worker benchmarking becomes necessary:
- Use single-worker mode Docker Compose configuration
- Pre-create tasks with known step assignments
- Query specific worker API for deterministic steps
- Document as synthetic benchmark (not matching production)
For Production Monitoring
Use OpenTelemetry for ongoing performance analysis:
- Set up trace retention (7-30 days)
- Create Grafana dashboards for span timing
- Alert on p95 latency increases
- Analyze slow workflows via correlation ID
Conclusion
Decision: Focus on event propagation and E2E latency benchmarks, use OpenTelemetry traces for worker-level breakdown.
Rationale: Matches production reality, provides complete coverage, lower maintenance, better insights.
Status: ✅ 4/4 practical benchmarks implemented and working
Benchmark Quick Reference Guide
Last Updated: 2025-10-08
Quick commands for running all benchmarks in the distributed benchmarking suite.
Prerequisites
# Start all Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# Verify services are healthy
curl http://localhost:8080/health # Orchestration
curl http://localhost:8081/health # Rust Worker
curl http://localhost:8082/health # Ruby Worker (optional)
# Set database URL (for SQL benchmarks)
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
Individual Benchmarks
✅ Implemented Benchmarks
# 1. API Task Creation (COMPLETE - 17.7-20.8ms)
cargo bench --package tasker-client --features benchmarks
# 2. SQL Function Performance (COMPLETE - 380µs-2.93ms)
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions
🚧 Placeholder Benchmarks
# 3. Event Propagation (placeholder)
cargo bench --package tasker-shared --features benchmarks event_propagation
# 4. Worker Execution (placeholder)
cargo bench --package tasker-worker --features benchmarks worker_execution
# 5. Handler Overhead (placeholder)
cargo bench --package tasker-worker --features benchmarks handler_overhead
# 6. Step Enqueueing (placeholder)
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing
# 7. End-to-End Latency (placeholder)
cargo bench --test e2e_latency
Run All Benchmarks
# Run ALL benchmarks (implemented + placeholders)
cargo bench --all-features
# Run only SQL benchmarks
cargo bench --package tasker-shared --features benchmarks
# Run only worker benchmarks
cargo bench --package tasker-worker --features benchmarks
Benchmark Categories
| Category | Package | Benchmark Name | Status | Run Command |
|---|---|---|---|---|
| API | tasker-client | task_initialization | ✅ Complete | cargo bench -p tasker-client --features benchmarks |
| SQL | tasker-shared | sql_functions | ✅ Complete | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions |
| Events | tasker-shared | event_propagation | 🚧 Placeholder | cargo bench -p tasker-shared --features benchmarks event_propagation |
| Worker | tasker-worker | worker_execution | 🚧 Placeholder | cargo bench -p tasker-worker --features benchmarks worker_execution |
| Worker | tasker-worker | handler_overhead | 🚧 Placeholder | cargo bench -p tasker-worker --features benchmarks handler_overhead |
| Orchestration | tasker-orchestration | step_enqueueing | 🚧 Placeholder | cargo bench -p tasker-orchestration --features benchmarks |
| E2E | tests | e2e_latency | 🚧 Placeholder | cargo bench --test e2e_latency |
Benchmark Output Locations
# Criterion HTML reports
open target/criterion/report/index.html
# Individual benchmark data
ls target/criterion/
# Proposed: Structured logs (not yet implemented)
# tmp/benchmarks/YYYY-MM-DD-benchmark-name.log
Common Options
# Save baseline for comparison
cargo bench --features benchmarks -- --save-baseline main
# Compare to baseline
cargo bench --features benchmarks -- --baseline main
# Verbose output
cargo bench --features benchmarks -- --verbose
# Run specific benchmark
cargo bench --package tasker-client --features benchmarks task_creation_api
# Skip health checks (CI mode)
TASKER_TEST_SKIP_HEALTH_CHECK=true cargo bench --features benchmarks
Troubleshooting
“Services must be running”
# Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# Check service health
curl http://localhost:8080/health
“DATABASE_URL must be set”
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
“Task template not found”
# Ensure worker services are running (they register templates)
docker-compose -f docker/docker-compose.test.yml ps
# Check registered templates
curl -s http://localhost:8080/v1/handlers | jq
Compilation errors
# Clean and rebuild
cargo clean
cargo build --all-features
Performance Targets
| Benchmark | Metric | Target | Current | Status |
|---|---|---|---|---|
| Task Init (linear) | mean | < 50ms | 17.7ms | ✅ 3x better |
| Task Init (diamond) | mean | < 75ms | 20.8ms | ✅ 3.6x better |
| SQL Task Discovery | mean | < 3ms | 1.75-2.93ms | ✅ Pass |
| SQL Step Readiness | mean | < 1ms | 440-603µs | ✅ Pass |
| Worker Total Overhead | mean | < 60ms | TBD | 🚧 |
| Event Notify (p95) | p95 | < 10ms | TBD | 🚧 |
| Step Enqueue (3 steps) | mean | < 50ms | TBD | 🚧 |
| E2E Complete (3 steps) | p99 | < 500ms | TBD | 🚧 |
Documentation
- Full Strategy: benchmark-strategy-summary.md
- Implementation Plan: phase-5.4-distributed-benchmarks-plan.md
- SQL Benchmark Guide: benchmarking-guide.md
Distributed Benchmarking Strategy
Status: 🎯 Framework Complete | Implementation In Progress Last Updated: 2025-10-08
Overview
Complete benchmarking infrastructure for measuring distributed system performance across all components.
Benchmark Suite Structure
✅ Implemented
1. API Task Creation (tasker-client/benches/task_initialization.rs)
Status: ✅ COMPLETE - Fully implemented and tested
Measures:
- HTTP request → task initialized latency
- Task record creation in PostgreSQL
- Initial step discovery from template
- Response generation and serialization
Results (2025-10-08):
Linear (3 steps): 17.7ms (Target: < 50ms) ✅ 3x better than target
Diamond (4 steps): 20.8ms (Target: < 75ms) ✅ 3.6x better than target
Run Command:
cargo bench --package tasker-client --features benchmarks
2. SQL Function Performance (tasker-shared/benches/sql_functions.rs)
Status: ✅ COMPLETE - Fully implemented (Phase 5.2)
Measures:
- 6 critical PostgreSQL function benchmarks
- Intelligent stratified sampling (5-10 diverse samples per function)
- EXPLAIN ANALYZE query plan analysis (run once per function)
Results (from Phase 5.2):
Task discovery: 1.75-2.93ms (O(1) scaling!)
Step readiness: 440-603µs (37% variance captured)
State transitions: ~380µs (±5% variance)
Task execution context: 448-559µs
Step dependencies: 332-343µs
Query plan buffer hit: 100% (all functions)
Run Command:
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions
🚧 Placeholders (Ready for Implementation)
3. Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Claim: PGMQ read + atomic claim
- Execute: Handler execution (framework overhead)
- Submit: Result serialization + HTTP submit
- Total: Complete worker cycle
Targets:
- Claim: < 20ms
- Execute (noop): < 10ms
- Submit: < 30ms
- Total overhead: < 60ms
Implementation Requirements:
- Pre-enqueued steps in namespace queues
- Worker client with breakdown metrics
- Multiple handler types (noop, calculation, database)
- Accurate timestamp collection for each phase
Run Command (when implemented):
cargo bench --package tasker-worker --features benchmarks worker_execution
4. Event Propagation (tasker-shared/benches/event_propagation.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- PostgreSQL LISTEN/NOTIFY latency
- PGMQ
pgmq_send_with_notifyoverhead - Event system framework overhead
Targets:
- p50: < 5ms
- p95: < 10ms
- p99: < 20ms
Implementation Requirements:
- PostgreSQL LISTEN connection setup
- PGMQ notification channel configuration
- Concurrent listener with timestamp correlation
- Accurate cross-thread time measurement
Run Command (when implemented):
cargo bench --package tasker-shared --features benchmarks event_propagation
5. Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Ready step discovery (SQL query time)
- Queue publishing (PGMQ write time)
- Notification overhead (LISTEN/NOTIFY)
- Total orchestration coordination
Targets:
- 3-step workflow: < 50ms
- 10-step workflow: < 100ms
- 50-step workflow: < 500ms
Implementation Requirements:
- Pre-created tasks with dependency chains
- Orchestration client with result processing trigger
- Queue polling to detect enqueued steps
- Breakdown metrics (discovery, publish, notify)
Challenge: Triggering step discovery without full workflow execution
Run Command (when implemented):
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing
6. Handler Overhead (tasker-worker/benches/handler_overhead.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Pure Rust handler (baseline - direct call)
- Rust handler via framework (dispatch overhead)
- Ruby handler via FFI (FFI boundary cost)
Targets:
- Pure Rust: < 1µs (baseline)
- Via Framework: < 1ms
- Ruby FFI: < 5ms
Implementation Requirements:
- Noop handler implementations (Rust + Ruby)
- Direct function call benchmarks
- Framework dispatch overhead measurement
- FFI bridge overhead measurement
Run Command (when implemented):
cargo bench --package tasker-worker --features benchmarks handler_overhead
7. End-to-End Latency (tests/benches/e2e_latency.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Complete workflow execution (API → Task Complete)
- All system components (API, DB, Queue, Worker, Events)
- Real network overhead
- Different workflow patterns
Targets:
- Linear (3 steps): < 500ms p99
- Diamond (4 steps): < 800ms p99
- Tree (7 steps): < 1500ms p99
Implementation Requirements:
- All Docker Compose services running
- Orchestration client for task creation
- Polling mechanism for completion detection
- Multiple workflow templates
- Timeout handling for stuck workflows
Special Considerations:
- SLOW by design: Measures real workflow execution (seconds)
- Fewer samples (sample_size=10 vs 50 default)
- Higher variance expected (network + system state)
- Focus on regression detection, not absolute numbers
Run Command (when implemented):
# Requires all Docker services running
docker-compose -f docker/docker-compose.test.yml up -d
cargo bench --test e2e_latency
Benchmark Output Logging Strategy
Current State
Implemented:
- Criterion default output (terminal + HTML reports)
- Custom health check banners in benchmarks
- EXPLAIN ANALYZE output in SQL benchmarks
- Inline result commentary
Location: Results saved to target/criterion/
Proposed Consistent Structure
1. Standard Output Format
All benchmarks should follow this pattern:
═══════════════════════════════════════════════════════════════════════════════
🔍 VERIFYING PREREQUISITES
═══════════════════════════════════════════════════════════════════════════════
✅ All prerequisites met
═══════════════════════════════════════════════════════════════════════════════
Benchmarking <category>/<test_name>
...
<category>/<test_name> time: [X.XX ms Y.YY ms Z.ZZ ms]
═══════════════════════════════════════════════════════════════════════════════
📊 BENCHMARK RESULTS: <CATEGORY NAME>
═══════════════════════════════════════════════════════════════════════════════
Performance Summary:
• Test 1: X.XX ms (Target: < YY ms) ✅ Status
• Test 2: X.XX ms (Target: < YY ms) ⚠️ Status
Key Findings:
• Finding 1
• Finding 2
═══════════════════════════════════════════════════════════════════════════════
2. Structured Log Files
Proposal: Create tmp/benchmarks/ directory with dated output:
tmp/benchmarks/
├── 2025-10-08-task-initialization.log
├── 2025-10-08-sql-functions.log
├── 2025-10-08-worker-execution.log
├── ...
└── latest/
├── task-initialization.log -> ../2025-10-08-task-initialization.log
└── summary.md
Log Format (example):
# Benchmark Run: task_initialization
Date: 2025-10-08 14:23:45 UTC
Commit: abc123def456
Environment: Docker Compose Test
## Prerequisites
- [x] Orchestration service healthy (http://localhost:8080)
- [x] Worker service healthy (http://localhost:8081)
## Results
### Linear Workflow (3 steps)
- Mean: 17.748 ms
- Std Dev: 0.624 ms
- Min: 17.081 ms
- Max: 18.507 ms
- Target: < 50 ms
- Status: ✅ PASS (3.0x better than target)
- Outliers: 2/20 (10%)
### Diamond Workflow (4 steps)
- Mean: 20.805 ms
- Std Dev: 0.741 ms
- Min: 19.949 ms
- Max: 21.633 ms
- Target: < 75 ms
- Status: ✅ PASS (3.6x better than target)
- Outliers: 0/20 (0%)
## Summary
✅ All tests passed
🎯 Average performance: 3.3x better than targets
3. Baseline Comparison Format
For tracking performance over time:
# Performance Baseline Comparison
Baseline: main branch (2025-10-01)
Current: feature/benchmarks (2025-10-08)
| Benchmark | Baseline | Current | Change | Status |
|-----------|----------|---------|--------|--------|
| task_init/linear | 18.2ms | 17.7ms | -2.7% | ✅ Improved |
| task_init/diamond | 21.1ms | 20.8ms | -1.4% | ✅ Improved |
| sql/task_discovery | 2.91ms | 2.93ms | +0.7% | ✅ Stable |
4. CI Integration Format
For GitHub Actions / CI output:
{
"benchmark_suite": "task_initialization",
"timestamp": "2025-10-08T14:23:45Z",
"commit": "abc123def456",
"results": [
{
"name": "linear_3_steps",
"mean_ms": 17.748,
"std_dev_ms": 0.624,
"target_ms": 50,
"status": "pass",
"performance_ratio": 3.0
}
],
"summary": {
"total_tests": 2,
"passed": 2,
"failed": 0,
"warnings": 0
}
}
Running All Benchmarks
Quick Reference
# 1. Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# 2. Run individual benchmarks
cargo bench --package tasker-client --features benchmarks # Task initialization
cargo bench --package tasker-shared --features benchmarks # SQL + Events
cargo bench --package tasker-worker --features benchmarks # Worker + Handlers
cargo bench --package tasker-orchestration --features benchmarks # Step enqueueing
cargo bench --test e2e_latency # End-to-end
# 3. Run ALL benchmarks (when all implemented)
cargo bench --all-features
Environment Variables
# Required for SQL benchmarks
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
# Optional: Skip health checks (CI)
export TASKER_TEST_SKIP_HEALTH_CHECK="true"
# Optional: Custom service URLs
export TASKER_TEST_ORCHESTRATION_URL="http://localhost:9080"
export TASKER_TEST_WORKER_URL="http://localhost:9081"
Performance Targets Summary
| Category | Component | Metric | Target | Status |
|---|---|---|---|---|
| API | Task Creation (3 steps) | p99 | < 50ms | ✅ 17.7ms |
| API | Task Creation (4 steps) | p99 | < 75ms | ✅ 20.8ms |
| SQL | Task Discovery | mean | < 3ms | ✅ 1.75-2.93ms |
| SQL | Step Readiness | mean | < 1ms | ✅ 440-603µs |
| Worker | Total Overhead | mean | < 60ms | 🚧 TBD |
| Worker | FFI Overhead | mean | < 5ms | 🚧 TBD |
| Events | Notify Latency | p95 | < 10ms | 🚧 TBD |
| Orchestration | Step Enqueueing (3 steps) | mean | < 50ms | 🚧 TBD |
| E2E | Complete Workflow (3 steps) | p99 | < 500ms | 🚧 TBD |
Next Steps
Immediate (Current Session)
- ✅ Create all benchmark skeletons
- 🎯 Design consistent logging structure
- Decide on implementation priorities
Short Term
- Implement worker execution benchmark
- Implement event propagation benchmark
- Create benchmark output logging utilities
Medium Term
- Implement step enqueueing benchmark
- Implement handler overhead benchmark
- Implement E2E latency benchmark
Long Term
- CI integration with baseline tracking
- Performance regression detection
- Automated benchmark reports
- Historical performance trending
Documentation
- Full Plan: phase-5.4-distributed-benchmarks-plan.md
- SQL Benchmarks: benchmarking-guide.md
SQL Function Benchmarking Guide
Created: 2025-10-08
Status: ✅ Complete
Location: tasker-shared/benches/sql_functions.rs
Overview
The SQL function benchmark suite measures performance of critical database operations that form the hot paths in the Tasker orchestration system. These benchmarks provide:
- Baseline Performance Metrics: Establish expected performance ranges
- Regression Detection: Identify performance degradations in code changes
- Optimization Guidance: Use EXPLAIN ANALYZE output to guide index/query improvements
- Capacity Planning: Understand scaling characteristics with data volume
Quick Start
Prerequisites
# 1. Ensure PostgreSQL is running
pg_isready
# 2. Set up test database
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo sqlx migrate run
# 3. Populate with test data - REQUIRED for representative benchmarks
cargo test --all-features
Important: The benchmarks use intelligent sampling to test diverse task/step complexities. Running integration tests first ensures the database contains various workflow patterns (linear, diamond, parallel) for representative benchmarking.
Running Benchmarks
# Run all SQL benchmarks
cargo bench --package tasker-shared --features benchmarks
# Run specific benchmark group
cargo bench --package tasker-shared --features benchmarks get_next_ready_tasks
# Run with baseline comparison
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main
# ... make changes ...
cargo bench --package tasker-shared --features benchmarks -- --baseline main
Sampling Strategy
The benchmarks use intelligent sampling to ensure representative results:
Task Sampling
- Samples 5 diverse tasks from different
named_task_uuidtypes - Distributes samples across different workflow patterns
- Maintains deterministic ordering (same UUIDs in same order each run)
- Provides consistent results while capturing complexity variance
Step Sampling
- Samples 10 diverse steps from different tasks
- Selects up to 2 steps per task for variety
- Captures different DAG depths and dependency patterns
- Helps identify performance variance in recursive queries
Benefits
- Representativeness: No bias from single task/step selection
- Consistency: Same samples = comparable baseline comparisons
- Variance Detection: Criterion can measure performance across complexities
- Real-world Accuracy: Reflects actual production workload diversity
Example Output:
step_readiness_status/calculate_readiness/0 2.345 ms
step_readiness_status/calculate_readiness/1 1.234 ms (simple linear task)
step_readiness_status/calculate_readiness/2 5.678 ms (complex diamond DAG)
step_readiness_status/calculate_readiness/3 3.456 ms
step_readiness_status/calculate_readiness/4 2.789 ms
Benchmark Categories
1. Task Discovery (get_next_ready_tasks)
What it measures: Time to discover ready tasks for orchestration
Hot path: Orchestration coordinator → Task discovery
Test parameters:
- Batch size: 1, 10, 50, 100 tasks
- Measures function overhead even with empty database
Expected performance:
- Empty DB: < 5ms for any batch size (function overhead)
- With data: Should scale linearly, not exponentially
Optimization targets:
- Index on task state
- Index on namespace for filtering
- Efficient processor ownership checks
Example output:
get_next_ready_tasks/batch_size/1
time: [2.1234 ms 2.1567 ms 2.1845 ms]
get_next_ready_tasks/batch_size/10
time: [2.2156 ms 2.2489 ms 2.2756 ms]
get_next_ready_tasks/batch_size/50
time: [2.5123 ms 2.5456 ms 2.5789 ms]
get_next_ready_tasks/batch_size/100
time: [3.0234 ms 3.0567 ms 3.0890 ms]
Analysis: Near-constant time across batch sizes indicates efficient query planning.
2. Step Readiness (get_step_readiness_status)
What it measures: Time to calculate if a step is ready to execute
Hot path: Step enqueuer → Dependency resolution
Dependencies: Requires test data (tasks with steps)
Expected performance:
- Simple linear: < 10ms
- Diamond pattern: < 20ms
- Complex DAG: < 50ms
Optimization targets:
- Parent step completion checks
- Dependency graph traversal
- Retry backoff calculations
Graceful degradation:
⚠️ Skipping step_readiness_status benchmark - no test data found
Run integration tests first to populate test data
3. State Transitions (transition_task_state_atomic)
What it measures: Time for atomic state transitions with processor ownership
Hot path: All orchestration operations (initialization, enqueuing, finalization)
Expected performance:
- Successful transition: < 15ms
- Failed transition (wrong state): < 10ms (faster path)
- Contention scenario: < 50ms with backoff
Optimization targets:
- Atomic compare-and-swap efficiency
- Index on task_uuid + processor_uuid
- Transition history table size
4. Task Execution Context (get_task_execution_context)
What it measures: Time to retrieve comprehensive task orchestration status
Hot path: Orchestration coordinator → Status checking
Dependencies: Requires test data (tasks in database)
Expected performance:
- Simple tasks: < 10ms
- Complex tasks: < 25ms
- With many steps: < 50ms
Optimization targets:
- Step aggregation queries
- State calculation efficiency
- Join optimization for step counts
5. Transitive Dependencies (get_step_transitive_dependencies)
What it measures: Time to resolve complete dependency tree for a step
Hot path: Worker → Step execution preparation (once per step lifecycle)
Dependencies: Requires test data (steps with dependencies)
Expected performance:
- Linear dependencies: < 5ms
- Diamond pattern: < 10ms
- Complex DAG (10+ levels): < 25ms
Optimization targets:
- Recursive CTE performance
- Index on step dependencies
- Materialized dependency graphs (future)
Why it matters: Called once per step on worker side when populating step data. While not in orchestration hot path, it affects worker step initialization latency. Recursive CTEs can be expensive with deep dependency trees.
6. EXPLAIN ANALYZE (explain_analyze)
What it measures: Query execution plans, not just timing
How it works: Runs EXPLAIN ANALYZE once per function (no repeated iterations since query plans don’t change between executions)
Functions analyzed:
get_next_ready_tasks()- Task discovery query plansget_task_execution_context()- Task status aggregation plansget_step_transitive_dependencies()- Recursive CTE dependency traversal plans
Purpose: Identify optimization opportunities:
- Sequential scans (need indexes)
- Nested loop performance
- Buffer hit ratios
- Index usage patterns
- Recursive CTE efficiency
Automatic Query Plan Logging: Captures each query plan once and analyzes, printing:
- ⏱️ Execution Time: Actual query execution duration
- 📋 Planning Time: Time spent planning the query
- 📦 Node Type: Primary operation type (Aggregate, Index Scan, etc.)
- 💰 Total Cost: PostgreSQL’s cost estimate
- ⚠️ Sequential Scan Warning: Alerts for potential missing indexes
- 📊 Buffer Hit Ratio: Cache efficiency (higher is better)
Example output:
════════════════════════════════════════════════════════════════════════════════
📊 QUERY PLAN ANALYSIS
════════════════════════════════════════════════════════════════════════════════
🔍 Function: get_next_ready_tasks
────────────────────────────────────────────────────────────────────────────────
⏱️ Execution Time: 2.345 ms
📋 Planning Time: 0.123 ms
📦 Node Type: Aggregate
💰 Total Cost: 45.67
📊 Buffer Hit Ratio: 98.5% (197/200 blocks)
────────────────────────────────────────────────────────────────────────────────
Saving Full Plans:
# Save complete JSON plans to target/query_plan_*.json
SAVE_QUERY_PLANS=1 cargo bench --package tasker-shared --features benchmarks
Red flags to investigate:
- “Seq Scan” on large tables → Add index
- “Nested Loop” with high iteration count → Optimize join strategy
- “Sort” operations on large datasets → Add index for ORDER BY
- Low buffer hit ratio (< 90%) → Increase shared_buffers or investigate I/O
Interpreting Results
Criterion Statistics
Criterion provides comprehensive statistics for each benchmark:
get_next_ready_tasks/batch_size/10
time: [2.2156 ms 2.2489 ms 2.2756 ms]
change: [-1.5% +0.2% +1.9%] (p = 0.31 > 0.05)
No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
2 (4.00%) high mild
1 (2.00%) high severe
Key metrics:
- [2.2156 ms 2.2489 ms 2.2756 ms]: Lower bound, mean, upper bound (95% confidence)
- change: Comparison to baseline (if available)
- p-value: Statistical significance (p < 0.05 = significant)
- Outliers: Measurements far from median (cache effects, GC, etc.)
Performance Expectations
Based on Phase 3 metrics verification (26 tasks executed):
| Metric | Expected | Warning | Critical |
|---|---|---|---|
| Task initialization | < 50ms | 50-100ms | > 100ms |
| Step readiness | < 20ms | 20-50ms | > 50ms |
| State transition | < 15ms | 15-30ms | > 30ms |
| Finalization claim | < 10ms | 10-25ms | > 25ms |
Note: These are function-level times, not end-to-end latencies.
Using Benchmarks for Optimization
Workflow
-
Establish Baseline
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main -
Make Changes (e.g., add index, optimize query)
-
Compare
cargo bench --package tasker-shared --features benchmarks -- --baseline main -
Review Output
get_next_ready_tasks/batch_size/100 time: [2.0123 ms 2.0456 ms 2.0789 ms] change: [-34.5% -32.1% -29.7%] (p = 0.00 < 0.05) Performance has improved. -
Analyze EXPLAIN Plans (if improvement isn’t clear)
Common Optimization Patterns
Pattern 1: Missing Index
Symptom: Exponential scaling with data volume
EXPLAIN shows: Seq Scan on tasks
Solution:
CREATE INDEX idx_tasks_state ON tasker.tasks(current_state)
WHERE complete = false;
Pattern 2: Inefficient Join
Symptom: High latency with complex DAGs
EXPLAIN shows: Nested Loop with high row counts
Solution: Use CTE or adjust join strategy
WITH parent_status AS (
SELECT ... -- Pre-compute parent completions
)
SELECT ... FROM tasker.workflow_steps s
JOIN parent_status ps ON ...
Pattern 3: Large Transaction History
Symptom: State transition slowing over time
EXPLAIN shows: Large scan of task_transitions
Solution: Partition by date or archive old transitions
CREATE TABLE tasker.task_transitions_archive (LIKE tasker.task_transitions);
-- Move old data periodically
Integration with Metrics
The benchmark results should correlate with production metrics:
From metrics-reference.md:
tasker_task_initialization_duration_milliseconds→ Benchmark: task discovery + initializationtasker_step_result_processing_duration_milliseconds→ Benchmark: step readiness + state transitionstasker_task_finalization_duration_milliseconds→ Benchmark: finalization claiming
Validation approach:
- Run benchmarks: Get ~2ms for task discovery
- Check metrics:
tasker_task_initialization_durationP95 = ~45ms - Calculate overhead: 45ms - 2ms = 43ms (business logic + framework)
This helps identify where optimization efforts should focus:
- If benchmark is slow → Optimize SQL/indexes
- If benchmark is fast but metrics slow → Optimize Rust code
Continuous Integration
Recommended CI Workflow
# .github/workflows/benchmarks.yml
name: Performance Benchmarks
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: tasker
options: >-
--health-cmd pg_isready
--health-interval 10s
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@stable
- name: Run migrations
run: cargo sqlx migrate run
env:
DATABASE_URL: postgresql://postgres:tasker@localhost/test
- name: Run benchmarks
run: cargo bench --package tasker-shared --features benchmarks
- name: Check for regressions
run: |
# Parse Criterion output and fail if P95 > threshold
# This is left as an exercise for CI implementation
Future Enhancements
Phase 5.3: Data Generation (Deferred)
The current benchmarks work with existing test data. Future work could add:
-
Realistic Data Generation
- Create 100/1,000/10,000 task datasets
- Various DAG complexities (linear, diamond, tree)
- State distribution (60% complete, 20% in-progress, etc.)
-
Contention Testing
- Multiple processors competing for same tasks
- Race condition scenarios
- Deadlock detection
-
Long-Running Benchmarks
- Memory leak detection
- Connection pool exhaustion
- Query plan cache effects
Troubleshooting
Benchmark fails with “DATABASE_URL must be set”
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
All benchmarks show “no test data found”
# Run integration tests to populate database
cargo test --all-features
# Or run specific test suite
cargo test --package tasker-shared --all-features
Benchmarks are inconsistent/noisy
- Close other applications
- Ensure PostgreSQL isn’t under load
- Run benchmarks multiple times
- Increase
sample_sizein benchmark code
Results don’t match production metrics
- Production has different data volumes
- Network latency in production
- Different PostgreSQL version/configuration
- Connection pool overhead in production
References
- Criterion Documentation: https://bheisler.github.io/criterion.rs/book/
- PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/sql-explain.html
- Phase 3 Metrics:
docs/observability/metrics-reference.md - Verification Results:
docs/observability/VERIFICATION_RESULTS.md
Sign-Off
Phase 5.2 Status: ✅ COMPLETE
Benchmarks Implemented:
- ✅
get_next_ready_tasks()- 4 batch sizes - ✅
get_step_readiness_status()- with graceful skip - ✅
transition_task_state_atomic()- atomic operations - ✅
get_task_execution_context()- orchestration status retrieval - ✅
get_step_transitive_dependencies()- recursive dependency traversal - ✅
EXPLAIN ANALYZE- query plan capture with automatic analysis
Documentation Complete:
- ✅ Quick start guide
- ✅ Interpretation guidance
- ✅ Optimization patterns
- ✅ Integration with metrics
- ✅ CI recommendations
Next Steps: Run benchmarks with real data and establish baseline performance targets.
Tasker-Core Logging Standards
Version: 1.0 Last Updated: 2025-10-07 Status: Active Related: Observability Standardization
Table of Contents
- Philosophy
- Log Levels
- Structured Fields
- Message Style
- Instrument Macro
- Error Handling
- Examples
- Enforcement
Philosophy
Principles:
- Production-First: Logs must be parseable, searchable, and professional
- Correlation-Driven: All operations include correlation_id for distributed tracing
- Structured: Fields over string interpolation for aggregation and querying
- Concise: Clear, actionable messages without noise
- Consistent: Predictable patterns across all code
Anti-Patterns to Avoid:
- ❌ Emojis (🚀✅❌) - Breaks log parsers, unprofessional
- ❌ All-caps prefixes (“BOOTSTRAP:”, “CORE:”) - Redundant with module paths
- ❌ Ticket references (“JIRA-123”, “PROJ-40”) - Internal, meaningless externally
- ❌ String interpolation - Use structured fields instead
- ❌ Verbose messages - Be concise, let fields provide detail
Log Levels
ERROR - Unrecoverable Failures
When to Use:
- Database connection permanently lost
- Critical system component failure
- Unrecoverable state machine violation
- Data corruption detected
- Message queue unavailable
Characteristics:
- Requires immediate human intervention
- Service degradation or outage
- Cannot automatically recover
- Should trigger alerts/pages
Example:
#![allow(unused)]
fn main() {
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Failed to claim task for finalization: database unavailable"
);
}
WARN - Degraded Operation
When to Use:
- Retryable failures after exhausting retries
- Circuit breaker opened (degraded mode)
- Fallback behavior activated
- Rate limiting engaged
- Configuration issues (non-fatal)
- Unexpected but handled conditions
Characteristics:
- Service continues but degraded
- Automatic recovery possible
- Should be monitored for patterns
- May indicate upstream problems
Example:
#![allow(unused)]
fn main() {
warn!(
correlation_id = %correlation_id,
step_uuid = %step_uuid,
retry_count = attempts,
max_retries = max_attempts,
next_retry_at = ?next_retry,
"Step execution failed after max retries, will not retry further"
);
}
INFO - Lifecycle Events
When to Use:
- System startup/shutdown
- Task created/completed/failed
- Step enqueued/completed
- State transitions (task/step)
- Configuration loaded
- Significant business events
Characteristics:
- Normal operation milestones
- Useful for understanding flow
- Production-ready verbosity
- Default log level in production
Example:
#![allow(unused)]
fn main() {
info!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
steps_enqueued = count,
duration_ms = elapsed.as_millis(),
"Task initialization complete"
);
}
DEBUG - Detailed Diagnostics
When to Use:
- Discovery query results
- Queue depth checks
- Dependency analysis details
- Configuration value dumps
- State machine transition details
- Detailed operation flow
Characteristics:
- Troubleshooting information
- Not shown in production (usually)
- Safe to be verbose
- Helps understand “why”
Example:
#![allow(unused)]
fn main() {
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
viable_steps = steps.len(),
pending_steps = pending.len(),
blocked_steps = blocked.len(),
"Step readiness analysis complete"
);
}
TRACE - Very Verbose
When to Use:
- Function entry/exit in hot paths
- Loop iteration details
- Deep parameter inspection
- Performance profiling hooks
Characteristics:
- Extremely verbose
- Usually disabled even in dev
- Performance impact acceptable
- Use sparingly
Example:
#![allow(unused)]
fn main() {
trace!(
correlation_id = %correlation_id,
iteration = i,
"Polling loop iteration"
);
}
Structured Fields
Required Fields (Context-Dependent)
Always Include:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id, // ALWAYS when available
}
When Task Context Available:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace,
}
When Step Context Available:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
}
For Operations:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
operation = "step_enqueue", // Operation identifier
duration_ms = elapsed.as_millis(), // Timing for operations
}
For Errors:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
error = %e, // Error Display
error_type = %type_name::<E>(), // Optional: Error type
}
Field Ordering (MANDATORY)
Standard Order:
- correlation_id (always first)
- Entity IDs (task_uuid, step_uuid, namespace)
- Operation/Action (operation, state, status)
- Measurements (duration_ms, count, size)
- Error Info (error, error_type, context)
- Other Context (additional fields)
Example:
#![allow(unused)]
fn main() {
info!(
// 1. Correlation ID (ALWAYS FIRST)
correlation_id = %correlation_id,
// 2. Entity IDs
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
// 3. Operation
operation = "step_transition",
from_state = %old_state,
to_state = %new_state,
// 4. Measurements
duration_ms = elapsed.as_millis(),
// 5. No errors (success case)
// 6. Other context
processor_id = %processor_uuid,
"Step state transition complete"
);
}
Field Formatting
Use Display Formatting (%):
#![allow(unused)]
fn main() {
// ✅ CORRECT: Let tracing handle formatting
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
}
Avoid Manual Conversion:
#![allow(unused)]
fn main() {
// ❌ WRONG: Manual to_string()
task_uuid = task_uuid.to_string(),
// ❌ WRONG: Debug formatting for production types
task_uuid = ?task_uuid, // Use ? only for Debug types
}
Field Naming:
#![allow(unused)]
fn main() {
// ✅ Standard names
duration_ms // Not elapsed_ms, time_ms
error // Not err, error_message
step_uuid // Not workflow_step_uuid (be consistent)
retry_count // Not attempts, retries
max_retries // Not max_attempts
}
Message Style
Guidelines
DO:
- ✅ Be concise and actionable
- ✅ Use present tense for states: “Step enqueued”
- ✅ Use past tense for events: “Task completed”
- ✅ Start with the subject: “Task completed” not “Successfully completed task”
- ✅ Focus on WHAT happened (fields show HOW)
DON’T:
- ❌ Use emojis: “🚀 Starting…” → “Starting orchestration system”
- ❌ Use all-caps prefixes: “BOOTSTRAP: Starting…” → “Starting orchestration bootstrap”
- ❌ Include ticket numbers: “PROJ-40: Processing…” → “Processing command”
- ❌ Be redundant: “Successfully enqueued step successfully” → “Step enqueued”
- ❌ Include technical jargon: “Atomic CAS transition succeeded” → “State transition complete”
- ❌ Be verbose: Keep messages under 10 words ideally
Before/After Examples
Lifecycle Events:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🚀 BOOTSTRAP: Starting unified orchestration system bootstrap");
// ✅ AFTER
info!("Starting orchestration system bootstrap");
}
Operation Completion:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("✅ STEP_ENQUEUER: Successfully marked step {} as enqueued", step_uuid);
// ✅ AFTER
info!(
correlation_id = %correlation_id,
step_uuid = %step_uuid,
"Step marked as enqueued"
);
}
Error Handling:
#![allow(unused)]
fn main() {
// ❌ BEFORE
error!("❌ ORCHESTRATION_LOOP: Failed to process task {}: {}", task_uuid, e);
// ✅ AFTER
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Task processing failed"
);
}
Shutdown:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🛑 Shutdown signal received, initiating graceful shutdown...");
// ✅ AFTER
info!("Shutdown signal received, initiating graceful shutdown");
}
Instrument Macro
When to Use
Use #[instrument] for:
- Function-level spans in hot paths
- Automatic correlation ID tracking
- Operations that should appear in traces
- Functions with significant duration
Benefits:
- Automatic span creation
- Automatic timing
- Better OpenTelemetry integration (Phase 2)
- Cleaner code
Example
#![allow(unused)]
fn main() {
use tracing::instrument;
#[instrument(skip(self), fields(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace
))]
pub async fn process_task(
&self,
correlation_id: Uuid,
task_uuid: Uuid,
namespace: String,
) -> Result<TaskResult> {
// Span automatically created with fields above
info!("Starting task processing");
// ... implementation ...
info!(
duration_ms = start.elapsed().as_millis(),
"Task processing complete"
);
Ok(result)
}
}
Skip Parameters
Always skip:
self(redundant)- Large structures (use specific fields instead)
- Sensitive data (passwords, tokens, PII)
#![allow(unused)]
fn main() {
#[instrument(
skip(self, context), // Skip large context
fields(
correlation_id = %correlation_id,
task_uuid = %context.task_uuid, // Extract specific fields
)
)]
}
Error Handling
Error Context
Always include:
#![allow(unused)]
fn main() {
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e, // Error Display (user-friendly)
error_type = %type_name::<E>(), // Optional: For classification
"Operation failed"
);
}
Error Propagation
#![allow(unused)]
fn main() {
// ✅ Log and return for caller to handle
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Step discovery query failed, will retry"
);
return Err(e);
// ❌ Don't log at every level (causes noise)
// Instead: Log once at appropriate level where handled
}
Error Classification
#![allow(unused)]
fn main() {
match result {
Err(e) if e.is_retryable() => {
warn!(
correlation_id = %correlation_id,
error = %e,
retry_count = attempts,
"Operation failed, will retry"
);
}
Err(e) => {
error!(
correlation_id = %correlation_id,
error = %e,
"Operation failed permanently"
);
}
Ok(result) => {
info!(
correlation_id = %correlation_id,
duration_ms = elapsed.as_millis(),
"Operation completed successfully"
);
}
}
}
Examples
Complete Examples by Scenario
Task Initialization
#![allow(unused)]
fn main() {
#[instrument(skip(self), fields(
correlation_id = %task_request.correlation_id,
task_name = %task_request.name,
namespace = %task_request.namespace
))]
pub async fn create_task_from_request(
&self,
task_request: TaskRequest,
) -> Result<TaskInitializationResult> {
let correlation_id = task_request.correlation_id;
let start = Instant::now();
info!("Starting task initialization");
// Create task
let task = self.create_task(&task_request).await?;
debug!(
task_uuid = %task.task_uuid,
template_uuid = %task.named_task_uuid,
"Task created in database"
);
// Discover steps
let steps = self.discover_initial_steps(task.task_uuid).await?;
info!(
correlation_id = %correlation_id,
task_uuid = %task.task_uuid,
step_count = steps.len(),
duration_ms = start.elapsed().as_millis(),
"Task initialization complete"
);
Ok(TaskInitializationResult {
task_uuid: task.task_uuid,
step_count: steps.len(),
})
}
}
Step Enqueueing
#![allow(unused)]
fn main() {
pub async fn enqueue_step(
&self,
correlation_id: Uuid,
task_uuid: Uuid,
step: &ViableStep,
) -> Result<()> {
let start = Instant::now();
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step.step_uuid,
step_name = %step.name,
queue = %step.queue_name,
"Enqueueing step"
);
let message = self.create_message(correlation_id, task_uuid, step)?;
self.pgmq_client
.send(&step.queue_name, &message)
.await?;
info!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step.step_uuid,
queue = %step.queue_name,
duration_ms = start.elapsed().as_millis(),
"Step enqueued"
);
Ok(())
}
}
Error Handling
#![allow(unused)]
fn main() {
match self.process_step_result(result).await {
Ok(()) => {
info!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
duration_ms = elapsed.as_millis(),
"Step result processed"
);
}
Err(e) if e.is_retryable() => {
warn!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
error = %e,
retry_count = result.attempts,
"Step result processing failed, will retry"
);
return Err(e);
}
Err(e) => {
error!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
error = %e,
"Step result processing failed permanently"
);
return Err(e);
}
}
}
Bootstrap/Shutdown
#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<OrchestrationSystemHandle> {
info!("Starting orchestration system bootstrap");
let config = ConfigManager::load()?;
debug!(environment = %config.environment, "Configuration loaded");
let context = SystemContext::from_config(config).await?;
info!(processor_uuid = %context.processor_uuid, "System context initialized");
let core = OrchestrationCore::new(context).await?;
info!("Orchestration core initialized");
// ... more initialization ...
info!(
processor_uuid = %core.processor_uuid,
namespaces = ?core.supported_namespaces,
"Orchestration system bootstrap complete"
);
Ok(handle)
}
pub async fn shutdown(&mut self) -> Result<()> {
info!("Initiating graceful shutdown");
if let Some(coordinator) = &self.event_coordinator {
coordinator.stop().await?;
debug!("Event coordinator stopped");
}
info!("Orchestration system shutdown complete");
Ok(())
}
}
Enforcement
Code Review Checklist
Before merging, verify:
- No emojis in log messages
- No all-caps component prefixes
- No ticket references in runtime logs
- correlation_id present in all task/step operations
- Structured fields follow standard ordering
- Messages are concise and actionable
- Appropriate log levels used
- Error context is complete
CI Checks
Recommended lints (future):
# Check for emojis
! grep -r '[🔧✅🚀❌⚠️📊🔍🎉🛡️⏱️📝🏗️🎯🔄💡📦🧪🌉🔌⏳🛑]' src/
# Check for all-caps prefixes
! grep -rE '(info|debug|warn|error)!\(".*[A-Z_]{3,}:' src/
# Check for ticket references in logs (allow in comments)
! grep -rE '(info|debug|warn|error)!.*[A-Z]+-[0-9]+' src/
Pre-commit Hook
Add to .git/hooks/pre-commit:
#!/bin/bash
./scripts/audit-logging.sh --check || {
echo "❌ Logging standards violation detected"
echo "Run: ./scripts/audit-logging.sh for details"
exit 1
}
Migration Guide
For Existing Code
- Remove emojis: Use find/replace
- Remove all-caps prefixes: Simple cleanup
- Add correlation_id: Extract from context
- Reorder fields: correlation_id first
- Shorten messages: Remove redundancy
- Verify log levels: Lifecycle = INFO, diagnostics = DEBUG
For New Code
- Always include correlation_id when context available
- Use
#[instrument]for significant functions - Follow field ordering: correlation_id, IDs, operation, measurements, errors
- Keep messages concise: Under 10 words
- Choose appropriate level: ERROR (fatal), WARN (degraded), INFO (lifecycle), DEBUG (diagnostic)
FAQ
Q: Should I use info! or debug! for step enqueueing?
A: info! - It’s a significant lifecycle event even if frequent.
Q: When should I add duration_ms?
A: For any operation that:
- Calls external systems (DB, queue)
- Is in the hot path
- Takes >10ms typically
- Needs performance monitoring
Q: Can I use emojis in error messages? A: No. Never use emojis in any log message. They break parsers and are unprofessional.
Q: Should correlation_id really always be first? A: Yes. This enables easy correlation across all logs. It’s the #1 most important field for distributed tracing.
Q: What about ticket references in module docs? A: Acceptable in module-level documentation for architectural context. Remove from runtime logs and inline comments.
Q: Can I include stack traces in logs?
A: Use error = %e which includes the error chain. Only add explicit backtrace for truly exceptional cases.
References
Document End
This is a living document. Propose changes via PR with rationale.
OpenTelemetry Metrics Reference
Status: ✅ Complete Export Interval: 60 seconds OTLP Endpoint: http://localhost:4317 Grafana UI: http://localhost:3000
This document provides a complete reference for all OpenTelemetry metrics instrumented in the Tasker orchestration system.
Table of Contents
- Overview
- Configuration
- Orchestration Metrics
- Worker Metrics
- Resilience Metrics
- Database Metrics
- Messaging Metrics
- Example Queries
- Dashboard Recommendations
Overview
The Tasker system exports 47+ OpenTelemetry metrics across 5 domains:
| Domain | Metrics | Description |
|---|---|---|
| Orchestration | 11 | Task lifecycle, step coordination, finalization |
| Worker | 10 | Step execution, claiming, result submission |
| Resilience | 8+ | Circuit breakers, MPSC channels |
| Database | 7 | SQL query performance, connection pools |
| Messaging | 11 | PGMQ queue operations, message processing |
All metrics include correlation_id labels for distributed tracing correlation with Tempo traces.
Histogram Metric Naming
OpenTelemetry automatically appends _milliseconds to histogram metric names when the unit is specified as ms. This provides clarity in Prometheus queries.
Pattern: metric_name → metric_name_milliseconds_{bucket,sum,count}
Example:
- Code:
tasker.step.execution.durationwith unit “ms” - Prometheus:
tasker_step_execution_duration_milliseconds_*
Query Patterns: Instant vs Rate-Based
Instant/Recent Data Queries - Use these when:
- Testing with burst/batch task execution
- Viewing data from recent runs (last few minutes)
- Data points are sparse or clustered together
- You want simple averages without time windows
Rate-Based Queries - Use these when:
- Continuous production monitoring
- Data flows steadily over time
- Calculating per-second rates
- Building alerting rules
Why the difference matters: The rate() function calculates per-second change rates over a time window. It requires data points spread across that window. If you run 26 tasks in quick succession, all data points cluster at one timestamp, and rate() returns no data because there’s no rate change to calculate.
Configuration
Enable OpenTelemetry
File: config/tasker/environments/development/telemetry.toml
[telemetry]
enabled = true
service_name = "tasker-core-dev"
sample_rate = 1.0
[telemetry.opentelemetry]
enabled = true # Must be true to export metrics
Verify in Logs
# Should see:
# opentelemetry_enabled=true
# NOT: Metrics collection disabled (TELEMETRY_ENABLED=false)
Orchestration Metrics
Module: tasker-shared/src/metrics/orchestration.rs
Instrumentation: tasker-orchestration/src/orchestration/lifecycle/*.rs
Counters
tasker.tasks.requests.total
Description: Total number of task creation requests received Type: Counter (u64) Labels:
correlation_id: Request correlation IDtask_type: Task name (e.g., “mathematical_sequence”)namespace: Task namespace (e.g., “rust_e2e_linear”)
Instrumented In: task_initializer.rs:start_task_initialization()
Example Query:
# Total task requests
tasker_tasks_requests_total
# By namespace
sum by (namespace) (tasker_tasks_requests_total)
# Specific correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.tasks.completions.total
Description: Total number of tasks that completed successfully Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Completed)
Example Query:
# Total completions
tasker_tasks_completions_total
# Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])
Expected Output: (To be verified)
tasker.tasks.failures.total
Description: Total number of tasks that failed Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Failed)
Example Query:
# Total failures
tasker_tasks_failures_total
# Error rate over 5 minutes
rate(tasker_tasks_failures_total[5m])
Expected Output: (To be verified)
tasker.steps.enqueued.total
Description: Total number of steps enqueued to worker queues Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Task namespacestep_name: Name of the enqueued step
Instrumented In: step_enqueuer.rs:enqueue_steps()
Example Query:
# Total steps enqueued
tasker_steps_enqueued_total
# By step name
sum by (step_name) (tasker_steps_enqueued_total)
# For specific task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.step_results.processed.total
Description: Total number of step results processed from workers Type: Counter (u64) Labels:
correlation_id: Request correlation IDresult_type: “success”, “error”, “timeout”, “cancelled”, “skipped”
Instrumented In: result_processor.rs:process_step_result()
Example Query:
# Total results processed
tasker_step_results_processed_total
# By result type
sum by (result_type) (tasker_step_results_processed_total)
# Success rate
rate(tasker_step_results_processed_total{result_type="success"}[5m])
Expected Output: (To be verified)
Histograms
tasker.task.initialization.duration
Description: Task initialization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_task_initialization_duration_milliseconds_buckettasker_task_initialization_duration_milliseconds_sumtasker_task_initialization_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDtask_type: Task name
Instrumented In: task_initializer.rs:start_task_initialization()
Example Queries:
Instant/Recent Data (works immediately after task execution):
# Simple average initialization time
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count
# P95 latency
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
Rate-Based (for continuous monitoring, requires data spread over time):
# Average initialization time over 5 minutes
rate(tasker_task_initialization_duration_milliseconds_sum[5m]) /
rate(tasker_task_initialization_duration_milliseconds_count[5m])
# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
tasker.task.finalization.duration
Description: Task finalization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_task_finalization_duration_milliseconds_buckettasker_task_finalization_duration_milliseconds_sumtasker_task_finalization_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDfinal_state: “complete”, “error”, “cancelled”
Instrumented In: task_finalizer.rs:finalize_task()
Example Queries:
Instant/Recent Data:
# Simple average finalization time
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count
# P95 by final state
histogram_quantile(0.95,
sum by (final_state, le) (
tasker_task_finalization_duration_milliseconds_bucket
)
)
Rate-Based:
# Average finalization time over 5 minutes
rate(tasker_task_finalization_duration_milliseconds_sum[5m]) /
rate(tasker_task_finalization_duration_milliseconds_count[5m])
# P95 by final state over 5 minutes
histogram_quantile(0.95,
sum by (final_state, le) (
rate(tasker_task_finalization_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step_result.processing.duration
Description: Step result processing duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_result_processing_duration_milliseconds_buckettasker_step_result_processing_duration_milliseconds_sumtasker_step_result_processing_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDresult_type: “success”, “error”, “timeout”, “cancelled”, “skipped”
Instrumented In: result_processor.rs:process_step_result()
Example Queries:
Instant/Recent Data:
# Simple average result processing time
tasker_step_result_processing_duration_milliseconds_sum /
tasker_step_result_processing_duration_milliseconds_count
# P50, P95, P99 latencies
histogram_quantile(0.50, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.95, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.99, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
Rate-Based:
# Average result processing time over 5 minutes
rate(tasker_step_result_processing_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_processing_duration_milliseconds_count[5m])
# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_processing_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
Gauges
tasker.tasks.active
Description: Number of tasks currently being processed Type: Gauge (u64) Labels:
state: Current task state
Status: Planned (not yet instrumented)
tasker.steps.ready
Description: Number of steps ready for execution Type: Gauge (u64) Labels:
namespace: Worker namespace
Status: Planned (not yet instrumented)
Worker Metrics
Module: tasker-shared/src/metrics/worker.rs
Instrumentation: tasker-worker/src/worker/*.rs
Counters
tasker.steps.executions.total
Description: Total number of step executions attempted Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: command_processor.rs:handle_execute_step()
Example Query:
# Total step executions
tasker_steps_executions_total
# Execution rate
rate(tasker_steps_executions_total[5m])
# For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.steps.successes.total
Description: Total number of step executions that completed successfully Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Worker namespace
Instrumented In: command_processor.rs:handle_execute_step() (success path)
Example Query:
# Total successes
tasker_steps_successes_total
# Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
# By namespace
sum by (namespace) (tasker_steps_successes_total)
Expected Output: (To be verified)
tasker.steps.failures.total
Description: Total number of step executions that failed Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Worker namespace (or “unknown” for early failures)error_type: “claim_failed”, “database_error”, “step_not_found”, “message_deletion_failed”
Instrumented In: command_processor.rs:handle_execute_step() (error paths)
Example Query:
# Total failures
tasker_steps_failures_total
# Failure rate
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])
# By error type
sum by (error_type) (tasker_steps_failures_total)
# Error distribution
topk(5, sum by (error_type) (tasker_steps_failures_total))
Expected Output: (To be verified)
tasker.steps.claimed.total
Description: Total number of steps claimed from queues Type: Counter (u64) Labels:
namespace: Worker namespaceclaim_method: “event”, “poll”
Instrumented In: step_claim.rs:try_claim_step()
Example Query:
# Total claims
tasker_steps_claimed_total
# By claim method
sum by (claim_method) (tasker_steps_claimed_total)
# Claim rate
rate(tasker_steps_claimed_total[5m])
Expected Output: (To be verified)
tasker.steps.results_submitted.total
Description: Total number of step results submitted to orchestration Type: Counter (u64) Labels:
correlation_id: Request correlation IDresult_type: “completion”
Instrumented In: orchestration_result_sender.rs:send_completion()
Example Query:
# Total submissions
tasker_steps_results_submitted_total
# Submission rate
rate(tasker_steps_results_submitted_total[5m])
# For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
Histograms
tasker.step.execution.duration
Description: Step execution duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_execution_duration_milliseconds_buckettasker_step_execution_duration_milliseconds_sumtasker_step_execution_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDnamespace: Worker namespaceresult: “success”, “error”
Instrumented In: command_processor.rs:handle_execute_step()
Example Queries:
Instant/Recent Data:
# Simple average execution time
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count
# P95 latency by namespace
histogram_quantile(0.95,
sum by (namespace, le) (
tasker_step_execution_duration_milliseconds_bucket
)
)
# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_step_execution_duration_milliseconds_bucket))
Rate-Based:
# Average execution time over 5 minutes
rate(tasker_step_execution_duration_milliseconds_sum[5m]) /
rate(tasker_step_execution_duration_milliseconds_count[5m])
# P95 latency by namespace over 5 minutes
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step.claim.duration
Description: Step claiming duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_claim_duration_milliseconds_buckettasker_step_claim_duration_milliseconds_sumtasker_step_claim_duration_milliseconds_count
Labels:
namespace: Worker namespaceclaim_method: “event”, “poll”
Instrumented In: step_claim.rs:try_claim_step()
Example Queries:
Instant/Recent Data:
# Simple average claim time
tasker_step_claim_duration_milliseconds_sum /
tasker_step_claim_duration_milliseconds_count
# Compare event vs poll claiming (P95)
histogram_quantile(0.95,
sum by (claim_method, le) (
tasker_step_claim_duration_milliseconds_bucket
)
)
Rate-Based:
# Average claim time over 5 minutes
rate(tasker_step_claim_duration_milliseconds_sum[5m]) /
rate(tasker_step_claim_duration_milliseconds_count[5m])
# P95 by claim method over 5 minutes
histogram_quantile(0.95,
sum by (claim_method, le) (
rate(tasker_step_claim_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step_result.submission.duration
Description: Step result submission duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_result_submission_duration_milliseconds_buckettasker_step_result_submission_duration_milliseconds_sumtasker_step_result_submission_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDresult_type: “completion”
Instrumented In: orchestration_result_sender.rs:send_completion()
Example Queries:
Instant/Recent Data:
# Simple average submission time
tasker_step_result_submission_duration_milliseconds_sum /
tasker_step_result_submission_duration_milliseconds_count
# P95 submission latency
histogram_quantile(0.95, sum by (le) (tasker_step_result_submission_duration_milliseconds_bucket))
Rate-Based:
# Average submission time over 5 minutes
rate(tasker_step_result_submission_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_submission_duration_milliseconds_count[5m])
# P95 submission latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_submission_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
Gauges
tasker.steps.active_executions
Description: Number of steps currently being executed Type: Gauge (u64) Labels:
namespace: Worker namespacehandler_type: “rust”, “ruby”
Status: Defined but not actively instrumented (gauge tracking removed during implementation)
tasker.queue.depth
Description: Current queue depth per namespace Type: Gauge (u64) Labels:
namespace: Worker namespace
Status: Planned (not yet instrumented)
Resilience Metrics
Module: tasker-shared/src/metrics/worker.rs, tasker-orchestration/src/web/circuit_breaker.rs
Instrumentation: Circuit breakers, MPSC channels
Related Docs: Circuit Breakers | Backpressure Architecture
Circuit Breaker Metrics
Circuit breakers provide fault isolation and cascade prevention. These metrics track breaker state transitions and related operations.
api_circuit_breaker_state
Description: Current state of the web API database circuit breaker Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels: None
Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs
Example Queries:
# Current state
api_circuit_breaker_state
# Alert when open
api_circuit_breaker_state == 2
tasker_circuit_breaker_state
Description: Per-component circuit breaker state Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels:
component: Circuit breaker name (e.g., “ffi_completion”, “task_readiness”, “pgmq”)
Instrumented In: Various circuit breaker implementations
Example Queries:
# All circuit breaker states
tasker_circuit_breaker_state
# Check specific component
tasker_circuit_breaker_state{component="ffi_completion"}
# Count open breakers
count(tasker_circuit_breaker_state == 2)
api_requests_rejected_total
Description: Total API requests rejected due to open circuit breaker Type: Counter (u64) Labels:
endpoint: The rejected endpoint path
Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs
Example Queries:
# Total rejections
api_requests_rejected_total
# Rejection rate
rate(api_requests_rejected_total[5m])
# By endpoint
sum by (endpoint) (api_requests_rejected_total)
ffi_completion_slow_sends_total
Description: FFI completion channel sends exceeding latency threshold (100ms default) Type: Counter (u64) Labels: None
Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Example Queries:
# Total slow sends
ffi_completion_slow_sends_total
# Slow send rate (alerts at >10/sec)
rate(ffi_completion_slow_sends_total[5m]) > 10
Alert Threshold: Warning when rate exceeds 10/second for 2 minutes
ffi_completion_circuit_open_rejections_total
Description: FFI completion operations rejected due to open circuit breaker Type: Counter (u64) Labels: None
Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Example Queries:
# Total rejections
ffi_completion_circuit_open_rejections_total
# Rejection rate
rate(ffi_completion_circuit_open_rejections_total[5m])
MPSC Channel Metrics
Bounded MPSC channels provide backpressure control. These metrics track channel utilization and overflow events.
mpsc_channel_usage_percent
Description: Current fill percentage of a bounded MPSC channel Type: Gauge (f64) Labels:
channel: Channel name (e.g., “orchestration_command”, “pgmq_notifications”)component: Owning component
Instrumented In: Channel monitor integration points
Example Queries:
# All channel usage
mpsc_channel_usage_percent
# High usage channels
mpsc_channel_usage_percent > 80
# By component
max by (component) (mpsc_channel_usage_percent)
Alert Thresholds:
- Warning: > 80% for 15 minutes
- Critical: > 90% for 5 minutes
mpsc_channel_capacity
Description: Configured buffer capacity for an MPSC channel Type: Gauge (u64) Labels:
channel: Channel namecomponent: Owning component
Instrumented In: Channel monitor initialization
Example Queries:
# All channel capacities
mpsc_channel_capacity
# Compare usage to capacity
mpsc_channel_usage_percent / 100 * mpsc_channel_capacity
mpsc_channel_full_events_total
Description: Count of channel overflow events (backpressure applied) Type: Counter (u64) Labels:
channel: Channel namecomponent: Owning component
Instrumented In: Channel send operations with backpressure handling
Example Queries:
# Total overflow events
mpsc_channel_full_events_total
# Overflow rate
rate(mpsc_channel_full_events_total[5m])
# By channel
sum by (channel) (mpsc_channel_full_events_total)
Alert Threshold: Any overflow events indicate backpressure is occurring
Resilience Dashboard Panels
Circuit Breaker State Timeline:
# Panel: Time series with state mapping
api_circuit_breaker_state
# Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)
FFI Completion Health:
# Panel: Multi-stat showing slow sends and rejections
rate(ffi_completion_slow_sends_total[5m])
rate(ffi_completion_circuit_open_rejections_total[5m])
Channel Saturation Overview:
# Panel: Gauge showing max channel usage
max(mpsc_channel_usage_percent)
# Thresholds: Green < 70%, Yellow < 90%, Red >= 90%
Backpressure Events:
# Panel: Time series of overflow rate
rate(mpsc_channel_full_events_total[5m])
Database Metrics
Module: tasker-shared/src/metrics/database.rs
Status: ⚠️ Defined but not yet instrumented
Planned Metrics
tasker.sql.queries.total- Countertasker.sql.query.duration- Histogramtasker.db.pool.connections_active- Gaugetasker.db.pool.connections_idle- Gaugetasker.db.pool.wait_duration- Histogramtasker.db.transactions.total- Countertasker.db.transaction.duration- Histogram
Messaging Metrics
Module: tasker-shared/src/metrics/messaging.rs
Status: ⚠️ Defined but not yet instrumented
Planned Metrics
tasker.queue.messages_sent.total- Countertasker.queue.messages_received.total- Countertasker.queue.messages_deleted.total- Countertasker.queue.message_send.duration- Histogramtasker.queue.message_receive.duration- Histogramtasker.queue.depth- Gaugetasker.queue.age_seconds- Gaugetasker.queue.visibility_timeouts.total- Countertasker.queue.errors.total- Countertasker.queue.retry_attempts.total- Counter
Note: Circuit breaker metrics (including queue-related circuit breakers) are documented in the Resilience Metrics section.
Example Queries
Task Execution Flow
Complete task execution for a specific correlation_id:
# 1. Task creation
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 3. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 4. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 5. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 6. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 7. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Flow: 1 → N → N → N → N → N → 1 (where N = number of steps)
Performance Analysis
Task initialization latency percentiles:
Instant/Recent Data:
# P50 (median)
histogram_quantile(0.50, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P95
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P99
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
Rate-Based (continuous monitoring):
# P50 (median)
histogram_quantile(0.50, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
# P95
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
# P99
histogram_quantile(0.99, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
Step execution latency by namespace:
Instant/Recent Data:
histogram_quantile(0.95,
sum by (namespace, le) (
tasker_step_execution_duration_milliseconds_bucket
)
)
Rate-Based:
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_milliseconds_bucket[5m])
)
)
End-to-end task duration (from request to completion):
This requires combining initialization + step execution + finalization durations. Use the simple average approach for instant data:
# Average task initialization
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count
# Average step execution
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count
# Average finalization
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count
Error Rate Monitoring
Overall step failure rate:
rate(tasker_steps_failures_total[5m]) /
rate(tasker_steps_executions_total[5m])
Error distribution by type:
topk(5, sum by (error_type) (tasker_steps_failures_total))
Task failure rate:
rate(tasker_tasks_failures_total[5m]) /
(rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))
Throughput Monitoring
Task request rate:
rate(tasker_tasks_requests_total[1m])
rate(tasker_tasks_requests_total[5m])
rate(tasker_tasks_requests_total[15m])
Step execution throughput:
sum(rate(tasker_steps_executions_total[5m]))
Step completion rate (successes + failures):
sum(rate(tasker_steps_successes_total[5m])) +
sum(rate(tasker_steps_failures_total[5m]))
Dashboard Recommendations
Task Execution Overview Dashboard
Panels:
-
Task Request Rate
- Query:
rate(tasker_tasks_requests_total[5m]) - Visualization: Time series graph
- Query:
-
Task Completion Rate
- Query:
rate(tasker_tasks_completions_total[5m]) - Visualization: Time series graph
- Query:
-
Task Success/Failure Ratio
- Query: Two series
- Completions:
rate(tasker_tasks_completions_total[5m]) - Failures:
rate(tasker_tasks_failures_total[5m])
- Completions:
- Visualization: Stacked area chart
- Query: Two series
-
Task Initialization Latency (P95)
- Query:
histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m])) - Visualization: Time series graph
- Query:
-
Steps Enqueued vs Executed
- Query: Two series
- Enqueued:
rate(tasker_steps_enqueued_total[5m]) - Executed:
rate(tasker_steps_executions_total[5m])
- Enqueued:
- Visualization: Time series graph
- Query: Two series
Worker Performance Dashboard
Panels:
-
Step Execution Throughput by Namespace
- Query:
sum by (namespace) (rate(tasker_steps_executions_total[5m])) - Visualization: Time series graph (multi-series)
- Query:
-
Step Success Rate
- Query:
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m]) - Visualization: Gauge (0-1 scale)
- Query:
-
Step Execution Latency Percentiles
- Query: Three series
- P50:
histogram_quantile(0.50, rate(tasker_step_execution_duration_bucket[5m])) - P95:
histogram_quantile(0.95, rate(tasker_step_execution_duration_bucket[5m])) - P99:
histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
- P50:
- Visualization: Time series graph
- Query: Three series
-
Step Claiming Performance (Event vs Poll)
- Query:
histogram_quantile(0.95, sum by (claim_method, le) (rate(tasker_step_claim_duration_bucket[5m]))) - Visualization: Time series graph
- Query:
-
Error Distribution by Type
- Query:
sum by (error_type) (rate(tasker_steps_failures_total[5m])) - Visualization: Pie chart or bar chart
- Query:
System Health Dashboard
Panels:
-
Overall Task Success Rate
- Query:
rate(tasker_tasks_completions_total[5m]) / (rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m])) - Visualization: Stat panel with thresholds (green > 0.95, yellow > 0.90, red < 0.90)
- Query:
-
Step Failure Rate
- Query:
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m]) - Visualization: Stat panel with thresholds
- Query:
-
Average Task End-to-End Duration
- Query: Combination of initialization, execution, and finalization durations
- Visualization: Time series graph
-
Result Processing Latency
- Query:
rate(tasker_step_result_processing_duration_sum[5m]) / rate(tasker_step_result_processing_duration_count[5m]) - Visualization: Time series graph
- Query:
-
Active Operations
- Query: Currently not instrumented (gauges removed)
- Status: Planned future enhancement
Verification Checklist
Use this checklist to verify metrics are working correctly:
Prerequisites
-
telemetry.opentelemetry.enabled = truein development config - Services restarted after config change
- Logs show
opentelemetry_enabled=true - Grafana LGTM container running on ports 3000, 4317
Basic Verification
- At least one task created via CLI
- Correlation ID captured from task creation
- Trace visible in Grafana Tempo for correlation ID
Orchestration Metrics
-
tasker_tasks_requests_totalreturns non-zero -
tasker_steps_enqueued_totalreturns expected step count -
tasker_step_results_processed_totalreturns expected result count -
tasker_tasks_completions_totalincrements on success -
tasker_task_initialization_duration_buckethas histogram data
Worker Metrics
-
tasker_steps_executions_totalreturns non-zero -
tasker_steps_successes_totalmatches successful steps -
tasker_steps_claimed_totalreturns expected claims -
tasker_steps_results_submitted_totalmatches result submissions -
tasker_step_execution_duration_buckethas histogram data
Resilience Metrics
-
api_circuit_breaker_statereturns 0 (Closed) during normal operation -
/health/detailedendpoint shows circuit breaker states -
mpsc_channel_usage_percentreturns values < 80% (no saturation) -
mpsc_channel_full_events_totalis 0 or very low (no backpressure) - FFI workers:
ffi_completion_slow_sends_totalis near zero
Correlation
- All metrics filterable by
correlation_id - Correlation ID in metrics matches trace ID in Tempo
- Complete execution flow visible from request to completion
Troubleshooting
No Metrics Appearing
Check 1: OpenTelemetry enabled
grep "opentelemetry_enabled" tmp/*.log
# Should show: opentelemetry_enabled=true
Check 2: OTLP endpoint accessible
curl -v http://localhost:4317 2>&1 | grep Connected
# Should show: Connected to localhost (127.0.0.1) port 4317
Check 3: Grafana LGTM running
curl -s http://localhost:3000/api/health | jq
# Should return healthy status
Check 4: Wait for export interval (60 seconds) Metrics are batched and exported every 60 seconds. Wait at least 1 minute after task execution.
Metrics Missing Labels
If correlation_id or other labels are missing, check:
- Logs for
correlation_idfield presence - Metric instrumentation includes KeyValue::new() calls
- Labels match between metric definition and usage
Histogram Buckets Empty
If histogram queries return no data:
- Verify histogram is initialized: check logs for metric initialization
- Ensure duration values are non-zero and reasonable
- Check that
record()is called, notadd()for histograms
Next Steps
Phase 3.4 (Future)
- Instrument database metrics (7 metrics)
- Instrument messaging metrics (11 metrics)
- Add gauge tracking for active operations
- Implement queue depth monitoring
Production Readiness
- Create alert rules for error rates
- Set up automated dashboards
- Configure metric retention policies
- Add metric aggregation for long-term storage
Last Updated: 2025-12-10
Test Task: mathematical_sequence (correlation_id: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5)
Status: All orchestration and worker metrics verified and producing data ✅
Recent Updates:
- 2025-12-10: Added Resilience Metrics section (circuit breakers, MPSC channels)
- 2025-10-08: Initial metrics verification completed
Metrics Verification Guide
Purpose: Verify that documented metrics queries work with actual system data
Test Task: mathematical_sequence
Correlation ID: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5
Task ID: 0199c3e0-ccea-70f0-b6ae-3086b2f68280
Trace ID: d640f82572e231322edba0a5ef6e1405
How to Use This Guide
- Open Grafana at http://localhost:3000
- Navigate to Explore (compass icon in sidebar)
- Select Prometheus as the data source
- Copy each query below into the query editor
- Record the actual output
- Mark ✅ if query works, ❌ if it fails, or ⚠️ if partial data
Orchestration Metrics Verification
1. Task Requests Counter
Metric: tasker.tasks.requests.total
Query 1: Basic counter
tasker_tasks_requests_total
Expected: At least 1 (for our test task) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Filtered by correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Exactly 1 Actual Result: _____________ Labels Present: [ ] correlation_id [ ] task_type [ ] namespace Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Sum by namespace
sum by (namespace) (tasker_tasks_requests_total)
Expected: 1 for namespace “rust_e2e_linear” Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
2. Task Completions Counter
Metric: tasker.tasks.completions.total
Query 1: Basic counter
tasker_tasks_completions_total
Expected: At least 1 (if task completed successfully) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Filtered by correlation_id
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])
Expected: Some positive rate value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
3. Steps Enqueued Counter
Metric: tasker.steps.enqueued.total
Query 1: Total steps enqueued for our task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of steps in mathematical_sequence workflow (likely 3-4 steps) Actual Result: _____________ Step Names Visible: [ ] Yes [ ] No Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Sum by step name
sum by (step_name) (tasker_steps_enqueued_total)
Expected: Breakdown by step name Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
4. Step Results Processed Counter
Metric: tasker.step_results.processed.total
Query 1: Results processed for our task
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Same as steps enqueued Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Sum by result type
sum by (result_type) (tasker_step_results_processed_total)
Expected: Breakdown showing “success” results Actual Result: _____________ Result Types Visible: [ ] success [ ] error [ ] timeout [ ] cancelled [ ] skipped Status: [ ] ✅ [ ] ❌ [ ] ⚠️
5. Task Initialization Duration Histogram
Metric: tasker.task.initialization.duration
Query 1: Check if histogram has data
tasker_task_initialization_duration_count
Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average initialization time
rate(tasker_task_initialization_duration_sum[5m]) /
rate(tasker_task_initialization_duration_count[5m])
Expected: Some millisecond value (probably < 100ms) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 latency
histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))
Expected: P95 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 4: P99 latency
histogram_quantile(0.99, rate(tasker_task_initialization_duration_bucket[5m]))
Expected: P99 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
6. Task Finalization Duration Histogram
Metric: tasker.task.finalization.duration
Query 1: Check count
tasker_task_finalization_duration_count
Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average finalization time
rate(tasker_task_finalization_duration_sum[5m]) /
rate(tasker_task_finalization_duration_count[5m])
Expected: Some millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 by final_state
histogram_quantile(0.95,
sum by (final_state, le) (
rate(tasker_task_finalization_duration_bucket[5m])
)
)
Expected: P95 value grouped by final_state (likely “complete”) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
7. Step Result Processing Duration Histogram
Metric: tasker.step_result.processing.duration
Query 1: Check count
tasker_step_result_processing_duration_count
Expected: Number of steps processed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average processing time
rate(tasker_step_result_processing_duration_sum[5m]) /
rate(tasker_step_result_processing_duration_count[5m])
Expected: Millisecond value for result processing Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Worker Metrics Verification
8. Step Executions Counter
Metric: tasker.steps.executions.total
Query 1: Total executions
tasker_steps_executions_total
Expected: Number of steps in workflow Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of steps executed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Execution rate
rate(tasker_steps_executions_total[5m])
Expected: Positive rate Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
9. Step Successes Counter
Metric: tasker.steps.successes.total
Query 1: Total successes
tasker_steps_successes_total
Expected: Should equal executions if all succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By namespace
sum by (namespace) (tasker_steps_successes_total)
Expected: Successes grouped by namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
Expected: ~1.0 (100%) if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
10. Step Failures Counter
Metric: tasker.steps.failures.total
Query 1: Total failures
tasker_steps_failures_total
Expected: 0 if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By error type
sum by (error_type) (tasker_steps_failures_total)
Expected: No results if no failures, or breakdown by error type Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
11. Steps Claimed Counter
Metric: tasker.steps.claimed.total
Query 1: Total claims
tasker_steps_claimed_total
Expected: Number of steps claimed (should match executions) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By claim method
sum by (claim_method) (tasker_steps_claimed_total)
Expected: Breakdown by “event” or “poll” Actual Result: _____________ Claim Methods Visible: [ ] event [ ] poll Status: [ ] ✅ [ ] ❌ [ ] ⚠️
12. Step Results Submitted Counter
Metric: tasker.steps.results_submitted.total
Query 1: Total submissions
tasker_steps_results_submitted_total
Expected: Number of steps that submitted results Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of step results submitted Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
13. Step Execution Duration Histogram
Metric: tasker.step.execution.duration
Query 1: Check count
tasker_step_execution_duration_count
Expected: Number of step executions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average execution time
rate(tasker_step_execution_duration_sum[5m]) /
rate(tasker_step_execution_duration_count[5m])
Expected: Average milliseconds per step Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 latency by namespace
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_bucket[5m])
)
)
Expected: P95 latency for rust_e2e_linear namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 4: P99 latency
histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
Expected: P99 latency value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
14. Step Claim Duration Histogram
Metric: tasker.step.claim.duration
Query 1: Check count
tasker_step_claim_duration_count
Expected: Number of claims Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average claim time
rate(tasker_step_claim_duration_sum[5m]) /
rate(tasker_step_claim_duration_count[5m])
Expected: Average milliseconds to claim Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 by claim method
histogram_quantile(0.95,
sum by (claim_method, le) (
rate(tasker_step_claim_duration_bucket[5m])
)
)
Expected: P95 claim latency by method Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
15. Step Result Submission Duration Histogram
Metric: tasker.step_result.submission.duration
Query 1: Check count
tasker_step_result_submission_duration_count
Expected: Number of result submissions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average submission time
rate(tasker_step_result_submission_duration_sum[5m]) /
rate(tasker_step_result_submission_duration_count[5m])
Expected: Average milliseconds to submit Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 submission latency
histogram_quantile(0.95, rate(tasker_step_result_submission_duration_bucket[5m]))
Expected: P95 submission latency Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Complete Execution Flow Verification
Purpose: Verify the full task lifecycle is visible in metrics
Query: Complete flow for correlation_id
# Run each query and record the value
# 1. Task created
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 3. Steps claimed
tasker_steps_claimed_total
Result: _____________
# 4. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 5. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 6. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 7. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 8. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
Expected Pattern: 1 → N → N → N → N → N → N → 1 Actual Pattern: _____________ → _____ → _____ → _____ → _____ → _____ → _____ → _____
Analysis:
- Do the numbers make sense for your workflow? [ ] Yes [ ] No
- Are any steps missing? _____________
- Do counts match expectations? [ ] Yes [ ] No
Issues Found
Document any issues discovered during verification:
Issue 1
Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No
Issue 2
Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No
Summary
Total Queries Tested: _____________ Successful: _____________ ✅ Failed: _____________ ❌ Partial: _____________ ⚠️
Overall Status: [ ] All Working [ ] Some Issues [ ] Major Problems
Ready for Production: [ ] Yes [ ] No [ ] Needs Work
Verification Date: _____________ Verified By: _____________ Grafana Version: _____________ OpenTelemetry Version: 0.26
OpenTelemetry Improvements
Last Updated: 2025-12-01 Audience: Developers, Operators Status: Active Related Docs: Observability Hub | Metrics Reference | Domain Events
← Back to Observability Hub
This document describes the OpenTelemetry improvements for the domain event system, including two-phase FFI telemetry initialization, domain event metrics, and enhanced observability for the distributed event system.
Overview
These telemetry improvements support the domain event system while addressing FFI-specific challenges:
| Improvement | Purpose | Impact |
|---|---|---|
| Two-Phase FFI Telemetry | Safe telemetry in FFI workers | No segfaults during Ruby/Python bridging |
| Domain Event Metrics | Event system observability | Real-time monitoring of event publication |
| Correlation ID Propagation | End-to-end tracing | Events traceable across distributed system |
| Worker Metrics Endpoint | Domain event statistics | /metrics/events for monitoring dashboards |
Two-Phase FFI Telemetry Initialization
The Problem
When Rust workers operate with Ruby FFI bindings, OpenTelemetry’s global tracer/meter providers can cause issues:
- Thread Safety: Ruby’s GVL (Global VM Lock) conflicts with OpenTelemetry’s internal threading
- Signal Handling: OpenTelemetry’s OTLP exporter may interfere with Ruby signal handling
- Segfaults: Premature initialization can cause crashes during FFI boundary crossings
The Solution: Two-Phase Initialization
flowchart LR
subgraph Phase1["Phase 1 (FFI-Safe)"]
A[Console logger]
B[Tracing init]
C[No OTLP export]
D[No global state]
end
subgraph Phase2["Phase 2 (Full OTel)"]
E[OTLP exporter]
F[Metrics export]
G[Full tracing]
H[Global tracer]
end
Phase1 -->|"After FFI bridge<br/>established"| Phase2
Worker Bootstrap Sequence:
- Load Rust worker library
- Initialize Phase 1 (console-only logging)
- Execute FFI bridge setup (Ruby/Python)
- Initialize Phase 2 (full OpenTelemetry)
Implementation
Phase 1: Console-Only Initialization (FFI-Safe):
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 284-326)
/// Initialize console-only logging (FFI-safe, no Tokio runtime required)
///
/// This function sets up structured console logging without OpenTelemetry,
/// making it safe to call from FFI initialization contexts where no Tokio
/// runtime exists yet.
pub fn init_console_only() {
TRACING_INITIALIZED.get_or_init(|| {
let environment = get_environment();
let log_level = get_log_level(&environment);
// Determine if we're in a TTY for ANSI color support
let use_ansi = IsTerminal::is_terminal(&std::io::stdout());
// Create base console layer
let console_layer = fmt::layer()
.with_target(true)
.with_thread_ids(true)
.with_level(true)
.with_ansi(use_ansi)
.with_filter(EnvFilter::new(&log_level));
// Build subscriber with console layer only (no telemetry)
let subscriber = tracing_subscriber::registry().with(console_layer);
if subscriber.try_init().is_err() {
tracing::debug!(
"Global tracing subscriber already initialized"
);
} else {
tracing::info!(
environment = %environment,
opentelemetry_enabled = false,
context = "ffi_initialization",
"Console-only logging initialized (FFI-safe mode)"
);
}
// Initialize basic metrics (no OpenTelemetry exporters)
metrics::init_metrics();
metrics::orchestration::init();
metrics::worker::init();
metrics::database::init();
metrics::messaging::init();
});
}
}
Phase 2: Full OpenTelemetry Initialization:
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 361-449)
/// Initialize tracing with console output and optional OpenTelemetry
///
/// When OpenTelemetry is enabled (via TELEMETRY_ENABLED=true), it also
/// configures distributed tracing with OTLP exporter.
///
/// **IMPORTANT**: When telemetry is enabled, this function MUST be called from
/// a Tokio runtime context because the batch exporter requires async I/O.
pub fn init_tracing() {
TRACING_INITIALIZED.get_or_init(|| {
let environment = get_environment();
let log_level = get_log_level(&environment);
let telemetry_config = TelemetryConfig::default();
// Determine if we're in a TTY for ANSI color support
let use_ansi = IsTerminal::is_terminal(&std::io::stdout());
// Create base console layer
let console_layer = fmt::layer()
.with_target(true)
.with_thread_ids(true)
.with_level(true)
.with_ansi(use_ansi)
.with_filter(EnvFilter::new(&log_level));
// Build subscriber with optional OpenTelemetry layer
let subscriber = tracing_subscriber::registry().with(console_layer);
if telemetry_config.enabled {
// Initialize OpenTelemetry tracer and logger
match (init_opentelemetry_tracer(&telemetry_config),
init_opentelemetry_logger(&telemetry_config)) {
(Ok(tracer_provider), Ok(logger_provider)) => {
// Add trace layer
let tracer = tracer_provider.tracer("tasker-core");
let telemetry_layer = OpenTelemetryLayer::new(tracer);
// Add log layer (bridge tracing logs -> OTEL logs)
let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);
let subscriber = subscriber.with(telemetry_layer).with(log_layer);
if subscriber.try_init().is_ok() {
tracing::info!(
environment = %environment,
opentelemetry_enabled = true,
logs_enabled = true,
otlp_endpoint = %telemetry_config.otlp_endpoint,
service_name = %telemetry_config.service_name,
"Console logging with OpenTelemetry initialized"
);
}
}
// ... error handling with fallback to console-only
}
}
});
}
}
Worker Bootstrap Integration:
#![allow(unused)]
fn main() {
// workers/rust/src/bootstrap.rs (lines 69-131)
pub async fn bootstrap() -> Result<(WorkerSystemHandle, RustEventHandler)> {
info!("📋 Creating native Rust step handler registry...");
let registry = Arc::new(RustStepHandlerRegistry::new());
// Get global event system for connecting to worker events
info!("🔗 Setting up event system connection...");
let event_system = get_global_event_system();
// Bootstrap the worker using tasker-worker foundation
info!("🏗️ Bootstrapping worker with tasker-worker foundation...");
let worker_handle =
WorkerBootstrap::bootstrap_with_event_system(Some(event_system.clone())).await?;
// Create step event publisher registry with domain event publisher
info!("🔔 Setting up step event publisher registry...");
let domain_event_publisher = {
let worker_core = worker_handle.worker_core.lock().await;
worker_core.domain_event_publisher()
};
// Dual-Path: Create in-process event bus for fast event delivery
info!("⚡ Creating in-process event bus for fast domain events...");
let in_process_bus = Arc::new(RwLock::new(InProcessEventBus::new(
InProcessEventBusConfig::default(),
)));
// Dual-Path: Create event router for dual-path delivery
info!("🔀 Creating event router for dual-path delivery...");
let event_router = Arc::new(RwLock::new(EventRouter::new(
domain_event_publisher.clone(),
in_process_bus.clone(),
)));
// Create registry with EventRouter for dual-path delivery
let mut step_event_registry =
StepEventPublisherRegistry::with_event_router(
domain_event_publisher.clone(),
event_router
);
Ok((worker_handle, event_handler))
}
}
Configuration
Telemetry is configured exclusively via environment variables. This is intentional because logging must be initialized before the TOML config loader runs (to log any config loading errors).
# Enable OpenTelemetry
export TELEMETRY_ENABLED=true
# OTLP endpoint (default: http://localhost:4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
# Service identification
export OTEL_SERVICE_NAME=tasker-orchestration
export OTEL_SERVICE_VERSION=0.1.0
# Deployment environment (falls back to TASKER_ENV, then "development")
export DEPLOYMENT_ENVIRONMENT=production
# Sampling rate (0.0 to 1.0, default: 1.0 = 100%)
export OTEL_TRACES_SAMPLER_ARG=1.0
The TelemetryConfig::default() implementation in tasker-shared/src/logging.rs:144-164
reads all values from environment variables at initialization time.
Domain Event Metrics
New Metrics
Domain event observability metrics:
| Metric | Type | Description |
|---|---|---|
tasker.domain_events.published.total | Counter | Total events published |
router.durable_routed | Counter | Events sent via durable path (PGMQ) |
router.fast_routed | Counter | Events sent via fast path (in-process) |
router.broadcast_routed | Counter | Events broadcast to both paths |
Implementation
Domain event metrics are emitted inline during publication:
#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 207-219)
// Emit OpenTelemetry metric
let counter = opentelemetry::global::meter("tasker")
.u64_counter("tasker.domain_events.published.total")
.with_description("Total number of domain events published")
.build();
counter.add(
1,
&[
opentelemetry::KeyValue::new("event_name", event_name.to_string()),
opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
],
);
}
Event routing statistics are tracked in the EventRouterStats and InProcessEventBusStats structures:
#![allow(unused)]
fn main() {
// tasker-shared/src/metrics/worker.rs (lines 431-444)
/// Statistics for the event router
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct EventRouterStats {
/// Total events routed through the router
pub total_routed: u64,
/// Events sent via durable path (PGMQ)
pub durable_routed: u64,
/// Events sent via fast path (in-process)
pub fast_routed: u64,
/// Events broadcast to both paths
pub broadcast_routed: u64,
/// Fast delivery errors in broadcast mode (non-fatal, logged for monitoring)
pub fast_delivery_errors: u64,
/// Failed routing attempts (durable failures only)
pub routing_errors: u64,
}
// tasker-shared/src/metrics/worker.rs (lines 455-467)
/// Statistics for the in-process event bus
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct InProcessEventBusStats {
/// Total events dispatched through the bus
pub total_events_dispatched: u64,
/// Total events dispatched to Rust handlers
pub rust_handler_dispatches: u64,
/// Total events dispatched to FFI channel
pub ffi_channel_dispatches: u64,
}
}
Prometheus Queries
Event publication rate by namespace:
sum by (namespace) (rate(tasker_domain_events_published_total[5m]))
Event failure rate:
rate(tasker_domain_events_failed_total[5m]) /
rate(tasker_domain_events_published_total[5m])
Publication latency (P95):
histogram_quantile(0.95,
sum by (le) (rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m]))
)
Latency by delivery mode:
histogram_quantile(0.95,
sum by (delivery_mode, le) (
rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m])
)
)
Worker Metrics Endpoint
/metrics/events Endpoint
The worker exposes domain event statistics through a dedicated metrics endpoint:
Request:
curl http://localhost:8081/metrics/events
Response:
{
"router": {
"total_routed": 42,
"durable_routed": 10,
"fast_routed": 30,
"broadcast_routed": 2,
"fast_delivery_errors": 0,
"routing_errors": 0
},
"in_process_bus": {
"total_events_dispatched": 32,
"rust_handler_dispatches": 20,
"ffi_channel_dispatches": 12
},
"captured_at": "2025-12-01T10:30:00Z",
"worker_id": "worker-01234567"
}
Implementation
#![allow(unused)]
fn main() {
// tasker-worker/src/web/handlers/metrics.rs (lines 178-218)
/// Domain event statistics endpoint: GET /metrics/events
///
/// Returns statistics about domain event routing and delivery paths.
/// Used for monitoring event publishing and by E2E tests to verify
/// events were published through the expected delivery paths.
///
/// # Response
///
/// Returns statistics for:
/// - **Router stats**: durable_routed, fast_routed, broadcast_routed counts
/// - **In-process bus stats**: handler dispatches, FFI channel dispatches
pub async fn domain_event_stats(
State(state): State<Arc<WorkerWebState>>,
) -> Json<DomainEventStats> {
debug!("Serving domain event statistics");
// Use cached event components - does not lock worker core
let stats = state.domain_event_stats().await;
Json(stats)
}
}
The DomainEventStats structure is defined in tasker-shared/src/types/web.rs:
#![allow(unused)]
fn main() {
// tasker-shared/src/types/web.rs (lines 546-555)
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct DomainEventStats {
/// Event router statistics
pub router: EventRouterStats,
/// In-process event bus statistics
pub in_process_bus: InProcessEventBusStats,
/// Timestamp when stats were captured
pub captured_at: DateTime<Utc>,
/// Worker ID for correlation
pub worker_id: String,
}
}
Correlation ID Propagation
End-to-End Tracing
Domain events maintain correlation IDs for distributed tracing:
flowchart LR
subgraph TaskCreation["Task Creation"]
A[correlation_id<br/>UUIDv7]
end
subgraph StepExecution["Step Execution"]
B[correlation_id<br/>propagated]
end
subgraph DomainEvent["Domain Event"]
C[correlation_id<br/>in metadata]
end
TaskCreation --> StepExecution --> DomainEvent
subgraph TraceContext["Trace Context"]
D[task_uuid]
E[step_uuid]
F[step_name]
G[namespace]
H[correlation_id]
end
Tracing Integration
The DomainEventPublisher::publish_event method uses #[instrument] for automatic span creation:
#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 157-231)
#[instrument(skip(self, payload, metadata), fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
&self,
event_name: &str,
payload: DomainEventPayload,
metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
let event_id = Uuid::now_v7();
let queue_name = format!("{}_domain_events", metadata.namespace);
debug!(
event_id = %event_id,
event_name = %event_name,
queue_name = %queue_name,
task_uuid = %metadata.task_uuid,
correlation_id = %metadata.correlation_id,
"Publishing domain event"
);
// Create and serialize domain event
let event = DomainEvent {
event_id,
event_name: event_name.to_string(),
event_version: "1.0".to_string(),
payload,
metadata: metadata.clone(),
};
// Publish to PGMQ
let message_id = self.message_client
.send_json_message(&queue_name, &event_json)
.await?;
info!(
event_id = %event_id,
message_id = message_id,
correlation_id = %metadata.correlation_id,
"Domain event published successfully"
);
Ok(event_id)
}
}
Querying by Correlation ID
Find all events for a task:
# In Grafana/Tempo
correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"
In PostgreSQL (PGMQ queues):
SELECT
message->>'event_name' as event,
message->'metadata'->>'step_name' as step,
message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';
Span Hierarchy
Domain Event Spans
Domain event spans:
Task Execution (root span)
├── Step Execution
│ ├── Handler Call
│ │ └── Business Logic
│ └── publish_domain_event ◄── NEW
│ ├── route_event
│ │ ├── publish_durable (if durable/broadcast)
│ │ └── publish_fast (if fast/broadcast)
│ └── record_metrics
└── Result Submission
Span Attributes
| Span | Attributes |
|---|---|
publish_domain_event | event_name, namespace, correlation_id, delivery_mode |
route_event | delivery_mode, target_queue (if durable) |
publish_durable | queue_name, message_size |
publish_fast | subscriber_count |
Troubleshooting
Console-Only Mode (No OTLP Export)
Symptom: Logs show “Console-only logging initialized (FFI-safe mode)” but no OpenTelemetry traces
Cause: init_console_only() was called but init_tracing() was never called, or TELEMETRY_ENABLED=false
Fix:
- Check initialization logs:
grep -E "(Console-only|OpenTelemetry)" logs/worker.log - Verify
TELEMETRY_ENABLED=trueis set:grep "opentelemetry_enabled" logs/worker.log
Domain Event Metrics Missing
Symptom: /metrics/events returns zeros for all stats
Cause: Events not being published or the event router/bus not tracking statistics
Fix:
- Verify events are being published:
grep "Domain event published successfully" logs/worker.log - Check event router initialization:
grep "event router" logs/worker.log - Verify in-process event bus is configured:
grep "in-process event bus" logs/worker.log
Correlation ID Not Propagating
Symptom: Events have different correlation IDs than parent task
Cause: EventMetadata not constructed with task’s correlation_id
Fix: Verify EventMetadata is constructed with the correct correlation_id from the task:
#![allow(unused)]
fn main() {
// When constructing EventMetadata, always use the task's correlation_id
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id, // Must use task's ID
fired_at: chrono::Utc::now(),
fired_by: handler_name.to_string(),
};
}
Best Practices
1. Always Use Two-Phase Init for FFI Workers
#![allow(unused)]
fn main() {
// Correct: Two-phase initialization pattern
// Phase 1: During FFI initialization (Magnus, PyO3, WASM)
tasker_shared::logging::init_console_only();
// Phase 2: After runtime creation
let runtime = tokio::runtime::Runtime::new()?;
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Incorrect: Calling init_tracing() during FFI initialization
// before Tokio runtime exists (may cause issues with OTLP exporter)
}
2. Include Correlation ID in All Events
#![allow(unused)]
fn main() {
// Always propagate correlation_id from the task
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id, // Critical!
fired_at: chrono::Utc::now(),
fired_by: handler_name.to_string(),
};
}
3. Use Structured Logging with Correlation Context
#![allow(unused)]
fn main() {
// All logs should include correlation_id for trace correlation
info!(
event_id = %event_id,
event_name = %event_name,
correlation_id = %metadata.correlation_id,
namespace = %metadata.namespace,
"Domain event published successfully"
);
}
Related Documentation
- Metrics Reference: metrics-reference.md - Complete metrics catalog
- Domain Events: ../domain-events.md - Event system architecture
- Logging Standards: logging-standards.md - Structured logging best practices
This telemetry architecture provides robust observability for domain events while ensuring safe operation with FFI-based language bindings.
Tasker Core Principles
This directory contains the core principles and design philosophy that guide Tasker Core development. These principles are not arbitrary rules but hard-won lessons extracted from implementation experience, root cause analyses, and architectural decisions.
Core Documents
| Document | Description |
|---|---|
| Tasker Core Tenets | The 11 foundational principles that drive all architecture and design decisions |
| Defense in Depth | Multi-layered protection model for idempotency and data integrity |
| Fail Loudly | Why errors beat silent defaults, and phantom data breaks trust |
| Cross-Language Consistency | The “one API” philosophy for Rust, Ruby, Python, and TypeScript workers |
| Composition Over Inheritance | Mixin-based handler composition pattern |
| Intentional AI Partnership | Collaborative approach to AI integration |
Influences
| Document | Description |
|---|---|
| Twelve-Factor App Alignment | How the 12-factor methodology shapes our architecture, with codebase examples and honest gap assessment |
| Zen of Python (PEP-20) | Tim Peters’ guiding principles — referenced as inspiration |
How These Principles Were Derived
These principles emerged from:
- Root Cause Analyses: Ownership removal revealed that “redundant protection with harmful side effects” is worse than minimal, well-understood protection
- Cross-Language Development: Handler harmonization established patterns for consistent APIs across four languages
- Architectural Migrations: Actor pattern refactoring proved the pattern’s effectiveness
- Production Incidents: Real bugs in parallel execution (Heisenbugs becoming Bohrbugs) shaped defensive design
- Protocol Trust Analysis: gRPC client refactoring exposed how silent defaults create phantom data that breaks consumer trust
When to Consult These Documents
- Design decisions: Read Tasker Core Tenets before proposing architecture changes
- Adding protections: Consult Defense in Depth to understand existing layers
- Error handling: Review Fail Loudly before adding defaults or fallbacks
- Worker development: Review Cross-Language Consistency for API alignment
- Handler patterns: Study Composition Over Inheritance for proper structure
Related Documentation
- Architecture Decisions:
docs/decisions/for specific ADRs - Historical Context:
docs/CHRONOLOGY.mdfor development timeline - Implementation Details:
docs/ticket-specs/for original specifications
Composition Over Inheritance
Last Updated: 2026-01-01 This document describes Tasker Core’s approach to handler composition using mixins and traits rather than class hierarchies.
The Core Principle
Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable
Handlers gain capabilities by mixing in modules, not by inheriting from specialized base classes.
Why Composition?
The Problem with Inheritance
Deep inheritance hierarchies create problems:
# BAD: Inheritance-based capabilities
class APIDecisionBatchableHandler < APIDecisionHandler < APIHandler < BaseHandler
# Which methods came from where?
# How do I override just one behavior?
# What if I need Batchable but not API?
end
| Problem | Description |
|---|---|
| Diamond problem | Multiple paths to same ancestor |
| Tight coupling | Can’t change base without affecting all children |
| Inflexible | Can’t pick-and-choose capabilities |
| Hard to test | Must test entire hierarchy |
| Opaque behavior | Method origin unclear |
The Composition Solution
Mixins provide selective capabilities:
# GOOD: Composition-based capabilities
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::APICapable
include TaskerCore::StepHandler::DecisionCapable
def call(context)
# Has API methods (get, post, put, delete)
# Has Decision methods (decision_success, decision_no_branches)
# Does NOT have Batchable methods (didn't include it)
end
end
| Benefit | Description |
|---|---|
| Selective inclusion | Only the capabilities you need |
| Clear origin | Module name indicates where methods come from |
| Independent testing | Test each mixin in isolation |
| Flexible combination | Any combination of capabilities |
| Flat structure | No deep hierarchies to navigate |
The Discovery
Analysis of Batchable handlers revealed they already used the composition pattern:
# Batchable was the TARGET architecture all along
class BatchHandler < Base
include BatchableCapable # Already doing it right!
def call(context)
batch_ctx = get_batch_context(context)
# ...process batch...
batch_worker_complete(processed_count: count, result_data: data)
end
end
The cross-language handler harmonization recommended migrating API and Decision handlers to match this pattern.
Capability Modules
Available Capabilities
| Capability | Module (Ruby) | Methods Provided |
|---|---|---|
| API | APICapable | get, post, put, delete |
| Decision | DecisionCapable | decision_success, decision_no_branches |
| Batchable | BatchableCapable | get_batch_context, batch_worker_complete, handle_no_op_worker |
Rust Traits
#![allow(unused)]
fn main() {
// Rust uses traits for the same pattern
pub trait APICapable {
async fn get(&self, path: &str, params: Option<Params>) -> Response;
async fn post(&self, path: &str, data: Option<Value>) -> Response;
async fn put(&self, path: &str, data: Option<Value>) -> Response;
async fn delete(&self, path: &str, params: Option<Params>) -> Response;
}
pub trait DecisionCapable {
fn decision_success(&self, steps: Vec<String>, result: Value) -> StepExecutionResult;
fn decision_no_branches(&self, result: Value) -> StepExecutionResult;
}
pub trait BatchableCapable {
fn get_batch_context(&self, context: &StepContext) -> BatchContext;
fn batch_worker_complete(&self, count: usize, data: Value) -> StepExecutionResult;
}
}
Python Mixins
# Python uses multiple inheritance (mixins)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin
class MyHandler(StepHandler, APIMixin, DecisionMixin):
def call(self, context: StepContext) -> StepHandlerResult:
# Has both API and Decision methods
response = self.get("/api/endpoint")
return self.decision_success(["next_step"], response)
TypeScript Mixins
// TypeScript uses mixin functions applied in constructor
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, applyDecision, APICapable, DecisionCapable } from '@tasker-systems/tasker';
class MyHandler extends StepHandler implements APICapable, DecisionCapable {
constructor() {
super();
applyAPI(this); // Adds get/post/put/delete methods
applyDecision(this); // Adds decisionSuccess/skipBranches methods
}
async call(context: StepContext): Promise<StepHandlerResult> {
// Has both API and Decision methods
const response = await this.get('/api/endpoint');
return this.decisionSuccess(['next_step'], response.body);
}
}
Separation of Concerns
What Orchestration Owns
The orchestration layer handles:
- Domain event publishing (after results committed)
- Decision point step creation (from DecisionPointOutcome)
- Batch worker creation (from BatchProcessingOutcome)
- State machine transitions
What Workers Own
Workers handle:
- Decision logic (returns
DecisionPointOutcome) - Batch analysis (returns
BatchProcessingOutcome) - Handler execution (returns
StepHandlerResult) - Custom publishers/subscribers (fast path events)
The Boundary
┌─────────────────────────────────────────────────────────────────┐
│ Worker Space │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Handler.call(context) ││
│ │ - Executes business logic ││
│ │ - Uses API/Decision/Batchable capabilities ││
│ │ - Returns StepHandlerResult with outcome ││
│ └─────────────────────────────────────────────────────────────┘│
│ ↓ Result (with outcome) │
├─────────────────────────────────────────────────────────────────┤
│ Orchestration Space │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Process result ││
│ │ - Commit state transition ││
│ │ - If DecisionPointOutcome: create decision steps ││
│ │ - If BatchProcessingOutcome: create batch workers ││
│ │ - Publish domain events ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
FFI Boundary Types
Outcomes crossing the FFI boundary need explicit types:
DecisionPointOutcome
#![allow(unused)]
fn main() {
// Rust definition
pub enum DecisionPointOutcome {
ActivateSteps { step_names: Vec<String> },
NoBranches,
}
// Serialized (all languages)
{
"type": "ActivateSteps",
"step_names": ["branch_a", "branch_b"]
}
}
BatchProcessingOutcome
#![allow(unused)]
fn main() {
// Rust definition
pub enum BatchProcessingOutcome {
Continue { cursor: CursorConfig },
Complete,
NoOp,
}
// Serialized (all languages)
{
"type": "Continue",
"cursor": {
"position": "offset_123",
"batch_size": 100
}
}
}
Migration Path
Cross-Language Migration Examples
Ruby
Before (inheritance):
class MyAPIHandler < TaskerCore::APIHandler
def call(context)
# ...
end
end
After (composition):
class MyAPIHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
def call(context)
# Same implementation, different structure
end
end
Python
Before (inheritance):
class MyAPIHandler(APIHandler):
def call(self, context):
# ...
After (composition):
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin
class MyAPIHandler(StepHandler, APIMixin):
def call(self, context):
# Same implementation, different structure
TypeScript
Before (inheritance):
class MyAPIHandler extends APIHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
// ...
}
}
After (composition):
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';
class MyAPIHandler extends StepHandler implements APICapable {
constructor() {
super();
applyAPI(this);
}
async call(context: StepContext): Promise<StepHandlerResult> {
// Same implementation, different structure
}
}
Rust
Rust already used the composition pattern via traits:
#![allow(unused)]
fn main() {
// Rust has always used traits (composition)
impl StepHandler for MyHandler { ... }
impl APICapable for MyHandler { ... }
impl DecisionCapable for MyHandler { ... }
}
Breaking Changes Implemented
The migration to composition involved breaking changes:
- Base class changes across all languages
- Module/mixin includes required
- Ruby cursor indexing changed from 1-indexed to 0-indexed
All breaking changes were accumulated and released together.
Anti-Patterns
Don’t: Inherit from Multiple Specialized Classes
# BAD: Ruby doesn't support multiple inheritance like this
class MyHandler < APIHandler, DecisionHandler # Syntax error!
Don’t: Reimplement Mixin Methods
# BAD: Overriding mixin methods defeats the purpose
class MyHandler < Base
include APICapable
def get(path, params: {})
# Custom implementation - now you own this forever
end
end
Don’t: Mix Concerns
# BAD: Handler doing orchestration's job
class MyHandler < Base
include DecisionCapable
def call(context)
# Don't create steps directly!
create_workflow_step("next_step") # Orchestration does this!
# Do return the outcome
decision_success(steps: ["next_step"], result_data: {})
end
end
Testing Composition
Test Mixins in Isolation
# Test the mixin itself
RSpec.describe TaskerCore::StepHandler::APICapable do
let(:handler) { Class.new { include TaskerCore::StepHandler::APICapable }.new }
it "provides get method" do
expect(handler).to respond_to(:get)
end
end
Test Handler with Stubs
# Test handler behavior, stub mixin methods
RSpec.describe MyHandler do
let(:handler) { MyHandler.new }
it "calls API and makes decision" do
allow(handler).to receive(:get).and_return({ status: 200 })
result = handler.call(context)
expect(result.decision_point_outcome.type).to eq("ActivateSteps")
end
end
Related Documentation
- Tasker Core Tenets - Tenet #3: Composition Over Inheritance
- Cross-Language Consistency - How composition works across languages
- Patterns and Practices - Handler patterns
Cross-Language Consistency
This document describes Tasker Core’s philosophy for maintaining consistent APIs across Rust, Ruby, Python, and TypeScript workers while respecting each language’s idioms.
The Core Philosophy
“There should be one–and preferably only one–obvious way to do it.” – The Zen of Python
When a developer learns one Tasker worker language, they should understand all of them at the conceptual level. The specific syntax changes; the patterns remain constant.
Consistency Without Uniformity
What We Align
Developer-facing touchpoints that affect daily work:
| Touchpoint | Why Align |
|---|---|
| Handler signatures | Developers switch languages within projects |
| Result factories | Error handling should feel familiar |
| Registry APIs | Service configuration is cross-cutting |
| Context access patterns | Data access is the core operation |
| Specialized handlers | API, Decision, Batchable are reusable patterns |
What We Don’t Force
Language idioms that feel natural in their ecosystem:
| Ruby | Python | TypeScript | Rust |
|---|---|---|---|
Blocks, yield | Decorators, context managers | Generics, interfaces | Traits, associated types |
Symbols (:name) | Type hints | async/await | Pattern matching |
| Duck typing | ABC, Protocol | Union types | Enums, Result<T,E> |
The Aligned APIs
Handler Signatures
All handlers receive context, return results:
# Ruby
class MyHandler < TaskerCore::StepHandler::Base
def call(context)
success(result: { data: "value" })
end
end
# Python
class MyHandler(BaseStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
return self.success({"data": "value"})
// TypeScript
class MyHandler extends BaseStepHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
return this.success({ data: "value" });
}
}
#![allow(unused)]
fn main() {
// Rust
impl StepHandler for MyHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
StepExecutionResult::success(json!({"data": "value"}), None)
}
}
}
Result Factories
Success and failure patterns are identical:
| Operation | Pattern |
|---|---|
| Success | success(result_data, metadata?) |
| Failure | failure(message, error_type, error_code?, retryable?, metadata?) |
The factory methods hide implementation details (wrapper classes, enum variants) behind a consistent interface.
Registry Operations
All registries support the same core operations:
| Operation | Description |
|---|---|
register(name, handler) | Register a handler by name |
is_registered(name) | Check if handler exists |
resolve(name) | Get handler instance |
list_handlers() | List all registered handlers |
Context Access Patterns
The StepContext provides unified access to execution data:
Core Fields (All Languages)
| Field | Type | Description |
|---|---|---|
task_uuid | String | Unique task identifier |
step_uuid | String | Unique step identifier |
input_data | Dict/Hash | Input data for the step |
step_config | Dict/Hash | Handler configuration |
dependency_results | Wrapper | Results from parent steps |
retry_count | Integer | Current retry attempt |
max_retries | Integer | Maximum retry attempts |
Convenience Methods
| Method | Description |
|---|---|
get_task_field(name) | Get field from task context |
get_dependency_result(step_name) | Get result from a parent step |
Specialized Handler Patterns
API Handler
HTTP operations available in all languages:
| Method | Pattern |
|---|---|
| GET | get(path, params?, headers?) |
| POST | post(path, data?, headers?) |
| PUT | put(path, data?, headers?) |
| DELETE | delete(path, params?, headers?) |
Decision Handler
Conditional workflow branching:
# Ruby
decision_success(steps: ["branch_a", "branch_b"], result_data: { routing: "criteria" })
decision_no_branches(result_data: { reason: "no action needed" })
# Python
self.decision_success(["branch_a", "branch_b"], {"routing": "criteria"})
self.decision_no_branches({"reason": "no action needed"})
Batchable Handler
Cursor-based batch processing:
| Operation | Description |
|---|---|
get_batch_context(context) | Extract batch metadata |
batch_worker_complete(count, data) | Signal batch completion |
handle_no_op_worker(batch_ctx) | Handle empty batch |
FFI Boundary Types
When data crosses the FFI boundary (Rust <-> Ruby/Python/TypeScript), types must serialize identically:
Required Explicit Types
| Type | Purpose |
|---|---|
DecisionPointOutcome | Decision handler results |
BatchProcessingOutcome | Batch handler results |
CursorConfig | Batch cursor configuration |
StepHandlerResult | All handler results |
Serialization Guarantee
The same JSON representation must work across all languages:
{
"success": true,
"result": { "data": "value" },
"metadata": { "timing_ms": 50 }
}
Why This Matters
Developer Productivity
When switching from a Ruby handler to a Python handler:
- No relearning core concepts
- Same mental model applies
- Documentation transfers
Code Review Consistency
Reviewers can evaluate handlers in any language:
- Pattern violations are obvious
- Best practices are universal
- Anti-patterns are recognizable
Documentation Efficiency
One set of conceptual docs serves all languages:
- Language-specific pages show syntax only
- Core patterns documented once
- Examples parallel across languages
The Pre-Alpha Advantage
In pre-alpha, we can make breaking changes to achieve consistency:
| Change Type | Example |
|---|---|
| Method renames | handle() → call() |
| Signature changes | (task, step) → (context) |
| Return type unification | Separate Success/Error → unified result |
These changes would be costly post-release but are cheap now.
Migration Path
When APIs diverge, we follow this pattern:
- Non-Breaking First: Add aliases, helpers, new modules
- Deprecation Period: Mark old APIs deprecated (warnings in logs)
- Breaking Release: Remove old APIs, document migration
Example timeline:
Phase 1: Python migration (non-breaking + breaking)
Phase 2: Ruby migration (non-breaking + breaking)
Phase 3: Rust alignment (already aligned)
Phase 4: TypeScript alignment (new implementation)
Phase 5: Breaking changes release (all languages together)
Anti-Patterns
Don’t: Force Identical Syntax
# BAD: Ruby-style symbols in Python
def call(context) -> Hash[:success => true] # Not Python!
Don’t: Ignore Language Idioms
# BAD: Python-style type hints in Ruby
def call(context: StepContext) -> StepHandlerResult # Not Ruby!
Don’t: Duplicate Orchestration Logic
# BAD: Worker creating decision steps
def call(context)
# Don't do orchestration's job!
create_decision_steps(...) # Orchestration handles this
end
Related Documentation
- Tasker Core Tenets - Tenet #4: Cross-Language Consistency
- API Convergence Matrix - Detailed API reference
- Patterns and Practices - Common patterns
- Example Handlers - Side-by-side code examples
Defense in Depth
This document describes Tasker Core’s multi-layered protection model for idempotency and data integrity.
The Four Protection Layers
Tasker Core uses four independent protection layers. Each layer catches what others might miss, and no single layer bears full responsibility for data integrity.
┌─────────────────────────────────────────────────────────────────┐
│ Layer 4: Application Logic │
│ (State-based deduplication) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3: Transaction Boundaries │
│ (All-or-nothing semantics) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: State Machine Guards │
│ (Current state validation) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1: Database Atomicity │
│ (Unique constraints, row locks, CAS) │
└─────────────────────────────────────────────────────────────────┘
Layer 1: Database Atomicity
The foundation layer using PostgreSQL’s transactional guarantees.
Mechanisms
| Mechanism | Purpose | Example |
|---|---|---|
| Unique constraints | Prevent duplicate records | One active task per (namespace, external_id) |
| Row-level locking | Prevent concurrent modification | SELECT ... FOR UPDATE on task claim |
| Compare-and-swap | Atomic state transitions | UPDATE ... WHERE state = $expected |
| Advisory locks | Distributed coordination | Template cache invalidation |
Atomic Claiming Pattern
-- Only one processor can claim a task
UPDATE tasks
SET state = 'in_progress',
processor_uuid = $1,
claimed_at = NOW()
WHERE id = $2
AND state = 'pending' -- CAS: only if still pending
RETURNING *;
If two processors try to claim the same task:
- First: Succeeds, task transitions to
in_progress - Second: Fails (0 rows affected), no state change
Why This Works
PostgreSQL’s MVCC ensures the WHERE state = 'pending' check and the SET state = 'in_progress' happen atomically. There’s no window where both processors see state = 'pending'.
Layer 2: State Machine Guards
State machine validation before any transition is attempted.
Implementation
#![allow(unused)]
fn main() {
impl TaskStateMachine {
pub fn can_transition(&self, from: TaskState, to: TaskState) -> bool {
VALID_TRANSITIONS.contains(&(from, to))
}
pub fn transition(&mut self, to: TaskState) -> Result<(), StateError> {
if !self.can_transition(self.current, to) {
return Err(StateError::InvalidTransition { from: self.current, to });
}
// Proceed with transition
}
}
}
Valid Transitions Matrix
The state machine explicitly defines which transitions are valid:
Pending → Initializing → EnqueuingSteps → StepsInProcess
↓
Complete ← EvaluatingResults ← (step completions)
↓
Error (from any state)
Invalid transitions are rejected before reaching the database.
Why This Works
Application-level guards prevent obviously invalid operations from even attempting database changes. This reduces database load and provides better error messages.
Layer 3: Transaction Boundaries
All-or-nothing semantics for multi-step operations.
Example: Step Enqueueing
#![allow(unused)]
fn main() {
async fn enqueue_steps(task_id: TaskId, steps: Vec<Step>) -> Result<()> {
let mut tx = pool.begin().await?;
// Insert all steps
for step in steps {
insert_step(&mut tx, task_id, &step).await?;
}
// Update task state
update_task_state(&mut tx, task_id, TaskState::StepsInProcess).await?;
// Atomic commit - all or nothing
tx.commit().await?;
Ok(())
}
}
If step insertion fails:
- Transaction rolls back
- Task state unchanged
- No partial steps created
Why This Works
PostgreSQL transactions ensure that either all changes commit or none do. There’s no intermediate state where some steps exist but task state is wrong.
Layer 4: Application-Level Filtering
State-based deduplication in application logic.
Example: Result Processing
#![allow(unused)]
fn main() {
async fn process_result(step_id: StepId, result: HandlerResult) -> Result<()> {
let step = get_step(step_id).await?;
// Filter: Only process if step is in_progress
if step.state != StepState::InProgress {
log::info!("Ignoring result for step {} in state {:?}", step_id, step.state);
return Ok(()); // Idempotent: already processed
}
// Proceed with result processing
apply_result(step, result).await
}
}
Why This Works
Even if the same result arrives multiple times (network retries, duplicate messages), only the first processing has effect. Subsequent attempts see the step already transitioned and exit cleanly.
The Discovery: Ownership Was Harmful
What We Learned
Analysis of processor UUID “ownership” enforcement revealed:
#![allow(unused)]
fn main() {
// OLD: Ownership enforcement (REMOVED)
fn can_process(&self, processor_uuid: Uuid) -> bool {
self.owner_uuid == processor_uuid // BLOCKED recovery!
}
// NEW: Ownership tracking only (for audit)
fn process(&self, processor_uuid: Uuid) -> Result<()> {
self.record_processor(processor_uuid); // Track, don't enforce
// ... proceed with processing
}
}
Why Ownership Enforcement Was Removed
| Scenario | With Enforcement | Without Enforcement |
|---|---|---|
| Normal operation | Works | Works |
| Orchestrator crash & restart | BLOCKED - new UUID | Automatic recovery |
| Duplicate message | Rejected | Layer 1 (CAS) rejects |
| Race condition | Rejected | Layer 1 (CAS) rejects |
The four protection layers already prevent corruption. Ownership added:
- Zero additional safety (layers 1-4 sufficient)
- Recovery blocking (crashed tasks stuck forever)
- Operational complexity (manual intervention needed)
The Verdict
“Processor UUID ownership was redundant protection with harmful side effects.”
When two actors receive identical messages:
- First: Succeeds atomically (Layer 1 CAS)
- Second: Fails cleanly (Layer 1 CAS)
- No partial state, no corruption
- No ownership check needed
Designing New Protections
When adding protection mechanisms, evaluate against this checklist:
Before Adding Protection
- Which layer does this belong to? (Database, state machine, transaction, application)
- What does it protect against? (Be specific: race condition, duplicate, corruption)
- Do existing layers already cover this? (Usually yes)
- What failure modes does it introduce? (Blocked recovery, increased latency)
- Can the system recover if this protection itself fails?
The Minimal Set Principle
Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.
A system that:
- Has fewer protections
- Recovers automatically from crashes
- Handles duplicates idempotently
Is better than a system that:
- Has more protections
- Requires manual intervention after crashes
- Is “theoretically more secure”
Relationship to Fail Loudly
Defense in Depth and Fail Loudly are complementary principles:
| Defense in Depth | Fail Loudly |
|---|---|
| Multiple layers prevent corruption | Errors surface problems immediately |
| Redundancy catches edge cases | Transparency enables diagnosis |
| Protection happens before damage | Visibility happens at detection |
Both reject the same anti-pattern: silent failures.
- Defense in Depth rejects: silent corruption (data changed without protection)
- Fail Loudly rejects: silent defaults (missing data hidden with fabricated values)
Together they ensure: if something goes wrong, we know about it—either protection prevents it, or an error surfaces it.
Related Documentation
- Tasker Core Tenets - Tenet #1: Defense in Depth, Tenet #11: Fail Loudly
- Fail Loudly - Errors as first-class citizens
- Idempotency and Atomicity - Implementation details
- States and Lifecycles - State machine specifications
- Ownership Removal ADR - Processor UUID ownership removal decision
Fail Loudly
This document describes Tasker Core’s philosophy on error handling: errors are first-class citizens, not inconveniences to hide.
The Core Principle
A system that lies is worse than one that fails.
When data is missing, malformed, or unexpected, the correct response is an explicit error—not a fabricated default that makes the problem invisible.
The Problem: Phantom Data
“Phantom data” is data that:
- Looks valid to consumers
- Passes type checks and validation
- Contains no actual information from the source
- Was fabricated by defensive code trying to be “helpful”
Example: The Silent Default
#![allow(unused)]
fn main() {
// WRONG: Silent default hides protocol violation
fn get_pool_utilization(response: Response) -> PoolUtilization {
response.pool_utilization.unwrap_or_else(|| PoolUtilization {
active_connections: 0,
idle_connections: 0,
max_connections: 0,
utilization_percent: 0.0, // Looks like "no load"
})
}
}
A monitoring system receiving this response sees:
utilization_percent: 0.0— “Great, the system is idle!”- Reality: The server never sent pool data. The system might be at 100% load.
The consumer cannot distinguish “server reported 0%” from “server sent nothing.”
The Trust Equation
Silent default
→ Consumer receives valid-looking data
→ Consumer makes decisions based on phantom values
→ Phantom bugs manifest in production
→ Debugging nightmare: "But the data looked correct!"
vs.
Explicit error
→ Consumer receives clear failure
→ Consumer handles error appropriately
→ Problem visible immediately
→ Fix applied at source
The Solution: Explicit Errors
Pattern: Required Fields Return Errors
#![allow(unused)]
fn main() {
// RIGHT: Explicit error on missing required data
fn get_pool_utilization(response: Response) -> Result<PoolUtilization, ClientError> {
response.pool_utilization.ok_or_else(|| {
ClientError::invalid_response(
"Response.pool_utilization",
"Server omitted required pool utilization data",
)
})
}
}
Now the consumer:
- Knows data is missing
- Can retry, alert, or degrade gracefully
- Never operates on phantom values
Pattern: Distinguish Required vs Optional
Not all fields should fail on absence. The distinction matters:
| Field Type | Missing Means | Response |
|---|---|---|
| Required | Protocol violation, server bug | Return error |
| Optional | Legitimately absent, feature not configured | Return None |
#![allow(unused)]
fn main() {
// Required: Server MUST send health checks
let checks = response.checks.ok_or_else(||
ClientError::invalid_response("checks", "missing")
)?;
// Optional: Distributed cache may not be configured
let cache = response.distributed_cache; // Option<T> preserved
}
Pattern: Propagate, Don’t Swallow
Errors should flow up, not disappear:
#![allow(unused)]
fn main() {
// WRONG: Error swallowed, default returned
fn convert_response(r: Response) -> DomainType {
let info = r.info.unwrap_or_default(); // Error hidden
// ...
}
// RIGHT: Error propagated to caller
fn convert_response(r: Response) -> Result<DomainType, ClientError> {
let info = r.info.ok_or_else(||
ClientError::invalid_response("info", "missing")
)?; // Error visible
// ...
}
}
When Defaults Are Acceptable
Not every unwrap_or_default() is wrong. Defaults are acceptable when:
-
The field is explicitly optional in the domain model
#![allow(unused)] fn main() { // Optional metadata that may legitimately be absent let metadata: Option<Value> = response.metadata; } -
The default is semantically meaningful
#![allow(unused)] fn main() { // Empty tags list is valid—means "no tags" let tags = response.tags.unwrap_or_default(); // Vec<String> } -
Absence cannot be confused with a valid value
#![allow(unused)] fn main() { // description being None vs "" are distinguishable let description: Option<String> = response.description; }
Red Flags to Watch For
When reviewing code, these patterns indicate potential phantom data:
1. unwrap_or_default() on Numeric Types
#![allow(unused)]
fn main() {
// RED FLAG: 0 looks like a valid measurement
let active_connections = pool.active.unwrap_or_default();
}
2. unwrap_or_else(|| ...) with Fabricated Values
#![allow(unused)]
fn main() {
// RED FLAG: "unknown" looks like real status
let status = check.status.unwrap_or_else(|| "unknown".to_string());
}
3. Default Structs for Missing Nested Data
#![allow(unused)]
fn main() {
// RED FLAG: Entire section fabricated
let config = response.config.unwrap_or_else(default_config);
}
4. Silent Fallbacks in Health Checks
#![allow(unused)]
fn main() {
// RED FLAG: Health check that never fails is useless
let health = check_health().unwrap_or(HealthStatus::Ok);
}
Implementation Checklist
When implementing new conversions or response handling:
- Is this field required by the protocol/API contract?
- If missing, would a default be indistinguishable from a valid value?
- Could a consumer make incorrect decisions based on a default?
- Is the error message actionable? (includes field name, explains what’s wrong)
- Is the error type appropriate? (
InvalidResponsefor protocol violations)
The Discovery
What We Found
During gRPC client implementation, analysis revealed pervasive patterns like:
#![allow(unused)]
fn main() {
// Found throughout conversions.rs
let checks = response.checks.unwrap_or_else(|| ReadinessChecks {
web_database: HealthCheck { status: "unknown".into(), ... },
orchestration_database: HealthCheck { status: "unknown".into(), ... },
// ... more fabricated checks
});
}
A client calling get_readiness() would receive what looked like a valid response with “unknown” status for all checks—when in reality, the server sent nothing.
The Refactoring
All required-field patterns were changed to explicit errors:
#![allow(unused)]
fn main() {
// After refactoring
let checks = response.checks.ok_or_else(|| {
ClientError::invalid_response(
"ReadinessResponse.checks",
"Readiness response missing required health checks",
)
})?;
}
Now a malformed server response immediately fails with:
Error: Invalid response: ReadinessResponse.checks - Readiness response missing required health checks
The problem is visible. The fix can be applied. Trust is preserved.
Related Principles
- Tenet #11: Fail Loudly in Tasker Core Tenets
- Meta-Principle #6: Errors Over Defaults
- Defense in Depth — fail loudly is a form of protection; silent defaults are a form of hiding
Summary
| Don’t | Do |
|---|---|
| Hide missing data with defaults | Return explicit errors |
| Make consumers guess if data is real | Distinguish required vs optional |
| Fabricate “unknown” status values | Error: “status unavailable” |
| Swallow errors in conversions | Propagate with ? operator |
| Treat all fields as optional | Model optionality in types |
The golden rule: If you can’t tell the difference between “server sent 0” and “server sent nothing,” you have a phantom data problem.
Intentional AI Partnership
A philosophy of rigorous collaboration for the age of AI-assisted engineering
The Growing Divide
There is a phrase gaining traction in software engineering circles: “Nice AI slop.”
It’s dismissive. It’s reductive. And it’s not entirely wrong.
The critique is valid: AI tools have made it possible to generate enormous volumes of code without understanding what that code does, why it’s structured the way it is, or how to maintain it when something breaks at 2 AM. Engineers who would never have shipped code they couldn’t explain are now approving pull requests they couldn’t debug. Project leads are drowning in contributions from well-meaning developers who “vibe-coded” their way into maintenance nightmares.
For those of us who have spent years—decades—in the craft of software engineering, who have sat with codebases through their full lifecycles, who have felt the weight of technical decisions made five years ago landing on our shoulders today, this is frustrating. The hard-won discipline of our profession seems to be eroding in favor of velocity.
And yet.
The response to “AI slop” cannot be rejection of AI as a partner in engineering work. That path leads to irrelevance. The question is not whether to work with AI, but how—with what principles, what practices, what commitments to quality and accountability.
This document is an attempt to articulate those principles. Not as abstract ideals, but as a working philosophy grounded in practice: building real systems, shipping real code, maintaining real accountability.
The Core Insight: Amplification, Not Replacement
AI does not create the problems we’re seeing. It amplifies them.
Teams that already had weak ownership practices now produce more poorly-understood code, faster. Organizations where “move fast and break things” meant “ship it and let someone else figure it out” now ship more of it. Engineers who never quite understood the systems they worked on can now generate more code they don’t understand.
But the inverse is also true.
Teams with strong engineering discipline—clear specifications, rigorous review, genuine ownership—can leverage AI to operate at a higher level of abstraction while maintaining (or even improving) quality. The acceleration becomes an advantage, not a liability.
This is the same dynamic that exists in any collaboration. A junior engineer paired with a senior engineer who doesn’t mentor becomes a junior engineer who writes more code without learning. A junior engineer paired with a senior engineer who invests in their growth becomes a stronger engineer, faster.
AI partnership follows the same pattern. The quality of the outcome depends on the quality of the collaboration practices surrounding it.
The discipline required for effective AI partnership is not new. It is the discipline that should characterize all engineering collaboration. AI simply makes the presence or absence of that discipline more visible, more consequential, and more urgent.
Principles of Intentional Partnership
1. Specification Before Implementation
The most effective AI collaboration begins long before code is written.
When you ask an AI to “build a feature,” you get code. When you work with an AI to understand the problem, research the landscape, evaluate approaches, and specify the solution—then implement—you get software.
This is not an AI-specific insight. It’s foundational engineering practice. But AI makes the cost of skipping specification deceptively low: you can generate code instantly, so why spend time on design? The answer is the same as it’s always been: because code without design is not software, it’s typing.
In practice:
- Begin with exploration: What problem are we solving? What does the current system look like? What will be different when this work is complete?
- Research with tools: Use AI capabilities to understand the codebase, explore patterns in the ecosystem, review prior art. Ground the work in reality, not assumptions.
- Develop evaluation criteria before evaluating solutions. Know what “good” looks like before you start judging options.
- Document the approach, not just the code. Specifications are artifacts of understanding.
2. Phased Delivery with Validation Gates
Large work should be decomposed into phases, and each phase should have clear acceptance criteria.
This principle exists because humans have limited working memory. It’s true for individual engineers, it’s true for teams, and it’s true for AI systems. Complex work exceeds the capacity of any single context—human or machine—to hold it all at once.
Phased delivery is how we manage this limitation. Each phase is small enough to understand completely, validate thoroughly, and commit to confidently. The boundaries between phases are synchronization points where understanding is verified.
In practice:
- Identify what can be parallelized versus what must be sequential. Not all work is equally dependent.
- Determine which aspects require careful attention versus which can be resolved at implementation time. Not all decisions are equally consequential.
- Each phase should be independently validatable: tests pass, acceptance criteria met, code reviewed.
- Phase documentation should include code samples for critical paths. Show, don’t just tell.
3. Validation as a First-Class Concern
Testing is not a phase that happens after implementation. It is a design constraint that shapes implementation.
AI can generate tests as easily as it generates code. This makes it tempting to treat testing as an afterthought: write the code, then generate tests to cover it. This inverts the value proposition of testing entirely.
Tests are specifications. They encode expectations about behavior. When tests are written first—or at least designed first—they constrain the implementation toward correctness. When tests are generated after the fact, they merely document whatever the implementation happens to do, bugs included.
In practice:
- Define acceptance criteria before implementation begins.
- Include edge cases, boundary conditions, and non-happy-path scenarios in specifications.
- End-to-end testing validates that the system works, not just that individual units work.
- Review tests with the same rigor as implementation code. Tests can have bugs too.
4. Human Accountability as the Final Gate
This is the principle that separates intentional partnership from “AI slop.”
The human engineer is ultimately responsible for code that ships. Not symbolically responsible—actually responsible. Responsible for understanding what the code does, why it’s structured the way it is, what trade-offs were made, and how to maintain it.
This is not about low trust in AI. It’s about the nature of accountability.
If you cannot explain why a particular approach was chosen, you should not approve it. If you cannot articulate the trade-offs embedded in a design decision, you should not sign off on it. If you cannot defend a choice—or at least explain why the choice wasn’t worth extensive deliberation—then you are not in a position to take responsibility for it.
This standard applies to all code, regardless of its origin. Human-written code that the approving engineer doesn’t understand is no better than AI-written code they don’t understand. The source is irrelevant; the accountability is what matters.
In practice:
- Review is not approval. Approval requires understanding.
- The bikeshedding threshold is a valid concept: knowing why something isn’t worth debating is also knowledge. But you must actually know this, not assume it.
- Code review agents and architectural validators are useful, but they augment human judgment rather than replacing it.
- If you wouldn’t ship code you wrote yourself without understanding it, don’t ship AI-written code without understanding it either.
5. Documentation as Extended Cognition
Documentation is not an artifact of completed work. It is a tool that enables work to continue.
Every engineer who joins a project faces the same challenge: building sufficient context to contribute effectively. Every AI session faces the same challenge: starting fresh without memory of prior work. Good documentation serves both.
This is the insight that makes documentation investment worthwhile: it extends cognition across time and across minds. The context you build today, documented well, becomes instantly available to future collaborators—human or AI.
In practice:
- Structure documentation for efficient context loading. Navigation guides, trigger patterns, clear hierarchies.
- Capture the “why” alongside the “what.” Decisions without rationale are trivia.
- Principles, architecture, guides, reference—different documents serve different needs at different times.
- Documentation that serves future AI sessions also serves future human engineers. The requirements are the same: limited working memory, need for efficient orientation.
6. Toolchain Alignment
Some development environments are better suited to intentional partnership than others.
The ideal toolchain provides fast feedback loops, enforces correctness constraints, and makes architectural decisions explicit. The compiler, the type system, the test framework—these become additional collaborators in the process, catching errors early and forcing clarity about intent.
Languages and tools that defer decisions to runtime, that allow implicit behavior, that prioritize flexibility over explicitness, make intentional partnership harder. Not impossible—but harder. The burden of verification shifts more heavily to the human.
In practice:
- Strong type systems document intent in ways that survive across sessions and collaborators.
- Compilers that enforce correctness (memory safety, exhaustive matching) catch the classes of errors most likely to slip through in high-velocity development.
- Explicit architectural patterns—actor models, channel semantics, clear ownership boundaries—force intentional design rather than emergent mess.
- The goal is not language advocacy but recognition: your toolchain affects your collaboration quality.
A Concrete Example: Building Tasker
These principles are not theoretical. They emerged from—and continue to guide—the development of Tasker, a workflow orchestration system built in Rust.
Why Rust?
Rust is not chosen as a recommendation but as an illustration of what makes a toolchain powerful for intentional partnership.
The Rust compiler forces agreement on memory ownership. You cannot be vague about who owns data and when it’s released; the borrow checker requires explicitness. This means architectural decisions about data flow must be made consciously rather than accidentally.
Exhaustive pattern matching means you cannot forget to handle a case. Every enum variant must be addressed. This is particularly valuable when working with AI: generated code that handles only the happy path fails to compile rather than failing silently in production.
The type system documents intent in ways that persist across context windows. When an AI session resumes work on a Rust codebase, the types communicate constraints that would otherwise need to be re-established through conversation.
Tokio channels, MPSC patterns, actor boundaries—these require intentional design. You cannot stumble into an actor architecture; you must choose it and implement it explicitly. This aligns well with specification-driven development.
None of this makes Rust uniquely suitable or necessary. It makes Rust an example of the properties that matter: explicitness, enforcement, feedback loops that catch errors early.
The Spec-Driven Workflow
Every significant piece of Tasker work follows a pattern:
-
Problem exploration: What are we trying to accomplish? What’s the current state? What will success look like?
-
Grounded research: Use AI capabilities to understand the codebase, explore ecosystem patterns, review tooling options. Generate a situated view of how the problem exists within the actual system.
-
Approach analysis: Develop criteria for evaluating solutions. Generate multiple approaches. Evaluate against criteria. Select and refine.
-
Phased planning: Break work into milestones with validation gates. Identify dependencies, parallelization opportunities, risk areas. Determine what needs careful specification versus what can be resolved during implementation.
-
Phase documentation: Each phase gets its own specification in a dedicated directory. Includes acceptance criteria, code samples for critical paths, and explicit validation requirements.
-
Implementation with validation: Work proceeds phase by phase. Tests are written. Code is reviewed. Each phase is complete before the next begins.
-
Human accountability gate: The human partner reviews not just for correctness but for understanding. Can they defend the choices? Do they know why alternatives were rejected? Are they prepared to maintain this code?
This workflow produces comprehensive documentation as a side effect of doing the work. The docs/ticket-specs/ directories in Tasker contain detailed specifications that serve both as implementation guides and as institutional memory. Future engineers—and future AI sessions—can understand not just what was built but why.
The Tenets as Guardrails
Tasker’s development is guided by ten core tenets, derived from experience. Several are directly relevant to intentional partnership:
State Machine Rigor: All state transitions are atomic, audited, and validated. This principle emerged from debugging distributed systems failures; it also provides clear contracts for AI-generated code to satisfy.
Defense in Depth: Multiple overlapping protection layers rather than single points of failure. In collaboration terms: review, testing, type checking, and runtime validation each catch what others might miss.
Composition Over Inheritance: Capabilities are composed via mixins, not class hierarchies. This produces code that’s easier to understand in isolation—crucial when any given context (human or AI) can only hold part of the system at once.
These tenets emerged from building software over many years. They apply to AI partnership because they apply to engineering generally. AI is a collaborator; good engineering principles govern collaboration.
The Organizational Dimension
Intentional AI partnership is not just an individual practice. It’s an organizational capability.
What Changes
When AI acceleration is available to everyone, the differentiator becomes the quality of surrounding practices:
- Specification quality determines whether AI generates useful code or plausible-looking nonsense.
- Review rigor determines whether errors are caught before or after deployment.
- Testing discipline determines whether systems are verifiably correct or coincidentally working.
- Documentation investment determines whether institutional knowledge accumulates or evaporates.
Organizations that were already strong in these areas will find AI amplifies their strength. Organizations that were weak will find AI amplifies their weakness—faster.
The Accountability Question
The hardest organizational challenge is accountability.
When an engineer can generate a month’s worth of code in a day, traditional review processes break down. You cannot carefully review a thousand lines of code per hour. Something has to give.
The answer is not “skip review” or “automate review entirely.” The answer is to change what gets reviewed.
In intentional partnership, the specification is the primary artifact. The specification is reviewed carefully: Does this approach make sense? Does it align with architectural principles? Does it handle edge cases? Does it integrate with existing systems?
The implementation—whether AI-generated or human-written—is validated against the specification. Tests verify behavior. Type systems verify contracts. Review confirms that the implementation matches the spec.
This shifts review from “read every line of code” to “verify that implementation matches intent.” It’s a different skill, but it’s learnable. And it scales in ways that line-by-line review does not.
Building the Capability
Organizations building intentional AI partnership should focus on:
-
Specification practices: Invest in training engineers to write clear, complete specifications. This skill was always valuable; it’s now critical.
-
Review culture: Shift review culture from gatekeeping to verification. The question is not “would I have written it this way?” but “does this correctly implement the specification?”
-
Testing infrastructure: Fast, comprehensive test suites become even more valuable when implementation velocity increases. Invest accordingly.
-
Documentation standards: Establish expectations for documentation quality. Make documentation a first-class deliverable, not an afterthought.
-
Toolchain alignment: Choose languages, frameworks, and tools that provide fast feedback and enforce correctness. The compiler is a collaborator.
The Call to Action: What Becomes Possible
There is another dimension to this conversation that deserves attention.
We have focused on rigor, accountability, and the discipline required to avoid producing “slop.” This framing is necessary but insufficient. It treats AI partnership primarily as a risk to be managed rather than an opportunity to be seized.
Consider what has changed.
For decades, software engineers have carried mental backlogs of things we would build if we had the time. Ideas that were sound, architecturally feasible, genuinely useful—but the time-to-execute made them impractical. Side projects abandoned. Features deprioritized. Entire systems that existed only as sketches in notebooks because the implementation cost was prohibitive.
That calculus has shifted.
AI partnership, applied rigorously, compresses implementation timelines in ways that make previously infeasible work feasible. The system you would have built “someday” can be prototyped in a weekend. The refactoring you’ve been putting off for years can be specified, planned, and executed in weeks. The tooling you wished existed can be created rather than merely wished for.
This is not about moving faster for its own sake. It’s about what becomes creatively possible when the friction of implementation is reduced.
Tasker exists because of this shift. A workflow orchestration system supporting four languages, with comprehensive documentation, rigorous testing, and production-grade architecture—built as a labor of love alongside a demanding day job. Ten years ago, this project would have remained an idea. Five years ago, perhaps a half-finished prototype. Today, it’s real software approaching production readiness.
And Tasker is not unique. Across the industry, engineers are building things that would not have existed otherwise. Not “AI-generated slop,” but genuine contributions to the craft—systems built with care, designed with intention, maintained with accountability.
This is what’s at stake when we talk about intentional partnership.
When we approach AI collaboration carelessly, we produce code we don’t understand and can’t maintain. We waste the capability on work that creates more problems than it solves. We give ammunition to critics who argue that AI makes engineering worse.
When we approach AI collaboration with rigor, clarity, and commitment to excellence, we unlock creative possibilities that were previously out of reach. We build things that matter. We expand what a single engineer, or a small team, can accomplish.
It is not treating ourselves with respect—our time, our creativity, our professional aspirations—to squander this capability on careless work. It is not treating the partnership with respect to use it without intention.
The opportunity before us is unprecedented. The discipline required to seize it is not new—it’s the discipline of good engineering, applied to a new context.
Let’s not waste it.
Conclusion: Craft Persists
The critique of “AI slop” is fundamentally a critique of craft—or its absence.
Craft is the accumulated wisdom of how to do something well. In software engineering, craft includes knowing when to abstract and when to be concrete, when to optimize and when to leave well enough alone, when to document and when the code is the documentation. Craft is what separates software that works from software that lasts.
AI does not possess craft. AI possesses capability—vast capability—but capability without wisdom is dangerous. This is true of humans as well; we just notice it less because human capability is more limited.
Intentional AI partnership is the practice of combining AI capability with human craft. The AI brings speed, breadth of knowledge, tireless pattern matching. The human brings judgment, accountability, and the accumulated wisdom of the profession.
Neither is sufficient alone. Together, working with discipline and intention, they can build software that is not just functional but maintainable, not just shipped but understood, not just code but craft.
The divide between “AI slop” and intentional partnership is not about the tools. It’s about us—whether we bring the same standards to AI collaboration that we would (or should) bring to any engineering work.
The tools are new. The standards are not. Let’s hold ourselves to them.
This document is part of the Tasker Core project principles. It reflects one approach to AI-assisted engineering; your mileage may vary. The principles here emerged from practice and continue to evolve with it.
Tasker Core Tenets
These 11 tenets guide all architectural and design decisions in Tasker Core. Each emerged from real implementation experience, root cause analyses, or architectural migrations.
The 11 Tenets
1. Defense in Depth
Multiple overlapping protection layers provide robust idempotency without single-point dependency.
Protection comes from four independent layers:
- Database-level atomicity: Unique constraints, row locking, compare-and-swap
- State machine guards: Current state validation before transitions
- Transaction boundaries: All-or-nothing semantics
- Application-level filtering: State-based deduplication
Each layer catches what others might miss. No single layer is responsible for all protection.
Origin: Processor UUID ownership was removed when analysis proved it provided redundant protection with harmful side effects (blocking recovery after crashes).
Lesson: Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.
2. Event-Driven with Polling Fallback
Real-time responsiveness via PostgreSQL LISTEN/NOTIFY, with polling as a reliability backstop.
The system supports three deployment modes:
- EventDrivenOnly: Lowest latency, relies on pg_notify
- PollingOnly: Traditional polling, higher latency but simple
- Hybrid (recommended): Event-driven primary, polling fallback
Events can be missed (network issues, connection drops). Polling ensures eventual consistency.
Origin: Event-driven task claiming was added for low-latency response while preserving reliability guarantees.
3. Composition Over Inheritance
Mixins and traits for handler capabilities, not class hierarchies.
Handler capabilities are composed via mixins:
Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable
This pattern enables:
- Selective capability inclusion
- Clear separation of concerns
- Easier testing of individual capabilities
- No diamond inheritance problems
Origin: Analysis of cross-language handler harmonization revealed Batchable handlers already used composition. This was identified as the target architecture for all handlers.
See also: Composition Over Inheritance
4. Cross-Language Consistency
Unified developer-facing APIs across Rust, Ruby, Python, and TypeScript.
Consistent touchpoints include:
- Handler signatures:
call(context)pattern - Result factories:
success(data)/failure(error, retry_on) - Registry APIs:
register_handler(name, handler) - Specialized patterns: API, Decision, Batchable
Each language expresses these idiomatically while maintaining conceptual consistency.
Origin: Cross-language API alignment established the “one obvious way” philosophy.
See also: Cross-Language Consistency
5. Actor-Based Decomposition
Lightweight actors for lifecycle management and clear boundaries.
Orchestration uses four core actors:
- TaskRequestActor: Task initialization
- ResultProcessorActor: Step result processing
- StepEnqueuerActor: Batch step enqueueing
- TaskFinalizerActor: Task completion
Worker uses five specialized actors:
- StepExecutorActor: Step execution coordination
- FFICompletionActor: FFI completion handling
- TemplateCacheActor: Template cache management
- DomainEventActor: Event dispatching
- WorkerStatusActor: Status and health
Each actor handles specific message types, enabling testability and clear ownership.
Origin: Actor pattern refactoring reduced monolithic processors from 1,575 LOC to ~150 LOC focused files.
6. State Machine Rigor
Dual state machines (Task + Step) for atomic transitions with full audit trails.
Task states (12): Pending → Initializing → EnqueuingSteps → StepsInProcess → EvaluatingResults → Complete/Error
Step states (8): Pending → Enqueued → InProgress → Complete/Error
All transitions are:
- Atomic (compare-and-swap at database level)
- Audited (full history in transitions table)
- Validated (state guards prevent invalid transitions)
Origin: Enhanced state machines with richer task states were introduced for better workflow visibility.
7. Audit Before Enforce
Track for observability, don’t block for “ownership.”
Processor UUID is tracked in every transition for:
- Debugging (which instance processed which step)
- Audit trails (full history of processing)
- Metrics (load distribution analysis)
But not enforced for:
- Ownership claims (blocks recovery)
- Permission checks (redundant with state guards)
Origin: Ownership enforcement removal proved that audit trails provide value without enforcement costs.
Key insight: When two actors receive identical messages, first succeeds atomically, second fails cleanly - no partial state, no corruption.
8. Pre-Alpha Freedom
Break things early to get architecture right.
In pre-alpha phase:
- Breaking changes are encouraged when architecture is fundamentally unsound
- No backward compatibility required for greenfield work
- Migration debt is cheaper than technical debt
- “Perfect” is the enemy of “architecturally sound”
This freedom enables:
- Rapid iteration on core patterns
- Learning from real implementation
- Correcting course before users depend on specifics
Origin: All major refactoring efforts made breaking changes that improved architecture fundamentally.
9. PostgreSQL as Foundation
Database-level guarantees with flexible messaging (PGMQ default, RabbitMQ optional).
PostgreSQL provides:
- State storage: Task and step state with transactional guarantees
- Advisory locks: Distributed coordination primitives
- Atomic functions: State transitions in single round-trip
- Row-level locking: Prevents concurrent modification
Messaging is provider-agnostic:
- PGMQ (default): Message queue built on PostgreSQL—single-dependency deployment
- RabbitMQ (optional): For high-throughput or existing broker infrastructure
The database is not just storage—it’s the coordination layer. Message delivery is pluggable.
Origin: Core architecture decision - PostgreSQL’s transactional guarantees eliminate entire classes of distributed systems problems. The messaging abstraction was added for deployment flexibility.
10. Bounded Resources
All channels bounded, backpressure everywhere.
Every MPSC channel is:
- Bounded: Fixed capacity, no unbounded memory growth
- Configurable: Sizes set via TOML configuration
- Monitored: Backpressure metrics exposed
Semaphores limit concurrent handler execution. Circuit breakers protect downstream services.
Origin: Bounded MPSC channels were mandated after analysis of unbounded channel risks.
Rule: Never use unbounded_channel(). Always configure bounds via TOML.
11. Fail Loudly
A system that lies is worse than one that fails. Errors are first-class citizens, not inconveniences to hide.
When data is missing, malformed, or unexpected:
- Return errors, not fabricated defaults
- Propagate failures up the call stack
- Make problems visible immediately, not downstream
- Trust nothing that hasn’t been validated
Silent defaults create phantom data—values that look valid but represent nothing real. A monitoring system that receives 0% utilization cannot distinguish “system is idle” from “data was missing.”
What this means in practice:
| Scenario | Wrong Approach | Right Approach |
|---|---|---|
| gRPC response missing field | Return default value | Return InvalidResponse error |
| Config section absent | Use empty/zero defaults | Fail with clear message |
| Health check data missing | Fabricate “unknown” status | Error: “health data unavailable” |
| Optional vs Required | Treat all as optional | Distinguish explicitly in types |
The trust equation:
A client that returns fabricated data
= A client that lies to you
= Worse than a client that fails loudly
= Debugging phantom bugs in production
Origin: gRPC client refactoring revealed pervasive unwrap_or_default() patterns that silently fabricated response data. Analysis showed consumers could receive “valid-looking” responses containing entirely phantom data, breaking the trust contract between client and caller.
Key insight: When a gRPC server omits required fields, that’s a protocol violation—not an opportunity to be “helpful” with defaults. The server is broken; pretending otherwise delays the fix and misleads operators.
Rule: Never use unwrap_or_default() or unwrap_or_else(|| fabricated_value) for required fields. Use ok_or_else(|| ClientError::invalid_response(...)) instead.
Meta-Principles
These overarching themes emerge from the tenets:
-
Simplicity Over Elegance: The minimal protection set that prevents corruption beats layered defense that prevents recovery
-
Observation-Driven Design: Let real behavior (parallel execution, edge cases) guide architecture
-
Explicit Over Implicit: Make boundaries, layers, and decisions visible in documentation and code
-
Consistency Without Uniformity: Align APIs while preserving language idioms
-
Separation of Concerns: Orchestration handles state and coordination; workers handle execution and domain events
-
Errors Over Defaults: When in doubt, fail with a clear error rather than proceeding with fabricated data
Applying These Tenets
When making design decisions:
- Check against tenets: Does this violate any of the 10 tenets?
- Find the precedent: Has a similar decision been made before? (See ticket-specs)
- Document the trade-off: What are you gaining and giving up?
- Consider recovery: If this fails, how does the system recover?
When reviewing code:
- Bounded resources: Are all channels bounded? All concurrency limited?
- State machine compliance: Do transitions use atomic database operations?
- Language consistency: Does the API align with other language workers?
- Composition pattern: Are capabilities mixed in rather than inherited?
- Fail loudly: Are missing/invalid data handled with errors, not silent defaults?
Twelve-Factor App Alignment
The Twelve-Factor App methodology, authored by Adam Wiggins and contributors at Heroku, has been a foundational influence on Tasker Core’s systems design. These principles were not adopted as a checklist but absorbed over years of building production systems. Some factors are deeply embedded in the architecture; others remain aspirational or partially realized.
This document maps each factor to where it shows up in the codebase, where we fall short, and what contributors should keep in mind. It is meant as practical guidance, not a compliance scorecard.
I. Codebase
One codebase tracked in revision control, many deploys.
Tasker Core is a single Git monorepo containing all deployable services: orchestration server, workers (Rust, Ruby, Python, TypeScript), CLI, and shared libraries.
Where this lives:
- Root
Cargo.tomldefines the workspace with all crate members - Environment-specific Docker Compose files produce different deploys from the same source:
docker/docker-compose.prod.yml,docker/docker-compose.dev.yml,docker/docker-compose.test.yml,docker/docker-compose.ci.yml - Feature flags (
web-api,grpc-api,test-services,test-cluster) control build variations without code branches
Gaps: The monorepo means all crates share a single version today (v0.1.0). As the project matures toward independent crate publishing, version coordination will need more tooling. Independent crate versioning and release management tooling will need to evolve as the project matures.
II. Dependencies
Explicitly declare and isolate dependencies.
Rust’s Cargo ecosystem makes this natural. All dependencies are declared in Cargo.toml with workspace-level management and pinned in Cargo.lock.
Where this lives:
- Root
Cargo.toml[workspace.dependencies]section — single source of truth for shared dependency versions Cargo.lockcommitted to the repository for reproducible builds- Multi-stage Docker builds (
docker/build/orchestration.prod.Dockerfile) usecargo-cheffor cached, reproducible dependency resolution - No runtime dependency fetching — everything resolved at build time
Gaps: FFI workers each bring their own dependency ecosystem (Python’s uv/pyproject.toml, Ruby’s Bundler/Gemfile, TypeScript’s bun/package.json). These are well-declared but not unified — contributors working across languages need to manage multiple lock files.
III. Config
Store config in the environment.
This is one of the strongest alignments. All runtime configuration flows through environment variables, with TOML files providing structured defaults that reference those variables.
Where this lives:
config/dotenv/— environment-specific.envfiles (base.env,test.env,orchestration.env)config/tasker/base/*.toml— role-based defaults with${ENV_VAR:-default}interpolationconfig/tasker/environments/{test,development,production}/— environment overridesdocker/.env.prod.template— production variable templatetasker-shared/src/config/— config loading with environment variable resolution- No secrets in source:
DATABASE_URL,POSTGRES_PASSWORD, JWT keys all via environment
For contributors: Never hard-code connection strings, credentials, or deployment-specific values. Use environment variables with sensible defaults in the TOML layer. The configuration structure is role-based (orchestration/worker/common), not component-based — see CLAUDE.md for details.
IV. Backing Services
Treat backing services as attached resources.
Backing services are abstracted behind trait interfaces and swappable via configuration alone.
Where this lives:
- Database: PostgreSQL connection via
DATABASE_URL, pool settings inconfig/tasker/base/common.tomlunder[common.database.pool] - Messaging: PGMQ or RabbitMQ selected via
TASKER_MESSAGING_BACKENDenvironment variable — same code paths, different drivers - Cache: Redis, Moka (in-process), or disabled entirely via
[common.cache]configuration - Observability: OpenTelemetry with pluggable backends (Honeycomb, Jaeger, Grafana Tempo) via
OTEL_EXPORTER_OTLP_ENDPOINT - Circuit breakers protect against backing service failures:
[common.circuit_breakers.component_configs]
For contributors: When adding a new backing service dependency, ensure it can be configured via environment variables and that the system degrades gracefully when it’s unavailable. Follow the messaging abstraction pattern — trait-based interfaces, not concrete types.
V. Build, Release, Run
Strictly separate build and run stages.
The Docker build pipeline enforces this cleanly with multi-stage builds.
Where this lives:
- Build:
docker/build/orchestration.prod.Dockerfile—cargo-chefdependency caching,cargo build --release --all-features --locked, binary stripping - Release: Tagged Docker images with only runtime dependencies (no build tools), non-root user (
tasker:999), read-only config mounts - Run:
docker/scripts/orchestration-entrypoint.sh— environment validation, database availability check, migrations, thenexecinto the application binary - Deployment modes control startup behavior:
standard,migrate-only,no-migrate,safe,emergency
Gaps: Local development doesn’t enforce the same separation — developers run cargo run directly, which conflates build and run. This is fine for development ergonomics but worth noting as a difference from the production path.
VI. Processes
Execute the app as one or more stateless processes.
All persistent state lives in PostgreSQL. Processes can be killed and restarted at any time without data loss.
Where this lives:
- Orchestration server: stateless HTTP/gRPC service backed by
tasker.tasksandtasker.stepstables - Workers: claim steps from message queues, execute handlers, write results back — no in-memory state across requests
- Message queue visibility timeouts (
visibility_timeout_secondsin worker config) ensure unacknowledged messages are reclaimed by other workers - Docker Compose
replicassetting scales workers horizontally
For contributors: Never store workflow state in memory across requests. If you need coordination state, it belongs in PostgreSQL. In-memory caches (Moka) are optimization layers, not sources of truth — the system must function correctly without them.
VII. Port Binding
Export services via port binding.
Each service is self-contained and binds its own ports.
Where this lives:
- REST:
config/tasker/base/orchestration.toml—[orchestration.web] bind_address = "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - gRPC:
[orchestration.grpc] bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Worker REST/gRPC on separate ports (8081/9191)
- Health endpoints on both transports for load balancer integration
- Docker exposes ports via environment-configurable mappings
VIII. Concurrency
Scale out via the process model.
The system scales horizontally by adding worker processes and vertically by tuning concurrency settings.
Where this lives:
- Horizontal:
docker/docker-compose.prod.yml—replicas: ${WORKER_REPLICAS:-2}, each worker is independent - Vertical:
config/tasker/base/orchestration.toml—max_concurrent_operations,batch_sizeper event system - Worker handler parallelism:
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 10 - Load shedding:
[worker.mpsc_channels.handler_dispatch.load_shedding] capacity_threshold_percent = 80.0
Gaps: The actor pattern within a single process is more vertical than horizontal — actors share a Tokio runtime and scale via async concurrency, not OS processes. This is a pragmatic choice for Rust’s async model but means single-process scaling has limits that multiple processes solve.
IX. Disposability
Maximize robustness with fast startup and graceful shutdown.
This factor gets significant attention due to the distributed nature of task orchestration.
Where this lives:
- Graceful shutdown: Signal handlers (SIGTERM, SIGINT) in
tasker-orchestration/src/bin/server.rsandtasker-worker/src/bin/— actors drain in-flight work, OpenTelemetry flushes spans, connections close cleanly - Fast startup: Compiled binary, pooled database connections, environment-driven config (no service discovery delays)
- Crash recovery: PGMQ visibility timeouts requeue unacknowledged messages; steps claimed by a crashed worker reappear for others after
visibility_timeout_seconds - Entrypoint:
docker/scripts/orchestration-entrypoint.shusesexecto replace shell with app process (proper PID 1 signal handling) - Health checks: Docker
start_periodallows grace time before liveness probes begin
For contributors: When adding new async subsystems, ensure they participate in the shutdown sequence. Bounded channels and drain timeouts (shutdown_drain_timeout_ms) prevent shutdown from hanging indefinitely.
X. Dev/Prod Parity
Keep development, staging, and production as similar as possible.
The same code, same migrations, and same config structure run everywhere — only values change.
Where this lives:
config/tasker/base/provides defaults;config/tasker/environments/overrides per-environment — structure is identicalmigrations/directory contains SQL migrations shared across all environments- Docker images use the same base (
debian:bullseye-slim) and runtime user (tasker:999) - Structured logging format (tracing crate) is consistent; only verbosity changes (
RUST_LOG) - E2E tests (
--features test-services) exercise the same code paths as production
Gaps: Development uses cargo run with debug builds while production uses release-optimized Docker images. The observability stack (Grafana LGTM) is available in docker-compose.dev.yml but most local development happens without it. These are standard trade-offs, but contributors should periodically test against the full Docker stack to catch environment-specific issues.
XI. Logs
Treat logs as event streams.
All logging goes to stdout/stderr. No file-based logging is built into the application.
Where this lives:
tasker-shared/src/logging.rs— tracing subscriber writes to stdout, JSON format in production, ANSI colors in development (TTY-detected)- OpenTelemetry integration exports structured traces via
OTEL_EXPORTER_OTLP_ENDPOINT - Correlation IDs (
correlation_id) propagate through tasks, steps, actors, and message queues for distributed tracing docker-compose.dev.ymlincludes Loki for log aggregation and Grafana for visualization- Entrypoint scripts log to stdout/stderr with role-prefixed format
For contributors: Use the tracing crate’s #[instrument] macro and structured fields (tracing::info!(task_id = %id, "processing")) rather than string interpolation. Never write to log files directly.
XII. Admin Processes
Run admin/management tasks as one-off processes.
The CLI and deployment scripts serve this role.
Where this lives:
tasker-ctl/— task management (create,list,cancel), DLQ investigation (dlq list,dlq recover), system health, auth token managementdocker/scripts/orchestration-entrypoint.sh—DEPLOYMENT_MODE=migrate-onlyruns migrations and exits without starting the serverconfig-validatorbinary validates TOML configuration as a one-off check- Database migrations run as a distinct phase before application startup, with retry logic and timeout protection
Gaps: Some administrative operations (cache invalidation, circuit breaker reset) are only available through the REST/gRPC API, not the CLI. As the CLI matures, these should become first-class admin commands.
Using This as a Contributor
These factors are not rules to enforce mechanically. They’re a lens for evaluating design decisions:
- Adding a new service dependency? Factor IV says treat it as an attached resource — configure via environment, degrade gracefully without it.
- Storing state? Factor VI says processes are stateless — put it in PostgreSQL, not in memory.
- Adding configuration? Factor III says environment variables — use the existing TOML-with-env-var-interpolation pattern.
- Writing logs? Factor XI says event streams — stdout, structured fields, correlation IDs.
- Building deployment artifacts? Factor V says separate build/release/run — don’t bake configuration into images.
When a factor conflicts with practical needs, document the trade-off. The goal is not purity but awareness.
Attribution
The Twelve-Factor App methodology was created by Adam Wiggins with contributions from many others, originally published at 12factor.net. It is made available under the MIT License and has influenced how a generation of developers think about building software-as-a-service applications. Its influence on this project is gratefully acknowledged.
PEP: 20 Title: The Zen of Python Author: Tim Peters tim.peters@gmail.com Status: Active Type: Informational Created: 19-Aug-2004 Post-History: 22-Aug-2004
Abstract
Long time Pythoneer Tim Peters succinctly channels the BDFL’s guiding principles for Python’s design into 20 aphorisms, only 19 of which have been written down.
The Zen of Python
.. code-block:: text
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Easter Egg
.. code-block:: pycon
import this
References
Originally posted to comp.lang.python/python-list@python.org under a
thread called "The Way of Python" <https://groups.google.com/d/msg/comp.lang.python/B_VxeTBClM0/L8W9KlsiriUJ>__
Copyright
This document has been placed in the public domain.
Tasker Core Reference
This directory contains technical reference documentation with precise specifications and implementation details.
Documents
| Document | Description |
|---|---|
| StepContext API | Cross-language API reference for step handlers |
| Table Management | Database table structure and management |
| Task and Step Readiness | SQL functions and execution logic |
| sccache Configuration | Build caching setup |
| Library Deployment Patterns | Library distribution strategies |
| FFI Telemetry Pattern | Cross-language telemetry integration |
When to Read These
- Need exact behavior: Consult these for precise specifications
- Debugging edge cases: Check implementation details
- Database operations: See Table Management and SQL functions
- Build optimization: Review sccache Configuration
Related Documentation
- Architecture - System structure and patterns
- Guides - Practical how-to documentation
- Development - Developer tooling and patterns
FFI Boundary Types Reference
Cross-language type harmonization for Rust, Python, and TypeScript boundaries.
This document defines the canonical FFI boundary types that cross the Rust orchestration layer and the Python/TypeScript worker implementations. These types are critical for correct serialization/deserialization between languages.
Overview
The tasker-core system uses FFI (Foreign Function Interface) to integrate Rust orchestration with Python and TypeScript step handlers. Data crosses this boundary via JSON serialization. These types must remain consistent across all three languages.
Source of Truth: Rust types in tasker-shared/src/messaging/execution_types.rs and
tasker-shared/src/models/core/batch_worker.rs.
Type Mapping
| Rust Type | Python Type | TypeScript Type |
|---|---|---|
CursorConfig | RustCursorConfig | RustCursorConfig |
BatchProcessingOutcome | BatchProcessingOutcome | BatchProcessingOutcome |
BatchWorkerInputs | RustBatchWorkerInputs | RustBatchWorkerInputs |
BatchMetadata | BatchMetadata | BatchMetadata |
FailureStrategy | FailureStrategy | FailureStrategy |
CursorConfig
Cursor configuration for a single batch’s position and range.
Flexible Cursor Types
Unlike simple integer cursors, RustCursorConfig supports flexible cursor values:
- Integer for record IDs:
123 - String for timestamps:
"2025-11-01T00:00:00Z" - Object for composite keys:
{"page": 1, "offset": 0}
This enables cursor-based pagination across diverse data sources.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
pub struct CursorConfig {
pub batch_id: String,
pub start_cursor: serde_json::Value, // Flexible type
pub end_cursor: serde_json::Value, // Flexible type
pub batch_size: u32,
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export interface RustCursorConfig {
batch_id: string;
start_cursor: unknown; // Flexible: number | string | object
end_cursor: unknown;
batch_size: number;
}
Python Definition
# workers/python/python/tasker_core/types.py
class RustCursorConfig(BaseModel):
batch_id: str
start_cursor: Any # Flexible: int | str | dict
end_cursor: Any
batch_size: int
JSON Wire Format
{
"batch_id": "batch_001",
"start_cursor": 0,
"end_cursor": 1000,
"batch_size": 1000
}
BatchProcessingOutcome
Discriminated union representing the outcome of a batchable step.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
NoBatches,
CreateBatches {
worker_template_name: String,
worker_count: u32,
cursor_configs: Vec<CursorConfig>,
total_items: u64,
},
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export interface NoBatchesOutcome {
type: 'no_batches';
}
export interface CreateBatchesOutcome {
type: 'create_batches';
worker_template_name: string;
worker_count: number;
cursor_configs: RustCursorConfig[];
total_items: number;
}
export type BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome;
Python Definition
# workers/python/python/tasker_core/types.py
class NoBatchesOutcome(BaseModel):
type: str = "no_batches"
class CreateBatchesOutcome(BaseModel):
type: str = "create_batches"
worker_template_name: str
worker_count: int
cursor_configs: list[RustCursorConfig]
total_items: int
BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome
JSON Wire Formats
NoBatches:
{
"type": "no_batches"
}
CreateBatches:
{
"type": "create_batches",
"worker_template_name": "batch_worker_template",
"worker_count": 5,
"cursor_configs": [
{ "batch_id": "001", "start_cursor": 0, "end_cursor": 1000, "batch_size": 1000 },
{ "batch_id": "002", "start_cursor": 1000, "end_cursor": 2000, "batch_size": 1000 }
],
"total_items": 5000
}
BatchWorkerInputs
Initialization inputs for batch worker instances, stored in workflow_steps.inputs.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/models/core/batch_worker.rs
pub struct BatchWorkerInputs {
pub cursor: CursorConfig,
pub batch_metadata: BatchMetadata,
pub is_no_op: bool,
}
pub struct BatchMetadata {
// checkpoint_interval removed - handlers decide when to checkpoint
pub cursor_field: String,
pub failure_strategy: FailureStrategy,
}
pub enum FailureStrategy {
ContinueOnFailure,
FailFast,
Isolate,
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export type FailureStrategy = 'continue_on_failure' | 'fail_fast' | 'isolate';
export interface BatchMetadata {
// checkpoint_interval removed - handlers decide when to checkpoint
cursor_field: string;
failure_strategy: FailureStrategy;
}
export interface RustBatchWorkerInputs {
cursor: RustCursorConfig;
batch_metadata: BatchMetadata;
is_no_op: boolean;
}
Python Definition
# workers/python/python/tasker_core/types.py
class FailureStrategy(str, Enum):
CONTINUE_ON_FAILURE = "continue_on_failure"
FAIL_FAST = "fail_fast"
ISOLATE = "isolate"
class BatchMetadata(BaseModel):
# checkpoint_interval removed - handlers decide when to checkpoint
cursor_field: str
failure_strategy: FailureStrategy
class RustBatchWorkerInputs(BaseModel):
cursor: RustCursorConfig
batch_metadata: BatchMetadata
is_no_op: bool
JSON Wire Format
{
"cursor": {
"batch_id": "batch_001",
"start_cursor": 0,
"end_cursor": 1000,
"batch_size": 1000
},
"batch_metadata": {
"cursor_field": "id",
"failure_strategy": "continue_on_failure"
},
"is_no_op": false
}
BatchAggregationResult
Standardized result from aggregating multiple batch worker results.
Cross-Language Standard
All three languages produce identical aggregation results:
| Field | Type | Description |
|---|---|---|
total_processed | int | Items processed across all batches |
total_succeeded | int | Items that succeeded |
total_failed | int | Items that failed |
total_skipped | int | Items that were skipped |
batch_count | int | Number of batch workers that ran |
success_rate | float | Success rate (0.0 to 1.0) |
errors | array | Collected errors (limited to 100) |
error_count | int | Total error count |
Usage Examples
TypeScript:
import { aggregateBatchResults } from 'tasker-core';
const workerResults = Object.values(context.previousResults)
.filter(r => r?.batch_worker);
const summary = aggregateBatchResults(workerResults);
return this.success(summary);
Python:
from tasker_core.types import aggregate_batch_results
worker_results = [
context.get_dependency_result(f"worker_{i}")
for i in range(batch_count)
]
summary = aggregate_batch_results(worker_results)
return self.success(summary.model_dump())
Factory Functions
Creating BatchProcessingOutcome
TypeScript:
import { noBatches, createBatches, RustCursorConfig } from 'tasker-core';
// No batches needed
const outcome1 = noBatches();
// Create batch workers
const configs: RustCursorConfig[] = [
{ batch_id: '001', start_cursor: 0, end_cursor: 1000, batch_size: 1000 },
{ batch_id: '002', start_cursor: 1000, end_cursor: 2000, batch_size: 1000 },
];
const outcome2 = createBatches('process_batch', 2, configs, 2000);
Python:
from tasker_core.types import no_batches, create_batches, RustCursorConfig
# No batches needed
outcome1 = no_batches()
# Create batch workers
configs = [
RustCursorConfig(batch_id="001", start_cursor=0, end_cursor=1000, batch_size=1000),
RustCursorConfig(batch_id="002", start_cursor=1000, end_cursor=2000, batch_size=1000),
]
outcome2 = create_batches("process_batch", 2, configs, 2000)
Type Guards (TypeScript)
import {
BatchProcessingOutcome,
isNoBatches,
isCreateBatches
} from 'tasker-core';
function handleOutcome(outcome: BatchProcessingOutcome): void {
if (isNoBatches(outcome)) {
console.log('No batches needed');
return;
}
if (isCreateBatches(outcome)) {
console.log(`Creating ${outcome.worker_count} workers`);
console.log(`Total items: ${outcome.total_items}`);
}
}
Migration Notes
From Legacy Types
If migrating from older batch processing types:
-
CursorConfig → RustCursorConfig: The new type adds
batch_idfield and uses flexible cursor types (unknown/Any) instead of fixednumber/int. -
Inline batch_processing_outcome → BatchProcessingOutcome: Use the discriminated union type with factory functions instead of building JSON manually.
-
Manual aggregation → aggregateBatchResults: Use the standardized aggregation function for consistent cross-language behavior.
Backwards Compatibility
The legacy CursorConfig type (with number/int cursors) is preserved for simple
use cases. Use RustCursorConfig when:
- Working with Rust orchestration inputs
- Needing flexible cursor types (timestamps, UUIDs, composites)
- Building
BatchProcessingOutcomestructures
Related Documentation
FFI Telemetry Initialization Pattern
Overview
This document describes the two-phase telemetry initialization pattern for Foreign Function Interface (FFI) integrations where Rust code is called from languages that don’t have a Tokio runtime during initialization (Ruby, Python, WASM).
The Problem
OpenTelemetry batch exporter requires a Tokio runtime context for async I/O operations:
#![allow(unused)]
fn main() {
// This PANICS if called outside a Tokio runtime
let tracer_provider = SdkTracerProvider::builder()
.with_batch_exporter(exporter) // ❌ Requires Tokio runtime
.with_resource(resource)
.with_sampler(sampler)
.build();
}
FFI Initialization Timeline:
1. Language Runtime Loads Extension (Ruby, Python, WASM)
↓ No Tokio runtime exists yet
2. Extension Init Function Called (Magnus init, PyO3 init, etc.)
↓ Logging needed for debugging, but no async runtime
3. Later: Create Tokio Runtime
↓ Now safe to initialize telemetry
4. Bootstrap Worker System
The Solution: Two-Phase Initialization
Phase 1: Console-Only Logging (FFI-Safe)
During language extension initialization, use console-only logging that requires no Tokio runtime:
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs
pub fn init_console_only() {
// Initialize console logging without OpenTelemetry
// Safe to call from any thread, no async runtime required
}
}
When to use:
- During Magnus initialization (Ruby)
- During PyO3 initialization (Python)
- During WASM module initialization
- Any context where no Tokio runtime exists
Phase 2: Full Telemetry (Tokio Context)
After creating the Tokio runtime, initialize full telemetry including OpenTelemetry:
#![allow(unused)]
fn main() {
// Create Tokio runtime
let runtime = tokio::runtime::Runtime::new()?;
// Initialize telemetry in runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
}
When to use:
- After creating Tokio runtime in bootstrap
- Inside
runtime.block_on()context - When async I/O is available
Implementation Guide
Ruby FFI (Magnus)
File Structure:
workers/ruby/ext/tasker_core/src/ffi_logging.rs- Phase 1workers/ruby/ext/tasker_core/src/bootstrap.rs- Phase 2
Phase 1: Magnus Initialization
#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/ffi_logging.rs
pub fn init_ffi_logger() -> Result<(), Box<dyn std::error::Error>> {
// Check if telemetry is enabled
let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
.map(|v| v.to_lowercase() == "true")
.unwrap_or(false);
if telemetry_enabled {
// Phase 1: Defer telemetry init to runtime context
println!("Telemetry enabled - deferring logging init to runtime context");
} else {
// Phase 1: Safe to initialize console-only logging
tasker_shared::logging::init_console_only();
tasker_shared::log_ffi!(
info,
"FFI console logging initialized (no telemetry)",
component: "ffi_boundary"
);
}
Ok(())
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/bootstrap.rs
pub fn bootstrap_worker() -> Result<Value, Error> {
// Create tokio runtime
let runtime = tokio::runtime::Runtime::new()?;
// Phase 2: Initialize telemetry in Tokio runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Continue with bootstrap...
let system_context = runtime.block_on(async {
SystemContext::new_for_worker().await
})?;
// ... rest of bootstrap
}
}
Python FFI (PyO3)
Phase 1: PyO3 Module Initialization
#![allow(unused)]
fn main() {
// workers/python/src/lib.rs
#[pymodule]
fn tasker_core(py: Python, m: &PyModule) -> PyResult<()> {
// Check if telemetry is enabled
let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
.map(|v| v.to_lowercase() == "true")
.unwrap_or(false);
if telemetry_enabled {
println!("Telemetry enabled - deferring logging init to runtime context");
} else {
tasker_shared::logging::init_console_only();
}
// Register Python functions...
m.add_function(wrap_pyfunction!(bootstrap_worker, m)?)?;
Ok(())
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/python/src/bootstrap.rs
#[pyfunction]
pub fn bootstrap_worker() -> PyResult<String> {
// Create tokio runtime
let runtime = tokio::runtime::Runtime::new()
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
format!("Failed to create runtime: {}", e)
))?;
// Phase 2: Initialize telemetry in Tokio runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Continue with bootstrap...
let system_context = runtime.block_on(async {
SystemContext::new_for_worker().await
})?;
// ... rest of bootstrap
}
}
WASM FFI
Phase 1: WASM Module Initialization
#![allow(unused)]
fn main() {
// workers/wasm/src/lib.rs
#[wasm_bindgen(start)]
pub fn init_wasm() {
// Check if telemetry is enabled (from JS environment)
let telemetry_enabled = js_sys::Reflect::get(
&js_sys::global(),
&"TELEMETRY_ENABLED".into()
).ok()
.and_then(|v| v.as_bool())
.unwrap_or(false);
if telemetry_enabled {
web_sys::console::log_1(&"Telemetry enabled - deferring logging init to runtime context".into());
} else {
tasker_shared::logging::init_console_only();
}
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/wasm/src/bootstrap.rs
#[wasm_bindgen]
pub async fn bootstrap_worker() -> Result<JsValue, JsValue> {
// In WASM, we're already in an async context
// Initialize telemetry directly
tasker_shared::logging::init_tracing();
// Continue with bootstrap...
let system_context = SystemContext::new_for_worker().await
.map_err(|e| JsValue::from_str(&format!("Bootstrap failed: {}", e)))?;
// ... rest of bootstrap
}
}
Docker Configuration
Enable telemetry in docker-compose with appropriate comments:
# docker/docker-compose.test.yml
ruby-worker:
environment:
# Two-phase FFI telemetry initialization pattern
# Phase 1: Magnus init skips telemetry (no runtime)
# Phase 2: bootstrap_worker() initializes telemetry in Tokio context
TELEMETRY_ENABLED: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: http://observability:4317
OTEL_SERVICE_NAME: tasker-ruby-worker
OTEL_SERVICE_VERSION: "0.1.0"
Verification
Expected Log Sequence
Ruby Worker with Telemetry Enabled:
1. Magnus init:
Telemetry enabled - deferring logging init to runtime context
2. After runtime creation:
Console logging with OpenTelemetry initialized
environment=test
opentelemetry_enabled=true
otlp_endpoint=http://observability:4317
service_name=tasker-ruby-worker
3. OpenTelemetry components:
Global meter provider is set
OpenTelemetry Prometheus text exporter initialized
Ruby Worker with Telemetry Disabled:
1. Magnus init:
Console-only logging initialized (FFI-safe mode)
environment=test
opentelemetry_enabled=false
context=ffi_initialization
2. After runtime creation:
(No additional initialization - already complete)
Health Check
All workers should be healthy with telemetry enabled:
$ curl http://localhost:8082/health
{"status":"healthy","timestamp":"...","worker_id":"worker-..."}
Grafana Verification
With all services running with telemetry:
- Access Grafana: http://localhost:3000 (admin/admin)
- Navigate to Explore → Tempo
- Query by service:
tasker-ruby-worker - Verify traces appear with correlation IDs
Key Principles
1. Separation of Concerns
- Infrastructure Decision (Tokio runtime availability): Handled by init functions
- Business Logic (when to log): Handled by application code
- Clean separation prevents runtime panics
2. Fail-Safe Defaults
- Always provide console logging at minimum
- Telemetry is enhancement, not requirement
- Graceful degradation if telemetry unavailable
3. Explicit Over Implicit
- Clear phase separation in code
- Documented at each call site
- Easy to understand initialization flow
4. Language-Agnostic Pattern
- Same pattern works for Ruby, Python, WASM
- Consistent across all FFI bindings
- Single source of truth in tasker-shared
Troubleshooting
“no reactor running” Panic
Symptom:
thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'
Cause:
Calling init_tracing() when TELEMETRY_ENABLED=true outside a Tokio runtime context.
Solution: Use two-phase pattern:
#![allow(unused)]
fn main() {
// Phase 1: Skip telemetry init
if telemetry_enabled {
println!("Deferring telemetry init...");
} else {
init_console_only();
}
// Phase 2: Initialize in runtime
runtime.block_on(async {
init_tracing();
});
}
Telemetry Not Appearing
Symptom:
No traces in Grafana/Tempo despite TELEMETRY_ENABLED=true.
Check:
- Verify environment variable is set:
TELEMETRY_ENABLED=true - Check logs for initialization message
- Verify OTLP endpoint is reachable
- Check observability stack is healthy
Debug:
# Check worker logs
docker logs docker-ruby-worker-1 | grep -E "telemetry|OpenTelemetry"
# Check observability stack
curl http://localhost:4317 # Should connect to OTLP gRPC
# Check Grafana Tempo
curl http://localhost:3200/api/status/buildinfo
Performance Considerations
Minimal Overhead
- Phase 1: Simple console initialization, <1ms
- Phase 2: Batch exporter initialization, <10ms
- Total overhead: <15ms during startup
- Zero runtime overhead after initialization
Memory Usage
- Console-only: ~100KB (tracing subscriber)
- With telemetry: ~500KB (includes OTLP client buffers)
- Acceptable for all deployment scenarios
Future Enhancements
Lazy Telemetry Upgrade
Future optimization could upgrade console-only subscriber to include telemetry without restart:
#![allow(unused)]
fn main() {
// Not yet implemented - requires tracing layer hot-swapping
pub fn upgrade_to_telemetry() -> TaskerResult<()> {
// Would require custom subscriber implementation
// to support layer addition after initialization
}
}
Per-Worker Telemetry Control
Could extend pattern to support per-worker telemetry configuration:
#![allow(unused)]
fn main() {
// Not yet implemented
pub fn init_with_config(config: TelemetryConfig) -> TaskerResult<()> {
// Would allow fine-grained control per worker
}
}
Phase 1.5: Worker Span Instrumentation with Trace Context Propagation
Implemented: 2025-11-24 Status: ✅ Production Ready - Validated end-to-end with Ruby workers
The Challenge
After implementing two-phase telemetry initialization (Phase 1), we discovered a gap: while OpenTelemetry infrastructure was working, worker step execution spans lacked correlation attributes needed for distributed tracing.
The Problem:
- ✅ Orchestration spans had correlation_id, task_uuid, step_uuid
- ✅ Worker infrastructure spans existed (read_messages, reserve_capacity)
- ❌ Worker step execution spans were missing these attributes
Root Cause: Ruby workers use an async dual-event-system architecture where:
- Rust worker fires FFI event to Ruby (via EventPoller polling every 10ms)
- Ruby processes event asynchronously
- Ruby returns completion via FFI
The async boundary made traditional span scope maintenance impossible.
The Solution: Trace ID Propagation Pattern
Instead of trying to maintain span scope across the async FFI boundary, we propagate trace context as opaque strings:
Rust: Extract trace_id/span_id → Add to FFI event payload →
Ruby: Treat as opaque strings → Propagate through processing → Include in completion →
Rust: Create linked span using returned trace_id/span_id
Key Insight: Ruby doesn’t need to understand OpenTelemetry - it just passes through trace IDs like it already does with correlation_id.
Implementation: Rust Side (Phase 1.5a)
File: tasker-worker/src/worker/command_processor.rs
Step 1: Create instrumented span with all required attributes
#![allow(unused)]
fn main() {
use tracing::{span, event, Level, Instrument};
pub async fn handle_execute_step(&self, step_message: SimpleStepMessage) -> TaskerResult<()> {
// Fetch step details to get step_name and namespace
let task_sequence_step = self.fetch_task_sequence_step(&step_message).await?;
// Create span with all 5 required attributes
let step_span = span!(
Level::INFO,
"worker.step_execution",
correlation_id = %step_message.correlation_id,
task_uuid = %step_message.task_uuid,
step_uuid = %step_message.step_uuid,
step_name = %task_sequence_step.workflow_step.name,
namespace = %task_sequence_step.task.namespace_name
);
let execution_result = async {
event!(Level::INFO, "step.execution_started");
// Extract trace context for FFI propagation
let trace_id = Some(step_message.correlation_id.to_string());
let span_id = Some(format!("span-{}", step_message.step_uuid));
// Fire FFI event with trace context
let result = self.event_publisher
.fire_step_execution_event_with_trace(
&task_sequence_step,
trace_id,
span_id,
)
.await?;
event!(Level::INFO, "step.execution_completed");
Ok(result)
}
.instrument(step_span) // Wrap async block with span
.await;
execution_result
}
}
Key Points:
- All 5 attributes present:
correlation_id,task_uuid,step_uuid,step_name,namespace - Event markers:
step.execution_started,step.execution_completed .instrument(span)pattern for async code- Trace context extracted and passed to FFI
Implementation: Data Structures
File: tasker-shared/src/types/base.rs
Add trace context fields to FFI event structures:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionEvent {
pub event_id: Uuid,
pub task_uuid: Uuid,
pub step_uuid: Uuid,
pub task_sequence_step: TaskSequenceStep,
pub correlation_id: Uuid,
// Trace context propagation
#[serde(skip_serializing_if = "Option::is_none")]
pub trace_id: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub span_id: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionCompletionEvent {
pub event_id: Uuid,
pub task_uuid: Uuid,
pub step_uuid: Uuid,
pub success: bool,
pub result: Option<serde_json::Value>,
// Trace context from Ruby
#[serde(skip_serializing_if = "Option::is_none")]
pub trace_id: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub span_id: Option<String>,
}
}
Design Notes:
- Fields are optional for backward compatibility
skip_serializing_ifprevents empty fields in JSON- Treated as opaque strings (no OpenTelemetry types)
Implementation: Ruby Side Propagation
File: workers/ruby/lib/tasker_core/event_bridge.rb
Propagate trace context like correlation_id:
def wrap_step_execution_event(event_data)
wrapped = {
event_id: event_data[:event_id],
task_uuid: event_data[:task_uuid],
step_uuid: event_data[:step_uuid],
task_sequence_step: TaskerCore::Models::TaskSequenceStepWrapper.new(event_data[:task_sequence_step])
}
# Expose correlation_id at top level for easy access
wrapped[:correlation_id] = event_data[:correlation_id] if event_data[:correlation_id]
wrapped[:parent_correlation_id] = event_data[:parent_correlation_id] if event_data[:parent_correlation_id]
# Expose trace_id and span_id for distributed tracing
wrapped[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
wrapped[:span_id] = event_data[:span_id] if event_data[:span_id]
wrapped
end
File: workers/ruby/lib/tasker_core/subscriber.rb
Include trace context in completion:
def publish_step_completion(event_data:, success:, result: nil, error_message: nil, metadata: nil)
completion_payload = {
event_id: event_data[:event_id],
task_uuid: event_data[:task_uuid],
step_uuid: event_data[:step_uuid],
success: success,
result: result,
metadata: metadata,
error_message: error_message
}
# Propagate trace context back to Rust
completion_payload[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
completion_payload[:span_id] = event_data[:span_id] if event_data[:span_id]
TaskerCore::Worker::EventBridge.instance.publish_step_completion(completion_payload)
end
Key Points:
- Ruby treats trace_id and span_id as opaque strings
- No OpenTelemetry dependency in Ruby
- Simple pass-through pattern like correlation_id
- Works with existing dual-event-system architecture
Implementation: Completion Span (Rust)
File: tasker-worker/src/worker/event_subscriber.rs
Create linked span when receiving Ruby completion:
#![allow(unused)]
fn main() {
pub fn handle_completion(&self, completion: StepExecutionCompletionEvent) -> TaskerResult<()> {
// Create linked span using trace context from Ruby
let completion_span = if let (Some(trace_id), Some(span_id)) =
(&completion.trace_id, &completion.span_id) {
span!(
Level::INFO,
"worker.step_completion_received",
trace_id = %trace_id,
span_id = %span_id,
event_id = %completion.event_id,
task_uuid = %completion.task_uuid,
step_uuid = %completion.step_uuid,
success = completion.success
)
} else {
// Fallback span without trace context
span!(
Level::INFO,
"worker.step_completion_received",
event_id = %completion.event_id,
task_uuid = %completion.task_uuid,
step_uuid = %completion.step_uuid,
success = completion.success
)
};
let _guard = completion_span.enter();
event!(Level::INFO, "step.ruby_execution_completed",
success = completion.success,
duration_ms = completion.metadata.execution_time_ms
);
// Continue with normal completion processing...
Ok(())
}
}
Key Points:
- Uses returned trace_id/span_id to create linked span
- Graceful fallback if trace context not available
- Event:
step.ruby_execution_completed
Validation Results (2025-11-24)
Test Task:
- Correlation ID:
88f21229-4085-4d53-8f52-2fde0b7228e2 - Task UUID:
019ab6f9-7a27-7d16-b298-1ea41b327373 - 4 steps executed successfully
Log Evidence:
worker.step_execution{
correlation_id=88f21229-4085-4d53-8f52-2fde0b7228e2
task_uuid=019ab6f9-7a27-7d16-b298-1ea41b327373
step_uuid=019ab6f9-7a2a-7873-a5d1-93234ae46003
step_name=linear_step_1
namespace=linear_workflow
}: step.execution_started
Step execution event with trace context fired successfully to FFI handlers
trace_id=Some("88f21229-4085-4d53-8f52-2fde0b7228e2")
span_id=Some("span-019ab6f9-7a2a-7873-a5d1-93234ae46003")
worker.step_completion_received{...}: step.ruby_execution_completed
Tempo Query Results:
- By
correlation_id: 9 traces (5 orchestration + 4 worker) - By
task_uuid: 13 traces (complete task lifecycle) - ✅ All attributes indexed and queryable
- ✅ Spans exported to Tempo successfully
Complete Trace Flow
For each step execution:
┌─────────────────────────────────────────────────────┐
│ Rust Worker (command_processor.rs) │
│ 1. Create worker.step_execution span │
│ - correlation_id, task_uuid, step_uuid │
│ - step_name, namespace │
│ 2. Emit step.execution_started event │
│ 3. Extract trace_id and span_id from span │
│ 4. Add to StepExecutionEvent │
│ 5. Fire FFI event with trace context │
│ 6. Emit step.execution_completed event │
└─────────────────┬───────────────────────────────────┘
│
│ Async FFI boundary (EventPoller polling)
▼
┌─────────────────────────────────────────────────────┐
│ Ruby EventBridge & Subscriber │
│ 1. Receive event with trace_id/span_id │
│ 2. Propagate as opaque strings │
│ 3. Execute Ruby handler (business logic) │
│ 4. Include trace_id/span_id in completion │
└─────────────────┬───────────────────────────────────┘
│
│ Completion via FFI
▼
┌─────────────────────────────────────────────────────┐
│ Rust Worker (event_subscriber.rs) │
│ 1. Receive StepExecutionCompletionEvent │
│ 2. Extract trace_id and span_id │
│ 3. Create worker.step_completion_received span │
│ 4. Emit step.ruby_execution_completed event │
└─────────────────────────────────────────────────────┘
Benefits of This Pattern
- No Breaking Changes: Optional fields, backward compatible
- Ruby Simplicity: No OpenTelemetry dependency, opaque string propagation
- Trace Continuity: Same trace_id flows Rust → Ruby → Rust
- Query-Friendly: Tempo queries show complete execution flow
- Extensible: Pattern works for Python, WASM, any FFI language
- Performance: Zero overhead in Ruby (just string passing)
Pattern for Python Workers
The exact same pattern applies to Python workers:
Python Side (PyO3):
# workers/python/tasker_core/event_bridge.py
def wrap_step_execution_event(event_data):
wrapped = {
'event_id': event_data['event_id'],
'task_uuid': event_data['task_uuid'],
'step_uuid': event_data['step_uuid'],
# ... other fields
}
# Propagate trace context as opaque strings
if 'trace_id' in event_data:
wrapped['trace_id'] = event_data['trace_id']
if 'span_id' in event_data:
wrapped['span_id'] = event_data['span_id']
return wrapped
Key Insight: Any FFI language can use this pattern - they just need to pass through trace_id and span_id as strings.
Performance Characteristics
- Rust overhead: ~50-100 microseconds per span creation
- FFI overhead: ~10-50 microseconds for extra string fields
- Ruby overhead: Zero (just string passing, no OpenTelemetry)
- Total overhead: <200 microseconds per step execution
- Network: Spans batched and exported asynchronously
Troubleshooting
Symptom: Spans missing trace_id/span_id in Tempo
Check:
- Verify Rust logs show “Step execution event with trace context fired successfully”
- Check Ruby logs don’t have errors in EventBridge
- Verify completion events include trace_id/span_id
- Query Tempo by task_uuid to see if spans exist
Debug:
# Check Rust worker logs for trace context
docker logs docker-ruby-worker-1 | grep -E "(trace_id|span_id)"
# Query Tempo by task_uuid
curl "http://localhost:3200/api/search?tags=task_uuid=<UUID>"
# Check span export metrics
curl "http://localhost:9090/metrics" | grep otel
Future Enhancements
OpenTelemetry W3C Trace Context: Currently using correlation_id as trace_id placeholder. Future enhancement:
#![allow(unused)]
fn main() {
use opentelemetry::trace::TraceContextExt;
// Extract real OpenTelemetry trace context
let cx = tracing::Span::current().context();
let span_context = cx.span().span_context();
let trace_id = span_context.trace_id().to_string();
let span_id = span_context.span_id().to_string();
}
Span Linking:
Use OpenTelemetry’s Link API for explicit parent-child relationships:
#![allow(unused)]
fn main() {
use opentelemetry::trace::{Link, SpanContext, TraceId, SpanId};
// Create linked span
let parent_context = SpanContext::new(
TraceId::from_hex(&trace_id)?,
SpanId::from_hex(&span_id)?,
TraceFlags::default(),
false,
TraceState::default(),
);
let span = span!(
Level::INFO,
"worker.step_completion_received",
links = vec![Link::new(parent_context, Vec::new())]
);
}
References
- OpenTelemetry Rust: https://github.com/open-telemetry/opentelemetry-rust
- Grafana LGTM Stack: https://grafana.com/oss/lgtm-stack/
- W3C Trace Context: https://www.w3.org/TR/trace-context/
Related Documentation
tasker-shared/src/logging.rs- Core logging implementationworkers/rust/README.md- Event-driven FFI architecturedocs/batch-processing.md- Distributed tracing integrationdocker/docker-compose.test.yml- Observability stack configuration
Status: ✅ Production Ready - Two-phase initialization and Phase 1.5 worker span instrumentation patterns implemented and validated with Ruby FFI. Ready for Python and WASM implementations.
Library Deployment Patterns
This document describes the library deployment patterns feature that enables applications to consume worker observability data (health, metrics, templates, configuration) either via the HTTP API or directly through FFI, without running a web server.
Overview
Previously, applications needed to run the worker’s HTTP server to access observability data. This created deployment overhead for applications that only needed programmatic access to health checks, metrics, or template information.
The library deployment patterns feature:
- Extracts observability logic into reusable services - Business logic moved from HTTP handlers to service classes
- Exposes services via FFI - Same functionality available without HTTP overhead
- Provides Ruby wrapper layer - Type-safe Ruby interface with dry-struct types
- Makes HTTP server optional - Services always available, web server is opt-in
Architecture
Service Layer
Four services encapsulate observability logic:
tasker-worker/src/worker/services/
├── health/ # HealthService - health checks
├── metrics/ # MetricsService - metrics collection
├── template_query/ # TemplateQueryService - template operations
└── config_query/ # ConfigQueryService - configuration queries
Each service:
- Contains all business logic previously in HTTP handlers
- Is independent of HTTP transport
- Can be accessed via web handlers OR FFI
- Returns typed response structures
Service Access Patterns
┌─────────────────────────────────────────┐
│ WorkerWebState │
│ ┌────────────────────────────────────┐ │
│ │ Service Instances │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │HealthServ.│ │MetricsService │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │TemplQuery │ │ConfigQuery │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └────────────────────────────────────┘ │
└──────────────┬───────────────┬──────────┘
│ │
┌───────────────────────┴───┐ ┌─────┴──────────────────────┐
│ HTTP Handlers │ │ FFI Layer │
│ (web/handlers/*.rs) │ │ (observability_ffi.rs) │
└───────────────────────────┘ └────────────────────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ HTTP Clients │ │ Ruby/Python │
│ curl, etc. │ │ Applications │
└───────────────┘ └───────────────┘
Usage
Ruby FFI Access
The TaskerCore::Observability module provides type-safe access to all services:
# Health checks
health = TaskerCore::Observability.health_basic
puts health.status # => "healthy"
puts health.worker_id # => "worker-abc123"
# Kubernetes-style probes
if TaskerCore::Observability.ready?
puts "Worker ready to receive requests"
end
if TaskerCore::Observability.alive?
puts "Worker is alive"
end
# Detailed health information
detailed = TaskerCore::Observability.health_detailed
detailed.checks.each do |name, check|
puts "#{name}: #{check.status} (#{check.duration_ms}ms)"
end
Metrics Access
# Domain event statistics
events = TaskerCore::Observability.event_stats
puts "Events routed: #{events.router.total_routed}"
puts "FFI dispatches: #{events.in_process_bus.ffi_channel_dispatches}"
# Prometheus format (for custom scrapers)
prometheus_text = TaskerCore::Observability.prometheus_metrics
Template Operations
# List templates (JSON string)
templates_json = TaskerCore::Observability.templates_list
# Validate a template
validation = TaskerCore::Observability.template_validate(
namespace: "payments",
name: "process_payment",
version: "v1"
)
if validation.valid
puts "Template valid with #{validation.handler_count} handlers"
else
validation.issues.each { |issue| puts "Issue: #{issue}" }
end
# Cache management
stats = TaskerCore::Observability.cache_stats
puts "Cache hits: #{stats.hits}, misses: #{stats.misses}"
TaskerCore::Observability.cache_clear # Clear all cached templates
Configuration Access
# Get runtime configuration (secrets redacted)
config = TaskerCore::Observability.config
puts "Environment: #{config.environment}"
puts "Redacted fields: #{config.metadata.redacted_fields.join(', ')}"
# Quick environment check
env = TaskerCore::Observability.environment
puts "Running in: #{env}" # => "production"
Configuration
HTTP Server Toggle
The HTTP server is now optional. Services are always created, but the HTTP server only starts if enabled:
# config/tasker/base/worker.toml
[worker.web]
enabled = true # Set to false to disable HTTP server
bind_address = "0.0.0.0:8081"
request_timeout_ms = 30000
When enabled = false:
- WorkerWebState is still created (services available)
- HTTP server does NOT start
- All services accessible via FFI only
- Reduces resource usage (no HTTP listener, no connections)
Deployment Modes
| Mode | HTTP Server | FFI Services | Use Case |
|---|---|---|---|
| Full | ✅ | ✅ | Standard deployment with monitoring |
| Library | ❌ | ✅ | Embedded in application, no external access |
| Headless | ❌ | ✅ | Container with external health checks disabled |
Type Definitions
The Ruby wrapper uses dry-struct types for structured access:
Health Types
TaskerCore::Observability::Types::BasicHealth
- status: String
- worker_id: String
- timestamp: String
TaskerCore::Observability::Types::DetailedHealth
- status: String
- timestamp: String
- worker_id: String
- checks: Hash[String, HealthCheck]
- system_info: WorkerSystemInfo
TaskerCore::Observability::Types::HealthCheck
- status: String
- message: String?
- duration_ms: Integer
- last_checked: String
Metrics Types
TaskerCore::Observability::Types::DomainEventStats
- router: EventRouterStats
- in_process_bus: InProcessEventBusStats
- captured_at: String
- worker_id: String
TaskerCore::Observability::Types::EventRouterStats
- total_routed: Integer
- durable_routed: Integer
- fast_routed: Integer
- broadcast_routed: Integer
- fast_delivery_errors: Integer
- routing_errors: Integer
Template Types
TaskerCore::Observability::Types::CacheStats
- total_entries: Integer
- hits: Integer
- misses: Integer
- evictions: Integer
- last_maintenance: String?
TaskerCore::Observability::Types::TemplateValidation
- valid: Boolean
- namespace: String
- name: String
- version: String
- handler_count: Integer
- issues: Array[String]
- handler_metadata: Hash?
Config Types
TaskerCore::Observability::Types::RuntimeConfig
- environment: String
- common: Hash
- worker: Hash
- metadata: ConfigMetadata
TaskerCore::Observability::Types::ConfigMetadata
- timestamp: String
- source: String
- redacted_fields: Array[String]
Error Handling
FFI methods raise RuntimeError on failures:
begin
health = TaskerCore::Observability.health_basic
rescue RuntimeError => e
if e.message.include?("Worker system not running")
# Worker not bootstrapped yet
elsif e.message.include?("Web state not available")
# Services not initialized
end
end
Template Operation Errors
Template operations raise RuntimeError for missing templates or namespaces:
begin
result = TaskerCore::Observability.template_get(
namespace: "unknown",
name: "missing",
version: "1.0.0"
)
rescue RuntimeError => e
puts "Template not found: #{e.message}"
end
# template_refresh handles errors gracefully, returning a result struct
result = TaskerCore::Observability.template_refresh(
namespace: "unknown",
name: "missing",
version: "1.0.0"
)
puts result.success # => false
puts result.message # => error description
Convenience Methods
The ready? and alive? methods handle errors gracefully:
# These never raise - they return false on any error
TaskerCore::Observability.ready? # => true/false
TaskerCore::Observability.alive? # => true/false
Note: alive? checks for status == "alive" (from liveness probe), while ready? checks for status == "healthy" (from readiness probe).
Best Practices
- Use type-safe methods when possible - Methods returning dry-struct types provide better validation
- Handle errors gracefully - FFI can fail if worker not bootstrapped
- Consider caching - For high-frequency health checks, cache results briefly
- Use ready?/alive? helpers - They handle exceptions and return boolean
- Prefer FFI for internal use - Less overhead than HTTP for same-process access
Migration Guide
From HTTP to FFI
Before (HTTP):
response = Faraday.get("http://localhost:8081/health")
health = JSON.parse(response.body)
After (FFI):
health = TaskerCore::Observability.health_basic
Disabling HTTP Server
-
Update configuration:
[worker.web] enabled = false -
Update health check scripts to use FFI:
# health_check.rb require 'tasker_core' exit(TaskerCore::Observability.ready? ? 0 : 1) -
Update monitoring to scrape via FFI:
metrics = TaskerCore::Observability.prometheus_metrics # Send to Prometheus pushgateway or custom aggregator
API Reference
Health Methods
| Method | Returns | Description |
|---|---|---|
health_basic | Types::BasicHealth | Basic health status |
health_live | Types::BasicHealth | Liveness probe (status: “alive”) |
health_ready | Types::DetailedHealth | Readiness probe with all checks |
health_detailed | Types::DetailedHealth | Full health information |
ready? | Boolean | True if status == “healthy” |
alive? | Boolean | True if status == “alive” |
Metrics Methods
| Method | Returns | Description |
|---|---|---|
metrics_worker | String (JSON) | Worker metrics as JSON |
event_stats | Types::DomainEventStats | Domain event statistics |
prometheus_metrics | String | Prometheus text format |
Template Methods
| Method | Returns | Description |
|---|---|---|
templates_list(include_cache_stats: false) | String (JSON) | List all templates |
template_get(namespace:, name:, version:) | String (JSON) | Get specific template (raises on error) |
template_validate(namespace:, name:, version:) | Types::TemplateValidation | Validate template (raises on error) |
cache_stats | Types::CacheStats | Cache statistics |
cache_clear | Types::CacheOperationResult | Clear template cache |
template_refresh(namespace:, name:, version:) | Types::CacheOperationResult | Refresh specific template |
Config Methods
| Method | Returns | Description |
|---|---|---|
config | Types::RuntimeConfig | Full config (secrets redacted) |
environment | String | Current environment name |
Related Documentation
- Configuration Management - Full configuration reference
- Deployment Patterns - General deployment options
- Observability - Metrics and monitoring
- FFI Telemetry Pattern - FFI logging integration
SCache Configuration Documentation
Overview
This document records our sccache configuration for future reference. Sccache is currently disabled due to GitHub Actions cache service issues, but we plan to re-enable it once the service is stable.
Current Status
🚫 DISABLED - Temporarily disabled due to GitHub Actions cache service issues:
sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => <h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p>
Planned Configuration
Environment Variables (setup-env action)
RUSTC_WRAPPER=sccache
SCCACHE_GHA_ENABLED=true
SCCACHE_CACHE_SIZE=2G # For Docker builds
GitHub Actions Integration
Workflows Using sccache
- code-quality.yml - Build caching for clippy and rustfmt
- test-unit.yml - Build caching for unit tests
- test-integration.yml - Build caching for integration tests
Action Configuration
- uses: mozilla-actions/sccache-action@v0.0.4
Expected Benefits
- 50%+ faster builds through compilation caching
- Reduced CI costs by avoiding redundant compilation
- Better developer experience with faster feedback loops
Performance Targets
- Build cache hit rate: Target > 80%
- Compilation time reduction: 50%+ on cache hits
- Total CI time: Reduce by 10-20 minutes per run
Local Development Setup
For local development when sccache is working:
# Install sccache
cargo binstall sccache -y
# Set environment variables
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED=true
# Check stats
sccache --show-stats
# Clear cache if needed
sccache --zero-stats
Re-enabling Steps
When GitHub Actions cache service is stable:
-
Re-enable in workflows:
- Uncomment
mozilla-actions/sccache-action@v0.0.4in workflows - Restore sccache environment variables in setup-env action
- Uncomment
-
Test with minimal workflow first:
- Start with code-quality.yml
- Monitor for cache service issues
- Gradually enable in other workflows
-
Monitor performance:
- Track build times before/after
- Monitor cache hit rates
- Watch for any new cache service errors
Configuration Locations
Files containing sccache configuration:
.github/actions/setup-env/action.yml- Environment variables.github/workflows/code-quality.yml- Action usage.github/workflows/test-unit.yml- Action usage.github/workflows/test-integration.yml- Action usagedocs/sccache-configuration.md- This documentation
Docker Integration
For Docker builds, pass sccache variables as build args:
build-args: |
SCCACHE_GHA_ENABLED=true
RUSTC_WRAPPER=sccache
SCCACHE_CACHE_SIZE=2G
Troubleshooting
Common Issues
- Cache service unavailable: Wait for GitHub to restore service
- Cache misses: Check RUSTC_WRAPPER is set correctly
- Permission errors: Ensure sccache action has proper permissions
Monitoring
- Check
sccache --show-statsfor cache effectiveness - Monitor CI run times for performance improvements
- Watch GitHub status page for cache service updates
References
StepContext API Reference
StepContext is the primary data access object for step handlers across all languages in the Tasker worker ecosystem. It provides a consistent interface for accessing task inputs, dependency results, configuration, and checkpoint data.
Overview
Every step handler receives a StepContext (or TaskSequenceStep in Rust) that contains:
- Task context - Input data for the workflow (JSONB from task.context)
- Dependency results - Results from upstream DAG steps
- Step configuration - Handler-specific settings from the template
- Checkpoint data - Batch processing state for resumability
- Retry information - Current attempt count and max retries
Cross-Language API Reference
Core Data Access
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Get task input | get_input::<T>("key")? | get_input("key") | get_input("key") | getInput("key") |
| Get input with default | get_input_or("key", default) | get_input_or("key", default) | get_input_or("key", default) | getInputOr("key", default) |
| Get config value | get_config::<T>("key")? | get_config("key") | get_config("key") | getConfig("key") |
| Get dependency result | get_dependency_result_column_value::<T>("step")? | get_dependency_result("step") | get_dependency_result("step") | getDependencyResult("step") |
| Get nested dependency field | get_dependency_field::<T>("step", &["path"])? | get_dependency_field("step", *path) | get_dependency_field("step", *path) | getDependencyField("step", ...path) |
Retry Helpers
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Check if retry | is_retry() | is_retry? | is_retry() | isRetry() |
| Check if last retry | is_last_retry() | is_last_retry? | is_last_retry() | isLastRetry() |
| Get retry count | retry_count() | retry_count | retry_count | retryCount |
| Get max retries | max_retries() | max_retries | max_retries | maxRetries |
Checkpoint Access
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Get raw checkpoint | checkpoint() | checkpoint | checkpoint | checkpoint |
| Get cursor | checkpoint_cursor::<T>() | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Get items processed | checkpoint_items_processed() | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
| Get accumulated results | accumulated_results::<T>() | accumulated_results | accumulated_results | accumulatedResults |
| Check has checkpoint | has_checkpoint() | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
Standard Fields
| Field | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Task UUID | task.task.task_uuid | task_uuid | task_uuid | taskUuid |
| Step UUID | workflow_step.workflow_step_uuid | step_uuid | step_uuid | stepUuid |
| Correlation ID | task.task.correlation_id | task.correlation_id | correlation_id | correlationId |
| Input data (raw) | task.task.context | input_data / context | input_data | inputData |
| Step config (raw) | step_definition.handler.initialization | step_config | step_config | stepConfig |
Usage Examples
Rust
#![allow(unused)]
fn main() {
use tasker_shared::types::base::TaskSequenceStep;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Get task input
let order_id: String = step_data.get_input("order_id")?;
let batch_size: i32 = step_data.get_input_or("batch_size", 100);
// Get config
let api_url: String = step_data.get_config("api_url")?;
// Get dependency result
let validation_result: ValidationResult = step_data.get_dependency_result_column_value("validate")?;
// Extract nested field from dependency
let item_count: i32 = step_data.get_dependency_field("process", &["stats", "count"])?;
// Check retry status
if step_data.is_retry() {
println!("Retry attempt {}", step_data.retry_count());
}
// Resume from checkpoint
let cursor: Option<i64> = step_data.checkpoint_cursor();
let start_from = cursor.unwrap_or(0);
// ... handler logic ...
}
}
Ruby
def call(context)
# Get task input
order_id = context.get_input('order_id')
batch_size = context.get_input_or('batch_size', 100)
# Get config
api_url = context.get_config('api_url')
# Get dependency result
validation_result = context.get_dependency_result('validate')
# Extract nested field from dependency
item_count = context.get_dependency_field('process', 'stats', 'count')
# Check retry status
if context.is_retry?
logger.info("Retry attempt #{context.retry_count}")
end
# Resume from checkpoint
start_from = context.checkpoint_cursor || 0
# ... handler logic ...
end
Python
def call(self, context: StepContext) -> StepHandlerResult:
# Get task input
order_id = context.get_input("order_id")
batch_size = context.get_input_or("batch_size", 100)
# Get config
api_url = context.get_config("api_url")
# Get dependency result
validation_result = context.get_dependency_result("validate")
# Extract nested field from dependency
item_count = context.get_dependency_field("process", "stats", "count")
# Check retry status
if context.is_retry():
print(f"Retry attempt {context.retry_count}")
# Resume from checkpoint
start_from = context.checkpoint_cursor or 0
# ... handler logic ...
TypeScript
async call(context: StepContext): Promise<StepHandlerResult> {
// Get task input
const orderId = context.getInput<string>('order_id');
const batchSize = context.getInputOr('batch_size', 100);
// Get config
const apiUrl = context.getConfig<string>('api_url');
// Get dependency result
const validationResult = context.getDependencyResult('validate');
// Extract nested field from dependency
const itemCount = context.getDependencyField('process', 'stats', 'count');
// Check retry status
if (context.isRetry()) {
console.log(`Retry attempt ${context.retryCount}`);
}
// Resume from checkpoint
const startFrom = context.checkpointCursor ?? 0;
// ... handler logic ...
}
Checkpoint Usage Guide
Checkpoints enable resumable batch processing. When a handler processes large datasets, it can save progress via checkpoints and resume from where it left off on retry.
Checkpoint Fields
- cursor - Position marker (can be int, string, or object)
- items_processed - Count of items completed
- accumulated_results - Running totals or aggregated state
Reading Checkpoints
# Python example
def call(self, context: StepContext) -> StepHandlerResult:
# Check if resuming from checkpoint
if context.has_checkpoint():
cursor = context.checkpoint_cursor
items_done = context.checkpoint_items_processed
totals = context.accumulated_results or {}
print(f"Resuming from cursor {cursor}, {items_done} items done")
else:
cursor = 0
items_done = 0
totals = {}
# Process from cursor position...
Writing Checkpoints
Checkpoints are written by including checkpoint data in the handler result metadata. See the batch processing documentation for details on the checkpoint yield pattern.
Notes
- All accessor methods handle missing data gracefully (return None/null or use defaults)
- Dependency results are automatically unwrapped from the
{"result": value}envelope - Type conversion is handled automatically where supported (Rust, TypeScript generics)
- Checkpoint data is persisted atomically by the CheckpointService
Table Management and Growth Strategies
Last Updated: 2026-01-10 Status: Active Recommendation
Problem Statement
In high-throughput workflow orchestration systems, the core task tables (tasks, workflow_steps, task_transitions, workflow_step_transitions) can grow to millions of rows over time. Without proper management, this growth can lead to:
Note: All tables reside in the
taskerschema with simplified names (e.g.,tasksinstead oftasker_tasks). Withsearch_path = tasker, public, queries use unqualified table names.
- Query Performance Degradation: Even with proper indexes, very large tables require more I/O operations
- Maintenance Overhead: VACUUM, ANALYZE, and index maintenance become increasingly expensive
- Backup/Recovery Challenges: Larger tables increase backup windows and recovery times
- Storage Costs: Historical data that’s rarely accessed still consumes storage resources
Existing Performance Mitigations
The tasker-core system employs several strategies to maintain query performance even with large tables:
1. Strategic Indexing
Covering Indexes for Hot Paths
The most critical indexes use PostgreSQL’s INCLUDE clause to create covering indexes that satisfy queries without table lookups:
Active Task Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for active task queries with priority sorting
CREATE INDEX IF NOT EXISTS idx_tasks_active_with_priority_covering
ON tasks (complete, priority, task_uuid)
INCLUDE (named_task_uuid, requested_at)
WHERE complete = false;
Impact: Task discovery queries can be satisfied entirely from the index without accessing the main table.
Step Readiness Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for step readiness queries
CREATE INDEX IF NOT EXISTS idx_workflow_steps_ready_covering
ON workflow_steps (task_uuid, processed, in_process)
INCLUDE (workflow_step_uuid, attempts, max_attempts, retryable)
WHERE processed = false;
-- Covering index for task-based step grouping
CREATE INDEX IF NOT EXISTS idx_workflow_steps_task_covering
ON workflow_steps (task_uuid)
INCLUDE (workflow_step_uuid, processed, in_process, attempts, max_attempts);
Impact: Step dependency resolution and retry logic queries avoid heap lookups.
Transitive Dependency Optimization (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for transitive dependency traversal
CREATE INDEX IF NOT EXISTS idx_workflow_steps_transitive_deps
ON workflow_steps (workflow_step_uuid, named_step_uuid)
INCLUDE (task_uuid, results, processed);
Impact: DAG traversal operations can read all needed columns from the index.
State Transition Lookups (Partial Indexes)
Current State Resolution (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Fast current state resolution (only indexes most_recent = true)
CREATE INDEX IF NOT EXISTS idx_task_transitions_state_lookup
ON task_transitions (task_uuid, to_state, most_recent)
WHERE most_recent = true;
CREATE INDEX IF NOT EXISTS idx_workflow_step_transitions_state_lookup
ON workflow_step_transitions (workflow_step_uuid, to_state, most_recent)
WHERE most_recent = true;
Impact: State lookups index only current state, not full audit history. Reduces index size by >90%.
Correlation and Tracing Indexes
Distributed Tracing Support (migrations/tasker/20251007000000_add_correlation_ids.sql):
-- Primary correlation ID lookups
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_id
ON tasks(correlation_id);
-- Hierarchical workflow traversal (parent-child relationships)
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_hierarchy
ON tasks(parent_correlation_id, correlation_id)
WHERE parent_correlation_id IS NOT NULL;
Impact: Enables efficient distributed tracing and workflow hierarchy queries.
Processor Ownership and Monitoring
Processor Tracking (migrations/tasker/20250912000000_tas41_richer_task_states.sql):
-- Index for processor ownership queries (audit trail only, enforcement removed)
CREATE INDEX IF NOT EXISTS idx_task_transitions_processor
ON task_transitions(processor_uuid)
WHERE processor_uuid IS NOT NULL;
-- Index for timeout monitoring using JSONB metadata
CREATE INDEX IF NOT EXISTS idx_task_transitions_timeout
ON task_transitions((transition_metadata->>'timeout_at'))
WHERE most_recent = true;
Impact: Enables processor-level debugging and timeout monitoring. Processor ownership enforcement was removed but the audit trail is preserved.
Dependency Graph Navigation
Step Edges for DAG Operations (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Parent-to-child navigation for dependency resolution
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_from_step
ON workflow_step_edges (from_step_uuid);
-- Child-to-parent navigation for completion propagation
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_to_step
ON workflow_step_edges (to_step_uuid);
Impact: Bidirectional DAG traversal for readiness checks and completion propagation.
2. Partial Indexes
Many indexes use WHERE clauses to index only active/relevant rows:
-- Only index tasks that are actively being processed
WHERE current_state IN ('pending', 'initializing', 'steps_in_process')
-- Only index the current state transition
WHERE most_recent = true
This significantly reduces index size and maintenance overhead while keeping lookups fast.
3. SQL Function Optimizations
Complex orchestration queries are implemented as PostgreSQL functions that leverage:
- Lateral Joins: For efficient correlated subqueries
- CTEs with Materialization: For complex dependency analysis
- Targeted Filtering: Early elimination of irrelevant rows using index scans
Example from get_next_ready_tasks():
-- First filter to active tasks with priority sorting (uses index)
WITH prioritized_tasks AS (
SELECT task_uuid, priority
FROM tasks
WHERE current_state IN ('pending', 'steps_in_process')
ORDER BY priority DESC, created_at ASC
LIMIT $1 * 2 -- Get more candidates than needed for filtering
)
-- Then apply complex staleness/readiness checks only on candidates
...
4. Staleness Exclusion
The system automatically excludes stale tasks from active processing queues:
- Tasks stuck in
waiting_for_dependencies> 60 minutes - Tasks stuck in
waiting_for_retry> 30 minutes - Tasks with lifecycle timeouts exceeded
This prevents the active query set from growing indefinitely, even if old tasks aren’t archived.
Archive-and-Delete Strategy (Considered, Not Implemented)
What We Considered
We initially designed an archive-and-delete strategy:
Architecture:
- Mirror tables:
tasker.archived_tasks,tasker.archived_workflow_steps,tasker.archived_task_transitions,tasker.archived_workflow_step_transitions - Background service running every 24 hours
- Batch processing: 1000 tasks per run
- Transactional archival: INSERT into archive tables → DELETE from main tables
- Retention policies: Configurable per task state (completed, error, cancelled)
Implementation Details:
#![allow(unused)]
fn main() {
// Archive tasks in terminal states older than retention period
pub async fn archive_completed_tasks(
pool: &PgPool,
retention_days: i32,
batch_size: i32,
) -> Result<ArchiveStats> {
// 1. INSERT INTO archived_tasks SELECT * FROM tasks WHERE ...
// 2. INSERT INTO archived_workflow_steps SELECT * WHERE task_uuid IN (...)
// 3. INSERT INTO archived_task_transitions SELECT * WHERE task_uuid IN (...)
// 4. DELETE FROM workflow_step_transitions WHERE ...
// 5. DELETE FROM task_transitions WHERE ...
// 6. DELETE FROM workflow_steps WHERE ...
// 7. DELETE FROM tasks WHERE ...
}
}
Why We Decided Against It
After implementation and analysis, we identified critical performance issues:
1. Write Amplification
Every archived task results in:
- 2× writes per row: INSERT into archive table + original row still exists until DELETE
- 1× delete per row: DELETE from main table triggers index updates
- Cascade costs: Foreign key relationships require multiple DELETE operations in sequence
For a system processing 100,000 tasks/day with 30-day retention:
- Daily archival: ~100,000 tasks × 2 write operations = 200,000 write I/Os
- Plus associated workflow_steps (typically 5-10 per task): 500,000-1,000,000 additional writes
2. Index Maintenance Overhead
PostgreSQL must maintain indexes during both INSERT and DELETE operations:
During INSERT to archive tables:
- Build index entries for all archive table indexes
- Update statistics for query planner
During DELETE from main tables:
- Mark deleted tuples in main table indexes
- Update free space maps
- Trigger VACUUM requirements
Result: Periodic severe degradation (2-5 seconds) during archival runs, even with batch processing.
3. Lock Contention
Large DELETE operations require:
- Row-level locks on deleted rows
- Table-level locks during index updates
- Lock escalation risk with large batch sizes
This creates a “stop-the-world” effect where active task processing is blocked during archival.
4. VACUUM Pressure
Frequent large DELETEs create dead tuples that require aggressive VACUUMing:
- Increases I/O load during off-hours
- Can’t be fully eliminated even with proper tuning
- Competes with active workload for resources
5. The “Garbage Collector” Anti-Pattern
The archive-and-delete strategy essentially implements a manual garbage collector:
- Periodic runs with performance impact
- Tuning trade-offs (frequency vs. batch size vs. impact)
- Operational complexity (monitoring, alerting, recovery)
Recommended Solution: PostgreSQL Native Partitioning
Overview
PostgreSQL’s native table partitioning with pg_partman provides zero-runtime-cost table management:
Key Advantages:
- No write amplification: Data stays in place, partitions are logical divisions
- No DELETE operations: Old partitions are DETACHed and dropped as units
- Instant partition drops: Dropping a partition is O(1), not O(rows)
- Transparent to application: Queries work identically on partitioned tables
- Battle-tested: Used by pgmq (our queue infrastructure) and thousands of production systems
How It Works
-- 1. Create partitioned parent table (in tasker schema)
CREATE TABLE tasker.tasks (
task_uuid UUID NOT NULL,
created_at TIMESTAMP NOT NULL,
-- ... other columns
) PARTITION BY RANGE (created_at);
-- 2. pg_partman automatically creates child partitions
-- tasker.tasks_p2025_01 (Jan 2025)
-- tasker.tasks_p2025_02 (Feb 2025)
-- tasker.tasks_p2025_03 (Mar 2025)
-- ... etc
-- 3. Queries transparently use appropriate partitions
SELECT * FROM tasks WHERE task_uuid = $1;
-- → PostgreSQL automatically queries correct partition
-- 4. Dropping old partitions is instant
ALTER TABLE tasker.tasks DETACH PARTITION tasker.tasks_p2024_12;
DROP TABLE tasker.tasks_p2024_12; -- Instant, no row-by-row deletion
Performance Characteristics
| Operation | Archive-and-Delete | Native Partitioning |
|---|---|---|
| Write path | INSERT + DELETE (2× I/O) | INSERT only (1× I/O) |
| Index maintenance | On INSERT + DELETE | On INSERT only |
| Lock contention | Row locks during DELETE | No locks for drops |
| VACUUM pressure | High (dead tuples) | None (partition drops) |
| Old data removal | O(rows) per deletion | O(1) partition detach |
| Query performance | Scans entire table | Partition pruning |
| Runtime impact | Periodic degradation | Zero |
Implementation with pg_partman
Installation
CREATE EXTENSION pg_partman;
Setup for tasks
-- 1. Create partitioned table structure
-- (Include all existing columns and indexes)
-- 2. Initialize pg_partman for monthly partitions
SELECT partman.create_parent(
p_parent_table := 'tasker.tasks',
p_control := 'created_at',
p_type := 'native',
p_interval := 'monthly',
p_premake := 3 -- Pre-create 3 future months
);
-- 3. Configure retention (keep 90 days)
UPDATE partman.part_config
SET retention = '90 days',
retention_keep_table = false -- Drop old partitions entirely
WHERE parent_table = 'tasker.tasks';
-- 4. Enable automatic maintenance
SELECT partman.run_maintenance(p_parent_table := 'tasker.tasks');
Automation
Add to cron or pg_cron:
-- Run maintenance every hour
SELECT cron.schedule('partman-maintenance', '0 * * * *',
$$SELECT partman.run_maintenance()$$
);
This automatically:
- Creates new partitions before they’re needed
- Detaches and drops partitions older than retention period
- Updates partition constraints for query optimization
Real-World Example: pgmq
The pgmq message queue system (which tasker-core uses for orchestration) implements partitioned queues for high-throughput scenarios:
Reference: pgmq Partitioned Queues
pgmq’s Rationale (from their docs):
“For very high-throughput queues, you may want to partition the queue table by time. This allows you to drop old partitions instead of deleting rows, which is much faster and doesn’t cause table bloat.”
pgmq’s Approach:
-- pgmq uses pg_partman for message queues
SELECT pgmq.create_partitioned(
queue_name := 'high_throughput_queue',
partition_interval := '1 day',
retention_interval := '7 days'
);
Benefits They Report:
- 10× faster old message cleanup vs. DELETE
- Zero bloat from message deletion
- Consistent performance even at millions of messages per day
Applying to Tasker: Our use case is nearly identical to pgmq:
- High-throughput append-heavy workload
- Time-series data (created_at is natural partition key)
- Need to retain recent data, drop old data
- Performance-critical read path
If pgmq chose partitioning over archive-and-delete for these reasons, we should too.
Migration Path
Phase 1: Analysis (Current State)
Before implementing partitioning:
- Analyze Current Growth Rate:
SELECT
pg_size_pretty(pg_total_relation_size('tasker.tasks')) as total_size,
count(*) as row_count,
min(created_at) as oldest_task,
max(created_at) as newest_task,
count(*) / EXTRACT(day FROM (max(created_at) - min(created_at))) as avg_tasks_per_day
FROM tasks;
-
Determine Partition Strategy:
- Daily partitions: For > 1M tasks/day
- Weekly partitions: For 100K-1M tasks/day
- Monthly partitions: For < 100K tasks/day
-
Plan Retention Period:
- Legal/compliance requirements
- Analytics/reporting needs
- Typical task investigation window
Phase 2: Implementation
- Create Partitioned Tables (requires downtime or blue-green deployment)
- Migrate Existing Data using
pg_partman.partition_data_proc() - Update Application (no code changes needed if using same table names)
- Configure Automation (pg_cron for maintenance)
Phase 3: Monitoring
Track partition management effectiveness:
-- Check partition sizes
SELECT
schemaname || '.' || tablename as partition_name,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'tasker' AND tablename LIKE 'tasks_p%'
ORDER BY tablename;
-- Verify partition pruning is working
EXPLAIN SELECT * FROM tasks
WHERE created_at > NOW() - INTERVAL '7 days';
-- Should show: "Seq Scan on tasker.tasks_p2025_11" (only current partition)
Decision Summary
Decision: Use PostgreSQL native partitioning with pg_partman for table growth management.
Rationale:
- Zero runtime performance impact vs. periodic degradation with archive-and-delete
- Operationally simpler (set-and-forget vs. monitoring archive jobs)
- Battle-tested solution used by pgmq and thousands of production systems
- Aligns with PostgreSQL best practices and community recommendations
Not Recommended: Archive-and-delete strategy due to write amplification, lock contention, and periodic performance degradation.
References
See Also
- States and Lifecycles - Task and step state management
- Task and Step Readiness - SQL function optimizations
- Observability README - Monitoring table growth and query performance
Task and Step Readiness and Execution
Last Updated: 2026-01-10 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands
← Back to Documentation Hub
This document provides comprehensive documentation of the SQL functions and database logic that drives task and step readiness analysis, dependency resolution, and execution coordination in the tasker-core system.
Overview
The tasker-core system relies heavily on sophisticated PostgreSQL functions to perform complex workflow orchestration operations at the database level. This approach provides significant performance benefits through set-based operations, atomic transactions, and reduced network round trips while maintaining data consistency.
The SQL function system supports several critical categories of operations:
- Step Readiness Analysis: Complex dependency resolution and backoff calculations
- DAG Operations: Cycle detection, depth calculation, and parallel execution discovery
- State Management: Atomic state transitions with processor ownership tracking
- Analytics and Monitoring: Performance metrics and system health analysis
- Task Execution Context: Comprehensive execution metadata and results management
SQL Function Architecture
Function Categories
The SQL functions are organized into logical categories as defined in
tasker-shared/src/database/sql_functions.rs:
1. Step Readiness Analysis
get_step_readiness_status(task_uuid, step_uuids[]): Comprehensive dependency analysiscalculate_backoff_delay(attempts, base_delay): Exponential backoff calculationcheck_step_dependencies(step_uuid): Parent completion validationget_ready_steps(task_uuid): Parallel execution candidate discovery
2. DAG Operations
detect_cycle(from_step_uuid, to_step_uuid): Cycle detection using recursive CTEscalculate_dependency_levels(task_uuid): Topological depth calculationcalculate_step_depth(step_uuid): Individual step depth analysisget_step_transitive_dependencies(step_uuid): Full dependency tree traversal
3. State Management
transition_task_state_atomic(task_uuid, from_state, to_state, processor_uuid): Atomic state transitions with ownershipget_current_task_state(task_uuid): Current task state resolutionfinalize_task_completion(task_uuid): Task completion orchestration
4. Analytics and Monitoring
get_analytics_metrics(since_timestamp): Comprehensive system analyticsget_system_health_counts(): System-wide health and performance metricsget_slowest_steps(limit): Performance optimization analysisget_slowest_tasks(limit): Task performance analysis
5. Task Discovery and Execution
get_next_ready_task(): Single task discovery for orchestrationget_next_ready_tasks(limit): Batch task discovery for scalingget_task_ready_info(task_uuid): Detailed task readiness informationget_task_execution_context(task_uuid): Complete execution metadata
Database Schema Foundation
Core Tables
The SQL functions operate on a comprehensive schema designed for UUID v7
performance and scalability. All tables reside in the tasker schema
with simplified names. With search_path = tasker, public, queries use unqualified
table names.
Primary Tables
tasks: Main workflow instances with UUID v7 primary keysworkflow_steps: Individual workflow steps with dependency relationshipstask_transitions: Task state change audit trail with processor trackingworkflow_step_transitions: Step state change audit trail
Registry Tables
task_namespaces: Workflow namespace definitionsnamed_tasks: Task type templates and metadatanamed_steps: Step type definitions and handlersworkflow_step_edges: Step dependency relationships (DAG structure)
Richer Task State Enhancements
The richer task states migration (migrations/tasker/20251209000000_tas41_richer_task_states.sql) enhanced the
schema with:
Task State Management:
-- 12 comprehensive task states
ALTER TABLE task_transitions
ADD CONSTRAINT chk_task_transitions_to_state
CHECK (to_state IN (
'pending', 'initializing', 'enqueuing_steps', 'steps_in_process',
'evaluating_results', 'waiting_for_dependencies', 'waiting_for_retry',
'blocked_by_failures', 'complete', 'error', 'cancelled', 'resolved_manually'
));
Processor Ownership Tracking:
ALTER TABLE task_transitions
ADD COLUMN processor_uuid UUID,
ADD COLUMN transition_metadata JSONB DEFAULT '{}';
Atomic State Transitions:
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN
Step Readiness Analysis
Recent Enhancements
WaitingForRetry State Support (Migration 20250927000000)
The step readiness system was enhanced to support the new WaitingForRetry state, which distinguishes retryable failures from permanent errors:
Key Changes:
- Helper Functions: Added
calculate_step_next_retry_time()andevaluate_step_state_readiness()for consistent backoff logic - State Recognition: Updated readiness evaluation to treat
waiting_for_retryas a ready-eligible state alongsidepending - Backoff Calculation: Centralized exponential backoff logic with configurable backoff periods
- Performance Optimization: Introduced task-scoped CTEs to eliminate table scans for batch operations
Semantic Impact:
- Before:
errorstate included both retryable and permanent failures - After:
error= permanent only,waiting_for_retry= awaiting backoff for retry
Backoff Logic Consolidation (October 2025)
The backoff calculation system was consolidated to eliminate configuration conflicts and race conditions:
Key Changes:
- Configuration Alignment: Single source of truth (TOML config) with max_backoff_seconds = 60
- Parameterized SQL Functions:
calculate_step_next_retry_time()accepts configurable max delay and multiplier - Atomic Updates: Row-level locking prevents concurrent backoff update conflicts
- Timing Consistency:
last_attempted_atupdated atomically withbackoff_request_seconds
Issues Resolved:
- Configuration Conflicts: Eliminated three conflicting max values (30s SQL, 60s code, 300s TOML)
- Race Conditions: Added SELECT FOR UPDATE locking in BackoffCalculator
- Hardcoded Values: Removed hardcoded 30-second cap and power(2, attempts) in SQL
Helper Functions Enhanced:
-
calculate_step_next_retry_time(): Now parameterized with configuration valuesCREATE OR REPLACE FUNCTION calculate_step_next_retry_time( backoff_request_seconds INTEGER, last_attempted_at TIMESTAMP, failure_time TIMESTAMP, attempts INTEGER, p_max_backoff_seconds INTEGER DEFAULT 60, p_backoff_multiplier NUMERIC DEFAULT 2.0 ) RETURNS TIMESTAMP- Respects custom backoff periods from step configuration (primary path)
- Falls back to exponential backoff with configurable parameters
- Defaults aligned with TOML config (60s max, 2.0 multiplier)
- Used consistently across all readiness evaluation
-
set_step_backoff_atomic(): New atomic update functionCREATE OR REPLACE FUNCTION set_step_backoff_atomic( p_step_uuid UUID, p_backoff_seconds INTEGER ) RETURNS BOOLEAN- Provides transactional guarantee for concurrent updates
- Updates both
backoff_request_secondsandlast_attempted_at - Ensures timing consistency with SQL calculations
-
evaluate_step_state_readiness(): Determines if a step is ready for executionCREATE OR REPLACE FUNCTION evaluate_step_state_readiness( current_state TEXT, processed BOOLEAN, in_process BOOLEAN, dependencies_satisfied BOOLEAN, retry_eligible BOOLEAN, retryable BOOLEAN, next_retry_time TIMESTAMP ) RETURNS BOOLEAN- Recognizes both
pendingandwaiting_for_retryas ready-eligible states - Validates backoff period has expired before allowing retry
- Ensures dependencies are satisfied and retry limits not exceeded
- Recognizes both
Step Readiness Status
The get_step_readiness_status function provides comprehensive analysis of step
execution eligibility:
CREATE OR REPLACE FUNCTION get_step_readiness_status(
task_uuid UUID,
step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
workflow_step_uuid UUID,
task_uuid UUID,
named_step_uuid UUID,
name VARCHAR,
current_state VARCHAR,
dependencies_satisfied BOOLEAN,
retry_eligible BOOLEAN,
ready_for_execution BOOLEAN,
last_failure_at TIMESTAMP,
next_retry_at TIMESTAMP,
total_parents INTEGER,
completed_parents INTEGER,
attempts INTEGER,
retry_limit INTEGER,
backoff_request_seconds INTEGER,
last_attempted_at TIMESTAMP
)
Key Analysis Features
Dependency Satisfaction:
- Validates all parent steps are in
completeorresolved_manuallystates - Handles complex DAG structures with multiple dependency paths
- Supports conditional dependencies based on parent results
Retry Logic:
- Exponential backoff calculation:
2^attemptsseconds (max 30) - Custom backoff periods from step configuration
- Retry limit enforcement to prevent infinite loops
- Failure tracking with temporal analysis
Execution Readiness:
- State validation (must be
pendingorerror) - Dependency satisfaction confirmation
- Retry eligibility assessment
- Backoff period expiration checking
Step Readiness Implementation
The Rust integration provides type-safe access to step readiness analysis:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct StepReadinessStatus {
pub workflow_step_uuid: Uuid,
pub task_uuid: Uuid,
pub named_step_uuid: Uuid,
pub name: String,
pub current_state: String,
pub dependencies_satisfied: bool,
pub retry_eligible: bool,
pub ready_for_execution: bool,
pub last_failure_at: Option<NaiveDateTime>,
pub next_retry_at: Option<NaiveDateTime>,
pub total_parents: i32,
pub completed_parents: i32,
pub attempts: i32,
pub retry_limit: i32,
pub backoff_request_seconds: Option<i32>,
pub last_attempted_at: Option<NaiveDateTime>,
}
impl StepReadinessStatus {
pub fn can_execute_now(&self) -> bool {
self.ready_for_execution
}
pub fn blocking_reason(&self) -> Option<&'static str> {
if !self.dependencies_satisfied {
return Some("dependencies_not_satisfied");
}
if !self.retry_eligible {
return Some("retry_not_eligible");
}
Some("invalid_state")
}
pub fn effective_backoff_seconds(&self) -> i32 {
self.backoff_request_seconds.unwrap_or_else(|| {
if self.attempts > 0 {
std::cmp::min(2_i32.pow(self.attempts as u32), 30)
} else {
0
}
})
}
}
}
DAG Operations and Dependency Resolution
Dependency Level Calculation
The calculate_dependency_levels function uses recursive CTEs to perform
topological analysis of the workflow DAG:
CREATE OR REPLACE FUNCTION calculate_dependency_levels(input_task_uuid UUID)
RETURNS TABLE(workflow_step_uuid UUID, dependency_level INTEGER)
LANGUAGE plpgsql STABLE AS $$
BEGIN
RETURN QUERY
WITH RECURSIVE dependency_levels AS (
-- Base case: Find root nodes (steps with no dependencies)
SELECT
ws.workflow_step_uuid,
0 as level
FROM workflow_steps ws
WHERE ws.task_uuid = input_task_uuid
AND NOT EXISTS (
SELECT 1
FROM workflow_step_edges wse
WHERE wse.to_step_uuid = ws.workflow_step_uuid
)
UNION ALL
-- Recursive case: Find children of current level nodes
SELECT
wse.to_step_uuid as workflow_step_uuid,
dl.level + 1 as level
FROM dependency_levels dl
JOIN workflow_step_edges wse ON wse.from_step_uuid = dl.workflow_step_uuid
JOIN workflow_steps ws ON ws.workflow_step_uuid = wse.to_step_uuid
WHERE ws.task_uuid = input_task_uuid
)
SELECT
dl.workflow_step_uuid,
MAX(dl.level) as dependency_level -- Use MAX to handle multiple paths
FROM dependency_levels dl
GROUP BY dl.workflow_step_uuid
ORDER BY dependency_level, workflow_step_uuid;
END;
Dependency Level Benefits
Parallel Execution Planning:
- Steps at the same dependency level can execute in parallel
- Enables optimal resource utilization across workers
- Supports batch enqueueing for scalability
Execution Ordering:
- Level 0: Root steps (no dependencies) - can start immediately
- Level N: Steps requiring completion of level N-1 steps
- Topological ordering ensures dependency satisfaction
Performance Optimization:
- Single query provides complete dependency analysis
- Avoids N+1 query problems in dependency resolution
- Enables batch processing optimizations
Transitive Dependencies
The get_step_transitive_dependencies function provides complete ancestor analysis:
CREATE OR REPLACE FUNCTION get_step_transitive_dependencies(step_uuid UUID)
RETURNS TABLE(
step_name VARCHAR,
step_uuid UUID,
task_uuid UUID,
distance INTEGER,
processed BOOLEAN,
results JSONB
)
This enables step handlers to access results from any ancestor step:
#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
pub async fn get_step_dependency_results_map(
&self,
step_uuid: Uuid,
) -> Result<HashMap<String, StepExecutionResult>, sqlx::Error> {
let dependencies = self.get_step_transitive_dependencies(step_uuid).await?;
Ok(dependencies
.into_iter()
.filter_map(|dep| {
if dep.processed && dep.results.is_some() {
let results: StepExecutionResult = dep.results.unwrap().into();
Some((dep.step_name, results))
} else {
None
}
})
.collect())
}
}
}
Task Execution Context
Recent Enhancements
Permanently Blocked Detection Fix (Migration 20251001000000)
The get_task_execution_context function was enhanced to correctly identify tasks blocked by permanent errors:
Problem: The function only checked attempts >= retry_limit to detect permanently blocked steps, missing cases where workers marked errors as non-retryable (e.g., missing handlers, configuration errors).
Solution: Updated permanently_blocked_steps calculation to check both conditions:
COUNT(CASE WHEN sd.current_state = 'error'
AND (sd.attempts >= retry_limit OR sd.retry_eligible = false) THEN 1 END)
Impact:
- execution_status: Now correctly returns
blocked_by_failuresinstead ofwaiting_for_dependenciesfor tasks with non-retryable errors - recommended_action: Returns
handle_failuresinstead ofwait_for_dependencies - health_status: Returns
blockedinstead ofrecoveringwhen appropriate
This fix ensures the orchestration system properly identifies when manual intervention is needed versus when a task is simply waiting for retry backoff.
Task Discovery and Orchestration
Task Readiness Discovery
The system provides multiple functions for task discovery based on orchestration needs:
Single Task Discovery
CREATE OR REPLACE FUNCTION get_next_ready_task()
RETURNS TABLE(
task_uuid UUID,
task_name VARCHAR,
priority INTEGER,
namespace_name VARCHAR,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Batch Task Discovery
CREATE OR REPLACE FUNCTION get_next_ready_tasks(limit_count INTEGER)
RETURNS TABLE(
task_uuid UUID,
task_name VARCHAR,
priority INTEGER,
namespace_name VARCHAR,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Task Ready Information
The ReadyTaskInfo structure provides comprehensive task metadata for
orchestration decisions:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct ReadyTaskInfo {
pub task_uuid: Uuid,
pub task_name: String,
pub priority: i32,
pub namespace_name: String,
pub ready_steps_count: i64,
pub computed_priority: Option<BigDecimal>,
pub current_state: String,
}
}
Priority Calculation:
- Base priority from task configuration
- Dynamic priority adjustment based on age, retry attempts
- Namespace-based priority modifiers
- SLA-based priority escalation
Ready Steps Count:
- Real-time count of steps eligible for execution
- Used for batch size optimization
- Influences orchestration scheduling decisions
State Management and Atomic Transitions
Atomic State Transitions
The enhanced state machine provides atomic transitions with processor ownership:
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
DECLARE
v_sort_key INTEGER;
v_transitioned BOOLEAN := FALSE;
BEGIN
-- Get next sort key
SELECT COALESCE(MAX(sort_key), 0) + 1 INTO v_sort_key
FROM task_transitions
WHERE task_uuid = p_task_uuid;
-- Atomically transition only if in expected state
WITH current_state AS (
SELECT to_state, processor_uuid
FROM task_transitions
WHERE task_uuid = p_task_uuid
AND most_recent = true
FOR UPDATE
),
ownership_check AS (
SELECT
CASE
-- States requiring ownership
WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
'steps_in_process', 'evaluating_results')
THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
-- Other states don't require ownership
ELSE true
END as can_transition
FROM current_state cs
WHERE cs.to_state = p_from_state
),
do_update AS (
UPDATE task_transitions
SET most_recent = false
WHERE task_uuid = p_task_uuid
AND most_recent = true
AND EXISTS (SELECT 1 FROM ownership_check WHERE can_transition)
RETURNING task_uuid
)
INSERT INTO task_transitions (
task_uuid, from_state, to_state,
processor_uuid, transition_metadata,
sort_key, most_recent, created_at, updated_at
)
SELECT
p_task_uuid, p_from_state, p_to_state,
p_processor_uuid, p_metadata,
v_sort_key, true, NOW(), NOW()
WHERE EXISTS (SELECT 1 FROM do_update);
GET DIAGNOSTICS v_transitioned = ROW_COUNT;
RETURN v_transitioned > 0;
END;
$$ LANGUAGE plpgsql;
Key Features
Atomic Operation:
- Single transaction with row-level locking
- Compare-and-swap semantics prevent race conditions
- Returns boolean indicating success/failure
Ownership Validation:
- Processor ownership required for active states
- Prevents concurrent processing by multiple orchestrators
- Supports ownership claiming for unowned tasks
State Consistency:
- Validates current state matches expected
from_state - Maintains audit trail with complete transition history
- Updates
most_recentflags atomically
Current State Resolution
Fast current state lookups are provided through optimized queries:
#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
pub async fn get_current_task_state(&self, task_uuid: Uuid)
-> Result<TaskState, sqlx::Error> {
let state_str = sqlx::query_scalar!(
r#"SELECT get_current_task_state($1) as "state""#,
task_uuid
)
.fetch_optional(&self.pool)
.await?
.ok_or_else(|| sqlx::Error::RowNotFound)?;
match state_str {
Some(state) => TaskState::try_from(state.as_str())
.map_err(|_| sqlx::Error::Decode("Invalid task state".into())),
None => Err(sqlx::Error::RowNotFound),
}
}
}
}
Analytics and System Health
System Health Monitoring
The get_system_health_counts function provides comprehensive system visibility:
CREATE OR REPLACE FUNCTION get_system_health_counts()
RETURNS TABLE(
pending_tasks BIGINT,
initializing_tasks BIGINT,
enqueuing_steps_tasks BIGINT,
steps_in_process_tasks BIGINT,
evaluating_results_tasks BIGINT,
waiting_for_dependencies_tasks BIGINT,
waiting_for_retry_tasks BIGINT,
blocked_by_failures_tasks BIGINT,
complete_tasks BIGINT,
error_tasks BIGINT,
cancelled_tasks BIGINT,
resolved_manually_tasks BIGINT,
total_tasks BIGINT,
-- step counts...
) AS $$
Health Score Calculation
The Rust implementation provides derived health metrics:
#![allow(unused)]
fn main() {
impl SystemHealthCounts {
pub fn health_score(&self) -> f64 {
if self.total_tasks == 0 {
return 1.0;
}
let success_rate = self.complete_tasks as f64 / self.total_tasks as f64;
let error_rate = self.error_tasks as f64 / self.total_tasks as f64;
let connection_health = 1.0 -
(self.active_connections as f64 / self.max_connections as f64).min(1.0);
// Weighted combination: 50% success rate, 30% error rate, 20% connection health
(success_rate * 0.5) + ((1.0 - error_rate) * 0.3) + (connection_health * 0.2)
}
pub fn is_under_heavy_load(&self) -> bool {
let connection_pressure =
self.active_connections as f64 / self.max_connections as f64;
let error_rate = if self.total_tasks > 0 {
self.error_tasks as f64 / self.total_tasks as f64
} else {
0.0
};
connection_pressure > 0.8 || error_rate > 0.2
}
}
}
Analytics Metrics
The get_analytics_metrics function provides comprehensive performance analysis:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct AnalyticsMetrics {
pub active_tasks_count: i64,
pub total_namespaces_count: i64,
pub unique_task_types_count: i64,
pub system_health_score: BigDecimal,
pub task_throughput: i64,
pub completion_count: i64,
pub error_count: i64,
pub completion_rate: BigDecimal,
pub error_rate: BigDecimal,
pub avg_task_duration: BigDecimal,
pub avg_step_duration: BigDecimal,
pub step_throughput: i64,
pub analysis_period_start: DateTime<Utc>,
pub calculated_at: DateTime<Utc>,
}
}
Performance Optimization Analysis
Slowest Steps Analysis
The system provides performance optimization guidance through detailed analysis:
CREATE OR REPLACE FUNCTION get_slowest_steps(
since_timestamp TIMESTAMP WITH TIME ZONE,
limit_count INTEGER,
namespace_filter VARCHAR,
task_name_filter VARCHAR,
version_filter VARCHAR
) RETURNS TABLE(
named_step_uuid INTEGER,
step_name VARCHAR,
avg_duration_seconds NUMERIC,
max_duration_seconds NUMERIC,
min_duration_seconds NUMERIC,
execution_count INTEGER,
error_count INTEGER,
error_rate NUMERIC,
last_executed_at TIMESTAMP WITH TIME ZONE
)
Slowest Tasks Analysis
Similar analysis is available at the task level:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct SlowestTaskAnalysis {
pub named_task_uuid: Uuid,
pub task_name: String,
pub avg_duration_seconds: f64,
pub max_duration_seconds: f64,
pub min_duration_seconds: f64,
pub execution_count: i32,
pub avg_step_count: f64,
pub error_count: i32,
pub error_rate: f64,
pub last_executed_at: Option<DateTime<Utc>>,
}
}
Critical Problem-Solving SQL Functions
PGMQ Message Race Condition Prevention
Problem: Multiple Workers Claiming Same Message
When multiple workers simultaneously try to process steps from the same queue, PGMQ’s standard
pgmq.read() function randomly selects messages, potentially causing workers to miss messages
they were specifically notified about. This creates inefficiency and potential race conditions.
Solution: pgmq_read_specific_message()
CREATE OR REPLACE FUNCTION pgmq_read_specific_message(
queue_name text,
target_msg_id bigint,
vt_seconds integer DEFAULT 30
) RETURNS TABLE (
msg_id bigint,
read_ct integer,
enqueued_at timestamp with time zone,
vt timestamp with time zone,
message jsonb
) AS $$
Key Problem-Solving Logic:
-
Atomic Claim with Visibility Timeout: Uses UPDATE…RETURNING pattern to atomically:
- Check if message is available (
vt <= now()) - Set new visibility timeout preventing other workers from claiming
- Increment read count for monitoring retry attempts
- Return message data only if successfully claimed
- Check if message is available (
-
Race Condition Prevention: The
WHERE vt <= now()clause ensures only one worker can claim a message. If two workers try simultaneously, only one UPDATE succeeds. -
Graceful Failure Handling: Returns empty result set if message is:
- Already claimed by another worker (vt > now())
- Non-existent (deleted or never existed)
- Archived (moved to archive table)
-
Security: Validates queue name to prevent SQL injection in dynamic query construction.
Real-World Impact: Eliminates “message not found” errors when workers are notified about specific messages but can’t retrieve them due to random selection in standard read.
Task State Ownership and Atomic Transitions
Problem: Concurrent Orchestrators Processing Same Task
In distributed deployments, multiple orchestrator instances might try to process the same task simultaneously, leading to duplicate work, inconsistent state, and race conditions.
Solution: transition_task_state_atomic()
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
Key Problem-Solving Logic:
-
Compare-and-Swap Pattern:
- Reads current state with
FOR UPDATElock - Only transitions if current state matches expected
from_state - Returns false if state has changed, allowing caller to retry with fresh state
- Reads current state with
-
Processor Ownership Enforcement:
CASE WHEN cs.to_state IN ('initializing', 'enqueuing_steps', 'steps_in_process', 'evaluating_results') THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL ELSE true END- Active processing states require ownership match
- Allows claiming unowned tasks (NULL processor_uuid)
- Terminal states (complete, error) don’t require ownership
-
Audit Trail Preservation:
- Updates previous transition’s
most_recent = false - Inserts new transition with
most_recent = true - Maintains complete history with sort_key ordering
- Updates previous transition’s
-
Atomic Success/Failure: Returns boolean indicating whether transition succeeded, enabling callers to handle contention gracefully.
Real-World Impact: Enables safe distributed orchestration where multiple instances can operate without conflicts, automatically distributing work through ownership claiming.
Batch Task Discovery with Priority
Problem: Efficient Work Distribution Across Orchestrators
Orchestrators need to discover ready tasks efficiently without creating hotspots or missing tasks, while respecting priority and avoiding claimed tasks.
Solution: get_next_ready_tasks()
CREATE OR REPLACE FUNCTION get_next_ready_tasks(p_limit INTEGER DEFAULT 5)
RETURNS TABLE(
task_uuid UUID,
task_name TEXT,
priority INTEGER,
namespace_name TEXT,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Key Problem-Solving Logic:
-
Ready Step Discovery:
WITH ready_steps AS ( SELECT task_uuid, COUNT(*) as ready_count FROM workflow_steps WHERE current_state IN ('pending', 'error') AND [dependency checks] GROUP BY task_uuid )- Pre-aggregates ready steps per task for efficiency
- Considers both new steps and retryable errors
-
State-Based Filtering:
- Only returns tasks in states that need processing
- Excludes terminal states (complete, cancelled)
- Includes waiting states that might have become ready
-
Priority Computation:
computed_priority = base_priority + (age_factor * hours_waiting) + (retry_factor * retry_count)- Dynamic priority based on age and retry attempts
- Prevents task starvation through age escalation
-
Batch Efficiency:
- Returns multiple tasks in single query
- Reduces database round trips
- Enables parallel processing across orchestrators
Real-World Impact: Enables efficient work distribution where each orchestrator can claim a batch of tasks, reducing contention and improving throughput.
Complex Dependency Resolution
Problem: Determining Step Execution Readiness
Workflow steps have complex dependencies involving parent completion, retry logic, backoff timing, and state validation. Determining which steps are ready for execution requires sophisticated dependency analysis that must handle:
- Multiple parent dependencies with conditional logic
- Exponential backoff after failures
- Retry limits and attempt tracking
- State consistency across distributed workers
Solution: get_step_readiness_status()
CREATE OR REPLACE FUNCTION get_step_readiness_status(
input_task_uuid UUID,
step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
workflow_step_uuid UUID,
task_uuid UUID,
named_step_uuid UUID,
name VARCHAR,
current_state VARCHAR,
dependencies_satisfied BOOLEAN,
retry_eligible BOOLEAN,
ready_for_execution BOOLEAN,
-- ... additional metadata
)
Key Problem-Solving Logic:
-
Dependency Satisfaction Analysis:
WITH parent_completion AS ( SELECT edge.to_step_uuid, COUNT(*) as total_parents, COUNT(CASE WHEN parent.current_state = 'complete' THEN 1 END) as completed_parents FROM workflow_step_edges edge JOIN workflow_steps parent ON parent.workflow_step_uuid = edge.from_step_uuid WHERE parent.task_uuid = input_task_uuid GROUP BY edge.to_step_uuid )- Counts total vs. completed parent dependencies
- Handles conditional dependencies based on parent results
- Supports complex DAG structures with multiple paths
-
Retry Eligibility Assessment:
retry_eligible = ( current_state = 'error' AND attempts < retry_limit AND (last_attempted_at IS NULL OR last_attempted_at + backoff_interval <= NOW()) )- Enforces retry limits to prevent infinite loops
- Calculates exponential backoff:
2^attemptsseconds (max 30) - Respects custom backoff periods from step configuration
- Considers temporal constraints for retry timing
-
State Validation:
ready_for_execution = ( current_state IN ('pending', 'error') AND dependencies_satisfied AND retry_eligible )- Only pending or retryable error steps can execute
- Requires all dependencies satisfied
- Must pass retry eligibility checks
- Prevents execution of steps in terminal states
-
Backoff Calculation:
next_retry_at = CASE WHEN current_state = 'error' AND attempts > 0 THEN last_attempted_at + INTERVAL '1 second' * COALESCE(backoff_request_seconds, LEAST(POW(2, attempts), 30)) ELSE NULL END- Custom backoff from step configuration takes precedence
- Default exponential backoff with maximum cap
- Temporal precision for scheduling retry attempts
Real-World Impact: Enables complex workflow orchestration with sophisticated dependency management, retry logic, and backoff handling, supporting enterprise-grade reliability patterns while maintaining high performance through set-based operations.
Integration with Event and State Systems
PostgreSQL LISTEN/NOTIFY Integration
The SQL functions integrate with the event-driven architecture through PostgreSQL notifications:
PGMQ Wrapper Functions for Atomic Operations
The system uses wrapper functions that combine PGMQ message sending with PostgreSQL notifications atomically:
-- Atomic wrapper that sends message AND notification
CREATE OR REPLACE FUNCTION pgmq_send_with_notify(
queue_name TEXT,
message JSONB,
delay_seconds INTEGER DEFAULT 0
) RETURNS BIGINT AS $$
DECLARE
msg_id BIGINT;
namespace_name TEXT;
event_payload TEXT;
namespace_channel TEXT;
global_channel TEXT := 'pgmq_message_ready';
BEGIN
-- Send message using PGMQ's native function
SELECT pgmq.send(queue_name, message, delay_seconds) INTO msg_id;
-- Extract namespace from queue name using robust helper
namespace_name := extract_queue_namespace(queue_name);
-- Build namespace-specific channel name
namespace_channel := 'pgmq_message_ready.' || namespace_name;
-- Build event payload
event_payload := json_build_object(
'event_type', 'message_ready',
'msg_id', msg_id,
'queue_name', queue_name,
'namespace', namespace_name,
'ready_at', NOW()::timestamptz,
'delay_seconds', delay_seconds
)::text;
-- Send notifications in same transaction
PERFORM pg_notify(namespace_channel, event_payload);
-- Also send to global channel if different
IF namespace_channel != global_channel THEN
PERFORM pg_notify(global_channel, event_payload);
END IF;
RETURN msg_id;
END;
$$ LANGUAGE plpgsql;
Namespace Extraction Helper
-- Robust namespace extraction helper function
CREATE OR REPLACE FUNCTION extract_queue_namespace(queue_name TEXT)
RETURNS TEXT AS $$
BEGIN
-- Handle orchestration queues
IF queue_name ~ '^orchestration' THEN
RETURN 'orchestration';
END IF;
-- Handle worker queues: worker_namespace_queue -> namespace
IF queue_name ~ '^worker_.*_queue$' THEN
RETURN COALESCE(
(regexp_match(queue_name, '^worker_(.+?)_queue$'))[1],
'worker'
);
END IF;
-- Handle standard namespace_queue pattern
IF queue_name ~ '^[a-zA-Z][a-zA-Z0-9_]*_queue$' THEN
RETURN COALESCE(
(regexp_match(queue_name, '^([a-zA-Z][a-zA-Z0-9_]*)_queue$'))[1],
'default'
);
END IF;
-- Fallback for any other pattern
RETURN 'default';
END;
$$ LANGUAGE plpgsql;
Fallback Polling for Task Readiness
Instead of database triggers for task readiness notifications, the system uses a fallback polling mechanism to ensure no ready tasks are missed:
FallbackPoller Configuration:
- Default polling interval: 30 seconds
- Runs
StepEnqueuerService::process_batch()periodically - Catches tasks that may have been missed by primary PGMQ notification system
- Configurable enable/disable via TOML configuration
Key Benefits:
- Resilience: Ensures no tasks are permanently stuck if notifications fail
- Simplicity: No complex database triggers or state tracking required
- Observability: Clear metrics on fallback discovery vs. event-driven discovery
- Safety Net: Primary event-driven system + fallback polling provides redundancy
PGMQ Message Queue Integration
SQL functions coordinate with PGMQ for reliable message processing:
Queue Management Functions
-- Ensure queue exists with proper configuration
CREATE OR REPLACE FUNCTION ensure_task_queue(queue_name VARCHAR)
RETURNS BOOLEAN AS $$
BEGIN
-- Create queue if it doesn't exist
PERFORM pgmq.create_queue(queue_name);
-- Ensure headers column exists (pgmq-rs compatibility)
PERFORM pgmq_ensure_headers_column(queue_name);
RETURN TRUE;
END;
$$ LANGUAGE plpgsql;
Message Processing Support
-- Get queue statistics for monitoring
CREATE OR REPLACE FUNCTION get_queue_statistics(queue_name VARCHAR)
RETURNS TABLE(
queue_name VARCHAR,
queue_length BIGINT,
oldest_msg_age_seconds INTEGER,
newest_msg_age_seconds INTEGER
) AS $$
BEGIN
RETURN QUERY
SELECT
queue_name,
pgmq.queue_length(queue_name),
EXTRACT(EPOCH FROM (NOW() - MIN(enqueued_at)))::INTEGER,
EXTRACT(EPOCH FROM (NOW() - MAX(enqueued_at)))::INTEGER
FROM pgmq.messages(queue_name);
END;
$$ LANGUAGE plpgsql;
Transaction Safety
All SQL functions are designed with transaction safety in mind:
Atomic Operations:
- State transitions use row-level locking (
FOR UPDATE) - Compare-and-swap patterns prevent race conditions
- Rollback safety for partial failures
Consistency Guarantees:
- Foreign key constraints maintained across all operations
- Check constraints validate state transitions
- Audit trails preserved for debugging and compliance
Performance Optimization:
- Efficient indexes for common query patterns
- Materialized views for expensive analytics queries
- Connection pooling for high concurrency
Usage Patterns and Best Practices
Rust Integration Patterns
The SqlFunctionExecutor provides type-safe access to all SQL functions:
#![allow(unused)]
fn main() {
use tasker_shared::database::sql_functions::{SqlFunctionExecutor, FunctionRegistry};
// Direct executor usage
let executor = SqlFunctionExecutor::new(pool);
let ready_steps = executor.get_ready_steps(task_uuid).await?;
// Registry pattern for organized access
let registry = FunctionRegistry::new(pool);
let analytics = registry.analytics().get_analytics_metrics(None).await?;
let health = registry.system_health().get_system_health_counts().await?;
}
Batch Processing Optimization
For high-throughput scenarios, the system supports efficient batch operations:
#![allow(unused)]
fn main() {
// Batch step readiness analysis
let task_uuids = vec![task1_uuid, task2_uuid, task3_uuid];
let batch_readiness = executor.get_step_readiness_status_batch(task_uuids).await?;
// Batch task discovery
let ready_tasks = executor.get_next_ready_tasks(50).await?;
}
Error Handling Best Practices
SQL function errors are properly propagated through the type system:
#![allow(unused)]
fn main() {
match executor.get_current_task_state(task_uuid).await {
Ok(state) => {
// Process state
}
Err(sqlx::Error::RowNotFound) => {
// Handle missing task
}
Err(e) => {
// Handle other database errors
}
}
}
Tasker Configuration Documentation Index
Coverage: 246/246 parameters documented (100%)
Common Configuration
- backoff (
common.backoff) — 5 params (5 documented) - cache (
common.cache) — 10 params (10 documented) - moka (
common.cache.moka) — 1 params - redis (
common.cache.redis) — 4 params - circuit_breakers (
common.circuit_breakers) — 13 params (13 documented) - component_configs (
common.circuit_breakers.component_configs) — 8 params - default_config (
common.circuit_breakers.default_config) — 3 params - global_settings (
common.circuit_breakers.global_settings) — 2 params - database (
common.database) — 7 params (7 documented) - pool (
common.database.pool) — 6 params - execution (
common.execution) — 2 params (2 documented) - mpsc_channels (
common.mpsc_channels) — 4 params (4 documented) - event_publisher (
common.mpsc_channels.event_publisher) — 1 params - ffi (
common.mpsc_channels.ffi) — 1 params - overflow_policy (
common.mpsc_channels.overflow_policy) — 2 params - pgmq_database (
common.pgmq_database) — 8 params (8 documented) - pool (
common.pgmq_database.pool) — 6 params - queues (
common.queues) — 14 params (14 documented) - orchestration_queues (
common.queues.orchestration_queues) — 3 params - pgmq (
common.queues.pgmq) — 3 params - rabbitmq (
common.queues.rabbitmq) — 3 params - system (
common.system) — 1 params (1 documented) - task_templates (
common.task_templates) — 1 params (1 documented)
Orchestration Configuration
- orchestration (
orchestration) — 2 params (2 documented) - batch_processing (
orchestration.batch_processing) — 4 params (4 documented) - decision_points (
orchestration.decision_points) — 7 params (7 documented) - dlq (
orchestration.dlq) — 13 params (13 documented) - staleness_detection (
orchestration.dlq.staleness_detection) — 12 params - event_systems (
orchestration.event_systems) — 36 params (36 documented) - orchestration (
orchestration.event_systems.orchestration) — 18 params - task_readiness (
orchestration.event_systems.task_readiness) — 18 params - grpc (
orchestration.grpc) — 9 params (9 documented) - mpsc_channels (
orchestration.mpsc_channels) — 3 params (3 documented) - command_processor (
orchestration.mpsc_channels.command_processor) — 1 params - event_listeners (
orchestration.mpsc_channels.event_listeners) — 1 params - event_systems (
orchestration.mpsc_channels.event_systems) — 1 params - web (
orchestration.web) — 17 params (17 documented) - auth (
orchestration.web.auth) — 9 params - database_pools (
orchestration.web.database_pools) — 5 params
Worker Configuration
- worker (
worker) — 2 params (2 documented) - circuit_breakers (
worker.circuit_breakers) — 4 params (4 documented) - ffi_completion_send (
worker.circuit_breakers.ffi_completion_send) — 4 params - event_systems (
worker.event_systems) — 32 params (32 documented) - worker (
worker.event_systems.worker) — 32 params - grpc (
worker.grpc) — 9 params (9 documented) - mpsc_channels (
worker.mpsc_channels) — 23 params (23 documented) - command_processor (
worker.mpsc_channels.command_processor) — 1 params - domain_events (
worker.mpsc_channels.domain_events) — 3 params - event_listeners (
worker.mpsc_channels.event_listeners) — 1 params - event_subscribers (
worker.mpsc_channels.event_subscribers) — 2 params - event_systems (
worker.mpsc_channels.event_systems) — 1 params - ffi_dispatch (
worker.mpsc_channels.ffi_dispatch) — 5 params - handler_dispatch (
worker.mpsc_channels.handler_dispatch) — 7 params - in_process_events (
worker.mpsc_channels.in_process_events) — 3 params - orchestration_client (
worker.orchestration_client) — 3 params (3 documented) - web (
worker.web) — 17 params (17 documented) - auth (
worker.web.auth) — 9 params - database_pools (
worker.web.database_pools) — 5 params
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: common
65/65 parameters documented
backoff
Path: common.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
backoff_multiplier | f64 | 2.0 | Multiplier applied to the previous delay for exponential backoff calculations |
default_backoff_seconds | Vec<u32> | [1, 5, 15, 30, 60] | Sequence of backoff delays in seconds for successive retry attempts |
jitter_enabled | bool | true | Add random jitter to backoff delays to prevent thundering herd on retry |
jitter_max_percentage | f64 | 0.15 | Maximum jitter as a fraction of the computed backoff delay |
max_backoff_seconds | u32 | 3600 | Hard upper limit on any single backoff delay |
common.backoff.backoff_multiplier
Multiplier applied to the previous delay for exponential backoff calculations
- Type:
f64 - Default:
2.0 - Valid Range: 1.0-10.0
- System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous
common.backoff.default_backoff_seconds
Sequence of backoff delays in seconds for successive retry attempts
- Type:
Vec<u32> - Default:
[1, 5, 15, 30, 60] - Valid Range: non-empty array of positive integers
- System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds
common.backoff.jitter_enabled
Add random jitter to backoff delays to prevent thundering herd on retry
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time
common.backoff.jitter_max_percentage
Maximum jitter as a fraction of the computed backoff delay
- Type:
f64 - Default:
0.15 - Valid Range: 0.0-1.0
- System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay
common.backoff.max_backoff_seconds
Hard upper limit on any single backoff delay
- Type:
u32 - Default:
3600 - Valid Range: 1-3600
- System Impact: Caps exponential backoff growth to prevent excessively long delays between retries
cache
Path: common.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
analytics_ttl_seconds | u32 | 60 | Time-to-live in seconds for cached analytics and metrics data |
backend | String | "redis" | Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process) |
default_ttl_seconds | u32 | 3600 | Default time-to-live in seconds for cached entries |
enabled | bool | false | Enable the distributed cache layer for template and analytics data |
template_ttl_seconds | u32 | 3600 | Time-to-live in seconds for cached task template definitions |
common.cache.analytics_ttl_seconds
Time-to-live in seconds for cached analytics and metrics data
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current
common.cache.backend
Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
- Type:
String - Default:
"redis" - Valid Range: redis | moka
- System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection
common.cache.default_ttl_seconds
Default time-to-live in seconds for cached entries
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Controls how long cached data remains valid before being re-fetched from the database
common.cache.enabled
Enable the distributed cache layer for template and analytics data
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required
common.cache.template_ttl_seconds
Time-to-live in seconds for cached task template definitions
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance
moka
Path: common.cache.moka
| Parameter | Type | Default | Description |
|---|---|---|---|
max_capacity | u64 | 10000 | Maximum number of entries the in-process Moka cache can hold |
common.cache.moka.max_capacity
Maximum number of entries the in-process Moka cache can hold
- Type:
u64 - Default:
10000 - Valid Range: 1-1000000
- System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached
redis
Path: common.cache.redis
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_timeout_seconds | u32 | 5 | Maximum time to wait when establishing a new Redis connection |
database | u32 | 0 | Redis database number (0-15) |
max_connections | u32 | 10 | Maximum number of connections in the Redis connection pool |
url | String | "${REDIS_URL:-redis://localhost:6379}" | Redis connection URL |
common.cache.redis.connection_timeout_seconds
Maximum time to wait when establishing a new Redis connection
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Connections that cannot be established within this timeout fail; cache falls back to database
common.cache.redis.database
Redis database number (0-15)
- Type:
u32 - Default:
0 - Valid Range: 0-15
- System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance
common.cache.redis.max_connections
Maximum number of connections in the Redis connection pool
- Type:
u32 - Default:
10 - Valid Range: 1-500
- System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads
common.cache.redis.url
Redis connection URL
- Type:
String - Default:
"${REDIS_URL:-redis://localhost:6379}" - Valid Range: valid Redis URI
- System Impact: Must be reachable when cache is enabled with redis backend
circuit_breakers
Path: common.circuit_breakers
component_configs
Path: common.circuit_breakers.component_configs
cache
Path: common.circuit_breakers.component_configs.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the cache circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the cache breaker |
common.circuit_breakers.component_configs.cache.failure_threshold
Failures before the cache circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database
common.circuit_breakers.component_configs.cache.success_threshold
Successes in Half-Open required to close the cache breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database
messaging
Path: common.circuit_breakers.component_configs.messaging
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the messaging circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the messaging breaker |
common.circuit_breakers.component_configs.messaging.failure_threshold
Failures before the messaging circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited
common.circuit_breakers.component_configs.messaging.success_threshold
Successes in Half-Open required to close the messaging breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient
task_readiness
Path: common.circuit_breakers.component_configs.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 10 | Failures before the task readiness circuit breaker trips to Open |
success_threshold | u32 | 3 | Successes in Half-Open required to close the task readiness breaker |
common.circuit_breakers.component_configs.task_readiness.failure_threshold
Failures before the task readiness circuit breaker trips to Open
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected
common.circuit_breakers.component_configs.task_readiness.success_threshold
Successes in Half-Open required to close the task readiness breaker
- Type:
u32 - Default:
3 - Valid Range: 1-100
- System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries
web
Path: common.circuit_breakers.component_configs.web
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the web/API database circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the web database breaker |
common.circuit_breakers.component_configs.web.failure_threshold
Failures before the web/API database circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts
common.circuit_breakers.component_configs.web.success_threshold
Successes in Half-Open required to close the web database breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic
default_config
Path: common.circuit_breakers.default_config
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive failures before a circuit breaker trips to the Open state |
success_threshold | u32 | 2 | Number of consecutive successes in Half-Open state required to close the circuit breaker |
timeout_seconds | u32 | 30 | Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests |
common.circuit_breakers.default_config.failure_threshold
Number of consecutive failures before a circuit breaker trips to the Open state
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping
common.circuit_breakers.default_config.success_threshold
Number of consecutive successes in Half-Open state required to close the circuit breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Higher values require more proof of recovery before restoring full traffic
common.circuit_breakers.default_config.timeout_seconds
Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures
global_settings
Path: common.circuit_breakers.global_settings
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics_collection_interval_seconds | u32 | 30 | Interval in seconds between circuit breaker metrics collection sweeps |
min_state_transition_interval_seconds | f64 | 5.0 | Minimum time in seconds between circuit breaker state transitions |
common.circuit_breakers.global_settings.metrics_collection_interval_seconds
Interval in seconds between circuit breaker metrics collection sweeps
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability
common.circuit_breakers.global_settings.min_state_transition_interval_seconds
Minimum time in seconds between circuit breaker state transitions
- Type:
f64 - Default:
5.0 - Valid Range: 0.0-60.0
- System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures
database
Path: common.database
| Parameter | Type | Default | Description |
|---|---|---|---|
url | String | "${DATABASE_URL:-postgresql://localhost/tasker}" | PostgreSQL connection URL for the primary database |
common.database.url
PostgreSQL connection URL for the primary database
- Type:
String - Default:
"${DATABASE_URL:-postgresql://localhost/tasker}" - Valid Range: valid PostgreSQL connection URI
- System Impact: All task, step, and workflow state is stored here; must be reachable at startup
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | postgresql://localhost/tasker | Local default, no auth |
| production | ${DATABASE_URL} | Always use env var injection for secrets rotation |
| test | postgresql://tasker:tasker@localhost:5432/tasker_rust_test | Isolated test database with known credentials |
Related: common.database.pool.max_connections, common.pgmq_database.url
pool
Path: common.database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 10 | Maximum time to wait when acquiring a connection from the pool |
idle_timeout_seconds | u32 | 300 | Time before an idle connection is closed and removed from the pool |
max_connections | u32 | 25 | Maximum number of concurrent database connections in the pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a connection before it is closed and replaced |
min_connections | u32 | 5 | Minimum number of idle connections maintained in the pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which connection acquisition is logged as slow |
common.database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the pool
- Type:
u32 - Default:
10 - Valid Range: 1-300
- System Impact: Queries fail with a timeout error if no connection is available within this window
common.database.pool.idle_timeout_seconds
Time before an idle connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the pool shrinks back to min_connections after load drops
common.database.pool.max_connections
Maximum number of concurrent database connections in the pool
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | 10-25 | Small pool for local development |
| production | 30-50 | Scale based on worker count and concurrent task volume |
| test | 10-30 | Moderate pool; cluster tests may run 10 services sharing the same DB |
Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds
common.database.pool.max_lifetime_seconds
Maximum total lifetime of a connection before it is closed and replaced
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections
common.database.pool.min_connections
Minimum number of idle connections maintained in the pool
- Type:
u32 - Default:
5 - Valid Range: 0-100
- System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods
common.database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which connection acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow acquire warnings indicate pool pressure or network issues
execution
Path: common.execution
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | String | "development" | Runtime environment identifier used for configuration context selection and logging |
step_enqueue_batch_size | u32 | 50 | Number of steps to enqueue in a single batch during task initialization |
common.execution.environment
Runtime environment identifier used for configuration context selection and logging
- Type:
String - Default:
"development" - Valid Range: test | development | production
- System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system
common.execution.step_enqueue_batch_size
Number of steps to enqueue in a single batch during task initialization
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency
mpsc_channels
Path: common.mpsc_channels
event_publisher
Path: common.mpsc_channels.event_publisher
| Parameter | Type | Default | Description |
|---|---|---|---|
event_queue_buffer_size | usize | 5000 | Bounded channel capacity for the event publisher MPSC channel |
common.mpsc_channels.event_publisher.event_queue_buffer_size
Bounded channel capacity for the event publisher MPSC channel
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner
ffi
Path: common.mpsc_channels.ffi
| Parameter | Type | Default | Description |
|---|---|---|---|
ruby_event_buffer_size | usize | 1000 | Bounded channel capacity for Ruby FFI event delivery |
common.mpsc_channels.ffi.ruby_event_buffer_size
Bounded channel capacity for Ruby FFI event delivery
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side
overflow_policy
Path: common.mpsc_channels.overflow_policy
| Parameter | Type | Default | Description |
|---|---|---|---|
log_warning_threshold | f64 | 0.8 | Channel saturation fraction at which warning logs are emitted |
common.mpsc_channels.overflow_policy.log_warning_threshold
Channel saturation fraction at which warning logs are emitted
- Type:
f64 - Default:
0.8 - Valid Range: 0.0-1.0
- System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity
metrics
Path: common.mpsc_channels.overflow_policy.metrics
| Parameter | Type | Default | Description |
|---|---|---|---|
saturation_check_interval_seconds | u32 | 30 | Interval in seconds between channel saturation metric samples |
common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds
Interval in seconds between channel saturation metric samples
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead
pgmq_database
Path: common.pgmq_database
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable PGMQ messaging subsystem |
url | String | "${PGMQ_DATABASE_URL:-}" | PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database |
common.pgmq_database.enabled
Enable PGMQ messaging subsystem
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend
common.pgmq_database.url
PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database
- Type:
String - Default:
"${PGMQ_DATABASE_URL:-}" - Valid Range: valid PostgreSQL connection URI or empty string
- System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load
Related: common.database.url, common.pgmq_database.enabled
pool
Path: common.pgmq_database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 5 | Maximum time to wait when acquiring a connection from the PGMQ pool |
idle_timeout_seconds | u32 | 300 | Time before an idle PGMQ connection is closed and removed from the pool |
max_connections | u32 | 15 | Maximum number of concurrent connections in the PGMQ database pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a PGMQ database connection before replacement |
min_connections | u32 | 3 | Minimum idle connections maintained in the PGMQ database pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which PGMQ pool acquisition is logged as slow |
common.pgmq_database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the PGMQ pool
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window
common.pgmq_database.pool.idle_timeout_seconds
Time before an idle PGMQ connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops
common.pgmq_database.pool.max_connections
Maximum number of concurrent connections in the PGMQ database pool
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Separate from the main database pool; size according to messaging throughput requirements
common.pgmq_database.pool.max_lifetime_seconds
Maximum total lifetime of a PGMQ database connection before replacement
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift in long-running PGMQ connections
common.pgmq_database.pool.min_connections
Minimum idle connections maintained in the PGMQ database pool
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations
common.pgmq_database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which PGMQ pool acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure
queues
Path: common.queues
| Parameter | Type | Default | Description |
|---|---|---|---|
backend | String | "${TASKER_MESSAGING_BACKEND:-pgmq}" | Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker) |
default_visibility_timeout_seconds | u32 | 30 | Default time a dequeued message remains invisible to other consumers |
naming_pattern | String | "{namespace}_{name}_queue" | Template pattern for constructing queue names from namespace and name |
orchestration_namespace | String | "orchestration" | Namespace prefix for orchestration queue names |
worker_namespace | String | "worker" | Namespace prefix for worker queue names |
common.queues.backend
Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
- Type:
String - Default:
"${TASKER_MESSAGING_BACKEND:-pgmq}" - Valid Range: pgmq | rabbitmq
- System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | pgmq or rabbitmq | pgmq for simplicity, rabbitmq for high-throughput push semantics |
| test | pgmq | Single-dependency setup, simpler CI |
Related: common.queues.pgmq, common.queues.rabbitmq
common.queues.default_visibility_timeout_seconds
Default time a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry
common.queues.naming_pattern
Template pattern for constructing queue names from namespace and name
- Type:
String - Default:
"{namespace}_{name}_queue" - Valid Range: string containing {namespace} and {name} placeholders
- System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration
common.queues.orchestration_namespace
Namespace prefix for orchestration queue names
- Type:
String - Default:
"orchestration" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues
common.queues.worker_namespace
Namespace prefix for worker queue names
- Type:
String - Default:
"worker" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues
orchestration_queues
Path: common.queues.orchestration_queues
| Parameter | Type | Default | Description |
|---|---|---|---|
step_results | String | "orchestration_step_results" | Queue name for step execution results returned by workers |
task_finalizations | String | "orchestration_task_finalizations" | Queue name for task finalization messages |
task_requests | String | "orchestration_task_requests" | Queue name for incoming task execution requests |
common.queues.orchestration_queues.step_results
Queue name for step execution results returned by workers
- Type:
String - Default:
"orchestration_step_results" - Valid Range: valid queue name
- System Impact: Workers publish step completion results here for the orchestration result processor
common.queues.orchestration_queues.task_finalizations
Queue name for task finalization messages
- Type:
String - Default:
"orchestration_task_finalizations" - Valid Range: valid queue name
- System Impact: Tasks ready for completion evaluation are enqueued here
common.queues.orchestration_queues.task_requests
Queue name for incoming task execution requests
- Type:
String - Default:
"orchestration_task_requests" - Valid Range: valid queue name
- System Impact: The orchestration system reads new task requests from this queue
pgmq
Path: common.queues.pgmq
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval_ms | u32 | 500 | Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive |
common.queues.pgmq.poll_interval_ms
Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive
- Type:
u32 - Default:
500 - Valid Range: 10-10000
- System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval
queue_depth_thresholds
Path: common.queues.pgmq.queue_depth_thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
critical_threshold | i64 | 5000 | Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions |
overflow_threshold | i64 | 10000 | Queue depth indicating an emergency condition requiring manual intervention |
common.queues.pgmq.queue_depth_thresholds.critical_threshold
Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
- Type:
i64 - Default:
5000 - Valid Range: 1+
- System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages
common.queues.pgmq.queue_depth_thresholds.overflow_threshold
Queue depth indicating an emergency condition requiring manual intervention
- Type:
i64 - Default:
10000 - Valid Range: 1+
- System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting
rabbitmq
Path: common.queues.rabbitmq
| Parameter | Type | Default | Description |
|---|---|---|---|
heartbeat_seconds | u16 | 30 | AMQP heartbeat interval for connection liveness detection |
prefetch_count | u16 | 100 | Number of unacknowledged messages RabbitMQ will deliver before waiting for acks |
url | String | "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" | AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’ |
common.queues.rabbitmq.heartbeat_seconds
AMQP heartbeat interval for connection liveness detection
- Type:
u16 - Default:
30 - Valid Range: 0-3600
- System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)
common.queues.rabbitmq.prefetch_count
Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
- Type:
u16 - Default:
100 - Valid Range: 1-65535
- System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process
common.queues.rabbitmq.url
AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’
- Type:
String - Default:
"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" - Valid Range: valid AMQP URI
- System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup
system
Path: common.system
| Parameter | Type | Default | Description |
|---|---|---|---|
default_dependent_system | String | "default" | Default system name assigned to tasks that do not specify a dependent system |
common.system.default_dependent_system
Default system name assigned to tasks that do not specify a dependent system
- Type:
String - Default:
"default" - Valid Range: non-empty string
- System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default
task_templates
Path: common.task_templates
| Parameter | Type | Default | Description |
|---|---|---|---|
search_paths | Vec<String> | ["config/tasks/**/*.{yml,yaml}"] | Glob patterns for discovering task template YAML files |
common.task_templates.search_paths
Glob patterns for discovering task template YAML files
- Type:
Vec<String> - Default:
["config/tasks/**/*.{yml,yaml}"] - Valid Range: valid glob patterns
- System Impact: Templates matching these patterns are loaded at startup for task definition discovery
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: common
65/65 parameters documented
backoff
Path: common.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
backoff_multiplier | f64 | 2.0 | Multiplier applied to the previous delay for exponential backoff calculations |
default_backoff_seconds | Vec<u32> | [1, 5, 15, 30, 60] | Sequence of backoff delays in seconds for successive retry attempts |
jitter_enabled | bool | true | Add random jitter to backoff delays to prevent thundering herd on retry |
jitter_max_percentage | f64 | 0.15 | Maximum jitter as a fraction of the computed backoff delay |
max_backoff_seconds | u32 | 3600 | Hard upper limit on any single backoff delay |
common.backoff.backoff_multiplier
Multiplier applied to the previous delay for exponential backoff calculations
- Type:
f64 - Default:
2.0 - Valid Range: 1.0-10.0
- System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous
common.backoff.default_backoff_seconds
Sequence of backoff delays in seconds for successive retry attempts
- Type:
Vec<u32> - Default:
[1, 5, 15, 30, 60] - Valid Range: non-empty array of positive integers
- System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds
common.backoff.jitter_enabled
Add random jitter to backoff delays to prevent thundering herd on retry
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time
common.backoff.jitter_max_percentage
Maximum jitter as a fraction of the computed backoff delay
- Type:
f64 - Default:
0.15 - Valid Range: 0.0-1.0
- System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay
common.backoff.max_backoff_seconds
Hard upper limit on any single backoff delay
- Type:
u32 - Default:
3600 - Valid Range: 1-3600
- System Impact: Caps exponential backoff growth to prevent excessively long delays between retries
cache
Path: common.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
analytics_ttl_seconds | u32 | 60 | Time-to-live in seconds for cached analytics and metrics data |
backend | String | "redis" | Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process) |
default_ttl_seconds | u32 | 3600 | Default time-to-live in seconds for cached entries |
enabled | bool | false | Enable the distributed cache layer for template and analytics data |
template_ttl_seconds | u32 | 3600 | Time-to-live in seconds for cached task template definitions |
common.cache.analytics_ttl_seconds
Time-to-live in seconds for cached analytics and metrics data
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current
common.cache.backend
Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
- Type:
String - Default:
"redis" - Valid Range: redis | moka
- System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection
common.cache.default_ttl_seconds
Default time-to-live in seconds for cached entries
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Controls how long cached data remains valid before being re-fetched from the database
common.cache.enabled
Enable the distributed cache layer for template and analytics data
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required
common.cache.template_ttl_seconds
Time-to-live in seconds for cached task template definitions
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance
moka
Path: common.cache.moka
| Parameter | Type | Default | Description |
|---|---|---|---|
max_capacity | u64 | 10000 | Maximum number of entries the in-process Moka cache can hold |
common.cache.moka.max_capacity
Maximum number of entries the in-process Moka cache can hold
- Type:
u64 - Default:
10000 - Valid Range: 1-1000000
- System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached
redis
Path: common.cache.redis
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_timeout_seconds | u32 | 5 | Maximum time to wait when establishing a new Redis connection |
database | u32 | 0 | Redis database number (0-15) |
max_connections | u32 | 10 | Maximum number of connections in the Redis connection pool |
url | String | "${REDIS_URL:-redis://localhost:6379}" | Redis connection URL |
common.cache.redis.connection_timeout_seconds
Maximum time to wait when establishing a new Redis connection
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Connections that cannot be established within this timeout fail; cache falls back to database
common.cache.redis.database
Redis database number (0-15)
- Type:
u32 - Default:
0 - Valid Range: 0-15
- System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance
common.cache.redis.max_connections
Maximum number of connections in the Redis connection pool
- Type:
u32 - Default:
10 - Valid Range: 1-500
- System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads
common.cache.redis.url
Redis connection URL
- Type:
String - Default:
"${REDIS_URL:-redis://localhost:6379}" - Valid Range: valid Redis URI
- System Impact: Must be reachable when cache is enabled with redis backend
circuit_breakers
Path: common.circuit_breakers
component_configs
Path: common.circuit_breakers.component_configs
cache
Path: common.circuit_breakers.component_configs.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the cache circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the cache breaker |
common.circuit_breakers.component_configs.cache.failure_threshold
Failures before the cache circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database
common.circuit_breakers.component_configs.cache.success_threshold
Successes in Half-Open required to close the cache breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database
messaging
Path: common.circuit_breakers.component_configs.messaging
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the messaging circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the messaging breaker |
common.circuit_breakers.component_configs.messaging.failure_threshold
Failures before the messaging circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited
common.circuit_breakers.component_configs.messaging.success_threshold
Successes in Half-Open required to close the messaging breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient
task_readiness
Path: common.circuit_breakers.component_configs.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 10 | Failures before the task readiness circuit breaker trips to Open |
success_threshold | u32 | 3 | Successes in Half-Open required to close the task readiness breaker |
common.circuit_breakers.component_configs.task_readiness.failure_threshold
Failures before the task readiness circuit breaker trips to Open
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected
common.circuit_breakers.component_configs.task_readiness.success_threshold
Successes in Half-Open required to close the task readiness breaker
- Type:
u32 - Default:
3 - Valid Range: 1-100
- System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries
web
Path: common.circuit_breakers.component_configs.web
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the web/API database circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the web database breaker |
common.circuit_breakers.component_configs.web.failure_threshold
Failures before the web/API database circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts
common.circuit_breakers.component_configs.web.success_threshold
Successes in Half-Open required to close the web database breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic
default_config
Path: common.circuit_breakers.default_config
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive failures before a circuit breaker trips to the Open state |
success_threshold | u32 | 2 | Number of consecutive successes in Half-Open state required to close the circuit breaker |
timeout_seconds | u32 | 30 | Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests |
common.circuit_breakers.default_config.failure_threshold
Number of consecutive failures before a circuit breaker trips to the Open state
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping
common.circuit_breakers.default_config.success_threshold
Number of consecutive successes in Half-Open state required to close the circuit breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Higher values require more proof of recovery before restoring full traffic
common.circuit_breakers.default_config.timeout_seconds
Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures
global_settings
Path: common.circuit_breakers.global_settings
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics_collection_interval_seconds | u32 | 30 | Interval in seconds between circuit breaker metrics collection sweeps |
min_state_transition_interval_seconds | f64 | 5.0 | Minimum time in seconds between circuit breaker state transitions |
common.circuit_breakers.global_settings.metrics_collection_interval_seconds
Interval in seconds between circuit breaker metrics collection sweeps
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability
common.circuit_breakers.global_settings.min_state_transition_interval_seconds
Minimum time in seconds between circuit breaker state transitions
- Type:
f64 - Default:
5.0 - Valid Range: 0.0-60.0
- System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures
database
Path: common.database
| Parameter | Type | Default | Description |
|---|---|---|---|
url | String | "${DATABASE_URL:-postgresql://localhost/tasker}" | PostgreSQL connection URL for the primary database |
common.database.url
PostgreSQL connection URL for the primary database
- Type:
String - Default:
"${DATABASE_URL:-postgresql://localhost/tasker}" - Valid Range: valid PostgreSQL connection URI
- System Impact: All task, step, and workflow state is stored here; must be reachable at startup
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | postgresql://localhost/tasker | Local default, no auth |
| production | ${DATABASE_URL} | Always use env var injection for secrets rotation |
| test | postgresql://tasker:tasker@localhost:5432/tasker_rust_test | Isolated test database with known credentials |
Related: common.database.pool.max_connections, common.pgmq_database.url
pool
Path: common.database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 10 | Maximum time to wait when acquiring a connection from the pool |
idle_timeout_seconds | u32 | 300 | Time before an idle connection is closed and removed from the pool |
max_connections | u32 | 25 | Maximum number of concurrent database connections in the pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a connection before it is closed and replaced |
min_connections | u32 | 5 | Minimum number of idle connections maintained in the pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which connection acquisition is logged as slow |
common.database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the pool
- Type:
u32 - Default:
10 - Valid Range: 1-300
- System Impact: Queries fail with a timeout error if no connection is available within this window
common.database.pool.idle_timeout_seconds
Time before an idle connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the pool shrinks back to min_connections after load drops
common.database.pool.max_connections
Maximum number of concurrent database connections in the pool
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | 10-25 | Small pool for local development |
| production | 30-50 | Scale based on worker count and concurrent task volume |
| test | 10-30 | Moderate pool; cluster tests may run 10 services sharing the same DB |
Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds
common.database.pool.max_lifetime_seconds
Maximum total lifetime of a connection before it is closed and replaced
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections
common.database.pool.min_connections
Minimum number of idle connections maintained in the pool
- Type:
u32 - Default:
5 - Valid Range: 0-100
- System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods
common.database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which connection acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow acquire warnings indicate pool pressure or network issues
execution
Path: common.execution
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | String | "development" | Runtime environment identifier used for configuration context selection and logging |
step_enqueue_batch_size | u32 | 50 | Number of steps to enqueue in a single batch during task initialization |
common.execution.environment
Runtime environment identifier used for configuration context selection and logging
- Type:
String - Default:
"development" - Valid Range: test | development | production
- System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system
common.execution.step_enqueue_batch_size
Number of steps to enqueue in a single batch during task initialization
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency
mpsc_channels
Path: common.mpsc_channels
event_publisher
Path: common.mpsc_channels.event_publisher
| Parameter | Type | Default | Description |
|---|---|---|---|
event_queue_buffer_size | usize | 5000 | Bounded channel capacity for the event publisher MPSC channel |
common.mpsc_channels.event_publisher.event_queue_buffer_size
Bounded channel capacity for the event publisher MPSC channel
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner
ffi
Path: common.mpsc_channels.ffi
| Parameter | Type | Default | Description |
|---|---|---|---|
ruby_event_buffer_size | usize | 1000 | Bounded channel capacity for Ruby FFI event delivery |
common.mpsc_channels.ffi.ruby_event_buffer_size
Bounded channel capacity for Ruby FFI event delivery
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side
overflow_policy
Path: common.mpsc_channels.overflow_policy
| Parameter | Type | Default | Description |
|---|---|---|---|
log_warning_threshold | f64 | 0.8 | Channel saturation fraction at which warning logs are emitted |
common.mpsc_channels.overflow_policy.log_warning_threshold
Channel saturation fraction at which warning logs are emitted
- Type:
f64 - Default:
0.8 - Valid Range: 0.0-1.0
- System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity
metrics
Path: common.mpsc_channels.overflow_policy.metrics
| Parameter | Type | Default | Description |
|---|---|---|---|
saturation_check_interval_seconds | u32 | 30 | Interval in seconds between channel saturation metric samples |
common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds
Interval in seconds between channel saturation metric samples
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead
pgmq_database
Path: common.pgmq_database
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable PGMQ messaging subsystem |
url | String | "${PGMQ_DATABASE_URL:-}" | PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database |
common.pgmq_database.enabled
Enable PGMQ messaging subsystem
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend
common.pgmq_database.url
PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database
- Type:
String - Default:
"${PGMQ_DATABASE_URL:-}" - Valid Range: valid PostgreSQL connection URI or empty string
- System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load
Related: common.database.url, common.pgmq_database.enabled
pool
Path: common.pgmq_database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 5 | Maximum time to wait when acquiring a connection from the PGMQ pool |
idle_timeout_seconds | u32 | 300 | Time before an idle PGMQ connection is closed and removed from the pool |
max_connections | u32 | 15 | Maximum number of concurrent connections in the PGMQ database pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a PGMQ database connection before replacement |
min_connections | u32 | 3 | Minimum idle connections maintained in the PGMQ database pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which PGMQ pool acquisition is logged as slow |
common.pgmq_database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the PGMQ pool
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window
common.pgmq_database.pool.idle_timeout_seconds
Time before an idle PGMQ connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops
common.pgmq_database.pool.max_connections
Maximum number of concurrent connections in the PGMQ database pool
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Separate from the main database pool; size according to messaging throughput requirements
common.pgmq_database.pool.max_lifetime_seconds
Maximum total lifetime of a PGMQ database connection before replacement
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift in long-running PGMQ connections
common.pgmq_database.pool.min_connections
Minimum idle connections maintained in the PGMQ database pool
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations
common.pgmq_database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which PGMQ pool acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure
queues
Path: common.queues
| Parameter | Type | Default | Description |
|---|---|---|---|
backend | String | "${TASKER_MESSAGING_BACKEND:-pgmq}" | Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker) |
default_visibility_timeout_seconds | u32 | 30 | Default time a dequeued message remains invisible to other consumers |
naming_pattern | String | "{namespace}_{name}_queue" | Template pattern for constructing queue names from namespace and name |
orchestration_namespace | String | "orchestration" | Namespace prefix for orchestration queue names |
worker_namespace | String | "worker" | Namespace prefix for worker queue names |
common.queues.backend
Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
- Type:
String - Default:
"${TASKER_MESSAGING_BACKEND:-pgmq}" - Valid Range: pgmq | rabbitmq
- System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | pgmq or rabbitmq | pgmq for simplicity, rabbitmq for high-throughput push semantics |
| test | pgmq | Single-dependency setup, simpler CI |
Related: common.queues.pgmq, common.queues.rabbitmq
common.queues.default_visibility_timeout_seconds
Default time a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry
common.queues.naming_pattern
Template pattern for constructing queue names from namespace and name
- Type:
String - Default:
"{namespace}_{name}_queue" - Valid Range: string containing {namespace} and {name} placeholders
- System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration
common.queues.orchestration_namespace
Namespace prefix for orchestration queue names
- Type:
String - Default:
"orchestration" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues
common.queues.worker_namespace
Namespace prefix for worker queue names
- Type:
String - Default:
"worker" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues
orchestration_queues
Path: common.queues.orchestration_queues
| Parameter | Type | Default | Description |
|---|---|---|---|
step_results | String | "orchestration_step_results" | Queue name for step execution results returned by workers |
task_finalizations | String | "orchestration_task_finalizations" | Queue name for task finalization messages |
task_requests | String | "orchestration_task_requests" | Queue name for incoming task execution requests |
common.queues.orchestration_queues.step_results
Queue name for step execution results returned by workers
- Type:
String - Default:
"orchestration_step_results" - Valid Range: valid queue name
- System Impact: Workers publish step completion results here for the orchestration result processor
common.queues.orchestration_queues.task_finalizations
Queue name for task finalization messages
- Type:
String - Default:
"orchestration_task_finalizations" - Valid Range: valid queue name
- System Impact: Tasks ready for completion evaluation are enqueued here
common.queues.orchestration_queues.task_requests
Queue name for incoming task execution requests
- Type:
String - Default:
"orchestration_task_requests" - Valid Range: valid queue name
- System Impact: The orchestration system reads new task requests from this queue
pgmq
Path: common.queues.pgmq
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval_ms | u32 | 500 | Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive |
common.queues.pgmq.poll_interval_ms
Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive
- Type:
u32 - Default:
500 - Valid Range: 10-10000
- System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval
queue_depth_thresholds
Path: common.queues.pgmq.queue_depth_thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
critical_threshold | i64 | 5000 | Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions |
overflow_threshold | i64 | 10000 | Queue depth indicating an emergency condition requiring manual intervention |
common.queues.pgmq.queue_depth_thresholds.critical_threshold
Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
- Type:
i64 - Default:
5000 - Valid Range: 1+
- System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages
common.queues.pgmq.queue_depth_thresholds.overflow_threshold
Queue depth indicating an emergency condition requiring manual intervention
- Type:
i64 - Default:
10000 - Valid Range: 1+
- System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting
rabbitmq
Path: common.queues.rabbitmq
| Parameter | Type | Default | Description |
|---|---|---|---|
heartbeat_seconds | u16 | 30 | AMQP heartbeat interval for connection liveness detection |
prefetch_count | u16 | 100 | Number of unacknowledged messages RabbitMQ will deliver before waiting for acks |
url | String | "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" | AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’ |
common.queues.rabbitmq.heartbeat_seconds
AMQP heartbeat interval for connection liveness detection
- Type:
u16 - Default:
30 - Valid Range: 0-3600
- System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)
common.queues.rabbitmq.prefetch_count
Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
- Type:
u16 - Default:
100 - Valid Range: 1-65535
- System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process
common.queues.rabbitmq.url
AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’
- Type:
String - Default:
"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" - Valid Range: valid AMQP URI
- System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup
system
Path: common.system
| Parameter | Type | Default | Description |
|---|---|---|---|
default_dependent_system | String | "default" | Default system name assigned to tasks that do not specify a dependent system |
common.system.default_dependent_system
Default system name assigned to tasks that do not specify a dependent system
- Type:
String - Default:
"default" - Valid Range: non-empty string
- System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default
task_templates
Path: common.task_templates
| Parameter | Type | Default | Description |
|---|---|---|---|
search_paths | Vec<String> | ["config/tasks/**/*.{yml,yaml}"] | Glob patterns for discovering task template YAML files |
common.task_templates.search_paths
Glob patterns for discovering task template YAML files
- Type:
Vec<String> - Default:
["config/tasks/**/*.{yml,yaml}"] - Valid Range: valid glob patterns
- System Impact: Templates matching these patterns are loaded at startup for task definition discovery
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: orchestration
91/91 parameters documented
orchestration
Root-level orchestration parameters
Path: orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_performance_logging | bool | true | Enable detailed performance logging for orchestration actors |
shutdown_timeout_ms | u64 | 30000 | Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown |
orchestration.enable_performance_logging
Enable detailed performance logging for orchestration actors
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern
orchestration.shutdown_timeout_ms
Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown
- Type:
u64 - Default:
30000 - Valid Range: 1000-300000
- System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments
batch_processing
Path: orchestration.batch_processing
| Parameter | Type | Default | Description |
|---|---|---|---|
checkpoint_stall_minutes | u32 | 15 | Minutes without a checkpoint update before a batch is considered stalled |
default_batch_size | u32 | 1000 | Default number of items in a single batch when not specified by the handler |
enabled | bool | true | Enable the batch processing subsystem for large-scale step execution |
max_parallel_batches | u32 | 50 | Maximum number of batch operations that can execute concurrently |
orchestration.batch_processing.checkpoint_stall_minutes
Minutes without a checkpoint update before a batch is considered stalled
- Type:
u32 - Default:
15 - Valid Range: 1-1440
- System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster
orchestration.batch_processing.default_batch_size
Default number of items in a single batch when not specified by the handler
- Type:
u32 - Default:
1000 - Valid Range: 1-100000
- System Impact: Larger batches improve throughput but increase memory usage and per-batch latency
orchestration.batch_processing.enabled
Enable the batch processing subsystem for large-scale step execution
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, batch step handlers cannot be used; all steps must be processed individually
orchestration.batch_processing.max_parallel_batches
Maximum number of batch operations that can execute concurrently
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads
decision_points
Path: orchestration.decision_points
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_detailed_logging | bool | false | Enable verbose logging of decision point evaluation including expression results |
enable_metrics | bool | true | Enable metrics collection for decision point evaluations |
enabled | bool | true | Enable the decision point evaluation subsystem for conditional workflow branching |
max_decision_depth | u32 | 20 | Maximum depth of nested decision point chains |
max_steps_per_decision | u32 | 100 | Maximum number of steps that can be generated by a single decision point evaluation |
warn_threshold_depth | u32 | 10 | Decision depth above which a warning is logged |
warn_threshold_steps | u32 | 50 | Number of steps per decision above which a warning is logged |
orchestration.decision_points.enable_detailed_logging
Enable verbose logging of decision point evaluation including expression results
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior
orchestration.decision_points.enable_metrics
Enable metrics collection for decision point evaluations
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks evaluation counts, timings, and branch selection distribution
orchestration.decision_points.enabled
Enable the decision point evaluation subsystem for conditional workflow branching
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, all decision points are skipped and conditional steps are not evaluated
orchestration.decision_points.max_decision_depth
Maximum depth of nested decision point chains
- Type:
u32 - Default:
20 - Valid Range: 1-100
- System Impact: Prevents infinite recursion from circular decision point references
orchestration.decision_points.max_steps_per_decision
Maximum number of steps that can be generated by a single decision point evaluation
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Safety limit to prevent decision points from creating unbounded step graphs
orchestration.decision_points.warn_threshold_depth
Decision depth above which a warning is logged
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Observability: identifies deeply nested decision chains that may indicate design issues
orchestration.decision_points.warn_threshold_steps
Number of steps per decision above which a warning is logged
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Observability: identifies decision points that generate unusually large step sets
dlq
Path: orchestration.dlq
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable the Dead Letter Queue subsystem for handling unrecoverable tasks |
orchestration.dlq.enabled
Enable the Dead Letter Queue subsystem for handling unrecoverable tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, stale or failed tasks remain in their error state without DLQ routing
staleness_detection
Path: orchestration.dlq.staleness_detection
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 100 | Number of potentially stale tasks to evaluate in a single detection sweep |
detection_interval_seconds | u32 | 300 | Interval in seconds between staleness detection sweeps |
dry_run | bool | false | Run staleness detection in observation-only mode without taking action |
enabled | bool | true | Enable periodic scanning for stale tasks |
orchestration.dlq.staleness_detection.batch_size
Number of potentially stale tasks to evaluate in a single detection sweep
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost
orchestration.dlq.staleness_detection.detection_interval_seconds
Interval in seconds between staleness detection sweeps
- Type:
u32 - Default:
300 - Valid Range: 30-3600
- System Impact: Lower values detect stale tasks faster but increase database query frequency
orchestration.dlq.staleness_detection.dry_run
Run staleness detection in observation-only mode without taking action
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds
orchestration.dlq.staleness_detection.enabled
Enable periodic scanning for stale tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d
actions
Path: orchestration.dlq.staleness_detection.actions
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_move_to_dlq | bool | true | Automatically move stale tasks to the DLQ after transitioning to error |
auto_transition_to_error | bool | true | Automatically transition stale tasks to the Error state |
emit_events | bool | true | Emit domain events when staleness is detected |
event_channel | String | "task_staleness_detected" | PGMQ channel name for staleness detection events |
orchestration.dlq.staleness_detection.actions.auto_move_to_dlq
Automatically move stale tasks to the DLQ after transitioning to error
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review
orchestration.dlq.staleness_detection.actions.auto_transition_to_error
Automatically transition stale tasks to the Error state
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state
orchestration.dlq.staleness_detection.actions.emit_events
Emit domain events when staleness is detected
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling
orchestration.dlq.staleness_detection.actions.event_channel
PGMQ channel name for staleness detection events
- Type:
String - Default:
"task_staleness_detected" - Valid Range: 1-255 characters
- System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling
thresholds
Path: orchestration.dlq.staleness_detection.thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
steps_in_process_minutes | u32 | 30 | Minutes a task can have steps in process before being considered stale |
task_max_lifetime_hours | u32 | 24 | Absolute maximum lifetime for any task regardless of state |
waiting_for_dependencies_minutes | u32 | 60 | Minutes a task can wait for step dependencies before being considered stale |
waiting_for_retry_minutes | u32 | 30 | Minutes a task can wait for step retries before being considered stale |
orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes
Minutes a task can have steps in process before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation
orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours
Absolute maximum lifetime for any task regardless of state
- Type:
u32 - Default:
24 - Valid Range: 1-168
- System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing
orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes
Minutes a task can wait for step dependencies before being considered stale
- Type:
u32 - Default:
60 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration
orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes
Minutes a task can wait for step retries before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration
event_systems
Path: orchestration.event_systems
orchestration
Path: orchestration.event_systems.orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "orchestration-event-system" | Unique identifier for the orchestration event system instance |
orchestration.event_systems.orchestration.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
orchestration.event_systems.orchestration.system_id
Unique identifier for the orchestration event system instance
- Type:
String - Default:
"orchestration-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: orchestration.event_systems.orchestration.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the orchestration event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics collection for event processing |
orchestration.event_systems.orchestration.health.enabled
Enable health monitoring for the orchestration event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for this event system
orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute
Error rate per minute above which the event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection
orchestration.event_systems.orchestration.health.max_consecutive_errors
Number of consecutive errors before the event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation after sustained failures; resets on any success
orchestration.event_systems.orchestration.health.performance_monitoring_enabled
Enable detailed performance metrics collection for event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks processing latency percentiles and throughput; adds minor overhead
processing
Path: orchestration.event_systems.orchestration.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read |
max_concurrent_operations | u32 | 50 | Maximum number of events processed concurrently by the orchestration event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed event processing operation |
orchestration.event_systems.orchestration.processing.batch_size
Number of events dequeued in a single batch read
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
orchestration.event_systems.orchestration.processing.max_concurrent_operations
Maximum number of events processed concurrently by the orchestration event system
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Controls parallelism for task request, result, and finalization processing
orchestration.event_systems.orchestration.processing.max_retries
Maximum retry attempts for a failed event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: orchestration.event_systems.orchestration.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between event processing retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: orchestration.event_systems.orchestration.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds an event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the orchestration event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued message remains invisible to other consumers |
orchestration.event_systems.orchestration.timing.claim_timeout_seconds
Maximum time in seconds an event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking event processing indefinitely
orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load
orchestration.event_systems.orchestration.timing.health_check_interval_seconds
Interval in seconds between health check probes for the orchestration event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness
orchestration.event_systems.orchestration.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
orchestration.event_systems.orchestration.timing.visibility_timeout_seconds
Time in seconds a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If processing is not completed within this window, the message becomes visible again for redelivery
task_readiness
Path: orchestration.event_systems.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "task-readiness-event-system" | Unique identifier for the task readiness event system instance |
orchestration.event_systems.task_readiness.deployment_mode
Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery
orchestration.event_systems.task_readiness.system_id
Unique identifier for the task readiness event system instance
- Type:
String - Default:
"task-readiness-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish task readiness events from other event systems
health
Path: orchestration.event_systems.task_readiness.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the task readiness event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the task readiness system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the task readiness system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for task readiness event processing |
orchestration.event_systems.task_readiness.health.enabled
Enable health monitoring for the task readiness event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks run for task readiness processing
orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute
Error rate per minute above which the task readiness system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
orchestration.event_systems.task_readiness.health.max_consecutive_errors
Number of consecutive errors before the task readiness system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful readiness check
orchestration.event_systems.task_readiness.health.performance_monitoring_enabled
Enable detailed performance metrics for task readiness event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency
processing
Path: orchestration.event_systems.task_readiness.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 50 | Number of task readiness events dequeued in a single batch |
max_concurrent_operations | u32 | 100 | Maximum number of task readiness events processed concurrently |
max_retries | u32 | 3 | Maximum retry attempts for a failed task readiness event |
orchestration.event_systems.task_readiness.processing.batch_size
Number of task readiness events dequeued in a single batch
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput
orchestration.event_systems.task_readiness.processing.max_concurrent_operations
Maximum number of task readiness events processed concurrently
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries
orchestration.event_systems.task_readiness.processing.max_retries
Maximum retry attempts for a failed task readiness event
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff
Path: orchestration.event_systems.task_readiness.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first task readiness processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay for readiness retries |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds for task readiness retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive readiness failure |
timing
Path: orchestration.event_systems.task_readiness.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a task readiness event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles for task readiness |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the task readiness event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single task readiness event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued task readiness message remains invisible to other consumers |
orchestration.event_systems.task_readiness.timing.claim_timeout_seconds
Maximum time in seconds a task readiness event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned readiness claims from blocking task evaluation
orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for task readiness
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness
orchestration.event_systems.task_readiness.timing.health_check_interval_seconds
Interval in seconds between health check probes for the task readiness event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the task readiness system verifies its own connectivity
orchestration.event_systems.task_readiness.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single task readiness event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Readiness events exceeding this timeout are considered failed
orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds
Time in seconds a dequeued task readiness message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate processing of readiness events during normal operation
grpc
Path: orchestration.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" | Socket address for the gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service (grpc.health.v1) |
enable_reflection | bool | true | Enable gRPC server reflection for service discovery |
enabled | bool | true | Enable the gRPC API server alongside the REST API |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 200 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame |
tls_enabled | bool | false | Enable TLS encryption for gRPC connections |
orchestration.grpc.bind_address
Socket address for the gRPC server
- Type:
String - Default:
"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict
orchestration.grpc.enable_health_service
Enable the gRPC health checking service (grpc.health.v1)
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
orchestration.grpc.enable_reflection
Enable gRPC server reflection for service discovery
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production
orchestration.grpc.enabled
Enable the gRPC API server alongside the REST API
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
orchestration.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
orchestration.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
orchestration.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
200 - Valid Range: 1-10000
- System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads
orchestration.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
orchestration.grpc.tls_enabled
Enable TLS encryption for gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments
mpsc_channels
Path: orchestration.mpsc_channels
command_processor
Path: orchestration.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 5000 | Bounded channel capacity for the orchestration command processor |
orchestration.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the orchestration command processor
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory
event_listeners
Path: orchestration.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 50000 | Bounded channel capacity for PGMQ event listener notifications |
orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications
- Type:
usize - Default:
50000 - Valid Range: 1000-1000000
- System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL
event_systems
Path: orchestration.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 10000 | Bounded channel capacity for the orchestration event system internal channel |
orchestration.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the orchestration event system internal channel
- Type:
usize - Default:
10000 - Valid Range: 100-100000
- System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts
web
Path: orchestration.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" | Socket address for the REST API server |
enabled | bool | true | Enable the REST API server for the orchestration service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for an HTTP request to complete before timeout |
orchestration.web.bind_address
Socket address for the REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - Valid Range: host:port
- System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8080 | Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI |
| test | 0.0.0.0:8080 | Default port for test fixtures |
orchestration.web.enabled
Enable the REST API server for the orchestration service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the service operates via messaging only
orchestration.web.request_timeout_ms
Maximum time in milliseconds for an HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: orchestration.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication |
enabled | bool | false | Enable authentication for the REST API |
jwt_audience | String | "tasker-api" | Expected ‘aud’ claim in JWT tokens |
jwt_issuer | String | "tasker-core" | Expected ‘iss’ claim in JWT tokens |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if this service issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours |
orchestration.web.auth.api_key
Static API key for simple key-based authentication
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
orchestration.web.auth.api_key_header
HTTP header name for API key authentication
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
orchestration.web.auth.enabled
Enable authentication for the REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth
orchestration.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens
- Type:
String - Default:
"tasker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
orchestration.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens
- Type:
String - Default:
"tasker-core" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected during validation
orchestration.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if this service issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers
orchestration.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production
orchestration.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
orchestration.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication
database_pools
Path: orchestration.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 50 | Advisory hint for the total number of database connections across all orchestration pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle web API connection is closed |
web_api_max_connections | u32 | 30 | Maximum number of connections the web API pool can grow to under load |
web_api_pool_size | u32 | 20 | Target number of connections in the web API database pool |
orchestration.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all orchestration pools
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
orchestration.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: API requests that cannot acquire a connection within this window return an error
orchestration.web.database_pools.web_api_idle_timeout_seconds
Time before an idle web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the web API pool shrinks after traffic subsides
orchestration.web.database_pools.web_api_max_connections
Maximum number of connections the web API pool can grow to under load
- Type:
u32 - Default:
30 - Valid Range: 1-500
- System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes
orchestration.web.database_pools.web_api_pool_size
Target number of connections in the web API database pool
- Type:
u32 - Default:
20 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the REST API can execute
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: worker
90/90 parameters documented
worker
Root-level worker parameters
Path: worker
| Parameter | Type | Default | Description |
|---|---|---|---|
worker_id | String | "worker-default-001" | Unique identifier for this worker instance |
worker_type | String | "general" | Worker type classification for routing and reporting |
worker.worker_id
Unique identifier for this worker instance
- Type:
String - Default:
"worker-default-001" - Valid Range: non-empty string
- System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster
worker.worker_type
Worker type classification for routing and reporting
- Type:
String - Default:
"general" - Valid Range: non-empty string
- System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types
circuit_breakers
Path: worker.circuit_breakers
ffi_completion_send
Path: worker.circuit_breakers.ffi_completion_send
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive FFI completion send failures before the circuit breaker trips |
recovery_timeout_seconds | u32 | 5 | Time the FFI completion breaker stays Open before probing with a test send |
slow_send_threshold_ms | u32 | 100 | Threshold in milliseconds above which FFI completion channel sends are logged as slow |
success_threshold | u32 | 2 | Consecutive successful sends in Half-Open required to close the breaker |
worker.circuit_breakers.ffi_completion_send.failure_threshold
Number of consecutive FFI completion send failures before the circuit breaker trips
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited
worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds
Time the FFI completion breaker stays Open before probing with a test send
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Short timeout (5s) because FFI channel issues are typically transient
worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms
Threshold in milliseconds above which FFI completion channel sends are logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-10000
- System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers
worker.circuit_breakers.ffi_completion_send.success_threshold
Consecutive successful sends in Half-Open required to close the breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient
event_systems
Path: worker.event_systems
worker
Path: worker.event_systems.worker
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "worker-event-system" | Unique identifier for the worker event system instance |
worker.event_systems.worker.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
worker.event_systems.worker.system_id
Unique identifier for the worker event system instance
- Type:
String - Default:
"worker-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: worker.event_systems.worker.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the worker event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the worker event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the worker event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for worker event processing |
worker.event_systems.worker.health.enabled
Enable health monitoring for the worker event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for the worker event system
worker.event_systems.worker.health.error_rate_threshold_per_minute
Error rate per minute above which the worker event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
worker.event_systems.worker.health.max_consecutive_errors
Number of consecutive errors before the worker event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful event processing
worker.event_systems.worker.health.performance_monitoring_enabled
Enable detailed performance metrics for worker event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings
metadata
Path: worker.event_systems.worker.metadata
fallback_poller
Path: worker.event_systems.worker.metadata.fallback_poller
| Parameter | Type | Default | Description |
|---|---|---|---|
age_threshold_seconds | u32 | 5 | Minimum age in seconds of a message before the fallback poller will pick it up |
batch_size | u32 | 20 | Number of messages to dequeue in a single fallback poll |
enabled | bool | true | Enable the fallback polling mechanism for step dispatch |
max_age_hours | u32 | 24 | Maximum age in hours of messages the fallback poller will process |
polling_interval_ms | u32 | 1000 | Interval in milliseconds between fallback polling cycles |
supported_namespaces | Vec<String> | [] | List of queue namespaces the fallback poller monitors; empty means all namespaces |
visibility_timeout_seconds | u32 | 30 | Time in seconds a message polled by the fallback mechanism remains invisible |
in_process_events
Path: worker.event_systems.worker.metadata.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
deduplication_cache_size | usize | 10000 | Number of event IDs to cache for deduplication of in-process events |
ffi_integration_enabled | bool | true | Enable FFI integration for in-process event delivery to Ruby/Python workers |
listener
Path: worker.event_systems.worker.metadata.listener
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_processing | bool | true | Enable batch processing of accumulated LISTEN/NOTIFY events |
connection_timeout_seconds | u32 | 30 | Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection |
event_timeout_seconds | u32 | 60 | Maximum time to wait for a LISTEN/NOTIFY event before yielding |
max_retry_attempts | u32 | 5 | Maximum number of listener reconnection attempts before falling back to polling |
retry_interval_seconds | u32 | 5 | Interval in seconds between LISTEN/NOTIFY listener reconnection attempts |
processing
Path: worker.event_systems.worker.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read by the worker |
max_concurrent_operations | u32 | 100 | Maximum number of events processed concurrently by the worker event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed worker event processing operation |
worker.event_systems.worker.processing.batch_size
Number of events dequeued in a single batch read by the worker
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
worker.event_systems.worker.processing.max_concurrent_operations
Maximum number of events processed concurrently by the worker event system
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Controls parallelism for step dispatch and completion processing
worker.event_systems.worker.processing.max_retries
Maximum retry attempts for a failed worker event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: worker.event_systems.worker.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first worker event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between worker event retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: worker.event_systems.worker.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a worker event claim remains valid |
fallback_polling_interval_seconds | u32 | 2 | Interval in seconds between fallback polling cycles for step dispatch |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the worker event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single worker event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued step dispatch message remains invisible to other workers |
worker.event_systems.worker.timing.claim_timeout_seconds
Maximum time in seconds a worker event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking step processing indefinitely
worker.event_systems.worker.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for step dispatch
- Type:
u32 - Default:
2 - Valid Range: 1-60
- System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency
worker.event_systems.worker.timing.health_check_interval_seconds
Interval in seconds between health check probes for the worker event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the worker event system verifies its own connectivity
worker.event_systems.worker.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single worker event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
worker.event_systems.worker.timing.visibility_timeout_seconds
Time in seconds a dequeued step dispatch message remains invisible to other workers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate step execution; must be longer than typical step processing time
grpc
Path: worker.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" | Socket address for the worker gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service on the worker |
enable_reflection | bool | true | Enable gRPC server reflection for the worker service |
enabled | bool | true | Enable the gRPC API server for the worker service |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames on the worker |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 1000 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server |
tls_enabled | bool | false | Enable TLS encryption for worker gRPC connections |
worker.grpc.bind_address
Socket address for the worker gRPC server
- Type:
String - Default:
"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191
worker.grpc.enable_health_service
Enable the gRPC health checking service on the worker
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
worker.grpc.enable_reflection
Enable gRPC server reflection for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production
worker.grpc.enabled
Enable the gRPC API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
worker.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames on the worker
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
worker.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
worker.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
1000 - Valid Range: 1-10000
- System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this
worker.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
worker.grpc.tls_enabled
Enable TLS encryption for worker gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments
mpsc_channels
Path: worker.mpsc_channels
command_processor
Path: worker.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 2000 | Bounded channel capacity for the worker command processor |
worker.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the worker command processor
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types
domain_events
Path: worker.mpsc_channels.domain_events
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 1000 | Bounded channel capacity for domain event system commands |
log_dropped_events | bool | true | Log a warning when domain events are dropped due to channel saturation |
shutdown_drain_timeout_ms | u32 | 5000 | Maximum time in milliseconds to drain pending domain events during shutdown |
worker.mpsc_channels.domain_events.command_buffer_size
Bounded channel capacity for domain event system commands
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown
worker.mpsc_channels.domain_events.log_dropped_events
Log a warning when domain events are dropped due to channel saturation
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps detect when event volume exceeds channel capacity
worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms
Maximum time in milliseconds to drain pending domain events during shutdown
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss
event_listeners
Path: worker.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 10000 | Bounded channel capacity for PGMQ event listener notifications on the worker |
worker.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications on the worker
- Type:
usize - Default:
10000 - Valid Range: 1000-1000000
- System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types
event_subscribers
Path: worker.mpsc_channels.event_subscribers
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step completion event subscribers |
result_buffer_size | usize | 1000 | Bounded channel capacity for step result event subscribers |
worker.mpsc_channels.event_subscribers.completion_buffer_size
Bounded channel capacity for step completion event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step completion notifications before they are forwarded to the orchestration service
worker.mpsc_channels.event_subscribers.result_buffer_size
Bounded channel capacity for step result event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution results before they are published to the result queue
event_systems
Path: worker.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 2000 | Bounded channel capacity for the worker event system internal channel |
worker.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the worker event system internal channel
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers events between the listener and processor; sized for worker-level throughput
ffi_dispatch
Path: worker.mpsc_channels.ffi_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
callback_timeout_ms | u32 | 5000 | Maximum time in milliseconds for FFI fire-and-forget domain event callbacks |
completion_send_timeout_ms | u32 | 10000 | Maximum time in milliseconds to retry sending FFI completion results when the channel is full |
completion_timeout_ms | u32 | 30000 | Maximum time in milliseconds to wait for an FFI step handler to complete |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for FFI step dispatch requests |
starvation_warning_threshold_ms | u32 | 10000 | Age in milliseconds of pending FFI events that triggers a starvation warning |
worker.mpsc_channels.ffi_dispatch.callback_timeout_ms
Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents indefinite blocking of FFI threads during domain event publishing
worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms
Maximum time in milliseconds to retry sending FFI completion results when the channel is full
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks
worker.mpsc_channels.ffi_dispatch.completion_timeout_ms
Maximum time in milliseconds to wait for an FFI step handler to complete
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads
worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size
Bounded channel capacity for FFI step dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers
worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms
Age in milliseconds of pending FFI events that triggers a starvation warning
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached
handler_dispatch
Path: worker.mpsc_channels.handler_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step handler completion notifications |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for step handler dispatch requests |
handler_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a step handler to complete execution |
max_concurrent_handlers | u32 | 10 | Maximum number of step handlers executing simultaneously |
worker.mpsc_channels.handler_dispatch.completion_buffer_size
Bounded channel capacity for step handler completion notifications
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers handler completion results before they are forwarded to the result processor
worker.mpsc_channels.handler_dispatch.dispatch_buffer_size
Bounded channel capacity for step handler dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers incoming step execution requests before handler assignment
worker.mpsc_channels.handler_dispatch.handler_timeout_ms
Maximum time in milliseconds for a step handler to complete execution
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity
worker.mpsc_channels.handler_dispatch.max_concurrent_handlers
Maximum number of step handlers executing simultaneously
- Type:
u32 - Default:
10 - Valid Range: 1-10000
- System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore
load_shedding
Path: worker.mpsc_channels.handler_dispatch.load_shedding
| Parameter | Type | Default | Description |
|---|---|---|---|
capacity_threshold_percent | f64 | 80.0 | Handler capacity percentage above which new step claims are refused |
enabled | bool | true | Enable load shedding to refuse step claims when handler capacity is nearly exhausted |
warning_threshold_percent | f64 | 70.0 | Handler capacity percentage at which warning logs are emitted |
worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent
Handler capacity percentage above which new step claims are refused
- Type:
f64 - Default:
80.0 - Valid Range: 0.0-100.0
- System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy
worker.mpsc_channels.handler_dispatch.load_shedding.enabled
Enable load shedding to refuse step claims when handler capacity is nearly exhausted
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload
worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent
Handler capacity percentage at which warning logs are emitted
- Type:
f64 - Default:
70.0 - Valid Range: 0.0-100.0
- System Impact: Observability: alerts operators that the worker is approaching its capacity limit
in_process_events
Path: worker.mpsc_channels.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
broadcast_buffer_size | usize | 2000 | Bounded broadcast channel capacity for in-process domain event delivery |
dispatch_timeout_ms | u32 | 5000 | Maximum time in milliseconds to wait when dispatching an in-process event |
log_subscriber_errors | bool | true | Log errors when in-process event subscribers fail to receive events |
worker.mpsc_channels.in_process_events.broadcast_buffer_size
Bounded broadcast channel capacity for in-process domain event delivery
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure
worker.mpsc_channels.in_process_events.dispatch_timeout_ms
Maximum time in milliseconds to wait when dispatching an in-process event
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow
worker.mpsc_channels.in_process_events.log_subscriber_errors
Log errors when in-process event subscribers fail to receive events
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps identify slow or failing event subscribers
orchestration_client
Path: worker.orchestration_client
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | String | "http://localhost:8080" | Base URL of the orchestration REST API that this worker reports to |
max_retries | u32 | 3 | Maximum retry attempts for failed orchestration API calls |
timeout_ms | u32 | 30000 | HTTP request timeout in milliseconds for orchestration API calls |
worker.orchestration_client.base_url
Base URL of the orchestration REST API that this worker reports to
- Type:
String - Default:
"http://localhost:8080" - Valid Range: valid HTTP(S) URL
- System Impact: Workers send step completion results and health reports to this endpoint
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | http://orchestration:8080 | Container-internal DNS in Kubernetes/Docker |
| test | http://localhost:8080 | Local orchestration for testing |
Related: orchestration.web.bind_address
worker.orchestration_client.max_retries
Maximum retry attempts for failed orchestration API calls
- Type:
u32 - Default:
3 - Valid Range: 0-10
- System Impact: Retries use backoff; higher values improve resilience to transient network issues
worker.orchestration_client.timeout_ms
HTTP request timeout in milliseconds for orchestration API calls
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried
web
Path: worker.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" | Socket address for the worker REST API server |
enabled | bool | true | Enable the REST API server for the worker service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a worker HTTP request to complete before timeout |
worker.web.bind_address
Socket address for the worker REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" - Valid Range: host:port
- System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8081 | Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override |
| test | 0.0.0.0:8081 | Default port offset from orchestration (8080) |
worker.web.enabled
Enable the REST API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only
worker.web.request_timeout_ms
Maximum time in milliseconds for a worker HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: worker.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication on the worker API |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication on the worker API |
enabled | bool | false | Enable authentication for the worker REST API |
jwt_audience | String | "worker-api" | Expected ‘aud’ claim in JWT tokens for the worker API |
jwt_issuer | String | "tasker-worker" | Expected ‘iss’ claim in JWT tokens for the worker API |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if the worker issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures on the worker API |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for worker JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours for worker API tokens |
worker.web.auth.api_key
Static API key for simple key-based authentication on the worker API
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
worker.web.auth.api_key_header
HTTP header name for API key authentication on the worker API
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
worker.web.auth.enabled
Enable authentication for the worker REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all worker API endpoints are unauthenticated
worker.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens for the worker API
- Type:
String - Default:
"worker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
worker.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens for the worker API
- Type:
String - Default:
"tasker-worker" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens
worker.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the worker service issues its own JWT tokens; typically empty
worker.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures on the worker API
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management
worker.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for worker JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
worker.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours for worker API tokens
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security
database_pools
Path: worker.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 25 | Advisory hint for the total number of database connections across all worker pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the worker web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle worker web API connection is closed |
web_api_max_connections | u32 | 15 | Maximum number of connections the worker web API pool can grow to under load |
web_api_pool_size | u32 | 10 | Target number of connections in the worker web API database pool |
worker.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all worker pools
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
worker.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the worker web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Worker API requests that cannot acquire a connection within this window return an error
worker.web.database_pools.web_api_idle_timeout_seconds
Time before an idle worker web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides
worker.web.database_pools.web_api_max_connections
Maximum number of connections the worker web API pool can grow to under load
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Hard ceiling for worker web API database connections
worker.web.database_pools.web_api_pool_size
Target number of connections in the worker web API database pool
- Type:
u32 - Default:
10 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: orchestration
91/91 parameters documented
orchestration
Root-level orchestration parameters
Path: orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_performance_logging | bool | true | Enable detailed performance logging for orchestration actors |
shutdown_timeout_ms | u64 | 30000 | Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown |
orchestration.enable_performance_logging
Enable detailed performance logging for orchestration actors
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern
orchestration.shutdown_timeout_ms
Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown
- Type:
u64 - Default:
30000 - Valid Range: 1000-300000
- System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments
batch_processing
Path: orchestration.batch_processing
| Parameter | Type | Default | Description |
|---|---|---|---|
checkpoint_stall_minutes | u32 | 15 | Minutes without a checkpoint update before a batch is considered stalled |
default_batch_size | u32 | 1000 | Default number of items in a single batch when not specified by the handler |
enabled | bool | true | Enable the batch processing subsystem for large-scale step execution |
max_parallel_batches | u32 | 50 | Maximum number of batch operations that can execute concurrently |
orchestration.batch_processing.checkpoint_stall_minutes
Minutes without a checkpoint update before a batch is considered stalled
- Type:
u32 - Default:
15 - Valid Range: 1-1440
- System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster
orchestration.batch_processing.default_batch_size
Default number of items in a single batch when not specified by the handler
- Type:
u32 - Default:
1000 - Valid Range: 1-100000
- System Impact: Larger batches improve throughput but increase memory usage and per-batch latency
orchestration.batch_processing.enabled
Enable the batch processing subsystem for large-scale step execution
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, batch step handlers cannot be used; all steps must be processed individually
orchestration.batch_processing.max_parallel_batches
Maximum number of batch operations that can execute concurrently
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads
decision_points
Path: orchestration.decision_points
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_detailed_logging | bool | false | Enable verbose logging of decision point evaluation including expression results |
enable_metrics | bool | true | Enable metrics collection for decision point evaluations |
enabled | bool | true | Enable the decision point evaluation subsystem for conditional workflow branching |
max_decision_depth | u32 | 20 | Maximum depth of nested decision point chains |
max_steps_per_decision | u32 | 100 | Maximum number of steps that can be generated by a single decision point evaluation |
warn_threshold_depth | u32 | 10 | Decision depth above which a warning is logged |
warn_threshold_steps | u32 | 50 | Number of steps per decision above which a warning is logged |
orchestration.decision_points.enable_detailed_logging
Enable verbose logging of decision point evaluation including expression results
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior
orchestration.decision_points.enable_metrics
Enable metrics collection for decision point evaluations
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks evaluation counts, timings, and branch selection distribution
orchestration.decision_points.enabled
Enable the decision point evaluation subsystem for conditional workflow branching
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, all decision points are skipped and conditional steps are not evaluated
orchestration.decision_points.max_decision_depth
Maximum depth of nested decision point chains
- Type:
u32 - Default:
20 - Valid Range: 1-100
- System Impact: Prevents infinite recursion from circular decision point references
orchestration.decision_points.max_steps_per_decision
Maximum number of steps that can be generated by a single decision point evaluation
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Safety limit to prevent decision points from creating unbounded step graphs
orchestration.decision_points.warn_threshold_depth
Decision depth above which a warning is logged
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Observability: identifies deeply nested decision chains that may indicate design issues
orchestration.decision_points.warn_threshold_steps
Number of steps per decision above which a warning is logged
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Observability: identifies decision points that generate unusually large step sets
dlq
Path: orchestration.dlq
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable the Dead Letter Queue subsystem for handling unrecoverable tasks |
orchestration.dlq.enabled
Enable the Dead Letter Queue subsystem for handling unrecoverable tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, stale or failed tasks remain in their error state without DLQ routing
staleness_detection
Path: orchestration.dlq.staleness_detection
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 100 | Number of potentially stale tasks to evaluate in a single detection sweep |
detection_interval_seconds | u32 | 300 | Interval in seconds between staleness detection sweeps |
dry_run | bool | false | Run staleness detection in observation-only mode without taking action |
enabled | bool | true | Enable periodic scanning for stale tasks |
orchestration.dlq.staleness_detection.batch_size
Number of potentially stale tasks to evaluate in a single detection sweep
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost
orchestration.dlq.staleness_detection.detection_interval_seconds
Interval in seconds between staleness detection sweeps
- Type:
u32 - Default:
300 - Valid Range: 30-3600
- System Impact: Lower values detect stale tasks faster but increase database query frequency
orchestration.dlq.staleness_detection.dry_run
Run staleness detection in observation-only mode without taking action
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds
orchestration.dlq.staleness_detection.enabled
Enable periodic scanning for stale tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d
actions
Path: orchestration.dlq.staleness_detection.actions
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_move_to_dlq | bool | true | Automatically move stale tasks to the DLQ after transitioning to error |
auto_transition_to_error | bool | true | Automatically transition stale tasks to the Error state |
emit_events | bool | true | Emit domain events when staleness is detected |
event_channel | String | "task_staleness_detected" | PGMQ channel name for staleness detection events |
orchestration.dlq.staleness_detection.actions.auto_move_to_dlq
Automatically move stale tasks to the DLQ after transitioning to error
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review
orchestration.dlq.staleness_detection.actions.auto_transition_to_error
Automatically transition stale tasks to the Error state
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state
orchestration.dlq.staleness_detection.actions.emit_events
Emit domain events when staleness is detected
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling
orchestration.dlq.staleness_detection.actions.event_channel
PGMQ channel name for staleness detection events
- Type:
String - Default:
"task_staleness_detected" - Valid Range: 1-255 characters
- System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling
thresholds
Path: orchestration.dlq.staleness_detection.thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
steps_in_process_minutes | u32 | 30 | Minutes a task can have steps in process before being considered stale |
task_max_lifetime_hours | u32 | 24 | Absolute maximum lifetime for any task regardless of state |
waiting_for_dependencies_minutes | u32 | 60 | Minutes a task can wait for step dependencies before being considered stale |
waiting_for_retry_minutes | u32 | 30 | Minutes a task can wait for step retries before being considered stale |
orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes
Minutes a task can have steps in process before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation
orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours
Absolute maximum lifetime for any task regardless of state
- Type:
u32 - Default:
24 - Valid Range: 1-168
- System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing
orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes
Minutes a task can wait for step dependencies before being considered stale
- Type:
u32 - Default:
60 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration
orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes
Minutes a task can wait for step retries before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration
event_systems
Path: orchestration.event_systems
orchestration
Path: orchestration.event_systems.orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "orchestration-event-system" | Unique identifier for the orchestration event system instance |
orchestration.event_systems.orchestration.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
orchestration.event_systems.orchestration.system_id
Unique identifier for the orchestration event system instance
- Type:
String - Default:
"orchestration-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: orchestration.event_systems.orchestration.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the orchestration event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics collection for event processing |
orchestration.event_systems.orchestration.health.enabled
Enable health monitoring for the orchestration event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for this event system
orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute
Error rate per minute above which the event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection
orchestration.event_systems.orchestration.health.max_consecutive_errors
Number of consecutive errors before the event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation after sustained failures; resets on any success
orchestration.event_systems.orchestration.health.performance_monitoring_enabled
Enable detailed performance metrics collection for event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks processing latency percentiles and throughput; adds minor overhead
processing
Path: orchestration.event_systems.orchestration.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read |
max_concurrent_operations | u32 | 50 | Maximum number of events processed concurrently by the orchestration event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed event processing operation |
orchestration.event_systems.orchestration.processing.batch_size
Number of events dequeued in a single batch read
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
orchestration.event_systems.orchestration.processing.max_concurrent_operations
Maximum number of events processed concurrently by the orchestration event system
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Controls parallelism for task request, result, and finalization processing
orchestration.event_systems.orchestration.processing.max_retries
Maximum retry attempts for a failed event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: orchestration.event_systems.orchestration.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between event processing retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: orchestration.event_systems.orchestration.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds an event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the orchestration event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued message remains invisible to other consumers |
orchestration.event_systems.orchestration.timing.claim_timeout_seconds
Maximum time in seconds an event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking event processing indefinitely
orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load
orchestration.event_systems.orchestration.timing.health_check_interval_seconds
Interval in seconds between health check probes for the orchestration event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness
orchestration.event_systems.orchestration.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
orchestration.event_systems.orchestration.timing.visibility_timeout_seconds
Time in seconds a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If processing is not completed within this window, the message becomes visible again for redelivery
task_readiness
Path: orchestration.event_systems.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "task-readiness-event-system" | Unique identifier for the task readiness event system instance |
orchestration.event_systems.task_readiness.deployment_mode
Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery
orchestration.event_systems.task_readiness.system_id
Unique identifier for the task readiness event system instance
- Type:
String - Default:
"task-readiness-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish task readiness events from other event systems
health
Path: orchestration.event_systems.task_readiness.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the task readiness event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the task readiness system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the task readiness system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for task readiness event processing |
orchestration.event_systems.task_readiness.health.enabled
Enable health monitoring for the task readiness event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks run for task readiness processing
orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute
Error rate per minute above which the task readiness system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
orchestration.event_systems.task_readiness.health.max_consecutive_errors
Number of consecutive errors before the task readiness system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful readiness check
orchestration.event_systems.task_readiness.health.performance_monitoring_enabled
Enable detailed performance metrics for task readiness event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency
processing
Path: orchestration.event_systems.task_readiness.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 50 | Number of task readiness events dequeued in a single batch |
max_concurrent_operations | u32 | 100 | Maximum number of task readiness events processed concurrently |
max_retries | u32 | 3 | Maximum retry attempts for a failed task readiness event |
orchestration.event_systems.task_readiness.processing.batch_size
Number of task readiness events dequeued in a single batch
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput
orchestration.event_systems.task_readiness.processing.max_concurrent_operations
Maximum number of task readiness events processed concurrently
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries
orchestration.event_systems.task_readiness.processing.max_retries
Maximum retry attempts for a failed task readiness event
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff
Path: orchestration.event_systems.task_readiness.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first task readiness processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay for readiness retries |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds for task readiness retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive readiness failure |
timing
Path: orchestration.event_systems.task_readiness.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a task readiness event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles for task readiness |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the task readiness event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single task readiness event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued task readiness message remains invisible to other consumers |
orchestration.event_systems.task_readiness.timing.claim_timeout_seconds
Maximum time in seconds a task readiness event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned readiness claims from blocking task evaluation
orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for task readiness
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness
orchestration.event_systems.task_readiness.timing.health_check_interval_seconds
Interval in seconds between health check probes for the task readiness event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the task readiness system verifies its own connectivity
orchestration.event_systems.task_readiness.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single task readiness event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Readiness events exceeding this timeout are considered failed
orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds
Time in seconds a dequeued task readiness message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate processing of readiness events during normal operation
grpc
Path: orchestration.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" | Socket address for the gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service (grpc.health.v1) |
enable_reflection | bool | true | Enable gRPC server reflection for service discovery |
enabled | bool | true | Enable the gRPC API server alongside the REST API |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 200 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame |
tls_enabled | bool | false | Enable TLS encryption for gRPC connections |
orchestration.grpc.bind_address
Socket address for the gRPC server
- Type:
String - Default:
"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict
orchestration.grpc.enable_health_service
Enable the gRPC health checking service (grpc.health.v1)
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
orchestration.grpc.enable_reflection
Enable gRPC server reflection for service discovery
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production
orchestration.grpc.enabled
Enable the gRPC API server alongside the REST API
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
orchestration.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
orchestration.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
orchestration.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
200 - Valid Range: 1-10000
- System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads
orchestration.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
orchestration.grpc.tls_enabled
Enable TLS encryption for gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments
mpsc_channels
Path: orchestration.mpsc_channels
command_processor
Path: orchestration.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 5000 | Bounded channel capacity for the orchestration command processor |
orchestration.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the orchestration command processor
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory
event_listeners
Path: orchestration.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 50000 | Bounded channel capacity for PGMQ event listener notifications |
orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications
- Type:
usize - Default:
50000 - Valid Range: 1000-1000000
- System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL
event_systems
Path: orchestration.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 10000 | Bounded channel capacity for the orchestration event system internal channel |
orchestration.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the orchestration event system internal channel
- Type:
usize - Default:
10000 - Valid Range: 100-100000
- System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts
web
Path: orchestration.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" | Socket address for the REST API server |
enabled | bool | true | Enable the REST API server for the orchestration service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for an HTTP request to complete before timeout |
orchestration.web.bind_address
Socket address for the REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - Valid Range: host:port
- System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8080 | Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI |
| test | 0.0.0.0:8080 | Default port for test fixtures |
orchestration.web.enabled
Enable the REST API server for the orchestration service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the service operates via messaging only
orchestration.web.request_timeout_ms
Maximum time in milliseconds for an HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: orchestration.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication |
enabled | bool | false | Enable authentication for the REST API |
jwt_audience | String | "tasker-api" | Expected ‘aud’ claim in JWT tokens |
jwt_issuer | String | "tasker-core" | Expected ‘iss’ claim in JWT tokens |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if this service issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours |
orchestration.web.auth.api_key
Static API key for simple key-based authentication
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
orchestration.web.auth.api_key_header
HTTP header name for API key authentication
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
orchestration.web.auth.enabled
Enable authentication for the REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth
orchestration.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens
- Type:
String - Default:
"tasker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
orchestration.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens
- Type:
String - Default:
"tasker-core" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected during validation
orchestration.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if this service issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers
orchestration.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production
orchestration.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
orchestration.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication
database_pools
Path: orchestration.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 50 | Advisory hint for the total number of database connections across all orchestration pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle web API connection is closed |
web_api_max_connections | u32 | 30 | Maximum number of connections the web API pool can grow to under load |
web_api_pool_size | u32 | 20 | Target number of connections in the web API database pool |
orchestration.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all orchestration pools
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
orchestration.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: API requests that cannot acquire a connection within this window return an error
orchestration.web.database_pools.web_api_idle_timeout_seconds
Time before an idle web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the web API pool shrinks after traffic subsides
orchestration.web.database_pools.web_api_max_connections
Maximum number of connections the web API pool can grow to under load
- Type:
u32 - Default:
30 - Valid Range: 1-500
- System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes
orchestration.web.database_pools.web_api_pool_size
Target number of connections in the web API database pool
- Type:
u32 - Default:
20 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the REST API can execute
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: worker
90/90 parameters documented
worker
Root-level worker parameters
Path: worker
| Parameter | Type | Default | Description |
|---|---|---|---|
worker_id | String | "worker-default-001" | Unique identifier for this worker instance |
worker_type | String | "general" | Worker type classification for routing and reporting |
worker.worker_id
Unique identifier for this worker instance
- Type:
String - Default:
"worker-default-001" - Valid Range: non-empty string
- System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster
worker.worker_type
Worker type classification for routing and reporting
- Type:
String - Default:
"general" - Valid Range: non-empty string
- System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types
circuit_breakers
Path: worker.circuit_breakers
ffi_completion_send
Path: worker.circuit_breakers.ffi_completion_send
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive FFI completion send failures before the circuit breaker trips |
recovery_timeout_seconds | u32 | 5 | Time the FFI completion breaker stays Open before probing with a test send |
slow_send_threshold_ms | u32 | 100 | Threshold in milliseconds above which FFI completion channel sends are logged as slow |
success_threshold | u32 | 2 | Consecutive successful sends in Half-Open required to close the breaker |
worker.circuit_breakers.ffi_completion_send.failure_threshold
Number of consecutive FFI completion send failures before the circuit breaker trips
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited
worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds
Time the FFI completion breaker stays Open before probing with a test send
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Short timeout (5s) because FFI channel issues are typically transient
worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms
Threshold in milliseconds above which FFI completion channel sends are logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-10000
- System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers
worker.circuit_breakers.ffi_completion_send.success_threshold
Consecutive successful sends in Half-Open required to close the breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient
event_systems
Path: worker.event_systems
worker
Path: worker.event_systems.worker
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "worker-event-system" | Unique identifier for the worker event system instance |
worker.event_systems.worker.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
worker.event_systems.worker.system_id
Unique identifier for the worker event system instance
- Type:
String - Default:
"worker-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: worker.event_systems.worker.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the worker event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the worker event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the worker event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for worker event processing |
worker.event_systems.worker.health.enabled
Enable health monitoring for the worker event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for the worker event system
worker.event_systems.worker.health.error_rate_threshold_per_minute
Error rate per minute above which the worker event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
worker.event_systems.worker.health.max_consecutive_errors
Number of consecutive errors before the worker event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful event processing
worker.event_systems.worker.health.performance_monitoring_enabled
Enable detailed performance metrics for worker event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings
metadata
Path: worker.event_systems.worker.metadata
fallback_poller
Path: worker.event_systems.worker.metadata.fallback_poller
| Parameter | Type | Default | Description |
|---|---|---|---|
age_threshold_seconds | u32 | 5 | Minimum age in seconds of a message before the fallback poller will pick it up |
batch_size | u32 | 20 | Number of messages to dequeue in a single fallback poll |
enabled | bool | true | Enable the fallback polling mechanism for step dispatch |
max_age_hours | u32 | 24 | Maximum age in hours of messages the fallback poller will process |
polling_interval_ms | u32 | 1000 | Interval in milliseconds between fallback polling cycles |
supported_namespaces | Vec<String> | [] | List of queue namespaces the fallback poller monitors; empty means all namespaces |
visibility_timeout_seconds | u32 | 30 | Time in seconds a message polled by the fallback mechanism remains invisible |
in_process_events
Path: worker.event_systems.worker.metadata.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
deduplication_cache_size | usize | 10000 | Number of event IDs to cache for deduplication of in-process events |
ffi_integration_enabled | bool | true | Enable FFI integration for in-process event delivery to Ruby/Python workers |
listener
Path: worker.event_systems.worker.metadata.listener
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_processing | bool | true | Enable batch processing of accumulated LISTEN/NOTIFY events |
connection_timeout_seconds | u32 | 30 | Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection |
event_timeout_seconds | u32 | 60 | Maximum time to wait for a LISTEN/NOTIFY event before yielding |
max_retry_attempts | u32 | 5 | Maximum number of listener reconnection attempts before falling back to polling |
retry_interval_seconds | u32 | 5 | Interval in seconds between LISTEN/NOTIFY listener reconnection attempts |
processing
Path: worker.event_systems.worker.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read by the worker |
max_concurrent_operations | u32 | 100 | Maximum number of events processed concurrently by the worker event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed worker event processing operation |
worker.event_systems.worker.processing.batch_size
Number of events dequeued in a single batch read by the worker
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
worker.event_systems.worker.processing.max_concurrent_operations
Maximum number of events processed concurrently by the worker event system
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Controls parallelism for step dispatch and completion processing
worker.event_systems.worker.processing.max_retries
Maximum retry attempts for a failed worker event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: worker.event_systems.worker.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first worker event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between worker event retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: worker.event_systems.worker.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a worker event claim remains valid |
fallback_polling_interval_seconds | u32 | 2 | Interval in seconds between fallback polling cycles for step dispatch |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the worker event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single worker event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued step dispatch message remains invisible to other workers |
worker.event_systems.worker.timing.claim_timeout_seconds
Maximum time in seconds a worker event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking step processing indefinitely
worker.event_systems.worker.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for step dispatch
- Type:
u32 - Default:
2 - Valid Range: 1-60
- System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency
worker.event_systems.worker.timing.health_check_interval_seconds
Interval in seconds between health check probes for the worker event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the worker event system verifies its own connectivity
worker.event_systems.worker.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single worker event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
worker.event_systems.worker.timing.visibility_timeout_seconds
Time in seconds a dequeued step dispatch message remains invisible to other workers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate step execution; must be longer than typical step processing time
grpc
Path: worker.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" | Socket address for the worker gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service on the worker |
enable_reflection | bool | true | Enable gRPC server reflection for the worker service |
enabled | bool | true | Enable the gRPC API server for the worker service |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames on the worker |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 1000 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server |
tls_enabled | bool | false | Enable TLS encryption for worker gRPC connections |
worker.grpc.bind_address
Socket address for the worker gRPC server
- Type:
String - Default:
"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191
worker.grpc.enable_health_service
Enable the gRPC health checking service on the worker
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
worker.grpc.enable_reflection
Enable gRPC server reflection for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production
worker.grpc.enabled
Enable the gRPC API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
worker.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames on the worker
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
worker.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
worker.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
1000 - Valid Range: 1-10000
- System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this
worker.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
worker.grpc.tls_enabled
Enable TLS encryption for worker gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments
mpsc_channels
Path: worker.mpsc_channels
command_processor
Path: worker.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 2000 | Bounded channel capacity for the worker command processor |
worker.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the worker command processor
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types
domain_events
Path: worker.mpsc_channels.domain_events
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 1000 | Bounded channel capacity for domain event system commands |
log_dropped_events | bool | true | Log a warning when domain events are dropped due to channel saturation |
shutdown_drain_timeout_ms | u32 | 5000 | Maximum time in milliseconds to drain pending domain events during shutdown |
worker.mpsc_channels.domain_events.command_buffer_size
Bounded channel capacity for domain event system commands
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown
worker.mpsc_channels.domain_events.log_dropped_events
Log a warning when domain events are dropped due to channel saturation
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps detect when event volume exceeds channel capacity
worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms
Maximum time in milliseconds to drain pending domain events during shutdown
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss
event_listeners
Path: worker.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 10000 | Bounded channel capacity for PGMQ event listener notifications on the worker |
worker.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications on the worker
- Type:
usize - Default:
10000 - Valid Range: 1000-1000000
- System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types
event_subscribers
Path: worker.mpsc_channels.event_subscribers
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step completion event subscribers |
result_buffer_size | usize | 1000 | Bounded channel capacity for step result event subscribers |
worker.mpsc_channels.event_subscribers.completion_buffer_size
Bounded channel capacity for step completion event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step completion notifications before they are forwarded to the orchestration service
worker.mpsc_channels.event_subscribers.result_buffer_size
Bounded channel capacity for step result event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution results before they are published to the result queue
event_systems
Path: worker.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 2000 | Bounded channel capacity for the worker event system internal channel |
worker.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the worker event system internal channel
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers events between the listener and processor; sized for worker-level throughput
ffi_dispatch
Path: worker.mpsc_channels.ffi_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
callback_timeout_ms | u32 | 5000 | Maximum time in milliseconds for FFI fire-and-forget domain event callbacks |
completion_send_timeout_ms | u32 | 10000 | Maximum time in milliseconds to retry sending FFI completion results when the channel is full |
completion_timeout_ms | u32 | 30000 | Maximum time in milliseconds to wait for an FFI step handler to complete |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for FFI step dispatch requests |
starvation_warning_threshold_ms | u32 | 10000 | Age in milliseconds of pending FFI events that triggers a starvation warning |
worker.mpsc_channels.ffi_dispatch.callback_timeout_ms
Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents indefinite blocking of FFI threads during domain event publishing
worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms
Maximum time in milliseconds to retry sending FFI completion results when the channel is full
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks
worker.mpsc_channels.ffi_dispatch.completion_timeout_ms
Maximum time in milliseconds to wait for an FFI step handler to complete
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads
worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size
Bounded channel capacity for FFI step dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers
worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms
Age in milliseconds of pending FFI events that triggers a starvation warning
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached
handler_dispatch
Path: worker.mpsc_channels.handler_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step handler completion notifications |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for step handler dispatch requests |
handler_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a step handler to complete execution |
max_concurrent_handlers | u32 | 10 | Maximum number of step handlers executing simultaneously |
worker.mpsc_channels.handler_dispatch.completion_buffer_size
Bounded channel capacity for step handler completion notifications
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers handler completion results before they are forwarded to the result processor
worker.mpsc_channels.handler_dispatch.dispatch_buffer_size
Bounded channel capacity for step handler dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers incoming step execution requests before handler assignment
worker.mpsc_channels.handler_dispatch.handler_timeout_ms
Maximum time in milliseconds for a step handler to complete execution
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity
worker.mpsc_channels.handler_dispatch.max_concurrent_handlers
Maximum number of step handlers executing simultaneously
- Type:
u32 - Default:
10 - Valid Range: 1-10000
- System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore
load_shedding
Path: worker.mpsc_channels.handler_dispatch.load_shedding
| Parameter | Type | Default | Description |
|---|---|---|---|
capacity_threshold_percent | f64 | 80.0 | Handler capacity percentage above which new step claims are refused |
enabled | bool | true | Enable load shedding to refuse step claims when handler capacity is nearly exhausted |
warning_threshold_percent | f64 | 70.0 | Handler capacity percentage at which warning logs are emitted |
worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent
Handler capacity percentage above which new step claims are refused
- Type:
f64 - Default:
80.0 - Valid Range: 0.0-100.0
- System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy
worker.mpsc_channels.handler_dispatch.load_shedding.enabled
Enable load shedding to refuse step claims when handler capacity is nearly exhausted
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload
worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent
Handler capacity percentage at which warning logs are emitted
- Type:
f64 - Default:
70.0 - Valid Range: 0.0-100.0
- System Impact: Observability: alerts operators that the worker is approaching its capacity limit
in_process_events
Path: worker.mpsc_channels.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
broadcast_buffer_size | usize | 2000 | Bounded broadcast channel capacity for in-process domain event delivery |
dispatch_timeout_ms | u32 | 5000 | Maximum time in milliseconds to wait when dispatching an in-process event |
log_subscriber_errors | bool | true | Log errors when in-process event subscribers fail to receive events |
worker.mpsc_channels.in_process_events.broadcast_buffer_size
Bounded broadcast channel capacity for in-process domain event delivery
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure
worker.mpsc_channels.in_process_events.dispatch_timeout_ms
Maximum time in milliseconds to wait when dispatching an in-process event
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow
worker.mpsc_channels.in_process_events.log_subscriber_errors
Log errors when in-process event subscribers fail to receive events
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps identify slow or failing event subscribers
orchestration_client
Path: worker.orchestration_client
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | String | "http://localhost:8080" | Base URL of the orchestration REST API that this worker reports to |
max_retries | u32 | 3 | Maximum retry attempts for failed orchestration API calls |
timeout_ms | u32 | 30000 | HTTP request timeout in milliseconds for orchestration API calls |
worker.orchestration_client.base_url
Base URL of the orchestration REST API that this worker reports to
- Type:
String - Default:
"http://localhost:8080" - Valid Range: valid HTTP(S) URL
- System Impact: Workers send step completion results and health reports to this endpoint
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | http://orchestration:8080 | Container-internal DNS in Kubernetes/Docker |
| test | http://localhost:8080 | Local orchestration for testing |
Related: orchestration.web.bind_address
worker.orchestration_client.max_retries
Maximum retry attempts for failed orchestration API calls
- Type:
u32 - Default:
3 - Valid Range: 0-10
- System Impact: Retries use backoff; higher values improve resilience to transient network issues
worker.orchestration_client.timeout_ms
HTTP request timeout in milliseconds for orchestration API calls
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried
web
Path: worker.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" | Socket address for the worker REST API server |
enabled | bool | true | Enable the REST API server for the worker service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a worker HTTP request to complete before timeout |
worker.web.bind_address
Socket address for the worker REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" - Valid Range: host:port
- System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8081 | Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override |
| test | 0.0.0.0:8081 | Default port offset from orchestration (8080) |
worker.web.enabled
Enable the REST API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only
worker.web.request_timeout_ms
Maximum time in milliseconds for a worker HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: worker.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication on the worker API |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication on the worker API |
enabled | bool | false | Enable authentication for the worker REST API |
jwt_audience | String | "worker-api" | Expected ‘aud’ claim in JWT tokens for the worker API |
jwt_issuer | String | "tasker-worker" | Expected ‘iss’ claim in JWT tokens for the worker API |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if the worker issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures on the worker API |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for worker JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours for worker API tokens |
worker.web.auth.api_key
Static API key for simple key-based authentication on the worker API
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
worker.web.auth.api_key_header
HTTP header name for API key authentication on the worker API
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
worker.web.auth.enabled
Enable authentication for the worker REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all worker API endpoints are unauthenticated
worker.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens for the worker API
- Type:
String - Default:
"worker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
worker.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens for the worker API
- Type:
String - Default:
"tasker-worker" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens
worker.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the worker service issues its own JWT tokens; typically empty
worker.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures on the worker API
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management
worker.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for worker JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
worker.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours for worker API tokens
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security
database_pools
Path: worker.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 25 | Advisory hint for the total number of database connections across all worker pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the worker web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle worker web API connection is closed |
web_api_max_connections | u32 | 15 | Maximum number of connections the worker web API pool can grow to under load |
web_api_pool_size | u32 | 10 | Target number of connections in the worker web API database pool |
worker.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all worker pools
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
worker.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the worker web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Worker API requests that cannot acquire a connection within this window return an error
worker.web.database_pools.web_api_idle_timeout_seconds
Time before an idle worker web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides
worker.web.database_pools.web_api_max_connections
Maximum number of connections the worker web API pool can grow to under load
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Hard ceiling for worker web API database connections
worker.web.database_pools.web_api_pool_size
Target number of connections in the worker web API database pool
- Type:
u32 - Default:
10 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration
Generated by tasker-ctl docs — Tasker Configuration System
Authentication & Authorization
API-level security for Tasker’s orchestration and worker HTTP endpoints, providing JWT bearer token and API key authentication with permission-based access control.
Architecture
┌──────────────────────────────┐
Request ──► Middleware │ SecurityService │
(per-route) │ ├─ JwtAuthenticator │
│ ├─ JwksKeyStore (optional) │
│ └─ ApiKeyRegistry (optional) │
└───────────┬──────────────────┘
│
▼
SecurityContext
(injected into request extensions)
│
▼
┌───────────────────────┐
│ authorize() wrapper │
│ Resource + Action │
└───────────┬───────────┘
│
┌─────────┴─────────┐
▼ ▼
Body parsing 403 (denied)
│
▼
Handler body
│
▼
200 (success)
Key Components
| Component | Location | Role |
|---|---|---|
SecurityService | tasker-shared/src/services/security_service.rs | Unified auth backend: validates JWTs (static key or JWKS) and API keys |
SecurityContext | tasker-shared/src/types/security.rs | Per-request identity + permissions, extracted by handlers |
Permission enum | tasker-shared/src/types/permissions.rs | Compile-time permission vocabulary (resource:action) |
Resource, Action | tasker-shared/src/types/resources.rs | Resource-based authorization types |
authorize() wrapper | tasker-shared/src/web/authorize.rs | Handler wrapper for declarative permission checks |
| Auth middleware | */src/web/middleware/auth.rs | Axum middleware injecting SecurityContext |
require_permission() | */src/web/middleware/permission.rs | Legacy per-handler permission gate (still available) |
Request Flow
- Middleware (
conditional_auth) runs on protected routes - If auth disabled → injects
SecurityContext::disabled_context()(all permissions) - If auth enabled → extracts Bearer token or API key from headers
SecurityServicevalidates credentials, returnsSecurityContextauthorize()wrapper checks permission BEFORE body deserialization → 403 if denied- Body deserialization and handler execution proceed if authorized
Route Layers
Routes are split into public (never require auth) and protected (auth middleware applied):
Orchestration (port 8080):
- Public:
/health/*,/metrics,/api-docs/* - Protected:
/v1/*,/config(opt-in)
Worker (port 8081):
- Public:
/health/*,/metrics,/api-docs/* - Protected:
/v1/templates/*,/config(opt-in)
Quick Start
# 1. Generate RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys
# 2. Generate a token
cargo run --bin tasker-ctl -- auth generate-token \
--private-key ./keys/jwt-private-key.pem \
--permissions "tasks:create,tasks:read,tasks:list" \
--subject my-service \
--expiry-hours 24
# 3. Enable auth in config (orchestration.toml)
# [web.auth]
# enabled = true
# jwt_public_key_path = "./keys/jwt-public-key.pem"
# 4. Use the token
curl -H "Authorization: Bearer <token>" http://localhost:8080/v1/tasks
Documentation Index
| Document | Contents |
|---|---|
| Permissions | Permission vocabulary, route mapping, wildcards, role patterns |
| Configuration | TOML config, environment variables, deployment patterns |
| Testing | E2E test infrastructure, cargo-make tasks, writing auth tests |
Cross-References
| Document | Contents |
|---|---|
| API Security Guide | Quick start, CLI commands, error responses, observability |
| Auth Integration Guide | JWKS, Auth0, Keycloak, Okta configuration |
Design Decisions
Auth Disabled by Default
Security is opt-in (enabled = false default). Existing deployments are unaffected. When disabled, all handlers receive a SecurityContext with AuthMethod::Disabled and permissions: ["*"].
Config Endpoint Opt-In
The /config endpoint exposes runtime configuration (secrets redacted). It is controlled by a separate toggle (config_endpoint_enabled, default false). When disabled, the route is not registered (404, not 401).
Resource-Based Authorization
Permission checks happen at the route level via authorize() wrappers BEFORE body deserialization:
#![allow(unused)]
fn main() {
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
}
This approach:
- Rejects unauthorized requests before parsing request bodies
- Provides a declarative, visible permission model at the route level
- Is protocol-agnostic (same
Resource/Actiontypes work for REST and gRPC) - Documents permissions in OpenAPI via
x-required-permissionextensions
The legacy require_permission() function is still available for cases where permission checks need to happen inside handler logic.
Credential Priority (Client)
The tasker-client library resolves credentials in this order:
- Endpoint-specific token (
TASKER_ORCHESTRATION_AUTH_TOKEN/TASKER_WORKER_AUTH_TOKEN) - Global token (
TASKER_AUTH_TOKEN) - API key (
TASKER_API_KEY) - JWT generation from private key (if configured)
Known Limitations
Body-before-permission ordering for POST/PATCH endpoints— Resolved by resource-based authorization- No token refresh — tokens are stateless; clients must generate new tokens before expiry
- API keys have no expiration — rotate by removing from config and redeploying
Configuration Reference
Complete configuration for Tasker authentication: server-side TOML, environment variables, and client settings.
Server-Side Configuration
Auth config lives under [web.auth] in both orchestration and worker TOML files.
Location
config/tasker/base/orchestration.toml → [web.auth]
config/tasker/base/worker.toml → [web.auth]
config/tasker/environments/{env}/... → environment overrides
Configuration follows the role-based structure (see Configuration Management).
Full Reference
[web]
# Whether the /config endpoint is registered (default: false).
# When false, GET /config returns 404. When true, requires system:config_read permission.
config_endpoint_enabled = false
[web.auth]
# Master switch (default: false). When disabled, all routes are accessible without credentials.
enabled = false
# --- JWT Configuration ---
# Token issuer claim (validated against incoming tokens)
jwt_issuer = "tasker-core"
# Token audience claim (validated against incoming tokens)
jwt_audience = "tasker-api"
# Token expiry for generated tokens (via CLI)
jwt_token_expiry_hours = 24
# Verification method: "public_key" (static RSA key) or "jwks" (dynamic key rotation)
jwt_verification_method = "public_key"
# Static public key (one of these, path takes precedence):
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_public_key = "" # Inline PEM string (use path instead for production)
# Private key (for token generation only, not needed for verification):
jwt_private_key = ""
# --- JWKS Configuration (when jwt_verification_method = "jwks") ---
# JWKS endpoint URL
jwks_url = "https://auth.example.com/.well-known/jwks.json"
# How often to refresh the key set (seconds)
jwks_refresh_interval_seconds = 3600
# --- Permission Validation ---
# JWT claim name containing the permissions array
permissions_claim = "permissions"
# Reject tokens with unrecognized permission strings
strict_validation = true
# Log unrecognized permissions even when strict_validation = false
log_unknown_permissions = true
# --- API Key Authentication ---
# Header name for API key authentication
api_key_header = "X-API-Key"
# Enable multi-key registry (default: false)
api_keys_enabled = false
# API key registry (multiple keys with individual permissions)
[[web.auth.api_keys]]
key = "sk-prod-monitoring-key"
permissions = ["tasks:read", "tasks:list", "dlq:read", "dlq:stats"]
description = "Production monitoring service"
[[web.auth.api_keys]]
key = "sk-prod-admin-key"
permissions = ["*"]
description = "Production admin"
Environment Variables
Server-Side
| Variable | Description | Overrides |
|---|---|---|
TASKER_JWT_PUBLIC_KEY_PATH | Path to RSA public key PEM file | web.auth.jwt_public_key_path |
TASKER_JWT_PUBLIC_KEY | Inline PEM public key | web.auth.jwt_public_key |
These override TOML values via the config loader’s environment interpolation.
Client-Side
| Variable | Priority | Description |
|---|---|---|
TASKER_ORCHESTRATION_AUTH_TOKEN | 1 (highest) | Bearer token for orchestration API only |
TASKER_WORKER_AUTH_TOKEN | 1 (highest) | Bearer token for worker API only |
TASKER_AUTH_TOKEN | 2 | Bearer token for both APIs |
TASKER_API_KEY | 3 | API key (sent via configured header) |
TASKER_API_KEY_HEADER | — | Custom header name (default: X-API-Key) |
TASKER_JWT_PRIVATE_KEY_PATH | 4 (lowest) | Private key for on-demand token generation |
The tasker-client library checks these in priority order and uses the first available credential.
Deployment Patterns
Development (Auth Disabled)
[web.auth]
enabled = false
All endpoints accessible without credentials. Default behavior.
Development (Auth Enabled, Static Key)
[web.auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
strict_validation = false
[[web.auth.api_keys]]
key = "dev-key"
permissions = ["*"]
description = "Dev superuser key"
Production (JWKS + API Keys)
[web.auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://auth.company.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://auth.company.com/"
jwt_audience = "tasker-api"
strict_validation = true
log_unknown_permissions = true
api_keys_enabled = true
api_key_header = "X-API-Key"
[[web.auth.api_keys]]
key = "sk-monitoring-prod"
permissions = ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
description = "Monitoring service"
[[web.auth.api_keys]]
key = "sk-submitter-prod"
permissions = ["tasks:create", "tasks:read", "tasks:list"]
description = "Task submission service"
Production (Config Endpoint Enabled)
[web]
config_endpoint_enabled = true
[web.auth]
enabled = true
# ... auth config ...
Exposes GET /config (requires system:config_read permission). Secrets are redacted in the response.
Key Management
Generating Keys
# Generate 2048-bit RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys --key-size 2048
# Output:
# keys/jwt-private-key.pem (keep secret, used for token generation)
# keys/jwt-public-key.pem (distribute to servers for verification)
Key Rotation (Static Key)
- Generate a new key pair
- Update
jwt_public_key_pathin server config - Restart servers
- Re-generate tokens with the new private key
- Old tokens become invalid immediately
Key Rotation (JWKS)
Handled automatically by the identity provider. Tasker refreshes keys on:
- Timer interval (
jwks_refresh_interval_seconds) - Unknown
kidin incoming token (triggers immediate refresh)
Security Hardening Checklist
- Private keys never committed to version control
-
enabled = truein production configs -
strict_validation = trueto reject unknown permissions - Token expiry set appropriately (1-24h recommended)
- API keys use descriptive names for audit trails
-
config_endpoint_enabled = falseunless needed (default) - Monitor
tasker.auth.failures.totalmetric for anomalies - Use JWKS in production for automatic key rotation
- Least-privilege: each service gets only the permissions it needs
Related
- API Security Guide — Quick start, CLI commands, error responses
- Auth Integration Guide — Auth0, Keycloak, Okta, JWKS setup
- Permissions — Full permission vocabulary and route mapping
Permissions
Permission-based access control using a resource:action vocabulary with wildcard support.
Permission Vocabulary
17 permissions organized by resource:
Tasks
| Permission | Description | Endpoints |
|---|---|---|
tasks:create | Create new tasks | POST /v1/tasks |
tasks:read | Read task details | GET /v1/tasks/{uuid} |
tasks:list | List tasks | GET /v1/tasks |
tasks:cancel | Cancel running tasks | DELETE /v1/tasks/{uuid} |
tasks:context_read | Read task context data | GET /v1/tasks/{uuid}/context |
Steps
| Permission | Description | Endpoints |
|---|---|---|
steps:read | Read workflow step details | GET /v1/tasks/{uuid}/workflow_steps, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}/audit |
steps:resolve | Manually resolve steps | PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} |
Dead Letter Queue
| Permission | Description | Endpoints |
|---|---|---|
dlq:read | Read DLQ entries | GET /v1/dlq, GET /v1/dlq/task/{task_uuid}, GET /v1/dlq/investigation-queue, GET /v1/dlq/staleness |
dlq:update | Update DLQ investigations | PATCH /v1/dlq/entry/{dlq_entry_uuid} |
dlq:stats | View DLQ statistics | GET /v1/dlq/stats |
Templates
| Permission | Description | Endpoints |
|---|---|---|
templates:read | Read task templates | Orchestration: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version} |
templates:validate | Validate templates | Worker: POST /v1/templates/{namespace}/{name}/{version}/validate |
System (Orchestration)
| Permission | Description | Endpoints |
|---|---|---|
system:config_read | Read system configuration | GET /config |
system:handlers_read | Read handler registry | GET /v1/handlers, GET /v1/handlers/{namespace}, GET /v1/handlers/{namespace}/{name} |
system:analytics_read | Read analytics data | GET /v1/analytics/performance, GET /v1/analytics/bottlenecks |
Worker
| Permission | Description | Endpoints |
|---|---|---|
worker:config_read | Read worker configuration | Worker: GET /config |
worker:templates_read | Read worker templates | Worker: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version} |
Wildcards
Resource-level wildcards allow broad access within a resource domain:
| Pattern | Matches |
|---|---|
tasks:* | All task permissions |
steps:* | All step permissions |
dlq:* | All DLQ permissions |
templates:* | All template permissions |
system:* | All system permissions |
worker:* | All worker permissions |
Note: Global wildcards (*) are NOT supported. Use explicit resource wildcards for broad access (e.g., tasks:*, system:*). This follows AWS IAM-style resource-level granularity.
Wildcard matching is implemented in permission_matches():
resource:*→ matches if required permission’s resource component equals the prefix- Exact string → matches if strings are identical
Role Patterns
Common permission sets for different service roles:
Read-Only Operator
["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
Suitable for dashboards, monitoring services, and read-only admin UIs.
Task Submitter
["tasks:create", "tasks:read", "tasks:list"]
Services that submit work to Tasker and track their submissions.
Ops Admin
["tasks:*", "steps:*", "dlq:*", "system:*"]
Full operational access including step resolution, DLQ investigation, and system observability.
Worker Service
["worker:config_read", "worker:templates_read"]
Worker processes that need to read their configuration and available templates.
Full Access (Admin)
["tasks:*", "steps:*", "dlq:*", "templates:*", "system:*", "worker:*"]
Full access to all resources via resource wildcards. Use sparingly.
Strict Validation
When strict_validation = true (default), tokens containing permission strings not in the vocabulary are rejected with 401:
Unknown permissions: custom:action, tasks:delete
Set strict_validation = false if your identity provider includes additional scopes that are not part of Tasker’s vocabulary. Use log_unknown_permissions = true to still log unrecognized permissions for monitoring.
Permission Check Implementation
Resource-Based Authorization
Permissions are enforced declaratively at the route level using authorize() wrappers. This ensures authorization happens before body deserialization:
#![allow(unused)]
fn main() {
// In routes.rs
use tasker_shared::web::authorize;
use tasker_shared::types::resources::{Resource, Action};
Router::new()
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
.route("/tasks", get(authorize(Resource::Tasks, Action::List, list_tasks)))
.route("/tasks/{uuid}", get(authorize(Resource::Tasks, Action::Read, get_task)))
}
The authorize() wrapper:
- Extracts
SecurityContextfrom request extensions (set by auth middleware) - If resource is public (Health/Metrics/Docs) → proceeds to handler
- If auth disabled (
AuthMethod::Disabled) → proceeds to handler - Checks
has_permission(required)→ if yes, proceeds; if no, returns 403
Resource → Permission Mapping
The ResourceAction type maps resource+action combinations to permissions:
| Resource | Action | Permission |
|---|---|---|
| Tasks | Create | tasks:create |
| Tasks | Read | tasks:read |
| Tasks | List | tasks:list |
| Tasks | Cancel | tasks:cancel |
| Tasks | ContextRead | tasks:context_read |
| Steps | Read/List | steps:read |
| Steps | Resolve | steps:resolve |
| Dlq | Read/List | dlq:read |
| Dlq | Update | dlq:update |
| Dlq | Stats | dlq:stats |
| Templates | Read/List | templates:read |
| Templates | Validate | templates:validate |
| System | ConfigRead | system:config_read |
| System | HandlersRead | system:handlers_read |
| System | AnalyticsRead | system:analytics_read |
| Worker | ConfigRead | worker:config_read |
| Worker | Read/List | worker:templates_read |
Public Resources
These resources don’t require authentication:
Resource::Health- Health check endpointsResource::Metrics- Prometheus metricsResource::Docs- OpenAPI/Swagger documentation
Legacy Handler-Level Check (Still Available)
For cases where you need permission checks inside handler logic:
#![allow(unused)]
fn main() {
use tasker_shared::services::require_permission;
use tasker_shared::types::Permission;
fn my_handler(ctx: SecurityContext) -> Result<(), ApiError> {
require_permission(&ctx, Permission::TasksCreate)?;
// ... handler logic
}
}
Source: tasker-shared/src/web/authorize.rs, tasker-shared/src/types/resources.rs
OpenAPI Documentation
Permission Extensions
Each protected endpoint in the OpenAPI spec includes an x-required-permission extension that documents the exact permission required:
{
"paths": {
"/v1/tasks": {
"post": {
"security": [
{ "bearer_auth": [] },
{ "api_key_auth": [] }
],
"x-required-permission": "tasks:create",
...
}
}
}
}
Why Extensions Instead of OAuth2 Scopes?
OpenAPI 3.x only formally supports scopes for OAuth2 and OpenID Connect security schemes—not for HTTP Bearer or API Key authentication. Since Tasker uses JWT Bearer tokens with JWKS validation (not OAuth2 flows), we use vendor extensions (x-required-permission) to document permissions in a standards-compliant way.
This approach:
- Is OpenAPI compliant (tools ignore unknown
x-fields gracefully) - Doesn’t misrepresent our authentication mechanism
- Is machine-readable for SDK generators and tooling
- Is visible in generated documentation
Viewing Permissions in Swagger UI
Each operation’s description includes a Required Permission line:
**Required Permission:** `tasks:create`
This provides human-readable permission information directly in the Swagger UI.
Programmatic Access
To extract permission requirements from the OpenAPI spec:
import json
spec = json.load(open("orchestration-openapi.json"))
for path, methods in spec["paths"].items():
for method, operation in methods.items():
if "x-required-permission" in operation:
print(f"{method.upper()} {path}: {operation['x-required-permission']}")
CLI: List Permissions
cargo run --bin tasker-ctl -- auth show-permissions
Outputs all 17 permissions with their resource grouping.
Auth Testing
E2E test infrastructure for validating authentication and permission enforcement.
Test Organization
tasker-orchestration/tests/web/auth/
├── mod.rs # Module declarations
├── common.rs # AuthWebTestClient, token generators, constants
├── tasks.rs # Task endpoint auth tests
├── workflow_steps.rs # Step resolution auth tests
├── dlq.rs # DLQ endpoint auth tests
├── handlers.rs # Handler registry auth tests
├── analytics.rs # Analytics endpoint auth tests
├── config.rs # Config endpoint auth tests
├── health.rs # Health endpoint public access tests
└── api_keys.rs # API key auth tests (full/read/tasks/none)
All tests are feature-gated: #[cfg(feature = "test-services")]
Running Auth Tests
# Run all auth E2E tests (requires database running)
cargo make test-auth-e2e # or: cargo make tae
# Run a specific test file
cargo nextest run --features test-services \
-E 'test(auth::tasks)' \
--package tasker-orchestration
# Run with output
cargo nextest run --features test-services \
-E 'test(auth::)' \
--package tasker-orchestration \
--nocapture
Test Infrastructure
AuthWebTestClient
A specialized HTTP client that starts an auth-enabled Axum server:
#![allow(unused)]
fn main() {
use crate::web::auth::common::AuthWebTestClient;
#[tokio::test]
async fn test_example() {
let client = AuthWebTestClient::new().await;
// client.base_url is http://127.0.0.1:{dynamic_port}
}
}
AuthWebTestClient::new() does:
- Loads
config/tasker/generated/auth-test.toml(auth enabled, test keys) - Resolves
jwt-public-key-test.pemviaCARGO_MANIFEST_DIR - Creates
SystemContext+OrchestrationCore+AppState - Starts Axum on a dynamically-allocated port (
127.0.0.1:0) - Provides HTTP methods:
get(),post_json(),patch_json(),delete()
Token Generators
#![allow(unused)]
fn main() {
use crate::web::auth::common::{generate_jwt, generate_expired_jwt, generate_jwt_wrong_issuer};
// Valid token with specific permissions
let token = generate_jwt(&["tasks:create", "tasks:read"]);
// Expired token (1 hour ago)
let token = generate_expired_jwt(&["tasks:create"]);
// Wrong issuer (won't validate)
let token = generate_jwt_wrong_issuer(&["tasks:create"]);
}
Token generation uses the test RSA private key (tests/fixtures/auth/jwt-private-key-test.pem) embedded as a constant.
API Key Constants
#![allow(unused)]
fn main() {
use crate::web::auth::common::{
TEST_API_KEY_FULL_ACCESS, // permissions: ["*"]
TEST_API_KEY_READ_ONLY, // permissions: tasks/steps/dlq read + system read
TEST_API_KEY_TASKS_ONLY, // permissions: ["tasks:*"]
TEST_API_KEY_NO_PERMISSIONS, // permissions: []
INVALID_API_KEY, // not registered
};
}
These match the keys configured in config/tasker/generated/auth-test.toml.
Test Configuration
config/tasker/generated/auth-test.toml
A copy of complete-test.toml with auth overrides:
[orchestration.web.auth]
enabled = true
jwt_issuer = "tasker-core-test"
jwt_audience = "tasker-api-test"
jwt_verification_method = "public_key"
jwt_public_key_path = "" # Set via TASKER_JWT_PUBLIC_KEY_PATH at runtime
api_keys_enabled = true
strict_validation = false
[[orchestration.web.auth.api_keys]]
key = "test-api-key-full-access"
permissions = ["*"]
[[orchestration.web.auth.api_keys]]
key = "test-api-key-read-only"
permissions = ["tasks:read", "tasks:list", "steps:read", ...]
# ... more keys ...
Test Fixture Keys
tests/fixtures/auth/
├── jwt-private-key-test.pem # RSA private key (for token generation in tests)
└── jwt-public-key-test.pem # RSA public key (loaded by SecurityService)
These are deterministic test keys committed to the repository. They are only used in tests and have no security value.
Test Patterns
Pattern: No Credentials → 401
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_no_credentials_returns_401() {
let client = AuthWebTestClient::new().await;
let response = client.get("/v1/tasks").await.unwrap();
assert_eq!(response.status(), 401);
}
}
Pattern: Valid JWT with Required Permission → 200
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_with_permission_succeeds() {
let client = AuthWebTestClient::new().await;
let token = generate_jwt(&["tasks:list"]);
let response = client
.get_with_token("/v1/tasks", &token)
.await
.unwrap();
assert_eq!(response.status(), 200);
}
}
Pattern: Valid JWT Missing Permission → 403
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_without_permission_returns_403() {
let client = AuthWebTestClient::new().await;
let token = generate_jwt(&["tasks:read"]); // missing tasks:create
let body = serde_json::json!({ /* ... */ });
let response = client
.post_json_with_token("/v1/tasks", &body, &token)
.await
.unwrap();
assert_eq!(response.status(), 403);
}
}
Pattern: API Key with Permissions → 200
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_api_key_full_access() {
let client = AuthWebTestClient::new().await;
let response = client
.get_with_api_key("/v1/tasks", TEST_API_KEY_FULL_ACCESS)
.await
.unwrap();
assert_eq!(response.status(), 200);
}
}
Pattern: Health Always Public
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_health_no_auth_required() {
let client = AuthWebTestClient::new().await;
let response = client.get("/health").await.unwrap();
assert_eq!(response.status(), 200);
}
}
Test Coverage Matrix
| Scenario | Expected | Test File |
|---|---|---|
| No credentials on protected routes | 401 | All files |
| JWT with exact permission | 200 | tasks, dlq, handlers, analytics, config |
JWT with resource wildcard (tasks:*) | 200 | tasks |
JWT with global wildcard (*) | 200 | All files |
| JWT missing required permission | 403 | tasks, dlq, handlers, analytics |
| JWT wrong issuer | 401 | tasks |
| JWT wrong audience | 401 | tasks |
| Expired JWT | 401 | tasks |
| Malformed JWT | 401 | tasks |
| API key full access | 200 | api_keys |
| API key read-only | 200/403 | api_keys |
| API key tasks-only | 200/403 | api_keys |
| API key no permissions | 403 | api_keys |
| Invalid API key | 401 | api_keys |
| Health endpoints without auth | 200 | health |
CI Compatibility
Auth tests are compatible with CI without special environment setup:
- Dynamic port allocation:
TcpListener::bind("127.0.0.1:0")avoids port conflicts - Self-configuring paths: Uses
CARGO_MANIFEST_DIRto resolve fixture paths at compile time - No external services: Auth validation is in-process (no external JWKS/IdP needed)
- Nextest isolation: Each test runs in its own process, preventing env var conflicts
Adding New Auth Tests
- Identify the endpoint and required permission (see Permissions)
- Add tests to the appropriate file (by resource) or create a new one
- Test at minimum: no credentials (401), correct permission (200), wrong permission (403)
- For POST/PATCH endpoints, use a valid request body (deserialization runs before permission check)
- Run
cargo make test-auth-e2eto verify
Related
- Permissions — Full permission vocabulary and endpoint mapping
- Configuration — Auth config reference
config/tasker/generated/auth-test.toml— Test auth configuration
Backpressure Monitoring Runbook
Last Updated: 2026-02-05 Audience: Operations, SRE, On-Call Engineers Status: Active Related Docs: Backpressure Architecture | MPSC Channel Tuning
This runbook provides guidance for monitoring, alerting, and responding to backpressure events in tasker-core.
Quick Reference
Critical Metrics Dashboard
| Metric | Normal | Warning | Critical | Action |
|---|---|---|---|---|
api_circuit_breaker_state | closed | - | open | See Circuit Breaker Open |
messaging_circuit_breaker_state | closed | half-open | open | See Messaging Circuit Breaker Open |
api_requests_rejected_total | < 1/min | > 5/min | > 20/min | See API Rejections |
mpsc_channel_saturation | < 50% | > 70% | > 90% | See Channel Saturation |
pgmq_queue_depth | < 50% max | > 70% max | > 90% max | See Queue Depth High |
worker_claim_refusals_total | < 5/min | > 20/min | > 50/min | See Claim Refusals |
handler_semaphore_wait_ms_p99 | < 100ms | > 500ms | > 2000ms | See Handler Wait |
domain_events_dropped_total | < 10/min | > 50/min | > 200/min | See Domain Events |
Key Metrics
API Layer Metrics
api_requests_total
- Type: Counter
- Labels:
endpoint,status_code,method - Description: Total API requests received
- Usage: Calculate request rate, error rate
api_requests_rejected_total
- Type: Counter
- Labels:
endpoint,reason(rate_limit, circuit_breaker, validation) - Description: Requests rejected due to backpressure
- Alert: > 10/min sustained
api_circuit_breaker_state
- Type: Gauge
- Values: 0 = closed, 1 = half-open, 2 = open
- Description: Current circuit breaker state
- Alert: state = 2 (open)
api_request_latency_ms
- Type: Histogram
- Labels:
endpoint - Description: Request processing time
- Alert: p99 > 5000ms
Messaging Metrics
messaging_circuit_breaker_state
- Type: Gauge
- Values: 0 = closed, 1 = half-open, 2 = open
- Description: Current messaging circuit breaker state
- Alert: state = 2 (open) — both orchestration and workers lose queue access
messaging_circuit_breaker_rejections_total
- Type: Counter
- Labels:
operation(send, receive) - Description: Messaging operations rejected due to open circuit breaker
- Alert: > 0 (any rejection indicates messaging outage)
Orchestration Metrics
orchestration_command_channel_size
- Type: Gauge
- Description: Current command channel buffer usage
- Alert: > 80% of
command_buffer_size
orchestration_command_channel_saturation
- Type: Gauge (0.0 - 1.0)
- Description: Channel saturation ratio
- Alert: > 0.8 sustained for > 1min
pgmq_queue_depth
- Type: Gauge
- Labels:
queue_name - Description: Messages in queue
- Alert: > configured max_queue_depth * 0.8
pgmq_enqueue_failures_total
- Type: Counter
- Labels:
queue_name,reason - Description: Failed enqueue operations
- Alert: > 0 (any failure)
Worker Metrics
worker_claim_refusals_total
- Type: Counter
- Labels:
worker_id,namespace - Description: Step claims refused due to capacity
- Alert: > 10/min sustained
worker_handler_semaphore_permits_available
- Type: Gauge
- Labels:
worker_id - Description: Available handler permits
- Alert: = 0 sustained for > 30s
worker_handler_semaphore_wait_ms
- Type: Histogram
- Labels:
worker_id - Description: Time waiting for semaphore permit
- Alert: p99 > 1000ms
worker_dispatch_channel_saturation
- Type: Gauge
- Labels:
worker_id - Description: Dispatch channel saturation
- Alert: > 0.8 sustained
worker_completion_channel_saturation
- Type: Gauge
- Labels:
worker_id - Description: Completion channel saturation
- Alert: > 0.8 sustained
domain_events_dropped_total
- Type: Counter
- Labels:
worker_id,event_type - Description: Domain events dropped due to channel full
- Alert: > 50/min (informational)
Alert Configurations
Prometheus Alert Rules
groups:
- name: tasker_backpressure
rules:
# API Layer
- alert: TaskerCircuitBreakerOpen
expr: api_circuit_breaker_state == 2
for: 30s
labels:
severity: critical
annotations:
summary: "Tasker API circuit breaker is open"
description: "Circuit breaker {{ $labels.instance }} has been open for > 30s"
runbook: "https://docs/operations/backpressure-monitoring.md#circuit-breaker-open"
- alert: TaskerMessagingCircuitBreakerOpen
expr: messaging_circuit_breaker_state == 2
for: 30s
labels:
severity: critical
annotations:
summary: "Tasker messaging circuit breaker is open"
description: "Messaging circuit breaker has been open for > 30s — queue operations are failing"
runbook: "https://docs/operations/backpressure-monitoring.md#messaging-circuit-breaker-open"
- alert: TaskerAPIRejectionsHigh
expr: rate(api_requests_rejected_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of API request rejections"
description: "{{ $value | printf \"%.2f\" }} requests/sec being rejected"
runbook: "https://docs/operations/backpressure-monitoring.md#api-rejections-high"
# Orchestration Layer
- alert: TaskerCommandChannelSaturated
expr: orchestration_command_channel_saturation > 0.8
for: 1m
labels:
severity: warning
annotations:
summary: "Orchestration command channel is saturated"
description: "Channel saturation at {{ $value | printf \"%.0f\" }}%"
runbook: "https://docs/operations/backpressure-monitoring.md#channel-saturation"
- alert: TaskerPGMQQueueDepthHigh
expr: pgmq_queue_depth / pgmq_queue_max_depth > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PGMQ queue depth is high"
description: "Queue {{ $labels.queue_name }} at {{ $value | printf \"%.0f\" }}% capacity"
runbook: "https://docs/operations/backpressure-monitoring.md#pgmq-queue-depth-high"
# Worker Layer
- alert: TaskerWorkerClaimRefusalsHigh
expr: rate(worker_claim_refusals_total[5m]) > 0.2
for: 2m
labels:
severity: warning
annotations:
summary: "High rate of worker claim refusals"
description: "Worker {{ $labels.worker_id }} refusing {{ $value | printf \"%.1f\" }} claims/sec"
runbook: "https://docs/operations/backpressure-monitoring.md#worker-claim-refusals-high"
- alert: TaskerHandlerWaitTimeHigh
expr: histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket) > 2000
for: 2m
labels:
severity: warning
annotations:
summary: "Handler wait time is high"
description: "p99 handler wait time is {{ $value | printf \"%.0f\" }}ms"
runbook: "https://docs/operations/backpressure-monitoring.md#handler-wait-time-high"
- alert: TaskerDomainEventsDropped
expr: rate(domain_events_dropped_total[5m]) > 1
for: 5m
labels:
severity: info
annotations:
summary: "Domain events being dropped"
description: "{{ $value | printf \"%.1f\" }} events/sec dropped"
runbook: "https://docs/operations/backpressure-monitoring.md#domain-events-dropped"
Incident Response Procedures
Circuit Breaker Open
Severity: Critical
Symptoms:
- API returning 503 responses
api_circuit_breaker_state = 2- Downstream operations failing
Immediate Actions:
- Check database connectivity
psql $DATABASE_URL -c "SELECT 1" - Check PGMQ extension health
psql $DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5" - Check recent error logs
kubectl logs -l app=tasker-orchestration --tail=100 | grep ERROR
Recovery:
- Circuit breaker will automatically attempt recovery after
timeout_seconds(default: 30s) - If database is healthy, breaker should close after
success_threshold(default: 2) successful requests - If database is unhealthy, fix database first
Escalation:
- If breaker remains open > 5 min after database recovery: Escalate to engineering
Messaging Circuit Breaker Open
Severity: Critical
Symptoms:
- Orchestration cannot enqueue steps or send task finalizations
- Workers cannot receive step messages or send results
messaging_circuit_breaker_state = 2MessagingError::CircuitBreakerOpenin logs
Immediate Actions:
- Check messaging backend health
# For PGMQ (default) psql $PGMQ_DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5" # For RabbitMQ rabbitmqctl status - Check PGMQ database connectivity (may differ from main database)
psql $PGMQ_DATABASE_URL -c "SELECT 1" - Check recent messaging errors
kubectl logs -l app=tasker-orchestration --tail=100 | grep -E "messaging|circuit_breaker|CircuitBreakerOpen"
Impact:
- Orchestration: Task initialization stalls, step results cannot be received, task finalizations blocked
- Workers: Step messages not received, results cannot be sent back to orchestration
- Safety: Messages remain in queues protected by visibility timeouts; no data loss occurs
- Health checks: Unaffected (bypass circuit breaker to detect recovery)
Recovery:
- Circuit breaker will automatically test recovery after
timeout_seconds(default: 30s) - On recovery, queued messages will be processed normally (visibility timeouts protect against loss)
- If messaging backend is unhealthy, fix it first — the breaker protects against cascading timeouts
Escalation:
- If breaker remains open > 5 min after backend recovery: Escalate to engineering
- If both web and messaging breakers are open simultaneously: Likely database-wide issue, escalate to DBA
API Rejections High
Severity: Warning
Symptoms:
- Clients receiving 429 or 503 responses
api_requests_rejected_totalincreasing
Diagnosis:
- Check rejection reason distribution
sum by (reason) (rate(api_requests_rejected_total[5m])) - If
reason=rate_limit: Legitimate load spike or client misbehavior - If
reason=circuit_breaker: See Circuit Breaker Open
Actions:
- Rate limit rejections:
- Identify high-volume client
- Consider increasing rate limit or contacting client
- Circuit breaker rejections:
- Follow circuit breaker procedure
Channel Saturation
Severity: Warning → Critical if sustained
Symptoms:
mpsc_channel_saturation > 0.8- Increased latency
- Potential backpressure cascade
Diagnosis:
- Identify saturated channel
orchestration_command_channel_saturation > 0.8 - Check upstream rate
rate(orchestration_commands_received_total[5m]) - Check downstream processing rate
rate(orchestration_commands_processed_total[5m])
Actions:
- Temporary: Scale up orchestration replicas
- Short-term: Increase channel buffer size
[orchestration.mpsc_channels.command_processor] command_buffer_size = 10000 # Increase from 5000 - Long-term: Investigate why processing is slow
PGMQ Queue Depth High
Severity: Warning → Critical if approaching max
Symptoms:
pgmq_queue_depthgrowing- Step execution delays
- Potential OOM if queue grows unbounded
Diagnosis:
- Identify growing queue
pgmq_queue_depth{queue_name=~".*"} - Check worker health
sum(worker_handler_semaphore_permits_available) - Check for stuck workers
count(worker_claim_refusals_total) by (worker_id)
Actions:
- Scale workers: Add more worker replicas
- Increase handler concurrency (short-term):
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 20 # Increase from 10 - Investigate slow handlers: Check handler execution latency
Worker Claim Refusals High
Severity: Warning
Symptoms:
worker_claim_refusals_totalincreasing- Workers at capacity
- Step execution delayed
Diagnosis:
- Check handler permit usage
worker_handler_semaphore_permits_available - Check handler execution time
histogram_quantile(0.99, worker_handler_execution_ms_bucket)
Actions:
- Scale workers: Add replicas
- Optimize handlers: If execution time is high
- Adjust threshold: If refusals are premature
[worker] claim_capacity_threshold = 0.9 # More aggressive claiming
Handler Wait Time High
Severity: Warning
Symptoms:
handler_semaphore_wait_ms_p99 > 1000ms- Steps waiting for execution
- Increased end-to-end latency
Diagnosis:
- Check permit utilization
1 - (worker_handler_semaphore_permits_available / worker_handler_semaphore_permits_total) - Check completion channel saturation
worker_completion_channel_saturation
Actions:
- Increase permits (if CPU/memory allow):
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 15 - Optimize handlers: Reduce execution time
- Scale workers: If resources constrained
Domain Events Dropped
Severity: Informational
Symptoms:
domain_events_dropped_totalincreasing- Downstream event consumers missing events
Diagnosis:
- This is expected behavior under load
- Check if drop rate is excessive
rate(domain_events_dropped_total[5m]) / rate(domain_events_dispatched_total[5m])
Actions:
- If < 1% dropped: Normal, no action needed
- If > 5% dropped: Consider increasing event channel buffer
[shared.domain_events] buffer_size = 20000 # Increase from 10000 - Note: Domain events are non-critical. Dropping does not affect step execution.
Capacity Planning
Determining Appropriate Limits
Command Channel Size
Required buffer = (peak_requests_per_second) * (avg_processing_time_ms / 1000) * safety_factor
Example:
peak_requests_per_second = 100
avg_processing_time_ms = 50
safety_factor = 2
Required buffer = 100 * 0.05 * 2 = 10 messages
Recommended: 5000 (50x headroom for bursts)
Handler Concurrency
Optimal concurrency = (worker_cpu_cores) * (1 + (io_wait_ratio))
Example:
worker_cpu_cores = 4
io_wait_ratio = 0.8 (handlers are mostly I/O bound)
Optimal concurrency = 4 * 1.8 = 7.2
Recommended: 8-10 permits
PGMQ Queue Depth
Max depth = (expected_processing_rate) * (max_acceptable_delay_seconds)
Example:
expected_processing_rate = 100 steps/sec
max_acceptable_delay = 60 seconds
Max depth = 100 * 60 = 6000 messages
Recommended: 10000 (headroom for bursts)
Grafana Dashboard
Import this dashboard for backpressure monitoring:
{
"dashboard": {
"title": "Tasker Backpressure",
"panels": [
{
"title": "Circuit Breaker State",
"type": "stat",
"targets": [{"expr": "api_circuit_breaker_state"}]
},
{
"title": "API Rejections Rate",
"type": "graph",
"targets": [{"expr": "rate(api_requests_rejected_total[5m])"}]
},
{
"title": "Channel Saturation",
"type": "graph",
"targets": [
{"expr": "orchestration_command_channel_saturation", "legendFormat": "orchestration"},
{"expr": "worker_dispatch_channel_saturation", "legendFormat": "worker-dispatch"},
{"expr": "worker_completion_channel_saturation", "legendFormat": "worker-completion"}
]
},
{
"title": "PGMQ Queue Depth",
"type": "graph",
"targets": [{"expr": "pgmq_queue_depth", "legendFormat": "{{queue_name}}"}]
},
{
"title": "Handler Wait Time (p99)",
"type": "graph",
"targets": [{"expr": "histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket)"}]
},
{
"title": "Worker Claim Refusals",
"type": "graph",
"targets": [{"expr": "rate(worker_claim_refusals_total[5m])"}]
}
]
}
}
Related Documentation
- Backpressure Architecture - Strategy overview
- MPSC Channel Tuning - Channel configuration
- Worker Event Systems - Worker architecture
Checkpoint Operations Guide
Last Updated: 2026-01-06 Status: Active Related: Batch Processing - Checkpoint Yielding
Overview
This guide covers operational concerns for checkpoint yielding in production environments, including monitoring, troubleshooting, and maintenance tasks.
Monitoring Checkpoints
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Checkpoint history size | Length of history array | >100 entries |
| Checkpoint age | Time since last checkpoint | >10 minutes (step-dependent) |
| Accumulated results size | Size of accumulated_results JSONB | >1MB |
| Checkpoint frequency | Checkpoints per step execution | <1 per minute (may indicate issues) |
SQL Queries for Monitoring
Steps with large checkpoint history:
SELECT
ws.workflow_step_uuid,
ws.name,
t.task_uuid,
jsonb_array_length(ws.checkpoint->'history') as history_length,
ws.checkpoint->>'timestamp' as last_checkpoint
FROM tasker.workflow_steps ws
JOIN tasker.tasks t ON ws.task_uuid = t.task_uuid
WHERE ws.checkpoint IS NOT NULL
AND jsonb_array_length(ws.checkpoint->'history') > 50
ORDER BY history_length DESC
LIMIT 20;
Steps with stale checkpoints (in progress but not checkpointed recently):
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.checkpoint->>'timestamp' as last_checkpoint,
NOW() - (ws.checkpoint->>'timestamp')::timestamptz as checkpoint_age
FROM tasker.workflow_steps ws
WHERE ws.current_state = 'in_progress'
AND ws.checkpoint IS NOT NULL
AND (ws.checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes'
ORDER BY checkpoint_age DESC;
Large accumulated results:
SELECT
ws.workflow_step_uuid,
ws.name,
pg_column_size(ws.checkpoint->'accumulated_results') as results_size_bytes,
pg_size_pretty(pg_column_size(ws.checkpoint->'accumulated_results')::bigint) as results_size
FROM tasker.workflow_steps ws
WHERE ws.checkpoint->'accumulated_results' IS NOT NULL
AND pg_column_size(ws.checkpoint->'accumulated_results') > 100000
ORDER BY results_size_bytes DESC
LIMIT 20;
Logging
Checkpoint operations emit structured logs:
INFO checkpoint_yield_step_event step_uuid=abc-123 cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc-123 history_length=5
Log fields to monitor:
step_uuid- Step being checkpointedcursor- Current positionitems_processed- Total items at checkpointhistory_length- Number of checkpoint entries
Troubleshooting
Step Not Resuming from Checkpoint
Symptoms: Step restarts from beginning instead of checkpoint position.
Checks:
- Verify checkpoint exists:
SELECT checkpoint FROM tasker.workflow_steps WHERE workflow_step_uuid = 'uuid'; - Check handler uses
BatchWorkerContextaccessors:has_checkpoint?/has_checkpoint()/hasCheckpoint()checkpoint_cursor/checkpointCursor
- Verify handler respects checkpoint in processing loop
Checkpoint Not Persisting
Symptoms: checkpoint_yield() returns but data not in database.
Checks:
- Check for errors in worker logs
- Verify FFI bridge is healthy
- Check database connectivity
Excessive Checkpoint History Growth
Symptoms: Steps have hundreds or thousands of checkpoint history entries.
Causes:
- Very long-running processes with frequent checkpoints
- Small checkpoint intervals relative to work
Remediation:
- Increase checkpoint interval (process more items between checkpoints)
- Clear history for completed steps (see Maintenance section)
- Monitor with history size query above
Large Accumulated Results
Symptoms: Database bloat, slow step queries.
Causes:
- Storing full result sets instead of summaries
- Unbounded accumulation without size checks
Remediation:
- Modify handler to store summaries, not full data
- Use external storage for large intermediate results
- Add size checks before checkpoint
Maintenance Tasks
Clear Checkpoint for Completed Steps
Completed steps retain checkpoint data for debugging. To clear:
-- Clear checkpoints for completed steps older than 7 days
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE current_state = 'complete'
AND checkpoint IS NOT NULL
AND updated_at < NOW() - INTERVAL '7 days';
Truncate History Array
For steps with excessive history:
-- Keep only last 10 history entries
UPDATE tasker.workflow_steps
SET checkpoint = jsonb_set(
checkpoint,
'{history}',
(SELECT jsonb_agg(elem)
FROM (
SELECT elem
FROM jsonb_array_elements(checkpoint->'history') elem
ORDER BY (elem->>'timestamp')::timestamptz DESC
LIMIT 10
) sub)
)
WHERE jsonb_array_length(checkpoint->'history') > 10;
Clear Checkpoint for Manual Reset
When manually resetting a step to reprocess from scratch:
-- Clear checkpoint to force reprocessing from beginning
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE workflow_step_uuid = 'step-uuid-here';
Warning: Only clear checkpoints if you want the step to restart from the beginning.
Capacity Planning
Database Sizing
Checkpoint column considerations:
- Each checkpoint: ~1-10KB typical (cursor, timestamp, metadata)
- History array: ~100 bytes per entry
- Accumulated results: Variable (handler-dependent)
Formula for checkpoint storage:
Storage = Active Steps × (Base Checkpoint Size + History Entries × 100 bytes + Accumulated Size)
Example: 10,000 active steps with 50-entry history and 5KB accumulated results:
10,000 × (5KB + 50 × 100B + 5KB) = 10,000 × 15KB = 150MB
Performance Impact
Checkpoint write: ~1-5ms per checkpoint (single row UPDATE)
Checkpoint read: Included in step data fetch (no additional query)
Recommendations:
- Checkpoint every 1000-10000 items or every 1-5 minutes
- Too frequent: Excessive database writes
- Too infrequent: Lost progress on failure
Alerting Recommendations
Prometheus/Grafana Metrics
If exporting to Prometheus:
# Alert on stale checkpoints
- alert: StaleCheckpoint
expr: tasker_checkpoint_age_seconds > 600
for: 5m
labels:
severity: warning
annotations:
summary: "Step checkpoint is stale"
# Alert on large history
- alert: CheckpointHistoryGrowth
expr: tasker_checkpoint_history_size > 100
for: 1h
labels:
severity: warning
annotations:
summary: "Checkpoint history exceeding threshold"
Database-Based Alerting
For periodic SQL-based monitoring:
-- Return non-zero if any issues detected
SELECT COUNT(*)
FROM tasker.workflow_steps
WHERE (
-- Stale in-progress checkpoints
(current_state = 'in_progress'
AND checkpoint IS NOT NULL
AND (checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes')
OR
-- Excessive history
(checkpoint IS NOT NULL
AND jsonb_array_length(checkpoint->'history') > 100)
);
See Also
- Batch Processing Guide - Full checkpoint yielding documentation
- DLQ System - Dead letter queue for failed steps
- Retry Semantics - Retry behavior with checkpoints
Connection Pool Tuning Guide
Overview
Tasker maintains two connection pools: tasker (task/step/transition operations) and pgmq (queue operations). Pool observability is provided via:
/health/detailed- Pool utilization inpool_utilizationfield/metrics- Prometheus gaugestasker_db_pool_connections{pool,state}- Atomic counters tracking acquire latency and errors
Pool Sizing Guidelines
Formula
max_connections = (peak_concurrent_operations * avg_hold_time_ms) / 1000 + headroom
Rules of thumb:
- Orchestration pool: 2-3x the number of concurrent tasks expected
- PGMQ pool: 1-2x the number of workers × batch size
- min_connections: 20-30% of max to avoid cold-start latency
- Never exceed PostgreSQL’s
max_connections / number_of_services
Environment Defaults
| Parameter | Base | Production | Development | Test |
|---|---|---|---|---|
max_connections (tasker) | 25 | 50 | 25 | 30 |
min_connections (tasker) | 5 | 10 | 5 | 2 |
max_connections (pgmq) | 15 | 25 | 15 | 10 |
slow_acquire_threshold_ms | 100 | 50 | 200 | 500 |
Metrics Interpretation
Utilization Thresholds
| Level | Utilization | Action |
|---|---|---|
| Healthy | < 80% | Normal operation |
| Degraded | 80-95% | Monitor closely, consider increasing max_connections |
| Unhealthy | > 95% | Pool exhaustion imminent; increase pool size or reduce load |
Slow Acquires
The slow_acquire_threshold_ms setting controls when an acquire is classified as “slow”:
- Production (50ms): Tight threshold for SLO-sensitive workloads
- Development (200ms): Relaxed for local debugging with fewer resources
- Test (500ms): Very relaxed for CI environments with contention
A high slow_acquires count relative to total_acquires (>5%) suggests:
- Pool is undersized for the workload
- Connections are held too long (long queries or transactions)
- Connection creation is slow (network latency to DB)
Acquire Errors
Non-zero acquire_errors indicates pool exhaustion (timeout waiting for connection). Remediation:
- Increase
max_connections - Increase
acquire_timeout_seconds(masks the problem) - Reduce query execution time
- Check for connection leaks (connections not returned to pool)
PostgreSQL Server-Side Considerations
max_connections
PostgreSQL’s max_connections is a hard limit across all clients. For cluster deployments:
pg_max_connections >= sum(service_max_pool * service_instance_count) + superuser_reserved
Default PostgreSQL max_connections is 100. For production:
- Set
max_connections = 500or higher - Reserve 5-10 connections for superuser (
superuser_reserved_connections) - Monitor with
SELECT count(*) FROM pg_stat_activity
Connection Overhead
Each PostgreSQL connection consumes ~5-10MB RAM. Size accordingly:
- 100 connections ~ 0.5-1GB additional RAM
- 500 connections ~ 2.5-5GB additional RAM
Statement Timeout
The statement_timeout database variable protects against runaway queries:
- Production: 30s (default)
- Test: 5s (fail fast)
Alert Threshold Recommendations
| Metric | Warning | Critical |
|---|---|---|
| Pool utilization | > 80% for 5 min | > 95% for 1 min |
| Slow acquires / total | > 5% over 5 min | > 20% over 1 min |
| Acquire errors | > 0 in 5 min | > 10 in 1 min |
| Average acquire time | > 50ms | > 200ms |
Configuration Reference
Pool settings are in config/tasker/base/common.toml under [common.database.pool] and [common.pgmq_database.pool]. Environment-specific overrides are in config/tasker/environments/{env}/common.toml.
[common.database.pool]
max_connections = 25
min_connections = 5
acquire_timeout_seconds = 10
idle_timeout_seconds = 300
max_lifetime_seconds = 1800
slow_acquire_threshold_ms = 100
MPSC Channel Tuning - Operational Runbook
Last Updated: 2025-12-10 Owner: Platform Engineering Related: ADR: Bounded MPSC Channels | Circuit Breakers | Backpressure Architecture
Overview
This runbook provides operational guidance for monitoring, diagnosing, and tuning MPSC channel buffer sizes in the tasker-core system. All channels are bounded with configurable capacities to prevent unbounded memory growth.
Quick Reference
Configuration Files
| File | Purpose | When to Edit |
|---|---|---|
config/tasker/base/mpsc_channels.toml | Base configuration | Default values |
config/tasker/environments/test/mpsc_channels.toml | Test overrides | Test environment tuning |
config/tasker/environments/development/mpsc_channels.toml | Dev overrides | Local development tuning |
config/tasker/environments/production/mpsc_channels.toml | Prod overrides | Production capacity planning |
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
mpsc_channel_usage_percent | Current fill percentage | > 80% |
mpsc_channel_capacity | Configured buffer size | N/A (informational) |
mpsc_channel_full_events_total | Overflow events counter | > 0 (indicates backpressure) |
Default Buffer Sizes
| Channel | Base | Test | Development | Production |
|---|---|---|---|---|
| Orchestration command | 1000 | 100 | 1000 | 5000 |
| PGMQ notifications | 10000 | 10000 | 10000 | 50000 |
| Task readiness | 1000 | 100 | 500 | 5000 |
| Worker command | 1000 | 1000 | 1000 | 2000 |
| Event publisher | 5000 | 5000 | 5000 | 10000 |
| Ruby FFI | 1000 | 1000 | 500 | 2000 |
Monitoring and Alerting
Recommended Alerts
Critical: Channel Saturation
# Alert when any channel is >90% full for 5 minutes
mpsc_channel_usage_percent > 90
Action: Immediate capacity increase or identify bottleneck
Warning: Channel High Usage
# Alert when any channel is >80% full for 15 minutes
mpsc_channel_usage_percent > 80
Action: Plan capacity increase, investigate throughput
Info: Channel Overflow Events
# Alert on any overflow events
rate(mpsc_channel_full_events_total[5m]) > 0
Action: Review backpressure handling, consider capacity increase
Grafana Queries
Channel Usage by Component
max by (channel, component) (mpsc_channel_usage_percent)
Channel Capacity Configuration
max by (channel, component) (mpsc_channel_capacity)
Overflow Event Rate
rate(mpsc_channel_full_events_total[5m])
Log Patterns
Saturation Warning (80% full)
WARN mpsc_channel_saturation channel=orchestration_command usage_percent=82.5
Overflow Event (channel full)
ERROR mpsc_channel_full channel=event_publisher action=dropped
Backpressure Applied
ERROR Ruby FFI event channel full - backpressure applied
Common Issues and Solutions
Issue 1: High Channel Saturation
Symptoms:
mpsc_channel_usage_percentconsistently > 80%- Slow message processing
- Increased latency
Diagnosis:
-
Check which channel is saturated:
# Grep logs for saturation warnings grep "mpsc_channel_saturation" logs/tasker.log | tail -20 -
Check metrics for specific channel:
mpsc_channel_usage_percent{channel="orchestration_command"}
Solutions:
Short-term (Immediate Relief):
# Edit appropriate environment file
# Example: config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 10000 # Increase from 5000
Long-term:
- Investigate message producer rate
- Optimize message consumer processing
- Consider horizontal scaling
Issue 2: PGMQ Notification Bursts
Symptoms:
- Spike in
mpsc_channel_usage_percent{channel="pgmq_notifications"} - During bulk task creation (1000+ tasks)
- Temporary saturation followed by recovery
Diagnosis:
-
Correlate with bulk task operations:
# Check for bulk task creation in logs grep "Bulk task creation" logs/tasker.log -
Verify buffer size configuration:
# Check current production configuration cat config/tasker/environments/production/mpsc_channels.toml | \ grep -A 2 "event_listeners"
Solutions:
If production buffer < 50000:
# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 50000 # Recommended for production
If already at 50000 and still saturating:
- Consider notification coalescing (future feature)
- Implement batch notification processing
- Scale orchestration services horizontally
Issue 3: Ruby FFI Backpressure
Symptoms:
- Errors: “Ruby FFI event channel full - backpressure applied”
- Ruby handler slowness
- Increased Rust-side latency
Diagnosis:
-
Check Ruby handler processing time:
# Add timing to Ruby handlers time_start = Time.now result = handler.execute(step) duration = Time.now - time_start logger.warn("Slow handler: #{duration}s") if duration > 1.0 -
Check FFI channel saturation:
mpsc_channel_usage_percent{channel="ruby_ffi"}
Solutions:
If Ruby handlers are slow:
- Optimize Ruby handler code
- Consider async Ruby processing
- Profile Ruby handler performance
If FFI buffer too small:
# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 2000 # Increase from 1000
If Rust-side producing too fast:
- Add rate limiting to Rust event production
- Batch events before FFI crossing
Issue 4: Event Publisher Drops
Symptoms:
- Counter increasing:
mpsc_channel_full_events_total{channel="event_publisher"} - Log warnings: “Event channel full, dropping event”
Diagnosis:
-
Check drop rate:
rate(mpsc_channel_full_events_total{channel="event_publisher"}[5m]) -
Identify event types being dropped:
grep "dropping event" logs/tasker.log | awk '{print $NF}' | sort | uniq -c
Solutions:
If drops are rare (< 1/min):
- Acceptable for non-critical events
- Monitor but no action needed
If drops are frequent (> 10/min):
# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 20000 # Increase from 10000
If drops are continuous:
- Investigate event storm cause
- Consider event sampling/filtering
- Review event subscriber performance
Capacity Planning
Sizing Formula
General guideline:
buffer_size = (peak_message_rate_per_sec * avg_processing_time_sec) * safety_factor
Where:
peak_message_rate_per_sec: Expected peak throughputavg_processing_time_sec: Average consumer processing timesafety_factor: 2-5x for bursts
Example calculation:
# Orchestration command channel
peak_rate = 500 messages/sec
processing_time = 0.01 sec (10ms)
safety_factor = 2x
buffer_size = (500 * 0.01) * 2 = 10 messages minimum
# Use 1000 for burst handling
Environment-Specific Guidelines
Test Environment:
- Use small buffers (100-500)
- Exposes backpressure issues early
- Forces proper error handling
Development Environment:
- Use moderate buffers (500-1000)
- Balances local resource usage
- Mimics test environment behavior
Production Environment:
- Use large buffers (2000-50000)
- Handles real-world burst traffic
- Prioritizes availability over memory
When to Increase Buffer Sizes
Increase if:
- ✅ Saturation > 80% for extended periods
- ✅ Overflow events occur regularly
- ✅ Latency increases during peak load
- ✅ Known traffic increase incoming
Don’t increase if:
- ❌ Consumer is bottleneck (fix consumer instead)
- ❌ Saturation is brief and recovers quickly
- ❌ Would mask underlying performance issue
Configuration Change Procedure
1. Identify Need
Review metrics and logs to determine which channel needs adjustment.
2. Calculate New Size
Use sizing formula or apply percentage increase:
new_size = current_size * (100 / (100 - target_usage_percent))
# Example: Currently 90% full, target 70%
new_size = 5000 * (100 / (100 - 70)) = 5000 * 3.33 = 16,650
# Round up: 20,000
3. Update Configuration
Important: Environment overrides MUST use full [mpsc_channels.*] prefix!
# ✅ CORRECT
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 20000
# ❌ WRONG - creates conflicting top-level key
[orchestration.command_processor]
command_buffer_size = 20000
4. Deploy
Local/Development:
# Restart service - picks up new config automatically
cargo run --package tasker-orchestration --bin tasker-server --features web-api
Production:
# Standard deployment process
# Configuration is loaded at service startup
kubectl rollout restart deployment/tasker-orchestration
5. Monitor
Watch metrics for 1-2 hours post-change:
- Channel usage percentage should decrease
- Overflow events should stop
- Latency should improve
6. Document
Update this runbook with:
- Why change was made
- New values
- Observed impact
Troubleshooting Checklist
□ Check metric: mpsc_channel_usage_percent for affected channel
□ Review logs for saturation warnings in last 24 hours
□ Verify configuration file has correct [mpsc_channels] prefix
□ Confirm environment variable TASKER_ENV matches intended environment
□ Check if issue correlates with specific operations (bulk tasks, etc.)
□ Verify consumer processing time hasn't increased
□ Check for resource constraints (CPU, memory)
□ Review recent code changes that might affect throughput
□ Consider if horizontal scaling is needed vs buffer increase
Emergency Response
Critical Saturation (>95%)
Immediate Actions:
- Increase buffer size by 2-5x in production config
- Deploy immediately via rolling restart
- Page on-call if service degradation visible
Example:
# Edit config
vim config/tasker/environments/production/mpsc_channels.toml
# Deploy
kubectl rollout restart deployment/tasker-orchestration
# Monitor
watch -n 5 'curl -s localhost:9090/api/v1/query?query=mpsc_channel_usage_percent | jq'
Service Unresponsive Due to Backpressure
Symptoms:
- All channels showing 100% usage
- No message processing
- Health checks failing
Actions:
- Check for downstream bottleneck (database, queue service)
- Scale out consumer services
- Temporarily increase all buffer sizes
- Check circuit breaker states (
/health/detailedendpoint) - if circuit breakers are open, address underlying database/service issues first
Note: MPSC channels and circuit breakers are complementary resilience mechanisms. Channel saturation indicates internal backpressure, while circuit breaker state indicates external service health. See Circuit Breakers for operational guidance.
Best Practices
- Monitor Proactively: Don’t wait for alerts - review metrics weekly
- Test Changes in Dev: Validate buffer changes in development first
- Document Rationale: Note why each production override exists
- Gradual Increases: Prefer 2x increases over 10x jumps
- Review Quarterly: Adjust defaults based on production patterns
- Alert on Changes: Get notified of configuration file commits
Related Documentation
Architecture:
- Backpressure Architecture - How MPSC channels fit into the broader resilience strategy
- Circuit Breakers - Fault isolation working alongside bounded channels
- ADR: Bounded MPSC Channels - Design decisions
Development:
- Developer Guidelines - Creating and using MPSC channels
Operations:
- Backpressure Monitoring - Unified alerting and incident response
Support
Questions? Ask in #platform-engineering Slack channel
Issues? File ticket with label infrastructure/channels
Escalation? Page on-call via PagerDuty
Cluster Testing Guide
Last Updated: 2026-01-19 Audience: Developers, QA Status: Active Related: Tooling | Idempotency and Atomicity
Overview
This guide covers multi-instance cluster testing for validating horizontal scaling, race condition detection, and concurrent processing scenarios.
Key Capabilities:
- Run N orchestration instances with M worker instances
- Test concurrent task creation across instances
- Validate state consistency across cluster
- Detect race conditions and data corruption
- Measure performance under concurrent load
Test Infrastructure
Feature Flags
Tests are organized by infrastructure requirements using Cargo feature flags:
| Feature Flag | Infrastructure Required | In CI? |
|---|---|---|
test-db | PostgreSQL database | Yes |
test-messaging | DB + messaging (PGMQ/RabbitMQ) | Yes |
test-services | DB + messaging + services running | Yes |
test-cluster | Multi-instance cluster running | No |
Hierarchy: Each flag implies the previous (test-cluster includes test-services includes test-messaging includes test-db).
Test Commands
# Unit tests (DB + messaging only)
cargo make test-rust-unit
# E2E tests (services running)
cargo make test-rust-e2e
# Cluster tests (cluster running - LOCAL ONLY)
cargo make test-rust-cluster
# All tests including cluster
cargo make test-rust-all
Test Entry Points
tests/
├── basic_tests.rs # Always compiles
├── integration_tests.rs # #[cfg(feature = "test-messaging")]
├── e2e_tests.rs # #[cfg(feature = "test-services")]
└── e2e/
└── multi_instance/ # #[cfg(feature = "test-cluster")]
├── mod.rs
├── concurrent_task_creation_test.rs
└── consistency_test.rs
Multi-Instance Test Manager
The MultiInstanceTestManager provides high-level APIs for cluster testing.
Location
tests/common/multi_instance_test_manager.rs
tests/common/orchestration_cluster.rs
Basic Usage
#![allow(unused)]
fn main() {
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_concurrent_operations() -> Result<()> {
// Setup from environment (reads TASKER_TEST_ORCHESTRATION_URLS, etc.)
let manager = MultiInstanceTestManager::setup_from_env().await?;
// Wait for all instances to become healthy
manager.wait_for_healthy(Duration::from_secs(30)).await?;
// Create tasks concurrently across the cluster
let requests = vec![create_task_request("namespace", "task", json!({})); 10];
let responses = manager.create_tasks_concurrent(requests).await?;
// Wait for completion
let task_uuids: Vec<_> = responses.iter()
.map(|r| uuid::Uuid::parse_str(&r.task_uuid).unwrap())
.collect();
let completed = manager.wait_for_tasks_completion(task_uuids.clone(), timeout).await?;
// Verify consistency across all instances
for uuid in &task_uuids {
manager.verify_task_consistency(*uuid).await?;
}
Ok(())
}
}
Key Methods
| Method | Description |
|---|---|
setup_from_env() | Create manager from environment variables |
setup(orch_count, worker_count) | Create manager with explicit counts |
wait_for_healthy(timeout) | Wait for all instances to be healthy |
create_tasks_concurrent(requests) | Create tasks using round-robin distribution |
wait_for_task_completion(uuid, timeout) | Wait for single task completion |
wait_for_tasks_completion(uuids, timeout) | Wait for multiple tasks |
verify_task_consistency(uuid) | Verify task state across all instances |
orchestration_count() | Number of orchestration instances |
worker_count() | Number of worker instances |
OrchestrationCluster
Lower-level cluster abstraction with load balancing:
#![allow(unused)]
fn main() {
use crate::common::orchestration_cluster::{OrchestrationCluster, ClusterConfig, LoadBalancingStrategy};
// Create cluster with round-robin load balancing
let config = ClusterConfig {
orchestration_urls: vec!["http://localhost:8080", "http://localhost:8081"],
worker_urls: vec!["http://localhost:8100", "http://localhost:8101"],
load_balancing: LoadBalancingStrategy::RoundRobin,
health_timeout: Duration::from_secs(5),
};
let cluster = OrchestrationCluster::new(config).await?;
// Get client using load balancing strategy
let client = cluster.get_client();
// Get all clients for parallel operations
for client in cluster.all_clients() {
let task = client.get_task(task_uuid).await?;
}
}
Running Cluster Tests
Prerequisites
- PostgreSQL running with PGMQ extension
- Environment configured for cluster mode
Step-by-Step
# 1. Start PostgreSQL (if not already running)
cargo make docker-up
# 2. Setup cluster environment
cargo make setup-env-cluster
# 3. Start the full cluster
cargo make cluster-start-all
# 4. Verify cluster health
cargo make cluster-status
# Expected output:
# Instance Status:
# ─────────────────────────────────────────────────────────────
# INSTANCE STATUS PID PORT
# ─────────────────────────────────────────────────────────────
# orchestration-1 healthy 12345 8080
# orchestration-2 healthy 12346 8081
# worker-rust-1 healthy 12347 8100
# worker-rust-2 healthy 12348 8101
# ... (more workers)
# 5. Run cluster tests
cargo make test-rust-cluster
# 6. Stop cluster when done
cargo make cluster-stop
Monitoring During Tests
# In separate terminal: Watch cluster logs
cargo make cluster-logs
# Or orchestration logs only
cargo make cluster-logs-orchestration
# Quick status check (no health probes)
cargo make cluster-status-quick
Test Scenarios
Concurrent Task Creation
Validates that tasks can be created concurrently across orchestration instances without conflicts.
File: tests/e2e/multi_instance/concurrent_task_creation_test.rs
Test: test_concurrent_task_creation_across_instances
Validates:
- Tasks created through different orchestration instances
- All tasks complete successfully
- State is consistent across all instances
- No duplicate UUIDs generated
Rapid Task Burst
Stress tests the system by creating many tasks in quick succession.
Test: test_rapid_task_creation_burst
Validates:
- System handles high task creation rate
- No duplicate task UUIDs
- All tasks created successfully
Round-Robin Distribution
Verifies tasks are distributed across instances using round-robin.
Test: test_task_creation_round_robin_distribution
Validates:
- Tasks distributed across instances
- Distribution is approximately even
- No single-instance bottleneck
Validation Results
The cluster testing infrastructure was validated with the following results:
Test Summary
| Metric | Result |
|---|---|
| Tests Passed | 1645 |
| Intermittent Failures | 3 (resource contention, not race conditions) |
| Tests Skipped | 21 (domain event tests, require single-instance) |
| Cluster Configuration | 2x orchestration + 2x each worker type (10 total) |
Key Findings
-
No Race Conditions Detected: All concurrent operations completed without data corruption or invalid states
-
Defense-in-Depth Validated: Four protection layers (database atomicity, state machine guards, transaction boundaries, application logic) work correctly together
-
Recovery Mechanism Works: Tasks and steps recover correctly after simulated failures
-
Consistent State: Task state is consistent when queried from any orchestration instance
Connection Pool Deadlock (Fixed)
Initial testing revealed intermittent failures under high parallelization:
- Cause: Connection pool deadlock in task initialization - transactions held connections while template loading needed additional connections
- Root Cause Fix: Moved template loading BEFORE transaction begins in
task_initialization/service.rs - Additional Tuning: Increased pool sizes (20→30 max, 1→2 min connections)
- Status: ✅ Fixed - all 9 cluster tests now pass in parallel
See the connection pool deadlock pattern documentation in docs/ticket-specs/ for details.
Domain Event Tests
21 tests were skipped in cluster mode (marked with #[cfg(not(feature = "test-cluster"))]):
- Reason: Domain event tests verify in-process event delivery, incompatible with multi-process cluster
- Status: Working as designed - these tests run in single-instance CI
Test Feature Flag Implementation
Adding the Feature Gate
Tests requiring cluster infrastructure should use the feature gate:
#![allow(unused)]
fn main() {
// At module level
#![cfg(feature = "test-cluster")]
// Or at test level
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_cluster_specific_behavior() -> Result<()> {
// ...
}
}
Skipping Tests in Cluster Mode
Some tests (like domain events) don’t work in cluster mode:
#![allow(unused)]
fn main() {
// Only run when NOT in cluster mode
#[tokio::test]
#[cfg(not(feature = "test-cluster"))]
async fn test_domain_event_delivery() -> Result<()> {
// In-process event testing
}
}
Conditional Imports
#![allow(unused)]
fn main() {
// Only import cluster test utilities when needed
#[cfg(feature = "test-cluster")]
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
}
Nextest Configuration
The .config/nextest.toml configures test execution for cluster scenarios:
[profile.default]
retries = 0
leak-timeout = { period = "500ms", result = "fail" }
fail-fast = false
# Multi-instance tests can run in parallel once cluster is warmed up
[[profile.default.overrides]]
filter = 'test(multi_instance)'
[profile.ci]
# Limit parallelism to avoid database connection pool exhaustion
test-threads = 4
Cluster Warmup: Multi-instance tests can run in parallel. Connection pools now start with min_connections=2 for faster warmup. The 5-second delay built into cluster-start-all usually suffices. If you see “Failed to create task after all retries” errors immediately after startup, wait a few more seconds for pools to fully initialize.
Troubleshooting
Cluster Won’t Start
# Check for port conflicts
lsof -i :8080-8089
lsof -i :8100-8109
# Check for stale PID files
ls -la .pids/
rm -rf .pids/*.pid # Clean up stale PIDs
# Retry start
cargo make cluster-start-all
Tests Timeout / “Failed to create task after all retries”
This typically indicates the cluster wasn’t fully warmed up:
# Check cluster health
cargo make cluster-status
# If health is green but tests fail, wait for warmup
sleep 10 && cargo make test-rust-cluster
# Check logs for errors
cargo make cluster-logs | head -100
# Restart cluster with extra warmup
cargo make cluster-stop
cargo make cluster-start-all
sleep 10
cargo make test-rust-cluster
Root cause: Connection pools start at min_connections=2 and grow on demand. The first requests after startup may timeout while pools are establishing connections.
Connection Pool Exhaustion
If tests fail with “pool timed out” errors, ensure you have the latest code with:
- Template loading before transaction in
task_initialization/service.rs - Pool sizes:
max_connections=30,min_connections=2in test config
If issues persist, verify pool configuration:
# Check test config
cat config/tasker/generated/orchestration-test.toml | grep -A5 "pool"
Environment Variables Not Set
# Verify environment
env | grep TASKER_TEST
# Re-source environment
source .env
# Or regenerate
cargo make setup-env-cluster
CI Considerations
Cluster tests are NOT run in CI due to GitHub Actions resource constraints:
- Running multiple orchestration + worker instances requires more memory than free GHA runners provide
- This is a conscious tradeoff for an open-source, pre-alpha project
Future Options (when project matures):
- Self-hosted runners with more resources
- Paid GHA larger runners
- Separate manual workflow trigger for cluster tests
Workaround: Run cluster tests locally before PRs that touch concurrent processing code.
Related Documentation
- Tooling - Cluster deployment tasks
- Idempotency and Atomicity - Protection mechanisms
Comprehensive Lifecycle Testing Framework Guide
This guide demonstrates the complete lifecycle testing framework, showing patterns, examples, and best practices for validating task and workflow step lifecycles with integrated SQL function validation.
Table of Contents
- Framework Overview
- Core Testing Patterns
- Advanced Assertion Traits
- Template-Based Testing
- SQL Function Integration
- Example Test Executions
- Tracing Output Examples
- Best Practices
- Troubleshooting
Framework Overview
Architecture
The comprehensive lifecycle testing framework consists of several key components:
#![allow(unused)]
fn main() {
// Core Infrastructure
TestOrchestrator // Wrapper around orchestration components
StepErrorSimulator // Realistic error scenario simulation
SqlLifecycleAssertion // SQL function validation
TestScenarioBuilder // YAML template loading
// Advanced Patterns
TemplateTestRunner // Parameterized error pattern testing
ErrorPattern // Comprehensive error configuration
TaskAssertions // Task-level validation trait
StepAssertions // Step-level validation trait
}
Integration Strategy
Each test follows the integrated validation pattern:
- Exercise Lifecycle: Use orchestration framework to create scenario
- Capture SQL State: Call SQL functions to get current state
- Assert Integration: Validate SQL functions return expected values
- Document Relationship: Structured tracing showing cause → effect
Core Testing Patterns
Pattern 1: Basic Lifecycle Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_basic_lifecycle_validation(pool: PgPool) -> Result<()> {
tracing::info!("🧪 Testing basic lifecycle progression");
// STEP 1: Exercise lifecycle using framework
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "basic_validation").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// STEP 2: Validate initial state
pool.assert_step_ready(step.workflow_step_uuid).await?;
// STEP 3: Execute step
let result = orchestrator.execute_step(&step, true, 1000).await?;
assert!(result.success);
// STEP 4: Validate final state
pool.assert_step_complete(step.workflow_step_uuid).await?;
tracing::info!("✅ Basic lifecycle validation complete");
Ok(())
}
}
Pattern 2: Error and Retry Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_error_retry_validation(pool: PgPool) -> Result<()> {
tracing::info!("🔄 Testing error and retry behavior");
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "retry_validation").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// STEP 1: Simulate retryable error
StepErrorSimulator::simulate_execution_error(
&pool,
&step,
1 // attempt number
).await?;
// STEP 2: Validate retry behavior
pool.assert_step_retry_behavior(
step.workflow_step_uuid,
1, // expected attempts
None, // no custom backoff
true // still retry eligible
).await?;
tracing::info!("✅ Error retry validation complete");
Ok(())
}
}
Pattern 3: Complex Dependency Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_dependency_validation(pool: PgPool) -> Result<()> {
tracing::info!("🔗 Testing dependency relationships");
let orchestrator = TestOrchestrator::new(pool.clone());
// Create diamond pattern workflow
let task = create_diamond_workflow_task(&orchestrator).await?;
let steps = get_task_steps(&pool, task.task_uuid).await?;
// Execute start step
let result = orchestrator.execute_step(&steps[0], true, 1000).await?;
assert!(result.success);
// Fail one branch
StepErrorSimulator::simulate_validation_error(
&pool,
&steps[1],
"dependency_test_error"
).await?;
// Complete other branch
let result = orchestrator.execute_step(&steps[2], true, 1000).await?;
assert!(result.success);
// Validate convergence step is blocked
pool.assert_step_blocked(steps[3].workflow_step_uuid).await?;
tracing::info!("✅ Dependency validation complete");
Ok(())
}
}
Advanced Assertion Traits
TaskAssertions Trait Usage
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{TaskAssertions, TaskStepDistribution};
// Task completion validation
pool.assert_task_complete(task_uuid).await?;
// Task error state validation
pool.assert_task_error(task_uuid, 2).await?; // 2 error steps
// Complex step distribution validation
pool.assert_task_step_distribution(
task_uuid,
TaskStepDistribution {
total_steps: 4,
completed_steps: 2,
failed_steps: 1,
ready_steps: 0,
pending_steps: 1,
in_progress_steps: 0,
error_steps: 1,
}
).await?;
// Execution status validation
pool.assert_task_execution_status(
task_uuid,
ExecutionStatus::BlockedByFailures,
Some(RecommendedAction::HandleFailures)
).await?;
// Completion percentage validation
pool.assert_task_completion_percentage(task_uuid, 75.0, 5.0).await?;
}
StepAssertions Trait Usage
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::StepAssertions;
// Basic step state validations
pool.assert_step_ready(step_uuid).await?;
pool.assert_step_complete(step_uuid).await?;
pool.assert_step_blocked(step_uuid).await?;
// Retry behavior validation
pool.assert_step_retry_behavior(
step_uuid,
3, // expected attempts
Some(30), // custom backoff seconds
false // not retry eligible (exhausted)
).await?;
// Dependency validation
pool.assert_step_dependencies_satisfied(step_uuid, true).await?;
// State transition sequence validation
pool.assert_step_state_sequence(
step_uuid,
vec!["Pending".to_string(), "InProgress".to_string(), "Complete".to_string()]
).await?;
// Permanent failure validation
pool.assert_step_failed_permanently(step_uuid).await?;
// Waiting for retry with specific time
let retry_time = chrono::Utc::now() + chrono::Duration::seconds(60);
pool.assert_step_waiting(step_uuid, retry_time).await?;
}
Template-Based Testing
ErrorPattern Configuration
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{ErrorPattern, TemplateTestRunner};
// Simple patterns
let success_pattern = ErrorPattern::AllSuccess;
let first_fail_pattern = ErrorPattern::FirstStepFails { retryable: true };
let last_fail_pattern = ErrorPattern::LastStepFails { permanently: false };
// Advanced patterns
let targeted_pattern = ErrorPattern::MiddleStepFails {
step_name: "process_payment".to_string(),
attempts_before_success: 3
};
let dependency_pattern = ErrorPattern::DependencyBlockage {
blocked_step: "finalize_order".to_string(),
blocking_step: "validate_payment".to_string()
};
// Custom pattern with full control
let custom_pattern = ErrorPattern::Custom {
step_configs: {
let mut configs = HashMap::new();
configs.insert("critical_step".to_string(), StepErrorConfig {
error_type: StepErrorType::ExternalServiceError,
attempts_before_success: Some(5),
custom_backoff_seconds: Some(120),
permanently_fails: false,
});
configs
}
};
}
Template Runner Usage
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_template_patterns(pool: PgPool) -> Result<()> {
let template_runner = TemplateTestRunner::new(pool.clone()).await?;
// Test single pattern
let summary = template_runner.run_template_with_errors(
"order_fulfillment.yaml",
ErrorPattern::FirstStepFails { retryable: true }
).await?;
assert!(summary.sql_validations_passed > 0);
assert_eq!(summary.sql_validations_failed, 0);
// Test all patterns automatically
let summaries = template_runner
.run_template_with_all_patterns("linear_workflow.yaml")
.await?;
for summary in summaries {
tracing::info!(
pattern = summary.error_pattern,
execution_time = summary.execution_time_ms,
validations = summary.sql_validations_passed,
"Pattern execution complete"
);
}
Ok(())
}
}
SQL Function Integration
Direct SQL Function Testing
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_direct_sql_functions(pool: PgPool) -> Result<()> {
// Test get_step_readiness_status
let step_status = sqlx::query!(
"SELECT ready_for_execution, dependencies_satisfied, retry_eligible, attempts,
backoff_request_seconds, next_retry_at
FROM get_step_readiness_status($1)",
step_uuid
)
.fetch_one(&pool)
.await?;
// Validate individual fields
assert_eq!(step_status.ready_for_execution, Some(true));
assert_eq!(step_status.dependencies_satisfied, Some(true));
assert_eq!(step_status.retry_eligible, Some(false));
assert_eq!(step_status.attempts, Some(0));
// Test get_task_execution_context
let task_context = sqlx::query!(
"SELECT total_steps, completed_steps, failed_steps, ready_steps,
pending_steps, in_progress_steps, error_steps,
completion_percentage, execution_status, recommended_action,
blocked_by_errors
FROM get_task_execution_context($1)",
task_uuid
)
.fetch_one(&pool)
.await?;
// Validate task aggregations
assert!(task_context.total_steps.unwrap_or(0) > 0);
assert_eq!(task_context.completed_steps, Some(0));
assert_eq!(task_context.failed_steps, Some(0));
Ok(())
}
}
Integrated SQL Validation Pattern
#![allow(unused)]
fn main() {
// The standard pattern used throughout the framework
async fn validate_integrated_sql_behavior(
pool: &PgPool,
task_uuid: Uuid,
step_uuid: Uuid
) -> Result<()> {
// STEP 1: Execute lifecycle action
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?;
// STEP 2: Immediately validate SQL functions
SqlLifecycleAssertion::assert_step_scenario(
pool,
task_uuid,
step_uuid,
ExpectedStepState {
state: "Error".to_string(),
ready_for_execution: false,
dependencies_satisfied: true,
retry_eligible: true,
attempts: 2,
next_retry_at: Some(calculate_expected_retry_time()),
backoff_request_seconds: None,
retry_limit: 3,
}
).await?;
// STEP 3: Document the relationship
tracing::info!(
lifecycle_action = "simulate_execution_error",
sql_result = "retry_eligible=true, attempts=2",
"✅ INTEGRATION: Lifecycle → SQL alignment verified"
);
Ok(())
}
}
Example Test Executions
Running Individual Tests
# Run specific test with detailed output
RUST_LOG=info cargo test --test complex_retry_scenarios \
test_cascading_retries_with_dependencies -- --nocapture
# Run all lifecycle tests
cargo test --all-features --test '*lifecycle*' -- --nocapture
# Run with specific environment
TASKER_ENV=test cargo test --test step_retry_lifecycle_tests -- --nocapture
Running Test Suites
# Run comprehensive validation
cargo test --test sql_function_integration_validation -- --nocapture
# Run complex scenarios
cargo test --test complex_retry_scenarios -- --nocapture
# Run task finalization tests
cargo test --test task_finalization_error_scenarios -- --nocapture
Tracing Output Examples
Successful Test Execution
2025-01-15T10:30:45.123Z INFO test_cascading_retries_with_dependencies:
🧪 Testing cascading retries with diamond dependency pattern
2025-01-15T10:30:45.125Z INFO test_cascading_retries_with_dependencies:
🏗️ Creating diamond workflow: Start → BranchA/BranchB → Convergence
2025-01-15T10:30:45.145Z INFO test_cascading_retries_with_dependencies:
📋 STEP 1: Executing start step successfully
step_uuid=01JGJX7K8QMRNP4W2X3Y5Z6ABC
2025-01-15T10:30:45.167Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 2: Simulating BranchA failure (attempt 1)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
error_type="ExecutionError" retryable=true
2025-01-15T10:30:45.189Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Retry behavior matches expectations
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
attempts=1 backoff=null retry_eligible=true
2025-01-15T10:30:45.201Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 3: BranchA retry attempt (attempt 2)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
2025-01-15T10:30:45.223Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step completed successfully
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
2025-01-15T10:30:45.245Z INFO test_cascading_retries_with_dependencies:
❌ STEP 4: Simulating BranchB permanent failure
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
error_type="ValidationError" retryable=false
2025-01-15T10:30:45.267Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step failed permanently (not retryable)
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
2025-01-15T10:30:45.289Z INFO test_cascading_retries_with_dependencies:
🚫 STEP 5: Validating Convergence step is blocked
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL
2025-01-15T10:30:45.301Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step blocked by dependencies
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL
2025-01-15T10:30:45.323Z INFO test_cascading_retries_with_dependencies:
📊 TASK ASSERTION: Step distribution matches expectations
task_uuid=01JGJX7K8PMRNP4W2X3Y5Z6MNO
total=4 completed=2 failed=0 ready=0 pending=0 in_progress=0 error=2
2025-01-15T10:30:45.345Z INFO test_cascading_retries_with_dependencies:
✅ INTEGRATION: Lifecycle → SQL alignment verified
lifecycle_action="cascading_retry_with_dependency_blocking"
sql_result="blocked_by_errors=true, error_steps=2"
2025-01-15T10:30:45.356Z INFO test_cascading_retries_with_dependencies:
🧪 CASCADING RETRY TEST COMPLETE: Diamond pattern with mixed outcomes validated
Error Pattern Testing Output
2025-01-15T10:35:12.123Z INFO test_template_runner_all_patterns:
🎭 TEMPLATE DEMO: All error patterns with multiple templates
2025-01-15T10:35:12.145Z INFO test_template_runner_all_patterns:
📋 Testing template with all error patterns
template="linear_workflow.yaml"
2025-01-15T10:35:12.167Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"AllSuccess"#
2025-01-15T10:35:12.234Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=67
successful_steps=4 failed_steps=0 retried_steps=0
final_state="Complete"
validations_passed=12 validations_failed=0
2025-01-15T10:35:12.256Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"FirstStepFails { retryable: true }"#
2025-01-15T10:35:12.334Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=1
2025-01-15T10:35:12.356Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=2
2025-01-15T10:35:12.423Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=167
successful_steps=4 failed_steps=0 retried_steps=1
final_state="Complete"
validations_passed=15 validations_failed=0
2025-01-15T10:35:12.445Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=0
pattern="AllSuccess" execution_time_ms=67
final_state="Complete" total_validations=12 success_rate="100.0%"
2025-01-15T10:35:12.467Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=1
pattern=r#"FirstStepFails { retryable: true }"# execution_time_ms=167
final_state="Complete" total_validations=15 success_rate="100.0%"
SQL Function Validation Output
2025-01-15T10:40:30.123Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION: Starting comprehensive validation across all scenarios
2025-01-15T10:40:30.145Z INFO test_comprehensive_sql_function_integration:
📋 SCENARIO 1: Basic lifecycle progression validation
2025-01-15T10:40:30.167Z INFO validate_initial_state:
✅ Initial state validation passed
2025-01-15T10:40:30.189Z INFO validate_step_completion:
✅ Step completion validation passed
step_uuid=01JGJX7M8QMRNP4W2X3Y5Z6PQR
2025-01-15T10:40:30.201Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 1: Basic lifecycle validation complete
scenario="basic_lifecycle" validations=2
2025-01-15T10:40:30.223Z INFO test_comprehensive_sql_function_integration:
🔄 SCENARIO 2: Error handling and retry behavior validation
2025-01-15T10:40:30.245Z INFO validate_retry_behavior:
✅ Retry behavior validation passed
step_uuid=01JGJX7M8RMRNP4W2X3Y5Z6STU
attempts=1 backoff=Some(5) retry_eligible=true
2025-01-15T10:40:30.267Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 2: Error and retry validation complete
scenario="error_retry" validations=1
2025-01-15T10:40:30.289Z INFO test_comprehensive_sql_function_integration:
🎯 FINAL VALIDATION: Comprehensive results summary
total_validations=25 successful_validations=25
success_rate="100.00%" scenarios_tested=6
2025-01-15T10:40:30.301Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION VALIDATION COMPLETE: All scenarios validated successfully
Best Practices
1. Always Use Integrated Validation Pattern
#![allow(unused)]
fn main() {
// ✅ GOOD: Integrated lifecycle + SQL validation
async fn test_step_retry_behavior(pool: PgPool) -> Result<()> {
// Exercise lifecycle
StepErrorSimulator::simulate_execution_error(pool, step, 1).await?;
// Immediately validate SQL functions
pool.assert_step_retry_behavior(step_uuid, 1, None, true).await?;
// Document relationship
tracing::info!("✅ INTEGRATION: Retry behavior alignment verified");
Ok(())
}
// ❌ BAD: Testing SQL functions in isolation
async fn test_sql_only(pool: PgPool) -> Result<()> {
// Directly manipulating database state
sqlx::query!("UPDATE steps SET attempts = 3").execute(pool).await?;
// This doesn't prove the integration works
let status = sqlx::query!("SELECT * FROM get_step_readiness_status($1)", uuid)
.fetch_one(pool).await?;
Ok(())
}
}
2. Use Structured Tracing
#![allow(unused)]
fn main() {
// ✅ GOOD: Structured tracing with context
tracing::info!(
step_uuid = %step.workflow_step_uuid,
attempts = expected_attempts,
backoff = ?expected_backoff,
retry_eligible = expected_retry_eligible,
"✅ STEP ASSERTION: Retry behavior matches expectations"
);
// ❌ BAD: Unstructured logging
println!("Step retry test passed");
}
3. Test Multiple Scenarios
#![allow(unused)]
fn main() {
// ✅ GOOD: Comprehensive scenario coverage
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_complete_retry_scenarios(pool: PgPool) -> Result<()> {
// Test retryable error
test_retryable_error_scenario(&pool).await?;
// Test non-retryable error
test_non_retryable_error_scenario(&pool).await?;
// Test retry exhaustion
test_retry_exhaustion_scenario(&pool).await?;
// Test custom backoff
test_custom_backoff_scenario(&pool).await?;
Ok(())
}
}
4. Validate State Transitions
#![allow(unused)]
fn main() {
// ✅ GOOD: Validate complete state transition sequence
pool.assert_step_state_sequence(
step_uuid,
vec![
"Pending".to_string(),
"InProgress".to_string(),
"Error".to_string(),
"WaitingForRetry".to_string(),
"Ready".to_string(),
"Complete".to_string()
]
).await?;
}
5. Use Assertion Traits for Readability
#![allow(unused)]
fn main() {
// ✅ GOOD: Clear, readable assertions
pool.assert_task_complete(task_uuid).await?;
pool.assert_step_failed_permanently(step_uuid).await?;
// ❌ BAD: Manual SQL queries everywhere
let task_status = sqlx::query!("SELECT ...").fetch_one(pool).await?;
assert_eq!(task_status.some_field, Some("Complete"));
}
Troubleshooting
Common Issues
1. Assertion Failures
Error: Task 01JGJX... assertion failed: expected Complete, found Processing
// Solution: Ensure lifecycle actions complete before asserting
tokio::time::sleep(Duration::from_millis(100)).await;
pool.assert_task_complete(task_uuid).await?;
2. SQL Function Mismatches
Error: Step 01JGJX... retry assertion failed: attempts expected 2, got Some(1)
// Solution: Verify error simulator is configured correctly
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?; // 2 attempts
3. State Machine Violations
Error: Cannot transition from Complete to InProgress
// Solution: Use proper orchestration framework, not direct DB manipulation
let result = orchestrator.execute_step(step, true, 1000).await?;
// Don't: sqlx::query!("UPDATE steps SET state = 'InProgress'").execute(pool).await?;
4. Template Loading Issues
Error: Template 'nonexistent.yaml' not found
// Solution: Ensure template exists in correct directory
// templates should be in tests/fixtures/task_templates/rust/
Debugging Techniques
1. Enable Detailed Tracing
RUST_LOG=debug cargo test test_name -- --nocapture
2. Inspect SQL Function Results Directly
#![allow(unused)]
fn main() {
let step_status = sqlx::query!(
"SELECT * FROM get_step_readiness_status($1)",
step_uuid
)
.fetch_one(&pool)
.await?;
tracing::debug!("Step status: {:?}", step_status);
}
3. Validate Test Prerequisites
#![allow(unused)]
fn main() {
// Ensure test setup is correct
assert_eq!(steps.len(), 4, "Test requires 4 steps");
assert_eq!(task.namespace, "expected_namespace");
}
4. Use Incremental Validation
#![allow(unused)]
fn main() {
// Validate after each step
orchestrator.execute_step(&step1, true, 1000).await?;
pool.assert_step_complete(step1.workflow_step_uuid).await?;
orchestrator.execute_step(&step2, false, 1000).await?;
pool.assert_step_retry_behavior(step2.workflow_step_uuid, 1, None, true).await?;
}
Migration from Old Tests
Before (Direct Database Manipulation)
#![allow(unused)]
fn main() {
// ❌ OLD: Bypassing orchestration framework
async fn test_task_finalization_old(pool: PgPool) -> Result<()> {
// Direct database manipulation
sqlx::query!("UPDATE tasks SET state = 'Error'").execute(&pool).await?;
sqlx::query!("UPDATE steps SET state = 'Error'").execute(&pool).await?;
// Test SQL functions in isolation
let context = get_task_execution_context(&pool, task_uuid).await?;
assert_eq!(context.execution_status, ExecutionStatus::Error);
Ok(())
}
}
After (Integrated Framework)
#![allow(unused)]
fn main() {
// ✅ NEW: Using integrated framework
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_task_finalization_new(pool: PgPool) -> Result<()> {
tracing::info!("🧪 Testing task finalization with integrated approach");
// Use orchestration framework
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "finalization").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// Create error state through framework
StepErrorSimulator::simulate_validation_error(
&pool,
&step,
"finalization_test_error"
).await?;
// Immediately validate SQL functions
pool.assert_step_failed_permanently(step.workflow_step_uuid).await?;
pool.assert_task_error(task.task_uuid, 1).await?;
tracing::info!("✅ INTEGRATION: Finalization behavior verified");
Ok(())
}
}
This comprehensive guide demonstrates the power and flexibility of the lifecycle testing framework, providing developers with the tools needed to validate complex workflow behavior while maintaining confidence in the system’s correctness.
Decision Point E2E Tests
This document describes the E2E tests for decision point functionality and how to run them.
Test Location
tests/e2e/ruby/conditional_approval_test.rs
Design Note: Deferred Step Type (Added 2025-10-27)
A critical design refinement was introduced to handle convergence patterns in decision point workflows:
The Convergence Problem
In conditional_approval, all three possible outcomes (auto_approve, manager_approval, finance_review) converge to the same finalize_approval step. However, we cannot create finalize_approval at task initialization because:
- We don’t know which approval steps will be created
- finalize_approval needs different dependencies depending on the decision point’s choice
Solution: type: deferred
A new step type was added to handle this pattern:
- name: finalize_approval
type: deferred # NEW STEP TYPE!
dependencies: [auto_approve, manager_approval, finance_review] # All possible deps
How it works:
- Deferred steps list ALL possible dependencies in the template
- At initialization, deferred steps are excluded (they’re descendants of decision points)
- When a decision point creates outcome steps, the system:
- Detects downstream deferred steps
- Computes:
declared_deps ∩ actually_created_steps= actual DAG - Creates deferred steps with resolved dependencies
Example:
- When routing_decision chooses
auto_approve:- Creates:
auto_approve - Detects:
finalize_approvalis deferred with deps[auto_approve, manager_approval, finance_review] - Intersection:
[auto_approve]∩[auto_approve]=[auto_approve] - Creates:
finalize_approvaldepending onauto_approveonly
- Creates:
This elegantly solves convergence without requiring handlers to explicitly list convergence steps or special orchestration logic.
Test Coverage
The test suite validates the conditional approval workflow, which demonstrates decision point functionality with dynamic step creation based on runtime conditions (approval amount thresholds).
Test Cases
-
test_small_amount_auto_approval() - Tests amounts < $1,000
- Expected path: validate_request → routing_decision → auto_approve → finalize_approval
- Verifies only 4 steps created
- Confirms manager_approval and finance_review are NOT created
-
test_medium_amount_manager_approval() - Tests amounts $1,000-$4,999
- Expected path: validate_request → routing_decision → manager_approval → finalize_approval
- Verifies only 4 steps created
- Confirms auto_approve and finance_review are NOT created
-
test_large_amount_dual_approval() - Tests amounts >= $5,000
- Expected path: validate_request → routing_decision → manager_approval + finance_review → finalize_approval
- Verifies 5 steps created
- Confirms both parallel approval steps complete
- Verifies auto_approve is NOT created
-
test_decision_point_step_dependency_structure() - Validates dependency resolution
- Verifies dynamically created steps depend on routing_decision
- Confirms finalize_approval waits for all approval steps
- Tests proper execution order
-
test_boundary_conditions() - Tests exactly at $1,000 threshold
- Verifies manager approval is used (not auto)
-
test_boundary_large_threshold() - Tests exactly at $5,000 threshold
- Verifies dual approval path is triggered
-
test_very_small_amount() - Tests $0.01 amount
- Verifies auto-approval for very small amounts
Running the Tests
Prerequisites
The tests require the full integration environment to be running. Use the Docker Compose test strategy:
# From the tasker-core directory
# 1. Stop any existing containers and clean up
docker-compose -f docker/docker-compose.test.yml down -v
# 2. Rebuild containers with latest changes
docker-compose -f docker/docker-compose.test.yml up --build -d
# 3. Wait for services to be healthy (about 10-15 seconds)
sleep 15
# 4. Run the conditional approval E2E tests
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo test --test e2e_tests e2e::ruby::conditional_approval_test -- --nocapture
# 5. Clean up after testing (optional)
docker-compose -f docker/docker-compose.test.yml down
Running Specific Tests
# Run just the small amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_small_amount_auto_approval -- --nocapture
# Run just the large amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_large_amount_dual_approval -- --nocapture
# Run all boundary tests
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_boundary -- --nocapture
Environment Variables
The tests use the following environment variables (set automatically in docker-compose.test.yml):
DATABASE_URL: PostgreSQL connection stringTASKER_ENV: Set to “test” for test configurationTASK_TEMPLATE_PATH: Points to test fixtures directoryRUST_LOG: Set to “info” or “debug” for detailed logging
Test Workflow Details
Conditional Approval Workflow
The workflow implements amount-based routing:
┌─────────────────┐
│ validate_request│
│ (initial) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ routing_decision│ ◄─── DECISION POINT (type: decision)
│ (decision) │
└────────┬────────┘
│
├─────────── < $1,000 ─────────┐
│ │
│ ▼
│ ┌────────────────┐
│ │ auto_approve │
│ └────────┬───────┘
│ │
├─────── $1,000-$4,999 ────────┼────┐
│ │ │
│ │ ▼
│ │ ┌──────────────────┐
│ │ │ manager_approval │
│ │ └────────┬─────────┘
│ │ │
└──────── >= $5,000 ───────────┼───────────┼────┐
│ │ │
│ │ ▼
│ │ ┌───────────────┐
│ │ │ finance_review│
│ │ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────┐
│ finalize_approval │
│ (convergence) │
└─────────────────────────┘
Decision Point Mechanism
- routing_decision step executes with
type: decisionmarker - Handler returns
DecisionPointOutcome::CreateStepswith step names - Orchestration creates those steps dynamically and adds dependencies
- Dynamically created steps execute like normal steps
- Convergence step (finalize_approval) waits for all paths
Task Template Location
The test uses the task template at:
tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml
Ruby Handler Implementation
The Ruby handlers are located at:
workers/ruby/spec/handlers/examples/conditional_approval/
├── handlers/
│ └── conditional_approval_handler.rb
└── step_handlers/
├── validate_request_handler.rb
├── routing_decision_handler.rb ◄─── DECISION POINT HANDLER
├── auto_approve_handler.rb
├── manager_approval_handler.rb
├── finance_review_handler.rb
└── finalize_approval_handler.rb
Key Implementation Detail
The routing_decision_handler.rb returns a decision point outcome:
outcome = if steps_to_create.empty?
TaskerCore::Types::DecisionPointOutcome.no_branches
else
TaskerCore::Types::DecisionPointOutcome.create_steps(steps_to_create)
end
TaskerCore::Types::StepHandlerCallResult.success(
result: {
# IMPORTANT: The decision point outcome MUST be in this key
decision_point_outcome: outcome.to_h,
route_type: route[:type],
# ... other result fields
}
)
Troubleshooting
Tests Fail with “Template Not Found”
Ensure the Ruby worker container has the correct template path:
docker-compose -f docker/docker-compose.test.yml logs ruby-worker
# Should show: TASK_TEMPLATE_PATH=/app/tests/fixtures/task_templates/ruby
Tests Timeout
Increase wait time in docker-compose startup:
sleep 30 # Instead of sleep 15
Database Connection Errors
Verify PostgreSQL is running and healthy:
docker-compose -f docker/docker-compose.test.yml ps
docker-compose -f docker/docker-compose.test.yml logs postgres
Step Creation Doesn’t Happen
Check orchestration logs for decision point processing:
docker-compose -f docker/docker-compose.test.yml logs orchestration | grep -i decision
Success Criteria
All tests should pass with output similar to:
test e2e::ruby::conditional_approval_test::test_small_amount_auto_approval ... ok
test e2e::ruby::conditional_approval_test::test_medium_amount_manager_approval ... ok
test e2e::ruby::conditional_approval_test::test_large_amount_dual_approval ... ok
test e2e::ruby::conditional_approval_test::test_decision_point_step_dependency_structure ... ok
test e2e::ruby::conditional_approval_test::test_boundary_conditions ... ok
test e2e::ruby::conditional_approval_test::test_boundary_large_threshold ... ok
test e2e::ruby::conditional_approval_test::test_very_small_amount ... ok
test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Next Steps
After validating Ruby workers:
- Phase 8a: Implement Rust worker support for decision points
- Phase 9a: Create E2E tests for Rust worker decision points
Focused Architectural and Security Audit Report
Audit Date: 2026-02-05 Auditor: Claude (Opus 4.6 / Sonnet 4.5 sub-agents) Status: Complete
Executive Summary
This audit evaluates all Tasker Core crates for alpha readiness across security, error handling, resilience, and architecture dimensions. Findings are categorized by severity (Critical/High/Medium/Low/Info) per the methodology defined in the audit specification.
Alpha Readiness Verdict
ALPHA READY with targeted fixes. No Critical vulnerabilities found. The High-severity items (dependency CVE, input validation gaps, shutdown timeouts) are straightforward fixes that can be completed in a single sprint.
Consolidated Finding Counts (All Crates)
| Severity | Count | Status |
|---|---|---|
| Critical | 0 | None found |
| High | 9 | Must fix before alpha |
| Medium | 22 | Document as known limitations |
| Low | 13 | Track for post-alpha |
High-Severity Findings (Must Fix Before Alpha)
| ID | Finding | Crate | Fix Effort | Remediation |
|---|---|---|---|---|
| S-1 | Queue name validation missing | tasker-shared | Small | Queue name validation |
| S-2 | SQL error details exposed to clients | tasker-shared | Medium | Error message sanitization |
| S-3 | #[allow] → #[expect] (systemic) | All | Small (batch) | Lint compliance cleanup |
| P-1 | NOTIFY channel name unvalidated | tasker-pgmq | Small | Queue name validation |
| O-1 | No actor panic recovery | tasker-orchestration | Medium | Shutdown and recovery hardening |
| O-2 | Graceful shutdown lacks timeout | tasker-orchestration | Small | Shutdown and recovery hardening |
| W-1 | checkpoint_yield blocks FFI without timeout | tasker-worker | Small | FFI checkpoint timeout |
| X-1 | bytes v1.11.0 CVE (RUSTSEC-2026-0007) | Workspace | Trivial | Dependency upgrade |
| P-2 | CLI migration SQL generation unescaped | tasker-pgmq | Small | Queue name validation |
Crate 1: tasker-shared
Overall Rating: A- (Strong foundation with targeted improvements needed)
The tasker-shared crate is the largest and most foundational crate in the workspace. It provides core types, error handling, messaging abstraction, security services, circuit breakers, configuration management, database utilities, and shared models. The crate demonstrates strong security practices overall.
Strengths
- Zero unsafe code across the entire crate
- Excellent cryptographic hygiene: Constant-time API key comparison via
subtle::ConstantTimeEq(src/types/api_key_auth.rs:53-62), JWKS hardening with SSRF prevention (blocks private IPs, cloud metadata endpoints, requires HTTPS), algorithm allowlist enforcement (noalg: none) - Comprehensive input validation: JSONB validation with size/depth/key count limits (
src/validation.rs), namespace validation with PostgreSQL identifier rules, XSS sanitization - 100% SQLx macro usage: All database queries use compile-time verified
sqlx::query!macros, zero string interpolation in SQL - Lock-free circuit breakers: Atomic state management (
AtomicU8for state,AtomicU64for metrics), proper memory ordering, correct state machine transitions - All MPSC channels bounded and config-driven: Full bounded-channel compliance
- Exemplary config security: Environment variable allowlist with regex validation, TOML injection prevention via
escape_toml_string(), fail-fast on validation errors - No hardcoded secrets: All sensitive values come from env vars or file paths
- Well-organized API surface: Feature-gated modules (web-api, grpc-api), selective re-exports
Finding S-1 (HIGH): Queue Name Validation Missing
Location: tasker-shared/src/messaging/service/router.rs:96-97
Queue names are constructed via format! with unvalidated namespace input:
#![allow(unused)]
fn main() {
fn step_queue(&self, namespace: &str) -> String {
format!("{}_{}_queue", self.worker_queue_prefix, namespace)
}
}
The MessagingError::InvalidQueueName variant exists (src/messaging/errors.rs:56) but is never raised. Neither the router nor the provider implementations (pgmq.rs:134-139, rabbitmq.rs:276-375) validate queue names before passing them to native queue APIs.
Risk: PGMQ creates PostgreSQL tables named after queues — special characters in namespace could cause SQL issues at the DDL level. RabbitMQ queue creation could fail with unexpected characters.
Recommendation: Add validate_queue_name() that enforces alphanumeric + underscore/hyphen, 1-255 chars. Call it in DefaultMessageRouter methods and/or ensure_queue().
Finding S-2 (HIGH): SQL Error Details Exposed to Clients
Location: tasker-shared/src/errors.rs:71-74, 431-437
#![allow(unused)]
fn main() {
impl From<sqlx::Error> for TaskerError {
fn from(err: sqlx::Error) -> Self {
TaskerError::DatabaseError(err.to_string())
}
}
}
sqlx::Error::to_string() can expose SQL query details, table/column names, constraint names, and potentially connection string information. These error messages may propagate to API responses.
Recommendation: Create a sanitized error mapper that logs full details internally but returns generic messages to API clients (e.g., “Database operation failed” with an internal error ID for correlation).
Finding S-3 (HIGH): #[allow] Used Instead of #[expect] (Lint Policy Violation)
Locations:
src/messaging/execution_types.rs:383—#[allow(clippy::too_many_arguments)]src/web/authorize.rs:194—#[allow(dead_code)]src/utils/serde.rs:46-47—#[allow(dead_code)]
Project lint policy mandates #[expect(lint_name, reason = "...")] instead of #[allow]. This is a policy compliance issue.
Recommendation: Convert all #[allow] to #[expect] with documented reasons.
Finding S-4 (MEDIUM): unwrap_or_default() Violations of Tenet #11 (Fail Loudly)
Locations (20+ instances across crate):
src/messaging/execution_types.rs:120,186,213— Step execution status defaults to empty stringsrc/database/sql_functions.rs:377,558— Query results default to empty vectorssrc/registry/task_handler_registry.rs:214,268,656,700,942— Config schema fields default silentlysrc/proto/conversions.rs:32— Invalid timestamps silently default to UNIX epoch
Risk: Required fields silently defaulting to empty values can mask real errors and produce incorrect behavior that’s hard to debug.
Recommendation: Audit all unwrap_or_default() usages. Replace with explicit error returns for required fields. Keep unwrap_or_default() only for truly optional fields with documented rationale.
Finding S-5 (MEDIUM): Error Context Loss in .map_err(|_| ...)
14 instances where original error context is discarded:
src/messaging/service/providers/rabbitmq.rs:544— Discards parse errorsrc/messaging/service/providers/in_memory.rs:305,331,368— 3 instancessrc/state_machine/task_state_machine.rs:114— Discards parse errorsrc/state_machine/actions.rs:256,372,434,842— 4 instances discarding publisher errorssrc/config/config_loader.rs:220,417— 2 instances discarding env var errorssrc/database/sql_functions.rs:1032— Discards decode errorsrc/types/auth.rs:283— Discards parse error
Recommendation: Include original error via .map_err(|e| SomeError::new(context, e.to_string())).
Finding S-6 (MEDIUM): Production expect() Calls
src/macros.rs:65— Panics if Tokio task spawning failssrc/cache/provider.rs:399,429,459,489,522— Multipleexpect("checked in should_use")calls
Risk: Panics in production code. While guarded by preconditions, they bypass error propagation.
Recommendation: Replace with Result propagation or add detailed safety comments explaining invariant guarantees.
Finding S-7 (MEDIUM): Database Pool Config Lacks Validation
Database pool configuration (PoolConfig) does not have a validate() method. Unlike circuit breaker config which validates ranges (failure_threshold > 0, timeout <= 300s), pool config relies on sqlx to reject invalid values at runtime.
Recommendation: Add validation: max_connections > 0, min_connections <= max_connections, acquire_timeout_seconds > 0.
Finding S-8 (MEDIUM): Individual Query Timeouts Missing
While database pools have acquire_timeout configured (src/database/pools.rs:169-170), individual sqlx::query! calls lack explicit timeout wrappers. Long-running queries rely solely on pool-level timeouts.
Recommendation: Consider PostgreSQL statement_timeout at the connection level, or add tokio::time::timeout() wrappers around critical query paths.
Finding S-9 (LOW): Message Size Limits Not Enforced
Messaging deserialization uses serde_json::from_slice() without explicit size limits. While PGMQ has implicit limits from PostgreSQL column sizes, a very large message could cause memory issues during deserialization.
Recommendation: Add configurable message size limits at the provider level.
Finding S-10 (LOW): File Path Exposure in Config Errors
src/services/security_service.rs:184-187 — Configuration errors include filesystem paths. Only occurs during startup (not exposed to API clients in normal operation).
Finding S-11 (LOW): Timestamp Conversion Silently Defaults to Epoch
src/proto/conversions.rs:32 — DateTime::from_timestamp().unwrap_or_default() silently converts invalid timestamps to 1970-01-01 instead of returning an error.
Finding S-12 (LOW): cargo-machete Ignore List Has 19 Entries
Cargo.toml:12-39 — Most are legitimately feature-gated or used via macros, but the list should be periodically audited to prevent dependency bloat.
Finding S-13 (LOW): Global Wildcard Permission Rejection Undocumented
src/types/permissions.rs — The permission_matches() function correctly rejects global wildcard (*) permissions but this behavior isn’t documented in user-facing comments.
Crate 2: tasker-pgmq
Overall Rating: B+ (Good with one high-priority fix needed)
The tasker-pgmq crate is a PGMQ wrapper providing PostgreSQL LISTEN/NOTIFY support for event-driven message processing. ~3,345 source lines across 9 files. No dependencies on tasker-shared (clean separation).
Strengths
- No unsafe code across the entire crate
- Payload uses parameterized queries: Message payloads bound via
$1parameter in NOTIFY - Payload size validation: Enforces pg_notify 8KB limit
- Comprehensive thiserror error types with context preservation
- Bounded channels: All MPSC channels bounded
- Good test coverage: 6 integration test files covering major flows
- Clean separation from tasker-shared: No duplication, standalone library
Finding P-1 (HIGH): SQL Injection via NOTIFY Channel Name
Location: tasker-pgmq/src/emitter.rs:122
#![allow(unused)]
fn main() {
let sql = format!("NOTIFY {}, $1", channel);
sqlx::query(&sql).bind(payload).execute(&self.pool)
}
PostgreSQL’s NOTIFY does not support parameterized channel identifiers. The channel name is interpolated directly via format!. Channel names flow from config.build_channel_name() which concatenates channels_prefix (from TOML config) with base channel names and namespace strings.
Risk: While the NOTIFY command has limited injection surface (it’s not a general SQL execution vector), malformed channel names could cause PostgreSQL errors, unexpected channel routing, or denial of service. The channels_prefix comes from config (lower risk), but namespace strings flow from queue operations.
Recommendation: Add channel name validation — allow only [a-zA-Z0-9_.]+, max 63 chars (PostgreSQL identifier limit). Apply in build_channel_name() and/or notify_channel().
Finding P-2 (HIGH): CLI Migration SQL Generation with Unescaped Input
Location: tasker-pgmq/src/bin/cli.rs:179-353
User-provided regex patterns and channel prefixes are directly interpolated into SQL migration strings when generating migration files. While these are generated files that should be reviewed before application, the lack of escaping creates a risk if the generation process is automated.
Recommendation: Validate inputs against strict patterns before interpolation. Add a warning comment in generated files that they should be reviewed.
Finding P-3 (MEDIUM): unwrap_or_default() on Database Results (Tenet #11)
Location: tasker-pgmq/src/client.rs:164
#![allow(unused)]
fn main() {
.read_batch(queue_name, visibility_timeout, l).await?.unwrap_or_default()
}
When read_batch returns None, this silently produces an empty vector instead of failing loudly. Could mask permission errors, connection failures, or other serious issues.
Recommendation: Return explicit error on unexpected None.
Finding P-4 (MEDIUM): RwLock Poison Handling Masks Panics
Location: tasker-pgmq/src/listener.rs (22 instances)
#![allow(unused)]
fn main() {
self.stats.write().unwrap_or_else(|p| p.into_inner())
}
Silently recovers from poisoned RwLock without logging. Could propagate corrupted state from a panicked thread.
Recommendation: Log warning on poison recovery, or switch to parking_lot::RwLock (doesn’t poison).
Finding P-5 (MEDIUM): Hardcoded Pool Size
Location: tasker-pgmq/src/client.rs:41-44
#![allow(unused)]
fn main() {
let pool = sqlx::postgres::PgPoolOptions::new()
.max_connections(20) // Hard-coded
.connect(database_url).await?;
}
Pool size should be configurable for different deployment scenarios.
Finding P-6 (MEDIUM): Missing Async Operation Timeouts
Database operations in client.rs, emitter.rs, and listener.rs lack explicit tokio::time::timeout() wrappers. Relies solely on pool-level acquire timeouts.
Finding P-7 (LOW): Error Context Loss in Regex Compilation
Location: tasker-pgmq/src/config.rs:169
#![allow(unused)]
fn main() {
Regex::new(&self.queue_naming_pattern)
.map_err(|_| PgmqNotifyError::invalid_pattern(&self.queue_naming_pattern))
}
Original regex error details discarded.
Finding P-8 (LOW): #[allow] Instead of #[expect] (Lint Policy)
Location: tasker-pgmq/src/emitter.rs:299-320 — 3 instances of #[allow(dead_code)] on EmitterFactory.
Crate 3: tasker-orchestration
Overall Rating: A- (Strong security with targeted resilience improvements needed)
The tasker-orchestration crate handles core orchestration logic: actors, state machines, REST + gRPC APIs, and auth middleware. This is the largest service crate and the primary attack surface.
Strengths
- Zero unsafe code across the entire crate
- Excellent auth architecture: Constant-time API key comparison, JWT algorithm allowlist, JWKS SSRF prevention, auth before body parsing
- gRPC/REST auth parity verified: All 6 gRPC task methods enforce identical permissions to REST counterparts
- No auth bypass found: All API v1 routes wrapped in
authorize(), health/metrics public by design - Database-level atomic claiming:
FOR UPDATE SKIP LOCKEDprevents concurrent state corruption - State transitions enforce ownership: No API endpoint allows direct state manipulation
- Sanitized error responses: No stack traces, database errors genericized, consistent JSON format
- Backpressure checked before resource operations: 503 with Retry-After header
- Full bounded-channel compliance: All MPSC channels bounded and config-driven (0 unbounded channels)
- HTTP request timeout:
TimeoutLayerwith configurable 30s default
Finding O-1 (HIGH): No Actor Panic Recovery
Location: tasker-orchestration/src/actors/command_processor_actor.rs:139
Actors spawn via spawn_named! but have no supervisor/restart logic. If OrchestrationCommandProcessorActor panics, the entire orchestration processing stops. Recovery requires full process restart.
Recommendation: Implement panic-catching wrapper with logged restart, or document that process-level supervision (systemd, k8s) handles this.
Finding O-2 (HIGH): Graceful Shutdown Lacks Timeout
Locations:
tasker-orchestration/src/orchestration/bootstrap.rs:177-213tasker-orchestration/src/bin/server.rs:68-82
Shutdown calls coordinator.lock().await.stop().await and orchestration_handle.stop().await with no timeout. If the event coordinator or actors hang during shutdown, the server never completes graceful shutdown.
Recommendation: Add 30-second timeout with force-kill fallback.
Finding O-3 (HIGH): #[allow] Instead of #[expect] (Lint Policy)
21 instances of #[allow] found across the crate (most without reason = clause):
src/actors/traits.rs:67,81src/web/extractors.rs:6src/health/channel_status.rs:87src/grpc/conversions.rs:42- And 16 more locations
Finding O-4 (MEDIUM): Request Validation Not Enforced at Handler Layer
Location: src/web/handlers/tasks.rs:47
TaskRequest has #[derive(Validate)] with constraints (name length 1-255, namespace length 1-255, priority range -100 to 100) but handlers accept Json<TaskRequest> without calling .validate(). Validation happens later at the service layer.
Impact: Oversized payloads are deserialized before rejection. Not a security vulnerability per se, but the defense-in-depth pattern would catch malformed input earlier.
Recommendation: Add .validate() at handler entry or use Valid<Json<TaskRequest>> extractor.
Finding O-5 (MEDIUM): Actor Shutdown May Lose In-Flight Work
Location: tasker-orchestration/src/actors/registry.rs:216-259
Shutdown uses Arc::get_mut() which only works if no other references exist. If get_mut fails, stopped() is silently skipped. In-flight work may be lost.
Finding O-6 (MEDIUM): Database Query Timeouts Missing
Same pattern as tasker-shared (Finding S-8). Individual sqlx::query! calls lack explicit timeout wrappers:
src/services/health/service.rs:284— health check querysrc/orchestration/backoff_calculator.rs:232,245,290,345,368— multiple queries
Pool-level acquire timeout (30s) provides partial mitigation.
Finding O-7 (MEDIUM): unwrap_or_default() on Config Fields
src/orchestration/event_systems/unified_event_coordinator.rs:89— event system configsrc/orchestration/bootstrap.rs:581— namespace configsrc/grpc/services/config.rs:96-97—jwt_issuerandjwt_audiencedefault to empty strings
Finding O-8 (MEDIUM): Error Context Loss
~12 instances of .map_err(|_| ...) discarding error context:
src/orchestration/bootstrap.rs:203— oneshot send errorsrc/web/handlers/health.rs:53— timeout errorsrc/web/handlers/tasks.rs:113— UUID parse error
Finding O-9 (MEDIUM): Hardcoded Magic Numbers
src/services/task_service.rs:257-259—per_page > 100validationsrc/orchestration/event_systems/orchestration_event_system.rs:142— 24h max message agesrc/services/analytics_query_service.rs:229— 30.0s slow step threshold
Finding O-10 (LOW): gRPC Internal Error May Leak Details
Location: src/grpc/conversions.rs:152-153
tonic::Status::internal(error.to_string()) — depending on error Display implementations, could expose implementation details in gRPC error messages.
Finding O-11 (LOW): CORS Allows Any Origin
Location: src/web/mod.rs
#![allow(unused)]
fn main() {
CorsLayer::new()
.allow_origin(tower_http::cors::Any)
.allow_methods(tower_http::cors::Any)
.allow_headers(tower_http::cors::Any)
}
Acceptable for alpha/API service, but should be configurable for production deployments.
Crate 4: tasker-worker
Overall Rating: A- (Strong FFI safety with one notable gap)
The tasker-worker crate handles handler dispatch, FFI integration, and completion processing. Despite complex FFI requirements, it achieves this with zero unsafe blocks in the crate itself.
Strengths
- Zero unsafe code despite handling Ruby/Python FFI integration
- All SQL queries via sqlx macros — no string interpolation
- Handler panic containment:
catch_unwind()+AssertUnwindSafewraps all handler calls - Error classification preserved: Permanent/Retryable distinction maintained across FFI boundary
- Fire-and-forget callbacks: Spawned into runtime, 5s timeout, no deadlock risk
- FFI completion circuit breaker: Latency-based, 100ms threshold, lock-free metrics
- All MPSC channels bounded — full bounded-channel compliance
- No production unwrap()/expect() in core paths
Finding W-1 (HIGH): checkpoint_yield Blocks FFI Thread Without Timeout
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:904
#![allow(unused)]
fn main() {
let result = self.config.runtime_handle.block_on(async {
self.handle_checkpoint_yield_async(/* ... */).await
});
}
Uses block_on which blocks the Ruby/Python thread while persisting checkpoint data to the database. No timeout wrapper. If the database is slow, this blocks the FFI thread indefinitely, potentially exhausting the thread pool.
Recommendation: Add tokio::time::timeout() around the block_on body (configurable, suggest 10s default).
Finding W-2 (MEDIUM): Starvation Detection is Warning-Only
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:772-793
check_starvation_warnings() logs warnings but doesn’t enforce any action. Also requires manual invocation by the caller — no automatic monitoring loop.
Finding W-3 (MEDIUM): FFI Thread Safety Documentation Gap
The FfiDispatchChannel uses Arc<Mutex<mpsc::Receiver>> (thread-safe) but lacks documentation about thread-safety guarantees, poll() contention behavior, and block_on safety in FFI context.
Finding W-4 (MEDIUM): #[allow] vs #[expect] (Lint Policy)
5 instances in web/middleware/mod.rs and web/middleware/request_id.rs.
Finding W-5 (MEDIUM): Missing Database Query Timeouts
Same systemic pattern as other crates. Checkpoint service and step claim queries lack explicit timeout wrappers.
Finding W-6 (LOW): unwrap_or_default() in worker/core.rs
Several instances, appear to be for optional fields (likely legitimate), but warrants review.
Crates 5-6: tasker-client & tasker-cli
Overall Rating: A (Excellent — cleanest crates in the workspace)
These client crates demonstrate the strongest compliance across all audit dimensions. Notably, lint policy compliant (using #[expect] already). No Critical or High findings.
Strengths
- No unsafe code in either crate
- No hardcoded credentials — all auth from env vars or config files
- RSA key generation validates minimum 2048-bit keys
- Proper error context preservation in all
Fromconversions - Complete transport abstraction: REST and gRPC both implement 11/11 methods
- HTTP/gRPC timeouts configured: 30s request, 10s connect
- Exponential backoff retry for
create_taskwith configurable max retries - Lint policy compliant — uses
#[expect]with reasons - User-facing CLI errors informative without leaking internals
Finding C-1 (MEDIUM): TLS Certificate Validation Not Explicitly Enforced
Location: tasker-client/src/api_clients/orchestration_client.rs:220
HTTP client uses reqwest::Client::builder() without explicitly setting .danger_accept_invalid_certs(false). Default is secure, but explicit enforcement prevents accidental changes.
Finding C-2 (MEDIUM): Default URLs Use HTTP
Location: tasker-client/src/config.rs:276
Default base_url is http://localhost:8080. Credentials transmitted over HTTP are vulnerable to interception. Appropriate for local dev, but should warn when HTTP is used with authentication enabled.
Finding C-3 (MEDIUM): Retry Logic Only on create_task
Other operations (get_task, list_tasks, etc.) do not retry on transient failures. Should either extend retry logic or document the limitation.
Finding C-4 (LOW): Production expect() in Config Initialization
tasker-client/src/api_clients/orchestration_client.rs:123 — panics if config is malformed. Acceptable during startup but could return Result instead.
Crates 7-10: Language Workers (Rust, Ruby, Python, TypeScript)
Overall Rating: A- (Strong FFI engineering, no critical gaps)
All 4 language workers share common architecture via FfiDispatchChannel for poll-based event dispatch. Audited ~22,000 lines of Rust FFI code plus language wrappers.
Strengths
- TypeScript: Comprehensive panic handling —
catch_unwindon all critical FFI functions, errors converted to JSON error responses - Ruby/Python: Managed FFI via Magnus and PyO3 — these frameworks handle panic unwinding automatically via their exception systems
- Error classification preserved across all FFI boundaries: Permanent/Retryable distinction maintained
- Fire-and-forget callbacks: No deadlock risk identified
- Starvation detection functional in all workers
- Proper Arc usage for thread-safe shared ownership across FFI
- TypeScript C FFI: Correct string memory management with
into_raw()/from_raw()pattern andfree_rust_string()for caller cleanup - Checkpoint support uniformly implemented across all 4 workers
- Consistent error hierarchy across all languages
Finding LW-1 (MEDIUM): TypeScript FFI Missing Safety Documentation
Location: workers/typescript/src-rust/lib.rs:38
#![allow(clippy::missing_safety_doc)] — suppresses docs for 9 unsafe extern "C" functions. Should use #[expect] per lint policy and add # Safety sections.
Finding LW-2 (MEDIUM): Rust Worker #[allow(dead_code)] (Lint Policy)
Location: workers/rust/src/event_subscribers/logging_subscriber.rs:60,98,132
3 instances of #[allow(dead_code)] instead of #[expect].
Finding LW-3 (LOW): Ruby Bootstrap Uses expect() on Ruby Runtime
Location: workers/ruby/ext/tasker_core/src/bridge.rs:19-20, bootstrap.rs:29-30
Ruby::get().expect("Ruby runtime should be available") — safe in practice (guaranteed by Magnus FFI contract) but could use ? for defensive programming.
Finding LW-4 (LOW): Timeout Cleanup Requires Manual Polling
cleanup_timeouts() exists in all FFI workers but documentation doesn’t specify recommended polling frequency. Workers must call this periodically.
Finding LW-5 (LOW): Ruby Tokio Thread Pool Hardcoded to 8
Location: workers/ruby/ext/tasker_core/src/bootstrap.rs:74-79
Hardcoded .worker_threads(8) for M2/M4 Pro compatibility. Python/TypeScript use defaults. Consider making configurable.
Cross-Cutting Concerns
Dependency Audit (cargo audit)
Finding X-1 (HIGH): bytes v1.11.0 Integer Overflow (RUSTSEC-2026-0007)
Published 2026-02-03. Integer overflow in BytesMut::reserve. Fix: upgrade to bytes >= 1.11.1. This is a transitive dependency used by tokio, hyper, axum, tonic, reqwest, sqlx — deeply embedded.
Recommendation: Add to workspace Cargo.toml: bytes = "1.11.1"
Finding X-2 (LOW): rustls-pemfile Unmaintained (RUSTSEC-2025-0134)
Transitive from lapin (RabbitMQ) → amq-protocol → tcp-stream → rustls-pemfile. No action available from this project; depends on upstream lapin update.
Clippy Compliance
Zero warnings across entire workspace with --all-targets --all-features. Excellent.
Systemic: #[allow] vs #[expect] (Lint Policy)
27 instances of #[allow] found across all crates. Distribution:
- tasker-shared: ~5 instances
- tasker-pgmq: 3 instances
- tasker-orchestration: 21 instances (highest)
- tasker-worker: 5 instances
- tasker-client/cli: 0 (compliant)
- Language workers: ~3 instances
Recommendation: Batch fix in a single PR — mechanical replacement of #[allow] → #[expect] with reason strings.
Systemic: Database Query Timeouts
Found across tasker-shared, tasker-orchestration, tasker-worker, and tasker-pgmq. Individual sqlx::query! calls lack explicit tokio::time::timeout() wrappers. Pool-level acquire timeouts (30s) provide partial mitigation.
Recommendation: Consider PostgreSQL statement_timeout at the connection level as a blanket fix, or add tokio::time::timeout() around critical query paths.
Systemic: unwrap_or_default() on Required Fields (Tenet #11)
Found across tasker-shared (20+ instances), tasker-orchestration (3 instances), tasker-pgmq (1 instance). Silent failures on required fields violate the Fail Loudly principle.
Recommendation: Audit all instances and replace with explicit error handling for required fields.
Appendix: Methodology
Each crate was evaluated across these dimensions:
- Security — Input validation, SQL safety, auth checks, unsafe blocks, crypto, secrets
- Error Handling — Fail Loudly (Tenet #11), context preservation, structured errors
- Resilience — Bounded channels, timeouts, circuit breakers, backpressure
- Architecture — API surface, documentation consistency, test coverage, dead code
- FFI-Specific (language workers) — Error classification, deadlock risk, starvation detection, memory safety
Severity definitions follow the audit specification.
Appendix: Remediation Tracking
Remediation work items for all High-severity findings:
| Work Item | Findings | Priority | Summary |
|---|---|---|---|
| Dependency upgrade | X-1 | Urgent | Upgrade bytes to fix RUSTSEC-2026-0007 CVE |
| Queue name validation | S-1, P-1, P-2 | High | Add queue name and NOTIFY channel validation |
| Lint compliance cleanup | S-3, O-3, W-4, LW-1, LW-2, P-8 | Medium | Replace #[allow] with #[expect] workspace-wide |
| Shutdown and recovery hardening | O-1, O-2 | High | Add shutdown timeout and actor panic recovery |
| FFI checkpoint timeout | W-1 | High | Add timeout to checkpoint_yield block_on |
| Error message sanitization | S-2 | High | Sanitize database error messages in API responses |
Architecture Decision Records (ADRs)
This directory contains Architecture Decision Records that document significant design decisions in Tasker Core. Each ADR captures the context, decision, and consequences of a specific architectural choice.
ADR Index
Active Decisions
| ADR | Title | Date | Status |
|---|---|---|---|
| ADR-001 | Actor-Based Orchestration Architecture | 2025-10 | Accepted |
| ADR-002 | Bounded MPSC Channels | 2025-10 | Accepted |
| ADR-003 | Processor UUID Ownership Removal | 2025-10 | Accepted |
| ADR-004 | Backoff Strategy Consolidation | 2025-10 | Accepted |
| ADR-005 | Worker Dual-Channel Event System | 2025-12 | Accepted |
| ADR-006 | Worker Actor-Service Decomposition | 2025-12 | Accepted |
| ADR-007 | FFI Over WASM for Language Workers | 2025-12 | Accepted |
| ADR-008 | Handler Composition Pattern | 2025-12 | Accepted |
Root Cause Analyses
| Document | Title | Date |
|---|---|---|
| RCA | Parallel Execution Timing Bugs | 2025-12 |
ADR Template
When creating a new ADR, use this template:
# ADR: [Title]
**Status**: [Proposed | Accepted | Deprecated | Superseded]
**Date**: YYYY-MM-DD
**Ticket**: TAS-XXX
## Context
What is the issue that we're seeing that is motivating this decision or change?
## Decision
What is the change that we're proposing and/or doing?
## Consequences
What becomes easier or more difficult to do because of this change?
### Positive
- Benefit 1
- Benefit 2
### Negative
- Trade-off 1
- Trade-off 2
### Neutral
- Side effect that is neither positive nor negative
## Alternatives Considered
What other options were considered and why were they rejected?
### Alternative 1: [Name]
Description and why it was rejected.
### Alternative 2: [Name]
Description and why it was rejected.
## References
- Related documents
- External references
When to Create an ADR
Create an ADR when:
- Making a significant architectural change that affects multiple components
- Choosing between alternatives with meaningful trade-offs
- Establishing a pattern that should be followed consistently
- Removing or deprecating an existing pattern or approach
- Learning from an incident (RCA format)
Don’t create an ADR for:
- Minor implementation details
- Bug fixes without architectural impact
- Documentation updates
- Routine refactoring
Related Documentation
- Tasker Core Tenets - Core design principles
ADR: Actor-Based Orchestration Architecture
Status: Accepted Date: 2025-10 Ticket: TAS-46
Context
The orchestration system used a command pattern with direct service delegation, but lacked formal boundaries between commands and lifecycle components. This created:
- Testing Complexity: Lifecycle components tightly coupled to command processor
- Unclear Boundaries: No formal interface between commands and lifecycle operations
- Limited Supervision: No standardized lifecycle hooks for resource management
- Inconsistent Patterns: Each component had different initialization patterns
- Coupling: Command processor had direct dependencies on multiple service instances
The command processor was 1,164 lines, mixing routing, hydration, validation, and delegation.
Decision
Adopt a lightweight actor pattern with message-based interfaces:
Core Abstractions:
OrchestrationActortrait with lifecycle hooks (started(),stopped())Messagetrait for type-safe messages with associatedResponsetypeHandler<M>trait for async message processingActorRegistryfor centralized actor management
Four Orchestration Actors:
- TaskRequestActor: Task initialization and request processing
- ResultProcessorActor: Step result processing
- StepEnqueuerActor: Step enqueueing coordination
- TaskFinalizerActor: Task finalization with atomic claiming
Implementation Approach:
- Greenfield migration (no dual support)
- Actors wrap existing services, not replace them
- Arc-wrapped actors for efficient cloning across threads
- No full actor framework (keeping it lightweight)
Consequences
Positive
- 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
- Clear boundaries: Each actor handles specific message types
- Testability: Message-based interfaces enable isolated testing
- Consistent patterns: Established migration pattern for all actors
- Lifecycle management: Standardized
started()/stopped()hooks - Thread safety: Arc-wrapped actors with Send+Sync guarantees
Negative
- Additional abstraction: One more layer between commands and services
- Learning curve: New pattern to understand
- Message overhead: ~100-500ns per actor call (acceptable for our use case)
- Not a full framework: Lacks supervision trees, mailboxes, etc.
Neutral
- Services remain unchanged; actors are thin wrappers
- Performance impact minimal (<1μs per operation)
Alternatives Considered
Alternative 1: Full Actor Framework (Actix)
Would provide supervision, mailboxes, and advanced patterns.
Rejected: Too heavyweight for our needs. We need lifecycle hooks and message-based testing, not a full distributed actor system.
Alternative 2: Keep Direct Service Delegation
Continue with command processor calling services directly.
Rejected: Doesn’t address testing complexity, unclear boundaries, or lifecycle management needs.
Alternative 3: Trait-Based Service Abstraction
Define Service trait and implement on each lifecycle component.
Partially adopted: Combined with actor pattern. Services implement business logic; actors provide message-based coordination.
References
- See the actor pattern implementation in
tasker-orchestration/ - Actors Architecture - Actor pattern documentation
- Events and Commands - Integration context
ADR: Bounded MPSC Channel Migration
Status: Implemented Date: 2025-10-14 Decision Makers: Engineering Team Ticket: TAS-51
Context and Problem Statement
Prior to this change, the tasker-core system had inconsistent and risky MPSC channel usage:
-
Unbounded Channels (3 critical sites): Risk of unbounded memory growth under load
- PGMQ notification listener: Could exhaust memory during notification bursts
- Event publisher: Vulnerable to event storms
- Ruby FFI handler: No backpressure across FFI boundary
-
Configuration Disconnect (6 sites): TOML configuration existed but wasn’t used
- Hard-coded values (100, 1000) with no rationale
- Test/dev/prod environments used identical capacities
- No ability to tune without code changes
-
No Backpressure Strategy: Missing overflow handling policies
- No monitoring of channel saturation
- No documented behavior when channels fill
- No metrics for operational visibility
Production Impact
- Memory Risk: OOM possible under high database notification load (10k+ tasks enqueued)
- Operational Pain: Cannot tune channel sizes without code deployment
- Environment Mismatch: Test environment uses production-scale buffers, masking issues
- Technical Debt: Wasted configuration infrastructure
Decision
Migrate to 100% bounded, configuration-driven MPSC channels with explicit backpressure handling.
Key Principles
- All Channels Bounded: Zero
unbounded_channel()calls in production code - Configuration-Driven: All capacities from TOML with environment overrides
- Separation of Concerns: Infrastructure (sizing) separate from business logic (retry behavior)
- Explicit Backpressure: Document and implement overflow policies
- Full Observability: Metrics for channel saturation and overflows
Solution Architecture
Configuration Structure
Created unified MPSC channel configuration in config/tasker/base/mpsc_channels.toml:
[mpsc_channels]
# Orchestration subsystem
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 1000
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 10000 # Large for notification bursts
# Task readiness subsystem
[mpsc_channels.task_readiness.event_channel]
buffer_size = 1000
send_timeout_ms = 1000
# Worker subsystem
[mpsc_channels.worker.command_processor]
command_buffer_size = 1000
[mpsc_channels.worker.in_process_events]
broadcast_buffer_size = 1000 # Rust → Ruby FFI
# Shared/cross-cutting
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 5000
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 1000
# Overflow policy
[mpsc_channels.overflow_policy]
log_warning_threshold = 0.8 # Warn at 80% full
drop_policy = "block"
Environment-Specific Overrides
Production (config/tasker/environments/production/mpsc_channels.toml):
- Orchestration command: 5000 (5x base)
- PGMQ listeners: 50000 (5x base) - handles bulk task creation bursts
- Event publisher: 10000 (2x base)
Development (config/tasker/environments/development/mpsc_channels.toml):
- Task readiness: 500 (0.5x base)
- Worker FFI: 500 (0.5x base)
Test (config/tasker/environments/test/mpsc_channels.toml):
- Orchestration command: 100 (0.1x base) - exposes backpressure issues
- Task readiness: 100 (0.1x base)
Critical Implementation Detail
Environment override files MUST use full [mpsc_channels.*] prefix:
# ✅ CORRECT
[mpsc_channels.task_readiness.event_channel]
buffer_size = 100
# ❌ WRONG - creates top-level key that overrides correct config
[task_readiness.event_channel]
buffer_size = 100
This was discovered during implementation when environment files created conflicting top-level configuration keys.
Configuration Migration
Migrated MPSC sizing fields from event_systems.toml to mpsc_channels.toml:
Moved to mpsc_channels.toml:
event_systems.task_readiness.metadata.event_channel.buffer_sizeevent_systems.task_readiness.metadata.event_channel.send_timeout_msevent_systems.worker.metadata.in_process_events.broadcast_buffer_size
Kept in event_systems.toml (event processing logic):
event_systems.task_readiness.metadata.event_channel.max_retriesevent_systems.task_readiness.metadata.event_channel.backoff
Rationale: Separation of concerns - infrastructure sizing vs business logic behavior.
Rust Type System
Created comprehensive type system in tasker-shared/src/config/mpsc_channels.rs:
#![allow(unused)]
fn main() {
pub struct MpscChannelsConfig {
pub orchestration: OrchestrationChannelsConfig,
pub task_readiness: TaskReadinessChannelsConfig,
pub worker: WorkerChannelsConfig,
pub shared: SharedChannelsConfig,
pub overflow_policy: OverflowPolicyConfig,
}
}
All channel creation sites updated to use configuration:
#![allow(unused)]
fn main() {
// Before
let (tx, rx) = mpsc::unbounded_channel();
// After
let buffer_size = config.mpsc_channels
.orchestration.event_listeners.pgmq_event_buffer_size;
let (tx, rx) = mpsc::channel(buffer_size);
}
Observability
ChannelMonitor Integration:
- Tracks channel usage in real-time
- Logs warnings at 80% saturation
- Exposes metrics via OpenTelemetry
Metrics Available:
mpsc_channel_usage_percent- Current channel fill percentagempsc_channel_capacity- Configured capacity- Component and channel name labels for filtering
Consequences
Positive
- Memory Safety: Bounded channels prevent OOM from unbounded growth
- Operational Flexibility: Tune channel sizes via configuration without code changes
- Environment Appropriateness: Test uses small buffers (exposes issues), production uses large buffers (handles load)
- Observability: Channel saturation visible in metrics and logs
- Documentation: Clear guidelines for future channel additions
Negative
- Backpressure Complexity: Must handle full channel conditions
- Configuration Overhead: More configuration files to maintain
- Tuning Required: May need adjustment based on production load patterns
Neutral
- No Performance Impact: Bounded channels with appropriate sizes perform identically to unbounded
- Backward Compatible: Existing deployments automatically use new defaults
Implementation Notes
Backpressure Strategies by Component
PGMQ Notification Listener:
- Strategy: Block sender (apply backpressure)
- Rationale: Cannot drop database notifications
- Buffer: Large (10000 base, 50000 production) to handle bursts
Event Publisher:
- Strategy: Drop events with metrics when full
- Rationale: Internal events are non-critical
- Buffer: Medium (5000 base, 10000 production)
Ruby FFI Handler:
- Strategy: Return error to Rust (signal backpressure)
- Rationale: Ruby must handle gracefully
- Buffer: Standard (1000) with monitoring
Sizing Guidelines
Command Channels (orchestration, worker):
- Base: 1000
- Test: 100 (expose issues)
- Production: 2000-5000 (concurrent load)
Event Channels:
- Base: 1000
- Production: Higher if event-driven architecture
Notification Channels:
- Base: 10000 (burst handling)
- Production: 50000 (bulk operations)
Validation
Testing Performed
- Unit Tests: Configuration loading and validation ✅
- Integration Tests: All tests pass with bounded channels ✅
- Local Verification: Service starts successfully in test environment ✅
- Configuration Verification: All environments load correctly ✅
Success Criteria Met
- ✅ Zero unbounded channels in production code
- ✅ 100% configurable channel capacities
- ✅ Environment-specific overrides working
- ✅ Backpressure handling implemented
- ✅ Observability through ChannelMonitor
- ✅ All tests passing
Future Considerations
- Dynamic Sizing: Consider runtime buffer adjustment based on load (not current scope)
- Priority Queues: Allow critical events to bypass overflow drops (evaluate based on metrics)
- Notification Coalescing: Reduce PGMQ notification volume during bursts (future optimization)
- Advanced Metrics: Percentile latencies for channel send operations
References
- Configuration Files:
config/tasker/base/mpsc_channels.toml - Rust Module:
tasker-shared/src/config/mpsc_channels.rs - Related ADRs: Command Pattern, Actor Pattern
Lessons Learned
- Configuration Structure Matters: Environment override files must use proper prefixes
- Separation of Concerns: Keep infrastructure config (sizing) separate from business logic (retry behavior)
- Test Appropriately: Small buffers in test environment expose backpressure issues early
- Migration Strategy: Moving config fields requires coordinated struct updates across all files
- Type Safety: Rust’s type system caught many configuration mismatches during development
Decision: Approved and Implemented Review Date: 2025-10-14 Next Review: 2026-Q1 (evaluate sizing based on production metrics)
ADR: Processor UUID Ownership Removal
Status: Accepted Date: 2025-10 Ticket: TAS-54
Context
When orchestrators crash with tasks in active processing states (Initializing, EnqueuingSteps, EvaluatingResults), the processor UUID ownership enforcement prevented new orchestrators from taking over. Tasks became permanently stuck until manual intervention.
Root Cause: Three states required ownership enforcement (the original state machine pattern), but when orchestrator A crashed and orchestrator B tried to recover, the ownership check failed: B != A.
Production Impact:
- Stuck tasks requiring manual intervention
- Orchestrator restarts caused task processing to halt
- 15-second gap between crash and retry, but tasks permanently blocked
Decision
Move to audit-only processor UUID tracking:
- Keep processor UUID in all transitions (audit trail for debugging)
- Remove ownership enforcement from state transitions
- Rely on existing state machine guards for idempotency
- Add configuration flag for gradual rollout
Key Insight: The original problem (race conditions) had been solved by multiple other mechanisms:
- Atomic finalization claiming via SQL functions
- Command pattern with stateless async processors
- Actor pattern with 4 production-ready actors
Idempotency Without Ownership
| Actor | Idempotency Mechanism | Race Condition Protection |
|---|---|---|
| TaskRequestActor | identity_hash unique constraint | Transaction atomicity |
| ResultProcessorActor | Current state guards | State machine atomicity |
| StepEnqueuerActor | SQL function atomicity | PGMQ transactional operations |
| TaskFinalizerActor | Atomic claiming | SQL compare-and-swap |
Consequences
Positive
- Task recovery: Tasks automatically recover after orchestrator crashes
- Zero manual interventions: Stuck task count approaches zero
- Audit trail preserved: Full debugging capability retained
- Instant rollback: Configuration flag allows quick revert
Negative
- New debugging patterns: Processor ownership changes visible in audit trail
- Team training: Operators need to understand audit-only interpretation
Neutral
- No database schema changes required
- No performance impact (one fewer query per transition)
Alternatives Considered
Alternative 1: Timeout-Based Ownership Transfer
Add timeout after which ownership can be claimed by another processor.
Rejected: Adds complexity; existing idempotency guards make ownership redundant entirely.
Alternative 2: Keep Ownership Enforcement
Continue with existing ownership enforcement behavior, add manual recovery tools.
Rejected: Doesn’t address root cause; manual intervention doesn’t scale.
References
- Defense in Depth - Multi-layer protection philosophy
- Idempotency and Atomicity - Defense layer documentation
ADR: Backoff Logic Consolidation
Status: Implemented Date: 2025-10-29 Deciders: Engineering Team Ticket: TAS-57
Context
The tasker-core distributed workflow orchestration system had multiple, potentially conflicting implementations of exponential backoff logic for step retry coordination. This created several critical issues:
Problems Identified
-
Configuration Conflicts: Three different maximum backoff values existed across the system:
- SQL Migration (hardcoded): 30 seconds
- Rust Code Default: 60 seconds
- TOML Configuration: 300 seconds
-
Race Conditions: No atomic guarantees on backoff updates when multiple orchestrators processed the same step failure simultaneously, leading to potential lost updates and inconsistent state.
-
Implementation Divergence: Dual calculation paths (Rust BackoffCalculator vs SQL fallback) could produce different results due to:
- Different time sources (
last_attempted_atvsfailure_time) - Hardcoded vs configurable parameters
- Lack of timestamp synchronization
- Different time sources (
-
Hardcoded SQL Values: The SQL migration contained non-configurable exponential backoff logic:
-- Old hardcoded implementation power(2, COALESCE(attempts, 1)) * interval '1 second', interval '30 seconds'
Decision
We consolidated the backoff logic with the following architectural decisions:
1. Single Source of Truth: TOML Configuration
Decision: All backoff parameters originate from TOML configuration files.
Rationale:
- Centralized configuration management
- Environment-specific overrides (test/development/production)
- Runtime validation and type safety
- Clear documentation of system behavior
Implementation:
# config/tasker/base/orchestration.toml
[backoff]
default_backoff_seconds = [1, 2, 4, 8, 16, 32]
max_backoff_seconds = 60 # Standard across all environments
backoff_multiplier = 2.0
jitter_enabled = true
jitter_max_percentage = 0.1
2. Standard Maximum Backoff: 60 Seconds
Decision: Standardize on 60 seconds as the maximum backoff delay.
Rationale:
- Balance: 60 seconds balances retry speed with system load reduction
- Not Too Short: 30 seconds (old SQL max) insufficient for rate limiting scenarios
- Not Too Long: 300 seconds (old TOML config) creates excessive delays in failure scenarios
- Alignment: Matches Rust code defaults and production requirements
Impact:
- Tasks recover faster from transient failures
- Rate-limited APIs get adequate cooldown
- User experience improved with reasonable retry times
3. Parameterized SQL Functions
Decision: SQL functions accept configuration parameters with sensible defaults.
Implementation:
CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
backoff_request_seconds INTEGER,
last_attempted_at TIMESTAMP,
failure_time TIMESTAMP,
attempts INTEGER,
p_max_backoff_seconds INTEGER DEFAULT 60,
p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP
Rationale:
- Eliminates hardcoded values in SQL
- Allows runtime configuration without schema changes
- Maintains SQL fallback safety net
- Defaults prevent breaking existing code
4. Atomic Backoff Updates with Row-Level Locking
Decision: Use PostgreSQL SELECT FOR UPDATE for atomic backoff updates.
Implementation:
#![allow(unused)]
fn main() {
// Rust BackoffCalculator
async fn update_backoff_atomic(&self, step_uuid: &Uuid, delay_seconds: u32) {
let mut tx = self.pool.begin().await?;
// Acquire row-level lock
sqlx::query!("SELECT ... FROM tasker_workflow_steps WHERE ... FOR UPDATE")
.fetch_one(&mut *tx).await?;
// Update with lock held
sqlx::query!("UPDATE tasker_workflow_steps SET ...")
.execute(&mut *tx).await?;
tx.commit().await?;
}
}
Rationale:
- Correctness: Prevents lost updates from concurrent orchestrators
- Simplicity: PostgreSQL’s row-level locking is well-understood and reliable
- Performance: Minimal overhead - locks only held during UPDATE operation
- Idempotency: Multiple retries produce consistent results
Alternative Considered: Optimistic concurrency with version field
- Rejected: More complex implementation, retry logic in application layer
- Benefit of Chosen Approach: Database guarantees atomicity
5. Timing Consistency: Update last_attempted_at with backoff_request_seconds
Decision: Always update both backoff_request_seconds and last_attempted_at atomically.
Rationale:
- SQL fallback calculation:
last_attempted_at + backoff_request_seconds - Prevents timing window where calculation uses stale timestamp
- Single transaction ensures consistency
Before:
#![allow(unused)]
fn main() {
// Old: Only updated backoff_request_seconds
sqlx::query!("UPDATE tasker_workflow_steps SET backoff_request_seconds = $1 ...")
}
After:
#![allow(unused)]
fn main() {
// New: Updates both atomically
sqlx::query!(
"UPDATE tasker_workflow_steps
SET backoff_request_seconds = $1,
last_attempted_at = NOW()
WHERE ..."
)
}
6. Dual-Path Strategy: Rust Primary, SQL Fallback
Decision: Maintain both Rust calculation and SQL fallback, but ensure they use same configuration.
Rationale:
- Rust Primary: Fast, configurable, with jitter support
- SQL Fallback: Safety net if
backoff_request_secondsis NULL - Consistency: Both paths now use same max delay and multiplier
Path Selection Logic:
CASE
-- Primary: Rust-calculated backoff
WHEN backoff_request_seconds IS NOT NULL AND last_attempted_at IS NOT NULL THEN
last_attempted_at + (backoff_request_seconds * interval '1 second')
-- Fallback: SQL exponential with configurable params
WHEN failure_time IS NOT NULL THEN
failure_time + LEAST(
power(p_backoff_multiplier, attempts) * interval '1 second',
p_max_backoff_seconds * interval '1 second'
)
ELSE NULL
END
Consequences
Positive
- Configuration Clarity: Single max_backoff_seconds value (60s) across entire system
- Race Condition Prevention: Atomic updates guarantee correctness in distributed deployments
- Flexibility: Parameterized SQL allows future config changes without migrations
- Timing Consistency: Synchronized timestamp updates eliminate calculation errors
- Maintainability: Clear separation of concerns - Rust for calculation, SQL for fallback
- Test Coverage: All 518 unit tests pass, validating correctness
Negative
-
Performance Overhead: Row-level locking adds ~1-2ms per backoff update
- Mitigation: Negligible compared to step execution time (typically seconds)
- Acceptable Trade-off: Correctness more important than microseconds
-
Lock Contention Risk: High-frequency failures on same step could cause lock queuing
- Mitigation: Exponential backoff naturally spreads out retries
- Monitoring: Added metrics for lock contention detection
- Real-World Impact: Minimal - failures are infrequent by design
-
Complexity: Transaction management adds code complexity
- Mitigation: Encapsulated in
update_backoff_atomic()method - Benefit: Hidden behind clean interface, testable in isolation
- Mitigation: Encapsulated in
Neutral
-
Breaking Change: SQL function signature changed (added parameters)
- Not an Issue: Greenfield alpha project, no production dependencies
- Future-Proof: Default parameters maintain backward compatibility
-
Configuration Migration: Changed max from 300s → 60s
- Impact: Tasks retry faster, reducing user-perceived latency
- Validation: All tests pass with new values
Validation
Testing
-
Unit Tests: All 518 unit tests pass
- BackoffCalculator calculation correctness
- Jitter bounds validation
- Max cap enforcement
-
Database Tests: SQL function behavior validated
- Parameterization with various max values
- Exponential calculation matches Rust
- Boundary conditions (attempts 0, 10, 20)
-
Integration Tests: End-to-end flow verified
- Worker failure → Backoff applied → Readiness respects delay
- SQL fallback when backoff_request_seconds NULL
- Rust and SQL calculations produce consistent results
Verification Steps Completed
✅ Configuration alignment (TOML, Rust defaults) ✅ SQL function rewrite with parameters ✅ BackoffCalculator atomic updates implemented ✅ Database reset successful with new migration ✅ All unit tests passing ✅ Architecture documentation updated
Implementation Notes
Files Modified
-
Configuration:
config/tasker/base/orchestration.toml: max_backoff_seconds = 60tasker-shared/src/config/tasker.rs: jitter_max_percentage = 0.1
-
Database Migration:
migrations/20250927000000_add_waiting_for_retry_state.sql: Parameterized functions
-
Rust Implementation:
tasker-orchestration/src/orchestration/backoff_calculator.rs: Atomic updates
-
Documentation:
docs/task-and-step-readiness-and-execution.md: Backoff section added- This ADR
Migration Path
Since this is greenfield alpha:
- Drop and recreate test database
- Run migrations with updated SQL functions
- Rebuild sqlx cache
- Run full test suite
Future Production Path (when needed):
- Deploy parameterized SQL functions alongside old functions
- Update Rust code to use new atomic methods
- Enable in staging, monitor metrics
- Gradual production rollout with feature flag
- Remove old functions after validation period
Future Enhancements
Potential Improvements (Post-Alpha)
- Configuration Table: Store backoff config in database for runtime updates
- Metrics: OpenTelemetry metrics for backoff application and lock contention
- Adaptive Backoff: Adjust multiplier based on system load or error patterns
- Per-Namespace Policies: Different backoff configs per workflow namespace
- Backoff Profiles: Named profiles (aggressive, moderate, conservative)
Monitoring Recommendations
Key Metrics to Track:
backoff_calculation_duration_seconds: Time to calculate and apply backoffbackoff_lock_contention_total: Lock acquisition failuresbackoff_sql_fallback_total: Frequency of SQL fallback usagebackoff_delay_applied_seconds: Histogram of actual delays
Alert Conditions:
- SQL fallback usage > 5% (indicates Rust path failing)
- Lock contention > threshold (indicates hot spots)
- Backoff delays > max_backoff_seconds (configuration issue)
References
- Task and Step Readiness Documentation
- States and Lifecycles Documentation
- BackoffCalculator Implementation
- SQL Migration 20250927000000
Related ADRs
- Ownership Removal - Concurrent access patterns
Decision Status: ✅ Implemented and Validated (2025-10-29)
ADR: Worker Dual-Channel Event System
Status: Accepted Date: 2025-12 Ticket: TAS-67
Context
The original Rust worker used a blocking .call() pattern in the event handler:
#![allow(unused)]
fn main() {
let result = handler.call(&event.payload.task_sequence_step).await; // BLOCKS
}
This created effectively sequential execution even for independent steps, preventing true concurrency and causing domain event race conditions where downstream systems saw events before orchestration processed results.
Decision
Adopt a dual-channel command pattern where handler invocation is fire-and-forget, and completions flow back through a separate channel.
Architecture:
[1] WorkerEventSystem receives StepExecutionEvent
↓
[2] ActorCommandProcessor routes to StepExecutorActor
↓
[3] StepExecutorActor claims step, publishes to HANDLER DISPATCH CHANNEL
↓ (fire-and-forget, non-blocking)
[4] HandlerDispatchService receives from channel
↓
[5] Resolves handler from registry, invokes handler.call()
↓
[6] Handler completes, publishes to COMPLETION CHANNEL
↓
[7] CompletionProcessorService receives from channel
↓
[8] Routes to FFICompletionService → Orchestration queue
Key Design Decisions:
- Bounded Parallel Execution: Semaphore-bounded concurrency (configurable via TOML)
- Ordered Domain Events: Events fire AFTER result is committed to completion channel
- Comprehensive Error Handling: Panics, timeouts, handler errors all generate proper failure results
- Fire-and-Forget FFI Callbacks:
runtime_handle.spawn()instead ofblock_on()prevents deadlocks
Consequences
Positive
- True parallelism: Parallel handler execution with bounded concurrency
- Eliminated race conditions: Domain events only fire after results committed
- Comprehensive error handling: All failure modes produce proper step failures
- Foundation for FFI: Reusable abstractions for Ruby/Python/TypeScript workers
- Bug discovery: Parallel execution surfaced latent SQL precedence bug
Negative
- Increased complexity: Two channels to manage instead of one
- Debugging complexity: Tracing flow across multiple channels requires structured logging
Neutral
- Channel saturation monitoring available via metrics
- Configurable buffer sizes per environment
Risk Mitigations Implemented
| Risk | Mitigation |
|---|---|
| Semaphore acquisition failure | Generate failure result instead of silent exit |
| FFI polling starvation | Metrics + starvation warnings + timeout |
| Completion channel backpressure | Release permit before send |
| FFI thread runtime context | Fire-and-forget callbacks |
Alternatives Considered
Alternative 1: Thread Pool Pattern
Use dedicated thread pool for handler execution.
Rejected: Tokio already provides excellent async runtime; adding threads increases complexity without benefit.
Alternative 2: Single Channel with Priority Queue
Priority queue for completions within single channel.
Rejected: Doesn’t address the fundamental blocking issue; still couples dispatch and completion.
Alternative 3: Keep Blocking Pattern with Larger Buffer
Increase buffer size to mask sequential execution.
Rejected: Doesn’t solve concurrency; just delays the problem.
References
- Worker Event Systems - Architecture documentation
- RCA: Parallel Execution Timing Bugs - Bug discovered during implementation
- FFI Callback Safety - FFI patterns established
ADR: Worker Actor-Service Decomposition
Status: Accepted Date: 2025-12 Ticket: TAS-69
Context
The tasker-worker crate had a monolithic command processor architecture:
WorkerProcessor: 1,575 lines of code- All command handling inline
- Difficult to test individual behaviors
- Inconsistent with orchestration actor architecture
Decision
Transform the worker from monolithic command processor to actor-based design, mirroring the orchestration actor pattern.
Before: Monolithic Design
WorkerCore
└── WorkerProcessor (1,575 LOC)
└── All command handling inline
After: Actor-Based Design
WorkerCore
└── ActorCommandProcessor (~350 LOC)
└── WorkerActorRegistry
├── StepExecutorActor → StepExecutorService
├── FFICompletionActor → FFICompletionService
├── TemplateCacheActor → TaskTemplateManager
├── DomainEventActor → DomainEventSystem
└── WorkerStatusActor → WorkerStatusService
Five Actors:
| Actor | Responsibility | Messages |
|---|---|---|
| StepExecutorActor | Step execution coordination | 4 |
| FFICompletionActor | FFI completion handling | 2 |
| TemplateCacheActor | Template cache management | 2 |
| DomainEventActor | Event dispatching | 1 |
| WorkerStatusActor | Status and health | 4 |
Three Services:
| Service | Lines | Purpose |
|---|---|---|
| StepExecutorService | ~400 | Step claiming, verification, FFI invocation |
| FFICompletionService | ~200 | Result delivery to orchestration |
| WorkerStatusService | ~200 | Stats tracking, health reporting |
Consequences
Positive
- 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
- Single responsibility: Each file handles one concern
- Testability: Services testable in isolation, actors via message handlers
- Consistency: Mirrors orchestration architecture
- Extensibility: New actors/services follow established pattern
Negative
- Two-phase initialization: Registry requires careful startup ordering
- Actor shutdown ordering: Must coordinate graceful shutdown
- Learning curve: New pattern to understand for contributors
Neutral
- Public API unchanged (
WorkerCore::new(),send_command(),stop()) - Internal restructuring transparent to users
Gaps Identified and Fixed
| Gap | Issue | Fix |
|---|---|---|
| Domain Event Dispatch | Events not dispatched after step completion | Explicit dispatch call in actor |
| Silent Error Handling | Orchestration send errors swallowed | Explicit error propagation |
| Namespace Sharing | Registry created new manager, losing namespaces | Shared pre-initialized manager |
Alternatives Considered
Alternative 1: Service-Only Pattern
Extract services without actor layer.
Rejected: Loses message-based interfaces that enable testing and future distributed execution.
Alternative 2: Keep Monolithic with Better Organization
Refactor WorkerProcessor into methods without extraction.
Rejected: Doesn’t address testability or architectural consistency goals.
Alternative 3: Full Actor Framework (Actix)
Use production actor framework.
Rejected: Too heavyweight; we need lifecycle hooks and message-based testing, not distributed supervision.
References
- Worker Actors - Architecture documentation
- Actor Pattern - Orchestration actor precedent
ADR: FFI Over WASM for Language Workers
Status: Accepted Date: 2025-12 Ticket: TAS-100
Context
For the TypeScript worker implementation, we needed to decide between two integration approaches:
- FFI (Foreign Function Interface): Direct C ABI calls to compiled Rust library
- WASM (WebAssembly): Compile Rust to wasm32-wasi target
Ruby (Magnus) and Python (PyO3) workers already used FFI successfully.
Decision
Proceed with FFI for all language workers. Reserve WASM for future serverless handler execution.
Decision Matrix:
| Criteria | FFI | WASM |
|---|---|---|
| Pattern Consistency | Matches Ruby/Python | Requires new architecture |
| Production Readiness | Node FFI mature, Bun stabilizing | WASI networking immature |
| Implementation Speed | 2-3 weeks | 2-3 months + unknowns |
| PostgreSQL Access | Native via Rust | Needs host functions |
| Multi-threading | Full Tokio support | Single-threaded WASM |
| Async Runtime | Tokio works | Incompatible |
| Debugging | Standard tools | Limited tooling |
Score: FFI 8/10, WASM 3/10 for current requirements.
WASM Deal-Breakers:
- No mature PostgreSQL client for
wasm32-wasi - Single-threaded execution (our
HandlerDispatchServicerelies on Tokio multi-threading) - Tokio doesn’t compile to
wasm32-wasitarget - WASI networking still experimental (Preview 2 adoption low)
Consequences
Positive
- Pattern consistency: Single Rust codebase serves all four workers
- Proven approach: Ruby/Python FFI already validated
- Full feature access: PostgreSQL, PGMQ, Tokio, domain events all work
- Standard debugging: lldb, gdb, structured logging across boundary
- Fast implementation: Estimated 2-3 weeks for TypeScript worker
Negative
- FFI safety concerns: Incorrect types can cause segfaults
- Platform builds: Must distribute
.dylib/.so/.dllper platform - Runtime compatibility: Different FFI semantics between Bun and Node
Neutral
- Bun FFI experimental but fast-stabilizing
- Pre-built binaries via GitHub releases address distribution
Future Vision
WASM Research: Revisit when WASI 0.3+ stabilizes with networking.
Serverless WASM Handlers:
- Compile individual handlers to WASM (not orchestration)
- Deploy to serverless platforms (AWS Lambda, Cloudflare Workers)
- Cold start optimization (1ms vs 100ms)
- Extreme scalability for compute-heavy workflows
Separation of Concerns:
- Orchestration: Stays Rust (PostgreSQL, PGMQ, state machines)
- Handlers: Optionally WASM (stateless compute units)
Alternatives Considered
Alternative 1: WASM with Host Functions
Implement database operations as host functions.
Rejected: Defeats the purpose - logic split between WASM and host, loses Rust benefits.
Alternative 2: Wait for WASI 0.3
Delay TypeScript worker until WASI matures.
Rejected: Timeline uncertain (6+ months); FFI works today.
Alternative 3: Spin Framework
Use Spin’s WASM abstraction layer.
Rejected: Framework lock-in; requires Spin APIs, can’t reuse Axum/Tower patterns.
References
- Cross-Language Consistency - API philosophy
- Workers Documentation - Language-specific implementation guides
ADR: Handler Composition Pattern
Status: Accepted Date: 2025-12 Ticket: TAS-112
Context
Cross-language step handler ergonomics research revealed an architectural inconsistency:
- Batchable handlers: Already use composition via mixins (target pattern)
- API handlers: Use inheritance (subclass pattern)
- Decision handlers: Use inheritance (subclass pattern)
Current State:
✅ Batchable: class Handler(StepHandler, Batchable) # Composition
❌ API: class Handler < APIHandler # Inheritance
❌ Decision: class Handler extends DecisionHandler # Inheritance
Guiding Principle (Zen of Python): “There should be one– and preferably only one –obvious way to do it.”
Decision
Migrate all handler patterns to composition (mixins/traits), using batchable as the reference implementation.
Target Architecture:
All patterns use composition:
Ruby: include Base, include API, include Decision, include Batchable
Python: class Handler(StepHandler, API, Decision, Batchable)
TypeScript: interface composition + mixins
Rust: trait composition (impl Base + API + Decision + Batchable)
Benefits:
- Single responsibility - each mixin handles one concern
- Flexible composition - handlers can mix capabilities as needed
- Easier testing - can test each capability independently
- Matches batchable pattern (already proven successful)
Example Migration:
# Old pattern (deprecated)
class MyHandler < TaskerCore::StepHandler::API
def call(context)
api_success(data)
end
end
# New pattern
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
def call(context)
api_success(data)
end
end
Consequences
Positive
- Consistent architecture: One pattern for all handler capabilities
- Composable capabilities: Mix API + Decision + Batchable as needed
- Testable in isolation: Each mixin can be tested independently
- Matches proven pattern: Batchable already validates approach
- Cross-language alignment: Same mental model in all languages
Negative
- Breaking change: All existing handlers need migration
- Learning curve: Contributors must understand mixin pattern
- Migration effort: All examples and documentation need updates
Neutral
- Pre-alpha status means breaking changes are acceptable
- Migration can be phased with deprecation warnings
Related Decisions
Ruby Result Unification
Ruby uses separate Success/Error classes while Python/TypeScript use unified result with success flag. Recommend unifying Ruby to match.
Rust Handler Traits
Rust needs ergonomic traits for API, Decision, and Batchable capabilities to match other languages:
#![allow(unused)]
fn main() {
pub trait APICapable {
fn api_success(&self, data: Value, status: u16) -> StepExecutionResult;
fn api_failure(&self, message: &str, status: u16) -> StepExecutionResult;
}
pub trait DecisionCapable {
fn decision_success(&self, step_names: Vec<String>) -> StepExecutionResult;
fn skip_branches(&self, reason: &str) -> StepExecutionResult;
}
}
FFI Boundary Types
Data structures crossing FFI boundaries must have identical serialization. Create explicit type mirrors in all languages:
DecisionPointOutcomeBatchProcessingOutcomeCursorConfig
Alternatives Considered
Alternative 1: Keep Inheritance Pattern
Continue with subclass pattern for API and Decision.
Rejected: Inconsistent with batchable; makes multi-capability handlers awkward.
Alternative 2: Migrate Batchable to Inheritance
Make batchable use inheritance to match others.
Rejected: Batchable composition is the better pattern; others should follow it.
Alternative 3: Language-Specific Patterns
Let each language use its idiomatic pattern.
Rejected: Violates cross-language consistency principle; increases cognitive load.
References
- Composition Over Inheritance - Principle documentation
- Cross-Language Consistency - API philosophy
- API Convergence Matrix - Cross-language API reference
RCA: Parallel Execution Exposing Latent Timing Bugs
Date: 2025-12-07
Related: Worker Dual-Channel Event System
Status: Resolved
Impact: Flaky E2E test test_mixed_workflow_scenario
Executive Summary
During the dual-channel event system implementation (fire-and-forget handler dispatch), a previously hidden bug in the SQL function get_task_execution_context() became consistently reproducible. The bug was a logical precedence error that had always existed but was masked by sequential execution timing. Introducing true parallelism changed the probability distribution of state combinations, transforming a Heisenbug into a Bohrbug.
This document captures the root cause analysis as a reference for understanding how architectural changes to concurrency can surface latent bugs in distributed systems.
The Bug
Symptom
Test test_mixed_workflow_scenario intermittently failed with timeout waiting for BlockedByFailures status, while the API returned HasReadySteps.
⏳ Waiting for task to fail (max 10s)...
Task execution status: processing (processing)
Task execution status: has_ready_steps (has_ready_steps) ← Wrong!
Task execution status: has_ready_steps (has_ready_steps)
... timeout ...
Root Cause
The SQL function get_task_execution_context() checked ready_steps > 0 BEFORE permanently_blocked_steps > 0:
-- BUGGY: Wrong precedence order
CASE
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps' -- ← Checked FIRST
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'
...
END as execution_status
When a task had BOTH permanently blocked steps AND ready steps, the function returned has_ready_steps instead of blocked_by_failures.
The Fix
Migration 20251207000000_fix_execution_status_priority.sql corrects the precedence:
-- FIXED: blocked_by_failures takes semantic priority
CASE
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures' -- ← Now FIRST
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'
...
END as execution_status
Why Did This Surface Now?
The Test Scenario
# 3 parallel steps with NO dependencies (can all run concurrently)
steps:
- name: success_step
retryable: false
- name: permanent_error_step
retryable: false # Fails permanently
- name: retryable_error_step
retryable: true
max_attempts: 2 # Fails, but becomes "ready" after backoff
Before: Blocking Handler Dispatch
The original architecture used blocking .call() in the event handler:
#![allow(unused)]
fn main() {
// workers/rust/src/event_handler.rs (before)
let result = handler.call(&step).await; // BLOCKS until handler completes
}
This created effectively sequential execution even for independent steps:
Timeline (Sequential):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]
t=50ms [success_step completes]
t=51ms [permanent_error_step starts]
t=100ms [permanent_error_step fails → PERMANENTLY BLOCKED]
t=101ms [retryable_error_step starts]
t=150ms [retryable_error_step fails → enters 100ms backoff]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 0 (still in backoff!)
──► Returns: blocked_by_failures ✓
The backoff hadn't elapsed yet because steps were processed one at a time.
After: Fire-and-Forget Handler Dispatch
The dual-channel event system introduced non-blocking dispatch via channels:
#![allow(unused)]
fn main() {
// Fire-and-forget pattern
dispatch_sender.send(DispatchHandlerMessage { step, ... }).await;
// Returns immediately - handler executes in separate task
}
This enables true parallel execution:
Timeline (Parallel):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]──────────────────►[completes t=50ms]
t=0ms [permanent_error_step starts]──────────►[fails t=50ms → BLOCKED]
t=0ms [retryable_error_step starts]──────────►[fails t=50ms → backoff]
t=150ms [retryable_error_step backoff expires → becomes READY]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 1 (backoff elapsed!)
──► Returns: has_ready_steps ✗ (BUG!)
Probability Analysis
The “Both States” Window
The bug manifests when checking status while the task has BOTH:
- At least one permanently blocked step
- At least one ready step (e.g., retryable step after backoff)
Sequential Processing:
├────────────────────────────────────────────────────────────────────┤
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│ Very LOW probability of "both states" window │
│ Steps complete serially; backoff rarely overlaps with status check │
└────────────────────────────────────────────────────────────────────┘
Parallel Processing:
├────────────────────────────────────────────────────────────────────┤
│░░░░░░░░░░░░████████████████████████████████████████████░░░░░░░░░░░│
│ ↑ ↑ │
│ │ HIGH probability "both states" window │ │
│ │ All steps complete ~simultaneously │ │
│ │ Backoff expires while status is polled │ │
└────────────────────────────────────────────────────────────────────┘
Quantifying the Change
| Metric | Sequential | Parallel |
|---|---|---|
| Step completion spread | ~150ms | ~50ms |
| “Both states” window duration | ~0ms (transient) | ~100ms+ (stable) |
| Probability of hitting bug | <1% | >50% |
| Bug classification | Heisenbug | Bohrbug |
Bug Classification
Heisenbug → Bohrbug Transformation
| Property | Before (Heisenbug) | After (Bohrbug) |
|---|---|---|
| Reproducibility | Intermittent, timing-dependent | Consistent, deterministic |
| Root cause | Logical precedence error | Same |
| Visibility | Hidden by sequential timing | Exposed by parallel timing |
| Debug difficulty | Extremely hard (may never reproduce) | Straightforward |
| Detection in CI | Might pass for months | Fails consistently under load |
Why This Matters
- The bug was always present - It existed in the SQL function since it was written
- Sequential execution hid it - Incidental timing prevented the problematic state
- Parallelization surfaced it - Not by introducing a bug, but by applying concurrency pressure
- This is good - Better to find in tests than production
Semantic Correctness
The Correct Mental Model
“If ANY step is permanently blocked, the task cannot make further progress toward completion, even if other steps are ready to execute.”
A task with permanent failures is blocked by failures regardless of what else might be runnable. The old code implicitly assumed:
“If work is available, we’re making progress”
This is incorrect for workflows where:
- Convergence points require ALL branches to complete
- Final task status depends on ALL steps succeeding
- Partial progress doesn’t constitute overall success
State Precedence (Correct Order)
-- 1. Permanent failures block overall progress
WHEN permanently_blocked_steps > 0 THEN 'blocked_by_failures'
-- 2. Ready work can continue (but may not lead to completion)
WHEN ready_steps > 0 THEN 'has_ready_steps'
-- 3. Work in flight
WHEN in_progress_steps > 0 THEN 'processing'
-- 4. All done
WHEN completed_steps = total_steps THEN 'all_complete'
-- 5. Waiting for dependencies
ELSE 'waiting_for_dependencies'
Patterns to Watch For
1. State Combination Explosions
Sequential processing often means only one state at a time. Parallelism creates state combinations that were previously impossible:
Sequential: A → B → C (states are mutually exclusive in time)
Parallel: A + B + C (states can coexist)
Watch for: CASE statements, if/else chains, and state machines that assume mutual exclusivity.
2. Timing-Dependent Invariants
Code may accidentally depend on timing:
#![allow(unused)]
fn main() {
// Assumes step_a completes before step_b starts
if step_a.is_complete() {
// Safe to check step_b
}
}
Watch for: Implicit ordering assumptions in status calculations, rollups, and aggregations.
3. Transient vs Stable States
Some states were transient under sequential processing but become stable under parallel:
| State | Sequential | Parallel |
|---|---|---|
| “1 complete, 1 in-progress” | Transient (~ms) | Stable (seconds) |
| “blocked + ready” | Nearly impossible | Common |
| “multiple errors” | Rare | Frequent |
Watch for: Error handling, status rollups, and progress calculations that assumed single-state scenarios.
4. Test Timing Sensitivity
Tests written for sequential execution may have implicit timing dependencies:
#![allow(unused)]
fn main() {
// This worked when steps were sequential
wait_for_status(BlockedByFailures, timeout: 10s);
// But fails when parallel execution creates a different status first
}
Watch for: Tests that pass in isolation but fail under concurrent load.
Verification Strategy
After Parallelization Changes
- Run tests multiple times - Timing bugs may not manifest on first run
- Run tests under load - Concurrent test execution increases probability
- Add explicit state combination tests - Test scenarios that were previously impossible
- Review CASE/if-else precedence - Check all status calculations for correct ordering
Example: Testing State Combinations
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_blocked_with_ready_steps() {
// Explicitly create the state combination
let task = create_task_with_parallel_steps();
// Force one step to permanent failure
force_step_to_permanent_failure(&task, "step_a").await;
// Force another step to ready (after backoff)
force_step_to_ready_after_backoff(&task, "step_b").await;
// Verify correct precedence
let status = get_task_execution_status(&task).await;
assert_eq!(status, ExecutionStatus::BlockedByFailures);
}
}
Conclusion
This bug exemplifies how architectural improvements to concurrency can surface latent correctness issues. The parallelization didn’t introduce a bug—it revealed one that had been hidden by incidental sequential timing.
This is a positive outcome: the bug was found in testing rather than production. The fix ensures correct semantic precedence regardless of execution timing, making the system more robust under parallel load.
Key Takeaways
- Parallelization is a stress test - It exposes timing-dependent bugs
- Sequential execution hides bugs - Incidental ordering masks logical errors
- State precedence matters - Review all status calculations when adding concurrency
- Heisenbugs become Bohrbugs - Parallel execution makes rare bugs reproducible
- This is good engineering - Finding bugs through architectural improvements validates the testing strategy
References
- Migration:
migrations/20251207000000_fix_execution_status_priority.sql - Test:
tests/e2e/ruby/error_scenarios_test.rs::test_mixed_workflow_scenario - SQL Function:
get_task_execution_context()inmigrations/20251001000000_fix_permanently_blocked_detection.sql - Dual-Channel Event System ADR
Tasker Core Benchmarks
Last Updated: 2026-01-23 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Observability | Deployment Patterns
<- Back to Documentation Hub
This directory contains documentation for all performance benchmarks in the tasker-core workspace.
Quick Reference
# E2E benchmarks (cluster mode, all tiers)
cargo make setup-env-all-cluster
cargo make cluster-start-all
set -a && source .env && set +a && cargo bench --bench e2e_latency
cargo make bench-report # Percentile JSON
cargo make bench-analysis # Markdown analysis
cargo make cluster-stop
# Component benchmarks (requires Docker services)
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo bench --package tasker-client --features benchmarks # API benchmarks
cargo bench --package tasker-shared --features benchmarks # SQL + Event benchmarks
Benchmark Categories
1. End-to-End Latency (tests/benches)
Location: tests/benches/e2e_latency.rs
Documentation: e2e-benchmarks.md
Measures complete workflow execution from API call through orchestration, message queue, worker execution, result processing, and dependency resolution — across all distributed components in a 10-instance cluster.
| Tier | Benchmark | Steps | Parallelism | P50 | Target (p99) |
|---|---|---|---|---|---|
| 1 | Linear Rust | 4 sequential | none | 255-258ms | < 500ms |
| 1 | Diamond Rust | 4 (2 parallel) | 2-way | 200-259ms | < 500ms |
| 2 | Complex DAG | 7 (mixed) | 2+3-way | 382ms | < 800ms |
| 2 | Hierarchical Tree | 8 (4 parallel) | 4-way | 389-426ms | < 800ms |
| 2 | Conditional | 5 (3 executed) | dynamic | 251-262ms | < 500ms |
| 3 | Cluster single task | 4 sequential | none | 261ms | < 500ms |
| 3 | Cluster concurrent 2x | 4+4 | distributed | 332-384ms | < 800ms |
| 4 | FFI linear (Ruby/Python/TS) | 4 sequential | none | 312-316ms | < 800ms |
| 4 | FFI diamond (Ruby/Python/TS) | 4 (2 parallel) | 2-way | 260-275ms | < 800ms |
| 5 | Batch 1000 rows | 7 (5 parallel) | 5-way | 358-368ms | < 1000ms |
Each step involves ~19 database operations, 2 message queue round-trips, 4+ state transitions, and dependency graph evaluation. See e2e-benchmarks.md for the detailed per-step lifecycle.
Key Characteristics:
- FFI overhead: ~23% vs native Rust (all languages within 3ms of each other)
- Linear patterns: highly reproducible (<2% variance between runs)
- Parallel patterns: environment-sensitive (I/O contention affects parallelism)
- Batch processing: 2,700-2,800 rows/second with tight P95/P50 ratios
Run Commands:
cargo make bench-e2e # Tier 1: Rust core
cargo make bench-e2e-full # Tier 1+2: + complexity
cargo make bench-e2e-cluster # Tier 3: Multi-instance
cargo make bench-e2e-languages # Tier 4: FFI comparison
cargo make bench-e2e-batch # Tier 5: Batch processing
cargo make bench-e2e-all # All tiers
2. API Performance (tasker-client)
Location: tasker-client/benches/task_initialization.rs
Measures orchestration API response times for task creation (HTTP round-trip + DB insert + step initialization).
| Benchmark | Target | Current | Status |
|---|---|---|---|
| Linear task init | < 50ms | 17.7ms | 2.8x better |
| Diamond task init | < 75ms | 20.8ms | 3.6x better |
cargo bench --package tasker-client --features benchmarks
3. SQL Function Performance (tasker-shared)
Location: tasker-shared/benches/sql_functions.rs
Measures critical PostgreSQL function performance for orchestration polling.
| Function | Target | Current (5K tasks) | Status |
|---|---|---|---|
| get_next_ready_tasks | < 3ms | 1.75-2.93ms | Pass |
| get_step_readiness_status | < 1ms | 440-603us | Pass |
| get_task_execution_context | < 1ms | 380-460us | Pass |
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks sql_functions
4. Event Propagation (tasker-shared)
Location: tasker-shared/benches/event_propagation.rs
Measures PostgreSQL LISTEN/NOTIFY round-trip latency for real-time coordination.
| Metric | Target (p95) | Current | Status |
|---|---|---|---|
| Notify round-trip | < 10ms | 14.1ms | Slightly above, p99 < 20ms |
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks event_propagation
Performance Targets
System-Wide Goals
| Category | Metric | Target | Rationale |
|---|---|---|---|
| API Latency | p99 | < 100ms | User-facing responsiveness |
| SQL Functions | mean | < 3ms | Orchestration polling efficiency |
| Event Propagation | p95 | < 10ms | Real-time coordination overhead |
| E2E Linear (4 steps) | p99 | < 500ms | End-user task completion |
| E2E Complex (7-8 steps) | p99 | < 800ms | Complex workflow completion |
| E2E Batch (1000 rows) | p99 | < 1000ms | Bulk operation completion |
Scaling Targets
| Dataset Size | get_next_ready_tasks | Notes |
|---|---|---|
| 1K tasks | < 2ms | Initial implementation |
| 5K tasks | < 3ms | Current verified |
| 10K tasks | < 5ms | Target |
| 100K tasks | < 10ms | Production scale |
Cluster Topology (E2E Benchmarks)
| Service | Instances | Ports | Build |
|---|---|---|---|
| Orchestration | 2 | 8080, 8081 | Release |
| Rust Worker | 2 | 8100, 8101 | Release |
| Ruby Worker | 2 | 8200, 8201 | Release extension |
| Python Worker | 2 | 8300, 8301 | Maturin develop |
| TypeScript Worker | 2 | 8400, 8401 | Bun FFI |
Deployment Mode: Hybrid (event-driven with polling fallback) Database: PostgreSQL (with PGMQ extension available) Messaging: RabbitMQ (via MessagingService provider abstraction; PGMQ also supported) Sample Size: 50 per benchmark
Running Benchmarks
E2E Benchmarks (Full Suite)
# 1. Setup cluster environment
cargo make setup-env-all-cluster
# 2. Start 10-instance cluster
cargo make cluster-start-all
# 3. Verify cluster health
cargo make cluster-status
# 4. Run benchmarks
set -a && source .env && set +a && cargo bench --bench e2e_latency
# 5. Generate reports
cargo make bench-report # → target/criterion/percentile_report.json
cargo make bench-analysis # → tmp/benchmark-results/benchmark-results.md
# 6. Stop cluster
cargo make cluster-stop
Component Benchmarks
# Start database
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
# Run individual suites
cargo bench --package tasker-client --features benchmarks # API
cargo bench --package tasker-shared --features benchmarks # SQL + Events
# Run all at once
cargo bench --all-features
Baseline Comparison
# Save current performance as baseline
cargo bench --all-features -- --save-baseline main
# After changes, compare
cargo bench --all-features -- --baseline main
# View report
open target/criterion/report/index.html
Interpreting Results
Stable Metrics (Reliable for Regression Detection)
These metrics show <2% variance between runs:
- Linear pattern P50 (sequential execution baseline)
- FFI linear P50 (framework overhead measurement)
- Single task in cluster (cluster overhead measurement)
- Batch P50 (parallel I/O throughput)
Environment-Sensitive Metrics
These metrics vary 10-30% depending on system load:
- Diamond pattern P50 (parallelism benefit depends on I/O capacity)
- Concurrent 2x (scheduling contention varies)
- Hierarchical tree (deep dependency chains amplify I/O latency)
Key Ratios (Always Valid)
- FFI overhead %: ~23% for all languages (framework-dominated)
- P95/P50 ratio: 1.01-1.12 (execution stability indicator)
- Cluster vs single overhead: <3ms (negligible cluster tax)
- FFI language spread: <3ms (language runtime is not the bottleneck)
Design Principles
Natural Measurement
Benchmarks measure real system behavior without artificial test harnesses:
- API benchmarks hit actual HTTP endpoints
- SQL benchmarks use real database with realistic data volumes
- E2E benchmarks execute complete workflows through all distributed components
Distributed System Focus
All benchmarks account for distributed system characteristics:
- Network latency included (HTTP, PostgreSQL, message queues)
- Database transaction timing considered
- Message queue delivery overhead measured
- Worker coordination and scheduling included
Load-Based Validation
Benchmarks serve dual purpose:
- Performance measurement: Track regressions and improvements
- Load testing: Expose race conditions and timing bugs
E2E benchmark warmup has historically discovered critical race conditions that manual testing never revealed.
Statistical Rigor
- 50 samples per benchmark for P50/P95 validity
- Criterion framework with statistical regression detection
- Multiple independent runs recommended for absolute comparisons
- Relative metrics (ratios, overhead %) preferred over absolute milliseconds
Troubleshooting
“Services must be running”
cargo make cluster-status # Check cluster health
cargo make cluster-start-all # Restart cluster
Tier 3/4 benchmarks skipped
# Ensure cluster env is configured (not single-service)
cargo make setup-env-all-cluster # Generates .env with cluster URLs
High variance between runs
- Close resource-intensive applications (browsers, IDEs)
- Ensure machine is plugged in (not throttling)
- Focus on stable metrics (linear P50, FFI overhead %) for comparisons
- Run benchmarks twice and compare for reproducibility
Benchmark takes too long
# Reduce sample size (default: 50)
cargo bench -- --sample-size 10
# Run single tier
cargo make bench-e2e # Only Tier 1
CI Integration
# Example: .github/workflows/benchmarks.yml
name: Performance Benchmarks
on:
pull_request:
paths:
- 'tasker-*/src/**'
- 'migrations/**'
jobs:
benchmark:
runs-on: ubuntu-latest
services:
postgres:
image: ghcr.io/pgmq/pg18-pgmq:v1.8.1
env:
POSTGRES_DB: tasker_rust_test
POSTGRES_USER: tasker
POSTGRES_PASSWORD: tasker
steps:
- uses: actions/checkout@v3
- run: cargo bench --all-features -- --save-baseline pr
- uses: benchmark-action/github-action-benchmark@v1
with:
tool: 'criterion'
output-file-path: target/criterion/report/index.html
Criterion automatically detects performance regressions with statistical comparison to baselines and alerts on >5% slowdowns.
Contributing
When adding new benchmarks:
- Follow naming convention:
<tier>_<category>/<group>/<scenario> - Include targets: Document expected performance in this README
- Add fixture: Create workflow template YAML in
tests/fixtures/task_templates/ - Document shape: Update e2e-benchmarks.md with topology
- Consider variance: Account for distributed system characteristics
- Use 50 samples: Minimum for P50/P95 statistical validity
Benchmark Template
#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use std::time::Duration;
fn bench_my_scenario(c: &mut Criterion) {
let mut group = c.benchmark_group("e2e_my_tier");
group.sample_size(50);
group.measurement_time(Duration::from_secs(30));
group.bench_function(BenchmarkId::new("workflow", "my_scenario"), |b| {
b.iter(|| {
runtime.block_on(async {
execute_benchmark_scenario(&client, namespace, handler, context, timeout).await
})
});
});
group.finish();
}
}
E2E Benchmark Scenarios: Workflow Shapes and Per-Step Lifecycle
Last Updated: 2026-01-23 Audience: Architects, Developers, Performance Engineers Related Docs: Benchmarks README | States & Lifecycles | Actor Pattern
<- Back to Benchmarks
What Each Benchmark Measures
Each E2E benchmark executes a complete workflow through the full distributed system: HTTP API call, task initialization, step discovery, message queue dispatch, worker execution, result submission, dependency graph re-evaluation, and task finalization.
A 4-step linear workflow at P50=257ms means the system completes 76+ database operations, 8 message queue round-trips, 16+ state machine transitions, and 4 dependency graph evaluations — all across a 10-instance distributed cluster — in approximately one quarter of a second.
Per-Step Lifecycle: What Happens for Every Step
Before examining the benchmark scenarios, it’s important to understand the work the system performs for each individual step. Every step in every benchmark goes through this complete lifecycle.
Messaging Backend: Tasker uses a MessagingService trait with provider variants for
PGMQ (PostgreSQL-native, single-dependency) and RabbitMQ (high-throughput). The benchmark
results documented here were captured using the RabbitMQ backend. The per-step lifecycle
is identical regardless of backend — only the transport layer differs.
State Machine Transitions (per step)
Step: Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
Task: StepsInProcess → EvaluatingResults → (EnqueuingSteps if more ready) → Complete
Database Operations (per step): ~19 operations
| Phase | Operations | Description |
|---|---|---|
| Discovery | 2 queries | get_next_ready_tasks + get_step_readiness_status_batch (8-CTE query) |
| Enqueueing | 4 writes | Fetch correlation_id, transition Pending→Enqueued (SELECT sort_key + UPDATE most_recent + INSERT transition) |
| Message send | 1 op | Send step dispatch to worker queue (via MessagingService) |
| Worker claim | 1 op | Claim message with visibility timeout (via MessagingService) |
| Worker transition | 3 writes | Transition Enqueued→InProgress |
| Result submission | 4 writes | Transition InProgress→EnqueuedForOrchestration + audit trigger INSERT + send completion to orchestration queue |
| Result processing | 4 writes | Fetch step state, transition →Complete, delete consumed message |
| Task coordination | 1+ queries | Re-evaluate get_step_readiness_status_batch for remaining steps |
| Total | ~19 ops |
Message Queue Round-Trips (per step): 2
- Orchestration → Worker: Step dispatch message (task_uuid, step_uuid, handler, context)
- Worker → Orchestration: Completion notification (task_uuid, step_uuid, results)
Dependency Graph Evaluation (per step completion)
After each step completes, the orchestration:
- Queries all steps in the task for current state
- Evaluates dependency edges (parent steps must be Complete)
- Calculates retry eligibility (attempts < max_attempts, backoff expired)
- Identifies newly-ready steps for enqueueing
- Updates task state (more steps ready → EnqueuingSteps, all complete → Complete)
Idempotency Guarantees
- Message visibility timeout: MessagingService prevents duplicate processing (30s window)
- State machine guards: Transitions validate from-state before applying
- Atomic claiming: Workers claim via the messaging backend’s atomic read operation
- Audit trail: Every transition creates an immutable
workflow_step_transitionsrecord
Tier 1: Core Performance (Rust Native)
Linear Rust (4 steps, sequential)
Fixture: tests/fixtures/task_templates/rust/mathematical_sequence.yaml
Namespace: rust_e2e_linear | Handler: mathematical_sequence
linear_step_1 → linear_step_2 → linear_step_3 → linear_step_4
| Step | Handler | Operation | Depends On | Math |
|---|---|---|---|---|
| linear_step_1 | LinearStep1 | square | none | 6^2 = 36 |
| linear_step_2 | LinearStep2 | square | step_1 | 36^2 = 1,296 |
| linear_step_3 | LinearStep3 | square | step_2 | 1,296^2 = 1,679,616 |
| linear_step_4 | LinearStep4 | square | step_3 | 1,679,616^2 |
Distributed system work for this workflow:
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 (4 per step) |
| State machine transitions (task) | 6 (Pending→Init→Enqueue→InProcess→Eval→Complete) |
| Database operations | 76 (19 per step) |
| MQ messages | 8 (2 per step) |
| Dependency evaluations | 4 (after each step completes) |
| HTTP calls (benchmark→API) | 1 create + ~5 polls |
| Sequential stages | 4 |
Why this matters: This is the purest sequential latency test. Each step must fully complete (all 19 DB operations + 2 message round-trips) before the next step can begin. The P50 of ~257ms means each step’s complete lifecycle averages ~64ms including all distributed coordination.
Diamond Rust (4 steps, 2-way parallel)
Fixture: tests/fixtures/task_templates/rust/diamond_pattern.yaml
Namespace: rust_e2e_diamond | Handler: diamond_pattern
diamond_start
/ \
/ \
diamond_branch_b diamond_branch_c ← parallel execution
\ /
\ /
diamond_end ← 2-way convergence
| Step | Handler | Operation | Depends On | Parallelism |
|---|---|---|---|---|
| diamond_start | Start | square | none | - |
| diamond_branch_b | BranchB | square | start | parallel with C |
| diamond_branch_c | BranchC | square | start | parallel with B |
| diamond_end | End | multiply_and_square | branch_b AND branch_c | convergence |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 |
| Database operations | 76 |
| MQ messages | 8 |
| Dependency evaluations | 4 |
| Sequential stages | 3 (start → parallel → end) |
| Convergence points | 1 (diamond_end waits for both branches) |
| Dependency edge checks | 4 (start→B, start→C, B→end, C→end) |
Why this matters: Tests the system’s ability to dispatch and execute steps concurrently. The convergence point (diamond_end) requires the orchestration to correctly evaluate that BOTH branch_b AND branch_c are Complete before enqueueing diamond_end. Under light load, this completes in 3 sequential stages vs 4 for linear (~30% faster).
Tier 2: Complexity Scaling
Complex DAG (7 steps, mixed parallelism)
Fixture: tests/fixtures/task_templates/rust/complex_dag.yaml
Namespace: rust_e2e_mixed_dag | Handler: complex_dag
dag_init
/ \
dag_process_left dag_process_right ← 2-way parallel
/ | | \
/ | | \
dag_validate dag_transform dag_analyze ← mixed dependencies
\ | /
\ | /
dag_finalize ← 3-way convergence
| Step | Depends On | Type |
|---|---|---|
| dag_init | none | init |
| dag_process_left | init | parallel branch |
| dag_process_right | init | parallel branch |
| dag_validate | left AND right | 2-way convergence |
| dag_transform | left only | linear continuation |
| dag_analyze | right only | linear continuation |
| dag_finalize | validate AND transform AND analyze | 3-way convergence |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 28 (7 steps x 4) |
| Database operations | 133 (7 x 19) |
| MQ messages | 14 (7 x 2) |
| Dependency evaluations | 7 |
| Sequential stages | 4 (init → left/right → validate/transform/analyze → finalize) |
| Convergence points | 2 (dag_validate: 2-way, dag_finalize: 3-way) |
| Dependency edge checks | 8 |
Why this matters: Tests multiple convergence points with different fan-in widths. The orchestration must correctly handle that dag_validate needs 2 parents while dag_finalize needs 3. Also tests mixed patterns: some steps continue from a single parent (transform from left only) while others require multiple.
Hierarchical Tree (8 steps, 4-way convergence)
Fixture: tests/fixtures/task_templates/rust/hierarchical_tree.yaml
Namespace: rust_e2e_tree | Handler: hierarchical_tree
tree_root
/ \
tree_branch_left tree_branch_right ← 2-way parallel
/ \ / \
tree_leaf_d tree_leaf_e tree_leaf_f tree_leaf_g ← 4-way parallel
\ | | /
\ | | /
tree_final_convergence ← 4-way convergence
| Level | Steps | Parallelism | Operation |
|---|---|---|---|
| 0 | root | sequential | square |
| 1 | branch_left, branch_right | 2-way parallel | square |
| 2 | leaf_d, leaf_e, leaf_f, leaf_g | 4-way parallel | square |
| 3 | final_convergence | 4-way convergence | multiply_all_and_square |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 32 (8 x 4) |
| Database operations | 152 (8 x 19) |
| MQ messages | 16 (8 x 2) |
| Dependency evaluations | 8 |
| Sequential stages | 4 (root → branches → leaves → convergence) |
| Maximum fan-out | 2-way (each branch → 2 leaves) |
| Maximum fan-in | 4-way (convergence waits for all 4 leaves) |
| Dependency edge checks | 9 |
Why this matters: Tests the widest convergence pattern — 4 parallel leaves must all complete before the final step can execute. This exercises the dependency evaluation with a large number of parent checks per step. Also tests hierarchical fan-out (root→2 branches→4 leaves).
Conditional Routing (5 steps, 3 executed)
Fixture: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml
Namespace: conditional_approval_rust | Handler: approval_routing
Context: {"amount": 500, "requester": "benchmark"}
validate_request
↓
routing_decision ← DECISION POINT (routes based on amount)
/ | \
/ | \
auto_approve manager_approval finance_review
(< $1000) ($1000-$5000) (> $5000)
\ | /
\ | /
finalize_approval ← deferred convergence
With benchmark context amount=500, only the auto_approve path executes:
validate_request → routing_decision → auto_approve → finalize_approval
| Step | Executed | Condition |
|---|---|---|
| validate_request | Yes | always |
| routing_decision | Yes | always (decision point) |
| auto_approve | Yes | amount < 1000 |
| manager_approval | Skipped | amount 1000-5000 |
| finance_review | Skipped | amount > 5000 |
| finalize_approval | Yes | deferred convergence (waits for executed paths only) |
Distributed system work (executed steps only):
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 (4 executed x 4) |
| Database operations | 76 (4 executed x 19) |
| MQ messages | 8 (4 executed x 2) |
| Dependency evaluations | 4 |
| Sequential stages | 4 (validate → decision → approve → finalize) |
| Skipped steps | 2 (manager_approval, finance_review) |
Why this matters: Tests deferred convergence — the finalize_approval step depends on ALL conditional branches, but only blocks on branches that actually executed. The orchestration must correctly determine that manager_approval and finance_review were skipped (not just incomplete) and allow finalize_approval to proceed. Also tests the decision point routing pattern.
Tier 3: Cluster Performance
Single Task Linear (4 steps, round-robin across 2 orchestrators)
Same workflow as Tier 1 linear_rust, but benchmarked with round-robin across 2 orchestration instances to measure cluster coordination overhead.
Distributed system work: Same as linear_rust (76 DB ops, 8 MQ messages) plus cluster coordination overhead (shared database, message queue visibility).
Why this matters: Validates that running in cluster mode adds negligible overhead compared to single-instance. The P50 difference (261ms vs 257ms = ~4ms) represents the entire cluster coordination tax.
Concurrent Tasks 2x (2 tasks simultaneously across 2 orchestrators)
Two linear workflows submitted simultaneously, one to each orchestration instance.
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions | 44 (22 per task) |
| Database operations | 152 (76 per task) |
| MQ messages | 16 (8 per task) |
| Concurrent step executions | up to 2 |
| Database connection contention | 2 orchestrators + 2 workers competing |
Why this matters: Tests work distribution across cluster instances under concurrent load. The P50 of ~332-384ms for TWO tasks (vs ~261ms for one) shows that the second task adds only 30-50% latency, not 100% — demonstrating effective parallelism in the cluster.
Tier 4: FFI Language Comparison
Same linear and diamond patterns as Tier 1, but using FFI workers (Ruby via Magnus, Python via PyO3, TypeScript via Bun FFI) instead of native Rust handlers.
Additional per-step work for FFI:
| Phase | Additional Operations |
|---|---|
| Handler dispatch | FFI bridge call (Rust → language runtime) |
| Context serialization | JSON serialize context for foreign runtime |
| Result deserialization | JSON deserialize results back to Rust |
| Circuit breaker check | should_allow() (sync, atomic check) |
| Completion callback | FFI completion channel (bounded MPSC) |
FFI overhead: ~23% (~60ms for 4 steps)
The overhead is framework-dominated (Rust dispatch + serialization + completion channel), not language-dominated — all three languages perform within 3ms of each other.
Tier 5: Batch Processing
CSV Products 1000 Rows (7 steps, 5-way parallel)
Fixture: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml
Namespace: csv_processing_rust | Handler: csv_product_inventory_analyzer
analyze_csv ← reads CSV, returns BatchProcessingOutcome
↓
[orchestration creates 5 dynamic workers from batch template]
↓
process_csv_batch_001 ──┐
process_csv_batch_002 ──┤
process_csv_batch_003 ──├──→ aggregate_csv_results ← deferred convergence
process_csv_batch_004 ──┤
process_csv_batch_005 ──┘
| Step | Type | Rows | Operation |
|---|---|---|---|
| analyze_csv | batchable | all 1000 | Count rows, compute batch ranges |
| process_csv_batch_001 | batch_worker | 1-200 | Compute inventory metrics |
| process_csv_batch_002 | batch_worker | 201-400 | Compute inventory metrics |
| process_csv_batch_003 | batch_worker | 401-600 | Compute inventory metrics |
| process_csv_batch_004 | batch_worker | 601-800 | Compute inventory metrics |
| process_csv_batch_005 | batch_worker | 801-1000 | Compute inventory metrics |
| aggregate_csv_results | deferred_convergence | all | Merge batch results |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 28 (7 x 4) |
| Database operations | 133 (7 x 19) |
| MQ messages | 14 (7 x 2) |
| Dynamic step creation | 5 (batch workers created at runtime) |
| Dependency edges (dynamic) | 6 (batch workers → analyze, aggregate → batch_template) |
| File I/O operations | 6 (1 analysis read + 5 batch reads of CSV) |
| CSV rows processed | 1000 |
| Sequential stages | 3 (analyze → 5 parallel workers → aggregate) |
Why this matters: Tests the most complex orchestration pattern — dynamic step
generation. The analyze_csv step returns a BatchProcessingOutcome that tells the
orchestration to create N worker steps at runtime. The orchestration must:
- Create new step records in the database
- Create dependency edges dynamically
- Enqueue all batch workers for parallel execution
- Use deferred convergence for the aggregate step (waits for batch template, not specific steps)
At P50=358-368ms for 1000 rows, throughput is ~2,700 rows/second with all the distributed system overhead included.
Summary: Operations Per Benchmark
| Benchmark | Steps | DB Ops | MQ Msgs | Transitions | Convergence | P50 |
|---|---|---|---|---|---|---|
| Linear Rust | 4 | 76 | 8 | 22 | none | 257ms |
| Diamond Rust | 4 | 76 | 8 | 22 | 2-way | 200-259ms |
| Complex DAG | 7 | 133 | 14 | 34 | 2+3-way | 382ms |
| Hierarchical Tree | 8 | 152 | 16 | 38 | 4-way | 389-426ms |
| Conditional | 4* | 76 | 8 | 22 | deferred | 251-262ms |
| Cluster single | 4 | 76 | 8 | 22 | none | 261ms |
| Cluster 2x | 8 | 152 | 16 | 44 | none | 332-384ms |
| FFI linear | 4 | 76 | 8 | 22 | none | 312-316ms |
| FFI diamond | 4 | 76 | 8 | 22 | 2-way | 260-275ms |
| Batch 1000 rows | 7 | 133 | 14 | 34 | deferred | 358-368ms |
*Conditional executes 4 of 5 defined steps (2 skipped by routing decision)
Performance per Sequential Stage
For workflows with known sequential depth, we can calculate per-stage overhead:
| Benchmark | Sequential Stages | P50 | Per-Stage Avg |
|---|---|---|---|
| Linear (4 seq) | 4 | 257ms | 64ms |
| Diamond (3 seq) | 3 | 200ms* | 67ms |
| Complex DAG (4 seq) | 4 | 382ms | 96ms** |
| Tree (4 seq) | 4 | 389ms | 97ms** |
| Conditional (4 seq) | 4 | 257ms | 64ms |
| Batch (3 seq) | 3 | 363ms | 121ms*** |
*Diamond under light load (parallelism helping) **Higher per-stage due to multiple steps per stage (more DB ops per evaluation cycle) ***Higher per-stage due to batch worker creation overhead + file I/O
The ~64ms per sequential stage for simple patterns represents the total distributed round-trip: orchestration discovery → MQ dispatch → worker claim → handler execute (~1ms for math operations) → MQ completion → orchestration result processing → dependency re-evaluation. The handler execution itself is negligible; the 64ms is almost entirely orchestration infrastructure.
Tasker Contrib Documentation
Quick Links
| Document | Description |
|---|---|
| README.md | Repository overview, vision, and structure |
| DEVELOPMENT.md | Local development and cross-repo setup |
Implementation Specifications
| Ticket | Status | Description |
|---|---|---|
| TAS-126 | 🚧 In Progress | Foundations: repo structure, vision, CLI plugin design |
TAS-126 Documents
| Document | Description |
|---|---|
| README.md | Ticket summary and deliverables |
| foundations.md | Architectural deep-dive and design rationale |
| rails.md | Rails-specific implementation plan |
| cli-plugin-architecture.md | CLI plugin system design |
Architecture
The foundations document covers:
- Design rationale (why separate repos, why Railtie over Engine)
- Framework integration patterns (lifecycle, events, generators)
- Configuration architecture (three-layer model)
- Testing architecture (unit, integration, E2E)
- Versioning strategy
Milestones
| Milestone | Status | Description |
|---|---|---|
| Foundations and CLI | 🚧 In Progress | TAS-126: Repo structure, vision, CLI plugin design |
| Rails | 📋 Planned | tasker-contrib-rails gem, generators, event bridge |
| Python | 📋 Planned | tasker-contrib-fastapi, pytest integration |
| TypeScript | 📋 Planned | tasker-contrib-bun, Bun.serve patterns |
Framework Guides
Coming soon as packages are implemented
- Rails Integration Guide
- FastAPI Integration Guide
- Bun Integration Guide
- Axum Integration Guide
Operational Guides
Coming soon
- Helm Chart Deployment
- Terraform Infrastructure
- Monitoring Setup
- Production Checklist
Example Applications
Complete example applications demonstrating Tasker patterns.
Examples
| Example | Description |
|---|---|
e-commerce-workflow/ | Order processing with payment, inventory, and shipping |
etl-pipeline/ | Data extraction, transformation, and loading workflow |
approval-system/ | Multi-level approval with conditional routing |
Purpose
These examples demonstrate:
- Real-world workflow patterns
- Multi-language handler implementations
- Testing strategies
- Deployment configurations
Status
📋 Planned
Approval System Example
Multi-level approval workflow demonstrating:
- Decision handlers for routing
- Convergence patterns (all approvals required)
- Human-in-the-loop integration
- Timeout and escalation
Status
📋 Planned
E-Commerce Workflow Example
Order processing workflow demonstrating:
- Diamond dependency patterns (parallel payment + inventory check)
- External API integration (payment gateway)
- Conditional routing (shipping method selection)
- Error handling and retries
Status
📋 Planned
ETL Pipeline Example
Data processing workflow demonstrating:
- Batchable handlers for large datasets
- Checkpoint/resume for long-running processes
- Multiple data sources
- Transformation chains
Status
📋 Planned
Engineering Stories
A progressive-disclosure blog series that teaches Tasker concepts through real-world scenarios. Each story builds on the previous, showing how a growing engineering team adopts workflow orchestration across all four supported languages.
These stories are being rewritten for the current Tasker architecture. See the archive for the original GitBook-era versions.
| Story | Theme | Status |
|---|---|---|
| 01: E-commerce Checkout | Basic workflows, error handling | Planned |
| 02: Data Pipeline | ETL patterns, resilience | Planned |
| 03: Microservices | Service coordination | Planned |
| 04: Team Scaling | Namespace isolation | Planned |
| 05: Observability | OpenTelemetry + domain events | Planned |
| 06: Batch Processing | Batch step patterns | Planned |
| 07: Conditional Workflows | Decision handlers | Planned |
| 08: Production Debugging | DLQ investigation | Planned |