Tasker

Workflow orchestration that meets your code where it lives.

Tasker is an open-source workflow orchestration engine built on PostgreSQL and PGMQ. You define workflows as task templates with ordered steps, implement handlers in Rust, Ruby, Python, or TypeScript, and the engine handles execution, retries, circuit breaking, and observability.

Your existing business logic — API calls, database operations, service integrations — becomes a distributed, event-driven, retryable workflow with minimal ceremony. No DSLs to learn, no framework rewrites. Just thin handler wrappers around code you already have.

Get Started

Getting Started Guide

From zero to your first workflow. Install, write a handler, define a template, submit a task, and watch it run.

Why Tasker?

An honest look at where Tasker fits in the workflow orchestration landscape — and where established tools might be a better choice.

Architecture

How Tasker works under the hood: actors, state machines, event systems, circuit breakers, and the PostgreSQL-native execution model.

Configuration Reference

Complete reference for all 246 configuration parameters across orchestration, workers, and shared settings.

Choose Your Language

Tasker is polyglot from the ground up. The orchestration engine is Rust; workers can be any of four languages, all sharing the same core abstractions expressed idiomatically.

Language	Package	Install	Registry
Rust	`tasker-client` / `tasker-worker`	`cargo add tasker-client tasker-worker`	crates.io
Ruby	`tasker-rb`	`gem install tasker-rb`	rubygems.org
Python	`tasker-py`	`pip install tasker-py`	pypi.org
TypeScript	`@tasker-systems/tasker`	`npm install @tasker-systems/tasker`	npmjs.com

Each language guide covers installation, handler patterns, testing, and production considerations:

Rust · Ruby · Python · TypeScript

Explore the Documentation

For New Users

Core Concepts — Tasks, steps, handlers, templates, and namespaces
Choosing Your Package — Which package do you need?
Quick Start — Running in 5 minutes

Architecture & Design

Architecture Overview — System design and component interaction
Design Principles — The tenets behind Tasker’s design decisions
Architectural Decisions — ADRs documenting key technical choices

Operational Guides

Handler Resolution — How Tasker finds and runs your handlers
Retry Semantics — Retry strategies, backoff, and circuit breaking
Batch Processing — Processing work in batches
DLQ System — Dead letter queue for failed tasks
Observability — Metrics, tracing, and logging

Reference

Configuration Reference — All 246 parameters documented
Worker API Convergence — Cross-language API alignment
FFI Safety — How polyglot workers communicate safely

Framework Integrations

Contrib Packages — Rails, FastAPI, Axum, and Bun integrations
Example Workflows — E-commerce, ETL, and approval system patterns

Engineering Stories

A progressive-disclosure blog series teaching Tasker concepts through real-world scenarios. Each story follows an engineering team as they adopt workflow orchestration, with working code examples across all four languages.

Story	What You’ll Learn
01: E-commerce Checkout	Basic workflows, error handling, retry patterns
02: Data Pipeline Resilience	ETL orchestration, resilience under failure
03: Microservices Coordination	Cross-service workflows, distributed tracing
04: Team Scaling	Namespace isolation, multi-team patterns
05: Observability	OpenTelemetry integration, domain events
06: Batch Processing	Batch step patterns, throughput optimization
07: Conditional Workflows	Decision handlers, approval flows
08: Production Debugging	DLQ investigation, diagnostics tooling

Stories are being rewritten for the current Tasker architecture. View archive →

The Project

Tasker is open-source software (MIT license) built by an engineer who has spent years designing workflow systems at multiple organizations — and finally had the opportunity to build the one that was always in his head.

It’s not venture-backed. It’s not chasing a market. It’s a labor of love built for the engineering community.

Read the full story →

Source Repositories

Repository	Description
tasker-core	Rust orchestration engine, polyglot workers, and CLI
tasker-contrib	Framework integrations and community packages
tasker-book	This documentation site

Why Tasker

Last Updated: 2025-01-09 Audience: Engineers evaluating workflow orchestration tools Status: Pre-Alpha

The Story

Tasker is a labor of love.

Over the years, I’ve built workflow systems at multiple organizations—each time encountering the same fundamental challenges: orchestrating complex, multi-step processes with proper dependency management, ensuring idempotency, handling retries gracefully, and doing all of this in a way that doesn’t require teams to rewrite their existing business logic.

Each time, I’d design parts of the solution I wished we could build—but the investment was never justifiable. General-purpose workflow infrastructure rarely makes sense for a single company to build from scratch when there are urgent product features to ship. So I’d compromise, cobble together something workable, and move on.

Tasker represents the opportunity to finally build that system properly—the one that’s been evolving in my head for years. Not as a venture-backed startup chasing a market, but as open-source software built by someone who genuinely cares about the problem space and wants to give back to the engineering community.

The Landscape

Honesty is important, and so in full candor: Tasker is not solving an unsolved problem. The workflow orchestration space has mature, battle-tested options.

Apache Airflow

What it does well: Airflow is the industry standard for data pipeline orchestration. Born at Airbnb and now an Apache project with thousands of contributors, it excels at scheduled, batch-oriented workflows defined as Python DAGs. Its ecosystem of operators and integrations is unmatched—if you need to connect to a cloud service, there’s probably an Airflow provider for it.

When to choose it: You have scheduled ETL/ELT workloads, your team is Python-native, you need managed cloud options (AWS MWAA, Google Cloud Composer, Astronomer), and you value ecosystem breadth over ergonomic simplicity.

Honest comparison: Airflow’s 10+ years of production use across thousands of companies represents a level of battle-testing Tasker simply cannot match. If your primary use case is data pipeline orchestration with scheduled intervals, Airflow is likely the safer choice.

Temporal

What it does well: Temporal pioneered “durable execution”—workflows that automatically survive crashes, network failures, and infrastructure outages. It reconstructs application state transparently, letting developers write code as if failures don’t exist. The event history and replay capabilities are genuinely impressive.

When to choose it: You need long-running workflows (hours, days, or longer), your operations require true durability guarantees, you’re building microservice orchestration with complex saga patterns, or you need human-in-the-loop workflows with unbounded wait times.

Honest comparison: Temporal’s durable execution model is architecturally different from Tasker. If your workflows genuinely need to survive arbitrary failures mid-execution and resume from exact state, Temporal was purpose-built for this. Tasker provides resilience through retries and idempotent step execution, but doesn’t offer Temporal’s deterministic replay.

Prefect

What it does well: Prefect feels like “what if workflow orchestration were just Python decorators?” It emphasizes minimal boilerplate—add @flow and @task decorators to existing functions, and you have an orchestrated workflow. Prefect 3.0 embraces dynamic workflows with native Python control flow.

When to choose it: Your team is Python-native, you want the fastest path from script to production pipeline, you value simplicity and developer experience, or you’re doing ML/data science workflows where Jupyter-to-production is important.

Honest comparison: Prefect’s decorator-based ergonomics are genuinely excellent for Python-only teams. If your organization is homogeneously Python and you don’t need polyglot support, Prefect delivers a very clean experience.

Dagster

What it does well: Dagster introduced “software-defined assets” as first-class primitives—you define what data assets should exist and their dependencies, and the orchestrator figures out how to materialize them. This asset-centric model provides excellent lineage tracking and observability.

When to choose it: You’re building a data platform where understanding asset lineage is critical, you want a declarative approach focused on data products rather than task sequences, or you need strong dbt integration and data quality built into your orchestration layer.

Honest comparison: Dagster’s asset-centric philosophy is a genuinely different way of thinking about orchestration. If your mental model centers on “what data assets need to exist” rather than “what steps need to execute,” Dagster may be a better conceptual fit.

So Why Tasker?

Given this landscape, why build another workflow orchestrator?

Philosophy: Meet Teams Where They Are

Most workflow tools require you to think in their terms. Define your work as DAGs using their DSL. Adopt their scheduling model. Often, rewrite your business logic to fit their execution model.

Tasker takes a different approach: bring workflow orchestration to your existing code, rather than bringing your code to a workflow framework.

If your codebase already has reasonable SOLID characteristics—services with clear responsibilities, well-defined interfaces, operations that can be made idempotent—Tasker aims to turn that code into distributed, event-driven, retryable workflows with minimal ceremony.

This philosophy manifests in several ways:

Polyglot from the ground up. Tasker’s orchestration engine is written in Rust, but workers can be written in Ruby, Python, TypeScript, or native Rust. Each language implementation shares the same core abstractions—same handler signatures, same result factories, same patterns—expressed idiomatically for each language. This isn’t an afterthought; cross-language consistency is a core design principle.

Minimal migration burden. Your existing business logic—API calls, database operations, external service integrations—can become step handlers with thin wrappers. You don’t need to restructure your application around the orchestrator.

Framework-agnostic core. Tasker Core provides the fundamentals without framework opinions. Tasker Contrib then provides framework-specific integrations (Rails, FastAPI, Bun) for teams who want batteries-included ergonomics. Progressive disclosure: learn the core concepts first, add framework sugar when needed.

Architecture: Event-Driven with Resilience Built In

Tasker’s architecture reflects lessons learned from building distributed systems:

PostgreSQL-native by default. Everything flows through Postgres—task state, step queues (via PGMQ), event coordination (via LISTEN/NOTIFY). This isn’t because Postgres is trendy; it’s because many teams already have Postgres expertise and operational knowledge. Tasker works as a single-dependency system on PostgreSQL alone. For environments requiring higher throughput or existing RabbitMQ infrastructure, Tasker also supports RabbitMQ as an alternative messaging backend—switch with a configuration change.

Event-driven with polling fallback. Real-time responsiveness through Postgres notifications, but with polling as a reliability backstop. Events can be missed; polling ensures eventual consistency.

Defense in depth. Multiple overlapping protection layers provide robust idempotency without single-point dependency. Database-level atomicity, state machine guards, transaction boundaries, and application-level filtering each catch what others might miss.

Composition over inheritance. Handler capabilities are composed via mixins/traits, not class hierarchies. This enables selective capability inclusion, clear separation of concerns, and easier testing.

Performance: Fast by Default

Tasker is built in Rust not for marketing purposes, but because workflow orchestration has real performance implications. When you’re coordinating thousands of steps across distributed workers, overhead matters.

Complex 7-step DAG workflows complete in under 133ms with push-based notifications
Concurrent execution via work-stealing thread pools
Lock-free channel-based internal coordination
Zero-copy where possible in the FFI boundaries

The Honest Assessment

Tasker excels when:

You need polyglot worker support across Ruby, Python, TypeScript, and Rust
Your team already has Postgres expertise and wants to avoid additional infrastructure
You want to bring orchestration to existing business logic rather than rewriting
You value clean, consistent APIs across languages
Performance matters and you’re willing to trade ecosystem breadth for it

Tasker may not be the right choice when:

You need the battle-tested maturity and ecosystem of Airflow
Your workflows require Temporal-style durable execution with deterministic replay
You’re an all-Python team and Prefect’s ergonomics fit perfectly
You’re building a data platform where asset-centric thinking (Dagster) is the right model
You need managed cloud offerings with SLAs and enterprise support

What Tasker Is (and Isn’t)

Tasker Is:

A workflow orchestration engine for step-based DAG execution with complex dependencies
PostgreSQL-native with flexible messaging using PGMQ (default) or RabbitMQ
Polyglot by design with first-class support for multiple languages
Focused on developer experience for teams who want minimal intrusion
Open source (MIT license) and built as a labor of love

Tasker Is Not:

A data orchestration platform like Dagster with asset lineage and data quality primitives
A durable execution engine like Temporal with deterministic replay and unlimited durability
A scheduled job runner for simple cron-style workloads (use actual cron)
A message bus for pure pub/sub fan-out (use Kafka or dedicated brokers)
Enterprise software with commercial support, SLAs, or managed offerings

Current State

Tasker is pre-alpha software. This is important context:

What this means:

The architecture is solidifying but breaking changes are expected
Documentation is comprehensive but evolving
There are no production deployments (that I know of) outside development
You should not bet critical business processes on Tasker today

What this enables:

Rapid iteration based on real feedback
Willingness to break APIs to get the design right
Focus on architectural correctness over backward compatibility
Honest experimentation without legacy constraints

If you’re evaluating Tasker, I’d encourage you to explore it for non-critical workloads, provide feedback, and help shape what it becomes. If you need production-ready workflow orchestration today, please consider the established tools above—I genuinely recommend them for their respective strengths.

The Path Forward

Tasker is being built with care, not speed. The goal isn’t to capture market share or compete with well-funded companies. The goal is to create something genuinely useful—a workflow orchestration system that respects developers’ time and meets them where they are.

The codebase is open, the design decisions are documented, and contributions are welcome. This is software built by an engineer for engineers, not a product chasing metrics.

If that resonates with you, welcome. Let’s build something good together.

Tasker Core Tenets - The 10 foundational design principles
Use Cases & Patterns - When and how to use Tasker
Quick Start Guide - Get running in 5 minutes
CHRONOLOGY - Development timeline and lessons learned

← Back to Documentation Hub

Getting Started

This section walks you from “what is Tasker?” to running your first workflow.

Path

Overview - What Tasker is and why it exists
Core Concepts - Tasks, steps, handlers, templates, and namespaces
Installation - Installing packages and running infrastructure
Choosing Your Package - Which package do you need?
Your First Handler - Write a step handler in your language
Your First Workflow - Define a template, submit a task, watch it run
Next Steps - Where to go from here

Overview

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Core Concepts

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Installation

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Choosing Your Package

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Your First Handler

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Your First Workflow

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Next Steps

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Tasker Core Architecture

This directory contains architectural reference documentation describing how Tasker Core’s components work together.

Documents

Document	Description
Crate Architecture	Workspace structure and crate responsibilities
Messaging Abstraction	Provider-agnostic messaging (PGMQ, RabbitMQ)
Actors	Actor-based orchestration lifecycle components
Worker Actors	Actor pattern for worker step execution
Worker Event Systems	Dual-channel event architecture for workers
States and Lifecycles	Dual state machine architecture (Task + Step)
Events and Commands	Event-driven coordination patterns
Domain Events	Business event publishing (durable/fast/broadcast)
Idempotency and Atomicity	Defense-in-depth guarantees
Backpressure Architecture	Unified resilience and flow control
Circuit Breakers	Fault isolation and cascade prevention
Deployment Patterns	Hybrid, EventDriven, PollingOnly modes; PGMQ/RabbitMQ backends

When to Read These

Designing new features: Understand how components interact
Debugging flow issues: Trace message paths through actors
Understanding trade-offs: See why patterns were chosen
Onboarding: Build mental model of the system

Principles - The “why” behind architectural decisions
Guides - Practical “how-to” documentation
CHRONOLOGY - Historical context for decisions

Actor-Based Architecture

Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Worker Actor Architecture | Events and Commands | States and Lifecycles

← Back to Documentation Hub

This document provides comprehensive documentation of the actor-based architecture in tasker-core, covering the lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture replaces imperative delegation with message-based actor coordination.

Overview

The tasker-core system implements a lightweight Actor pattern inspired by frameworks like Actix, but designed specifically for our orchestration needs without external dependencies. The architecture provides:

Actor Abstraction: Lifecycle components encapsulated as actors with clear lifecycle hooks
Message-Based Communication: Type-safe message handling via Handler trait
Central Registry: ActorRegistry for managing all orchestration actors
Service Decomposition: Focused components following single responsibility principle
Direct Integration: Command processor calls actors directly without wrapper layers

This architecture eliminates inconsistencies in lifecycle component initialization, provides type-safe message handling, and creates a clear separation between command processing and business logic execution.

Implementation Status

All phases implemented and production-ready: core abstractions, all 4 primary actors, message hydration, module reorganization, service decomposition, and direct actor integration.

Core Concepts

What is an Actor?

In the tasker-core context, an Actor is an encapsulated lifecycle component that:

Manages its own state: Each actor owns its dependencies and configuration
Processes messages: Responds to typed command messages via the Handler trait
Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
Is isolated: Actors communicate through message passing
Is thread-safe: All actors are Send + Sync + ’static

Why Actors?

The previous architecture had several inconsistencies:

#![allow(unused)]
fn main() {
// OLD: Inconsistent initialization patterns
pub struct TaskInitializer {
    // Constructor pattern
}

pub struct TaskFinalizer {
    // Builder pattern with new()
}

pub struct StepEnqueuer {
    // Factory pattern with create()
}
}

The actor pattern provides consistency:

#![allow(unused)]
fn main() {
// NEW: Consistent actor pattern
impl OrchestrationActor for TaskRequestActor {
    fn name(&self) -> &'static str { "TaskRequestActor" }
    fn context(&self) -> &Arc<SystemContext> { &self.context }
    fn started(&mut self) -> TaskerResult<()> { /* initialization */ }
    fn stopped(&mut self) -> TaskerResult<()> { /* cleanup */ }
}
}

Actor vs Service

Services (underlying business logic):

Encapsulate business logic
Stateless operations on domain models
Direct method invocation
Examples: TaskFinalizer, StepEnqueuerService, OrchestrationResultProcessor

Actors (message-based coordination):

Wrap services with message-based interface
Manage service lifecycle
Asynchronous message handling
Examples: TaskRequestActor, ResultProcessorActor, StepEnqueuerActor, TaskFinalizerActor

The relationship:

#![allow(unused)]
fn main() {
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,  // Wraps underlying service
}

impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        // Delegates to service
        self.service.finalize_task(msg.task_uuid).await
    }
}
}

Actor Traits

OrchestrationActor Trait

The base trait for all orchestration actors, defined in tasker-orchestration/src/actors/traits.rs:

#![allow(unused)]
fn main() {
/// Base trait for all orchestration actors
///
/// Provides lifecycle management and context access for all actors in the
/// orchestration system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
///
/// # Lifecycle
///
/// 1. **Construction**: Actor is created by ActorRegistry
/// 2. **Initialization**: `started()` is called during registry build
/// 3. **Operation**: Actor processes messages via Handler<M> implementations
/// 4. **Shutdown**: `stopped()` is called during registry shutdown
pub trait OrchestrationActor: Send + Sync + 'static {
    /// Returns the unique name of this actor
    ///
    /// Used for logging, metrics, and debugging. Should be a static string
    /// that clearly identifies the actor's purpose.
    fn name(&self) -> &'static str;

    /// Returns a reference to the system context
    ///
    /// Provides access to database pool, configuration, and other
    /// framework-level resources.
    fn context(&self) -> &Arc<SystemContext>;

    /// Called when the actor is started
    ///
    /// Perform any initialization work here, such as:
    /// - Setting up database connections
    /// - Loading configuration
    /// - Initializing caches
    ///
    /// # Errors
    ///
    /// Return an error if initialization fails. The actor will not be
    /// registered and the system will fail to start.
    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor started");
        Ok(())
    }

    /// Called when the actor is stopped
    ///
    /// Perform any cleanup work here, such as:
    /// - Closing database connections
    /// - Flushing caches
    /// - Releasing resources
    ///
    /// # Errors
    ///
    /// Return an error if cleanup fails. Errors are logged but do not
    /// prevent other actors from shutting down.
    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor stopped");
        Ok(())
    }
}
}

Key Design Decisions:

Send + Sync + ’static: Enables actors to be shared across threads
Default lifecycle hooks: Actors only override when needed
Context injection: All actors have access to SystemContext
Error handling: Lifecycle failures are TaskerResult for proper error propagation

Handler Trait

The message handling trait, enabling type-safe message processing:

#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
///
/// Actors implement Handler<M> for each message type they can process.
/// This provides type-safe, asynchronous message handling with clear
/// input/output contracts.
#[async_trait]
pub trait Handler<M: Message>: OrchestrationActor {
    /// The response type returned by this handler
    type Response: Send;

    /// Handle a message asynchronously
    ///
    /// Process the message and return a response. This method should be
    /// idempotent where possible and handle errors gracefully.
    async fn handle(&self, msg: M) -> TaskerResult<Self::Response>;
}
}

Key Design Decisions:

async_trait: All message handling is asynchronous
Type safety: Message and Response types are checked at compile time
Multiple implementations: Actor can implement Handler for multiple message types
Error propagation: TaskerResult ensures proper error handling

Message Trait

The marker trait for command messages:

#![allow(unused)]
fn main() {
/// Marker trait for command messages
///
/// All messages sent to actors must implement this trait. The associated
/// `Response` type defines what the handler will return.
pub trait Message: Send + 'static {
    /// The response type for this message
    type Response: Send;
}
}

Key Design Decisions:

Marker trait: No methods, just type constraints
Associated type: Response type is part of the message definition
Send + ’static: Enables messages to cross thread boundaries

ActorRegistry

The central registry managing all orchestration actors, defined in tasker-orchestration/src/actors/registry.rs:

Purpose

The ActorRegistry serves as:

Single Source of Truth: All actors are registered here
Lifecycle Manager: Handles initialization and shutdown
Dependency Injection: Provides SystemContext to all actors
Type-Safe Access: Strongly-typed access to each actor

Structure

#![allow(unused)]
fn main() {
/// Registry managing all orchestration actors
///
/// The ActorRegistry holds Arc references to all actors in the system,
/// providing centralized access and lifecycle management.
#[derive(Clone)]
pub struct ActorRegistry {
    /// System context shared by all actors
    context: Arc<SystemContext>,

    /// Task request actor for processing task initialization requests
    pub task_request_actor: Arc<TaskRequestActor>,

    /// Result processor actor for processing step execution results
    pub result_processor_actor: Arc<ResultProcessorActor>,

    /// Step enqueuer actor for batch processing ready tasks
    pub step_enqueuer_actor: Arc<StepEnqueuerActor>,

    /// Task finalizer actor for task finalization with atomic claiming
    pub task_finalizer_actor: Arc<TaskFinalizerActor>,
}
}

Initialization

The build() method creates and initializes all actors:

#![allow(unused)]
fn main() {
impl ActorRegistry {
    pub async fn build(context: Arc<SystemContext>) -> TaskerResult<Self> {
        tracing::info!("Building ActorRegistry with actors");

        // Create shared StepEnqueuerService (used by multiple actors)
        let task_claim_step_enqueuer = StepEnqueuerService::new(context.clone()).await?;
        let task_claim_step_enqueuer = Arc::new(task_claim_step_enqueuer);

        // Create TaskRequestActor and its dependencies
        let task_initializer = Arc::new(TaskInitializer::new(
            context.clone(),
            task_claim_step_enqueuer.clone(),
        ));

        let task_request_processor = Arc::new(TaskRequestProcessor::new(
            context.message_client.clone(),
            context.task_handler_registry.clone(),
            task_initializer,
            TaskRequestProcessorConfig::default(),
        ));

        let mut task_request_actor = TaskRequestActor::new(context.clone(), task_request_processor);
        task_request_actor.started()?;
        let task_request_actor = Arc::new(task_request_actor);

        // Create ResultProcessorActor and its dependencies
        let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
        let result_processor = Arc::new(OrchestrationResultProcessor::new(
            task_finalizer,
            context.clone(),
        ));

        let mut result_processor_actor =
            ResultProcessorActor::new(context.clone(), result_processor);
        result_processor_actor.started()?;
        let result_processor_actor = Arc::new(result_processor_actor);

        // Create StepEnqueuerActor using shared StepEnqueuerService
        let mut step_enqueuer_actor =
            StepEnqueuerActor::new(context.clone(), task_claim_step_enqueuer.clone());
        step_enqueuer_actor.started()?;
        let step_enqueuer_actor = Arc::new(step_enqueuer_actor);

        // Create TaskFinalizerActor using shared StepEnqueuerService
        let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
        let mut task_finalizer_actor = TaskFinalizerActor::new(context.clone(), task_finalizer);
        task_finalizer_actor.started()?;
        let task_finalizer_actor = Arc::new(task_finalizer_actor);

        tracing::info!("✅ ActorRegistry built successfully with 4 actors");

        Ok(Self {
            context,
            task_request_actor,
            result_processor_actor,
            step_enqueuer_actor,
            task_finalizer_actor,
        })
    }
}
}

Shutdown

The shutdown() method gracefully stops all actors:

#![allow(unused)]
fn main() {
impl ActorRegistry {
    pub async fn shutdown(&mut self) {
        tracing::info!("Shutting down ActorRegistry");

        // Call stopped() on all actors in reverse initialization order
        if let Some(actor) = Arc::get_mut(&mut self.task_finalizer_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop TaskFinalizerActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.step_enqueuer_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop StepEnqueuerActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.result_processor_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop ResultProcessorActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.task_request_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop TaskRequestActor");
            }
        }

        tracing::info!("✅ ActorRegistry shutdown complete");
    }
}
}

Implemented Actors

TaskRequestActor

Handles task initialization requests from external clients.

Location: tasker-orchestration/src/actors/task_request_actor.rs

Message: ProcessTaskRequestMessage

Input: TaskRequestMessage with task details
Response: Uuid of created task

Delegation: Wraps TaskRequestProcessor service

Purpose: Entry point for new workflow instances, coordinates task creation and initial step discovery.

ResultProcessorActor

Processes step execution results from workers.

Location: tasker-orchestration/src/actors/result_processor_actor.rs

Message: ProcessStepResultMessage

Input: StepExecutionResult with execution outcome
Response: () (unit type)

Delegation: Wraps OrchestrationResultProcessor service

Purpose: Handles step completion, coordinates task finalization when appropriate.

StepEnqueuerActor

Manages batch processing of ready tasks.

Location: tasker-orchestration/src/actors/step_enqueuer_actor.rs

Message: ProcessBatchMessage

Input: Empty (uses system state)
Response: StepEnqueuerServiceResult with batch stats

Delegation: Wraps StepEnqueuerService

Purpose: Discovers ready tasks and enqueues their executable steps.

TaskFinalizerActor

Handles task finalization with atomic claiming.

Location: tasker-orchestration/src/actors/task_finalizer_actor.rs

Message: FinalizeTaskMessage

Input: task_uuid to finalize
Response: FinalizationResult with action taken

Delegation: Wraps TaskFinalizer service (decomposed into focused components)

Purpose: Completes or fails tasks based on step execution results, prevents race conditions through atomic claiming.

Integration with Commands

Command Processor Integration

The command processor calls actors directly without intermediate wrapper layers:

#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs

/// Handle task initialization using TaskRequestActor directly
async fn handle_initialize_task(
    &self,
    request: TaskRequestMessage,
) -> TaskerResult<TaskInitializeResult> {
    // Direct actor-based task initialization
    let msg = ProcessTaskRequestMessage { request };
    let task_uuid = self.actors.task_request_actor.handle(msg).await?;

    Ok(TaskInitializeResult::Success {
        task_uuid,
        message: "Task initialized successfully".to_string(),
    })
}

/// Handle step result processing using ResultProcessorActor directly
async fn handle_process_step_result(
    &self,
    step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
    // Direct actor-based step result processing
    let msg = ProcessStepResultMessage {
        result: step_result.clone(),
    };

    match self.actors.result_processor_actor.handle(msg).await {
        Ok(()) => Ok(StepProcessResult::Success {
            message: format!(
                "Step {} result processed successfully",
                step_result.step_uuid
            ),
        }),
        Err(e) => Ok(StepProcessResult::Error {
            message: format!("Failed to process step result: {e}"),
        }),
    }
}

/// Handle task finalization using TaskFinalizerActor directly
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
    // Direct actor-based task finalization
    let msg = FinalizeTaskMessage { task_uuid };

    let result = self.actors.task_finalizer_actor.handle(msg).await?;

    Ok(TaskFinalizationResult::Success {
        task_uuid: result.task_uuid,
        final_status: format!("{:?}", result.action),
        completion_time: Some(chrono::Utc::now()),
    })
}
}

Design Evolution: Initially planned to use lifecycle_services/ as a wrapper layer between command processor and actors. After implementing Phase 7 service decomposition, we found direct actor calls were simpler and cleaner, so we removed the intermediate layer.

Service Decomposition (Phase 7)

Large services (800-900 lines) were decomposed into focused components following single responsibility principle:

TaskFinalizer Decomposition

task_finalization/ (848 lines → 6 files)
├── mod.rs                          # Public API and types
├── service.rs                      # Main TaskFinalizer service (~200 lines)
├── completion_handler.rs           # Task completion logic
├── event_publisher.rs              # Lifecycle event publishing
├── execution_context_provider.rs   # Context fetching
└── state_handlers.rs               # State-specific handling

StepEnqueuerService Decomposition

step_enqueuer_services/ (781 lines → 3 files)
├── mod.rs                          # Public API
├── service.rs                      # Main service (~250 lines)
├── batch_processor.rs              # Batch processing logic
└── state_handlers.rs               # State-specific processing

ResultProcessor Decomposition

result_processing/ (889 lines → 4 files)
├── mod.rs                          # Public API
├── service.rs                      # Main processor
├── metadata_processor.rs           # Metadata handling
├── error_handler.rs                # Error processing
└── result_validator.rs             # Result validation

Actor Lifecycle

Lifecycle Phases

┌─────────────────┐
│  Construction   │  ActorRegistry::build() creates actor instances
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Initialization  │  started() hook called on each actor
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Operation     │  Actors process messages via Handler<M>::handle()
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Shutdown     │  stopped() hook called on each actor (reverse order)
└─────────────────┘

Example Actor Implementation

#![allow(unused)]
fn main() {
use tasker_orchestration::actors::{OrchestrationActor, Handler, Message};

// Define the actor
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,
}

// Implement base actor trait
impl OrchestrationActor for TaskFinalizerActor {
    fn name(&self) -> &'static str {
        "TaskFinalizerActor"
    }

    fn context(&self) -> &Arc<SystemContext> {
        &self.context
    }

    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!("TaskFinalizerActor starting");
        Ok(())
    }

    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!("TaskFinalizerActor stopping");
        Ok(())
    }
}

// Define message type
pub struct FinalizeTaskMessage {
    pub task_uuid: Uuid,
}

impl Message for FinalizeTaskMessage {
    type Response = FinalizationResult;
}

// Implement message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        tracing::debug!(
            actor = %self.name(),
            task_uuid = %msg.task_uuid,
            "Processing FinalizeTaskMessage"
        );

        // Delegate to service
        self.service.finalize_task(msg.task_uuid).await
            .map_err(|e| e.into())
    }
}
}

Benefits

1. Consistency

All lifecycle components follow the same pattern:

Uniform initialization via started()
Uniform cleanup via stopped()
Uniform message handling via Handler<M>

2. Type Safety

Messages and responses are type-checked at compile time:

#![allow(unused)]
fn main() {
// Compile error if message/response types don't match
impl Handler<WrongMessage> for TaskFinalizerActor {
    type Response = WrongResponse;  // ❌ Won't compile
    // ...
}
}

3. Testability

Clear message boundaries for mocking
Isolated actor lifecycle for unit tests
Type-safe message construction

4. Maintainability

Clear separation of concerns
Explicit message contracts
Centralized lifecycle management
Decomposed services (<300 lines per file)

5. Simplicity

Direct actor calls (no wrapper layers)
Pure routing in command processor
Easy to trace message flow

Summary

The actor-based architecture provides a consistent, type-safe foundation for lifecycle component management in tasker-core. Key takeaways:

Lightweight Pattern: Actors wrap decomposed services, providing message-based interface
Lifecycle Management: Consistent initialization and shutdown via traits
Type Safety: Compile-time verification of message contracts
Service Decomposition: Focused components following single responsibility principle
Direct Integration: Command processor calls actors directly without intermediate wrappers
Production Ready: All phases complete, zero breaking changes, full test coverage

The architecture provides a solid foundation for future scalability and maintainability while maintaining the proven reliability of existing orchestration logic.

← Back to Documentation Hub

Backpressure Architecture

Last Updated: 2026-02-05 Audience: Architects, Developers, Operations Status: Active Related Docs: Worker Event Systems | MPSC Channel Guidelines

<- Back to Documentation Hub

This document provides the unified backpressure strategy for tasker-core, covering all system components from API ingestion through worker execution.

Core Principle

Step idempotency is the primary constraint. Any backpressure mechanism must ensure that step claiming, business logic execution, and result persistence remain stable and consistent. The system must gracefully degrade under load without compromising workflow correctness.

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKPRESSURE FLOW OVERVIEW                           │
└─────────────────────────────────────────────────────────────────────────────┘

                            ┌─────────────────┐
                            │  External Client │
                            └────────┬────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [1] API LAYER BACKPRESSURE      │
                    │  • Circuit breaker (503)         │
                    │  • System overload (503)         │
                    │  • Request validation            │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [2] ORCHESTRATION BACKPRESSURE  │
                    │  • Command channel (bounded)     │
                    │  • Connection pool limits        │
                    │  • PGMQ depth checks             │
                    └────────────────┬────────────────┘
                                     │
                         ┌───────────┴───────────┐
                         │     PGMQ Queues       │
                         │  • Namespace queues   │
                         │  • Result queues      │
                         └───────────┬───────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [3] WORKER BACKPRESSURE         │
                    │  • Claim capacity check          │
                    │  • Semaphore-bounded handlers    │
                    │  • Completion channel bounds     │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [4] RESULT FLOW BACKPRESSURE    │
                    │  • Completion channel bounds     │
                    │  • Domain event drop semantics   │
                    └─────────────────────────────────┘

Backpressure Points by Component

1. API Layer

The API layer provides backpressure through 503 responses with intelligent Retry-After headers.

Rate Limiting (429): Rate limiting is intentionally out of scope for tasker-core. This responsibility belongs to upstream infrastructure (API Gateways, NLB/ALB, service mesh). Tasker focuses on system health-based backpressure via 503 responses.

Mechanism	Status	Behavior
Circuit Breaker	Implemented	Return 503 when database breaker open
System Overload	Planned	Return 503 when queue/channel saturation detected
Request Validation	Implemented	Return 400 for invalid requests

Response Codes:

200 OK - Request accepted
400 Bad Request - Invalid request format
503 Service Unavailable - System overloaded (includes Retry-After header)

503 Response Triggers:

Circuit Breaker Open: Database operations failing repeatedly
Queue Depth High (Planned): PGMQ namespace queues approaching capacity
Channel Saturation (Planned): Command channel buffer > 80% full

Retry-After Header Strategy:

503 Service Unavailable
Retry-After: {calculated_delay_seconds}

Calculation:
- Circuit breaker open: Use breaker timeout (default 30s)
- Queue depth high: Estimate based on processing rate
- Channel saturation: Short delay (5-10s) for buffer drain

Configuration:

# config/tasker/base/common.toml
[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

2. Orchestration Layer

The orchestration layer protects internal processing from command flooding.

Mechanism	Status	Behavior
Command Channel	Implemented	Bounded MPSC with monitoring
Connection Pool	Implemented	Database connection limits
PGMQ Depth Check	Planned	Reject enqueue when queue too deep

Command Channel Backpressure:

Command Sender → [Bounded Channel] → Command Processor
                      │
                      └── If full: Block with timeout → Reject

Configuration:

# config/tasker/base/orchestration.toml
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

3. Messaging Layer

The messaging layer provides the backbone between orchestration and workers. Provider-agnostic via MessageClient, supporting PGMQ (default) and RabbitMQ backends.

Mechanism	Status	Behavior
Visibility Timeout	Implemented	Messages return to queue after timeout
Batch Size Limits	Implemented	Bounded message reads
Queue Depth Check	Planned	Reject enqueue when depth exceeded
Messaging Circuit Breaker	Implemented	Fast-fail send/receive when provider unhealthy

Messaging Circuit Breaker: MessageClient wraps send/receive operations with circuit breaker protection. When the messaging provider (PGMQ or RabbitMQ) fails repeatedly, the breaker opens and returns MessagingError::CircuitBreakerOpen immediately, preventing slow timeouts from cascading into orchestration and worker processing loops. Ack/nack and health check operations bypass the breaker — ack/nack failures are safe (visibility timeout handles redelivery), and health check must work when the breaker is open to detect recovery. See Circuit Breakers for details.

Queue Depth Monitoring (Planned):

The system will work with PGMQ’s native capabilities rather than enforcing arbitrary limits. Queue depth monitoring provides visibility without hard rejection:

┌──────────────────────────────────────────────────────────────────────┐
│ QUEUE DEPTH STRATEGY                                                  │
├──────────────────────────────────────────────────────────────────────┤
│ Level    │ Depth Ratio │ Action                                       │
├──────────────────────────────────────────────────────────────────────┤
│ Normal   │ < 70%       │ Normal operation                             │
│ Warning  │ 70-85%      │ Log warning, emit metric                     │
│ Critical │ 85-95%      │ API returns 503 for new tasks                │
│ Overflow │ > 95%       │ API rejects all writes, alert operators      │
└──────────────────────────────────────────────────────────────────────┘

Note: Depth ratio = current_depth / configured_soft_limit
Soft limit is advisory, not a hard PGMQ constraint.

Portability Considerations:

Queue depth semantics vary by backend (PGMQ vs RabbitMQ vs SQS)
Configuration is backend-agnostic where possible
Backend-specific tuning goes in backend-specific config sections

Configuration:

# config/tasker/base/common.toml
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# Messaging circuit breaker
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

4. Worker Layer

The worker layer protects handler execution from saturation.

Mechanism	Status	Behavior
Semaphore-Bounded Dispatch	Implemented	Max concurrent handlers
Claim Capacity Check	Planned	Refuse claims when at capacity
Handler Timeout	Implemented	Kill stuck handlers
Completion Channel	Implemented	Bounded result buffer

Handler Dispatch Flow:

Step Message
     │
     ▼
┌─────────────────┐
│ Capacity Check  │──── At capacity? ──── Leave in queue
│ (Planned)       │                       (visibility timeout)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Acquire Permit  │
│ (Semaphore)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Execute Handler │
│ (with timeout)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Release Permit  │──── BEFORE sending to completion channel
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Send Completion │
└─────────────────┘

Configuration:

# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000

5. Domain Events

Domain events use fire-and-forget semantics to avoid blocking the critical path.

Mechanism	Status	Behavior
Try-Send	Implemented	Non-blocking send
Drop on Full	Implemented	Events dropped if channel full
Metrics	Planned	Track dropped events

Domain Event Flow:

Handler Complete
     │
     ├── Result → Completion Channel (blocking, must succeed)
     │
     └── Domain Events → try_send() → If full: DROP with metric
                                       │
                                       └── Step execution NOT affected

Segmentation of Responsibility

Orchestration System

The orchestration system must protect itself from:

Client overload: Too many /v1/tasks requests
Internal saturation: Command channel overflow
Database exhaustion: Connection pool depletion
Queue explosion: Unbounded PGMQ growth

Backpressure Response Hierarchy:

Return 503 to client with Retry-After (fastest, cheapest)
Block at command channel (internal buffering)
Soft-reject at queue depth threshold (503 to new tasks)
Circuit breaker opens (stop accepting work)

Worker System

The worker system must protect itself from:

Handler saturation: Too many concurrent handlers
FFI backlog: Ruby/Python handlers falling behind
Completion overflow: Results backing up
Step starvation: Claims outpacing processing

Backpressure Response Hierarchy:

Refuse step claim (leave in queue, visibility timeout)
Block at dispatch channel (internal buffering)
Block at completion channel (handler waits)
Circuit breaker opens (stop claiming work)

Step Idempotency Guarantees

Safe Backpressure Points

These backpressure points preserve step idempotency:

Point	Why Safe
API 503 rejection	Task not yet created
Queue depth soft-limit	Step not yet enqueued
Step claim refusal	Message stays in queue, visibility timeout protects
Handler dispatch channel full	Step claimed but execution queued
Completion channel backpressure	Handler completed, result buffered

Unsafe Patterns (NEVER DO)

Pattern	Risk	Mitigation
Drop step after claiming	Lost work	Always send result (success or failure)
Timeout during handler execution	Duplicate execution on retry	Handlers MUST be idempotent
Drop completion result	Orchestration unaware of completion	Completion channel blocks, never drops
Reset step state without visibility timeout	Race with other workers	Use PGMQ visibility timeout

Idempotency Contract

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STEP EXECUTION IDEMPOTENCY CONTRACT                       │
└─────────────────────────────────────────────────────────────────────────────┘

1. CLAIM: Atomic via pgmq_read_specific_message()
   ├── Only one worker can claim a message
   ├── Visibility timeout protects against worker crash
   └── If claim fails: Message stays in queue → another worker claims

2. EXECUTE: Handler invocation (FFI boundary critical - see below)
   ├── Handlers SHOULD be idempotent (business logic recommendation)
   ├── Timeout generates FAILURE result (not drop)
   ├── Panic generates FAILURE result (not drop)
   └── Error generates FAILURE result (not drop)

3. PERSIST: Result submission
   ├── Completion channel is bounded but BLOCKING
   ├── Result MUST reach orchestration (never dropped)
   └── If send fails: Step remains "in_progress" → recovered by orchestration

4. FINALIZE: Orchestration processes result
   ├── State transition is atomic
   ├── Duplicate results handled by state guards
   └── Idempotent: Same result processed twice = same outcome

FFI Boundary Idempotency Semantics

The FFI boundary (Rust → Ruby/Python handler) creates a critical demarcation for error classification:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FFI BOUNDARY ERROR CLASSIFICATION                         │
└─────────────────────────────────────────────────────────────────────────────┘

                           FFI BOUNDARY
                                │
    BEFORE FFI CROSSING         │         AFTER FFI CROSSING
    (System Layer)              │         (Business Logic Layer)
                                │
    ┌─────────────────────┐     │     ┌─────────────────────┐
    │ System errors are   │     │     │ System failures     │
    │ RETRYABLE:          │     │     │ are PERMANENT:      │
    │                     │     │     │                     │
    │ • Channel timeout   │     │     │ • Worker crash      │
    │ • Queue unavailable │     │     │ • FFI panic         │
    │ • Claim race lost   │     │     │ • Process killed    │
    │ • Network partition │     │     │                     │
    │ • Message malformed │     │     │ We cannot know if   │
    │                     │     │     │ business logic      │
    │ Step has NOT been   │     │     │ executed or not.    │
    │ handed to handler.  │     │     │                     │
    └─────────────────────┘     │     └─────────────────────┘
                                │
                                │     ┌─────────────────────┐
                                │     │ Developer errors    │
                                │     │ are TRUSTED:        │
                                │     │                     │
                                │     │ • RetryableError →  │
                                │     │   System retries    │
                                │     │                     │
                                │     │ • PermanentError →  │
                                │     │   Step fails        │
                                │     │                     │
                                │     │ Developer knows     │
                                │     │ their domain logic. │
                                │     └─────────────────────┘

Key Principles:

Before FFI: Any system error is safe to retry because no business logic has executed.
After FFI, system failure: If the worker crashes or FFI call fails after dispatch, we MUST treat it as permanent failure. We cannot know if the handler:
- Never started (safe to retry)
- Started but didn’t complete (unknown side effects)
- Completed but didn’t return (work is done)
After FFI, developer error: Trust the developer’s classification:
- RetryableError: Developer explicitly signals safe to retry (e.g., temporary API unavailable)
- PermanentError: Developer explicitly signals not retriable (e.g., invalid data, business rule violation)

Implementation Guidance:

#![allow(unused)]
fn main() {
// BEFORE FFI - system error, retryable
match dispatch_to_handler(step).await {
    Err(DispatchError::ChannelFull) => StepResult::retryable("dispatch_channel_full"),
    Err(DispatchError::Timeout) => StepResult::retryable("dispatch_timeout"),
    Ok(ffi_handle) => {
        // AFTER FFI - different rules apply
        match ffi_handle.await {
            // System crash after FFI = permanent (unknown state)
            Err(FfiError::ProcessCrash) => StepResult::permanent("handler_crash"),
            Err(FfiError::Panic) => StepResult::permanent("handler_panic"),

            // Developer-returned errors = trust their classification
            Ok(HandlerResult::RetryableError(msg)) => StepResult::retryable(msg),
            Ok(HandlerResult::PermanentError(msg)) => StepResult::permanent(msg),
            Ok(HandlerResult::Success(data)) => StepResult::success(data),
        }
    }
}
}

Note: We RECOMMEND handlers be idempotent but cannot REQUIRE it—business logic is developer-controlled. The system provides visibility timeout protection and duplicate result handling, but ultimate idempotency responsibility lies with handler implementations.

Backpressure Decision Tree

Use this decision tree when designing new backpressure mechanisms:

                    ┌─────────────────────────┐
                    │ New Backpressure Point  │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │ Does this affect step   │
                    │ execution correctness?  │
                    └────────────┬────────────┘
                                 │
                   ┌─────────────┴─────────────┐
                   │                           │
                  Yes                          No
                   │                           │
                   ▼                           ▼
         ┌─────────────────┐         ┌─────────────────┐
         │ Can the work be │         │ Safe to drop    │
         │ retried safely? │         │ or timeout      │
         └────────┬────────┘         └─────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
       Yes                  No
        │                   │
        ▼                   ▼
  ┌───────────┐      ┌───────────────┐
  │ Use block │      │ MUST NOT DROP │
  │ or reject │      │ Block until   │
  │ (retriable│      │ success       │
  │ error)    │      └───────────────┘
  └───────────┘

Configuration Reference

TOML Structure: Configuration files are organized as config/tasker/base/{common,worker,orchestration}.toml with environment overrides in config/tasker/environments/{test,development,production}/.

Complete Backpressure Configuration

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/common.toml - Shared settings
# ════════════════════════════════════════════════════════════════════════════

# Circuit breaker defaults (inherited by all component breakers)
[common.circuit_breakers.default_config]
failure_threshold = 5      # Failures before opening
timeout_seconds = 30       # Time in open state before half-open
success_threshold = 2      # Successes in half-open to close

# Web/API database circuit breaker
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

# Messaging circuit breaker - PGMQ/RabbitMQ operations
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Queue configuration
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/orchestration.toml - Orchestration layer
# ════════════════════════════════════════════════════════════════════════════

[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

[orchestration.mpsc_channels.event_channel]
event_channel_buffer_size = 10000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/worker.toml - Worker layer
# ════════════════════════════════════════════════════════════════════════════

[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000        # Steps waiting for handler
completion_buffer_size = 1000      # Results waiting for orchestration
max_concurrent_handlers = 10       # Semaphore permits
handler_timeout_ms = 30000         # Max handler execution time

[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000        # FFI events waiting for Ruby/Python
completion_timeout_ms = 30000      # Time to wait for FFI completion
starvation_warning_threshold_ms = 10000  # Warn if event waits this long

# Planned:
# claim_capacity_threshold = 0.8   # Refuse claims at 80% capacity

Monitoring and Alerting

See Backpressure Monitoring Runbook for:

Key metrics to monitor
Alerting thresholds
Incident response procedures

Key Metrics Summary

Metric	Type	Alert Threshold
`api_requests_rejected_total`	Counter	> 10/min
`circuit_breaker_state`	Gauge	state = open
`mpsc_channel_saturation`	Gauge	> 80%
`pgmq_queue_depth`	Gauge	> 80% of max
`worker_claim_refusals_total`	Counter	> 10/min
`handler_semaphore_wait_time_ms`	Histogram	p99 > 1000ms

Worker Event Systems - Dual-channel architecture
MPSC Channel Guidelines - Channel creation guide
MPSC Channel Tuning - Operational tuning
Bounded MPSC Channels ADR

<- Back to Documentation Hub

Circuit Breakers

Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring

<- Back to Documentation Hub

Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.

Core Concept

Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.

State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CIRCUIT BREAKER STATE MACHINE                            │
└─────────────────────────────────────────────────────────────────────────────┘

                    Success
                  ┌─────────┐
                  │         │
                  ▼         │
              ┌───────┐     │
      ───────>│CLOSED │─────┘
              └───┬───┘
                  │
                  │ failure_threshold
                  │ consecutive failures
                  │
                  ▼
              ┌───────┐
              │ OPEN  │◄─────────────────────┐
              └───┬───┘                      │
                  │                          │
                  │ timeout_seconds          │ Any failure
                  │ elapsed                  │ in half-open
                  │                          │
                  ▼                          │
            ┌──────────┐                     │
            │HALF-OPEN │─────────────────────┘
            └────┬─────┘
                 │
                 │ success_threshold
                 │ consecutive successes
                 │
                 ▼
            ┌───────┐
            │CLOSED │
            └───────┘

States:

Closed: Normal operation. All calls allowed. Tracks consecutive failures.
Open: Failing fast. All calls rejected immediately. Waiting for timeout.
Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.

Unified Trait: `CircuitBreakerBehavior`

All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:

#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
    fn name(&self) -> &str;
    fn state(&self) -> CircuitState;
    fn should_allow(&self) -> bool;
    fn record_success(&self, duration: Duration);
    fn record_failure(&self, duration: Duration);
    fn is_healthy(&self) -> bool;
    fn force_open(&self);
    fn force_closed(&self);
    fn metrics(&self) -> CircuitBreakerMetrics;
}
}

Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:

Consistent state machine behavior across all breakers
Proper half-open → closed recovery via success_threshold
Lock-free atomic state management
Domain-specific methods remain as additional methods on each type

Circuit Breaker Implementations

Tasker-core has four circuit breaker implementations, each protecting specific components. All wrap the generic CircuitBreaker from tasker_shared::resilience:

Circuit Breaker	Location	Purpose	Trigger Type
Web Database	`tasker-orchestration`	API database operations	Error-based
Task Readiness	`tasker-orchestration`	Fallback poller database checks	Error-based
FFI Completion	`tasker-worker`	Ruby/Python handler completion channel	Latency-based
Messaging	`tasker-shared`	Message queue operations (PGMQ/RabbitMQ)	Error-based

1. Web Database Circuit Breaker

Purpose: Protects API endpoints from cascading database failures.

Scope: Independent from orchestration system’s internal operations.

Behavior:

Opens when database queries fail repeatedly
Returns 503 with Retry-After header when open
Fast-fail rejection with atomic state management

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Consecutive failures before opening
success_threshold = 2      # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)

Health Check Integration:

Included in /health/ready endpoint
State reported in /health/detailed response
Metric: api_circuit_breaker_state (0=closed, 1=half-open, 2=open)

2. Task Readiness Circuit Breaker

Purpose: Protects fallback poller from database overload during polling cycles.

Scope: Independent from web circuit breaker, specific to task readiness queries.

Behavior:

Opens when task readiness queries fail repeatedly
Skips polling cycles when open (doesn’t fail-fast, just skips)
Allows orchestration to continue processing existing work

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10     # Higher threshold for polling
timeout_seconds = 60       # Longer recovery window
success_threshold = 3      # More successes needed for confidence

Why Separate from Web?:

Different failure patterns (polling vs request-driven)
Different recovery semantics (skip vs reject)
Isolation prevents web failures from stopping polling (and vice versa)

3. FFI Completion Circuit Breaker

Purpose: Protects Ruby/Python worker completion channels from backpressure.

Scope: Worker-specific, protects FFI boundary.

Behavior:

Latency-based: Treats slow sends (>100ms) as failures
Opens when completion channel is consistently slow
Prevents FFI threads from blocking on saturated channels
Drops completions when open (with metrics), allowing handler threads to continue

Configuration (config/tasker/base/worker.toml):

[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5            # Slow sends before opening
recovery_timeout_seconds = 5     # Short recovery window
success_threshold = 2            # Successes to close
slow_send_threshold_ms = 100     # Latency threshold (100ms)

Why Latency-Based?:

Slow channel sends indicate backpressure buildup
Blocking FFI threads can cascade to Ruby/Python handler starvation
Error-only detection misses slow-but-completing operations
Latency detection catches degradation before total failure

Metrics:

ffi_completion_slow_sends_total - Sends exceeding latency threshold
ffi_completion_circuit_open_rejections_total - Rejections due to open circuit

4. Messaging Circuit Breaker

Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).

Scope: Integrated into MessageClient, shared across orchestration and worker messaging.

Behavior:

Opens when send/receive operations fail repeatedly
Protected operations: send_step_message, receive_step_messages, send_step_result, receive_step_results, send_task_request, receive_task_requests, send_task_finalization, receive_task_finalizations, send_message, receive_messages
Unprotected operations (safe to fail or needed for recovery): ack_message, nack_message, extend_visibility, health_check, ensure_queue, queue stats
Coordinates with visibility timeout for message safety
Provider-agnostic: works with both PGMQ and RabbitMQ backends

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes to close
# timeout_seconds inherited from default_config (30s)

Why ack/nack bypass the breaker?:

Ack/nack failure causes message redelivery via visibility timeout, which is safe
Health check must work when breaker is open to detect recovery
Queue management is startup-only and should not be gated

Configuration Reference

Global Settings

[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30    # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions

Default Configuration

Applied to any circuit breaker without explicit configuration:

[common.circuit_breakers.default_config]
failure_threshold = 5      # 1-100 range
timeout_seconds = 30       # 1-300 range
success_threshold = 2      # 1-50 range

Component-Specific Overrides

# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3

# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

Note: timeout_seconds is inherited from default_config for all component circuit breakers. The pgmq key is accepted as an alias for messaging for backward compatibility.

Worker-Specific Configuration

# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100

Environment Overrides

Different environments may need different thresholds:

Test (config/tasker/environments/test/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 2      # Faster failure detection
timeout_seconds = 5        # Quick recovery for tests
success_threshold = 1

Production (config/tasker/environments/production/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 10     # More tolerance for transient failures
timeout_seconds = 60       # Longer recovery window
success_threshold = 5      # More confidence before closing

Health Endpoint Integration

Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.

Orchestration Health (`/health/detailed`)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "Circuit breaker state: Closed",
      "duration_ms": 1,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Worker Health (`/health/detailed`)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
      "duration_ms": 0,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Health Status Mapping

Circuit Breaker State	Health Status	Impact
All Closed	`healthy`	Normal operation
Any Half-Open	`degraded`	Testing recovery
Any Open	`unhealthy`	Failing fast

Monitoring and Alerting

Key Metrics

Metric	Type	Description
`api_circuit_breaker_state`	Gauge	Web breaker state (0/1/2)
`tasker_circuit_breaker_state`	Gauge	Per-component state
`api_requests_rejected_total`	Counter	Rejections due to open breaker
`ffi_completion_slow_sends_total`	Counter	Slow send detections
`ffi_completion_circuit_open_rejections_total`	Counter	FFI breaker rejections

Prometheus Alerts

groups:
  - name: circuit_breakers
    rules:
      - alert: TaskerCircuitBreakerOpen
        expr: api_circuit_breaker_state == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN"
          description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"

      - alert: TaskerCircuitBreakerHalfOpen
        expr: api_circuit_breaker_state == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker stuck in half-open"
          description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"

      - alert: TaskerFFISlowSendsHigh
        expr: rate(ffi_completion_slow_sends_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "FFI completion channel experiencing backpressure"
          description: "Slow sends averaging >10/second, circuit breaker may open"

Grafana Dashboard Panels

Circuit Breaker State Timeline:

Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)

FFI Latency Percentiles:

Panel: Time series
Queries:
  - histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical

Operational Procedures

When Circuit Breaker Opens

Immediate Actions:

Check database connectivity: pg_isready -h <host> -p 5432
Check connection pool status: /health/detailed endpoint
Review recent error logs for root cause
Monitor queue depth for message backlog

Recovery:

Circuit automatically tests recovery after timeout_seconds
No manual intervention needed for transient failures
For persistent failures, fix underlying issue first

Escalation:

If breaker stays open >5 minutes, escalate to database team
If breaker oscillates (open/half-open/open), increase failure_threshold

Tuning Guidelines

Symptom: Breaker opens too frequently

Increase failure_threshold
Investigate root cause of failures
Consider if failures are transient vs systemic

Symptom: Breaker stays open too long

Decrease timeout_seconds
Verify downstream system has recovered
Check if success_threshold is too high

Symptom: FFI breaker opens unnecessarily

Increase slow_send_threshold_ms
Verify channel buffer sizes are adequate
Check Ruby/Python handler throughput

Architecture Integration

Relationship to Backpressure

Circuit breakers are one layer of the broader backpressure strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        RESILIENCE LAYER STACK                                │
└─────────────────────────────────────────────────────────────────────────────┘

Layer 1: Circuit Breakers     → Fast-fail on component failure
Layer 2: Bounded Channels     → Backpressure on internal queues
Layer 3: Visibility Timeouts  → Message-level retry safety
Layer 4: Semaphore Limits     → Handler execution rate limiting
Layer 5: Connection Pools     → Database resource management

See Backpressure Architecture for the complete strategy.

Independence Principle

Each circuit breaker operates independently:

Web breaker can be open while task readiness breaker is closed
FFI breaker state doesn’t affect PGMQ breaker
Prevents single failure mode from cascading across components
Allows targeted recovery per component

Integration Points

Component	Circuit Breaker	Integration Point
`tasker-orchestration/src/web`	Web Database	API request handlers
`tasker-orchestration/src/orchestration/task_readiness`	Task Readiness	Fallback poller loop
`tasker-worker/src/worker/handlers`	FFI Completion	Completion channel sends
`tasker-shared/src/messaging/client.rs`	Messaging	`MessageClient` send/receive methods

Troubleshooting

Common Issues

Issue: Web circuit breaker flapping (open → half-open → open rapidly)

Diagnosis:

Check database query latency (slow queries can cause timeout failures)
Review connection pool saturation
Check if PostgreSQL is under memory pressure

Resolution:

Increase failure_threshold if failures are transient
Increase timeout_seconds to give more recovery time
Fix underlying database performance issues

Issue: FFI completion circuit breaker opens during normal load

Diagnosis:

Check Ruby/Python handler execution time
Review completion channel buffer utilization
Verify worker concurrency settings

Resolution:

Increase slow_send_threshold_ms if handlers are legitimately slow
Increase channel buffer size in worker config
Reduce handler concurrency if system is overloaded

Issue: Task readiness breaker open but web API working fine

Diagnosis:

Task readiness queries may be slower/different than API queries
Polling may hit database at different times (e.g., during maintenance)

Resolution:

Independent breakers are working as designed
Check specific task readiness query performance
Consider database index optimization for readiness queries

Source Code Reference

Component	File
`CircuitBreakerBehavior` Trait	`tasker-shared/src/resilience/behavior.rs`
Generic `CircuitBreaker`	`tasker-shared/src/resilience/circuit_breaker.rs`
Circuit Breaker Config	`tasker-shared/src/config/circuit_breaker.rs`
`MessageClient` (messaging breaker)	`tasker-shared/src/messaging/client.rs`
`WebDatabaseCircuitBreaker`	`tasker-orchestration/src/api_common/circuit_breaker.rs`
Web CB Helpers	`tasker-orchestration/src/web/circuit_breaker.rs`
`TaskReadinessCircuitBreaker`	`tasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs`
`FfiCompletionCircuitBreaker`	`tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs`
Worker Health Integration	`tasker-worker/src/web/handlers/health.rs`
Circuit Breaker Types	`tasker-shared/src/types/api/worker.rs`

Backpressure Architecture - Complete resilience strategy
Operations: Backpressure Monitoring - Operational runbooks
Operations: MPSC Channel Tuning - Channel capacity management
Observability - Metrics and logging standards
Configuration Management - TOML configuration reference

<- Back to Documentation Hub

Crate Architecture

Last Updated: 2026-01-15 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands | Quick Start

← Back to Documentation Hub

Overview

Tasker Core is organized as a Cargo workspace with 7 member crates, each with a specific responsibility in the workflow orchestration system. This document explains the role of each crate, their inter-dependencies, and how they work together to provide a complete orchestration solution.

Design Philosophy

The crate structure follows these principles:

Separation of Concerns: Each crate has a well-defined responsibility
Minimal Dependencies: Crates depend on the minimum necessary dependencies
Shared Foundation: Common types and utilities in tasker-shared
Language Flexibility: Support for multiple worker implementations (Rust, Ruby, Python planned)
Production Ready: Workers and the orchestration system can be deployed and scaled independently

Workspace Structure

tasker-core/
├── tasker-pgmq/              # PGMQ wrapper with notification support
├── tasker-shared/            # Shared types, SQL functions, utilities
├── tasker-orchestration/     # Task coordination and lifecycle management
├── tasker-worker/            # Step execution and handler integration
├── tasker-client/            # API client library (REST + gRPC transport)
├── tasker-ctl/              # CLI binary (depends on tasker-client)
└── workers/
    ├── ruby/ext/tasker_core/ # Ruby FFI bindings
    └── rust/                 # Rust native worker

Crate Dependency Graph

┌─────────────────────────────────────────────────────────┐
│                   External Dependencies                 │
│  (sqlx, tokio, serde, pgmq, magnus, axum, etc.)       │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                    tasker-pgmq                          │
│  PGMQ wrapper with PostgreSQL LISTEN/NOTIFY            │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                    tasker-shared                        │
│  Core types, SQL functions, state machines             │
└─────────────────────────────────────────────────────────┘
                            │
               ┌────────────┴────────────┐
               │                         │
               ▼                         ▼
┌──────────────────────────┐  ┌──────────────────────────┐
│  tasker-orchestration    │  │    tasker-worker         │
│  Task coordination       │  │    Step execution        │
│  Lifecycle management    │  │    Handler integration   │
│  REST API                │  │    FFI support           │
└──────────────────────────┘  └──────────────────────────┘
               │                         │
               ▼                         │
┌──────────────────────────┐            │
│    tasker-client         │            │
│    API client library    │            │
│    REST + gRPC transport │            │
└──────────────────────────┘            │
               │                        │
               ▼                        │
┌──────────────────────────┐            │
│    tasker-ctl            │            │
│    CLI binary            │            │
└──────────────────────────┘            │
                                        │
               ┌────────────────────────┘
               │
      ┌────────┴────────┐
      ▼                 ▼
┌────────────┐  ┌────────────┐
│ workers/   │  │ workers/   │
│   ruby/    │  │   rust/    │
│   ext/     │  │            │
└────────────┘  └────────────┘

Core Crates

tasker-pgmq

Purpose: Wrapper around PostgreSQL Message Queue (PGMQ) with native PostgreSQL LISTEN/NOTIFY support

Location: tasker-pgmq/

Key Responsibilities:

Wrap pgmq crate with notification capabilities
Provide atomic pgmq_send_with_notify() operations
Handle notification channel management
Support namespace-aware queue naming

Public API:

#![allow(unused)]
fn main() {
pub struct PgmqClient {
    // Send message with atomic notification
    pub async fn send_with_notify<T>(&self, queue: &str, msg: T) -> Result<i64>;

    // Read message with visibility timeout
    pub async fn read<T>(&self, queue: &str, vt: i32) -> Result<Option<Message<T>>>;

    // Delete processed message
    pub async fn delete(&self, queue: &str, msg_id: i64) -> Result<bool>;
}
}

When to Use:

When you need reliable message queuing with PostgreSQL
When you need atomic send + notify operations
When building event-driven systems on PostgreSQL

Dependencies:

pgmq - Core PostgreSQL message queue functionality
sqlx - Database connectivity
tokio - Async runtime

tasker-shared

Purpose: Foundation crate containing all shared types, utilities, and SQL function interfaces

Location: tasker-shared/

Key Responsibilities:

Core domain models (Task, WorkflowStep, TaskTransition, etc.)
State machine implementations (Task + Step)
SQL function executor and registry
Database utilities and migrations
Event system traits and types
Messaging abstraction layer: Provider-agnostic messaging with PGMQ, RabbitMQ, and InMemory backends
Factory system for testing
Metrics and observability primitives

Public API:

#![allow(unused)]
fn main() {
// Core Models
pub mod models {
    pub struct Task { /* ... */ }
    pub struct WorkflowStep { /* ... */ }
    pub struct TaskTransition { /* ... */ }
    pub struct WorkflowStepTransition { /* ... */ }
}

// State Machines
pub mod state_machine {
    pub struct TaskStateMachine { /* ... */ }
    pub struct StepStateMachine { /* ... */ }
    pub enum TaskState { /* 12 states */ }
    pub enum WorkflowStepState { /* 9 states */ }
}

// SQL Functions
pub mod database {
    pub struct SqlFunctionExecutor { /* ... */ }
    pub async fn get_step_readiness_status(...) -> Result<Vec<StepReadinessStatus>>;
    pub async fn get_next_ready_tasks(...) -> Result<Vec<ReadyTaskInfo>>;
}

// Event System
pub mod event_system {
    pub trait EventDrivenSystem { /* ... */ }
    pub enum DeploymentMode { Hybrid, EventDrivenOnly, PollingOnly }
}

// Messaging
pub mod messaging {
    // Provider abstraction
    pub enum MessagingProvider { Pgmq, RabbitMq, InMemory }
    pub trait MessagingService { /* send_message, receive_messages, ack_message, ... */ }
    pub trait SupportsPushNotifications { /* subscribe, subscribe_many, requires_fallback_polling */ }
    pub enum MessageNotification { Available { ... }, Message(...) }

    // Domain client
    pub struct MessageClient { /* High-level queue operations */ }

    // Message types
    pub struct SimpleStepMessage { /* ... */ }
    pub struct TaskRequestMessage { /* ... */ }
    pub struct StepExecutionResult { /* ... */ }
}
}

When to Use:

Always - This is the foundation for all other crates
When you need core domain models
When you need state machine logic
When you need SQL function access
When you need testing factories

Dependencies:

tasker-pgmq - Message queue operations
sqlx - Database operations
serde - Serialization
Many workspace-shared dependencies

Why It’s Separate:

Eliminates circular dependencies between orchestration and worker
Provides single source of truth for domain models
Enables independent testing of core logic
Allows multiple implementations (orchestration vs worker) to share code

tasker-orchestration

Purpose: Task coordination, lifecycle management, and orchestration REST API

Location: tasker-orchestration/

Key Responsibilities:

Actor-based lifecycle coordination
Task initialization and finalization
Step discovery and enqueueing
Result processing from workers
Dynamic executor pool management
Event-driven coordination
REST API endpoints
Health monitoring
Metrics collection

Public API:

#![allow(unused)]
fn main() {
// Core orchestration
pub struct OrchestrationCore {
    pub async fn new() -> Result<Self>;
    pub async fn from_config(config: ConfigManager) -> Result<Self>;
}

// Actor-based coordination
pub mod actors {
    pub struct ActorRegistry { /* ... */ }
    pub struct TaskRequestActor { /* ... */ }
    pub struct ResultProcessorActor { /* ... */ }
    pub struct StepEnqueuerActor { /* ... */ }
    pub struct TaskFinalizerActor { /* ... */ }

    pub trait OrchestrationActor { /* ... */ }
    pub trait Handler<M: Message> { /* ... */ }
    pub trait Message { /* ... */ }
}

// Lifecycle services (wrapped by actors)
pub mod lifecycle {
    pub struct TaskInitializer { /* ... */ }
    pub struct StepEnqueuerService { /* ... */ }
    pub struct OrchestrationResultProcessor { /* ... */ }
    pub struct TaskFinalizer { /* ... */ }
}

// Message hydration (Phase 4)
pub mod hydration {
    pub struct StepResultHydrator { /* ... */ }
    pub struct TaskRequestHydrator { /* ... */ }
    pub struct FinalizationHydrator { /* ... */ }
}

// REST API (Axum)
pub mod web {
    // POST /v1/tasks
    pub async fn create_task(request: TaskRequest) -> Result<TaskResponse>;

    // GET /v1/tasks/{uuid}
    pub async fn get_task(uuid: Uuid) -> Result<TaskResponse>;

    // GET /health
    pub async fn health_check() -> Result<HealthResponse>;
}

// gRPC API (Tonic)
// Feature-gated behind `grpc-api`
pub mod grpc {
    pub struct GrpcServer { /* ... */ }
    pub struct GrpcState { /* wraps Arc<SharedApiServices> */ }

    pub mod services {
        pub struct TaskServiceImpl { /* 6 RPCs */ }
        pub struct StepServiceImpl { /* 4 RPCs */ }
        pub struct TemplateServiceImpl { /* 2 RPCs */ }
        pub struct HealthServiceImpl { /* 4 RPCs */ }
        pub struct AnalyticsServiceImpl { /* 2 RPCs */ }
        pub struct DlqServiceImpl { /* 6 RPCs */ }
        pub struct ConfigServiceImpl { /* 1 RPC */ }
    }

    pub mod interceptors {
        pub struct AuthInterceptor { /* Bearer token, API key */ }
    }
}

// Event systems
pub mod event_systems {
    pub struct OrchestrationEventSystem { /* ... */ }
    pub struct TaskReadinessEventSystem { /* ... */ }
}
}

Actor Architecture:

The orchestration crate implements a lightweight actor pattern for lifecycle component coordination:

ActorRegistry: Manages all 4 orchestration actors with lifecycle hooks
Message-Based Communication: Type-safe message handling via Handler<M> trait
Service Decomposition: Large services decomposed into focused components (<300 lines per file)
Direct Integration: Command processor calls actors directly without wrapper layers

See Actor-Based Architecture for comprehensive documentation.

When to Use:

When you need to run the orchestration server
When you need task coordination logic
When building custom orchestration components
When integrating with the REST API

Dependencies:

tasker-shared - Core types and SQL functions
tasker-pgmq - Message queuing
axum - REST API framework
tower-http - HTTP middleware

Deployment: Typically deployed as a server process (tasker-server binary)

Dual-Server Architecture:

Orchestration supports both REST and gRPC APIs running simultaneously via SharedApiServices:

#![allow(unused)]
fn main() {
pub struct SharedApiServices {
    pub security_service: Option<Arc<SecurityService>>,
    pub task_service: TaskService,
    pub step_service: StepService,
    pub health_service: HealthService,
    // ... other services
}

// Both APIs share the same service instances
AppState { services: Arc<SharedApiServices>, ... }   // REST
GrpcState { services: Arc<SharedApiServices>, ... }  // gRPC
}

Port Allocation:

REST: 8080 (configurable)
gRPC: 9190 (configurable)

tasker-worker

Purpose: Step execution, handler integration, and worker coordination

Location: tasker-worker/

Key Responsibilities:

Claim steps from namespace queues
Execute step handlers (Rust or FFI)
Submit results to orchestration
Template management and caching
Event-driven step claiming
Worker health monitoring
FFI integration layer

Public API:

#![allow(unused)]
fn main() {
// Worker core
pub struct WorkerCore {
    pub async fn new(config: WorkerConfig) -> Result<Self>;
    pub async fn start(&mut self) -> Result<()>;
}

// Handler execution
pub mod handlers {
    pub trait StepHandler {
        async fn execute(&self, context: StepContext) -> Result<StepResult>;
    }
}

// Template management
pub mod task_template_manager {
    pub struct TaskTemplateManager {
        pub async fn load_templates(&mut self) -> Result<()>;
        pub fn get_template(&self, name: &str) -> Option<&TaskTemplate>;
    }
}

// Event systems
pub mod event_systems {
    pub struct WorkerEventSystem { /* ... */ }
}
}

When to Use:

When you need to run a worker process
When implementing custom step handlers
When integrating with Ruby/Python handlers via FFI
When building worker-specific tools

Dependencies:

tasker-shared - Core types and messaging
tasker-pgmq - Message queuing
magnus (optional) - Ruby FFI bindings

Deployment: Deployed as worker processes, typically one per namespace or scaled horizontally

tasker-client

Purpose: Transport-agnostic API client library for REST and gRPC

Location: tasker-client/

Key Responsibilities:

HTTP client for orchestration REST API
gRPC client for orchestration gRPC API (feature-gated)
Transport abstraction via unified client traits
Configuration management and auth resolution
Client-side request building

Public API:

#![allow(unused)]
fn main() {
// REST client
pub struct RestOrchestrationClient {
    pub async fn new(base_url: &str) -> Result<Self>;
    // Task, step, template, health operations
}

// gRPC client (feature-gated)
#[cfg(feature = "grpc")]
pub struct GrpcOrchestrationClient {
    pub async fn connect(endpoint: &str) -> Result<Self>;
    pub async fn connect_with_auth(endpoint: &str, auth: GrpcAuthConfig) -> Result<Self>;
    // Same operations as REST client
}

// Transport-agnostic client
pub enum UnifiedOrchestrationClient {
    Rest(Box<RestOrchestrationClient>),
    Grpc(Box<GrpcOrchestrationClient>),
}

// Client trait for transport abstraction
pub trait OrchestrationClient: Send + Sync {
    async fn create_task(&self, request: TaskRequest) -> Result<TaskResponse>;
    async fn get_task(&self, uuid: Uuid) -> Result<TaskResponse>;
    async fn list_tasks(&self, filters: TaskFilters) -> Result<Vec<TaskResponse>>;
    async fn health_check(&self) -> Result<HealthResponse>;
    // ... more operations
}
}

When to Use:

When you need to interact with orchestration API from Rust
When building integration tests
When implementing client applications or FFI bindings
When building UI frontends (TUI, web) that need API access

tasker-ctl

Purpose: Command-line interface for Tasker (split from tasker-client)

Location: tasker-ctl/

Key Responsibilities:

CLI argument parsing and command dispatch (via clap)
Task, worker, system, config, auth, and DLQ commands
Configuration documentation generation (via askama, feature-gated)
API key generation and management

CLI Tools:

# Task management
tasker-ctl task create --template linear_workflow
tasker-ctl task get <uuid>
tasker-ctl task list --namespace payments

# Health checks
tasker-ctl health

# Configuration docs generation
tasker-ctl docs generate

When to Use:

When managing tasks from the command line
When generating configuration documentation
When performing administrative operations (auth, DLQ management)

Dependencies:

reqwest - HTTP client
clap - CLI argument parsing
serde_json - JSON serialization

Worker Implementations

workers/ruby/ext/tasker_core

Purpose: Ruby FFI bindings enabling Ruby workers to execute Rust-orchestrated workflows

Location: workers/ruby/ext/tasker_core/

Key Responsibilities:

Expose Rust worker functionality to Ruby via Magnus (FFI)
Handle Ruby handler execution
Manage Ruby <-> Rust type conversions
Provide Ruby API for template registration
FFI performance optimization

Ruby API:

# Worker bootstrap
result = TaskerCore::Worker::Bootstrap.start!

# Template registration (automatic)
# Ruby templates in workers/ruby/app/tasker/tasks/templates/

# Handler execution (automatic via FFI)
class MyHandler < TaskerCore::StepHandler
  def execute(context)
    # Step implementation
    { success: true, result: "done" }
  end
end

When to Use:

When you have existing Ruby handlers
When you need Ruby-specific libraries or gems
When migrating from Ruby-based orchestration
When team expertise is primarily Ruby

Dependencies:

magnus - Ruby FFI bindings
tasker-worker - Core worker logic
Ruby runtime

Performance Considerations:

FFI overhead: ~5-10ms per step (measured)
Ruby GC can impact latency
Thread-safe FFI calls via Ruby global lock
Best for I/O-bound operations, not CPU-intensive

workers/rust

Purpose: Native Rust worker implementation for maximum performance

Location: workers/rust/

Key Responsibilities:

Native Rust step handler execution
Template definitions in Rust
Direct integration with tasker-worker
Maximum performance for CPU-intensive operations

Handler API:

#![allow(unused)]
fn main() {
// Define handler in Rust
pub struct MyHandler;

#[async_trait]
impl StepHandler for MyHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Step implementation
        Ok(StepResult::success(json!({"result": "done"})))
    }
}

// Register in template
pub fn register_template() -> TaskTemplate {
    TaskTemplate {
        name: "my_workflow",
        steps: vec![
            StepTemplate {
                name: "my_step",
                handler: Box::new(MyHandler),
                // ...
            }
        ]
    }
}
}

When to Use:

When you need maximum performance
For CPU-intensive operations
When building new workflows in Rust
When minimizing latency is critical

Dependencies:

tasker-worker - Core worker logic
tokio - Async runtime

Performance: Native Rust handlers have zero FFI overhead

Crate Relationships

How Crates Work Together

Task Creation Flow

Client Application
  ↓ [HTTP POST]
tasker-client
  ↓ [REST API]
tasker-orchestration::web
  ↓ [Task lifecycle]
tasker-orchestration::lifecycle::TaskInitializer
  ↓ [Uses]
tasker-shared::models::Task
tasker-shared::database::sql_functions
  ↓ [PostgreSQL]
Database + PGMQ

Step Execution Flow

tasker-orchestration::lifecycle::StepEnqueuer
  ↓ [pgmq_send_with_notify]
PGMQ namespace queue
  ↓ [pg_notify event]
tasker-worker::event_systems::WorkerEventSystem
  ↓ [Claims step]
tasker-worker::handlers::execute_handler
  ↓ [FFI or native]
workers/ruby or workers/rust
  ↓ [Returns result]
tasker-worker::orchestration_result_sender
  ↓ [pgmq_send_with_notify]
PGMQ orchestration_step_results queue
  ↓ [pg_notify event]
tasker-orchestration::lifecycle::ResultProcessor
  ↓ [Updates state]
tasker-shared::models::WorkflowStepTransition

Dependency Rationale

Why tasker-shared exists:

Prevents circular dependencies (orchestration ↔ worker)
Single source of truth for domain models
Enables independent testing
Allows SQL function reuse

Why workers are separate from tasker-worker:

Language-specific implementations
Independent deployment
FFI boundary separation
Multiple worker types supported

Why tasker-pgmq is separate:

Reusable in other projects
Focused responsibility
Easy to test independently
Can be published as separate crate

Building and Testing

Build All Crates

# Build everything with all features
cargo build --all-features

# Build specific crate
cargo build --package tasker-orchestration --all-features

# Build workspace root (minimal, mostly for integration)
cargo build

Test All Crates

# Test everything
cargo test --all-features

# Test specific crate
cargo test --package tasker-shared --all-features

# Test with database
DATABASE_URL="postgresql://..." cargo test --all-features

Feature Flags

# Root workspace features
[features]
benchmarks = [
    "tasker-shared/benchmarks",
    # ...
]
test-utils = [
    "tasker-orchestration/test-utils",
    "tasker-shared/test-utils",
    "tasker-worker/test-utils",
]

Migration Notes

Root Crate Being Phased Out

The root tasker-core crate (defined in the workspace root Cargo.toml) is being phased out:

Current: Contains minimal code, mostly workspace configuration
Future: Will be removed entirely, replaced by individual crates
Impact: No functional impact, internal restructuring only
Timeline: Complete when all functionality moved to member crates

Why: Cleaner workspace structure, better separation of concerns, easier to understand

Adding New Crates

When adding a new crate to the workspace:

Add to [workspace.members] in root Cargo.toml
Create crate: cargo new --lib tasker-new-crate
Add workspace dependencies to crate’s Cargo.toml
Update this documentation
Add to dependency graph above
Document public API

Best Practices

When to Create a New Crate

Create a new crate when:

✅ You have a distinct, reusable component
✅ You need independent versioning
✅ You want to reduce compile times
✅ You need isolation for testing
✅ You have language-specific implementations

Don’t create a new crate when:

❌ It’s tightly coupled to existing crates
❌ It’s only used in one place
❌ It would create circular dependencies
❌ It’s a small utility module

Dependency Management

Use workspace dependencies: Define versions in root Cargo.toml
Minimize dependencies: Only depend on what you need
Version consistently: Use workspace = true in member crates
Document dependencies: Explain why each dependency is needed

API Design

Stable public API: Changes should be backward compatible
Clear documentation: Every public item needs docs
Examples in docs: Show how to use the API
Error handling: Use Result with meaningful error types

Actor-Based Architecture - Actor pattern implementation in tasker-orchestration
Messaging Abstraction - Provider-agnostic messaging
Quick Start - Get running with the crates
Events and Commands - How crates coordinate
States and Lifecycles - State machines in tasker-shared
Task Readiness & Execution - SQL functions in tasker-shared
Archive: Ruby Integration Lessons - FFI patterns

← Back to Documentation Hub

Deployment Patterns and Configuration

Last Updated: 2026-01-15 Audience: Architects, Operators Status: Active Related Docs: Documentation Hub | Quick Start | Observability | Messaging Abstraction

← Back to Documentation Hub

Overview

Tasker Core supports three deployment modes, each optimized for different operational requirements and infrastructure constraints. This guide covers deployment patterns, configuration management, and production considerations.

Key Deployment Modes:

Hybrid Mode (Recommended) - Event-driven with polling fallback
EventDrivenOnly Mode - Pure event-driven for lowest latency
PollingOnly Mode - Traditional polling for restricted environments

Messaging Backend Options:

PGMQ (Default) - PostgreSQL-based, single infrastructure dependency
RabbitMQ - AMQP broker, higher throughput for high-volume scenarios

Messaging Backend Selection

Tasker Core supports multiple messaging backends through a provider-agnostic abstraction layer. The choice of backend affects deployment architecture and operational requirements.

Backend Comparison

Feature	PGMQ	RabbitMQ
Infrastructure	PostgreSQL only	PostgreSQL + RabbitMQ
Delivery Model	Poll + pg_notify signals	Native push (basic_consume)
Fallback Polling	Required for reliability	Not needed
Throughput	Good	Higher
Latency	Low (~10-50ms)	Lowest (~5-20ms)
Operational Complexity	Lower	Higher
Message Persistence	PostgreSQL transactions	RabbitMQ durability

PGMQ (Default)

PostgreSQL Message Queue is the default backend, ideal for:

Simpler deployments: Single database dependency
Transactional workflows: Messages participate in PostgreSQL transactions
Smaller to medium scale: Excellent for most workloads

Configuration:

# Default - no additional configuration needed
TASKER_MESSAGING_BACKEND=pgmq

Deployment Mode Interaction:

Uses pg_notify for real-time notifications
Fallback polling recommended for reliability
Hybrid mode provides best balance

RabbitMQ

AMQP-based messaging for high-throughput scenarios:

High-volume workloads: Better throughput characteristics
Existing RabbitMQ infrastructure: Leverage existing investments
Pure push delivery: No fallback polling required

Configuration:

TASKER_MESSAGING_BACKEND=rabbitmq
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/%2F

Deployment Mode Interaction:

EventDrivenOnly mode is natural fit (no fallback needed)
Native push delivery via basic_consume()
Protocol-guaranteed message delivery

Choosing a Backend

Decision Tree:
                              ┌─────────────────┐
                              │ Do you need the │
                              │ highest possible │
                              │ throughput?     │
                              └────────┬────────┘
                                       │
                            ┌──────────┴──────────┐
                            │                     │
                           Yes                    No
                            │                     │
                            ▼                     ▼
                   ┌────────────────┐   ┌────────────────┐
                   │ Do you have    │   │ Use PGMQ       │
                   │ existing       │   │ (simpler ops)  │
                   │ RabbitMQ?      │   └────────────────┘
                   └───────┬────────┘
                           │
                ┌──────────┴──────────┐
                │                     │
               Yes                    No
                │                     │
                ▼                     ▼
       ┌────────────────┐    ┌────────────────┐
       │ Use RabbitMQ   │    │ Evaluate       │
       └────────────────┘    │ operational    │
                             │ tradeoffs      │
                             └────────────────┘

Recommendation: Start with PGMQ. Migrate to RabbitMQ only when throughput requirements demand it.

Production Deployment Strategy: Mixed Mode Architecture

Important: In production-grade Kubernetes environments, you typically run multiple orchestration containers simultaneously with different deployment modes. This is not just about horizontal scaling with identical configurations—it’s about deploying containers with different coordination strategies to optimize for both throughput and reliability.

Recommended Production Pattern

High-Throughput + Safety Net Architecture:

# Most orchestration containers in EventDrivenOnly mode for maximum throughput
- EventDrivenOnly containers: 8-12 replicas (handles 80-90% of workload)
- PollingOnly containers: 2-3 replicas (safety net for missed events)

Why this works:

EventDrivenOnly containers handle the bulk of work with ~10ms latency
PollingOnly containers catch any events that might be missed during network issues or LISTEN/NOTIFY failures
Both sets of containers coordinate through atomic SQL operations (no conflicts)
Scale each mode independently based on throughput needs

Alternative: All-Hybrid Deployment

You can also deploy all containers in Hybrid mode and scale horizontally:

# All containers use Hybrid mode
- Hybrid containers: 10-15 replicas

This is simpler but less flexible. The mixed-mode approach lets you:

Tune for specific workload patterns (event-heavy vs. polling-heavy)
Adapt to infrastructure constraints (some networks better for events, others for polling)
Optimize resource usage (EventDrivenOnly uses less CPU than Hybrid)
Scale dimensions independently (scale up event listeners without scaling pollers)

Key Insight

The different deployment modes exist not just for config tuning, but to enable sophisticated deployment strategies where you mix coordination approaches across containers to meet production throughput and reliability requirements.

Deployment Mode Comparison

Feature	Hybrid	EventDrivenOnly	PollingOnly
Latency	Low (event-driven primary)	Lowest (~10ms)	Higher (~100-500ms)
Reliability	Highest (automatic fallback)	Good (requires stable connections)	Good (no dependencies)
Resource Usage	Medium (listeners + pollers)	Low (listeners only)	Medium (pollers only)
Network Requirements	Standard PostgreSQL	Persistent connections required	Standard PostgreSQL
Production Recommended	✅ Yes	⚠️ With stable network	⚠️ For restricted environments
Complexity	Medium	Low	Low

Hybrid Mode (Recommended)

Overview

Hybrid mode combines the best of both worlds: event-driven coordination for real-time performance with polling fallback for reliability.

How it works:

PostgreSQL LISTEN/NOTIFY provides real-time event notifications
If event listeners fail or lag, polling automatically takes over
System continuously monitors and switches between modes
No manual intervention required

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"

[orchestration.hybrid]
# Event listener settings
enable_event_listeners = true
listener_reconnect_interval_ms = 5000
listener_health_check_interval_ms = 30000

# Polling fallback settings
enable_polling_fallback = true
polling_interval_ms = 1000
fallback_activation_threshold_ms = 5000

# Worker event settings
[orchestration.worker_events]
enable_worker_listeners = true
worker_listener_reconnect_ms = 5000

When to Use Hybrid Mode

Ideal for:

Production deployments requiring high reliability
Environments with occasional network instability
Systems requiring both low latency and guaranteed delivery
Multi-region deployments with variable network quality

Example: Production E-commerce Platform

# docker-compose.production.yml
version: '3.8'

services:
  orchestration:
    image: tasker-orchestration:latest
    environment:
      - TASKER_ENV=production
      - TASKER_DEPLOYMENT_MODE=Hybrid
      - DATABASE_URL=postgresql://tasker:${DB_PASSWORD}@postgres:5432/tasker_production
      - RUST_LOG=info
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    image: postgres:16
    environment:
      - POSTGRES_DB=tasker_production
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G

volumes:
  postgres-data:

Monitoring Hybrid Mode

Key Metrics:

#![allow(unused)]
fn main() {
// Hybrid mode health indicators
tasker_event_listener_active{mode="hybrid"} = 1           // Listener is active
tasker_event_listener_lag_ms{mode="hybrid"} < 100         // Event lag is acceptable
tasker_polling_fallback_active{mode="hybrid"} = 0         // Not in fallback mode
tasker_mode_switches_total{mode="hybrid"} < 10/hour       // Infrequent mode switching
}

Alert conditions:

Event listener down for > 60 seconds
Polling fallback active for > 5 minutes
Mode switches > 20 per hour (indicates instability)

EventDrivenOnly Mode

Overview

EventDrivenOnly mode provides the lowest possible latency by relying entirely on PostgreSQL LISTEN/NOTIFY for coordination.

How it works:

Orchestration and workers establish persistent PostgreSQL connections
LISTEN on specific channels for events
Immediate notification on queue changes
No polling overhead or delay

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "EventDrivenOnly"

[orchestration.event_driven]
# Listener configuration
listener_reconnect_interval_ms = 2000
listener_health_check_interval_ms = 15000
max_reconnect_attempts = 10

# Event channels
channels = [
    "pgmq_message_ready.orchestration",
    "pgmq_message_ready.*",
    "pgmq_queue_created"
]

# Connection pool for listeners
listener_pool_size = 5
connection_timeout_ms = 5000

When to Use EventDrivenOnly Mode

Ideal for:

High-throughput, low-latency requirements
Stable network environments
Development and testing environments
Systems with reliable PostgreSQL infrastructure

Not recommended for:

Unstable network connections
Environments with frequent PostgreSQL failovers
Systems requiring guaranteed operation during network issues

Example: High-Performance Payment Processing

#![allow(unused)]
fn main() {
// Worker configuration for event-driven mode
use tasker_worker::WorkerConfig;

let config = WorkerConfig {
    deployment_mode: DeploymentMode::EventDrivenOnly,
    namespaces: vec!["payments".to_string()],
    event_driven_settings: EventDrivenSettings {
        listener_reconnect_interval_ms: 2000,
        health_check_interval_ms: 15000,
        max_reconnect_attempts: 10,
    },
    ..Default::default()
};

// Start worker with event-driven mode
let worker = WorkerCore::from_config(config).await?;
worker.start().await?;
}

Monitoring EventDrivenOnly Mode

Critical Metrics:

#![allow(unused)]
fn main() {
// Event-driven health indicators
tasker_event_listener_active{mode="event_driven"} = 1    // Must be 1
tasker_event_notifications_received_total                 // Should be > 0
tasker_event_processing_duration_seconds                  // Should be < 0.01
tasker_listener_reconnections_total                       // Should be low
}

Alert conditions:

Event listener inactive
No events received for > 60 seconds (when activity expected)
Reconnections > 5 per hour

PollingOnly Mode

Overview

PollingOnly mode provides the most reliable operation in restricted or unstable network environments by using traditional polling.

How it works:

Orchestration and workers poll message queues at regular intervals
No dependency on persistent connections or LISTEN/NOTIFY
Configurable polling intervals for performance/resource trade-offs
Automatic retry and backoff on failures

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
# Polling intervals
task_request_poll_interval_ms = 1000
step_result_poll_interval_ms = 500
finalization_poll_interval_ms = 2000

# Batch processing
batch_size = 10
max_messages_per_poll = 100

# Backoff on errors
error_backoff_base_ms = 1000
error_backoff_max_ms = 30000
error_backoff_multiplier = 2.0

When to Use PollingOnly Mode

Ideal for:

Restricted network environments (firewalls blocking persistent connections)
Environments with frequent PostgreSQL connection issues
Systems prioritizing reliability over latency
Legacy infrastructure with limited LISTEN/NOTIFY support

Not recommended for:

High-frequency, low-latency requirements
Systems with strict resource constraints
Environments where polling overhead is problematic

Example: Batch Processing System

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
# Longer intervals for batch processing
task_request_poll_interval_ms = 5000
step_result_poll_interval_ms = 2000
finalization_poll_interval_ms = 10000

# Large batches for efficiency
batch_size = 50
max_messages_per_poll = 500

Monitoring PollingOnly Mode

Key Metrics:

#![allow(unused)]
fn main() {
// Polling health indicators
tasker_polling_cycles_total                               // Should be increasing
tasker_polling_messages_processed_total                   // Should be > 0
tasker_polling_duration_seconds                           // Should be stable
tasker_polling_errors_total                               // Should be low
}

Alert conditions:

Polling stopped (no cycles in last 60 seconds)
Polling duration > 10x interval (indicates overload)
Error rate > 5% of polling cycles

Configuration Management

Component-Based Configuration

Tasker Core uses a component-based TOML configuration system with environment-specific overrides.

Configuration Structure:

config/tasker/
├── base/                          # Base configuration (all environments)
│   ├── database.toml             # Database connection pool settings
│   ├── orchestration.toml        # Orchestration and deployment mode
│   ├── circuit_breakers.toml    # Circuit breaker thresholds
│   ├── executor_pools.toml      # Executor pool sizing
│   ├── pgmq.toml                # Message queue configuration
│   ├── query_cache.toml         # Query caching settings
│   └── telemetry.toml           # Metrics and logging
│
└── environments/                  # Environment-specific overrides
    ├── development/
    │   └── *.toml               # Development overrides
    ├── test/
    │   └── *.toml               # Test overrides
    └── production/
        └── *.toml               # Production overrides

Environment Detection

# Set environment via TASKER_ENV
export TASKER_ENV=production

# Validate configuration
cargo run --bin config-validator

# Expected output:
# ✓ Configuration loaded successfully
# ✓ Environment: production
# ✓ Deployment mode: Hybrid
# ✓ Database pool: 50 connections
# ✓ Circuit breakers: 10 configurations

Example: Production Configuration

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
max_concurrent_tasks = 1000
task_timeout_seconds = 3600

[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true
polling_interval_ms = 2000
fallback_activation_threshold_ms = 10000

[orchestration.health]
health_check_interval_ms = 30000
unhealthy_threshold = 3
recovery_threshold = 2

# config/tasker/environments/production/database.toml
[database]
max_connections = 50
min_connections = 10
connection_timeout_ms = 5000
idle_timeout_seconds = 600
max_lifetime_seconds = 1800

[database.query_cache]
enabled = true
max_size = 1000
ttl_seconds = 300

# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5
timeout_seconds = 60
half_open_timeout_seconds = 30

[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60

Docker Compose Deployment

Development Setup

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: tasker
      POSTGRES_PASSWORD: tasker
      POSTGRES_DB: tasker_rust_test
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U tasker"]
      interval: 5s
      timeout: 5s
      retries: 5

  orchestration:
    build:
      context: .
      target: orchestration
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

  worker:
    build:
      context: .
      target: worker
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8081:8081"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

  ruby-worker:
    build:
      context: ./workers/ruby
      dockerfile: Dockerfile
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8082:8082"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

volumes:
  postgres-data:

Production Deployment

# docker-compose.production.yml
version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: tasker
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_DB: tasker_production
    volumes:
      - postgres-data:/var/lib/postgresql/data
    deploy:
      placement:
        constraints:
          - node.labels.database == true
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G
    secrets:
      - db_password

  orchestration:
    image: tasker-orchestration:${VERSION}
    environment:
      - TASKER_ENV=production
      - DATABASE_URL_FILE=/run/secrets/database_url
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      rollback_config:
        parallelism: 0
        order: stop-first
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G
    secrets:
      - database_url

  worker:
    image: tasker-worker:${VERSION}
    environment:
      - TASKER_ENV=production
      - DATABASE_URL_FILE=/run/secrets/database_url
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
    secrets:
      - database_url

secrets:
  db_password:
    external: true
  database_url:
    external: true

volumes:
  postgres-data:
    driver: local

Kubernetes Deployment

Mixed-Mode Production Deployment (Recommended)

This example demonstrates the recommended production pattern: multiple orchestration deployments with different modes.

# k8s/orchestration-event-driven.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration-event-driven
  namespace: tasker
  labels:
    app: tasker-orchestration
    mode: event-driven
spec:
  replicas: 10  # Majority of orchestration capacity
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
      mode: event-driven
  template:
    metadata:
      labels:
        app: tasker-orchestration
        mode: event-driven
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DEPLOYMENT_MODE
          value: "EventDrivenOnly"  # High-throughput mode
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            cpu: 500m      # Lower CPU for event-driven
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
# k8s/orchestration-polling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration-polling
  namespace: tasker
  labels:
    app: tasker-orchestration
    mode: polling
spec:
  replicas: 3  # Safety net for missed events
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
      mode: polling
  template:
    metadata:
      labels:
        app: tasker-orchestration
        mode: polling
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DEPLOYMENT_MODE
          value: "PollingOnly"  # Reliability safety net
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            cpu: 750m      # Higher CPU for polling
            memory: 512Mi
          limits:
            cpu: 1500m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
# k8s/orchestration-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  selector:
    app: tasker-orchestration  # Matches BOTH deployments
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
  type: ClusterIP

Key points about this mixed-mode deployment:

10 EventDrivenOnly pods handle 80-90% of work with ~10ms latency
3 PollingOnly pods catch anything missed by event listeners
Single service load balances across all 13 pods
No conflicts - atomic SQL operations prevent duplicate processing
Independent scaling - scale event-driven pods for throughput, polling pods for reliability

Single-Mode Orchestration Deployment

# k8s/orchestration-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
  template:
    metadata:
      labels:
        app: tasker-orchestration
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2

---
apiVersion: v1
kind: Service
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  selector:
    app: tasker-orchestration
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
  type: ClusterIP

Worker Deployment

# k8s/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-worker-payments
  namespace: tasker
spec:
  replicas: 5
  selector:
    matchLabels:
      app: tasker-worker
      namespace: payments
  template:
    metadata:
      labels:
        app: tasker-worker
        namespace: payments
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8081"
    spec:
      containers:
      - name: worker
        image: tasker-worker:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        - name: WORKER_NAMESPACES
          value: "payments"
        ports:
        - containerPort: 8081
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 20
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 5

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tasker-worker-payments
  namespace: tasker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tasker-worker-payments
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Health Monitoring

Health Check Endpoints

Orchestration Health:

# Basic health check
curl http://localhost:8080/health

# Response:
{
  "status": "healthy",
  "database": "connected",
  "message_queue": "operational"
}

# Detailed health check
curl http://localhost:8080/health/detailed

# Response:
{
  "status": "healthy",
  "deployment_mode": "Hybrid",
  "event_listeners": {
    "active": true,
    "channels": 3,
    "lag_ms": 12
  },
  "polling": {
    "active": false,
    "fallback_triggered": false
  },
  "database": {
    "status": "connected",
    "pool_size": 50,
    "active_connections": 23
  },
  "circuit_breakers": {
    "database": "closed",
    "message_queue": "closed"
  },
  "executors": {
    "task_initializer": {
      "active": 3,
      "max": 10,
      "queue_depth": 5
    },
    "result_processor": {
      "active": 5,
      "max": 10,
      "queue_depth": 12
    }
  }
}

Worker Health:

# Worker health check
curl http://localhost:8081/health

# Response:
{
  "status": "healthy",
  "namespaces": ["payments", "inventory"],
  "active_executions": 8,
  "claimed_steps": 3
}

Kubernetes Probes

# Liveness probe - restart if unhealthy
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# Readiness probe - remove from load balancer if not ready
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

gRPC Health Checks

Tasker Core exposes gRPC health endpoints alongside REST for Kubernetes gRPC health probes.

Port Allocation:

Service	REST Port	gRPC Port
Orchestration	8080	9190
Rust Worker	8081	9191
Ruby Worker	8082	9200
Python Worker	8083	9300
TypeScript Worker	8085	9400

gRPC Health Endpoints:

# Using grpcurl
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckLiveness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckReadiness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckDetailedHealth

Kubernetes gRPC Probes (Kubernetes 1.24+):

# gRPC liveness probe
livenessProbe:
  grpc:
    port: 9190
    service: tasker.v1.HealthService
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# gRPC readiness probe
readinessProbe:
  grpc:
    port: 9190
    service: tasker.v1.HealthService
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Configuration (config/tasker/base/orchestration.toml):

[orchestration.grpc]
enabled = true
bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
enable_reflection = true       # Service discovery via grpcurl
enable_health_service = true   # gRPC health checks

Scaling Patterns

Horizontal Scaling

Mixed-Mode Orchestration Scaling (Recommended)

Scale different deployment modes independently to optimize for throughput and reliability:

# Scale event-driven pods for throughput
kubectl scale deployment tasker-orchestration-event-driven --replicas=15 -n tasker

# Scale polling pods for reliability
kubectl scale deployment tasker-orchestration-polling --replicas=5 -n tasker

Scaling strategy by workload:

Scenario	Event-Driven Pods	Polling Pods	Rationale
High throughput	15-20	3-5	Maximize event-driven capacity
Network unstable	5-8	5-8	Balance between modes
Cost optimization	10-12	2-3	Minimize polling overhead
Maximum reliability	8-10	8-10	Ensure complete coverage

Single-Mode Orchestration Scaling

If using single deployment mode (Hybrid or EventDrivenOnly):

# Scale orchestration to 10 replicas (all same mode)
kubectl scale deployment tasker-orchestration --replicas=10 -n tasker

Key principles:

Multiple orchestration instances process tasks independently
Atomic finalization claiming prevents duplicate processing
Load balancer distributes API requests across instances

Worker Scaling

Workers scale independently per namespace:

# Scale payment workers to 10 replicas
kubectl scale deployment tasker-worker-payments --replicas=10 -n tasker

Each worker claims steps from namespace-specific queues
No coordination required between workers
Scale per namespace based on queue depth

Vertical Scaling

Resource Allocation:

# High-throughput orchestration
resources:
  requests:
    cpu: 2000m
    memory: 4Gi
  limits:
    cpu: 4000m
    memory: 8Gi

# Standard worker
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Auto-Scaling

HPA Configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tasker-orchestration
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tasker-orchestration
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: tasker_tasks_active
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Production Considerations

Database Configuration

Connection Pooling:

# config/tasker/environments/production/database.toml
[database]
max_connections = 50              # Total pool size
min_connections = 10              # Minimum maintained connections
connection_timeout_ms = 5000      # Connection acquisition timeout
idle_timeout_seconds = 600        # Close idle connections after 10 minutes
max_lifetime_seconds = 1800       # Recycle connections after 30 minutes

Calculation:

Total DB Connections = (Orchestration Replicas × Pool Size) + (Worker Replicas × Pool Size)
Example: (3 × 50) + (10 × 20) = 350 connections

Ensure PostgreSQL max_connections > Total DB Connections + Buffer
Recommended: max_connections = 500 for above example

Circuit Breaker Tuning

# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5               # Open after 5 consecutive errors
timeout_seconds = 60              # Stay open for 60 seconds
half_open_timeout_seconds = 30    # Test recovery for 30 seconds

[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60

Executor Pool Sizing

# config/tasker/environments/production/executor_pools.toml
[executor_pools.task_initializer]
min_executors = 2
max_executors = 10
queue_high_watermark = 100
queue_low_watermark = 10

[executor_pools.result_processor]
min_executors = 5
max_executors = 20
queue_high_watermark = 200
queue_low_watermark = 20

[executor_pools.step_enqueuer]
min_executors = 3
max_executors = 15
queue_high_watermark = 150
queue_low_watermark = 15

Observability Integration

Prometheus Metrics:

# Prometheus scrape config
scrape_configs:
  - job_name: 'tasker-orchestration'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - tasker
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Key Alerts:

# alerts.yaml
groups:
  - name: tasker
    interval: 30s
    rules:
      - alert: TaskerOrchestrationDown
        expr: up{job="tasker-orchestration"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Tasker orchestration instance down"

      - alert: TaskerHighErrorRate
        expr: rate(tasker_step_errors_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in step execution"

      - alert: TaskerCircuitBreakerOpen
        expr: tasker_circuit_breaker_state{state="open"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is open"

      - alert: TaskerDatabasePoolExhausted
        expr: tasker_database_pool_active >= tasker_database_pool_max
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool exhausted"

Migration Strategies

Migrating to Hybrid Mode

Step 1: Enable event listeners

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"

[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true    # Keep polling enabled during migration

Step 2: Monitor event listener health

# Check metrics for event listener stability
curl http://localhost:8080/health/detailed | jq '.event_listeners'

Step 3: Gradually reduce polling frequency

# Once event listeners are stable
[orchestration.hybrid]
polling_interval_ms = 5000        # Increase from 1000ms to 5000ms

Step 4: Validate performance

Monitor latency metrics: tasker_step_discovery_duration_seconds
Verify no missed events: tasker_polling_messages_found_total should be near zero

Rollback Plan

If event-driven mode fails:

# Immediate rollback to PollingOnly
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
task_request_poll_interval_ms = 500    # Aggressive polling

Gradual rollback:

Increase polling frequency in Hybrid mode
Monitor for stability
Disable event listeners once polling is stable
Switch to PollingOnly mode

Troubleshooting

Event Listener Issues

Problem: Event listeners not receiving notifications

Diagnosis:

-- Check PostgreSQL LISTEN/NOTIFY is working
NOTIFY pgmq_message_ready, 'test';

# Check listener status
curl http://localhost:8080/health/detailed | jq '.event_listeners'

Solutions:

Verify PostgreSQL version supports LISTEN/NOTIFY (9.0+)
Check firewall rules allow persistent connections
Increase listener_reconnect_interval_ms if connections drop frequently
Switch to Hybrid or PollingOnly mode if issues persist

Polling Performance Issues

Problem: High CPU usage from polling

Diagnosis:

# Check polling frequency and batch sizes
curl http://localhost:8080/health/detailed | jq '.polling'

Solutions:

Increase polling intervals
Increase batch sizes to process more messages per poll
Switch to Hybrid or EventDrivenOnly mode for better performance
Scale horizontally to distribute polling load

Database Connection Exhaustion

Problem: “connection pool exhausted” errors

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'tasker_production';

-- Check max connections
SHOW max_connections;

Solutions:

Increase max_connections in database.toml
Increase PostgreSQL max_connections setting
Reduce number of replicas
Implement connection pooling at infrastructure level (PgBouncer)

Best Practices

Configuration Management

Use environment-specific overrides instead of modifying base configuration
Validate configuration with config-validator before deployment
Version control all configuration including environment overrides
Use secrets management for sensitive values (passwords, keys)

Deployment Strategy

Use mixed-mode architecture in production (EventDrivenOnly + PollingOnly)
- Deploy 80-90% of orchestration pods in EventDrivenOnly mode for throughput
- Deploy 10-20% of orchestration pods in PollingOnly mode as safety net
- Single service load balances across all pods
Alternative: Deploy all pods in Hybrid mode for simpler configuration
- Trade-off: Less tuning flexibility, slightly higher resource usage
Scale each mode independently based on workload characteristics
Monitor deployment mode metrics to adjust ratios over time
Test mixed-mode deployments in staging before production

Deployment Operations

Always test configuration changes in staging first
Use rolling updates with health checks to prevent downtime
Monitor deployment mode health during and after deployments
Keep polling capacity available even when event-driven is primary

Scaling Guidelines

Mixed-mode orchestration: Scale EventDrivenOnly and PollingOnly deployments independently
- Scale event-driven pods based on throughput requirements
- Scale polling pods based on reliability requirements
Single-mode orchestration: Scale based on API request rate and task initialization throughput
Workers: Scale based on namespace-specific queue depth
Database connections: Monitor and adjust pool sizes as replicas scale
Use HPA for automatic scaling based on CPU/memory and custom metrics

Observability

Enable comprehensive metrics in production
Set up alerts for circuit breaker states, connection pool exhaustion
Monitor deployment mode distribution in mixed-mode deployments
Track event listener lag in EventDrivenOnly and Hybrid modes
Monitor polling overhead to optimize resource usage
Track step execution latency per namespace and handler

Summary

Tasker Core’s flexible deployment modes enable sophisticated production architectures:

Deployment Modes

Hybrid Mode: Event-driven with polling fallback in a single container
EventDrivenOnly Mode: Maximum throughput with ~10ms latency
PollingOnly Mode: Reliable safety net with traditional polling

Recommended Production Pattern

Mixed-Mode Architecture (recommended for production at scale):

Deploy majority of orchestration pods in EventDrivenOnly mode for high throughput
Deploy minority of orchestration pods in PollingOnly mode as reliability safety net
Both deployments coordinate through atomic SQL operations with no conflicts
Scale each mode independently based on workload characteristics

Alternative: Deploy all pods in Hybrid mode for simpler configuration with automatic fallback.

The key insight: deployment modes exist not just for configuration tuning, but to enable mixing coordination strategies across containers to meet production requirements for both throughput and reliability.

← Back to Documentation Hub

Next: Observability | Benchmarks | Quick Start

Domain Events Architecture

Last Updated: 2025-12-01 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Observability | States and Lifecycles

← Back to Documentation Hub

This document provides comprehensive documentation of the domain event system in tasker-core, covering event delivery modes, publisher patterns, subscriber implementations, and integration with the workflow orchestration system.

Overview

Domain Events vs System Events

The tasker-core system distinguishes between two types of events:

Aspect	System Events	Domain Events
Purpose	Internal coordination	Business observability
Producers	Orchestration components	Step handlers during execution
Consumers	Event systems, command processors	External systems, analytics, audit
Delivery	PGMQ + LISTEN/NOTIFY	Configurable (Durable/Fast/Broadcast)
Semantics	At-least-once	Fire-and-forget (best effort)

System events handle internal workflow coordination: task initialization, step enqueueing, result processing, and finalization. These are documented in Events and Commands.

Domain events enable business observability: payment processed, order fulfilled, inventory updated. Step handlers publish these events to enable external systems to react to business outcomes.

Key Design Principle: Non-Blocking Publication

Domain event publishing never fails the step. This is a fundamental design decision:

Event publish errors are logged with warn! level
Step execution continues regardless of publish outcome
Business logic success is independent of event delivery
A handler that successfully processes a payment should not fail if event publishing fails

#![allow(unused)]
fn main() {
// Event publishing is fire-and-forget
if let Err(e) = publisher.publish_event(event_name, payload, metadata).await {
    warn!(
        handler = self.handler_name(),
        event_name = event_name,
        error = %e,
        "Failed to publish domain event - step will continue"
    );
}
// Step continues regardless of publish result
}

Architecture

Data Flow

flowchart TB
    subgraph handlers["Step Handlers"]
        SH["Step Handler<br/>(Rust/Ruby)"]
    end

    SH -->|"publish_domain_event(name, payload)"| ER

    subgraph routing["Event Routing"]
        ER["EventRouter<br/>(Delivery Mode)"]
    end

    ER --> Durable
    ER --> Fast
    ER --> Broadcast

    subgraph modes["Delivery Modes"]
        Durable["Durable<br/>(PGMQ)"]
        Fast["Fast<br/>(In-Process)"]
        Broadcast["Broadcast<br/>(Both Paths)"]
    end

    Durable --> NEQ["Namespace<br/>Event Queue"]
    Fast --> IPB["InProcessEventBus"]
    Broadcast --> NEQ
    Broadcast --> IPB

    subgraph external["External Integration"]
        NEQ --> EC["External Consumer<br/>(Polling)"]
    end

    subgraph internal["Internal Subscribers"]
        IPB --> RS["Rust<br/>Subscribers"]
        IPB --> RFF["Ruby FFI<br/>Channel"]
    end

    style handlers fill:#e1f5fe
    style routing fill:#fff3e0
    style modes fill:#f3e5f5
    style external fill:#ffebee
    style internal fill:#e8f5e9

Component Summary

Component	Purpose	Location
`EventRouter`	Routes events based on delivery mode	`tasker-shared/src/events/domain_events/router.rs`
`DomainEventPublisher`	Durable PGMQ-based publishing	`tasker-shared/src/events/domain_events/publisher.rs`
`InProcessEventBus`	Fast in-memory event dispatch	`tasker-shared/src/events/domain_events/in_process_bus.rs`
`EventRegistry`	Pattern-based subscriber registration	`tasker-shared/src/events/domain_events/registry.rs`
`StepEventPublisher`	Handler callback trait	`tasker-shared/src/events/domain_events/step_event_publisher.rs`
`GenericStepEventPublisher`	Default publisher implementation	`tasker-shared/src/events/domain_events/generic_publisher.rs`

Delivery Modes

Overview

The domain event system supports three delivery modes, configured per event in YAML templates:

Mode	Durability	Latency	Use Case
Durable	High (PGMQ)	Higher (~5-10ms)	External system integration, audit trails
Fast	Low (memory)	Lowest (<1ms)	Internal subscribers, metrics, real-time processing
Broadcast	High + Low	Both paths	Events needing both internal and external delivery

Durable Mode (PGMQ) - External Integration Boundary

Durable events define the integration boundary between Tasker and external systems. Events are published to namespace-specific PGMQ queues where external consumers can poll and process them.

Key Design Decision: Tasker does NOT consume durable events internally. PGMQ serves as a lightweight, PostgreSQL-native alternative to external message brokers (Kafka, AWS SNS/SQS, RabbitMQ). External systems or middleware proxies can:

Poll PGMQ queues directly
Forward events to Kafka, SNS/SQS, or other messaging systems
Implement custom event processing pipelines

payment.processed → payments_domain_events (PGMQ queue) → External Systems
order.fulfilled   → fulfillment_domain_events (PGMQ queue) → External Systems

Characteristics:

Persisted in PostgreSQL (survives restarts)
For external consumer integration only
No internal Tasker polling or subscription
Consumer acknowledgment and retry handled by external consumers
Ordered within namespace

Implementation:

#![allow(unused)]
fn main() {
// DomainEventPublisher routes durable events to PGMQ
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: Value,
    metadata: EventMetadata,
) -> TaskerResult<()> {
    let queue_name = format!("{}_domain_events", metadata.namespace);
    let message = DomainEventMessage {
        event_name: event_name.to_string(),
        payload,
        metadata,
    };

    self.message_client
        .send_message(&queue_name, &message)
        .await
}
}

Fast Mode (In-Process) - Internal Subscriber Pattern

Fast events are the only delivery mode with internal subscriber support. Events are dispatched immediately to in-memory subscribers within the Tasker worker process.

#![allow(unused)]
fn main() {
// InProcessEventBus provides dual-path delivery
pub struct InProcessEventBus {
    event_sender: tokio::sync::broadcast::Sender<DomainEvent>,
    ffi_event_sender: Option<mpsc::Sender<DomainEvent>>,
}
}

Characteristics:

Zero persistence overhead
Sub-millisecond latency
Lost on service restart
Internal to Tasker process only
Dual-path: Rust subscribers + Ruby FFI channel
Non-blocking broadcast semantics

Dual-Path Architecture:

InProcessEventBus
       │
       ├──► tokio::broadcast::Sender ──► Rust Subscribers (EventRegistry)
       │
       └──► mpsc::Sender ──► Ruby FFI Channel ──► Ruby Event Handlers

Use Cases:

Real-time metrics collection
Internal logging and telemetry
Secondary actions that are not business-critical parts of the Task -> WorkflowStep DAG hierarchy
Example: DataDog, Sentry, NewRelic, PagerDuty, Salesforce, Slack, Zapier

Broadcast Mode - Internal + External Delivery

Broadcast mode delivers events to both paths simultaneously: the fast in-process bus for internal subscribers AND the durable PGMQ queue for external systems. This ensures internal subscribers receive the same event shape as external consumers.

#![allow(unused)]
fn main() {
// EventRouter handles broadcast semantics
async fn route_event(&self, event: DomainEvent, mode: EventDeliveryMode) {
    match mode {
        EventDeliveryMode::Durable => {
            self.durable_publisher.publish(event).await;
        }
        EventDeliveryMode::Fast => {
            self.in_process_bus.publish(event).await;
        }
        EventDeliveryMode::Broadcast => {
            // Send to both paths concurrently
            let (durable, fast) = tokio::join!(
                self.durable_publisher.publish(event.clone()),
                self.in_process_bus.publish(event)
            );
            // Log errors but don't fail
        }
    }
}
}

When to Use Broadcast:

Internal subscribers need the same event that external systems receive
Real-time internal metrics tracking for events also exported externally
Audit logging both internally and to external compliance systems

Important: Data published via broadcast goes to BOTH the internal process AND the public PGMQ boundary. Do not use broadcast for sensitive internal-only data (use fast for those).

Publisher Patterns

Default Publisher (GenericStepEventPublisher)

The default publisher automatically handles event publication for step handlers:

#![allow(unused)]
fn main() {
pub struct GenericStepEventPublisher {
    router: Arc<EventRouter>,
    default_delivery_mode: EventDeliveryMode,
}

impl GenericStepEventPublisher {
    /// Publish event with metadata extracted from step context
    pub async fn publish(
        &self,
        step_data: &TaskSequenceStep,
        event_name: &str,
        payload: Value,
    ) -> TaskerResult<()> {
        let metadata = EventMetadata {
            task_uuid: step_data.task.task.task_uuid,
            step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
            step_name: Some(step_data.workflow_step.name.clone()),
            namespace: step_data.task.namespace_name.clone(),
            correlation_id: step_data.task.task.correlation_id,
            fired_at: Utc::now(),
            fired_by: "generic_publisher".to_string(),
        };

        self.router.route_event(event_name, payload, metadata).await
    }
}
}

Custom Publishers

Custom publishers extend TaskerCore::DomainEvents::BasePublisher (Ruby) to provide specialized event handling with payload transformation, conditional publishing, and lifecycle hooks.

Real Example: PaymentEventPublisher (workers/ruby/spec/handlers/examples/domain_events/publishers/payment_event_publisher.rb):

# Custom publisher for payment-related domain events
# Demonstrates durable delivery mode with custom payload enrichment
module DomainEvents
  module Publishers
    class PaymentEventPublisher < TaskerCore::DomainEvents::BasePublisher
      # Must match the `publisher:` field in YAML
      def name
        'DomainEvents::Publishers::PaymentEventPublisher'
      end

      # Transform step result into payment event payload
      def transform_payload(step_result, event_declaration, step_context = nil)
        result = step_result[:result] || {}
        event_name = event_declaration[:name]

        if step_result[:success] && event_name&.include?('processed')
          build_success_payload(result, step_result, step_context)
        elsif !step_result[:success] && event_name&.include?('failed')
          build_failure_payload(result, step_result, step_context)
        else
          result
        end
      end

      # Determine if this event should be published
      def should_publish?(step_result, event_declaration, step_context = nil)
        result = step_result[:result] || {}
        event_name = event_declaration[:name]

        # For success events, verify we have transaction data
        if event_name&.include?('processed')
          return step_result[:success] && result[:transaction_id].present?
        end

        # For failure events, verify we have error info
        if event_name&.include?('failed')
          metadata = step_result[:metadata] || {}
          return !step_result[:success] && metadata[:error_code].present?
        end

        true  # Default: always publish
      end

      # Add execution metrics to event metadata
      def additional_metadata(step_result, event_declaration, step_context = nil)
        metadata = step_result[:metadata] || {}
        {
          execution_time_ms: metadata[:execution_time_ms],
          publisher_type: 'custom',
          publisher_name: name,
          payment_provider: metadata[:payment_provider]
        }
      end

      private

      def build_success_payload(result, step_result, step_context)
        {
          transaction_id: result[:transaction_id],
          amount: result[:amount],
          currency: result[:currency] || 'USD',
          payment_method: result[:payment_method] || 'credit_card',
          processed_at: result[:processed_at] || Time.now.iso8601,
          delivery_mode: 'durable',
          publisher: name
        }
      end
    end
  end
end

YAML Configuration for Custom Publisher:

steps:
  - name: process_payment
    publishes_events:
      - name: payment.processed
        condition: success
        delivery_mode: durable
        publisher: DomainEvents::Publishers::PaymentEventPublisher
      - name: payment.failed
        condition: failure
        delivery_mode: durable
        publisher: DomainEvents::Publishers::PaymentEventPublisher

YAML Event Declaration

Events are declared in task template YAML files using the publishes_events field at the step level:

# config/tasks/payments/credit_card_payment/1.0.0.yaml
name: credit_card_payment
namespace_name: payments
version: 1.0.0
description: Process credit card payments with validation and fraud detection

# Task-level domain events (optional)
domain_events: []

steps:
  - name: process_payment
    description: Process the payment transaction
    handler:
      callable: PaymentProcessing::StepHandler::ProcessPaymentHandler
      initialization:
        gateway_url: "${PAYMENT_GATEWAY_URL}"
    dependencies:
      - validate_payment
    retry:
      retryable: true
      limit: 3
      backoff: exponential
    timeout_seconds: 120

    # Step-level event declarations
    publishes_events:
      - name: payment.processed
        description: "Payment successfully processed"
        condition: success  # success, failure, retryable_failure, permanent_failure, always
        schema:
          type: object
          required: [transaction_id, amount]
          properties:
            transaction_id: { type: string }
            amount: { type: number }
        delivery_mode: broadcast  # durable, fast, or broadcast
        publisher: PaymentEventPublisher  # optional custom publisher

      - name: payment.failed
        description: "Payment processing failed permanently"
        condition: permanent_failure
        schema:
          type: object
          required: [error_code, reason]
          properties:
            error_code: { type: string }
            reason: { type: string }
        delivery_mode: durable

Publication Conditions:

success: Publish only when step completes successfully
failure: Publish on any step failure (backward compatible)
retryable_failure: Publish only on retryable failures (step can be retried)
permanent_failure: Publish only on permanent failures (exhausted retries or non-retryable)
always: Publish regardless of step outcome

Event Declaration Fields:

name: Event name in dotted notation (e.g., payment.processed)
description: Human-readable description of when this event is published
condition: When to publish (defaults to success)
schema: JSON Schema for validating event payloads
delivery_mode: Delivery mode (defaults to durable)
publisher: Optional custom publisher class name

Subscriber Patterns

Subscriber patterns apply only to fast (in-process) events. Durable events are consumed by external systems, not by internal Tasker subscribers.

Rust Subscribers (InProcessEventBus)

Rust subscribers are registered with the InProcessEventBus using the EventHandler type. Subscribers are async closures that receive DomainEvent instances.

Real Example: Logging Subscriber (workers/rust/src/event_subscribers/logging_subscriber.rs):

#![allow(unused)]
fn main() {
use std::sync::Arc;
use tasker_shared::events::registry::EventHandler;
use tracing::info;

/// Create a logging subscriber that logs all events matching a pattern
pub fn create_logging_subscriber(prefix: &str) -> EventHandler {
    let prefix = prefix.to_string();

    Arc::new(move |event| {
        let prefix = prefix.clone();

        Box::pin(async move {
            let step_name = event.metadata.step_name.as_deref().unwrap_or("unknown");

            info!(
                prefix = %prefix,
                event_name = %event.event_name,
                event_id = %event.event_id,
                task_uuid = %event.metadata.task_uuid,
                step_name = %step_name,
                namespace = %event.metadata.namespace,
                correlation_id = %event.metadata.correlation_id,
                fired_at = %event.metadata.fired_at,
                "Domain event received"
            );

            Ok(())
        })
    })
}
}

Real Example: Metrics Collector (workers/rust/src/event_subscribers/metrics_subscriber.rs):

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};

/// Collects metrics from domain events (thread-safe)
pub struct EventMetricsCollector {
    events_received: AtomicU64,
    success_events: AtomicU64,
    failure_events: AtomicU64,
    // ... additional fields
}

impl EventMetricsCollector {
    pub fn new() -> Arc<Self> {
        Arc::new(Self {
            events_received: AtomicU64::new(0),
            success_events: AtomicU64::new(0),
            failure_events: AtomicU64::new(0),
        })
    }

    /// Create an event handler for this collector
    pub fn create_handler(self: &Arc<Self>) -> EventHandler {
        let metrics = Arc::clone(self);

        Arc::new(move |event| {
            let metrics = Arc::clone(&metrics);

            Box::pin(async move {
                metrics.events_received.fetch_add(1, Ordering::Relaxed);

                if event.payload.execution_result.success {
                    metrics.success_events.fetch_add(1, Ordering::Relaxed);
                } else {
                    metrics.failure_events.fetch_add(1, Ordering::Relaxed);
                }

                Ok(())
            })
        })
    }

    pub fn events_received(&self) -> u64 {
        self.events_received.load(Ordering::Relaxed)
    }
}
}

Registration with InProcessEventBus:

#![allow(unused)]
fn main() {
use tasker_worker::worker::in_process_event_bus::InProcessEventBus;

let mut bus = InProcessEventBus::new(config);

// Subscribe to all events
bus.subscribe("*", create_logging_subscriber("[ALL]")).unwrap();

// Subscribe to specific patterns
bus.subscribe("payment.*", create_logging_subscriber("[PAYMENT]")).unwrap();

// Use metrics collector
let metrics = EventMetricsCollector::new();
bus.subscribe("*", metrics.create_handler()).unwrap();
}

Ruby Subscribers (BaseSubscriber)

Ruby subscribers extend TaskerCore::DomainEvents::BaseSubscriber and use the class-level subscribes_to pattern declaration.

Real Example: LoggingSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/logging_subscriber.rb):

# Example logging subscriber for fast/in-process domain events
module DomainEvents
  module Subscribers
    class LoggingSubscriber < TaskerCore::DomainEvents::BaseSubscriber
      # Subscribe to all events using pattern matching
      subscribes_to '*'

      # Handle any domain event by logging its details
      def handle(event)
        event_name = event[:event_name]
        metadata = event[:metadata] || {}

        logger.info "[LoggingSubscriber] Event: #{event_name}"
        logger.info "  Task: #{metadata[:task_uuid]}"
        logger.info "  Step: #{metadata[:step_name]}"
        logger.info "  Namespace: #{metadata[:namespace]}"
        logger.info "  Correlation: #{metadata[:correlation_id]}"
      end
    end
  end
end

Real Example: MetricsSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/metrics_subscriber.rb):

# Example metrics subscriber for fast/in-process domain events
module DomainEvents
  module Subscribers
    class MetricsSubscriber < TaskerCore::DomainEvents::BaseSubscriber
      subscribes_to '*'

      class << self
        attr_accessor :events_received, :success_events, :failure_events,
                      :events_by_namespace, :last_event_at

        def reset_counters!
          @mutex = Mutex.new
          @events_received = 0
          @success_events = 0
          @failure_events = 0
          @events_by_namespace = Hash.new(0)
          @last_event_at = nil
        end
      end

      reset_counters!

      def handle(event)
        event_name = event[:event_name]
        metadata = event[:metadata] || {}
        execution_result = event[:execution_result] || {}

        self.class.increment(:events_received)

        if execution_result[:success]
          self.class.increment(:success_events)
        else
          self.class.increment(:failure_events)
        end

        namespace = metadata[:namespace] || 'unknown'
        self.class.increment_hash(:events_by_namespace, namespace)
        self.class.set(:last_event_at, Time.now)
      end
    end
  end
end

Registration in Bootstrap:

# Register subscribers with the registry
registry = TaskerCore::DomainEvents::SubscriberRegistry.instance
registry.register(DomainEvents::Subscribers::LoggingSubscriber)
registry.register(DomainEvents::Subscribers::MetricsSubscriber)
registry.start_all!

# Later, query metrics
puts "Total events: #{DomainEvents::Subscribers::MetricsSubscriber.events_received}"
puts "By namespace: #{DomainEvents::Subscribers::MetricsSubscriber.events_by_namespace}"

External PGMQ Consumers (Durable Events)

Durable events are published to PGMQ queues for external consumption. Tasker does not provide internal consumers for these queues. External systems can consume events using:

Direct PGMQ Polling: Query pgmq.q_{namespace}_domain_events tables directly
PGMQ Client Libraries: Use pgmq client libraries in Python, Node.js, Go, etc.
Middleware Proxies: Build adapters that forward events to Kafka, SNS/SQS, etc.

Example: External Python Consumer:

import pgmq

# Connect to PGMQ
queue = pgmq.Queue("payments_domain_events", dsn="postgresql://...")

# Poll for events
while True:
    messages = queue.read(batch_size=50, vt=30)
    for msg in messages:
        process_event(msg.message)
        queue.delete(msg.msg_id)

Configuration

Domain event system configuration is part of the worker configuration in worker.toml files.

TOML Configuration

# config/tasker/base/worker.toml

# In-process event bus configuration for fast domain event delivery
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 2000        # Channel capacity for broadcast events
log_subscriber_errors = true        # Log errors from event subscribers
dispatch_timeout_ms = 5000          # Timeout for event dispatch

# Domain Event System MPSC Configuration
[worker.mpsc_channels.domain_events]
command_buffer_size = 1000          # Channel capacity for domain event commands
shutdown_drain_timeout_ms = 5000    # Time to drain events on shutdown
log_dropped_events = true           # Log when events are dropped due to backpressure

# In-process event settings (part of worker event systems)
[worker.event_systems.worker.metadata.in_process_events]
ffi_integration_enabled = true      # Enable Ruby/Python FFI event channel
deduplication_cache_size = 10000    # Event deduplication cache size

Environment Overrides

Test Environment (config/tasker/environments/test/worker.toml):

# In-process event bus - smaller buffers for testing
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 1000
log_subscriber_errors = true
dispatch_timeout_ms = 2000

# Domain Event System - smaller buffers for testing
[worker.mpsc_channels.domain_events]
command_buffer_size = 100
shutdown_drain_timeout_ms = 1000
log_dropped_events = true

[worker.event_systems.worker.metadata.in_process_events]
deduplication_cache_size = 1000

Production Environment (config/tasker/environments/production/worker.toml):

# In-process event bus - large buffers for production throughput
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 5000
log_subscriber_errors = false       # Reduce log noise in production
dispatch_timeout_ms = 10000

# Domain Event System - large buffers for production throughput
[worker.mpsc_channels.domain_events]
command_buffer_size = 5000
shutdown_drain_timeout_ms = 10000
log_dropped_events = false          # Reduce log noise in production

Configuration Parameters

Parameter	Description	Default
`broadcast_buffer_size`	Capacity of the broadcast channel for fast events	2000
`log_subscriber_errors`	Whether to log subscriber errors	true
`dispatch_timeout_ms`	Timeout for event dispatch to subscribers	5000
`command_buffer_size`	Capacity of domain event command channel	1000
`shutdown_drain_timeout_ms`	Time to drain pending events on shutdown	5000
`log_dropped_events`	Whether to log events dropped due to backpressure	true
`ffi_integration_enabled`	Enable FFI event channel for Ruby/Python	true
`deduplication_cache_size`	Size of event deduplication cache	10000

Integration with Step Execution

Event-Driven Domain Event Publishing

The worker uses an event-driven command pattern for step execution and domain event publishing. Nothing blocks - domain events are dispatched after successful orchestration notification using fire-and-forget semantics.

Flow (tasker-worker/src/worker/command_processor.rs):

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  FFI Handler    │────►│ Completion       │────►│ WorkerProcessor     │
│  (Ruby/Rust)    │     │ Channel          │     │ Command Loop        │
└─────────────────┘     └──────────────────┘     └──────────┬──────────┘
                                                            │
                        ┌───────────────────────────────────┴───────────────┐
                        │                                                   │
                        ▼                                                   ▼
              ┌─────────────────────┐                          ┌────────────────────┐
              │ 1. Send result to   │                          │ 2. Dispatch domain │
              │    orchestration    │──── on success ─────────►│    events          │
              │    (PGMQ)           │                          │    (fire-and-forget)│
              └─────────────────────┘                          └────────────────────┘

Implementation:

#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 512-525)
// Worker command processor receives step completions via FFI channel
match self.handle_send_step_result(step_result.clone()).await {
    Ok(()) => {
        // Dispatch domain events AFTER successful orchestration notification.
        // Domain events are declarative of what HAS happened - the step is only
        // truly complete once orchestration has been notified successfully.
        // Fire-and-forget semantics (try_send) - never blocks the worker.
        self.dispatch_domain_events(&step_result, None);
        info!(
            worker_id = %self.worker_id,
            step_uuid = %step_result.step_uuid,
            "Step completion forwarded to orchestration successfully"
        );
    }
    Err(e) => {
        // Don't dispatch domain events - orchestration wasn't notified,
        // so the step isn't truly complete from the system's perspective
        error!(
            worker_id = %self.worker_id,
            step_uuid = %step_result.step_uuid,
            error = %e,
            "Failed to forward step completion to orchestration"
        );
    }
}
}

Domain Event Dispatch (fire-and-forget):

#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 362-432)
fn dispatch_domain_events(&mut self, step_result: &StepExecutionResult, correlation_id: Option<Uuid>) {
    // Retrieve cached step context (stored when step was claimed)
    let task_sequence_step = match self.step_execution_contexts.remove(&step_result.step_uuid) {
        Some(ctx) => ctx,
        None => return, // No context = can't build events
    };

    // Build events from step definition's publishes_events declarations
    for event_def in &task_sequence_step.step_definition.publishes_events {
        // Check publication condition before building event
        if !event_def.should_publish(step_result.success) {
            continue; // Skip events whose condition doesn't match
        }

        let event = DomainEventToPublish {
            event_name: event_def.name.clone(),
            delivery_mode: event_def.delivery_mode,
            business_payload: step_result.result.clone(),
            metadata: EventMetadata { /* ... */ },
            task_sequence_step: task_sequence_step.clone(),
            execution_result: step_result.clone(),
        };
        domain_events.push(event);
    }

    // Fire-and-forget dispatch - try_send never blocks
    let dispatched = handle.dispatch_events(domain_events, publisher_name, correlation);

    if !dispatched {
        warn!(
            step_uuid = %step_result.step_uuid,
            "Domain event dispatch failed - channel full (events dropped)"
        );
    }
}
}

Key Design Decisions:

Events only after orchestration success: Domain events are declarative of what HAS happened. If orchestration notification fails, the step isn’t truly complete from the system’s perspective.
Fire-and-forget via try_send: Never blocks the worker command loop. If the channel is full, events are dropped and logged.
Context caching: Step execution context is cached when the step is claimed, then retrieved for event building after completion.

Correlation ID Propagation

Domain events maintain correlation IDs for end-to-end distributed tracing. The correlation ID originates from the task and flows through all step executions and domain events.

EventMetadata Structure (tasker-shared/src/events/domain_events.rs):

#![allow(unused)]
fn main() {
pub struct EventMetadata {
    pub task_uuid: Uuid,
    pub step_uuid: Option<Uuid>,
    pub step_name: Option<String>,
    pub namespace: String,
    pub correlation_id: Uuid,      // From task for end-to-end tracing
    pub fired_at: DateTime<Utc>,
    pub fired_by: String,          // Publisher identifier (worker_id)
}
}

Getting Correlation ID via API:

Use the orchestration API to get the correlation ID for a task:

# Get task details including correlation_id
curl http://localhost:8080/v1/tasks/{task_uuid} | jq '.correlation_id'

# Response includes correlation_id
{
  "task_uuid": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
  "correlation_id": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
  "status": "complete",
  ...
}

Tracing Events in PGMQ:

# Find all durable events for a correlation ID
psql $DATABASE_URL -c "
  SELECT
    message->>'event_name' as event,
    message->'metadata'->>'step_name' as step,
    message->'metadata'->>'fired_at' as fired_at
  FROM pgmq.q_payments_domain_events
  WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
  ORDER BY message->'metadata'->>'fired_at';
"

Metrics and Observability

OpenTelemetry Metrics

Domain event publication emits OpenTelemetry counter metrics (tasker-shared/src/events/domain_events.rs:207-219):

#![allow(unused)]
fn main() {
// Metric emitted on every domain event publication
let counter = opentelemetry::global::meter("tasker")
    .u64_counter("tasker.domain_events.published.total")
    .with_description("Total number of domain events published")
    .build();

counter.add(1, &[
    opentelemetry::KeyValue::new("event_name", event_name.to_string()),
    opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
]);
}

Prometheus Metrics Endpoint

The orchestration service exposes Prometheus-format metrics:

# Get Prometheus metrics from orchestration service
curl http://localhost:8080/metrics

# Get Prometheus metrics from worker service
curl http://localhost:8081/metrics

OpenTelemetry Tracing

Domain event publication is instrumented with tracing spans (tasker-shared/src/events/domain_events.rs:157-161):

#![allow(unused)]
fn main() {
#[instrument(skip(self, payload, metadata), fields(
    event_name = %event_name,
    namespace = %metadata.namespace,
    correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: DomainEventPayload,
    metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
    // ... implementation with debug! and info! logs including correlation_id
}
}

Grafana Query Examples

Loki Query - Domain Events by Correlation ID:

{service_name="tasker-worker"} |= "Domain event published" | json | correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"

Loki Query - All Domain Event Publications:

{service_name=~"tasker.*"} |= "Domain event" | json | line_format "{{.event_name}} - {{.namespace}} - {{.correlation_id}}"

Tempo Query - Trace by Correlation ID:

{resource.service.name="tasker-worker"} && {span.correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Prometheus Query - Event Publication Rate by Namespace:

sum by (namespace) (rate(tasker_domain_events_published_total[5m]))

Prometheus Query - Event Publication Rate by Event Name:

topk(10, sum by (event_name) (rate(tasker_domain_events_published_total[5m])))

Structured Log Fields

Domain event logs include structured fields for querying:

Field	Description	Example
`event_id`	Unique event UUID (v7, time-ordered)	`0199c3e0-d123-...`
`event_name`	Event name in dot notation	`payment.processed`
`queue_name`	Target PGMQ queue	`payments_domain_events`
`task_uuid`	Parent task UUID	`0199c3e0-ccdb-...`
`correlation_id`	End-to-end trace correlation	`0199c3e0-ccdb-...`
`namespace`	Event namespace	`payments`
`message_id`	PGMQ message ID	`12345`

Best Practices

1. Choose the Right Delivery Mode

Scenario	Recommended Mode	Rationale
External system integration	Durable	Reliable delivery to external consumers
Internal metrics/telemetry	Fast	Internal subscribers only, low latency
Internal + external needs	Broadcast	Same event shape to both internal and external
Audit trails for compliance	Durable	Persisted for external audit systems
Real-time internal dashboards	Fast	In-process subscribers handle immediately

Key Decision Criteria:

Need internal Tasker subscribers? → Use fast or broadcast
Need external system integration? → Use durable or broadcast
Internal-only, sensitive data? → Use fast (never reaches PGMQ boundary)

2. Design Event Payloads

Do:

#![allow(unused)]
fn main() {
json!({
    "transaction_id": "TXN-123",
    "amount": 99.99,
    "currency": "USD",
    "timestamp": "2025-12-01T10:00:00Z",
    "idempotency_key": step_uuid
})
}

Don’t:

#![allow(unused)]
fn main() {
json!({
    "data": "payment processed",  // No structure
    "info": full_database_record  // Too much data
})
}

3. Handle Subscriber Failures Gracefully

#![allow(unused)]
fn main() {
#[async_trait]
impl EventSubscriber for MySubscriber {
    async fn handle(&self, event: &DomainEvent) -> TaskerResult<()> {
        // Wrap in timeout
        match timeout(Duration::from_secs(5), self.process(event)).await {
            Ok(result) => result,
            Err(_) => {
                warn!(event = %event.name, "Subscriber timeout");
                Ok(()) // Don't fail the dispatch
            }
        }
    }
}
}

4. Use Correlation IDs for Debugging

#![allow(unused)]
fn main() {
// Always include correlation ID in logs
info!(
    correlation_id = %event.metadata.correlation_id,
    event_name = %event.name,
    namespace = %event.metadata.namespace,
    "Processing domain event"
);
}

Events and Commands: events-and-commands.md - System event architecture
Observability: observability/README.md - Metrics and monitoring
States and Lifecycles: states-and-lifecycles.md - Task/step state machines

This domain event architecture provides a flexible, reliable foundation for business observability in the tasker-core workflow orchestration system.

Events and Commands Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Messaging Abstraction | States and Lifecycles | Deployment Patterns

← Back to Documentation Hub

This document provides comprehensive documentation of the event-driven and command pattern architecture in tasker-core, covering the unified event system foundation, orchestration and worker implementations, and the flow of tasks and steps through the system.

Overview

The tasker-core system implements a sophisticated hybrid architecture that combines:

Event-Driven Systems: Real-time coordination using PostgreSQL LISTEN/NOTIFY and PGMQ notifications
Command Pattern: Async command processors using tokio mpsc channels for orchestration and worker operations
Hybrid Deployment Modes: PollingOnly, EventDrivenOnly, and Hybrid modes with fallback polling
Queue-Based Communication: Provider-agnostic message queues (PGMQ or RabbitMQ) for reliable step execution and result processing

This architecture eliminates polling complexity while maintaining resilience through fallback mechanisms and provides horizontal scaling capabilities with atomic operation guarantees.

Event System Foundation

EventDrivenSystem Trait

The foundation of the event architecture is defined in tasker-shared/src/event_system/event_driven.rs with the EventDrivenSystem trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
    type SystemId: Send + Sync + Clone + fmt::Display + fmt::Debug;
    type Event: Send + Sync + Clone + fmt::Debug;
    type Config: Send + Sync + Clone;
    type Statistics: EventSystemStatistics + Send + Sync + Clone;

    // Core lifecycle methods
    async fn start(&mut self) -> Result<(), DeploymentModeError>;
    async fn stop(&mut self) -> Result<(), DeploymentModeError>;
    fn is_running(&self) -> bool;

    // Event processing
    async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;

    // Monitoring and health
    async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
    fn statistics(&self) -> Self::Statistics;

    // Configuration
    fn deployment_mode(&self) -> DeploymentMode;
    fn config(&self) -> &Self::Config;
}
}

Deployment Modes

The system supports three deployment modes for different operational requirements:

PollingOnly Mode

Traditional polling-based coordination
No event listeners or real-time notifications
Reliable fallback for environments with networking restrictions
Higher latency but guaranteed operation

EventDrivenOnly Mode

Pure event-driven coordination using PostgreSQL LISTEN/NOTIFY
Real-time response to database changes
Lowest latency for step discovery and task coordination
Requires reliable PostgreSQL connections

Hybrid Mode

Primary event-driven coordination with polling fallback
Best of both worlds: real-time when possible, reliable when needed
Automatic fallback during connection issues
Production-ready with resilience guarantees

Selecting a Deployment Mode

The Tasker system is built with the expectation of distributed deployment with multiple instances of both orchestration core servers and worker servers operating simultaneously. The goal of separating deployment mode is to enable different deployments to scale up event driven only processing nodes to meet demand, while having polling only nodes at a reasonable fallback polling interval and batch size. It is also to deploy in hybrid mode and control these on an instance over instance level.

Event Types and Sources

Queue-Level Events (Provider-Agnostic)

The system supports multiple messaging backends through MessageNotification:

#![allow(unused)]
fn main() {
pub enum MessageNotification {
    /// Signal-only notification (PGMQ style)
    /// Indicates a message is available but requires separate fetch
    Available {
        queue_name: String,
        msg_id: Option<i64>,
    },

    /// Full message notification (RabbitMQ style)
    /// Contains the complete message payload
    Message(QueuedMessage<Vec<u8>>),
}
}

Event Sources by Provider:

Provider	Notification Type	Fetch Required	Fallback Polling
PGMQ	`Available`	Yes (read by msg_id)	Required
RabbitMQ	`Message`	No (full payload)	Not needed
InMemory	`Message`	No	Not needed

Common Event Types:

Step Results: Worker completion notifications
Task Requests: New task initialization requests
Message Ready Events: Queue message availability notifications
Transport: Provider-agnostic via MessagingProvider.subscribe_many()

Command Pattern Architecture

Command Processor Pattern

Both orchestration and worker systems implement the command pattern to replace complex polling-based coordinators:

Benefits:

No Polling Loops (Except where intended for fallback): Pure tokio mpsc command processing
Simplified Architecture: ~100 lines vs 1000+ lines of complex systems
Race Condition Prevention: Atomic operations through proper delegation
Observability Preservation: Maintains metrics through delegated components

Command Flow Patterns

Both systems follow consistent command processing patterns:

sequenceDiagram
    participant Client
    participant CommandChannel
    participant Processor
    participant Delegate
    participant Response

    Client->>CommandChannel: Send Command + ResponseChannel
    CommandChannel->>Processor: Receive Command
    Processor->>Delegate: Delegate to Business Logic Component
    Delegate-->>Processor: Return Result
    Processor->>Response: Send Result via ResponseChannel
    Response-->>Client: Receive Result

Orchestration Event Systems

OrchestrationEventSystem

Implemented in tasker-orchestration/src/orchestration/event_systems/orchestration_event_system.rs:

#![allow(unused)]
fn main() {
pub struct OrchestrationEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    queue_listener: Option<OrchestrationQueueListener>,
    fallback_poller: Option<OrchestrationFallbackPoller>,
    context: Arc<SystemContext>,
    orchestration_core: Arc<OrchestrationCore>,
    command_sender: mpsc::Sender<OrchestrationCommand>,
    // ... statistics and state
}
}

Orchestration Command Types

The command processor handles both full-message and signal-only notification types:

#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
    // Task lifecycle
    InitializeTask { request: TaskRequestMessage, resp: CommandResponder<TaskInitializeResult> },
    ProcessStepResult { result: StepExecutionResult, resp: CommandResponder<StepProcessResult> },
    FinalizeTask { task_uuid: Uuid, resp: CommandResponder<TaskFinalizationResult> },

    // Full message processing (RabbitMQ style - MessageNotification::Message)
    // Used when provider delivers complete message payload
    ProcessStepResultFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<StepProcessResult> },
    InitializeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskInitializeResult> },
    FinalizeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskFinalizationResult> },

    // Signal-only processing (PGMQ style - MessageNotification::Available)
    // Used when provider sends notification that requires separate fetch
    ProcessStepResultFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<StepProcessResult> },
    InitializeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskInitializeResult> },
    FinalizeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskFinalizationResult> },

    // Task readiness (database events)
    ProcessTaskReadiness { task_uuid: Uuid, namespace: String, priority: i32, ready_steps: i32, triggered_by: String, resp: CommandResponder<TaskReadinessResult> },

    // System operations
    GetProcessingStats { resp: CommandResponder<OrchestrationProcessingStats> },
    HealthCheck { resp: CommandResponder<SystemHealth> },
    Shutdown { resp: CommandResponder<()> },
}
}

Command Routing by Notification Type:

MessageNotification::Message -> *FromMessage commands (immediate processing)
MessageNotification::Available -> *FromMessageEvent commands (requires fetch)

Orchestration Queue Architecture

The orchestration system coordinates multiple queue types:

orchestration_step_results: Step completion results from workers
orchestration_task_requests: New task initialization requests
orchestration_task_finalization: Task finalization notifications
Namespace Queues: Per-namespace step queues (e.g., fulfillment_queue, inventory_queue)

TaskReadinessEventSystem

Handles database-level events for task readiness using PostgreSQL LISTEN/NOTIFY:

#![allow(unused)]
fn main() {
pub struct TaskReadinessEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    listener: Option<TaskReadinessListener>,
    fallback_poller: Option<TaskReadinessFallbackPoller>,
    context: Arc<SystemContext>,
    command_sender: mpsc::Sender<OrchestrationCommand>,
    // ... configuration and statistics
}
}

PGMQ Notification Channels:

pgmq_message_ready.orchestration: Orchestration queue messages ready (task requests, step results, finalizations)
pgmq_message_ready.{namespace}: Worker namespace queue messages ready (e.g., payments, fulfillment, linear_workflow)
pgmq_message_ready: Global channel for all queue messages (fallback)
pgmq_queue_created: Queue creation notifications

Unified Event Coordination

The UnifiedEventCoordinator demonstrates coordinated management of multiple event systems:

#![allow(unused)]
fn main() {
pub struct UnifiedEventCoordinator {
    orchestration_system: OrchestrationEventSystem,
    task_readiness_fallback: FallbackPoller,
    deployment_mode: DeploymentMode,
    health_monitor: EventSystemHealthMonitor,
    // ... coordination logic
}
}

Coordination Features:

Shared Command Channel: Both systems send commands to same orchestration processor
Health Monitoring: Unified health checking across all event systems
Deployment Mode Management: Synchronized mode changes
Statistics Aggregation: Combined metrics from all systems

Worker Event Systems

WorkerEventSystem

Implemented in tasker-worker/src/worker/event_systems/worker_event_system.rs:

#![allow(unused)]
fn main() {
pub struct WorkerEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    queue_listeners: HashMap<String, WorkerQueueListener>,
    fallback_pollers: HashMap<String, WorkerFallbackPoller>,
    context: Arc<SystemContext>,
    command_sender: mpsc::Sender<WorkerCommand>,
    // ... statistics and configuration
}
}

Worker Command Types

#![allow(unused)]
fn main() {
pub enum WorkerCommand {
    // Step execution
    ExecuteStep { message: PgmqMessage<SimpleStepMessage>, queue_name: String, resp: CommandResponder<()> },
    ExecuteStepWithCorrelation { message: PgmqMessage<SimpleStepMessage>, queue_name: String, correlation_id: Uuid, resp: CommandResponder<()> },

    // Result processing
    SendStepResult { result: StepExecutionResult, resp: CommandResponder<()> },
    ProcessStepCompletion { step_result: StepExecutionResult, correlation_id: Option<Uuid>, resp: CommandResponder<()> },

    // Event integration
    ExecuteStepFromMessage { queue_name: String, message: PgmqMessage, resp: CommandResponder<()> },
    ExecuteStepFromEvent { message_event: MessageReadyEvent, resp: CommandResponder<()> },

    // System operations
    GetWorkerStatus { resp: CommandResponder<WorkerStatus> },
    SetEventIntegration { enabled: bool, resp: CommandResponder<()> },
    GetEventStatus { resp: CommandResponder<EventIntegrationStatus> },
    RefreshTemplateCache { namespace: Option<String>, resp: CommandResponder<()> },
    HealthCheck { resp: CommandResponder<WorkerHealthStatus> },
    Shutdown { resp: CommandResponder<()> },
}
}

Worker Queue Architecture

Workers monitor namespace-specific queues for step execution as Custom Namespace Queues that are dynamically configured per deployment

Example queues:

fulfillment_queue: All fulfillment namespace steps
inventory_queue: All inventory namespace steps
notifications_queue: All notification namespace steps
payment_queue: All payment processing steps

Event Flow and System Interactions

Complete Task Execution Flow

sequenceDiagram
    participant Client
    participant Orchestration
    participant TaskDB
    participant StepQueue
    participant Worker
    participant ResultQueue

    %% Task Initialization
    Client->>Orchestration: TaskRequestMessage (via pgmq_send_with_notify)
    Orchestration->>TaskDB: Create Task + Steps

    %% Step Discovery and Enqueueing (Event-Driven or Fallback Polling)
    Orchestration->>StepQueue: pgmq_send_with_notify(ready steps)
    StepQueue-->>Worker: pg_notify('pgmq_message_ready.{namespace}')

    %% Step Execution
    Worker->>StepQueue: pgmq.read() to claim step
    Worker->>Worker: Execute Step Handler
    Worker->>ResultQueue: pgmq_send_with_notify(StepExecutionResult)
    ResultQueue-->>Orchestration: pg_notify('pgmq_message_ready.orchestration')

    %% Result Processing
    Orchestration->>Orchestration: ProcessStepResult Command
    Orchestration->>TaskDB: Update Step State
    Note over Orchestration: Fallback poller discovers ready steps if events missed

    %% Task Completion
    Note over Orchestration: All Steps Complete
    Orchestration->>Orchestration: FinalizeTask Command
    Orchestration->>TaskDB: Mark Task Complete
    Orchestration-->>Client: Task Completed

Event-Driven Step Discovery

sequenceDiagram
    participant Worker
    participant PostgreSQL
    participant PgmqNotify
    participant OrchestrationListener
    participant StepEnqueuer

    Worker->>PostgreSQL: pgmq_send_with_notify('orchestration_step_results', result)
    PostgreSQL->>PostgreSQL: Atomic: pgmq.send() + pg_notify()
    PostgreSQL->>PgmqNotify: NOTIFY 'pgmq_message_ready.orchestration'
    PgmqNotify->>OrchestrationListener: MessageReadyEvent
    OrchestrationListener->>StepEnqueuer: ProcessStepResult Command
    StepEnqueuer->>PostgreSQL: Query ready steps, enqueue via pgmq_send_with_notify()

Hybrid Mode Operation

stateDiagram-v2
    [*] --> EventDriven

    EventDriven --> Processing : Event Received
    Processing --> EventDriven : Success
    Processing --> PollingFallback : Event Failed

    PollingFallback --> FallbackPolling : Start Polling
    FallbackPolling --> EventDriven : Connection Restored
    FallbackPolling --> Processing : Poll Found Work

    EventDriven --> HealthCheck : Periodic Check
    HealthCheck --> EventDriven : Healthy
    HealthCheck --> PollingFallback : Event Issues Detected

Queue Architecture and Message Flow

PGMQ Integration

The system uses PostgreSQL Message Queue (PGMQ) for reliable message delivery:

Queue Types and Purposes

Queue Name	Purpose	Message Type	Processing System
`orchestration_step_results`	Step completion results	`StepExecutionResult`	Orchestration
`orchestration_task_requests`	New task requests	`TaskRequestMessage`	Orchestration
`orchestration_task_finalization`	Task finalization	`TaskFinalizationMessage`	Orchestration
`{namespace}_queue`	Namespace-specific steps	`SimpleStepMessage`	Workers

Message Processing Patterns

Event-Driven Processing:

Message arrives in PGMQ queue
PostgreSQL triggers pg_notify with MessageReadyEvent
Event system receives notification
System processes message via command pattern
Message deleted after successful processing

Polling-Based Processing (Fallback):

Periodic queue polling (configurable interval)
Fetch available messages in batches
Process messages via command pattern
Delete processed messages

Circuit Breaker Integration

All PGMQ operations are protected by circuit breakers:

#![allow(unused)]
fn main() {
pub struct UnifiedPgmqClient {
    standard_client: Box<dyn PgmqClientTrait + Send + Sync>,
    protected_client: Option<ProtectedPgmqClient>,
    circuit_breaker_enabled: bool,
}
}

Circuit Breaker Features:

Automatic Protection: Failure detection and circuit opening
Configurable Thresholds: Error rate and timeout configuration
Seamless Fallback: Automatic switching between standard and protected clients
Recovery Detection: Automatic circuit closing when service recovers

Statistics and Monitoring

Event System Statistics

Both orchestration and worker event systems implement comprehensive statistics:

#![allow(unused)]
fn main() {
pub trait EventSystemStatistics {
    fn events_processed(&self) -> u64;
    fn events_failed(&self) -> u64;
    fn processing_rate(&self) -> f64;         // events/second
    fn average_latency_ms(&self) -> f64;
    fn deployment_mode_score(&self) -> f64;   // 0.0-1.0 effectiveness
    fn success_rate(&self) -> f64;            // derived: processed/(processed+failed)
}
}

Health Monitoring

Deployment Mode Health Status

#![allow(unused)]
fn main() {
pub enum DeploymentModeHealthStatus {
    Healthy,                    // All systems operational
    Degraded { reason: String },// Some issues but functional
    Unhealthy { reason: String },// Significant issues
    Critical { reason: String }, // System failure imminent
}
}

Health Check Integration

Event System Health: Connection status, processing latency, error rates
Command Processor Health: Queue backlog, processing timeout detection
Database Health: Connection pool status, query performance
Circuit Breaker Status: Circuit state, failure rates, recovery status

Metrics Collection

Key metrics collected across the system:

Orchestration Metrics

Task Initialization Rate: Tasks/minute initialized
Step Enqueueing Rate: Steps/minute enqueued to worker queues
Result Processing Rate: Results/minute processed from workers
Task Completion Rate: Tasks/minute completed successfully
Error Rates: Failures by operation type and cause

Worker Metrics

Step Execution Rate: Steps/minute executed
Handler Performance: Execution time by handler type
Queue Processing: Messages claimed/processed by queue
Result Submission Rate: Results/minute sent to orchestration
FFI Integration: Event correlation and handler communication stats

Error Handling and Resilience

Error Categories

The system handles multiple error categories with appropriate strategies:

Transient Errors

Database Connection Issues: Circuit breaker protection + retry with exponential backoff
Queue Processing Failures: Message retry with backoff, poison message detection
Network Interruptions: Automatic fallback to polling mode

Permanent Errors

Invalid Message Format: Dead letter queue for manual analysis
Handler Execution Failures: Step failure state with retry limits
Configuration Errors: System startup prevention with clear error messages

System Errors

Resource Exhaustion: Graceful degradation and load shedding
Component Crashes: Automatic restart with state recovery
Data Corruption: Transaction rollback and consistency validation

Fallback Mechanisms

Event System Fallbacks

Event-Driven -> Polling: Automatic fallback when event connection fails
Real-time -> Batch: Switch to batch processing during high load
Primary -> Secondary: Database failover support for high availability

Command Processing Fallbacks

Async -> Sync: Degraded operation for critical operations
Distributed -> Local: Local processing when coordination fails
Optimistic -> Pessimistic: Conservative processing during uncertainty

Configuration Management

Event System Configuration

Event systems are configured via TOML with environment overrides:

# config/tasker/base/event_systems.toml
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
health_monitoring_enabled = true
health_check_interval = "30s"
max_concurrent_processors = 10
processing_timeout = "100ms"

[orchestration_event_system.queue_listener]
enabled = true
batch_size = 50
poll_interval = "1s"
connection_timeout = "5s"

[orchestration_event_system.fallback_poller]
enabled = true
poll_interval = "5s"
batch_size = 20
max_retry_attempts = 3

[task_readiness]
enabled = true
polling_interval_seconds = 30

[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
# PGMQ channels handled by listeners, not direct postgres channels
supported_namespaces = ["orchestration"]

Runtime Configuration Changes

Certain configuration changes can be applied at runtime:

Deployment Mode Switching: EventDrivenOnly <-> Hybrid <-> PollingOnly
Event Integration Toggle: Enable/disable event processing
Health Check Intervals: Adjust monitoring frequency
Circuit Breaker Thresholds: Modify failure detection sensitivity

Integration Points

State Machine Integration

Event systems integrate tightly with the state machines documented in states-and-lifecycles.md:

Task State Changes: Event systems react to task transitions
Step State Changes: Step completion triggers task readiness checks
Event Generation: State transitions generate events for system coordination
Atomic Operations: Event processing maintains state machine consistency

Database Integration

Event systems coordinate with PostgreSQL at multiple levels:

LISTEN/NOTIFY: Real-time notifications for database changes
PGMQ Integration: Reliable message queues built on PostgreSQL
Transaction Coordination: Event processing within database transactions
SQL Functions: Database functions generate events and notifications

External System Integration

The event architecture supports integration with external systems:

Webhook Events: HTTP callbacks for external system notifications
Message Bus Integration: Apache Kafka, RabbitMQ, etc. for enterprise messaging
Monitoring Integration: Prometheus, DataDog, etc. for metrics export
API Integration: REST and GraphQL APIs for external coordination

Actor Integration

Overview

The tasker-core system implements a lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture provides a consistent, type-safe foundation for orchestration component management with all lifecycle operations coordinated through actors.

Status: Complete (Phases 1-7) - Production ready

For comprehensive actor documentation, see Actor-Based Architecture.

Actor Pattern Basics

The actor pattern introduces three core traits:

OrchestrationActor: Base trait for all actors with lifecycle hooks
Handler: Message handling trait for type-safe command processing
Message: Marker trait for command messages

#![allow(unused)]
fn main() {
// Actor definition
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,
}

// Message definition
pub struct FinalizeTaskMessage {
    pub task_uuid: Uuid,
}

impl Message for FinalizeTaskMessage {
    type Response = FinalizationResult;
}

// Message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        self.service.finalize_task(msg.task_uuid).await
            .map_err(|e| e.into())
    }
}
}

Integration with Command Processor

The actor pattern integrates seamlessly with the command processor through direct actor calls:

#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs

async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
    // Direct actor-based task finalization
    let msg = FinalizeTaskMessage { task_uuid };
    let result = self.actors.task_finalizer_actor.handle(msg).await?;

    Ok(TaskFinalizationResult::Success {
        task_uuid: result.task_uuid,
        final_status: format!("{:?}", result.action),
        completion_time: Some(chrono::Utc::now()),
    })
}

async fn handle_process_step_result(
    &self,
    step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
    // Direct actor-based step result processing
    let msg = ProcessStepResultMessage {
        result: step_result.clone(),
    };

    match self.actors.result_processor_actor.handle(msg).await {
        Ok(()) => Ok(StepProcessResult::Success {
            message: format!(
                "Step {} result processed successfully",
                step_result.step_uuid
            ),
        }),
        Err(e) => Ok(StepProcessResult::Error {
            message: format!("Failed to process step result: {e}"),
        }),
    }
}
}

Event → Command → Actor Flow

The complete event-to-actor flow:

┌──────────────┐
│ PGMQ Message │ Message arrives in queue
└──────┬───────┘
       │
       ▼
┌──────────────────┐
│  Event Listener  │ EventDrivenSystem processes notification
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Command Channel  │ Send command to processor via tokio::mpsc
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Command Processor│ Convert command to actor message
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Actor Registry  │ Route message to appropriate actor
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Handler<M>::     │ Actor processes message
│    handle()      │ Delegates to underlying service
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Response        │ Return result to command processor
└──────────────────┘

ActorRegistry and Lifecycle

The ActorRegistry manages all 4 orchestration actors and integrates with the system lifecycle:

#![allow(unused)]
fn main() {
// During system startup
let context = Arc::new(SystemContext::with_pool(pool).await?);
let actors = ActorRegistry::build(context).await?;  // Calls started() on all actors

// During operation
let msg = FinalizeTaskMessage { task_uuid };
let result = actors.task_finalizer_actor.handle(msg).await?;

// During shutdown
actors.shutdown().await;  // Calls stopped() on all actors in reverse order
}

Current Actors:

TaskRequestActor: Handles task initialization requests
ResultProcessorActor: Processes step execution results
StepEnqueuerActor: Manages batch processing of ready tasks
TaskFinalizerActor: Handles task finalization with atomic claiming

Benefits for Event-Driven Architecture

The actor pattern enhances the event-driven architecture by providing:

Type Safety: Compile-time verification of message contracts
Consistency: Uniform lifecycle management across all components
Testability: Clear message boundaries for isolated testing
Observability: Actor-level metrics and tracing
Evolvability: Easy to add new message handlers and actors

Implementation Status

The actor integration is complete:

Phase 1 ✅: Actor infrastructure and test harness
- OrchestrationActor, Handler, Message traits
- ActorRegistry structure
Phase 2-3 ✅: All 4 primary actors implemented
- TaskRequestActor, ResultProcessorActor
- StepEnqueuerActor, TaskFinalizerActor
Phase 4-6 ✅: Message hydration and module reorganization
- Hydration layer for PGMQ messages
- Clean module organization
Phase 7 ✅: Service decomposition
- Large services decomposed into focused components
- All files <300 lines following single responsibility principle
Cleanup ✅: Direct actor integration
- Command processor calls actors directly
- Removed intermediate wrapper layers
- Production-ready implementation

Service Decomposition

Large services (800-900 lines) were decomposed into focused components:

TaskFinalizer (848 → 6 files):

service.rs: Main TaskFinalizer (~200 lines)
completion_handler.rs: Task completion logic
event_publisher.rs: Lifecycle event publishing
execution_context_provider.rs: Context fetching
state_handlers.rs: State-specific handling

StepEnqueuerService (781 → 3 files):

service.rs: Main service (~250 lines)
batch_processor.rs: Batch processing logic
state_handlers.rs: State-specific processing

ResultProcessor (889 → 4 files):

service.rs: Main processor
metadata_processor.rs: Metadata handling
error_handler.rs: Error processing
result_validator.rs: Result validation

This comprehensive event and command architecture, now enhanced with the actor pattern, provides the foundation for scalable, reliable, and maintainable workflow orchestration in the tasker-core system while maintaining the flexibility to operate in diverse deployment environments.

Idempotency and Atomicity Guarantees

Last Updated: 2025-01-19 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands | Task Readiness & Execution

← Back to Documentation Hub

Overview

Tasker Core is designed for distributed orchestration with multiple orchestrator instances processing tasks concurrently. This document explains the defense-in-depth approach that ensures safe concurrent operation without race conditions, data corruption, or lost work.

The system provides idempotency and atomicity through four overlapping protection layers:

Database Atomicity: PostgreSQL constraints, row locking, and compare-and-swap operations
State Machine Guards: Current-state validation before all transitions
Transaction Boundaries: All-or-nothing semantics for complex operations
Application Logic: State-based filtering and idempotent patterns

These layers work together to ensure that operations can be safely retried, multiple orchestrators can process work concurrently, and crashes don’t leave the system in an inconsistent state.

Core Protection Mechanisms

Layer 1: Database Atomicity

PostgreSQL provides fundamental atomic guarantees through several mechanisms:

Unique Constraints

Purpose: Prevent duplicate creation of entities

Key Constraints:

tasker.tasks.identity_hash (UNIQUE) - Prevents duplicate task creation from identical requests
tasker.task_namespaces.name (UNIQUE) - Namespace name uniqueness
tasker.named_tasks (namespace_id, name, version) (UNIQUE) - Task template uniqueness
tasker.named_steps.system_name (UNIQUE) - Step handler uniqueness

Example Protection:

#![allow(unused)]
fn main() {
// Two orchestrators receive identical TaskRequestMessage
// Orchestrator A creates task first -> commits successfully
// Orchestrator B attempts to create -> unique constraint violation
// Result: Exactly one task created, error cleanly handled
}

See Task Initialization for details on how this protects task creation.

Row-Level Locking

Purpose: Prevent concurrent modifications to the same database row

Locking Patterns:

FOR UPDATE - Exclusive lock, blocks concurrent transactions

-- Used in: transition_task_state_atomic()
SELECT * FROM tasker.tasks WHERE task_uuid = $1 FOR UPDATE;
-- Blocks until transaction commits or rolls back

FOR UPDATE SKIP LOCKED - Lock-free work distribution

-- Used in: get_next_ready_tasks()
SELECT * FROM tasker.tasks
WHERE state = ANY($1)
FOR UPDATE SKIP LOCKED
LIMIT $2;
-- Each orchestrator gets different tasks, no blocking

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Two orchestrators attempt state transition on same task
// Orchestrator A: BEGIN; SELECT FOR UPDATE; UPDATE state; COMMIT;
// Orchestrator B: BEGIN; SELECT FOR UPDATE (BLOCKS until A commits)
//                 UPDATE fails due to state validation
// Result: Only one transition succeeds, no race condition
}

Compare-and-Swap Semantics

Purpose: Validate expected state before making changes

Pattern: All state transitions validate current state in the same transaction as the update

-- From transition_task_state_atomic()
UPDATE tasker.tasks
SET state = $new_state, updated_at = NOW()
WHERE task_uuid = $uuid
  AND state = $expected_current_state  -- Critical: CAS validation
RETURNING *;

Example Protection:

#![allow(unused)]
fn main() {
// Orchestrator A and B both think task is in "Pending" state
// A transitions: WHERE state = 'Pending' -> succeeds, now "Initializing"
// B transitions: WHERE state = 'Pending' -> returns 0 rows (fails gracefully)
// Result: Atomic transition, no invalid state
}

See SQL Function Architecture for more on database-level guarantees.

Layer 2: State Machine Guards

Purpose: Enforce valid state transitions through application-level validation

Both task and step state machines validate current state before allowing transitions. This provides protection even when database constraints alone wouldn’t catch invalid operations.

Task State Machine

Defined in tasker-shared/src/state_machine/task_state_machine.rs, the TaskStateMachine validates:

Current state retrieval: Always fetch latest state from database
Event applicability: Check if event is valid for current state
Terminal state protection: Cannot transition from Complete/Error/Cancelled
Ownership tracking: Processor UUID tracked for audit (not enforced after ownership removal)

Example Protection:

#![allow(unused)]
fn main() {
// TaskStateMachine prevents invalid transitions
let mut state_machine = TaskStateMachine::new(task, context);

// Attempt to mark complete when still processing
let result = state_machine.transition(TaskEvent::MarkComplete).await;
// Result: Error - cannot mark complete while steps are in progress

// Current state validation prevents:
// - Completing tasks with pending steps
// - Re-initializing completed tasks
// - Transitioning from terminal states
}

See States and Lifecycles for complete state machine documentation.

Workflow Step State Machine

Defined in tasker-shared/src/state_machine/step_state_machine.rs, the StepStateMachine ensures:

Execution claiming: Only Pending/Enqueued steps can transition to InProgress
Completion validation: Only InProgress steps can be marked complete
Retry eligibility: Validates max_attempts and backoff timing

Example Protection:

#![allow(unused)]
fn main() {
// Worker attempts to claim already-processing step
let mut step_machine = StepStateMachine::new(step.into(), context);

match step_machine.current_state().await {
    WorkflowStepState::InProgress => {
        // Already being processed by another worker
        return Ok(false); // Cannot claim
    }
    WorkflowStepState::Pending | WorkflowStepState::Enqueued => {
        // Attempt atomic transition
        step_machine.transition(StepEvent::Start).await?;
    }
}
}

This prevents:

Multiple workers executing the same step concurrently
Marking steps complete that weren’t started
Retrying steps that exceeded max_attempts

Layer 3: Transaction Boundaries

Purpose: Ensure all-or-nothing semantics for multi-step operations

Critical operations wrap multiple database changes in a single transaction, ensuring atomic completion or full rollback on failure.

Task Initialization Transaction

Task creation involves multiple dependent entities that must all succeed or all fail:

#![allow(unused)]
fn main() {
// From TaskInitializer.initialize_task()
let mut tx = pool.begin().await?;

// 1. Create or find namespace (find-or-create is idempotent)
let namespace = NamespaceResolver::resolve_namespace(&mut tx, namespace_name).await?;

// 2. Create or find named task
let named_task = NamespaceResolver::resolve_named_task(&mut tx, namespace, task_name).await?;

// 3. Create task record
let task = create_task(&mut tx, named_task.uuid, context).await?;

// 4. Create all workflow steps and edges
let (step_count, step_mapping) = WorkflowStepBuilder::create_workflow_steps(
    &mut tx, task.uuid, template
).await?;

// 5. Initialize state machine
StateInitializer::initialize_task_state(&mut tx, task.uuid).await?;

// ALL OR NOTHING: Commit entire transaction
tx.commit().await?;
}

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Task creation partially fails
// - Namespace created ✓
// - Named task created ✓
// - Task record created ✓
// - Workflow steps: Cycle detected ✗ (error thrown)
// Result: tx.rollback() -> ALL changes reverted, clean failure
}

Cycle Detection Enforcement

Workflow dependencies are validated during task initialization to prevent circular references:

#![allow(unused)]
fn main() {
// From WorkflowStepBuilder::create_step_dependencies()
for dependency in &step_definition.dependencies {
    let from_uuid = step_mapping[dependency];
    let to_uuid = step_mapping[&step_definition.name];

    // Check for self-reference
    if from_uuid == to_uuid {
        return Err(CycleDetected { from, to });
    }

    // Check for path that would create cycle
    if WorkflowStepEdge::would_create_cycle(pool, from_uuid, to_uuid).await? {
        return Err(CycleDetected { from, to });
    }

    // Safe to create edge
    WorkflowStepEdge::create_with_transaction(&mut tx, edge).await?;
}
}

This prevents invalid DAG structures from ever being persisted to the database.

Layer 4: Application Logic Patterns

Purpose: Implement idempotent patterns at the application level

Beyond database and state machine protections, application code uses several patterns to ensure safe retry and concurrent operation.

Find-or-Create Pattern

Used for entities that should be unique but may be created by multiple orchestrators:

#![allow(unused)]
fn main() {
// From NamespaceResolver
pub async fn resolve_namespace(
    tx: &mut Transaction<'_, Postgres>,
    name: &str,
) -> Result<TaskNamespace> {
    // Try to find existing
    if let Some(namespace) = TaskNamespace::find_by_name(pool, name).await? {
        return Ok(namespace);
    }

    // Create if not found
    match TaskNamespace::create_with_transaction(tx, NewTaskNamespace { name }).await {
        Ok(namespace) => Ok(namespace),
        Err(sqlx::Error::Database(e)) if is_unique_violation(&e) => {
            // Another orchestrator created it between our find and create
            // Re-query to get the one that won the race
            TaskNamespace::find_by_name(pool, name).await?
                .ok_or(Error::NotFound)
        }
        Err(e) => Err(e),
    }
}
}

Why This Works:

First attempt: Finds existing → idempotent
Create attempt: Unique constraint prevents duplicates
Retry after unique violation: Gets the winner → idempotent
Result: Exactly one namespace, regardless of concurrent attempts

State-Based Filtering

Operations filter by state to naturally deduplicate work:

#![allow(unused)]
fn main() {
// From StepEnqueuerService
// Only enqueue steps in specific states
let ready_steps = steps.iter()
    .filter(|step| matches!(
        step.state,
        WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
    ))
    .collect();

// Skip steps already:
// - Enqueued (another orchestrator handled it)
// - InProgress (worker is executing)
// - Complete (already done)
// - Error (terminal state)
}

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Orchestrator crash mid-batch
// Before crash: Enqueued steps 1-5 of 10
// After restart: Process task again
// State filtering:
//   - Steps 1-5: state = Enqueued → skip
//   - Steps 6-10: state = Pending → enqueue
// Result: Each step enqueued exactly once
}

State-Before-Queue Pattern

Ensures workers only see steps in correct state:

#![allow(unused)]
fn main() {
// 1. Commit state transition to database FIRST
step_state_machine.transition(StepEvent::Enqueue).await?;
// Step now in Enqueued state in database

// 2. THEN send PGMQ notification
pgmq_client.send_with_notify(queue_name, step_message).await?;

// Worker receives notification and:
// - Queries database for step
// - Sees state = Enqueued (committed)
// - Can safely claim and execute
}

Why Order Matters:

#![allow(unused)]
fn main() {
// Wrong order (queue-before-state):
// 1. Send PGMQ message
// 2. Worker receives immediately
// 3. Worker queries database → state still Pending
// 4. Worker might skip or fail to claim
// 5. State transition commits

// Correct order (state-before-queue):
// 1. State transition commits
// 2. Send PGMQ message
// 3. Worker receives
// 4. Worker queries → state correctly Enqueued
// 5. Worker can claim
}

See Events and Commands for event system details.

Component-by-Component Guarantees

Task Initialization Idempotency

Component: TaskRequestActor and TaskInitializer service Operation: Creating a new task from a template File: tasker-orchestration/src/orchestration/lifecycle/task_initialization/

Protection Mechanisms

Identity Hash Unique Constraint

#![allow(unused)]
fn main() {
// Tasks are identified by hash of (namespace, task_name, context)
let identity_hash = calculate_identity_hash(namespace, name, context);

NewTask {
    identity_hash,  // Unique constraint prevents duplicates
    named_task_uuid,
    context,
    // ...
}
}

Transaction Atomicity
- All entities created in single transaction
- Namespace, named task, task, workflow steps, edges
- Cycle detection validates DAG before committing
- Any failure rolls back everything
Find-or-Create for Shared Entities
- Namespaces can be created by any orchestrator
- Named tasks shared across workflow instances
- Named steps reused across tasks

Concurrent Scenario

Two orchestrators receive identical TaskRequestMessage:

T0: Orchestrator A begins transaction
T1: Orchestrator B begins transaction
T2: A creates namespace "payments"
T3: B attempts to create namespace "payments"
T4: A creates task with identity_hash "abc123"
T5: B attempts to create task with identity_hash "abc123"
T6: A commits successfully ✓
T7: B attempts commit → unique constraint violation on identity_hash
T8: B transaction rolled back

Result:

Exactly one task created
No partial state in database
Orchestrator B receives clear error
Retry-safe: B can check if task exists and return it

Cycle Detection

Prevents invalid workflow definitions:

#![allow(unused)]
fn main() {
// Template defines: A depends on B, B depends on C, C depends on A
// During initialization:
//   - Create steps A, B, C
//   - Create edge A -> B (valid)
//   - Create edge B -> C (valid)
//   - Attempt edge C -> A
//     - would_create_cycle() returns true
//     - Error: CycleDetected
//   - Transaction rolled back
// Result: Invalid workflow rejected, no partial data
}

See tasker-shared/src/models/core/workflow_step_edge.rs:236-270 for cycle detection implementation.

Step Enqueueing Idempotency

Component: StepEnqueuerActor and StepEnqueuerService Operation: Enqueueing ready workflow steps to worker queues File: tasker-orchestration/src/orchestration/lifecycle/step_enqueuer_services/

Multi-Layer Protection

SQL-Level Row Locking

-- get_next_ready_tasks() uses SKIP LOCKED
SELECT task_uuid FROM tasker.tasks
WHERE state = ANY($states)
FOR UPDATE SKIP LOCKED  -- Prevents concurrent claiming
LIMIT $batch_size;

Each orchestrator gets different tasks, no overlap

State Machine Compare-and-Swap

#![allow(unused)]
fn main() {
// Only transition if task in expected state
state_machine.transition(TaskEvent::EnqueueSteps(uuids)).await?;
// Fails if another orchestrator already transitioned
}

Step State Filtering

#![allow(unused)]
fn main() {
// Only enqueue steps in specific states
let enqueueable = steps.filter(|s| matches!(
    s.state,
    WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
));
}

State-Before-Queue Ordering

#![allow(unused)]
fn main() {
// 1. Commit step state to Enqueued
step.transition(StepEvent::Enqueue).await?;

// 2. Send PGMQ message
pgmq.send_with_notify(queue, message).await?;
}

Concurrent Scenario

Two orchestrators discover the same ready steps:

T0: Orchestrator A queries get_next_ready_tasks(batch=100)
T1: Orchestrator B queries get_next_ready_tasks(batch=100)
T2: A gets tasks [1,2,3] (locked by A's transaction)
T3: B gets tasks [4,5,6] (different rows, SKIP LOCKED)
T4: A enqueues steps for tasks 1,2,3
T5: B enqueues steps for tasks 4,5,6
T6: Both commit successfully

Result: No overlap, each task processed once

Orchestrator Crash Mid-Batch:

T0: Orchestrator A gets task 1 with steps [A, B, C, D]
T1: A enqueues steps A, B to "payments_queue"
T2: A crashes before processing steps C, D
T3: Task 1 state still EnqueuingSteps
T4: Orchestrator B picks up task 1 (A's transaction rolled back)
T5: B queries steps for task 1
T6: Steps A, B have state = Enqueued → skip
T7: Steps C, D have state = Pending → enqueue

Result: Steps A, B enqueued once, C, D recovered and enqueued

Result Processing Idempotency

Component: ResultProcessorActor and OrchestrationResultProcessor Operation: Processing step execution results from workers File: tasker-orchestration/src/orchestration/lifecycle/result_processing/

Protection Mechanisms

State Guard Validation

#![allow(unused)]
fn main() {
// TaskCoordinator validates step state before processing result
let current_state = step_state_machine.current_state().await?;

match current_state {
    WorkflowStepState::InProgress => {
        // Valid: step is being processed
        step_state_machine.transition(StepEvent::Complete).await?;
    }
    WorkflowStepState::Complete => {
        // Idempotent: already processed this result
        return Ok(AlreadyComplete);
    }
    _ => {
        // Invalid state for result processing
        return Err(InvalidState);
    }
}
}

Atomic State Transitions
- Step result processing uses compare-and-swap
- Task state transitions validate current state
- All updates in same transaction as state check
Ownership Removed
- Processor UUID tracked for audit only
- Not enforced for transitions
- Any orchestrator can process results
- Enables recovery after crashes

Concurrent Scenario

Worker submits result, orchestrator crashes, retry arrives:

T0: Worker completes step A, submits result to orchestration_step_results queue
T1: Orchestrator A pulls message, begins processing
T2: A transitions step A to Complete
T3: A begins task state evaluation
T4: A crashes before deleting PGMQ message
T5: PGMQ visibility timeout expires → message reappears
T6: Orchestrator B pulls same message
T7: B queries step A state → Complete
T8: B returns early (idempotent, already processed)
T9: B deletes PGMQ message

Result: Step processed exactly once, retry is harmless

Before Ownership Removal (Ownership Enforced):

// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: task.processor_uuid != B.uuid
// Error: Ownership violation → TASK STUCK

After Ownership Removal (Ownership Audit-Only):

// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: current task state (no ownership check)
// B processes successfully → TASK RECOVERS

See the Ownership Removal ADR for full analysis.

Task Finalization Idempotency

Component: TaskFinalizerActor and TaskFinalizer service Operation: Finalizing task to terminal state File: tasker-orchestration/src/orchestration/lifecycle/task_finalization/

Current Protection (Sufficient for Recovery)

State Guard Protection

#![allow(unused)]
fn main() {
// TaskFinalizer checks current task state
let context = ExecutionContextProvider::fetch(task_uuid).await?;

match context.should_finalize() {
    true => {
        // Transition to Complete
        task_state_machine.transition(TaskEvent::MarkComplete).await?;
    }
    false => {
        // Not ready to finalize (steps still pending)
        return Ok(NotReady);
    }
}
}

Idempotent for Recovery

#![allow(unused)]
fn main() {
// Scenario: Orchestrator crashes during finalization
// - Task state already Complete → state guard returns early
// - Task state still StepsInProcess → retry succeeds
// Result: Recovery works, final state reached
}

Concurrent Scenario (Not Graceful)

Two orchestrators attempt finalization simultaneously:

T0: Orchestrators A and B both receive finalization trigger
T1: A checks: all steps complete → proceed
T2: B checks: all steps complete → proceed
T3: A transitions task to Complete (succeeds)
T4: B attempts transition to Complete
T5: State guard: task already Complete
T6: B receives StateMachineError (invalid transition)

Result:

✓ Task finalized exactly once (correct)
✓ No data corruption
⚠️ Orchestrator B gets error (not graceful)

Future Enhancement: Atomic Finalization Claiming

Atomic claiming would make concurrent finalization graceful:

-- Proposed claim_task_for_finalization() function
UPDATE tasker.tasks
SET finalization_claimed_at = NOW(),
    finalization_claimed_by = $processor_uuid
WHERE task_uuid = $uuid
  AND state = 'StepsInProcess'
  AND finalization_claimed_at IS NULL
RETURNING *;

With atomic finalization claiming:

T0: Orchestrators A and B both receive finalization trigger
T1: A calls claim_task_for_finalization() → succeeds
T2: B calls claim_task_for_finalization() → returns 0 rows
T3: A proceeds with finalization
T4: B returns early (silent success, already claimed)

This enhancement is deferred (implementation not yet scheduled).

SQL Function Atomicity

File: tasker-shared/src/database/sql/ Documented: Task Readiness & Execution

Atomic State Transitions

Function: transition_task_state_atomic() Protection: Compare-and-swap with row locking

-- Atomic state transition with validation
UPDATE tasker.tasks
SET state = $new_state,
    updated_at = NOW()
WHERE task_uuid = $uuid
  AND state = $expected_current_state  -- CAS: only if state matches
FOR UPDATE;  -- Lock prevents concurrent modifications

Key Guarantees:

Returns 0 rows if state doesn’t match → safe retry
Row lock prevents concurrent transitions
Processor UUID tracked for audit, not enforced

Work Distribution Without Contention

Function: get_next_ready_tasks() Protection: Lock-free claiming via SKIP LOCKED

SELECT task_uuid, correlation_id, state
FROM tasker.tasks
WHERE state = ANY($processable_states)
  AND (
    state NOT IN ('WaitingForRetry') OR
    last_retry_at + retry_interval < NOW()
  )
ORDER BY
  CASE state
    WHEN 'Pending' THEN 1
    WHEN 'WaitingForRetry' THEN 2
    ELSE 3
  END,
  created_at ASC
FOR UPDATE SKIP LOCKED  -- Skip locked rows, no blocking
LIMIT $batch_size;

Key Guarantees:

Each orchestrator gets different tasks
No blocking or contention
Dynamic priority (Pending before WaitingForRetry)
Prevents task starvation

Step Readiness with Dependency Validation

Function: get_step_readiness_status() Protection: Validates dependencies in single query

WITH step_dependencies AS (
  SELECT COUNT(*) as total_deps,
         SUM(CASE WHEN dep_step.state = 'Complete' THEN 1 ELSE 0 END) as completed_deps
  FROM tasker.workflow_step_edges e
  JOIN tasker.workflow_steps dep_step ON e.from_step_uuid = dep_step.uuid
  WHERE e.to_step_uuid = $step_uuid
)
SELECT
  CASE
    WHEN total_deps = completed_deps THEN 'Ready'
    WHEN step.state = 'Error' AND step.attempts < step.max_attempts THEN 'WaitingForRetry'
    ELSE 'Blocked'
  END as readiness
FROM step_dependencies, tasker.workflow_steps step
WHERE step.uuid = $step_uuid;

Key Guarantees:

Atomic dependency check
Handles retry logic with backoff
Prevents premature execution

Cycle Detection

Function: WorkflowStepEdge::would_create_cycle() (Rust, uses SQL) Protection: Recursive CTE path traversal

WITH RECURSIVE step_path AS (
  -- Base: Start from proposed destination
  SELECT from_step_uuid, to_step_uuid, 1 as depth
  FROM tasker.workflow_step_edges
  WHERE from_step_uuid = $proposed_to

  UNION ALL

  -- Recursive: Follow edges
  SELECT sp.from_step_uuid, wse.to_step_uuid, sp.depth + 1
  FROM step_path sp
  JOIN tasker.workflow_step_edges wse ON sp.to_step_uuid = wse.from_step_uuid
  WHERE sp.depth < 100  -- Prevent infinite recursion
)
SELECT COUNT(*) as has_path
FROM step_path
WHERE to_step_uuid = $proposed_from;

Returns: True if adding edge would create cycle

Enforcement: Called by WorkflowStepBuilder during task initialization

Self-reference check: from_uuid == to_uuid
Path check: Would adding edge create cycle?
Error before commit: Transaction rolled back on cycle

See tasker-orchestration/src/orchestration/lifecycle/task_initialization/workflow_step_builder.rs for enforcement.

Cross-Cutting Scenarios

Multiple Orchestrators Processing Same Task

Scenario: Load balancer distributes work to multiple orchestrators

Protection:

Work Distribution:

-- Each orchestrator gets different tasks via SKIP LOCKED
Orchestrator A: Tasks [1, 2, 3]
Orchestrator B: Tasks [4, 5, 6]

State Transitions:

#![allow(unused)]
fn main() {
// Both attempt to transition same task (shouldn't happen, but...)
A: transition(Pending -> Initializing) → succeeds
B: transition(Pending -> Initializing) → fails (state already changed)
}

Step Enqueueing:

#![allow(unused)]
fn main() {
// Task in EnqueuingSteps state
A: Processes task, enqueues steps A, B
B: Cannot claim task (state not in processable states)
// OR if B claims during transition:
B: Filters steps by state → A, B already Enqueued, skips them
}

Result: No duplicate work, clean coordination

Orchestrator Crashes and Recovers

Scenario: Orchestrator crashes mid-operation, another takes over

During Task Initialization

Before ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A)
T2: A crashes
T3: Task stuck in Initializing forever (ownership blocks recovery)

After ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A for audit)
T2: A crashes
T3: Orchestrator B picks up task 1
T4: B transitions Initializing -> EnqueuingSteps (succeeds, no ownership check)
T5: Task recovers automatically

During Step Enqueueing

T0: Orchestrator A enqueues steps [A, B] of task 1
T1: A crashes before committing
T2: Transaction rolls back
T3: Steps A, B remain in Pending state
T4: Orchestrator B picks up task 1
T5: B enqueues steps A, B (state still Pending)
T6: No duplicate work

During Result Processing

T0: Worker completes step A
T1: Orchestrator A receives result, transitions step to Complete
T2: A crashes before updating task state
T3: PGMQ message visibility timeout expires
T4: Orchestrator B receives same result message
T5: B queries step A → already Complete
T6: B skips processing (idempotent)
T7: B evaluates task state, continues workflow

Result: Complete recovery, no manual intervention

Retry After Transient Failure

Scenario: Database connection lost during operation

#![allow(unused)]
fn main() {
// Orchestrator attempts task initialization
let result = task_initializer.initialize(request).await;

match result {
    Err(TaskInitializationError::Database(_)) => {
        // Transient failure (connection lost)
        // Retry same request
        let retry_result = task_initializer.initialize(request).await;

        // Possibilities:
        // 1. Succeeds: Transaction completed before connection lost
        //    → identity_hash unique constraint prevents duplicate
        //    → Get existing task
        // 2. Succeeds: Transaction rolled back
        //    → Create task successfully
        // 3. Fails: Different error
        //    → Handle appropriately
    }
    Ok(task) => { /* Success */ }
}
}

Key Pattern: Operations are designed to be retry-safe

Database constraints prevent duplicates
State guards prevent invalid transitions
Find-or-create handles concurrent creation

PGMQ Message Duplicate Delivery

Scenario: PGMQ message processed twice due to visibility timeout

#![allow(unused)]
fn main() {
// Worker completes step, sends result
pgmq.send("orchestration_step_results", result).await?;

// Orchestrator A receives message
let message = pgmq.read("orchestration_step_results").await?;

// A processes result
result_processor.process(message.payload).await?;

// A about to delete message, crashes
// Message visibility timeout expires → message reappears

// Orchestrator B receives same message
let duplicate = pgmq.read("orchestration_step_results").await?;

// B processes result
// State machine checks: step already Complete
// Returns early (idempotent)
result_processor.process(duplicate.payload).await?; // Harmless

// B deletes message
pgmq.delete(duplicate.msg_id).await?;
}

Protection:

State guards: Check current state before processing
Idempotent handlers: Safe to process same message multiple times
Message deletion: Only after confirmed processing

See Events and Commands for PGMQ architecture.

Multi-Instance Validation

The defense-in-depth architecture was validated through comprehensive multi-instance cluster testing. This section documents the validation results and confirms the effectiveness of the protection mechanisms.

Test Configuration

Orchestration Instances: 2 (ports 8080, 8081)
Worker Instances: 2 per type (Rust: 8100-8101, Ruby: 8200-8201, Python: 8300-8301, TypeScript: 8400-8401)
Total Services: 10 concurrent instances
Database: Shared PostgreSQL with PGMQ messaging

Validation Results

Metric	Result
Tests Passed	1,645
Intermittent Failures	3 (resource contention, not race conditions)
Tests Skipped	21 (domain event tests, require single-instance)
Race Conditions Detected	0
Data Corruption Detected	0

What Was Validated

Concurrent Task Creation
- Tasks created through different orchestration instances
- No duplicate tasks or UUIDs
- All tasks complete successfully
- State consistent across all instances
Work Distribution
- SKIP LOCKED distributes tasks without overlap
- Multiple workers claim different steps
- No duplicate step processing
State Machine Guards
- Invalid transitions rejected at state machine layer
- Compare-and-swap prevents concurrent modifications
- Terminal states protected from re-entry
Transaction Boundaries
- All-or-nothing semantics maintained under load
- No partial task initialization observed
- Crash recovery works correctly
Cross-Instance Consistency
- Task state queries return same result from any instance
- Step state transitions visible immediately to all instances
- No stale reads observed

Protection Layer Effectiveness

Layer	Validation Method	Result
Database Atomicity	Concurrent unique constraint tests	Duplicates correctly rejected
State Machine Guards	Parallel transition attempts	Invalid transitions blocked
Transaction Boundaries	Crash injection tests	Clean rollback, no corruption
Application Logic	State filtering under load	Idempotent processing confirmed

Intermittent Failures Analysis

Three tests showed intermittent failures under heavy parallelization:

Root Cause: Database connection pool exhaustion when running 1600+ tests in parallel
Evidence: Failures occurred only at high parallelism (>4 threads), not with serialized execution
Classification: Resource contention, NOT race conditions
Mitigation: Nextest configured with test-threads = 1 for multi_instance tests

Key Finding: No race conditions were detected. All intermittent failures traced to resource limits.

Domain Event Tests

21 tests were excluded from cluster mode using #[cfg(not(feature = "test-cluster"))]:

Reason: Domain event tests verify in-process event delivery (publish/subscribe within single process)
Behavior in Cluster: Events published in one instance aren’t delivered to subscribers in another instance
Status: Working as designed - these tests run correctly in single-instance CI

Stress Test Results

Rapid Task Burst Test:

25 tasks created in <1 second
All tasks completed successfully
No duplicate UUIDs
Creation rate: ~50 tasks/second sustained

Round-Robin Distribution Test:

Tasks distributed evenly across orchestration instances
Load balancing working correctly
No single-instance bottleneck

Recommendations Validated

The following architectural decisions were validated by cluster testing:

Ownership Removal: Processor UUID as audit-only (not enforced) enables automatic recovery
SKIP LOCKED Pattern: Effective for contention-free work distribution
State-Before-Queue Pattern: Prevents workers from seeing uncommitted state
Find-or-Create Pattern: Handles concurrent entity creation correctly

Future Enhancements Identified

Testing identified one P2 improvement opportunity:

Atomic Finalization Claiming

Current: Second orchestrator gets StateMachineError during concurrent finalization
Proposed: Transaction-based locking for graceful handling
Priority: P2 (operational improvement, correctness already ensured)

Running Cluster Validation

To reproduce the validation:

# Setup cluster environment
cargo make setup-env-cluster

# Start full cluster
cargo make cluster-start-all

# Run all tests including cluster tests
cargo make test-rust-all

# Stop cluster
cargo make cluster-stop

See Cluster Testing Guide for detailed instructions.

Design Principles

Defense in Depth

The system intentionally provides multiple overlapping protection layers rather than relying on a single mechanism. This ensures:

Resilience: If one layer fails (e.g., application bug), others prevent corruption
Clear Semantics: Each layer has a specific purpose and failure mode
Ease of Reasoning: Developers can understand guarantees at each level
Graceful Degradation: System remains safe even under partial failures

Fail-Safe Defaults

When in doubt, the system errs on the side of caution:

State transitions fail if current state doesn’t match → prevents invalid states
Unique constraints fail creation → prevents duplicates
Row locks block concurrent access → prevents race conditions
Cycle detection fails initialization → prevents invalid workflows

Better to fail cleanly than to corrupt data.

Retry Safety

All critical operations are designed to be safely retryable:

Idempotent: Same operation, repeated → same outcome
State-Based: Check current state before acting
Atomic: All-or-nothing commits
No Side Effects: Operations don’t accumulate partial state

This enables:

Automatic retry after transient failures
Duplicate message handling
Recovery after crashes
Horizontal scaling without coordination overhead

Audit Trail Without Enforcement

Ownership Decision: Track ownership for observability, don’t enforce for correctness

#![allow(unused)]
fn main() {
// Processor UUID recorded in all transitions
pub struct TaskTransition {
    pub task_uuid: Uuid,
    pub from_state: TaskState,
    pub to_state: TaskState,
    pub processor_uuid: Uuid,  // For audit and debugging
    pub event: String,
    pub timestamp: DateTime<Utc>,
}

// But NOT enforced in transition logic
impl TaskStateMachine {
    pub async fn transition(&mut self, event: TaskEvent) -> Result<TaskState> {
        // ✅ Tracks processor UUID
        // ❌ Does NOT require ownership match
        // Reason: Enables recovery after crashes
    }
}
}

Why This Works:

State guards provide correctness (current state validation)
Processor UUID provides observability (who did what when)
No ownership blocking means automatic recovery
Full audit trail for debugging and monitoring

Implementation Checklist

When implementing new orchestration operations, ensure:

Database Layer

Unique constraints for entities that must be singular
FOR UPDATE locking for state transitions
FOR UPDATE SKIP LOCKED for work distribution
Compare-and-swap (CAS) in UPDATE WHERE clauses
Transaction wrapping for multi-step operations

State Machine Layer

Current state retrieval before transitions
Event applicability validation
Terminal state protection
Error handling for invalid transitions

Application Layer

Find-or-create pattern for shared entities
State-based filtering before processing
State-before-queue ordering for events
Idempotent message handlers

Testing

Concurrent operation tests (multiple orchestrators)
Crash recovery tests (mid-operation failures)
Retry safety tests (duplicate message handling)
Race condition tests (timing-dependent scenarios)

Core Architecture

States and Lifecycles - Dual state machine architecture
Events and Commands - Event-driven coordination patterns
Actor-Based Architecture - Orchestration actor pattern
Task Readiness & Execution - SQL functions and execution logic

Implementation Details

Ownership Removal ADR - Processor UUID ownership removal decision

Multi-Instance Validation

Cluster Testing Guide - Running multi-instance cluster tests

Testing

Comprehensive Lifecycle Testing - Testing patterns including concurrent scenarios

← Back to Documentation Hub

Messaging Abstraction Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Deployment Patterns | Crate Architecture

<- Back to Documentation Hub

Overview

The provider-agnostic messaging abstraction enables Tasker Core to support multiple messaging backends through a unified interface. This architecture allows switching between PGMQ (PostgreSQL Message Queue) and RabbitMQ without changes to business logic.

Key Benefits:

Zero handler changes required: Switching providers requires only configuration changes
Provider-specific optimizations: Each backend can leverage its native strengths
Testability: In-memory provider for fast unit testing
Gradual migration: Systems can transition between providers incrementally

Core Concepts

Message Delivery Models

Different messaging providers have fundamentally different delivery models:

Provider	Native Model	Push Support	Notification Type	Fallback Needed
PGMQ	Poll	Yes (pg_notify)	Signal only	Yes (catch-up)
RabbitMQ	Push	Yes (native)	Full message	No
InMemory	Push	Yes	Full message	No

PGMQ (Signal-Only):

pg_notify sends a signal that a message exists
Worker must fetch the message after receiving the signal
Fallback polling catches missed signals

RabbitMQ (Full Message Push):

basic_consume() delivers complete messages
No separate fetch required
Protocol guarantees delivery

Architecture Layers

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Application Layer                                  │
│  (Orchestration, Workers, Event Systems)                                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Uses MessageClient
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           MessageClient                                      │
│  Domain-level facade with queue classification                              │
│  Location: tasker-shared/src/messaging/client.rs                           │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Delegates to
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         MessagingProvider Enum                               │
│  Runtime dispatch without trait objects (zero-cost abstraction)             │
│  Location: tasker-shared/src/messaging/service/provider.rs                 │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    │               │               │
                    ▼               ▼               ▼
            ┌───────────┐   ┌───────────┐   ┌───────────┐
            │   PGMQ    │   │ RabbitMQ  │   │ InMemory  │
            │ Provider  │   │ Provider  │   │ Provider  │
            └───────────┘   └───────────┘   └───────────┘

Core Traits and Types

MessagingService Trait

Location: tasker-shared/src/messaging/service/traits.rs

The foundational trait defining queue operations:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait MessagingService: Send + Sync {
    // Queue lifecycle
    async fn create_queue(&self, name: &str) -> Result<(), MessagingError>;
    async fn delete_queue(&self, name: &str) -> Result<(), MessagingError>;
    async fn queue_exists(&self, name: &str) -> Result<bool, MessagingError>;
    async fn list_queues(&self) -> Result<Vec<String>, MessagingError>;

    // Message operations
    async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError>;
    async fn send_message_with_delay(&self, queue: &str, payload: &[u8], delay_seconds: i64) -> Result<i64, MessagingError>;
    async fn receive_messages(&self, queue: &str, limit: i32, visibility_timeout: i32) -> Result<Vec<QueuedMessage<Vec<u8>>>, MessagingError>;
    async fn ack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
    async fn nack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;

    // Provider information
    fn provider_name(&self) -> &'static str;
}
}

SupportsPushNotifications Trait

Location: tasker-shared/src/messaging/service/traits.rs

Extends MessagingService with push notification capabilities:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait SupportsPushNotifications: MessagingService {
    /// Subscribe to messages on a single queue
    fn subscribe(&self, queue_name: &str)
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;

    /// Subscribe to messages on multiple queues
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;

    /// Whether this provider requires fallback polling for reliability
    fn requires_fallback_polling(&self) -> bool;

    /// Suggested polling interval if fallback is needed
    fn fallback_polling_interval(&self) -> Option<Duration>;

    /// Whether this provider supports fetching by message ID
    fn supports_fetch_by_message_id(&self) -> bool;
}
}

MessageNotification Enum

Location: tasker-shared/src/messaging/service/traits.rs

Abstracts the two notification models:

#![allow(unused)]
fn main() {
pub enum MessageNotification {
    /// Signal-only notification (PGMQ style)
    /// Indicates a message is available but requires separate fetch
    Available {
        queue_name: String,
        msg_id: Option<i64>,
    },

    /// Full message notification (RabbitMQ style)
    /// Contains the complete message payload
    Message(QueuedMessage<Vec<u8>>),
}
}

Provider Implementations

PGMQ Provider

Location: tasker-shared/src/messaging/service/providers/pgmq.rs

PostgreSQL-based message queue with LISTEN/NOTIFY integration:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for PgmqMessagingService {
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        // Uses PgmqNotifyListener for pg_notify subscription
        // Returns MessageNotification::Available (signal-only) for large messages
        // Returns MessageNotification::Message for small messages (<7KB)
    }

    fn requires_fallback_polling(&self) -> bool {
        true  // pg_notify can miss signals during connection issues
    }

    fn supports_fetch_by_message_id(&self) -> bool {
        true  // PGMQ supports read_specific_message()
    }
}
}

Characteristics:

Uses PostgreSQL for storage and delivery
pg_notify for real-time notifications
Fallback polling required for reliability
Supports visibility timeout for message claiming

RabbitMQ Provider

Location: tasker-shared/src/messaging/service/providers/rabbitmq.rs

AMQP-based message broker with native push delivery:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for RabbitMqMessagingService {
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        // Uses lapin basic_consume() for native push delivery
        // Always returns MessageNotification::Message (full payload)
    }

    fn requires_fallback_polling(&self) -> bool {
        false  // AMQP protocol guarantees delivery
    }

    fn supports_fetch_by_message_id(&self) -> bool {
        false  // RabbitMQ doesn't support fetch-by-ID
    }
}
}

Characteristics:

Native push delivery via AMQP protocol
No fallback polling needed
Higher throughput for high-volume scenarios
Requires separate infrastructure (RabbitMQ server)

InMemory Provider

Location: tasker-shared/src/messaging/service/providers/in_memory.rs

In-process message queue for testing:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for InMemoryMessagingService {
    fn requires_fallback_polling(&self) -> bool {
        false  // In-memory is reliable within process
    }
}
}

Use Cases:

Unit testing without external dependencies
Integration testing with controlled timing
Development environments

MessagingProvider Enum

Location: tasker-shared/src/messaging/service/provider.rs

Enum dispatch pattern for runtime provider selection without trait objects:

#![allow(unused)]
fn main() {
pub enum MessagingProvider {
    Pgmq(PgmqMessagingService),
    RabbitMq(RabbitMqMessagingService),
    InMemory(InMemoryMessagingService),
}

impl MessagingProvider {
    /// Delegate all MessagingService methods to the underlying provider
    pub async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError> {
        match self {
            Self::Pgmq(p) => p.send_message(queue, payload).await,
            Self::RabbitMq(p) => p.send_message(queue, payload).await,
            Self::InMemory(p) => p.send_message(queue, payload).await,
        }
    }

    /// Subscribe to push notifications
    pub fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        match self {
            Self::Pgmq(p) => p.subscribe_many(queue_names),
            Self::RabbitMq(p) => p.subscribe_many(queue_names),
            Self::InMemory(p) => p.subscribe_many(queue_names),
        }
    }

    /// Check if fallback polling is required
    pub fn requires_fallback_polling(&self) -> bool {
        match self {
            Self::Pgmq(p) => p.requires_fallback_polling(),
            Self::RabbitMq(p) => p.requires_fallback_polling(),
            Self::InMemory(p) => p.requires_fallback_polling(),
        }
    }
}
}

Benefits:

Zero-cost abstraction (no vtable indirection)
Exhaustive match ensures all providers handled
Easy to add new providers

MessageClient Facade

Location: tasker-shared/src/messaging/client.rs

Domain-level facade providing high-level queue operations:

#![allow(unused)]
fn main() {
pub struct MessageClient {
    provider: Arc<MessagingProvider>,
    classifier: QueueClassifier,
}

impl MessageClient {
    /// Send a step message to the appropriate namespace queue
    pub async fn send_step_message(
        &self,
        namespace: &str,
        step: &SimpleStepMessage,
    ) -> Result<i64, MessagingError> {
        let queue_name = self.classifier.step_queue_for_namespace(namespace);
        let payload = serde_json::to_vec(step)?;
        self.provider.send_message(&queue_name, &payload).await
    }

    /// Send a step result to the orchestration queue
    pub async fn send_step_result(
        &self,
        result: &StepExecutionResult,
    ) -> Result<i64, MessagingError> {
        let queue_name = self.classifier.orchestration_results_queue();
        let payload = serde_json::to_vec(result)?;
        self.provider.send_message(&queue_name, &payload).await
    }

    /// Access the underlying provider for advanced operations
    pub fn provider(&self) -> &MessagingProvider {
        &self.provider
    }
}
}

Event System Integration

Provider-Agnostic Queue Listeners

Both orchestration and worker queue listeners use provider.subscribe_many():

#![allow(unused)]
fn main() {
// tasker-orchestration/src/orchestration/orchestration_queues/listener.rs
impl OrchestrationQueueListener {
    pub async fn start(&mut self) -> Result<(), MessagingError> {
        let queues = vec![
            self.classifier.orchestration_results_queue(),
            self.classifier.orchestration_requests_queue(),
            self.classifier.orchestration_finalization_queue(),
        ];

        // Provider-agnostic subscription
        let stream = self.provider.subscribe_many(&queues)?;

        // Process notifications
        while let Some(notification) = stream.next().await {
            match notification {
                MessageNotification::Available { queue_name, msg_id } => {
                    // PGMQ style: send event command to fetch message
                    self.send_event_command(queue_name, msg_id).await;
                }
                MessageNotification::Message(msg) => {
                    // RabbitMQ style: send message command with full payload
                    self.send_message_command(msg).await;
                }
            }
        }
    }
}
}

Deployment Mode Selection

Event systems select the appropriate mode based on provider capabilities:

#![allow(unused)]
fn main() {
// Determine effective deployment mode for this provider
let effective_mode = deployment_mode.effective_for_provider(provider.provider_name());

match effective_mode {
    DeploymentMode::EventDrivenOnly => {
        // Start queue listener only (no fallback poller)
        // RabbitMQ typically uses this mode
    }
    DeploymentMode::Hybrid => {
        // Start both listener and fallback poller
        // PGMQ uses this mode for reliability
    }
    DeploymentMode::PollingOnly => {
        // Start fallback poller only
        // For restricted environments
    }
}
}

Command Routing

Dual Command Variants

Command processors handle both notification types:

#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
    // For full message notifications (RabbitMQ)
    ProcessStepResultFromMessage {
        queue_name: String,
        message: QueuedMessage<Vec<u8>>,
        resp: CommandResponder<StepProcessResult>,
    },

    // For signal-only notifications (PGMQ)
    ProcessStepResultFromMessageEvent {
        message_event: MessageReadyEvent,
        resp: CommandResponder<StepProcessResult>,
    },
}
}

Routing Logic:

MessageNotification::Message -> ProcessStepResultFromMessage
MessageNotification::Available -> ProcessStepResultFromMessageEvent

Type-Safe Channel Wrappers

NewType wrappers for MPSC channels prevent accidental misuse:

Orchestration Channels

Location: tasker-orchestration/src/orchestration/channels.rs

#![allow(unused)]
fn main() {
/// Strongly-typed sender for orchestration commands
#[derive(Debug, Clone)]
pub struct OrchestrationCommandSender(pub(crate) mpsc::Sender<OrchestrationCommand>);

/// Strongly-typed receiver for orchestration commands
#[derive(Debug)]
pub struct OrchestrationCommandReceiver(pub(crate) mpsc::Receiver<OrchestrationCommand>);

/// Strongly-typed sender for orchestration notifications
#[derive(Debug, Clone)]
pub struct OrchestrationNotificationSender(pub(crate) mpsc::Sender<OrchestrationNotification>);

/// Strongly-typed receiver for orchestration notifications
#[derive(Debug)]
pub struct OrchestrationNotificationReceiver(pub(crate) mpsc::Receiver<OrchestrationNotification>);
}

Worker Channels

Location: tasker-worker/src/worker/channels.rs

#![allow(unused)]
fn main() {
/// Strongly-typed sender for worker commands
#[derive(Debug, Clone)]
pub struct WorkerCommandSender(pub(crate) mpsc::Sender<WorkerCommand>);

/// Strongly-typed receiver for worker commands
#[derive(Debug)]
pub struct WorkerCommandReceiver(pub(crate) mpsc::Receiver<WorkerCommand>);
}

Channel Factory

#![allow(unused)]
fn main() {
pub struct ChannelFactory;

impl ChannelFactory {
    /// Create type-safe orchestration command channel pair
    pub fn orchestration_command_channel(buffer_size: usize)
        -> (OrchestrationCommandSender, OrchestrationCommandReceiver)
    {
        let (tx, rx) = mpsc::channel(buffer_size);
        (OrchestrationCommandSender(tx), OrchestrationCommandReceiver(rx))
    }
}
}

Benefits:

Compile-time prevention of channel misuse
Self-documenting function signatures
Zero runtime overhead (NewTypes compile away)

Configuration

Provider Selection

# config/dotenv/test.env
# Valid values: pgmq (default), rabbitmq
TASKER_MESSAGING_BACKEND=pgmq

# RabbitMQ connection (only used when backend=rabbitmq)
RABBITMQ_URL=amqp://tasker:tasker@localhost:5672/%2F

Provider-Specific Settings

# config/tasker/base/common.toml
[pgmq]
visibility_timeout_seconds = 60
max_message_size_bytes = 1048576
batch_size = 100

[rabbitmq]
prefetch_count = 100
connection_timeout_seconds = 30
heartbeat_seconds = 60

Migration Guide

Switching from PGMQ to RabbitMQ

Deploy RabbitMQ infrastructure

Update configuration:

export TASKER_MESSAGING_BACKEND=rabbitmq
export RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/%2F

Restart services - No code changes required

Gradual Migration

For zero-downtime migration:

Deploy new services with RabbitMQ configuration
Gradually shift traffic to new services
Monitor for any issues
Decommission PGMQ-based services

Testing

Provider-Agnostic Tests

Most tests should use InMemoryMessagingService for speed:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_step_execution() {
    let provider = MessagingProvider::InMemory(InMemoryMessagingService::new());
    let client = MessageClient::new(Arc::new(provider));

    // Test with in-memory provider
    client.send_step_message("payments", &step_msg).await.unwrap();
}
}

Provider-Specific Tests

For integration tests that need specific provider behavior:

#![allow(unused)]
fn main() {
#[tokio::test]
#[cfg(feature = "integration-tests")]
async fn test_pgmq_notifications() {
    let provider = MessagingProvider::Pgmq(PgmqMessagingService::new(pool).await?);
    // Test PGMQ-specific behavior
}
}

Best Practices

1. Use MessageClient for Application Code

#![allow(unused)]
fn main() {
// Good: Use domain-level facade
let client = context.message_client();
client.send_step_result(&result).await?;

// Avoid: Direct provider access unless necessary
let provider = context.messaging_provider();
provider.send_message("queue", &payload).await?;
}

2. Handle Both Notification Types

#![allow(unused)]
fn main() {
match notification {
    MessageNotification::Available { queue_name, msg_id } => {
        // Signal-only: need to fetch message
    }
    MessageNotification::Message(msg) => {
        // Full message: can process immediately
    }
}
}

3. Respect Provider Capabilities

#![allow(unused)]
fn main() {
if provider.supports_fetch_by_message_id() {
    // Can use read_specific_message()
} else {
    // Must use alternative approach
}
}

4. Configure Fallback Appropriately

#![allow(unused)]
fn main() {
if provider.requires_fallback_polling() {
    // Start fallback poller for reliability
}
}

Events and Commands - Command pattern details
Deployment Patterns - Deployment modes and configuration
Worker Event Systems - Worker event architecture
Crate Architecture - Workspace structure

<- Back to Documentation Hub

Next: Events and Commands | Deployment Patterns

States and Lifecycles

Last Updated: 2025-10-10 Audience: All Status: Active Related Docs: Documentation Hub | Events and Commands | Task Readiness & Execution

← Back to Documentation Hub

This document provides comprehensive documentation of the state machine architecture in tasker-core, covering both task and workflow step lifecycles, their state transitions, and the underlying persistence mechanisms.

Overview

The tasker-core system implements a sophisticated dual-state-machine architecture:

Task State Machine: Manages overall workflow orchestration with 12 comprehensive states
Workflow Step State Machine: Manages individual step execution with 8 states including orchestration queuing

Both state machines work in coordination to provide atomic, auditable, and resilient workflow execution with proper event-driven communication between orchestration and worker systems.

Task State Machine Architecture

Task State Definitions

The task state machine implements 12 comprehensive states as defined in tasker-shared/src/state_machine/states.rs:

Initial States

Pending: Created but not started (default initial state)
Initializing: Discovering initial ready steps and setting up task context

Active Processing States

EnqueuingSteps: Actively enqueuing ready steps to worker queues
StepsInProcess: Steps are being processed by workers (orchestration monitoring)
EvaluatingResults: Processing results from completed steps and determining next actions

Waiting States

WaitingForDependencies: No ready steps, waiting for dependencies to be satisfied
WaitingForRetry: Waiting for retry timeout before attempting failed steps again
BlockedByFailures: Has failures that prevent progress (manual intervention may be needed)

Terminal States

Complete: All steps completed successfully (terminal)
Error: Task failed permanently (terminal)
Cancelled: Task was cancelled (terminal)
ResolvedManually: Manually resolved by operator (terminal)

Task State Properties

Each state has key properties that drive system behavior:

#![allow(unused)]
fn main() {
impl TaskState {
    pub fn is_terminal(&self) -> bool         // Cannot transition further
    pub fn requires_ownership(&self) -> bool  // Processor ownership required
    pub fn is_active(&self) -> bool          // Currently being processed  
    pub fn is_waiting(&self) -> bool         // Waiting for external conditions
    pub fn can_be_processed(&self) -> bool   // Available for orchestration pickup
}
}

Ownership-Required States: Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults Processable States: Pending, WaitingForDependencies, WaitingForRetry

Task Lifecycle Flow

stateDiagram-v2
    [*] --> Pending
    
    %% Initial Flow
    Pending --> Initializing : Start
    
    %% From Initializing
    Initializing --> EnqueuingSteps : ReadyStepsFound(count)
    Initializing --> Complete : NoStepsFound
    Initializing --> WaitingForDependencies : NoDependenciesReady
    
    %% Processing Flow
    EnqueuingSteps --> StepsInProcess : StepsEnqueued(uuids)
    EnqueuingSteps --> Error : EnqueueFailed(error)
    
    StepsInProcess --> EvaluatingResults : AllStepsCompleted
    StepsInProcess --> EvaluatingResults : StepCompleted(uuid)
    StepsInProcess --> WaitingForRetry : StepFailed(uuid)
    
    %% Result Evaluation
    EvaluatingResults --> Complete : AllStepsSuccessful
    EvaluatingResults --> EnqueuingSteps : ReadyStepsFound(count)
    EvaluatingResults --> WaitingForDependencies : NoDependenciesReady
    EvaluatingResults --> BlockedByFailures : PermanentFailure(error)
    
    %% Waiting States
    WaitingForDependencies --> EvaluatingResults : DependenciesReady
    WaitingForRetry --> EnqueuingSteps : RetryReady
    
    %% Problem Resolution
    BlockedByFailures --> Error : GiveUp
    BlockedByFailures --> ResolvedManually : ManualResolution
    
    %% Cancellation (from any non-terminal state)
    Pending --> Cancelled : Cancel
    Initializing --> Cancelled : Cancel
    EnqueuingSteps --> Cancelled : Cancel
    StepsInProcess --> Cancelled : Cancel
    EvaluatingResults --> Cancelled : Cancel
    WaitingForDependencies --> Cancelled : Cancel
    WaitingForRetry --> Cancelled : Cancel
    BlockedByFailures --> Cancelled : Cancel
    
    %% Legacy Support
    Error --> Pending : Reset
    
    %% Terminal States
    Complete --> [*]
    Error --> [*]
    Cancelled --> [*]
    ResolvedManually --> [*]

Task Event System

Task state transitions are driven by events defined in tasker-shared/src/state_machine/events.rs:

Lifecycle Events

Start: Begin task processing
Cancel: Cancel task execution
GiveUp: Abandon task (BlockedByFailures -> Error)
ManualResolution: Manually resolve task

Discovery Events

ReadyStepsFound(count): Ready steps discovered during initialization/evaluation
NoStepsFound: No steps defined - task can complete immediately
NoDependenciesReady: Dependencies not satisfied - wait required
DependenciesReady: Dependencies now ready - can proceed

Processing Events

StepsEnqueued(vec<Uuid>): Steps successfully queued for workers
EnqueueFailed(error): Failed to enqueue steps
StepCompleted(uuid): Individual step completed
StepFailed(uuid): Individual step failed
AllStepsCompleted: All current batch steps finished
AllStepsSuccessful: All steps completed successfully

System Events

PermanentFailure(error): Unrecoverable failure
RetryReady: Retry timeout expired
Timeout: Operation timeout occurred
ProcessorCrashed: Processor became unavailable

Processor Ownership

The task state machine implements processor ownership for active states to prevent race conditions:

#![allow(unused)]
fn main() {
// Ownership validation in task_state_machine.rs
if target_state.requires_ownership() {
    let current_processor = self.get_current_processor().await?;
    TransitionGuard::check_ownership(target_state, current_processor, self.processor_uuid)?;
}
}

Ownership Rules:

States requiring ownership: Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults
Processor UUID stored in tasker.task_transitions.processor_uuid column
Atomic ownership claiming prevents concurrent processing
Ownership validated on each transition attempt

Workflow Step State Machine Architecture

Step State Definitions

The workflow step state machine implements 9 states for individual step execution:

Processing Pipeline States

Pending: Initial state when step is created
Enqueued: Queued for processing but not yet claimed by worker
InProgress: Currently being executed by a worker
EnqueuedForOrchestration: Worker completed, queued for orchestration processing
EnqueuedAsErrorForOrchestration: Worker failed, queued for orchestration error processing

Waiting States

WaitingForRetry: Step failed with retryable error, waiting for backoff period before retry

Terminal States

Complete: Step completed successfully (after orchestration processing)
Error: Step failed permanently (non-retryable or max retries exceeded)
Cancelled: Step was cancelled
ResolvedManually: Step was manually resolved by operator

State Machine Evolution

Previously, the Error state was used for both retryable and permanent failures. The introduction of WaitingForRetry created a semantic change:

Before: Error = any failure (retryable or permanent)
After: Error = permanent failure only, WaitingForRetry = retryable failure awaiting backoff

This change required updates to:

get_step_readiness_status() to recognize WaitingForRetry as a ready-eligible state
get_task_execution_context() to properly detect blocked vs recovering tasks
Error classification logic to distinguish permanent from retryable errors

Step State Properties

#![allow(unused)]
fn main() {
impl WorkflowStepState {
    pub fn is_terminal(&self) -> bool                    // No further transitions
    pub fn is_error(&self) -> bool                       // In error state (may allow retry)
    pub fn is_active(&self) -> bool                      // Being processed by worker
    pub fn is_in_processing_pipeline(&self) -> bool     // In execution pipeline
    pub fn is_ready_for_claiming(&self) -> bool         // Available for worker claim
    pub fn satisfies_dependencies(&self) -> bool        // Can satisfy other step dependencies
}
}

Step Lifecycle Flow

stateDiagram-v2
    [*] --> Pending

    %% Main Execution Path
    Pending --> Enqueued : Enqueue
    Enqueued --> InProgress : Start (worker claims)
    InProgress --> EnqueuedForOrchestration : EnqueueForOrchestration(success)
    EnqueuedForOrchestration --> Complete : Complete(results) [orchestration]

    %% Error Handling Path
    InProgress --> EnqueuedAsErrorForOrchestration : EnqueueForOrchestration(error)
    EnqueuedAsErrorForOrchestration --> WaitingForRetry : WaitForRetry(error) [retryable]
    EnqueuedAsErrorForOrchestration --> Error : Fail(error) [permanent/max retries]

    %% Retry Path
    WaitingForRetry --> Pending : Retry (after backoff)

    %% Legacy Direct Path (deprecated)
    InProgress --> Complete : Complete(results) [direct - legacy]
    InProgress --> Error : Fail(error) [direct - legacy]

    %% Legacy Backward Compatibility
    Pending --> InProgress : Start [legacy]

    %% Direct Failure Paths (error before worker processing)
    Pending --> Error : Fail(error)
    Enqueued --> Error : Fail(error)

    %% Cancellation Paths
    Pending --> Cancelled : Cancel
    Enqueued --> Cancelled : Cancel
    InProgress --> Cancelled : Cancel
    EnqueuedForOrchestration --> Cancelled : Cancel
    EnqueuedAsErrorForOrchestration --> Cancelled : Cancel
    WaitingForRetry --> Cancelled : Cancel
    Error --> Cancelled : Cancel

    %% Manual Resolution (from any state)
    Pending --> ResolvedManually : ResolveManually
    Enqueued --> ResolvedManually : ResolveManually
    InProgress --> ResolvedManually : ResolveManually
    EnqueuedForOrchestration --> ResolvedManually : ResolveManually
    EnqueuedAsErrorForOrchestration --> ResolvedManually : ResolveManually
    WaitingForRetry --> ResolvedManually : ResolveManually
    Error --> ResolvedManually : ResolveManually

    %% Terminal States
    Complete --> [*]
    Error --> [*]
    Cancelled --> [*]
    ResolvedManually --> [*]

Step Event System

Step transitions are driven by StepEvent types:

Processing Events

Enqueue: Queue step for worker processing
Start: Begin step execution (worker claims step)
EnqueueForOrchestration(results): Worker completes, queues for orchestration
Complete(results): Mark step complete (from orchestration or legacy direct)
Fail(error): Mark step as permanently failed
WaitForRetry(error): Mark step for retry after backoff

Control Events

Cancel: Cancel step execution
ResolveManually: Manual operator resolution
Retry: Retry step from WaitingForRetry or Error state

Step Execution Flow Integration

The step state machine integrates tightly with the task state machine:

Task Discovers Ready Steps: TaskEvent::ReadyStepsFound(count) -> Task moves to EnqueuingSteps
Steps Get Enqueued: StepEvent::Enqueue -> Steps move to Enqueued state
Workers Claim Steps: StepEvent::Start -> Steps move to InProgress
Workers Complete Steps: StepEvent::EnqueueForOrchestration(results) -> Steps move to EnqueuedForOrchestration
Orchestration Processes Results: StepEvent::Complete(results) -> Steps move to Complete
Task Evaluates Progress: TaskEvent::StepCompleted(uuid) -> Task moves to EvaluatingResults
Task Completes or Continues: Based on remaining steps -> Task moves to Complete or back to EnqueuingSteps

Guard Conditions and Validation

Both state machines implement comprehensive guard conditions in tasker-shared/src/state_machine/guards.rs:

Task Guards

TransitionGuard

Validates all task state transitions
Prevents invalid state combinations
Enforces terminal state immutability
Supports legacy transition compatibility

Ownership Validation

Checks processor ownership for ownership-required states
Prevents concurrent task processing
Allows ownership claiming for unowned tasks

Step Guards

StepDependenciesMetGuard

Validates all step dependencies are satisfied
Delegates to WorkflowStep::dependencies_met()
Prevents premature step execution

StepNotInProgressGuard

Ensures step is not already being processed
Prevents duplicate worker claims
Validates step availability

Retry Guards

StepCanBeRetriedGuard: Validates step is in Error state
Checks retry limits and conditions
Prevents infinite retry loops

Orchestration Guards

StepCanBeEnqueuedForOrchestrationGuard: Step must be InProgress
StepCanBeCompletedFromOrchestrationGuard: Step must be EnqueuedForOrchestration
StepCanBeFailedFromOrchestrationGuard: Step must be EnqueuedForOrchestration

Persistence Layer Architecture

Delegation Pattern

The persistence layer in tasker-shared/src/state_machine/persistence.rs implements a delegation pattern to the model layer:

#![allow(unused)]
fn main() {
// TaskTransitionPersistence -> TaskTransition::create() & TaskTransition::get_current()
// StepTransitionPersistence -> WorkflowStepTransition::create() & WorkflowStepTransition::get_current()
}

Benefits:

No SQL duplication between state machine and models
Atomic transaction handling in models
Single source of truth for database operations
Independent testability of model methods

Transition Storage

Task Transitions (`tasker.task_transitions`)

CREATE TABLE tasker.task_transitions (
  task_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  task_uuid UUID NOT NULL,
  to_state VARCHAR NOT NULL,
  from_state VARCHAR,
  processor_uuid UUID,           -- Ownership tracking
  metadata JSONB,
  sort_key INTEGER NOT NULL,
  most_recent BOOLEAN DEFAULT false,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Step Transitions (`tasker.workflow_step_transitions`)

CREATE TABLE tasker.workflow_step_transitions (
  workflow_step_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  workflow_step_uuid UUID NOT NULL,
  to_state VARCHAR NOT NULL,
  from_state VARCHAR,
  metadata JSONB,
  sort_key INTEGER NOT NULL,
  most_recent BOOLEAN DEFAULT false,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Current State Resolution

Both transition models implement efficient current state resolution:

#![allow(unused)]
fn main() {
// O(1) current state lookup using most_recent flag
TaskTransition::get_current(pool, task_uuid) -> Option<TaskTransition>
WorkflowStepTransition::get_current(pool, step_uuid) -> Option<WorkflowStepTransition>
}

Performance Optimization:

most_recent = true flag on latest transition only
Indexed queries: (task_uuid, most_recent) WHERE most_recent = true
Atomic flag updates during transition creation

Atomic Transitions with Ownership

Atomic transitions with processor ownership:

#![allow(unused)]
fn main() {
impl TaskTransitionPersistence {
    pub async fn transition_with_ownership(
        &self,
        task_uuid: Uuid,
        from_state: TaskState,
        to_state: TaskState, 
        processor_uuid: Uuid,
        metadata: Option<Value>,
        pool: &PgPool,
    ) -> PersistenceResult<bool>
}
}

Atomicity Guarantees:

Single database transaction for state change
Processor UUID stored in dedicated column
most_recent flag updated atomically
Race condition prevention through database constraints

Action System

Both state machines execute actions after successful transitions:

Task Actions

PublishTransitionEventAction: Publishes task state change events
UpdateTaskCompletionAction: Updates task completion status
ErrorStateCleanupAction: Performs error state cleanup

Step Actions

PublishTransitionEventAction: Publishes step state change events
UpdateStepResultsAction: Updates step results and execution data
TriggerStepDiscoveryAction: Triggers task-level step discovery
ErrorStateCleanupAction: Performs step error cleanup

Actions execute sequentially after transition persistence, ensuring consistency.

State Machine Integration Points

Task <-> Step Coordination

Step Discovery: Task initialization discovers ready steps
Step Enqueueing: Task enqueues discovered steps to worker queues
Progress Monitoring: Task monitors step completion via events
Result Processing: Task processes step results and discovers next steps
Completion Detection: Task completes when all steps are complete

Event-Driven Communication

pg_notify: PostgreSQL notifications for real-time coordination
Event Publishers: Publish state transition events to event system
Event Subscribers: React to state changes across system boundaries
Queue Integration: Provider-agnostic message queues (PGMQ or RabbitMQ) for worker communication

Worker Integration

Step Claiming: Workers claim Enqueued steps from queues
Progress Updates: Workers transition steps to InProgress
Result Submission: Workers submit results via EnqueueForOrchestration
Orchestration Processing: Orchestration processes results and completes steps

This sophisticated state machine architecture provides the foundation for reliable, auditable, and scalable workflow orchestration in the tasker-core system.

Step Result Audit System

The step result audit system provides SOC2-compliant audit trails for workflow step execution results, enabling complete attribution tracking for compliance and debugging.

Audit Table Design

The tasker.workflow_step_result_audit table stores lightweight references with attribution data:

CREATE TABLE tasker.workflow_step_result_audit (
    workflow_step_result_audit_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
    workflow_step_uuid UUID NOT NULL REFERENCES tasker.workflow_steps,
    workflow_step_transition_uuid UUID NOT NULL REFERENCES tasker.workflow_step_transitions,
    task_uuid UUID NOT NULL REFERENCES tasker.tasks,
    recorded_at TIMESTAMP NOT NULL DEFAULT NOW(),

    -- Attribution (NEW data not in transitions)
    worker_uuid UUID,
    correlation_id UUID,

    -- Extracted scalars for indexing/filtering
    success BOOLEAN NOT NULL,
    execution_time_ms BIGINT,

    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    UNIQUE (workflow_step_uuid, workflow_step_transition_uuid)
);

Design Principles

No Data Duplication: Full execution results already exist in tasker.workflow_step_transitions.metadata. The audit table stores references only.
Attribution Capture: The audit system captures NEW attribution data:
- worker_uuid: Which worker instance processed the step
- correlation_id: Distributed tracing identifier for request correlation
Indexed Scalars: Success and execution time are extracted for efficient filtering without JSON parsing.
SQL Trigger: A database trigger (trg_step_result_audit) guarantees audit record creation when workers persist results, ensuring SOC2 compliance.

Attribution Flow

Attribution data flows through the system via TransitionContext:

#![allow(unused)]
fn main() {
// Worker creates attribution context
let context = TransitionContext::with_worker(
    worker_uuid,
    Some(correlation_id),
);

// Context is merged into transition metadata
state_machine.transition_with_context(event, Some(context)).await?;

// SQL trigger extracts attribution from metadata
-- In trigger:
-- v_worker_uuid := (NEW.metadata->>'worker_uuid')::UUID;
-- v_correlation_id := (NEW.metadata->>'correlation_id')::UUID;
}

Trigger Behavior

The create_step_result_audit trigger fires on transitions to:

enqueued_for_orchestration: Successful step completion
enqueued_as_error_for_orchestration: Failed step completion

These states represent when workers persist execution results, creating the audit trail.

Querying Audit History

Via API

GET /v1/tasks/{task_uuid}/workflow_steps/{step_uuid}/audit

Returns audit records with full transition details via JOIN, ordered by recorded_at DESC.

Via Client

#![allow(unused)]
fn main() {
let audit_history = client.get_step_audit_history(task_uuid, step_uuid).await?;
for record in audit_history {
    println!("Worker: {:?}, Success: {}, Time: {:?}ms",
        record.worker_uuid,
        record.success,
        record.execution_time_ms
    );
}
}

Via Model

#![allow(unused)]
fn main() {
// Get audit history for a step with full transition details
let history = WorkflowStepResultAudit::get_audit_history(&pool, step_uuid).await?;

// Get all audit records for a task
let task_history = WorkflowStepResultAudit::get_task_audit_history(&pool, task_uuid).await?;

// Query by worker for attribution investigation
let worker_records = WorkflowStepResultAudit::get_by_worker(&pool, worker_uuid, Some(100)).await?;

// Query by correlation ID for distributed tracing
let correlated = WorkflowStepResultAudit::get_by_correlation_id(&pool, correlation_id).await?;
}

Indexes for Common Query Patterns

The audit table includes optimized indexes:

idx_audit_step_uuid: Primary query - get audit history for a step
idx_audit_task_uuid: Get all audit records for a task
idx_audit_recorded_at: Time-range queries for SOC2 audit reports
idx_audit_worker_uuid: Attribution investigation (partial index)
idx_audit_correlation_id: Distributed tracing queries (partial index)
idx_audit_success: Success/failure filtering

Historical Data

The migration includes a backfill for existing transitions. Historical records will have NULL attribution (worker_uuid, correlation_id) since that data wasn’t captured before the audit system was introduced.

Worker Actor-Based Architecture

Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands

<- Back to Documentation Hub

This document provides comprehensive documentation of the worker actor-based architecture in tasker-worker, covering the lightweight Actor pattern that mirrors the orchestration architecture for step execution and worker coordination.

Overview

The tasker-worker system implements a lightweight Actor pattern that mirrors the orchestration architecture, providing:

Actor Abstraction: Worker components encapsulated as actors with clear lifecycle hooks
Message-Based Communication: Type-safe message handling via Handler<M> trait
Central Registry: WorkerActorRegistry for managing all worker actors
Service Decomposition: Focused services following single responsibility principle
Lock-Free Statistics: AtomicU64 counters for hot-path performance
Direct Integration: Command processor routes to actors without wrapper layers

This architecture provides consistency between orchestration and worker systems, enabling clearer code organization and improved maintainability.

Implementation Status

Complete: All phases implemented and production-ready

Phase 1: Core abstractions (traits, registry, lifecycle management)
Phase 2: Service decomposition from 1575 LOC command_processor.rs
Phase 3: All 5 primary actors implemented
Phase 4: Command processor refactored to pure routing (~200 LOC)
Phase 5: Stateless service design eliminating lock contention
Cleanup: Lock-free AtomicU64 statistics, shared event system

Current State: Production-ready actor-based worker with 5 actors managing all step execution operations.

Core Concepts

What is a Worker Actor?

In the tasker-worker context, a Worker Actor is an encapsulated step execution component that:

Manages its own state: Each actor owns its dependencies and configuration
Processes messages: Responds to typed command messages via the Handler<M> trait
Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
Is isolated: Actors communicate through message passing
Is thread-safe: All actors are Send + Sync + 'static

Why Actors for Workers?

The previous architecture had a monolithic command processor:

#![allow(unused)]
fn main() {
// OLD: 1575 LOC monolithic command processor
pub struct WorkerProcessor {
    // All logic mixed together
    // RwLock contention on hot path
    // Two-phase initialization complexity
}
}

The actor pattern provides:

#![allow(unused)]
fn main() {
// NEW: Pure routing command processor (~200 LOC)
impl ActorCommandProcessor {
    async fn handle_command(&self, command: WorkerCommand) -> bool {
        match command {
            WorkerCommand::ExecuteStep { message, queue_name, resp } => {
                let msg = ExecuteStepMessage { message, queue_name };
                let result = self.actors.step_executor_actor.handle(msg).await;
                let _ = resp.send(result);
                true
            }
            // ... pure routing, no business logic
        }
    }
}
}

Actor vs Service

Services (underlying business logic):

Encapsulate step execution logic
Stateless operations on step data
Direct method invocation
Examples: StepExecutorService, FFICompletionService, WorkerStatusService

Actors (message-based coordination):

Wrap services with message-based interface
Manage service lifecycle
Asynchronous message handling
Examples: StepExecutorActor, FFICompletionActor, WorkerStatusActor

The relationship:

#![allow(unused)]
fn main() {
pub struct StepExecutorActor {
    context: Arc<SystemContext>,
    service: Arc<StepExecutorService>,  // Wraps underlying service
}

#[async_trait]
impl Handler<ExecuteStepMessage> for StepExecutorActor {
    async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
        // Delegates to stateless service
        self.service.execute_step(msg.message, &msg.queue_name).await
    }
}
}

Worker Actor Traits

WorkerActor Trait

The base trait for all worker actors, defined in tasker-worker/src/worker/actors/traits.rs:

#![allow(unused)]
fn main() {
/// Base trait for all worker actors
///
/// Provides lifecycle management and context access for all actors in the
/// worker system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
pub trait WorkerActor: Send + Sync + 'static {
    /// Returns the unique name of this actor
    fn name(&self) -> &'static str;

    /// Returns a reference to the system context
    fn context(&self) -> &Arc<SystemContext>;

    /// Called when the actor is started
    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor started");
        Ok(())
    }

    /// Called when the actor is stopped
    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor stopped");
        Ok(())
    }
}
}

Handler Trait

The message handling trait, enabling type-safe message processing:

#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
#[async_trait]
pub trait Handler<M: Message>: WorkerActor {
    /// Handle a message asynchronously
    async fn handle(&self, msg: M) -> TaskerResult<M::Response>;
}
}

Message Trait

The marker trait for command messages:

#![allow(unused)]
fn main() {
/// Marker trait for command messages
pub trait Message: Send + 'static {
    /// The response type for this message
    type Response: Send;
}
}

WorkerActorRegistry

The central registry managing all worker actors, defined in tasker-worker/src/worker/actors/registry.rs:

Structure

#![allow(unused)]
fn main() {
/// Registry managing all worker actors
#[derive(Clone)]
pub struct WorkerActorRegistry {
    /// System context shared by all actors
    context: Arc<SystemContext>,

    /// Worker ID for this registry
    worker_id: String,

    /// Step executor actor for step execution    pub step_executor_actor: Arc<StepExecutorActor>,

    /// FFI completion actor for handling step completions    pub ffi_completion_actor: Arc<FFICompletionActor>,

    /// Template cache actor for template management    pub template_cache_actor: Arc<TemplateCacheActor>,

    /// Domain event actor for event dispatching    pub domain_event_actor: Arc<DomainEventActor>,

    /// Worker status actor for health and status    pub worker_status_actor: Arc<WorkerStatusActor>,
}
}

Initialization

All dependencies required at construction time (no two-phase initialization):

#![allow(unused)]
fn main() {
impl WorkerActorRegistry {
    pub async fn build(
        context: Arc<SystemContext>,
        worker_id: String,
        task_template_manager: Arc<TaskTemplateManager>,
        event_publisher: WorkerEventPublisher,
        domain_event_handle: DomainEventSystemHandle,
    ) -> TaskerResult<Self> {
        // Create actors with all dependencies upfront
        let mut step_executor_actor = StepExecutorActor::new(
            context.clone(),
            worker_id.clone(),
            task_template_manager.clone(),
            event_publisher,
            domain_event_handle,
        );

        // Call started() lifecycle hook
        step_executor_actor.started()?;

        // ... create other actors ...

        Ok(Self {
            context,
            worker_id,
            step_executor_actor: Arc::new(step_executor_actor),
            // ...
        })
    }
}
}

Implemented Actors

StepExecutorActor

Handles step execution from PGMQ messages and events.

Location: tasker-worker/src/worker/actors/step_executor_actor.rs

Messages:

ExecuteStepMessage - Execute step from raw data
ExecuteStepWithCorrelationMessage - Execute with FFI correlation
ExecuteStepFromPgmqMessage - Execute from PGMQ message
ExecuteStepFromEventMessage - Execute from event notification

Delegation: Wraps StepExecutorService (stateless, no locks)

Purpose: Central coordinator for all step execution, handles claiming, handler invocation, and result construction.

FFICompletionActor

Handles step completion results from FFI handlers.

Location: tasker-worker/src/worker/actors/ffi_completion_actor.rs

Messages:

SendStepResultMessage - Send result to orchestration
ProcessStepCompletionMessage - Process completion with correlation

Delegation: Wraps FFICompletionService

Purpose: Forwards step execution results to orchestration queue, manages correlation for async FFI handlers.

TemplateCacheActor

Manages task template caching and refresh.

Location: tasker-worker/src/worker/actors/template_cache_actor.rs

Messages:

RefreshTemplateCacheMessage - Refresh cache for namespace

Delegation: Wraps TaskTemplateManager

Purpose: Maintains handler template cache for efficient step execution.

DomainEventActor

Dispatches domain events after step completion.

Location: tasker-worker/src/worker/actors/domain_event_actor.rs

Messages:

DispatchDomainEventsMessage - Dispatch events for completed step

Delegation: Wraps DomainEventSystemHandle

Purpose: Fire-and-forget domain event dispatch (never blocks step completion).

WorkerStatusActor

Provides worker health and status reporting.

Location: tasker-worker/src/worker/actors/worker_status_actor.rs

Messages:

GetWorkerStatusMessage - Get current worker status
HealthCheckMessage - Perform health check
GetEventStatusMessage - Get event integration status
SetEventIntegrationMessage - Enable/disable event integration

Features:

Lock-free statistics via AtomicStepExecutionStats
AtomicU64 counters for total_executed, total_succeeded, total_failed
Average execution time computed on read from sum / count

Purpose: Real-time health monitoring and statistics without lock contention.

Lock-Free Statistics

The WorkerStatusActor uses atomic counters for lock-free statistics on the hot path:

#![allow(unused)]
fn main() {
/// Lock-free step execution statistics using atomic counters
#[derive(Debug)]
pub struct AtomicStepExecutionStats {
    total_executed: AtomicU64,
    total_succeeded: AtomicU64,
    total_failed: AtomicU64,
    total_execution_time_ms: AtomicU64,
}

impl AtomicStepExecutionStats {
    /// Record a successful step execution (lock-free)
    #[inline]
    pub fn record_success(&self, execution_time_ms: u64) {
        self.total_executed.fetch_add(1, Ordering::Relaxed);
        self.total_succeeded.fetch_add(1, Ordering::Relaxed);
        self.total_execution_time_ms.fetch_add(execution_time_ms, Ordering::Relaxed);
    }

    /// Record a failed step execution (lock-free)
    #[inline]
    pub fn record_failure(&self) {
        self.total_executed.fetch_add(1, Ordering::Relaxed);
        self.total_failed.fetch_add(1, Ordering::Relaxed);
    }

    /// Get a snapshot of current statistics
    pub fn snapshot(&self) -> StepExecutionStats {
        let total_executed = self.total_executed.load(Ordering::Relaxed);
        let total_time = self.total_execution_time_ms.load(Ordering::Relaxed);
        let average_execution_time_ms = if total_executed > 0 {
            total_time as f64 / total_executed as f64
        } else {
            0.0
        };
        StepExecutionStats {
            total_executed,
            total_succeeded: self.total_succeeded.load(Ordering::Relaxed),
            total_failed: self.total_failed.load(Ordering::Relaxed),
            average_execution_time_ms,
        }
    }
}
}

Benefits:

Zero lock contention on step completion (every step calls record_success or record_failure)
Sub-microsecond overhead per operation
Consistent averages computed from totals

Integration with Commands

ActorCommandProcessor

The command processor provides pure routing to actors:

#![allow(unused)]
fn main() {
impl ActorCommandProcessor {
    async fn handle_command(&self, command: WorkerCommand) -> bool {
        match command {
            // Step Execution Commands -> StepExecutorActor
            WorkerCommand::ExecuteStep { message, queue_name, resp } => {
                let msg = ExecuteStepMessage { message, queue_name };
                let result = self.actors.step_executor_actor.handle(msg).await;
                let _ = resp.send(result);
                true
            }

            // Completion Commands -> FFICompletionActor
            WorkerCommand::SendStepResult { result, resp } => {
                let msg = SendStepResultMessage { result };
                let send_result = self.actors.ffi_completion_actor.handle(msg).await;
                let _ = resp.send(send_result);
                true
            }

            // Status Commands -> WorkerStatusActor
            WorkerCommand::HealthCheck { resp } => {
                let result = self.actors.worker_status_actor.handle(HealthCheckMessage).await;
                let _ = resp.send(result);
                true
            }

            WorkerCommand::Shutdown { resp } => {
                let _ = resp.send(Ok(()));
                false  // Exit command loop
            }
        }
    }
}
}

FFI Completion Flow

Domain events are dispatched after successful orchestration notification:

#![allow(unused)]
fn main() {
async fn handle_ffi_completion(&self, step_result: StepExecutionResult) {
    // Record stats (lock-free)
    if step_result.success {
        self.actors.worker_status_actor
            .record_success(step_result.metadata.execution_time_ms as f64).await;
    } else {
        self.actors.worker_status_actor.record_failure().await;
    }

    // Send to orchestration FIRST
    let msg = SendStepResultMessage { result: step_result.clone() };
    match self.actors.ffi_completion_actor.handle(msg).await {
        Ok(()) => {
            // Domain events dispatched AFTER successful orchestration notification
            // Fire-and-forget - never blocks the worker
            self.actors.step_executor_actor
                .dispatch_domain_events(step_result.step_uuid, &step_result, None).await;
        }
        Err(e) => {
            // Don't dispatch domain events - orchestration wasn't notified
            tracing::error!("Failed to forward step completion to orchestration");
        }
    }
}
}

Service Decomposition

Large services were decomposed from the monolithic command processor:

StepExecutorService

services/step_execution/
├── mod.rs                  # Public API
├── service.rs              # StepExecutorService (~250 lines)
├── step_claimer.rs         # Step claiming logic
├── handler_invoker.rs      # Handler invocation
└── result_builder.rs       # Result construction

Key Design: Completely stateless service using &self methods. Wrapped in Arc<StepExecutorService> without any locks.

FFICompletionService

services/ffi_completion/
├── mod.rs                  # Public API
├── service.rs              # FFICompletionService
└── result_sender.rs        # Orchestration result sender

WorkerStatusService

services/worker_status/
├── mod.rs                  # Public API
└── service.rs              # WorkerStatusService

Key Architectural Decisions

1. Stateless Services

Services use &self methods with no mutable state:

#![allow(unused)]
fn main() {
impl StepExecutorService {
    pub async fn execute_step(
        &self,  // Immutable reference
        message: PgmqMessage<SimpleStepMessage>,
        queue_name: &str,
    ) -> TaskerResult<bool> {
        // Stateless execution - no mutable state
    }
}
}

Benefits:

Zero lock contention
Maximum concurrency per worker
Simplified reasoning about state

2. Constructor-Based Dependency Injection

All dependencies required at construction time:

#![allow(unused)]
fn main() {
pub async fn new(
    context: Arc<SystemContext>,
    worker_id: String,
    task_template_manager: Arc<TaskTemplateManager>,
    event_publisher: WorkerEventPublisher,        // Required
    domain_event_handle: DomainEventSystemHandle, // Required
) -> TaskerResult<Self>
}

Benefits:

Compiler enforces complete initialization
No “partially initialized” states
Clear dependency graph

3. Shared Event System

Event publisher and subscriber share the same WorkerEventSystem:

#![allow(unused)]
fn main() {
let shared_event_system = event_system
    .unwrap_or_else(|| Arc::new(WorkerEventSystem::new()));
let event_publisher =
    WorkerEventPublisher::with_event_system(worker_id.clone(), shared_event_system.clone());

// Enable subscriber with same shared system
processor.enable_event_subscriber(Some(shared_event_system)).await;
}

Benefits:

FFI handlers reliably receive step execution events
No isolated event systems causing silent failures

4. Graceful Degradation

Domain events never fail step completion:

#![allow(unused)]
fn main() {
// dispatch_domain_events returns () not TaskerResult<()>
// Errors logged but never propagated
pub async fn dispatch_domain_events(
    &self,
    step_uuid: Uuid,
    result: &StepExecutionResult,
    metadata: Option<HashMap<String, serde_json::Value>>,
) {
    // Fire-and-forget with error logging
    // Channel full? Log and continue
    // Dispatch error? Log and continue
}
}

Comparison with Orchestration Actors

Aspect	Orchestration	Worker
Actor Count	4 actors	5 actors
Registry	`ActorRegistry`	`WorkerActorRegistry`
Base Trait	`OrchestrationActor`	`WorkerActor`
Message Trait	`Handler<M>`	`Handler<M>` (same)
Service Design	Decomposed	Stateless
Statistics	N/A	Lock-free AtomicU64
LOC Reduction	~800 -> ~200	1575 -> ~200

Benefits

1. Consistency with Orchestration

Same patterns and traits as orchestration actors:

Identical Handler<M> trait interface
Similar registry lifecycle management
Consistent message-based communication

2. Zero Lock Contention

Stateless services eliminate RwLock on hot path
AtomicU64 counters for statistics
Maximum concurrent step execution

3. Type Safety

Messages and responses checked at compile time:

#![allow(unused)]
fn main() {
// Compile error if types don't match
impl Handler<ExecuteStepMessage> for StepExecutorActor {
    async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
        // Must return bool, not something else
    }
}
}

4. Testability

Clear message boundaries for mocking
Isolated actor lifecycle for unit tests
119 unit tests, 73 E2E tests passing

5. Maintainability

1575 LOC -> ~200 LOC command processor
Focused services (<300 lines per file)
Clear separation of concerns

Detailed Analysis

For design rationale, see the Worker Decomposition ADR.

Summary

The worker actor-based architecture provides a consistent, type-safe foundation for step execution in tasker-worker. Key takeaways:

Mirrors Orchestration: Same patterns as orchestration actors for consistency
Lock-Free Performance: Stateless services and AtomicU64 counters
Type Safety: Compile-time verification of message contracts
Pure Routing: Command processor delegates without business logic
Graceful Degradation: Domain events never fail step completion
Production Ready: 119 unit tests, 73 E2E tests, full regression coverage

The architecture provides a solid foundation for high-throughput step execution while maintaining the proven reliability of the orchestration system.

<- Back to Documentation Hub

Worker Event Systems Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Worker Actors | Events and Commands | Messaging Abstraction

<- Back to Documentation Hub

This document provides comprehensive documentation of the worker event system architecture in tasker-worker, covering the dual-channel event pattern, domain event publishing, and FFI integration.

Overview

The worker event system implements a dual-channel architecture for non-blocking step execution:

WorkerEventSystem: Receives step execution events via provider-agnostic subscriptions
HandlerDispatchService: Fire-and-forget handler invocation with bounded concurrency
CompletionProcessorService: Routes results back to orchestration
DomainEventSystem: Fire-and-forget domain event publishing

Messaging Backend Support: The worker event system supports multiple messaging backends (PGMQ, RabbitMQ) through a provider-agnostic abstraction. See Messaging Abstraction for details.

This architecture enables true parallel handler execution while maintaining strict ordering guarantees for domain events.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WORKER EVENT FLOW                                  │
└─────────────────────────────────────────────────────────────────────────────┘

                    MessagingProvider (PGMQ or RabbitMQ)
                                  │
                                  │ provider.subscribe_many()
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         WorkerEventSystem                                    │
│  ┌──────────────────────┐    ┌──────────────────────┐                       │
│  │  WorkerQueueListener │    │  WorkerFallbackPoller │                      │
│  │  (provider-agnostic) │    │  (PGMQ only)          │                      │
│  └──────────┬───────────┘    └──────────┬───────────┘                       │
│             │                           │                                    │
│             └───────────┬───────────────┘                                    │
│                         │                                                    │
│                         ▼                                                    │
│   MessageNotification::Message → ExecuteStepFromMessage (RabbitMQ)          │
│   MessageNotification::Available → ExecuteStepFromEvent (PGMQ)              │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      ActorCommandProcessor                                   │
│                              │                                               │
│                              ▼                                               │
│                      StepExecutorActor                                       │
│                              │                                               │
│                              │ claim step, send to dispatch channel          │
│                              ▼                                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
           Rust Workers               FFI Workers (Ruby/Python)
                    │                           │
                    ▼                           ▼
┌───────────────────────────────┐   ┌───────────────────────────────┐
│   HandlerDispatchService      │   │     FfiDispatchChannel        │
│                               │   │                               │
│   dispatch_receiver           │   │   pending_events HashMap      │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   [Semaphore] N permits       │   │   poll_step_events()          │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   handler.call()              │   │   Ruby/Python handler         │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   PostHandlerCallback         │   │   complete_step_event()       │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   completion_sender           │   │   PostHandlerCallback         │
│                               │   │         │                     │
└───────────────┬───────────────┘   │         ▼                     │
                │                   │   completion_sender           │
                │                   │                               │
                │                   └───────────────┬───────────────┘
                │                                   │
                └───────────────┬───────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CompletionProcessorService                                │
│                              │                                               │
│                              ▼                                               │
│                    FFICompletionService                                      │
│                              │                                               │
│                              ▼                                               │
│               orchestration_step_results queue                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                         Orchestration

Core Components

1. WorkerEventSystem

Location: tasker-worker/src/worker/event_systems/worker_event_system.rs

Implements the EventDrivenSystem trait for worker namespace queue processing. Supports three deployment modes with provider-agnostic message handling:

Mode	Description	PGMQ Behavior	RabbitMQ Behavior
`PollingOnly`	Traditional polling	Poll PGMQ tables	Poll via basic_get
`EventDrivenOnly`	Pure push delivery	pg_notify signals	basic_consume push
`Hybrid`	Event-driven + polling	pg_notify + fallback	Push only (no fallback)

Provider-Specific Behavior:

PGMQ: Uses MessageNotification::Available (signal-only), requires fallback polling
RabbitMQ: Uses MessageNotification::Message (full payload), no fallback needed

Key Features:

Unified configuration via WorkerEventSystemConfig
Atomic statistics with AtomicU64 counters
Converts WorkerNotification to WorkerCommand for processing

#![allow(unused)]
fn main() {
// Worker notification to command conversion (provider-agnostic)
match notification {
    // RabbitMQ style - full message delivered
    WorkerNotification::Message(msg) => {
        command_sender.send(WorkerCommand::ExecuteStepFromMessage {
            queue_name: msg.queue_name.clone(),
            message: msg,
            resp: resp_tx,
        }).await;
    }
    // PGMQ style - signal-only, requires fetch
    WorkerNotification::Event(WorkerQueueEvent::StepMessage(msg_event)) => {
        command_sender.send(WorkerCommand::ExecuteStepFromEvent {
            message_event: msg_event,
            resp: resp_tx,
        }).await;
    }
    // ...
}
}

2. HandlerDispatchService

Location: tasker-worker/src/worker/handlers/dispatch_service.rs

Non-blocking handler dispatch with bounded parallelism.

Architecture:

dispatch_receiver → [Semaphore] → handler.call() → [callback] → completion_sender
                         │                              │
                         └─→ Bounded to N concurrent    └─→ Domain events
                              tasks

Key Design Decisions:

Semaphore-Bounded Concurrency: Limits concurrent handlers to prevent resource exhaustion
Permit Release Before Send: Prevents backpressure cascade
Post-Handler Callback: Domain events fire only after result is committed

#![allow(unused)]
fn main() {
tokio::spawn(async move {
    let permit = semaphore.acquire().await?;

    let result = execute_with_timeout(&registry, &msg, timeout).await;

    // Release permit BEFORE sending to completion channel
    drop(permit);

    // Send result FIRST
    sender.send(result.clone()).await?;

    // Callback fires AFTER result is committed
    if let Some(cb) = callback {
        cb.on_handler_complete(&step, &result, &worker_id).await;
    }
});
}

Error Handling:

Scenario	Behavior
Handler timeout	`StepExecutionResult::failure()` with `error_type=handler_timeout`
Handler panic	Caught via `catch_unwind()`, failure result generated
Handler error	Failure result with `error_type=handler_error`
Semaphore closed	Failure result with `error_type=semaphore_acquisition_failed`

Handler Resolution

Before handler execution, the dispatch service resolves the handler using a resolver chain pattern:

HandlerDefinition                    ResolverChain                    Handler
     │                                    │                              │
     │  callable: "process_payment"       │                              │
     │  method: "refund"                  │                              │
     │  resolver: null                    │                              │
     │                                    │                              │
     ├───────────────────────────────────►│                              │
     │                                    │                              │
     │                    ┌───────────────┴───────────────┐              │
     │                    │ ExplicitMappingResolver (10)  │              │
     │                    │ can_resolve? ─► YES           │              │
     │                    │ resolve() ─────────────────────────────────►│
     │                    └───────────────────────────────┘              │
     │                                                                   │
     │                    ┌───────────────────────────────┐              │
     │                    │ MethodDispatchWrapper         │              │
     │                    │ (if method != "call")         │◄─────────────┤
     │                    └───────────────────────────────┘              │

Built-in Resolvers:

Resolver	Priority	Function
`ExplicitMappingResolver`	10	Hash lookup of registered handlers
`ClassConstantResolver`	100	Runtime class lookup (Ruby only)
`ClassLookupResolver`	100	Runtime class lookup (Python/TypeScript only)

Method Dispatch: When handler.method is specified and not "call", a MethodDispatchWrapper is applied to invoke the specified method instead of the default call() method.

See Handler Resolution Guide for complete documentation.

3. FfiDispatchChannel

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs

Pull-based polling interface for FFI workers (Ruby, Python). Enables language-specific handlers without complex FFI memory management.

Flow:

Rust                           Ruby/Python
  │                                 │
  │  dispatch(step)                 │
  │ ──────────────────────────────► │
  │                                 │ pending_events.insert()
  │                                 │
  │  poll_step_events()             │
  │ ◄────────────────────────────── │
  │                                 │
  │                                 │ handler.call()
  │                                 │
  │  complete_step_event(result)    │
  │ ◄────────────────────────────── │
  │                                 │
  │  PostHandlerCallback            │
  │  completion_sender.send()       │
  │                                 │

Key Features:

Thread-safe pending events map with lock poisoning recovery
Configurable completion timeout (default 30s)
Starvation detection and warnings
Fire-and-forget callbacks via runtime_handle.spawn()

4. CompletionProcessorService

Location: tasker-worker/src/worker/handlers/completion_processor.rs

Receives completed step results and routes to orchestration queue via FFICompletionService.

completion_receiver → CompletionProcessorService → FFICompletionService → orchestration_step_results

Note: Currently processes completions sequentially. Parallel processing is planned as a future enhancement.

5. DomainEventSystem

Location: tasker-worker/src/worker/event_systems/domain_event_system.rs

Async system for fire-and-forget domain event publishing.

Architecture:

command_processor.rs                  DomainEventSystem
      │                                     │
      │ try_send(command)                   │ spawn process_loop()
      ▼                                     ▼
mpsc::Sender<DomainEventCommand>  →  mpsc::Receiver
                                            │
                                            ▼
                                    EventRouter → PGMQ / InProcess

Key Design:

try_send() never blocks - if channel is full, events are dropped with metrics
Background task processes commands asynchronously
Graceful shutdown drains fast events up to configurable timeout
Three delivery modes: Durable (PGMQ), Fast (in-process), Broadcast

Shared Event Abstractions

EventDrivenSystem Trait

Location: tasker-shared/src/event_system/event_driven.rs

Unified trait for all event-driven systems:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
    type SystemId: Send + Sync + Clone;
    type Event: Send + Sync + Clone;
    type Config: Send + Sync + Clone;
    type Statistics: EventSystemStatistics;

    fn system_id(&self) -> Self::SystemId;
    fn deployment_mode(&self) -> DeploymentMode;
    fn is_running(&self) -> bool;

    async fn start(&mut self) -> Result<(), DeploymentModeError>;
    async fn stop(&mut self) -> Result<(), DeploymentModeError>;
    async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
    async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;

    fn statistics(&self) -> Self::Statistics;
    fn config(&self) -> &Self::Config;
}
}

Deployment Modes

Location: tasker-shared/src/event_system/deployment.rs

#![allow(unused)]
fn main() {
pub enum DeploymentMode {
    PollingOnly,      // Traditional polling, no events
    EventDrivenOnly,  // Pure event-driven, no polling
    Hybrid,           // Event-driven with polling fallback
}
}

PostHandlerCallback Trait

Location: tasker-worker/src/worker/handlers/dispatch_service.rs

Extensibility point for post-handler actions:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PostHandlerCallback: Send + Sync + 'static {
    /// Called after a handler completes
    async fn on_handler_complete(
        &self,
        step: &TaskSequenceStep,
        result: &StepExecutionResult,
        worker_id: &str,
    );

    /// Name of this callback for logging purposes
    fn name(&self) -> &str;
}
}

Implementations:

NoOpCallback: Default no-operation callback
DomainEventCallback: Publishes domain events to DomainEventSystem

Configuration

Worker Event System

# config/tasker/base/event_systems.toml
[event_systems.worker]
system_id = "worker-event-system"
deployment_mode = "Hybrid"

[event_systems.worker.metadata.listener]
retry_interval_seconds = 5
max_retry_attempts = 3
event_timeout_seconds = 60
batch_processing = true
connection_timeout_seconds = 30

[event_systems.worker.metadata.fallback_poller]
enabled = true
polling_interval_ms = 100
batch_size = 10
age_threshold_seconds = 30
max_age_hours = 24
visibility_timeout_seconds = 60

Handler Dispatch

# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000

[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000
completion_timeout_ms = 30000
starvation_warning_threshold_ms = 10000
callback_timeout_ms = 5000
completion_send_timeout_ms = 10000

Integration with Worker Actors

The event systems integrate with the worker actor architecture:

WorkerEventSystem
       │
       ▼
ActorCommandProcessor
       │
       ├──► StepExecutorActor ──► dispatch_sender
       │
       ├──► FFICompletionActor ◄── completion_receiver
       │
       └──► DomainEventActor ◄── PostHandlerCallback

See Worker Actors Documentation for actor details.

Event Flow Guarantees

Ordering Guarantee

Domain events fire AFTER result is committed to completion channel:

handler.call()
    → result committed to completion_sender
    → PostHandlerCallback.on_handler_complete()
    → domain events dispatched

This eliminates race conditions where downstream systems see events before orchestration processes results.

Idempotency Guarantee

State machine guards prevent duplicate execution:

Step claimed atomically via transition_step_state_atomic()
State guards reject duplicate claims
Results are deduplicated by completion channel

Fire-and-Forget Guarantee

Domain event failures never fail step completion:

#![allow(unused)]
fn main() {
// DomainEventCallback
pub async fn on_handler_complete(&self, step, result, worker_id) {
    // dispatch_events uses try_send() - never blocks
    // If channel full, events dropped with metrics
    // Step completion is NOT affected
    self.handle.dispatch_events(events, publisher_name, correlation_id);
}
}

Monitoring

Key Metrics

Metric	Description
`tasker.worker.events_processed`	Total events processed
`tasker.worker.events_failed`	Events that failed processing
`tasker.ffi.pending_events`	Pending FFI events (starvation indicator)
`tasker.ffi.oldest_event_age_ms`	Age of oldest pending event
`tasker.channel.completion.saturation`	Completion channel utilization
`tasker.domain_events.dispatched`	Domain events dispatched
`tasker.domain_events.dropped`	Domain events dropped (backpressure)

Health Checks

#![allow(unused)]
fn main() {
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError> {
    if self.is_running.load(Ordering::Acquire) {
        Ok(DeploymentModeHealthStatus::Healthy)
    } else {
        Ok(DeploymentModeHealthStatus::Critical)
    }
}
}

Backpressure Handling

The worker event system implements multiple backpressure mechanisms to ensure graceful degradation under load while preserving step idempotency.

Backpressure Points

┌─────────────────────────────────────────────────────────────────────────────┐
│                      WORKER BACKPRESSURE FLOW                                │
└─────────────────────────────────────────────────────────────────────────────┘

[1] Step Claiming
    │
    ├── Planned: Capacity check before claiming
    │   └── If at capacity: Leave message in queue (visibility timeout)
    │
    ▼
[2] Handler Dispatch Channel (Bounded)
    │
    ├── dispatch_buffer_size = 1000
    │   └── If full: Sender blocks until space available
    │
    ▼
[3] Semaphore-Bounded Execution
    │
    ├── max_concurrent_handlers = 10
    │   └── If permits exhausted: Task waits for permit
    │
    ├── CRITICAL: Permit released BEFORE sending to completion channel
    │   └── Prevents backpressure cascade
    │
    ▼
[4] Completion Channel (Bounded)
    │
    ├── completion_buffer_size = 1000
    │   └── If full: Handler task blocks until space available
    │
    ▼
[5] Domain Events (Fire-and-Forget)
    │
    └── try_send() semantics
        └── If channel full: Events DROPPED (step execution unaffected)

Handler Dispatch Backpressure

The HandlerDispatchService uses semaphore-bounded parallelism:

#![allow(unused)]
fn main() {
// Permit acquisition blocks if all permits in use
let permit = semaphore.acquire().await?;

let result = execute_with_timeout(&registry, &msg, timeout).await;

// CRITICAL: Release permit BEFORE sending to completion channel
// This prevents backpressure cascade where full completion channel
// holds permits, starving new handler execution
drop(permit);

// Now send to completion channel (may block if full)
sender.send(result).await?;
}

Why permit release before send matters:

If completion channel is full, handler task blocks on send
If permit is held during block, no new handlers can start
By releasing permit first, new handlers can start even if completions are backing up

FFI Dispatch Backpressure

The FfiDispatchChannel handles backpressure for Ruby/Python workers:

Scenario	Behavior
Dispatch channel full	Sender blocks
FFI polling too slow	Starvation warning logged
Completion timeout	Failure result generated
Callback timeout	Callback fire-and-forget, logged

Starvation Detection:

[worker.mpsc_channels.ffi_dispatch]
starvation_warning_threshold_ms = 10000  # Warn if event waits > 10s

Domain Event Drop Semantics

Domain events use try_send() and are explicitly designed to be droppable:

#![allow(unused)]
fn main() {
// Domain events fire AFTER result is committed
// They are non-critical and use fire-and-forget semantics
match event_sender.try_send(event) {
    Ok(()) => { /* Event dispatched */ }
    Err(TrySendError::Full(_)) => {
        // Event dropped - step execution NOT affected
        warn!("Domain event dropped: channel full");
        metrics.increment("domain_events_dropped");
    }
}
}

Why this is safe: Domain events are informational. Dropping them does not affect step execution correctness. The step result is already committed to the completion channel before domain events fire.

Step Claiming Backpressure (Planned)

Future enhancement: Workers will check capacity before claiming steps:

#![allow(unused)]
fn main() {
// Planned implementation
fn should_claim_step(&self) -> bool {
    let available = self.semaphore.available_permits();
    let threshold = self.config.claim_capacity_threshold;  // e.g., 0.8
    let max = self.config.max_concurrent_handlers;

    available as f64 / max as f64 > (1.0 - threshold)
}
}

If at capacity:

Worker does NOT acknowledge the PGMQ message
Message returns to queue after visibility timeout
Another worker (or same worker later) claims it

Idempotency Under Backpressure

All backpressure mechanisms preserve step idempotency:

Backpressure Point	Idempotency Guarantee
Claim refusal	Message stays in queue, visibility timeout protects
Dispatch channel full	Step claimed but queued for execution
Semaphore wait	Step claimed, waiting for permit
Completion channel full	Handler completed, result buffered
Domain event drop	Non-critical, step result already persisted

Critical Rule: A claimed step MUST produce a result (success or failure). Backpressure may delay but never drop step execution.

For comprehensive backpressure strategy, see Backpressure Architecture.

Best Practices

1. Choose Deployment Mode

Production: Use Hybrid for reliability with event-driven performance
Development: Use EventDrivenOnly for fastest iteration
Restricted environments: Use PollingOnly when pg_notify unavailable

2. Tune Concurrency

[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 10  # Start here, increase based on monitoring

Monitor:

Semaphore wait times
Handler execution latency
Completion channel saturation

3. Configure Timeouts

handler_timeout_ms = 30000        # Match your slowest handler
completion_timeout_ms = 30000     # FFI completion timeout
callback_timeout_ms = 5000        # Domain event callback timeout

4. Monitor Starvation

For FFI workers, monitor pending event age:

# Ruby
metrics = Tasker.ffi_dispatch_metrics
if metrics[:oldest_pending_age_ms] > 10000
  warn "FFI polling falling behind"
end

Messaging Abstraction - Provider-agnostic messaging
Backpressure Architecture - Unified backpressure strategy
Worker Actor-Based Architecture - Actor pattern implementation
Events and Commands - Command pattern details
Dual-Channel Event System ADR - Dual-channel event system decision
FFI Callback Safety - FFI guidelines
RCA: Parallel Execution Timing Bugs - Lessons learned
Backpressure Monitoring Runbook - Metrics and alerting

<- Back to Documentation Hub

Tasker Core Guides

This directory contains practical how-to guides for working with Tasker Core.

Documents

Document	Description
Quick Start	Get running in 5 minutes
Use Cases and Patterns	Practical workflow examples
Conditional Workflows	Runtime decision-making and dynamic steps
Batch Processing	Parallel processing with cursor-based workers
DLQ System	Dead letter queue investigation and resolution
Retry Semantics	Understanding max_attempts and retryable flags
Identity Strategy	Task deduplication with STRICT, CALLER_PROVIDED, ALWAYS_UNIQUE
Configuration Management	TOML architecture, CLI tools, runtime observability

When to Read These

Getting started: Begin with Quick Start
Implementing features: Check Use Cases and Patterns
Handling errors: See Retry Semantics and DLQ System
Processing data: Review Batch Processing
Deploying: Consult Configuration Management

Architecture - The “what” - system structure
Principles - The “why” - design philosophy
Workers - Language-specific handler development

API Security Guide

API-level security for orchestration (8080) and worker (8081) endpoints using JWT bearer tokens and API key authentication with permission-based access control.

Security is disabled by default for backward compatibility. Enable it explicitly in configuration.

See also: Auth Documentation Hub for architecture overview, Permissions for route mapping, Configuration for full reference, Testing for E2E test patterns.

Quick Start

1. Generate Keys

cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys

2. Generate a Token

cargo run --bin tasker-ctl -- auth generate-token \
  --private-key ./keys/jwt-private-key.pem \
  --permissions "tasks:create,tasks:read,tasks:list,steps:read" \
  --subject my-service \
  --expiry-hours 24

3. Enable Auth in Configuration

In config/tasker/base/orchestration.toml:

[auth]
enabled = true
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"

4. Use the Token

export TASKER_AUTH_TOKEN=<generated-token>
cargo run --bin tasker-ctl -- task list

Or with curl:

curl -H "Authorization: Bearer $TASKER_AUTH_TOKEN" http://localhost:8080/v1/tasks

Permission Vocabulary

Permission	Resource	Description
`tasks:create`	tasks	Create new tasks
`tasks:read`	tasks	Read task details
`tasks:list`	tasks	List tasks
`tasks:cancel`	tasks	Cancel running tasks
`tasks:context_read`	tasks	Read task context data
`steps:read`	steps	Read workflow step details
`steps:resolve`	steps	Manually resolve steps
`dlq:read`	dlq	Read DLQ entries
`dlq:update`	dlq	Update DLQ investigations
`dlq:stats`	dlq	View DLQ statistics
`templates:read`	templates	Read task templates
`templates:validate`	templates	Validate templates
`system:config_read`	system	Read system configuration
`system:handlers_read`	system	Read handler registry
`system:analytics_read`	system	Read analytics data
`worker:config_read`	worker	Read worker configuration
`worker:templates_read`	worker	Read worker templates

Wildcards

tasks:* - All task permissions
steps:* - All step permissions
dlq:* - All DLQ permissions
* - All permissions (superuser)

Show All Permissions

cargo run --bin tasker-ctl -- auth show-permissions

Configuration Reference

Server-Side (orchestration.toml / worker.toml)

[auth]
enabled = true

# JWT Configuration
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
jwt_token_expiry_hours = 24

# Key Configuration (one of these):
jwt_public_key_path = "./keys/jwt-public-key.pem"   # File path (preferred)
jwt_public_key = "-----BEGIN RSA PUBLIC KEY-----..." # Inline PEM
# Or set env: TASKER_JWT_PUBLIC_KEY_PATH

# JWKS (for dynamic key rotation)
jwt_verification_method = "jwks"  # "public_key" (default) or "jwks"
jwks_url = "https://auth.example.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600

# Permission validation
permissions_claim = "permissions"   # JWT claim containing permissions
strict_validation = true            # Reject tokens with unknown permissions
log_unknown_permissions = true

# API Key Authentication
api_key_header = "X-API-Key"
api_keys_enabled = true

[[auth.api_keys]]
key = "sk-prod-key-1"
permissions = ["tasks:read", "tasks:list", "steps:read"]
description = "Read-only monitoring service"

[[auth.api_keys]]
key = "sk-admin-key"
permissions = ["*"]
description = "Admin key"

Client-Side (Environment Variables)

Variable	Description
`TASKER_AUTH_TOKEN`	Bearer token for both APIs
`TASKER_ORCHESTRATION_AUTH_TOKEN`	Override token for orchestration only
`TASKER_WORKER_AUTH_TOKEN`	Override token for worker only
`TASKER_API_KEY`	API key (fallback if no token)
`TASKER_API_KEY_HEADER`	Custom header name (default: `X-API-Key`)

Priority: endpoint-specific token > global token > API key > config file.

JWT Token Structure

{
  "sub": "my-service",
  "iss": "tasker-core",
  "aud": "tasker-api",
  "iat": 1706000000,
  "exp": 1706086400,
  "permissions": [
    "tasks:create",
    "tasks:read",
    "tasks:list",
    "steps:read"
  ],
  "worker_namespaces": []
}

Common Role Patterns

Read-only operator:

permissions: ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]

Task submitter:

permissions: ["tasks:create", "tasks:read", "tasks:list"]

Ops admin:

permissions: ["tasks:*", "steps:*", "dlq:*", "system:*"]

Worker service:

permissions: ["worker:config_read", "worker:templates_read"]

Superuser:

permissions: ["*"]

Public Endpoints

These endpoints never require authentication:

GET /health - Basic health check
GET /health/detailed - Detailed health
GET /metrics - Prometheus metrics

API Key Authentication

API keys are validated against a configured registry. Each key has its own set of permissions.

# Using API key
curl -H "X-API-Key: sk-prod-key-1" http://localhost:8080/v1/tasks

API keys are simpler than JWTs but have limitations:

No expiration (rotate by removing from config)
No claims beyond permissions
Best for service-to-service communication with static permissions

Error Responses

401 Unauthorized (Missing/Invalid Credentials)

{
  "error": "unauthorized",
  "message": "Missing authentication credentials"
}

403 Forbidden (Insufficient Permissions)

{
  "error": "forbidden",
  "message": "Missing required permission: tasks:create"
}

Migration Guide: Disabled to Enabled

Generate keys and distribute the public key to server config
Generate tokens for each service/user with appropriate permissions
Set enabled = true in auth config
Deploy - services without valid tokens will get 401 responses
Monitor the tasker.auth.failures.total metric for issues

All endpoints remain accessible without auth when enabled = false.

Observability

Structured Logs

info on successful authentication (subject, method)
warn on authentication failure (error details)
warn on permission denial (subject, required permission)

Prometheus Metrics

Metric	Type	Labels
`tasker.auth.requests.total`	Counter	method, result
`tasker.auth.failures.total`	Counter	reason
`tasker.permission.denials.total`	Counter	permission
`tasker.auth.jwt.verification.duration`	Histogram	result

CLI Auth Commands

# Generate RSA key pair
tasker-ctl auth generate-keys [--output-dir ./keys] [--key-size 2048]

# Generate JWT token
tasker-ctl auth generate-token \
  --permissions tasks:create,tasks:read \
  --subject my-service \
  --private-key ./keys/jwt-private-key.pem \
  --expiry-hours 24

# List all permissions
tasker-ctl auth show-permissions

# Validate a token
tasker-ctl auth validate-token \
  --token <JWT> \
  --public-key ./keys/jwt-public-key.pem

gRPC Authentication

gRPC endpoints support the same authentication methods as REST, using gRPC metadata instead of HTTP headers.

gRPC Ports

Service	REST Port	gRPC Port
Orchestration	8080	9190
Rust Worker	8081	9191

Bearer Token (gRPC)

# Using grpcurl with Bearer token
grpcurl -plaintext \
  -H "Authorization: Bearer $TASKER_AUTH_TOKEN" \
  localhost:9190 tasker.v1.TaskService/ListTasks

API Key (gRPC)

# Using grpcurl with API key
grpcurl -plaintext \
  -H "X-API-Key: sk-prod-key-1" \
  localhost:9190 tasker.v1.TaskService/ListTasks

gRPC Client Configuration

#![allow(unused)]
fn main() {
use tasker_client::grpc_clients::{OrchestrationGrpcClient, GrpcAuthConfig};

// With API key
let client = OrchestrationGrpcClient::connect_with_auth(
    "http://localhost:9190",
    GrpcAuthConfig::ApiKey("sk-prod-key-1".to_string()),
).await?;

// With Bearer token
let client = OrchestrationGrpcClient::connect_with_auth(
    "http://localhost:9190",
    GrpcAuthConfig::Bearer("eyJ...".to_string()),
).await?;
}

gRPC Error Codes

gRPC Status	HTTP Equivalent	Meaning
`UNAUTHENTICATED`	401	Missing or invalid credentials
`PERMISSION_DENIED`	403	Valid credentials but insufficient permissions
`NOT_FOUND`	404	Resource not found
`UNAVAILABLE`	503	Service unavailable

Public gRPC Endpoints

These endpoints never require authentication:

HealthService/CheckHealth - Basic health check
HealthService/CheckLiveness - Kubernetes liveness probe
HealthService/CheckReadiness - Kubernetes readiness probe
HealthService/CheckDetailedHealth - Detailed health metrics

Security Considerations

Key storage: Private keys should never be committed to git. Use file paths or environment variables.
Token expiry: Set appropriate expiry times. Short-lived tokens (1-24h) are preferred.
Least privilege: Grant only the permissions each service needs.
Key rotation: Use JWKS for automatic key rotation in production.
API key rotation: Remove old keys from config and redeploy.
Audit: Monitor tasker.auth.failures.total and tasker.permission.denials.total for anomalies.

External Auth Provider Integration

Integrating Tasker’s API security with external identity providers via JWKS endpoints.

See also: Auth Documentation Hub for architecture overview, Configuration for full TOML reference.

JWKS Integration

Tasker supports JWKS (JSON Web Key Set) for dynamic public key discovery. This enables key rotation without redeploying Tasker.

Configuration

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://your-provider.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://your-provider.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"  # or custom claim name

How It Works

On first request, Tasker fetches the JWKS from the configured URL
Keys are cached for the configured refresh interval
When a token has an unknown kid (Key ID), a refresh is triggered
RSA keys are parsed from the JWK n and e components

Auth0

Auth0 Configuration

Create an API in Auth0 Dashboard:
- Name: Tasker API
- Identifier: tasker-api (this becomes the audience)
- Signing Algorithm: RS256
Create permissions in the API settings matching Tasker’s vocabulary:
- tasks:create, tasks:read, tasks:list, etc.
Assign permissions to users/applications via Auth0 roles

Tasker Configuration for Auth0

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.auth0.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.auth0.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"

Token Request

curl --request POST \
  --url https://YOUR_DOMAIN.auth0.com/oauth/token \
  --header 'content-type: application/json' \
  --data '{
    "client_id": "YOUR_CLIENT_ID",
    "client_secret": "YOUR_CLIENT_SECRET",
    "audience": "tasker-api",
    "grant_type": "client_credentials"
  }'

Keycloak

Keycloak Configuration

Create a realm and client for Tasker
Define client roles matching Tasker permissions
Configure the client to include roles in the permissions token claim via a protocol mapper

Tasker Configuration for Keycloak

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://keycloak.example.com/realms/YOUR_REALM/protocol/openid-connect/certs"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://keycloak.example.com/realms/YOUR_REALM"
jwt_audience = "tasker-api"
permissions_claim = "permissions"  # Configure via protocol mapper

Okta

Okta Configuration

Create an API authorization server
Add custom claims for permissions
Define scopes matching Tasker permissions

Tasker Configuration for Okta

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID/v1/keys"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID"
jwt_audience = "tasker-api"
permissions_claim = "scp"  # Okta uses "scp" for scopes by default

Custom JWKS Endpoint

Any provider that serves a standard JWKS endpoint works. The endpoint must return:

{
  "keys": [
    {
      "kty": "RSA",
      "kid": "key-id-1",
      "use": "sig",
      "alg": "RS256",
      "n": "<base64url-encoded modulus>",
      "e": "<base64url-encoded exponent>"
    }
  ]
}

Static Public Key (Development)

For development or simple deployments without a JWKS endpoint:

[auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"

Generate keys with:

tasker-ctl auth generate-keys --output-dir /etc/tasker/keys

Permission Claim Mapping

If your identity provider uses a different claim name for permissions:

permissions_claim = "custom_permissions"  # Default: "permissions"

The claim must be a JSON array of strings:

{
  "custom_permissions": ["tasks:create", "tasks:read"]
}

Strict Validation

When strict_validation = true (default), tokens containing unknown permission strings are rejected. Set to false if your provider includes additional scopes/permissions not in Tasker’s vocabulary:

strict_validation = false
log_unknown_permissions = true  # Still log unknown permissions for monitoring

Batch Processing in Tasker

Last Updated: 2026-01-06 Status: Production Ready Related: Conditional Workflows, DLQ System

Overview
Architecture Foundations
Core Concepts
Checkpoint Yielding
Workflow Pattern
Data Structures
Implementation Patterns
Use Cases
Operator Workflows
Code Examples
Best Practices

Overview

Batch processing in Tasker enables parallel processing of large datasets by dynamically creating worker steps at runtime. A single “batchable” step analyzes a workload and instructs orchestration to create N worker instances, each processing a subset of data using cursor-based boundaries.

Key Characteristics

Dynamic Worker Creation: Workers are created at runtime based on dataset analysis, using predefined in templates for structure, but scaled according to need.

Cursor-Based Resumability: Each worker processes a specific range (cursor) and can resume from checkpoints on failure.

Deferred Convergence: Aggregation steps use intersection semantics to wait for all created workers, regardless of count.

Standard Lifecycle: Workers use existing retry, timeout, and DLQ mechanics - no special recovery system needed.

Example Flow

Task: Process 1000-row CSV file

1. analyze_csv (batchable step)
   → Counts rows: 1000
   → Calculates workers: 5 (200 rows each)
   → Returns BatchProcessingOutcome::CreateBatches

2. Orchestration creates workers dynamically:
   ├─ process_csv_batch_001 (rows 1-200)
   ├─ process_csv_batch_002 (rows 201-400)
   ├─ process_csv_batch_003 (rows 401-600)
   ├─ process_csv_batch_004 (rows 601-800)
   └─ process_csv_batch_005 (rows 801-1000)

3. Workers process in parallel

4. aggregate_csv_results (deferred convergence)
   → Waits for all 5 workers (intersection semantics)
   → Aggregates results from completed workers
   → Returns combined metrics

Architecture Foundations

Batch processing builds on and extends three foundational Tasker patterns:

1. DAG (Directed Acyclic Graph) Workflow Orchestration

What Batch Processing Inherits:

Worker steps are full DAG nodes with standard state machines
Parent-child dependencies enforced via tasker_workflow_step_edges
Cycle detection prevents circular dependencies
Topological ordering ensures correct execution sequence

What Batch Processing Extends:

Dynamic node creation: Template steps instantiated N times at runtime
Edge generation: Batchable step → worker instances → convergence step
Transactional atomicity: All workers created in single database transaction

Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:357-400

#![allow(unused)]
fn main() {
// Transaction ensures all-or-nothing worker creation
let mut tx = pool.begin().await?;

for (i, cursor_config) in cursor_configs.iter().enumerate() {
    // Create worker instance from template
    let worker_step = WorkflowStepCreator::create_from_template(
        &mut tx,
        task_uuid,
        &worker_template,
        &format!("{}_{:03}", worker_template_name, i + 1),
        Some(batch_worker_inputs.clone()),
    ).await?;

    // Create edge: batchable → worker
    WorkflowStepEdge::create_with_transaction(
        &mut tx,
        batchable_step.workflow_step_uuid,
        worker_step.workflow_step_uuid,
        "batch_dependency",
    ).await?;
}

tx.commit().await?; // Atomic - all workers or none
}

2. Retryability and Lifecycle Management

What Batch Processing Inherits:

Standard lifecycle.max_retries configuration per template
Exponential backoff via lifecycle.backoff_multiplier
Staleness detection using lifecycle.max_steps_in_process_minutes
Standard state transitions (Pending → Enqueued → InProgress → Complete/Error)

What Batch Processing Extends:

Checkpoint-based resumability: Workers checkpoint progress and resume from last cursor position
Cursor preservation during retry: workflow_steps.results field preserved by ResetForRetry action
Additional staleness detection: Checkpoint timestamp tracking alongside duration-based detection

Key Simplification:

❌ No BatchRecoveryService - Uses standard retry + DLQ
❌ No duplicate timeout settings - Uses lifecycle config only
✅ Cursor data preserved during ResetForRetry

Configuration Example: tests/fixtures/task_templates/ruby/batch_processing_products_csv.yaml:749-752

- name: process_csv_batch
  type: batch_worker
  lifecycle:
    max_steps_in_process_minutes: 120  # DLQ timeout
    max_retries: 3                     # Standard retry limit
    backoff_multiplier: 2.0            # Exponential backoff

3. Deferred Convergence

What Batch Processing Inherits:

Intersection semantics: Wait for declared dependencies ∩ actually created steps
Template-level dependencies: Convergence step depends on worker template, not instances
Runtime resolution: System computes effective dependencies when workers are created

What Batch Processing Extends:

Batch aggregation pattern: Convergence steps aggregate results from N workers
NoBatches scenario handling: Placeholder worker created when dataset too small
Scenario detection helpers: BatchAggregationScenario::detect() for both cases

Flow Comparison:

Conditional Workflows (Decision Points):

decision_step → creates → option_a, option_b (conditional)
                            ↓
convergence_step (depends on option_a AND option_b templates)
                 → waits for whichever were created (intersection)

Batch Processing (Batchable Steps):

batchable_step → creates → worker_001, worker_002, ..., worker_N
                            ↓
convergence_step (depends on worker template)
                 → waits for ALL workers created (intersection)

Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:600-650

#![allow(unused)]
fn main() {
// Determine and create convergence step with intersection semantics
pub async fn determine_and_create_convergence_step(
    &self,
    tx: &mut PgTransaction,
    task_uuid: Uuid,
    convergence_template: &StepDefinition,
    created_workers: &[WorkflowStep],
) -> Result<Option<WorkflowStep>> {
    // Create convergence step as deferred type
    let convergence_step = WorkflowStepCreator::create_from_template(
        tx,
        task_uuid,
        convergence_template,
        &convergence_template.name,
        None,
    ).await?;

    // Create edges from ALL worker instances to convergence step
    for worker in created_workers {
        WorkflowStepEdge::create_with_transaction(
            tx,
            worker.workflow_step_uuid,
            convergence_step.workflow_step_uuid,
            "batch_convergence_dependency",
        ).await?;
    }

    Ok(Some(convergence_step))
}
}

Core Concepts

Batchable Steps

Purpose: Analyze a workload and decide whether to create batch workers.

Responsibilities:

Examine dataset (size, complexity, business logic)
Calculate optimal worker count based on batch size
Generate cursor configurations defining batch boundaries
Return BatchProcessingOutcome instructing orchestration

Returns: BatchProcessingOutcome enum with two variants:

NoBatches: Dataset too small or empty - create placeholder worker
CreateBatches: Create N workers with cursor configurations

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-120

#![allow(unused)]
fn main() {
// Batchable handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    let csv_file_path = step_data.task.context.get("csv_file_path").unwrap();

    // Count rows in CSV (excluding header)
    let total_rows = count_csv_rows(csv_file_path)?;

    // Get batch configuration from handler initialization
    let batch_size = step_data.handler_initialization
        .get("batch_size").and_then(|v| v.as_u64()).unwrap_or(200);

    if total_rows == 0 {
        // No batches needed
        let outcome = BatchProcessingOutcome::no_batches();
        return Ok(success_result(
            step_uuid,
            json!({ "batch_processing_outcome": outcome.to_value() }),
            elapsed_ms,
            None,
        ));
    }

    // Calculate workers
    let worker_count = (total_rows as f64 / batch_size as f64).ceil() as u32;

    // Generate cursor configs
    let cursor_configs = create_cursor_configs(total_rows, worker_count);

    // Return CreateBatches outcome
    let outcome = BatchProcessingOutcome::create_batches(
        "process_csv_batch".to_string(),
        worker_count,
        cursor_configs,
        total_rows,
    );

    Ok(success_result(
        step_uuid,
        json!({
            "batch_processing_outcome": outcome.to_value(),
            "worker_count": worker_count,
            "total_rows": total_rows
        }),
        elapsed_ms,
        None,
    ))
}
}

Batch Workers

Purpose: Process a specific subset of data defined by cursor configuration.

Responsibilities:

Extract cursor config from workflow_step.inputs
Check for is_no_op flag (NoBatches placeholder scenario)
Process items within cursor range (start_cursor to end_cursor)
Checkpoint progress periodically for resumability
Return processed results for aggregation

Cursor Configuration: Each worker receives BatchWorkerInputs in workflow_step.inputs:

{
  "cursor": {
    "batch_id": "001",
    "start_cursor": 1,
    "end_cursor": 200,
    "batch_size": 200
  },
  "batch_metadata": {
    "checkpoint_interval": 100,
    "cursor_field": "row_number",
    "failure_strategy": "fail_fast"
  },
  "is_no_op": false
}

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-280

#![allow(unused)]
fn main() {
// Batch worker handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Extract context using helper
    let context = BatchWorkerContext::from_step_data(step_data)?;

    // Check for no-op placeholder worker
    if context.is_no_op() {
        return Ok(success_result(
            step_uuid,
            json!({
                "no_op": true,
                "reason": "NoBatches scenario - no items to process"
            }),
            elapsed_ms,
            None,
        ));
    }

    // Get cursor range
    let start_row = context.start_position();
    let end_row = context.end_position();

    // Get CSV file path from dependency results
    let csv_file_path = step_data
        .dependency_results
        .get("analyze_csv")
        .and_then(|r| r.result.get("csv_file_path"))
        .unwrap();

    // Process CSV rows in cursor range
    let mut processed_count = 0;
    let mut metrics = initialize_metrics();

    let file = File::open(csv_file_path)?;
    let mut csv_reader = csv::ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
        let data_row_num = row_idx + 1; // 1-indexed after header

        if data_row_num < start_row {
            continue; // Skip rows before our range
        }
        if data_row_num >= end_row {
            break; // Processed all our rows
        }

        let product: Product = result?;

        // Update metrics
        metrics.total_inventory_value += product.price * (product.stock as f64);
        metrics.category_counts.entry(product.category.clone())
            .or_insert(0) += 1;

        processed_count += 1;

        // Checkpoint progress periodically
        if processed_count % context.checkpoint_interval() == 0 {
            checkpoint_progress(step_uuid, data_row_num).await?;
        }
    }

    // Return results for aggregation
    Ok(success_result(
        step_uuid,
        json!({
            "processed_count": processed_count,
            "total_inventory_value": metrics.total_inventory_value,
            "category_counts": metrics.category_counts,
            "batch_id": context.batch_id(),
            "start_row": start_row,
            "end_row": end_row
        }),
        elapsed_ms,
        None,
    ))
}
}

Convergence Steps

Purpose: Aggregate results from all batch workers using deferred intersection semantics.

Responsibilities:

Detect scenario using BatchAggregationScenario::detect()
Handle both NoBatches and WithBatches scenarios
Aggregate metrics from all worker results
Return combined results for task completion

Scenario Detection:

#![allow(unused)]
fn main() {
pub enum BatchAggregationScenario {
    /// No batches created - placeholder worker used
    NoBatches {
        batchable_result: StepDependencyResult,
    },

    /// Batches created - multiple workers processed data
    WithBatches {
        batch_results: Vec<(String, StepDependencyResult)>,
        worker_count: u32,
    },
}
}

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-480

#![allow(unused)]
fn main() {
// Convergence handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Detect scenario using helper
    let scenario = BatchAggregationScenario::detect(
        &step_data.dependency_results,
        "analyze_csv",        // batchable step name
        "process_csv_batch_", // batch worker prefix
    )?;

    match scenario {
        BatchAggregationScenario::NoBatches { batchable_result } => {
            // No workers created - get dataset size from batchable step
            let total_rows = batchable_result
                .result.get("total_rows")
                .and_then(|v| v.as_u64())
                .unwrap_or(0);

            // Return zero metrics
            Ok(success_result(
                step_uuid,
                json!({
                    "total_processed": total_rows,
                    "total_inventory_value": 0.0,
                    "category_counts": {},
                    "worker_count": 0
                }),
                elapsed_ms,
                None,
            ))
        }

        BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
            // Aggregate results from all workers
            let mut total_processed = 0u64;
            let mut total_inventory_value = 0.0;
            let mut global_category_counts = HashMap::new();
            let mut max_price = 0.0;
            let mut max_price_product = None;

            for (step_name, result) in batch_results {
                // Sum processed counts
                total_processed += result.result
                    .get("processed_count")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                // Sum inventory values
                total_inventory_value += result.result
                    .get("total_inventory_value")
                    .and_then(|v| v.as_f64())
                    .unwrap_or(0.0);

                // Merge category counts
                if let Some(categories) = result.result
                    .get("category_counts")
                    .and_then(|v| v.as_object()) {
                    for (category, count) in categories {
                        *global_category_counts.entry(category.clone()).or_insert(0)
                            += count.as_u64().unwrap_or(0);
                    }
                }

                // Find global max price
                let batch_max_price = result.result
                    .get("max_price")
                    .and_then(|v| v.as_f64())
                    .unwrap_or(0.0);
                if batch_max_price > max_price {
                    max_price = batch_max_price;
                    max_price_product = result.result
                        .get("max_price_product")
                        .and_then(|v| v.as_str())
                        .map(String::from);
                }
            }

            // Return aggregated metrics
            Ok(success_result(
                step_uuid,
                json!({
                    "total_processed": total_processed,
                    "total_inventory_value": total_inventory_value,
                    "category_counts": global_category_counts,
                    "max_price": max_price,
                    "max_price_product": max_price_product,
                    "worker_count": worker_count
                }),
                elapsed_ms,
                None,
            ))
        }
    }
}
}

Checkpoint Yielding

Checkpoint yielding enables handler-driven progress persistence during long-running batch operations. Handlers explicitly checkpoint their progress, persist state to the database, and yield control back to the orchestrator for re-dispatch.

Key Characteristics

Handler-Driven: Handlers decide when to checkpoint based on business logic, not configuration timers. This gives handlers full control over checkpoint frequency and timing.

Checkpoint-Persist-Then-Redispatch: Progress is atomically saved to the database before the step is re-dispatched. This ensures no progress is ever lost, even during infrastructure failures.

Step Remains In-Progress: During checkpoint yield cycles, the step stays in InProgress state. It is not released or re-enqueued through normal channels—the re-dispatch happens internally.

State Machine Integrity: Only Success or Failure results trigger state transitions. Checkpoint yields are internal handler mechanics that don’t affect the step state machine.

When to Use Checkpoint Yielding

Use checkpoint yielding when:

Processing takes longer than your visibility timeout (prevents DLQ escalation)
You want resumable processing after transient failures
You need to periodically release resources (memory, connections)
Long-running operations need progress visibility

Don’t use checkpoint yielding when:

Batch processing completes quickly (<30 seconds)
The overhead of checkpointing exceeds the benefit
Operations are inherently non-resumable

API Reference

All languages provide a checkpoint_yield() method (or checkpointYield() in TypeScript) on the Batchable mixin:

Ruby

class MyBatchWorkerHandler
  include Tasker::StepHandler::Batchable

  def call(step_data)
    context = BatchWorkerContext.from_step_data(step_data)

    # Resume from checkpoint if present
    start_item = context.has_checkpoint? ? context.checkpoint_cursor : 0
    accumulated = context.accumulated_results || []

    items = fetch_items_to_process(start_item)

    items.each_with_index do |item, idx|
      result = process_item(item)
      accumulated << result

      # Checkpoint every 1000 items
      if (idx + 1) % 1000 == 0
        checkpoint_yield(
          cursor: start_item + idx + 1,
          items_processed: accumulated.size,
          accumulated_results: { processed: accumulated }
        )
        # Handler execution stops here and resumes on re-dispatch
      end
    end

    # Return final success result
    success_result(results: { all_processed: accumulated })
  end
end

BatchWorkerContext Accessors (Ruby):

checkpoint_cursor - Current cursor position (or nil if no checkpoint)
accumulated_results - Previously accumulated results (or nil)
has_checkpoint? - Returns true if checkpoint data exists
checkpoint_items_processed - Number of items processed at checkpoint

Python

class MyBatchWorkerHandler(BatchableHandler):
    def call(self, step_data: TaskSequenceStep) -> StepExecutionResult:
        context = BatchWorkerContext.from_step_data(step_data)

        # Resume from checkpoint if present
        start_item = context.checkpoint_cursor if context.has_checkpoint() else 0
        accumulated = context.accumulated_results or []

        items = self.fetch_items_to_process(start_item)

        for idx, item in enumerate(items):
            result = self.process_item(item)
            accumulated.append(result)

            # Checkpoint every 1000 items
            if (idx + 1) % 1000 == 0:
                self.checkpoint_yield(
                    cursor=start_item + idx + 1,
                    items_processed=len(accumulated),
                    accumulated_results={"processed": accumulated}
                )
                # Handler execution stops here and resumes on re-dispatch

        # Return final success result
        return self.success_result(results={"all_processed": accumulated})

BatchWorkerContext Accessors (Python):

checkpoint_cursor: int | str | dict | None - Current cursor position
accumulated_results: dict | None - Previously accumulated results
has_checkpoint() -> bool - Returns true if checkpoint data exists
checkpoint_items_processed: int - Number of items processed at checkpoint

TypeScript

class MyBatchWorkerHandler extends BatchableHandler {
  async call(stepData: TaskSequenceStep): Promise<StepExecutionResult> {
    const context = BatchWorkerContext.fromStepData(stepData);

    // Resume from checkpoint if present
    const startItem = context.hasCheckpoint() ? context.checkpointCursor : 0;
    const accumulated = context.accumulatedResults ?? [];

    const items = await this.fetchItemsToProcess(startItem);

    for (let idx = 0; idx < items.length; idx++) {
      const result = await this.processItem(items[idx]);
      accumulated.push(result);

      // Checkpoint every 1000 items
      if ((idx + 1) % 1000 === 0) {
        await this.checkpointYield({
          cursor: startItem + idx + 1,
          itemsProcessed: accumulated.length,
          accumulatedResults: { processed: accumulated }
        });
        // Handler execution stops here and resumes on re-dispatch
      }
    }

    // Return final success result
    return this.successResult({ results: { allProcessed: accumulated } });
  }
}

BatchWorkerContext Properties (TypeScript):

checkpointCursor: number | string | Record<string, unknown> | undefined
accumulatedResults: Record<string, unknown> | undefined
hasCheckpoint(): boolean
checkpointItemsProcessed: number

Checkpoint Data Structure

Checkpoints are persisted in the checkpoint JSONB column on workflow_steps:

{
  "cursor": 1000,
  "items_processed": 1000,
  "timestamp": "2026-01-06T12:00:00Z",
  "accumulated_results": {
    "processed": ["item1", "item2", "..."]
  },
  "history": [
    {"cursor": 500, "timestamp": "2026-01-06T11:59:30Z"},
    {"cursor": 1000, "timestamp": "2026-01-06T12:00:00Z"}
  ]
}

Fields:

cursor - Flexible JSON value representing position (integer, string, or object)
items_processed - Total items processed at this checkpoint
timestamp - ISO 8601 timestamp when checkpoint was created
accumulated_results - Optional intermediate results to preserve
history - Array of previous checkpoint positions (appended automatically)

Checkpoint Flow

┌──────────────────────────────────────────────────────────────────┐
│  Handler calls checkpoint_yield(cursor, items_processed, ...)   │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  FFI Bridge: checkpoint_yield_step_event()                       │
│  Converts language-specific types to CheckpointYieldData         │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  CheckpointService::save_checkpoint()                            │
│  - Atomic SQL update with history append                         │
│  - Uses JSONB jsonb_set for history array                        │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  Worker re-dispatches step via internal MPSC channel             │
│  - Step stays InProgress (not released)                          │
│  - Re-queued for immediate processing                            │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  Handler resumes with checkpoint data in workflow_step           │
│  - BatchWorkerContext provides checkpoint accessors              │
│  - Handler continues from saved cursor position                  │
└──────────────────────────────────────────────────────────────────┘

Failure and Recovery

Transient Failure After Checkpoint:

Handler checkpoints at position 500
Handler fails at position 750 (transient error)
Step is retried (standard retry semantics)
Handler resumes from checkpoint (position 500)
Items 500-750 are reprocessed (idempotency required)
Processing continues to completion

Permanent Failure:

Handler checkpoints at position 500
Handler encounters non-retryable error
Step transitions to Error state
Checkpoint data preserved for operator inspection
Manual intervention may use checkpoint to resume later

Best Practices

Checkpoint Frequency:

Too frequent: Overhead dominates (database writes, re-dispatch latency)
Too infrequent: Lost progress on failure, long recovery time
Rule of thumb: Checkpoint every 1-5 minutes of work, or every 1000-10000 items

Accumulated Results:

Keep accumulated results small (summaries, counts, IDs)
For large result sets, write to external storage and store reference
Unbounded accumulated results can cause performance degradation

Cursor Design:

Use monotonic cursors (integers, timestamps) when possible
Complex cursors (objects) are supported but harder to debug
Cursor must uniquely identify resume position

Idempotency:

Items between last checkpoint and failure will be reprocessed
Ensure item processing is idempotent or use deduplication
Consider storing processed item IDs in accumulated_results

Monitoring

Checkpoint Events (logged automatically):

INFO checkpoint_yield_step_event step_uuid=abc cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc history_length=2

Metrics to Monitor:

Checkpoint frequency per step
Average items processed between checkpoints
Checkpoint history size (detect unbounded growth)
Re-dispatch latency after checkpoint

Known Limitations

History Array Growth: The history array grows with each checkpoint. For very long-running processes with frequent checkpoints, this can lead to large JSONB values. Consider:

Setting a maximum history length (future enhancement)
Clearing history on step completion
Using external storage for detailed history

Accumulated Results Size: No built-in size limit on accumulated_results. Handlers must self-regulate to prevent database bloat. Consider:

Storing summaries instead of raw data
Using external storage for large intermediate results
Implementing size checks before checkpoint

Workflow Pattern

Template Definition

Batch processing workflows use three step types in YAML templates:

name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"

steps:
  # BATCHABLE STEP: Analyzes dataset and decides batching strategy
  - name: analyze_csv
    type: batchable
    dependencies: []
    handler:
      callable: BatchProcessing::CsvAnalyzerHandler
      initialization:
        batch_size: 200
        max_workers: 5

  # BATCH WORKER TEMPLATE: Single batch processing unit
  # Orchestration creates N instances from this template at runtime
  - name: process_csv_batch
    type: batch_worker
    dependencies:
      - analyze_csv
    lifecycle:
      max_steps_in_process_minutes: 120
      max_retries: 3
      backoff_multiplier: 2.0
    handler:
      callable: BatchProcessing::CsvBatchProcessorHandler
      initialization:
        operation: "inventory_analysis"

  # DEFERRED CONVERGENCE STEP: Aggregates results from all workers
  - name: aggregate_csv_results
    type: deferred_convergence
    dependencies:
      - process_csv_batch  # Template dependency - resolves to all instances
    handler:
      callable: BatchProcessing::CsvResultsAggregatorHandler
      initialization:
        aggregation_type: "inventory_metrics"

Runtime Execution Flow

1. Task Initialization

User creates task with context: { "csv_file_path": "/path/to/data.csv" }
↓
Task enters Initializing state
↓
Orchestration discovers ready steps: [analyze_csv]

2. Batchable Step Execution

analyze_csv step enqueued to worker queue
↓
Worker claims step, executes CsvAnalyzerHandler
↓
Handler counts rows: 1000
Handler calculates workers: 5 (200 rows each)
Handler generates cursor configs
Handler returns BatchProcessingOutcome::CreateBatches
↓
Step completes with batch_processing_outcome in results

3. Batch Worker Creation (Orchestration)

ResultProcessorActor processes analyze_csv completion
↓
Detects batch_processing_outcome in step results
↓
Sends ProcessBatchableStepMessage to BatchProcessingActor
↓
BatchProcessingService.process_batchable_step():
  - Begins database transaction
  - Creates 5 worker instances from process_csv_batch template:
    * process_csv_batch_001 (cursor: rows 1-200)
    * process_csv_batch_002 (cursor: rows 201-400)
    * process_csv_batch_003 (cursor: rows 401-600)
    * process_csv_batch_004 (cursor: rows 601-800)
    * process_csv_batch_005 (cursor: rows 801-1000)
  - Creates edges: analyze_csv → each worker
  - Creates convergence step: aggregate_csv_results
  - Creates edges: each worker → aggregate_csv_results
  - Commits transaction (all-or-nothing)
↓
Workers enqueued to worker queue with PGMQ notifications

4. Parallel Worker Execution

5 workers execute in parallel:

Worker 001:
  - Extracts cursor: start=1, end=200
  - Processes CSV rows 1-200
  - Returns: processed_count=200, metrics={...}

Worker 002:
  - Extracts cursor: start=201, end=400
  - Processes CSV rows 201-400
  - Returns: processed_count=200, metrics={...}

... (workers 003-005 similar)

All workers complete

5. Convergence Step Execution

Orchestration discovers aggregate_csv_results is ready
(all parent workers completed - intersection semantics)
↓
aggregate_csv_results enqueued to worker queue
↓
Worker claims step, executes CsvResultsAggregatorHandler
↓
Handler detects scenario: WithBatches (5 workers)
Handler aggregates results from all 5 workers:
  - total_processed: 1000
  - total_inventory_value: $XXX,XXX.XX
  - category_counts: {electronics: 300, clothing: 250, ...}
Handler returns aggregated metrics
↓
Step completes

6. Task Completion

Orchestration detects all steps complete
↓
TaskFinalizerActor finalizes task
↓
Task state: Complete

NoBatches Scenario Flow

When dataset is too small or empty:

analyze_csv determines dataset too small (e.g., 50 rows < 200 batch_size)
↓
Returns BatchProcessingOutcome::NoBatches
↓
Orchestration creates single placeholder worker:
  - process_csv_batch_001 (is_no_op: true)
  - No cursor configuration needed
  - Still maintains DAG structure
↓
Placeholder worker executes:
  - Detects is_no_op flag
  - Returns immediately with no_op: true
  - No actual data processing
↓
Convergence step detects NoBatches scenario:
  - Uses batchable step result directly
  - Returns zero metrics or empty aggregation

Why placeholder workers?

Maintains consistent DAG structure
Convergence step logic handles both scenarios uniformly
No special-case orchestration logic needed
Standard retry/DLQ mechanics still apply

Data Structures

BatchProcessingOutcome

Location: tasker-shared/src/messaging/execution_types.rs

Purpose: Returned by batchable handlers to instruct orchestration.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
    /// No batching needed - create placeholder worker
    NoBatches,

    /// Create N batch workers with cursor configurations
    CreateBatches {
        /// Template step name (e.g., "process_csv_batch")
        worker_template_name: String,

        /// Number of workers to create
        worker_count: u32,

        /// Cursor configurations for each worker
        cursor_configs: Vec<CursorConfig>,

        /// Total items across all batches
        total_items: u64,
    },
}

impl BatchProcessingOutcome {
    pub fn no_batches() -> Self {
        BatchProcessingOutcome::NoBatches
    }

    pub fn create_batches(
        worker_template_name: String,
        worker_count: u32,
        cursor_configs: Vec<CursorConfig>,
        total_items: u64,
    ) -> Self {
        BatchProcessingOutcome::CreateBatches {
            worker_template_name,
            worker_count,
            cursor_configs,
            total_items,
        }
    }

    pub fn to_value(&self) -> serde_json::Value {
        serde_json::to_value(self).unwrap_or(json!({}))
    }
}
}

Ruby Mirror: workers/ruby/lib/tasker_core/types/batch_processing_outcome.rb

module TaskerCore
  module Types
    module BatchProcessingOutcome
      class NoBatches < Dry::Struct
        attribute :type, Types::String.default('no_batches')

        def to_h
          { 'type' => 'no_batches' }
        end

        def requires_batch_creation?
          false
        end
      end

      class CreateBatches < Dry::Struct
        attribute :type, Types::String.default('create_batches')
        attribute :worker_template_name, Types::Strict::String
        attribute :worker_count, Types::Coercible::Integer.constrained(gteq: 1)
        attribute :cursor_configs, Types::Array.of(Types::Hash).constrained(min_size: 1)
        attribute :total_items, Types::Coercible::Integer.constrained(gteq: 0)

        def to_h
          {
            'type' => 'create_batches',
            'worker_template_name' => worker_template_name,
            'worker_count' => worker_count,
            'cursor_configs' => cursor_configs,
            'total_items' => total_items
          }
        end

        def requires_batch_creation?
          true
        end
      end

      class << self
        def no_batches
          NoBatches.new
        end

        def create_batches(worker_template_name:, worker_count:, cursor_configs:, total_items:)
          CreateBatches.new(
            worker_template_name: worker_template_name,
            worker_count: worker_count,
            cursor_configs: cursor_configs,
            total_items: total_items
          )
        end
      end
    end
  end
end

CursorConfig

Location: tasker-shared/src/messaging/execution_types.rs

Purpose: Defines batch boundaries for each worker.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct CursorConfig {
    /// Batch identifier (e.g., "001", "002", "003")
    pub batch_id: String,

    /// Starting position (inclusive) - flexible JSON value
    pub start_cursor: serde_json::Value,

    /// Ending position (exclusive) - flexible JSON value
    pub end_cursor: serde_json::Value,

    /// Number of items in this batch
    pub batch_size: u32,
}
}

Design Notes:

Cursor values use serde_json::Value for flexibility
Supports integers, strings, timestamps, UUIDs, etc.
Batch IDs are zero-padded strings for consistent ordering
start_cursor is inclusive, end_cursor is exclusive

Example Cursor Configs:

// Numeric cursors (CSV row numbers)
{
  "batch_id": "001",
  "start_cursor": 1,
  "end_cursor": 200,
  "batch_size": 200
}

// Timestamp cursors (event processing)
{
  "batch_id": "002",
  "start_cursor": "2025-01-01T00:00:00Z",
  "end_cursor": "2025-01-01T01:00:00Z",
  "batch_size": 3600
}

// UUID cursors (database pagination)
{
  "batch_id": "003",
  "start_cursor": "00000000-0000-0000-0000-000000000000",
  "end_cursor": "3fffffff-ffff-ffff-ffff-ffffffffffff",
  "batch_size": 1000000
}

BatchWorkerInputs

Location: tasker-shared/src/models/core/batch_worker.rs

Purpose: Stored in workflow_steps.inputs for each worker instance.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchWorkerInputs {
    /// Cursor configuration defining this worker's batch range
    pub cursor: CursorConfig,

    /// Batch processing metadata
    pub batch_metadata: BatchMetadata,

    /// Flag indicating if this is a placeholder worker (NoBatches scenario)
    #[serde(default)]
    pub is_no_op: bool,
}

impl BatchWorkerInputs {
    pub fn new(
        cursor_config: CursorConfig,
        batch_config: &BatchConfiguration,
        is_no_op: bool,
    ) -> Self {
        Self {
            cursor: cursor_config,
            batch_metadata: BatchMetadata {
                checkpoint_interval: batch_config.checkpoint_interval,
                cursor_field: batch_config.cursor_field.clone(),
                failure_strategy: batch_config.failure_strategy.clone(),
            },
            is_no_op,
        }
    }
}
}

Storage Location:

✅ workflow_steps.inputs (instance-specific runtime data)
❌ NOT in step_definition.handler.initialization (that’s the template)

BatchMetadata

Location: tasker-shared/src/models/core/batch_worker.rs

Purpose: Runtime configuration for batch processing behavior.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchMetadata {
    /// Checkpoint frequency (every N items)
    pub checkpoint_interval: u32,

    /// Field name used for cursor tracking (e.g., "id", "row_number")
    pub cursor_field: String,

    /// How to handle failures during batch processing
    pub failure_strategy: FailureStrategy,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum FailureStrategy {
    /// Fail immediately if any item fails
    FailFast,

    /// Continue processing remaining items, report failures at end
    ContinueOnFailure,

    /// Isolate failed items to separate queue
    IsolateFailed,
}
}

Implementation Patterns

Rust Implementation

1. Batchable Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_shared::messaging::execution_types::{BatchProcessingOutcome, CursorConfig};
use serde_json::json;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Analyze dataset
    let dataset_size = analyze_dataset(step_data)?;
    let batch_size = get_batch_size_from_config(step_data)?;

    // 2. Check if batching needed
    if dataset_size == 0 || dataset_size < batch_size {
        let outcome = BatchProcessingOutcome::no_batches();
        return Ok(success_result(
            step_uuid,
            json!({ "batch_processing_outcome": outcome.to_value() }),
            elapsed_ms,
            None,
        ));
    }

    // 3. Calculate worker count
    let worker_count = (dataset_size as f64 / batch_size as f64).ceil() as u32;

    // 4. Generate cursor configs
    let cursor_configs = create_cursor_configs(dataset_size, worker_count, batch_size);

    // 5. Return CreateBatches outcome
    let outcome = BatchProcessingOutcome::create_batches(
        "worker_template_name".to_string(),
        worker_count,
        cursor_configs,
        dataset_size,
    );

    Ok(success_result(
        step_uuid,
        json!({
            "batch_processing_outcome": outcome.to_value(),
            "worker_count": worker_count,
            "total_items": dataset_size
        }),
        elapsed_ms,
        None,
    ))
}

fn create_cursor_configs(
    total_items: u64,
    worker_count: u32,
    batch_size: u64,
) -> Vec<CursorConfig> {
    let mut cursor_configs = Vec::new();
    let items_per_worker = (total_items as f64 / worker_count as f64).ceil() as u64;

    for i in 0..worker_count {
        let start_position = i as u64 * items_per_worker;
        let end_position = ((i + 1) as u64 * items_per_worker).min(total_items);

        cursor_configs.push(CursorConfig {
            batch_id: format!("{:03}", i + 1),
            start_cursor: json!(start_position),
            end_cursor: json!(end_position),
            batch_size: (end_position - start_position) as u32,
        });
    }

    cursor_configs
}
}

2. Batch Worker Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchWorkerContext;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Extract batch worker context using helper
    let context = BatchWorkerContext::from_step_data(step_data)?;

    // 2. Check for no-op placeholder worker
    if context.is_no_op() {
        return Ok(success_result(
            step_uuid,
            json!({
                "no_op": true,
                "reason": "NoBatches scenario",
                "batch_id": context.batch_id()
            }),
            elapsed_ms,
            None,
        ));
    }

    // 3. Extract cursor range
    let start_idx = context.start_position();
    let end_idx = context.end_position();
    let checkpoint_interval = context.checkpoint_interval();

    // 4. Process items in range
    let mut processed_count = 0;
    let mut results = initialize_results();

    for idx in start_idx..end_idx {
        // Process item
        let item = get_item(idx)?;
        update_results(&mut results, item);

        processed_count += 1;

        // 5. Checkpoint progress periodically
        if processed_count % checkpoint_interval == 0 {
            checkpoint_progress(step_uuid, idx).await?;
        }
    }

    // 6. Return results for aggregation
    Ok(success_result(
        step_uuid,
        json!({
            "processed_count": processed_count,
            "results": results,
            "batch_id": context.batch_id(),
            "start_position": start_idx,
            "end_position": end_idx
        }),
        elapsed_ms,
        None,
    ))
}
}

3. Convergence Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchAggregationScenario;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Detect scenario using helper
    let scenario = BatchAggregationScenario::detect(
        &step_data.dependency_results,
        "batchable_step_name",
        "batch_worker_prefix_",
    )?;

    // 2. Handle both scenarios
    let aggregated_results = match scenario {
        BatchAggregationScenario::NoBatches { batchable_result } => {
            // Get dataset info from batchable step
            let total_items = batchable_result
                .result.get("total_items")
                .and_then(|v| v.as_u64())
                .unwrap_or(0);

            // Return zero metrics
            json!({
                "total_processed": total_items,
                "worker_count": 0
            })
        }

        BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
            // Aggregate results from all workers
            let mut total_processed = 0u64;

            for (step_name, result) in batch_results {
                total_processed += result.result
                    .get("processed_count")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                // Additional aggregation logic...
            }

            json!({
                "total_processed": total_processed,
                "worker_count": worker_count
            })
        }
    };

    // 3. Return aggregated results
    Ok(success_result(
        step_uuid,
        aggregated_results,
        elapsed_ms,
        None,
    ))
}
}

Ruby Implementation

1. Batchable Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
    def call(task, _sequence, step)
      csv_file_path = task.context['csv_file_path']
      total_rows = count_csv_rows(csv_file_path)

      # Get batch configuration
      batch_size = step_definition_initialization['batch_size'] || 200
      max_workers = step_definition_initialization['max_workers'] || 5

      # Calculate worker count
      worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min

      if worker_count == 0 || total_rows == 0
        # Use helper for NoBatches outcome
        return no_batches_success(
          reason: 'dataset_too_small',
          total_rows: total_rows
        )
      end

      # Generate cursor configs using helper
      cursor_configs = generate_cursor_configs(
        total_items: total_rows,
        worker_count: worker_count
      )

      # Use helper for CreateBatches outcome
      create_batches_success(
        worker_template_name: 'process_csv_batch',
        worker_count: worker_count,
        cursor_configs: cursor_configs,
        total_items: total_rows,
        additional_data: {
          'csv_file_path' => csv_file_path
        }
      )
    end

    private

    def count_csv_rows(csv_file_path)
      CSV.read(csv_file_path, headers: true).length
    end
  end
end

2. Batch Worker Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
    def call(context)
      # Extract batch context using helper
      batch_ctx = get_batch_context(context)

      # Use helper to check for no-op worker
      no_op_result = handle_no_op_worker(batch_ctx)
      return no_op_result if no_op_result

      # Get CSV file path from dependency results
      csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')

      # Process CSV rows in cursor range
      metrics = process_csv_batch(
        csv_file_path,
        batch_ctx.start_cursor,
        batch_ctx.end_cursor
      )

      # Return results for aggregation
      success(
        result_data: {
          'processed_count' => metrics[:processed_count],
          'total_inventory_value' => metrics[:total_inventory_value],
          'category_counts' => metrics[:category_counts],
          'batch_id' => batch_ctx.batch_id
        }
      )
    end

    private

    def process_csv_batch(csv_file_path, start_row, end_row)
      metrics = {
        processed_count: 0,
        total_inventory_value: 0.0,
        category_counts: Hash.new(0)
      }

      CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
        next if data_row_num < start_row
        break if data_row_num >= end_row

        product = parse_product(row)

        metrics[:total_inventory_value] += product.price * product.stock
        metrics[:category_counts][product.category] += 1
        metrics[:processed_count] += 1
      end

      metrics
    end
  end
end

3. Convergence Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
    def call(_task, sequence, _step)
      # Detect scenario using helper
      scenario = detect_aggregation_scenario(
        sequence,
        batchable_step_name: 'analyze_csv',
        batch_worker_prefix: 'process_csv_batch_'
      )

      # Use helper for aggregation with custom block
      aggregate_batch_worker_results(scenario) do |batch_results|
        # Custom aggregation logic
        total_processed = 0
        total_inventory_value = 0.0
        global_category_counts = Hash.new(0)

        batch_results.each do |step_name, result|
          total_processed += result.dig('result', 'processed_count') || 0
          total_inventory_value += result.dig('result', 'total_inventory_value') || 0.0

          (result.dig('result', 'category_counts') || {}).each do |category, count|
            global_category_counts[category] += count
          end
        end

        {
          'total_processed' => total_processed,
          'total_inventory_value' => total_inventory_value,
          'category_counts' => global_category_counts,
          'worker_count' => batch_results.size
        }
      end
    end
  end
end

Use Cases

1. Large Dataset Processing

Scenario: Process millions of records from a database, file, or API.

Why Batch Processing?

Single worker would timeout
Memory constraints prevent loading entire dataset
Want parallelism for speed

Example: Product catalog synchronization

Batchable: Analyze product table (5 million products)
Workers: 100 workers × 50,000 products each
Convergence: Aggregate sync statistics
Result: 5M products synced in 10 minutes vs 2 hours sequential

2. Time-Based Event Processing

Scenario: Process events from a time-series database or log aggregation system.

Why Batch Processing?

Events span long time ranges
Want to process hourly/daily chunks in parallel
Need resumability for long-running processing

Example: Analytics event processing

Batchable: Analyze events (30 days × 24 hours)
Workers: 720 workers (1 per hour)
Cursors: Timestamp ranges (2025-01-01T00:00 to 2025-01-01T01:00)
Convergence: Aggregate daily/monthly metrics

3. Multi-Source Data Integration

Scenario: Fetch data from multiple external APIs or services.

Why Batch Processing?

Each source is independent
Want parallel fetching for speed
Some sources may fail (need retry per source)

Example: Third-party data enrichment

Batchable: Analyze customer list (partition by data provider)
Workers: 5 workers (1 per provider: Stripe, Salesforce, HubSpot, etc.)
Cursors: Provider-specific identifiers
Convergence: Merge enriched customer profiles

4. Bulk File Processing

Scenario: Process multiple files (CSVs, images, documents).

Why Batch Processing?

Each file is independent processing unit
Want parallelism across files
File sizes vary (dynamic batch sizing)

Example: Image transformation pipeline

Batchable: List S3 bucket objects (1000 images)
Workers: 20 workers × 50 images each
Cursors: S3 object key ranges
Convergence: Verify all images transformed

5. Geographical Data Partitioning

Scenario: Process data partitioned by geography (regions, countries, cities).

Why Batch Processing?

Geographic boundaries provide natural partitions
Want parallel processing per region
Different regions may have different data volumes

Example: Regional sales report generation

Batchable: Analyze sales data (50 US states)
Workers: 50 workers (1 per state)
Cursors: State codes (AL, AK, AZ, ...)
Convergence: National sales dashboard

Operator Workflows

Batch processing integrates seamlessly with the DLQ (Dead Letter Queue) system for operator visibility and manual intervention. This section shows how operators manage failed batch workers.

DLQ Integration Principles

From DLQ System Documentation:

Investigation Tracking Only: DLQ tracks “why task is stuck” and “who investigated” - it doesn’t manipulate tasks
Step-Level Resolution: Operators fix problem steps using step APIs, not task-level operations
Three Resolution Types:
- ResetForRetry: Reset attempts, return step to pending (cursor preserved)
- ResolveManually: Skip step, mark resolved without results
- CompleteManually: Provide manual results for dependent steps

Key for Batch Processing: Cursor data in workflow_steps.results is preserved during ResetForRetry, enabling resumability without data loss.

Staleness Detection for Batch Workers

Batch workers have two staleness detection mechanisms:

1. Duration-Based (Standard):

lifecycle:
  max_steps_in_process_minutes: 120  # DLQ threshold

If worker stays in InProgress state for > 120 minutes, flagged as stale.

2. Checkpoint-Based (Batch-Specific):

#![allow(unused)]
fn main() {
// Workers checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
    checkpoint_progress(step_uuid, current_cursor).await?;
}
}

If last checkpoint timestamp is too old, flagged as stale even if within duration threshold.

Common Operator Scenarios

Scenario 1: Transient Database Failure

Problem: 3 out of 5 batch workers failed due to database connection timeout.

Step 1: Find the stuck task in DLQ:

# Get investigation queue (prioritized by age and reason)
curl http://localhost:8080/v1/dlq/investigation-queue | jq

Step 2: Get task details and identify failed workers:

-- Get DLQ entry for the task
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.resolution_status,
    dlq.task_snapshot->'workflow_steps' as steps
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = 'task-uuid-here'
  AND dlq.resolution_status = 'pending';

-- Query task's workflow steps to find failed batch workers
SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = 'task-uuid-here'
  AND ws.name LIKE 'process_csv_batch_%'
  AND ws.current_state = 'Error';

Result:

workflow_step_uuid | name                   | current_state | attempts | last_error
-------------------|------------------------|---------------|----------|------------------
uuid-worker-2      | process_csv_batch_002  | Error         | 3        | DB timeout
uuid-worker-4      | process_csv_batch_004  | Error         | 3        | DB timeout
uuid-worker-5      | process_csv_batch_005  | Error         | 3        | DB timeout

Operator Action: Database is now healthy - reset workers for retry

# Get task UUID from DLQ entry
TASK_UUID="abc-123-task-uuid"

# Reset worker 2 (preserves cursor: rows 201-400)
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-2 \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "reset_for_retry",
    "reset_by": "operator@example.com",
    "reason": "Database connection restored, resetting attempts"
  }'

# Reset workers 4 and 5
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-4 \
  -H "Content-Type: application/json" \
  -d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-5 \
  -H "Content-Type: application/json" \
  -d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'

# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "resolution_status": "manually_resolved",
    "resolution_notes": "Reset 3 failed batch workers after database connection restored",
    "resolved_by": "operator@example.com"
  }'

Result:

Workers 2, 4, 5 return to Pending state
Cursor configs preserved in workflow_steps.inputs
Retry attempt counter reset to 0
Workers re-enqueued for execution
DLQ entry updated with resolution metadata

Scenario 2: Bad Data in Specific Batch

Problem: Worker 3 repeatedly fails due to malformed CSV row in its range (rows 401-600).

Investigation:

-- Get worker details
SELECT
    ws.name,
    ws.current_state,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-3';

Result:

name: process_csv_batch_003
current_state: Error
attempts: 3
last_error: "CSV parsing failed at row 523: invalid price format"

Operator Decision: Row 523 has known data quality issue, already fixed in source system.

Option 1: Complete Manually (provide results for this batch):

TASK_UUID="abc-123-task-uuid"
STEP_UUID="uuid-worker-3"

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "complete_manually",
    "completion_data": {
      "result": {
        "processed_count": 199,
        "total_inventory_value": 45232.50,
        "category_counts": {"electronics": 150, "clothing": 49},
        "batch_id": "003",
        "note": "Row 523 skipped due to data quality issue, manually verified totals"
      },
      "metadata": {
        "manually_verified": true,
        "verification_method": "manual_inspection",
        "skipped_rows": [523]
      }
    },
    "reason": "Manual completion after verifying corrected data in source system",
    "completed_by": "operator@example.com"
  }'

Option 2: Resolve Manually (skip this batch):

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "resolve_manually",
    "resolved_by": "operator@example.com",
    "reason": "Non-critical batch containing known bad data, skipping 200 rows out of 1000 total"
  }'

Result (Option 1):

Worker 3 marked Complete with manual results
Convergence step receives manual results in aggregation
Task completes successfully with note about manual intervention

Result (Option 2):

Worker 3 marked ResolvedManually (no results provided)
Convergence step detects missing results, adjusts aggregation
Task completes with reduced total (800 rows instead of 1000)

Scenario 3: Long-Running Worker Needs Checkpoint

Problem: Worker 1 processing 10,000 rows, operator notices it’s been running 90 minutes (threshold: 120 minutes).

Investigation:

-- Check last checkpoint
SELECT
    ws.name,
    ws.current_state,
    ws.results->>'last_checkpoint_cursor' as last_checkpoint,
    ws.results->>'checkpoint_timestamp' as checkpoint_time,
    NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-1';

Result:

name: process_large_batch_001
current_state: InProgress
last_checkpoint: 7850
checkpoint_time: 2025-01-15 11:30:00
time_since_checkpoint: 00:05:00

Operator Action: Worker is healthy and making progress (checkpointed 5 minutes ago at row 7850).

No action needed - worker will complete normally. Operator adds investigation note to DLQ entry:

DLQ_ENTRY_UUID="dlq-entry-uuid-here"

curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {
      "investigation_notes": "Worker healthy, last checkpoint at row 7850 (5 min ago), estimated 15 min remaining",
      "investigator": "operator@example.com",
      "timestamp": "2025-01-15T11:35:00Z",
      "action_taken": "none - monitoring"
    }
  }'

Scenario 4: All Workers Failed - Batch Strategy Issue

Problem: All 10 workers fail with “memory exhausted” error - batch size too large.

Investigation via API:

TASK_UUID="task-uuid-here"

# Get task details including all workflow steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps | jq '.[] | select(.name | startswith("process_large_batch_")) | {name, current_state, attempts, last_error}'

Result: All workers show current_state: "Error" with same OOM error in last_error.

Operator Action: Cancel entire task, will re-run with smaller batch size.

DLQ_ENTRY_UUID="dlq-entry-uuid-here"

# Cancel task (cancels all workers)
curl -X DELETE http://localhost:8080/v1/tasks/${TASK_UUID}

# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "resolution_status": "permanently_failed",
    "resolution_notes": "Batch size too large causing OOM. Cancelled task and created new task with batch_size: 5000 instead of 10000",
    "resolved_by": "operator@example.com",
    "metadata": {
      "root_cause": "configuration_error",
      "corrective_action": "reduced_batch_size",
      "new_task_uuid": "new-task-uuid-here"
    }
  }'

# Create new task with corrected configuration
curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "data_processing",
    "template_name": "large_dataset_processor",
    "context": {
      "dataset_id": "dataset-123",
      "batch_size": 5000,
      "max_workers": 20
    }
  }'

DLQ Query Patterns for Batch Processing

1. Find DLQ entry for a batch processing task:

-- Get DLQ entry with task snapshot
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.resolution_status,
    dlq.dlq_timestamp,
    dlq.resolution_notes,
    dlq.resolved_by,
    dlq.task_snapshot->'namespace_name' as namespace,
    dlq.task_snapshot->'template_name' as template,
    dlq.task_snapshot->'current_state' as task_state
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = :task_uuid
  AND dlq.resolution_status = 'pending'
ORDER BY dlq.dlq_timestamp DESC
LIMIT 1;

2. Check batch completion progress:

SELECT
    COUNT(*) FILTER (WHERE ws.current_state = 'Complete') as completed_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'InProgress') as in_progress_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'Error') as failed_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'Pending') as pending_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'WaitingForRetry') as waiting_retry,
    COUNT(*) as total_workers
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
  AND ws.name LIKE 'process_%_batch_%';

3. Find workers with stale checkpoints:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.results->>'last_checkpoint_cursor' as checkpoint_cursor,
    ws.results->>'checkpoint_timestamp' as checkpoint_time,
    NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
  AND ws.name LIKE 'process_%_batch_%'
  AND ws.current_state = 'InProgress'
  AND ws.results->>'checkpoint_timestamp' IS NOT NULL
  AND NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz > interval '15 minutes'
ORDER BY time_since_checkpoint DESC;

4. Get aggregated batch task health:

SELECT
    t.task_uuid,
    t.namespace_name,
    t.template_name,
    t.current_state as task_state,
    t.execution_status,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_count,
    jsonb_object_agg(
        ws.current_state,
        COUNT(*)
    ) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_states,
    dlq.dlq_reason,
    dlq.resolution_status
FROM tasker.tasks t
JOIN tasker.workflow_steps ws ON ws.task_uuid = t.task_uuid
LEFT JOIN tasker.tasks_dlq dlq ON dlq.task_uuid = t.task_uuid
    AND dlq.resolution_status = 'pending'
WHERE t.task_uuid = :task_uuid
GROUP BY t.task_uuid, t.namespace_name, t.template_name, t.current_state, t.execution_status,
         dlq.dlq_reason, dlq.resolution_status;

5. Find all batch tasks in DLQ:

-- Find tasks with batch workers that are stuck
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.dlq_timestamp,
    t.namespace_name,
    t.template_name,
    t.current_state,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as batch_worker_count,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.current_state = 'Error' AND ws.name LIKE 'process_%_batch_%') as failed_workers
FROM tasker.tasks_dlq dlq
JOIN tasker.tasks t ON t.task_uuid = dlq.task_uuid
JOIN tasker.workflow_steps ws ON ws.task_uuid = dlq.task_uuid
WHERE dlq.resolution_status = 'pending'
GROUP BY dlq.dlq_entry_uuid, dlq.task_uuid, dlq.dlq_reason, dlq.dlq_timestamp,
         t.namespace_name, t.template_name, t.current_state
HAVING COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') > 0
ORDER BY dlq.dlq_timestamp DESC;

Operator Dashboard Recommendations

For monitoring batch processing tasks, operators should have dashboards showing:

Batch Progress:
- Total workers vs completed workers
- Estimated completion time based on worker velocity
- Current throughput (items/second across all workers)
Stale Worker Alerts:
- Workers exceeding duration threshold
- Workers with stale checkpoints
- Workers with repeated failures
Batch Health Metrics:
- Success rate per batch
- Average processing time per worker
- Resource utilization (CPU, memory)
Resolution Actions:
- Recent operator interventions
- Resolution action distribution (ResetForRetry vs ResolveManually)
- Time to resolution for stale workers

Code Examples

Complete Working Example: CSV Product Inventory

This section shows a complete end-to-end implementation processing a 1000-row CSV file in 5 parallel batches.

Rust Implementation

1. Batchable Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-150

#![allow(unused)]
fn main() {
pub struct CsvAnalyzerHandler;

#[async_trait]
impl StepHandler for CsvAnalyzerHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Get CSV file path from task context
        let csv_file_path = step_data
            .task
            .context
            .get("csv_file_path")
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow!("Missing csv_file_path in task context"))?;

        // Count total data rows (excluding header)
        let file = File::open(csv_file_path)?;
        let reader = BufReader::new(file);
        let total_rows = reader.lines().count().saturating_sub(1) as u64;

        info!("CSV Analysis: {} rows in {}", total_rows, csv_file_path);

        // Get batch configuration
        let handler_init = step_data.handler_initialization.as_object().unwrap();
        let batch_size = handler_init
            .get("batch_size")
            .and_then(|v| v.as_u64())
            .unwrap_or(200);
        let max_workers = handler_init
            .get("max_workers")
            .and_then(|v| v.as_u64())
            .unwrap_or(5);

        // Determine if batching needed
        if total_rows == 0 {
            let outcome = BatchProcessingOutcome::no_batches();
            let elapsed_ms = start_time.elapsed().as_millis() as u64;

            return Ok(success_result(
                step_uuid,
                json!({
                    "batch_processing_outcome": outcome.to_value(),
                    "reason": "empty_dataset",
                    "total_rows": 0
                }),
                elapsed_ms,
                None,
            ));
        }

        // Calculate worker count
        let worker_count = ((total_rows as f64 / batch_size as f64).ceil() as u64)
            .min(max_workers);

        // Generate cursor configurations
        let actual_batch_size = (total_rows as f64 / worker_count as f64).ceil() as u64;
        let mut cursor_configs = Vec::new();

        for i in 0..worker_count {
            let start_row = (i * actual_batch_size) + 1; // 1-indexed after header
            let end_row = ((i + 1) * actual_batch_size).min(total_rows) + 1;

            cursor_configs.push(CursorConfig {
                batch_id: format!("{:03}", i + 1),
                start_cursor: json!(start_row),
                end_cursor: json!(end_row),
                batch_size: (end_row - start_row) as u32,
            });
        }

        info!(
            "Creating {} batch workers for {} rows (batch_size: {})",
            worker_count, total_rows, actual_batch_size
        );

        // Return CreateBatches outcome
        let outcome = BatchProcessingOutcome::create_batches(
            "process_csv_batch".to_string(),
            worker_count as u32,
            cursor_configs,
            total_rows,
        );

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        Ok(success_result(
            step_uuid,
            json!({
                "batch_processing_outcome": outcome.to_value(),
                "worker_count": worker_count,
                "total_rows": total_rows,
                "csv_file_path": csv_file_path
            }),
            elapsed_ms,
            Some(json!({
                "batch_size": actual_batch_size,
                "file_path": csv_file_path
            })),
        ))
    }
}
}

2. Batch Worker Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-350

#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;

#[derive(Debug, Deserialize)]
struct Product {
    id: u32,
    title: String,
    category: String,
    price: f64,
    stock: u32,
    rating: f64,
}

#[async_trait]
impl StepHandler for CsvBatchProcessorHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract batch worker context using helper
        let context = BatchWorkerContext::from_step_data(step_data)?;

        // Check for no-op placeholder worker
        if context.is_no_op() {
            let elapsed_ms = start_time.elapsed().as_millis() as u64;
            return Ok(success_result(
                step_uuid,
                json!({
                    "no_op": true,
                    "reason": "NoBatches scenario - no items to process",
                    "batch_id": context.batch_id()
                }),
                elapsed_ms,
                None,
            ));
        }

        // Get CSV file path from dependency results
        let csv_file_path = step_data
            .dependency_results
            .get("analyze_csv")
            .and_then(|r| r.result.get("csv_file_path"))
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow!("Missing csv_file_path from analyze_csv"))?;

        // Extract cursor range
        let start_row = context.start_position();
        let end_row = context.end_position();

        info!(
            "Processing batch {} (rows {}-{})",
            context.batch_id(),
            start_row,
            end_row
        );

        // Initialize metrics
        let mut processed_count = 0u64;
        let mut total_inventory_value = 0.0;
        let mut category_counts: HashMap<String, u32> = HashMap::new();
        let mut max_price = 0.0;
        let mut max_price_product = None;
        let mut total_rating = 0.0;

        // Open CSV and process rows in cursor range
        let file = File::open(Path::new(csv_file_path))?;
        let mut csv_reader = csv::ReaderBuilder::new()
            .has_headers(true)
            .from_reader(BufReader::new(file));

        for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
            let data_row_num = row_idx + 1; // 1-indexed after header

            if data_row_num < start_row {
                continue; // Skip rows before our range
            }
            if data_row_num >= end_row {
                break; // Processed all our rows
            }

            let product: Product = result?;

            // Calculate inventory metrics
            let inventory_value = product.price * (product.stock as f64);
            total_inventory_value += inventory_value;

            *category_counts.entry(product.category.clone()).or_insert(0) += 1;

            if product.price > max_price {
                max_price = product.price;
                max_price_product = Some(product.title.clone());
            }

            total_rating += product.rating;
            processed_count += 1;

            // Checkpoint progress periodically
            if processed_count % context.checkpoint_interval() == 0 {
                debug!(
                    "Checkpoint: batch {} processed {} items",
                    context.batch_id(),
                    processed_count
                );
            }
        }

        let average_rating = if processed_count > 0 {
            total_rating / (processed_count as f64)
        } else {
            0.0
        };

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        info!(
            "Batch {} complete: {} items processed",
            context.batch_id(),
            processed_count
        );

        Ok(success_result(
            step_uuid,
            json!({
                "processed_count": processed_count,
                "total_inventory_value": total_inventory_value,
                "category_counts": category_counts,
                "max_price": max_price,
                "max_price_product": max_price_product,
                "average_rating": average_rating,
                "batch_id": context.batch_id(),
                "start_row": start_row,
                "end_row": end_row
            }),
            elapsed_ms,
            None,
        ))
    }
}
}

3. Convergence Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-520

#![allow(unused)]
fn main() {
pub struct CsvResultsAggregatorHandler;

#[async_trait]
impl StepHandler for CsvResultsAggregatorHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Detect scenario using helper
        let scenario = BatchAggregationScenario::detect(
            &step_data.dependency_results,
            "analyze_csv",
            "process_csv_batch_",
        )?;

        let (total_processed, total_inventory_value, category_counts, max_price, max_price_product, overall_avg_rating, worker_count) = match scenario {
            BatchAggregationScenario::NoBatches { batchable_result } => {
                // No batch workers - get dataset size from batchable step
                let total_rows = batchable_result
                    .result
                    .get("total_rows")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                info!("NoBatches scenario: {} rows (no processing needed)", total_rows);

                (total_rows, 0.0, HashMap::new(), 0.0, None, 0.0, 0)
            }

            BatchAggregationScenario::WithBatches {
                batch_results,
                worker_count,
            } => {
                info!("Aggregating results from {} batch workers", worker_count);

                let mut total_processed = 0u64;
                let mut total_inventory_value = 0.0;
                let mut global_category_counts: HashMap<String, u64> = HashMap::new();
                let mut max_price = 0.0;
                let mut max_price_product = None;
                let mut weighted_ratings = Vec::new();

                for (step_name, result) in batch_results {
                    // Sum processed counts
                    let count = result
                        .result
                        .get("processed_count")
                        .and_then(|v| v.as_u64())
                        .unwrap_or(0);
                    total_processed += count;

                    // Sum inventory values
                    let value = result
                        .result
                        .get("total_inventory_value")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    total_inventory_value += value;

                    // Merge category counts
                    if let Some(categories) = result
                        .result
                        .get("category_counts")
                        .and_then(|v| v.as_object())
                    {
                        for (category, cat_count) in categories {
                            *global_category_counts
                                .entry(category.clone())
                                .or_insert(0) += cat_count.as_u64().unwrap_or(0);
                        }
                    }

                    // Find global max price
                    let batch_max_price = result
                        .result
                        .get("max_price")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    if batch_max_price > max_price {
                        max_price = batch_max_price;
                        max_price_product = result
                            .result
                            .get("max_price_product")
                            .and_then(|v| v.as_str())
                            .map(String::from);
                    }

                    // Collect ratings for weighted average
                    let avg_rating = result
                        .result
                        .get("average_rating")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    weighted_ratings.push((count, avg_rating));
                }

                // Calculate overall weighted average rating
                let total_items = weighted_ratings.iter().map(|(c, _)| c).sum::<u64>();
                let overall_avg_rating = if total_items > 0 {
                    weighted_ratings
                        .iter()
                        .map(|(count, avg)| (*count as f64) * avg)
                        .sum::<f64>()
                        / (total_items as f64)
                } else {
                    0.0
                };

                (
                    total_processed,
                    total_inventory_value,
                    global_category_counts,
                    max_price,
                    max_price_product,
                    overall_avg_rating,
                    worker_count,
                )
            }
        };

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        info!(
            "Aggregation complete: {} total items processed by {} workers",
            total_processed, worker_count
        );

        Ok(success_result(
            step_uuid,
            json!({
                "total_processed": total_processed,
                "total_inventory_value": total_inventory_value,
                "category_counts": category_counts,
                "max_price": max_price,
                "max_price_product": max_price_product,
                "overall_average_rating": overall_avg_rating,
                "worker_count": worker_count
            }),
            elapsed_ms,
            None,
        ))
    }
}
}

Ruby Implementation

1. Batchable Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_analyzer_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Analyzer - Batchable Step
    class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
      def call(task, _sequence, step)
        csv_file_path = task.context['csv_file_path']
        raise ArgumentError, 'Missing csv_file_path in task context' unless csv_file_path

        # Count CSV rows (excluding header)
        total_rows = count_csv_rows(csv_file_path)

        Rails.logger.info("CSV Analysis: #{total_rows} rows in #{csv_file_path}")

        # Get batch configuration from handler initialization
        batch_size = step_definition_initialization['batch_size'] || 200
        max_workers = step_definition_initialization['max_workers'] || 5

        # Calculate worker count
        worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min

        if worker_count.zero? || total_rows.zero?
          # Use helper for NoBatches outcome
          return no_batches_success(
            reason: 'empty_dataset',
            total_rows: total_rows
          )
        end

        # Generate cursor configs using helper
        cursor_configs = generate_cursor_configs(
          total_items: total_rows,
          worker_count: worker_count
        ) do |batch_idx, start_pos, end_pos, items_in_batch|
          # Adjust to 1-indexed row numbers (after header)
          {
            'batch_id' => format('%03d', batch_idx + 1),
            'start_cursor' => start_pos + 1,
            'end_cursor' => end_pos + 1,
            'batch_size' => items_in_batch
          }
        end

        Rails.logger.info("Creating #{worker_count} batch workers for #{total_rows} rows")

        # Use helper for CreateBatches outcome
        create_batches_success(
          worker_template_name: 'process_csv_batch',
          worker_count: worker_count,
          cursor_configs: cursor_configs,
          total_items: total_rows,
          additional_data: {
            'csv_file_path' => csv_file_path
          }
        )
      end

      private

      def count_csv_rows(csv_file_path)
        CSV.read(csv_file_path, headers: true).length
      end

      def step_definition_initialization
        @step_definition_initialization ||= {}
      end
    end
  end
end

2. Batch Worker Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_batch_processor_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Batch Processor - Batch Worker
    class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
      Product = Struct.new(
        :id, :title, :description, :category, :price,
        :discount_percentage, :rating, :stock, :brand, :sku, :weight,
        keyword_init: true
      )

      def call(context)
        # Extract batch context using helper
        batch_ctx = get_batch_context(context)

        # Use helper to check for no-op worker
        no_op_result = handle_no_op_worker(batch_ctx)
        return no_op_result if no_op_result

        # Get CSV file path from dependency results
        csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
        raise ArgumentError, 'Missing csv_file_path from analyze_csv' unless csv_file_path

        Rails.logger.info("Processing batch #{batch_ctx.batch_id} (rows #{batch_ctx.start_cursor}-#{batch_ctx.end_cursor})")

        # Process CSV rows in cursor range
        metrics = process_csv_batch(
          csv_file_path,
          batch_ctx.start_cursor,
          batch_ctx.end_cursor
        )

        Rails.logger.info("Batch #{batch_ctx.batch_id} complete: #{metrics[:processed_count]} items processed")

        # Return results for aggregation
        success(
          result_data: {
            'processed_count' => metrics[:processed_count],
            'total_inventory_value' => metrics[:total_inventory_value],
            'category_counts' => metrics[:category_counts],
            'max_price' => metrics[:max_price],
            'max_price_product' => metrics[:max_price_product],
            'average_rating' => metrics[:average_rating],
            'batch_id' => batch_ctx.batch_id,
            'start_row' => batch_ctx.start_cursor,
            'end_row' => batch_ctx.end_cursor
          }
        )
      end

      private

      def process_csv_batch(csv_file_path, start_row, end_row)
        metrics = {
          processed_count: 0,
          total_inventory_value: 0.0,
          category_counts: Hash.new(0),
          max_price: 0.0,
          max_price_product: nil,
          ratings: []
        }

        CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
          # Skip rows before our range
          next if data_row_num < start_row
          # Break when we've processed all our rows
          break if data_row_num >= end_row

          product = parse_product(row)

          # Calculate inventory metrics
          inventory_value = product.price * product.stock
          metrics[:total_inventory_value] += inventory_value

          metrics[:category_counts][product.category] += 1

          if product.price > metrics[:max_price]
            metrics[:max_price] = product.price
            metrics[:max_price_product] = product.title
          end

          metrics[:ratings] << product.rating
          metrics[:processed_count] += 1
        end

        # Calculate average rating
        metrics[:average_rating] = if metrics[:ratings].any?
                                      metrics[:ratings].sum / metrics[:ratings].size.to_f
                                    else
                                      0.0
                                    end

        metrics.except(:ratings)
      end

      def parse_product(row)
        Product.new(
          id: row['id'].to_i,
          title: row['title'],
          description: row['description'],
          category: row['category'],
          price: row['price'].to_f,
          discount_percentage: row['discountPercentage'].to_f,
          rating: row['rating'].to_f,
          stock: row['stock'].to_i,
          brand: row['brand'],
          sku: row['sku'],
          weight: row['weight'].to_i
        )
      end
    end
  end
end

3. Convergence Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_results_aggregator_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Results Aggregator - Deferred Convergence Step
    class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
      def call(_task, sequence, _step)
        # Detect scenario using helper
        scenario = detect_aggregation_scenario(
          sequence,
          batchable_step_name: 'analyze_csv',
          batch_worker_prefix: 'process_csv_batch_'
        )

        # Use helper for aggregation with custom block
        aggregate_batch_worker_results(scenario) do |batch_results|
          aggregate_csv_metrics(batch_results)
        end
      end

      private

      def aggregate_csv_metrics(batch_results)
        total_processed = 0
        total_inventory_value = 0.0
        global_category_counts = Hash.new(0)
        max_price = 0.0
        max_price_product = nil
        weighted_ratings = []

        batch_results.each do |step_name, batch_result|
          result = batch_result['result'] || {}

          # Sum processed counts
          count = result['processed_count'] || 0
          total_processed += count

          # Sum inventory values
          total_inventory_value += result['total_inventory_value'] || 0.0

          # Merge category counts
          (result['category_counts'] || {}).each do |category, cat_count|
            global_category_counts[category] += cat_count
          end

          # Find global max price
          batch_max_price = result['max_price'] || 0.0
          if batch_max_price > max_price
            max_price = batch_max_price
            max_price_product = result['max_price_product']
          end

          # Collect ratings for weighted average
          avg_rating = result['average_rating'] || 0.0
          weighted_ratings << { count: count, avg: avg_rating }
        end

        # Calculate overall weighted average rating
        total_items = weighted_ratings.sum { |r| r[:count] }
        overall_avg_rating = if total_items.positive?
                               weighted_ratings.sum { |r| r[:avg] * r[:count] } / total_items.to_f
                             else
                               0.0
                             end

        Rails.logger.info("Aggregation complete: #{total_processed} total items processed by #{batch_results.size} workers")

        {
          'total_processed' => total_processed,
          'total_inventory_value' => total_inventory_value,
          'category_counts' => global_category_counts,
          'max_price' => max_price,
          'max_price_product' => max_price_product,
          'overall_average_rating' => overall_avg_rating,
          'worker_count' => batch_results.size
        }
      end
    end
  end
end

YAML Template

File: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml

---
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
description: "Process CSV product data in parallel batches"
task_handler:
  callable: rust_handler
  initialization: {}

steps:
  # BATCHABLE STEP: CSV Analysis and Batch Planning
  - name: analyze_csv
    type: batchable
    dependencies: []
    handler:
      callable: CsvAnalyzerHandler
      initialization:
        batch_size: 200
        max_workers: 5

  # BATCH WORKER TEMPLATE: Single CSV Batch Processing
  # Orchestration creates N instances from this template
  - name: process_csv_batch
    type: batch_worker
    dependencies:
      - analyze_csv
    lifecycle:
      max_steps_in_process_minutes: 120
      max_retries: 3
      backoff_multiplier: 2.0
    handler:
      callable: CsvBatchProcessorHandler
      initialization:
        operation: "inventory_analysis"

  # DEFERRED CONVERGENCE STEP: CSV Results Aggregation
  - name: aggregate_csv_results
    type: deferred_convergence
    dependencies:
      - process_csv_batch  # Template dependency - resolves to all worker instances
    handler:
      callable: CsvResultsAggregatorHandler
      initialization:
        aggregation_type: "inventory_metrics"

Best Practices

1. Batch Size Calculation

Guideline: Balance parallelism with overhead.

Too Small:

Excessive orchestration overhead
Too many database transactions
Diminishing returns on parallelism

Too Large:

Workers timeout or OOM
Long retry times on failure
Reduced parallelism

Recommended Approach:

def calculate_optimal_batch_size(total_items, item_processing_time_ms)
  # Target: Each batch takes 5-10 minutes
  target_duration_ms = 7.5 * 60 * 1000

  # Calculate items per batch
  items_per_batch = (target_duration_ms / item_processing_time_ms).ceil

  # Enforce min/max bounds
  [[items_per_batch, 100].max, 10000].min
end

2. Worker Count Limits

Guideline: Limit concurrency based on system resources.

handler:
  initialization:
    batch_size: 200
    max_workers: 10  # Prevents creating 100 workers for 20,000 items

Considerations:

Database connection pool size
Memory per worker
External API rate limits
CPU cores available

3. Cursor Design

Guideline: Use cursors that support resumability.

Good Cursor Types:

✅ Integer offsets: start_cursor: 1000, end_cursor: 2000
✅ Timestamps: start_cursor: "2025-01-01T00:00:00Z"
✅ Database IDs: start_cursor: uuid_a, end_cursor: uuid_b
✅ Composite keys: { date: "2025-01-01", partition: "US-WEST" }

Bad Cursor Types:

❌ Page numbers (data can shift between pages)
❌ Non-deterministic ordering (random, relevance scores)
❌ Mutable values (last_modified_at can change)

4. Checkpoint Frequency

Guideline: Balance resumability with performance.

#![allow(unused)]
fn main() {
// Checkpoint every 100 items
if processed_count % 100 == 0 {
    checkpoint_progress(step_uuid, current_cursor).await?;
}
}

Factors:

Item processing time (faster = higher frequency)
Worker failure rate (higher = more frequent checkpoints)
Database write overhead (less frequent = better performance)

Recommended:

Fast items (< 10ms each): Checkpoint every 1000 items
Medium items (10-100ms each): Checkpoint every 100 items
Slow items (> 100ms each): Checkpoint every 10 items

5. Error Handling Strategies

FailFast (default):

#![allow(unused)]
fn main() {
FailureStrategy::FailFast
}

Worker fails immediately on first error
Suitable for: Data validation, schema violations
Retry preserves cursor for retry

ContinueOnFailure:

#![allow(unused)]
fn main() {
FailureStrategy::ContinueOnFailure
}

Worker processes all items, collects errors
Suitable for: Best-effort processing, partial results acceptable
Returns both results and error list

IsolateFailed:

#![allow(unused)]
fn main() {
FailureStrategy::IsolateFailed
}

Failed items moved to separate queue
Suitable for: Large batches with few expected failures
Allows manual review of failed items

6. Aggregation Patterns

Sum/Count:

#![allow(unused)]
fn main() {
let total = batch_results.iter()
    .map(|(_, r)| r.result.get("count").unwrap().as_u64().unwrap())
    .sum::<u64>();
}

Max/Min:

#![allow(unused)]
fn main() {
let max_value = batch_results.iter()
    .filter_map(|(_, r)| r.result.get("max").and_then(|v| v.as_f64()))
    .max_by(|a, b| a.partial_cmp(b).unwrap())
    .unwrap_or(0.0);
}

Weighted Average:

#![allow(unused)]
fn main() {
let total_weight: u64 = weighted_values.iter().map(|(w, _)| w).sum();
let weighted_avg = weighted_values.iter()
    .map(|(weight, value)| (*weight as f64) * value)
    .sum::<f64>() / (total_weight as f64);
}

Merge HashMaps:

#![allow(unused)]
fn main() {
let mut merged = HashMap::new();
for (_, result) in batch_results {
    if let Some(counts) = result.get("counts").and_then(|v| v.as_object()) {
        for (key, count) in counts {
            *merged.entry(key.clone()).or_insert(0) += count.as_u64().unwrap();
        }
    }
}
}

7. Testing Strategies

Unit Tests: Test handler logic independently

#![allow(unused)]
fn main() {
#[test]
fn test_cursor_generation() {
    let configs = create_cursor_configs(1000, 5, 200);
    assert_eq!(configs.len(), 5);
    assert_eq!(configs[0].start_cursor, json!(0));
    assert_eq!(configs[0].end_cursor, json!(200));
}
}

Integration Tests: Test with small datasets

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_batch_processing_integration() {
    let task = create_task_with_csv("test_data_10_rows.csv").await;
    assert_eq!(task.current_state, TaskState::Complete);

    let steps = get_workflow_steps(task.task_uuid).await;
    let workers = steps.iter().filter(|s| s.step_type == "batch_worker").count();
    assert_eq!(workers, 1); // 10 rows = 1 worker with batch_size 200
}
}

E2E Tests: Test complete workflow with realistic data

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_csv_batch_processing_e2e() {
    let task = create_task_with_csv("products_1000_rows.csv").await;
    wait_for_completion(task.task_uuid, Duration::from_secs(60)).await;

    let results = get_aggregation_results(task.task_uuid).await;
    assert_eq!(results["total_processed"], 1000);
    assert_eq!(results["worker_count"], 5);
}
}

8. Monitoring and Observability

Metrics to Track:

Worker creation time
Individual worker duration
Batch size distribution
Retry rate per batch
Aggregation duration

Recommended Dashboards:

-- Batch processing health
SELECT
    COUNT(*) FILTER (WHERE step_type = 'batch_worker') as total_workers,
    AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_worker_duration_sec,
    MAX(EXTRACT(EPOCH FROM (updated_at - created_at))) as max_worker_duration_sec,
    COUNT(*) FILTER (WHERE current_state = 'Error') as failed_workers
FROM tasker.workflow_steps
WHERE task_uuid = :task_uuid
  AND step_type = 'batch_worker';

Summary

Batch processing in Tasker provides a robust, production-ready pattern for parallel dataset processing:

Key Strengths:

✅ Builds on proven DAG, retry, and deferred convergence foundations
✅ No special recovery system needed (uses standard DLQ + retry)
✅ Transaction-based worker creation prevents corruption
✅ Cursor-based resumability enables long-running processing
✅ Language-agnostic design works across Rust and Ruby workers

Integration Points:

DAG: Workers are full nodes with standard lifecycle
Retryability: Uses lifecycle.max_retries and exponential backoff
Deferred Convergence: Intersection semantics aggregate dynamic worker counts
DLQ: Standard operator workflows with cursor preservation

Production Readiness:

908 tests passing (Ruby workers)
Real-world CSV processing (1000 rows)
Docker integration working
Code review complete with recommended fixes

For More Information:

Conditional Workflows: See docs/conditional-workflows.md
DLQ System: See docs/dlq-system.md
Code Examples: See workers/rust/src/step_handlers/batch_processing_*.rs

Caching Guide

This guide covers Tasker’s distributed caching system, including configuration, backend selection, circuit breaker protection, and operational considerations.

Overview

Tasker provides optional caching for:

Task Templates: Reduces database queries when loading workflow definitions
Analytics: Caches performance metrics and bottleneck analysis results

Caching is disabled by default and must be explicitly enabled in configuration.

Configuration

Basic Setup

[common.cache]
enabled = true
backend = "redis"              # or "dragonfly" / "moka" / "memory" / "in-memory"
default_ttl_seconds = 3600     # 1 hour default
template_ttl_seconds = 3600    # 1 hour for templates
analytics_ttl_seconds = 60     # 1 minute for analytics
key_prefix = "tasker"          # Namespace for cache keys

[common.cache.redis]
url = "${REDIS_URL:-redis://localhost:6379}"
max_connections = 10
connection_timeout_seconds = 5
database = 0

[common.cache.moka]
max_capacity = 10000           # Maximum entries in cache

Backend Selection

Backend	Config Value	Use Case
Redis	`"redis"`	Multi-instance deployments (production)
Dragonfly	`"dragonfly"`	Redis-compatible with better multi-threaded performance
Memcached	`"memcached"`	Simple distributed cache (requires `cache-memcached` feature)
Moka	`"moka"`, `"memory"`, `"in-memory"`	Single-instance, development, DoS protection
NoOp	(enabled = false)	Disabled, always-miss

Cache Backends

Redis (Distributed)

Redis is the recommended backend for production deployments:

Shared state: All instances see the same cache entries
Invalidation works: Worker bootstrap invalidations propagate to all instances
Persistence: Survives process restarts (if Redis is configured for persistence)

[common.cache]
enabled = true
backend = "redis"

[common.cache.redis]
url = "redis://redis.internal:6379"

Dragonfly (Distributed)

Dragonfly is a Redis-compatible in-memory data store with better multi-threaded performance. It uses the same port (6379) and protocol as Redis, so no code changes are required.

Redis compatible: Drop-in replacement for Redis
Better performance: Multi-threaded architecture for higher throughput
Shared state: Same distributed semantics as Redis

[common.cache]
enabled = true
backend = "dragonfly"  # Uses Redis provider internally

[common.cache.redis]
url = "redis://dragonfly.internal:6379"

Note: Dragonfly is used in Tasker’s test and CI environments for improved performance. For production, either Redis or Dragonfly works.

Memcached (Distributed)

Memcached is a simple, high-performance distributed cache. It requires the cache-memcached feature flag (not enabled by default).

Simple protocol: Lightweight key-value store
Distributed: State is shared across instances
No pattern deletion: Relies on TTL expiry (like Moka)

[common.cache]
enabled = true
backend = "memcached"

[common.cache.memcached]
url = "tcp://memcached.internal:11211"
connection_timeout_seconds = 5

Note: Enable with cargo build --features cache-memcached. Not enabled by default to reduce dependency footprint.

Moka (In-Memory)

Moka provides a high-performance in-memory cache:

Zero network latency: All operations are in-process
DoS protection: Rate-limits expensive operations without Redis dependency
Single-instance only: Cache is not shared across processes

[common.cache]
enabled = true
backend = "moka"

[common.cache.moka]
max_capacity = 10000

Important: Moka is only suitable for:

Single-instance deployments
Development environments
Analytics caching (where brief staleness is acceptable)

NoOp (Disabled)

When caching is disabled or a backend fails to initialize:

[common.cache]
enabled = false

The NoOp provider always returns cache misses and succeeds on writes (no-op). This is also used as a graceful fallback when Redis connection fails.

Circuit Breaker Protection

The cache circuit breaker prevents repeated timeout penalties when Redis/Dragonfly is unavailable. Instead of waiting for connection timeouts on every request, the circuit breaker fails fast after detecting failures.

Configuration

[common.circuit_breakers.component_configs.cache]
failure_threshold = 5    # Open after 5 consecutive failures
timeout_seconds = 15     # Test recovery after 15 seconds
success_threshold = 2    # Close after 2 successful calls

Behavior When Circuit is Open

When the circuit breaker is open (cache unavailable):

Operation	Behavior
`get()`	Returns `None` (cache miss)
`set()`	Returns `Ok(())` (no-op)
`delete()`	Returns `Ok(())` (no-op)
`health_check()`	Returns `false` (unhealthy)

This fail-fast behavior ensures:

Requests don’t wait for connection timeouts
Database queries still work (cache miss → DB fallback)
Recovery is automatic when Redis/Dragonfly becomes available

Circuit States

State	Description
Closed	Normal operation, all calls go through
Open	Failing fast, calls return fallback values
Half-Open	Testing recovery, limited calls allowed

Monitoring

Circuit state is logged at state transitions:

INFO  Circuit breaker half-open (testing recovery)
INFO  Circuit breaker closed (recovered)
ERROR Circuit breaker opened (failing fast)

Usage Context Constraints

Different caching use cases have different consistency requirements. Tasker enforces these constraints at runtime:

Template Caching

Constraint: Requires distributed cache (Redis) or no cache (NoOp)

Templates are cached to avoid repeated database queries when loading workflow definitions. However, workers invalidate the template cache on bootstrap when they register new handler versions.

If an in-memory cache (Moka) is used:

Orchestration server caches templates in its local memory
Worker boots and invalidates templates in Redis (or nowhere, if Moka)
Orchestration server never sees the invalidation
Stale templates are served → operational errors

Behavior with Moka: Template caching is automatically disabled with a warning:

WARN Cache provider 'moka' is not safe for template caching (in-memory cache
     would drift from worker invalidations). Template caching disabled.

Analytics Caching

Constraint: Any backend allowed

Analytics data is informational and TTL-bounded. Brief staleness is acceptable, and in-memory caching provides DoS protection for expensive aggregation queries.

Behavior with Moka: Analytics caching works normally.

Cache Keys

Cache keys are prefixed with the configured key_prefix to allow multiple Tasker deployments to share a Redis instance:

Resource	Key Pattern
Templates	`{prefix}:template:{namespace}:{name}:{version}`
Performance Metrics	`{prefix}:analytics:performance:{hours}`
Bottleneck Analysis	`{prefix}:analytics:bottlenecks:{limit}:{min_executions}`

Operational Patterns

Multi-Instance Production

[common.cache]
enabled = true
backend = "redis"
template_ttl_seconds = 3600    # Long TTL, rely on invalidation
analytics_ttl_seconds = 60     # Short TTL for fresh data

Templates cached for 1 hour but invalidated on worker registration
Analytics cached briefly to reduce database load

Single-Instance / Development

[common.cache]
enabled = true
backend = "moka"
template_ttl_seconds = 300     # Shorter TTL since no invalidation
analytics_ttl_seconds = 30

Template caching automatically disabled (Moka constraint)
Analytics caching works, provides DoS protection

Caching Disabled

[common.cache]
enabled = false

All cache operations are no-ops
Every request hits the database
Useful for debugging or when cache adds complexity without benefit

Graceful Degradation

Tasker never fails to start due to cache issues:

Redis connection failure: Falls back to NoOp with warning
Backend misconfiguration: Falls back to NoOp with warning
Cache operation errors: Logged as warnings, never propagated

WARN Failed to connect to Redis, falling back to NoOp cache (graceful degradation)

The cache layer uses “best-effort” writes—failures are logged but never block request processing.

Monitoring

Cache Hit/Miss Rates

Cache operations are logged at DEBUG level:

DEBUG hours=24 "Performance metrics cache HIT"
DEBUG hours=24 "Performance metrics cache MISS, querying DB"

Provider Status

On startup, the active cache provider is logged:

INFO backend="redis" "Distributed cache provider initialized successfully"
INFO backend="moka" max_capacity=10000 "In-memory cache provider initialized"
INFO "Distributed cache disabled by configuration"

Troubleshooting

Templates Not Caching

Check if backend is Moka—template caching is disabled with Moka
Check for Redis connection warnings in logs
Verify enabled = true in configuration

Stale Templates Being Served

Verify all instances point to the same Redis
Check that workers are properly invalidating on bootstrap
Consider reducing template_ttl_seconds

High Cache Miss Rate

Check Redis connectivity and latency
Verify TTL settings aren’t too aggressive
Check for cache key collisions (multiple deployments, same prefix)

Memory Growth with Moka

Reduce max_capacity setting
Check TTL settings—items evict on TTL or capacity limit
Monitor entry count if metrics are available

Conditional Workflows and Decision Points

Last Updated: 2025-10-27 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Use Cases & Patterns | States and Lifecycles

← Back to Documentation Hub

Overview

Conditional workflows enable runtime decision-making that dynamically determines which workflow steps to execute based on business logic. Unlike static DAG workflows where all steps are predefined, conditional workflows use decision point steps to create steps on-demand based on runtime conditions.

Dynamic Workflow Decision Points provide this capability through:

Decision Point Steps: Special step type that evaluates business logic and returns step names to create
Deferred Steps: Step type with dynamic dependency resolution using intersection semantics
Type-Safe Integration: Ruby and Rust helpers ensuring clean serialization between languages

When to Use Conditional Workflows
Logical Pattern
Architecture and Implementation
YAML Configuration
Simple Example: Approval Routing
Complex Example: Multi-Tier Approval
Ruby Implementation Guide
Rust Implementation Guide
Best Practices
Limitations and Constraints

When to Use Conditional Workflows

✅ Use Conditional Workflows When:

1. Business Logic Determines Execution Path

Approval workflows with amount-based routing (small/medium/large)
Risk-based processing (low/medium/high risk paths)
Tiered customer service (bronze/silver/gold/platinum)
Regulatory compliance with jurisdictional variations

2. Step Requirements Are Unknown Until Runtime

Dynamic validation checks based on request type
Multi-stage approvals where approval count depends on amount
Conditional enrichment steps based on data completeness
Parallel processing with variable worker count

3. Workflow Complexity Varies By Input

Simple cases skip expensive steps
Complex cases trigger additional validation
Emergency processing bypasses normal checks
VIP customers get expedited handling

❌ Don’t Use Conditional Workflows When:

1. Static DAG is Sufficient

All possible execution paths known at design time
Complexity overhead not justified
Simple if/else can be handled in handler code

2. Purely Sequential Logic

No parallelism or branching needed
Handler code can make decisions directly

3. Real-Time Sub-Second Decisions

Decision overhead (~10-20ms) not acceptable
In-memory processing required

Logical Pattern

Core Concepts

Task Initialization
       ↓
Regular Step(s)
       ↓
Decision Point Step ← Evaluates business logic
       ↓
   [Decision Made]
       ↓
   ┌───┴───┐
   ↓       ↓
Path A  Path B  ← Steps created dynamically
   ↓       ↓
   └───┬───┘
       ↓
Convergence Step ← Deferred dependencies resolve via intersection
       ↓
Task Complete

Decision Point Pattern

Evaluation Phase: Decision point step executes handler
Decision Output: Handler returns list of step names to create
Dynamic Creation: Orchestration creates specified steps with proper dependencies
Execution: Created steps execute like normal steps
Convergence: Deferred steps wait for intersection of declared dependencies + created steps

Intersection Semantics for Deferred Steps

Declared Dependencies (in template):

- step_a
- step_b
- step_c

Actually Created Steps (by decision point):

Only step_a and step_c were created

Effective Dependencies (intersection):

step_a AND step_c  (step_b ignored since not created)

This enables convergence steps that work regardless of which path was taken.

Architecture and Implementation

Step Type: Decision Point

Decision point steps are regular steps with a special handler that returns a DecisionPointOutcome:

#![allow(unused)]
fn main() {
pub enum DecisionPointOutcome {
    NoBranches,               // No additional steps needed
    CreateSteps {             // Dynamically create these steps
        step_names: Vec<String>,
    },
}
}

Key Characteristics:

Executes like a normal step
Result includes decision_point_outcome field
Orchestration detects outcome and creates steps
Created steps depend on the decision point step
Fully atomic - either all steps created or none

Step Type: Deferred

Deferred steps use intersection semantics for dependency resolution:

type: deferred  # Special step type
dependencies:
  - routing_decision  # Must wait for decision point
  - step_a           # Might be created
  - step_b           # Might be created
  - step_c           # Might be created

Resolution Logic:

Wait for decision point to complete
Check which declared dependencies actually exist
Wait only for intersection of declared + created
Execute when all existing dependencies complete

Orchestration Flow

┌─────────────────────────────────────────┐
│ Step Result Processor                   │
│                                         │
│ 1. Check if result has                  │
│    decision_point_outcome field         │
│                                         │
│ 2. If CreateSteps:                      │
│    - Validate step names exist          │
│    - Create WorkflowStep records        │
│    - Set dependencies                   │
│    - Enqueue for execution              │
│                                         │
│ 3. If NoBranches:                       │
│    - Continue normally                  │
│                                         │
│ 4. Metrics and telemetry:               │
│    - Track steps_created count          │
│    - Log decision outcome               │
│    - Warn if depth limit approached     │
└─────────────────────────────────────────┘

Configuration

Decision point behavior is configured per environment:

# config/tasker/base/orchestration.toml
[orchestration.decision_points]
enabled = true
max_depth = 3           # Prevent infinite recursion
warn_threshold = 2      # Warn when nearing limit

YAML Configuration

Task Template Structure

Actual Implementation (from tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml):

---
name: approval_routing
namespace_name: conditional_approval
version: 1.0.0
description: >
  Ruby implementation of conditional approval workflow demonstrating dynamic decision points.
  Routes approval requests through different paths based on amount thresholds.
task_handler:
  callable: tasker_worker_ruby::TaskHandler
  initialization: {}
steps:
  - name: validate_request
    type: standard
    dependencies: []
    handler:
      callable: ConditionalApproval::StepHandlers::ValidateRequestHandler
      initialization: {}

  - name: routing_decision
    type: decision  # DECISION POINT
    dependencies:
      - validate_request
    handler:
      callable: ConditionalApproval::StepHandlers::RoutingDecisionHandler
      initialization: {}

  - name: finalize_approval
    type: deferred  # DEFERRED - uses intersection semantics
    dependencies:
      - auto_approve       # ALL possible dependencies listed
      - manager_approval   # System computes intersection at runtime
      - finance_review
    handler:
      callable: ConditionalApproval::StepHandlers::FinalizeApprovalHandler
      initialization: {}

  # Possible dynamic branches (created by decision point)
  - name: auto_approve
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::AutoApproveHandler
      initialization: {}

  - name: manager_approval
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::ManagerApprovalHandler
      initialization: {}

  - name: finance_review
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::FinanceReviewHandler
      initialization: {}

Key Points:

type: decision marks the decision point step
type: deferred enables intersection semantics for convergence
ALL possible dependencies listed in deferred step
Orchestration computes: declared deps ∩ actually created steps

Simple Example: Approval Routing

Business Requirement

Route approval requests based on amount:

< $1,000: Auto-approve (no human intervention)
$1,000 - $4,999: Manager approval required
≥ $5,000: Manager + Finance approval required

Template Configuration

namespace: approval_workflows
name: simple_routing
version: "1.0"

steps:
  - name: validate_request
    handler: validate_request

  - name: routing_decision
    handler: routing_decision
    type: decision_point
    dependencies:
      - validate_request

  - name: auto_approve
    handler: auto_approve
    dependencies:
      - routing_decision

  - name: manager_approval
    handler: manager_approval
    dependencies:
      - routing_decision

  - name: finance_review
    handler: finance_review
    dependencies:
      - routing_decision

  - name: finalize_approval
    handler: finalize_approval
    type: deferred
    dependencies:
      - routing_decision
      - auto_approve
      - manager_approval
      - finance_review

Ruby Handler Implementation

Actual Implementation (from workers/ruby/spec/handlers/examples/conditional_approval/step_handlers/routing_decision_handler.rb):

# frozen_string_literal: true

module ConditionalApproval
  module StepHandlers
    # Routing Decision: DECISION POINT that routes approval based on amount
    #
    # Uses TaskerCore::StepHandler::Decision base class for clean, type-safe
    # decision outcome serialization consistent with Rust expectations.
    class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
      SMALL_AMOUNT_THRESHOLD = 1_000
      LARGE_AMOUNT_THRESHOLD = 5_000

      def call(task, _sequence, _step)
        # Get amount from validated request
        amount = task.context['amount']
        raise 'Amount is required for routing decision' unless amount

        # Make routing decision based on amount
        route = determine_route(amount)

        # Use Decision base class helper for clean outcome serialization
        decision_success(
          steps: route[:steps],
          result_data: {
            route_type: route[:type],
            reasoning: route[:reasoning],
            amount: amount
          },
          metadata: {
            operation: 'routing_decision',
            route_thresholds: {
              small: SMALL_AMOUNT_THRESHOLD,
              large: LARGE_AMOUNT_THRESHOLD
            }
          }
        )
      end

      private

      def determine_route(amount)
        if amount < SMALL_AMOUNT_THRESHOLD
          {
            type: 'auto_approval',
            steps: ['auto_approve'],
            reasoning: "Amount $#{amount} below threshold - auto-approval"
          }
        elsif amount < LARGE_AMOUNT_THRESHOLD
          {
            type: 'manager_only',
            steps: ['manager_approval'],
            reasoning: "Amount $#{amount} requires manager approval"
          }
        else
          {
            type: 'dual_approval',
            steps: %w[manager_approval finance_review],
            reasoning: "Amount $#{amount} >= $#{LARGE_AMOUNT_THRESHOLD} - dual approval required"
          }
        end
      end
    end
  end
end

Key Ruby Patterns:

Inherit from TaskerCore::StepHandler::Decision - Specialized base class for decision points
Use helper method decision_success(steps:, result_data:, metadata:) - Clean API for decision outcomes
Helper automatically creates DecisionPointOutcome and embeds it correctly
No manual serialization needed - base class handles Rust compatibility
For no-branch scenarios, use decision_no_branches(result_data:, metadata:)

Execution Flow Examples

Example 1: Small Amount ($500)

1. validate_request → Complete
2. routing_decision → Complete (creates: auto_approve)
3. auto_approve     → Complete
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {auto_approve} = auto_approve)

Total Steps Created: 4
Execution Time: ~500ms

Example 2: Medium Amount ($2,500)

1. validate_request  → Complete
2. routing_decision  → Complete (creates: manager_approval)
3. manager_approval  → Complete
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {manager_approval} = manager_approval)

Total Steps Created: 4
Execution Time: ~2s (human approval delay)

Example 3: Large Amount ($10,000)

1. validate_request  → Complete
2. routing_decision  → Complete (creates: manager_approval, finance_review)
3. manager_approval  → Complete (parallel)
3. finance_review    → Complete (parallel)
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {manager_approval, finance_review})

Total Steps Created: 5
Execution Time: ~3s (parallel approvals)

Complex Example: Multi-Tier Approval

Business Requirement

Implement sophisticated approval routing with:

Risk assessment step
Tiered approval requirements
Emergency override path
Compliance checks based on jurisdiction

Template Configuration

namespace: approval_workflows
name: multi_tier_approval
version: "1.0"

steps:
  # Phase 1: Initial validation and risk assessment
  - name: validate_request
    handler: validate_request

  - name: assess_risk
    handler: assess_risk
    dependencies:
      - validate_request

  # Phase 2: Primary routing decision
  - name: primary_routing
    handler: primary_routing
    type: decision_point
    dependencies:
      - assess_risk

  # Phase 3: Conditional approval paths
  - name: emergency_approval
    handler: emergency_approval
    dependencies:
      - primary_routing

  - name: standard_manager_approval
    handler: standard_manager_approval
    dependencies:
      - primary_routing

  - name: senior_manager_approval
    handler: senior_manager_approval
    dependencies:
      - primary_routing

  # Phase 4: Secondary routing for high-risk cases
  - name: compliance_routing
    handler: compliance_routing
    type: decision_point
    dependencies:
      - primary_routing
      - senior_manager_approval  # Only if created

  # Phase 5: Compliance paths
  - name: legal_review
    handler: legal_review
    dependencies:
      - compliance_routing

  - name: fraud_investigation
    handler: fraud_investigation
    dependencies:
      - compliance_routing

  - name: jurisdictional_check
    handler: jurisdictional_check
    dependencies:
      - compliance_routing

  # Phase 6: Convergence
  - name: finalize_approval
    handler: finalize_approval
    type: deferred
    dependencies:
      - primary_routing
      - emergency_approval
      - standard_manager_approval
      - senior_manager_approval
      - compliance_routing
      - legal_review
      - fraud_investigation
      - jurisdictional_check

Ruby Handler: Primary Routing

class PrimaryRoutingHandler < TaskerCore::StepHandler::Decision
  def call(task, sequence, _step)
    amount = task.context['amount']
    risk_score = sequence.get_results('assess_risk')['risk_score']
    is_emergency = task.context['emergency'] == true

    steps_to_create = if is_emergency && amount < 10_000
      # Emergency override path
      ['emergency_approval']
    elsif risk_score < 30 && amount < 5_000
      # Low risk, standard approval
      ['standard_manager_approval']
    else
      # High risk or large amount - senior approval + compliance routing
      ['senior_manager_approval', 'compliance_routing']
    end

    decision_success(
      steps: steps_to_create,
      result_data: {
        route_type: determine_route_type(is_emergency, risk_score, amount),
        risk_score: risk_score,
        amount: amount,
        emergency: is_emergency
      }
    )
  end
end

Ruby Handler: Compliance Routing (Nested Decision)

class ComplianceRoutingHandler < TaskerCore::StepHandler::Decision
  def call(task, sequence, _step)
    amount = task.context['amount']
    risk_score = sequence.get_results('assess_risk')['risk_score']
    jurisdiction = task.context['jurisdiction']

    steps_to_create = []

    # Large amounts always need legal review
    steps_to_create << 'legal_review' if amount >= 50_000

    # High risk triggers fraud investigation
    steps_to_create << 'fraud_investigation' if risk_score >= 70

    # Certain jurisdictions need special checks
    steps_to_create << 'jurisdictional_check' if high_regulation_jurisdiction?(jurisdiction)

    if steps_to_create.empty?
      # No additional compliance steps needed
      decision_no_branches(
        result_data: { reason: 'no_compliance_requirements' }
      )
    else
      decision_success(
        steps: steps_to_create,
        result_data: {
          compliance_level: 'enhanced',
          checks_required: steps_to_create
        }
      )
    end
  end

  private

  def high_regulation_jurisdiction?(jurisdiction)
    %w[EU UK APAC].include?(jurisdiction)
  end
end

Execution Scenarios

Scenario 1: Emergency Low-Risk Request ($5,000)

Path: validate → assess_risk → primary_routing → emergency_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates emergency_approval)
Complexity: Low

Scenario 2: Standard Medium-Risk Request ($3,000, Risk 25)

Path: validate → assess_risk → primary_routing → standard_manager_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates standard_manager_approval)
Complexity: Low

Scenario 3: High-Risk Large Amount ($75,000, Risk 80, EU)

Path: validate → assess_risk → primary_routing → senior_manager_approval + compliance_routing
      → legal_review + fraud_investigation + jurisdictional_check → finalize
Steps Created: 9
Decision Points: 2 (primary_routing → compliance_routing)
Complexity: High (nested decisions)

Ruby Implementation Guide

Using the Decision Base Class

The TaskerCore::StepHandler::Decision base class provides type-safe helpers:

class MyDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    # Your business logic here
    amount = context.get_task_field('amount')

    if amount < 1000
      # Create single step
      decision_success(
        steps: 'auto_approve',  # Can pass string or array
        result_data: { route: 'auto' }
      )
    elsif amount < 5000
      # Create multiple steps
      decision_success(
        steps: ['manager_approval', 'risk_check'],
        result_data: { route: 'standard' }
      )
    else
      # No additional steps needed
      decision_no_branches(
        result_data: { route: 'none', reason: 'manual_review_required' }
      )
    end
  end
end

Helper Methods

decision_success(steps:, result_data: {}, metadata: {})

Creates steps dynamically
steps: String or Array of step names
result_data: Additional data to store in step results
metadata: Observability metadata

decision_no_branches(result_data: {}, metadata: {})

No additional steps created
Workflow proceeds to next static step

decision_with_custom_outcome(outcome:, result_data: {}, metadata: {})

Advanced: Full control over outcome structure
Most handlers should use decision_success or decision_no_branches

validate_decision_outcome!(outcome)

Validates custom outcome structure
Raises error if invalid

Type Definitions

# workers/ruby/lib/tasker_core/types/decision_point_outcome.rb

module TaskerCore
  module Types
    module DecisionPointOutcome
      # Factory methods
      def self.no_branches
        NoBranches.new
      end

      def self.create_steps(step_names)
        CreateSteps.new(step_names: step_names)
      end

      # Serialization format (matches Rust)
      class NoBranches
        def to_h
          { type: 'no_branches' }
        end
      end

      class CreateSteps
        def to_h
          { type: 'create_steps', step_names: step_names }
        end
      end
    end
  end
end

Rust Implementation Guide

Decision Handler Implementation

Actual Implementation (from workers/rust/src/step_handlers/conditional_approval_rust.rs):

#![allow(unused)]
fn main() {
use super::{error_result, success_result, RustStepHandler, StepHandlerConfig};
use anyhow::Result;
use async_trait::async_trait;
use chrono::Utc;
use serde_json::json;
use std::collections::HashMap;
use tasker_shared::messaging::{DecisionPointOutcome, StepExecutionResult};
use tasker_shared::types::TaskSequenceStep;

const SMALL_AMOUNT_THRESHOLD: i64 = 1000;
const LARGE_AMOUNT_THRESHOLD: i64 = 5000;

pub struct RoutingDecisionHandler {
    #[allow(dead_code)]
    config: StepHandlerConfig,
}

#[async_trait]
impl RustStepHandler for RoutingDecisionHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract amount from task context
        let amount: i64 = step_data.get_context_field("amount")?;

        // Business logic: determine routing
        let (route_type, steps, reasoning) = if amount < SMALL_AMOUNT_THRESHOLD {
            (
                "auto_approval",
                vec!["auto_approve"],
                format!("Amount ${} under threshold", amount)
            )
        } else if amount < LARGE_AMOUNT_THRESHOLD {
            (
                "manager_only",
                vec!["manager_approval"],
                format!("Amount ${} requires manager approval", amount)
            )
        } else {
            (
                "dual_approval",
                vec!["manager_approval", "finance_review"],
                format!("Amount ${} requires dual approval", amount)
            )
        };

        // Create decision point outcome
        let outcome = DecisionPointOutcome::create_steps(
            steps.iter().map(|s| s.to_string()).collect()
        );

        // Build result with embedded outcome
        let result_data = json!({
            "route_type": route_type,
            "reasoning": reasoning,
            "amount": amount,
            "decision_point_outcome": outcome.to_value()  // Embedded outcome
        });

        let metadata = HashMap::from([
            ("route_type".to_string(), json!(route_type)),
            ("steps_to_create".to_string(), json!(steps)),
        ]);

        Ok(success_result(
            step_uuid,
            result_data,
            start_time.elapsed().as_millis() as i64,
            Some(metadata),
        ))
    }

    fn name(&self) -> &str {
        "routing_decision"
    }

    fn new(config: StepHandlerConfig) -> Self {
        Self { config }
    }
}
}

DecisionPointOutcome Type

Type Definition (from tasker-shared/src/messaging/execution_types.rs):

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum DecisionPointOutcome {
    NoBranches,
    CreateSteps {
        step_names: Vec<String>,
    },
}

impl DecisionPointOutcome {
    /// Create outcome that creates specific steps
    pub fn create_steps(step_names: Vec<String>) -> Self {
        Self::CreateSteps { step_names }
    }

    /// Create outcome with no additional steps
    pub fn no_branches() -> Self {
        Self::NoBranches
    }

    /// Convert to JSON value for embedding in StepExecutionResult
    pub fn to_value(&self) -> serde_json::Value {
        serde_json::to_value(self).expect("DecisionPointOutcome serialization should not fail")
    }

    /// Extract decision outcome from step execution result
    pub fn from_step_result(result: &serde_json::Value) -> Option<Self> {
        result
            .as_object()?
            .get("decision_point_outcome")
            .and_then(|v| serde_json::from_value(v.clone()).ok())
    }
}
}

Key Rust Patterns:

DecisionPointOutcome::create_steps(vec![...]) - Type-safe factory
outcome.to_value() - Serializes to JSON matching Ruby format
Embedded in result JSON as decision_point_outcome field
Serde handles serialization: { "type": "create_steps", "step_names": [...] }

Best Practices

1. Keep Decision Logic Deterministic

# ✅ Good: Deterministic decision based on input
def call(context)
  amount = context.get_task_field('amount')

  steps = if amount < 1000
    ['auto_approve']
  else
    ['manager_approval']
  end

  decision_success(steps: steps)
end

# ❌ Bad: Non-deterministic (time-based, random)
def call(context)
  # Decision changes based on when it runs
  steps = if Time.now.hour < 9
    ['emergency_approval']
  else
    ['standard_approval']
  end

  decision_success(steps: steps)
end

2. Validate Step Names

Ensure all step names in decision outcomes exist in template:

VALID_STEPS = %w[auto_approve manager_approval finance_review].freeze

def call(context)
  steps_to_create = determine_steps(context)

  # Validate step names
  invalid = steps_to_create - VALID_STEPS
  unless invalid.empty?
    raise "Invalid step names: #{invalid.join(', ')}"
  end

  decision_success(steps: steps_to_create)
end

3. Use Deferred Type for Convergence

Any step that might depend on dynamically created steps should be type: deferred:

# ✅ Correct
- name: finalize
  type: deferred  # Uses intersection semantics
  dependencies:
    - routing_decision
    - auto_approve
    - manager_approval

# ❌ Wrong - will fail if dependencies don't all exist
- name: finalize
  dependencies:
    - routing_decision
    - auto_approve
    - manager_approval

4. Limit Decision Depth

Prevent infinite recursion:

[orchestration.decision_points]
max_depth = 3  # Maximum nesting level
warn_threshold = 2  # Warn when approaching limit

# ✅ Good: Linear decision chain (depth 1-2)
validate → routing_decision → compliance_check → finalize

# ⚠️ Be Careful: Deep nesting (depth 3)
validate → routing_1 → routing_2 → routing_3 → finalize

# ❌ Bad: Circular or unbounded nesting
routing_decision creates steps that create more routing decisions...

5. Handle No-Branch Cases

Explicitly return no_branches when no steps needed:

def call(context)
  amount = context.get_task_field('amount')

  if context.get_task_field('skip_approval')
    # No additional steps needed
    decision_no_branches(
      result_data: { reason: 'approval_skipped' }
    )
  else
    decision_success(steps: determine_steps(amount))
  end
end

6. Meaningful Result Data

Include context for debugging and audit trails:

decision_success(
  steps: ['manager_approval', 'finance_review'],
  result_data: {
    route_type: 'dual_approval',
    reasoning: "Amount $#{amount} >= $5,000 threshold",
    amount: amount,
    thresholds_applied: {
      small: 1_000,
      large: 5_000
    }
  },
  metadata: {
    decision_time_ms: elapsed_ms,
    steps_created_count: 2
  }
)

Limitations and Constraints

Technical Limits

1. Maximum Decision Depth

Default: 3 levels of nested decision points
Configurable via orchestration.decision_points.max_depth
Prevents infinite recursion

2. Step Names Must Exist in Template

All step names in CreateSteps must be defined in template
Orchestration validates before creating steps
Invalid names cause permanent failure

3. Decision Logic is Non-Retryable by Default

Decision steps should be deterministic
Retry disabled by default (max_attempts: 1)
External API calls should be in separate steps

4. Created Steps Cannot Modify Template

Decision points create instances of template steps
Cannot dynamically define new step types
All possible steps must be in template

Performance Considerations

1. Decision Overhead

Each decision point adds ~10-20ms overhead
Includes: handler execution + step creation + dependency resolution
Factor into SLA planning

2. Database Impact

Each created step = 1 WorkflowStep record + edges
Large branch counts increase database operations
Monitor workflow_steps table growth

3. Observability

Decision outcomes logged with telemetry
Metrics track: decision_points.steps_created, decision_points.depth
Use structured logging for audit trails

Semantic Constraints

1. Deferred Dependencies Must Include Decision Point

# ✅ Correct
- name: finalize
  type: deferred
  dependencies:
    - routing_decision  # Must list the decision point
    - auto_approve
    - manager_approval

# ❌ Wrong - missing decision point
- name: finalize
  type: deferred
  dependencies:
    - auto_approve
    - manager_approval

2. Decision Points Cannot Be Circular

# ❌ Not allowed - circular dependency
routing_a creates routing_b
routing_b creates routing_a

3. No Dynamic Template Modification

Cannot add new handler types at runtime
Cannot modify step configurations
All possibilities must be predefined

Testing Decision Point Workflows

E2E Test Structure

Both Ruby and Rust implementations include comprehensive E2E tests covering all routing scenarios:

Test Locations:

Ruby: tests/e2e/ruby/conditional_approval_test.rs
Rust: tests/e2e/rust/conditional_approval_rust.rs

Test Scenarios:

Small Amount ($500) - Auto-approval only

validate_request → routing_decision → auto_approve → finalize_approval
Expected: 4 steps created, only auto_approve path taken

Medium Amount ($3,000) - Manager approval only

validate_request → routing_decision → manager_approval → finalize_approval
Expected: 4 steps created, only manager path taken

Large Amount ($10,000) - Dual approval

validate_request → routing_decision → manager_approval + finance_review → finalize_approval
Expected: 5 steps created, both approval paths taken (parallel)

API Validation - Initial step count verification

Expected: 2 steps at initialization (validate_request, routing_decision)
Reason: finalize_approval is transitive descendant of decision point

Running Tests

# Run all E2E tests
cargo test --test e2e_tests

# Run Ruby conditional approval tests only
cargo test --test e2e_tests e2e::ruby::conditional_approval

# Run Rust conditional approval tests only
cargo test --test e2e_tests e2e::rust::conditional_approval_rust

# Run with output for debugging
cargo test --test e2e_tests -- --nocapture

Test Fixtures

Ruby Template: tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml Rust Template: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml

Both templates demonstrate:

Decision point step configuration (type: decision)
Deferred convergence step (type: deferred)
Dynamic step dependencies
Namespace isolation between Ruby/Rust

Validation Checklist

When implementing decision point workflows, ensure:

✅ Decision point step has type: decision
✅ Deferred convergence step has type: deferred
✅ All possible dependencies listed in deferred step
✅ Handler embeds decision_point_outcome in result
✅ Step names in outcome match template definitions
✅ E2E tests cover all routing scenarios
✅ Tests validate step creation and completion
✅ Namespace isolated if multiple implementations exist

Use Cases & Patterns - More workflow examples
States and Lifecycles - State machine details
Task and Step Readiness - Dependency resolution logic
Quick Start - Getting started guide
Crate Architecture - System architecture overview
Decision Point E2E Tests - Detailed test documentation

← Back to Documentation Hub

Configuration Management

Last Updated: 2025-10-17 Audience: Operators, Developers, Architects Status: Active Related Docs: Environment Configuration Comparison, Deployment Patterns

← Back to Documentation Hub

Overview

Tasker Core implements a sophisticated component-based configuration system with environment-specific overrides, runtime observability, and comprehensive validation. This document explains how to manage, validate, inspect, and deploy Tasker configurations.

Key Features

Feature	Description	Benefit
Component-Based Architecture	3 focused TOML files organized by common, orchestration, and worker	Easy to understand and maintain
Environment Overrides	Test, development, production-specific settings	Safe defaults with production scale-out
Single-File Runtime Loading	Load from pre-merged configuration files at runtime	Deployment certainty - exact config known at build time
Runtime Observability	`/config` API endpoints with secret redaction	Live inspection of deployed configurations
CLI Tools	Generate and validate single deployable configs	Build-time verification, deployment artifacts
Context-Specific Validation	Orchestration and worker-specific validation rules	Catch errors before deployment
Secret Redaction	12+ sensitive key patterns automatically hidden	Safe configuration inspection

Quick Start

Inspect Running System Configuration

# Check orchestration configuration (includes common + orchestration-specific)
curl http://localhost:8080/config | jq

# Check worker configuration (includes common + worker-specific)
curl http://localhost:8081/config | jq

# Secrets are automatically redacted for safety

Generate Deployable Configuration

# Generate production orchestration config for deployment
tasker-ctl config generate \
    --context orchestration \
    --environment production \
    --output config/tasker/orchestration-production.toml

# This merged file is then loaded at runtime via TASKER_CONFIG_PATH
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml

Validate Configuration

# Validate orchestration config for production
tasker-ctl config validate \
    --context orchestration \
    --environment production

# Validates: type safety, ranges, required fields, business rules

Part 1: Configuration Architecture

1.1 Component-Based Structure

Tasker uses a component-based TOML architecture where configuration is split into focused files with single responsibility:

config/tasker/
├── base/                           # Base configuration (defaults)
│   ├── common.toml                 # Shared: database, circuit breakers, telemetry
│   ├── orchestration.toml          # Orchestration-specific settings
│   └── worker.toml                 # Worker-specific settings
│
├── environments/                   # Environment-specific overrides
│   ├── test/
│   │   ├── common.toml             # Test overrides (small values, fast execution)
│   │   ├── orchestration.toml
│   │   └── worker.toml
│   │
│   ├── development/
│   │   ├── common.toml             # Development overrides (medium values, local Docker)
│   │   ├── orchestration.toml
│   │   └── worker.toml
│   │
│   └── production/
│       ├── common.toml             # Production overrides (large values, scale-out)
│       ├── orchestration.toml
│       └── worker.toml
│
├── orchestration-test.toml         # Generated merged configs (used at runtime via TASKER_CONFIG_PATH)
├── orchestration-production.toml   # Single-file deployment artifacts
├── worker-test.toml
└── worker-production.toml

1.2 Configuration Contexts

Tasker has three configuration contexts:

Context	Purpose	Components
Common	Shared across orchestration and worker	Database, circuit breakers, telemetry, backoff, system
Orchestration	Orchestration-specific settings	Web API, MPSC channels, event systems, shutdown
Worker	Worker-specific settings	Handler discovery, resource limits, health monitoring

1.3 Environment Detection

Configuration loading uses TASKER_ENV environment variable:

# Test environment (default) - small values for fast tests
export TASKER_ENV=test

# Development environment - medium values for local Docker
export TASKER_ENV=development

# Production environment - large values for scale-out
export TASKER_ENV=production

Detection Order:

TASKER_ENV environment variable
Default to “development” if not set

1.4 Runtime Configuration Loading

Production/Docker Deployment: Single-file loading via TASKER_CONFIG_PATH

Runtime systems (orchestration and worker) load configuration from pre-merged single files:

# Set path to merged configuration file
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml

# System loads this single file at startup
# No directory merging at runtime - configuration is fully determined at build time

Key Benefits:

Deployment Certainty: Exact configuration known before deployment
Simplified Debugging: Single file shows exactly what’s running
Configuration Auditing: One file to version control and code review
Fail Loudly: Missing or invalid config halts startup with explicit errors

Configuration Path Precedence:

The system uses a two-tier configuration loading strategy with clear precedence:

Primary: TASKER_CONFIG_PATH (Explicit single file - Docker/production)
- When set, system loads configuration from this exact file path
- Intended for production and Docker deployments
- Example: TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
- Source logging: "📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)"
Fallback: TASKER_CONFIG_ROOT (Convention-based - tests/development)
- When TASKER_CONFIG_PATH is not set, system looks for config using convention
- Convention: {TASKER_CONFIG_ROOT}/tasker/{context}-{environment}.toml
- Examples:
  - Orchestration: /config/tasker/generated/orchestration-test.toml
  - Worker: /config/tasker/worker-production.toml
- Source logging: "📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))"

Logging and Transparency:

The system clearly logs which approach was taken at startup:

# Explicit path approach (TASKER_CONFIG_PATH set)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)

# Convention-based approach (TASKER_CONFIG_ROOT set)
INFO tasker_shared::system_context: Using convention-based config path: /config/tasker/generated/orchestration-test.toml (environment=test)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))

When to Use Each:

Environment	Recommended Approach	Reason
Production	`TASKER_CONFIG_PATH`	Explicit, auditable, matches what’s reviewed
Docker	`TASKER_CONFIG_PATH`	Single source of truth, no ambiguity
Kubernetes	`TASKER_CONFIG_PATH`	ConfigMap contains exact file
Tests (nextest)	`TASKER_CONFIG_ROOT`	Tests span multiple contexts, convention handles both
Local dev	Either	Personal preference

Error Handling:

If neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set:

ConfigurationError("Neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set.
For Docker/production: set TASKER_CONFIG_PATH to the merged config file.
For tests/development: set TASKER_CONFIG_ROOT to the config directory.")

Local Development: Directory-based loading (legacy tests only)

For legacy test compatibility, you can still use directory-based loading via the load_context_direct() method, but this is not supported for production use.

1.5 Merging Strategy

Configuration merging follows environment overrides win pattern:

# base/common.toml
[database.pool]
max_connections = 30
min_connections = 8

# environments/production/common.toml
[database.pool]
max_connections = 50

# Result: max_connections = 50, min_connections = 8 (inherited from base)

Part 2: Runtime Observability

2.1 Configuration API Endpoints

Tasker provides unified configuration endpoints that return complete configuration (common + context-specific) in a single response.

Orchestration API

Endpoint: GET /config (system endpoint at root level)

Purpose: Inspect complete orchestration configuration including common settings

Example Request:

curl http://localhost:8080/config | jq

Response Structure:

{
  "environment": "production",
  "common": {
    "database": {
      "url": "***REDACTED***",
      "pool": {
        "max_connections": 50,
        "min_connections": 15
      }
    },
    "circuit_breakers": { "...": "..." },
    "telemetry": { "...": "..." },
    "system": { "...": "..." },
    "backoff": { "...": "..." },
    "task_templates": { "...": "..." }
  },
  "orchestration": {
    "web": {
      "bind_address": "0.0.0.0:8080",
      "request_timeout_ms": 60000
    },
    "mpsc_channels": {
      "command_buffer_size": 5000,
      "pgmq_notification_buffer_size": 50000
    },
    "event_systems": { "...": "..." }
  },
  "metadata": {
    "timestamp": "2025-10-17T15:30:45Z",
    "source": "runtime",
    "redacted_fields": [
      "database.url",
      "telemetry.api_key"
    ]
  }
}

Worker API

Endpoint: GET /config (system endpoint at root level)

Purpose: Inspect complete worker configuration including common settings

Example Request:

curl http://localhost:8081/config | jq

Response Structure:

{
  "environment": "production",
  "common": {
    "database": { "...": "..." },
    "circuit_breakers": { "...": "..." },
    "telemetry": { "...": "..." }
  },
  "worker": {
    "template_path": "/app/templates",
    "max_concurrent_steps": 500,
    "resource_limits": {
      "max_memory_mb": 4096,
      "max_cpu_percent": 90
    },
    "web": {
      "bind_address": "0.0.0.0:8081",
      "request_timeout_ms": 60000
    }
  },
  "metadata": {
    "timestamp": "2025-10-17T15:30:45Z",
    "source": "runtime",
    "redacted_fields": [
      "database.url",
      "worker.auth_token"
    ]
  }
}

2.2 Design Philosophy

Single Endpoint, Complete Configuration: Each system has one /config endpoint that returns both common and context-specific configuration in a single response.

Benefits:

Single curl command: Get complete picture without correlation
Easy comparison: Compare orchestration vs worker configs for compatibility
Tooling-friendly: Automated tools can validate shared config matches
Debugging-friendly: No mental correlation between multiple endpoints
System endpoint: At root level like /health, /metrics (not under /v1/)

2.3 Comprehensive Secret Redaction

All sensitive configuration values are automatically redacted before returning to clients.

Sensitive Key Patterns (12+ patterns, case-insensitive):

password, secret, token, key, api_key
private_key, jwt_private_key, jwt_public_key
auth_token, credentials, database_url, url

Key Features:

Recursive Processing: Handles deeply nested objects and arrays
Field Path Tracking: Reports which fields were redacted (e.g., database.url)
Smart Skipping: Empty strings and booleans not redacted
Case-Insensitive: Catches API_KEY, Secret_Token, database_PASSWORD
Structure Preservation: Non-sensitive data remains intact

Example:

{
  "database": {
    "url": "***REDACTED***",
    "adapter": "postgresql",
    "pool": {
      "max_connections": 30
    }
  },
  "metadata": {
    "redacted_fields": ["database.url"]
  }
}

2.4 OpenAPI/Swagger Integration

All configuration endpoints are documented with OpenAPI 3.0 and Swagger UI.

Access Swagger UI:

Orchestration: http://localhost:8080/api-docs/ui
Worker: http://localhost:8081/api-docs/ui

OpenAPI Specification:

Orchestration: http://localhost:8080/api-docs/openapi.json
Worker: http://localhost:8081/api-docs/openapi.json

Part 3: CLI Tools

3.1 Generate Command

Purpose: Generate a single merged configuration file from base + environment overrides for deployment.

Command Signature:

tasker-ctl config generate \
    --context <common|orchestration|worker> \
    --environment <test|development|production>

Examples:

# Generate orchestration config for production
tasker-ctl config generate --context orchestration --environment production

# Generate worker config for development
tasker-ctl config generate --context worker --environment development

# Generate common config for test
tasker-ctl config generate --context common --environment test

Output Location: Automatically generated at:

config/tasker/generated/{context}-{environment}.toml

Key Features:

Automatic Paths: No need for --source-dir or --output flags

Metadata Headers: Generated files include rich metadata:

# Generated by Tasker Configuration System
# Context: orchestration
# Environment: production
# Generated At: 2025-10-17T15:30:45Z
# Base Config: config/tasker/base/orchestration.toml
# Environment Override: config/tasker/environments/production/orchestration.toml
#
# This is a merged configuration file combining base settings with
# environment-specific overrides. Environment values take precedence.

Automatic Validation: Validates during generation
Smart Merging: TOML-level merging preserves structure

3.2 Validate Command

Purpose: Validate configuration files with context-specific validation rules.

Command Signature:

tasker-ctl config validate \
    --context <common|orchestration|worker> \
    --environment <test|development|production>

Examples:

# Validate orchestration config for production
tasker-ctl config validate --context orchestration --environment production

# Validate worker config for test
tasker-ctl config validate --context worker --environment test

Validation Features:

Environment variable substitution (${VAR:-default})
Type checking (numeric ranges, boolean values)
Required field validation
Context-specific business rules
Clear error messages

Example Output:

🔍 Validating configuration...
   Context: orchestration
   Environment: production
   ✓ Configuration loaded
   ✓ Validation passed

✅ Configuration is valid!

📊 Configuration Summary:
   Context: orchestration
   Environment: production
   Database: postgresql://tasker:***@localhost/tasker_production
   Web API: 0.0.0.0:8080
   MPSC Channels: 5 configured

3.3 Configuration Validator Binary

For quick validation without the full CLI:

# Validate all three environments
TASKER_ENV=test cargo run --bin config-validator
TASKER_ENV=development cargo run --bin config-validator
TASKER_ENV=production cargo run --bin config-validator

Part 4: Environment-Specific Configurations

See Environment Configuration Comparison for complete details on configuration values across environments.

4.1 Scaling Pattern

Tasker follows a 1:5:50 scaling pattern across environments:

Component	Test	Development	Production	Pattern
Database Connections	10	25	50	1x → 2.5x → 5x
Concurrent Steps	10	50	500	1x → 5x → 50x
MPSC Channel Buffers	100-500	500-1000	2000-50000	1x → 5-10x → 20-100x
Memory Limits	512MB	2GB	4GB	1x → 4x → 8x

4.2 Environment Philosophy

Test Environment:

Goal: Fast execution, test isolation
Strategy: Minimal resources, small buffers
Example: 10 database connections, 100-500 MPSC buffers

Development Environment:

Goal: Comfortable local Docker development
Strategy: Medium values, realistic workflows
Example: 25 database connections, 2GB RAM, 500-1000 MPSC buffers
Cluster Testing: 2 orchestrators to test multi-instance coordination

Production Environment:

Goal: High throughput, scale-out capacity
Strategy: Large values, production resilience
Example: 50 database connections, 4GB RAM, 2000-50000 MPSC buffers

Part 5: Deployment Workflows

5.1 Docker Deployment

Build-Time Configuration Generation:

FROM rust:1.75 as builder

WORKDIR /app
COPY . .

# Build CLI tool
RUN cargo build --release --bin tasker-ctl

# Generate production config (single merged file)
RUN ./target/release/tasker-ctl config generate \
    --context orchestration \
    --environment production \
    --output config/tasker/orchestration-production.toml

# Build orchestration binary
RUN cargo build --release --bin tasker-orchestration

FROM rust:1.75-slim

WORKDIR /app

# Copy orchestration binary
COPY --from=builder /app/target/release/tasker-orchestration /usr/local/bin/

# Copy generated config (single file with all merged settings)
COPY --from=builder /app/config/tasker/orchestration-production.toml /app/config/orchestration.toml

# Set environment - TASKER_CONFIG_PATH is REQUIRED
ENV TASKER_CONFIG_PATH=/app/config/orchestration.toml
ENV TASKER_ENV=production

CMD ["tasker-orchestration"]

Key Changes from Phase 2:

✅ Single merged file generated at build time
✅ TASKER_CONFIG_PATH environment variable (required)
✅ No runtime merging - exact config known at build time
✅ Fail loudly if TASKER_CONFIG_PATH not set

5.2 Kubernetes Deployment

ConfigMap Strategy with Pre-Generated Config:

# Step 1: Generate merged configuration locally
tasker-ctl config generate \
  --context orchestration \
  --environment production \
  --output orchestration-production.toml

# Step 2: Create ConfigMap from generated file
kubectl create configmap tasker-orchestration-config \
  --from-file=orchestration.toml=orchestration-production.toml

Deployment Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tasker-orchestration
  template:
    metadata:
      labels:
        app: tasker-orchestration
    spec:
      containers:
      - name: orchestration
        image: tasker/orchestration:latest
        env:
        - name: TASKER_ENV
          value: "production"
        # REQUIRED: Path to single merged configuration file
        - name: TASKER_CONFIG_PATH
          value: "/config/orchestration.toml"
        # DATABASE_URL should be in a separate secret
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-db-credentials
              key: database-url
        volumeMounts:
        - name: config
          mountPath: /config
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: tasker-orchestration-config
          items:
          - key: orchestration.toml
            path: orchestration.toml

Key Benefits:

✅ Generated file reviewed before deployment
✅ Single source of truth for runtime configuration
✅ Easy to diff between environments
✅ ConfigMap contains exact runtime configuration

5.3 Local Development and Testing

For Tests (Legacy directory-based loading):

# Set test environment
export TASKER_ENV=test

# Tests use legacy load_context_direct() method
cargo test --all-features

For Docker Compose (Single-file loading):

# Generate test configs first
tasker-ctl config generate --context orchestration --environment test \
  --output config/tasker/generated/orchestration-test.toml

tasker-ctl config generate --context worker --environment test \
  --output config/tasker/generated/worker-test.toml

# Start services with generated configs
docker-compose -f docker/docker-compose.test.yml up

Docker Compose Configuration:

services:
  orchestration:
    environment:
      # REQUIRED: Path to single merged file
      TASKER_CONFIG_PATH: /app/config/tasker/generated/orchestration-test.toml
    volumes:
      # Mount config directory (contains generated files)
      - ./config/tasker:/app/config/tasker:ro

Key Points:

✅ Tests use legacy directory-based loading for convenience
✅ Docker Compose uses single-file loading (matches production)
✅ Generated files should be committed to repo for reproducibility
✅ Both approaches work; choose based on use case

Part 6: Configuration Validation

6.1 Context-Specific Validation

Each configuration context has specific validation rules:

Common Configuration:

Database URL format and connectivity
Pool size ranges (1-1000 connections)
Circuit breaker thresholds (1-100 failures)
Timeout durations (1-3600 seconds)

Orchestration Configuration:

Web API bind address format
Request timeout ranges (1000-300000 ms)
MPSC channel buffer sizes (100-100000)
Event system configuration consistency

Worker Configuration:

Template path existence
Resource limit ranges (memory, CPU %)
Handler discovery path validation
Concurrent step limits (1-10000)

6.2 Validation Workflow

Pre-Deployment Validation:

# Validate before generating deployment artifact
tasker-ctl config validate --context orchestration --environment production

# Generate only if validation passes
tasker-ctl config generate --context orchestration --environment production

Runtime Validation:

Configuration validated on application startup
Invalid config prevents startup (fail-fast)
Clear error messages for troubleshooting

6.3 Common Validation Errors

Example Error Messages:

❌ Validation Error: database.pool.max_connections
   Value: 5000
   Issue: Exceeds maximum allowed value (1000)
   Fix: Reduce to 1000 or less

❌ Validation Error: web.bind_address
   Value: "invalid:port"
   Issue: Invalid IP:port format
   Fix: Use format like "0.0.0.0:8080" or "127.0.0.1:3000"

Part 7: Operational Workflows

7.1 Compare Deployed Configurations

Cross-System Comparison:

# Get orchestration config
curl http://orchestration:8080/config > orch-config.json

# Get worker config
curl http://worker:8081/config > worker-config.json

# Compare common sections for compatibility
jq '.common' orch-config.json > orch-common.json
jq '.common' worker-config.json > worker-common.json

diff orch-common.json worker-common.json

Why This Matters:

Ensures orchestration and worker share same database config
Validates circuit breaker settings match
Confirms telemetry endpoints aligned

7.2 Debug Configuration Issues

Step 1: Inspect Runtime Config

# Check what's actually deployed
curl http://localhost:8080/config | jq '.orchestration.web'

Step 2: Compare to Expected

# Check generated config file
cat config/tasker/generated/orchestration-production.toml

# Compare values

Step 3: Trace Configuration Source

# Check metadata for source files
curl http://localhost:8080/config | jq '.metadata'

# Metadata shows:
# - Environment (production)
# - Timestamp (when config was loaded)
# - Source (runtime)
# - Redacted fields (for transparency)

7.3 Configuration Drift Detection

Manual Comparison:

# Generate what should be deployed
tasker-ctl config generate --context orchestration --environment production

# Compare to runtime
diff config/tasker/generated/orchestration-production.toml \
     <(curl -s http://localhost:8080/config | jq -r '.orchestration')

Automated Monitoring (future):

Periodic config snapshots
Alert on unexpected changes
Configuration version tracking

Part 8: Best Practices

8.1 Configuration Management

DO: ✅ Use environment variables for secrets (${DATABASE_URL}) ✅ Validate configs before deployment ✅ Generate single deployable artifacts for production ✅ Use /config endpoints for debugging ✅ Keep environment overrides minimal (only what changes) ✅ Document configuration changes in commit messages

DON’T: ❌ Commit production secrets to config files ❌ Mix test and production configurations ❌ Skip validation before deployment ❌ Use unbounded configuration values ❌ Override all settings in environment files

8.2 Security Best Practices

Secrets Management:

# ✅ GOOD: Use environment variable substitution
[database]
url = "${DATABASE_URL}"

# ❌ BAD: Hard-code credentials
[database]
url = "postgresql://user:password@localhost/db"

Production Deployment:

# ✅ GOOD: Use Kubernetes secrets
kubectl create secret generic tasker-db-url \
  --from-literal=url='postgresql://...'

# ❌ BAD: Commit secrets to config files

Runtime Inspection:

/config endpoint automatically redacts secrets
Safe to use in logging and monitoring
Field path tracking shows what was redacted

8.3 Testing Strategy

Test All Environments:

# Ensure all environments validate
for env in test development production; do
  echo "Validating $env..."
  tasker-ctl config validate --context orchestration --environment $env
done

Integration Testing:

# Test with generated configs
tasker-ctl config generate --context orchestration --environment test
export TASKER_CONFIG_PATH=config/tasker/generated/orchestration-test.toml
cargo test --all-features

Part 9: Troubleshooting

9.1 Common Issues

Issue: Configuration fails to load

# Check environment variable
echo $TASKER_ENV

# Check config files exist
ls -la config/tasker/base/
ls -la config/tasker/environments/$TASKER_ENV/

# Validate config
tasker-ctl config validate --context orchestration --environment $TASKER_ENV

Issue: Unexpected configuration values at runtime

# Check runtime config
curl http://localhost:8080/config | jq

# Compare to expected
cat config/tasker/generated/orchestration-$TASKER_ENV.toml

Issue: Validation errors

# Run validation with detailed output
RUST_LOG=debug tasker-ctl config validate \
  --context orchestration \
  --environment production

9.2 Debug Mode

Enable Configuration Debug Logging:

# Detailed config loading logs
RUST_LOG=tasker_shared::config=debug cargo run

# Shows:
# - Which files are loaded
# - Merge order
# - Environment variable substitution
# - Validation results

Part 10: Future Enhancements

10.1 Planned Features

Explain Command (Deferred):

# Get documentation for a parameter
tasker-ctl config explain --parameter database.pool.max_connections

# Shows:
# - Purpose and system impact
# - Valid range and type
# - Environment-specific recommendations
# - Related parameters
# - Example usage

Detect-Unused Command (Deferred):

# Find unused configuration parameters
tasker-ctl config detect-unused --context orchestration

# Auto-remove with backup
tasker-ctl config detect-unused --context orchestration --fix

10.2 Operational Enhancements

Configuration Versioning:

Track configuration changes over time
Compare configs across versions
Rollback capability

Automated Drift Detection:

Periodic config snapshots
Alert on unexpected changes
Configuration compliance checking

Configuration Templates:

Pre-built configurations for common scenarios
Quick-start templates for new deployments
Best practice configurations

Environment Configuration Comparison - Detailed comparison of configuration values across environments
Deployment Patterns - Deployment modes and strategies
Quick Start Guide - Getting started with Tasker

Summary

Tasker’s configuration system provides:

Component-Based Architecture: Focused TOML files with single responsibility
Environment Scaling: 1:5:50 pattern from test → development → production
Single-File Runtime Loading: Deploy exact configuration known at build time via TASKER_CONFIG_PATH
Runtime Observability: /config endpoints with comprehensive secret redaction
CLI Tools: Generate and validate single deployable configs
Context-Specific Validation: Catch errors before deployment
Security First: Automatic secret redaction, environment variable substitution

Key Workflows:

Production/Docker: Generate single-file config at build time, set TASKER_CONFIG_PATH, deploy
Testing: Use legacy directory-based loading for convenience
Debugging: Use /config endpoints to inspect runtime configuration
Validation: Validate before generating deployment artifacts

Phase 3 Changes (October 2025):

✅ Runtime systems now require TASKER_CONFIG_PATH environment variable
✅ Configuration loaded from single merged files (no runtime merging)
✅ Deployment certainty: exact config known at build time
✅ Fail loudly: missing/invalid config halts startup with explicit errors
✅ Generated configs committed to repo for reproducibility

← Back to Documentation Hub

Dead Letter Queue (DLQ) System Architecture

Purpose: Investigation tracking system for stuck, stale, or problematic tasks

Last Updated: 2025-11-01

Executive Summary

The DLQ (Dead Letter Queue) system is an investigation tracking system, NOT a task manipulation layer.

Key Principles:

DLQ tracks “why task is stuck” and “who investigated”
Resolution happens at step level via step APIs
No task-level “requeue” - fix the problem steps instead
Steps carry their own retry, attempt, and state lifecycles independent of DLQ
DLQ is for audit, visibility, and investigation only

Architecture: PostgreSQL-based system with:

tasks_dlq table for investigation tracking
3 database views for monitoring and analysis
6 REST endpoints for operator interaction
Background staleness detection service

DLQ vs Step Resolution

What DLQ Does

✅ Investigation Tracking:

Record when and why task became stuck
Capture complete task snapshot for debugging
Track operator investigation workflow
Provide visibility into systemic issues

✅ Visibility and Monitoring:

Dashboard statistics by DLQ reason
Prioritized investigation queue for triage
Proactive staleness monitoring (before DLQ)
Alerting integration for high-priority entries

What DLQ Does NOT Do

❌ Task Manipulation:

Does NOT retry failed steps
Does NOT requeue tasks
Does NOT modify step state
Does NOT execute business logic

Why This Separation Matters

Steps are mutable - Operators can:

Manually resolve failed steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
View step readiness status: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Check retry eligibility and dependency satisfaction
Trigger next steps by completing blocked steps

DLQ is immutable audit trail - Operators should:

Review task snapshot to understand what went wrong
Use step endpoints to fix the underlying problem
Update DLQ investigation status to track resolution
Analyze DLQ patterns to prevent future occurrences

DLQ Reasons

staleness_timeout

Definition: Task exceeded state-specific staleness threshold

States:

waiting_for_dependencies - Default 60 minutes
waiting_for_retry - Default 30 minutes
steps_in_process - Default 30 minutes

Template Override: Configure per-template thresholds:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60
  max_duration_minutes: 1440  # 24 hours

Resolution Pattern:

Operator: GET /v1/dlq/task/{task_uuid} - Review task snapshot
Identify stuck steps: Check current_state in snapshot
Fix steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Task state machine automatically progresses when steps fixed
Operator: PATCH /v1/dlq/entry/{dlq_entry_uuid} - Mark investigation resolved

Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring

max_retries_exceeded

Definition: Step exhausted all retry attempts and remains in Error state

Resolution Pattern:

Review step results: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Analyze last_failure_at and error details
Fix underlying issue (infrastructure, data, etc.)
Manually resolve step: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Update DLQ investigation status

dependency_cycle_detected

Definition: Circular dependency detected in workflow step graph

Resolution Pattern:

Review task template configuration
Identify cycle in step dependencies
Update template to break cycle
Manually cancel affected tasks
Re-submit tasks with corrected template

worker_unavailable

Definition: No worker available for task’s namespace

Resolution Pattern:

Check worker service health
Verify namespace configuration
Scale worker capacity if needed
Tasks automatically progress when worker available

manual_dlq

Definition: Operator manually sent task to DLQ for investigation

Resolution Pattern: Custom per-investigation

Database Schema

tasks_dlq Table

CREATE TABLE tasker.tasks_dlq (
    dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
    task_uuid UUID NOT NULL UNIQUE,  -- One pending entry per task
    original_state VARCHAR(50) NOT NULL,
    dlq_reason dlq_reason NOT NULL,
    dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    task_snapshot JSONB,  -- Complete task state for debugging
    resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
    resolution_notes TEXT,
    resolved_at TIMESTAMPTZ,
    resolved_by VARCHAR(255),
    metadata JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
    ON tasker.tasks_dlq (task_uuid)
    WHERE resolution_status = 'pending';

Key Fields:

dlq_entry_uuid - UUID v7 (time-ordered) for investigation tracking
task_uuid - Foreign key to task (unique for pending entries)
original_state - Task state when sent to DLQ
task_snapshot - JSONB snapshot with debugging context
resolution_status - Investigation workflow status

Database Views

v_dlq_dashboard

Purpose: Aggregated statistics for monitoring dashboard

Columns:

dlq_reason - Why tasks are in DLQ
total_entries - Count of entries
pending, manually_resolved, permanent_failures, cancelled - Breakdown by status
oldest_entry, newest_entry - Time range
avg_resolution_time_minutes - Average time to resolve

Use Case: High-level DLQ health monitoring

v_dlq_investigation_queue

Purpose: Prioritized queue for operator triage

Columns:

Task and DLQ entry UUIDs
priority_score - Composite score (base reason priority + age factor)
minutes_in_dlq - How long entry has been pending
Task metadata for context

Ordering: Priority score DESC (most urgent first)

Use Case: Operator dashboard showing “what to investigate next”

v_task_staleness_monitoring

Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ

Columns:

task_uuid, namespace_name, task_name
current_state, time_in_state_minutes
staleness_threshold_minutes - Threshold for this state
health_status - healthy | warning | stale
priority - Task priority for ordering

Health Status Classification:

healthy - < 80% of threshold
warning - 80-99% of threshold
stale - ≥ 100% of threshold

Use Case: Alerting at 80% threshold to prevent DLQ entries

REST API Endpoints

1. List DLQ Entries

GET /v1/dlq?resolution_status=pending&limit=50

Purpose: Browse DLQ entries with filtering

Query Parameters:

resolution_status - Filter by status (optional)
limit - Max entries (default: 50)
offset - Pagination offset (default: 0)

Response: Array of DlqEntry objects

Use Case: General DLQ browsing and pagination

2. Get DLQ Entry with Task Snapshot

GET /v1/dlq/task/{task_uuid}

Purpose: Retrieve most recent DLQ entry for a task with complete snapshot

Response: DlqEntry with full task_snapshot JSONB

Task Snapshot Contains:

Task UUID, namespace, name
Current state and time in state
Staleness threshold
Task age and priority
Template configuration
Detection time

Use Case: Investigation starting point - “why is this task stuck?”

3. Update DLQ Investigation Status

PATCH /v1/dlq/entry/{dlq_entry_uuid}

Purpose: Track investigation workflow

Request Body:

{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed by manually completing stuck step using step API",
  "resolved_by": "operator@example.com",
  "metadata": {
    "fixed_step_uuid": "...",
    "root_cause": "database connection timeout"
  }
}

Use Case: Document investigation findings and resolution

4. Get DLQ Statistics

GET /v1/dlq/stats

Purpose: Aggregated statistics for monitoring

Response: Statistics grouped by dlq_reason

Use Case: Dashboard metrics, identifying systemic issues

5. Get Investigation Queue

GET /v1/dlq/investigation-queue?limit=100

Purpose: Prioritized queue for operator triage

Response: Array of DlqInvestigationQueueEntry ordered by priority

Priority Factors:

Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
Age multiplier (older entries = higher priority)

Use Case: “What should I investigate next?”

6. Get Staleness Monitoring

GET /v1/dlq/staleness?limit=100

Purpose: Proactive monitoring BEFORE tasks hit DLQ

Response: Array of StalenessMonitoring with health status

Ordering: Stale first, then warning, then healthy

Use Case: Alerting and prevention

Alert Integration:

# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'

Step Endpoints and Resolution Workflow

Step Endpoints

1. List Task Steps

GET /v1/tasks/{uuid}/workflow_steps

Returns: Array of steps with readiness status

Key Fields:

current_state - Step state (pending, enqueued, in_progress, complete, error)
dependencies_satisfied - Can step execute?
retry_eligible - Can step retry?
ready_for_execution - Ready to enqueue?
attempts / max_attempts - Retry tracking
last_failure_at - When step last failed
next_retry_at - When step eligible for retry

Use Case: Understand task execution status

2. Get Step Details

GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Returns: Single step with full readiness analysis

Use Case: Deep dive into specific step

3. Manually Resolve Step

PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Purpose: Operator actions to handle stuck or failed steps

Action Types:

ResetForRetry - Reset attempt counter and return to pending for automatic retry:

{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection restored, resetting attempts"
}

ResolveManually - Mark step as manually resolved without results:

{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical step, bypassing for workflow continuation"
}

CompleteManually - Complete step with execution results for dependent steps:

{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validated": true,
      "score": 95
    },
    "metadata": {
      "manually_verified": true,
      "verification_method": "manual_inspection"
    }
  },
  "reason": "Manual verification completed after infrastructure fix",
  "completed_by": "operator@example.com"
}

Behavior by Action Type:

reset_for_retry: Clears attempt counter, transitions to pending, enables automatic retry
resolve_manually: Transitions to resolved_manually (terminal state)
complete_manually: Transitions to complete with results available for dependent steps

Common Effects:

Triggers task state machine re-evaluation
Task automatically discovers next ready steps
Task progresses when all dependencies satisfied

Use Case: Unblock stuck workflow by fixing problem step

Complete Resolution Workflow

Scenario: Task Stuck in waiting_for_dependencies

1. Operator receives DLQ alert

GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority

2. Operator reviews task snapshot

GET /v1/dlq/task/abc-123
# Response:
{
  "dlq_entry_uuid": "xyz-789",
  "task_uuid": "abc-123",
  "original_state": "waiting_for_dependencies",
  "dlq_reason": "staleness_timeout",
  "task_snapshot": {
    "task_uuid": "abc-123",
    "namespace": "order_processing",
    "task_name": "fulfill_order",
    "current_state": "error",
    "time_in_state_minutes": 65,
    "threshold_minutes": 60
  }
}

3. Operator checks task steps

GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)

4. Operator investigates step_2 failure

GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout

5. Operator fixes infrastructure issue

# Fix database connection pool configuration
# Verify database connectivity

6. Operator chooses resolution strategy

Option A: Reset for retry (infrastructure fixed, retry should work):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection pool fixed, resetting attempts for automatic retry"
}

Option B: Resolve manually (bypass step entirely):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical validation step, bypassing"
}

Option C: Complete manually (provide results for dependent steps):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validation_status": "passed",
      "score": 100
    },
    "metadata": {
      "manually_verified": true
    }
  },
  "reason": "Manual validation completed",
  "completed_by": "operator@example.com"
}

7. Task state machine automatically progresses

Outcome depends on action type chosen:

If Option A (reset_for_retry):

Step 2 → pending (attempts reset to 0)
Automatic retry begins when dependencies satisfied
Step 2 re-enqueued to worker
If successful, workflow continues normally

If Option B (resolve_manually):

Step 2 → resolved_manually (terminal state)
Step 3 dependencies satisfied (manual resolution counts as success)
Task transitions: error → enqueuing_steps
Step 3 enqueued to worker
Task resumes normal execution

If Option C (complete_manually):

Step 2 → complete (with operator-provided results)
Step 3 can consume results from completion_data
Task transitions: error → enqueuing_steps
Step 3 enqueued to worker with access to step 2 results
Task resumes normal execution

8. Operator updates DLQ investigation

PATCH /v1/dlq/entry/xyz-789
{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
  "resolved_by": "operator@example.com",
  "metadata": {
    "root_cause": "database_connection_timeout",
    "fixed_step_uuid": "{step_2_uuid}",
    "infrastructure_fix": "increased_connection_pool_size"
  }
}

Step Retry and Attempt Lifecycles

Step State Machine

States:

pending - Initial state, awaiting dependencies
enqueued - Sent to worker queue
in_progress - Worker actively processing
enqueued_for_orchestration - Result submitted, awaiting orchestration
complete - Successfully finished
error - Failed (may be retryable)
cancelled - Manually cancelled
resolved_manually - Operator intervention

Retry Logic

Configured per step in template:

retry:
  retryable: true
  max_attempts: 3
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 30000

Retry Eligibility Criteria:

retryable: true in configuration
attempts < max_attempts
Current state is error
next_retry_at timestamp has passed (backoff elapsed)

Backoff Calculation:

backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)

Example (base=1000ms, max=30000ms):

Attempt 1 fails → wait 1s
Attempt 2 fails → wait 2s
Attempt 3 fails → wait 4s

SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at

Attempt Tracking

Fields (on workflow_steps table):

attempts - Current attempt count
max_attempts - Configuration limit
last_attempted_at - Timestamp of last execution
last_failure_at - Timestamp of last failure

Workflow:

Step enqueued → attempts++
Step fails → Record last_failure_at, calculate next_retry_at
Backoff elapses → Step becomes retry_eligible: true
Orchestration discovers ready steps → Step re-enqueued
Repeat until success or attempts >= max_attempts

Max Attempts Exceeded:

Step remains in error state
retry_eligible: false
Task transitions to error state
May trigger DLQ entry with reason max_retries_exceeded

Independence from DLQ

Key Point: Step retry logic is INDEPENDENT of DLQ

Steps retry automatically based on configuration
DLQ does NOT trigger retries
DLQ does NOT modify retry counters
DLQ is pure observation and investigation

Why This Matters:

Retry logic is predictable and configuration-driven
DLQ doesn’t interfere with normal workflow execution
Operators can manually resolve to bypass retry limits
DLQ provides visibility into retry exhaustion patterns

Staleness Detection

Background Service

Component: tasker-orchestration/src/orchestration/staleness_detector.rs

Configuration:

[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300  # 5 minutes

Operation:

Timer triggers every 5 minutes
Calls detect_and_transition_stale_tasks() SQL function
Function identifies tasks exceeding thresholds
Creates DLQ entries for stale tasks
Transitions tasks to error state
Records OpenTelemetry metrics

Staleness Thresholds

Per-State Defaults (configurable):

waiting_for_dependencies: 60 minutes
waiting_for_retry: 30 minutes
steps_in_process: 30 minutes

Per-Template Override:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60

Precedence: Template config > Global defaults

Staleness SQL Function

Function: detect_and_transition_stale_tasks()

Architecture:

v_task_state_analysis (base view)
    │
    ├── get_stale_tasks_for_dlq() (discovery function)
    │       │
    │       └── detect_and_transition_stale_tasks() (main orchestration)
    │               ├── create_dlq_entry() (DLQ creation)
    │               └── transition_stale_task_to_error() (state transition)

Performance Optimization:

Expensive joins happen ONCE in base view
Discovery function filters stale tasks
Main function processes results in loop
LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries

Output: Returns StalenessResult records with:

Task identification (UUID, namespace, name)
State and timing information
action_taken - What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)
moved_to_dlq - Boolean
transition_success - Boolean

OpenTelemetry Metrics

Metrics Exported

Counters:

tasker.dlq.entries_created.total - DLQ entries created
tasker.staleness.tasks_detected.total - Stale tasks detected
tasker.staleness.tasks_transitioned_to_error.total - Tasks moved to Error
tasker.staleness.detection_runs.total - Detection cycles

Histograms:

tasker.staleness.detection.duration - Detection execution time (ms)
tasker.dlq.time_in_queue - Time in DLQ before resolution (hours)

Gauges:

tasker.dlq.pending_investigations - Current pending DLQ count

Alert Examples

Prometheus Alerting Rules:

# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
  expr: tasker_dlq_pending_investigations > 50
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High number of pending DLQ investigations ({{ $value }})"

# Alert on slow detection cycles
- alert: SlowStalenessDetection
  expr: tasker_staleness_detection_duration > 5000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Staleness detection taking >5s ({{ $value }}ms)"

# Alert on high stale task rate
- alert: HighStalenessRate
  expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High rate of stale task detection ({{ $value }}/sec)"

CLI Usage Examples

The tasker-ctl tool provides commands for managing workflow steps directly from the command line.

List Workflow Steps

# List all steps for a task
tasker-ctl task steps <TASK_UUID>

# Example output:
# ✓ Found 3 workflow steps:
#
#   Step: validate_input (01933d7c-...)
#     State: complete
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 1/3
#
#   Step: process_order (01933d7c-...)
#     State: error
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 3/3
#     ⚠ Retry eligible

Get Step Details

# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>

# Example output:
# ✓ Step Details:
#
#   UUID: 01933d7c-...
#   Name: process_order
#   State: error
#   Dependencies satisfied: true
#   Ready for execution: false
#   Retry eligible: false
#   Attempts: 3/3
#   Last failure: 2025-11-02T14:23:45Z

Reset Step for Retry

When infrastructure is fixed and you want to reset attempt counter:

tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
  --reason "Database connection pool increased" \
  --reset-by "ops-team@example.com"

# Example output:
# ✓ Step reset successfully!
#   New state: pending
#   Reason: Database connection pool increased
#   Reset by: ops-team@example.com

Resolve Step Manually

When you want to bypass a non-critical step:

tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
  --reason "Non-critical validation, bypassing" \
  --resolved-by "ops-team@example.com"

# Example output:
# ✓ Step resolved manually!
#   New state: resolved_manually
#   Reason: Non-critical validation, bypassing
#   Resolved by: ops-team@example.com

Complete Step Manually with Results

When you’ve manually performed the step’s work and need to provide results:

tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
  --result '{"validated": true, "score": 95}' \
  --metadata '{"verification_method": "manual_review"}' \
  --reason "Manual verification after infrastructure fix" \
  --completed-by "ops-team@example.com"

# Example output:
# ✓ Step completed manually with results!
#   New state: complete
#   Reason: Manual verification after infrastructure fix
#   Completed by: ops-team@example.com

JSON Formatting Tips:

# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'

# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
  "validation_status": "passed",
  "checks": ["auth", "permissions", "rate_limit"],
  "score": 100
}
EOF
)"

# Or read from a file
--result "$(cat result.json)"

Operational Runbooks

Runbook 1: Investigating High DLQ Count

Trigger: tasker_dlq_pending_investigations > 50

Steps:

Check DLQ dashboard:

curl /v1/dlq/stats | jq

Identify dominant reason:

{
  "dlq_reason": "staleness_timeout",
  "total_entries": 45,
  "pending": 45
}

Get investigation queue:

curl /v1/dlq/investigation-queue?limit=10 | jq

Check staleness monitoring:

curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'

Identify patterns:

Common namespace?
Common task template?
Common time period?

Take action:

Infrastructure issue? → Fix and manually resolve affected tasks
Template misconfiguration? → Update template thresholds
Worker unavailable? → Scale worker capacity
Systemic dependency issue? → Investigate upstream systems

Runbook 2: Proactive Staleness Prevention

Trigger: Regular monitoring (not incident-driven)

Steps:

Monitor warning threshold:

curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'

Alert when warning count exceeds baseline:

if [ $warning_count -gt 10 ]; then
  alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi

Investigate early:

curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
  task_uuid,
  current_state,
  time_in_state_minutes,
  staleness_threshold_minutes,
  threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'

Intervene before DLQ:

Check task steps for blockages
Review dependencies
Manually resolve if appropriate

Best Practices

For Operators

✅ DO:

Use staleness monitoring for proactive prevention
Document investigation findings in DLQ resolution notes
Fix root causes, not just symptoms
Update DLQ investigation status promptly
Use step endpoints to resolve stuck workflows
Monitor DLQ statistics for systemic patterns

❌ DON’T:

Don’t try to “requeue” from DLQ - fix the steps instead
Don’t ignore warning health status - investigate early
Don’t manually resolve steps without fixing root cause
Don’t leave DLQ investigations in pending status indefinitely

For Developers

✅ DO:

Configure appropriate staleness thresholds per template
Make steps retryable with sensible backoff
Implement idempotent step handlers
Add defensive timeouts to prevent hanging
Test workflows under failure scenarios

❌ DON’T:

Don’t set thresholds too low (causes false positives)
Don’t set thresholds too high (delays detection)
Don’t make all steps non-retryable
Don’t ignore DLQ patterns - they indicate design issues
Don’t rely on DLQ for normal workflow control flow

Testing

Test Coverage

Unit Tests: SQL function testing (17 tests)

Staleness detection logic
DLQ entry creation
Threshold calculation with template overrides
View query correctness

Integration Tests: Lifecycle testing (4 tests)

Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
Complete investigation workflow (test_dlq_investigation_workflow)

Metrics Tests: OpenTelemetry integration (1 test)

Staleness detection metrics recording
DLQ investigation metrics recording
Pending investigations gauge query

Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml

2-step linear workflow
2-minute staleness thresholds for fast test execution
Test-only template for lifecycle validation

Performance: All 22 tests complete in 0.95s (< 1s target)

Implementation Notes

File Locations:

Staleness detector: tasker-orchestration/src/orchestration/staleness_detector.rs
DLQ models: tasker-shared/src/models/orchestration/dlq.rs
SQL functions: migrations/20251122000004_add_dlq_discovery_function.sql
Database views: migrations/20251122000003_add_dlq_views.sql

Key Design Decisions:

Investigation tracking only - no task manipulation
Step-level resolution via existing step endpoints
Proactive monitoring at 80% threshold
Template-specific threshold overrides
Atomic DLQ entry creation with unique constraint
Time-ordered UUID v7 for investigation tracking

Future Enhancements

Potential improvements (not currently planned):

DLQ Patterns Analysis
- Machine learning to identify systemic issues
- Automated root cause suggestions
- Pattern clustering by namespace/template
Advanced Alerting
- Anomaly detection on staleness rates
- Predictive DLQ entry forecasting
- Correlation with infrastructure metrics
Investigation Workflow
- Automated triage rules
- Escalation policies
- Integration with incident management systems
Performance Optimization
- Materialized views for dashboard
- Query result caching
- Incremental staleness detection

End of Documentation

Handler Resolution Guide

Last Updated: 2026-01-08 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | API Convergence Matrix

<- Back to Guides

Overview

Handler resolution is the process of converting a callable address (a string in your YAML template) into an executable handler instance that can process workflow steps. The resolver chain pattern provides a flexible, extensible approach that works consistently across all language workers.

This guide covers:

The mental model for handler resolution
The common path for task templates
Built-in resolvers and how they work
Method dispatch for multi-method handlers
Writing custom resolvers
Cross-language considerations

Mental Model

Handler resolution uses three key concepts:

handler:
  callable: "PaymentProcessor"      # 1. Address: WHERE to find the handler
  method: "refund"                  # 2. Entry Point: WHICH method to invoke
  resolver: "explicit_mapping"      # 3. Resolution Hint: HOW to resolve

1. Address (callable)

The callable field is a logical address that identifies the handler. Think of it like a URL - it points to where the handler lives, but the format depends on your resolution strategy:

Format	Example	Resolver
Short name	`payment_processor`	ExplicitMappingResolver
Class path (Ruby)	`PaymentHandlers::ProcessPaymentHandler`	ClassConstantResolver
Module path (Python)	`payment_handlers.ProcessPaymentHandler`	ClassLookupResolver
Namespace path (TS)	`PaymentHandlers.ProcessPaymentHandler`	ClassLookupResolver

2. Entry Point (method)

The method field specifies which method to invoke on the handler. This enables multi-method handlers - a single handler class that exposes multiple entry points:

# Default: calls the `call` method
handler:
  callable: payment_processor

# Explicit method: calls the `refund` method instead
handler:
  callable: payment_processor
  method: refund

When to use method dispatch:

Payment handlers with charge, refund, void methods
Validation handlers with validate_input, validate_output methods
CRUD handlers with create, read, update, delete methods

3. Resolution Hint (resolver)

The resolver field is an optional optimization that bypasses the resolver chain and goes directly to a specific resolver:

# Let the chain figure it out (default)
handler:
  callable: payment_processor

# Skip directly to explicit mapping (faster, explicit)
handler:
  callable: payment_processor
  resolver: explicit_mapping

When to use resolver hints:

Performance optimization for high-throughput steps
Explicit documentation of resolution strategy
Avoiding ambiguity when multiple resolvers could match

The Common Path

For most templates, you don’t need to think about resolution at all. The default resolution flow handles common cases automatically:

# Most common pattern - just specify the callable
steps:
  - name: process_payment
    handler:
      callable: process_payment  # Resolved by ExplicitMappingResolver
      initialization:
        timeout_ms: 5000

What happens under the hood:

Worker receives step execution event
HandlerDispatchService extracts the HandlerDefinition
ResolverChain iterates through resolvers by priority
ExplicitMappingResolver (priority 10) finds the registered handler
Handler is invoked with call() method (default)

Resolver Chain Architecture

The resolver chain is an ordered list of resolvers, each with a priority. Lower priority numbers are checked first:

┌─────────────────────────────────────────────────────────────────┐
│                      ResolverChain                               │
│                                                                  │
│  ┌──────────────────────┐  ┌──────────────────────┐            │
│  │ ExplicitMapping      │  │ ClassConstant        │            │
│  │ Priority: 10         │──│ Priority: 100        │──► ...     │
│  │                      │  │                      │            │
│  │ "process_payment" ──►│  │ "Handlers::Payment"──►           │
│  │  Handler instance    │  │  constantize()       │            │
│  └──────────────────────┘  └──────────────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Resolution Flow

HandlerDefinition
       │
       ▼
┌──────────────────┐
│ Has resolver     │──Yes──► Go directly to named resolver
│ hint?            │
└────────┬─────────┘
         │ No
         ▼
┌──────────────────┐
│ ExplicitMapping  │──can_resolve?──Yes──► Return handler
│ (priority 10)    │
└────────┬─────────┘
         │ No
         ▼
┌──────────────────┐
│ ClassConstant    │──can_resolve?──Yes──► Return handler
│ (priority 100)   │
└────────┬─────────┘
         │ No
         ▼
    ResolutionError

Built-in Resolvers

ExplicitMappingResolver (Priority 10)

The primary resolver for all workers. Handlers are registered with string keys at startup:

#![allow(unused)]
fn main() {
// Rust registration
registry.register("process_payment", Arc::new(ProcessPaymentHandler::new()));
}

# Ruby registration
registry.register("process_payment", ProcessPaymentHandler)

# Python registration
registry.register("process_payment", ProcessPaymentHandler)

// TypeScript registration
registry.register("process_payment", ProcessPaymentHandler);

When it resolves: When the callable exactly matches a registered key.

Best for:

Native Rust handlers (required - no runtime reflection)
Performance-critical handlers
Explicit, predictable resolution

Class Lookup Resolvers (Priority 100)

Dynamic language only (Ruby, Python, TypeScript). Interprets the callable as a class path and instantiates it at runtime.

Naming Note: Ruby uses ClassConstantResolver (Ruby terminology for classes). Python and TypeScript use ClassLookupResolver. The functionality is equivalent.

# Ruby: Uses Object.const_get (ClassConstantResolver)
handler:
  callable: PaymentHandlers::ProcessPaymentHandler

# Python: Uses importlib (ClassLookupResolver)
handler:
  callable: payment_handlers.ProcessPaymentHandler

# TypeScript: Uses dynamic import (ClassLookupResolver)
handler:
  callable: PaymentHandlers.ProcessPaymentHandler

When it resolves: When the callable looks like a class/module path (contains ::, ., or starts with uppercase).

Best for:

Convention-over-configuration setups
Handlers that don’t need explicit registration
Dynamic handler loading

Not available in Rust: Rust has no runtime reflection, so class lookup resolvers always return None. Use ExplicitMappingResolver instead.

Method Dispatch

Method dispatch allows a single handler to expose multiple entry points. This is useful for handlers that perform related operations:

Defining a Multi-Method Handler

# Ruby
class PaymentHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Default method - standard payment processing
  end

  def refund(context)
    # Refund-specific logic
  end

  def void(context)
    # Void-specific logic
  end
end

# Python
class PaymentHandler(StepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Default method
        pass

    def refund(self, context: StepContext) -> StepHandlerResult:
        # Refund-specific logic
        pass

// TypeScript
class PaymentHandler extends StepHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    // Default method
  }

  async refund(context: StepContext): Promise<StepHandlerResult> {
    // Refund-specific logic
  }
}

#![allow(unused)]
fn main() {
// Rust - requires explicit method routing
impl RustStepHandler for PaymentHandler {
    async fn call(&self, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
        // Default method
    }

    async fn invoke_method(&self, method: &str, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
        match method {
            "refund" => self.refund(step).await,
            "void" => self.void(step).await,
            _ => self.call(step).await,
        }
    }
}
}

Using Method Dispatch in Templates

steps:
  - name: process_refund
    handler:
      callable: payment_handler
      method: refund  # Invokes refund() instead of call()
      initialization:
        reason_required: true

How Method Dispatch Works

Resolver chain resolves the handler from callable
If method is specified and not “call”, a MethodDispatchWrapper is applied
When invoked, the wrapper calls the specified method instead of call()

                    ┌───────────────────┐
HandlerDefinition ──│ ResolverChain     │── Handler
(method: "refund")  │                   │
                    └─────────┬─────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │ MethodDispatch    │
                    │ Wrapper           │
                    │                   │
                    │ inner.refund()    │
                    └───────────────────┘

Writing Custom Resolvers

You can extend the resolver chain with custom resolution strategies for your domain.

Rust Custom Resolver

#![allow(unused)]
fn main() {
use tasker_shared::registry::{StepHandlerResolver, ResolutionContext, ResolvedHandler};
use async_trait::async_trait;

#[derive(Debug)]
pub struct ServiceDiscoveryResolver {
    service_registry: Arc<ServiceRegistry>,
}

#[async_trait]
impl StepHandlerResolver for ServiceDiscoveryResolver {
    fn resolver_name(&self) -> &str {
        "service_discovery"
    }

    fn priority(&self) -> u32 {
        50  // Between explicit (10) and class constant (100)
    }

    fn can_resolve(&self, definition: &HandlerDefinition) -> bool {
        // Resolve callables that start with "service://"
        definition.callable.starts_with("service://")
    }

    async fn resolve(
        &self,
        definition: &HandlerDefinition,
        context: &ResolutionContext,
    ) -> Result<Arc<dyn ResolvedHandler>, ResolutionError> {
        let service_name = definition.callable.strip_prefix("service://").unwrap();
        let handler = self.service_registry.lookup(service_name).await?;
        Ok(Arc::new(StepHandlerAsResolved::new(handler)))
    }
}
}

Ruby Custom Resolver

module TaskerCore
  module Registry
    class ServiceDiscoveryResolver < BaseResolver
      def resolver_name
        "service_discovery"
      end

      def priority
        50
      end

      def can_resolve?(definition)
        definition.callable.start_with?("service://")
      end

      def resolve(definition, context)
        service_name = definition.callable.delete_prefix("service://")
        handler_class = ServiceRegistry.lookup(service_name)
        handler_class.new
      end
    end
  end
end

Python Custom Resolver

from tasker_core.registry import BaseResolver, ResolutionError

class ServiceDiscoveryResolver(BaseResolver):
    def resolver_name(self) -> str:
        return "service_discovery"

    def priority(self) -> int:
        return 50

    def can_resolve(self, definition: HandlerDefinition) -> bool:
        return definition.callable.startswith("service://")

    async def resolve(
        self, definition: HandlerDefinition, context: ResolutionContext
    ) -> ResolvedHandler:
        service_name = definition.callable.removeprefix("service://")
        handler_class = self.service_registry.lookup(service_name)
        return handler_class()

TypeScript Custom Resolver

import { BaseResolver, HandlerDefinition, ResolutionContext } from './registry';

export class ServiceDiscoveryResolver extends BaseResolver {
  resolverName(): string {
    return 'service_discovery';
  }

  priority(): number {
    return 50;
  }

  canResolve(definition: HandlerDefinition): boolean {
    return definition.callable.startsWith('service://');
  }

  async resolve(
    definition: HandlerDefinition,
    context: ResolutionContext
  ): Promise<ResolvedHandler> {
    const serviceName = definition.callable.replace('service://', '');
    const HandlerClass = await this.serviceRegistry.lookup(serviceName);
    return new HandlerClass();
  }
}

Registering Custom Resolvers

#![allow(unused)]
fn main() {
// Rust
let mut chain = ResolverChain::new();
chain.register(Arc::new(ExplicitMappingResolver::new()));
chain.register(Arc::new(ServiceDiscoveryResolver::new(service_registry)));
chain.register(Arc::new(ClassConstantResolver::new()));
}

# Ruby
chain = TaskerCore::Registry::ResolverChain.new
chain.register(TaskerCore::Registry::ExplicitMappingResolver.new)
chain.register(ServiceDiscoveryResolver.new(service_registry))
chain.register(TaskerCore::Registry::ClassConstantResolver.new)

Cross-Language Considerations

Why Rust is Different

Rust has no runtime reflection, which affects handler resolution:

Capability	Ruby/Python/TypeScript	Rust
Class Lookup Resolver	✅ Works	❌ Always returns None
Method dispatch	✅ Native (`send`, `getattr`)	⚠️ Requires `invoke_method`
Dynamic handler loading	✅ `const_get`, `importlib`	❌ Must pre-register

Best Practice for Rust:

Always use ExplicitMappingResolver with explicit registration
Implement invoke_method() for multi-method handlers
Use resolver hints (resolver: explicit_mapping) for clarity

Method Dispatch by Language

Language	Default Method	Dynamic Dispatch
Ruby	`call`	`handler.public_send(method, context)`
Python	`call`	`getattr(handler, method)(context)`
TypeScript	`call`	`handler[method](context)`
Rust	`call`	`handler.invoke_method(method, step)`

Troubleshooting

“Handler not found” Error

Symptoms: ResolutionError: No resolver could resolve callable 'my_handler'

Causes:

Handler not registered with ExplicitMappingResolver
Class path typo (for ClassConstantResolver)
Handler registered with different name than callable

Solutions:

#![allow(unused)]
fn main() {
// Verify registration
assert!(registry.is_registered("my_handler"));

// Check registered handlers
println!("{:?}", registry.list_handlers());
}

Method Not Found

Symptoms: MethodNotFound: Handler 'my_handler' does not respond to 'refund'

Causes:

Method name typo in YAML template
Method not defined on handler class
Method is private (Ruby) or underscore-prefixed (Python)

Solutions:

# Verify method name matches exactly
handler:
  callable: payment_handler
  method: refund  # Must match method name in handler

Resolver Hint Ignored

Symptoms: Resolution works but seems slow, or wrong resolver is used

Causes:

Resolver hint name doesn’t match any registered resolver
Resolver with that name returns None for this callable

Solutions:

# Use exact resolver name
handler:
  callable: my_handler
  resolver: explicit_mapping  # Not "explicit" or "mapping"

Best Practices

1. Prefer Explicit Registration

# Good: Clear, predictable, works in all languages
handler:
  callable: process_payment

# Avoid: Relies on runtime class lookup, not portable to Rust
handler:
  callable: PaymentHandlers::ProcessPaymentHandler

# Good: Single handler, multiple entry points
steps:
  - name: validate_input
    handler:
      callable: validator
      method: validate_input

  - name: validate_output
    handler:
      callable: validator
      method: validate_output

# Avoid: Separate handlers for closely related operations
steps:
  - name: validate_input
    handler:
      callable: input_validator
  - name: validate_output
    handler:
      callable: output_validator

3. Document Resolution Strategy

# Good: Explicit about how resolution works
handler:
  callable: payment_processor
  resolver: explicit_mapping  # Self-documenting
  method: refund
  initialization:
    timeout_ms: 5000

4. Test Resolution in Isolation

#![allow(unused)]
fn main() {
#[test]
fn test_handler_resolution() {
    let chain = create_resolver_chain();
    let definition = HandlerDefinition::builder()
        .callable("process_payment")
        .build();

    assert!(chain.can_resolve(&definition));
}
}

Summary

Concept	Purpose	Default
`callable`	Handler address	Required
`method`	Entry point method	`"call"`
`resolver`	Resolution strategy hint	Chain iteration
ExplicitMappingResolver	Registered handlers	Priority 10
ClassConstantResolver / ClassLookupResolver	Dynamic class lookup	Priority 100
MethodDispatchWrapper	Multi-method support	Applied when `method` != `"call"`

The resolver chain provides a flexible, extensible system for handler resolution that works consistently across all language workers while respecting each language’s capabilities.

Task Identity Strategy Pattern

Last Updated: 2026-01-20 Audience: Developers, Operators Status: Active Related Docs: Documentation Hub | Idempotency and Atomicity

← Back to Documentation Hub

Overview

Task identity determines how Tasker deduplicates task creation requests. The identity strategy pattern allows named tasks to configure their deduplication behavior based on domain requirements.

When a task creation request arrives, Tasker computes an identity hash based on the configured strategy. If a task with that identity hash already exists, the request is rejected with a 409 Conflict response.

Why This Matters

Task identity is domain-specific:

Use Case	Same Template + Same Context	Desired Behavior
Payment processing	Likely accidental duplicate	Deduplicate (safety)
Nightly batch job	Intentional repetition	Allow (operational)
Report generation	Could be either	Configurable
Event-driven triggers	Often intentional	Allow
Retry with same params	Intentional	Allow

A TaskRequest with identical context might be:

An accidental duplicate (network retry, user double-click) → should deduplicate
An intentional repetition (scheduled job, legitimate re-run) → should allow

Identity Strategies

STRICT (Default)

identity_hash = hash(named_task_uuid, normalized_context)

Same named task + same context = same identity hash = deduplicated.

Use when:

Accidental duplicates are a risk (payments, orders, notifications)
Context fully describes the work to be done
Network retries or user double-clicks should be safe

Example:

#![allow(unused)]
fn main() {
// Payment processing - same payment_id should never create duplicate tasks
TaskRequest {
    namespace: "payments".to_string(),
    name: "process_payment".to_string(),
    context: json!({
        "payment_id": "PAY-12345",
        "amount": 100.00,
        "currency": "USD"
    }),
    idempotency_key: None,  // Uses STRICT strategy
    ..Default::default()
}
}

CALLER_PROVIDED

identity_hash = hash(named_task_uuid, idempotency_key)

Caller must provide idempotency_key. Request is rejected with 400 Bad Request if the key is missing.

Use when:

Caller has a natural idempotency key (order_id, transaction_id, request_id)
Caller needs control over deduplication scope
Similar to Stripe’s Idempotency-Key pattern

Example:

#![allow(unused)]
fn main() {
// Order processing - caller controls idempotency with their order ID
TaskRequest {
    namespace: "orders".to_string(),
    name: "fulfill_order".to_string(),
    context: json!({
        "order_id": "ORD-98765",
        "items": [...]
    }),
    idempotency_key: Some("ORD-98765".to_string()),  // Required for CallerProvided
    ..Default::default()
}
}

ALWAYS_UNIQUE

identity_hash = uuidv7()

Every request creates a new task. No deduplication.

Use when:

Every submission should create work (notifications, events)
Repetition is intentional (scheduled jobs, cron-like triggers)
Context doesn’t define uniqueness

Example:

#![allow(unused)]
fn main() {
// Notification sending - every call should send a notification
TaskRequest {
    namespace: "notifications".to_string(),
    name: "send_email".to_string(),
    context: json!({
        "user_id": 123,
        "template": "welcome",
        "data": {...}
    }),
    idempotency_key: None,  // ALWAYS_UNIQUE ignores this
    ..Default::default()
}
}

Configuration

Named Task Configuration

Set the identity strategy in your task template:

# templates/payments/process_payment.yaml
namespace: payments
name: process_payment
version: "1.0.0"
identity_strategy: strict  # strict | caller_provided | always_unique

steps:
  - name: validate_payment
    handler: payment_validator
    # ...

Per-Request Override

The idempotency_key field overrides any strategy:

#![allow(unused)]
fn main() {
// Even if named task is ALWAYS_UNIQUE, this key makes it deduplicate
TaskRequest {
    idempotency_key: Some("my-custom-key-12345".to_string()),
    // ... other fields
}
}

Precedence:

idempotency_key (if provided) → always uses hash of key
Named task’s identity_strategy → applies if no key provided
Default → STRICT (if strategy not configured)

API Behavior

Successful Creation (201 Created)

{
  "task_uuid": "019bddae-b818-7d82-b7c5-bd42e5db27fc",
  "step_count": 4,
  "message": "Task created successfully"
}

Duplicate Identity (409 Conflict)

When a task with the same identity hash exists:

{
  "error": {
    "code": "CONFLICT",
    "message": "A task with this identity already exists. The task's identity strategy prevents duplicate creation."
  }
}

Security Note: The API returns 409 Conflict rather than the existing task’s UUID. This prevents potential data leakage where attackers could probe for existing task UUIDs by submitting requests with guessed contexts.

Missing Idempotency Key (400 Bad Request)

When CallerProvided strategy requires a key:

{
  "error": {
    "code": "BAD_REQUEST",
    "message": "idempotency_key is required when named task uses CallerProvided identity strategy"
  }
}

JSON Normalization

For STRICT strategy, the context JSON is normalized before hashing:

Key ordering: Keys are sorted alphabetically (recursively)
Whitespace: Removed for consistency
Semantic equivalence: {"b":2,"a":1} and {"a":1,"b":2} produce the same hash

This means these two requests produce the same identity hash:

#![allow(unused)]
fn main() {
// Request 1
context: json!({"user_id": 123, "action": "create"})

// Request 2 - same content, different key order
context: json!({"action": "create", "user_id": 123})
}

Note: Array order is preserved (arrays are ordered by definition).

Recommended Patterns

Pattern 1: Time-Bucketed Keys

For deduplication within a time window but allowing repetition across windows:

#![allow(unused)]
fn main() {
// Dedupe within same hour, allow across hours
let hour_bucket = chrono::Utc::now().format("%Y-%m-%d-%H");
let idempotency_key = format!("{}-{}-{}", job_name, customer_id, hour_bucket);

TaskRequest {
    namespace: "reports".to_string(),
    name: "generate_report".to_string(),
    context: json!({ "customer_id": 12345 }),
    idempotency_key: Some(idempotency_key),
    ..Default::default()
}
}

Pattern 2: Time-Aware Context

Include scheduling context directly in the request:

#![allow(unused)]
fn main() {
TaskRequest {
    namespace: "batch".to_string(),
    name: "daily_reconciliation".to_string(),
    context: json!({
        "account_id": "ACC-001",
        "run_date": "2026-01-20",      // Changes daily
        "run_window": "morning"         // Optional: finer granularity
    }),
    ..Default::default()
}
}

Granularity Guide

Dedup Window	Key/Context Pattern	Use Case
Per-minute	`{job}-{YYYY-MM-DD-HH-mm}`	High-frequency event processing
Per-hour	`{job}-{YYYY-MM-DD-HH}`	Hourly reports, rate-limited APIs
Per-day	`{job}-{YYYY-MM-DD}`	Daily batch jobs, EOD processing
Per-week	`{job}-{YYYY-Www}`	Weekly aggregations
Per-month	`{job}-{YYYY-MM}`	Monthly billing cycles

Anti-Patterns

Don’t Rely on Timing

#![allow(unused)]
fn main() {
// BAD: Hoping requests are "far enough apart"
TaskRequest { context: json!({ "customer_id": 123 }) }
}

Don’t Use ALWAYS_UNIQUE for Critical Operations

#![allow(unused)]
fn main() {
// BAD: Creates duplicate work on network retries
// Named task with AlwaysUnique for payment processing
}

Do Make Identity Explicit

#![allow(unused)]
fn main() {
// GOOD: Clear what makes this task unique
TaskRequest {
    context: json!({
        "payment_id": "PAY-123",  // Natural idempotency key
        "amount": 100
    }),
    ..Default::default()
}
}

Database Implementation

The identity strategy is enforced at the database level:

UNIQUE constraint on identity_hash column prevents duplicates
identity_strategy column on named_tasks stores the configured strategy
Atomic insertion with constraint violation returns 409 Conflict

-- Identity hash has unique constraint
CREATE UNIQUE INDEX idx_tasks_identity_hash ON tasker.tasks(identity_hash);

-- Named tasks store their strategy
ALTER TABLE tasker.named_tasks
ADD COLUMN identity_strategy VARCHAR(20) DEFAULT 'strict';

Testing Considerations

When writing tests that create tasks, inject a unique identifier to avoid identity hash collisions:

#![allow(unused)]
fn main() {
// Test utility that ensures unique identity per test run
fn create_task_request(namespace: &str, name: &str, context: Value) -> TaskRequest {
    let mut ctx = context.as_object().cloned().unwrap_or_default();
    ctx.insert("_test_run_id".to_string(), json!(Uuid::now_v7().to_string()));

    TaskRequest {
        namespace: namespace.to_string(),
        name: name.to_string(),
        context: Value::Object(ctx),
        ..Default::default()
    }
}
}

Summary

Strategy	Identity Hash	Deduplicates?	Key Required?
STRICT	`hash(uuid, context)`	Yes	No
CALLER_PROVIDED	`hash(uuid, key)`	Yes	Yes
ALWAYS_UNIQUE	`uuidv7()`	No	No

Choose STRICT (default) unless you have a specific reason not to. It’s the safest option for preventing accidental duplicate task creation.

Quick Start Guide

Last Updated: 2025-10-10 Audience: Developers Status: Active Time to Complete: 5 minutes Related Docs: Documentation Hub | Use Cases | Crate Architecture

← Back to Documentation Hub

Get Tasker Core Running in 5 Minutes

This guide will get you from zero to running your first workflow in under 5 minutes using Docker Compose.

Prerequisites

Before starting, ensure you have:

Docker and Docker Compose installed
Git to clone the repository
curl for testing (or any HTTP client)

That’s it! Docker Compose handles all the complexity.

Step 1: Clone and Start Services (2 minutes)

# Clone the repository
git clone https://github.com/tasker-systems/tasker-core
cd tasker-core

# Start PostgreSQL (includes PGMQ extension for default messaging)
docker-compose up -d postgres

# Wait for PostgreSQL to be ready (about 10 seconds)
docker-compose logs -f postgres
# Press Ctrl+C when you see "database system is ready to accept connections"

# Run database migrations
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"  # Verify connection

# Start orchestration server and workers
docker-compose --profile server up -d

# Verify all services are healthy
docker-compose ps

You should see:

NAME                     STATUS              PORTS
tasker-postgres          Up (healthy)        5432
tasker-orchestration     Up (healthy)        0.0.0.0:8080->8080/tcp
tasker-worker            Up (healthy)        0.0.0.0:8081->8081/tcp
tasker-ruby-worker       Up (healthy)        0.0.0.0:8082->8082/tcp

Step 2: Verify Services (30 seconds)

Check that all services are responding:

# Check orchestration health
curl http://localhost:8080/health

# Expected response:
# {
#   "status": "healthy",
#   "database": "connected",
#   "message_queue": "operational"
# }

# Check Rust worker health
curl http://localhost:8081/health

# Check Ruby worker health (if started)
curl http://localhost:8082/health

Step 3: Create Your First Task (1 minute)

Now let’s create a simple linear workflow with 4 steps:

# Create a task using the linear_workflow template
curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "template_name": "linear_workflow",
    "namespace": "rust_e2e_linear",
    "configuration": {
      "test_value": "hello_world"
    }
  }'

Response:

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "status": "pending",
  "namespace": "rust_e2e_linear",
  "created_at": "2025-10-10T12:00:00Z"
}

Save the task_uuid from the response! You’ll need it to check the task status.

Step 4: Monitor Task Execution (1 minute)

Watch your workflow execute in real-time:

# Replace {task_uuid} with your actual task UUID
TASK_UUID="01234567-89ab-cdef-0123-456789abcdef"

# Check task status
curl http://localhost:8080/v1/tasks/${TASK_UUID}

Initial Response (task just created):

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "current_state": "initializing",
  "total_steps": 4,
  "completed_steps": 0,
  "namespace": "rust_e2e_linear"
}

Wait a few seconds and check again:

# Check again after a few seconds
curl http://localhost:8080/v1/tasks/${TASK_UUID}

Final Response (task completed):

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "current_state": "complete",
  "total_steps": 4,
  "completed_steps": 4,
  "namespace": "rust_e2e_linear",
  "completed_at": "2025-10-10T12:00:05Z",
  "duration_ms": 134
}

Congratulations! 🎉 You’ve just executed your first workflow with Tasker Core!

What Just Happened?

Let’s break down what happened in those ~100-150ms:

1. Orchestration received task creation request
   ↓
2. Task initialized with "linear_workflow" template
   ↓
3. 4 workflow steps created with dependencies:
   - mathematical_add (no dependencies)
   - mathematical_multiply (depends on add)
   - mathematical_subtract (depends on multiply)
   - mathematical_divide (depends on subtract)
   ↓
4. Orchestration discovered step 1 was ready
   ↓
5. Step 1 enqueued to "rust_e2e_linear" namespace queue
   ↓
6. Worker claimed and executed step 1
   ↓
7. Worker sent result back to orchestration
   ↓
8. Orchestration processed result, discovered step 2
   ↓
9. Steps 2, 3, 4 executed sequentially (due to dependencies)
   ↓
10. All steps complete → Task marked "complete"

Key Observations:

Each step executed by autonomous workers
Steps executed in dependency order automatically
Complete workflow: ~130-150ms (including all coordination)
All state changes recorded in audit trail

View Detailed Task Information

Get complete task execution details:

# Get full task details including steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/details

Response includes:

{
  "task": {
    "task_uuid": "...",
    "current_state": "complete",
    "namespace": "rust_e2e_linear"
  },
  "steps": [
    {
      "name": "mathematical_add",
      "current_state": "complete",
      "result": {"value": 15},
      "duration_ms": 12
    },
    {
      "name": "mathematical_multiply",
      "current_state": "complete",
      "result": {"value": 30},
      "duration_ms": 8
    },
    // ... remaining steps
  ],
  "state_transitions": [
    {
      "from_state": null,
      "to_state": "pending",
      "timestamp": "2025-10-10T12:00:00.000Z"
    },
    {
      "from_state": "pending",
      "to_state": "initializing",
      "timestamp": "2025-10-10T12:00:00.050Z"
    },
    // ... complete transition history
  ]
}

Try a More Complex Workflow

Now try the diamond workflow pattern (parallel execution):

curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "template_name": "diamond_workflow",
    "namespace": "rust_e2e_diamond",
    "configuration": {
      "test_value": "parallel_test"
    }
  }'

Diamond pattern:

        step_1 (root)
       /            \
   step_2          step_3    ← Execute in PARALLEL
       \            /
        step_4 (join)

Steps 2 and 3 execute simultaneously because they both depend only on step 1!

View Logs

See what’s happening inside the services:

# Orchestration logs
docker-compose logs -f orchestration

# Worker logs
docker-compose logs -f worker

# All logs
docker-compose logs -f

Key log patterns to look for:

Task initialized: task_uuid=... - Task created
Step enqueued: step_uuid=... - Step sent to worker
Step claimed: step_uuid=... - Worker picked up step
Step completed: step_uuid=... - Step finished successfully
Task finalized: task_uuid=... - Workflow complete

Explore the API

List All Tasks

curl http://localhost:8080/v1/tasks

Get Task Execution Context

curl http://localhost:8080/v1/tasks/${TASK_UUID}/context

View Available Templates

curl http://localhost:8080/v1/templates

Check System Health

curl http://localhost:8080/health/detailed

Next Steps

1. Understand What You Just Built

Read about the architecture:

Crate Architecture - How the workspace is organized
Events and Commands - How orchestration and workers coordinate
States and Lifecycles - Task and step state machines

2. See Real-World Examples

Explore practical use cases:

Use Cases and Patterns - E-commerce, payments, ETL, microservices
See example templates in: tests/fixtures/task_templates/

3. Create Your Own Workflow

Option A: Rust Handler (Native Performance)

#![allow(unused)]
fn main() {
// workers/rust/src/handlers/my_handler.rs
pub struct MyCustomHandler;

#[async_trait]
impl StepHandler for MyCustomHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Your business logic here
        let input: String = context.configuration.get("input")?;

        let result = process_data(&input).await?;

        Ok(StepResult::success(json!({
            "output": result
        })))
    }
}
}

Option B: Ruby Handler (via FFI)

# workers/ruby/app/tasker/tasks/templates/my_workflow/handlers/my_handler.rb
class MyHandler < TaskerCore::StepHandler
  def execute(context)
    input = context.configuration['input']

    result = process_data(input)

    { success: true, output: result }
  end
end

Define Your Workflow Template

# tests/fixtures/task_templates/rust/my_workflow.yaml
namespace: my_namespace
name: my_workflow
version: "1.0"

steps:
  - name: my_step
    handler: my_handler
    dependencies: []
    retry:
      retryable: true
      max_attempts: 3
      backoff: exponential
      backoff_base_ms: 1000

4. Deploy to Production

Learn about deployment:

Deployment Patterns - Hybrid, EventDriven, PollingOnly modes
Observability - Metrics, logging, monitoring
Benchmarks - Performance validation

5. Run Tests Locally

# Build the workspace
cargo build --all-features

# Run all tests
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo test --all-features

# Run benchmarks
cargo bench --all-features

Troubleshooting

Services Won’t Start

# Check Docker service status
docker-compose ps

# View service logs
docker-compose logs postgres
docker-compose logs orchestration

# Restart services
docker-compose restart

# Clean restart
docker-compose down
docker-compose up -d

Task Stays in “pending” or “initializing”

Possible causes:

Template not found - Check available templates: curl http://localhost:8080/v1/templates
Worker not running - Check worker status: curl http://localhost:8081/health
Database connection issue - Check logs: docker-compose logs postgres

Solution:

# Verify template exists
curl http://localhost:8080/v1/templates | jq '.[] | select(.name == "linear_workflow")'

# Restart workers
docker-compose restart worker

# Check orchestration logs for errors
docker-compose logs orchestration | grep ERROR

“Connection refused” Errors

Cause: Services not fully started yet

Solution: Wait 10-15 seconds after docker-compose up, then check health:

curl http://localhost:8080/health

PostgreSQL Connection Issues

# Verify PostgreSQL is running
docker-compose ps postgres

# Test connection
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"

# View PostgreSQL logs
docker-compose logs postgres | tail -50

Cleanup

When you’re done exploring:

# Stop all services
docker-compose down

# Stop and remove volumes (cleans database)
docker-compose down -v

# Remove all Docker resources (complete cleanup)
docker-compose down -v
docker system prune -f

Summary

You’ve successfully:

✅ Started Tasker Core services with Docker Compose
✅ Created and executed a linear workflow
✅ Monitored task execution in real-time
✅ Viewed detailed task and step information
✅ Explored the REST API

Total time: ~5 minutes from zero to working workflow! 🚀

Getting Help

Documentation Issues: Open an issue on GitHub
Architecture Questions: See Crate Architecture
Use Case Examples: See Use Cases and Patterns
Deployment Help: See Deployment Patterns

← Back to Documentation Hub

Next: Use Cases and Patterns | Crate Architecture

Retry Semantics: max_attempts and retryable

Last Updated: 2025-10-10 Audience: Developers Status: Active Related Docs: Documentation Hub | Bug Report: Retry Eligibility Logic | States and Lifecycles

← Back to Documentation Hub

Overview

The Tasker orchestration system uses two configuration fields to control step execution and retry behavior:

max_attempts: Maximum number of total execution attempts (including first execution)
retryable: Whether the step can be retried after failure

Semantic Definitions

max_attempts

Definition: The maximum number of times a step can be attempted, including the first execution.

This is NOT “number of retries” - it’s total attempts:

max_attempts=0: Step can never execute (likely a configuration error)
max_attempts=1: Exactly one attempt (no retries after failure)
max_attempts=3: First attempt + up to 2 retries = 3 total attempts

Implementation: SQL formula attempts < max_attempts where attempts starts at 0.

retryable

Definition: Whether a step can be retried after the first execution fails.

Important: The retryable flag does NOT affect the first execution attempt:

First execution (attempts=0): Always eligible regardless of retryable setting
Retry attempts (attempts>0): Require retryable=true

Configuration Examples

Single Execution, No Retries

retry:
  retryable: false
  max_attempts: 1  # First attempt only
  backoff: exponential

Behavior:

attempts	retry_eligible	Outcome
0	✅ true	First execution allowed
1	❌ false	No retries (retryable=false)

Use Case: Idempotent operations that should not retry (e.g., record creation with unique constraints)

Multiple Attempts with Retries

retry:
  retryable: true
  max_attempts: 3  # First attempt + 2 retries
  backoff: exponential
  backoff_base_ms: 1000

Behavior:

attempts	retry_eligible	Outcome
0	✅ true	First execution allowed
1	✅ true	First retry allowed (1 < 3)
2	✅ true	Second retry allowed (2 < 3)
3	❌ false	Max attempts exhausted (3 >= 3)

Use Case: External API calls that might have transient failures

Unlimited Retries (Not Recommended)

retry:
  retryable: true
  max_attempts: 999999
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 300000  # Cap at 5 minutes

Behavior: Will retry until external intervention (task cancellation, system restart)

Use Case: Critical operations that must eventually succeed (use with caution!)

Retry Eligibility Logic

SQL Implementation

From migrations/20251006000000_fix_retry_eligibility_logic.sql:

-- retry_eligible calculation
(
  COALESCE(ws.attempts, 0) = 0  -- First attempt always eligible
  OR (
    COALESCE(ws.retryable, true) = true  -- Must be retryable for retries
    AND COALESCE(ws.attempts, 0) < COALESCE(ws.max_attempts, 3)
  )
) as retry_eligible

Decision Tree

Is attempts = 0?
├─ YES → retry_eligible = TRUE (first execution)
└─ NO  → Is retryable = true?
    ├─ YES → Is attempts < max_attempts?
    │   ├─ YES → retry_eligible = TRUE (retry allowed)
    │   └─ NO  → retry_eligible = FALSE (max attempts exhausted)
    └─ NO  → retry_eligible = FALSE (retries disabled)

Edge Cases

max_attempts=0

retry:
  max_attempts: 0

Behavior: Step can never execute (0 < 0 = false for all attempts)

Status: ⚠️ Configuration error - likely unintended

Recommendation: Use max_attempts: 1 for single execution

retryable=false with max_attempts > 1

retry:
  retryable: false
  max_attempts: 3  # Only first attempt will execute

Behavior: First execution allowed, but no retries regardless of max_attempts

Effective Result: Same as max_attempts: 1

Recommendation: Set max_attempts: 1 when retryable: false for clarity

Historical Context

Why “max_attempts” instead of “retry_limit”?

The original field name retry_limit was semantically confusing:

Old Interpretation (incorrect):

retry_limit=1 → “1 retry allowed” → 2 total attempts?
retry_limit=0 → “0 retries” → 1 attempt or blocked?

New Interpretation (clear):

max_attempts=1 → “1 total attempt” → exactly 1 execution
max_attempts=0 → “0 attempts” → clearly invalid

Migration Timeline

Original: retry_limit field with ambiguous semantics
2025-10-05: Bug discovered - retry_limit=0 blocked all execution
2025-10-06: Fixed SQL logic + renamed to max_attempts
2025-10-06: Added 6 SQL boundary tests for edge cases

Testing

Boundary Condition Tests

See tests/integration/sql_functions/retry_boundary_tests.rs for comprehensive coverage:

test_max_attempts_zero_allows_first_execution - Edge case handling
test_max_attempts_zero_blocks_after_first - Exhaustion after first
test_max_attempts_one_semantics - Single execution semantics
test_max_attempts_three_progression - Standard retry progression
test_first_attempt_ignores_retryable_flag - First execution independence
test_retries_require_retryable_true - Retry flag enforcement

All tests passing as of 2025-10-06.

Best Practices

For Single-Execution Steps

retry:
  retryable: false
  max_attempts: 1
  backoff: exponential  # Ignored, but required for schema

Why: Makes intent crystal clear - execute once, never retry

For Transient Failure Tolerance

retry:
  retryable: true
  max_attempts: 3
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 30000

Why: Reasonable retry count with exponential backoff prevents thundering herd

For Critical Operations

retry:
  retryable: true
  max_attempts: 10
  backoff: exponential
  backoff_base_ms: 5000
  max_backoff_ms: 300000  # 5 minutes

Why: More attempts with longer backoff for operations that must succeed

Bug Report: Retry Eligibility Logic
State Machine Documentation
SQL Function: get_step_readiness_status_batch
Migration: 20251006000000_fix_retry_eligibility_logic.sql

Questions or Issues? See test suite for comprehensive examples or consult bug report for historical context.

Use Cases and Patterns

Last Updated: 2025-10-10 Audience: Developers, Architects, Product Managers Status: Active Related Docs: Documentation Hub | Quick Start | Crate Architecture

← Back to Documentation Hub

Overview

This guide provides practical examples of when and how to use Tasker Core for workflow orchestration. Each use case includes architectural patterns, example workflows, and implementation guidance based on real-world scenarios.

E-Commerce Order Fulfillment
Payment Processing Pipeline
Data Transformation ETL
Microservices Orchestration
Scheduled Job Coordination
Conditional Workflows and Decision Points
Anti-Patterns

E-Commerce Order Fulfillment

Problem Statement

An e-commerce platform needs to coordinate multiple steps when processing orders:

Validate order details and inventory
Reserve inventory and process payment (parallel)
Ship order after both payment and inventory confirmed
Send confirmation emails
Handle failures gracefully with retries

Why Tasker Core?

Complex Dependencies: Steps have clear dependency relationships
Parallel Execution: Payment and inventory can happen simultaneously
Retry Logic: External API calls need retry with backoff
Audit Trail: Complete history needed for compliance
Idempotency: Steps must handle duplicate executions safely

Workflow Structure

Task: order_fulfillment_#{order_id}
  Priority: Based on order value and customer tier
  Namespace: fulfillment

  Steps:
    1. validate_order
       - Handler: ValidateOrderHandler
       - Dependencies: None (root step)
       - Retry: retryable=true, max_attempts=3
       - Validates order data, checks fraud

    2. check_inventory
       - Handler: InventoryCheckHandler
       - Dependencies: validate_order (must complete)
       - Retry: retryable=true, max_attempts=5
       - Queries inventory system

    3. reserve_inventory
       - Handler: InventoryReservationHandler
       - Dependencies: check_inventory
       - Retry: retryable=true, max_attempts=3
       - Reserves stock with timeout

    4. process_payment
       - Handler: PaymentProcessingHandler
       - Dependencies: validate_order
       - Retry: retryable=true, max_attempts=3
       - Charges customer payment method
       - **Runs in parallel with reserve_inventory**

    5. ship_order
       - Handler: ShippingHandler
       - Dependencies: reserve_inventory AND process_payment
       - Retry: retryable=false, max_attempts=1
       - Creates shipping label, schedules pickup

    6. send_confirmation
       - Handler: EmailNotificationHandler
       - Dependencies: ship_order
       - Retry: retryable=true, max_attempts=10
       - Sends confirmation email to customer

Implementation Pattern

Task Template (YAML configuration):

namespace: fulfillment
name: order_fulfillment
version: "1.0"

steps:
  - name: validate_order
    handler: validate_order
    retry:
      retryable: true
      max_attempts: 3
      backoff: exponential
      backoff_base_ms: 1000

  - name: check_inventory
    handler: check_inventory
    dependencies:
      - validate_order
    retry:
      retryable: true
      max_attempts: 5
      backoff: exponential
      backoff_base_ms: 2000

  # ... remaining steps

Step Handler (Rust implementation):

#![allow(unused)]
fn main() {
pub struct ValidateOrderHandler;

#[async_trait]
impl StepHandler for ValidateOrderHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Extract order data from context
        let order_id: String = context.configuration.get("order_id")?;
        let customer_id: String = context.configuration.get("customer_id")?;

        // Validate order
        let order = validate_order_data(&order_id).await?;

        // Check fraud detection
        if check_fraud_risk(&customer_id, &order).await? {
            return Ok(StepResult::permanent_failure(
                "fraud_detected",
                json!({"reason": "High fraud risk"})
            ));
        }

        // Success - pass data to next steps
        Ok(StepResult::success(json!({
            "order_id": order_id,
            "validated_at": Utc::now(),
            "total_amount": order.total
        })))
    }
}
}

Ruby Handler Alternative:

class ProcessPaymentHandler < TaskerCore::StepHandler
  def execute(context)
    order_id = context.configuration['order_id']
    amount = context.configuration['amount']

    # Process payment via payment gateway
    result = PaymentGateway.charge(
      amount: amount,
      idempotency_key: context.step_uuid
    )

    if result.success?
      { success: true, transaction_id: result.transaction_id }
    else
      # Retryable failure with backoff
      { success: false, retryable: true, error: result.error }
    end
  rescue PaymentGateway::NetworkError => e
    # Transient error, retry
    { success: false, retryable: true, error: e.message }
  rescue PaymentGateway::CardDeclined => e
    # Permanent failure, don't retry
    { success: false, retryable: false, error: e.message }
  end
end

Key Patterns

1. Parallel Execution

reserve_inventory and process_payment both depend only on earlier steps
Tasker automatically executes them in parallel
ship_order waits for both to complete

2. Idempotent Handlers

Use step_uuid as idempotency key for external APIs
Check if operation already completed before retrying
Handle duplicate executions gracefully

3. Smart Retry Logic

Network errors → retryable with exponential backoff
Business logic failures → permanent, no retry
Configure max_attempts based on criticality

4. Data Flow

Early steps provide data to later steps via results
Access parent results: context.parent_results["validate_order"]
Build context as workflow progresses

Observability

Monitor these metrics for order fulfillment:

#![allow(unused)]
fn main() {
// Track order processing stages
metrics::counter!("orders.validated").increment(1);
metrics::counter!("orders.payment_processed").increment(1);
metrics::counter!("orders.shipped").increment(1);

// Track failures by reason
metrics::counter!("orders.failed", "reason" => "fraud").increment(1);
metrics::counter!("orders.failed", "reason" => "inventory").increment(1);

// Track timing
metrics::histogram!("order.fulfillment_time_ms").record(elapsed_ms);
}

Payment Processing Pipeline

Problem Statement

A fintech platform needs to process payments with strict requirements:

Multiple payment methods (card, bank transfer, wallet)
Regulatory compliance and audit trails
Automatic retry for transient failures
Reconciliation with accounting system
Webhook notifications to customers

Why Tasker Core?

Compliance: Complete audit trail with state transitions
Reliability: Automatic retry with configurable limits
Observability: Detailed metrics for financial operations
Idempotency: Prevent duplicate charges
Flexibility: Support multiple payment flows

Workflow Structure

Task: payment_processing_#{payment_id}
  Namespace: payments
  Priority: High (financial operations)

  Steps:
    1. validate_payment_request
       - Verify payment details
       - Check account status
       - Validate payment method

    2. check_fraud
       - Run fraud detection
       - Verify transaction limits
       - Check velocity rules

    3. authorize_payment
       - Contact payment gateway
       - Reserve funds (authorization hold)
       - Return authorization code

    4. capture_payment (depends on authorize_payment)
       - Capture authorized funds
       - Handle settlement
       - Generate receipt

    5. record_transaction (depends on capture_payment)
       - Write to accounting ledger
       - Update customer balance
       - Create audit records

    6. send_notification (depends on record_transaction)
       - Send webhook to merchant
       - Send receipt to customer
       - Update payment status

Implementation Highlights

Retry Strategy for Payment Gateway:

#![allow(unused)]
fn main() {
impl StepHandler for AuthorizePaymentHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let payment_id = context.configuration.get("payment_id")?;

        match gateway.authorize(payment_id, &context.step_uuid).await {
            Ok(auth) => {
                Ok(StepResult::success(json!({
                    "authorization_code": auth.code,
                    "authorized_at": Utc::now(),
                    "gateway_transaction_id": auth.transaction_id
                })))
            }

            Err(GatewayError::NetworkTimeout) => {
                // Transient - retry with backoff
                Ok(StepResult::retryable_failure(
                    "network_timeout",
                    json!({"retry_recommended": true})
                ))
            }

            Err(GatewayError::InsufficientFunds) => {
                // Permanent - don't retry
                Ok(StepResult::permanent_failure(
                    "insufficient_funds",
                    json!({"requires_manual_intervention": false})
                ))
            }

            Err(GatewayError::InvalidCard) => {
                // Permanent - don't retry
                Ok(StepResult::permanent_failure(
                    "invalid_card",
                    json!({"requires_manual_intervention": true})
                ))
            }
        }
    }
}
}

Idempotency Pattern:

#![allow(unused)]
fn main() {
async fn capture_payment(context: &StepContext) -> Result<StepResult> {
    let idempotency_key = context.step_uuid.to_string();

    // Check if we already captured this payment
    if let Some(existing) = check_existing_capture(&idempotency_key).await? {
        return Ok(StepResult::success(json!({
            "already_captured": true,
            "transaction_id": existing.transaction_id,
            "note": "Idempotent duplicate detected"
        })));
    }

    // Proceed with capture
    let result = gateway.capture(&idempotency_key).await?;

    // Store idempotency record
    store_capture_record(&idempotency_key, &result).await?;

    Ok(StepResult::success(json!(result)))
}
}

Key Patterns

1. Two-Phase Commit

Authorize (reserve) → Capture (settle)
Allows cancellation between phases
Common in payment processing

2. Audit Trail

Every state transition recorded
Regulatory compliance built-in
Forensic investigation support

3. Circuit Breaking

Protect against payment gateway failures
Automatic backoff when gateway degraded
Fallback to alternate gateways

Data Transformation ETL

Problem Statement

A data analytics platform needs to process data through multiple transformation stages:

Extract data from multiple sources (APIs, databases, files)
Transform data (clean, enrich, aggregate)
Load to data warehouse
Handle large datasets with partitioning
Retry transient failures, skip corrupted data

Why Tasker Core?

DAG Execution: Complex transformation pipelines
Parallel Processing: Independent partitions processed concurrently
Error Handling: Skip corrupted records, retry transient failures
Observability: Track data quality and processing metrics
Scheduling: Integrate with cron/scheduler for periodic runs

Workflow Structure

Task: etl_customer_data_#{date}
  Namespace: data_pipeline

  Steps:
    1. extract_customer_profiles
       - Fetch from customer database
       - Partition by customer_id ranges
       - Creates multiple output partitions

    2. extract_transaction_history
       - Fetch from transactions database
       - Runs in parallel with extract_customer_profiles
       - Time-based partitioning

    3. enrich_customer_data (depends on extract_customer_profiles)
       - Add demographic data from external API
       - Process partitions in parallel
       - Each partition is independent

    4. join_transactions (depends on enrich_customer_data, extract_transaction_history)
       - Join enriched profiles with transactions
       - Aggregate metrics per customer
       - Parallel processing per partition

    5. load_to_warehouse (depends on join_transactions)
       - Bulk load to data warehouse
       - Verify data quality
       - Update metadata tables

    6. generate_summary_report (depends on load_to_warehouse)
       - Generate processing statistics
       - Send notification with summary
       - Archive source files

Implementation Pattern

Partition-Based Processing:

#![allow(unused)]
fn main() {
pub struct ExtractCustomerProfilesHandler;

#[async_trait]
impl StepHandler for ExtractCustomerProfilesHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let date: String = context.configuration.get("processing_date")?;

        // Determine partitions (e.g., by customer_id ranges)
        let partitions = calculate_partitions(1000000, 100000)?; // 10 partitions

        // Extract data for each partition
        let mut partition_files = Vec::new();
        for partition in partitions {
            let filename = extract_partition(&date, partition).await?;
            partition_files.push(filename);
        }

        // Return partition info for downstream steps
        Ok(StepResult::success(json!({
            "partitions": partition_files,
            "total_records": 1000000,
            "extracted_at": Utc::now()
        })))
    }
}
}

Error Handling for Data Quality:

#![allow(unused)]
fn main() {
async fn enrich_customer_data(context: &StepContext) -> Result<StepResult> {
    let partition_file: String = context.configuration.get("partition_file")?;

    let mut processed = 0;
    let mut skipped = 0;
    let mut errors = Vec::new();

    for record in read_partition(&partition_file).await? {
        match enrich_record(record).await {
            Ok(enriched) => {
                write_enriched(enriched).await?;
                processed += 1;
            }
            Err(EnrichmentError::MalformedData(e)) => {
                // Skip corrupted record, continue processing
                skipped += 1;
                errors.push(format!("Skipped record: {}", e));
            }
            Err(EnrichmentError::ApiTimeout(e)) => {
                // Transient failure, retry entire step
                return Ok(StepResult::retryable_failure(
                    "api_timeout",
                    json!({"error": e.to_string()})
                ));
            }
        }
    }

    if skipped as f64 / processed as f64 > 0.1 {
        // Too many skipped records
        return Ok(StepResult::permanent_failure(
            "data_quality_issue",
            json!({
                "processed": processed,
                "skipped": skipped,
                "error_rate": skipped as f64 / processed as f64
            })
        ));
    }

    Ok(StepResult::success(json!({
        "processed": processed,
        "skipped": skipped,
        "errors": errors
    })))
}
}

Key Patterns

1. Partition-Based Parallelism

Split large datasets into partitions
Process partitions independently
Aggregate results in final step

2. Graceful Degradation

Skip corrupted individual records
Continue processing remaining data
Report data quality issues

3. Monitoring Data Quality

Track record counts through pipeline
Alert on unexpected error rates
Validate schema at boundaries

Microservices Orchestration

Problem Statement

Coordinate operations across multiple microservices:

User registration flow (auth, profile, notifications, analytics)
Distributed transactions with compensation
Service dependency management
Timeout and circuit breaking

Why Tasker Core?

Service Coordination: Orchestrate distributed operations
Saga Pattern: Implement compensation for failures
Resilience: Circuit breakers and timeouts
Observability: End-to-end tracing with correlation IDs
Flexibility: Handle heterogeneous service protocols

Workflow Structure (User Registration Example)

Task: user_registration_#{user_id}
  Namespace: user_onboarding

  Steps:
    1. create_auth_account
       - Call auth service to create account
       - Generate user credentials
       - Store authentication tokens

    2. create_user_profile (depends on create_auth_account)
       - Call profile service
       - Initialize user preferences
       - Set default settings

    3. setup_notification_preferences (depends on create_user_profile)
       - Call notification service
       - Configure email preferences
       - Set up push notifications

    4. track_user_signup (depends on create_user_profile)
       - Call analytics service
       - Record signup event
       - Runs in parallel with setup_notification_preferences

    5. send_welcome_email (depends on setup_notification_preferences)
       - Send welcome email
       - Provide onboarding links
       - Track email delivery

  Compensation Steps (on failure):
    - If create_user_profile fails → delete_auth_account
    - If any step fails after profile → deactivate_user

Implementation Pattern (Saga with Compensation)

#![allow(unused)]
fn main() {
pub struct CreateUserProfileHandler;

#[async_trait]
impl StepHandler for CreateUserProfileHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let user_id: String = context.configuration.get("user_id")?;
        let email: String = context.configuration.get("email")?;

        // Get auth details from previous step
        let auth_result = context.parent_results.get("create_auth_account")
            .ok_or("Missing auth result")?;
        let auth_token: String = auth_result.get("auth_token")?;

        // Call profile service
        match profile_service.create_profile(&user_id, &email, &auth_token).await {
            Ok(profile) => {
                Ok(StepResult::success(json!({
                    "profile_id": profile.id,
                    "created_at": profile.created_at,
                    "user_id": user_id
                })))
            }

            Err(ProfileServiceError::DuplicateEmail) => {
                // Permanent failure - email already exists
                // Trigger compensation
                Ok(StepResult::permanent_failure_with_compensation(
                    "duplicate_email",
                    json!({"email": email}),
                    vec!["delete_auth_account"] // Compensation steps
                ))
            }

            Err(ProfileServiceError::ServiceUnavailable) => {
                // Transient - retry
                Ok(StepResult::retryable_failure(
                    "service_unavailable",
                    json!({"retry_recommended": true})
                ))
            }
        }
    }
}
}

Compensation Handler:

#![allow(unused)]
fn main() {
pub struct DeleteAuthAccountHandler;

#[async_trait]
impl StepHandler for DeleteAuthAccountHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let user_id: String = context.configuration.get("user_id")?;

        // Best-effort deletion
        match auth_service.delete_account(&user_id).await {
            Ok(_) => {
                Ok(StepResult::success(json!({
                    "compensated": true,
                    "user_id": user_id
                })))
            }
            Err(e) => {
                // Log error but don't fail - compensation is best-effort
                warn!("Compensation failed for user {}: {}", user_id, e);
                Ok(StepResult::success(json!({
                    "compensated": false,
                    "error": e.to_string(),
                    "requires_manual_cleanup": true
                })))
            }
        }
    }
}
}

Key Patterns

1. Correlation IDs

Pass correlation_id through all services
Enable end-to-end tracing
Simplify debugging distributed issues

2. Compensation (Saga Pattern)

Define compensation steps for cleanup
Execute on permanent failures
Best-effort execution, log failures

3. Service Circuit Breakers

Wrap service calls in circuit breakers
Fail fast when services degraded
Automatic recovery detection

Scheduled Job Coordination

Problem Statement

Run periodic jobs with dependencies:

Daily report generation (depends on data refresh)
Scheduled data backups (depends on maintenance window)
Cleanup jobs (depends on retention policies)

Why Tasker Core?

Dependency Management: Jobs run in correct order
Failure Handling: Automatic retry of failed jobs
Observability: Track job execution history
Flexibility: Dynamic scheduling based on results

Implementation Pattern

#![allow(unused)]
fn main() {
// External scheduler (cron, Kubernetes CronJob, etc.) creates tasks
pub async fn schedule_daily_reports() -> Result<Uuid> {
    let client = OrchestrationClient::new("http://orchestration:8080").await?;

    let task_request = TaskRequest {
        template_name: "daily_reporting".to_string(),
        namespace: "scheduled_jobs".to_string(),
        configuration: json!({
            "report_date": Utc::now().format("%Y-%m-%d").to_string(),
            "report_types": ["sales", "inventory", "customer_activity"]
        }),
        priority: 5, // Normal priority
    };

    let response = client.create_task(task_request).await?;
    Ok(response.task_uuid)
}
}

Conditional Workflows and Decision Points

Problem Statement

Many workflows require runtime decision-making where the execution path depends on business logic evaluated at runtime:

Approval routing based on request amount or risk level
Tiered processing based on customer status
Compliance checks varying by jurisdiction
Dynamic resource allocation based on workload

Why Use Decision Points?

Traditional Approach (Static DAG):

# Must define ALL possible paths upfront
Steps:
  - validate
  - route_A  # Always created
  - route_B  # Always created
  - route_C  # Always created
  - converge # Must handle all paths

Decision Point Approach (Dynamic DAG):

# Create ONLY the needed path at runtime
Steps:
  - validate
  - routing_decision  # Decides which path
  - route_A           # Created dynamically if needed
  - route_B           # Created dynamically if needed
  - route_C           # Created dynamically if needed
  - converge          # Uses intersection semantics

Benefits

Efficiency: Only execute steps actually needed
Clarity: Workflow reflects actual business logic
Cost Savings: Reduce API calls, processing time, and resource usage
Flexibility: Add new paths without changing core logic

Core Pattern

Task: conditional_approval
  Steps:
    1. validate_request       # Regular step
    2. routing_decision       # Decision point (type: decision_point)
       → Evaluates business logic
       → Returns: CreateSteps(['manager_approval']) or NoBranches
    3. auto_approve          # Might be created
    4. manager_approval      # Might be created
    5. finance_review        # Might be created
    6. finalize_approval     # Convergence (type: deferred)
       → Waits for intersection of dependencies

Example: Amount-Based Approval Routing

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    # Business logic determines which steps to create
    steps = if amount < 1_000
      ['auto_approve']
    elsif amount < 5_000
      ['manager_approval']
    else
      ['manager_approval', 'finance_review']
    end

    # Return decision outcome
    decision_success(
      steps: steps,
      result_data: {
        route_type: determine_route_type(amount),
        amount: amount
      }
    )
  end
end

Real-World Scenarios

1. E-Commerce Returns Processing

Low-value returns: Auto-approve
Medium-value: Manager review
High-value or suspicious: Fraud investigation + manager review

2. Financial Risk Assessment

Low-risk transactions: Standard processing
Medium-risk: Additional verification
High-risk: Manual review + compliance checks + legal review

3. Healthcare Prior Authorization

Standard procedures: Auto-approve
Specialized care: Medical director review
Experimental treatments: Medical director + insurance review + compliance

4. Customer Support Escalation

Simple issues: Tier 1 resolution
Complex issues: Tier 2 specialist
VIP customers: Immediate senior support + account manager notification

Key Features

Decision Point Steps:

Special step type that returns DecisionPointOutcome
Can return NoBranches (no additional steps) or CreateSteps (list of step names)
Fully atomic - either all steps created or none
Supports nested decisions (configurable depth limit)

Deferred Steps:

Use intersection semantics for dependencies
Wait for: (declared dependencies) ∩ (actually created steps)
Enable convergence regardless of path taken

Type-Safe Implementation:

Ruby: TaskerCore::StepHandler::Decision base class
Rust: DecisionPointOutcome enum with serde support
Automatic validation and serialization

Implementation

See the complete guide: Conditional Workflows and Decision Points

Covers:

When to use conditional workflows
YAML configuration
Ruby and Rust implementation patterns
Simple and complex examples
Best practices and limitations

Anti-Patterns

❌ Don’t Use Tasker Core For:

1. Simple Cron Jobs

# ❌ Anti-pattern: Single-step scheduled job
Task: send_daily_email
  Steps:
    - send_email  # No dependencies, no retry needed

Why: Overhead not justified. Use native cron or systemd timers.

2. Real-Time Sub-Millisecond Operations

# ❌ Anti-pattern: High-frequency trading
Task: execute_trade_#{microseconds}
  Steps:
    - check_price   # Needs <1ms latency
    - execute_order

Why: Architectural overhead (~10-20ms) too high. Use in-memory queues or direct service calls.

3. Pure Fan-Out

# ❌ Anti-pattern: Simple message broadcasting
Task: broadcast_notification
  Steps:
    - send_to_user_1
    - send_to_user_2
    - send_to_user_3
    # ... 1000s of independent steps

Why: Use message bus (Kafka, RabbitMQ) for pub/sub patterns. Tasker is for orchestration, not broadcasting.

4. Stateless Single Operations

# ❌ Anti-pattern: Single API call with no retry
Task: fetch_user_data
  Steps:
    - call_api  # No dependencies, no state management needed

Why: Direct API call with client-side retry is simpler.

Pattern Selection Guide

Characteristic	Use Tasker Core?	Alternative
Multiple dependent steps	✅ Yes	N/A
Parallel execution needed	✅ Yes	Thread pools for simple cases
Retry logic required	✅ Yes	Client-side retry libraries
Audit trail needed	✅ Yes	Append-only logs
Single step, no retry	❌ No	Direct function call
Sub-second latency required	❌ No	In-memory queues
Pure broadcast/fan-out	❌ No	Message bus (Kafka, etc.)
Simple scheduled job	❌ No	Cron, systemd timers

Quick Start - Get your first workflow running
Conditional Workflows - Runtime decision-making and dynamic step creation
Crate Architecture - Understand the codebase
Deployment Patterns - Deploy to production
States and Lifecycles - State machine deep dive
Events and Commands - Event-driven patterns

← Back to Documentation Hub

Worker Crates Overview

Last Updated: 2025-12-27 Audience: Developers, Architects, Operators Status: Active Related Docs: Worker Event Systems | Worker Actors

<- Back to Documentation Hub

The tasker-core workspace provides four worker implementations for executing workflow step handlers. Each implementation targets different deployment scenarios and developer ecosystems while sharing the same core Rust foundation.

Document	Description
API Convergence Matrix	Quick reference for aligned APIs across languages
Example Handlers	Side-by-side handler examples
Patterns and Practices	Common patterns across all workers
Rust Worker	Native Rust implementation
Ruby Worker	Ruby gem for Rails integration
Python Worker	Python package for data pipelines
TypeScript Worker	TypeScript/JS for Bun/Node/Deno

Overview

Four Workers, One Foundation

All workers share the same Rust core (tasker-worker crate) for orchestration, queueing, and state management. The language-specific workers add handler execution in their respective runtimes.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WORKER ARCHITECTURE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

                              PostgreSQL + PGMQ
                                      │
                                      ▼
                    ┌─────────────────────────────┐
                    │   Rust Core (tasker-worker) │
                    │   ─────────────────────────│
                    │   • Queue Management        │
                    │   • State Machines          │
                    │   • Orchestration           │
                    │   • Actor System            │
                    └─────────────────────────────┘
                                      │
          ┌───────────────┬───────────┼───────────┬───────────────┐
          │               │           │           │               │
          ▼               ▼           ▼           ▼               ▼
    ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌─────────────┐
    │   Rust    │   │   Ruby    │   │  Python   │   │ TypeScript  │
    │  Worker   │   │  Worker   │   │  Worker   │   │   Worker    │
    │───────────│   │───────────│   │───────────│   │─────────────│
    │ Native    │   │ FFI Bridge│   │ FFI Bridge│   │ FFI Bridge  │
    │ Handlers  │   │ + Gem     │   │ + Package │   │ Bun/Node/Deno│
    └───────────┘   └───────────┘   └───────────┘   └─────────────┘

Comparison Table

Feature	Rust	Ruby	Python	TypeScript
Performance	Native	GVL-limited	GIL-limited	V8/Bun native
Integration	Standalone	Rails/Rack apps	Data pipelines	Node/Bun/Deno apps
Handler Style	Async traits	Class-based	ABC-based	Class-based
Concurrency	Tokio async	Thread + FFI poll	Thread + FFI poll	Event loop + FFI poll
Deployment	Binary	Gem + Server	Package + Server	Package + Server
Headless Mode	N/A	Library embed	Library embed	Library embed
Runtimes	-	MRI	CPython	Bun, Node.js, Deno

When to Use Each

Rust Worker - Best for:

Maximum throughput requirements
Resource-constrained environments
Standalone microservices
Performance-critical handlers

Ruby Worker - Best for:

Rails/Ruby applications
ActiveRecord/ORM integration
Existing Ruby codebases
Quick prototyping with Ruby ecosystem

Python Worker - Best for:

Data processing pipelines
ML/AI integration
Scientific computing workflows
Python-native team preferences

TypeScript Worker - Best for:

Modern JavaScript/TypeScript applications
Full-stack Node.js teams
Edge computing with Bun or Deno
React/Vue/Angular backend services
Multi-runtime deployment flexibility

Deployment Modes

Server Mode

All workers can run as standalone servers:

Rust:

cargo run -p workers-rust

Ruby:

cd workers/ruby
./bin/server.rb

Python:

cd workers/python
python bin/server.py

TypeScript (Bun):

cd workers/typescript
bun run bin/server.ts

TypeScript (Node.js):

cd workers/typescript
npx tsx bin/server.ts

Headless/Embedded Mode (Ruby, Python & TypeScript)

Ruby, Python, and TypeScript workers can be embedded into existing applications without running the HTTP server. Headless mode is controlled via TOML configuration, not bootstrap parameters.

TOML Configuration (e.g., config/tasker/base/worker.toml):

[web]
enabled = false  # Disables HTTP server for headless/embedded mode

Ruby (in Rails):

# config/initializers/tasker.rb
require 'tasker_core'

# Bootstrap worker (web server disabled via TOML config)
TaskerCore::Worker::Bootstrap.start!

# Register handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
  'MyHandler',
  MyHandler
)

Python (in application):

from tasker_core import bootstrap_worker, HandlerRegistry
from tasker_core.types import BootstrapConfig

# Bootstrap worker (web server disabled via TOML config)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)

# Register handlers
registry = HandlerRegistry.instance()
registry.register("my_handler", MyHandler)

TypeScript (in application):

import { createRuntime, HandlerRegistry, EventEmitter, EventPoller, StepExecutionSubscriber } from '@tasker-systems/tasker';

// Bootstrap worker (web server disabled via TOML config)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });

// Register handlers
const registry = new HandlerRegistry();
registry.register('my_handler', MyHandler);

// Start event processing
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();

const poller = new EventPoller(runtime, emitter);
poller.start();

Core Concepts

1. Handler Registration

All workers use a registry pattern for handler discovery:

                    ┌─────────────────────┐
                    │  HandlerRegistry    │
                    │  (Singleton)        │
                    └─────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌─────────┐    ┌─────────┐    ┌─────────┐
         │Handler A│    │Handler B│    │Handler C│
         └─────────┘    └─────────┘    └─────────┘

2. Event Flow

Step events flow through a consistent pipeline:

1. PGMQ Queue → Event received
2. Worker claims step (atomic)
3. Handler resolved by name
4. Handler.call(context) executed
5. Result sent to completion channel
6. Orchestration receives result

3. Error Classification

All workers distinguish between:

Retryable Errors: Transient failures → Re-enqueue step
Permanent Errors: Unrecoverable → Mark step failed

4. Graceful Shutdown

All workers handle shutdown signals (SIGTERM, SIGINT):

1. Signal received
2. Stop accepting new work
3. Complete in-flight handlers
4. Flush completion channel
5. Shutdown Rust foundation
6. Exit cleanly

Configuration

Environment Variables

Common across all workers:

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`TASKER_ENV`	Environment (test/development/production)	development
`TASKER_CONFIG_PATH`	Path to TOML configuration	Auto-detected
`TASKER_TEMPLATE_PATH`	Path to task templates	Auto-detected
`TASKER_NAMESPACE`	Worker namespace for queue isolation	default
`RUST_LOG`	Log level (trace/debug/info/warn/error)	info

Language-Specific

Ruby:

Variable	Description
`RUBY_GC_HEAP_GROWTH_FACTOR`	GC tuning for production

Python:

Variable	Description
`PYTHON_HANDLER_PATH`	Path for handler auto-discovery

Handler Types

All workers support specialized handler types:

StepHandler (Base)

Basic step execution:

class MyHandler(StepHandler):
    handler_name = "my_handler"

    def call(self, context):
        return self.success({"result": "done"})

ApiHandler

HTTP/REST API integration with automatic error classification:

class FetchDataHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')
    response = connection.get("/users/#{user_id}")
    process_response(response)
    success(result: response.body)
  end
end

DecisionHandler

Dynamic workflow routing:

class RouteHandler(DecisionHandler):
    handler_name = "route_handler"

    def call(self, context):
        if context.input_data["amount"] < 1000:
            return self.route_to_steps(["auto_approve"])
        return self.route_to_steps(["manager_approval"])

Batchable

Large dataset processing. Note: Ruby uses subclass inheritance, Python uses mixin:

Ruby (subclass of Base):

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
    batch_worker_complete(processed_count: batch_ctx.batch_size)
  end
end

Python (mixin):

class CsvBatchProcessor(StepHandler, Batchable):
    handler_name = "csv_batch_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)
        if batch_ctx is None:
            return self.failure(message="No batch context", error_type="missing_context")
        # Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
        batch_size = batch_ctx.cursor_config.end_cursor - batch_ctx.cursor_config.start_cursor
        return self.batch_worker_success(items_processed=batch_size)

Quick Start

Rust

# Build and run
cd workers/rust
cargo run

# With custom configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run

Ruby

# Install dependencies
cd workers/ruby
bundle install
bundle exec rake compile

# Run server
./bin/server.rb

Python

# Install dependencies
cd workers/python
uv sync
uv run maturin develop

# Run server
python bin/server.py

TypeScript

# Install dependencies
cd workers/typescript
bun install
cargo build --release -p tasker-ts

# Run server (Bun)
bun run bin/server.ts

# Run server (Node.js)
npx tsx bin/server.ts

# Run server (Deno)
deno run --allow-ffi --allow-env --allow-net bin/server.ts

Monitoring

Health Checks

All workers expose health status:

# Python
from tasker_core import get_health_check
health = get_health_check()

# Ruby
health = TaskerCore::FFI.health_check

Metrics

Common metrics available:

Metric	Description
`pending_count`	Events awaiting processing
`in_flight_count`	Events being processed
`completed_count`	Successfully completed
`failed_count`	Failed events
`starvation_detected`	Processing bottleneck

Logging

All workers use structured logging:

2025-01-15T10:30:00Z [INFO] python-worker: Processing step step_uuid=abc-123 handler=process_order
2025-01-15T10:30:01Z [INFO] python-worker: Step completed step_uuid=abc-123 success=true duration_ms=150

Architecture Deep Dive

For detailed architectural documentation:

Worker Event Systems - Dual-channel architecture, event-driven processing
Worker Actors - Actor-based coordination, message handling
Events and Commands - Event definitions, command routing

API Convergence Matrix

Last Updated: 2026-01-08 Status: Active <- Back to Worker Crates Overview

Overview

This document provides a quick reference for the aligned APIs across Ruby, Python, TypeScript, and Rust worker implementations. All four languages share consistent patterns for handler execution, result creation, registry operations, and composition via mixins/traits.

Handler Signatures

Language	Base Class	Signature
Ruby	`TaskerCore::StepHandler::Base`	`def call(context)`
Python	`BaseStepHandler`	`def call(self, context: StepContext) -> StepHandlerResult`
TypeScript	`StepHandler`	`async call(context: StepContext): Promise<StepHandlerResult>`
Rust	`StepHandler` trait	`async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult`

Composition Pattern

All languages use composition via mixins/traits rather than inheritance hierarchies.

Handler Composition

Language	Base	Mixin Syntax	Example
Ruby	`StepHandler::Base`	`include Mixins::API`	`class Handler < Base; include Mixins::API`
Python	`StepHandler`	Multiple inheritance	`class Handler(StepHandler, APIMixin)`
TypeScript	`StepHandler`	`applyAPI(this)`	Mixin functions applied in constructor
Rust	`impl StepHandler`	`impl APICapable`	Multiple trait implementations

Available Mixins/Traits

Capability	Ruby	Python	TypeScript	Rust
API	`Mixins::API`	`APIMixin`	`applyAPI()`	`APICapable`
Decision	`Mixins::Decision`	`DecisionMixin`	`applyDecision()`	`DecisionCapable`
Batchable	`Mixins::Batchable`	`BatchableMixin`	`BatchableHandler`	`BatchableCapable`

StepContext Fields

The StepContext provides unified access to step execution data across Ruby, Python, and TypeScript.

Field	Type	Description
`task_uuid`	String	Unique task identifier (UUID v7)
`step_uuid`	String	Unique step identifier (UUID v7)
`input_data`	Dict/Hash	Input data for the step from `workflow_step.inputs`
`step_inputs`	Dict/Hash	Alias for `input_data`
`step_config`	Dict/Hash	Handler configuration from `step_definition.handler.initialization`
`dependency_results`	Wrapper	Results from parent steps (DependencyResultsWrapper)
`retry_count`	Integer	Current retry attempt (from `workflow_step.attempts`)
`max_retries`	Integer	Maximum retry attempts (from `workflow_step.max_attempts`)

Convenience Methods

Method	Description
`get_task_field(name)`	Get field from task context
`get_dependency_result(step_name)`	Get result from a parent step

Ruby-Specific Accessors

Property	Type	Description
`task`	TaskWrapper	Full task wrapper with context and metadata
`workflow_step`	WorkflowStepWrapper	Workflow step with execution state
`step_definition`	StepDefinitionWrapper	Step definition from task template

Result Factories

Success Results

Language	Method	Example
Ruby	`success(result:, metadata:)`	`success(result: { id: 123 }, metadata: { ms: 50 })`
Python	`self.success(result, metadata)`	`self.success({"id": 123}, {"ms": 50})`
Rust	`StepExecutionResult::success(...)`	`StepExecutionResult::success(result, metadata)`

Failure Results

Language	Method	Key Parameters
Ruby	`failure(message:, error_type:, error_code:, retryable:, metadata:)`	keyword arguments
Python	`self.failure(message, error_type, error_code, retryable, metadata)`	positional/keyword
Rust	`StepExecutionResult::failure(...)`	structured fields

Result Fields

Field	Ruby	Python	Rust	Description
success	bool	bool	bool	Whether step succeeded
result	Hash	Dict	HashMap	Result data
metadata	Hash	Dict	HashMap	Additional context
error_message	String	str	String	Human-readable error
error_type	String	str	String	Error classification
error_code	String (optional)	str (optional)	String (optional)	Application error code
retryable	bool	bool	bool	Whether to retry

Standard error_type Values

Use these standard values for consistent error classification:

Value	Description	Retry Behavior
`PermanentError`	Non-recoverable failure	Never retry
`RetryableError`	Temporary failure	Will retry
`ValidationError`	Input validation failed	No retry
`TimeoutError`	Operation timed out	May retry
`UnexpectedError`	Unexpected handler error	May retry

Registry API

Operation	Ruby	Python	Rust
Register	`register(name, klass)`	`register(name, klass)`	`register_handler(name, handler)`
Check	`is_registered(name)`	`is_registered(name)`	`is_registered(name)`
Resolve	`resolve(name)`	`resolve(name)`	`get_handler(name)`
List	`list_handlers`	`list_handlers()`	`list_handlers()`

Note: Ruby also provides original method names (register_handler, handler_available?, resolve_handler, registered_handlers) as the primary API with the above as cross-language aliases.

Resolver Chain API

Handler resolution uses a chain-of-responsibility pattern to convert callable addresses into executable handlers.

StepHandlerResolver Interface

Method	Ruby	Python	TypeScript	Rust
Get Name	`name`	`resolver_name()`	`resolverName()`	`resolver_name(&self)`
Get Priority	`priority`	`priority()`	`priority()`	`priority(&self)`
Can Resolve?	`can_resolve?(definition, config)`	`can_resolve(definition)`	`canResolve(definition)`	`can_resolve(&self, definition)`
Resolve	`resolve(definition, config)`	`resolve(definition, context)`	`resolve(definition, context)`	`resolve(&self, definition, context)`

ResolverChain Operations

Operation	Ruby	Python	TypeScript	Rust
Create	`ResolverChain.new`	`ResolverChain()`	`new ResolverChain()`	`ResolverChain::new()`
Register	`register(resolver)`	`register(resolver)`	`register(resolver)`	`register(resolver)`
Resolve	`resolve(definition, context)`	`resolve(definition, context)`	`resolve(definition, context)`	`resolve(definition, context)`
Can Resolve?	`can_resolve?(definition)`	`can_resolve(definition)`	`canResolve(definition)`	`can_resolve(definition)`
List	`resolvers`	`resolvers`	`resolvers`	`resolvers()`

Built-in Resolvers

Resolver	Priority	Function	Rust	Ruby	Python	TypeScript
ExplicitMappingResolver	10	Hash lookup of registered handlers	✅	✅	✅	✅
ClassConstantResolver	100	Runtime class lookup (Ruby)	❌	✅	-	-
ClassLookupResolver	100	Runtime class lookup (Python/TS)	❌	-	✅	✅

Note: Class lookup resolvers are not available in Rust due to lack of runtime reflection. Rust handlers must use ExplicitMappingResolver. Ruby uses ClassConstantResolver (Ruby terminology); Python and TypeScript use ClassLookupResolver (same functionality, language-appropriate naming).

HandlerDefinition Fields

Field	Type	Description	Required
`callable`	String	Handler address (name or class path)	Yes
`method`	String	Entry point method (default: `"call"`)	No
`resolver`	String	Resolution hint to bypass chain	No
`initialization`	Dict/Hash	Handler configuration	No

Method Dispatch

Multi-method handlers expose multiple entry points through the method field:

Language	Default Method	Dynamic Dispatch
Ruby	`call`	`handler.public_send(method, context)`
Python	`call`	`getattr(handler, method)(context)`
TypeScript	`call`	`handler[method](context)`
Rust	`call`	`handler.invoke_method(method, step)`

Creating Multi-Method Handlers:

Language	Signature
Ruby	Define additional methods alongside `call`
Python	Define additional methods alongside `call`
TypeScript	Define additional async methods alongside `call`
Rust	Implement `invoke_method` to dispatch to internal methods

See Handler Resolution Guide for complete documentation.

Specialized Handlers

API Handler

Operation	Ruby	Python	TypeScript
GET	`get(path, params: {}, headers: {})`	`self.get(path, params={}, headers={})`	`this.get(path, params?, headers?)`
POST	`post(path, data: {}, headers: {})`	`self.post(path, data={}, headers={})`	`this.post(path, data?, headers?)`
PUT	`put(path, data: {}, headers: {})`	`self.put(path, data={}, headers={})`	`this.put(path, data?, headers?)`
DELETE	`delete(path, params: {}, headers: {})`	`self.delete(path, params={}, headers={})`	`this.delete(path, params?, headers?)`

Decision Handler

Language	Simple API	Result Fields
Ruby	`decision_success(steps:, routing_context:)`	`decision_point_outcome: { type, step_names }`
Python	`decision_success(steps, routing_context)`	`decision_point_outcome: { type, step_names }`
TypeScript	`decisionSuccess(steps, routingContext?)`	`decision_point_outcome: { type, step_names }`
Rust	`decision_success(step_uuid, step_names, ...)`	Pattern-based

Decision Helper Methods (Cross-Language):

decision_success(steps, routing_context) - Create dynamic steps
skip_branches(reason, routing_context) - Skip all conditional branches
decision_failure(message, error_type) - Decision could not be made

Batchable Handler

Operation	Ruby	Python	TypeScript
Get Context	`get_batch_context(context)`	`get_batch_context(context)`	`getBatchContext(context)`
Complete Batch	`batch_worker_complete(processed_count:, result_data:)`	`batch_worker_complete(processed_count, result_data)`	`batchWorkerComplete(processedCount, resultData)`
Handle No-Op	`handle_no_op_worker(batch_ctx)`	`handle_no_op_worker(batch_ctx)`	`handleNoOpWorker(batchCtx)`

Standard Batch Result Fields:

processed_count / items_processed
items_succeeded / items_failed
start_cursor, end_cursor, batch_size, last_cursor

Cursor Indexing:

All languages use 0-indexed cursors (start at 0, not 1)
Ruby was updated from 1-indexed to 0-indexed for consistency

Checkpoint Yielding

Checkpoint yielding enables batch workers to persist progress and yield control for re-dispatch.

Operation	Ruby	Python	TypeScript
Checkpoint	`checkpoint_yield(cursor:, items_processed:, accumulated_results:)`	`checkpoint_yield(cursor, items_processed, accumulated_results)`	`checkpointYield({ cursor, itemsProcessed, accumulatedResults })`

BatchWorkerContext Checkpoint Accessors:

Accessor	Ruby	Python	TypeScript
Cursor	`checkpoint_cursor`	`checkpoint_cursor`	`checkpointCursor`
Accumulated Results	`accumulated_results`	`accumulated_results`	`accumulatedResults`
Has Checkpoint?	`has_checkpoint?`	`has_checkpoint()`	`hasCheckpoint()`
Items Processed	`checkpoint_items_processed`	`checkpoint_items_processed`	`checkpointItemsProcessed`

FFI Contract:

Function	Description
`checkpoint_yield_step_event(event_id, data)`	Persist checkpoint and re-dispatch step

Key Invariants:

Progress is atomically saved before re-dispatch
Step remains InProgress during checkpoint yield cycle
Only Success/Failure trigger state transitions

See Batch Processing Guide - Checkpoint Yielding for full documentation.

Domain Events

Publisher Contract

Language	Base Class	Key Method
Ruby	`TaskerCore::DomainEvents::BasePublisher`	`publish(ctx)`
Python	`BasePublisher`	`publish(ctx)`
TypeScript	`BasePublisher`	`publish(ctx)`
Rust	`StepEventPublisher` trait	`publish(ctx)`

Publisher Lifecycle Hooks

All languages support publisher lifecycle hooks for instrumentation:

Hook	Ruby	Python	TypeScript	Description
Before Publish	`before_publish(ctx)`	`before_publish(ctx)`	`beforePublish(ctx)`	Called before publishing
After Publish	`after_publish(ctx, result)`	`after_publish(ctx, result)`	`afterPublish(ctx, result)`	Called after successful publish
On Error	`on_publish_error(ctx, error)`	`on_publish_error(ctx, error)`	`onPublishError(ctx, error)`	Called on publish failure
Metadata	`additional_metadata(ctx)`	`additional_metadata(ctx)`	`additionalMetadata(ctx)`	Inject custom metadata

StepEventContext Fields

Field	Description
`task_uuid`	Task identifier
`step_uuid`	Step identifier
`step_name`	Handler/step name
`namespace`	Task namespace
`correlation_id`	Tracing correlation ID
`result`	Step execution result
`metadata`	Additional metadata

Subscriber Contract

Language	Base Class	Key Methods
Ruby	`TaskerCore::DomainEvents::BaseSubscriber`	`subscribes_to`, `handle(event)`
Python	`BaseSubscriber`	`subscribes_to()`, `handle(event)`
TypeScript	`BaseSubscriber`	`subscribesTo()`, `handle(event)`
Rust	EventHandler closures	N/A

Subscriber Lifecycle Hooks

All languages support subscriber lifecycle hooks:

Hook	Ruby	Python	TypeScript	Description
Before Handle	`before_handle(event)`	`before_handle(event)`	`beforeHandle(event)`	Called before handling
After Handle	`after_handle(event, result)`	`after_handle(event, result)`	`afterHandle(event, result)`	Called after handling
On Error	`on_handle_error(event, error)`	`on_handle_error(event, error)`	`onHandleError(event, error)`	Called on handler failure

Registries

Language	Publisher Registry	Subscriber Registry
Ruby	`PublisherRegistry.instance`	`SubscriberRegistry.instance`
Python	`PublisherRegistry.instance()`	`SubscriberRegistry.instance()`
TypeScript	`PublisherRegistry.getInstance()`	`SubscriberRegistry.getInstance()`

Migration Summary

Ruby

Before	After
`def call(task, sequence, step)`	`def call(context)`
`class Handler < API`	`class Handler < Base; include Mixins::API`
`task.context['field']`	`context.get_task_field('field')`
`sequence.get_results('step')`	`context.get_dependency_result('step')`
1-indexed cursors	0-indexed cursors

Python

Before	After
`def handle(self, task, sequence, step)`	`def call(self, context)`
`class Handler(APIHandler)`	`class Handler(StepHandler, APIMixin)`
N/A	`self.success(result, metadata)`
N/A	Publisher/Subscriber lifecycle hooks

TypeScript

Before	After
`class Handler extends APIHandler`	`class Handler extends StepHandler implements APICapable`
No domain events	Complete domain events module
N/A	Publisher/Subscriber lifecycle hooks
N/A	`applyAPI(this)`, `applyDecision(this)` mixins

Rust

Before	After
(already aligned)	(already aligned)
N/A	`APICapable`, `DecisionCapable`, `BatchableCapable` traits

Example Handlers - Cross-Language Reference

Last Updated: 2025-12-21 Status: Active <- Back to Worker Crates Overview

Overview

This document provides side-by-side handler examples across Ruby, Python, and Rust. These examples demonstrate the aligned APIs that enable consistent patterns across all worker implementations.

Simple Step Handler

Ruby

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order_id = context.get_task_field('order_id')
    amount = context.get_task_field('amount')

    result = process_order(order_id, amount)

    success(
      result: {
        order_id: order_id,
        status: 'processed',
        total: result[:total]
      },
      metadata: { processed_at: Time.now.iso8601 }
    )
  rescue StandardError => e
    failure(
      message: e.message,
      error_type: 'UnexpectedError',
      retryable: true,
      metadata: { order_id: order_id }
    )
  end

  private

  def process_order(order_id, amount)
    # Business logic here
    { total: amount * 1.08 }
  end
end

Python

from tasker_core import BaseStepHandler, StepContext, StepHandlerResult


class ProcessOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        try:
            order_id = context.get_task_field("order_id")
            amount = context.get_task_field("amount")

            result = self.process_order(order_id, amount)

            return self.success(
                result={
                    "order_id": order_id,
                    "status": "processed",
                    "total": result["total"],
                },
                metadata={"processed_at": datetime.now().isoformat()},
            )
        except Exception as e:
            return self.failure(
                message=str(e),
                error_type="handler_error",
                retryable=True,
                metadata={"order_id": order_id},
            )

    def process_order(self, order_id: str, amount: float) -> dict:
        # Business logic here
        return {"total": amount * 1.08}

Rust

#![allow(unused)]
fn main() {
use tasker_shared::types::{TaskSequenceStep, StepExecutionResult};

pub struct ProcessOrderHandler;

impl ProcessOrderHandler {
    pub async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
        let order_id = step_data.task.context.get("order_id")
            .and_then(|v| v.as_str())
            .unwrap_or_default();
        let amount = step_data.task.context.get("amount")
            .and_then(|v| v.as_f64())
            .unwrap_or(0.0);

        match self.process_order(order_id, amount).await {
            Ok(result) => StepExecutionResult::success(
                serde_json::json!({
                    "order_id": order_id,
                    "status": "processed",
                    "total": result.total,
                }),
                Some(serde_json::json!({
                    "processed_at": chrono::Utc::now().to_rfc3339(),
                })),
            ),
            Err(e) => StepExecutionResult::failure(
                &e.to_string(),
                "handler_error",
                true, // retryable
            ),
        }
    }

    async fn process_order(&self, _order_id: &str, amount: f64) -> Result<OrderResult, Error> {
        Ok(OrderResult { total: amount * 1.08 })
    }
}
}

Handler with Dependencies

Ruby

class ShipOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Get results from dependent steps
    validation = context.get_dependency_result('validate_order')
    payment = context.get_dependency_result('process_payment')

    unless validation && validation['valid']
      return failure(
        message: 'Order validation failed',
        error_type: 'ValidationError',
        retryable: false
      )
    end

    unless payment && payment['status'] == 'completed'
      return failure(
        message: 'Payment not completed',
        error_type: 'PermanentError',
        retryable: false
      )
    end

    # Access task context
    order_id = context.get_task_field('order_id')
    shipping_address = context.get_task_field('shipping_address')

    tracking_number = create_shipment(order_id, shipping_address)

    success(result: {
      order_id: order_id,
      tracking_number: tracking_number,
      shipped_at: Time.now.iso8601
    })
  end
end

Python

class ShipOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Get results from dependent steps
        validation = context.get_dependency_result("validate_order")
        payment = context.get_dependency_result("process_payment")

        if not validation or not validation.get("valid"):
            return self.failure(
                message="Order validation failed",
                error_type="validation_error",
                retryable=False,
            )

        if not payment or payment.get("status") != "completed":
            return self.failure(
                message="Payment not completed",
                error_type="permanent_error",
                retryable=False,
            )

        # Access task context
        order_id = context.get_task_field("order_id")
        shipping_address = context.get_task_field("shipping_address")

        tracking_number = self.create_shipment(order_id, shipping_address)

        return self.success(
            result={
                "order_id": order_id,
                "tracking_number": tracking_number,
                "shipped_at": datetime.now().isoformat(),
            }
        )

Decision Handler

Ruby

class ApprovalRoutingHandler < TaskerCore::StepHandler::Decision
  THRESHOLDS = {
    auto_approve: 1000,
    manager_only: 5000
  }.freeze

  def call(context)
    amount = context.get_task_field('amount').to_f
    department = context.get_task_field('department')

    if amount < THRESHOLDS[:auto_approve]
      decision_success(
        steps: ['auto_approve'],
        result_data: {
          route_type: 'automatic',
          amount: amount,
          reason: 'Below threshold'
        }
      )
    elsif amount < THRESHOLDS[:manager_only]
      decision_success(
        steps: ['manager_approval'],
        result_data: {
          route_type: 'manager',
          amount: amount,
          approver: find_manager(department)
        }
      )
    else
      decision_success(
        steps: ['manager_approval', 'finance_review'],
        result_data: {
          route_type: 'dual_approval',
          amount: amount,
          requires_cfo: amount > 50_000
        }
      )
    end
  end

  private

  def find_manager(department)
    # Lookup logic
    "manager@example.com"
  end
end

Python

class ApprovalRoutingHandler(DecisionHandler):
    THRESHOLDS = {
        "auto_approve": 1000,
        "manager_only": 5000,
    }

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = float(context.get_task_field("amount") or 0)
        department = context.get_task_field("department")

        if amount < self.THRESHOLDS["auto_approve"]:
            return self.decision_success(
                steps=["auto_approve"],
                routing_context={
                    "route_type": "automatic",
                    "amount": amount,
                    "reason": "Below threshold",
                },
            )
        elif amount < self.THRESHOLDS["manager_only"]:
            return self.decision_success(
                steps=["manager_approval"],
                routing_context={
                    "route_type": "manager",
                    "amount": amount,
                    "approver": self.find_manager(department),
                },
            )
        else:
            return self.decision_success(
                steps=["manager_approval", "finance_review"],
                routing_context={
                    "route_type": "dual_approval",
                    "amount": amount,
                    "requires_cfo": amount > 50000,
                },
            )

    def find_manager(self, department: str) -> str:
        return "manager@example.com"

Batch Processing Handler

Ruby (Analyzer)

class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
  BATCH_SIZE = 100

  def call(context)
    file_path = context.get_task_field('csv_file_path')
    total_rows = count_csv_rows(file_path)

    if total_rows <= BATCH_SIZE
      # Small file - process inline, no batches needed
      outcome = TaskerCore::Types::BatchProcessingOutcome.no_batches

      success(
        result: {
          batch_processing_outcome: outcome.to_h,
          total_rows: total_rows,
          processing_mode: 'inline'
        }
      )
    else
      # Large file - create batch workers
      cursor_configs = calculate_batches(total_rows, BATCH_SIZE)
      outcome = TaskerCore::Types::BatchProcessingOutcome.create_batches(
        worker_template_name: 'process_csv_batch',
        worker_count: cursor_configs.size,
        cursor_configs: cursor_configs,
        total_items: total_rows
      )

      success(
        result: {
          batch_processing_outcome: outcome.to_h,
          total_rows: total_rows,
          batch_count: cursor_configs.size
        }
      )
    end
  end

  private

  def calculate_batches(total, batch_size)
    (0...total).step(batch_size).map.with_index do |start, idx|
      {
        'batch_id' => format('%03d', idx),
        'start_cursor' => start,
        'end_cursor' => [start + batch_size, total].min,
        'batch_size' => [batch_size, total - start].min
      }
    end
  end
end

Ruby (Batch Worker)

class CsvBatchWorkerHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)

    # Handle placeholder batches
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Get file path from analyzer step
    analyzer_result = context.get_dependency_result('analyze_csv')
    file_path = analyzer_result&.dig('csv_file_path')

    # Process this batch
    records = read_csv_range(file_path, batch_ctx.start_cursor, batch_ctx.batch_size)
    processed = records.map { |row| transform_row(row) }

    batch_worker_complete(
      processed_count: processed.size,
      result_data: {
        batch_id: batch_ctx.batch_id,
        records_processed: processed.size,
        summary: calculate_summary(processed)
      }
    )
  end
end

Python (Batch Worker)

class CsvBatchWorkerHandler(BatchableHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)

        # Handle placeholder batches
        no_op_result = self.handle_no_op_worker(batch_ctx)
        if no_op_result:
            return no_op_result

        # Get file path from analyzer step
        analyzer_result = context.get_dependency_result("analyze_csv")
        file_path = analyzer_result.get("csv_file_path") if analyzer_result else None

        # Process this batch
        records = self.read_csv_range(
            file_path, batch_ctx.start_cursor, batch_ctx.batch_size
        )
        processed = [self.transform_row(row) for row in records]

        return self.batch_worker_complete(
            processed_count=len(processed),
            result_data={
                "batch_id": batch_ctx.batch_id,
                "records_processed": len(processed),
                "summary": self.calculate_summary(processed),
            },
        )

API Handler

Ruby

class FetchUserHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')

    # Automatic error classification (429 -> retryable, 404 -> permanent)
    response = connection.get("/users/#{user_id}")
    process_response(response)

    success(result: {
      user_id: user_id,
      email: response.body['email'],
      name: response.body['name']
    })
  end

  def base_url
    'https://api.example.com'
  end

  def configure_connection
    Faraday.new(base_url) do |conn|
      conn.request :json
      conn.response :json
      conn.options.timeout = 30
    end
  end
end

Python

class FetchUserHandler(ApiStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        user_id = context.get_task_field("user_id")

        # Automatic error classification
        response = self.get(f"/users/{user_id}")

        return self.success(
            result={
                "user_id": user_id,
                "email": response["email"],
                "name": response["name"],
            }
        )

    @property
    def base_url(self) -> str:
        return "https://api.example.com"

    def configure_session(self, session):
        session.headers["Authorization"] = f"Bearer {self.get_token()}"
        session.timeout = 30

Error Handling Patterns

Ruby - Raising Exceptions

class ValidateOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order = context.task.context

    # Permanent error - will not retry
    if order['amount'].to_f <= 0
      raise TaskerCore::Errors::PermanentError.new(
        'Order amount must be positive',
        error_code: 'INVALID_AMOUNT',
        context: { amount: order['amount'] }
      )
    end

    # Retryable error - will retry with backoff
    if external_service_unavailable?
      raise TaskerCore::Errors::RetryableError.new(
        'External service temporarily unavailable',
        retry_after: 30,
        context: { service: 'payment_gateway' }
      )
    end

    success(result: { valid: true })
  end
end

Python - Returning Failures

class ValidateOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        order = context.task.context

        # Permanent error - will not retry
        amount = float(order.get("amount", 0))
        if amount <= 0:
            return self.failure(
                message="Order amount must be positive",
                error_type="validation_error",
                error_code="INVALID_AMOUNT",
                retryable=False,
                metadata={"amount": amount},
            )

        # Retryable error - will retry with backoff
        if self.external_service_unavailable():
            return self.failure(
                message="External service temporarily unavailable",
                error_type="retryable_error",
                retryable=True,
                metadata={"service": "payment_gateway"},
            )

        return self.success(result={"valid": True})

FFI Safety Safeguards

Last Updated: 2026-02-02 Status: Production Implementation Applies To: Ruby (Magnus), Python (PyO3), TypeScript (C FFI) workers

Overview

Tasker’s FFI workers embed the Rust tasker-worker runtime inside language-specific host processes (Ruby, Python, TypeScript/JavaScript). This document describes the safeguards that prevent Rust-side failures from crashing or corrupting the host process, ensuring that infrastructure unavailability, misconfiguration, and unexpected panics are surfaced as language-native errors rather than process faults.

FFI Architecture

Host Process (Ruby / Python / Node.js)
         │
         ▼
    FFI Boundary
    ┌─────────────────────────────────────┐
    │  Language Binding Layer              │
    │  (Magnus / PyO3 / extern "C")       │
    │                                     │
    │  ┌─────────────────────────────┐    │
    │  │  Bridge Module              │    │
    │  │  (bootstrap, poll, complete)│    │
    │  └────────────┬────────────────┘    │
    │               │                     │
    │  ┌────────────▼────────────────┐    │
    │  │  FfiDispatchChannel         │    │
    │  │  (event dispatch, callbacks)│    │
    │  └────────────┬────────────────┘    │
    │               │                     │
    │  ┌────────────▼────────────────┐    │
    │  │  WorkerBootstrap            │    │
    │  │  (runtime, DB, messaging)   │    │
    │  └─────────────────────────────┘    │
    └─────────────────────────────────────┘

Panic Safety by Framework

Each FFI framework provides different levels of automatic panic protection:

Framework	Panic Handling	Mechanism
Magnus (Ruby)	Automatic	Catches panics at FFI boundary, converts to Ruby `RuntimeError`
PyO3 (Python)	Automatic	Catches panics at `#[pyfunction]` boundary, converts to `PanicException`
C FFI (TypeScript)	Manual	Requires explicit `std::panic::catch_unwind` wrappers

TypeScript C FFI: Explicit Panic Guards

Because the TypeScript worker uses raw extern "C" functions (for compatibility with Node.js, Bun, and Deno FFI), panics unwinding through this boundary would be undefined behavior. All extern "C" functions that call into bridge internals are wrapped with catch_unwind:

#![allow(unused)]
fn main() {
// workers/typescript/src-rust/lib.rs
#[no_mangle]
pub unsafe extern "C" fn bootstrap_worker(config_json: *const c_char) -> *mut c_char {
    // ... parse config_json ...

    let result = panic::catch_unwind(AssertUnwindSafe(|| {
        bridge::bootstrap_worker_internal(config_str)
    }));

    match result {
        Ok(Ok(json)) => /* return JSON */,
        Ok(Err(e)) => json_error(&format!("Bootstrap failed: {}", e)),
        Err(panic_info) => {
            // Extract panic message, log it, return JSON error
            json_error(&msg)
        }
    }
}
}

Protected functions: bootstrap_worker, stop_worker, get_worker_status, transition_to_graceful_shutdown, poll_step_events, poll_in_process_events, complete_step_event, checkpoint_yield_step_event, get_ffi_dispatch_metrics, check_starvation_warnings, cleanup_timeouts.

Error Handling at FFI Boundaries

Bootstrap Failures

When infrastructure is unavailable during worker startup, errors flow through the normal Result path rather than panicking:

Failure Scenario	Handling	Host Process Impact
Database unreachable	`TaskerError::DatabaseError` returned	Language exception, app can retry
Config TOML missing	`TaskerError::ConfigurationError` returned	Language exception with descriptive message
Worker config section absent	`TaskerError::ConfigurationError` returned	Language exception (was previously a panic)
Messaging backend unavailable	`TaskerError::ConfigurationError` returned	Language exception
Tokio runtime creation fails	Logged + language error returned	Language exception
Port already in use	`TaskerError::WorkerError` returned	Language exception
Redis/cache unavailable	Graceful degradation to noop cache	No error - worker starts without cache

Steady-State Operation Failures

Once bootstrapped, the worker handles infrastructure failures gracefully:

Failure Scenario	Handling	Host Process Impact
Database goes down during poll	Poll returns `None` (no events)	No impact - polling continues
Completion channel full	Retry loop with timeout, then logged	Step result may be lost after timeout
Completion channel closed	Returns `false` to caller	App code sees completion failure
Callback timeout (5s)	Logged, step completion unaffected	Domain events may be delayed
Messaging down during callback	Callback times out, logged	Domain events may not publish
Lock poisoned	Error returned to caller	Language exception
Worker not initialized	Error returned to caller	Language exception

Lock Acquisition

All three workers validate lock acquisition before proceeding:

#![allow(unused)]
fn main() {
// Pattern used in all workers
let handle_guard = WORKER_SYSTEM.lock().map_err(|e| {
    error!("Failed to acquire worker system lock: {}", e);
    // Convert to language-appropriate error
})?;
}

A poisoned mutex (from a previous panic) produces a language exception rather than propagating the original panic.

EventRouter Availability

Post-bootstrap access to the EventRouter uses fallible error handling rather than .expect():

#![allow(unused)]
fn main() {
// Use ok_or_else instead of expect to prevent panic at FFI boundary
let event_router = worker_core.event_router().ok_or_else(|| {
    error!("EventRouter not available from WorkerCore after bootstrap");
    // Return language-appropriate error
})?;
}

Callback Safety

The FfiDispatchChannel uses a fire-and-forget pattern for post-completion callbacks, preventing the host process from being blocked or deadlocked by Rust-side async operations:

Completion is sent first - the step result is delivered to the completion channel before any callback fires
Callback is spawned separately - runs in the Tokio runtime, not the FFI caller’s thread
Timeout protection - callbacks are bounded by a configurable timeout (default 5s)
Callback failures are logged - they never affect step completion or the host process

FFI Thread (Ruby/Python/JS)          Tokio Runtime
         │                                │
         ├──► complete(event_id, result)   │
         │    ├──► send result to channel  │
         │    └──► spawn callback ─────────┼──► callback.on_handler_complete()
         │                                 │    (with 5s timeout)
         ◄──── return true ────────────────│
         │  (immediate, non-blocking)      │

See docs/development/ffi-callback-safety.md for detailed callback safety guidelines.

Backpressure Protection

Completion Channel

The completion channel uses a try-send retry loop with timeout to prevent indefinite blocking:

Try-send avoids blocking the FFI thread
Retry with sleep (10ms intervals) handles transient backpressure
Timeout (configurable, default 30s) prevents permanent stalls
Logged when backpressure delays exceed 100ms

Starvation Detection

The FfiDispatchChannel tracks event age and warns when polling falls behind:

Events older than starvation_warning_threshold_ms (default 10s) trigger warnings
check_starvation_warnings() can be called periodically from the host process
FfiDispatchMetrics exposes pending count, oldest event age, and starvation status

Infrastructure Dependency Matrix

Component	Bootstrap	Poll	Complete	Callback
Database	Required (error on failure)	Not needed	Not needed	Errors logged
Message Bus	Required (error on failure)	Not needed	Not needed	Errors logged
Config System	Required (error on failure)	Not needed	Not needed	Not needed
Cache (Redis)	Optional (degrades to noop)	Not needed	Not needed	Not needed
Tokio Runtime	Required (error on failure)	Used	Used	Used

Worker Lifecycle Safety

Start (`bootstrap_worker`)

Validates configuration, creates runtime, initializes all subsystems
All failures return language-appropriate errors
Already-running detection prevents double initialization

Status (`get_worker_status`)

Safe when worker is not initialized (returns running: false)
Safe when worker is running (queries internal state)
Lock acquisition failure returns error

Stop (`stop_worker`)

Safe when worker is not running (returns success message)
Sends shutdown signal and clears handle
In-flight operations complete before shutdown

Graceful Shutdown (`transition_to_graceful_shutdown`)

Initiates graceful shutdown allowing in-flight work to drain
Errors during transition are logged and returned
Requires worker to be running (error otherwise)

Adding a New FFI Worker

When implementing a new language worker:

Check framework panic safety - if the framework (like Magnus/PyO3) catches panics automatically, you get protection for free. If using raw C FFI, wrap all extern "C" functions with catch_unwind.
Use the standard bridge pattern - global WORKER_SYSTEM mutex, BridgeHandle struct containing WorkerSystemHandle + FfiDispatchChannel + runtime.
Handle all lock acquisitions - always use .map_err() on .lock() calls.
Avoid .expect() and .unwrap() in FFI code - use ok_or_else() or map_err() to convert to language-appropriate errors.
Use fire-and-forget callbacks - never block the FFI thread on async operations.
Integrate starvation detection - call check_starvation_warnings() periodically.
Expose metrics - expose FfiDispatchMetrics for health monitoring.

FFI Callback Safety - Detailed callback patterns and deadlock prevention
Worker Event Systems - Dispatch and completion channel architecture
MPSC Channel Guidelines - Channel sizing and configuration
Worker Patterns & Practices - General worker development patterns
Memory Management - FFI memory management across languages

FFI Memory Management in TypeScript Workers

Status: Active
Applies To: TypeScript/Bun/Node.js FFI Related: Ruby (Magnus), Python (PyO3)

Overview

This document explains the memory management pattern used when calling Rust functions from TypeScript via FFI (Foreign Function Interface). Understanding this pattern is critical for preventing memory leaks and undefined behavior.

Key Principle: When Rust hands memory to JavaScript across the FFI boundary, Rust’s ownership system no longer applies. The JavaScript code becomes responsible for explicitly freeing that memory.

The Memory Handoff Pattern

Three-Step Process

// 1. ALLOCATE: Rust allocates memory and returns a pointer
const ptr = this.lib.symbols.get_worker_status() as Pointer;

// 2. READ: JavaScript reads/copies the data from that pointer
const json = new CString(ptr);              // Read C string into JS string
const status = JSON.parse(json);            // Parse into JS object

// 3. FREE: JavaScript tells Rust to deallocate the memory
this.lib.symbols.free_rust_string(ptr);     // Rust frees the memory

// After this point, 'status' is a safe JavaScript object
// and the Rust memory has been freed (no leak)

Why This Pattern Exists

When Rust returns a pointer across the FFI boundary, it deliberately leaks the memory from Rust’s perspective:

#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn get_worker_status() -> *mut c_char {
    let status = WorkerStatus { /* ... */ };
    let json = serde_json::to_string(&status).unwrap();
    
    // into_raw() transfers ownership OUT of Rust's memory system
    CString::new(json).unwrap().into_raw()
    // Rust's Drop trait will NOT run on this memory!
}
}

The .into_raw() method:

Converts CString to a raw pointer
Prevents Rust from freeing the memory when it goes out of scope
Transfers ownership responsibility to the caller

Without this, Rust would free the memory immediately, and JavaScript would read garbage data (use-after-free).

The Free Function

JavaScript must call back into Rust to free the memory:

#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn free_rust_string(ptr: *mut c_char) {
    if ptr.is_null() {
        return;
    }
    
    // SAFETY: We know this pointer came from CString::into_raw()
    // and this function is only called once per pointer
    unsafe {
        let _ = CString::from_raw(ptr);
        // CString goes out of scope here and properly frees the memory
    }
}
}

This reconstructs the CString from the raw pointer, which causes Rust’s Drop trait to run and free the memory.

Safety Guarantees

This pattern is safe because of three key properties:

1. Single-Threaded JavaScript Runtime

JavaScript (and TypeScript) runs on a single thread (ignoring Web Workers), which means:

No race conditions: The read → free sequence is atomic from Rust’s perspective
No concurrent access: Only one piece of code can access the pointer at a time
Predictable execution order: Steps always happen in sequence

2. One-Way Handoff

Rust follows a strict contract:

Rust allocates → Returns pointer → NEVER TOUCHES IT AGAIN

Rust doesn’t keep any references to the memory
Rust never reads or writes to that memory after returning the pointer
The memory is “orphaned” from Rust’s perspective until free_rust_string is called

3. JavaScript Copies Before Freeing

JavaScript creates a new copy of the data before freeing:

const ptr = this.lib.symbols.get_worker_status() as Pointer;

// Step 1: Read bytes from Rust memory into a JavaScript string
const json = new CString(ptr);  // COPY operation

// Step 2: Parse string into JavaScript objects
const status = JSON.parse(json);  // Creates new JS objects

// Step 3: Free the Rust memory
this.lib.symbols.free_rust_string(ptr);

// At this point:
// - 'status' is pure JavaScript (managed by V8/JavaScriptCore)
// - Rust memory has been freed (no leak)
// - 'ptr' is invalid (but we never use it again)

The status object is fully owned by JavaScript’s garbage collector. It has no connection to the freed Rust memory.

Comparison to Ruby and Python FFI

Ruby (Magnus)

# Ruby FFI with Magnus
result = TaskerCore::FFI.get_worker_status()
# No explicit free needed - Magnus manages memory via Rust Drop traits

How it works: Magnus creates a bridge between Ruby’s GC and Rust’s ownership system. When Ruby no longer references the object, Rust’s Drop trait eventually runs.

Python (PyO3)

# Python FFI with PyO3
result = tasker_core.get_worker_status()
# No explicit free needed - PyO3 uses Python's reference counting

How it works: PyO3 wraps Rust data in PyObject wrappers. When Python’s reference count reaches zero, the Rust data is dropped.

TypeScript (Bun/Node FFI)

// TypeScript FFI - manual memory management required
const ptr = lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
lib.symbols.free_rust_string(ptr);  // MUST call explicitly

Why different: Bun and Node.js use raw C FFI (similar to ctypes in Python or FFI gem in Ruby). There’s no automatic memory management bridge, so we must manually free.

Tradeoff: More verbose, but gives us complete control and makes memory lifetime explicit.

Common Pitfalls and How We Avoid Them

1. Memory Leak (Forgetting to Free)

Problem:

// BAD: Memory leak
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
// Oops! Forgot to call free_rust_string(ptr)

How we avoid it: Every code path that allocates a pointer must free it. We wrap this in methods like pollStepEvents() that handle the complete lifecycle:

pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events() as Pointer;
  if (!ptr) {
    return [];  // No allocation, no free needed
  }
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);  // Always freed
  return events;
}

2. Double-Free

Problem:

// BAD: Double-free (undefined behavior)
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
this.lib.symbols.free_rust_string(ptr);
this.lib.symbols.free_rust_string(ptr);  // CRASH! Already freed

How we avoid it: We free the pointer exactly once in each code path, and we never store pointers for reuse. Each pointer is used in a single scope and immediately freed.

3. Use-After-Free

Problem:

// BAD: Use-after-free
const ptr = this.lib.symbols.get_worker_status();
this.lib.symbols.free_rust_string(ptr);
const json = new CString(ptr);  // CRASH! Memory is gone

How we avoid it: We always read/copy before freeing. The order is strictly: allocate → read → free.

Pattern in Practice

Example: Worker Status

getWorkerStatus(): WorkerStatus {
  // 1. Allocate: Rust allocates memory for JSON string
  const ptr = this.lib.symbols.get_worker_status() as Pointer;
  
  // 2. Read: Copy data into JavaScript
  const json = new CString(ptr);        // Rust memory → JS string
  const status = JSON.parse(json);      // JS string → JS object
  
  // 3. Free: Deallocate Rust memory
  this.lib.symbols.free_rust_string(ptr);
  
  // 4. Return: Pure JavaScript object (safe)
  return status;
}

Example: Polling Step Events

pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events() as Pointer;
  
  // Handle null pointer (no events available)
  if (!ptr) {
    return [];
  }
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);
  
  return events;
}

Example: Bootstrap Worker

bootstrapWorker(config: BootstrapConfig): BootstrapResult {
  const configJson = JSON.stringify(config);
  
  // Pass JavaScript data TO Rust (no pointer returned)
  const ptr = this.lib.symbols.bootstrap_worker(configJson) as Pointer;
  
  // Read the result
  const json = new CString(ptr);
  const result = JSON.parse(json);
  
  // Free the result pointer
  this.lib.symbols.free_rust_string(ptr);
  
  return result;
}

Memory Lifetime Diagrams

Successful Pattern

Time →

JavaScript:    [allocate ptr] → [read data] → [free ptr] → [use data]
Rust Memory:   [allocated]    → [allocated] → [freed]    → [freed]
JS Objects:    [none]         → [created]   → [exists]   → [exists]
                                  ↑
                            Data copied here

Memory Leak (Anti-Pattern)

Time →

JavaScript:    [allocate ptr] → [read data] → [use data] → ...
Rust Memory:   [allocated]    → [allocated] → [LEAK]     → [LEAK]
JS Objects:    [none]         → [created]   → [exists]   → [exists]
                                                ↑
                                    Forgot to free! Memory leaked

Use-After-Free (Anti-Pattern)

Time →

JavaScript:    [allocate ptr] → [free ptr] → [read ptr] → CRASH!
Rust Memory:   [allocated]    → [freed]    → [freed]
JS Objects:    [none]         → [none]     → [CORRUPT]
                                              ↑
                                    Reading freed memory!

Best Practices

1. Keep Pointer Lifetime Short

// GOOD: Pointer freed in same scope
const result = this.getWorkerStatus();

// BAD: Don't store pointers
this.statusPtr = this.lib.symbols.get_worker_status();  // Leak risk

2. Always Free in Same Method

// GOOD: Allocate and free in same method
pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events();
  if (!ptr) return [];
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);
  return events;
}

// BAD: Returning pointer for later freeing
getPtrToStatus(): Pointer {
  return this.lib.symbols.get_worker_status();  // Who will free this?
}

3. Handle Null Pointers

// GOOD: Check for null before freeing
const ptr = this.lib.symbols.poll_step_events();
if (!ptr) {
  return [];  // No memory allocated, nothing to free
}

const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr);
return events;

4. Document Ownership in Comments

/**
 * Poll for step events from FFI.
 * 
 * MEMORY: This function manages the lifetime of the pointer returned
 * by poll_step_events(). The pointer is freed before returning.
 */
pollStepEvents(): FfiStepEvent[] {
  // ...
}

Testing Memory Safety

Rust Tests

Rust’s test suite can verify FFI functions don’t leak:

#![allow(unused)]
fn main() {
#[test]
fn test_status_no_leak() {
    let ptr = get_worker_status();
    assert!(!ptr.is_null());
    
    // Manually free to ensure it works
    free_rust_string(ptr);
    
    // If we had a leak, tools like valgrind or AddressSanitizer
    // would catch it
}
}

TypeScript Tests

TypeScript tests verify proper usage:

test('status retrieval frees memory', () => {
  const runtime = new BunTaskerRuntime();
  
  // This should not leak - memory freed internally
  const status = runtime.getWorkerStatus();
  
  expect(status.running).toBeDefined();
  
  // Call multiple times to stress test
  for (let i = 0; i < 100; i++) {
    runtime.getWorkerStatus();
  }
  // If we leaked, we'd have 100 leaked strings
});

Leak Detection Tools

Valgrind (Linux): Detects memory leaks in Rust code
AddressSanitizer: Detects use-after-free and double-free
Process memory monitoring: Track RSS growth over time

When in Doubt

Golden Rule: Every *mut c_char pointer returned by a Rust FFI function must have a corresponding free_rust_string() call in the TypeScript code, executed exactly once per pointer, after all reads are complete.

If you see a pattern like:

const ptr = this.lib.symbols.some_function();

Ask yourself:

Does this return a pointer to allocated memory? (Check Rust signature)
Am I reading the data before freeing?
Am I freeing exactly once?
Am I never using ptr after freeing?

If the answer to all is “yes”, you’re following the pattern correctly.

References

Rust FFI Guidelines: https://doc.rust-lang.org/nomicon/ffi.html
Bun FFI Documentation: https://bun.sh/docs/api/ffi
Node.js ffi-napi: https://github.com/node-ffi-napi/node-ffi-napi
docs/worker-crates/patterns-and-practices.md: General worker patterns

Worker Crates: Common Patterns and Practices

Last Updated: 2026-01-06 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | Worker Actors

<- Back to Worker Crates Overview

This document describes the common patterns and practices shared across all three worker implementations (Rust, Ruby, Python). Understanding these patterns helps developers write consistent handlers regardless of the language.

Architectural Patterns
Handler Lifecycle
Error Handling
Polling Architecture
Event Bridge Pattern
Singleton Pattern
Observability
Checkpoint Yielding

Architectural Patterns

Dual-Channel Architecture

All workers implement a dual-channel architecture for non-blocking step execution:

┌─────────────────────────────────────────────────────────────────┐
│                    DUAL-CHANNEL PATTERN                         │
└─────────────────────────────────────────────────────────────────┘

    PostgreSQL PGMQ
          │
          ▼
  ┌───────────────────┐
  │  Dispatch Channel │  ──→  Step events flow TO handlers
  └───────────────────┘
          │
          ▼
  ┌───────────────────┐
  │  Handler Execution │  ──→  Business logic runs here
  └───────────────────┘
          │
          ▼
  ┌───────────────────┐
  │ Completion Channel │  ──→  Results flow BACK to orchestration
  └───────────────────┘
          │
          ▼
    Orchestration

Benefits:

Fire-and-forget dispatch (non-blocking)
Bounded concurrency via semaphores
Results processed independently from dispatch
Consistent pattern across all languages

Language-Specific Implementations

Component	Rust	Ruby	Python
Dispatch Channel	`mpsc::channel`	`poll_step_events` FFI	`poll_step_events` FFI
Completion Channel	`mpsc::channel`	`complete_step_event` FFI	`complete_step_event` FFI
Concurrency Model	Tokio async tasks	Ruby threads + FFI polling	Python threads + FFI polling
GIL Handling	N/A	Pull-based polling	Pull-based polling

Handler Lifecycle

Handler Registration

All implementations follow the same registration pattern:

1. Define handler class/struct
2. Set handler_name identifier
3. Register with HandlerRegistry
4. Handler ready for resolution

Ruby Example:

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Access data via cross-language standard methods
    order_id = context.get_task_field('order_id')

    # Business logic here...

    # Return result using base class helper (keyword args required)
    success(result: { order_id: order_id, status: 'processed' })
  end
end

# Registration
registry = TaskerCore::Registry::HandlerRegistry.instance
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)

Python Example:

from tasker_core import StepHandler, StepHandlerResult, HandlerRegistry

class ProcessOrderHandler(StepHandler):
    handler_name = "process_order"

    def call(self, context):
        order_id = context.input_data.get("order_id")
        return StepHandlerResult.success_handler_result(
            {"order_id": order_id, "status": "processed"}
        )

# Registration
registry = HandlerRegistry.instance()
registry.register("process_order", ProcessOrderHandler)

Handler Resolution Flow

1. Step event received with handler name
2. Registry.resolve(handler_name) called
3. Handler class instantiated
4. handler.call(context) invoked
5. Result returned to completion channel

Handler Context

All handlers receive a context object containing:

Field	Description
`task_uuid`	Unique identifier for the task
`step_uuid`	Unique identifier for the step
`input_data`	Task context data passed to the step
`dependency_results`	Results from parent/dependency steps
`step_config`	Configuration from step definition
`step_inputs`	Runtime inputs from workflow_step.inputs
`retry_count`	Current retry attempt number
`max_retries`	Maximum retry attempts allowed

Handler Results

All handlers return a structured result indicating success or failure. However, the APIs differ between Ruby and Python - this is a known design inconsistency that may be addressed in a future ticket.

Ruby - Uses keyword arguments and separate Success/Error types:

# Via base handler shortcuts
success(result: { key: "value" }, metadata: { duration_ms: 150 })

failure(
  message: "Something went wrong",
  error_type: "PermanentError",
  error_code: "VALIDATION_ERROR",  # Ruby has error_code field
  retryable: false,
  metadata: { field: "email" }
)

# Or via type factory methods
TaskerCore::Types::StepHandlerCallResult.success(result: { key: "value" })
TaskerCore::Types::StepHandlerCallResult.error(
  error_type: "PermanentError",
  message: "Error message",
  error_code: "ERR_001"
)

Python - Uses positional/keyword arguments and a single result type:

# Via base handler shortcuts
self.success(result={"key": "value"}, metadata={"duration_ms": 150})

self.failure(
    message="Something went wrong",
    error_type="ValidationError",  # Python has error_type only (no error_code)
    retryable=False,
    metadata={"field": "email"}
)

# Or via class factory methods
StepHandlerResult.success_handler_result(
    {"key": "value"},             # First positional arg is result
    {"duration_ms": 150}          # Second positional arg is metadata
)
StepHandlerResult.failure_handler_result(
    message="Something went wrong",
    error_type="ValidationError",
    retryable=False,
    metadata={"field": "email"}
)

Key Differences:

Aspect	Ruby	Python
Factory method names	`.success()`, `.error()`	`.success_handler_result()`, `.failure_handler_result()`
Result type	`Success` / `Error` structs	Single `StepHandlerResult` class
Error code field	`error_code` (freeform)	Not present
Argument style	Keyword required (`result:`)	Positional allowed

Error Handling

Error Classification

All workers classify errors into two categories:

Type	Description	Behavior
Retryable	Transient errors that may succeed on retry	Step re-enqueued up to max_retries
Permanent	Unrecoverable errors	Step marked as failed immediately

HTTP Status Code Classification (ApiHandler)

400, 401, 403, 404, 422  →  Permanent Error (client errors)
429                       →  Retryable Error (rate limiting)
500-599                   →  Retryable Error (server errors)

Exception Hierarchy

Ruby:

TaskerCore::Error                  # Base class
├── TaskerCore::RetryableError     # Transient failures
├── TaskerCore::PermanentError     # Unrecoverable failures
├── TaskerCore::FFIError           # FFI bridge errors
└── TaskerCore::ConfigurationError # Configuration issues

Python:

TaskerError                        # Base class
├── WorkerNotInitializedError      # Worker not bootstrapped
├── WorkerBootstrapError           # Bootstrap failed
├── WorkerAlreadyRunningError      # Double initialization
├── FFIError                       # FFI bridge errors
├── ConversionError                # Type conversion errors
└── StepExecutionError             # Handler execution errors

Error Context Propagation

All errors should include context for debugging:

StepHandlerResult.failure_handler_result(
    message="Payment gateway timeout",
    error_type="gateway_timeout",
    retryable=True,
    metadata={
        "gateway": "stripe",
        "request_id": "req_xyz",
        "response_time_ms": 30000
    }
)

Polling Architecture

Why Polling?

Ruby and Python workers use a pull-based polling model due to language runtime constraints:

Ruby: The Global VM Lock (GVL) prevents Rust from safely calling Ruby methods from Rust threads. Polling allows Ruby to control thread context.

Python: The Global Interpreter Lock (GIL) has the same limitation. Python must initiate all cross-language calls.

Polling Characteristics

Parameter	Default Value	Description
Poll Interval	10ms	Time between polls when no events
Max Latency	~10ms	Time from event generation to processing start
Starvation Check	Every 100 polls (1 second)	Detect processing bottlenecks
Cleanup Interval	Every 1000 polls (10 seconds)	Clean up timed-out events

Poll Loop Structure

while running:
    # 1. Poll for event
    event = poll_step_events()

    if event:
        # 2. Process event through handler
        process_event(event)
    else:
        # 3. Sleep when no events
        time.sleep(0.01)  # 10ms

    # 4. Periodic maintenance
    if poll_count % 100 == 0:
        check_starvation_warnings()

    if poll_count % 1000 == 0:
        cleanup_timeouts()

FFI Contract

Ruby and Python share the same FFI contract:

Function	Description
`poll_step_events()`	Get next pending event (returns None if empty)
`complete_step_event(event_id, result)`	Submit handler result
`get_ffi_dispatch_metrics()`	Get dispatch channel metrics
`check_starvation_warnings()`	Trigger starvation logging
`cleanup_timeouts()`	Clean up timed-out events

Event Bridge Pattern

Overview

All workers implement an EventBridge (pub/sub) pattern for internal coordination:

┌─────────────────────────────────────────────────────────────────┐
│                      EVENT BRIDGE PATTERN                        │
└─────────────────────────────────────────────────────────────────┘

  Publishers                    EventBridge                 Subscribers
  ─────────                    ───────────                 ───────────
  HandlerRegistry  ──publish──→            ──notify──→  StepExecutionSubscriber
  EventPoller      ──publish──→  [Events]  ──notify──→  MetricsCollector
  Worker           ──publish──→            ──notify──→  Custom Subscribers

Standard Event Names

Event	Description	Payload
`handler_registered`	Handler added to registry	`(name, handler_class)`
`step_execution_received`	Step event received	`FfiStepEvent`
`step_execution_completed`	Handler finished	`StepHandlerResult`
`worker_started`	Worker bootstrap complete	`worker_id`
`worker_stopped`	Worker shutdown	`worker_id`

Implementation Libraries

Language	Library	Pattern
Ruby	`dry-events`	Publisher/Subscriber
Python	`pyee`	EventEmitter
Rust	Native channels	mpsc

Usage Example (Python)

from tasker_core import EventBridge, EventNames

bridge = EventBridge.instance()

# Subscribe to events
def on_step_received(event):
    print(f"Processing step {event.step_uuid}")

bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)

# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)

Singleton Pattern

Worker State Management

All workers store global state in a thread-safe singleton:

┌─────────────────────────────────────────────────────────────────┐
│                    SINGLETON WORKER STATE                        │
└─────────────────────────────────────────────────────────────────┘

    Thread-Safe Global
           │
           ▼
    ┌──────────────────┐
    │   WorkerSystem   │
    │  ┌────────────┐  │
    │  │ Mutex/Lock │  │
    │  │  Inner     │  │
    │  │  State     │  │
    │  └────────────┘  │
    └──────────────────┘
           │
           ├──→ HandlerRegistry
           ├──→ EventBridge
           ├──→ EventPoller
           └──→ Configuration

Singleton Classes

Language	Singleton Implementation
Rust	`OnceLock<Mutex<WorkerSystem>>`
Ruby	`Singleton` module
Python	Class-level `_instance` with `instance()` classmethod

Reset for Testing

All singletons provide reset methods for test isolation:

# Python
HandlerRegistry.reset_instance()
EventBridge.reset_instance()

# Ruby
TaskerCore::Registry::HandlerRegistry.reset_instance!

Observability

Health Checks

All workers expose health information via FFI:

from tasker_core import get_health_check

health = get_health_check()
# Returns: HealthCheck with component statuses

Metrics

Standard metrics available from all workers:

Metric	Description
`pending_count`	Events awaiting processing
`in_flight_count`	Events currently being processed
`completed_count`	Successfully completed events
`failed_count`	Failed events
`starvation_detected`	Whether events are timing out
`starving_event_count`	Events exceeding timeout threshold

Structured Logging

All workers use structured logging with consistent fields:

from tasker_core import log_info, LogContext

context = LogContext(
    correlation_id="abc-123",
    task_uuid="task-456",
    operation="process_order"
)
log_info("Processing order", context)

Specialized Handlers

Handler Type Hierarchy

Ruby (all are subclasses):

TaskerCore::StepHandler::Base
├── TaskerCore::StepHandler::Api        # HTTP/REST API integration
├── TaskerCore::StepHandler::Decision   # Dynamic workflow decisions
└── TaskerCore::StepHandler::Batchable  # Batch processing support

Python (Batchable is a mixin):

StepHandler (ABC)
├── ApiHandler         # HTTP/REST API integration (subclass)
├── DecisionHandler    # Dynamic workflow decisions (subclass)
└── + Batchable        # Batch processing (mixin via multiple inheritance)

ApiHandler

For HTTP API integration with automatic error classification:

class FetchUserHandler(ApiHandler):
    handler_name = "fetch_user"

    def call(self, context):
        response = self.get(f"/users/{context.input_data['user_id']}")
        return self.success(result=response.json())

DecisionHandler

For dynamic workflow routing:

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    if amount < 1000
      decision_success(steps: ['auto_approve'], result_data: { route: 'auto' })
    else
      decision_success(steps: ['manager_approval', 'finance_review'])
    end
  end
end

Batchable

For processing large datasets in chunks. Note: Ruby uses subclass inheritance, Python uses mixin.

Ruby (subclass):

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
    batch_worker_complete(processed_count: batch_ctx.batch_size)
  end
end

Python (mixin):

class CsvProcessorHandler(StepHandler, Batchable):
    handler_name = "csv_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)
        # Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
        return self.batch_worker_success(processed_count=batch_ctx.batch_size)

Checkpoint Yielding

Checkpoint yielding enables batch workers to persist progress and yield control back to the orchestrator for re-dispatch. This is essential for long-running batch operations.

When to Use

Processing takes longer than visibility timeout
You need resumable processing after failures
Long-running operations need progress visibility

Cross-Language API

All Batchable handlers provide checkpoint_yield() (or checkpointYield() in TypeScript):

Ruby:

class MyBatchWorker < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)

    # Resume from checkpoint if present
    start = batch_ctx.has_checkpoint? ? batch_ctx.checkpoint_cursor : 0

    items.each_with_index do |item, idx|
      process_item(item)

      # Checkpoint every 1000 items
      if (idx + 1) % 1000 == 0
        checkpoint_yield(
          cursor: start + idx + 1,
          items_processed: idx + 1,
          accumulated_results: { partial: "data" }
        )
      end
    end

    batch_worker_complete(processed_count: items.size)
  end
end

Python:

class MyBatchWorker(StepHandler, Batchable):
    def call(self, context):
        batch_ctx = self.get_batch_context(context)

        # Resume from checkpoint if present
        start = batch_ctx.checkpoint_cursor if batch_ctx.has_checkpoint() else 0

        for idx, item in enumerate(items):
            self.process_item(item)

            # Checkpoint every 1000 items
            if (idx + 1) % 1000 == 0:
                self.checkpoint_yield(
                    cursor=start + idx + 1,
                    items_processed=idx + 1,
                    accumulated_results={"partial": "data"}
                )

        return self.batch_worker_success(processed_count=len(items))

TypeScript:

class MyBatchWorker extends BatchableHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    const batchCtx = this.getBatchContext(context);

    // Resume from checkpoint if present
    const start = batchCtx.hasCheckpoint() ? batchCtx.checkpointCursor : 0;

    for (let idx = 0; idx < items.length; idx++) {
      await this.processItem(items[idx]);

      // Checkpoint every 1000 items
      if ((idx + 1) % 1000 === 0) {
        await this.checkpointYield({
          cursor: start + idx + 1,
          itemsProcessed: idx + 1,
          accumulatedResults: { partial: "data" }
        });
      }
    }

    return this.batchWorkerSuccess({ processedCount: items.length });
  }
}

BatchWorkerContext Checkpoint Accessors

All languages provide consistent accessors for checkpoint data:

Accessor	Ruby	Python	TypeScript
Cursor position	`checkpoint_cursor`	`checkpoint_cursor`	`checkpointCursor`
Accumulated data	`accumulated_results`	`accumulated_results`	`accumulatedResults`
Has checkpoint?	`has_checkpoint?`	`has_checkpoint()`	`hasCheckpoint()`
Items processed	`checkpoint_items_processed`	`checkpoint_items_processed`	`checkpointItemsProcessed`

FFI Contract

Function	Description
`checkpoint_yield_step_event(event_id, data)`	Persist checkpoint and re-dispatch step

Key Invariants

Checkpoint-Persist-Then-Redispatch: Progress saved before re-dispatch
Step Stays InProgress: No state machine transitions during yield
Handler-Driven: Handlers decide when to checkpoint

See Batch Processing Guide - Checkpoint Yielding for comprehensive documentation.

Best Practices

1. Keep Handlers Focused

Each handler should do one thing well:

Validate input
Perform single operation
Return clear result

2. Use Error Classification

Always specify whether errors are retryable:

# Good - clear error classification
return self.failure("API rate limit", retryable=True)

# Bad - ambiguous error handling
raise Exception("API error")

3. Include Context in Errors

return StepHandlerResult.failure_handler_result(
    message="Database connection failed",
    error_type="database_error",
    retryable=True,
    metadata={
        "host": "db.example.com",
        "port": 5432,
        "connection_timeout_ms": 5000
    }
)

4. Use Structured Logging

log_info("Order processed", {
    "order_id": order_id,
    "total": total,
    "items_count": len(items)
})

5. Test Handler Isolation

Reset singletons between tests:

def setup_method(self):
    HandlerRegistry.reset_instance()
    EventBridge.reset_instance()

Python Worker

Last Updated: 2026-01-01 Audience: Python Developers Status: Active Package: tasker_core Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview

The Python worker provides a package-based interface for integrating tasker-core workflow execution into Python applications. It supports both standalone server deployment and headless embedding in existing codebases.

Quick Start

Installation

cd workers/python
uv sync                    # Install dependencies
uv run maturin develop     # Build FFI extension

Running the Server

python bin/server.py

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`TASKER_ENV`	Environment (test/development/production)	development
`TASKER_CONFIG_PATH`	Path to TOML configuration	Auto-detected
`TASKER_TEMPLATE_PATH`	Path to task templates	Auto-detected
`PYTHON_HANDLER_PATH`	Path for handler auto-discovery	Not set
`RUST_LOG`	Log level (trace/debug/info/warn/error)	info

Architecture

Server Mode

Location: workers/python/bin/server.py

The server bootstraps the Rust foundation and manages Python handler execution:

from tasker_core import (
    bootstrap_worker,
    EventBridge,
    EventPoller,
    HandlerRegistry,
    StepExecutionSubscriber,
)

# Bootstrap Rust worker foundation
result = bootstrap_worker(config)

# Start event dispatch system
event_bridge = EventBridge.instance()
event_bridge.start()

# Create step execution subscriber
handler_registry = HandlerRegistry.instance()
step_subscriber = StepExecutionSubscriber(
    event_bridge=event_bridge,
    handler_registry=handler_registry,
    worker_id="python-worker-001"
)
step_subscriber.start()

# Start event poller (10ms polling)
event_poller = EventPoller(polling_interval_ms=10)
event_poller.on_step_event(lambda e: event_bridge.publish("step_execution_received", e))
event_poller.start()

# Wait for shutdown signal
shutdown_event.wait()

# Graceful shutdown
event_poller.stop()
step_subscriber.stop()
event_bridge.stop()
stop_worker()

Headless/Embedded Mode

For embedding in existing Python applications:

from tasker_core import (
    bootstrap_worker,
    HandlerRegistry,
    EventBridge,
    EventPoller,
    StepExecutionSubscriber,
)
from tasker_core.types import BootstrapConfig

# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)

# Register handlers
registry = HandlerRegistry.instance()
registry.register("process_data", ProcessDataHandler)

# Start event dispatch (required for embedded usage)
bridge = EventBridge.instance()
bridge.start()

subscriber = StepExecutionSubscriber(bridge, registry, "embedded-worker")
subscriber.start()

poller = EventPoller()
poller.on_step_event(lambda e: bridge.publish("step_execution_received", e))
poller.start()

FFI Bridge

Python communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                    PYTHON FFI BRIDGE                            │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (poll_step_events)
          ▼
   ┌─────────────────────┐
   │    EventPoller      │
   │  (daemon thread)    │──→ poll every 10ms
   └─────────────────────┘
          │
          │ publish to EventBridge
          ▼
   ┌─────────────────────┐
   │ StepExecution       │
   │ Subscriber          │──→ route to handler
   └─────────────────────┘
          │
          │ handler.call(context)
          ▼
   ┌─────────────────────┐
   │  Handler Execution  │
   └─────────────────────┘
          │
          │ FFI (complete_step_event)
          ▼
   Rust Completion Channel

Handler Development

Base Handler (ABC)

Location: python/tasker_core/step_handler/base.py

All handlers inherit from StepHandler:

from tasker_core import StepHandler, StepContext, StepHandlerResult

class ProcessOrderHandler(StepHandler):
    handler_name = "process_order"
    handler_version = "1.0.0"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Access input data
        order_id = context.input_data.get("order_id")
        amount = context.input_data.get("amount")

        # Business logic
        result = self.process_order(order_id, amount)

        # Return success
        return self.success(result={
            "order_id": order_id,
            "status": "processed",
            "total": result["total"]
        })

Handler Signature

def call(self, context: StepContext) -> StepHandlerResult:
    # context.task_uuid       - Task identifier
    # context.step_uuid       - Step identifier
    # context.input_data      - Task context data
    # context.dependency_results - Results from parent steps
    # context.step_config     - Handler configuration
    # context.step_inputs     - Runtime inputs
    # context.retry_count     - Current retry attempt
    # context.max_retries     - Maximum retry attempts

Result Methods

# Success result (from base class)
return self.success(
    result={"key": "value"},
    metadata={"duration_ms": 100}
)

# Failure result (from base class)
return self.failure(
    message="Payment declined",
    error_type="payment_error",
    retryable=True,
    metadata={"card_last_four": "1234"}
)

# Or using factory methods
from tasker_core import StepHandlerResult

return StepHandlerResult.success_handler_result(
    {"key": "value"},
    {"duration_ms": 100}
)

return StepHandlerResult.failure_handler_result(
    message="Error",
    error_type="validation_error",
    retryable=False
)

Accessing Dependencies

def call(self, context: StepContext) -> StepHandlerResult:
    # Get result from a dependency step
    validation = context.dependency_results.get("validate_order", {})

    if validation.get("valid"):
        # Process with validated data
        return self.success(result={"processed": True})

    return self.failure("Validation failed", retryable=False)

Composition Pattern

Python handlers use composition via mixins (multiple inheritance) rather than single inheritance.

Using Mixins (Recommended for New Code)

from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin

class MyHandler(StepHandler, APIMixin, DecisionMixin):
    handler_name = "my_handler"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Has both API methods (get, post, put, delete)
        # And Decision methods (decision_success, skip_branches)
        response = self.get("/api/endpoint")
        return self.decision_success(["next_step"], response)

Available Mixins

Mixin	Location	Methods Provided
`APIMixin`	`mixins/api.py`	`get`, `post`, `put`, `delete`, `http_client`
`DecisionMixin`	`mixins/decision.py`	`decision_success`, `skip_branches`, `decision_failure`
`BatchableMixin`	(base class)	`get_batch_context`, `handle_no_op_worker`, `create_cursor_configs`

Using Wrapper Classes (Backward Compatible)

The wrapper classes delegate to mixins internally:

# These are equivalent:
class MyHandler(ApiHandler):
    # Inherits API methods via APIMixin internally
    pass

class MyHandler(StepHandler, APIMixin):
    # Explicit mixin inclusion
    pass

Specialized Handlers

API Handler

Location: python/tasker_core/step_handler/api.py

For HTTP API integration with automatic error classification:

from tasker_core.step_handler import ApiHandler

class FetchUserHandler(ApiHandler):
    handler_name = "fetch_user"
    base_url = "https://api.example.com"

    def call(self, context: StepContext) -> StepHandlerResult:
        user_id = context.input_data["user_id"]

        # Automatic error classification
        response = self.get(f"/users/{user_id}")

        return self.api_success(response)

HTTP Methods:

# GET request
response = self.get("/path", params={"key": "value"}, headers={})

# POST request
response = self.post("/path", data={"key": "value"}, headers={})

# PUT request
response = self.put("/path", data={"key": "value"}, headers={})

# DELETE request
response = self.delete("/path", params={}, headers={})

ApiResponse Properties:

response.status_code     # HTTP status code
response.headers         # Response headers
response.body            # Parsed body (dict or str)
response.ok              # True if 2xx status
response.is_client_error # True if 4xx status
response.is_server_error # True if 5xx status
response.is_retryable    # True if should retry (408, 429, 500-504)
response.retry_after     # Retry-After header value in seconds

Error Classification:

Status	Classification	Behavior
400, 401, 403, 404, 422	Non-retryable	Permanent failure
408, 429, 500-504	Retryable	Standard retry

Decision Handler

Location: python/tasker_core/step_handler/decision.py

For dynamic workflow routing:

from tasker_core.step_handler import DecisionHandler
from tasker_core import DecisionPointOutcome

class RoutingDecisionHandler(DecisionHandler):
    handler_name = "routing_decision"

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = context.input_data.get("amount", 0)

        if amount < 1000:
            # Auto-approve small amounts
            outcome = DecisionPointOutcome.create_steps(
                ["auto_approve"],
                routing_context={"route_type": "auto"}
            )
            return self.decision_success(outcome)

        elif amount < 5000:
            # Manager approval for medium amounts
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval"],
                routing_context={"route_type": "manager"}
            )
            return self.decision_success(outcome)

        else:
            # Dual approval for large amounts
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval", "finance_review"],
                routing_context={"route_type": "dual"}
            )
            return self.decision_success(outcome)

Decision Methods:

# Create steps
outcome = DecisionPointOutcome.create_steps(
    step_names=["step1", "step2"],
    routing_context={"key": "value"}
)
return self.decision_success(outcome)

# No branches needed
outcome = DecisionPointOutcome.no_branches(reason="condition not met")
return self.decision_no_branches(outcome)

Batchable Mixin

Location: python/tasker_core/batch_processing/

For processing large datasets in chunks. Both analyzer and worker handlers implement the standard call() method:

Analyzer Handler (creates batch configurations):

from tasker_core import StepHandler, StepHandlerResult
from tasker_core.batch_processing import Batchable

class CsvAnalyzerHandler(StepHandler, Batchable):
    handler_name = "csv_analyzer"

    def call(self, context: StepContext) -> StepHandlerResult:
        """Analyze CSV and create batch worker configurations."""
        csv_path = context.input_data["csv_path"]
        row_count = count_csv_rows(csv_path)

        if row_count == 0:
            # No data to process
            return self.batch_analyzer_success(
                cursor_configs=[],
                total_items=0,
                batch_metadata={"csv_path": csv_path}
            )

        # Create cursor ranges for batch workers
        cursor_configs = self.create_cursor_ranges(
            total_items=row_count,
            batch_size=100,
            max_batches=5
        )

        return self.batch_analyzer_success(
            cursor_configs=cursor_configs,
            total_items=row_count,
            worker_template_name="process_csv_batch",
            batch_metadata={"csv_path": csv_path}
        )

Worker Handler (processes a batch):

class CsvBatchProcessorHandler(StepHandler, Batchable):
    handler_name = "csv_batch_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        """Process a batch of CSV rows."""
        # Get cursor config from step_inputs
        step_inputs = context.step_inputs or {}

        # Check for no-op placeholder batch
        if step_inputs.get("is_no_op"):
            return self.batch_worker_success(
                items_processed=0,
                items_succeeded=0,
                metadata={"no_op": True}
            )

        cursor = step_inputs.get("cursor", {})
        start_cursor = cursor.get("start_cursor", 0)
        end_cursor = cursor.get("end_cursor", 0)

        # Get CSV path from analyzer result
        analyzer_result = context.get_dependency_result("analyze_csv")
        csv_path = analyzer_result["batch_metadata"]["csv_path"]

        # Process the batch
        results = process_csv_batch(csv_path, start_cursor, end_cursor)

        return self.batch_worker_success(
            items_processed=results["count"],
            items_succeeded=results["success"],
            items_failed=results["failed"],
            results=results["data"],
            last_cursor=end_cursor
        )

Batchable Helper Methods:

# Analyzer helpers
self.create_cursor_ranges(total_items, batch_size, max_batches)
self.batch_analyzer_success(cursor_configs, total_items, worker_template_name, batch_metadata)

# Worker helpers
self.batch_worker_success(items_processed, items_succeeded, items_failed, results, last_cursor, metadata)
self.get_batch_context(context)  # Returns BatchWorkerContext or None

# Aggregator helpers
self.aggregate_worker_results(worker_results)  # Returns aggregated counts

Handler Registry

Registration

Location: python/tasker_core/handler.py

from tasker_core import HandlerRegistry

registry = HandlerRegistry.instance()

# Manual registration
registry.register("process_order", ProcessOrderHandler)

# Check if registered
registry.is_registered("process_order")  # True

# Resolve and instantiate
handler = registry.resolve("process_order")
result = handler.call(context)

# List all handlers
registry.list_handlers()  # ["process_order", ...]

# Handler count
registry.handler_count()  # 1

Auto-Discovery

# Discover handlers from a package
count = registry.discover_handlers("myapp.handlers")
print(f"Discovered {count} handlers")

Handlers are discovered by:

Scanning the package for classes inheriting from StepHandler
Using the handler_name class attribute for registration

Type System

Pydantic Models

Python types use Pydantic for validation:

from tasker_core import StepContext, StepHandlerResult, FfiStepEvent

# StepContext - validated from FFI event
context = StepContext.from_ffi_event(event, "handler_name")
context.task_uuid      # UUID
context.step_uuid      # UUID
context.input_data     # dict
context.retry_count    # int

# StepHandlerResult - structured result
result = StepHandlerResult.success_handler_result({"key": "value"})
result.success         # True
result.result          # {"key": "value"}
result.error_message   # None

Configuration Types

from tasker_core.types import BootstrapConfig, CursorConfig

# Bootstrap configuration
# Note: Headless mode is controlled via TOML config (web.enabled = false)
config = BootstrapConfig(
    namespace="my-app",
    log_level="info"
)

# Cursor configuration for batch processing
cursor = CursorConfig(
    batch_size=100,
    start_cursor=0,
    end_cursor=1000
)

Event System

EventBridge

Location: python/tasker_core/event_bridge.py

from tasker_core import EventBridge, EventNames

bridge = EventBridge.instance()

# Start the event system
bridge.start()

# Subscribe to events
def on_step_received(event):
    print(f"Processing step: {event.step_uuid}")

bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)

# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)

# Stop when done
bridge.stop()

Event Names

from tasker_core import EventNames

EventNames.STEP_EXECUTION_RECEIVED  # Step event received from FFI
EventNames.STEP_COMPLETION_SENT     # Handler result sent to FFI
EventNames.HANDLER_REGISTERED       # Handler registered
EventNames.HANDLER_ERROR            # Handler execution error
EventNames.POLLER_METRICS           # FFI dispatch metrics update
EventNames.POLLER_ERROR             # Poller encountered an error

EventPoller

Location: python/tasker_core/event_poller.py

from tasker_core import EventPoller

poller = EventPoller(
    polling_interval_ms=10,       # Poll every 10ms
    starvation_check_interval=100, # Check every 1 second
    cleanup_interval=1000          # Cleanup every 10 seconds
)

# Register callbacks
poller.on_step_event(handle_step)
poller.on_metrics(handle_metrics)
poller.on_error(handle_error)

# Start polling (daemon thread)
poller.start()

# Get metrics
metrics = poller.get_metrics()
print(f"Pending: {metrics.pending_count}")

# Stop polling
poller.stop(timeout=5.0)

Domain Events

Python has full domain event support with lifecycle hooks matching Ruby and TypeScript capabilities.

Location: python/tasker_core/domain_events.py

BasePublisher

Publishers transform step execution context into domain-specific events:

from tasker_core.domain_events import BasePublisher, StepEventContext, DomainEvent

class PaymentEventPublisher(BasePublisher):
    publisher_name = "payment_events"

    def publishes_for(self) -> list[str]:
        """Which steps trigger this publisher."""
        return ["process_payment", "refund_payment"]

    async def transform_payload(self, ctx: StepEventContext) -> dict:
        """Transform step context into domain event payload."""
        return {
            "payment_id": ctx.result.get("payment_id"),
            "amount": ctx.result.get("amount"),
            "currency": ctx.result.get("currency"),
            "status": ctx.result.get("status")
        }

    # Lifecycle hooks (optional)
    async def before_publish(self, ctx: StepEventContext) -> None:
        """Called before publishing."""
        print(f"Publishing payment event for step: {ctx.step_name}")

    async def after_publish(self, ctx: StepEventContext, event: DomainEvent) -> None:
        """Called after successful publish."""
        print(f"Published event: {event.event_name}")

    async def on_publish_error(self, ctx: StepEventContext, error: Exception) -> None:
        """Called on publish failure."""
        print(f"Failed to publish: {error}")

    async def additional_metadata(self, ctx: StepEventContext) -> dict:
        """Inject custom metadata."""
        return {"payment_processor": "stripe"}

BaseSubscriber

Subscribers react to domain events matching specific patterns:

from tasker_core.domain_events import BaseSubscriber, InProcessDomainEvent, SubscriberResult

class AuditLoggingSubscriber(BaseSubscriber):
    subscriber_name = "audit_logger"

    def subscribes_to(self) -> list[str]:
        """Which events to handle (glob patterns supported)."""
        return ["payment.*", "order.completed"]

    async def handle(self, event: InProcessDomainEvent) -> SubscriberResult:
        """Handle matching events."""
        await self.log_to_audit_trail(event)
        return SubscriberResult(success=True)

    # Lifecycle hooks (optional)
    async def before_handle(self, event: InProcessDomainEvent) -> None:
        """Called before handling."""
        print(f"Handling: {event.event_name}")

    async def after_handle(self, event: InProcessDomainEvent, result: SubscriberResult) -> None:
        """Called after handling."""
        print(f"Handled successfully: {result.success}")

    async def on_handle_error(self, event: InProcessDomainEvent, error: Exception) -> None:
        """Called on handler failure."""
        print(f"Handler error: {error}")

Registries

Manage publishers and subscribers with singleton registries:

from tasker_core.domain_events import PublisherRegistry, SubscriberRegistry

# Publisher Registry
pub_registry = PublisherRegistry.instance()
pub_registry.register(PaymentEventPublisher)
pub_registry.register(OrderEventPublisher)

# Get publisher for a step
publisher = pub_registry.get_for_step("process_payment")

# Subscriber Registry
sub_registry = SubscriberRegistry.instance()
sub_registry.register(AuditLoggingSubscriber)
sub_registry.register(MetricsSubscriber)

# Start all subscribers
sub_registry.start_all()

# Stop all subscribers
sub_registry.stop_all()

Signal Handling

The Python worker handles signals for graceful shutdown:

Signal	Behavior
`SIGTERM`	Graceful shutdown
`SIGINT`	Graceful shutdown (Ctrl+C)
`SIGUSR1`	Report worker status

import signal

def handle_shutdown(signum, frame):
    print("Shutting down...")
    shutdown_event.set()

signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)

Error Handling

Exception Classes

from tasker_core import (
    TaskerError,              # Base class
    WorkerNotInitializedError,
    WorkerBootstrapError,
    WorkerAlreadyRunningError,
    FFIError,
    ConversionError,
    StepExecutionError,
)

Using StepExecutionError

from tasker_core import StepExecutionError

def call(self, context):
    # Retryable error
    raise StepExecutionError(
        "Database connection timeout",
        error_type="database_error",
        retryable=True
    )

    # Non-retryable error
    raise StepExecutionError(
        "Invalid input format",
        error_type="validation_error",
        retryable=False
    )

Logging

Structured Logging

from tasker_core import log_info, log_error, log_warn, log_debug, LogContext

# Simple logging
log_info("Processing started")
log_error("Failed to connect")

# With context dict
log_info("Order processed", {
    "order_id": "123",
    "amount": "100.00"
})

# With LogContext model
context = LogContext(
    correlation_id="abc-123",
    task_uuid="task-456",
    operation="process_order"
)
log_info("Processing", context)

File Structure

workers/python/
├── bin/
│   └── server.py              # Production server
├── python/
│   └── tasker_core/
│       ├── __init__.py        # Package exports
│       ├── handler.py         # Handler registry
│       ├── event_bridge.py    # Event pub/sub
│       ├── event_poller.py    # FFI polling
│       ├── logging.py         # Structured logging
│       ├── types.py           # Pydantic models
│       ├── step_handler/
│       │   ├── __init__.py
│       │   ├── base.py        # Base handler ABC
│       │   ├── api.py         # API handler
│       │   └── decision.py    # Decision handler
│       ├── batch_processing/
│       │   └── __init__.py    # Batchable mixin
│       └── step_execution_subscriber.py
├── src/                       # Rust/PyO3 extension
├── tests/
│   ├── test_step_handler.py
│   ├── test_module_exports.py
│   └── handlers/examples/
├── pyproject.toml
└── uv.lock

Testing

Unit Tests

cd workers/python
uv run pytest tests/

With Coverage

uv run pytest tests/ --cov=tasker_core

Type Checking

uv run mypy python/tasker_core/

Linting

uv run ruff check python/

Example Handlers

Linear Workflow

class LinearStep1Handler(StepHandler):
    handler_name = "linear_step_1"

    def call(self, context: StepContext) -> StepHandlerResult:
        return self.success(result={
            "step1_processed": True,
            "input_received": context.input_data,
            "processed_at": datetime.now().isoformat()
        })

Data Processing

class TransformDataHandler(StepHandler):
    handler_name = "transform_data"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Get raw data from dependency
        raw_data = context.dependency_results.get("fetch_data", {})

        # Transform
        transformed = [
            {"id": item["id"], "value": item["raw_value"] * 2}
            for item in raw_data.get("items", [])
        ]

        return self.success(result={
            "items": transformed,
            "count": len(transformed)
        })

Conditional Approval

class ApprovalRouterHandler(DecisionHandler):
    handler_name = "approval_router"

    THRESHOLDS = {
        "auto": 1000,
        "manager": 5000
    }

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = context.input_data.get("amount", 0)

        if amount < self.THRESHOLDS["auto"]:
            outcome = DecisionPointOutcome.create_steps(["auto_approve"])
        elif amount < self.THRESHOLDS["manager"]:
            outcome = DecisionPointOutcome.create_steps(["manager_approval"])
        else:
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval", "finance_review"]
            )

        return self.decision_success(outcome)

Ruby Worker

Last Updated: 2026-01-01 Audience: Ruby Developers Status: Active Package: tasker_core (gem) Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview

The Ruby worker provides a gem-based interface for integrating tasker-core workflow execution into Ruby applications. It supports both standalone server deployment and headless embedding in Rails applications.

Quick Start

Installation

cd workers/ruby
bundle install
bundle exec rake compile  # Compile FFI extension

Running the Server

./bin/server.rb

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`TASKER_ENV`	Environment (test/development/production)	development
`TASKER_CONFIG_PATH`	Path to TOML configuration	Auto-detected
`TASKER_TEMPLATE_PATH`	Path to task templates	Auto-detected
`RUBY_GC_HEAP_GROWTH_FACTOR`	GC tuning for production	Ruby default

Architecture

Server Mode

Location: workers/ruby/bin/server.rb

The server bootstraps the Rust foundation and manages Ruby handler execution:

# Bootstrap the worker system
bootstrap = TaskerCore::Worker::Bootstrap.start!

# Signal handlers for graceful shutdown
Signal.trap('TERM') { shutdown_event.set }
Signal.trap('INT') { shutdown_event.set }

# Main loop with health checks
loop do
  break if shutdown_event.set?
  sleep(1)
end

# Graceful shutdown
bootstrap.shutdown!

Headless/Embedded Mode

For embedding in Rails applications without an HTTP server:

# config/initializers/tasker.rb
require 'tasker_core'

# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
TaskerCore::Worker::Bootstrap.start!

# Register application handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
  'ProcessOrderHandler',
  ProcessOrderHandler
)

FFI Bridge

Ruby communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                    RUBY FFI BRIDGE                              │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (poll_step_events)
          ▼
   ┌─────────────┐
   │   Ruby      │
   │   Thread    │──→ poll every 10ms
   └─────────────┘
          │
          ▼
   ┌─────────────┐
   │  Handler    │
   │  Execution  │──→ handler.call(context)
   └─────────────┘
          │
          │ FFI (complete_step_event)
          ▼
   Rust Completion Channel

Handler Development

Base Handler

Location: lib/tasker_core/step_handler/base.rb

All handlers inherit from TaskerCore::StepHandler::Base:

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Access task context via cross-language standard methods
    order_id = context.get_task_field('order_id')
    amount = context.get_task_field('amount')

    # Business logic
    result = process_order(order_id, amount)

    # Return success result
    success(result: {
      order_id: order_id,
      status: 'processed',
      total: result[:total]
    })
  end
end

Handler Signature

def call(context)
  # context - StepContext with cross-language standard fields:
  #   context.task_uuid       - Task UUID
  #   context.step_uuid       - Step UUID
  #   context.input_data      - Step inputs from workflow_step.inputs
  #   context.step_config     - Handler config from step_definition
  #   context.retry_count     - Current retry attempt
  #   context.max_retries     - Maximum retry attempts
  #   context.get_task_field('field')       - Get field from task context
  #   context.get_dependency_result('step') - Get result from parent step
end

Result Methods

# Success result (keyword arguments required)
success(
  result: { key: 'value' },
  metadata: { duration_ms: 100 }
)

# Failure result
# error_type must be one of: 'PermanentError', 'RetryableError',
# 'ValidationError', 'UnexpectedError', 'StepCompletionError'
failure(
  message: 'Payment declined',
  error_type: 'PermanentError',   # Use enum value, not freeform string
  error_code: 'PAYMENT_DECLINED', # Optional freeform error code
  retryable: false,
  metadata: { card_last_four: '1234' }
)

Accessing Dependencies

def call(context)
  # Get result from a dependency step
  validation_result = context.get_dependency_result('validate_order')

  if validation_result && validation_result['valid']
    # Process with validated data
  end
end

Composition Pattern

Ruby handlers use composition via mixins rather than inheritance. You can use either:

Wrapper classes (Api, Decision, Batchable) - simpler, backward compatible
Mixin modules (Mixins::API, Mixins::Decision, Mixins::Batchable) - explicit composition

Using Mixins (Recommended for New Code)

class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API
  include TaskerCore::StepHandler::Mixins::Decision

  def call(context)
    # Has both API methods (get, post, put, delete)
    # And Decision methods (decision_success, decision_no_branches)
    response = get('/api/endpoint')
    decision_success(steps: ['next_step'], result_data: response)
  end
end

Available Mixins

Mixin	Location	Methods Provided
`Mixins::API`	`mixins/api.rb`	`get`, `post`, `put`, `delete`, `connection`
`Mixins::Decision`	`mixins/decision.rb`	`decision_success`, `decision_no_branches`, `skip_branches`
`Mixins::Batchable`	`mixins/batchable.rb`	`get_batch_context`, `handle_no_op_worker`, `create_cursor_configs`

Using Wrapper Classes (Backward Compatible)

The wrapper classes delegate to mixins internally:

# These are equivalent:
class MyHandler < TaskerCore::StepHandler::Api
  # Inherits API methods via Mixins::API
end

class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API
  # Explicit mixin inclusion
end

Specialized Handlers

API Handler

Location: lib/tasker_core/step_handler/api.rb

For HTTP API integration with automatic error classification:

class FetchUserHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')

    # Automatic error classification (429 → retryable, 404 → permanent)
    response = connection.get("/users/#{user_id}")
    process_response(response)  # Raises on errors, returns response on success

    # Return success result with response data
    success(result: response.body)
  end

  # Optional: Custom connection configuration
  def configure_connection
    Faraday.new(base_url) do |conn|
      conn.request :json
      conn.response :json
      conn.options.timeout = 30
    end
  end
end

HTTP Methods Available:

get(path, params: {}, headers: {})
post(path, data: {}, headers: {})
put(path, data: {}, headers: {})
delete(path, params: {}, headers: {})

Error Classification:

Status	Classification	Behavior
400, 401, 403, 404, 422	Permanent	No retry
429	Retryable	Respect Retry-After
500-599	Retryable	Standard backoff

Decision Handler

Location: lib/tasker_core/step_handler/decision.rb

For dynamic workflow routing:

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    if amount < 1000
      # Auto-approve small amounts
      decision_success(
        steps: ['auto_approve'],
        result_data: { route_type: 'auto', amount: amount }
      )
    elsif amount < 5000
      # Manager approval for medium amounts
      decision_success(
        steps: ['manager_approval'],
        result_data: { route_type: 'manager', amount: amount }
      )
    else
      # Dual approval for large amounts
      decision_success(
        steps: ['manager_approval', 'finance_review'],
        result_data: { route_type: 'dual', amount: amount }
      )
    end
  end
end

Decision Methods:

decision_success(steps:, result_data: {}) - Create steps dynamically
decision_no_branches(result_data: {}) - Skip conditional steps

Batchable Handler

Location: lib/tasker_core/step_handler/batchable.rb

For processing large datasets in chunks:

Breaking Change: Cursors are now 0-indexed (previously 1-indexed) to match Python, TypeScript, and Rust.

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    # Extract batch context from step inputs
    batch_ctx = get_batch_context(context)

    # Handle no-op placeholder batches
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process this batch
    csv_file = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
    records = read_csv_batch(csv_file, batch_ctx.start_cursor, batch_ctx.batch_size)

    processed = records.map { |record| transform_record(record) }

    # Return batch completion
    batch_worker_complete(
      processed_count: processed.size,
      result_data: { records: processed }
    )
  end
end

Batch Helper Methods:

get_batch_context(context) - Get batch boundaries from StepContext
handle_no_op_worker(batch_ctx) - Handle placeholder batches
batch_worker_complete(processed_count:, result_data:) - Complete batch
create_cursor_configs(total_items, worker_count) - Create 0-indexed cursor ranges

Cursor Indexing:

# Creates 0-indexed cursor ranges
configs = create_cursor_configs(1000, 5)
# => [
#   { batch_id: '1', start_cursor: 0, end_cursor: 200 },
#   { batch_id: '2', start_cursor: 200, end_cursor: 400 },
#   { batch_id: '3', start_cursor: 400, end_cursor: 600 },
#   { batch_id: '4', start_cursor: 600, end_cursor: 800 },
#   { batch_id: '5', start_cursor: 800, end_cursor: 1000 }
# ]

Handler Registry

Registration

Location: lib/tasker_core/registry/handler_registry.rb

registry = TaskerCore::Registry::HandlerRegistry.instance

# Manual registration
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)

# Check availability
registry.handler_available?('ProcessOrderHandler')  # => true

# List all handlers
registry.registered_handlers  # => ["ProcessOrderHandler", ...]

Discovery Modes

Preloaded Handlers (Test environment)
- ObjectSpace scanning for loaded handler classes
Template-Driven Discovery
- YAML templates define handler references
- Handlers loaded from configured paths

Handler Search Paths

app/handlers/
lib/handlers/
handlers/
app/tasker/handlers/
lib/tasker/handlers/
spec/handlers/examples/  (test environment)

Configuration

Bootstrap Configuration

Bootstrap configuration is controlled via TOML files, not Ruby parameters:

# config/tasker/base/worker.toml
[web]
enabled = true              # Set to false for headless/embedded mode
bind_address = "0.0.0.0"
port = 8080

# Ruby bootstrap is simple - config comes from TOML
TaskerCore::Worker::Bootstrap.start!

Handler Configuration

class MyHandler < TaskerCore::StepHandler::Base
  def initialize(config: {})
    super
    @timeout = config[:timeout] || 30
    @max_retries = config[:retries] || 3
  end

  def config_schema
    {
      type: 'object',
      properties: {
        timeout: { type: 'integer', minimum: 1, default: 30 },
        retries: { type: 'integer', minimum: 0, default: 3 }
      }
    }
  end
end

Signal Handling

The Ruby worker handles multiple signals:

Signal	Behavior
`SIGTERM`	Graceful shutdown
`SIGINT`	Graceful shutdown (Ctrl+C)
`SIGUSR1`	Report worker status
`SIGUSR2`	Reload configuration

# Status reporting
Signal.trap('USR1') do
  logger.info "Worker Status: #{bootstrap.status.inspect}"
end

# Configuration reload
Signal.trap('USR2') do
  bootstrap.reload_config
end

Error Handling

Exception Classes

TaskerCore::Errors::Error                  # Base class
├── TaskerCore::Errors::ConfigurationError # Configuration issues
├── TaskerCore::Errors::FFIError           # FFI bridge errors
├── TaskerCore::Errors::ProceduralError    # Base for workflow errors
│   ├── TaskerCore::Errors::RetryableError # Transient failures
│   ├── TaskerCore::Errors::PermanentError # Unrecoverable failures
│   │   ├── TaskerCore::Errors::ValidationError # Validation failures
│   │   └── TaskerCore::Errors::NotFoundError   # Resource not found
│   ├── TaskerCore::Errors::TimeoutError   # Timeout failures
│   └── TaskerCore::Errors::NetworkError   # Network failures
└── TaskerCore::Errors::ServerError        # Embedded server errors

Raising Errors

def call(context)
  # Retryable error (will be retried)
  raise TaskerCore::Errors::RetryableError.new(
    'Database connection timeout',
    retry_after: 5,
    context: { service: 'database' }
  )

  # Permanent error (no retry)
  raise TaskerCore::Errors::PermanentError.new(
    'Invalid order format',
    error_code: 'INVALID_ORDER',
    context: { field: 'order_id' }
  )

  # Validation error (permanent, with field info)
  raise TaskerCore::Errors::ValidationError.new(
    'Email format is invalid',
    field: 'email',
    error_code: 'INVALID_EMAIL'
  )
end

Logging

Structured Logging (Recommended)

New code should use TaskerCore::Tracing for unified structured logging via FFI:

# Recommended: Use Tracing directly
TaskerCore::Tracing.info('Processing order', {
  order_id: order.id,
  amount: order.total,
  customer_id: order.customer_id
})

TaskerCore::Tracing.error('Payment failed', {
  error_code: 'DECLINED',
  card_last_four: '1234'
})

Legacy Logger (Deprecated)

Note: TaskerCore::Logger is maintained for backward compatibility but delegates to TaskerCore::Tracing. New code should use Tracing directly.

# Legacy (still works, but deprecated)
logger = TaskerCore::Logger.instance
logger.info('Processing order', {
  order_id: order.id,
  amount: order.total
})

Log Levels

Controlled via RUST_LOG environment variable:

trace - Very detailed debugging
debug - Debugging information
info - Normal operation
warn - Warning conditions
error - Error conditions

File Structure

workers/ruby/
├── bin/
│   ├── server.rb            # Production server
│   └── health_check.rb      # Health check script
├── ext/
│   └── tasker_core/
│       └── extconf.rb       # FFI extension config
├── lib/
│   └── tasker_core/
│       ├── errors.rb        # Exception classes
│       ├── handlers.rb      # Handler namespace
│       ├── internal.rb      # Internal modules
│       ├── logger.rb        # Logging
│       ├── models.rb        # Type models
│       ├── registry/
│       │   ├── handler_registry.rb
│       │   └── step_handler_resolver.rb
│       ├── step_handler/
│       │   ├── base.rb      # Base handler
│       │   ├── api.rb       # API handler
│       │   ├── decision.rb  # Decision handler
│       │   └── batchable.rb # Batch handler
│       ├── task_handler/
│       │   └── base.rb      # Task orchestration
│       ├── types/           # Type definitions
│       └── version.rb
├── spec/
│   ├── handlers/examples/   # Example handlers
│   └── integration/         # Integration tests
├── Gemfile
└── tasker_core.gemspec

Testing

Unit Tests

cd workers/ruby
bundle exec rspec spec/

Integration Tests

DATABASE_URL=postgresql://... bundle exec rspec spec/integration/

E2E Tests (from project root)

DATABASE_URL=postgresql://... \
TASKER_ENV=test \
bundle exec rspec spec/handlers/

Example Handlers

Linear Workflow

# spec/handlers/examples/linear_workflow/step_handlers/linear_step_1_handler.rb
module LinearWorkflow
  module StepHandlers
    class LinearStep1Handler < TaskerCore::StepHandler::Base
      def call(context)
        input = context.context  # Full task context
        success(result: {
          step1_processed: true,
          input_received: input,
          processed_at: Time.now.iso8601
        })
      end
    end
  end
end

Order Fulfillment

class ValidateOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order = context.context  # Full task context

    unless order['items']&.any?
      return failure(
        message: 'Order must have at least one item',
        error_type: 'ValidationError',
        error_code: 'EMPTY_ORDER',
        retryable: false
      )
    end

    success(result: {
      valid: true,
      item_count: order['items'].size,
      total: calculate_total(order['items'])
    })
  end
end

Conditional Approval

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  THRESHOLDS = {
    auto_approve: 1000,
    manager_only: 5000
  }.freeze

  def call(context)
    amount = context.get_task_field('amount').to_f

    if amount < THRESHOLDS[:auto_approve]
      decision_success(steps: ['auto_approve'])
    elsif amount < THRESHOLDS[:manager_only]
      decision_success(steps: ['manager_approval'])
    else
      decision_success(steps: ['manager_approval', 'finance_review'])
    end
  end
end

Rust Worker

Last Updated: 2026-01-01 Audience: Rust Developers Status: Active Package: workers-rust Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview

The Rust worker is the native, high-performance implementation for workflow step execution. It demonstrates the full capability of the tasker-worker foundation with zero FFI overhead.

Quick Start

Running the Server

cd workers/rust
cargo run

With Custom Configuration

TASKER_CONFIG_PATH=/path/to/config.toml cargo run

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`TASKER_CONFIG_PATH`	Path to TOML configuration	Auto-detected
`RUST_LOG`	Log level	`info`

Architecture

Entry Point

Location: workers/rust/src/main.rs

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize structured logging
    tasker_shared::logging::init_tracing();

    // Bootstrap worker system
    let mut bootstrap_result = bootstrap().await?;

    // Start event handler (legacy path)
    tokio::spawn(async move {
        bootstrap_result.event_handler.start().await
    });

    // Wait for shutdown signal
    tokio::select! {
        _ = tokio::signal::ctrl_c() => { /* shutdown */ }
        _ = wait_for_sigterm() => { /* shutdown */ }
    }

    bootstrap_result.worker_handle.stop()?;
    Ok(())
}

Bootstrap Process

Location: workers/rust/src/bootstrap.rs

The bootstrap process:

Creates step handler registry with all handlers
Sets up global event system
Bootstraps tasker-worker foundation
Creates domain event publisher registry
Spawns HandlerDispatchService for non-blocking dispatch
Creates event handler for legacy path

#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<RustWorkerBootstrapResult> {
    // Create registry with all handlers
    let registry = Arc::new(RustStepHandlerRegistry::new());

    // Bootstrap worker foundation
    let worker_handle = WorkerBootstrap::bootstrap_with_event_system(...).await?;

    // Set up dispatch service (non-blocking path)
    let dispatch_service = HandlerDispatchService::with_callback(...);

    Ok(RustWorkerBootstrapResult {
        worker_handle,
        event_handler,
        dispatch_service_handle,
    })
}
}

Handler Dispatch

The Rust worker uses the HandlerDispatchService for non-blocking handler execution:

┌────────────────────────────────────────────────────────────────┐
│                    RUST HANDLER DISPATCH                        │
└────────────────────────────────────────────────────────────────┘

   PGMQ Queue
        │
        ▼
  ┌─────────────┐
  │  Dispatch   │
  │  Channel    │
  └─────────────┘
        │
        ▼
  ┌─────────────────────────────────────────┐
  │       HandlerDispatchService            │
  │  ┌────────────────────────────────────┐ │
  │  │  Semaphore (10 permits)            │ │
  │  │       │                            │ │
  │  │       ▼                            │ │
  │  │  handler.call(&step_data).await    │ │
  │  │       │                            │ │
  │  │       ▼                            │ │
  │  │  DomainEventCallback               │ │
  │  └────────────────────────────────────┘ │
  └─────────────────────────────────────────┘
        │
        ▼
  ┌─────────────┐
  │ Completion  │
  │  Channel    │
  └─────────────┘
        │
        ▼
   Orchestration

Handler Development

Capability Traits

Rust uses traits for handler composition, matching the mixin pattern in Ruby/Python/TypeScript.

Location: tasker-worker/src/handler_capabilities.rs

APICapable Trait

For HTTP API integration:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::APICapable;

impl APICapable for MyHandler {
    // Use the helper methods:
    // - api_success(step_uuid, data, status, headers, execution_time_ms)
    // - api_failure(step_uuid, message, status, error_type, execution_time_ms)
    // - classify_status_code(status) -> ErrorClassification
}
}

DecisionCapable Trait

For dynamic workflow routing:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::DecisionCapable;

impl DecisionCapable for MyHandler {
    // Use the helper methods:
    // - decision_success(step_uuid, step_names, routing_context, execution_time_ms)
    // - skip_branches(step_uuid, reason, routing_context, execution_time_ms)
    // - decision_failure(step_uuid, message, error_type, execution_time_ms)
}
}

BatchableCapable Trait

For batch processing:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::BatchableCapable;

impl BatchableCapable for MyHandler {
    // Use the helper methods:
    // - create_cursor_configs(total_items, worker_count) -> Vec<CursorConfig>
    // - create_cursor_ranges(total_items, batch_size, max_batches) -> Vec<CursorConfig>
    // - batch_analyzer_success(step_uuid, worker_template, configs, total_items, ...)
    // - batch_worker_success(step_uuid, processed, succeeded, failed, skipped, ...)
    // - no_batches_outcome(step_uuid, reason, execution_time_ms)
    // - batch_failure(step_uuid, message, error_type, retryable, ...)
}
}

Composing Multiple Traits

#![allow(unused)]
fn main() {
// Implement multiple capability traits for a single handler
pub struct CompositeHandler {
    config: StepHandlerConfig,
}

impl APICapable for CompositeHandler {}
impl DecisionCapable for CompositeHandler {}

#[async_trait]
impl RustStepHandler for CompositeHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Use API capability to fetch data
        let response = self.call_api("/users/123").await?;

        // Use Decision capability to route based on response
        if response.status == 200 {
            self.decision_success(step_uuid, vec!["process_user"], None, 50)
        } else {
            self.api_failure(step_uuid, "API failed", response.status, "api_error", 50)
        }
    }
}
}

Handler Trait

Location: workers/rust/src/step_handlers/mod.rs

All Rust handlers implement the RustStepHandler trait:

#![allow(unused)]
fn main() {
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;

#[async_trait]
pub trait RustStepHandler: Send + Sync {
    /// Handler name for registration
    fn name(&self) -> &str;

    /// Execute the handler
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult>;

    /// Create a new instance with configuration from YAML
    fn new(config: StepHandlerConfig) -> Self where Self: Sized;
}
}

Creating a Handler

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use anyhow::Result;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use crate::step_handlers::{RustStepHandler, StepHandlerConfig, success_result};
use serde_json::json;

pub struct ProcessOrderHandler {
    _config: StepHandlerConfig,
}

#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
    fn name(&self) -> &str {
        "process_order"
    }

    fn new(config: StepHandlerConfig) -> Self {
        Self { _config: config }
    }

    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract input from task context
        let order_id = step_data.task.context
            .get("order_id")
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow::anyhow!("Missing order_id"))?;

        // Process the order
        let result = process_order(order_id).await?;

        // Return success using helper function
        Ok(success_result(
            step_uuid,
            json!({
                "order_id": order_id,
                "status": "processed",
                "total": result.total
            }),
            start_time.elapsed().as_millis() as i64,
            None,
        ))
    }
}
}

Handler Registration

Location: workers/rust/src/step_handlers/registry.rs

Handlers are registered in the RustStepHandlerRegistry:

#![allow(unused)]
fn main() {
pub struct RustStepHandlerRegistry {
    handlers: HashMap<String, Arc<dyn RustStepHandler>>,
}

impl RustStepHandlerRegistry {
    pub fn new() -> Self {
        let mut registry = Self {
            handlers: HashMap::new(),
        };

        registry.register_all_handlers();
        registry
    }

    fn register_all_handlers(&mut self) {
        let empty_config = StepHandlerConfig::empty();

        // Linear workflow handlers
        self.register_handler(Arc::new(LinearStep1Handler::new(empty_config.clone())));
        self.register_handler(Arc::new(LinearStep2Handler::new(empty_config.clone())));

        // Order fulfillment handlers
        self.register_handler(Arc::new(ValidateOrderHandler::new(empty_config.clone())));
        self.register_handler(Arc::new(ProcessPaymentHandler::new(empty_config.clone())));

        // ... more handlers
    }

    fn register_handler(&mut self, handler: Arc<dyn RustStepHandler>) {
        let name = handler.name().to_string();
        self.handlers.insert(name, handler);
    }

    pub fn get_handler(&self, name: &str) -> Result<Arc<dyn RustStepHandler>, RustStepHandlerError> {
        self.handlers
            .get(name)
            .cloned()
            .ok_or_else(|| RustStepHandlerError::SystemError {
                message: format!("Handler '{}' not found in registry", name),
            })
    }
}
}

Example Handlers

Linear Workflow

Location: workers/rust/src/step_handlers/linear_workflow.rs

Simple sequential workflow with 4 steps:

#![allow(unused)]
fn main() {
pub struct LinearStep1Handler;

#[async_trait]
impl RustStepHandler for LinearStep1Handler {
    fn name(&self) -> &str {
        "linear_step_1"
    }

    async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        info!("LinearStep1Handler: Processing step");

        let input = step_data.input_data.clone();
        let mut result = serde_json::Map::new();
        result.insert("step1_processed".to_string(), json!(true));
        result.insert("input_received".to_string(), input);

        Ok(StepHandlerResult::success(json!(result)))
    }
}
}

Diamond Workflow

Location: workers/rust/src/step_handlers/diamond_workflow.rs

Parallel branching with convergence:

    ┌─────┐
    │Start│
    └──┬──┘
       │
  ┌────┴────┐
  ▼         ▼
┌───┐     ┌───┐
│ B │     │ C │
└─┬─┘     └─┬─┘
  │         │
  └────┬────┘
       ▼
    ┌─────┐
    │ End │
    └─────┘

Batch Processing

Location: workers/rust/src/step_handlers/batch_processing_products_csv.rs

Three-phase batch processing:

Analyzer: Counts total records
Batch Processor: Processes chunks
Aggregator: Combines results

#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;

#[async_trait]
impl RustStepHandler for CsvBatchProcessorHandler {
    fn name(&self) -> &str {
        "csv_batch_processor"
    }

    async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        let batch_size = step_data.step_inputs
            .get("batch_size")
            .and_then(|v| v.as_u64())
            .unwrap_or(100) as usize;

        let start_cursor = step_data.step_inputs
            .get("start_cursor")
            .and_then(|v| v.as_u64())
            .unwrap_or(0) as usize;

        // Process records in batch
        let processed = process_batch(start_cursor, batch_size).await?;

        Ok(StepHandlerResult::success(json!({
            "processed_count": processed,
            "batch_complete": true
        })))
    }
}
}

Error Injection (Testing)

Location: workers/rust/src/step_handlers/error_injection/

Handlers for testing retry behavior:

#![allow(unused)]
fn main() {
pub struct FailNTimesHandler;

impl FailNTimesHandler {
    /// Create handler that fails N times before succeeding
    pub fn new(fail_count: u32) -> Self {
        Self { fail_count, attempts: AtomicU32::new(0) }
    }
}

#[async_trait]
impl RustStepHandler for FailNTimesHandler {
    async fn call(&self, _step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        let attempt = self.attempts.fetch_add(1, Ordering::SeqCst);

        if attempt < self.fail_count {
            Ok(StepHandlerResult::failure(
                "Intentional failure for testing",
                "test_error",
                true, // retryable
            ))
        } else {
            Ok(StepHandlerResult::success(json!({"attempts": attempt + 1})))
        }
    }
}
}

Domain Events

Post-Execution Publishing

Handlers can publish domain events after step execution using the StepEventPublisher trait:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use std::sync::Arc;
use tasker_shared::events::domain_events::DomainEventPublisher;
use tasker_worker::worker::step_event_publisher::{
    StepEventPublisher, StepEventContext, PublishResult
};

#[derive(Debug)]
pub struct PaymentEventPublisher {
    domain_publisher: Arc<DomainEventPublisher>,
}

impl PaymentEventPublisher {
    pub fn new(domain_publisher: Arc<DomainEventPublisher>) -> Self {
        Self { domain_publisher }
    }
}

#[async_trait]
impl StepEventPublisher for PaymentEventPublisher {
    fn name(&self) -> &str {
        "PaymentEventPublisher"
    }

    fn domain_publisher(&self) -> &Arc<DomainEventPublisher> {
        &self.domain_publisher
    }

    async fn publish(&self, ctx: &StepEventContext) -> PublishResult {
        let mut result = PublishResult::default();

        if ctx.step_succeeded() {
            let payload = json!({
                "order_id": ctx.execution_result.result["order_id"],
                "amount": ctx.execution_result.result["amount"],
            });

            // Uses default impl from trait
            match self.publish_event(ctx, "payment.completed", payload).await {
                Ok(event_id) => result.published.push(event_id),
                Err(e) => result.errors.push(e.to_string()),
            }
        }

        result
    }
}
}

Dual-Path Delivery

Events can route to different delivery paths:

Path	Description	Use Case
`durable`	Published to PGMQ	External consumers, audit
`fast`	In-process bus	Metrics, telemetry

Configuration

Bootstrap Configuration

#![allow(unused)]
fn main() {
pub struct WorkerBootstrapConfig {
    pub worker_id: String,
    pub enable_web_api: bool,
    pub event_driven_enabled: bool,
    pub deployment_mode_hint: Option<String>,
}

// Default configuration
let config = WorkerBootstrapConfig {
    worker_id: "rust-worker-001".to_string(),
    enable_web_api: true,
    event_driven_enabled: true,
    deployment_mode_hint: Some("Hybrid".to_string()),
    ..Default::default()
};
}

Dispatch Configuration

#![allow(unused)]
fn main() {
let config = HandlerDispatchConfig {
    max_concurrent_handlers: 10,
    handler_timeout: Duration::from_secs(30),
    service_id: "rust-handler-dispatch".to_string(),
    load_shedding: LoadSheddingConfig::default(),
};
}

Signal Handling

The Rust worker handles graceful shutdown:

#![allow(unused)]
fn main() {
// Wait for shutdown signal
tokio::select! {
    _ = tokio::signal::ctrl_c() => {
        info!("Received Ctrl+C, initiating graceful shutdown...");
    }
    result = wait_for_sigterm() => {
        info!("Received SIGTERM, initiating graceful shutdown...");
    }
}

// Graceful shutdown sequence
bootstrap_result.worker_handle.stop()?;
}

Performance

Characteristics

Zero FFI Overhead: Native Rust handlers
Async/Await: Non-blocking I/O with Tokio
Bounded Concurrency: Semaphore-limited parallelism
Memory Safety: Rust’s ownership model

Benchmarking

# Run with release optimizations
cargo run --release

# With performance profiling
RUST_LOG=trace cargo run --release

File Structure

workers/rust/
├── src/
│   ├── main.rs                  # Entry point
│   ├── bootstrap.rs             # Worker initialization
│   ├── lib.rs                   # Library exports
│   ├── event_handler.rs         # Event bridging (legacy)
│   ├── global_event_system.rs   # Global event coordination
│   ├── step_handlers/
│   │   ├── mod.rs               # Handler traits and types
│   │   ├── registry.rs          # Handler registry
│   │   ├── linear_workflow.rs   # Linear workflow handlers
│   │   ├── diamond_workflow.rs  # Diamond workflow handlers
│   │   ├── tree_workflow.rs     # Tree workflow handlers
│   │   ├── mixed_dag_workflow.rs
│   │   ├── order_fulfillment.rs
│   │   ├── batch_processing_*.rs
│   │   ├── error_injection/     # Test handlers
│   │   └── domain_event_*.rs    # Event publishing
│   └── event_subscribers/
│       ├── mod.rs
│       ├── logging_subscriber.rs
│       └── metrics_subscriber.rs
├── Cargo.toml
└── tests/

Testing

Unit Tests

cargo test -p workers-rust

Integration Tests

# With database
DATABASE_URL=postgresql://... cargo test -p workers-rust --test integration

E2E Tests

# From project root
DATABASE_URL=postgresql://... cargo nextest run --package workers-rust

TypeScript Worker

Last Updated: 2026-01-01 Audience: TypeScript/JavaScript Developers Status: Active Package: @tasker-systems/tasker Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview

The TypeScript worker provides a multi-runtime interface for integrating tasker-core workflow execution into TypeScript/JavaScript applications. It supports Bun, Node.js, and Deno runtimes with unified FFI bindings to the Rust worker foundation.

Quick Start

Installation

cd workers/typescript
bun install                     # Install dependencies
cargo build --release -p tasker-ts  # Build FFI library

Running the Server

# With Bun (recommended for production)
bun run bin/server.ts

# With Node.js
npx tsx bin/server.ts

# With Deno
deno run --allow-ffi --allow-env --allow-net bin/server.ts

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`TASKER_ENV`	Environment (test/development/production)	development
`TASKER_CONFIG_PATH`	Path to TOML configuration	Auto-detected
`TASKER_TEMPLATE_PATH`	Path to task templates	Auto-detected
`TASKER_FFI_LIBRARY_PATH`	Path to `libtasker_ts`	Auto-detected
`RUST_LOG`	Log level (trace/debug/info/warn/error)	info
`PORT`	HTTP server port	8081

Architecture

Server Mode

Location: workers/typescript/bin/server.ts

The server bootstraps the Rust foundation and manages TypeScript handler execution:

import { createRuntime } from '../src/ffi/index.js';
import { EventEmitter } from '../src/events/event-emitter.js';
import { EventPoller } from '../src/events/event-poller.js';
import { HandlerRegistry } from '../src/handler/registry.js';
import { StepExecutionSubscriber } from '../src/subscriber/step-execution-subscriber.js';

// Create runtime for current environment (Bun/Node/Deno)
const runtime = createRuntime();
await runtime.load(libraryPath);

// Bootstrap Rust worker foundation
const result = runtime.bootstrapWorker({ namespace: 'my-app' });

// Create event system
const emitter = new EventEmitter();
const registry = new HandlerRegistry();

// Register handlers
registry.register('process_order', ProcessOrderHandler);

// Create step execution subscriber
const subscriber = new StepExecutionSubscriber(
  emitter,
  registry,
  runtime,
  { workerId: 'typescript-worker-001' }
);
subscriber.start();

// Start event poller (10ms polling)
const poller = new EventPoller(runtime, emitter, {
  pollingIntervalMs: 10
});
poller.start();

// Wait for shutdown signal
await shutdownSignal;

// Graceful shutdown
poller.stop();
await subscriber.waitForCompletion();
runtime.stopWorker();

Headless/Embedded Mode

For embedding in existing TypeScript applications:

import { createRuntime } from '@tasker-systems/tasker';
import { EventEmitter, EventPoller, HandlerRegistry, StepExecutionSubscriber } from '@tasker-systems/tasker';

// Bootstrap worker (headless mode via TOML: web.enabled = false)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });

// Register handlers
const registry = new HandlerRegistry();
registry.register('process_data', ProcessDataHandler);

// Start event system
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();

const poller = new EventPoller(runtime, emitter);
poller.start();

FFI Bridge

TypeScript communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                  TYPESCRIPT FFI BRIDGE                          │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (pollStepEvents)
          ▼
   ┌─────────────────────┐
   │    EventPoller      │
   │  (setInterval)      │──→ poll every 10ms
   └─────────────────────┘
          │
          │ emit to EventEmitter
          ▼
   ┌─────────────────────┐
   │ StepExecution       │
   │ Subscriber          │──→ route to handler
   └─────────────────────┘
          │
          │ handler.call(context)
          ▼
   ┌─────────────────────┐
   │  Handler Execution  │
   └─────────────────────┘
          │
          │ FFI (completeStepEvent)
          ▼
   Rust Completion Channel

Multi-Runtime Support

Runtime	FFI Library	Status
Bun	koffi	Production
Node.js	koffi	Production
Deno	Deno.dlopen	Production

Handler Development

Base Handler

Location: workers/typescript/src/handler/base.ts

All handlers extend StepHandler:

import { StepHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class ProcessOrderHandler extends StepHandler {
  static handlerName = 'process_order';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Access input data
    const orderId = context.getInput<string>('order_id');
    const amount = context.getInput<number>('amount');

    // Business logic
    const result = await this.processOrder(orderId, amount);

    // Return success
    return this.success({
      order_id: orderId,
      status: 'processed',
      total: result.total
    });
  }

  private async processOrder(orderId: string, amount: number) {
    // Implementation
    return { total: amount * 1.1 };
  }
}

Handler Signature

async call(context: StepContext): Promise<StepHandlerResult>

// StepContext provides:
context.taskUuid          // Task identifier
context.stepUuid          // Step identifier
context.stepInputs        // Runtime inputs
context.stepConfig        // Handler configuration
context.dependencyResults // Results from parent steps
context.taskContext       // Full task context
context.retryCount        // Current retry attempt

// Type-safe accessors:
context.getInput<T>(key)              // Get single input
context.getDependencyResult(stepName) // Get dependency result
context.getAllDependencyResults(name) // Get all instances (batch workers)

Result Methods

// Success result (from base class)
return this.success(
  { key: 'value' },           // result
  { duration_ms: 100 }        // metadata (optional)
);

// Failure result (from base class)
return this.failure(
  'Payment declined',         // message
  'payment_error',            // errorType
  true,                       // retryable
  { card_last_four: '1234' }  // metadata (optional)
);

Error Types

import { ErrorType } from '@tasker-systems/tasker';

ErrorType.PERMANENT_ERROR   // Non-retryable failures
ErrorType.RETRYABLE_ERROR   // Retryable failures
ErrorType.VALIDATION_ERROR  // Input validation failures
ErrorType.HANDLER_ERROR     // Handler execution failures

Accessing Dependencies

async call(context: StepContext): Promise<StepHandlerResult> {
  // Get result from a dependency step
  const validation = context.getDependencyResult('validate_order') as {
    valid: boolean;
    amount: number;
  } | null;

  if (!validation) {
    return this.failure('Missing validation result', 'dependency_error', false);
  }

  if (validation.valid) {
    return this.success({ processed: true, amount: validation.amount });
  }

  return this.failure('Validation failed', 'validation_error', false);
}

Specialized Handlers

Mixin Pattern

TypeScript uses composition via mixins rather than inheritance. You can use either:

Wrapper classes (ApiHandler, DecisionHandler) - simpler, backward compatible
Mixin functions (applyAPI, applyDecision) - explicit composition

import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';

// Using mixin pattern (recommended for new code)
class MyHandler extends StepHandler implements APICapable {
  constructor() {
    super();
    applyAPI(this);  // Adds get/post/put/delete methods
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    const response = await this.get('/api/data');
    return this.apiSuccess(response);
  }
}

// Or using wrapper class (simpler, backward compatible)
import { ApiHandler } from '@tasker-systems/tasker';

class MyHandler extends ApiHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    const response = await this.get('/api/data');
    return this.apiSuccess(response);
  }
}

API Handler

Location: workers/typescript/src/handler/api.ts

For HTTP API integration with automatic error classification:

import { ApiHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class FetchUserHandler extends ApiHandler {
  static handlerName = 'fetch_user';
  static handlerVersion = '1.0.0';

  protected baseUrl = 'https://api.example.com';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const userId = context.getInput<string>('user_id');

    // Automatic error classification
    const response = await this.get(`/users/${userId}`);

    if (!response.ok) {
      return this.apiFailure(response);
    }

    return this.apiSuccess(response);
  }
}

HTTP Methods:

// GET request
const response = await this.get('/path', {
  params: { key: 'value' },
  headers: { 'Authorization': 'Bearer token' }
});

// POST request
const response = await this.post('/path', {
  body: { key: 'value' },
  headers: {}
});

// PUT request
const response = await this.put('/path', { body: { key: 'value' } });

// DELETE request
const response = await this.delete('/path', { params: {} });

ApiResponse Properties:

response.statusCode      // HTTP status code
response.headers         // Response headers
response.body            // Parsed body (object or string)
response.ok              // True if 2xx status
response.isClientError   // True if 4xx status
response.isServerError   // True if 5xx status
response.isRetryable     // True if should retry (408, 429, 500-504)
response.retryAfter      // Retry-After header value in seconds

Error Classification:

Status	Classification	Behavior
400, 401, 403, 404, 422	Non-retryable	Permanent failure
408, 429, 500-504	Retryable	Standard retry

Decision Handler

Location: workers/typescript/src/handler/decision.ts

For dynamic workflow routing:

import { DecisionHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class RoutingDecisionHandler extends DecisionHandler {
  static handlerName = 'routing_decision';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const amount = context.getInput<number>('amount') ?? 0;

    if (amount < 1000) {
      // Auto-approve small amounts
      return this.decisionSuccess(['auto_approve'], {
        route_type: 'auto',
        amount
      });
    } else if (amount < 5000) {
      // Manager approval for medium amounts
      return this.decisionSuccess(['manager_approval'], {
        route_type: 'manager',
        amount
      });
    } else {
      // Dual approval for large amounts
      return this.decisionSuccess(['manager_approval', 'finance_review'], {
        route_type: 'dual',
        amount
      });
    }
  }
}

Decision Methods:

// Activate specific steps
return this.decisionSuccess(
  ['step1', 'step2'],           // steps to activate
  { route_reason: 'threshold' } // routing context
);

// No branches needed
return this.decisionNoBranches('condition not met');

BatchableStepHandler

Location: workers/typescript/src/handler/batchable.ts

For processing large datasets in chunks. Cross-language aligned with Ruby and Python implementations.

Analyzer Handler (creates batch configurations):

import { BatchableStepHandler } from '@tasker-systems/tasker';
import type { StepContext, BatchableResult } from '@tasker-systems/tasker';

export class CsvAnalyzerHandler extends BatchableStepHandler {
  static handlerName = 'csv_analyzer';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<BatchableResult> {
    const csvPath = context.getInput<string>('csv_path');
    const rowCount = await this.countCsvRows(csvPath);

    if (rowCount === 0) {
      // No data to process - use cross-language standard
      return this.noBatchesResult('empty_dataset', {
        csv_path: csvPath,
        analyzed_at: new Date().toISOString()
      });
    }

    // Create cursor configs using Ruby-style helper
    // Divides rowCount into 5 roughly equal batches
    const batchConfigs = this.createCursorConfigs(rowCount, 5);

    return this.batchSuccess('process_csv_batch', batchConfigs, {
      csv_path: csvPath,
      total_rows: rowCount,
      analyzed_at: new Date().toISOString()
    });
  }
}

Worker Handler (processes a batch):

export class CsvBatchProcessorHandler extends BatchableStepHandler {
  static handlerName = 'csv_batch_processor';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Cross-language standard: check for no-op worker first
    const noOpResult = this.handleNoOpWorker(context);
    if (noOpResult) {
      return noOpResult;
    }

    // Get batch worker inputs from Rust
    const batchInputs = this.getBatchWorkerInputs(context);
    const cursor = batchInputs?.cursor;

    if (!cursor) {
      return this.failure('Missing batch cursor', 'batch_error', false);
    }

    // Process the batch
    const results = await this.processCsvBatch(
      cursor.start_cursor,
      cursor.end_cursor
    );

    return this.success({
      batch_id: cursor.batch_id,
      rows_processed: results.count,
      items_succeeded: results.success,
      items_failed: results.failed
    });
  }
}

Aggregator Handler (combines results):

export class CsvAggregatorHandler extends StepHandler {
  static handlerName = 'csv_aggregator';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Get all batch worker results
    const workerResults = context.getAllDependencyResults('process_csv_batch') as Array<{
      rows_processed: number;
      items_succeeded: number;
      items_failed: number;
    } | null>;

    // Aggregate results
    let totalProcessed = 0;
    let totalSucceeded = 0;
    let totalFailed = 0;

    for (const result of workerResults) {
      if (result) {
        totalProcessed += result.rows_processed ?? 0;
        totalSucceeded += result.items_succeeded ?? 0;
        totalFailed += result.items_failed ?? 0;
      }
    }

    return this.success({
      total_processed: totalProcessed,
      total_succeeded: totalSucceeded,
      total_failed: totalFailed,
      worker_count: workerResults.length
    });
  }
}

BatchableStepHandler Methods (Cross-Language Aligned):

Method	Ruby Equivalent	Purpose
`batchSuccess(template, configs, metadata)`	`batch_success`	Create batch workers
`noBatchesResult(reason, metadata)`	`no_batches_outcome`	Empty dataset handling
`createCursorConfigs(total, workers)`	`create_cursor_configs`	Divide work by worker count
`handleNoOpWorker(context)`	`handle_no_op_worker`	Detect no-op placeholders
`getBatchWorkerInputs(context)`	`get_batch_context`	Access Rust batch inputs
`aggregateWorkerResults(results)`	`aggregate_batch_worker_results`	Static aggregation helper

Handler Registry

Registration

Location: workers/typescript/src/handler/registry.ts

import { HandlerRegistry } from '@tasker-systems/tasker';

const registry = new HandlerRegistry();

// Manual registration
registry.register('process_order', ProcessOrderHandler);

// Check if registered
registry.isRegistered('process_order'); // true

// Resolve and instantiate
const handler = registry.resolve('process_order');
if (handler) {
  const result = await handler.call(context);
}

// List all handlers
registry.listHandlers(); // ['process_order', ...]

// Handler count
registry.handlerCount(); // 1

Bulk Registration

import { registerExampleHandlers } from './handlers/examples/index.js';

// Register multiple handlers at once
registerExampleHandlers(registry);

Type System

Core Types

import type {
  StepContext,
  StepHandlerResult,
  BatchableResult,
  FfiStepEvent,
  BootstrapConfig,
  WorkerStatus,
} from '@tasker-systems/tasker';

// StepContext - created from FFI event
const context = StepContext.fromFfiEvent(event, 'handler_name');
context.taskUuid;      // string
context.stepUuid;      // string
context.stepInputs;    // Record<string, unknown>
context.retryCount;    // number

// StepHandlerResult - handler output
result.success;        // boolean
result.result;         // Record<string, unknown>
result.errorMessage;   // string | undefined
result.retryable;      // boolean

Configuration Types

import type { BootstrapConfig } from '@tasker-systems/tasker';

const config: BootstrapConfig = {
  namespace: 'my-app',
  environment: 'production',
  configPath: '/path/to/config.toml'
};

Event System

EventEmitter

Location: workers/typescript/src/events/event-emitter.ts

import { EventEmitter } from '@tasker-systems/tasker';
import { StepEventNames } from '@tasker-systems/tasker';

const emitter = new EventEmitter();

// Subscribe to events
emitter.on(StepEventNames.STEP_EXECUTION_RECEIVED, (payload) => {
  console.log(`Processing step: ${payload.event.step_uuid}`);
});

emitter.on(StepEventNames.STEP_EXECUTION_COMPLETED, (payload) => {
  console.log(`Step completed: ${payload.stepUuid}`);
});

// Emit events
emitter.emit(StepEventNames.STEP_EXECUTION_RECEIVED, {
  event: ffiStepEvent
});

Event Names

import { StepEventNames } from '@tasker-systems/tasker';

StepEventNames.STEP_EXECUTION_RECEIVED  // Step event received from FFI
StepEventNames.STEP_EXECUTION_STARTED   // Handler execution started
StepEventNames.STEP_EXECUTION_COMPLETED // Handler execution completed
StepEventNames.STEP_EXECUTION_FAILED    // Handler execution failed
StepEventNames.STEP_COMPLETION_SENT     // Result sent to FFI

EventPoller

Location: workers/typescript/src/events/event-poller.ts

import { EventPoller } from '@tasker-systems/tasker';

const poller = new EventPoller(runtime, emitter, {
  pollingIntervalMs: 10,        // Poll every 10ms
  starvationCheckInterval: 100, // Check every 1 second
  cleanupInterval: 1000         // Cleanup every 10 seconds
});

// Start polling
poller.start();

// Get metrics
const metrics = poller.getMetrics();
console.log(`Pending: ${metrics.pendingCount}`);

// Stop polling
poller.stop();

Domain Events

TypeScript has full domain event support, matching Ruby and Python capabilities. The domain events module provides BasePublisher, BaseSubscriber, and registries for custom event handling.

Location: workers/typescript/src/handler/domain-events.ts

BasePublisher

Publishers transform step execution context into domain-specific events:

import { BasePublisher, StepEventContext, DomainEvent } from '@tasker-systems/tasker';

export class PaymentEventPublisher extends BasePublisher {
  static publisherName = 'payment_events';

  // Required: which steps trigger this publisher
  publishesFor(): string[] {
    return ['process_payment', 'refund_payment'];
  }

  // Transform step context into domain event
  async transformPayload(ctx: StepEventContext): Promise<Record<string, unknown>> {
    return {
      payment_id: ctx.result?.payment_id,
      amount: ctx.result?.amount,
      currency: ctx.result?.currency,
      status: ctx.result?.status
    };
  }

  // Lifecycle hooks (optional)
  async beforePublish(ctx: StepEventContext): Promise<void> {
    console.log(`Publishing payment event for step: ${ctx.stepName}`);
  }

  async afterPublish(ctx: StepEventContext, event: DomainEvent): Promise<void> {
    console.log(`Published event: ${event.eventName}`);
  }

  async onPublishError(ctx: StepEventContext, error: Error): Promise<void> {
    console.error(`Failed to publish: ${error.message}`);
  }

  // Inject custom metadata
  async additionalMetadata(ctx: StepEventContext): Promise<Record<string, unknown>> {
    return { payment_processor: 'stripe' };
  }
}

BaseSubscriber

Subscribers react to domain events matching specific patterns:

import { BaseSubscriber, InProcessDomainEvent, SubscriberResult } from '@tasker-systems/tasker';

export class AuditLoggingSubscriber extends BaseSubscriber {
  static subscriberName = 'audit_logger';

  // Which events to handle (glob patterns supported)
  subscribesTo(): string[] {
    return ['payment.*', 'order.completed'];
  }

  // Handle matching events
  async handle(event: InProcessDomainEvent): Promise<SubscriberResult> {
    await this.logToAuditTrail(event);
    return { success: true };
  }

  // Lifecycle hooks (optional)
  async beforeHandle(event: InProcessDomainEvent): Promise<void> {
    console.log(`Handling: ${event.eventName}`);
  }

  async afterHandle(event: InProcessDomainEvent, result: SubscriberResult): Promise<void> {
    console.log(`Handled successfully: ${result.success}`);
  }

  async onHandleError(event: InProcessDomainEvent, error: Error): Promise<void> {
    console.error(`Handler error: ${error.message}`);
  }
}

Registries

Manage publishers and subscribers with singleton registries:

import { PublisherRegistry, SubscriberRegistry } from '@tasker-systems/tasker';

// Publisher Registry
const pubRegistry = PublisherRegistry.getInstance();
pubRegistry.register(PaymentEventPublisher);
pubRegistry.register(OrderEventPublisher);
pubRegistry.freeze(); // Prevent further registrations

// Get publisher for a step
const publisher = pubRegistry.getForStep('process_payment');

// Subscriber Registry
const subRegistry = SubscriberRegistry.getInstance();
subRegistry.register(AuditLoggingSubscriber);
subRegistry.register(MetricsSubscriber);

// Start all subscribers
subRegistry.startAll();

// Stop all subscribers
subRegistry.stopAll();

FFI Integration

Domain events integrate with the Rust FFI layer for cross-language event flow:

import { createFfiPollAdapter, InProcessDomainEventPoller } from '@tasker-systems/tasker';

// Create poller connected to Rust broadcast channel
const poller = new InProcessDomainEventPoller();

// Set the FFI poll function
poller.setPollFunction(createFfiPollAdapter(runtime));

// Start polling for events
poller.start((event) => {
  // Route to appropriate subscriber
  const subscribers = subRegistry.getMatchingSubscribers(event.eventName);
  for (const sub of subscribers) {
    sub.handle(event);
  }
});

Signal Handling

The TypeScript worker handles signals for graceful shutdown:

Signal	Behavior
`SIGTERM`	Graceful shutdown
`SIGINT`	Graceful shutdown (Ctrl+C)

import { ShutdownController } from '@tasker-systems/tasker';

const shutdown = new ShutdownController();

// Register signal handlers
shutdown.registerSignalHandlers();

// Wait for shutdown signal
await shutdown.waitForShutdown();

// Or check if shutdown requested
if (shutdown.isShutdownRequested()) {
  // Begin cleanup
}

Error Handling

Using Failure Results

async call(context: StepContext): Promise<StepHandlerResult> {
  try {
    const result = await this.processData(context);
    return this.success(result);
  } catch (error) {
    if (error instanceof NetworkError) {
      // Retryable error
      return this.failure(
        error.message,
        ErrorType.RETRYABLE_ERROR,
        true,
        { endpoint: error.endpoint }
      );
    }

    // Non-retryable error
    return this.failure(
      error instanceof Error ? error.message : 'Unknown error',
      ErrorType.HANDLER_ERROR,
      false
    );
  }
}

Logging

Structured Logging

import { logInfo, logError, logWarn, logDebug } from '@tasker-systems/tasker';

// Simple logging
logInfo('Processing started', { component: 'handler' });
logError('Failed to connect', { component: 'database' });

// With additional context
logInfo('Order processed', {
  component: 'handler',
  order_id: '123',
  amount: '100.00'
});

Pino Integration

The worker uses pino for structured logging:

import pino from 'pino';

const logger = pino({
  name: 'my-handler',
  level: process.env.RUST_LOG ?? 'info'
});

logger.info({ orderId: '123' }, 'Processing order');

File Structure

workers/typescript/
├── bin/
│   └── server.ts               # Production server
├── src/
│   ├── index.ts                # Package exports
│   ├── bootstrap/
│   │   └── bootstrap.ts        # Worker initialization
│   ├── events/
│   │   ├── event-emitter.ts    # Event pub/sub
│   │   ├── event-poller.ts     # FFI polling
│   │   └── event-system.ts     # Combined event system
│   ├── ffi/
│   │   ├── bun-runtime.ts      # Bun FFI adapter
│   │   ├── node-runtime.ts     # Node.js FFI adapter
│   │   ├── deno-runtime.ts     # Deno FFI adapter
│   │   ├── runtime-interface.ts # Common interface
│   │   └── types.ts            # FFI types
│   ├── handler/
│   │   ├── base.ts             # Base handler class
│   │   ├── api.ts              # API handler
│   │   ├── decision.ts         # Decision handler
│   │   ├── batchable.ts        # Batchable handler
│   │   ├── domain-events.ts    # Domain events module
│   │   ├── registry.ts         # Handler registry
│   │   └── mixins/             # Mixin modules
│   │       ├── index.ts        # Mixin exports
│   │       ├── api.ts          # APIMixin, applyAPI
│   │       └── decision.ts     # DecisionMixin, applyDecision
│   ├── server/
│   │   ├── worker-server.ts    # Server implementation
│   │   └── types.ts            # Server types
│   ├── subscriber/
│   │   └── step-execution-subscriber.ts
│   └── types/
│       ├── step-context.ts     # Step context
│       └── step-handler-result.ts
├── tests/
│   ├── unit/                   # Unit tests
│   ├── integration/            # Integration tests
│   └── handlers/examples/      # Example handlers
├── src-rust/                   # Rust FFI extension
├── package.json
├── tsconfig.json
└── biome.json                  # Linting config

Testing

Unit Tests

cd workers/typescript
bun test                        # Run all tests
bun test tests/unit/            # Run unit tests only

Integration Tests

bun test tests/integration/     # Run integration tests

With Coverage

bun test --coverage

Linting

bun run check                   # Biome lint + format check
bun run check:fix               # Auto-fix issues

Type Checking

bunx tsc --noEmit               # Type check without emit

Example Handlers

Linear Workflow

export class DoubleHandler extends StepHandler {
  static handlerName = 'double_value';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const value = context.getInput<number>('value') ?? 0;
    return this.success({
      result: value * 2,
      operation: 'double'
    });
  }
}

export class AddHandler extends StepHandler {
  static handlerName = 'add_constant';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const prev = context.getDependencyResult('double_value') as { result: number } | null;
    const value = prev?.result ?? 0;
    return this.success({
      result: value + 10,
      operation: 'add'
    });
  }
}

Diamond Workflow (Parallel Branches)

export class DiamondStartHandler extends StepHandler {
  static handlerName = 'diamond_start';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const input = context.getInput<number>('value') ?? 0;
    return this.success({ squared: input * input });
  }
}

export class BranchBHandler extends StepHandler {
  static handlerName = 'branch_b';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const start = context.getDependencyResult('diamond_start') as { squared: number };
    return this.success({ result: start.squared + 25 });
  }
}

export class BranchCHandler extends StepHandler {
  static handlerName = 'branch_c';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const start = context.getDependencyResult('diamond_start') as { squared: number };
    return this.success({ result: start.squared * 2 });
  }
}

export class DiamondEndHandler extends StepHandler {
  static handlerName = 'diamond_end';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const branchB = context.getDependencyResult('branch_b') as { result: number };
    const branchC = context.getDependencyResult('branch_c') as { result: number };
    return this.success({
      final: (branchB.result + branchC.result) / 2
    });
  }
}

Error Handling

export class RetryableErrorHandler extends StepHandler {
  static handlerName = 'retryable_error';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Simulate a retryable error (e.g., network timeout)
    return this.failure(
      'Connection timeout - will be retried',
      ErrorType.RETRYABLE_ERROR,
      true,
      { attempt: context.retryCount }
    );
  }
}

export class PermanentErrorHandler extends StepHandler {
  static handlerName = 'permanent_error';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Simulate a permanent error (e.g., validation failure)
    return this.failure(
      'Invalid input - no retry allowed',
      ErrorType.PERMANENT_ERROR,
      false
    );
  }
}

Docker Deployment

Dockerfile

FROM oven/bun:1.1.38 AS runtime

WORKDIR /app

# Copy built artifacts
COPY workers/typescript/dist/ ./dist/
COPY workers/typescript/package.json ./
COPY target/release/libtasker_ts.dylib ./lib/

# Install production dependencies
RUN bun install --production

# Set environment
ENV TASKER_FFI_LIBRARY_PATH=/app/lib/libtasker_ts.dylib
ENV PORT=8081

EXPOSE 8081

CMD ["bun", "run", "dist/bin/server.js"]

Docker Compose

typescript-worker:
  build:
    context: .
    dockerfile: docker/build/typescript-worker.Dockerfile
  environment:
    DATABASE_URL: postgresql://tasker:tasker@postgres:5432/tasker
    TASKER_ENV: production
    TASKER_TEMPLATE_PATH: /app/templates
    PORT: 8081
  ports:
    - "8084:8081"
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
    interval: 10s
    timeout: 5s
    retries: 3

Observability Documentation

Last Updated: 2025-12-01 Audience: Operators, Developers Status: Active Related Docs: Documentation Hub | Benchmarks | Deployment Patterns | Domain Events

← Back to Documentation Hub

This directory contains documentation for monitoring, metrics, logging, and performance measurement in tasker-core.

📊 Performance & Benchmarking → ../benchmarks/

All benchmark documentation has been consolidated in the docs/benchmarks/ directory.

See: Benchmark README for:

API performance benchmarks
SQL function benchmarks
Event propagation benchmarks
End-to-end latency benchmarks
Benchmark quick reference
Performance targets and CI integration

Migration Note: The following files remain in this directory for historical context but are superseded by the consolidated benchmarks documentation:

benchmark-implementation-decision.md - Decision rationale (archived)
benchmark-quick-reference.md - Superseded by ../benchmarks/README.md
benchmark-strategy-summary.md - Consolidated into benchmark-specific docs
benchmarking-guide.md - SQL benchmarks moved to ../benchmarks/sql-benchmarks.md
phase-5.4-distributed-benchmarks-plan.md - Implementation complete

Observability Categories

1. Metrics (`metrics-*.md`)

Purpose: System health, performance counters, and operational metrics

Documentation:

metrics-reference.md - Complete metrics catalog
metrics-verification.md - Verification procedures
VERIFICATION_RESULTS.md - Test results and validation

Key Metrics Tracked:

Task lifecycle events (created, started, completed, failed)
Step execution metrics (claimed, executed, retried)
Database operation performance (query times, cache hit rates)
Worker health (active workers, queue depths, claim rates)
System resource usage (memory, connections, threads)

Export Targets:

OpenTelemetry (planned)
Prometheus (supported)
CloudWatch (planned)
Datadog (planned)

Quick Reference:

#![allow(unused)]
fn main() {
// Example: Recording a metric
metrics::counter!("tasker.tasks.created").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed_ms);
metrics::gauge!("tasker.workers.active").set(worker_count as f64);
}

2. Logging (`logging-standards.md`)

Purpose: Structured logging for debugging, audit trails, and operational visibility

Documentation:

logging-standards.md - Logging standards and best practices

Log Levels:

ERROR: Critical failures requiring immediate attention
WARN: Degraded operation or retry scenarios
INFO: Significant lifecycle events and state transitions
DEBUG: Detailed execution flow for troubleshooting
TRACE: Exhaustive detail for deep debugging

Structured Fields:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task_uuid,
    correlation_id = %correlation_id,
    step_name = %step_name,
    elapsed_ms = elapsed.as_millis(),
    "Step execution completed successfully"
);
}

Key Standards:

Use structured logging (not string interpolation)
Include correlation IDs for distributed tracing
Log state transitions at INFO level
Include timing information for performance analysis
Sanitize sensitive data (credentials, PII)

3. Tracing and OpenTelemetry

Purpose: Distributed request tracing across services

Status: ✅ Active

Documentation:

opentelemetry-improvements.md - Telemetry enhancements

Current Features:

Distributed trace propagation via correlation IDs (UUIDv7)
Span creation for major operations:
- API request handling
- Step execution (claim → execute → submit)
- Orchestration coordination
- Domain event publishing
- Message queue operations
Two-phase FFI telemetry initialization (safe for Ruby/Python workers)
Integration with Grafana LGTM stack (Prometheus, Tempo)
Domain event metrics (/metrics/events endpoint)

Two-Phase FFI Initialization:

Phase 1: Console-only logging (safe during FFI bridge setup)
Phase 2: Full OpenTelemetry (after FFI established)

Example:

#![allow(unused)]
fn main() {
#[tracing::instrument(
    name = "publish_domain_event",
    skip(self, payload),
    fields(
        event_name = %event_name,
        namespace = %metadata.namespace,
        correlation_id = %metadata.correlation_id,
        delivery_mode = ?delivery_mode
    )
)]
async fn publish_event(&self, event_name: &str, ...) -> Result<()> {
    // Implementation
}
}

4. Health Checks

Purpose: Service health monitoring for orchestration, availability, and alerting

Endpoints:

GET /health - Overall service health
GET /health/ready - Readiness for traffic (K8s readiness probe)
GET /health/live - Liveness check (K8s liveness probe)

Health Indicators:

Database connection pool status
Message queue connectivity
Worker availability
Circuit breaker states
Resource utilization (memory, connections)

Response Format:

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "connections_active": 5,
      "connections_idle": 15,
      "connections_max": 20
    },
    "message_queue": {
      "status": "healthy",
      "queues_monitored": 3
    },
    "circuit_breakers": {
      "status": "healthy",
      "open_breakers": 0
    }
  },
  "uptime_seconds": 3600
}

Observability Architecture

Component-Level Instrumentation

┌─────────────────────────────────────────────────────────────┐
│                   Observability Stack                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Metrics │  │   Logs   │  │  Traces  │  │  Health  │  │
│  │ (Counters│  │(Structured)│  │(Planned)│  │  Checks  │  │
│  │Histograms│  │   JSON   │  │  Spans   │  │   HTTP   │  │
│  │  Gauges) │  │   Fields │  │   Tags   │  │  Probes  │  │
│  └─────┬────┘  └─────┬────┘  └─────┬────┘  └─────┬────┘  │
│        │             │             │             │        │
└────────┼─────────────┼─────────────┼─────────────┼────────┘
         │             │             │             │
         ▼             ▼             ▼             ▼
  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
  │Prometheus │ │  Loki /   │ │  Jaeger / │ │    K8s    │
  │   OTLP    │ │CloudWatch │ │   Tempo   │ │  Probes   │
  └───────────┘ └───────────┘ └───────────┘ └───────────┘

Instrumentation Points

Orchestration:

Task lifecycle transitions
Step discovery and enqueueing
Result processing
Finalization operations
Database query performance

Worker:

Step claiming
Handler execution
Result submission
FFI call overhead (Ruby workers)
Event propagation latency

Database:

Query execution times
Connection pool metrics
Transaction commit latency
Buffer cache hit ratio

Message Queue:

Message send/receive latency
Queue depth
Notification propagation time
Message processing errors

Performance Monitoring

Key Performance Indicators (KPIs)

Metric	Target	Alert Threshold	Notes
API Response Time (p99)	< 100ms	> 200ms	User-facing latency
SQL Function Time (mean)	< 3ms	> 5ms	Orchestration efficiency
Event Propagation (p95)	< 10ms	> 20ms	Real-time coordination
E2E Task Completion (p99)	< 500ms	> 1000ms	End-user experience
Worker Claim Success Rate	> 95%	< 90%	Resource contention
Database Connection Pool	< 80%	> 90%	Resource exhaustion

Monitoring Dashboards

Recommended Dashboard Panels:

Task Throughput
- Tasks created/min
- Tasks completed/min
- Tasks failed/min
- Active tasks count
Step Execution
- Steps enqueued/min
- Steps completed/min
- Average step execution time
- Step retry rate
System Health
- Worker health status
- Database connection pool utilization
- Circuit breaker status
- API response times (p50, p95, p99)
Error Rates
- Task failures by namespace
- Step failures by handler
- Database errors
- Message queue errors

Correlation and Debugging

Correlation ID Propagation

Every request generates a UUIDv7 correlation ID that flows through:

API request → Task creation
Task → Step enqueueing
Step → Worker execution
Worker → Result submission
Result → Orchestration processing

Tracing a Request:

# Find correlation ID from task creation
curl http://localhost:8080/v1/tasks/{task_uuid} | jq .correlation_id

# Search logs across all services
docker logs orchestration 2>&1 | grep {correlation_id}
docker logs worker 2>&1 | grep {correlation_id}

# Query database for full timeline
psql $DATABASE_URL -c "
  SELECT
    created_at,
    from_state,
    to_state,
    metadata->>'duration_ms' as duration
  FROM tasker.task_transitions
  WHERE metadata->>'correlation_id' = '{correlation_id}'
  ORDER BY created_at;
"

Debug Logging

Enable debug logging for detailed execution flow:

# Docker Compose
RUST_LOG=debug docker-compose up

# Local development
RUST_LOG=tasker_worker=debug,tasker_orchestration=debug cargo run

# Specific modules
RUST_LOG=tasker_worker::worker::command_processor=trace cargo test

Best Practices

1. Structured Logging

✅ Do:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task.uuid,
    namespace = %task.namespace,
    elapsed_ms = elapsed.as_millis(),
    "Task completed successfully"
);
}

❌ Don’t:

#![allow(unused)]
fn main() {
info!("Task {} in namespace {} completed in {}ms",
    task.uuid, task.namespace, elapsed.as_millis());
}

2. Metric Naming

Use consistent, hierarchical naming:

#![allow(unused)]
fn main() {
metrics::counter!("tasker.tasks.created").increment(1);
metrics::counter!("tasker.tasks.completed").increment(1);
metrics::counter!("tasker.tasks.failed").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed);
}

3. Performance Measurement

Measure at operation boundaries:

#![allow(unused)]
fn main() {
let start = Instant::now();
let result = operation().await?;
let elapsed = start.elapsed();

metrics::histogram!("tasker.operation.duration_ms")
    .record(elapsed.as_millis() as f64);

info!(
    operation = "operation_name",
    elapsed_ms = elapsed.as_millis(),
    success = result.is_ok(),
    "Operation completed"
);
}

4. Error Context

Include rich context in errors:

#![allow(unused)]
fn main() {
error!(
    task_uuid = %task_uuid,
    step_uuid = %step_uuid,
    error = %err,
    retry_count = retry_count,
    "Step execution failed, will retry"
);
}

Tools and Integration

Development Tools

Metrics Visualization:

# Prometheus (if configured)
open http://localhost:9090

# Grafana (if configured)
open http://localhost:3000

Log Aggregation:

# Docker Compose logs
docker-compose -f docker/docker-compose.test.yml logs -f

# Specific service
docker-compose -f docker/docker-compose.test.yml logs -f orchestration

# JSON parsing
docker-compose logs orchestration | jq 'select(.level == "ERROR")'

Production Tools (Planned)

Metrics: Prometheus + Grafana / DataDog / CloudWatch
Logs: Loki / CloudWatch Logs / Splunk
Traces: Jaeger / Tempo / Honeycomb
Alerts: AlertManager / PagerDuty / Opsgenie

Benchmarks: ../benchmarks/README.md
SQL Functions: ../task-and-step-readiness-and-execution.md

File Organization

Current Files

Active:

metrics-reference.md - Complete metrics catalog
metrics-verification.md - Verification procedures
logging-standards.md - Logging best practices
opentelemetry-improvements.md - Telemetry enhancements
VERIFICATION_RESULTS.md - Test results

Archived (superseded by docs/benchmarks/):

benchmark-implementation-decision.md
benchmark-quick-reference.md
benchmark-strategy-summary.md
benchmarking-guide.md
phase-5.4-distributed-benchmarks-plan.md

Recommended Cleanup

Move benchmark files to docs/archive/ or delete:

# Option 1: Archive
mkdir -p docs/archive/benchmarks
mv docs/observability/benchmark-*.md docs/archive/benchmarks/
mv docs/observability/phase-5.4-*.md docs/archive/benchmarks/

# Option 2: Delete (information consolidated)
rm docs/observability/benchmark-*.md
rm docs/observability/phase-5.4-*.md

Contributing

When adding observability instrumentation:

Follow standards: Use structured logging and consistent metric naming
Include context: Add correlation IDs and relevant metadata
Document metrics: Update metrics-reference.md with new metrics
Test instrumentation: Verify metrics and logs in development
Consider performance: Avoid expensive operations in hot paths

Benchmark Audit & Profiling Plan

Created: 2025-10-09 Status: 📋 Planning Purpose: Audit existing benchmarks, establish profiling tooling, baseline before Actor/Services refactor

Executive Summary

Before refactoring tasker-orchestration/src/orchestration/lifecycle/ to Actor/Services pattern, we need to:

Audit Benchmarks: Review which benchmarks are implemented vs placeholders
Clean Up: Remove or complete placeholder benchmarks
Establish Profiling: Set up flamegraph/samply tooling
Baseline Profiles: Capture performance profiles for comparison post-refactor

Current Status: We have working SQL and E2E benchmarks but several placeholder component benchmarks that need decisions.

Benchmark Inventory

✅ Working & Complete Benchmarks

1. SQL Function Benchmarks

Location: tasker-shared/benches/sql_functions.rs
Status: ✅ Complete, Compiles, Well-documented
Coverage:
- get_next_ready_tasks() (4 batch sizes)
- get_step_readiness_status() (5 diverse samples)
- transition_task_state_atomic() (5 samples)
- get_task_execution_context() (5 samples)
- get_step_transitive_dependencies() (10 samples)
Documentation: docs/observability/benchmarking-guide.md

Run Command:

cargo bench --package tasker-shared --features benchmarks

2. Event Propagation Benchmarks

Location: tasker-shared/benches/event_propagation.rs
Status: ✅ Complete, Compiles
Coverage: PostgreSQL LISTEN/NOTIFY event propagation

Run Command:

cargo bench --package tasker-shared --features benchmarks event_propagation

3. Task Initialization Benchmarks

Location: tasker-client/benches/task_initialization.rs
Status: ✅ Complete, Compiles
Coverage: API task creation latency

Run Command:

export SQLX_OFFLINE=true
cargo bench --package tasker-client --features benchmarks task_initialization

4. End-to-End Workflow Latency Benchmarks

Location: tests/benches/e2e_latency.rs
Status: ✅ Complete, Compiles
Coverage: Complete workflow execution (API → Result)
- Linear workflow (Ruby FFI)
- Diamond workflow (Ruby FFI)
- Linear workflow (Rust native)
- Diamond workflow (Rust native)
Prerequisites: Docker Compose services running

Run Command:

export SQLX_OFFLINE=true
cargo bench --bench e2e_latency

⚠️ Placeholder Benchmarks (Need Decision)

5. Orchestration Benchmarks

Location: tasker-orchestration/benches/
Files:
- orchestration_benchmarks.rs - Empty placeholder
- step_enqueueing.rs - Placeholder with documentation
Status: Not implemented
Documented Intent: Measure orchestration coordination latency
Challenges:
- Requires triggering orchestration cycle without full execution
- Need step discovery measurement isolation
- Queue publishing and notification overhead breakdown

6. Worker Benchmarks

Location: tasker-worker/benches/
Files:
- worker_benchmarks.rs - Empty placeholder
- worker_execution.rs - Placeholder with documentation
- handler_overhead.rs - Placeholder with documentation
Status: Not implemented
Documented Intent:
- Worker processing cycle (claim, execute, submit)
- Framework overhead vs pure handler execution
- Ruby FFI overhead measurement
Challenges:
- Need pre-enqueued steps in test queues
- Noop handler implementations for baseline
- Breakdown metrics for each phase

Recommendations

Option 1: Keep Placeholders for Future Work ✅ RECOMMENDED

Rationale:

Phase 5.4 distributed benchmarks are documented but complex to implement
E2E benchmarks (e2e_latency.rs) already provide full workflow metrics
SQL benchmarks provide component-level detail
Actor/Services refactor is more urgent than distributed component benchmarks

Action:

Keep placeholder files with clear “NOT IMPLEMENTED” status
Update comments to reference this audit document
Future ticket (post-refactor) can implement if needed

Option 2: Remove Placeholders

Rationale:

Reduce confusion about benchmark status
E2E benchmarks already cover end-to-end latency
SQL benchmarks cover database hot paths

Action:

Delete placeholder bench files
Document decision in this file
Can recreate later if specific component isolation needed

Option 3: Implement Placeholders Now

Rationale:

Complete benchmark suite before refactor
Better baseline data for Actor/Services comparison

Concerns:

2-3 days implementation effort
Delays Actor/Services refactor
May need re-implementation post-refactor anyway

Decision: Option 1 (Keep Placeholders, Document Status)

We have sufficient benchmarking coverage:

✅ SQL functions (hot path queries)
✅ E2E workflows (user-facing latency)
✅ Event propagation (LISTEN/NOTIFY)
✅ Task initialization (API latency)

What’s Missing:

Component-level orchestration breakdown (not critical for refactor)
Worker cycle breakdown (available via OpenTelemetry traces)
Framework overhead measurement (nice-to-have, not blocking)

Action Items:

Update placeholder comments with “Status: Planned for future implementation”
Reference this document for implementation guidance
Move forward with profiling and refactor

Profiling Tooling Setup

Goals

Identify Inefficiencies: Find hot spots in lifecycle code
Establish Baseline: Profile before Actor/Services refactor
Compare Post-Refactor: Validate performance impact of refactor
Continuous Profiling: Enable ongoing performance analysis

Tool Selection

Primary: `samply` (macOS-friendly)

GitHub: https://github.com/mstange/samply
Advantages:
- Native macOS support (uses Instruments)
- Interactive web UI for flamegraphs
- Low overhead
- Works with release builds
Use Case: Development profiling on macOS

Secondary: `flamegraph` (CI/production)

GitHub: https://github.com/flamegraph-rs/flamegraph
Advantages:
- Linux support (perf-based)
- SVG output for CI artifacts
- Well-established in Rust ecosystem
Use Case: CI profiling, Linux production analysis

Tertiary: `cargo-flamegraph` (Convenience)

Cargo Plugin: Wraps flamegraph-rs
Advantages:
- Single command profiling
- Automatic symbol resolution
Use Case: Quick local profiling

Installation

macOS Setup (samply)

# Install samply
cargo install samply

# macOS requires SIP adjustment for sampling (one-time setup)
# https://github.com/mstange/samply#macos-permissions

# Verify installation
samply --version

Linux Setup (flamegraph)

# Install prerequisites (Ubuntu/Debian)
sudo apt-get install linux-tools-common linux-tools-generic

# Install flamegraph
cargo install flamegraph

# Allow perf without sudo (optional)
echo 'kernel.perf_event_paranoid=-1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Verify installation
flamegraph --version

Cross-Platform (cargo-flamegraph)

# Install cargo-flamegraph
cargo install cargo-flamegraph

# Verify installation
cargo flamegraph --version

Profiling Workflows

1. Profile E2E Benchmark (Recommended for Baseline)

Captures the entire workflow execution including orchestration lifecycle:

# macOS
samply record cargo bench --bench e2e_latency -- --profile-time=60

# Linux
cargo flamegraph --bench e2e_latency -- --profile-time=60

# Output: Interactive flamegraph showing hot paths

What to Look For:

Time spent in lifecycle/ modules (task_initializer, step_enqueuer, result_processor, etc.)
Database query time vs business logic time
Serialization/deserialization overhead
Lock contention (should be minimal with our architecture)

2. Profile SQL Benchmarks

Isolates database performance:

# Profile just SQL function benchmarks
samply record cargo bench --package tasker-shared --features benchmarks sql_functions

# Output: Shows PostgreSQL function overhead

What to Look For:

Time in sqlx query execution
Connection pool overhead
Query planning time (shouldn’t be visible if using prepared statements)

3. Profile Integration Tests (Realistic Workload)

Profile actual test execution for realistic patterns:

# Profile a specific integration test
samply record cargo test --test e2e_tests e2e::rust::simple_integration_tests::test_linear_workflow

# Profile all integration tests (longer run)
samply record cargo test --test e2e_tests --all-features

What to Look For:

Initialization overhead
Test setup time vs actual execution time
Repeated patterns across tests

4. Profile Specific Lifecycle Components

Isolate specific modules for deep analysis:

# Example: Profile only result processing
samply record cargo test --package tasker-orchestration --test lifecycle_integration_tests \
  test_result_processing_updates_task_state --all-features -- --nocapture

# Or profile a unit test for a specific function
samply record cargo test --package tasker-orchestration \
  result_processor::tests::test_process_step_result_success --all-features

Baseline Profiling Plan

Phase 1: Capture Pre-Refactor Baselines (Day 1)

Goal: Establish performance baseline of current lifecycle code before Actor/Services refactor

# 1. Clean build
cargo clean
cargo build --release --all-features

# 2. Profile E2E benchmarks (primary baseline)
samply record --output=baseline-e2e-pre-refactor.json \
  cargo bench --bench e2e_latency

# 3. Profile SQL benchmarks
samply record --output=baseline-sql-pre-refactor.json \
  cargo bench --package tasker-shared --features benchmarks

# 4. Profile specific lifecycle operations
samply record --output=baseline-task-init-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::task_initializer::tests --all-features

samply record --output=baseline-step-enqueue-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::step_enqueuer::tests --all-features

samply record --output=baseline-result-processor-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::result_processor::tests --all-features

Deliverables (completed, profiles removed — superseded by cluster benchmarks):

~~Baseline profile files in profiles/pre-refactor/~~ (removed)
Performance baselines now in docs/benchmarks/README.md

Phase 2: Identify Optimization Opportunities (Day 1)

Goal: Document current performance characteristics to preserve in refactor

Analysis Checklist:

✅ Time spent in each lifecycle module (task_initializer, step_enqueuer, etc.)
✅ Database query time breakdown
✅ Serialization overhead (JSON, MessagePack)
✅ Lock contention points (if any)
✅ Unnecessary allocations or clones
✅ Recursive call depth

Document Findings: Performance baselines are now documented in docs/benchmarks/README.md. The original lifecycle-performance-baseline.md was removed — its measurements had data quality issues and the refactor it targeted is complete.

Phase 3: Post-Refactor Validation (After Refactor)

Goal: Validate Actor/Services refactor maintains or improves performance

# Re-run same profiling commands after refactor
samply record --output=baseline-e2e-post-refactor.json \
  cargo bench --bench e2e_latency

# Compare baselines
# (samply doesn't have built-in diff, use manual comparison)

Success Criteria:

E2E latency: Within 10% of baseline (preferably faster)
SQL latency: Unchanged (no regression from refactor)
Lifecycle operation time: Within 20% of baseline
No new hot paths or contention points

Regression Signals:

E2E latency >20% slower
New allocations/clones in hot paths
Increased lock contention
Message passing overhead >5% of total time

Profiling Best Practices

1. Use Release Builds

# Always profile release builds (--release flag)
cargo build --release --all-features
samply record cargo bench --bench e2e_latency

Rationale: Debug builds have 10-100x overhead that masks real performance issues

2. Run Multiple Times

# Run 3 times, compare consistency
for i in {1..3}; do
  samply record --output=profile-$i.json cargo bench --bench e2e_latency
done

Rationale: Catch warm-up effects, JIT compilation, cache behavior

3. Isolate Interference

# Close other applications
# Disable background processes (Spotlight, backups)
# Use consistent hardware (don't profile on battery power)

4. Focus on Hot Paths

80/20 Rule: 80% of time is spent in 20% of code

Priority Order:

Top 5 functions by time (>5% each)
Recursive calls (can amplify overhead)
Locks and synchronization (contention multiplies)
Allocations in loops (O(n) becomes visible)

5. Benchmark-Driven Profiling

Always profile realistic workloads:

✅ E2E benchmarks (represents user experience)
✅ Integration tests (real workflow patterns)
❌ Unit tests (too isolated, not representative)

Flamegraph Interpretation

Reading Flamegraphs

┌─────────────────────────────────────────────┐ ← Total Program Time (100%)
│                                             │
│  ┌────────────────┐  ┌─────────────────┐  │
│  │ Database Ops   │  │ Serialization   │  │ ← High-level Operations (60%)
│  │ (30%)          │  │ (30%)           │  │
│  │                │  │                 │  │
│  │  ┌──────────┐  │  │  ┌───────────┐ │  │
│  │  │ SQL Exec │  │  │  │ JSON Ser  │ │  │ ← Leaf Operations (25%)
│  │  │ (25%)    │  │  │  │ (20%)     │ │  │
│  └──┴──────────┴──┘  └──┴───────────┴─┘  │
│                                             │
│  ┌──────────────────────────────────────┐  │
│  │ Business Logic (20%)                  │  │ ← Remaining Time
│  └──────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

Width = Time spent in function (including children) Height = Call stack depth Color = Function group (can be customized)

Key Patterns

1. Wide Flat Bars = Hot Path

┌───────────────────────────────────────┐
│ step_enqueuer::enqueue_ready_steps()  │  ← 40% of total time
└───────────────────────────────────────┘

Action: Optimize this function

2. Deep Call Stack = Recursion/Abstractions

┌─────────────────────────┐
│ process_dependencies()  │
│  ┌─────────────────────┐│
│  │ resolve_deps()      ││
│  │  ┌─────────────────┐││
│  │  │ check_ready()   │││
│  │  └─────────────────┘││
│  └─────────────────────┘│
└─────────────────────────┘

Action: Consider flattening or caching

3. Many Narrow Bars = Fragmentation

┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│A│B│C│D│E│F│G│H│I│J│K│L│M│  ← Many small functions
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

Action: Not necessarily bad (may be inlining), but check if overhead-heavy

Integration with CI

GitHub Actions Workflow (Future Enhancement)

# .github/workflows/profile-benchmarks.yml
name: Profile Benchmarks

on:
  pull_request:
    paths:
      - 'tasker-orchestration/src/orchestration/lifecycle/**'
      - 'tasker-shared/src/**'

jobs:
  profile:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install flamegraph
        run: cargo install flamegraph

      - name: Profile benchmarks
        run: |
          cargo flamegraph --bench e2e_latency -- --profile-time=60 -o flamegraph.svg

      - name: Upload flamegraph
        uses: actions/upload-artifact@v3
        with:
          name: flamegraph
          path: flamegraph.svg

      - name: Compare with baseline
        run: |
          # TODO: Implement baseline comparison
          # Download previous flamegraph, compare hot paths

Documentation Structure

Created Documents

This Document: docs/observability/benchmark-audit-and-profiling-plan.md
- Benchmark inventory
- Profiling tooling setup
- Baseline capture plan
Existing: docs/observability/benchmarking-guide.md
- SQL benchmark documentation
- Running instructions
- Performance expectations
~~docs/observability/lifecycle-performance-baseline.md~~ (Removed — superseded by docs/benchmarks/README.md)

Next Steps

Before Actor/Services Refactor

✅ Audit Complete: Documented benchmark status

⏳ Install Profiling Tools:

cargo install samply  # macOS
cargo install flamegraph  # Linux

⏳ Capture Baselines (1 day):
- Run profiling plan Phase 1
- Generate flamegraphs
- Document hot paths
✅ Baseline Document: Superseded by docs/benchmarks/README.md

During Actor/Services Refactor

Incremental Profiling: Profile after each major component conversion
Compare Baselines: Ensure no performance regressions
Document Changes: Note architectural changes affecting performance

After Actor/Services Refactor

Full Re-Profile: Run profiling plan Phase 3
Comparison Analysis: Document performance changes
Update Documentation: Reflect new architecture
Benchmark Updates: Update benchmarks if Actor/Services changes measurement approach

Summary

Current State:

✅ SQL benchmarks working
✅ E2E benchmarks working
✅ Event propagation benchmarks working
✅ Task initialization benchmarks working
⚠️ Component benchmarks are placeholders (OK for now)

Decision:

Keep placeholder benchmarks for future work
Move forward with profiling and baseline capture
Sufficient coverage to validate Actor/Services refactor

Action Plan:

Install profiling tools (samply/flamegraph)
Capture pre-refactor baselines (1 day)
Document current hot paths
Proceed with Actor/Services refactor
Validate post-refactor performance

Success Criteria:

Baseline profiles captured
Hot paths documented
Post-refactor validation plan established
No performance regressions from refactor

Benchmark Implementation Decision: Event-Driven + E2E Focus

Date: 2025-10-08 Decision: Focus on event propagation and E2E benchmarks; infer worker metrics from traces

Context

Original Phase 5.4 plan included 7 benchmark categories:

✅ API Task Creation
🚧 Worker Processing Cycle
✅ Event Propagation
🚧 Step Enqueueing
🚧 Handler Overhead
✅ SQL Functions
✅ E2E Latency

Architectural Challenge: Worker Benchmarking

Problem: Direct worker benchmarking doesn’t match production reality

In a distributed system with multiple workers:

❌ Can’t predict which worker will claim which step
❌ Can’t control step distribution across workers
❌ Artificial scenarios required to direct specific steps to specific workers
❌ API queries would need to know which worker to query (unknowable in advance)

Example:

Task with 10 steps across 3 workers:
- Worker A might claim steps 1, 3, 7
- Worker B might claim steps 2, 5, 6, 9
- Worker C might claim steps 4, 8, 10

Which worker do you benchmark? How do you ensure consistent measurement?

Decision: Focus on Observable Metrics

✅ What We WILL Measure Directly

1. Event Propagation (`tasker-shared/benches/event_propagation.rs`)

Status: ✅ IMPLEMENTED

Measures: PostgreSQL LISTEN/NOTIFY round-trip latency

Approach:

#![allow(unused)]
fn main() {
// Setup listener on test channel
listener.listen("pgmq_message_ready.benchmark_event_test").await;

// Send message with notify
let send_time = Instant::now();
sqlx::query("SELECT pgmq_send_with_notify(...)").execute(&pool).await;

// Measure until listener receives
let received_at = listener.recv().await;
let latency = received_at.duration_since(send_time);
}

Why it works:

Observable from outside the system
Deterministic measurement (single listener, single sender)
Matches production behavior (real LISTEN/NOTIFY path)
Critical for worker responsiveness

Expected Performance: < 5-10ms p95

2. End-to-End Latency (`tests/benches/e2e_latency.rs`)

Status: ✅ IMPLEMENTED

Measures: Complete workflow execution (API → Task Complete)

Approach:

#![allow(unused)]
fn main() {
// Create task
let response = client.create_task(request).await;
let start = Instant::now();

// Poll for completion
loop {
    let task = client.get_task(task_uuid).await;
    if task.execution_status == "AllComplete" {
        return start.elapsed();
    }
    tokio::time::sleep(Duration::from_millis(50)).await;
}
}

Why it works:

Measures user experience (submit → result)
Naturally includes ALL system overhead:
- API processing
- Database writes
- Message queue latency
- Worker claim/execute/submit (embedded in total time)
- Event propagation
- Orchestration coordination
No need to know which workers executed which steps
Reflects real production behavior

Expected Performance:

Linear (3 steps): < 500ms p99
Diamond (4 steps): < 800ms p99

📊 What We WILL Infer from Traces

Worker-Level Breakdown via OpenTelemetry

Instead of direct benchmarking, use existing OpenTelemetry instrumentation:

# Query traces by correlation_id from E2E benchmark
curl "http://localhost:16686/api/traces?service=tasker-worker&tags=correlation_id:abc-123"

# Extract span timings:
{
  "spans": [
    {"operationName": "step_claim",       "duration": 15ms},
    {"operationName": "execute_handler",  "duration": 42ms},  // Business logic
    {"operationName": "submit_result",    "duration": 23ms}
  ]
}

Advantages:

✅ Works across distributed workers (correlation ID links everything)
✅ Captures real production behavior (actual task execution)
✅ Breaks down by step type (different handlers have different timing)
✅ Shows which worker processed each step
✅ Already instrumented (Phase 3.3 work)

Metrics Available:

step_claim_duration - Time to claim step from queue
handler_execution_duration - Time to execute handler logic
result_submission_duration - Time to submit result back
ffi_overhead - Rust vs Ruby handler comparison

🚧 Benchmarks NOT Implemented (By Design)

Worker Processing Cycle (`tasker-worker/benches/worker_execution.rs`)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

Requires artificial pre-arrangement of which worker claims which step
Doesn’t match production (multiple workers competing for steps)
Metrics available via OpenTelemetry traces instead

Alternative: Query traces for step_claim → execute_handler → submit_result span timing

Step Enqueueing (`tasker-orchestration/benches/step_enqueueing.rs`)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

Difficult to trigger orchestration step discovery without full execution
Result naturally embedded in E2E latency measurement
Coordination overhead visible in E2E timing

Alternative: E2E benchmark includes step enqueueing naturally

Handler Overhead (`tasker-worker/benches/handler_overhead.rs`)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

FFI overhead varies by handler type (can’t benchmark in isolation)
Real overhead visible in E2E benchmark + traces
Rust vs Ruby comparison available via trace analysis

Alternative: Compare handler_execution_duration spans for Rust vs Ruby handlers in traces

Implementation Summary

✅ Complete Benchmarks (3/7)

Benchmark	Status	Measures	Run Command
SQL Functions	✅ Complete	PostgreSQL function performance	`DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions`
Task Initialization	✅ Complete	API task creation latency	`cargo bench -p tasker-client --features benchmarks`
Event Propagation	✅ Complete	LISTEN/NOTIFY round-trip	`DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks event_propagation`
E2E Latency	✅ Complete	Complete workflow execution	`cargo bench --test e2e_latency`

🚧 Placeholder Benchmarks (3/7)

Benchmark	Status	Alternative Measurement
Worker Execution	🚧 Placeholder	OpenTelemetry traces (correlation ID)
Step Enqueueing	🚧 Placeholder	Embedded in E2E latency
Handler Overhead	🚧 Placeholder	OpenTelemetry span comparison (Rust vs Ruby)

Advantages of This Approach

1. Matches Production Reality

E2E benchmark reflects actual user experience
No artificial worker pre-arrangement required
Measures real distributed system behavior

2. Complete Coverage

E2E latency includes ALL components naturally
OpenTelemetry provides worker-level breakdown
Event propagation measures critical notification path

3. Lower Maintenance

Fewer benchmarks to maintain
No complex setup for worker isolation
Traces provide flexible analysis

4. Better Insights

Correlation IDs link entire workflow across services
Can analyze timing for ANY task in production
Breakdown available on-demand via trace queries

How to Use This System

Running Performance Analysis

Step 1: Run E2E benchmark

cargo bench --test e2e_latency

Step 2: Extract correlation_id from benchmark output

Created task: abc-123-def-456 (correlation_id: xyz-789)

Step 3: Query traces for breakdown

# Jaeger UI or API
curl "http://localhost:16686/api/traces?tags=correlation_id:xyz-789"

Step 4: Analyze span timing

{
  "spans": [
    {"service": "orchestration", "operation": "create_task", "duration": 18ms},
    {"service": "orchestration", "operation": "enqueue_steps", "duration": 12ms},
    {"service": "worker", "operation": "step_claim", "duration": 15ms},
    {"service": "worker", "operation": "execute_handler", "duration": 42ms},
    {"service": "worker", "operation": "submit_result", "duration": 23ms},
    {"service": "orchestration", "operation": "process_result", "duration": 8ms}
  ]
}

Total E2E: ~118ms (matches benchmark) Worker overhead: 15ms + 23ms = 38ms (claim + submit, excluding business logic)

Recommendations

Completion Criteria

✅ Complete with 4 working benchmarks:

SQL Functions
Task Initialization
Event Propagation
E2E Latency

📋 Document that worker-level metrics come from OpenTelemetry

For Future Enhancement

If direct worker benchmarking becomes necessary:

Use single-worker mode Docker Compose configuration
Pre-create tasks with known step assignments
Query specific worker API for deterministic steps
Document as synthetic benchmark (not matching production)

For Production Monitoring

Use OpenTelemetry for ongoing performance analysis:

Set up trace retention (7-30 days)
Create Grafana dashboards for span timing
Alert on p95 latency increases
Analyze slow workflows via correlation ID

Conclusion

Decision: Focus on event propagation and E2E latency benchmarks, use OpenTelemetry traces for worker-level breakdown.

Rationale: Matches production reality, provides complete coverage, lower maintenance, better insights.

Status: ✅ 4/4 practical benchmarks implemented and working

Benchmark Quick Reference Guide

Last Updated: 2025-10-08

Quick commands for running all benchmarks in the distributed benchmarking suite.

Prerequisites

# Start all Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# Verify services are healthy
curl http://localhost:8080/health  # Orchestration
curl http://localhost:8081/health  # Rust Worker
curl http://localhost:8082/health  # Ruby Worker (optional)

# Set database URL (for SQL benchmarks)
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

Individual Benchmarks

✅ Implemented Benchmarks

# 1. API Task Creation (COMPLETE - 17.7-20.8ms)
cargo bench --package tasker-client --features benchmarks

# 2. SQL Function Performance (COMPLETE - 380µs-2.93ms)
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions

🚧 Placeholder Benchmarks

# 3. Event Propagation (placeholder)
cargo bench --package tasker-shared --features benchmarks event_propagation

# 4. Worker Execution (placeholder)
cargo bench --package tasker-worker --features benchmarks worker_execution

# 5. Handler Overhead (placeholder)
cargo bench --package tasker-worker --features benchmarks handler_overhead

# 6. Step Enqueueing (placeholder)
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing

# 7. End-to-End Latency (placeholder)
cargo bench --test e2e_latency

Run All Benchmarks

# Run ALL benchmarks (implemented + placeholders)
cargo bench --all-features

# Run only SQL benchmarks
cargo bench --package tasker-shared --features benchmarks

# Run only worker benchmarks
cargo bench --package tasker-worker --features benchmarks

Benchmark Categories

Category	Package	Benchmark Name	Status	Run Command
API	tasker-client	task_initialization	✅ Complete	`cargo bench -p tasker-client --features benchmarks`
SQL	tasker-shared	sql_functions	✅ Complete	`DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions`
Events	tasker-shared	event_propagation	🚧 Placeholder	`cargo bench -p tasker-shared --features benchmarks event_propagation`
Worker	tasker-worker	worker_execution	🚧 Placeholder	`cargo bench -p tasker-worker --features benchmarks worker_execution`
Worker	tasker-worker	handler_overhead	🚧 Placeholder	`cargo bench -p tasker-worker --features benchmarks handler_overhead`
Orchestration	tasker-orchestration	step_enqueueing	🚧 Placeholder	`cargo bench -p tasker-orchestration --features benchmarks`
E2E	tests	e2e_latency	🚧 Placeholder	`cargo bench --test e2e_latency`

Benchmark Output Locations

# Criterion HTML reports
open target/criterion/report/index.html

# Individual benchmark data
ls target/criterion/

# Proposed: Structured logs (not yet implemented)
# tmp/benchmarks/YYYY-MM-DD-benchmark-name.log

Common Options

# Save baseline for comparison
cargo bench --features benchmarks -- --save-baseline main

# Compare to baseline
cargo bench --features benchmarks -- --baseline main

# Verbose output
cargo bench --features benchmarks -- --verbose

# Run specific benchmark
cargo bench --package tasker-client --features benchmarks task_creation_api

# Skip health checks (CI mode)
TASKER_TEST_SKIP_HEALTH_CHECK=true cargo bench --features benchmarks

Troubleshooting

“Services must be running”

# Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# Check service health
curl http://localhost:8080/health

“DATABASE_URL must be set”

export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

“Task template not found”

# Ensure worker services are running (they register templates)
docker-compose -f docker/docker-compose.test.yml ps

# Check registered templates
curl -s http://localhost:8080/v1/handlers | jq

Compilation errors

# Clean and rebuild
cargo clean
cargo build --all-features

Performance Targets

Benchmark	Metric	Target	Current	Status
Task Init (linear)	mean	< 50ms	17.7ms	✅ 3x better
Task Init (diamond)	mean	< 75ms	20.8ms	✅ 3.6x better
SQL Task Discovery	mean	< 3ms	1.75-2.93ms	✅ Pass
SQL Step Readiness	mean	< 1ms	440-603µs	✅ Pass
Worker Total Overhead	mean	< 60ms	TBD	🚧
Event Notify (p95)	p95	< 10ms	TBD	🚧
Step Enqueue (3 steps)	mean	< 50ms	TBD	🚧
E2E Complete (3 steps)	p99	< 500ms	TBD	🚧

Documentation

Full Strategy: benchmark-strategy-summary.md
Implementation Plan: phase-5.4-distributed-benchmarks-plan.md
SQL Benchmark Guide: benchmarking-guide.md

Distributed Benchmarking Strategy

Status: 🎯 Framework Complete | Implementation In Progress Last Updated: 2025-10-08

Overview

Complete benchmarking infrastructure for measuring distributed system performance across all components.

Benchmark Suite Structure

✅ Implemented

1. API Task Creation (`tasker-client/benches/task_initialization.rs`)

Status: ✅ COMPLETE - Fully implemented and tested

Measures:

HTTP request → task initialized latency
Task record creation in PostgreSQL
Initial step discovery from template
Response generation and serialization

Results (2025-10-08):

Linear (3 steps):   17.7ms  (Target: < 50ms)  ✅ 3x better than target
Diamond (4 steps):  20.8ms  (Target: < 75ms)  ✅ 3.6x better than target

Run Command:

cargo bench --package tasker-client --features benchmarks

2. SQL Function Performance (`tasker-shared/benches/sql_functions.rs`)

Status: ✅ COMPLETE - Fully implemented (Phase 5.2)

Measures:

6 critical PostgreSQL function benchmarks
Intelligent stratified sampling (5-10 diverse samples per function)
EXPLAIN ANALYZE query plan analysis (run once per function)

Results (from Phase 5.2):

Task discovery:            1.75-2.93ms  (O(1) scaling!)
Step readiness:            440-603µs    (37% variance captured)
State transitions:         ~380µs       (±5% variance)
Task execution context:    448-559µs
Step dependencies:         332-343µs
Query plan buffer hit:     100%         (all functions)

Run Command:

DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions

🚧 Placeholders (Ready for Implementation)

3. Worker Processing Cycle (`tasker-worker/benches/worker_execution.rs`)

Status: 🚧 Skeleton created - needs implementation

Measures:

Claim: PGMQ read + atomic claim
Execute: Handler execution (framework overhead)
Submit: Result serialization + HTTP submit
Total: Complete worker cycle

Targets:

Claim: < 20ms
Execute (noop): < 10ms
Submit: < 30ms
Total overhead: < 60ms

Implementation Requirements:

Pre-enqueued steps in namespace queues
Worker client with breakdown metrics
Multiple handler types (noop, calculation, database)
Accurate timestamp collection for each phase

Run Command (when implemented):

cargo bench --package tasker-worker --features benchmarks worker_execution

4. Event Propagation (`tasker-shared/benches/event_propagation.rs`)

Status: 🚧 Skeleton created - needs implementation

Measures:

PostgreSQL LISTEN/NOTIFY latency
PGMQ pgmq_send_with_notify overhead
Event system framework overhead

Targets:

p50: < 5ms
p95: < 10ms
p99: < 20ms

Implementation Requirements:

PostgreSQL LISTEN connection setup
PGMQ notification channel configuration
Concurrent listener with timestamp correlation
Accurate cross-thread time measurement

Run Command (when implemented):

cargo bench --package tasker-shared --features benchmarks event_propagation

5. Step Enqueueing (`tasker-orchestration/benches/step_enqueueing.rs`)

Status: 🚧 Skeleton created - needs implementation

Measures:

Ready step discovery (SQL query time)
Queue publishing (PGMQ write time)
Notification overhead (LISTEN/NOTIFY)
Total orchestration coordination

Targets:

3-step workflow: < 50ms
10-step workflow: < 100ms
50-step workflow: < 500ms

Implementation Requirements:

Pre-created tasks with dependency chains
Orchestration client with result processing trigger
Queue polling to detect enqueued steps
Breakdown metrics (discovery, publish, notify)

Challenge: Triggering step discovery without full workflow execution

Run Command (when implemented):

cargo bench --package tasker-orchestration --features benchmarks step_enqueueing

6. Handler Overhead (`tasker-worker/benches/handler_overhead.rs`)

Status: 🚧 Skeleton created - needs implementation

Measures:

Pure Rust handler (baseline - direct call)
Rust handler via framework (dispatch overhead)
Ruby handler via FFI (FFI boundary cost)

Targets:

Pure Rust: < 1µs (baseline)
Via Framework: < 1ms
Ruby FFI: < 5ms

Implementation Requirements:

Noop handler implementations (Rust + Ruby)
Direct function call benchmarks
Framework dispatch overhead measurement
FFI bridge overhead measurement

Run Command (when implemented):

cargo bench --package tasker-worker --features benchmarks handler_overhead

7. End-to-End Latency (`tests/benches/e2e_latency.rs`)

Status: 🚧 Skeleton created - needs implementation

Measures:

Complete workflow execution (API → Task Complete)
All system components (API, DB, Queue, Worker, Events)
Real network overhead
Different workflow patterns

Targets:

Linear (3 steps): < 500ms p99
Diamond (4 steps): < 800ms p99
Tree (7 steps): < 1500ms p99

Implementation Requirements:

All Docker Compose services running
Orchestration client for task creation
Polling mechanism for completion detection
Multiple workflow templates
Timeout handling for stuck workflows

Special Considerations:

SLOW by design: Measures real workflow execution (seconds)
Fewer samples (sample_size=10 vs 50 default)
Higher variance expected (network + system state)
Focus on regression detection, not absolute numbers

Run Command (when implemented):

# Requires all Docker services running
docker-compose -f docker/docker-compose.test.yml up -d

cargo bench --test e2e_latency

Benchmark Output Logging Strategy

Current State

Implemented:

Criterion default output (terminal + HTML reports)
Custom health check banners in benchmarks
EXPLAIN ANALYZE output in SQL benchmarks
Inline result commentary

Location: Results saved to target/criterion/

Proposed Consistent Structure

1. Standard Output Format

All benchmarks should follow this pattern:

═══════════════════════════════════════════════════════════════════════════════
🔍 VERIFYING PREREQUISITES
═══════════════════════════════════════════════════════════════════════════════
✅ All prerequisites met
═══════════════════════════════════════════════════════════════════════════════

Benchmarking <category>/<test_name>
...
<category>/<test_name>   time: [X.XX ms Y.YY ms Z.ZZ ms]

═══════════════════════════════════════════════════════════════════════════════
📊 BENCHMARK RESULTS: <CATEGORY NAME>
═══════════════════════════════════════════════════════════════════════════════

Performance Summary:
  • Test 1: X.XX ms  (Target: < YY ms)  ✅ Status
  • Test 2: X.XX ms  (Target: < YY ms)  ⚠️  Status

Key Findings:
  • Finding 1
  • Finding 2

═══════════════════════════════════════════════════════════════════════════════

2. Structured Log Files

Proposal: Create tmp/benchmarks/ directory with dated output:

tmp/benchmarks/
├── 2025-10-08-task-initialization.log
├── 2025-10-08-sql-functions.log
├── 2025-10-08-worker-execution.log
├── ...
└── latest/
    ├── task-initialization.log -> ../2025-10-08-task-initialization.log
    └── summary.md

Log Format (example):

# Benchmark Run: task_initialization
Date: 2025-10-08 14:23:45 UTC
Commit: abc123def456
Environment: Docker Compose Test

## Prerequisites
- [x] Orchestration service healthy (http://localhost:8080)
- [x] Worker service healthy (http://localhost:8081)

## Results

### Linear Workflow (3 steps)
- Mean: 17.748 ms
- Std Dev: 0.624 ms
- Min: 17.081 ms
- Max: 18.507 ms
- Target: < 50 ms
- Status: ✅ PASS (3.0x better than target)
- Outliers: 2/20 (10%)

### Diamond Workflow (4 steps)
- Mean: 20.805 ms
- Std Dev: 0.741 ms
- Min: 19.949 ms
- Max: 21.633 ms
- Target: < 75 ms
- Status: ✅ PASS (3.6x better than target)
- Outliers: 0/20 (0%)

## Summary
✅ All tests passed
🎯 Average performance: 3.3x better than targets

3. Baseline Comparison Format

For tracking performance over time:

# Performance Baseline Comparison
Baseline: main branch (2025-10-01)
Current: feature/benchmarks (2025-10-08)

| Benchmark | Baseline | Current | Change | Status |
|-----------|----------|---------|--------|--------|
| task_init/linear | 18.2ms | 17.7ms | -2.7% | ✅ Improved |
| task_init/diamond | 21.1ms | 20.8ms | -1.4% | ✅ Improved |
| sql/task_discovery | 2.91ms | 2.93ms | +0.7% | ✅ Stable |

4. CI Integration Format

For GitHub Actions / CI output:

{
  "benchmark_suite": "task_initialization",
  "timestamp": "2025-10-08T14:23:45Z",
  "commit": "abc123def456",
  "results": [
    {
      "name": "linear_3_steps",
      "mean_ms": 17.748,
      "std_dev_ms": 0.624,
      "target_ms": 50,
      "status": "pass",
      "performance_ratio": 3.0
    }
  ],
  "summary": {
    "total_tests": 2,
    "passed": 2,
    "failed": 0,
    "warnings": 0
  }
}

Running All Benchmarks

Quick Reference

# 1. Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# 2. Run individual benchmarks
cargo bench --package tasker-client --features benchmarks     # Task initialization
cargo bench --package tasker-shared --features benchmarks     # SQL + Events
cargo bench --package tasker-worker --features benchmarks     # Worker + Handlers
cargo bench --package tasker-orchestration --features benchmarks  # Step enqueueing
cargo bench --test e2e_latency                                # End-to-end

# 3. Run ALL benchmarks (when all implemented)
cargo bench --all-features

Environment Variables

# Required for SQL benchmarks
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

# Optional: Skip health checks (CI)
export TASKER_TEST_SKIP_HEALTH_CHECK="true"

# Optional: Custom service URLs
export TASKER_TEST_ORCHESTRATION_URL="http://localhost:9080"
export TASKER_TEST_WORKER_URL="http://localhost:9081"

Performance Targets Summary

Category	Component	Metric	Target	Status
API	Task Creation (3 steps)	p99	< 50ms	✅ 17.7ms
API	Task Creation (4 steps)	p99	< 75ms	✅ 20.8ms
SQL	Task Discovery	mean	< 3ms	✅ 1.75-2.93ms
SQL	Step Readiness	mean	< 1ms	✅ 440-603µs
Worker	Total Overhead	mean	< 60ms	🚧 TBD
Worker	FFI Overhead	mean	< 5ms	🚧 TBD
Events	Notify Latency	p95	< 10ms	🚧 TBD
Orchestration	Step Enqueueing (3 steps)	mean	< 50ms	🚧 TBD
E2E	Complete Workflow (3 steps)	p99	< 500ms	🚧 TBD

Next Steps

Immediate (Current Session)

✅ Create all benchmark skeletons
🎯 Design consistent logging structure
Decide on implementation priorities

Short Term

Implement worker execution benchmark
Implement event propagation benchmark
Create benchmark output logging utilities

Medium Term

Implement step enqueueing benchmark
Implement handler overhead benchmark
Implement E2E latency benchmark

Long Term

CI integration with baseline tracking
Performance regression detection
Automated benchmark reports
Historical performance trending

Documentation

Full Plan: phase-5.4-distributed-benchmarks-plan.md
SQL Benchmarks: benchmarking-guide.md

SQL Function Benchmarking Guide

Created: 2025-10-08 Status: ✅ Complete Location: tasker-shared/benches/sql_functions.rs

Overview

The SQL function benchmark suite measures performance of critical database operations that form the hot paths in the Tasker orchestration system. These benchmarks provide:

Baseline Performance Metrics: Establish expected performance ranges
Regression Detection: Identify performance degradations in code changes
Optimization Guidance: Use EXPLAIN ANALYZE output to guide index/query improvements
Capacity Planning: Understand scaling characteristics with data volume

Quick Start

Prerequisites

# 1. Ensure PostgreSQL is running
pg_isready

# 2. Set up test database
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo sqlx migrate run

# 3. Populate with test data - REQUIRED for representative benchmarks
cargo test --all-features

Important: The benchmarks use intelligent sampling to test diverse task/step complexities. Running integration tests first ensures the database contains various workflow patterns (linear, diamond, parallel) for representative benchmarking.

Running Benchmarks

# Run all SQL benchmarks
cargo bench --package tasker-shared --features benchmarks

# Run specific benchmark group
cargo bench --package tasker-shared --features benchmarks get_next_ready_tasks

# Run with baseline comparison
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main
# ... make changes ...
cargo bench --package tasker-shared --features benchmarks -- --baseline main

Sampling Strategy

The benchmarks use intelligent sampling to ensure representative results:

Task Sampling

Samples 5 diverse tasks from different named_task_uuid types
Distributes samples across different workflow patterns
Maintains deterministic ordering (same UUIDs in same order each run)
Provides consistent results while capturing complexity variance

Step Sampling

Samples 10 diverse steps from different tasks
Selects up to 2 steps per task for variety
Captures different DAG depths and dependency patterns
Helps identify performance variance in recursive queries

Benefits

Representativeness: No bias from single task/step selection
Consistency: Same samples = comparable baseline comparisons
Variance Detection: Criterion can measure performance across complexities
Real-world Accuracy: Reflects actual production workload diversity

Example Output:

step_readiness_status/calculate_readiness/0    2.345 ms
step_readiness_status/calculate_readiness/1    1.234 ms  (simple linear task)
step_readiness_status/calculate_readiness/2    5.678 ms  (complex diamond DAG)
step_readiness_status/calculate_readiness/3    3.456 ms
step_readiness_status/calculate_readiness/4    2.789 ms

Benchmark Categories

1. Task Discovery (`get_next_ready_tasks`)

What it measures: Time to discover ready tasks for orchestration

Hot path: Orchestration coordinator → Task discovery

Test parameters:

Batch size: 1, 10, 50, 100 tasks
Measures function overhead even with empty database

Expected performance:

Empty DB: < 5ms for any batch size (function overhead)
With data: Should scale linearly, not exponentially

Optimization targets:

Index on task state
Index on namespace for filtering
Efficient processor ownership checks

Example output:

get_next_ready_tasks/batch_size/1
                        time:   [2.1234 ms 2.1567 ms 2.1845 ms]
get_next_ready_tasks/batch_size/10
                        time:   [2.2156 ms 2.2489 ms 2.2756 ms]
get_next_ready_tasks/batch_size/50
                        time:   [2.5123 ms 2.5456 ms 2.5789 ms]
get_next_ready_tasks/batch_size/100
                        time:   [3.0234 ms 3.0567 ms 3.0890 ms]

Analysis: Near-constant time across batch sizes indicates efficient query planning.

2. Step Readiness (`get_step_readiness_status`)

What it measures: Time to calculate if a step is ready to execute

Hot path: Step enqueuer → Dependency resolution

Dependencies: Requires test data (tasks with steps)

Expected performance:

Simple linear: < 10ms
Diamond pattern: < 20ms
Complex DAG: < 50ms

Optimization targets:

Parent step completion checks
Dependency graph traversal
Retry backoff calculations

Graceful degradation:

⚠️  Skipping step_readiness_status benchmark - no test data found
    Run integration tests first to populate test data

3. State Transitions (`transition_task_state_atomic`)

What it measures: Time for atomic state transitions with processor ownership

Hot path: All orchestration operations (initialization, enqueuing, finalization)

Expected performance:

Successful transition: < 15ms
Failed transition (wrong state): < 10ms (faster path)
Contention scenario: < 50ms with backoff

Optimization targets:

Atomic compare-and-swap efficiency
Index on task_uuid + processor_uuid
Transition history table size

4. Task Execution Context (`get_task_execution_context`)

What it measures: Time to retrieve comprehensive task orchestration status

Hot path: Orchestration coordinator → Status checking

Dependencies: Requires test data (tasks in database)

Expected performance:

Simple tasks: < 10ms
Complex tasks: < 25ms
With many steps: < 50ms

Optimization targets:

Step aggregation queries
State calculation efficiency
Join optimization for step counts

5. Transitive Dependencies (`get_step_transitive_dependencies`)

What it measures: Time to resolve complete dependency tree for a step

Hot path: Worker → Step execution preparation (once per step lifecycle)

Dependencies: Requires test data (steps with dependencies)

Expected performance:

Linear dependencies: < 5ms
Diamond pattern: < 10ms
Complex DAG (10+ levels): < 25ms

Optimization targets:

Recursive CTE performance
Index on step dependencies
Materialized dependency graphs (future)

Why it matters: Called once per step on worker side when populating step data. While not in orchestration hot path, it affects worker step initialization latency. Recursive CTEs can be expensive with deep dependency trees.

6. EXPLAIN ANALYZE (`explain_analyze`)

What it measures: Query execution plans, not just timing

How it works: Runs EXPLAIN ANALYZE once per function (no repeated iterations since query plans don’t change between executions)

Functions analyzed:

get_next_ready_tasks() - Task discovery query plans
get_task_execution_context() - Task status aggregation plans
get_step_transitive_dependencies() - Recursive CTE dependency traversal plans

Purpose: Identify optimization opportunities:

Sequential scans (need indexes)
Nested loop performance
Buffer hit ratios
Index usage patterns
Recursive CTE efficiency

Automatic Query Plan Logging: Captures each query plan once and analyzes, printing:

⏱️ Execution Time: Actual query execution duration
📋 Planning Time: Time spent planning the query
📦 Node Type: Primary operation type (Aggregate, Index Scan, etc.)
💰 Total Cost: PostgreSQL’s cost estimate
⚠️ Sequential Scan Warning: Alerts for potential missing indexes
📊 Buffer Hit Ratio: Cache efficiency (higher is better)

Example output:

════════════════════════════════════════════════════════════════════════════════
📊 QUERY PLAN ANALYSIS
════════════════════════════════════════════════════════════════════════════════

🔍 Function: get_next_ready_tasks
────────────────────────────────────────────────────────────────────────────────
⏱️  Execution Time: 2.345 ms
📋 Planning Time: 0.123 ms
📦 Node Type: Aggregate
💰 Total Cost: 45.67
📊 Buffer Hit Ratio: 98.5% (197/200 blocks)
────────────────────────────────────────────────────────────────────────────────

Saving Full Plans:

# Save complete JSON plans to target/query_plan_*.json
SAVE_QUERY_PLANS=1 cargo bench --package tasker-shared --features benchmarks

Red flags to investigate:

“Seq Scan” on large tables → Add index
“Nested Loop” with high iteration count → Optimize join strategy
“Sort” operations on large datasets → Add index for ORDER BY
Low buffer hit ratio (< 90%) → Increase shared_buffers or investigate I/O

Interpreting Results

Criterion Statistics

Criterion provides comprehensive statistics for each benchmark:

get_next_ready_tasks/batch_size/10
                        time:   [2.2156 ms 2.2489 ms 2.2756 ms]
                        change: [-1.5% +0.2% +1.9%] (p = 0.31 > 0.05)
                        No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe

Key metrics:

[2.2156 ms 2.2489 ms 2.2756 ms]: Lower bound, mean, upper bound (95% confidence)
change: Comparison to baseline (if available)
p-value: Statistical significance (p < 0.05 = significant)
Outliers: Measurements far from median (cache effects, GC, etc.)

Performance Expectations

Based on Phase 3 metrics verification (26 tasks executed):

Metric	Expected	Warning	Critical
Task initialization	< 50ms	50-100ms	> 100ms
Step readiness	< 20ms	20-50ms	> 50ms
State transition	< 15ms	15-30ms	> 30ms
Finalization claim	< 10ms	10-25ms	> 25ms

Note: These are function-level times, not end-to-end latencies.

Using Benchmarks for Optimization

Workflow

Establish Baseline

cargo bench --package tasker-shared --features benchmarks -- --save-baseline main

Make Changes (e.g., add index, optimize query)

Compare

cargo bench --package tasker-shared --features benchmarks -- --baseline main

Review Output

get_next_ready_tasks/batch_size/100
                     time:   [2.0123 ms 2.0456 ms 2.0789 ms]
                     change: [-34.5% -32.1% -29.7%] (p = 0.00 < 0.05)
                     Performance has improved.

Analyze EXPLAIN Plans (if improvement isn’t clear)

Common Optimization Patterns

Pattern 1: Missing Index

Symptom: Exponential scaling with data volume

EXPLAIN shows: Seq Scan on tasks

Solution:

CREATE INDEX idx_tasks_state ON tasker.tasks(current_state)
WHERE complete = false;

Pattern 2: Inefficient Join

Symptom: High latency with complex DAGs

EXPLAIN shows: Nested Loop with high row counts

Solution: Use CTE or adjust join strategy

WITH parent_status AS (
  SELECT ... -- Pre-compute parent completions
)
SELECT ... FROM tasker.workflow_steps s
JOIN parent_status ps ON ...

Pattern 3: Large Transaction History

Symptom: State transition slowing over time

EXPLAIN shows: Large scan of task_transitions

Solution: Partition by date or archive old transitions

CREATE TABLE tasker.task_transitions_archive (LIKE tasker.task_transitions);
-- Move old data periodically

Integration with Metrics

The benchmark results should correlate with production metrics:

From metrics-reference.md:

tasker_task_initialization_duration_milliseconds → Benchmark: task discovery + initialization
tasker_step_result_processing_duration_milliseconds → Benchmark: step readiness + state transitions
tasker_task_finalization_duration_milliseconds → Benchmark: finalization claiming

Validation approach:

Run benchmarks: Get ~2ms for task discovery
Check metrics: tasker_task_initialization_duration P95 = ~45ms
Calculate overhead: 45ms - 2ms = 43ms (business logic + framework)

This helps identify where optimization efforts should focus:

If benchmark is slow → Optimize SQL/indexes
If benchmark is fast but metrics slow → Optimize Rust code

Continuous Integration

Recommended CI Workflow

# .github/workflows/benchmarks.yml
name: Performance Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: tasker
        options: >-
          --health-cmd pg_isready
          --health-interval 10s

    steps:
      - uses: actions/checkout@v3
      - uses: dtolnay/rust-toolchain@stable

      - name: Run migrations
        run: cargo sqlx migrate run
        env:
          DATABASE_URL: postgresql://postgres:tasker@localhost/test

      - name: Run benchmarks
        run: cargo bench --package tasker-shared --features benchmarks

      - name: Check for regressions
        run: |
          # Parse Criterion output and fail if P95 > threshold
          # This is left as an exercise for CI implementation

Future Enhancements

Phase 5.3: Data Generation (Deferred)

The current benchmarks work with existing test data. Future work could add:

Realistic Data Generation
- Create 100/1,000/10,000 task datasets
- Various DAG complexities (linear, diamond, tree)
- State distribution (60% complete, 20% in-progress, etc.)
Contention Testing
- Multiple processors competing for same tasks
- Race condition scenarios
- Deadlock detection
Long-Running Benchmarks
- Memory leak detection
- Connection pool exhaustion
- Query plan cache effects

Troubleshooting

Benchmark fails with “DATABASE_URL must be set”

export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

All benchmarks show “no test data found”

# Run integration tests to populate database
cargo test --all-features

# Or run specific test suite
cargo test --package tasker-shared --all-features

Benchmarks are inconsistent/noisy

Close other applications
Ensure PostgreSQL isn’t under load
Run benchmarks multiple times
Increase sample_size in benchmark code

Results don’t match production metrics

Production has different data volumes
Network latency in production
Different PostgreSQL version/configuration
Connection pool overhead in production

References

Criterion Documentation: https://bheisler.github.io/criterion.rs/book/
PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/sql-explain.html
Phase 3 Metrics: docs/observability/metrics-reference.md
Verification Results: docs/observability/VERIFICATION_RESULTS.md

Sign-Off

Phase 5.2 Status: ✅ COMPLETE

Benchmarks Implemented:

✅ get_next_ready_tasks() - 4 batch sizes
✅ get_step_readiness_status() - with graceful skip
✅ transition_task_state_atomic() - atomic operations
✅ get_task_execution_context() - orchestration status retrieval
✅ get_step_transitive_dependencies() - recursive dependency traversal
✅ EXPLAIN ANALYZE - query plan capture with automatic analysis

Documentation Complete:

✅ Quick start guide
✅ Interpretation guidance
✅ Optimization patterns
✅ Integration with metrics
✅ CI recommendations

Next Steps: Run benchmarks with real data and establish baseline performance targets.

Tasker-Core Logging Standards

Version: 1.0 Last Updated: 2025-10-07 Status: Active Related: Observability Standardization

Philosophy
Log Levels
Structured Fields
Message Style
Instrument Macro
Error Handling
Examples
Enforcement

Philosophy

Principles:

Production-First: Logs must be parseable, searchable, and professional
Correlation-Driven: All operations include correlation_id for distributed tracing
Structured: Fields over string interpolation for aggregation and querying
Concise: Clear, actionable messages without noise
Consistent: Predictable patterns across all code

Anti-Patterns to Avoid:

❌ Emojis (🚀✅❌) - Breaks log parsers, unprofessional
❌ All-caps prefixes (“BOOTSTRAP:”, “CORE:”) - Redundant with module paths
❌ Ticket references (“JIRA-123”, “PROJ-40”) - Internal, meaningless externally
❌ String interpolation - Use structured fields instead
❌ Verbose messages - Be concise, let fields provide detail

Log Levels

ERROR - Unrecoverable Failures

When to Use:

Database connection permanently lost
Critical system component failure
Unrecoverable state machine violation
Data corruption detected
Message queue unavailable

Characteristics:

Requires immediate human intervention
Service degradation or outage
Cannot automatically recover
Should trigger alerts/pages

Example:

#![allow(unused)]
fn main() {
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Failed to claim task for finalization: database unavailable"
);
}

WARN - Degraded Operation

When to Use:

Retryable failures after exhausting retries
Circuit breaker opened (degraded mode)
Fallback behavior activated
Rate limiting engaged
Configuration issues (non-fatal)
Unexpected but handled conditions

Characteristics:

Service continues but degraded
Automatic recovery possible
Should be monitored for patterns
May indicate upstream problems

Example:

#![allow(unused)]
fn main() {
warn!(
    correlation_id = %correlation_id,
    step_uuid = %step_uuid,
    retry_count = attempts,
    max_retries = max_attempts,
    next_retry_at = ?next_retry,
    "Step execution failed after max retries, will not retry further"
);
}

INFO - Lifecycle Events

When to Use:

System startup/shutdown
Task created/completed/failed
Step enqueued/completed
State transitions (task/step)
Configuration loaded
Significant business events

Characteristics:

Normal operation milestones
Useful for understanding flow
Production-ready verbosity
Default log level in production

Example:

#![allow(unused)]
fn main() {
info!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    steps_enqueued = count,
    duration_ms = elapsed.as_millis(),
    "Task initialization complete"
);
}

DEBUG - Detailed Diagnostics

When to Use:

Discovery query results
Queue depth checks
Dependency analysis details
Configuration value dumps
State machine transition details
Detailed operation flow

Characteristics:

Troubleshooting information
Not shown in production (usually)
Safe to be verbose
Helps understand “why”

Example:

#![allow(unused)]
fn main() {
debug!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    viable_steps = steps.len(),
    pending_steps = pending.len(),
    blocked_steps = blocked.len(),
    "Step readiness analysis complete"
);
}

TRACE - Very Verbose

When to Use:

Function entry/exit in hot paths
Loop iteration details
Deep parameter inspection
Performance profiling hooks

Characteristics:

Extremely verbose
Usually disabled even in dev
Performance impact acceptable
Use sparingly

Example:

#![allow(unused)]
fn main() {
trace!(
    correlation_id = %correlation_id,
    iteration = i,
    "Polling loop iteration"
);
}

Structured Fields

Required Fields (Context-Dependent)

Always Include:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,  // ALWAYS when available
}

When Task Context Available:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace,
}

When Step Context Available:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
}

For Operations:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
operation = "step_enqueue",        // Operation identifier
duration_ms = elapsed.as_millis(), // Timing for operations
}

For Errors:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
error = %e,                        // Error Display
error_type = %type_name::<E>(),   // Optional: Error type
}

Field Ordering (MANDATORY)

Standard Order:

correlation_id (always first)
Entity IDs (task_uuid, step_uuid, namespace)
Operation/Action (operation, state, status)
Measurements (duration_ms, count, size)
Error Info (error, error_type, context)
Other Context (additional fields)

Example:

#![allow(unused)]
fn main() {
info!(
    // 1. Correlation ID (ALWAYS FIRST)
    correlation_id = %correlation_id,

    // 2. Entity IDs
    task_uuid = %task_uuid,
    step_uuid = %step_uuid,
    namespace = %namespace,

    // 3. Operation
    operation = "step_transition",
    from_state = %old_state,
    to_state = %new_state,

    // 4. Measurements
    duration_ms = elapsed.as_millis(),

    // 5. No errors (success case)

    // 6. Other context
    processor_id = %processor_uuid,

    "Step state transition complete"
);
}

Field Formatting

Use Display Formatting (%):

#![allow(unused)]
fn main() {
// ✅ CORRECT: Let tracing handle formatting
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
}

Avoid Manual Conversion:

#![allow(unused)]
fn main() {
// ❌ WRONG: Manual to_string()
task_uuid = task_uuid.to_string(),

// ❌ WRONG: Debug formatting for production types
task_uuid = ?task_uuid,  // Use ? only for Debug types
}

Field Naming:

#![allow(unused)]
fn main() {
// ✅ Standard names
duration_ms          // Not elapsed_ms, time_ms
error                // Not err, error_message
step_uuid            // Not workflow_step_uuid (be consistent)
retry_count          // Not attempts, retries
max_retries          // Not max_attempts
}

Message Style

Guidelines

DO:

✅ Be concise and actionable
✅ Use present tense for states: “Step enqueued”
✅ Use past tense for events: “Task completed”
✅ Start with the subject: “Task completed” not “Successfully completed task”
✅ Focus on WHAT happened (fields show HOW)

DON’T:

❌ Use emojis: “🚀 Starting…” → “Starting orchestration system”
❌ Use all-caps prefixes: “BOOTSTRAP: Starting…” → “Starting orchestration bootstrap”
❌ Include ticket numbers: “PROJ-40: Processing…” → “Processing command”
❌ Be redundant: “Successfully enqueued step successfully” → “Step enqueued”
❌ Include technical jargon: “Atomic CAS transition succeeded” → “State transition complete”
❌ Be verbose: Keep messages under 10 words ideally

Before/After Examples

Lifecycle Events:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🚀 BOOTSTRAP: Starting unified orchestration system bootstrap");

// ✅ AFTER
info!("Starting orchestration system bootstrap");
}

Operation Completion:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("✅ STEP_ENQUEUER: Successfully marked step {} as enqueued", step_uuid);

// ✅ AFTER
info!(
    correlation_id = %correlation_id,
    step_uuid = %step_uuid,
    "Step marked as enqueued"
);
}

Error Handling:

#![allow(unused)]
fn main() {
// ❌ BEFORE
error!("❌ ORCHESTRATION_LOOP: Failed to process task {}: {}", task_uuid, e);

// ✅ AFTER
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Task processing failed"
);
}

Shutdown:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🛑 Shutdown signal received, initiating graceful shutdown...");

// ✅ AFTER
info!("Shutdown signal received, initiating graceful shutdown");
}

Instrument Macro

When to Use

Use #[instrument] for:

Function-level spans in hot paths
Automatic correlation ID tracking
Operations that should appear in traces
Functions with significant duration

Benefits:

Automatic span creation
Automatic timing
Better OpenTelemetry integration (Phase 2)
Cleaner code

Example

#![allow(unused)]
fn main() {
use tracing::instrument;

#[instrument(skip(self), fields(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    namespace = %namespace
))]
pub async fn process_task(
    &self,
    correlation_id: Uuid,
    task_uuid: Uuid,
    namespace: String,
) -> Result<TaskResult> {
    // Span automatically created with fields above
    info!("Starting task processing");

    // ... implementation ...

    info!(
        duration_ms = start.elapsed().as_millis(),
        "Task processing complete"
    );

    Ok(result)
}
}

Skip Parameters

Always skip:

self (redundant)
Large structures (use specific fields instead)
Sensitive data (passwords, tokens, PII)

#![allow(unused)]
fn main() {
#[instrument(
    skip(self, context),  // Skip large context
    fields(
        correlation_id = %correlation_id,
        task_uuid = %context.task_uuid,  // Extract specific fields
    )
)]
}

Error Handling

Error Context

Always include:

#![allow(unused)]
fn main() {
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,              // Error Display (user-friendly)
    error_type = %type_name::<E>(),  // Optional: For classification
    "Operation failed"
);
}

Error Propagation

#![allow(unused)]
fn main() {
// ✅ Log and return for caller to handle
debug!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Step discovery query failed, will retry"
);
return Err(e);

// ❌ Don't log at every level (causes noise)
// Instead: Log once at appropriate level where handled
}

Error Classification

#![allow(unused)]
fn main() {
match result {
    Err(e) if e.is_retryable() => {
        warn!(
            correlation_id = %correlation_id,
            error = %e,
            retry_count = attempts,
            "Operation failed, will retry"
        );
    }
    Err(e) => {
        error!(
            correlation_id = %correlation_id,
            error = %e,
            "Operation failed permanently"
        );
    }
    Ok(result) => {
        info!(
            correlation_id = %correlation_id,
            duration_ms = elapsed.as_millis(),
            "Operation completed successfully"
        );
    }
}
}

Examples

Complete Examples by Scenario

Task Initialization

#![allow(unused)]
fn main() {
#[instrument(skip(self), fields(
    correlation_id = %task_request.correlation_id,
    task_name = %task_request.name,
    namespace = %task_request.namespace
))]
pub async fn create_task_from_request(
    &self,
    task_request: TaskRequest,
) -> Result<TaskInitializationResult> {
    let correlation_id = task_request.correlation_id;
    let start = Instant::now();

    info!("Starting task initialization");

    // Create task
    let task = self.create_task(&task_request).await?;

    debug!(
        task_uuid = %task.task_uuid,
        template_uuid = %task.named_task_uuid,
        "Task created in database"
    );

    // Discover steps
    let steps = self.discover_initial_steps(task.task_uuid).await?;

    info!(
        correlation_id = %correlation_id,
        task_uuid = %task.task_uuid,
        step_count = steps.len(),
        duration_ms = start.elapsed().as_millis(),
        "Task initialization complete"
    );

    Ok(TaskInitializationResult {
        task_uuid: task.task_uuid,
        step_count: steps.len(),
    })
}
}

Step Enqueueing

#![allow(unused)]
fn main() {
pub async fn enqueue_step(
    &self,
    correlation_id: Uuid,
    task_uuid: Uuid,
    step: &ViableStep,
) -> Result<()> {
    let start = Instant::now();

    debug!(
        correlation_id = %correlation_id,
        task_uuid = %task_uuid,
        step_uuid = %step.step_uuid,
        step_name = %step.name,
        queue = %step.queue_name,
        "Enqueueing step"
    );

    let message = self.create_message(correlation_id, task_uuid, step)?;

    self.pgmq_client
        .send(&step.queue_name, &message)
        .await?;

    info!(
        correlation_id = %correlation_id,
        task_uuid = %task_uuid,
        step_uuid = %step.step_uuid,
        queue = %step.queue_name,
        duration_ms = start.elapsed().as_millis(),
        "Step enqueued"
    );

    Ok(())
}
}

Error Handling

#![allow(unused)]
fn main() {
match self.process_step_result(result).await {
    Ok(()) => {
        info!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            duration_ms = elapsed.as_millis(),
            "Step result processed"
        );
    }
    Err(e) if e.is_retryable() => {
        warn!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            error = %e,
            retry_count = result.attempts,
            "Step result processing failed, will retry"
        );
        return Err(e);
    }
    Err(e) => {
        error!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            error = %e,
            "Step result processing failed permanently"
        );
        return Err(e);
    }
}
}

Bootstrap/Shutdown

#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<OrchestrationSystemHandle> {
    info!("Starting orchestration system bootstrap");

    let config = ConfigManager::load()?;
    debug!(environment = %config.environment, "Configuration loaded");

    let context = SystemContext::from_config(config).await?;
    info!(processor_uuid = %context.processor_uuid, "System context initialized");

    let core = OrchestrationCore::new(context).await?;
    info!("Orchestration core initialized");

    // ... more initialization ...

    info!(
        processor_uuid = %core.processor_uuid,
        namespaces = ?core.supported_namespaces,
        "Orchestration system bootstrap complete"
    );

    Ok(handle)
}

pub async fn shutdown(&mut self) -> Result<()> {
    info!("Initiating graceful shutdown");

    if let Some(coordinator) = &self.event_coordinator {
        coordinator.stop().await?;
        debug!("Event coordinator stopped");
    }

    info!("Orchestration system shutdown complete");
    Ok(())
}
}

Enforcement

Code Review Checklist

Before merging, verify:

No emojis in log messages
No all-caps component prefixes
No ticket references in runtime logs
correlation_id present in all task/step operations
Structured fields follow standard ordering
Messages are concise and actionable
Appropriate log levels used
Error context is complete

CI Checks

Recommended lints (future):

# Check for emojis
! grep -r '[🔧✅🚀❌⚠️📊🔍🎉🛡️⏱️📝🏗️🎯🔄💡📦🧪🌉🔌⏳🛑]' src/

# Check for all-caps prefixes
! grep -rE '(info|debug|warn|error)!\(".*[A-Z_]{3,}:' src/

# Check for ticket references in logs (allow in comments)
! grep -rE '(info|debug|warn|error)!.*[A-Z]+-[0-9]+' src/

Pre-commit Hook

Add to .git/hooks/pre-commit:

#!/bin/bash
./scripts/audit-logging.sh --check || {
    echo "❌ Logging standards violation detected"
    echo "Run: ./scripts/audit-logging.sh for details"
    exit 1
}

Migration Guide

For Existing Code

Remove emojis: Use find/replace
Remove all-caps prefixes: Simple cleanup
Add correlation_id: Extract from context
Reorder fields: correlation_id first
Shorten messages: Remove redundancy
Verify log levels: Lifecycle = INFO, diagnostics = DEBUG

For New Code

Always include correlation_id when context available
Use #[instrument] for significant functions
Follow field ordering: correlation_id, IDs, operation, measurements, errors
Keep messages concise: Under 10 words
Choose appropriate level: ERROR (fatal), WARN (degraded), INFO (lifecycle), DEBUG (diagnostic)

FAQ

Q: Should I use info! or debug! for step enqueueing? A: info! - It’s a significant lifecycle event even if frequent.

Q: When should I add duration_ms? A: For any operation that:

Calls external systems (DB, queue)
Is in the hot path
Takes >10ms typically
Needs performance monitoring

Q: Can I use emojis in error messages? A: No. Never use emojis in any log message. They break parsers and are unprofessional.

Q: Should correlation_id really always be first? A: Yes. This enables easy correlation across all logs. It’s the #1 most important field for distributed tracing.

Q: What about ticket references in module docs? A: Acceptable in module-level documentation for architectural context. Remove from runtime logs and inline comments.

Q: Can I include stack traces in logs? A: Use error = %e which includes the error chain. Only add explicit backtrace for truly exceptional cases.

References

Document End

This is a living document. Propose changes via PR with rationale.

OpenTelemetry Metrics Reference

Status: ✅ Complete Export Interval: 60 seconds OTLP Endpoint: http://localhost:4317 Grafana UI: http://localhost:3000

This document provides a complete reference for all OpenTelemetry metrics instrumented in the Tasker orchestration system.

Overview
Configuration
Orchestration Metrics
Worker Metrics
Resilience Metrics
Database Metrics
Messaging Metrics
Example Queries
Dashboard Recommendations

Overview

The Tasker system exports 47+ OpenTelemetry metrics across 5 domains:

Domain	Metrics	Description
Orchestration	11	Task lifecycle, step coordination, finalization
Worker	10	Step execution, claiming, result submission
Resilience	8+	Circuit breakers, MPSC channels
Database	7	SQL query performance, connection pools
Messaging	11	PGMQ queue operations, message processing

All metrics include correlation_id labels for distributed tracing correlation with Tempo traces.

Histogram Metric Naming

OpenTelemetry automatically appends _milliseconds to histogram metric names when the unit is specified as ms. This provides clarity in Prometheus queries.

Pattern: metric_name → metric_name_milliseconds_{bucket,sum,count}

Example:

Code: tasker.step.execution.duration with unit “ms”
Prometheus: tasker_step_execution_duration_milliseconds_*

Query Patterns: Instant vs Rate-Based

Instant/Recent Data Queries - Use these when:

Testing with burst/batch task execution
Viewing data from recent runs (last few minutes)
Data points are sparse or clustered together
You want simple averages without time windows

Rate-Based Queries - Use these when:

Continuous production monitoring
Data flows steadily over time
Calculating per-second rates
Building alerting rules

Why the difference matters: The rate() function calculates per-second change rates over a time window. It requires data points spread across that window. If you run 26 tasks in quick succession, all data points cluster at one timestamp, and rate() returns no data because there’s no rate change to calculate.

Configuration

Enable OpenTelemetry

File: config/tasker/environments/development/telemetry.toml

[telemetry]
enabled = true
service_name = "tasker-core-dev"
sample_rate = 1.0

[telemetry.opentelemetry]
enabled = true  # Must be true to export metrics

Verify in Logs

# Should see:
# opentelemetry_enabled=true
# NOT: Metrics collection disabled (TELEMETRY_ENABLED=false)

Orchestration Metrics

Module: tasker-shared/src/metrics/orchestration.rs Instrumentation: tasker-orchestration/src/orchestration/lifecycle/*.rs

Counters

`tasker.tasks.requests.total`

Description: Total number of task creation requests received Type: Counter (u64) Labels:

correlation_id: Request correlation ID
task_type: Task name (e.g., “mathematical_sequence”)
namespace: Task namespace (e.g., “rust_e2e_linear”)

Instrumented In: task_initializer.rs:start_task_initialization()

Example Query:

# Total task requests
tasker_tasks_requests_total

# By namespace
sum by (namespace) (tasker_tasks_requests_total)

# Specific correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)

`tasker.tasks.completions.total`

Description: Total number of tasks that completed successfully Type: Counter (u64) Labels:

correlation_id: Request correlation ID

Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Completed)

Example Query:

# Total completions
tasker_tasks_completions_total

# Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])

Expected Output: (To be verified)

`tasker.tasks.failures.total`

Description: Total number of tasks that failed Type: Counter (u64) Labels:

correlation_id: Request correlation ID

Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Failed)

Example Query:

# Total failures
tasker_tasks_failures_total

# Error rate over 5 minutes
rate(tasker_tasks_failures_total[5m])

Expected Output: (To be verified)

`tasker.steps.enqueued.total`

Description: Total number of steps enqueued to worker queues Type: Counter (u64) Labels:

correlation_id: Request correlation ID
namespace: Task namespace
step_name: Name of the enqueued step

Instrumented In: step_enqueuer.rs:enqueue_steps()

Example Query:

# Total steps enqueued
tasker_steps_enqueued_total

# By step name
sum by (step_name) (tasker_steps_enqueued_total)

# For specific task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)

`tasker.step_results.processed.total`

Description: Total number of step results processed from workers Type: Counter (u64) Labels:

correlation_id: Request correlation ID
result_type: “success”, “error”, “timeout”, “cancelled”, “skipped”

Instrumented In: result_processor.rs:process_step_result()

Example Query:

# Total results processed
tasker_step_results_processed_total

# By result type
sum by (result_type) (tasker_step_results_processed_total)

# Success rate
rate(tasker_step_results_processed_total{result_type="success"}[5m])

Expected Output: (To be verified)

Histograms

`tasker.task.initialization.duration`

Description: Task initialization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_task_initialization_duration_milliseconds_bucket
tasker_task_initialization_duration_milliseconds_sum
tasker_task_initialization_duration_milliseconds_count

Labels:

correlation_id: Request correlation ID
task_type: Task name

Instrumented In: task_initializer.rs:start_task_initialization()

Example Queries:

Instant/Recent Data (works immediately after task execution):

# Simple average initialization time
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count

# P95 latency
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

Rate-Based (for continuous monitoring, requires data spread over time):

# Average initialization time over 5 minutes
rate(tasker_task_initialization_duration_milliseconds_sum[5m]) /
rate(tasker_task_initialization_duration_milliseconds_count[5m])

# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values

`tasker.task.finalization.duration`

Description: Task finalization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_task_finalization_duration_milliseconds_bucket
tasker_task_finalization_duration_milliseconds_sum
tasker_task_finalization_duration_milliseconds_count

Labels:

correlation_id: Request correlation ID
final_state: “complete”, “error”, “cancelled”

Instrumented In: task_finalizer.rs:finalize_task()

Example Queries:

Instant/Recent Data:

# Simple average finalization time
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count

# P95 by final state
histogram_quantile(0.95,
  sum by (final_state, le) (
    tasker_task_finalization_duration_milliseconds_bucket
  )
)

Rate-Based:

# Average finalization time over 5 minutes
rate(tasker_task_finalization_duration_milliseconds_sum[5m]) /
rate(tasker_task_finalization_duration_milliseconds_count[5m])

# P95 by final state over 5 minutes
histogram_quantile(0.95,
  sum by (final_state, le) (
    rate(tasker_task_finalization_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values

`tasker.step_result.processing.duration`

Description: Step result processing duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_step_result_processing_duration_milliseconds_bucket
tasker_step_result_processing_duration_milliseconds_sum
tasker_step_result_processing_duration_milliseconds_count

Labels:

correlation_id: Request correlation ID
result_type: “success”, “error”, “timeout”, “cancelled”, “skipped”

Instrumented In: result_processor.rs:process_step_result()

Example Queries:

Instant/Recent Data:

# Simple average result processing time
tasker_step_result_processing_duration_milliseconds_sum /
tasker_step_result_processing_duration_milliseconds_count

# P50, P95, P99 latencies
histogram_quantile(0.50, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.95, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.99, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))

Rate-Based:

# Average result processing time over 5 minutes
rate(tasker_step_result_processing_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_processing_duration_milliseconds_count[5m])

# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_processing_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values

Gauges

`tasker.tasks.active`

Description: Number of tasks currently being processed Type: Gauge (u64) Labels:

state: Current task state

Status: Planned (not yet instrumented)

`tasker.steps.ready`

Description: Number of steps ready for execution Type: Gauge (u64) Labels:

namespace: Worker namespace

Status: Planned (not yet instrumented)

Worker Metrics

Module: tasker-shared/src/metrics/worker.rs Instrumentation: tasker-worker/src/worker/*.rs

Counters

`tasker.steps.executions.total`

Description: Total number of step executions attempted Type: Counter (u64) Labels:

correlation_id: Request correlation ID

Instrumented In: command_processor.rs:handle_execute_step()

Example Query:

# Total step executions
tasker_steps_executions_total

# Execution rate
rate(tasker_steps_executions_total[5m])

# For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)

`tasker.steps.successes.total`

Description: Total number of step executions that completed successfully Type: Counter (u64) Labels:

correlation_id: Request correlation ID
namespace: Worker namespace

Instrumented In: command_processor.rs:handle_execute_step() (success path)

Example Query:

# Total successes
tasker_steps_successes_total

# Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])

# By namespace
sum by (namespace) (tasker_steps_successes_total)

Expected Output: (To be verified)

`tasker.steps.failures.total`

Description: Total number of step executions that failed Type: Counter (u64) Labels:

correlation_id: Request correlation ID
namespace: Worker namespace (or “unknown” for early failures)
error_type: “claim_failed”, “database_error”, “step_not_found”, “message_deletion_failed”

Instrumented In: command_processor.rs:handle_execute_step() (error paths)

Example Query:

# Total failures
tasker_steps_failures_total

# Failure rate
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])

# By error type
sum by (error_type) (tasker_steps_failures_total)

# Error distribution
topk(5, sum by (error_type) (tasker_steps_failures_total))

Expected Output: (To be verified)

`tasker.steps.claimed.total`

Description: Total number of steps claimed from queues Type: Counter (u64) Labels:

namespace: Worker namespace
claim_method: “event”, “poll”

Instrumented In: step_claim.rs:try_claim_step()

Example Query:

# Total claims
tasker_steps_claimed_total

# By claim method
sum by (claim_method) (tasker_steps_claimed_total)

# Claim rate
rate(tasker_steps_claimed_total[5m])

Expected Output: (To be verified)

`tasker.steps.results_submitted.total`

Description: Total number of step results submitted to orchestration Type: Counter (u64) Labels:

correlation_id: Request correlation ID
result_type: “completion”

Instrumented In: orchestration_result_sender.rs:send_completion()

Example Query:

# Total submissions
tasker_steps_results_submitted_total

# Submission rate
rate(tasker_steps_results_submitted_total[5m])

# For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)

Histograms

`tasker.step.execution.duration`

Description: Step execution duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_step_execution_duration_milliseconds_bucket
tasker_step_execution_duration_milliseconds_sum
tasker_step_execution_duration_milliseconds_count

Labels:

correlation_id: Request correlation ID
namespace: Worker namespace
result: “success”, “error”

Instrumented In: command_processor.rs:handle_execute_step()

Example Queries:

Instant/Recent Data:

# Simple average execution time
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count

# P95 latency by namespace
histogram_quantile(0.95,
  sum by (namespace, le) (
    tasker_step_execution_duration_milliseconds_bucket
  )
)

# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_step_execution_duration_milliseconds_bucket))

Rate-Based:

# Average execution time over 5 minutes
rate(tasker_step_execution_duration_milliseconds_sum[5m]) /
rate(tasker_step_execution_duration_milliseconds_count[5m])

# P95 latency by namespace over 5 minutes
histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values

`tasker.step.claim.duration`

Description: Step claiming duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_step_claim_duration_milliseconds_bucket
tasker_step_claim_duration_milliseconds_sum
tasker_step_claim_duration_milliseconds_count

Labels:

namespace: Worker namespace
claim_method: “event”, “poll”

Instrumented In: step_claim.rs:try_claim_step()

Example Queries:

Instant/Recent Data:

# Simple average claim time
tasker_step_claim_duration_milliseconds_sum /
tasker_step_claim_duration_milliseconds_count

# Compare event vs poll claiming (P95)
histogram_quantile(0.95,
  sum by (claim_method, le) (
    tasker_step_claim_duration_milliseconds_bucket
  )
)

Rate-Based:

# Average claim time over 5 minutes
rate(tasker_step_claim_duration_milliseconds_sum[5m]) /
rate(tasker_step_claim_duration_milliseconds_count[5m])

# P95 by claim method over 5 minutes
histogram_quantile(0.95,
  sum by (claim_method, le) (
    rate(tasker_step_claim_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values

`tasker.step_result.submission.duration`

Description: Step result submission duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

tasker_step_result_submission_duration_milliseconds_bucket
tasker_step_result_submission_duration_milliseconds_sum
tasker_step_result_submission_duration_milliseconds_count

Labels:

correlation_id: Request correlation ID
result_type: “completion”

Instrumented In: orchestration_result_sender.rs:send_completion()

Example Queries:

Instant/Recent Data:

# Simple average submission time
tasker_step_result_submission_duration_milliseconds_sum /
tasker_step_result_submission_duration_milliseconds_count

# P95 submission latency
histogram_quantile(0.95, sum by (le) (tasker_step_result_submission_duration_milliseconds_bucket))

Rate-Based:

# Average submission time over 5 minutes
rate(tasker_step_result_submission_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_submission_duration_milliseconds_count[5m])

# P95 submission latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_submission_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values

Gauges

`tasker.steps.active_executions`

Description: Number of steps currently being executed Type: Gauge (u64) Labels:

namespace: Worker namespace
handler_type: “rust”, “ruby”

Status: Defined but not actively instrumented (gauge tracking removed during implementation)

`tasker.queue.depth`

Description: Current queue depth per namespace Type: Gauge (u64) Labels:

namespace: Worker namespace

Status: Planned (not yet instrumented)

Resilience Metrics

Module: tasker-shared/src/metrics/worker.rs, tasker-orchestration/src/web/circuit_breaker.rs Instrumentation: Circuit breakers, MPSC channels Related Docs: Circuit Breakers | Backpressure Architecture

Circuit Breaker Metrics

Circuit breakers provide fault isolation and cascade prevention. These metrics track breaker state transitions and related operations.

`api_circuit_breaker_state`

Description: Current state of the web API database circuit breaker Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels: None

Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs

Example Queries:

# Current state
api_circuit_breaker_state

# Alert when open
api_circuit_breaker_state == 2

`tasker_circuit_breaker_state`

Description: Per-component circuit breaker state Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels:

component: Circuit breaker name (e.g., “ffi_completion”, “task_readiness”, “pgmq”)

Instrumented In: Various circuit breaker implementations

Example Queries:

# All circuit breaker states
tasker_circuit_breaker_state

# Check specific component
tasker_circuit_breaker_state{component="ffi_completion"}

# Count open breakers
count(tasker_circuit_breaker_state == 2)

`api_requests_rejected_total`

Description: Total API requests rejected due to open circuit breaker Type: Counter (u64) Labels:

endpoint: The rejected endpoint path

Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs

Example Queries:

# Total rejections
api_requests_rejected_total

# Rejection rate
rate(api_requests_rejected_total[5m])

# By endpoint
sum by (endpoint) (api_requests_rejected_total)

`ffi_completion_slow_sends_total`

Description: FFI completion channel sends exceeding latency threshold (100ms default) Type: Counter (u64) Labels: None

Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs

Example Queries:

# Total slow sends
ffi_completion_slow_sends_total

# Slow send rate (alerts at >10/sec)
rate(ffi_completion_slow_sends_total[5m]) > 10

Alert Threshold: Warning when rate exceeds 10/second for 2 minutes

`ffi_completion_circuit_open_rejections_total`

Description: FFI completion operations rejected due to open circuit breaker Type: Counter (u64) Labels: None

Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs

Example Queries:

# Total rejections
ffi_completion_circuit_open_rejections_total

# Rejection rate
rate(ffi_completion_circuit_open_rejections_total[5m])

MPSC Channel Metrics

Bounded MPSC channels provide backpressure control. These metrics track channel utilization and overflow events.

`mpsc_channel_usage_percent`

Description: Current fill percentage of a bounded MPSC channel Type: Gauge (f64) Labels:

channel: Channel name (e.g., “orchestration_command”, “pgmq_notifications”)
component: Owning component

Instrumented In: Channel monitor integration points

Example Queries:

# All channel usage
mpsc_channel_usage_percent

# High usage channels
mpsc_channel_usage_percent > 80

# By component
max by (component) (mpsc_channel_usage_percent)

Alert Thresholds:

Warning: > 80% for 15 minutes
Critical: > 90% for 5 minutes

`mpsc_channel_capacity`

Description: Configured buffer capacity for an MPSC channel Type: Gauge (u64) Labels:

channel: Channel name
component: Owning component

Instrumented In: Channel monitor initialization

Example Queries:

# All channel capacities
mpsc_channel_capacity

# Compare usage to capacity
mpsc_channel_usage_percent / 100 * mpsc_channel_capacity

`mpsc_channel_full_events_total`

Description: Count of channel overflow events (backpressure applied) Type: Counter (u64) Labels:

channel: Channel name
component: Owning component

Instrumented In: Channel send operations with backpressure handling

Example Queries:

# Total overflow events
mpsc_channel_full_events_total

# Overflow rate
rate(mpsc_channel_full_events_total[5m])

# By channel
sum by (channel) (mpsc_channel_full_events_total)

Alert Threshold: Any overflow events indicate backpressure is occurring

Resilience Dashboard Panels

Circuit Breaker State Timeline:

# Panel: Time series with state mapping
api_circuit_breaker_state
# Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)

FFI Completion Health:

# Panel: Multi-stat showing slow sends and rejections
rate(ffi_completion_slow_sends_total[5m])
rate(ffi_completion_circuit_open_rejections_total[5m])

Channel Saturation Overview:

# Panel: Gauge showing max channel usage
max(mpsc_channel_usage_percent)
# Thresholds: Green < 70%, Yellow < 90%, Red >= 90%

Backpressure Events:

# Panel: Time series of overflow rate
rate(mpsc_channel_full_events_total[5m])

Database Metrics

Module: tasker-shared/src/metrics/database.rs Status: ⚠️ Defined but not yet instrumented

Planned Metrics

tasker.sql.queries.total - Counter
tasker.sql.query.duration - Histogram
tasker.db.pool.connections_active - Gauge
tasker.db.pool.connections_idle - Gauge
tasker.db.pool.wait_duration - Histogram
tasker.db.transactions.total - Counter
tasker.db.transaction.duration - Histogram

Messaging Metrics

Module: tasker-shared/src/metrics/messaging.rs Status: ⚠️ Defined but not yet instrumented

Planned Metrics

tasker.queue.messages_sent.total - Counter
tasker.queue.messages_received.total - Counter
tasker.queue.messages_deleted.total - Counter
tasker.queue.message_send.duration - Histogram
tasker.queue.message_receive.duration - Histogram
tasker.queue.depth - Gauge
tasker.queue.age_seconds - Gauge
tasker.queue.visibility_timeouts.total - Counter
tasker.queue.errors.total - Counter
tasker.queue.retry_attempts.total - Counter

Note: Circuit breaker metrics (including queue-related circuit breakers) are documented in the Resilience Metrics section.

Example Queries

Task Execution Flow

Complete task execution for a specific correlation_id:

# 1. Task creation
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 3. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 4. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 5. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 6. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 7. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Flow: 1 → N → N → N → N → N → 1 (where N = number of steps)

Performance Analysis

Task initialization latency percentiles:

Instant/Recent Data:

# P50 (median)
histogram_quantile(0.50, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P95
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P99
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

Rate-Based (continuous monitoring):

# P50 (median)
histogram_quantile(0.50, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

# P95
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

# P99
histogram_quantile(0.99, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

Step execution latency by namespace:

Instant/Recent Data:

histogram_quantile(0.95,
  sum by (namespace, le) (
    tasker_step_execution_duration_milliseconds_bucket
  )
)

Rate-Based:

histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_milliseconds_bucket[5m])
  )
)

End-to-end task duration (from request to completion):

This requires combining initialization + step execution + finalization durations. Use the simple average approach for instant data:

# Average task initialization
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count

# Average step execution
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count

# Average finalization
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count

Error Rate Monitoring

Overall step failure rate:

rate(tasker_steps_failures_total[5m]) /
rate(tasker_steps_executions_total[5m])

Error distribution by type:

topk(5, sum by (error_type) (tasker_steps_failures_total))

Task failure rate:

rate(tasker_tasks_failures_total[5m]) /
(rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))

Throughput Monitoring

Task request rate:

rate(tasker_tasks_requests_total[1m])
rate(tasker_tasks_requests_total[5m])
rate(tasker_tasks_requests_total[15m])

Step execution throughput:

sum(rate(tasker_steps_executions_total[5m]))

Step completion rate (successes + failures):

sum(rate(tasker_steps_successes_total[5m])) +
sum(rate(tasker_steps_failures_total[5m]))

Dashboard Recommendations

Task Execution Overview Dashboard

Panels:

Task Request Rate
- Query: rate(tasker_tasks_requests_total[5m])
- Visualization: Time series graph
Task Completion Rate
- Query: rate(tasker_tasks_completions_total[5m])
- Visualization: Time series graph
Task Success/Failure Ratio
- Query: Two series
  - Completions: rate(tasker_tasks_completions_total[5m])
  - Failures: rate(tasker_tasks_failures_total[5m])
- Visualization: Stacked area chart
Task Initialization Latency (P95)
- Query: histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))
- Visualization: Time series graph
Steps Enqueued vs Executed
- Query: Two series
  - Enqueued: rate(tasker_steps_enqueued_total[5m])
  - Executed: rate(tasker_steps_executions_total[5m])
- Visualization: Time series graph

Worker Performance Dashboard

Panels:

Step Execution Throughput by Namespace
- Query: sum by (namespace) (rate(tasker_steps_executions_total[5m]))
- Visualization: Time series graph (multi-series)
Step Success Rate
- Query: rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
- Visualization: Gauge (0-1 scale)
Step Execution Latency Percentiles
- Query: Three series
  - P50: histogram_quantile(0.50, rate(tasker_step_execution_duration_bucket[5m]))
  - P95: histogram_quantile(0.95, rate(tasker_step_execution_duration_bucket[5m]))
  - P99: histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
- Visualization: Time series graph
Step Claiming Performance (Event vs Poll)
- Query: histogram_quantile(0.95, sum by (claim_method, le) (rate(tasker_step_claim_duration_bucket[5m])))
- Visualization: Time series graph
Error Distribution by Type
- Query: sum by (error_type) (rate(tasker_steps_failures_total[5m]))
- Visualization: Pie chart or bar chart

System Health Dashboard

Panels:

Overall Task Success Rate
- Query: rate(tasker_tasks_completions_total[5m]) / (rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))
- Visualization: Stat panel with thresholds (green > 0.95, yellow > 0.90, red < 0.90)
Step Failure Rate
- Query: rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])
- Visualization: Stat panel with thresholds
Average Task End-to-End Duration
- Query: Combination of initialization, execution, and finalization durations
- Visualization: Time series graph
Result Processing Latency
- Query: rate(tasker_step_result_processing_duration_sum[5m]) / rate(tasker_step_result_processing_duration_count[5m])
- Visualization: Time series graph
Active Operations
- Query: Currently not instrumented (gauges removed)
- Status: Planned future enhancement

Verification Checklist

Use this checklist to verify metrics are working correctly:

Prerequisites

telemetry.opentelemetry.enabled = true in development config
Services restarted after config change
Logs show opentelemetry_enabled=true
Grafana LGTM container running on ports 3000, 4317

Basic Verification

At least one task created via CLI
Correlation ID captured from task creation
Trace visible in Grafana Tempo for correlation ID

Orchestration Metrics

tasker_tasks_requests_total returns non-zero
tasker_steps_enqueued_total returns expected step count
tasker_step_results_processed_total returns expected result count
tasker_tasks_completions_total increments on success
tasker_task_initialization_duration_bucket has histogram data

Worker Metrics

tasker_steps_executions_total returns non-zero
tasker_steps_successes_total matches successful steps
tasker_steps_claimed_total returns expected claims
tasker_steps_results_submitted_total matches result submissions
tasker_step_execution_duration_bucket has histogram data

Resilience Metrics

api_circuit_breaker_state returns 0 (Closed) during normal operation
/health/detailed endpoint shows circuit breaker states
mpsc_channel_usage_percent returns values < 80% (no saturation)
mpsc_channel_full_events_total is 0 or very low (no backpressure)
FFI workers: ffi_completion_slow_sends_total is near zero

Correlation

All metrics filterable by correlation_id
Correlation ID in metrics matches trace ID in Tempo
Complete execution flow visible from request to completion

Troubleshooting

No Metrics Appearing

Check 1: OpenTelemetry enabled

grep "opentelemetry_enabled" tmp/*.log
# Should show: opentelemetry_enabled=true

Check 2: OTLP endpoint accessible

curl -v http://localhost:4317 2>&1 | grep Connected
# Should show: Connected to localhost (127.0.0.1) port 4317

Check 3: Grafana LGTM running

curl -s http://localhost:3000/api/health | jq
# Should return healthy status

Check 4: Wait for export interval (60 seconds) Metrics are batched and exported every 60 seconds. Wait at least 1 minute after task execution.

Metrics Missing Labels

If correlation_id or other labels are missing, check:

Logs for correlation_id field presence
Metric instrumentation includes KeyValue::new() calls
Labels match between metric definition and usage

Histogram Buckets Empty

If histogram queries return no data:

Verify histogram is initialized: check logs for metric initialization
Ensure duration values are non-zero and reasonable
Check that record() is called, not add() for histograms

Next Steps

Phase 3.4 (Future)

Instrument database metrics (7 metrics)
Instrument messaging metrics (11 metrics)
Add gauge tracking for active operations
Implement queue depth monitoring

Production Readiness

Create alert rules for error rates
Set up automated dashboards
Configure metric retention policies
Add metric aggregation for long-term storage

Last Updated: 2025-12-10 Test Task: mathematical_sequence (correlation_id: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5) Status: All orchestration and worker metrics verified and producing data ✅

Recent Updates:

2025-12-10: Added Resilience Metrics section (circuit breakers, MPSC channels)
2025-10-08: Initial metrics verification completed

Metrics Verification Guide

Purpose: Verify that documented metrics queries work with actual system data Test Task: mathematical_sequence Correlation ID: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5 Task ID: 0199c3e0-ccea-70f0-b6ae-3086b2f68280 Trace ID: d640f82572e231322edba0a5ef6e1405

How to Use This Guide

Open Grafana at http://localhost:3000
Navigate to Explore (compass icon in sidebar)
Select Prometheus as the data source
Copy each query below into the query editor
Record the actual output
Mark ✅ if query works, ❌ if it fails, or ⚠️ if partial data

Orchestration Metrics Verification

1. Task Requests Counter

Metric: tasker.tasks.requests.total

Query 1: Basic counter

tasker_tasks_requests_total

Expected: At least 1 (for our test task) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Filtered by correlation_id

tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Exactly 1 Actual Result: _____________ Labels Present: [ ] correlation_id [ ] task_type [ ] namespace Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: Sum by namespace

sum by (namespace) (tasker_tasks_requests_total)

Expected: 1 for namespace “rust_e2e_linear” Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

2. Task Completions Counter

Metric: tasker.tasks.completions.total

Query 1: Basic counter

tasker_tasks_completions_total

Expected: At least 1 (if task completed successfully) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Filtered by correlation_id

tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: Completion rate over 5 minutes

rate(tasker_tasks_completions_total[5m])

Expected: Some positive rate value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

3. Steps Enqueued Counter

Metric: tasker.steps.enqueued.total

Query 1: Total steps enqueued for our task

tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of steps in mathematical_sequence workflow (likely 3-4 steps) Actual Result: _____________ Step Names Visible: [ ] Yes [ ] No Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Sum by step name

sum by (step_name) (tasker_steps_enqueued_total)

Expected: Breakdown by step name Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

4. Step Results Processed Counter

Metric: tasker.step_results.processed.total

Query 1: Results processed for our task

tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Same as steps enqueued Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Sum by result type

sum by (result_type) (tasker_step_results_processed_total)

Expected: Breakdown showing “success” results Actual Result: _____________ Result Types Visible: [ ] success [ ] error [ ] timeout [ ] cancelled [ ] skipped Status: [ ] ✅ [ ] ❌ [ ] ⚠️

5. Task Initialization Duration Histogram

Metric: tasker.task.initialization.duration

Query 1: Check if histogram has data

tasker_task_initialization_duration_count

Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average initialization time

rate(tasker_task_initialization_duration_sum[5m]) /
rate(tasker_task_initialization_duration_count[5m])

Expected: Some millisecond value (probably < 100ms) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: P95 latency

histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))

Expected: P95 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 4: P99 latency

histogram_quantile(0.99, rate(tasker_task_initialization_duration_bucket[5m]))

Expected: P99 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

6. Task Finalization Duration Histogram

Metric: tasker.task.finalization.duration

Query 1: Check count

tasker_task_finalization_duration_count

Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average finalization time

rate(tasker_task_finalization_duration_sum[5m]) /
rate(tasker_task_finalization_duration_count[5m])

Expected: Some millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: P95 by final_state

histogram_quantile(0.95,
  sum by (final_state, le) (
    rate(tasker_task_finalization_duration_bucket[5m])
  )
)

Expected: P95 value grouped by final_state (likely “complete”) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

7. Step Result Processing Duration Histogram

Metric: tasker.step_result.processing.duration

Query 1: Check count

tasker_step_result_processing_duration_count

Expected: Number of steps processed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average processing time

rate(tasker_step_result_processing_duration_sum[5m]) /
rate(tasker_step_result_processing_duration_count[5m])

Expected: Millisecond value for result processing Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Worker Metrics Verification

8. Step Executions Counter

Metric: tasker.steps.executions.total

Query 1: Total executions

tasker_steps_executions_total

Expected: Number of steps in workflow Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: For specific task

tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of steps executed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: Execution rate

rate(tasker_steps_executions_total[5m])

Expected: Positive rate Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

9. Step Successes Counter

Metric: tasker.steps.successes.total

Query 1: Total successes

tasker_steps_successes_total

Expected: Should equal executions if all succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: By namespace

sum by (namespace) (tasker_steps_successes_total)

Expected: Successes grouped by namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: Success rate

rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])

Expected: ~1.0 (100%) if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

10. Step Failures Counter

Metric: tasker.steps.failures.total

Query 1: Total failures

tasker_steps_failures_total

Expected: 0 if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: By error type

sum by (error_type) (tasker_steps_failures_total)

Expected: No results if no failures, or breakdown by error type Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

11. Steps Claimed Counter

Metric: tasker.steps.claimed.total

Query 1: Total claims

tasker_steps_claimed_total

Expected: Number of steps claimed (should match executions) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: By claim method

sum by (claim_method) (tasker_steps_claimed_total)

Expected: Breakdown by “event” or “poll” Actual Result: _____________ Claim Methods Visible: [ ] event [ ] poll Status: [ ] ✅ [ ] ❌ [ ] ⚠️

12. Step Results Submitted Counter

Metric: tasker.steps.results_submitted.total

Query 1: Total submissions

tasker_steps_results_submitted_total

Expected: Number of steps that submitted results Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: For specific task

tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of step results submitted Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

13. Step Execution Duration Histogram

Metric: tasker.step.execution.duration

Query 1: Check count

tasker_step_execution_duration_count

Expected: Number of step executions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average execution time

rate(tasker_step_execution_duration_sum[5m]) /
rate(tasker_step_execution_duration_count[5m])

Expected: Average milliseconds per step Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: P95 latency by namespace

histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_bucket[5m])
  )
)

Expected: P95 latency for rust_e2e_linear namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 4: P99 latency

histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))

Expected: P99 latency value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

14. Step Claim Duration Histogram

Metric: tasker.step.claim.duration

Query 1: Check count

tasker_step_claim_duration_count

Expected: Number of claims Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average claim time

rate(tasker_step_claim_duration_sum[5m]) /
rate(tasker_step_claim_duration_count[5m])

Expected: Average milliseconds to claim Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: P95 by claim method

histogram_quantile(0.95,
  sum by (claim_method, le) (
    rate(tasker_step_claim_duration_bucket[5m])
  )
)

Expected: P95 claim latency by method Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

15. Step Result Submission Duration Histogram

Metric: tasker.step_result.submission.duration

Query 1: Check count

tasker_step_result_submission_duration_count

Expected: Number of result submissions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 2: Average submission time

rate(tasker_step_result_submission_duration_sum[5m]) /
rate(tasker_step_result_submission_duration_count[5m])

Expected: Average milliseconds to submit Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Query 3: P95 submission latency

histogram_quantile(0.95, rate(tasker_step_result_submission_duration_bucket[5m]))

Expected: P95 submission latency Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️

Complete Execution Flow Verification

Purpose: Verify the full task lifecycle is visible in metrics

Query: Complete flow for correlation_id

# Run each query and record the value

# 1. Task created
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 3. Steps claimed
tasker_steps_claimed_total
Result: _____________

# 4. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 5. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 6. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 7. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 8. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

Expected Pattern: 1 → N → N → N → N → N → N → 1 Actual Pattern: _____________ → _____ → _____ → _____ → _____ → _____ → _____ → _____

Analysis:

Do the numbers make sense for your workflow? [ ] Yes [ ] No
Are any steps missing? _____________
Do counts match expectations? [ ] Yes [ ] No

Issues Found

Document any issues discovered during verification:

Issue 1

Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No

Issue 2

Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No

Summary

Total Queries Tested: _____________ Successful: _____________ ✅ Failed: _____________ ❌ Partial: _____________ ⚠️

Overall Status: [ ] All Working [ ] Some Issues [ ] Major Problems

Ready for Production: [ ] Yes [ ] No [ ] Needs Work

Verification Date: _____________ Verified By: _____________ Grafana Version: _____________ OpenTelemetry Version: 0.26

OpenTelemetry Improvements

Last Updated: 2025-12-01 Audience: Developers, Operators Status: Active Related Docs: Observability Hub | Metrics Reference | Domain Events

← Back to Observability Hub

This document describes the OpenTelemetry improvements for the domain event system, including two-phase FFI telemetry initialization, domain event metrics, and enhanced observability for the distributed event system.

Overview

These telemetry improvements support the domain event system while addressing FFI-specific challenges:

Improvement	Purpose	Impact
Two-Phase FFI Telemetry	Safe telemetry in FFI workers	No segfaults during Ruby/Python bridging
Domain Event Metrics	Event system observability	Real-time monitoring of event publication
Correlation ID Propagation	End-to-end tracing	Events traceable across distributed system
Worker Metrics Endpoint	Domain event statistics	`/metrics/events` for monitoring dashboards

Two-Phase FFI Telemetry Initialization

The Problem

When Rust workers operate with Ruby FFI bindings, OpenTelemetry’s global tracer/meter providers can cause issues:

Thread Safety: Ruby’s GVL (Global VM Lock) conflicts with OpenTelemetry’s internal threading
Signal Handling: OpenTelemetry’s OTLP exporter may interfere with Ruby signal handling
Segfaults: Premature initialization can cause crashes during FFI boundary crossings

The Solution: Two-Phase Initialization

flowchart LR
    subgraph Phase1["Phase 1 (FFI-Safe)"]
        A[Console logger]
        B[Tracing init]
        C[No OTLP export]
        D[No global state]
    end

    subgraph Phase2["Phase 2 (Full OTel)"]
        E[OTLP exporter]
        F[Metrics export]
        G[Full tracing]
        H[Global tracer]
    end

    Phase1 -->|"After FFI bridge<br/>established"| Phase2

Worker Bootstrap Sequence:

Load Rust worker library
Initialize Phase 1 (console-only logging)
Execute FFI bridge setup (Ruby/Python)
Initialize Phase 2 (full OpenTelemetry)

Implementation

Phase 1: Console-Only Initialization (FFI-Safe):

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 284-326)

/// Initialize console-only logging (FFI-safe, no Tokio runtime required)
///
/// This function sets up structured console logging without OpenTelemetry,
/// making it safe to call from FFI initialization contexts where no Tokio
/// runtime exists yet.
pub fn init_console_only() {
    TRACING_INITIALIZED.get_or_init(|| {
        let environment = get_environment();
        let log_level = get_log_level(&environment);

        // Determine if we're in a TTY for ANSI color support
        let use_ansi = IsTerminal::is_terminal(&std::io::stdout());

        // Create base console layer
        let console_layer = fmt::layer()
            .with_target(true)
            .with_thread_ids(true)
            .with_level(true)
            .with_ansi(use_ansi)
            .with_filter(EnvFilter::new(&log_level));

        // Build subscriber with console layer only (no telemetry)
        let subscriber = tracing_subscriber::registry().with(console_layer);

        if subscriber.try_init().is_err() {
            tracing::debug!(
                "Global tracing subscriber already initialized"
            );
        } else {
            tracing::info!(
                environment = %environment,
                opentelemetry_enabled = false,
                context = "ffi_initialization",
                "Console-only logging initialized (FFI-safe mode)"
            );
        }

        // Initialize basic metrics (no OpenTelemetry exporters)
        metrics::init_metrics();
        metrics::orchestration::init();
        metrics::worker::init();
        metrics::database::init();
        metrics::messaging::init();
    });
}
}

Phase 2: Full OpenTelemetry Initialization:

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 361-449)

/// Initialize tracing with console output and optional OpenTelemetry
///
/// When OpenTelemetry is enabled (via TELEMETRY_ENABLED=true), it also
/// configures distributed tracing with OTLP exporter.
///
/// **IMPORTANT**: When telemetry is enabled, this function MUST be called from
/// a Tokio runtime context because the batch exporter requires async I/O.
pub fn init_tracing() {
    TRACING_INITIALIZED.get_or_init(|| {
        let environment = get_environment();
        let log_level = get_log_level(&environment);
        let telemetry_config = TelemetryConfig::default();

        // Determine if we're in a TTY for ANSI color support
        let use_ansi = IsTerminal::is_terminal(&std::io::stdout());

        // Create base console layer
        let console_layer = fmt::layer()
            .with_target(true)
            .with_thread_ids(true)
            .with_level(true)
            .with_ansi(use_ansi)
            .with_filter(EnvFilter::new(&log_level));

        // Build subscriber with optional OpenTelemetry layer
        let subscriber = tracing_subscriber::registry().with(console_layer);

        if telemetry_config.enabled {
            // Initialize OpenTelemetry tracer and logger
            match (init_opentelemetry_tracer(&telemetry_config),
                   init_opentelemetry_logger(&telemetry_config)) {
                (Ok(tracer_provider), Ok(logger_provider)) => {
                    // Add trace layer
                    let tracer = tracer_provider.tracer("tasker-core");
                    let telemetry_layer = OpenTelemetryLayer::new(tracer);

                    // Add log layer (bridge tracing logs -> OTEL logs)
                    let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);

                    let subscriber = subscriber.with(telemetry_layer).with(log_layer);

                    if subscriber.try_init().is_ok() {
                        tracing::info!(
                            environment = %environment,
                            opentelemetry_enabled = true,
                            logs_enabled = true,
                            otlp_endpoint = %telemetry_config.otlp_endpoint,
                            service_name = %telemetry_config.service_name,
                            "Console logging with OpenTelemetry initialized"
                        );
                    }
                }
                // ... error handling with fallback to console-only
            }
        }
    });
}
}

Worker Bootstrap Integration:

#![allow(unused)]
fn main() {
// workers/rust/src/bootstrap.rs (lines 69-131)

pub async fn bootstrap() -> Result<(WorkerSystemHandle, RustEventHandler)> {
    info!("📋 Creating native Rust step handler registry...");
    let registry = Arc::new(RustStepHandlerRegistry::new());

    // Get global event system for connecting to worker events
    info!("🔗 Setting up event system connection...");
    let event_system = get_global_event_system();

    // Bootstrap the worker using tasker-worker foundation
    info!("🏗️  Bootstrapping worker with tasker-worker foundation...");
    let worker_handle =
        WorkerBootstrap::bootstrap_with_event_system(Some(event_system.clone())).await?;

    // Create step event publisher registry with domain event publisher
    info!("🔔 Setting up step event publisher registry...");
    let domain_event_publisher = {
        let worker_core = worker_handle.worker_core.lock().await;
        worker_core.domain_event_publisher()
    };

    // Dual-Path: Create in-process event bus for fast event delivery
    info!("⚡ Creating in-process event bus for fast domain events...");
    let in_process_bus = Arc::new(RwLock::new(InProcessEventBus::new(
        InProcessEventBusConfig::default(),
    )));

    // Dual-Path: Create event router for dual-path delivery
    info!("🔀 Creating event router for dual-path delivery...");
    let event_router = Arc::new(RwLock::new(EventRouter::new(
        domain_event_publisher.clone(),
        in_process_bus.clone(),
    )));

    // Create registry with EventRouter for dual-path delivery
    let mut step_event_registry =
        StepEventPublisherRegistry::with_event_router(
            domain_event_publisher.clone(),
            event_router
        );

    Ok((worker_handle, event_handler))
}
}

Configuration

Telemetry is configured exclusively via environment variables. This is intentional because logging must be initialized before the TOML config loader runs (to log any config loading errors).

# Enable OpenTelemetry
export TELEMETRY_ENABLED=true

# OTLP endpoint (default: http://localhost:4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Service identification
export OTEL_SERVICE_NAME=tasker-orchestration
export OTEL_SERVICE_VERSION=0.1.0

# Deployment environment (falls back to TASKER_ENV, then "development")
export DEPLOYMENT_ENVIRONMENT=production

# Sampling rate (0.0 to 1.0, default: 1.0 = 100%)
export OTEL_TRACES_SAMPLER_ARG=1.0

The TelemetryConfig::default() implementation in tasker-shared/src/logging.rs:144-164 reads all values from environment variables at initialization time.

Domain Event Metrics

New Metrics

Domain event observability metrics:

Metric	Type	Description
`tasker.domain_events.published.total`	Counter	Total events published
`router.durable_routed`	Counter	Events sent via durable path (PGMQ)
`router.fast_routed`	Counter	Events sent via fast path (in-process)
`router.broadcast_routed`	Counter	Events broadcast to both paths

Implementation

Domain event metrics are emitted inline during publication:

#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 207-219)

// Emit OpenTelemetry metric
let counter = opentelemetry::global::meter("tasker")
    .u64_counter("tasker.domain_events.published.total")
    .with_description("Total number of domain events published")
    .build();

counter.add(
    1,
    &[
        opentelemetry::KeyValue::new("event_name", event_name.to_string()),
        opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
    ],
);
}

Event routing statistics are tracked in the EventRouterStats and InProcessEventBusStats structures:

#![allow(unused)]
fn main() {
// tasker-shared/src/metrics/worker.rs (lines 431-444)

/// Statistics for the event router
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct EventRouterStats {
    /// Total events routed through the router
    pub total_routed: u64,
    /// Events sent via durable path (PGMQ)
    pub durable_routed: u64,
    /// Events sent via fast path (in-process)
    pub fast_routed: u64,
    /// Events broadcast to both paths
    pub broadcast_routed: u64,
    /// Fast delivery errors in broadcast mode (non-fatal, logged for monitoring)
    pub fast_delivery_errors: u64,
    /// Failed routing attempts (durable failures only)
    pub routing_errors: u64,
}

// tasker-shared/src/metrics/worker.rs (lines 455-467)

/// Statistics for the in-process event bus
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct InProcessEventBusStats {
    /// Total events dispatched through the bus
    pub total_events_dispatched: u64,
    /// Total events dispatched to Rust handlers
    pub rust_handler_dispatches: u64,
    /// Total events dispatched to FFI channel
    pub ffi_channel_dispatches: u64,
}
}

Prometheus Queries

Event publication rate by namespace:

sum by (namespace) (rate(tasker_domain_events_published_total[5m]))

Event failure rate:

rate(tasker_domain_events_failed_total[5m]) /
rate(tasker_domain_events_published_total[5m])

Publication latency (P95):

histogram_quantile(0.95,
  sum by (le) (rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m]))
)

Latency by delivery mode:

histogram_quantile(0.95,
  sum by (delivery_mode, le) (
    rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m])
  )
)

Worker Metrics Endpoint

`/metrics/events` Endpoint

The worker exposes domain event statistics through a dedicated metrics endpoint:

Request:

curl http://localhost:8081/metrics/events

Response:

{
  "router": {
    "total_routed": 42,
    "durable_routed": 10,
    "fast_routed": 30,
    "broadcast_routed": 2,
    "fast_delivery_errors": 0,
    "routing_errors": 0
  },
  "in_process_bus": {
    "total_events_dispatched": 32,
    "rust_handler_dispatches": 20,
    "ffi_channel_dispatches": 12
  },
  "captured_at": "2025-12-01T10:30:00Z",
  "worker_id": "worker-01234567"
}

Implementation

#![allow(unused)]
fn main() {
// tasker-worker/src/web/handlers/metrics.rs (lines 178-218)

/// Domain event statistics endpoint: GET /metrics/events
///
/// Returns statistics about domain event routing and delivery paths.
/// Used for monitoring event publishing and by E2E tests to verify
/// events were published through the expected delivery paths.
///
/// # Response
///
/// Returns statistics for:
/// - **Router stats**: durable_routed, fast_routed, broadcast_routed counts
/// - **In-process bus stats**: handler dispatches, FFI channel dispatches
pub async fn domain_event_stats(
    State(state): State<Arc<WorkerWebState>>,
) -> Json<DomainEventStats> {
    debug!("Serving domain event statistics");

    // Use cached event components - does not lock worker core
    let stats = state.domain_event_stats().await;

    Json(stats)
}
}

The DomainEventStats structure is defined in tasker-shared/src/types/web.rs:

#![allow(unused)]
fn main() {
// tasker-shared/src/types/web.rs (lines 546-555)

#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct DomainEventStats {
    /// Event router statistics
    pub router: EventRouterStats,
    /// In-process event bus statistics
    pub in_process_bus: InProcessEventBusStats,
    /// Timestamp when stats were captured
    pub captured_at: DateTime<Utc>,
    /// Worker ID for correlation
    pub worker_id: String,
}
}

Correlation ID Propagation

End-to-End Tracing

Domain events maintain correlation IDs for distributed tracing:

flowchart LR
    subgraph TaskCreation["Task Creation"]
        A[correlation_id<br/>UUIDv7]
    end

    subgraph StepExecution["Step Execution"]
        B[correlation_id<br/>propagated]
    end

    subgraph DomainEvent["Domain Event"]
        C[correlation_id<br/>in metadata]
    end

    TaskCreation --> StepExecution --> DomainEvent

    subgraph TraceContext["Trace Context"]
        D[task_uuid]
        E[step_uuid]
        F[step_name]
        G[namespace]
        H[correlation_id]
    end

Tracing Integration

The DomainEventPublisher::publish_event method uses #[instrument] for automatic span creation:

#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 157-231)

#[instrument(skip(self, payload, metadata), fields(
    event_name = %event_name,
    namespace = %metadata.namespace,
    correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: DomainEventPayload,
    metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
    let event_id = Uuid::now_v7();
    let queue_name = format!("{}_domain_events", metadata.namespace);

    debug!(
        event_id = %event_id,
        event_name = %event_name,
        queue_name = %queue_name,
        task_uuid = %metadata.task_uuid,
        correlation_id = %metadata.correlation_id,
        "Publishing domain event"
    );

    // Create and serialize domain event
    let event = DomainEvent {
        event_id,
        event_name: event_name.to_string(),
        event_version: "1.0".to_string(),
        payload,
        metadata: metadata.clone(),
    };

    // Publish to PGMQ
    let message_id = self.message_client
        .send_json_message(&queue_name, &event_json)
        .await?;

    info!(
        event_id = %event_id,
        message_id = message_id,
        correlation_id = %metadata.correlation_id,
        "Domain event published successfully"
    );

    Ok(event_id)
}
}

Querying by Correlation ID

Find all events for a task:

# In Grafana/Tempo
correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"

In PostgreSQL (PGMQ queues):

SELECT
    message->>'event_name' as event,
    message->'metadata'->>'step_name' as step,
    message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';

Span Hierarchy

Domain Event Spans

Domain event spans:

Task Execution (root span)
├── Step Execution
│   ├── Handler Call
│   │   └── Business Logic
│   └── publish_domain_event           ◄── NEW
│       ├── route_event
│       │   ├── publish_durable        (if durable/broadcast)
│       │   └── publish_fast           (if fast/broadcast)
│       └── record_metrics
└── Result Submission

Span Attributes

Span	Attributes
`publish_domain_event`	event_name, namespace, correlation_id, delivery_mode
`route_event`	delivery_mode, target_queue (if durable)
`publish_durable`	queue_name, message_size
`publish_fast`	subscriber_count

Troubleshooting

Console-Only Mode (No OTLP Export)

Symptom: Logs show “Console-only logging initialized (FFI-safe mode)” but no OpenTelemetry traces

Cause: init_console_only() was called but init_tracing() was never called, or TELEMETRY_ENABLED=false

Fix:

Check initialization logs:

grep -E "(Console-only|OpenTelemetry)" logs/worker.log

Verify TELEMETRY_ENABLED=true is set:

grep "opentelemetry_enabled" logs/worker.log

Domain Event Metrics Missing

Symptom: /metrics/events returns zeros for all stats

Cause: Events not being published or the event router/bus not tracking statistics

Fix:

Verify events are being published:

grep "Domain event published successfully" logs/worker.log

Check event router initialization:
```
grep "event router" logs/worker.log
```

Verify in-process event bus is configured:

grep "in-process event bus" logs/worker.log

Correlation ID Not Propagating

Symptom: Events have different correlation IDs than parent task

Cause: EventMetadata not constructed with task’s correlation_id

Fix: Verify EventMetadata is constructed with the correct correlation_id from the task:

#![allow(unused)]
fn main() {
// When constructing EventMetadata, always use the task's correlation_id
let metadata = EventMetadata {
    task_uuid: step_data.task.task.task_uuid,
    step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
    step_name: Some(step_data.workflow_step.name.clone()),
    namespace: step_data.task.namespace_name.clone(),
    correlation_id: step_data.task.task.correlation_id,  // Must use task's ID
    fired_at: chrono::Utc::now(),
    fired_by: handler_name.to_string(),
};
}

Best Practices

1. Always Use Two-Phase Init for FFI Workers

#![allow(unused)]
fn main() {
// Correct: Two-phase initialization pattern
// Phase 1: During FFI initialization (Magnus, PyO3, WASM)
tasker_shared::logging::init_console_only();

// Phase 2: After runtime creation
let runtime = tokio::runtime::Runtime::new()?;
runtime.block_on(async {
    tasker_shared::logging::init_tracing();
});

// Incorrect: Calling init_tracing() during FFI initialization
// before Tokio runtime exists (may cause issues with OTLP exporter)
}

2. Include Correlation ID in All Events

#![allow(unused)]
fn main() {
// Always propagate correlation_id from the task
let metadata = EventMetadata {
    task_uuid: step_data.task.task.task_uuid,
    step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
    step_name: Some(step_data.workflow_step.name.clone()),
    namespace: step_data.task.namespace_name.clone(),
    correlation_id: step_data.task.task.correlation_id,  // Critical!
    fired_at: chrono::Utc::now(),
    fired_by: handler_name.to_string(),
};
}

3. Use Structured Logging with Correlation Context

#![allow(unused)]
fn main() {
// All logs should include correlation_id for trace correlation
info!(
    event_id = %event_id,
    event_name = %event_name,
    correlation_id = %metadata.correlation_id,
    namespace = %metadata.namespace,
    "Domain event published successfully"
);
}

Metrics Reference: metrics-reference.md - Complete metrics catalog
Domain Events: ../domain-events.md - Event system architecture
Logging Standards: logging-standards.md - Structured logging best practices

This telemetry architecture provides robust observability for domain events while ensuring safe operation with FFI-based language bindings.

Tasker Core Principles

This directory contains the core principles and design philosophy that guide Tasker Core development. These principles are not arbitrary rules but hard-won lessons extracted from implementation experience, root cause analyses, and architectural decisions.

Core Documents

Document	Description
Tasker Core Tenets	The 11 foundational principles that drive all architecture and design decisions
Defense in Depth	Multi-layered protection model for idempotency and data integrity
Fail Loudly	Why errors beat silent defaults, and phantom data breaks trust
Cross-Language Consistency	The “one API” philosophy for Rust, Ruby, Python, and TypeScript workers
Composition Over Inheritance	Mixin-based handler composition pattern
Intentional AI Partnership	Collaborative approach to AI integration

Influences

Document	Description
Twelve-Factor App Alignment	How the 12-factor methodology shapes our architecture, with codebase examples and honest gap assessment
Zen of Python (PEP-20)	Tim Peters’ guiding principles — referenced as inspiration

How These Principles Were Derived

These principles emerged from:

Root Cause Analyses: Ownership removal revealed that “redundant protection with harmful side effects” is worse than minimal, well-understood protection
Cross-Language Development: Handler harmonization established patterns for consistent APIs across four languages
Architectural Migrations: Actor pattern refactoring proved the pattern’s effectiveness
Production Incidents: Real bugs in parallel execution (Heisenbugs becoming Bohrbugs) shaped defensive design
Protocol Trust Analysis: gRPC client refactoring exposed how silent defaults create phantom data that breaks consumer trust

When to Consult These Documents

Design decisions: Read Tasker Core Tenets before proposing architecture changes
Adding protections: Consult Defense in Depth to understand existing layers
Error handling: Review Fail Loudly before adding defaults or fallbacks
Worker development: Review Cross-Language Consistency for API alignment
Handler patterns: Study Composition Over Inheritance for proper structure

Architecture Decisions: docs/decisions/ for specific ADRs
Historical Context: docs/CHRONOLOGY.md for development timeline
Implementation Details: docs/ticket-specs/ for original specifications

Composition Over Inheritance

Last Updated: 2026-01-01 This document describes Tasker Core’s approach to handler composition using mixins and traits rather than class hierarchies.

The Core Principle

Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable

Handlers gain capabilities by mixing in modules, not by inheriting from specialized base classes.

Why Composition?

The Problem with Inheritance

Deep inheritance hierarchies create problems:

# BAD: Inheritance-based capabilities
class APIDecisionBatchableHandler < APIDecisionHandler < APIHandler < BaseHandler
  # Which methods came from where?
  # How do I override just one behavior?
  # What if I need Batchable but not API?
end

Problem	Description
Diamond problem	Multiple paths to same ancestor
Tight coupling	Can’t change base without affecting all children
Inflexible	Can’t pick-and-choose capabilities
Hard to test	Must test entire hierarchy
Opaque behavior	Method origin unclear

The Composition Solution

Mixins provide selective capabilities:

# GOOD: Composition-based capabilities
class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::APICapable
  include TaskerCore::StepHandler::DecisionCapable

  def call(context)
    # Has API methods (get, post, put, delete)
    # Has Decision methods (decision_success, decision_no_branches)
    # Does NOT have Batchable methods (didn't include it)
  end
end

Benefit	Description
Selective inclusion	Only the capabilities you need
Clear origin	Module name indicates where methods come from
Independent testing	Test each mixin in isolation
Flexible combination	Any combination of capabilities
Flat structure	No deep hierarchies to navigate

The Discovery

Analysis of Batchable handlers revealed they already used the composition pattern:

# Batchable was the TARGET architecture all along
class BatchHandler < Base
  include BatchableCapable  # Already doing it right!

  def call(context)
    batch_ctx = get_batch_context(context)
    # ...process batch...
    batch_worker_complete(processed_count: count, result_data: data)
  end
end

The cross-language handler harmonization recommended migrating API and Decision handlers to match this pattern.

Capability Modules

Available Capabilities

Capability	Module (Ruby)	Methods Provided
API	`APICapable`	`get`, `post`, `put`, `delete`
Decision	`DecisionCapable`	`decision_success`, `decision_no_branches`
Batchable	`BatchableCapable`	`get_batch_context`, `batch_worker_complete`, `handle_no_op_worker`

Rust Traits

#![allow(unused)]
fn main() {
// Rust uses traits for the same pattern
pub trait APICapable {
    async fn get(&self, path: &str, params: Option<Params>) -> Response;
    async fn post(&self, path: &str, data: Option<Value>) -> Response;
    async fn put(&self, path: &str, data: Option<Value>) -> Response;
    async fn delete(&self, path: &str, params: Option<Params>) -> Response;
}

pub trait DecisionCapable {
    fn decision_success(&self, steps: Vec<String>, result: Value) -> StepExecutionResult;
    fn decision_no_branches(&self, result: Value) -> StepExecutionResult;
}

pub trait BatchableCapable {
    fn get_batch_context(&self, context: &StepContext) -> BatchContext;
    fn batch_worker_complete(&self, count: usize, data: Value) -> StepExecutionResult;
}
}

Python Mixins

# Python uses multiple inheritance (mixins)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin

class MyHandler(StepHandler, APIMixin, DecisionMixin):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Has both API and Decision methods
        response = self.get("/api/endpoint")
        return self.decision_success(["next_step"], response)

TypeScript Mixins

// TypeScript uses mixin functions applied in constructor
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, applyDecision, APICapable, DecisionCapable } from '@tasker-systems/tasker';

class MyHandler extends StepHandler implements APICapable, DecisionCapable {
  constructor() {
    super();
    applyAPI(this);       // Adds get/post/put/delete methods
    applyDecision(this);  // Adds decisionSuccess/skipBranches methods
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Has both API and Decision methods
    const response = await this.get('/api/endpoint');
    return this.decisionSuccess(['next_step'], response.body);
  }
}

Separation of Concerns

What Orchestration Owns

The orchestration layer handles:

Domain event publishing (after results committed)
Decision point step creation (from DecisionPointOutcome)
Batch worker creation (from BatchProcessingOutcome)
State machine transitions

What Workers Own

Workers handle:

Decision logic (returns DecisionPointOutcome)
Batch analysis (returns BatchProcessingOutcome)
Handler execution (returns StepHandlerResult)
Custom publishers/subscribers (fast path events)

The Boundary

┌─────────────────────────────────────────────────────────────────┐
│                        Worker Space                              │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Handler.call(context)                                       ││
│  │   - Executes business logic                                 ││
│  │   - Uses API/Decision/Batchable capabilities               ││
│  │   - Returns StepHandlerResult with outcome                  ││
│  └─────────────────────────────────────────────────────────────┘│
│                              ↓ Result (with outcome)             │
├─────────────────────────────────────────────────────────────────┤
│                    Orchestration Space                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Process result                                              ││
│  │   - Commit state transition                                 ││
│  │   - If DecisionPointOutcome: create decision steps          ││
│  │   - If BatchProcessingOutcome: create batch workers         ││
│  │   - Publish domain events                                   ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

FFI Boundary Types

Outcomes crossing the FFI boundary need explicit types:

DecisionPointOutcome

#![allow(unused)]
fn main() {
// Rust definition
pub enum DecisionPointOutcome {
    ActivateSteps { step_names: Vec<String> },
    NoBranches,
}

// Serialized (all languages)
{
    "type": "ActivateSteps",
    "step_names": ["branch_a", "branch_b"]
}
}

BatchProcessingOutcome

#![allow(unused)]
fn main() {
// Rust definition
pub enum BatchProcessingOutcome {
    Continue { cursor: CursorConfig },
    Complete,
    NoOp,
}

// Serialized (all languages)
{
    "type": "Continue",
    "cursor": {
        "position": "offset_123",
        "batch_size": 100
    }
}
}

Migration Path

Cross-Language Migration Examples

Ruby

Before (inheritance):

class MyAPIHandler < TaskerCore::APIHandler
  def call(context)
    # ...
  end
end

After (composition):

class MyAPIHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API

  def call(context)
    # Same implementation, different structure
  end
end

Python

Before (inheritance):

class MyAPIHandler(APIHandler):
    def call(self, context):
        # ...

After (composition):

from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin

class MyAPIHandler(StepHandler, APIMixin):
    def call(self, context):
        # Same implementation, different structure

TypeScript

Before (inheritance):

class MyAPIHandler extends APIHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    // ...
  }
}

After (composition):

import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';

class MyAPIHandler extends StepHandler implements APICapable {
  constructor() {
    super();
    applyAPI(this);
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Same implementation, different structure
  }
}

Rust

Rust already used the composition pattern via traits:

#![allow(unused)]
fn main() {
// Rust has always used traits (composition)
impl StepHandler for MyHandler { ... }
impl APICapable for MyHandler { ... }
impl DecisionCapable for MyHandler { ... }
}

Breaking Changes Implemented

The migration to composition involved breaking changes:

Base class changes across all languages
Module/mixin includes required
Ruby cursor indexing changed from 1-indexed to 0-indexed

All breaking changes were accumulated and released together.

Anti-Patterns

Don’t: Inherit from Multiple Specialized Classes

# BAD: Ruby doesn't support multiple inheritance like this
class MyHandler < APIHandler, DecisionHandler  # Syntax error!

Don’t: Reimplement Mixin Methods

# BAD: Overriding mixin methods defeats the purpose
class MyHandler < Base
  include APICapable

  def get(path, params: {})
    # Custom implementation - now you own this forever
  end
end

Don’t: Mix Concerns

# BAD: Handler doing orchestration's job
class MyHandler < Base
  include DecisionCapable

  def call(context)
    # Don't create steps directly!
    create_workflow_step("next_step")  # Orchestration does this!

    # Do return the outcome
    decision_success(steps: ["next_step"], result_data: {})
  end
end

Testing Composition

Test Mixins in Isolation

# Test the mixin itself
RSpec.describe TaskerCore::StepHandler::APICapable do
  let(:handler) { Class.new { include TaskerCore::StepHandler::APICapable }.new }

  it "provides get method" do
    expect(handler).to respond_to(:get)
  end
end

Test Handler with Stubs

# Test handler behavior, stub mixin methods
RSpec.describe MyHandler do
  let(:handler) { MyHandler.new }

  it "calls API and makes decision" do
    allow(handler).to receive(:get).and_return({ status: 200 })

    result = handler.call(context)

    expect(result.decision_point_outcome.type).to eq("ActivateSteps")
  end
end

Tasker Core Tenets - Tenet #3: Composition Over Inheritance
Cross-Language Consistency - How composition works across languages
Patterns and Practices - Handler patterns

Cross-Language Consistency

This document describes Tasker Core’s philosophy for maintaining consistent APIs across Rust, Ruby, Python, and TypeScript workers while respecting each language’s idioms.

The Core Philosophy

“There should be one–and preferably only one–obvious way to do it.” – The Zen of Python

When a developer learns one Tasker worker language, they should understand all of them at the conceptual level. The specific syntax changes; the patterns remain constant.

Consistency Without Uniformity

What We Align

Developer-facing touchpoints that affect daily work:

Touchpoint	Why Align
Handler signatures	Developers switch languages within projects
Result factories	Error handling should feel familiar
Registry APIs	Service configuration is cross-cutting
Context access patterns	Data access is the core operation
Specialized handlers	API, Decision, Batchable are reusable patterns

What We Don’t Force

Language idioms that feel natural in their ecosystem:

Ruby	Python	TypeScript	Rust
Blocks, `yield`	Decorators, context managers	Generics, interfaces	Traits, associated types
Symbols (`:name`)	Type hints	`async/await`	Pattern matching
Duck typing	ABC, Protocol	Union types	Enums, `Result<T,E>`

The Aligned APIs

Handler Signatures

All handlers receive context, return results:

# Ruby
class MyHandler < TaskerCore::StepHandler::Base
  def call(context)
    success(result: { data: "value" })
  end
end

# Python
class MyHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        return self.success({"data": "value"})

// TypeScript
class MyHandler extends BaseStepHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    return this.success({ data: "value" });
  }
}

#![allow(unused)]
fn main() {
// Rust
impl StepHandler for MyHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
        StepExecutionResult::success(json!({"data": "value"}), None)
    }
}
}

Result Factories

Success and failure patterns are identical:

Operation	Pattern
Success	`success(result_data, metadata?)`
Failure	`failure(message, error_type, error_code?, retryable?, metadata?)`

The factory methods hide implementation details (wrapper classes, enum variants) behind a consistent interface.

Registry Operations

All registries support the same core operations:

Operation	Description
`register(name, handler)`	Register a handler by name
`is_registered(name)`	Check if handler exists
`resolve(name)`	Get handler instance
`list_handlers()`	List all registered handlers

Context Access Patterns

The StepContext provides unified access to execution data:

Core Fields (All Languages)

Field	Type	Description
`task_uuid`	String	Unique task identifier
`step_uuid`	String	Unique step identifier
`input_data`	Dict/Hash	Input data for the step
`step_config`	Dict/Hash	Handler configuration
`dependency_results`	Wrapper	Results from parent steps
`retry_count`	Integer	Current retry attempt
`max_retries`	Integer	Maximum retry attempts

Convenience Methods

Method	Description
`get_task_field(name)`	Get field from task context
`get_dependency_result(step_name)`	Get result from a parent step

Specialized Handler Patterns

API Handler

HTTP operations available in all languages:

Method	Pattern
GET	`get(path, params?, headers?)`
POST	`post(path, data?, headers?)`
PUT	`put(path, data?, headers?)`
DELETE	`delete(path, params?, headers?)`

Decision Handler

Conditional workflow branching:

# Ruby
decision_success(steps: ["branch_a", "branch_b"], result_data: { routing: "criteria" })
decision_no_branches(result_data: { reason: "no action needed" })

# Python
self.decision_success(["branch_a", "branch_b"], {"routing": "criteria"})
self.decision_no_branches({"reason": "no action needed"})

Batchable Handler

Cursor-based batch processing:

Operation	Description
`get_batch_context(context)`	Extract batch metadata
`batch_worker_complete(count, data)`	Signal batch completion
`handle_no_op_worker(batch_ctx)`	Handle empty batch

FFI Boundary Types

When data crosses the FFI boundary (Rust <-> Ruby/Python/TypeScript), types must serialize identically:

Required Explicit Types

Type	Purpose
`DecisionPointOutcome`	Decision handler results
`BatchProcessingOutcome`	Batch handler results
`CursorConfig`	Batch cursor configuration
`StepHandlerResult`	All handler results

Serialization Guarantee

The same JSON representation must work across all languages:

{
  "success": true,
  "result": { "data": "value" },
  "metadata": { "timing_ms": 50 }
}

Why This Matters

Developer Productivity

When switching from a Ruby handler to a Python handler:

No relearning core concepts
Same mental model applies
Documentation transfers

Code Review Consistency

Reviewers can evaluate handlers in any language:

Pattern violations are obvious
Best practices are universal
Anti-patterns are recognizable

Documentation Efficiency

One set of conceptual docs serves all languages:

Language-specific pages show syntax only
Core patterns documented once
Examples parallel across languages

The Pre-Alpha Advantage

In pre-alpha, we can make breaking changes to achieve consistency:

Change Type	Example
Method renames	`handle()` → `call()`
Signature changes	`(task, step)` → `(context)`
Return type unification	Separate Success/Error → unified result

These changes would be costly post-release but are cheap now.

Migration Path

When APIs diverge, we follow this pattern:

Non-Breaking First: Add aliases, helpers, new modules
Deprecation Period: Mark old APIs deprecated (warnings in logs)
Breaking Release: Remove old APIs, document migration

Example timeline:

Phase 1: Python migration (non-breaking + breaking)
Phase 2: Ruby migration (non-breaking + breaking)
Phase 3: Rust alignment (already aligned)
Phase 4: TypeScript alignment (new implementation)
Phase 5: Breaking changes release (all languages together)

Anti-Patterns

Don’t: Force Identical Syntax

# BAD: Ruby-style symbols in Python
def call(context) -> Hash[:success => true]  # Not Python!

Don’t: Ignore Language Idioms

# BAD: Python-style type hints in Ruby
def call(context: StepContext) -> StepHandlerResult  # Not Ruby!

Don’t: Duplicate Orchestration Logic

# BAD: Worker creating decision steps
def call(context)
  # Don't do orchestration's job!
  create_decision_steps(...)  # Orchestration handles this
end

Tasker Core Tenets - Tenet #4: Cross-Language Consistency
API Convergence Matrix - Detailed API reference
Patterns and Practices - Common patterns
Example Handlers - Side-by-side code examples

Defense in Depth

This document describes Tasker Core’s multi-layered protection model for idempotency and data integrity.

The Four Protection Layers

Tasker Core uses four independent protection layers. Each layer catches what others might miss, and no single layer bears full responsibility for data integrity.

┌─────────────────────────────────────────────────────────────────┐
│                    Layer 4: Application Logic                   │
│                    (State-based deduplication)                  │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 3: Transaction Boundaries              │
│                    (All-or-nothing semantics)                   │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 2: State Machine Guards                │
│                    (Current state validation)                   │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 1: Database Atomicity                  │
│                    (Unique constraints, row locks, CAS)         │
└─────────────────────────────────────────────────────────────────┘

Layer 1: Database Atomicity

The foundation layer using PostgreSQL’s transactional guarantees.

Mechanisms

Mechanism	Purpose	Example
Unique constraints	Prevent duplicate records	One active task per (namespace, external_id)
Row-level locking	Prevent concurrent modification	`SELECT ... FOR UPDATE` on task claim
Compare-and-swap	Atomic state transitions	`UPDATE ... WHERE state = $expected`
Advisory locks	Distributed coordination	Template cache invalidation

Atomic Claiming Pattern

-- Only one processor can claim a task
UPDATE tasks
SET state = 'in_progress',
    processor_uuid = $1,
    claimed_at = NOW()
WHERE id = $2
  AND state = 'pending'  -- CAS: only if still pending
RETURNING *;

If two processors try to claim the same task:

First: Succeeds, task transitions to in_progress
Second: Fails (0 rows affected), no state change

Why This Works

PostgreSQL’s MVCC ensures the WHERE state = 'pending' check and the SET state = 'in_progress' happen atomically. There’s no window where both processors see state = 'pending'.

Layer 2: State Machine Guards

State machine validation before any transition is attempted.

Implementation

#![allow(unused)]
fn main() {
impl TaskStateMachine {
    pub fn can_transition(&self, from: TaskState, to: TaskState) -> bool {
        VALID_TRANSITIONS.contains(&(from, to))
    }

    pub fn transition(&mut self, to: TaskState) -> Result<(), StateError> {
        if !self.can_transition(self.current, to) {
            return Err(StateError::InvalidTransition { from: self.current, to });
        }
        // Proceed with transition
    }
}
}

Valid Transitions Matrix

The state machine explicitly defines which transitions are valid:

Pending → Initializing → EnqueuingSteps → StepsInProcess
                                              ↓
Complete ← EvaluatingResults ← (step completions)
                  ↓
               Error (from any state)

Invalid transitions are rejected before reaching the database.

Why This Works

Application-level guards prevent obviously invalid operations from even attempting database changes. This reduces database load and provides better error messages.

Layer 3: Transaction Boundaries

All-or-nothing semantics for multi-step operations.

Example: Step Enqueueing

#![allow(unused)]
fn main() {
async fn enqueue_steps(task_id: TaskId, steps: Vec<Step>) -> Result<()> {
    let mut tx = pool.begin().await?;

    // Insert all steps
    for step in steps {
        insert_step(&mut tx, task_id, &step).await?;
    }

    // Update task state
    update_task_state(&mut tx, task_id, TaskState::StepsInProcess).await?;

    // Atomic commit - all or nothing
    tx.commit().await?;
    Ok(())
}
}

If step insertion fails:

Transaction rolls back
Task state unchanged
No partial steps created

Why This Works

PostgreSQL transactions ensure that either all changes commit or none do. There’s no intermediate state where some steps exist but task state is wrong.

Layer 4: Application-Level Filtering

State-based deduplication in application logic.

Example: Result Processing

#![allow(unused)]
fn main() {
async fn process_result(step_id: StepId, result: HandlerResult) -> Result<()> {
    let step = get_step(step_id).await?;

    // Filter: Only process if step is in_progress
    if step.state != StepState::InProgress {
        log::info!("Ignoring result for step {} in state {:?}", step_id, step.state);
        return Ok(()); // Idempotent: already processed
    }

    // Proceed with result processing
    apply_result(step, result).await
}
}

Why This Works

Even if the same result arrives multiple times (network retries, duplicate messages), only the first processing has effect. Subsequent attempts see the step already transitioned and exit cleanly.

The Discovery: Ownership Was Harmful

What We Learned

Analysis of processor UUID “ownership” enforcement revealed:

#![allow(unused)]
fn main() {
// OLD: Ownership enforcement (REMOVED)
fn can_process(&self, processor_uuid: Uuid) -> bool {
    self.owner_uuid == processor_uuid  // BLOCKED recovery!
}

// NEW: Ownership tracking only (for audit)
fn process(&self, processor_uuid: Uuid) -> Result<()> {
    self.record_processor(processor_uuid);  // Track, don't enforce
    // ... proceed with processing
}
}

Why Ownership Enforcement Was Removed

Scenario	With Enforcement	Without Enforcement
Normal operation	Works	Works
Orchestrator crash & restart	BLOCKED - new UUID	Automatic recovery
Duplicate message	Rejected	Layer 1 (CAS) rejects
Race condition	Rejected	Layer 1 (CAS) rejects

The four protection layers already prevent corruption. Ownership added:

Zero additional safety (layers 1-4 sufficient)
Recovery blocking (crashed tasks stuck forever)
Operational complexity (manual intervention needed)

The Verdict

“Processor UUID ownership was redundant protection with harmful side effects.”

When two actors receive identical messages:

First: Succeeds atomically (Layer 1 CAS)
Second: Fails cleanly (Layer 1 CAS)
No partial state, no corruption
No ownership check needed

Designing New Protections

When adding protection mechanisms, evaluate against this checklist:

Before Adding Protection

Which layer does this belong to? (Database, state machine, transaction, application)
What does it protect against? (Be specific: race condition, duplicate, corruption)
Do existing layers already cover this? (Usually yes)
What failure modes does it introduce? (Blocked recovery, increased latency)
Can the system recover if this protection itself fails?

The Minimal Set Principle

Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.

A system that:

Has fewer protections
Recovers automatically from crashes
Handles duplicates idempotently

Is better than a system that:

Has more protections
Requires manual intervention after crashes
Is “theoretically more secure”

Relationship to Fail Loudly

Defense in Depth and Fail Loudly are complementary principles:

Defense in Depth	Fail Loudly
Multiple layers prevent corruption	Errors surface problems immediately
Redundancy catches edge cases	Transparency enables diagnosis
Protection happens before damage	Visibility happens at detection

Both reject the same anti-pattern: silent failures.

Defense in Depth rejects: silent corruption (data changed without protection)
Fail Loudly rejects: silent defaults (missing data hidden with fabricated values)

Together they ensure: if something goes wrong, we know about it—either protection prevents it, or an error surfaces it.

Tasker Core Tenets - Tenet #1: Defense in Depth, Tenet #11: Fail Loudly
Fail Loudly - Errors as first-class citizens
Idempotency and Atomicity - Implementation details
States and Lifecycles - State machine specifications
Ownership Removal ADR - Processor UUID ownership removal decision

Fail Loudly

This document describes Tasker Core’s philosophy on error handling: errors are first-class citizens, not inconveniences to hide.

The Core Principle

A system that lies is worse than one that fails.

When data is missing, malformed, or unexpected, the correct response is an explicit error—not a fabricated default that makes the problem invisible.

The Problem: Phantom Data

“Phantom data” is data that:

Looks valid to consumers
Passes type checks and validation
Contains no actual information from the source
Was fabricated by defensive code trying to be “helpful”

Example: The Silent Default

#![allow(unused)]
fn main() {
// WRONG: Silent default hides protocol violation
fn get_pool_utilization(response: Response) -> PoolUtilization {
    response.pool_utilization.unwrap_or_else(|| PoolUtilization {
        active_connections: 0,
        idle_connections: 0,
        max_connections: 0,
        utilization_percent: 0.0,  // Looks like "no load"
    })
}
}

A monitoring system receiving this response sees:

utilization_percent: 0.0 — “Great, the system is idle!”
Reality: The server never sent pool data. The system might be at 100% load.

The consumer cannot distinguish “server reported 0%” from “server sent nothing.”

The Trust Equation

Silent default
  → Consumer receives valid-looking data
  → Consumer makes decisions based on phantom values
  → Phantom bugs manifest in production
  → Debugging nightmare: "But the data looked correct!"

vs.

Explicit error
  → Consumer receives clear failure
  → Consumer handles error appropriately
  → Problem visible immediately
  → Fix applied at source

The Solution: Explicit Errors

Pattern: Required Fields Return Errors

#![allow(unused)]
fn main() {
// RIGHT: Explicit error on missing required data
fn get_pool_utilization(response: Response) -> Result<PoolUtilization, ClientError> {
    response.pool_utilization.ok_or_else(|| {
        ClientError::invalid_response(
            "Response.pool_utilization",
            "Server omitted required pool utilization data",
        )
    })
}
}

Now the consumer:

Knows data is missing
Can retry, alert, or degrade gracefully
Never operates on phantom values

Pattern: Distinguish Required vs Optional

Not all fields should fail on absence. The distinction matters:

Field Type	Missing Means	Response
Required	Protocol violation, server bug	Return error
Optional	Legitimately absent, feature not configured	Return `None`

#![allow(unused)]
fn main() {
// Required: Server MUST send health checks
let checks = response.checks.ok_or_else(||
    ClientError::invalid_response("checks", "missing")
)?;

// Optional: Distributed cache may not be configured
let cache = response.distributed_cache; // Option<T> preserved
}

Pattern: Propagate, Don’t Swallow

Errors should flow up, not disappear:

#![allow(unused)]
fn main() {
// WRONG: Error swallowed, default returned
fn convert_response(r: Response) -> DomainType {
    let info = r.info.unwrap_or_default();  // Error hidden
    // ...
}

// RIGHT: Error propagated to caller
fn convert_response(r: Response) -> Result<DomainType, ClientError> {
    let info = r.info.ok_or_else(||
        ClientError::invalid_response("info", "missing")
    )?;  // Error visible
    // ...
}
}

When Defaults Are Acceptable

Not every unwrap_or_default() is wrong. Defaults are acceptable when:

The field is explicitly optional in the domain model

#![allow(unused)]
fn main() {
// Optional metadata that may legitimately be absent
let metadata: Option<Value> = response.metadata;
}

The default is semantically meaningful

#![allow(unused)]
fn main() {
// Empty tags list is valid—means "no tags"
let tags = response.tags.unwrap_or_default(); // Vec<String>
}

Absence cannot be confused with a valid value

#![allow(unused)]
fn main() {
// description being None vs "" are distinguishable
let description: Option<String> = response.description;
}

Red Flags to Watch For

When reviewing code, these patterns indicate potential phantom data:

1. `unwrap_or_default()` on Numeric Types

#![allow(unused)]
fn main() {
// RED FLAG: 0 looks like a valid measurement
let active_connections = pool.active.unwrap_or_default();
}

2. `unwrap_or_else(|| ...)` with Fabricated Values

#![allow(unused)]
fn main() {
// RED FLAG: "unknown" looks like real status
let status = check.status.unwrap_or_else(|| "unknown".to_string());
}

3. Default Structs for Missing Nested Data

#![allow(unused)]
fn main() {
// RED FLAG: Entire section fabricated
let config = response.config.unwrap_or_else(default_config);
}

4. Silent Fallbacks in Health Checks

#![allow(unused)]
fn main() {
// RED FLAG: Health check that never fails is useless
let health = check_health().unwrap_or(HealthStatus::Ok);
}

Implementation Checklist

When implementing new conversions or response handling:

Is this field required by the protocol/API contract?
If missing, would a default be indistinguishable from a valid value?
Could a consumer make incorrect decisions based on a default?
Is the error message actionable? (includes field name, explains what’s wrong)
Is the error type appropriate? (InvalidResponse for protocol violations)

The Discovery

What We Found

During gRPC client implementation, analysis revealed pervasive patterns like:

#![allow(unused)]
fn main() {
// Found throughout conversions.rs
let checks = response.checks.unwrap_or_else(|| ReadinessChecks {
    web_database: HealthCheck { status: "unknown".into(), ... },
    orchestration_database: HealthCheck { status: "unknown".into(), ... },
    // ... more fabricated checks
});
}

A client calling get_readiness() would receive what looked like a valid response with “unknown” status for all checks—when in reality, the server sent nothing.

The Refactoring

All required-field patterns were changed to explicit errors:

#![allow(unused)]
fn main() {
// After refactoring
let checks = response.checks.ok_or_else(|| {
    ClientError::invalid_response(
        "ReadinessResponse.checks",
        "Readiness response missing required health checks",
    )
})?;
}

Now a malformed server response immediately fails with:

Error: Invalid response: ReadinessResponse.checks - Readiness response missing required health checks

The problem is visible. The fix can be applied. Trust is preserved.

Tenet #11: Fail Loudly in Tasker Core Tenets
Meta-Principle #6: Errors Over Defaults
Defense in Depth — fail loudly is a form of protection; silent defaults are a form of hiding

Summary

Don’t	Do
Hide missing data with defaults	Return explicit errors
Make consumers guess if data is real	Distinguish required vs optional
Fabricate “unknown” status values	Error: “status unavailable”
Swallow errors in conversions	Propagate with `?` operator
Treat all fields as optional	Model optionality in types

The golden rule: If you can’t tell the difference between “server sent 0” and “server sent nothing,” you have a phantom data problem.

Intentional AI Partnership

A philosophy of rigorous collaboration for the age of AI-assisted engineering

The Growing Divide

There is a phrase gaining traction in software engineering circles: “Nice AI slop.”

It’s dismissive. It’s reductive. And it’s not entirely wrong.

The critique is valid: AI tools have made it possible to generate enormous volumes of code without understanding what that code does, why it’s structured the way it is, or how to maintain it when something breaks at 2 AM. Engineers who would never have shipped code they couldn’t explain are now approving pull requests they couldn’t debug. Project leads are drowning in contributions from well-meaning developers who “vibe-coded” their way into maintenance nightmares.

For those of us who have spent years—decades—in the craft of software engineering, who have sat with codebases through their full lifecycles, who have felt the weight of technical decisions made five years ago landing on our shoulders today, this is frustrating. The hard-won discipline of our profession seems to be eroding in favor of velocity.

And yet.

The response to “AI slop” cannot be rejection of AI as a partner in engineering work. That path leads to irrelevance. The question is not whether to work with AI, but how—with what principles, what practices, what commitments to quality and accountability.

This document is an attempt to articulate those principles. Not as abstract ideals, but as a working philosophy grounded in practice: building real systems, shipping real code, maintaining real accountability.

The Core Insight: Amplification, Not Replacement

AI does not create the problems we’re seeing. It amplifies them.

Teams that already had weak ownership practices now produce more poorly-understood code, faster. Organizations where “move fast and break things” meant “ship it and let someone else figure it out” now ship more of it. Engineers who never quite understood the systems they worked on can now generate more code they don’t understand.

But the inverse is also true.

Teams with strong engineering discipline—clear specifications, rigorous review, genuine ownership—can leverage AI to operate at a higher level of abstraction while maintaining (or even improving) quality. The acceleration becomes an advantage, not a liability.

This is the same dynamic that exists in any collaboration. A junior engineer paired with a senior engineer who doesn’t mentor becomes a junior engineer who writes more code without learning. A junior engineer paired with a senior engineer who invests in their growth becomes a stronger engineer, faster.

AI partnership follows the same pattern. The quality of the outcome depends on the quality of the collaboration practices surrounding it.

The discipline required for effective AI partnership is not new. It is the discipline that should characterize all engineering collaboration. AI simply makes the presence or absence of that discipline more visible, more consequential, and more urgent.

Principles of Intentional Partnership

1. Specification Before Implementation

The most effective AI collaboration begins long before code is written.

When you ask an AI to “build a feature,” you get code. When you work with an AI to understand the problem, research the landscape, evaluate approaches, and specify the solution—then implement—you get software.

This is not an AI-specific insight. It’s foundational engineering practice. But AI makes the cost of skipping specification deceptively low: you can generate code instantly, so why spend time on design? The answer is the same as it’s always been: because code without design is not software, it’s typing.

In practice:

Begin with exploration: What problem are we solving? What does the current system look like? What will be different when this work is complete?
Research with tools: Use AI capabilities to understand the codebase, explore patterns in the ecosystem, review prior art. Ground the work in reality, not assumptions.
Develop evaluation criteria before evaluating solutions. Know what “good” looks like before you start judging options.
Document the approach, not just the code. Specifications are artifacts of understanding.

2. Phased Delivery with Validation Gates

Large work should be decomposed into phases, and each phase should have clear acceptance criteria.

This principle exists because humans have limited working memory. It’s true for individual engineers, it’s true for teams, and it’s true for AI systems. Complex work exceeds the capacity of any single context—human or machine—to hold it all at once.

Phased delivery is how we manage this limitation. Each phase is small enough to understand completely, validate thoroughly, and commit to confidently. The boundaries between phases are synchronization points where understanding is verified.

In practice:

Identify what can be parallelized versus what must be sequential. Not all work is equally dependent.
Determine which aspects require careful attention versus which can be resolved at implementation time. Not all decisions are equally consequential.
Each phase should be independently validatable: tests pass, acceptance criteria met, code reviewed.
Phase documentation should include code samples for critical paths. Show, don’t just tell.

3. Validation as a First-Class Concern

Testing is not a phase that happens after implementation. It is a design constraint that shapes implementation.

AI can generate tests as easily as it generates code. This makes it tempting to treat testing as an afterthought: write the code, then generate tests to cover it. This inverts the value proposition of testing entirely.

Tests are specifications. They encode expectations about behavior. When tests are written first—or at least designed first—they constrain the implementation toward correctness. When tests are generated after the fact, they merely document whatever the implementation happens to do, bugs included.

In practice:

Define acceptance criteria before implementation begins.
Include edge cases, boundary conditions, and non-happy-path scenarios in specifications.
End-to-end testing validates that the system works, not just that individual units work.
Review tests with the same rigor as implementation code. Tests can have bugs too.

4. Human Accountability as the Final Gate

This is the principle that separates intentional partnership from “AI slop.”

The human engineer is ultimately responsible for code that ships. Not symbolically responsible—actually responsible. Responsible for understanding what the code does, why it’s structured the way it is, what trade-offs were made, and how to maintain it.

This is not about low trust in AI. It’s about the nature of accountability.

If you cannot explain why a particular approach was chosen, you should not approve it. If you cannot articulate the trade-offs embedded in a design decision, you should not sign off on it. If you cannot defend a choice—or at least explain why the choice wasn’t worth extensive deliberation—then you are not in a position to take responsibility for it.

This standard applies to all code, regardless of its origin. Human-written code that the approving engineer doesn’t understand is no better than AI-written code they don’t understand. The source is irrelevant; the accountability is what matters.

In practice:

Review is not approval. Approval requires understanding.
The bikeshedding threshold is a valid concept: knowing why something isn’t worth debating is also knowledge. But you must actually know this, not assume it.
Code review agents and architectural validators are useful, but they augment human judgment rather than replacing it.
If you wouldn’t ship code you wrote yourself without understanding it, don’t ship AI-written code without understanding it either.

5. Documentation as Extended Cognition

Documentation is not an artifact of completed work. It is a tool that enables work to continue.

Every engineer who joins a project faces the same challenge: building sufficient context to contribute effectively. Every AI session faces the same challenge: starting fresh without memory of prior work. Good documentation serves both.

This is the insight that makes documentation investment worthwhile: it extends cognition across time and across minds. The context you build today, documented well, becomes instantly available to future collaborators—human or AI.

In practice:

Structure documentation for efficient context loading. Navigation guides, trigger patterns, clear hierarchies.
Capture the “why” alongside the “what.” Decisions without rationale are trivia.
Principles, architecture, guides, reference—different documents serve different needs at different times.
Documentation that serves future AI sessions also serves future human engineers. The requirements are the same: limited working memory, need for efficient orientation.

6. Toolchain Alignment

Some development environments are better suited to intentional partnership than others.

The ideal toolchain provides fast feedback loops, enforces correctness constraints, and makes architectural decisions explicit. The compiler, the type system, the test framework—these become additional collaborators in the process, catching errors early and forcing clarity about intent.

Languages and tools that defer decisions to runtime, that allow implicit behavior, that prioritize flexibility over explicitness, make intentional partnership harder. Not impossible—but harder. The burden of verification shifts more heavily to the human.

In practice:

Strong type systems document intent in ways that survive across sessions and collaborators.
Compilers that enforce correctness (memory safety, exhaustive matching) catch the classes of errors most likely to slip through in high-velocity development.
Explicit architectural patterns—actor models, channel semantics, clear ownership boundaries—force intentional design rather than emergent mess.
The goal is not language advocacy but recognition: your toolchain affects your collaboration quality.

A Concrete Example: Building Tasker

These principles are not theoretical. They emerged from—and continue to guide—the development of Tasker, a workflow orchestration system built in Rust.

Why Rust?

Rust is not chosen as a recommendation but as an illustration of what makes a toolchain powerful for intentional partnership.

The Rust compiler forces agreement on memory ownership. You cannot be vague about who owns data and when it’s released; the borrow checker requires explicitness. This means architectural decisions about data flow must be made consciously rather than accidentally.

Exhaustive pattern matching means you cannot forget to handle a case. Every enum variant must be addressed. This is particularly valuable when working with AI: generated code that handles only the happy path fails to compile rather than failing silently in production.

The type system documents intent in ways that persist across context windows. When an AI session resumes work on a Rust codebase, the types communicate constraints that would otherwise need to be re-established through conversation.

Tokio channels, MPSC patterns, actor boundaries—these require intentional design. You cannot stumble into an actor architecture; you must choose it and implement it explicitly. This aligns well with specification-driven development.

None of this makes Rust uniquely suitable or necessary. It makes Rust an example of the properties that matter: explicitness, enforcement, feedback loops that catch errors early.

The Spec-Driven Workflow

Every significant piece of Tasker work follows a pattern:

Problem exploration: What are we trying to accomplish? What’s the current state? What will success look like?
Grounded research: Use AI capabilities to understand the codebase, explore ecosystem patterns, review tooling options. Generate a situated view of how the problem exists within the actual system.
Approach analysis: Develop criteria for evaluating solutions. Generate multiple approaches. Evaluate against criteria. Select and refine.
Phased planning: Break work into milestones with validation gates. Identify dependencies, parallelization opportunities, risk areas. Determine what needs careful specification versus what can be resolved during implementation.
Phase documentation: Each phase gets its own specification in a dedicated directory. Includes acceptance criteria, code samples for critical paths, and explicit validation requirements.
Implementation with validation: Work proceeds phase by phase. Tests are written. Code is reviewed. Each phase is complete before the next begins.
Human accountability gate: The human partner reviews not just for correctness but for understanding. Can they defend the choices? Do they know why alternatives were rejected? Are they prepared to maintain this code?

This workflow produces comprehensive documentation as a side effect of doing the work. The docs/ticket-specs/ directories in Tasker contain detailed specifications that serve both as implementation guides and as institutional memory. Future engineers—and future AI sessions—can understand not just what was built but why.

The Tenets as Guardrails

Tasker’s development is guided by ten core tenets, derived from experience. Several are directly relevant to intentional partnership:

State Machine Rigor: All state transitions are atomic, audited, and validated. This principle emerged from debugging distributed systems failures; it also provides clear contracts for AI-generated code to satisfy.

Defense in Depth: Multiple overlapping protection layers rather than single points of failure. In collaboration terms: review, testing, type checking, and runtime validation each catch what others might miss.

Composition Over Inheritance: Capabilities are composed via mixins, not class hierarchies. This produces code that’s easier to understand in isolation—crucial when any given context (human or AI) can only hold part of the system at once.

These tenets emerged from building software over many years. They apply to AI partnership because they apply to engineering generally. AI is a collaborator; good engineering principles govern collaboration.

The Organizational Dimension

Intentional AI partnership is not just an individual practice. It’s an organizational capability.

What Changes

When AI acceleration is available to everyone, the differentiator becomes the quality of surrounding practices:

Specification quality determines whether AI generates useful code or plausible-looking nonsense.
Review rigor determines whether errors are caught before or after deployment.
Testing discipline determines whether systems are verifiably correct or coincidentally working.
Documentation investment determines whether institutional knowledge accumulates or evaporates.

Organizations that were already strong in these areas will find AI amplifies their strength. Organizations that were weak will find AI amplifies their weakness—faster.

The Accountability Question

The hardest organizational challenge is accountability.

When an engineer can generate a month’s worth of code in a day, traditional review processes break down. You cannot carefully review a thousand lines of code per hour. Something has to give.

The answer is not “skip review” or “automate review entirely.” The answer is to change what gets reviewed.

In intentional partnership, the specification is the primary artifact. The specification is reviewed carefully: Does this approach make sense? Does it align with architectural principles? Does it handle edge cases? Does it integrate with existing systems?

The implementation—whether AI-generated or human-written—is validated against the specification. Tests verify behavior. Type systems verify contracts. Review confirms that the implementation matches the spec.

This shifts review from “read every line of code” to “verify that implementation matches intent.” It’s a different skill, but it’s learnable. And it scales in ways that line-by-line review does not.

Building the Capability

Organizations building intentional AI partnership should focus on:

Specification practices: Invest in training engineers to write clear, complete specifications. This skill was always valuable; it’s now critical.
Review culture: Shift review culture from gatekeeping to verification. The question is not “would I have written it this way?” but “does this correctly implement the specification?”
Testing infrastructure: Fast, comprehensive test suites become even more valuable when implementation velocity increases. Invest accordingly.
Documentation standards: Establish expectations for documentation quality. Make documentation a first-class deliverable, not an afterthought.
Toolchain alignment: Choose languages, frameworks, and tools that provide fast feedback and enforce correctness. The compiler is a collaborator.

The Call to Action: What Becomes Possible

There is another dimension to this conversation that deserves attention.

We have focused on rigor, accountability, and the discipline required to avoid producing “slop.” This framing is necessary but insufficient. It treats AI partnership primarily as a risk to be managed rather than an opportunity to be seized.

Consider what has changed.

For decades, software engineers have carried mental backlogs of things we would build if we had the time. Ideas that were sound, architecturally feasible, genuinely useful—but the time-to-execute made them impractical. Side projects abandoned. Features deprioritized. Entire systems that existed only as sketches in notebooks because the implementation cost was prohibitive.

That calculus has shifted.

AI partnership, applied rigorously, compresses implementation timelines in ways that make previously infeasible work feasible. The system you would have built “someday” can be prototyped in a weekend. The refactoring you’ve been putting off for years can be specified, planned, and executed in weeks. The tooling you wished existed can be created rather than merely wished for.

This is not about moving faster for its own sake. It’s about what becomes creatively possible when the friction of implementation is reduced.

Tasker exists because of this shift. A workflow orchestration system supporting four languages, with comprehensive documentation, rigorous testing, and production-grade architecture—built as a labor of love alongside a demanding day job. Ten years ago, this project would have remained an idea. Five years ago, perhaps a half-finished prototype. Today, it’s real software approaching production readiness.

And Tasker is not unique. Across the industry, engineers are building things that would not have existed otherwise. Not “AI-generated slop,” but genuine contributions to the craft—systems built with care, designed with intention, maintained with accountability.

This is what’s at stake when we talk about intentional partnership.

When we approach AI collaboration carelessly, we produce code we don’t understand and can’t maintain. We waste the capability on work that creates more problems than it solves. We give ammunition to critics who argue that AI makes engineering worse.

When we approach AI collaboration with rigor, clarity, and commitment to excellence, we unlock creative possibilities that were previously out of reach. We build things that matter. We expand what a single engineer, or a small team, can accomplish.

It is not treating ourselves with respect—our time, our creativity, our professional aspirations—to squander this capability on careless work. It is not treating the partnership with respect to use it without intention.

The opportunity before us is unprecedented. The discipline required to seize it is not new—it’s the discipline of good engineering, applied to a new context.

Let’s not waste it.

Conclusion: Craft Persists

The critique of “AI slop” is fundamentally a critique of craft—or its absence.

Craft is the accumulated wisdom of how to do something well. In software engineering, craft includes knowing when to abstract and when to be concrete, when to optimize and when to leave well enough alone, when to document and when the code is the documentation. Craft is what separates software that works from software that lasts.

AI does not possess craft. AI possesses capability—vast capability—but capability without wisdom is dangerous. This is true of humans as well; we just notice it less because human capability is more limited.

Intentional AI partnership is the practice of combining AI capability with human craft. The AI brings speed, breadth of knowledge, tireless pattern matching. The human brings judgment, accountability, and the accumulated wisdom of the profession.

Neither is sufficient alone. Together, working with discipline and intention, they can build software that is not just functional but maintainable, not just shipped but understood, not just code but craft.

The divide between “AI slop” and intentional partnership is not about the tools. It’s about us—whether we bring the same standards to AI collaboration that we would (or should) bring to any engineering work.

The tools are new. The standards are not. Let’s hold ourselves to them.

This document is part of the Tasker Core project principles. It reflects one approach to AI-assisted engineering; your mileage may vary. The principles here emerged from practice and continue to evolve with it.

Tasker Core Tenets

These 11 tenets guide all architectural and design decisions in Tasker Core. Each emerged from real implementation experience, root cause analyses, or architectural migrations.

The 11 Tenets

1. Defense in Depth

Multiple overlapping protection layers provide robust idempotency without single-point dependency.

Protection comes from four independent layers:

Database-level atomicity: Unique constraints, row locking, compare-and-swap
State machine guards: Current state validation before transitions
Transaction boundaries: All-or-nothing semantics
Application-level filtering: State-based deduplication

Each layer catches what others might miss. No single layer is responsible for all protection.

Origin: Processor UUID ownership was removed when analysis proved it provided redundant protection with harmful side effects (blocking recovery after crashes).

Lesson: Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.

2. Event-Driven with Polling Fallback

Real-time responsiveness via PostgreSQL LISTEN/NOTIFY, with polling as a reliability backstop.

The system supports three deployment modes:

EventDrivenOnly: Lowest latency, relies on pg_notify
PollingOnly: Traditional polling, higher latency but simple
Hybrid (recommended): Event-driven primary, polling fallback

Events can be missed (network issues, connection drops). Polling ensures eventual consistency.

Origin: Event-driven task claiming was added for low-latency response while preserving reliability guarantees.

3. Composition Over Inheritance

Mixins and traits for handler capabilities, not class hierarchies.

Handler capabilities are composed via mixins:

Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable

This pattern enables:

Selective capability inclusion
Clear separation of concerns
Easier testing of individual capabilities
No diamond inheritance problems

Origin: Analysis of cross-language handler harmonization revealed Batchable handlers already used composition. This was identified as the target architecture for all handlers.

See also: Composition Over Inheritance

4. Cross-Language Consistency

Unified developer-facing APIs across Rust, Ruby, Python, and TypeScript.

Consistent touchpoints include:

Handler signatures: call(context) pattern
Result factories: success(data) / failure(error, retry_on)
Registry APIs: register_handler(name, handler)
Specialized patterns: API, Decision, Batchable

Each language expresses these idiomatically while maintaining conceptual consistency.

Origin: Cross-language API alignment established the “one obvious way” philosophy.

See also: Cross-Language Consistency

5. Actor-Based Decomposition

Lightweight actors for lifecycle management and clear boundaries.

Orchestration uses four core actors:

TaskRequestActor: Task initialization
ResultProcessorActor: Step result processing
StepEnqueuerActor: Batch step enqueueing
TaskFinalizerActor: Task completion

Worker uses five specialized actors:

StepExecutorActor: Step execution coordination
FFICompletionActor: FFI completion handling
TemplateCacheActor: Template cache management
DomainEventActor: Event dispatching
WorkerStatusActor: Status and health

Each actor handles specific message types, enabling testability and clear ownership.

Origin: Actor pattern refactoring reduced monolithic processors from 1,575 LOC to ~150 LOC focused files.

6. State Machine Rigor

Dual state machines (Task + Step) for atomic transitions with full audit trails.

Task states (12): Pending → Initializing → EnqueuingSteps → StepsInProcess → EvaluatingResults → Complete/Error

Step states (8): Pending → Enqueued → InProgress → Complete/Error

All transitions are:

Atomic (compare-and-swap at database level)
Audited (full history in transitions table)
Validated (state guards prevent invalid transitions)

Origin: Enhanced state machines with richer task states were introduced for better workflow visibility.

7. Audit Before Enforce

Track for observability, don’t block for “ownership.”

Processor UUID is tracked in every transition for:

Debugging (which instance processed which step)
Audit trails (full history of processing)
Metrics (load distribution analysis)

But not enforced for:

Ownership claims (blocks recovery)
Permission checks (redundant with state guards)

Origin: Ownership enforcement removal proved that audit trails provide value without enforcement costs.

Key insight: When two actors receive identical messages, first succeeds atomically, second fails cleanly - no partial state, no corruption.

8. Pre-Alpha Freedom

Break things early to get architecture right.

In pre-alpha phase:

Breaking changes are encouraged when architecture is fundamentally unsound
No backward compatibility required for greenfield work
Migration debt is cheaper than technical debt
“Perfect” is the enemy of “architecturally sound”

This freedom enables:

Rapid iteration on core patterns
Learning from real implementation
Correcting course before users depend on specifics

Origin: All major refactoring efforts made breaking changes that improved architecture fundamentally.

9. PostgreSQL as Foundation

Database-level guarantees with flexible messaging (PGMQ default, RabbitMQ optional).

PostgreSQL provides:

State storage: Task and step state with transactional guarantees
Advisory locks: Distributed coordination primitives
Atomic functions: State transitions in single round-trip
Row-level locking: Prevents concurrent modification

Messaging is provider-agnostic:

PGMQ (default): Message queue built on PostgreSQL—single-dependency deployment
RabbitMQ (optional): For high-throughput or existing broker infrastructure

The database is not just storage—it’s the coordination layer. Message delivery is pluggable.

Origin: Core architecture decision - PostgreSQL’s transactional guarantees eliminate entire classes of distributed systems problems. The messaging abstraction was added for deployment flexibility.

10. Bounded Resources

All channels bounded, backpressure everywhere.

Every MPSC channel is:

Bounded: Fixed capacity, no unbounded memory growth
Configurable: Sizes set via TOML configuration
Monitored: Backpressure metrics exposed

Semaphores limit concurrent handler execution. Circuit breakers protect downstream services.

Origin: Bounded MPSC channels were mandated after analysis of unbounded channel risks.

Rule: Never use unbounded_channel(). Always configure bounds via TOML.

11. Fail Loudly

A system that lies is worse than one that fails. Errors are first-class citizens, not inconveniences to hide.

When data is missing, malformed, or unexpected:

Return errors, not fabricated defaults
Propagate failures up the call stack
Make problems visible immediately, not downstream
Trust nothing that hasn’t been validated

Silent defaults create phantom data—values that look valid but represent nothing real. A monitoring system that receives 0% utilization cannot distinguish “system is idle” from “data was missing.”

What this means in practice:

Scenario	Wrong Approach	Right Approach
gRPC response missing field	Return default value	Return `InvalidResponse` error
Config section absent	Use empty/zero defaults	Fail with clear message
Health check data missing	Fabricate “unknown” status	Error: “health data unavailable”
Optional vs Required	Treat all as optional	Distinguish explicitly in types

The trust equation:

A client that returns fabricated data
  = A client that lies to you
  = Worse than a client that fails loudly
  = Debugging phantom bugs in production

Origin: gRPC client refactoring revealed pervasive unwrap_or_default() patterns that silently fabricated response data. Analysis showed consumers could receive “valid-looking” responses containing entirely phantom data, breaking the trust contract between client and caller.

Key insight: When a gRPC server omits required fields, that’s a protocol violation—not an opportunity to be “helpful” with defaults. The server is broken; pretending otherwise delays the fix and misleads operators.

Rule: Never use unwrap_or_default() or unwrap_or_else(|| fabricated_value) for required fields. Use ok_or_else(|| ClientError::invalid_response(...)) instead.

Meta-Principles

These overarching themes emerge from the tenets:

Simplicity Over Elegance: The minimal protection set that prevents corruption beats layered defense that prevents recovery
Observation-Driven Design: Let real behavior (parallel execution, edge cases) guide architecture
Explicit Over Implicit: Make boundaries, layers, and decisions visible in documentation and code
Consistency Without Uniformity: Align APIs while preserving language idioms
Separation of Concerns: Orchestration handles state and coordination; workers handle execution and domain events
Errors Over Defaults: When in doubt, fail with a clear error rather than proceeding with fabricated data

Applying These Tenets

When making design decisions:

Check against tenets: Does this violate any of the 10 tenets?
Find the precedent: Has a similar decision been made before? (See ticket-specs)
Document the trade-off: What are you gaining and giving up?
Consider recovery: If this fails, how does the system recover?

When reviewing code:

Bounded resources: Are all channels bounded? All concurrency limited?
State machine compliance: Do transitions use atomic database operations?
Language consistency: Does the API align with other language workers?
Composition pattern: Are capabilities mixed in rather than inherited?
Fail loudly: Are missing/invalid data handled with errors, not silent defaults?

Twelve-Factor App Alignment

The Twelve-Factor App methodology, authored by Adam Wiggins and contributors at Heroku, has been a foundational influence on Tasker Core’s systems design. These principles were not adopted as a checklist but absorbed over years of building production systems. Some factors are deeply embedded in the architecture; others remain aspirational or partially realized.

This document maps each factor to where it shows up in the codebase, where we fall short, and what contributors should keep in mind. It is meant as practical guidance, not a compliance scorecard.

I. Codebase

One codebase tracked in revision control, many deploys.

Tasker Core is a single Git monorepo containing all deployable services: orchestration server, workers (Rust, Ruby, Python, TypeScript), CLI, and shared libraries.

Where this lives:

Root Cargo.toml defines the workspace with all crate members
Environment-specific Docker Compose files produce different deploys from the same source: docker/docker-compose.prod.yml, docker/docker-compose.dev.yml, docker/docker-compose.test.yml, docker/docker-compose.ci.yml
Feature flags (web-api, grpc-api, test-services, test-cluster) control build variations without code branches

Gaps: The monorepo means all crates share a single version today (v0.1.0). As the project matures toward independent crate publishing, version coordination will need more tooling. Independent crate versioning and release management tooling will need to evolve as the project matures.

II. Dependencies

Explicitly declare and isolate dependencies.

Rust’s Cargo ecosystem makes this natural. All dependencies are declared in Cargo.toml with workspace-level management and pinned in Cargo.lock.

Where this lives:

Root Cargo.toml [workspace.dependencies] section — single source of truth for shared dependency versions
Cargo.lock committed to the repository for reproducible builds
Multi-stage Docker builds (docker/build/orchestration.prod.Dockerfile) use cargo-chef for cached, reproducible dependency resolution
No runtime dependency fetching — everything resolved at build time

Gaps: FFI workers each bring their own dependency ecosystem (Python’s uv/pyproject.toml, Ruby’s Bundler/Gemfile, TypeScript’s bun/package.json). These are well-declared but not unified — contributors working across languages need to manage multiple lock files.

III. Config

Store config in the environment.

This is one of the strongest alignments. All runtime configuration flows through environment variables, with TOML files providing structured defaults that reference those variables.

Where this lives:

config/dotenv/ — environment-specific .env files (base.env, test.env, orchestration.env)
config/tasker/base/*.toml — role-based defaults with ${ENV_VAR:-default} interpolation
config/tasker/environments/{test,development,production}/ — environment overrides
docker/.env.prod.template — production variable template
tasker-shared/src/config/ — config loading with environment variable resolution
No secrets in source: DATABASE_URL, POSTGRES_PASSWORD, JWT keys all via environment

For contributors: Never hard-code connection strings, credentials, or deployment-specific values. Use environment variables with sensible defaults in the TOML layer. The configuration structure is role-based (orchestration/worker/common), not component-based — see CLAUDE.md for details.

IV. Backing Services

Treat backing services as attached resources.

Backing services are abstracted behind trait interfaces and swappable via configuration alone.

Where this lives:

Database: PostgreSQL connection via DATABASE_URL, pool settings in config/tasker/base/common.toml under [common.database.pool]
Messaging: PGMQ or RabbitMQ selected via TASKER_MESSAGING_BACKEND environment variable — same code paths, different drivers
Cache: Redis, Moka (in-process), or disabled entirely via [common.cache] configuration
Observability: OpenTelemetry with pluggable backends (Honeycomb, Jaeger, Grafana Tempo) via OTEL_EXPORTER_OTLP_ENDPOINT
Circuit breakers protect against backing service failures: [common.circuit_breakers.component_configs]

For contributors: When adding a new backing service dependency, ensure it can be configured via environment variables and that the system degrades gracefully when it’s unavailable. Follow the messaging abstraction pattern — trait-based interfaces, not concrete types.

V. Build, Release, Run

Strictly separate build and run stages.

The Docker build pipeline enforces this cleanly with multi-stage builds.

Where this lives:

Build: docker/build/orchestration.prod.Dockerfile — cargo-chef dependency caching, cargo build --release --all-features --locked, binary stripping
Release: Tagged Docker images with only runtime dependencies (no build tools), non-root user (tasker:999), read-only config mounts
Run: docker/scripts/orchestration-entrypoint.sh — environment validation, database availability check, migrations, then exec into the application binary
Deployment modes control startup behavior: standard, migrate-only, no-migrate, safe, emergency

Gaps: Local development doesn’t enforce the same separation — developers run cargo run directly, which conflates build and run. This is fine for development ergonomics but worth noting as a difference from the production path.

VI. Processes

Execute the app as one or more stateless processes.

All persistent state lives in PostgreSQL. Processes can be killed and restarted at any time without data loss.

Where this lives:

Orchestration server: stateless HTTP/gRPC service backed by tasker.tasks and tasker.steps tables
Workers: claim steps from message queues, execute handlers, write results back — no in-memory state across requests
Message queue visibility timeouts (visibility_timeout_seconds in worker config) ensure unacknowledged messages are reclaimed by other workers
Docker Compose replicas setting scales workers horizontally

For contributors: Never store workflow state in memory across requests. If you need coordination state, it belongs in PostgreSQL. In-memory caches (Moka) are optimization layers, not sources of truth — the system must function correctly without them.

VII. Port Binding

Export services via port binding.

Each service is self-contained and binds its own ports.

Where this lives:

REST: config/tasker/base/orchestration.toml — [orchestration.web] bind_address = "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
gRPC: [orchestration.grpc] bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
Worker REST/gRPC on separate ports (8081/9191)
Health endpoints on both transports for load balancer integration
Docker exposes ports via environment-configurable mappings

VIII. Concurrency

Scale out via the process model.

The system scales horizontally by adding worker processes and vertically by tuning concurrency settings.

Where this lives:

Horizontal: docker/docker-compose.prod.yml — replicas: ${WORKER_REPLICAS:-2}, each worker is independent
Vertical: config/tasker/base/orchestration.toml — max_concurrent_operations, batch_size per event system
Worker handler parallelism: [worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 10
Load shedding: [worker.mpsc_channels.handler_dispatch.load_shedding] capacity_threshold_percent = 80.0

Gaps: The actor pattern within a single process is more vertical than horizontal — actors share a Tokio runtime and scale via async concurrency, not OS processes. This is a pragmatic choice for Rust’s async model but means single-process scaling has limits that multiple processes solve.

IX. Disposability

Maximize robustness with fast startup and graceful shutdown.

This factor gets significant attention due to the distributed nature of task orchestration.

Where this lives:

Graceful shutdown: Signal handlers (SIGTERM, SIGINT) in tasker-orchestration/src/bin/server.rs and tasker-worker/src/bin/ — actors drain in-flight work, OpenTelemetry flushes spans, connections close cleanly
Fast startup: Compiled binary, pooled database connections, environment-driven config (no service discovery delays)
Crash recovery: PGMQ visibility timeouts requeue unacknowledged messages; steps claimed by a crashed worker reappear for others after visibility_timeout_seconds
Entrypoint: docker/scripts/orchestration-entrypoint.sh uses exec to replace shell with app process (proper PID 1 signal handling)
Health checks: Docker start_period allows grace time before liveness probes begin

For contributors: When adding new async subsystems, ensure they participate in the shutdown sequence. Bounded channels and drain timeouts (shutdown_drain_timeout_ms) prevent shutdown from hanging indefinitely.

X. Dev/Prod Parity

Keep development, staging, and production as similar as possible.

The same code, same migrations, and same config structure run everywhere — only values change.

Where this lives:

config/tasker/base/ provides defaults; config/tasker/environments/ overrides per-environment — structure is identical
migrations/ directory contains SQL migrations shared across all environments
Docker images use the same base (debian:bullseye-slim) and runtime user (tasker:999)
Structured logging format (tracing crate) is consistent; only verbosity changes (RUST_LOG)
E2E tests (--features test-services) exercise the same code paths as production

Gaps: Development uses cargo run with debug builds while production uses release-optimized Docker images. The observability stack (Grafana LGTM) is available in docker-compose.dev.yml but most local development happens without it. These are standard trade-offs, but contributors should periodically test against the full Docker stack to catch environment-specific issues.

XI. Logs

Treat logs as event streams.

All logging goes to stdout/stderr. No file-based logging is built into the application.

Where this lives:

tasker-shared/src/logging.rs — tracing subscriber writes to stdout, JSON format in production, ANSI colors in development (TTY-detected)
OpenTelemetry integration exports structured traces via OTEL_EXPORTER_OTLP_ENDPOINT
Correlation IDs (correlation_id) propagate through tasks, steps, actors, and message queues for distributed tracing
docker-compose.dev.yml includes Loki for log aggregation and Grafana for visualization
Entrypoint scripts log to stdout/stderr with role-prefixed format

For contributors: Use the tracing crate’s #[instrument] macro and structured fields (tracing::info!(task_id = %id, "processing")) rather than string interpolation. Never write to log files directly.

XII. Admin Processes

Run admin/management tasks as one-off processes.

The CLI and deployment scripts serve this role.

Where this lives:

tasker-ctl/ — task management (create, list, cancel), DLQ investigation (dlq list, dlq recover), system health, auth token management
docker/scripts/orchestration-entrypoint.sh — DEPLOYMENT_MODE=migrate-only runs migrations and exits without starting the server
config-validator binary validates TOML configuration as a one-off check
Database migrations run as a distinct phase before application startup, with retry logic and timeout protection

Gaps: Some administrative operations (cache invalidation, circuit breaker reset) are only available through the REST/gRPC API, not the CLI. As the CLI matures, these should become first-class admin commands.

Using This as a Contributor

These factors are not rules to enforce mechanically. They’re a lens for evaluating design decisions:

Adding a new service dependency? Factor IV says treat it as an attached resource — configure via environment, degrade gracefully without it.
Storing state? Factor VI says processes are stateless — put it in PostgreSQL, not in memory.
Adding configuration? Factor III says environment variables — use the existing TOML-with-env-var-interpolation pattern.
Writing logs? Factor XI says event streams — stdout, structured fields, correlation IDs.
Building deployment artifacts? Factor V says separate build/release/run — don’t bake configuration into images.

When a factor conflicts with practical needs, document the trade-off. The goal is not purity but awareness.

Attribution

The Twelve-Factor App methodology was created by Adam Wiggins with contributions from many others, originally published at 12factor.net. It is made available under the MIT License and has influenced how a generation of developers think about building software-as-a-service applications. Its influence on this project is gratefully acknowledged.

PEP: 20 Title: The Zen of Python Author: Tim Peters tim.peters@gmail.com Status: Active Type: Informational Created: 19-Aug-2004 Post-History: 22-Aug-2004

Abstract

Long time Pythoneer Tim Peters succinctly channels the BDFL’s guiding principles for Python’s design into 20 aphorisms, only 19 of which have been written down.

The Zen of Python

.. code-block:: text

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Easter Egg

.. code-block:: pycon

import this

References

Originally posted to comp.lang.python/python-list@python.org under a thread called "The Way of Python" <https://groups.google.com/d/msg/comp.lang.python/B_VxeTBClM0/L8W9KlsiriUJ>__

Copyright

This document has been placed in the public domain.

Tasker Core Reference

This directory contains technical reference documentation with precise specifications and implementation details.

Documents

Document	Description
StepContext API	Cross-language API reference for step handlers
Table Management	Database table structure and management
Task and Step Readiness	SQL functions and execution logic
sccache Configuration	Build caching setup
Library Deployment Patterns	Library distribution strategies
FFI Telemetry Pattern	Cross-language telemetry integration

When to Read These

Need exact behavior: Consult these for precise specifications
Debugging edge cases: Check implementation details
Database operations: See Table Management and SQL functions
Build optimization: Review sccache Configuration

Architecture - System structure and patterns
Guides - Practical how-to documentation
Development - Developer tooling and patterns

FFI Boundary Types Reference

Cross-language type harmonization for Rust, Python, and TypeScript boundaries.

This document defines the canonical FFI boundary types that cross the Rust orchestration layer and the Python/TypeScript worker implementations. These types are critical for correct serialization/deserialization between languages.

Overview

The tasker-core system uses FFI (Foreign Function Interface) to integrate Rust orchestration with Python and TypeScript step handlers. Data crosses this boundary via JSON serialization. These types must remain consistent across all three languages.

Source of Truth: Rust types in tasker-shared/src/messaging/execution_types.rs and tasker-shared/src/models/core/batch_worker.rs.

Type Mapping

Rust Type	Python Type	TypeScript Type
`CursorConfig`	`RustCursorConfig`	`RustCursorConfig`
`BatchProcessingOutcome`	`BatchProcessingOutcome`	`BatchProcessingOutcome`
`BatchWorkerInputs`	`RustBatchWorkerInputs`	`RustBatchWorkerInputs`
`BatchMetadata`	`BatchMetadata`	`BatchMetadata`
`FailureStrategy`	`FailureStrategy`	`FailureStrategy`

CursorConfig

Cursor configuration for a single batch’s position and range.

Flexible Cursor Types

Unlike simple integer cursors, RustCursorConfig supports flexible cursor values:

Integer for record IDs: 123
String for timestamps: "2025-11-01T00:00:00Z"
Object for composite keys: {"page": 1, "offset": 0}

This enables cursor-based pagination across diverse data sources.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
pub struct CursorConfig {
    pub batch_id: String,
    pub start_cursor: serde_json::Value,  // Flexible type
    pub end_cursor: serde_json::Value,    // Flexible type
    pub batch_size: u32,
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export interface RustCursorConfig {
  batch_id: string;
  start_cursor: unknown;  // Flexible: number | string | object
  end_cursor: unknown;
  batch_size: number;
}

Python Definition

# workers/python/python/tasker_core/types.py
class RustCursorConfig(BaseModel):
    batch_id: str
    start_cursor: Any  # Flexible: int | str | dict
    end_cursor: Any
    batch_size: int

JSON Wire Format

{
  "batch_id": "batch_001",
  "start_cursor": 0,
  "end_cursor": 1000,
  "batch_size": 1000
}

BatchProcessingOutcome

Discriminated union representing the outcome of a batchable step.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
    NoBatches,
    CreateBatches {
        worker_template_name: String,
        worker_count: u32,
        cursor_configs: Vec<CursorConfig>,
        total_items: u64,
    },
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export interface NoBatchesOutcome {
  type: 'no_batches';
}

export interface CreateBatchesOutcome {
  type: 'create_batches';
  worker_template_name: string;
  worker_count: number;
  cursor_configs: RustCursorConfig[];
  total_items: number;
}

export type BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome;

Python Definition

# workers/python/python/tasker_core/types.py
class NoBatchesOutcome(BaseModel):
    type: str = "no_batches"

class CreateBatchesOutcome(BaseModel):
    type: str = "create_batches"
    worker_template_name: str
    worker_count: int
    cursor_configs: list[RustCursorConfig]
    total_items: int

BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome

JSON Wire Formats

NoBatches:

{
  "type": "no_batches"
}

CreateBatches:

{
  "type": "create_batches",
  "worker_template_name": "batch_worker_template",
  "worker_count": 5,
  "cursor_configs": [
    { "batch_id": "001", "start_cursor": 0, "end_cursor": 1000, "batch_size": 1000 },
    { "batch_id": "002", "start_cursor": 1000, "end_cursor": 2000, "batch_size": 1000 }
  ],
  "total_items": 5000
}

BatchWorkerInputs

Initialization inputs for batch worker instances, stored in workflow_steps.inputs.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/models/core/batch_worker.rs
pub struct BatchWorkerInputs {
    pub cursor: CursorConfig,
    pub batch_metadata: BatchMetadata,
    pub is_no_op: bool,
}

pub struct BatchMetadata {
    // checkpoint_interval removed - handlers decide when to checkpoint
    pub cursor_field: String,
    pub failure_strategy: FailureStrategy,
}

pub enum FailureStrategy {
    ContinueOnFailure,
    FailFast,
    Isolate,
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export type FailureStrategy = 'continue_on_failure' | 'fail_fast' | 'isolate';

export interface BatchMetadata {
  // checkpoint_interval removed - handlers decide when to checkpoint
  cursor_field: string;
  failure_strategy: FailureStrategy;
}

export interface RustBatchWorkerInputs {
  cursor: RustCursorConfig;
  batch_metadata: BatchMetadata;
  is_no_op: boolean;
}

Python Definition

# workers/python/python/tasker_core/types.py
class FailureStrategy(str, Enum):
    CONTINUE_ON_FAILURE = "continue_on_failure"
    FAIL_FAST = "fail_fast"
    ISOLATE = "isolate"

class BatchMetadata(BaseModel):
    # checkpoint_interval removed - handlers decide when to checkpoint
    cursor_field: str
    failure_strategy: FailureStrategy

class RustBatchWorkerInputs(BaseModel):
    cursor: RustCursorConfig
    batch_metadata: BatchMetadata
    is_no_op: bool

JSON Wire Format

{
  "cursor": {
    "batch_id": "batch_001",
    "start_cursor": 0,
    "end_cursor": 1000,
    "batch_size": 1000
  },
  "batch_metadata": {
    "cursor_field": "id",
    "failure_strategy": "continue_on_failure"
  },
  "is_no_op": false
}

BatchAggregationResult

Standardized result from aggregating multiple batch worker results.

Cross-Language Standard

All three languages produce identical aggregation results:

Field	Type	Description
`total_processed`	int	Items processed across all batches
`total_succeeded`	int	Items that succeeded
`total_failed`	int	Items that failed
`total_skipped`	int	Items that were skipped
`batch_count`	int	Number of batch workers that ran
`success_rate`	float	Success rate (0.0 to 1.0)
`errors`	array	Collected errors (limited to 100)
`error_count`	int	Total error count

Usage Examples

TypeScript:

import { aggregateBatchResults } from 'tasker-core';

const workerResults = Object.values(context.previousResults)
  .filter(r => r?.batch_worker);
const summary = aggregateBatchResults(workerResults);
return this.success(summary);

Python:

from tasker_core.types import aggregate_batch_results

worker_results = [
    context.get_dependency_result(f"worker_{i}")
    for i in range(batch_count)
]
summary = aggregate_batch_results(worker_results)
return self.success(summary.model_dump())

Factory Functions

Creating BatchProcessingOutcome

TypeScript:

import { noBatches, createBatches, RustCursorConfig } from 'tasker-core';

// No batches needed
const outcome1 = noBatches();

// Create batch workers
const configs: RustCursorConfig[] = [
  { batch_id: '001', start_cursor: 0, end_cursor: 1000, batch_size: 1000 },
  { batch_id: '002', start_cursor: 1000, end_cursor: 2000, batch_size: 1000 },
];
const outcome2 = createBatches('process_batch', 2, configs, 2000);

Python:

from tasker_core.types import no_batches, create_batches, RustCursorConfig

# No batches needed
outcome1 = no_batches()

# Create batch workers
configs = [
    RustCursorConfig(batch_id="001", start_cursor=0, end_cursor=1000, batch_size=1000),
    RustCursorConfig(batch_id="002", start_cursor=1000, end_cursor=2000, batch_size=1000),
]
outcome2 = create_batches("process_batch", 2, configs, 2000)

Type Guards (TypeScript)

import {
  BatchProcessingOutcome,
  isNoBatches,
  isCreateBatches
} from 'tasker-core';

function handleOutcome(outcome: BatchProcessingOutcome): void {
  if (isNoBatches(outcome)) {
    console.log('No batches needed');
    return;
  }

  if (isCreateBatches(outcome)) {
    console.log(`Creating ${outcome.worker_count} workers`);
    console.log(`Total items: ${outcome.total_items}`);
  }
}

Migration Notes

From Legacy Types

If migrating from older batch processing types:

CursorConfig → RustCursorConfig: The new type adds batch_id field and uses flexible cursor types (unknown/Any) instead of fixed number/int.
Inline batch_processing_outcome → BatchProcessingOutcome: Use the discriminated union type with factory functions instead of building JSON manually.
Manual aggregation → aggregateBatchResults: Use the standardized aggregation function for consistent cross-language behavior.

Backwards Compatibility

The legacy CursorConfig type (with number/int cursors) is preserved for simple use cases. Use RustCursorConfig when:

Working with Rust orchestration inputs
Needing flexible cursor types (timestamps, UUIDs, composites)
Building BatchProcessingOutcome structures

Batch Processing Guide
Worker Architecture
Configuration Management

FFI Telemetry Initialization Pattern

Overview

This document describes the two-phase telemetry initialization pattern for Foreign Function Interface (FFI) integrations where Rust code is called from languages that don’t have a Tokio runtime during initialization (Ruby, Python, WASM).

The Problem

OpenTelemetry batch exporter requires a Tokio runtime context for async I/O operations:

#![allow(unused)]
fn main() {
// This PANICS if called outside a Tokio runtime
let tracer_provider = SdkTracerProvider::builder()
    .with_batch_exporter(exporter)  // ❌ Requires Tokio runtime
    .with_resource(resource)
    .with_sampler(sampler)
    .build();
}

FFI Initialization Timeline:

1. Language Runtime Loads Extension (Ruby, Python, WASM)
   ↓ No Tokio runtime exists yet
2. Extension Init Function Called (Magnus init, PyO3 init, etc.)
   ↓ Logging needed for debugging, but no async runtime
3. Later: Create Tokio Runtime
   ↓ Now safe to initialize telemetry
4. Bootstrap Worker System

The Solution: Two-Phase Initialization

Phase 1: Console-Only Logging (FFI-Safe)

During language extension initialization, use console-only logging that requires no Tokio runtime:

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs
pub fn init_console_only() {
    // Initialize console logging without OpenTelemetry
    // Safe to call from any thread, no async runtime required
}
}

When to use:

During Magnus initialization (Ruby)
During PyO3 initialization (Python)
During WASM module initialization
Any context where no Tokio runtime exists

Phase 2: Full Telemetry (Tokio Context)

After creating the Tokio runtime, initialize full telemetry including OpenTelemetry:

#![allow(unused)]
fn main() {
// Create Tokio runtime
let runtime = tokio::runtime::Runtime::new()?;

// Initialize telemetry in runtime context
runtime.block_on(async {
    tasker_shared::logging::init_tracing();
});
}

When to use:

After creating Tokio runtime in bootstrap
Inside runtime.block_on() context
When async I/O is available

Implementation Guide

Ruby FFI (Magnus)

File Structure:

workers/ruby/ext/tasker_core/src/ffi_logging.rs - Phase 1
workers/ruby/ext/tasker_core/src/bootstrap.rs - Phase 2

Phase 1: Magnus Initialization

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/ffi_logging.rs

pub fn init_ffi_logger() -> Result<(), Box<dyn std::error::Error>> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        // Phase 1: Defer telemetry init to runtime context
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        // Phase 1: Safe to initialize console-only logging
        tasker_shared::logging::init_console_only();
        tasker_shared::log_ffi!(
            info,
            "FFI console logging initialized (no telemetry)",
            component: "ffi_boundary"
        );
    }

    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/bootstrap.rs

pub fn bootstrap_worker() -> Result<Value, Error> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

Python FFI (PyO3)

Phase 1: PyO3 Module Initialization

#![allow(unused)]
fn main() {
// workers/python/src/lib.rs

#[pymodule]
fn tasker_core(py: Python, m: &PyModule) -> PyResult<()> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        tasker_shared::logging::init_console_only();
    }

    // Register Python functions...
    m.add_function(wrap_pyfunction!(bootstrap_worker, m)?)?;
    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/python/src/bootstrap.rs

#[pyfunction]
pub fn bootstrap_worker() -> PyResult<String> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()
        .map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
            format!("Failed to create runtime: {}", e)
        ))?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

WASM FFI

Phase 1: WASM Module Initialization

#![allow(unused)]
fn main() {
// workers/wasm/src/lib.rs

#[wasm_bindgen(start)]
pub fn init_wasm() {
    // Check if telemetry is enabled (from JS environment)
    let telemetry_enabled = js_sys::Reflect::get(
        &js_sys::global(),
        &"TELEMETRY_ENABLED".into()
    ).ok()
    .and_then(|v| v.as_bool())
    .unwrap_or(false);

    if telemetry_enabled {
        web_sys::console::log_1(&"Telemetry enabled - deferring logging init to runtime context".into());
    } else {
        tasker_shared::logging::init_console_only();
    }
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/wasm/src/bootstrap.rs

#[wasm_bindgen]
pub async fn bootstrap_worker() -> Result<JsValue, JsValue> {
    // In WASM, we're already in an async context
    // Initialize telemetry directly
    tasker_shared::logging::init_tracing();

    // Continue with bootstrap...
    let system_context = SystemContext::new_for_worker().await
        .map_err(|e| JsValue::from_str(&format!("Bootstrap failed: {}", e)))?;

    // ... rest of bootstrap
}
}

Docker Configuration

Enable telemetry in docker-compose with appropriate comments:

# docker/docker-compose.test.yml

ruby-worker:
  environment:
    # Two-phase FFI telemetry initialization pattern
    # Phase 1: Magnus init skips telemetry (no runtime)
    # Phase 2: bootstrap_worker() initializes telemetry in Tokio context
    TELEMETRY_ENABLED: "true"
    OTEL_EXPORTER_OTLP_ENDPOINT: http://observability:4317
    OTEL_SERVICE_NAME: tasker-ruby-worker
    OTEL_SERVICE_VERSION: "0.1.0"

Verification

Expected Log Sequence

Ruby Worker with Telemetry Enabled:

1. Magnus init:
Telemetry enabled - deferring logging init to runtime context

2. After runtime creation:
Console logging with OpenTelemetry initialized
  environment=test
  opentelemetry_enabled=true
  otlp_endpoint=http://observability:4317
  service_name=tasker-ruby-worker

3. OpenTelemetry components:
Global meter provider is set
OpenTelemetry Prometheus text exporter initialized

Ruby Worker with Telemetry Disabled:

1. Magnus init:
Console-only logging initialized (FFI-safe mode)
  environment=test
  opentelemetry_enabled=false
  context=ffi_initialization

2. After runtime creation:
(No additional initialization - already complete)

Health Check

All workers should be healthy with telemetry enabled:

$ curl http://localhost:8082/health
{"status":"healthy","timestamp":"...","worker_id":"worker-..."}

Grafana Verification

With all services running with telemetry:

Access Grafana: http://localhost:3000 (admin/admin)
Navigate to Explore → Tempo
Query by service: tasker-ruby-worker
Verify traces appear with correlation IDs

Key Principles

1. Separation of Concerns

Infrastructure Decision (Tokio runtime availability): Handled by init functions
Business Logic (when to log): Handled by application code
Clean separation prevents runtime panics

2. Fail-Safe Defaults

Always provide console logging at minimum
Telemetry is enhancement, not requirement
Graceful degradation if telemetry unavailable

3. Explicit Over Implicit

Clear phase separation in code
Documented at each call site
Easy to understand initialization flow

4. Language-Agnostic Pattern

Same pattern works for Ruby, Python, WASM
Consistent across all FFI bindings
Single source of truth in tasker-shared

Troubleshooting

“no reactor running” Panic

Symptom:

thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'

Cause: Calling init_tracing() when TELEMETRY_ENABLED=true outside a Tokio runtime context.

Solution: Use two-phase pattern:

#![allow(unused)]
fn main() {
// Phase 1: Skip telemetry init
if telemetry_enabled {
    println!("Deferring telemetry init...");
} else {
    init_console_only();
}

// Phase 2: Initialize in runtime
runtime.block_on(async {
    init_tracing();
});
}

Telemetry Not Appearing

Symptom: No traces in Grafana/Tempo despite TELEMETRY_ENABLED=true.

Check:

Verify environment variable is set: TELEMETRY_ENABLED=true
Check logs for initialization message
Verify OTLP endpoint is reachable
Check observability stack is healthy

Debug:

# Check worker logs
docker logs docker-ruby-worker-1 | grep -E "telemetry|OpenTelemetry"

# Check observability stack
curl http://localhost:4317  # Should connect to OTLP gRPC

# Check Grafana Tempo
curl http://localhost:3200/api/status/buildinfo

Performance Considerations

Minimal Overhead

Phase 1: Simple console initialization, <1ms
Phase 2: Batch exporter initialization, <10ms
Total overhead: <15ms during startup
Zero runtime overhead after initialization

Memory Usage

Console-only: ~100KB (tracing subscriber)
With telemetry: ~500KB (includes OTLP client buffers)
Acceptable for all deployment scenarios

Future Enhancements

Lazy Telemetry Upgrade

Future optimization could upgrade console-only subscriber to include telemetry without restart:

#![allow(unused)]
fn main() {
// Not yet implemented - requires tracing layer hot-swapping
pub fn upgrade_to_telemetry() -> TaskerResult<()> {
    // Would require custom subscriber implementation
    // to support layer addition after initialization
}
}

Per-Worker Telemetry Control

Could extend pattern to support per-worker telemetry configuration:

#![allow(unused)]
fn main() {
// Not yet implemented
pub fn init_with_config(config: TelemetryConfig) -> TaskerResult<()> {
    // Would allow fine-grained control per worker
}
}

Phase 1.5: Worker Span Instrumentation with Trace Context Propagation

Implemented: 2025-11-24 Status: ✅ Production Ready - Validated end-to-end with Ruby workers

The Challenge

After implementing two-phase telemetry initialization (Phase 1), we discovered a gap: while OpenTelemetry infrastructure was working, worker step execution spans lacked correlation attributes needed for distributed tracing.

The Problem:

✅ Orchestration spans had correlation_id, task_uuid, step_uuid
✅ Worker infrastructure spans existed (read_messages, reserve_capacity)
❌ Worker step execution spans were missing these attributes

Root Cause: Ruby workers use an async dual-event-system architecture where:

Rust worker fires FFI event to Ruby (via EventPoller polling every 10ms)
Ruby processes event asynchronously
Ruby returns completion via FFI

The async boundary made traditional span scope maintenance impossible.

The Solution: Trace ID Propagation Pattern

Instead of trying to maintain span scope across the async FFI boundary, we propagate trace context as opaque strings:

Rust: Extract trace_id/span_id → Add to FFI event payload →
Ruby: Treat as opaque strings → Propagate through processing → Include in completion →
Rust: Create linked span using returned trace_id/span_id

Key Insight: Ruby doesn’t need to understand OpenTelemetry - it just passes through trace IDs like it already does with correlation_id.

Implementation: Rust Side (Phase 1.5a)

File: tasker-worker/src/worker/command_processor.rs

Step 1: Create instrumented span with all required attributes

#![allow(unused)]
fn main() {
use tracing::{span, event, Level, Instrument};

pub async fn handle_execute_step(&self, step_message: SimpleStepMessage) -> TaskerResult<()> {
    // Fetch step details to get step_name and namespace
    let task_sequence_step = self.fetch_task_sequence_step(&step_message).await?;

    // Create span with all 5 required attributes
    let step_span = span!(
        Level::INFO,
        "worker.step_execution",
        correlation_id = %step_message.correlation_id,
        task_uuid = %step_message.task_uuid,
        step_uuid = %step_message.step_uuid,
        step_name = %task_sequence_step.workflow_step.name,
        namespace = %task_sequence_step.task.namespace_name
    );

    let execution_result = async {
        event!(Level::INFO, "step.execution_started");

        // Extract trace context for FFI propagation
        let trace_id = Some(step_message.correlation_id.to_string());
        let span_id = Some(format!("span-{}", step_message.step_uuid));

        // Fire FFI event with trace context
        let result = self.event_publisher
            .fire_step_execution_event_with_trace(
                &task_sequence_step,
                trace_id,
                span_id,
            )
            .await?;

        event!(Level::INFO, "step.execution_completed");
        Ok(result)
    }
    .instrument(step_span)  // Wrap async block with span
    .await;

    execution_result
}
}

Key Points:

All 5 attributes present: correlation_id, task_uuid, step_uuid, step_name, namespace
Event markers: step.execution_started, step.execution_completed
.instrument(span) pattern for async code
Trace context extracted and passed to FFI

Implementation: Data Structures

File: tasker-shared/src/types/base.rs

Add trace context fields to FFI event structures:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub task_sequence_step: TaskSequenceStep,
    pub correlation_id: Uuid,

    // Trace context propagation
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionCompletionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub success: bool,
    pub result: Option<serde_json::Value>,

    // Trace context from Ruby
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}
}

Design Notes:

Fields are optional for backward compatibility
skip_serializing_if prevents empty fields in JSON
Treated as opaque strings (no OpenTelemetry types)

Implementation: Ruby Side Propagation

File: workers/ruby/lib/tasker_core/event_bridge.rb

Propagate trace context like correlation_id:

def wrap_step_execution_event(event_data)
  wrapped = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    task_sequence_step: TaskerCore::Models::TaskSequenceStepWrapper.new(event_data[:task_sequence_step])
  }

  # Expose correlation_id at top level for easy access
  wrapped[:correlation_id] = event_data[:correlation_id] if event_data[:correlation_id]
  wrapped[:parent_correlation_id] = event_data[:parent_correlation_id] if event_data[:parent_correlation_id]

  # Expose trace_id and span_id for distributed tracing
  wrapped[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  wrapped[:span_id] = event_data[:span_id] if event_data[:span_id]

  wrapped
end

File: workers/ruby/lib/tasker_core/subscriber.rb

Include trace context in completion:

def publish_step_completion(event_data:, success:, result: nil, error_message: nil, metadata: nil)
  completion_payload = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    success: success,
    result: result,
    metadata: metadata,
    error_message: error_message
  }

  # Propagate trace context back to Rust
  completion_payload[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  completion_payload[:span_id] = event_data[:span_id] if event_data[:span_id]

  TaskerCore::Worker::EventBridge.instance.publish_step_completion(completion_payload)
end

Key Points:

Ruby treats trace_id and span_id as opaque strings
No OpenTelemetry dependency in Ruby
Simple pass-through pattern like correlation_id
Works with existing dual-event-system architecture

Implementation: Completion Span (Rust)

File: tasker-worker/src/worker/event_subscriber.rs

Create linked span when receiving Ruby completion:

#![allow(unused)]
fn main() {
pub fn handle_completion(&self, completion: StepExecutionCompletionEvent) -> TaskerResult<()> {
    // Create linked span using trace context from Ruby
    let completion_span = if let (Some(trace_id), Some(span_id)) =
        (&completion.trace_id, &completion.span_id) {
        span!(
            Level::INFO,
            "worker.step_completion_received",
            trace_id = %trace_id,
            span_id = %span_id,
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    } else {
        // Fallback span without trace context
        span!(
            Level::INFO,
            "worker.step_completion_received",
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    };

    let _guard = completion_span.enter();

    event!(Level::INFO, "step.ruby_execution_completed",
        success = completion.success,
        duration_ms = completion.metadata.execution_time_ms
    );

    // Continue with normal completion processing...
    Ok(())
}
}

Key Points:

Uses returned trace_id/span_id to create linked span
Graceful fallback if trace context not available
Event: step.ruby_execution_completed

Validation Results (2025-11-24)

Test Task:

Correlation ID: 88f21229-4085-4d53-8f52-2fde0b7228e2
Task UUID: 019ab6f9-7a27-7d16-b298-1ea41b327373
4 steps executed successfully

Log Evidence:

worker.step_execution{
  correlation_id=88f21229-4085-4d53-8f52-2fde0b7228e2
  task_uuid=019ab6f9-7a27-7d16-b298-1ea41b327373
  step_uuid=019ab6f9-7a2a-7873-a5d1-93234ae46003
  step_name=linear_step_1
  namespace=linear_workflow
}: step.execution_started

Step execution event with trace context fired successfully to FFI handlers
  trace_id=Some("88f21229-4085-4d53-8f52-2fde0b7228e2")
  span_id=Some("span-019ab6f9-7a2a-7873-a5d1-93234ae46003")

worker.step_completion_received{...}: step.ruby_execution_completed

Tempo Query Results:

By correlation_id: 9 traces (5 orchestration + 4 worker)
By task_uuid: 13 traces (complete task lifecycle)
✅ All attributes indexed and queryable
✅ Spans exported to Tempo successfully

Complete Trace Flow

For each step execution:

┌─────────────────────────────────────────────────────┐
│ Rust Worker (command_processor.rs)                 │
│ 1. Create worker.step_execution span               │
│    - correlation_id, task_uuid, step_uuid          │
│    - step_name, namespace                          │
│ 2. Emit step.execution_started event               │
│ 3. Extract trace_id and span_id from span          │
│ 4. Add to StepExecutionEvent                       │
│ 5. Fire FFI event with trace context               │
│ 6. Emit step.execution_completed event             │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Async FFI boundary (EventPoller polling)
                  ▼
┌─────────────────────────────────────────────────────┐
│ Ruby EventBridge & Subscriber                       │
│ 1. Receive event with trace_id/span_id            │
│ 2. Propagate as opaque strings                     │
│ 3. Execute Ruby handler (business logic)           │
│ 4. Include trace_id/span_id in completion          │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Completion via FFI
                  ▼
┌─────────────────────────────────────────────────────┐
│ Rust Worker (event_subscriber.rs)                  │
│ 1. Receive StepExecutionCompletionEvent            │
│ 2. Extract trace_id and span_id                    │
│ 3. Create worker.step_completion_received span     │
│ 4. Emit step.ruby_execution_completed event        │
└─────────────────────────────────────────────────────┘

Benefits of This Pattern

No Breaking Changes: Optional fields, backward compatible
Ruby Simplicity: No OpenTelemetry dependency, opaque string propagation
Trace Continuity: Same trace_id flows Rust → Ruby → Rust
Query-Friendly: Tempo queries show complete execution flow
Extensible: Pattern works for Python, WASM, any FFI language
Performance: Zero overhead in Ruby (just string passing)

Pattern for Python Workers

The exact same pattern applies to Python workers:

Python Side (PyO3):

# workers/python/tasker_core/event_bridge.py

def wrap_step_execution_event(event_data):
    wrapped = {
        'event_id': event_data['event_id'],
        'task_uuid': event_data['task_uuid'],
        'step_uuid': event_data['step_uuid'],
        # ... other fields
    }

    # Propagate trace context as opaque strings
    if 'trace_id' in event_data:
        wrapped['trace_id'] = event_data['trace_id']
    if 'span_id' in event_data:
        wrapped['span_id'] = event_data['span_id']

    return wrapped

Key Insight: Any FFI language can use this pattern - they just need to pass through trace_id and span_id as strings.

Performance Characteristics

Rust overhead: ~50-100 microseconds per span creation
FFI overhead: ~10-50 microseconds for extra string fields
Ruby overhead: Zero (just string passing, no OpenTelemetry)
Total overhead: <200 microseconds per step execution
Network: Spans batched and exported asynchronously

Troubleshooting

Symptom: Spans missing trace_id/span_id in Tempo

Check:

Verify Rust logs show “Step execution event with trace context fired successfully”
Check Ruby logs don’t have errors in EventBridge
Verify completion events include trace_id/span_id
Query Tempo by task_uuid to see if spans exist

Debug:

# Check Rust worker logs for trace context
docker logs docker-ruby-worker-1 | grep -E "(trace_id|span_id)"

# Query Tempo by task_uuid
curl "http://localhost:3200/api/search?tags=task_uuid=<UUID>"

# Check span export metrics
curl "http://localhost:9090/metrics" | grep otel

Future Enhancements

OpenTelemetry W3C Trace Context: Currently using correlation_id as trace_id placeholder. Future enhancement:

#![allow(unused)]
fn main() {
use opentelemetry::trace::TraceContextExt;

// Extract real OpenTelemetry trace context
let cx = tracing::Span::current().context();
let span_context = cx.span().span_context();
let trace_id = span_context.trace_id().to_string();
let span_id = span_context.span_id().to_string();
}

Span Linking: Use OpenTelemetry’s Link API for explicit parent-child relationships:

#![allow(unused)]
fn main() {
use opentelemetry::trace::{Link, SpanContext, TraceId, SpanId};

// Create linked span
let parent_context = SpanContext::new(
    TraceId::from_hex(&trace_id)?,
    SpanId::from_hex(&span_id)?,
    TraceFlags::default(),
    false,
    TraceState::default(),
);

let span = span!(
    Level::INFO,
    "worker.step_completion_received",
    links = vec![Link::new(parent_context, Vec::new())]
);
}

References

OpenTelemetry Rust: https://github.com/open-telemetry/opentelemetry-rust
Grafana LGTM Stack: https://grafana.com/oss/lgtm-stack/
W3C Trace Context: https://www.w3.org/TR/trace-context/

tasker-shared/src/logging.rs - Core logging implementation
workers/rust/README.md - Event-driven FFI architecture
docs/batch-processing.md - Distributed tracing integration
docker/docker-compose.test.yml - Observability stack configuration

Status: ✅ Production Ready - Two-phase initialization and Phase 1.5 worker span instrumentation patterns implemented and validated with Ruby FFI. Ready for Python and WASM implementations.

Library Deployment Patterns

This document describes the library deployment patterns feature that enables applications to consume worker observability data (health, metrics, templates, configuration) either via the HTTP API or directly through FFI, without running a web server.

Overview

Previously, applications needed to run the worker’s HTTP server to access observability data. This created deployment overhead for applications that only needed programmatic access to health checks, metrics, or template information.

The library deployment patterns feature:

Extracts observability logic into reusable services - Business logic moved from HTTP handlers to service classes
Exposes services via FFI - Same functionality available without HTTP overhead
Provides Ruby wrapper layer - Type-safe Ruby interface with dry-struct types
Makes HTTP server optional - Services always available, web server is opt-in

Architecture

Service Layer

Four services encapsulate observability logic:

tasker-worker/src/worker/services/
├── health/          # HealthService - health checks
├── metrics/         # MetricsService - metrics collection
├── template_query/  # TemplateQueryService - template operations
└── config_query/    # ConfigQueryService - configuration queries

Each service:

Contains all business logic previously in HTTP handlers
Is independent of HTTP transport
Can be accessed via web handlers OR FFI
Returns typed response structures

Service Access Patterns

                    ┌─────────────────────────────────────────┐
                    │            WorkerWebState               │
                    │  ┌────────────────────────────────────┐ │
                    │  │         Service Instances           │ │
                    │  │  ┌────────────┐ ┌────────────────┐ │ │
                    │  │  │HealthServ.│ │MetricsService  │ │ │
                    │  │  └────────────┘ └────────────────┘ │ │
                    │  │  ┌────────────┐ ┌────────────────┐ │ │
                    │  │  │TemplQuery │ │ConfigQuery     │ │ │
                    │  │  └────────────┘ └────────────────┘ │ │
                    │  └────────────────────────────────────┘ │
                    └──────────────┬───────────────┬──────────┘
                                   │               │
           ┌───────────────────────┴───┐     ┌─────┴──────────────────────┐
           │     HTTP Handlers         │     │     FFI Layer              │
           │  (web/handlers/*.rs)      │     │  (observability_ffi.rs)    │
           └───────────────────────────┘     └────────────────────────────┘
                       │                                 │
                       ▼                                 ▼
               ┌───────────────┐                ┌───────────────┐
               │  HTTP Clients │                │  Ruby/Python  │
               │  curl, etc.   │                │  Applications │
               └───────────────┘                └───────────────┘

Usage

Ruby FFI Access

The TaskerCore::Observability module provides type-safe access to all services:

# Health checks
health = TaskerCore::Observability.health_basic
puts health.status        # => "healthy"
puts health.worker_id     # => "worker-abc123"

# Kubernetes-style probes
if TaskerCore::Observability.ready?
  puts "Worker ready to receive requests"
end

if TaskerCore::Observability.alive?
  puts "Worker is alive"
end

# Detailed health information
detailed = TaskerCore::Observability.health_detailed
detailed.checks.each do |name, check|
  puts "#{name}: #{check.status} (#{check.duration_ms}ms)"
end

Metrics Access

# Domain event statistics
events = TaskerCore::Observability.event_stats
puts "Events routed: #{events.router.total_routed}"
puts "FFI dispatches: #{events.in_process_bus.ffi_channel_dispatches}"

# Prometheus format (for custom scrapers)
prometheus_text = TaskerCore::Observability.prometheus_metrics

Template Operations

# List templates (JSON string)
templates_json = TaskerCore::Observability.templates_list

# Validate a template
validation = TaskerCore::Observability.template_validate(
  namespace: "payments",
  name: "process_payment",
  version: "v1"
)

if validation.valid
  puts "Template valid with #{validation.handler_count} handlers"
else
  validation.issues.each { |issue| puts "Issue: #{issue}" }
end

# Cache management
stats = TaskerCore::Observability.cache_stats
puts "Cache hits: #{stats.hits}, misses: #{stats.misses}"

TaskerCore::Observability.cache_clear  # Clear all cached templates

Configuration Access

# Get runtime configuration (secrets redacted)
config = TaskerCore::Observability.config
puts "Environment: #{config.environment}"
puts "Redacted fields: #{config.metadata.redacted_fields.join(', ')}"

# Quick environment check
env = TaskerCore::Observability.environment
puts "Running in: #{env}"  # => "production"

Configuration

HTTP Server Toggle

The HTTP server is now optional. Services are always created, but the HTTP server only starts if enabled:

# config/tasker/base/worker.toml
[worker.web]
enabled = true              # Set to false to disable HTTP server
bind_address = "0.0.0.0:8081"
request_timeout_ms = 30000

When enabled = false:

WorkerWebState is still created (services available)
HTTP server does NOT start
All services accessible via FFI only
Reduces resource usage (no HTTP listener, no connections)

Deployment Modes

Mode	HTTP Server	FFI Services	Use Case
Full	✅	✅	Standard deployment with monitoring
Library	❌	✅	Embedded in application, no external access
Headless	❌	✅	Container with external health checks disabled

Type Definitions

The Ruby wrapper uses dry-struct types for structured access:

Health Types

TaskerCore::Observability::Types::BasicHealth
  - status: String
  - worker_id: String
  - timestamp: String

TaskerCore::Observability::Types::DetailedHealth
  - status: String
  - timestamp: String
  - worker_id: String
  - checks: Hash[String, HealthCheck]
  - system_info: WorkerSystemInfo

TaskerCore::Observability::Types::HealthCheck
  - status: String
  - message: String?
  - duration_ms: Integer
  - last_checked: String

Metrics Types

TaskerCore::Observability::Types::DomainEventStats
  - router: EventRouterStats
  - in_process_bus: InProcessEventBusStats
  - captured_at: String
  - worker_id: String

TaskerCore::Observability::Types::EventRouterStats
  - total_routed: Integer
  - durable_routed: Integer
  - fast_routed: Integer
  - broadcast_routed: Integer
  - fast_delivery_errors: Integer
  - routing_errors: Integer

Template Types

TaskerCore::Observability::Types::CacheStats
  - total_entries: Integer
  - hits: Integer
  - misses: Integer
  - evictions: Integer
  - last_maintenance: String?

TaskerCore::Observability::Types::TemplateValidation
  - valid: Boolean
  - namespace: String
  - name: String
  - version: String
  - handler_count: Integer
  - issues: Array[String]
  - handler_metadata: Hash?

Config Types

TaskerCore::Observability::Types::RuntimeConfig
  - environment: String
  - common: Hash
  - worker: Hash
  - metadata: ConfigMetadata

TaskerCore::Observability::Types::ConfigMetadata
  - timestamp: String
  - source: String
  - redacted_fields: Array[String]

Error Handling

FFI methods raise RuntimeError on failures:

begin
  health = TaskerCore::Observability.health_basic
rescue RuntimeError => e
  if e.message.include?("Worker system not running")
    # Worker not bootstrapped yet
  elsif e.message.include?("Web state not available")
    # Services not initialized
  end
end

Template Operation Errors

Template operations raise RuntimeError for missing templates or namespaces:

begin
  result = TaskerCore::Observability.template_get(
    namespace: "unknown",
    name: "missing",
    version: "1.0.0"
  )
rescue RuntimeError => e
  puts "Template not found: #{e.message}"
end

# template_refresh handles errors gracefully, returning a result struct
result = TaskerCore::Observability.template_refresh(
  namespace: "unknown",
  name: "missing",
  version: "1.0.0"
)
puts result.success  # => false
puts result.message  # => error description

Convenience Methods

The ready? and alive? methods handle errors gracefully:

# These never raise - they return false on any error
TaskerCore::Observability.ready?  # => true/false
TaskerCore::Observability.alive?  # => true/false

Note: alive? checks for status == "alive" (from liveness probe), while ready? checks for status == "healthy" (from readiness probe).

Best Practices

Use type-safe methods when possible - Methods returning dry-struct types provide better validation
Handle errors gracefully - FFI can fail if worker not bootstrapped
Consider caching - For high-frequency health checks, cache results briefly
Use ready?/alive? helpers - They handle exceptions and return boolean
Prefer FFI for internal use - Less overhead than HTTP for same-process access

Migration Guide

From HTTP to FFI

Before (HTTP):

response = Faraday.get("http://localhost:8081/health")
health = JSON.parse(response.body)

After (FFI):

health = TaskerCore::Observability.health_basic

Disabling HTTP Server

Update configuration:
```
[worker.web]
enabled = false
```

Update health check scripts to use FFI:

# health_check.rb
require 'tasker_core'

exit(TaskerCore::Observability.ready? ? 0 : 1)

Update monitoring to scrape via FFI:

metrics = TaskerCore::Observability.prometheus_metrics
# Send to Prometheus pushgateway or custom aggregator

API Reference

Health Methods

Method	Returns	Description
`health_basic`	`Types::BasicHealth`	Basic health status
`health_live`	`Types::BasicHealth`	Liveness probe (status: “alive”)
`health_ready`	`Types::DetailedHealth`	Readiness probe with all checks
`health_detailed`	`Types::DetailedHealth`	Full health information
`ready?`	`Boolean`	True if status == “healthy”
`alive?`	`Boolean`	True if status == “alive”

Metrics Methods

Method	Returns	Description
`metrics_worker`	`String` (JSON)	Worker metrics as JSON
`event_stats`	`Types::DomainEventStats`	Domain event statistics
`prometheus_metrics`	`String`	Prometheus text format

Template Methods

Method	Returns	Description
`templates_list(include_cache_stats: false)`	`String` (JSON)	List all templates
`template_get(namespace:, name:, version:)`	`String` (JSON)	Get specific template (raises on error)
`template_validate(namespace:, name:, version:)`	`Types::TemplateValidation`	Validate template (raises on error)
`cache_stats`	`Types::CacheStats`	Cache statistics
`cache_clear`	`Types::CacheOperationResult`	Clear template cache
`template_refresh(namespace:, name:, version:)`	`Types::CacheOperationResult`	Refresh specific template

Config Methods

Method	Returns	Description
`config`	`Types::RuntimeConfig`	Full config (secrets redacted)
`environment`	`String`	Current environment name

Configuration Management - Full configuration reference
Deployment Patterns - General deployment options
Observability - Metrics and monitoring
FFI Telemetry Pattern - FFI logging integration

SCache Configuration Documentation

Overview

This document records our sccache configuration for future reference. Sccache is currently disabled due to GitHub Actions cache service issues, but we plan to re-enable it once the service is stable.

Current Status

🚫 DISABLED - Temporarily disabled due to GitHub Actions cache service issues:

sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => <h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p>

Planned Configuration

Environment Variables (setup-env action)

RUSTC_WRAPPER=sccache
SCCACHE_GHA_ENABLED=true
SCCACHE_CACHE_SIZE=2G  # For Docker builds

GitHub Actions Integration

Workflows Using sccache

code-quality.yml - Build caching for clippy and rustfmt
test-unit.yml - Build caching for unit tests
test-integration.yml - Build caching for integration tests

Action Configuration

- uses: mozilla-actions/sccache-action@v0.0.4

Expected Benefits

50%+ faster builds through compilation caching
Reduced CI costs by avoiding redundant compilation
Better developer experience with faster feedback loops

Performance Targets

Build cache hit rate: Target > 80%
Compilation time reduction: 50%+ on cache hits
Total CI time: Reduce by 10-20 minutes per run

Local Development Setup

For local development when sccache is working:

# Install sccache
cargo binstall sccache -y

# Set environment variables
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED=true

# Check stats
sccache --show-stats

# Clear cache if needed
sccache --zero-stats

Re-enabling Steps

When GitHub Actions cache service is stable:

Re-enable in workflows:
- Uncomment mozilla-actions/sccache-action@v0.0.4 in workflows
- Restore sccache environment variables in setup-env action
Test with minimal workflow first:
- Start with code-quality.yml
- Monitor for cache service issues
- Gradually enable in other workflows
Monitor performance:
- Track build times before/after
- Monitor cache hit rates
- Watch for any new cache service errors

Configuration Locations

Files containing sccache configuration:

.github/actions/setup-env/action.yml - Environment variables
.github/workflows/code-quality.yml - Action usage
.github/workflows/test-unit.yml - Action usage
.github/workflows/test-integration.yml - Action usage
docs/sccache-configuration.md - This documentation

Docker Integration

For Docker builds, pass sccache variables as build args:

build-args: |
  SCCACHE_GHA_ENABLED=true
  RUSTC_WRAPPER=sccache
  SCCACHE_CACHE_SIZE=2G

Troubleshooting

Common Issues

Cache service unavailable: Wait for GitHub to restore service
Cache misses: Check RUSTC_WRAPPER is set correctly
Permission errors: Ensure sccache action has proper permissions

Monitoring

Check sccache --show-stats for cache effectiveness
Monitor CI run times for performance improvements
Watch GitHub status page for cache service updates

References

StepContext API Reference

StepContext is the primary data access object for step handlers across all languages in the Tasker worker ecosystem. It provides a consistent interface for accessing task inputs, dependency results, configuration, and checkpoint data.

Overview

Every step handler receives a StepContext (or TaskSequenceStep in Rust) that contains:

Task context - Input data for the workflow (JSONB from task.context)
Dependency results - Results from upstream DAG steps
Step configuration - Handler-specific settings from the template
Checkpoint data - Batch processing state for resumability
Retry information - Current attempt count and max retries

Cross-Language API Reference

Core Data Access

Operation	Rust	Ruby	Python	TypeScript
Get task input	`get_input::<T>("key")?`	`get_input("key")`	`get_input("key")`	`getInput("key")`
Get input with default	`get_input_or("key", default)`	`get_input_or("key", default)`	`get_input_or("key", default)`	`getInputOr("key", default)`
Get config value	`get_config::<T>("key")?`	`get_config("key")`	`get_config("key")`	`getConfig("key")`
Get dependency result	`get_dependency_result_column_value::<T>("step")?`	`get_dependency_result("step")`	`get_dependency_result("step")`	`getDependencyResult("step")`
Get nested dependency field	`get_dependency_field::<T>("step", &["path"])?`	`get_dependency_field("step", *path)`	`get_dependency_field("step", *path)`	`getDependencyField("step", ...path)`

Retry Helpers

Operation	Rust	Ruby	Python	TypeScript
Check if retry	`is_retry()`	`is_retry?`	`is_retry()`	`isRetry()`
Check if last retry	`is_last_retry()`	`is_last_retry?`	`is_last_retry()`	`isLastRetry()`
Get retry count	`retry_count()`	`retry_count`	`retry_count`	`retryCount`
Get max retries	`max_retries()`	`max_retries`	`max_retries`	`maxRetries`

Checkpoint Access

Operation	Rust	Ruby	Python	TypeScript
Get raw checkpoint	`checkpoint()`	`checkpoint`	`checkpoint`	`checkpoint`
Get cursor	`checkpoint_cursor::<T>()`	`checkpoint_cursor`	`checkpoint_cursor`	`checkpointCursor`
Get items processed	`checkpoint_items_processed()`	`checkpoint_items_processed`	`checkpoint_items_processed`	`checkpointItemsProcessed`
Get accumulated results	`accumulated_results::<T>()`	`accumulated_results`	`accumulated_results`	`accumulatedResults`
Check has checkpoint	`has_checkpoint()`	`has_checkpoint?`	`has_checkpoint()`	`hasCheckpoint()`

Standard Fields

Field	Rust	Ruby	Python	TypeScript
Task UUID	`task.task.task_uuid`	`task_uuid`	`task_uuid`	`taskUuid`
Step UUID	`workflow_step.workflow_step_uuid`	`step_uuid`	`step_uuid`	`stepUuid`
Correlation ID	`task.task.correlation_id`	`task.correlation_id`	`correlation_id`	`correlationId`
Input data (raw)	`task.task.context`	`input_data` / `context`	`input_data`	`inputData`
Step config (raw)	`step_definition.handler.initialization`	`step_config`	`step_config`	`stepConfig`

Usage Examples

Rust

#![allow(unused)]
fn main() {
use tasker_shared::types::base::TaskSequenceStep;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Get task input
    let order_id: String = step_data.get_input("order_id")?;
    let batch_size: i32 = step_data.get_input_or("batch_size", 100);

    // Get config
    let api_url: String = step_data.get_config("api_url")?;

    // Get dependency result
    let validation_result: ValidationResult = step_data.get_dependency_result_column_value("validate")?;

    // Extract nested field from dependency
    let item_count: i32 = step_data.get_dependency_field("process", &["stats", "count"])?;

    // Check retry status
    if step_data.is_retry() {
        println!("Retry attempt {}", step_data.retry_count());
    }

    // Resume from checkpoint
    let cursor: Option<i64> = step_data.checkpoint_cursor();
    let start_from = cursor.unwrap_or(0);

    // ... handler logic ...
}
}

Ruby

def call(context)
  # Get task input
  order_id = context.get_input('order_id')
  batch_size = context.get_input_or('batch_size', 100)

  # Get config
  api_url = context.get_config('api_url')

  # Get dependency result
  validation_result = context.get_dependency_result('validate')

  # Extract nested field from dependency
  item_count = context.get_dependency_field('process', 'stats', 'count')

  # Check retry status
  if context.is_retry?
    logger.info("Retry attempt #{context.retry_count}")
  end

  # Resume from checkpoint
  start_from = context.checkpoint_cursor || 0

  # ... handler logic ...
end

Python

def call(self, context: StepContext) -> StepHandlerResult:
    # Get task input
    order_id = context.get_input("order_id")
    batch_size = context.get_input_or("batch_size", 100)

    # Get config
    api_url = context.get_config("api_url")

    # Get dependency result
    validation_result = context.get_dependency_result("validate")

    # Extract nested field from dependency
    item_count = context.get_dependency_field("process", "stats", "count")

    # Check retry status
    if context.is_retry():
        print(f"Retry attempt {context.retry_count}")

    # Resume from checkpoint
    start_from = context.checkpoint_cursor or 0

    # ... handler logic ...

TypeScript

async call(context: StepContext): Promise<StepHandlerResult> {
  // Get task input
  const orderId = context.getInput<string>('order_id');
  const batchSize = context.getInputOr('batch_size', 100);

  // Get config
  const apiUrl = context.getConfig<string>('api_url');

  // Get dependency result
  const validationResult = context.getDependencyResult('validate');

  // Extract nested field from dependency
  const itemCount = context.getDependencyField('process', 'stats', 'count');

  // Check retry status
  if (context.isRetry()) {
    console.log(`Retry attempt ${context.retryCount}`);
  }

  // Resume from checkpoint
  const startFrom = context.checkpointCursor ?? 0;

  // ... handler logic ...
}

Checkpoint Usage Guide

Checkpoints enable resumable batch processing. When a handler processes large datasets, it can save progress via checkpoints and resume from where it left off on retry.

Checkpoint Fields

cursor - Position marker (can be int, string, or object)
items_processed - Count of items completed
accumulated_results - Running totals or aggregated state

Reading Checkpoints

# Python example
def call(self, context: StepContext) -> StepHandlerResult:
    # Check if resuming from checkpoint
    if context.has_checkpoint():
        cursor = context.checkpoint_cursor
        items_done = context.checkpoint_items_processed
        totals = context.accumulated_results or {}
        print(f"Resuming from cursor {cursor}, {items_done} items done")
    else:
        cursor = 0
        items_done = 0
        totals = {}

    # Process from cursor position...

Writing Checkpoints

Checkpoints are written by including checkpoint data in the handler result metadata. See the batch processing documentation for details on the checkpoint yield pattern.

Notes

All accessor methods handle missing data gracefully (return None/null or use defaults)
Dependency results are automatically unwrapped from the {"result": value} envelope
Type conversion is handled automatically where supported (Rust, TypeScript generics)
Checkpoint data is persisted atomically by the CheckpointService

Table Management and Growth Strategies

Last Updated: 2026-01-10 Status: Active Recommendation

Problem Statement

In high-throughput workflow orchestration systems, the core task tables (tasks, workflow_steps, task_transitions, workflow_step_transitions) can grow to millions of rows over time. Without proper management, this growth can lead to:

Note: All tables reside in the tasker schema with simplified names (e.g., tasks instead of tasker_tasks). With search_path = tasker, public, queries use unqualified table names.

Query Performance Degradation: Even with proper indexes, very large tables require more I/O operations
Maintenance Overhead: VACUUM, ANALYZE, and index maintenance become increasingly expensive
Backup/Recovery Challenges: Larger tables increase backup windows and recovery times
Storage Costs: Historical data that’s rarely accessed still consumes storage resources

Existing Performance Mitigations

The tasker-core system employs several strategies to maintain query performance even with large tables:

1. Strategic Indexing

Covering Indexes for Hot Paths

The most critical indexes use PostgreSQL’s INCLUDE clause to create covering indexes that satisfy queries without table lookups:

Active Task Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for active task queries with priority sorting
CREATE INDEX IF NOT EXISTS idx_tasks_active_with_priority_covering
    ON tasks (complete, priority, task_uuid)
    INCLUDE (named_task_uuid, requested_at)
    WHERE complete = false;

Impact: Task discovery queries can be satisfied entirely from the index without accessing the main table.

Step Readiness Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for step readiness queries
CREATE INDEX IF NOT EXISTS idx_workflow_steps_ready_covering
    ON workflow_steps (task_uuid, processed, in_process)
    INCLUDE (workflow_step_uuid, attempts, max_attempts, retryable)
    WHERE processed = false;

-- Covering index for task-based step grouping
CREATE INDEX IF NOT EXISTS idx_workflow_steps_task_covering
    ON workflow_steps (task_uuid)
    INCLUDE (workflow_step_uuid, processed, in_process, attempts, max_attempts);

Impact: Step dependency resolution and retry logic queries avoid heap lookups.

Transitive Dependency Optimization (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for transitive dependency traversal
CREATE INDEX IF NOT EXISTS idx_workflow_steps_transitive_deps
    ON workflow_steps (workflow_step_uuid, named_step_uuid)
    INCLUDE (task_uuid, results, processed);

Impact: DAG traversal operations can read all needed columns from the index.

State Transition Lookups (Partial Indexes)

Current State Resolution (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Fast current state resolution (only indexes most_recent = true)
CREATE INDEX IF NOT EXISTS idx_task_transitions_state_lookup
    ON task_transitions (task_uuid, to_state, most_recent)
    WHERE most_recent = true;

CREATE INDEX IF NOT EXISTS idx_workflow_step_transitions_state_lookup
    ON workflow_step_transitions (workflow_step_uuid, to_state, most_recent)
    WHERE most_recent = true;

Impact: State lookups index only current state, not full audit history. Reduces index size by >90%.

Correlation and Tracing Indexes

Distributed Tracing Support (migrations/tasker/20251007000000_add_correlation_ids.sql):

-- Primary correlation ID lookups
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_id
    ON tasks(correlation_id);

-- Hierarchical workflow traversal (parent-child relationships)
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_hierarchy
    ON tasks(parent_correlation_id, correlation_id)
    WHERE parent_correlation_id IS NOT NULL;

Impact: Enables efficient distributed tracing and workflow hierarchy queries.

Processor Ownership and Monitoring

Processor Tracking (migrations/tasker/20250912000000_tas41_richer_task_states.sql):

-- Index for processor ownership queries (audit trail only, enforcement removed)
CREATE INDEX IF NOT EXISTS idx_task_transitions_processor
    ON task_transitions(processor_uuid)
    WHERE processor_uuid IS NOT NULL;

-- Index for timeout monitoring using JSONB metadata
CREATE INDEX IF NOT EXISTS idx_task_transitions_timeout
    ON task_transitions((transition_metadata->>'timeout_at'))
    WHERE most_recent = true;

Impact: Enables processor-level debugging and timeout monitoring. Processor ownership enforcement was removed but the audit trail is preserved.

Step Edges for DAG Operations (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Parent-to-child navigation for dependency resolution
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_from_step
    ON workflow_step_edges (from_step_uuid);

-- Child-to-parent navigation for completion propagation
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_to_step
    ON workflow_step_edges (to_step_uuid);

Impact: Bidirectional DAG traversal for readiness checks and completion propagation.

2. Partial Indexes

Many indexes use WHERE clauses to index only active/relevant rows:

-- Only index tasks that are actively being processed
WHERE current_state IN ('pending', 'initializing', 'steps_in_process')

-- Only index the current state transition
WHERE most_recent = true

This significantly reduces index size and maintenance overhead while keeping lookups fast.

3. SQL Function Optimizations

Complex orchestration queries are implemented as PostgreSQL functions that leverage:

Lateral Joins: For efficient correlated subqueries
CTEs with Materialization: For complex dependency analysis
Targeted Filtering: Early elimination of irrelevant rows using index scans

Example from get_next_ready_tasks():

-- First filter to active tasks with priority sorting (uses index)
WITH prioritized_tasks AS (
  SELECT task_uuid, priority
  FROM tasks
  WHERE current_state IN ('pending', 'steps_in_process')
  ORDER BY priority DESC, created_at ASC
  LIMIT $1 * 2  -- Get more candidates than needed for filtering
)
-- Then apply complex staleness/readiness checks only on candidates
...

4. Staleness Exclusion

The system automatically excludes stale tasks from active processing queues:

Tasks stuck in waiting_for_dependencies > 60 minutes
Tasks stuck in waiting_for_retry > 30 minutes
Tasks with lifecycle timeouts exceeded

This prevents the active query set from growing indefinitely, even if old tasks aren’t archived.

Archive-and-Delete Strategy (Considered, Not Implemented)

What We Considered

We initially designed an archive-and-delete strategy:

Architecture:

Mirror tables: tasker.archived_tasks, tasker.archived_workflow_steps, tasker.archived_task_transitions, tasker.archived_workflow_step_transitions
Background service running every 24 hours
Batch processing: 1000 tasks per run
Transactional archival: INSERT into archive tables → DELETE from main tables
Retention policies: Configurable per task state (completed, error, cancelled)

Implementation Details:

#![allow(unused)]
fn main() {
// Archive tasks in terminal states older than retention period
pub async fn archive_completed_tasks(
    pool: &PgPool,
    retention_days: i32,
    batch_size: i32,
) -> Result<ArchiveStats> {
    // 1. INSERT INTO archived_tasks SELECT * FROM tasks WHERE ...
    // 2. INSERT INTO archived_workflow_steps SELECT * WHERE task_uuid IN (...)
    // 3. INSERT INTO archived_task_transitions SELECT * WHERE task_uuid IN (...)
    // 4. DELETE FROM workflow_step_transitions WHERE ...
    // 5. DELETE FROM task_transitions WHERE ...
    // 6. DELETE FROM workflow_steps WHERE ...
    // 7. DELETE FROM tasks WHERE ...
}
}

Why We Decided Against It

After implementation and analysis, we identified critical performance issues:

1. Write Amplification

Every archived task results in:

2× writes per row: INSERT into archive table + original row still exists until DELETE
1× delete per row: DELETE from main table triggers index updates
Cascade costs: Foreign key relationships require multiple DELETE operations in sequence

For a system processing 100,000 tasks/day with 30-day retention:

Daily archival: ~100,000 tasks × 2 write operations = 200,000 write I/Os
Plus associated workflow_steps (typically 5-10 per task): 500,000-1,000,000 additional writes

2. Index Maintenance Overhead

PostgreSQL must maintain indexes during both INSERT and DELETE operations:

During INSERT to archive tables:

Build index entries for all archive table indexes
Update statistics for query planner

During DELETE from main tables:

Mark deleted tuples in main table indexes
Update free space maps
Trigger VACUUM requirements

Result: Periodic severe degradation (2-5 seconds) during archival runs, even with batch processing.

3. Lock Contention

Large DELETE operations require:

Row-level locks on deleted rows
Table-level locks during index updates
Lock escalation risk with large batch sizes

This creates a “stop-the-world” effect where active task processing is blocked during archival.

4. VACUUM Pressure

Frequent large DELETEs create dead tuples that require aggressive VACUUMing:

Increases I/O load during off-hours
Can’t be fully eliminated even with proper tuning
Competes with active workload for resources

5. The “Garbage Collector” Anti-Pattern

The archive-and-delete strategy essentially implements a manual garbage collector:

Periodic runs with performance impact
Tuning trade-offs (frequency vs. batch size vs. impact)
Operational complexity (monitoring, alerting, recovery)

Migration Path

Phase 1: Analysis (Current State)

Before implementing partitioning:

Analyze Current Growth Rate:

SELECT
    pg_size_pretty(pg_total_relation_size('tasker.tasks')) as total_size,
    count(*) as row_count,
    min(created_at) as oldest_task,
    max(created_at) as newest_task,
    count(*) / EXTRACT(day FROM (max(created_at) - min(created_at))) as avg_tasks_per_day
FROM tasks;

Determine Partition Strategy:
- Daily partitions: For > 1M tasks/day
- Weekly partitions: For 100K-1M tasks/day
- Monthly partitions: For < 100K tasks/day
Plan Retention Period:
- Legal/compliance requirements
- Analytics/reporting needs
- Typical task investigation window

Phase 2: Implementation

Create Partitioned Tables (requires downtime or blue-green deployment)
Migrate Existing Data using pg_partman.partition_data_proc()
Update Application (no code changes needed if using same table names)
Configure Automation (pg_cron for maintenance)

Phase 3: Monitoring

Track partition management effectiveness:

-- Check partition sizes
SELECT
    schemaname || '.' || tablename as partition_name,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'tasker' AND tablename LIKE 'tasks_p%'
ORDER BY tablename;

-- Verify partition pruning is working
EXPLAIN SELECT * FROM tasks
WHERE created_at > NOW() - INTERVAL '7 days';
-- Should show: "Seq Scan on tasker.tasks_p2025_11" (only current partition)

Decision Summary

Decision: Use PostgreSQL native partitioning with pg_partman for table growth management.

Rationale:

Zero runtime performance impact vs. periodic degradation with archive-and-delete
Operationally simpler (set-and-forget vs. monitoring archive jobs)
Battle-tested solution used by pgmq and thousands of production systems
Aligns with PostgreSQL best practices and community recommendations

Not Recommended: Archive-and-delete strategy due to write amplification, lock contention, and periodic performance degradation.

References

Task and Step Readiness and Execution

Last Updated: 2026-01-10 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands

← Back to Documentation Hub

This document provides comprehensive documentation of the SQL functions and database logic that drives task and step readiness analysis, dependency resolution, and execution coordination in the tasker-core system.

Overview

The tasker-core system relies heavily on sophisticated PostgreSQL functions to perform complex workflow orchestration operations at the database level. This approach provides significant performance benefits through set-based operations, atomic transactions, and reduced network round trips while maintaining data consistency.

The SQL function system supports several critical categories of operations:

Step Readiness Analysis: Complex dependency resolution and backoff calculations
DAG Operations: Cycle detection, depth calculation, and parallel execution discovery
State Management: Atomic state transitions with processor ownership tracking
Analytics and Monitoring: Performance metrics and system health analysis
Task Execution Context: Comprehensive execution metadata and results management

SQL Function Architecture

Function Categories

The SQL functions are organized into logical categories as defined in tasker-shared/src/database/sql_functions.rs:

1. Step Readiness Analysis

get_step_readiness_status(task_uuid, step_uuids[]): Comprehensive dependency analysis
calculate_backoff_delay(attempts, base_delay): Exponential backoff calculation
check_step_dependencies(step_uuid): Parent completion validation
get_ready_steps(task_uuid): Parallel execution candidate discovery

2. DAG Operations

detect_cycle(from_step_uuid, to_step_uuid): Cycle detection using recursive CTEs
calculate_dependency_levels(task_uuid): Topological depth calculation
calculate_step_depth(step_uuid): Individual step depth analysis
get_step_transitive_dependencies(step_uuid): Full dependency tree traversal

3. State Management

transition_task_state_atomic(task_uuid, from_state, to_state, processor_uuid): Atomic state transitions with ownership
get_current_task_state(task_uuid): Current task state resolution
finalize_task_completion(task_uuid): Task completion orchestration

4. Analytics and Monitoring

get_analytics_metrics(since_timestamp): Comprehensive system analytics
get_system_health_counts(): System-wide health and performance metrics
get_slowest_steps(limit): Performance optimization analysis
get_slowest_tasks(limit): Task performance analysis

5. Task Discovery and Execution

get_next_ready_task(): Single task discovery for orchestration
get_next_ready_tasks(limit): Batch task discovery for scaling
get_task_ready_info(task_uuid): Detailed task readiness information
get_task_execution_context(task_uuid): Complete execution metadata

Database Schema Foundation

Core Tables

The SQL functions operate on a comprehensive schema designed for UUID v7 performance and scalability. All tables reside in the tasker schema with simplified names. With search_path = tasker, public, queries use unqualified table names.

Primary Tables

tasks: Main workflow instances with UUID v7 primary keys
workflow_steps: Individual workflow steps with dependency relationships
task_transitions: Task state change audit trail with processor tracking
workflow_step_transitions: Step state change audit trail

Registry Tables

task_namespaces: Workflow namespace definitions
named_tasks: Task type templates and metadata
named_steps: Step type definitions and handlers
workflow_step_edges: Step dependency relationships (DAG structure)

Richer Task State Enhancements

The richer task states migration (migrations/tasker/20251209000000_tas41_richer_task_states.sql) enhanced the schema with:

Task State Management:

-- 12 comprehensive task states
ALTER TABLE task_transitions
ADD CONSTRAINT chk_task_transitions_to_state
CHECK (to_state IN (
    'pending', 'initializing', 'enqueuing_steps', 'steps_in_process',
    'evaluating_results', 'waiting_for_dependencies', 'waiting_for_retry',
    'blocked_by_failures', 'complete', 'error', 'cancelled', 'resolved_manually'
));

Processor Ownership Tracking:

ALTER TABLE task_transitions
ADD COLUMN processor_uuid UUID,
ADD COLUMN transition_metadata JSONB DEFAULT '{}';

Atomic State Transitions:

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN

Step Readiness Analysis

Recent Enhancements

WaitingForRetry State Support (Migration 20250927000000)

The step readiness system was enhanced to support the new WaitingForRetry state, which distinguishes retryable failures from permanent errors:

Key Changes:

Helper Functions: Added calculate_step_next_retry_time() and evaluate_step_state_readiness() for consistent backoff logic
State Recognition: Updated readiness evaluation to treat waiting_for_retry as a ready-eligible state alongside pending
Backoff Calculation: Centralized exponential backoff logic with configurable backoff periods
Performance Optimization: Introduced task-scoped CTEs to eliminate table scans for batch operations

Semantic Impact:

Before: error state included both retryable and permanent failures
After: error = permanent only, waiting_for_retry = awaiting backoff for retry

Backoff Logic Consolidation (October 2025)

The backoff calculation system was consolidated to eliminate configuration conflicts and race conditions:

Key Changes:

Configuration Alignment: Single source of truth (TOML config) with max_backoff_seconds = 60
Parameterized SQL Functions: calculate_step_next_retry_time() accepts configurable max delay and multiplier
Atomic Updates: Row-level locking prevents concurrent backoff update conflicts
Timing Consistency: last_attempted_at updated atomically with backoff_request_seconds

Issues Resolved:

Configuration Conflicts: Eliminated three conflicting max values (30s SQL, 60s code, 300s TOML)
Race Conditions: Added SELECT FOR UPDATE locking in BackoffCalculator
Hardcoded Values: Removed hardcoded 30-second cap and power(2, attempts) in SQL

Helper Functions Enhanced:

calculate_step_next_retry_time(): Now parameterized with configuration values
```
CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP,
    failure_time TIMESTAMP,
    attempts INTEGER,
    p_max_backoff_seconds INTEGER DEFAULT 60,
    p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP
```
- Respects custom backoff periods from step configuration (primary path)
- Falls back to exponential backoff with configurable parameters
- Defaults aligned with TOML config (60s max, 2.0 multiplier)
- Used consistently across all readiness evaluation
set_step_backoff_atomic(): New atomic update function
```
CREATE OR REPLACE FUNCTION set_step_backoff_atomic(
    p_step_uuid UUID,
    p_backoff_seconds INTEGER
) RETURNS BOOLEAN
```
- Provides transactional guarantee for concurrent updates
- Updates both backoff_request_seconds and last_attempted_at
- Ensures timing consistency with SQL calculations

evaluate_step_state_readiness(): Determines if a step is ready for execution

CREATE OR REPLACE FUNCTION evaluate_step_state_readiness(
    current_state TEXT,
    processed BOOLEAN,
    in_process BOOLEAN,
    dependencies_satisfied BOOLEAN,
    retry_eligible BOOLEAN,
    retryable BOOLEAN,
    next_retry_time TIMESTAMP
) RETURNS BOOLEAN

Recognizes both pending and waiting_for_retry as ready-eligible states
Validates backoff period has expired before allowing retry
Ensures dependencies are satisfied and retry limits not exceeded

Step Readiness Status

The get_step_readiness_status function provides comprehensive analysis of step execution eligibility:

CREATE OR REPLACE FUNCTION get_step_readiness_status(
    task_uuid UUID,
    step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
    workflow_step_uuid UUID,
    task_uuid UUID,
    named_step_uuid UUID,
    name VARCHAR,
    current_state VARCHAR,
    dependencies_satisfied BOOLEAN,
    retry_eligible BOOLEAN,
    ready_for_execution BOOLEAN,
    last_failure_at TIMESTAMP,
    next_retry_at TIMESTAMP,
    total_parents INTEGER,
    completed_parents INTEGER,
    attempts INTEGER,
    retry_limit INTEGER,
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP
)

Key Analysis Features

Dependency Satisfaction:

Validates all parent steps are in complete or resolved_manually states
Handles complex DAG structures with multiple dependency paths
Supports conditional dependencies based on parent results

Retry Logic:

Exponential backoff calculation: 2^attempts seconds (max 30)
Custom backoff periods from step configuration
Retry limit enforcement to prevent infinite loops
Failure tracking with temporal analysis

Execution Readiness:

State validation (must be pending or error)
Dependency satisfaction confirmation
Retry eligibility assessment
Backoff period expiration checking

Step Readiness Implementation

The Rust integration provides type-safe access to step readiness analysis:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct StepReadinessStatus {
    pub workflow_step_uuid: Uuid,
    pub task_uuid: Uuid,
    pub named_step_uuid: Uuid,
    pub name: String,
    pub current_state: String,
    pub dependencies_satisfied: bool,
    pub retry_eligible: bool,
    pub ready_for_execution: bool,
    pub last_failure_at: Option<NaiveDateTime>,
    pub next_retry_at: Option<NaiveDateTime>,
    pub total_parents: i32,
    pub completed_parents: i32,
    pub attempts: i32,
    pub retry_limit: i32,
    pub backoff_request_seconds: Option<i32>,
    pub last_attempted_at: Option<NaiveDateTime>,
}

impl StepReadinessStatus {
    pub fn can_execute_now(&self) -> bool {
        self.ready_for_execution
    }

    pub fn blocking_reason(&self) -> Option<&'static str> {
        if !self.dependencies_satisfied {
            return Some("dependencies_not_satisfied");
        }
        if !self.retry_eligible {
            return Some("retry_not_eligible");
        }
        Some("invalid_state")
    }

    pub fn effective_backoff_seconds(&self) -> i32 {
        self.backoff_request_seconds.unwrap_or_else(|| {
            if self.attempts > 0 {
                std::cmp::min(2_i32.pow(self.attempts as u32), 30)
            } else {
                0
            }
        })
    }
}
}

DAG Operations and Dependency Resolution

Dependency Level Calculation

The calculate_dependency_levels function uses recursive CTEs to perform topological analysis of the workflow DAG:

CREATE OR REPLACE FUNCTION calculate_dependency_levels(input_task_uuid UUID)
RETURNS TABLE(workflow_step_uuid UUID, dependency_level INTEGER)
LANGUAGE plpgsql STABLE AS $$
BEGIN
  RETURN QUERY
  WITH RECURSIVE dependency_levels AS (
    -- Base case: Find root nodes (steps with no dependencies)
    SELECT
      ws.workflow_step_uuid,
      0 as level
    FROM workflow_steps ws
    WHERE ws.task_uuid = input_task_uuid
      AND NOT EXISTS (
        SELECT 1
        FROM workflow_step_edges wse
        WHERE wse.to_step_uuid = ws.workflow_step_uuid
      )

    UNION ALL

    -- Recursive case: Find children of current level nodes
    SELECT
      wse.to_step_uuid as workflow_step_uuid,
      dl.level + 1 as level
    FROM dependency_levels dl
    JOIN workflow_step_edges wse ON wse.from_step_uuid = dl.workflow_step_uuid
    JOIN workflow_steps ws ON ws.workflow_step_uuid = wse.to_step_uuid
    WHERE ws.task_uuid = input_task_uuid
  )
  SELECT
    dl.workflow_step_uuid,
    MAX(dl.level) as dependency_level  -- Use MAX to handle multiple paths
  FROM dependency_levels dl
  GROUP BY dl.workflow_step_uuid
  ORDER BY dependency_level, workflow_step_uuid;
END;

Dependency Level Benefits

Parallel Execution Planning:

Steps at the same dependency level can execute in parallel
Enables optimal resource utilization across workers
Supports batch enqueueing for scalability

Execution Ordering:

Level 0: Root steps (no dependencies) - can start immediately
Level N: Steps requiring completion of level N-1 steps
Topological ordering ensures dependency satisfaction

Performance Optimization:

Single query provides complete dependency analysis
Avoids N+1 query problems in dependency resolution
Enables batch processing optimizations

Transitive Dependencies

The get_step_transitive_dependencies function provides complete ancestor analysis:

CREATE OR REPLACE FUNCTION get_step_transitive_dependencies(step_uuid UUID)
RETURNS TABLE(
    step_name VARCHAR,
    step_uuid UUID,
    task_uuid UUID,
    distance INTEGER,
    processed BOOLEAN,
    results JSONB
)

This enables step handlers to access results from any ancestor step:

#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
    pub async fn get_step_dependency_results_map(
        &self,
        step_uuid: Uuid,
    ) -> Result<HashMap<String, StepExecutionResult>, sqlx::Error> {
        let dependencies = self.get_step_transitive_dependencies(step_uuid).await?;
        Ok(dependencies
            .into_iter()
            .filter_map(|dep| {
                if dep.processed && dep.results.is_some() {
                    let results: StepExecutionResult = dep.results.unwrap().into();
                    Some((dep.step_name, results))
                } else {
                    None
                }
            })
            .collect())
    }
}
}

Task Execution Context

Recent Enhancements

Permanently Blocked Detection Fix (Migration 20251001000000)

The get_task_execution_context function was enhanced to correctly identify tasks blocked by permanent errors:

Problem: The function only checked attempts >= retry_limit to detect permanently blocked steps, missing cases where workers marked errors as non-retryable (e.g., missing handlers, configuration errors).

Solution: Updated permanently_blocked_steps calculation to check both conditions:

COUNT(CASE WHEN sd.current_state = 'error'
            AND (sd.attempts >= retry_limit OR sd.retry_eligible = false) THEN 1 END)

Impact:

execution_status: Now correctly returns blocked_by_failures instead of waiting_for_dependencies for tasks with non-retryable errors
recommended_action: Returns handle_failures instead of wait_for_dependencies
health_status: Returns blocked instead of recovering when appropriate

This fix ensures the orchestration system properly identifies when manual intervention is needed versus when a task is simply waiting for retry backoff.

Task Discovery and Orchestration

Task Readiness Discovery

The system provides multiple functions for task discovery based on orchestration needs:

Single Task Discovery

CREATE OR REPLACE FUNCTION get_next_ready_task()
RETURNS TABLE(
    task_uuid UUID,
    task_name VARCHAR,
    priority INTEGER,
    namespace_name VARCHAR,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Batch Task Discovery

CREATE OR REPLACE FUNCTION get_next_ready_tasks(limit_count INTEGER)
RETURNS TABLE(
    task_uuid UUID,
    task_name VARCHAR,
    priority INTEGER,
    namespace_name VARCHAR,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Task Ready Information

The ReadyTaskInfo structure provides comprehensive task metadata for orchestration decisions:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct ReadyTaskInfo {
    pub task_uuid: Uuid,
    pub task_name: String,
    pub priority: i32,
    pub namespace_name: String,
    pub ready_steps_count: i64,
    pub computed_priority: Option<BigDecimal>,
    pub current_state: String,
}
}

Priority Calculation:

Base priority from task configuration
Dynamic priority adjustment based on age, retry attempts
Namespace-based priority modifiers
SLA-based priority escalation

Ready Steps Count:

Real-time count of steps eligible for execution
Used for batch size optimization
Influences orchestration scheduling decisions

State Management and Atomic Transitions

Atomic State Transitions

The enhanced state machine provides atomic transitions with processor ownership:

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
DECLARE
    v_sort_key INTEGER;
    v_transitioned BOOLEAN := FALSE;
BEGIN
    -- Get next sort key
    SELECT COALESCE(MAX(sort_key), 0) + 1 INTO v_sort_key
    FROM task_transitions
    WHERE task_uuid = p_task_uuid;

    -- Atomically transition only if in expected state
    WITH current_state AS (
        SELECT to_state, processor_uuid
        FROM task_transitions
        WHERE task_uuid = p_task_uuid
        AND most_recent = true
        FOR UPDATE
    ),
    ownership_check AS (
        SELECT
            CASE
                -- States requiring ownership
                WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
                                   'steps_in_process', 'evaluating_results')
                THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
                -- Other states don't require ownership
                ELSE true
            END as can_transition
        FROM current_state cs
        WHERE cs.to_state = p_from_state
    ),
    do_update AS (
        UPDATE task_transitions
        SET most_recent = false
        WHERE task_uuid = p_task_uuid
        AND most_recent = true
        AND EXISTS (SELECT 1 FROM ownership_check WHERE can_transition)
        RETURNING task_uuid
    )
    INSERT INTO task_transitions (
        task_uuid, from_state, to_state,
        processor_uuid, transition_metadata,
        sort_key, most_recent, created_at, updated_at
    )
    SELECT
        p_task_uuid, p_from_state, p_to_state,
        p_processor_uuid, p_metadata,
        v_sort_key, true, NOW(), NOW()
    WHERE EXISTS (SELECT 1 FROM do_update);

    GET DIAGNOSTICS v_transitioned = ROW_COUNT;
    RETURN v_transitioned > 0;
END;
$$ LANGUAGE plpgsql;

Key Features

Atomic Operation:

Single transaction with row-level locking
Compare-and-swap semantics prevent race conditions
Returns boolean indicating success/failure

Ownership Validation:

Processor ownership required for active states
Prevents concurrent processing by multiple orchestrators
Supports ownership claiming for unowned tasks

State Consistency:

Validates current state matches expected from_state
Maintains audit trail with complete transition history
Updates most_recent flags atomically

Current State Resolution

Fast current state lookups are provided through optimized queries:

#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
    pub async fn get_current_task_state(&self, task_uuid: Uuid)
        -> Result<TaskState, sqlx::Error> {
        let state_str = sqlx::query_scalar!(
            r#"SELECT get_current_task_state($1) as "state""#,
            task_uuid
        )
        .fetch_optional(&self.pool)
        .await?
        .ok_or_else(|| sqlx::Error::RowNotFound)?;

        match state_str {
            Some(state) => TaskState::try_from(state.as_str())
                .map_err(|_| sqlx::Error::Decode("Invalid task state".into())),
            None => Err(sqlx::Error::RowNotFound),
        }
    }
}
}

Analytics and System Health

System Health Monitoring

The get_system_health_counts function provides comprehensive system visibility:

CREATE OR REPLACE FUNCTION get_system_health_counts()
RETURNS TABLE(
    pending_tasks BIGINT,
    initializing_tasks BIGINT,
    enqueuing_steps_tasks BIGINT,
    steps_in_process_tasks BIGINT,
    evaluating_results_tasks BIGINT,
    waiting_for_dependencies_tasks BIGINT,
    waiting_for_retry_tasks BIGINT,
    blocked_by_failures_tasks BIGINT,
    complete_tasks BIGINT,
    error_tasks BIGINT,
    cancelled_tasks BIGINT,
    resolved_manually_tasks BIGINT,
    total_tasks BIGINT,
    -- step counts...
) AS $$

Health Score Calculation

The Rust implementation provides derived health metrics:

#![allow(unused)]
fn main() {
impl SystemHealthCounts {
    pub fn health_score(&self) -> f64 {
        if self.total_tasks == 0 {
            return 1.0;
        }

        let success_rate = self.complete_tasks as f64 / self.total_tasks as f64;
        let error_rate = self.error_tasks as f64 / self.total_tasks as f64;
        let connection_health = 1.0 -
            (self.active_connections as f64 / self.max_connections as f64).min(1.0);

        // Weighted combination: 50% success rate, 30% error rate, 20% connection health
        (success_rate * 0.5) + ((1.0 - error_rate) * 0.3) + (connection_health * 0.2)
    }

    pub fn is_under_heavy_load(&self) -> bool {
        let connection_pressure =
            self.active_connections as f64 / self.max_connections as f64;
        let error_rate = if self.total_tasks > 0 {
            self.error_tasks as f64 / self.total_tasks as f64
        } else {
            0.0
        };

        connection_pressure > 0.8 || error_rate > 0.2
    }
}
}

Analytics Metrics

The get_analytics_metrics function provides comprehensive performance analysis:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct AnalyticsMetrics {
    pub active_tasks_count: i64,
    pub total_namespaces_count: i64,
    pub unique_task_types_count: i64,
    pub system_health_score: BigDecimal,
    pub task_throughput: i64,
    pub completion_count: i64,
    pub error_count: i64,
    pub completion_rate: BigDecimal,
    pub error_rate: BigDecimal,
    pub avg_task_duration: BigDecimal,
    pub avg_step_duration: BigDecimal,
    pub step_throughput: i64,
    pub analysis_period_start: DateTime<Utc>,
    pub calculated_at: DateTime<Utc>,
}
}

Performance Optimization Analysis

Slowest Steps Analysis

The system provides performance optimization guidance through detailed analysis:

CREATE OR REPLACE FUNCTION get_slowest_steps(
    since_timestamp TIMESTAMP WITH TIME ZONE,
    limit_count INTEGER,
    namespace_filter VARCHAR,
    task_name_filter VARCHAR,
    version_filter VARCHAR
) RETURNS TABLE(
    named_step_uuid INTEGER,
    step_name VARCHAR,
    avg_duration_seconds NUMERIC,
    max_duration_seconds NUMERIC,
    min_duration_seconds NUMERIC,
    execution_count INTEGER,
    error_count INTEGER,
    error_rate NUMERIC,
    last_executed_at TIMESTAMP WITH TIME ZONE
)

Slowest Tasks Analysis

Similar analysis is available at the task level:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct SlowestTaskAnalysis {
    pub named_task_uuid: Uuid,
    pub task_name: String,
    pub avg_duration_seconds: f64,
    pub max_duration_seconds: f64,
    pub min_duration_seconds: f64,
    pub execution_count: i32,
    pub avg_step_count: f64,
    pub error_count: i32,
    pub error_rate: f64,
    pub last_executed_at: Option<DateTime<Utc>>,
}
}

Critical Problem-Solving SQL Functions

PGMQ Message Race Condition Prevention

Problem: Multiple Workers Claiming Same Message

When multiple workers simultaneously try to process steps from the same queue, PGMQ’s standard pgmq.read() function randomly selects messages, potentially causing workers to miss messages they were specifically notified about. This creates inefficiency and potential race conditions.

Solution: pgmq_read_specific_message()

CREATE OR REPLACE FUNCTION pgmq_read_specific_message(
    queue_name text,
    target_msg_id bigint,
    vt_seconds integer DEFAULT 30
) RETURNS TABLE (
    msg_id bigint,
    read_ct integer,
    enqueued_at timestamp with time zone,
    vt timestamp with time zone,
    message jsonb
) AS $$

Key Problem-Solving Logic:

Atomic Claim with Visibility Timeout: Uses UPDATE…RETURNING pattern to atomically:
- Check if message is available (vt <= now())
- Set new visibility timeout preventing other workers from claiming
- Increment read count for monitoring retry attempts
- Return message data only if successfully claimed
Race Condition Prevention: The WHERE vt <= now() clause ensures only one worker can claim a message. If two workers try simultaneously, only one UPDATE succeeds.
Graceful Failure Handling: Returns empty result set if message is:
- Already claimed by another worker (vt > now())
- Non-existent (deleted or never existed)
- Archived (moved to archive table)
Security: Validates queue name to prevent SQL injection in dynamic query construction.

Real-World Impact: Eliminates “message not found” errors when workers are notified about specific messages but can’t retrieve them due to random selection in standard read.

Task State Ownership and Atomic Transitions

Problem: Concurrent Orchestrators Processing Same Task

In distributed deployments, multiple orchestrator instances might try to process the same task simultaneously, leading to duplicate work, inconsistent state, and race conditions.

Solution: transition_task_state_atomic()

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$

Key Problem-Solving Logic:

Compare-and-Swap Pattern:
- Reads current state with FOR UPDATE lock
- Only transitions if current state matches expected from_state
- Returns false if state has changed, allowing caller to retry with fresh state

Processor Ownership Enforcement:

CASE
    WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
                        'steps_in_process', 'evaluating_results')
    THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
    ELSE true
END

Active processing states require ownership match
Allows claiming unowned tasks (NULL processor_uuid)
Terminal states (complete, error) don’t require ownership

Audit Trail Preservation:
- Updates previous transition’s most_recent = false
- Inserts new transition with most_recent = true
- Maintains complete history with sort_key ordering
Atomic Success/Failure: Returns boolean indicating whether transition succeeded, enabling callers to handle contention gracefully.

Real-World Impact: Enables safe distributed orchestration where multiple instances can operate without conflicts, automatically distributing work through ownership claiming.

Batch Task Discovery with Priority

Problem: Efficient Work Distribution Across Orchestrators

Orchestrators need to discover ready tasks efficiently without creating hotspots or missing tasks, while respecting priority and avoiding claimed tasks.

Solution: get_next_ready_tasks()

CREATE OR REPLACE FUNCTION get_next_ready_tasks(p_limit INTEGER DEFAULT 5)
RETURNS TABLE(
    task_uuid UUID,
    task_name TEXT,
    priority INTEGER,
    namespace_name TEXT,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Key Problem-Solving Logic:

Ready Step Discovery:

WITH ready_steps AS (
    SELECT task_uuid, COUNT(*) as ready_count
    FROM workflow_steps
    WHERE current_state IN ('pending', 'error')
    AND [dependency checks]
    GROUP BY task_uuid
)

Pre-aggregates ready steps per task for efficiency
Considers both new steps and retryable errors

State-Based Filtering:
- Only returns tasks in states that need processing
- Excludes terminal states (complete, cancelled)
- Includes waiting states that might have become ready

Priority Computation:

computed_priority = base_priority +
                   (age_factor * hours_waiting) +
                   (retry_factor * retry_count)

Dynamic priority based on age and retry attempts
Prevents task starvation through age escalation

Batch Efficiency:
- Returns multiple tasks in single query
- Reduces database round trips
- Enables parallel processing across orchestrators

Real-World Impact: Enables efficient work distribution where each orchestrator can claim a batch of tasks, reducing contention and improving throughput.

Complex Dependency Resolution

Problem: Determining Step Execution Readiness

Workflow steps have complex dependencies involving parent completion, retry logic, backoff timing, and state validation. Determining which steps are ready for execution requires sophisticated dependency analysis that must handle:

Multiple parent dependencies with conditional logic
Exponential backoff after failures
Retry limits and attempt tracking
State consistency across distributed workers

Solution: get_step_readiness_status()

CREATE OR REPLACE FUNCTION get_step_readiness_status(
    input_task_uuid UUID,
    step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
    workflow_step_uuid UUID,
    task_uuid UUID,
    named_step_uuid UUID,
    name VARCHAR,
    current_state VARCHAR,
    dependencies_satisfied BOOLEAN,
    retry_eligible BOOLEAN,
    ready_for_execution BOOLEAN,
    -- ... additional metadata
)

Key Problem-Solving Logic:

Dependency Satisfaction Analysis:

WITH parent_completion AS (
    SELECT
        edge.to_step_uuid,
        COUNT(*) as total_parents,
        COUNT(CASE WHEN parent.current_state = 'complete' THEN 1 END) as completed_parents
    FROM workflow_step_edges edge
    JOIN workflow_steps parent ON parent.workflow_step_uuid = edge.from_step_uuid
    WHERE parent.task_uuid = input_task_uuid
    GROUP BY edge.to_step_uuid
)

Counts total vs. completed parent dependencies
Handles conditional dependencies based on parent results
Supports complex DAG structures with multiple paths

Retry Eligibility Assessment:
```
retry_eligible = (
    current_state = 'error' AND
    attempts < retry_limit AND
    (last_attempted_at IS NULL OR
     last_attempted_at + backoff_interval <= NOW())
)
```
- Enforces retry limits to prevent infinite loops
- Calculates exponential backoff: 2^attempts seconds (max 30)
- Respects custom backoff periods from step configuration
- Considers temporal constraints for retry timing
State Validation:
```
ready_for_execution = (
    current_state IN ('pending', 'error') AND
    dependencies_satisfied AND
    retry_eligible
)
```
- Only pending or retryable error steps can execute
- Requires all dependencies satisfied
- Must pass retry eligibility checks
- Prevents execution of steps in terminal states

Backoff Calculation:

next_retry_at = CASE
    WHEN current_state = 'error' AND attempts > 0
    THEN last_attempted_at + INTERVAL '1 second' *
         COALESCE(backoff_request_seconds, LEAST(POW(2, attempts), 30))
    ELSE NULL
END

Custom backoff from step configuration takes precedence
Default exponential backoff with maximum cap
Temporal precision for scheduling retry attempts

Real-World Impact: Enables complex workflow orchestration with sophisticated dependency management, retry logic, and backoff handling, supporting enterprise-grade reliability patterns while maintaining high performance through set-based operations.

Integration with Event and State Systems

PostgreSQL LISTEN/NOTIFY Integration

The SQL functions integrate with the event-driven architecture through PostgreSQL notifications:

PGMQ Wrapper Functions for Atomic Operations

The system uses wrapper functions that combine PGMQ message sending with PostgreSQL notifications atomically:

-- Atomic wrapper that sends message AND notification
CREATE OR REPLACE FUNCTION pgmq_send_with_notify(
    queue_name TEXT,
    message JSONB,
    delay_seconds INTEGER DEFAULT 0
) RETURNS BIGINT AS $$
DECLARE
    msg_id BIGINT;
    namespace_name TEXT;
    event_payload TEXT;
    namespace_channel TEXT;
    global_channel TEXT := 'pgmq_message_ready';
BEGIN
    -- Send message using PGMQ's native function
    SELECT pgmq.send(queue_name, message, delay_seconds) INTO msg_id;

    -- Extract namespace from queue name using robust helper
    namespace_name := extract_queue_namespace(queue_name);

    -- Build namespace-specific channel name
    namespace_channel := 'pgmq_message_ready.' || namespace_name;

    -- Build event payload
    event_payload := json_build_object(
        'event_type', 'message_ready',
        'msg_id', msg_id,
        'queue_name', queue_name,
        'namespace', namespace_name,
        'ready_at', NOW()::timestamptz,
        'delay_seconds', delay_seconds
    )::text;

    -- Send notifications in same transaction
    PERFORM pg_notify(namespace_channel, event_payload);

    -- Also send to global channel if different
    IF namespace_channel != global_channel THEN
        PERFORM pg_notify(global_channel, event_payload);
    END IF;

    RETURN msg_id;
END;
$$ LANGUAGE plpgsql;

Namespace Extraction Helper

-- Robust namespace extraction helper function
CREATE OR REPLACE FUNCTION extract_queue_namespace(queue_name TEXT)
RETURNS TEXT AS $$
BEGIN
    -- Handle orchestration queues
    IF queue_name ~ '^orchestration' THEN
        RETURN 'orchestration';
    END IF;

    -- Handle worker queues: worker_namespace_queue -> namespace
    IF queue_name ~ '^worker_.*_queue$' THEN
        RETURN COALESCE(
            (regexp_match(queue_name, '^worker_(.+?)_queue$'))[1],
            'worker'
        );
    END IF;

    -- Handle standard namespace_queue pattern
    IF queue_name ~ '^[a-zA-Z][a-zA-Z0-9_]*_queue$' THEN
        RETURN COALESCE(
            (regexp_match(queue_name, '^([a-zA-Z][a-zA-Z0-9_]*)_queue$'))[1],
            'default'
        );
    END IF;

    -- Fallback for any other pattern
    RETURN 'default';
END;
$$ LANGUAGE plpgsql;

Fallback Polling for Task Readiness

Instead of database triggers for task readiness notifications, the system uses a fallback polling mechanism to ensure no ready tasks are missed:

FallbackPoller Configuration:

Default polling interval: 30 seconds
Runs StepEnqueuerService::process_batch() periodically
Catches tasks that may have been missed by primary PGMQ notification system
Configurable enable/disable via TOML configuration

Key Benefits:

Resilience: Ensures no tasks are permanently stuck if notifications fail
Simplicity: No complex database triggers or state tracking required
Observability: Clear metrics on fallback discovery vs. event-driven discovery
Safety Net: Primary event-driven system + fallback polling provides redundancy

PGMQ Message Queue Integration

SQL functions coordinate with PGMQ for reliable message processing:

Queue Management Functions

-- Ensure queue exists with proper configuration
CREATE OR REPLACE FUNCTION ensure_task_queue(queue_name VARCHAR)
RETURNS BOOLEAN AS $$
BEGIN
    -- Create queue if it doesn't exist
    PERFORM pgmq.create_queue(queue_name);

    -- Ensure headers column exists (pgmq-rs compatibility)
    PERFORM pgmq_ensure_headers_column(queue_name);

    RETURN TRUE;
END;
$$ LANGUAGE plpgsql;

Message Processing Support

-- Get queue statistics for monitoring
CREATE OR REPLACE FUNCTION get_queue_statistics(queue_name VARCHAR)
RETURNS TABLE(
    queue_name VARCHAR,
    queue_length BIGINT,
    oldest_msg_age_seconds INTEGER,
    newest_msg_age_seconds INTEGER
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        queue_name,
        pgmq.queue_length(queue_name),
        EXTRACT(EPOCH FROM (NOW() - MIN(enqueued_at)))::INTEGER,
        EXTRACT(EPOCH FROM (NOW() - MAX(enqueued_at)))::INTEGER
    FROM pgmq.messages(queue_name);
END;
$$ LANGUAGE plpgsql;

Transaction Safety

All SQL functions are designed with transaction safety in mind:

Atomic Operations:

State transitions use row-level locking (FOR UPDATE)
Compare-and-swap patterns prevent race conditions
Rollback safety for partial failures

Consistency Guarantees:

Foreign key constraints maintained across all operations
Check constraints validate state transitions
Audit trails preserved for debugging and compliance

Performance Optimization:

Efficient indexes for common query patterns
Materialized views for expensive analytics queries
Connection pooling for high concurrency

Usage Patterns and Best Practices

Rust Integration Patterns

The SqlFunctionExecutor provides type-safe access to all SQL functions:

#![allow(unused)]
fn main() {
use tasker_shared::database::sql_functions::{SqlFunctionExecutor, FunctionRegistry};

// Direct executor usage
let executor = SqlFunctionExecutor::new(pool);
let ready_steps = executor.get_ready_steps(task_uuid).await?;

// Registry pattern for organized access
let registry = FunctionRegistry::new(pool);
let analytics = registry.analytics().get_analytics_metrics(None).await?;
let health = registry.system_health().get_system_health_counts().await?;
}

Batch Processing Optimization

For high-throughput scenarios, the system supports efficient batch operations:

#![allow(unused)]
fn main() {
// Batch step readiness analysis
let task_uuids = vec![task1_uuid, task2_uuid, task3_uuid];
let batch_readiness = executor.get_step_readiness_status_batch(task_uuids).await?;

// Batch task discovery
let ready_tasks = executor.get_next_ready_tasks(50).await?;
}

Error Handling Best Practices

SQL function errors are properly propagated through the type system:

#![allow(unused)]
fn main() {
match executor.get_current_task_state(task_uuid).await {
    Ok(state) => {
        // Process state
    }
    Err(sqlx::Error::RowNotFound) => {
        // Handle missing task
    }
    Err(e) => {
        // Handle other database errors
    }
}
}

Tasker Configuration Documentation Index

Coverage: 246/246 parameters documented (100%)

Common Configuration

backoff (common.backoff) — 5 params (5 documented)
cache (common.cache) — 10 params (10 documented)
moka (common.cache.moka) — 1 params
redis (common.cache.redis) — 4 params
circuit_breakers (common.circuit_breakers) — 13 params (13 documented)
component_configs (common.circuit_breakers.component_configs) — 8 params
default_config (common.circuit_breakers.default_config) — 3 params
global_settings (common.circuit_breakers.global_settings) — 2 params
database (common.database) — 7 params (7 documented)
pool (common.database.pool) — 6 params
execution (common.execution) — 2 params (2 documented)
mpsc_channels (common.mpsc_channels) — 4 params (4 documented)
event_publisher (common.mpsc_channels.event_publisher) — 1 params
ffi (common.mpsc_channels.ffi) — 1 params
overflow_policy (common.mpsc_channels.overflow_policy) — 2 params
pgmq_database (common.pgmq_database) — 8 params (8 documented)
pool (common.pgmq_database.pool) — 6 params
queues (common.queues) — 14 params (14 documented)
orchestration_queues (common.queues.orchestration_queues) — 3 params
pgmq (common.queues.pgmq) — 3 params
rabbitmq (common.queues.rabbitmq) — 3 params
system (common.system) — 1 params (1 documented)
task_templates (common.task_templates) — 1 params (1 documented)

Orchestration Configuration

orchestration (orchestration) — 2 params (2 documented)
batch_processing (orchestration.batch_processing) — 4 params (4 documented)
decision_points (orchestration.decision_points) — 7 params (7 documented)
dlq (orchestration.dlq) — 13 params (13 documented)
staleness_detection (orchestration.dlq.staleness_detection) — 12 params
event_systems (orchestration.event_systems) — 36 params (36 documented)
orchestration (orchestration.event_systems.orchestration) — 18 params
task_readiness (orchestration.event_systems.task_readiness) — 18 params
grpc (orchestration.grpc) — 9 params (9 documented)
mpsc_channels (orchestration.mpsc_channels) — 3 params (3 documented)
command_processor (orchestration.mpsc_channels.command_processor) — 1 params
event_listeners (orchestration.mpsc_channels.event_listeners) — 1 params
event_systems (orchestration.mpsc_channels.event_systems) — 1 params
web (orchestration.web) — 17 params (17 documented)
auth (orchestration.web.auth) — 9 params
database_pools (orchestration.web.database_pools) — 5 params

Worker Configuration

worker (worker) — 2 params (2 documented)
circuit_breakers (worker.circuit_breakers) — 4 params (4 documented)
ffi_completion_send (worker.circuit_breakers.ffi_completion_send) — 4 params
event_systems (worker.event_systems) — 32 params (32 documented)
worker (worker.event_systems.worker) — 32 params
grpc (worker.grpc) — 9 params (9 documented)
mpsc_channels (worker.mpsc_channels) — 23 params (23 documented)
command_processor (worker.mpsc_channels.command_processor) — 1 params
domain_events (worker.mpsc_channels.domain_events) — 3 params
event_listeners (worker.mpsc_channels.event_listeners) — 1 params
event_subscribers (worker.mpsc_channels.event_subscribers) — 2 params
event_systems (worker.mpsc_channels.event_systems) — 1 params
ffi_dispatch (worker.mpsc_channels.ffi_dispatch) — 5 params
handler_dispatch (worker.mpsc_channels.handler_dispatch) — 7 params
in_process_events (worker.mpsc_channels.in_process_events) — 3 params
orchestration_client (worker.orchestration_client) — 3 params (3 documented)
web (worker.web) — 17 params (17 documented)
auth (worker.web.auth) — 9 params
database_pools (worker.web.database_pools) — 5 params

Generated by tasker-ctl docs — Tasker Configuration System

Configuration Reference: common

65/65 parameters documented

backoff

Path: common.backoff

Parameter	Type	Default	Description
`backoff_multiplier`	`f64`	`2.0`	Multiplier applied to the previous delay for exponential backoff calculations
`default_backoff_seconds`	`Vec<u32>`	`[1, 5, 15, 30, 60]`	Sequence of backoff delays in seconds for successive retry attempts
`jitter_enabled`	`bool`	`true`	Add random jitter to backoff delays to prevent thundering herd on retry
`jitter_max_percentage`	`f64`	`0.15`	Maximum jitter as a fraction of the computed backoff delay
`max_backoff_seconds`	`u32`	`3600`	Hard upper limit on any single backoff delay

`common.backoff.backoff_multiplier`

Multiplier applied to the previous delay for exponential backoff calculations

Type: f64
Default: 2.0
Valid Range: 1.0-10.0
System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous

`common.backoff.default_backoff_seconds`

Sequence of backoff delays in seconds for successive retry attempts

Type: Vec<u32>
Default: [1, 5, 15, 30, 60]
Valid Range: non-empty array of positive integers
System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds

`common.backoff.jitter_enabled`

Add random jitter to backoff delays to prevent thundering herd on retry

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time

`common.backoff.jitter_max_percentage`

Maximum jitter as a fraction of the computed backoff delay

Type: f64
Default: 0.15
Valid Range: 0.0-1.0
System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay

`common.backoff.max_backoff_seconds`

Hard upper limit on any single backoff delay

Type: u32
Default: 3600
Valid Range: 1-3600
System Impact: Caps exponential backoff growth to prevent excessively long delays between retries

cache

Path: common.cache

Parameter	Type	Default	Description
`analytics_ttl_seconds`	`u32`	`60`	Time-to-live in seconds for cached analytics and metrics data
`backend`	`String`	`"redis"`	Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
`default_ttl_seconds`	`u32`	`3600`	Default time-to-live in seconds for cached entries
`enabled`	`bool`	`false`	Enable the distributed cache layer for template and analytics data
`template_ttl_seconds`	`u32`	`3600`	Time-to-live in seconds for cached task template definitions

`common.cache.analytics_ttl_seconds`

Time-to-live in seconds for cached analytics and metrics data

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current

`common.cache.backend`

Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)

Type: String
Default: "redis"
Valid Range: redis | moka
System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection

`common.cache.default_ttl_seconds`

Default time-to-live in seconds for cached entries

Type: u32
Default: 3600
Valid Range: 1-86400
System Impact: Controls how long cached data remains valid before being re-fetched from the database

`common.cache.enabled`

Enable the distributed cache layer for template and analytics data

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required

`common.cache.template_ttl_seconds`

Time-to-live in seconds for cached task template definitions

Type: u32
Default: 3600
Valid Range: 1-86400
System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance

moka

Path: common.cache.moka

Parameter	Type	Default	Description
`max_capacity`	`u64`	`10000`	Maximum number of entries the in-process Moka cache can hold

`common.cache.moka.max_capacity`

Maximum number of entries the in-process Moka cache can hold

Type: u64
Default: 10000
Valid Range: 1-1000000
System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached

redis

Path: common.cache.redis

Parameter	Type	Default	Description
`connection_timeout_seconds`	`u32`	`5`	Maximum time to wait when establishing a new Redis connection
`database`	`u32`	`0`	Redis database number (0-15)
`max_connections`	`u32`	`10`	Maximum number of connections in the Redis connection pool
`url`	`String`	`"${REDIS_URL:-redis://localhost:6379}"`	Redis connection URL

`common.cache.redis.connection_timeout_seconds`

Maximum time to wait when establishing a new Redis connection

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Connections that cannot be established within this timeout fail; cache falls back to database

`common.cache.redis.database`

Redis database number (0-15)

Type: u32
Default: 0
Valid Range: 0-15
System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance

`common.cache.redis.max_connections`

Maximum number of connections in the Redis connection pool

Type: u32
Default: 10
Valid Range: 1-500
System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads

`common.cache.redis.url`

Redis connection URL

Type: String
Default: "${REDIS_URL:-redis://localhost:6379}"
Valid Range: valid Redis URI
System Impact: Must be reachable when cache is enabled with redis backend

circuit_breakers

Path: common.circuit_breakers

component_configs

Path: common.circuit_breakers.component_configs

cache

Path: common.circuit_breakers.component_configs.cache

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the cache circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the cache breaker

`common.circuit_breakers.component_configs.cache.failure_threshold`

Failures before the cache circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database

`common.circuit_breakers.component_configs.cache.success_threshold`

Successes in Half-Open required to close the cache breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database

messaging

Path: common.circuit_breakers.component_configs.messaging

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the messaging circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the messaging breaker

`common.circuit_breakers.component_configs.messaging.failure_threshold`

Failures before the messaging circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited

`common.circuit_breakers.component_configs.messaging.success_threshold`

Successes in Half-Open required to close the messaging breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient

task_readiness

Path: common.circuit_breakers.component_configs.task_readiness

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`10`	Failures before the task readiness circuit breaker trips to Open
`success_threshold`	`u32`	`3`	Successes in Half-Open required to close the task readiness breaker

`common.circuit_breakers.component_configs.task_readiness.failure_threshold`

Failures before the task readiness circuit breaker trips to Open

Type: u32
Default: 10
Valid Range: 1-100
System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected

`common.circuit_breakers.component_configs.task_readiness.success_threshold`

Successes in Half-Open required to close the task readiness breaker

Type: u32
Default: 3
Valid Range: 1-100
System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries

web

Path: common.circuit_breakers.component_configs.web

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the web/API database circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the web database breaker

`common.circuit_breakers.component_configs.web.failure_threshold`

Failures before the web/API database circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts

`common.circuit_breakers.component_configs.web.success_threshold`

Successes in Half-Open required to close the web database breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic

default_config

Path: common.circuit_breakers.default_config

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Number of consecutive failures before a circuit breaker trips to the Open state
`success_threshold`	`u32`	`2`	Number of consecutive successes in Half-Open state required to close the circuit breaker
`timeout_seconds`	`u32`	`30`	Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

`common.circuit_breakers.default_config.failure_threshold`

Number of consecutive failures before a circuit breaker trips to the Open state

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping

`common.circuit_breakers.default_config.success_threshold`

Number of consecutive successes in Half-Open state required to close the circuit breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Higher values require more proof of recovery before restoring full traffic

`common.circuit_breakers.default_config.timeout_seconds`

Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

Type: u32
Default: 30
Valid Range: 1-300
System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures

global_settings

Path: common.circuit_breakers.global_settings

Parameter	Type	Default	Description
`metrics_collection_interval_seconds`	`u32`	`30`	Interval in seconds between circuit breaker metrics collection sweeps
`min_state_transition_interval_seconds`	`f64`	`5.0`	Minimum time in seconds between circuit breaker state transitions

`common.circuit_breakers.global_settings.metrics_collection_interval_seconds`

Interval in seconds between circuit breaker metrics collection sweeps

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability

`common.circuit_breakers.global_settings.min_state_transition_interval_seconds`

Minimum time in seconds between circuit breaker state transitions

Type: f64
Default: 5.0
Valid Range: 0.0-60.0
System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures

database

Path: common.database

Parameter	Type	Default	Description
`url`	`String`	`"${DATABASE_URL:-postgresql://localhost/tasker}"`	PostgreSQL connection URL for the primary database

`common.database.url`

PostgreSQL connection URL for the primary database

Type: String
Default: "${DATABASE_URL:-postgresql://localhost/tasker}"
Valid Range: valid PostgreSQL connection URI
System Impact: All task, step, and workflow state is stored here; must be reachable at startup

Environment Recommendations:

Environment	Value	Rationale
development	postgresql://localhost/tasker	Local default, no auth
production	${DATABASE_URL}	Always use env var injection for secrets rotation
test	postgresql://tasker:tasker@localhost:5432/tasker_rust_test	Isolated test database with known credentials

Related: common.database.pool.max_connections, common.pgmq_database.url

pool

Path: common.database.pool

Parameter	Type	Default	Description
`acquire_timeout_seconds`	`u32`	`10`	Maximum time to wait when acquiring a connection from the pool
`idle_timeout_seconds`	`u32`	`300`	Time before an idle connection is closed and removed from the pool
`max_connections`	`u32`	`25`	Maximum number of concurrent database connections in the pool
`max_lifetime_seconds`	`u32`	`1800`	Maximum total lifetime of a connection before it is closed and replaced
`min_connections`	`u32`	`5`	Minimum number of idle connections maintained in the pool
`slow_acquire_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which connection acquisition is logged as slow

`common.database.pool.acquire_timeout_seconds`

Maximum time to wait when acquiring a connection from the pool

Type: u32
Default: 10
Valid Range: 1-300
System Impact: Queries fail with a timeout error if no connection is available within this window

`common.database.pool.idle_timeout_seconds`

Time before an idle connection is closed and removed from the pool

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Controls how quickly the pool shrinks back to min_connections after load drops

`common.database.pool.max_connections`

Maximum number of concurrent database connections in the pool

Type: u32
Default: 25
Valid Range: 1-1000
System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion

Environment Recommendations:

Environment	Value	Rationale
development	10-25	Small pool for local development
production	30-50	Scale based on worker count and concurrent task volume
test	10-30	Moderate pool; cluster tests may run 10 services sharing the same DB

Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds

`common.database.pool.max_lifetime_seconds`

Maximum total lifetime of a connection before it is closed and replaced

Type: u32
Default: 1800
Valid Range: 60-86400
System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections

`common.database.pool.min_connections`

Minimum number of idle connections maintained in the pool

Type: u32
Default: 5
Valid Range: 0-100
System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods

`common.database.pool.slow_acquire_threshold_ms`

Threshold in milliseconds above which connection acquisition is logged as slow

Type: u32
Default: 100
Valid Range: 10-60000
System Impact: Observability: slow acquire warnings indicate pool pressure or network issues

execution

Path: common.execution

Parameter	Type	Default	Description
`environment`	`String`	`"development"`	Runtime environment identifier used for configuration context selection and logging
`step_enqueue_batch_size`	`u32`	`50`	Number of steps to enqueue in a single batch during task initialization

`common.execution.environment`

Runtime environment identifier used for configuration context selection and logging

Type: String
Default: "development"
Valid Range: test | development | production
System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system

`common.execution.step_enqueue_batch_size`

Number of steps to enqueue in a single batch during task initialization

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency

mpsc_channels

Path: common.mpsc_channels

event_publisher

Path: common.mpsc_channels.event_publisher

Parameter	Type	Default	Description
`event_queue_buffer_size`	`usize`	`5000`	Bounded channel capacity for the event publisher MPSC channel

`common.mpsc_channels.event_publisher.event_queue_buffer_size`

Bounded channel capacity for the event publisher MPSC channel

Type: usize
Default: 5000
Valid Range: 100-100000
System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner

ffi

Path: common.mpsc_channels.ffi

Parameter	Type	Default	Description
`ruby_event_buffer_size`	`usize`	`1000`	Bounded channel capacity for Ruby FFI event delivery

`common.mpsc_channels.ffi.ruby_event_buffer_size`

Bounded channel capacity for Ruby FFI event delivery

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side

overflow_policy

Path: common.mpsc_channels.overflow_policy

Parameter	Type	Default	Description
`log_warning_threshold`	`f64`	`0.8`	Channel saturation fraction at which warning logs are emitted

`common.mpsc_channels.overflow_policy.log_warning_threshold`

Channel saturation fraction at which warning logs are emitted

Type: f64
Default: 0.8
Valid Range: 0.0-1.0
System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity

metrics

Path: common.mpsc_channels.overflow_policy.metrics

Parameter	Type	Default	Description
`saturation_check_interval_seconds`	`u32`	`30`	Interval in seconds between channel saturation metric samples

`common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds`

Interval in seconds between channel saturation metric samples

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead

pgmq_database

Path: common.pgmq_database

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable PGMQ messaging subsystem
`url`	`String`	`"${PGMQ_DATABASE_URL:-}"`	PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

`common.pgmq_database.enabled`

Enable PGMQ messaging subsystem

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend

`common.pgmq_database.url`

PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

Type: String
Default: "${PGMQ_DATABASE_URL:-}"
Valid Range: valid PostgreSQL connection URI or empty string
System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load

Related: common.database.url, common.pgmq_database.enabled

pool

Path: common.pgmq_database.pool

Parameter	Type	Default	Description
`acquire_timeout_seconds`	`u32`	`5`	Maximum time to wait when acquiring a connection from the PGMQ pool
`idle_timeout_seconds`	`u32`	`300`	Time before an idle PGMQ connection is closed and removed from the pool
`max_connections`	`u32`	`15`	Maximum number of concurrent connections in the PGMQ database pool
`max_lifetime_seconds`	`u32`	`1800`	Maximum total lifetime of a PGMQ database connection before replacement
`min_connections`	`u32`	`3`	Minimum idle connections maintained in the PGMQ database pool
`slow_acquire_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

`common.pgmq_database.pool.acquire_timeout_seconds`

Maximum time to wait when acquiring a connection from the PGMQ pool

Type: u32
Default: 5
Valid Range: 1-300
System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window

`common.pgmq_database.pool.idle_timeout_seconds`

Time before an idle PGMQ connection is closed and removed from the pool

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops

`common.pgmq_database.pool.max_connections`

Maximum number of concurrent connections in the PGMQ database pool

Type: u32
Default: 15
Valid Range: 1-500
System Impact: Separate from the main database pool; size according to messaging throughput requirements

`common.pgmq_database.pool.max_lifetime_seconds`

Maximum total lifetime of a PGMQ database connection before replacement

Type: u32
Default: 1800
Valid Range: 60-86400
System Impact: Prevents connection drift in long-running PGMQ connections

`common.pgmq_database.pool.min_connections`

Minimum idle connections maintained in the PGMQ database pool

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations

`common.pgmq_database.pool.slow_acquire_threshold_ms`

Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

Type: u32
Default: 100
Valid Range: 10-60000
System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure

queues

Path: common.queues

Parameter	Type	Default	Description
`backend`	`String`	`"${TASKER_MESSAGING_BACKEND:-pgmq}"`	Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
`default_visibility_timeout_seconds`	`u32`	`30`	Default time a dequeued message remains invisible to other consumers
`naming_pattern`	`String`	`"{namespace}_{name}_queue"`	Template pattern for constructing queue names from namespace and name
`orchestration_namespace`	`String`	`"orchestration"`	Namespace prefix for orchestration queue names
`worker_namespace`	`String`	`"worker"`	Namespace prefix for worker queue names

`common.queues.backend`

Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)

Type: String
Default: "${TASKER_MESSAGING_BACKEND:-pgmq}"
Valid Range: pgmq | rabbitmq
System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker

Environment Recommendations:

Environment	Value	Rationale
production	pgmq or rabbitmq	pgmq for simplicity, rabbitmq for high-throughput push semantics
test	pgmq	Single-dependency setup, simpler CI

Related: common.queues.pgmq, common.queues.rabbitmq

`common.queues.default_visibility_timeout_seconds`

Default time a dequeued message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry

`common.queues.naming_pattern`

Template pattern for constructing queue names from namespace and name

Type: String
Default: "{namespace}_{name}_queue"
Valid Range: string containing {namespace} and {name} placeholders
System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration

`common.queues.orchestration_namespace`

Namespace prefix for orchestration queue names

Type: String
Default: "orchestration"
Valid Range: non-empty string
System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues

`common.queues.worker_namespace`

Namespace prefix for worker queue names

Type: String
Default: "worker"
Valid Range: non-empty string
System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues

orchestration_queues

Path: common.queues.orchestration_queues

Parameter	Type	Default	Description
`step_results`	`String`	`"orchestration_step_results"`	Queue name for step execution results returned by workers
`task_finalizations`	`String`	`"orchestration_task_finalizations"`	Queue name for task finalization messages
`task_requests`	`String`	`"orchestration_task_requests"`	Queue name for incoming task execution requests

`common.queues.orchestration_queues.step_results`

Queue name for step execution results returned by workers

Type: String
Default: "orchestration_step_results"
Valid Range: valid queue name
System Impact: Workers publish step completion results here for the orchestration result processor

`common.queues.orchestration_queues.task_finalizations`

Queue name for task finalization messages

Type: String
Default: "orchestration_task_finalizations"
Valid Range: valid queue name
System Impact: Tasks ready for completion evaluation are enqueued here

`common.queues.orchestration_queues.task_requests`

Queue name for incoming task execution requests

Type: String
Default: "orchestration_task_requests"
Valid Range: valid queue name
System Impact: The orchestration system reads new task requests from this queue

pgmq

Path: common.queues.pgmq

Parameter	Type	Default	Description
`poll_interval_ms`	`u32`	`500`	Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

`common.queues.pgmq.poll_interval_ms`

Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

Type: u32
Default: 500
Valid Range: 10-10000
System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval

queue_depth_thresholds

Path: common.queues.pgmq.queue_depth_thresholds

Parameter	Type	Default	Description
`critical_threshold`	`i64`	`5000`	Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
`overflow_threshold`	`i64`	`10000`	Queue depth indicating an emergency condition requiring manual intervention

`common.queues.pgmq.queue_depth_thresholds.critical_threshold`

Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions

Type: i64
Default: 5000
Valid Range: 1+
System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages

`common.queues.pgmq.queue_depth_thresholds.overflow_threshold`

Queue depth indicating an emergency condition requiring manual intervention

Type: i64
Default: 10000
Valid Range: 1+
System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting

rabbitmq

Path: common.queues.rabbitmq

Parameter	Type	Default	Description
`heartbeat_seconds`	`u16`	`30`	AMQP heartbeat interval for connection liveness detection
`prefetch_count`	`u16`	`100`	Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
`url`	`String`	`"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"`	AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

`common.queues.rabbitmq.heartbeat_seconds`

AMQP heartbeat interval for connection liveness detection

Type: u16
Default: 30
Valid Range: 0-3600
System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)

`common.queues.rabbitmq.prefetch_count`

Number of unacknowledged messages RabbitMQ will deliver before waiting for acks

Type: u16
Default: 100
Valid Range: 1-65535
System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process

`common.queues.rabbitmq.url`

AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

Type: String
Default: "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"
Valid Range: valid AMQP URI
System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup

system

Path: common.system

Parameter	Type	Default	Description
`default_dependent_system`	`String`	`"default"`	Default system name assigned to tasks that do not specify a dependent system

`common.system.default_dependent_system`

Default system name assigned to tasks that do not specify a dependent system

Type: String
Default: "default"
Valid Range: non-empty string
System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default

task_templates

Path: common.task_templates

Parameter	Type	Default	Description
`search_paths`	`Vec<String>`	`["config/tasks/*/.{yml,yaml}"]`	Glob patterns for discovering task template YAML files

`common.task_templates.search_paths`

Glob patterns for discovering task template YAML files

Type: Vec<String>
Default: ["config/tasks/**/*.{yml,yaml}"]
Valid Range: valid glob patterns
System Impact: Templates matching these patterns are loaded at startup for task definition discovery

Generated by tasker-ctl docs — Tasker Configuration System

Configuration Reference: common

65/65 parameters documented

backoff

Path: common.backoff

Parameter	Type	Default	Description
`backoff_multiplier`	`f64`	`2.0`	Multiplier applied to the previous delay for exponential backoff calculations
`default_backoff_seconds`	`Vec<u32>`	`[1, 5, 15, 30, 60]`	Sequence of backoff delays in seconds for successive retry attempts
`jitter_enabled`	`bool`	`true`	Add random jitter to backoff delays to prevent thundering herd on retry
`jitter_max_percentage`	`f64`	`0.15`	Maximum jitter as a fraction of the computed backoff delay
`max_backoff_seconds`	`u32`	`3600`	Hard upper limit on any single backoff delay

`common.backoff.backoff_multiplier`

Multiplier applied to the previous delay for exponential backoff calculations

Type: f64
Default: 2.0
Valid Range: 1.0-10.0
System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous

`common.backoff.default_backoff_seconds`

Sequence of backoff delays in seconds for successive retry attempts

Type: Vec<u32>
Default: [1, 5, 15, 30, 60]
Valid Range: non-empty array of positive integers
System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds

`common.backoff.jitter_enabled`

Add random jitter to backoff delays to prevent thundering herd on retry

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time

`common.backoff.jitter_max_percentage`

Maximum jitter as a fraction of the computed backoff delay

Type: f64
Default: 0.15
Valid Range: 0.0-1.0
System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay

`common.backoff.max_backoff_seconds`

Hard upper limit on any single backoff delay

Type: u32
Default: 3600
Valid Range: 1-3600
System Impact: Caps exponential backoff growth to prevent excessively long delays between retries

cache

Path: common.cache

Parameter	Type	Default	Description
`analytics_ttl_seconds`	`u32`	`60`	Time-to-live in seconds for cached analytics and metrics data
`backend`	`String`	`"redis"`	Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
`default_ttl_seconds`	`u32`	`3600`	Default time-to-live in seconds for cached entries
`enabled`	`bool`	`false`	Enable the distributed cache layer for template and analytics data
`template_ttl_seconds`	`u32`	`3600`	Time-to-live in seconds for cached task template definitions

`common.cache.analytics_ttl_seconds`

Time-to-live in seconds for cached analytics and metrics data

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current

`common.cache.backend`

Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)

Type: String
Default: "redis"
Valid Range: redis | moka
System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection

`common.cache.default_ttl_seconds`

Default time-to-live in seconds for cached entries

Type: u32
Default: 3600
Valid Range: 1-86400
System Impact: Controls how long cached data remains valid before being re-fetched from the database

`common.cache.enabled`

Enable the distributed cache layer for template and analytics data

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required

`common.cache.template_ttl_seconds`

Time-to-live in seconds for cached task template definitions

Type: u32
Default: 3600
Valid Range: 1-86400
System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance

moka

Path: common.cache.moka

Parameter	Type	Default	Description
`max_capacity`	`u64`	`10000`	Maximum number of entries the in-process Moka cache can hold

`common.cache.moka.max_capacity`

Maximum number of entries the in-process Moka cache can hold

Type: u64
Default: 10000
Valid Range: 1-1000000
System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached

redis

Path: common.cache.redis

Parameter	Type	Default	Description
`connection_timeout_seconds`	`u32`	`5`	Maximum time to wait when establishing a new Redis connection
`database`	`u32`	`0`	Redis database number (0-15)
`max_connections`	`u32`	`10`	Maximum number of connections in the Redis connection pool
`url`	`String`	`"${REDIS_URL:-redis://localhost:6379}"`	Redis connection URL

`common.cache.redis.connection_timeout_seconds`

Maximum time to wait when establishing a new Redis connection

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Connections that cannot be established within this timeout fail; cache falls back to database

`common.cache.redis.database`

Redis database number (0-15)

Type: u32
Default: 0
Valid Range: 0-15
System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance

`common.cache.redis.max_connections`

Maximum number of connections in the Redis connection pool

Type: u32
Default: 10
Valid Range: 1-500
System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads

`common.cache.redis.url`

Redis connection URL

Type: String
Default: "${REDIS_URL:-redis://localhost:6379}"
Valid Range: valid Redis URI
System Impact: Must be reachable when cache is enabled with redis backend

circuit_breakers

Path: common.circuit_breakers

component_configs

Path: common.circuit_breakers.component_configs

cache

Path: common.circuit_breakers.component_configs.cache

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the cache circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the cache breaker

`common.circuit_breakers.component_configs.cache.failure_threshold`

Failures before the cache circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database

`common.circuit_breakers.component_configs.cache.success_threshold`

Successes in Half-Open required to close the cache breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database

messaging

Path: common.circuit_breakers.component_configs.messaging

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the messaging circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the messaging breaker

`common.circuit_breakers.component_configs.messaging.failure_threshold`

Failures before the messaging circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited

`common.circuit_breakers.component_configs.messaging.success_threshold`

Successes in Half-Open required to close the messaging breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient

task_readiness

Path: common.circuit_breakers.component_configs.task_readiness

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`10`	Failures before the task readiness circuit breaker trips to Open
`success_threshold`	`u32`	`3`	Successes in Half-Open required to close the task readiness breaker

`common.circuit_breakers.component_configs.task_readiness.failure_threshold`

Failures before the task readiness circuit breaker trips to Open

Type: u32
Default: 10
Valid Range: 1-100
System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected

`common.circuit_breakers.component_configs.task_readiness.success_threshold`

Successes in Half-Open required to close the task readiness breaker

Type: u32
Default: 3
Valid Range: 1-100
System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries

web

Path: common.circuit_breakers.component_configs.web

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Failures before the web/API database circuit breaker trips to Open
`success_threshold`	`u32`	`2`	Successes in Half-Open required to close the web database breaker

`common.circuit_breakers.component_configs.web.failure_threshold`

Failures before the web/API database circuit breaker trips to Open

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts

`common.circuit_breakers.component_configs.web.success_threshold`

Successes in Half-Open required to close the web database breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic

default_config

Path: common.circuit_breakers.default_config

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Number of consecutive failures before a circuit breaker trips to the Open state
`success_threshold`	`u32`	`2`	Number of consecutive successes in Half-Open state required to close the circuit breaker
`timeout_seconds`	`u32`	`30`	Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

`common.circuit_breakers.default_config.failure_threshold`

Number of consecutive failures before a circuit breaker trips to the Open state

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping

`common.circuit_breakers.default_config.success_threshold`

Number of consecutive successes in Half-Open state required to close the circuit breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Higher values require more proof of recovery before restoring full traffic

`common.circuit_breakers.default_config.timeout_seconds`

Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

Type: u32
Default: 30
Valid Range: 1-300
System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures

global_settings

Path: common.circuit_breakers.global_settings

Parameter	Type	Default	Description
`metrics_collection_interval_seconds`	`u32`	`30`	Interval in seconds between circuit breaker metrics collection sweeps
`min_state_transition_interval_seconds`	`f64`	`5.0`	Minimum time in seconds between circuit breaker state transitions

`common.circuit_breakers.global_settings.metrics_collection_interval_seconds`

Interval in seconds between circuit breaker metrics collection sweeps

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability

`common.circuit_breakers.global_settings.min_state_transition_interval_seconds`

Minimum time in seconds between circuit breaker state transitions

Type: f64
Default: 5.0
Valid Range: 0.0-60.0
System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures

database

Path: common.database

Parameter	Type	Default	Description
`url`	`String`	`"${DATABASE_URL:-postgresql://localhost/tasker}"`	PostgreSQL connection URL for the primary database

`common.database.url`

PostgreSQL connection URL for the primary database

Type: String
Default: "${DATABASE_URL:-postgresql://localhost/tasker}"
Valid Range: valid PostgreSQL connection URI
System Impact: All task, step, and workflow state is stored here; must be reachable at startup

Environment Recommendations:

Environment	Value	Rationale
development	postgresql://localhost/tasker	Local default, no auth
production	${DATABASE_URL}	Always use env var injection for secrets rotation
test	postgresql://tasker:tasker@localhost:5432/tasker_rust_test	Isolated test database with known credentials

Related: common.database.pool.max_connections, common.pgmq_database.url

pool

Path: common.database.pool

Parameter	Type	Default	Description
`acquire_timeout_seconds`	`u32`	`10`	Maximum time to wait when acquiring a connection from the pool
`idle_timeout_seconds`	`u32`	`300`	Time before an idle connection is closed and removed from the pool
`max_connections`	`u32`	`25`	Maximum number of concurrent database connections in the pool
`max_lifetime_seconds`	`u32`	`1800`	Maximum total lifetime of a connection before it is closed and replaced
`min_connections`	`u32`	`5`	Minimum number of idle connections maintained in the pool
`slow_acquire_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which connection acquisition is logged as slow

`common.database.pool.acquire_timeout_seconds`

Maximum time to wait when acquiring a connection from the pool

Type: u32
Default: 10
Valid Range: 1-300
System Impact: Queries fail with a timeout error if no connection is available within this window

`common.database.pool.idle_timeout_seconds`

Time before an idle connection is closed and removed from the pool

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Controls how quickly the pool shrinks back to min_connections after load drops

`common.database.pool.max_connections`

Maximum number of concurrent database connections in the pool

Type: u32
Default: 25
Valid Range: 1-1000
System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion

Environment Recommendations:

Environment	Value	Rationale
development	10-25	Small pool for local development
production	30-50	Scale based on worker count and concurrent task volume
test	10-30	Moderate pool; cluster tests may run 10 services sharing the same DB

Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds

`common.database.pool.max_lifetime_seconds`

Maximum total lifetime of a connection before it is closed and replaced

Type: u32
Default: 1800
Valid Range: 60-86400
System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections

`common.database.pool.min_connections`

Minimum number of idle connections maintained in the pool

Type: u32
Default: 5
Valid Range: 0-100
System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods

`common.database.pool.slow_acquire_threshold_ms`

Threshold in milliseconds above which connection acquisition is logged as slow

Type: u32
Default: 100
Valid Range: 10-60000
System Impact: Observability: slow acquire warnings indicate pool pressure or network issues

execution

Path: common.execution

Parameter	Type	Default	Description
`environment`	`String`	`"development"`	Runtime environment identifier used for configuration context selection and logging
`step_enqueue_batch_size`	`u32`	`50`	Number of steps to enqueue in a single batch during task initialization

`common.execution.environment`

Runtime environment identifier used for configuration context selection and logging

Type: String
Default: "development"
Valid Range: test | development | production
System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system

`common.execution.step_enqueue_batch_size`

Number of steps to enqueue in a single batch during task initialization

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency

mpsc_channels

Path: common.mpsc_channels

event_publisher

Path: common.mpsc_channels.event_publisher

Parameter	Type	Default	Description
`event_queue_buffer_size`	`usize`	`5000`	Bounded channel capacity for the event publisher MPSC channel

`common.mpsc_channels.event_publisher.event_queue_buffer_size`

Bounded channel capacity for the event publisher MPSC channel

Type: usize
Default: 5000
Valid Range: 100-100000
System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner

ffi

Path: common.mpsc_channels.ffi

Parameter	Type	Default	Description
`ruby_event_buffer_size`	`usize`	`1000`	Bounded channel capacity for Ruby FFI event delivery

`common.mpsc_channels.ffi.ruby_event_buffer_size`

Bounded channel capacity for Ruby FFI event delivery

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side

overflow_policy

Path: common.mpsc_channels.overflow_policy

Parameter	Type	Default	Description
`log_warning_threshold`	`f64`	`0.8`	Channel saturation fraction at which warning logs are emitted

`common.mpsc_channels.overflow_policy.log_warning_threshold`

Channel saturation fraction at which warning logs are emitted

Type: f64
Default: 0.8
Valid Range: 0.0-1.0
System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity

metrics

Path: common.mpsc_channels.overflow_policy.metrics

Parameter	Type	Default	Description
`saturation_check_interval_seconds`	`u32`	`30`	Interval in seconds between channel saturation metric samples

`common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds`

Interval in seconds between channel saturation metric samples

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead

pgmq_database

Path: common.pgmq_database

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable PGMQ messaging subsystem
`url`	`String`	`"${PGMQ_DATABASE_URL:-}"`	PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

`common.pgmq_database.enabled`

Enable PGMQ messaging subsystem

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend

`common.pgmq_database.url`

PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

Type: String
Default: "${PGMQ_DATABASE_URL:-}"
Valid Range: valid PostgreSQL connection URI or empty string
System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load

Related: common.database.url, common.pgmq_database.enabled

pool

Path: common.pgmq_database.pool

Parameter	Type	Default	Description
`acquire_timeout_seconds`	`u32`	`5`	Maximum time to wait when acquiring a connection from the PGMQ pool
`idle_timeout_seconds`	`u32`	`300`	Time before an idle PGMQ connection is closed and removed from the pool
`max_connections`	`u32`	`15`	Maximum number of concurrent connections in the PGMQ database pool
`max_lifetime_seconds`	`u32`	`1800`	Maximum total lifetime of a PGMQ database connection before replacement
`min_connections`	`u32`	`3`	Minimum idle connections maintained in the PGMQ database pool
`slow_acquire_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

`common.pgmq_database.pool.acquire_timeout_seconds`

Maximum time to wait when acquiring a connection from the PGMQ pool

Type: u32
Default: 5
Valid Range: 1-300
System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window

`common.pgmq_database.pool.idle_timeout_seconds`

Time before an idle PGMQ connection is closed and removed from the pool

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops

`common.pgmq_database.pool.max_connections`

Maximum number of concurrent connections in the PGMQ database pool

Type: u32
Default: 15
Valid Range: 1-500
System Impact: Separate from the main database pool; size according to messaging throughput requirements

`common.pgmq_database.pool.max_lifetime_seconds`

Maximum total lifetime of a PGMQ database connection before replacement

Type: u32
Default: 1800
Valid Range: 60-86400
System Impact: Prevents connection drift in long-running PGMQ connections

`common.pgmq_database.pool.min_connections`

Minimum idle connections maintained in the PGMQ database pool

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations

`common.pgmq_database.pool.slow_acquire_threshold_ms`

Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

Type: u32
Default: 100
Valid Range: 10-60000
System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure

queues

Path: common.queues

Parameter	Type	Default	Description
`backend`	`String`	`"${TASKER_MESSAGING_BACKEND:-pgmq}"`	Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
`default_visibility_timeout_seconds`	`u32`	`30`	Default time a dequeued message remains invisible to other consumers
`naming_pattern`	`String`	`"{namespace}_{name}_queue"`	Template pattern for constructing queue names from namespace and name
`orchestration_namespace`	`String`	`"orchestration"`	Namespace prefix for orchestration queue names
`worker_namespace`	`String`	`"worker"`	Namespace prefix for worker queue names

`common.queues.backend`

Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)

Type: String
Default: "${TASKER_MESSAGING_BACKEND:-pgmq}"
Valid Range: pgmq | rabbitmq
System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker

Environment Recommendations:

Environment	Value	Rationale
production	pgmq or rabbitmq	pgmq for simplicity, rabbitmq for high-throughput push semantics
test	pgmq	Single-dependency setup, simpler CI

Related: common.queues.pgmq, common.queues.rabbitmq

`common.queues.default_visibility_timeout_seconds`

Default time a dequeued message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry

`common.queues.naming_pattern`

Template pattern for constructing queue names from namespace and name

Type: String
Default: "{namespace}_{name}_queue"
Valid Range: string containing {namespace} and {name} placeholders
System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration

`common.queues.orchestration_namespace`

Namespace prefix for orchestration queue names

Type: String
Default: "orchestration"
Valid Range: non-empty string
System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues

`common.queues.worker_namespace`

Namespace prefix for worker queue names

Type: String
Default: "worker"
Valid Range: non-empty string
System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues

orchestration_queues

Path: common.queues.orchestration_queues

Parameter	Type	Default	Description
`step_results`	`String`	`"orchestration_step_results"`	Queue name for step execution results returned by workers
`task_finalizations`	`String`	`"orchestration_task_finalizations"`	Queue name for task finalization messages
`task_requests`	`String`	`"orchestration_task_requests"`	Queue name for incoming task execution requests

`common.queues.orchestration_queues.step_results`

Queue name for step execution results returned by workers

Type: String
Default: "orchestration_step_results"
Valid Range: valid queue name
System Impact: Workers publish step completion results here for the orchestration result processor

`common.queues.orchestration_queues.task_finalizations`

Queue name for task finalization messages

Type: String
Default: "orchestration_task_finalizations"
Valid Range: valid queue name
System Impact: Tasks ready for completion evaluation are enqueued here

`common.queues.orchestration_queues.task_requests`

Queue name for incoming task execution requests

Type: String
Default: "orchestration_task_requests"
Valid Range: valid queue name
System Impact: The orchestration system reads new task requests from this queue

pgmq

Path: common.queues.pgmq

Parameter	Type	Default	Description
`poll_interval_ms`	`u32`	`500`	Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

`common.queues.pgmq.poll_interval_ms`

Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

Type: u32
Default: 500
Valid Range: 10-10000
System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval

queue_depth_thresholds

Path: common.queues.pgmq.queue_depth_thresholds

Parameter	Type	Default	Description
`critical_threshold`	`i64`	`5000`	Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
`overflow_threshold`	`i64`	`10000`	Queue depth indicating an emergency condition requiring manual intervention

`common.queues.pgmq.queue_depth_thresholds.critical_threshold`

Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions

Type: i64
Default: 5000
Valid Range: 1+
System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages

`common.queues.pgmq.queue_depth_thresholds.overflow_threshold`

Queue depth indicating an emergency condition requiring manual intervention

Type: i64
Default: 10000
Valid Range: 1+
System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting

rabbitmq

Path: common.queues.rabbitmq

Parameter	Type	Default	Description
`heartbeat_seconds`	`u16`	`30`	AMQP heartbeat interval for connection liveness detection
`prefetch_count`	`u16`	`100`	Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
`url`	`String`	`"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"`	AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

`common.queues.rabbitmq.heartbeat_seconds`

AMQP heartbeat interval for connection liveness detection

Type: u16
Default: 30
Valid Range: 0-3600
System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)

`common.queues.rabbitmq.prefetch_count`

Number of unacknowledged messages RabbitMQ will deliver before waiting for acks

Type: u16
Default: 100
Valid Range: 1-65535
System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process

`common.queues.rabbitmq.url`

AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

Type: String
Default: "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"
Valid Range: valid AMQP URI
System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup

system

Path: common.system

Parameter	Type	Default	Description
`default_dependent_system`	`String`	`"default"`	Default system name assigned to tasks that do not specify a dependent system

`common.system.default_dependent_system`

Default system name assigned to tasks that do not specify a dependent system

Type: String
Default: "default"
Valid Range: non-empty string
System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default

task_templates

Path: common.task_templates

Parameter	Type	Default	Description
`search_paths`	`Vec<String>`	`["config/tasks/*/.{yml,yaml}"]`	Glob patterns for discovering task template YAML files

`common.task_templates.search_paths`

Glob patterns for discovering task template YAML files

Type: Vec<String>
Default: ["config/tasks/**/*.{yml,yaml}"]
Valid Range: valid glob patterns
System Impact: Templates matching these patterns are loaded at startup for task definition discovery

Generated by `tasker-ctl docs` — Tasker Configuration System

Configuration Reference: orchestration

91/91 parameters documented

orchestration

Root-level orchestration parameters

Path: orchestration

Parameter	Type	Default	Description
`enable_performance_logging`	`bool`	`true`	Enable detailed performance logging for orchestration actors
`shutdown_timeout_ms`	`u64`	`30000`	Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

`orchestration.enable_performance_logging`

Enable detailed performance logging for orchestration actors

Type: bool
Default: true
Valid Range: true/false
System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern

`orchestration.shutdown_timeout_ms`

Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

Type: u64
Default: 30000
Valid Range: 1000-300000
System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments

batch_processing

Path: orchestration.batch_processing

Parameter	Type	Default	Description
`checkpoint_stall_minutes`	`u32`	`15`	Minutes without a checkpoint update before a batch is considered stalled
`default_batch_size`	`u32`	`1000`	Default number of items in a single batch when not specified by the handler
`enabled`	`bool`	`true`	Enable the batch processing subsystem for large-scale step execution
`max_parallel_batches`	`u32`	`50`	Maximum number of batch operations that can execute concurrently

`orchestration.batch_processing.checkpoint_stall_minutes`

Minutes without a checkpoint update before a batch is considered stalled

Type: u32
Default: 15
Valid Range: 1-1440
System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster

`orchestration.batch_processing.default_batch_size`

Default number of items in a single batch when not specified by the handler

Type: u32
Default: 1000
Valid Range: 1-100000
System Impact: Larger batches improve throughput but increase memory usage and per-batch latency

`orchestration.batch_processing.enabled`

Enable the batch processing subsystem for large-scale step execution

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, batch step handlers cannot be used; all steps must be processed individually

`orchestration.batch_processing.max_parallel_batches`

Maximum number of batch operations that can execute concurrently

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads

decision_points

Path: orchestration.decision_points

Parameter	Type	Default	Description
`enable_detailed_logging`	`bool`	`false`	Enable verbose logging of decision point evaluation including expression results
`enable_metrics`	`bool`	`true`	Enable metrics collection for decision point evaluations
`enabled`	`bool`	`true`	Enable the decision point evaluation subsystem for conditional workflow branching
`max_decision_depth`	`u32`	`20`	Maximum depth of nested decision point chains
`max_steps_per_decision`	`u32`	`100`	Maximum number of steps that can be generated by a single decision point evaluation
`warn_threshold_depth`	`u32`	`10`	Decision depth above which a warning is logged
`warn_threshold_steps`	`u32`	`50`	Number of steps per decision above which a warning is logged

`orchestration.decision_points.enable_detailed_logging`

Enable verbose logging of decision point evaluation including expression results

Type: bool
Default: false
Valid Range: true/false
System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior

`orchestration.decision_points.enable_metrics`

Enable metrics collection for decision point evaluations

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks evaluation counts, timings, and branch selection distribution

`orchestration.decision_points.enabled`

Enable the decision point evaluation subsystem for conditional workflow branching

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, all decision points are skipped and conditional steps are not evaluated

`orchestration.decision_points.max_decision_depth`

Maximum depth of nested decision point chains

Type: u32
Default: 20
Valid Range: 1-100
System Impact: Prevents infinite recursion from circular decision point references

`orchestration.decision_points.max_steps_per_decision`

Maximum number of steps that can be generated by a single decision point evaluation

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Safety limit to prevent decision points from creating unbounded step graphs

`orchestration.decision_points.warn_threshold_depth`

Decision depth above which a warning is logged

Type: u32
Default: 10
Valid Range: 1-100
System Impact: Observability: identifies deeply nested decision chains that may indicate design issues

`orchestration.decision_points.warn_threshold_steps`

Number of steps per decision above which a warning is logged

Type: u32
Default: 50
Valid Range: 1-10000
System Impact: Observability: identifies decision points that generate unusually large step sets

dlq

Path: orchestration.dlq

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

`orchestration.dlq.enabled`

Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, stale or failed tasks remain in their error state without DLQ routing

staleness_detection

Path: orchestration.dlq.staleness_detection

Parameter	Type	Default	Description
`batch_size`	`u32`	`100`	Number of potentially stale tasks to evaluate in a single detection sweep
`detection_interval_seconds`	`u32`	`300`	Interval in seconds between staleness detection sweeps
`dry_run`	`bool`	`false`	Run staleness detection in observation-only mode without taking action
`enabled`	`bool`	`true`	Enable periodic scanning for stale tasks

`orchestration.dlq.staleness_detection.batch_size`

Number of potentially stale tasks to evaluate in a single detection sweep

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost

`orchestration.dlq.staleness_detection.detection_interval_seconds`

Interval in seconds between staleness detection sweeps

Type: u32
Default: 300
Valid Range: 30-3600
System Impact: Lower values detect stale tasks faster but increase database query frequency

`orchestration.dlq.staleness_detection.dry_run`

Run staleness detection in observation-only mode without taking action

Type: bool
Default: false
Valid Range: true/false
System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds

`orchestration.dlq.staleness_detection.enabled`

Enable periodic scanning for stale tasks

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d

actions

Path: orchestration.dlq.staleness_detection.actions

Parameter	Type	Default	Description
`auto_move_to_dlq`	`bool`	`true`	Automatically move stale tasks to the DLQ after transitioning to error
`auto_transition_to_error`	`bool`	`true`	Automatically transition stale tasks to the Error state
`emit_events`	`bool`	`true`	Emit domain events when staleness is detected
`event_channel`	`String`	`"task_staleness_detected"`	PGMQ channel name for staleness detection events

`orchestration.dlq.staleness_detection.actions.auto_move_to_dlq`

Automatically move stale tasks to the DLQ after transitioning to error

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review

`orchestration.dlq.staleness_detection.actions.auto_transition_to_error`

Automatically transition stale tasks to the Error state

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state

`orchestration.dlq.staleness_detection.actions.emit_events`

Emit domain events when staleness is detected

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling

`orchestration.dlq.staleness_detection.actions.event_channel`

PGMQ channel name for staleness detection events

Type: String
Default: "task_staleness_detected"
Valid Range: 1-255 characters
System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling

thresholds

Path: orchestration.dlq.staleness_detection.thresholds

Parameter	Type	Default	Description
`steps_in_process_minutes`	`u32`	`30`	Minutes a task can have steps in process before being considered stale
`task_max_lifetime_hours`	`u32`	`24`	Absolute maximum lifetime for any task regardless of state
`waiting_for_dependencies_minutes`	`u32`	`60`	Minutes a task can wait for step dependencies before being considered stale
`waiting_for_retry_minutes`	`u32`	`30`	Minutes a task can wait for step retries before being considered stale

`orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes`

Minutes a task can have steps in process before being considered stale

Type: u32
Default: 30
Valid Range: 1-1440
System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation

`orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours`

Absolute maximum lifetime for any task regardless of state

Type: u32
Default: 24
Valid Range: 1-168
System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing

`orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes`

Minutes a task can wait for step dependencies before being considered stale

Type: u32
Default: 60
Valid Range: 1-1440
System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration

`orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes`

Minutes a task can wait for step retries before being considered stale

Type: u32
Default: 30
Valid Range: 1-1440
System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration

event_systems

Path: orchestration.event_systems

orchestration

Path: orchestration.event_systems.orchestration

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"orchestration-event-system"`	Unique identifier for the orchestration event system instance

`orchestration.event_systems.orchestration.deployment_mode`

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

`orchestration.event_systems.orchestration.system_id`

Unique identifier for the orchestration event system instance

Type: String
Default: "orchestration-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: orchestration.event_systems.orchestration.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the orchestration event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the event system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the event system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics collection for event processing

`orchestration.event_systems.orchestration.health.enabled`

Enable health monitoring for the orchestration event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks or error tracking run for this event system

`orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute`

Error rate per minute above which the event system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection

`orchestration.event_systems.orchestration.health.max_consecutive_errors`

Number of consecutive errors before the event system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation after sustained failures; resets on any success

`orchestration.event_systems.orchestration.health.performance_monitoring_enabled`

Enable detailed performance metrics collection for event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks processing latency percentiles and throughput; adds minor overhead

processing

Path: orchestration.event_systems.orchestration.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`20`	Number of events dequeued in a single batch read
`max_concurrent_operations`	`u32`	`50`	Maximum number of events processed concurrently by the orchestration event system
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed event processing operation

`orchestration.event_systems.orchestration.processing.batch_size`

Number of events dequeued in a single batch read

Type: u32
Default: 20
Valid Range: 1-1000
System Impact: Larger batches improve throughput but increase per-batch processing time

`orchestration.event_systems.orchestration.processing.max_concurrent_operations`

Maximum number of events processed concurrently by the orchestration event system

Type: u32
Default: 50
Valid Range: 1-10000
System Impact: Controls parallelism for task request, result, and finalization processing

`orchestration.event_systems.orchestration.processing.max_retries`

Maximum retry attempts for a failed event processing operation

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Events exceeding this retry count are dropped or sent to the DLQ

backoff

Path: orchestration.event_systems.orchestration.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first event processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds between event processing retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive failure

timing

Path: orchestration.event_systems.orchestration.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds an event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`5`	Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the orchestration event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued message remains invisible to other consumers

`orchestration.event_systems.orchestration.timing.claim_timeout_seconds`

Maximum time in seconds an event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned claims from blocking event processing indefinitely

`orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load

`orchestration.event_systems.orchestration.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the orchestration event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness

`orchestration.event_systems.orchestration.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Events exceeding this timeout are considered failed and may be retried

`orchestration.event_systems.orchestration.timing.visibility_timeout_seconds`

Time in seconds a dequeued message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: If processing is not completed within this window, the message becomes visible again for redelivery

task_readiness

Path: orchestration.event_systems.task_readiness

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"task-readiness-event-system"`	Unique identifier for the task readiness event system instance

`orchestration.event_systems.task_readiness.deployment_mode`

Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery

`orchestration.event_systems.task_readiness.system_id`

Unique identifier for the task readiness event system instance

Type: String
Default: "task-readiness-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish task readiness events from other event systems

health

Path: orchestration.event_systems.task_readiness.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the task readiness event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the task readiness system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the task readiness system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics for task readiness event processing

`orchestration.event_systems.task_readiness.health.enabled`

Enable health monitoring for the task readiness event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks run for task readiness processing

`orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute`

Error rate per minute above which the task readiness system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal complementing max_consecutive_errors

`orchestration.event_systems.task_readiness.health.max_consecutive_errors`

Number of consecutive errors before the task readiness system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation; resets on any successful readiness check

`orchestration.event_systems.task_readiness.health.performance_monitoring_enabled`

Enable detailed performance metrics for task readiness event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency

processing

Path: orchestration.event_systems.task_readiness.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`50`	Number of task readiness events dequeued in a single batch
`max_concurrent_operations`	`u32`	`100`	Maximum number of task readiness events processed concurrently
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed task readiness event

`orchestration.event_systems.task_readiness.processing.batch_size`

Number of task readiness events dequeued in a single batch

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput

`orchestration.event_systems.task_readiness.processing.max_concurrent_operations`

Maximum number of task readiness events processed concurrently

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries

`orchestration.event_systems.task_readiness.processing.max_retries`

Maximum retry attempts for a failed task readiness event

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Readiness events are idempotent so retries are safe; limits retry storms

backoff

Path: orchestration.event_systems.task_readiness.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first task readiness processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay for readiness retries
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds for task readiness retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive readiness failure

timing

Path: orchestration.event_systems.task_readiness.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds a task readiness event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`5`	Interval in seconds between fallback polling cycles for task readiness
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the task readiness event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single task readiness event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued task readiness message remains invisible to other consumers

`orchestration.event_systems.task_readiness.timing.claim_timeout_seconds`

Maximum time in seconds a task readiness event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned readiness claims from blocking task evaluation

`orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles for task readiness

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness

`orchestration.event_systems.task_readiness.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the task readiness event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the task readiness system verifies its own connectivity

`orchestration.event_systems.task_readiness.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single task readiness event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Readiness events exceeding this timeout are considered failed

`orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds`

Time in seconds a dequeued task readiness message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Prevents duplicate processing of readiness events during normal operation

grpc

Path: orchestration.grpc

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"`	Socket address for the gRPC server
`enable_health_service`	`bool`	`true`	Enable the gRPC health checking service (grpc.health.v1)
`enable_reflection`	`bool`	`true`	Enable gRPC server reflection for service discovery
`enabled`	`bool`	`true`	Enable the gRPC API server alongside the REST API
`keepalive_interval_seconds`	`u32`	`30`	Interval in seconds between gRPC keepalive ping frames
`keepalive_timeout_seconds`	`u32`	`20`	Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
`max_concurrent_streams`	`u32`	`200`	Maximum number of concurrent gRPC streams per connection
`max_frame_size`	`u32`	`16384`	Maximum size in bytes of a single HTTP/2 frame
`tls_enabled`	`bool`	`false`	Enable TLS encryption for gRPC connections

`orchestration.grpc.bind_address`

Socket address for the gRPC server

Type: String
Default: "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
Valid Range: host:port
System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict

`orchestration.grpc.enable_health_service`

Enable the gRPC health checking service (grpc.health.v1)

Type: bool
Default: true
Valid Range: true/false
System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

`orchestration.grpc.enable_reflection`

Enable gRPC server reflection for service discovery

Type: bool
Default: true
Valid Range: true/false
System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production

`orchestration.grpc.enabled`

Enable the gRPC API server alongside the REST API

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no gRPC endpoints are available; clients must use REST

`orchestration.grpc.keepalive_interval_seconds`

Interval in seconds between gRPC keepalive ping frames

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

`orchestration.grpc.keepalive_timeout_seconds`

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

Type: u32
Default: 20
Valid Range: 1-300
System Impact: Connections that fail to acknowledge within this window are considered dead and closed

`orchestration.grpc.max_concurrent_streams`

Maximum number of concurrent gRPC streams per connection

Type: u32
Default: 200
Valid Range: 1-10000
System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads

`orchestration.grpc.max_frame_size`

Maximum size in bytes of a single HTTP/2 frame

Type: u32
Default: 16384
Valid Range: 16384-16777215
System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

`orchestration.grpc.tls_enabled`

Enable TLS encryption for gRPC connections

Type: bool
Default: false
Valid Range: true/false
System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments

mpsc_channels

Path: orchestration.mpsc_channels

command_processor

Path: orchestration.mpsc_channels.command_processor

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`5000`	Bounded channel capacity for the orchestration command processor

`orchestration.mpsc_channels.command_processor.command_buffer_size`

Bounded channel capacity for the orchestration command processor

Type: usize
Default: 5000
Valid Range: 100-100000
System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory

event_listeners

Path: orchestration.mpsc_channels.event_listeners

Parameter	Type	Default	Description
`pgmq_event_buffer_size`	`usize`	`50000`	Bounded channel capacity for PGMQ event listener notifications

`orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size`

Bounded channel capacity for PGMQ event listener notifications

Type: usize
Default: 50000
Valid Range: 1000-1000000
System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL

event_systems

Path: orchestration.mpsc_channels.event_systems

Parameter	Type	Default	Description
`event_channel_buffer_size`	`usize`	`10000`	Bounded channel capacity for the orchestration event system internal channel

`orchestration.mpsc_channels.event_systems.event_channel_buffer_size`

Bounded channel capacity for the orchestration event system internal channel

Type: usize
Default: 10000
Valid Range: 100-100000
System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts

web

Path: orchestration.web

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"`	Socket address for the REST API server
`enabled`	`bool`	`true`	Enable the REST API server for the orchestration service
`request_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for an HTTP request to complete before timeout

`orchestration.web.bind_address`

Socket address for the REST API server

Type: String
Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
Valid Range: host:port
System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments

Environment Recommendations:

Environment	Value	Rationale
production	0.0.0.0:8080	Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI
test	0.0.0.0:8080	Default port for test fixtures

`orchestration.web.enabled`

Enable the REST API server for the orchestration service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no HTTP endpoints are available; the service operates via messaging only

`orchestration.web.request_timeout_ms`

Maximum time in milliseconds for an HTTP request to complete before timeout

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: orchestration.web.auth

Parameter	Type	Default	Description
`api_key`	`String`	`""`	Static API key for simple key-based authentication
`api_key_header`	`String`	`"X-API-Key"`	HTTP header name for API key authentication
`enabled`	`bool`	`false`	Enable authentication for the REST API
`jwt_audience`	`String`	`"tasker-api"`	Expected ‘aud’ claim in JWT tokens
`jwt_issuer`	`String`	`"tasker-core"`	Expected ‘iss’ claim in JWT tokens
`jwt_private_key`	`String`	`""`	PEM-encoded private key for signing JWT tokens (if this service issues tokens)
`jwt_public_key`	`String`	`"${TASKER_JWT_PUBLIC_KEY:-}"`	PEM-encoded public key for verifying JWT token signatures
`jwt_public_key_path`	`String`	`"${TASKER_JWT_PUBLIC_KEY_PATH:-}"`	File path to a PEM-encoded public key for JWT verification
`jwt_token_expiry_hours`	`u32`	`24`	Default JWT token validity period in hours

`orchestration.web.auth.api_key`

Static API key for simple key-based authentication

Type: String
Default: ""
Valid Range: non-empty string or empty to disable
System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

`orchestration.web.auth.api_key_header`

HTTP header name for API key authentication

Type: String
Default: "X-API-Key"
Valid Range: valid HTTP header name
System Impact: Clients send their API key in this header; default is X-API-Key

`orchestration.web.auth.enabled`

Enable authentication for the REST API

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth

`orchestration.web.auth.jwt_audience`

Expected ‘aud’ claim in JWT tokens

Type: String
Default: "tasker-api"
Valid Range: non-empty string
System Impact: Tokens with a different audience are rejected during validation

`orchestration.web.auth.jwt_issuer`

Expected ‘iss’ claim in JWT tokens

Type: String
Default: "tasker-core"
Valid Range: non-empty string
System Impact: Tokens with a different issuer are rejected during validation

`orchestration.web.auth.jwt_private_key`

PEM-encoded private key for signing JWT tokens (if this service issues tokens)

Type: String
Default: ""
Valid Range: valid PEM private key or empty
System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers

`orchestration.web.auth.jwt_public_key`

PEM-encoded public key for verifying JWT token signatures

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY:-}"
Valid Range: valid PEM public key or empty
System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production

`orchestration.web.auth.jwt_public_key_path`

File path to a PEM-encoded public key for JWT verification

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
Valid Range: valid file path or empty
System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

`orchestration.web.auth.jwt_token_expiry_hours`

Default JWT token validity period in hours

Type: u32
Default: 24
Valid Range: 1-720
System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication

database_pools

Path: orchestration.web.database_pools

Parameter	Type	Default	Description
`max_total_connections_hint`	`u32`	`50`	Advisory hint for the total number of database connections across all orchestration pools
`web_api_connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when acquiring a connection from the web API pool
`web_api_idle_timeout_seconds`	`u32`	`600`	Time before an idle web API connection is closed
`web_api_max_connections`	`u32`	`30`	Maximum number of connections the web API pool can grow to under load
`web_api_pool_size`	`u32`	`20`	Target number of connections in the web API database pool

`orchestration.web.database_pools.max_total_connections_hint`

Advisory hint for the total number of database connections across all orchestration pools

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

`orchestration.web.database_pools.web_api_connection_timeout_seconds`

Maximum time to wait when acquiring a connection from the web API pool

Type: u32
Default: 30
Valid Range: 1-300
System Impact: API requests that cannot acquire a connection within this window return an error

`orchestration.web.database_pools.web_api_idle_timeout_seconds`

Time before an idle web API connection is closed

Type: u32
Default: 600
Valid Range: 1-3600
System Impact: Controls how quickly the web API pool shrinks after traffic subsides

`orchestration.web.database_pools.web_api_max_connections`

Maximum number of connections the web API pool can grow to under load

Type: u32
Default: 30
Valid Range: 1-500
System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes

`orchestration.web.database_pools.web_api_pool_size`

Target number of connections in the web API database pool

Type: u32
Default: 20
Valid Range: 1-200
System Impact: Determines how many concurrent database queries the REST API can execute

Generated by `tasker-ctl docs` — Tasker Configuration System

Configuration Reference: worker

90/90 parameters documented

worker

Root-level worker parameters

Path: worker

Parameter	Type	Default	Description
`worker_id`	`String`	`"worker-default-001"`	Unique identifier for this worker instance
`worker_type`	`String`	`"general"`	Worker type classification for routing and reporting

`worker.worker_id`

Unique identifier for this worker instance

Type: String
Default: "worker-default-001"
Valid Range: non-empty string
System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster

`worker.worker_type`

Worker type classification for routing and reporting

Type: String
Default: "general"
Valid Range: non-empty string
System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types

circuit_breakers

Path: worker.circuit_breakers

ffi_completion_send

Path: worker.circuit_breakers.ffi_completion_send

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Number of consecutive FFI completion send failures before the circuit breaker trips
`recovery_timeout_seconds`	`u32`	`5`	Time the FFI completion breaker stays Open before probing with a test send
`slow_send_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which FFI completion channel sends are logged as slow
`success_threshold`	`u32`	`2`	Consecutive successful sends in Half-Open required to close the breaker

`worker.circuit_breakers.ffi_completion_send.failure_threshold`

Number of consecutive FFI completion send failures before the circuit breaker trips

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited

`worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds`

Time the FFI completion breaker stays Open before probing with a test send

Type: u32
Default: 5
Valid Range: 1-300
System Impact: Short timeout (5s) because FFI channel issues are typically transient

`worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms`

Threshold in milliseconds above which FFI completion channel sends are logged as slow

Type: u32
Default: 100
Valid Range: 10-10000
System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers

`worker.circuit_breakers.ffi_completion_send.success_threshold`

Consecutive successful sends in Half-Open required to close the breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient

event_systems

Path: worker.event_systems

worker

Path: worker.event_systems.worker

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"worker-event-system"`	Unique identifier for the worker event system instance

`worker.event_systems.worker.deployment_mode`

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

`worker.event_systems.worker.system_id`

Unique identifier for the worker event system instance

Type: String
Default: "worker-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: worker.event_systems.worker.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the worker event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the worker event system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the worker event system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics for worker event processing

`worker.event_systems.worker.health.enabled`

Enable health monitoring for the worker event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks or error tracking run for the worker event system

`worker.event_systems.worker.health.error_rate_threshold_per_minute`

Error rate per minute above which the worker event system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal complementing max_consecutive_errors

`worker.event_systems.worker.health.max_consecutive_errors`

Number of consecutive errors before the worker event system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation; resets on any successful event processing

`worker.event_systems.worker.health.performance_monitoring_enabled`

Enable detailed performance metrics for worker event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings

metadata

Path: worker.event_systems.worker.metadata

fallback_poller

Path: worker.event_systems.worker.metadata.fallback_poller

Parameter	Type	Default	Description
`age_threshold_seconds`	`u32`	`5`	Minimum age in seconds of a message before the fallback poller will pick it up
`batch_size`	`u32`	`20`	Number of messages to dequeue in a single fallback poll
`enabled`	`bool`	`true`	Enable the fallback polling mechanism for step dispatch
`max_age_hours`	`u32`	`24`	Maximum age in hours of messages the fallback poller will process
`polling_interval_ms`	`u32`	`1000`	Interval in milliseconds between fallback polling cycles
`supported_namespaces`	`Vec<String>`	`[]`	List of queue namespaces the fallback poller monitors; empty means all namespaces
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a message polled by the fallback mechanism remains invisible

in_process_events

Path: worker.event_systems.worker.metadata.in_process_events

Parameter	Type	Default	Description
`deduplication_cache_size`	`usize`	`10000`	Number of event IDs to cache for deduplication of in-process events
`ffi_integration_enabled`	`bool`	`true`	Enable FFI integration for in-process event delivery to Ruby/Python workers

listener

Path: worker.event_systems.worker.metadata.listener

Parameter	Type	Default	Description
`batch_processing`	`bool`	`true`	Enable batch processing of accumulated LISTEN/NOTIFY events
`connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection
`event_timeout_seconds`	`u32`	`60`	Maximum time to wait for a LISTEN/NOTIFY event before yielding
`max_retry_attempts`	`u32`	`5`	Maximum number of listener reconnection attempts before falling back to polling
`retry_interval_seconds`	`u32`	`5`	Interval in seconds between LISTEN/NOTIFY listener reconnection attempts

processing

Path: worker.event_systems.worker.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`20`	Number of events dequeued in a single batch read by the worker
`max_concurrent_operations`	`u32`	`100`	Maximum number of events processed concurrently by the worker event system
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed worker event processing operation

`worker.event_systems.worker.processing.batch_size`

Number of events dequeued in a single batch read by the worker

Type: u32
Default: 20
Valid Range: 1-1000
System Impact: Larger batches improve throughput but increase per-batch processing time

`worker.event_systems.worker.processing.max_concurrent_operations`

Maximum number of events processed concurrently by the worker event system

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Controls parallelism for step dispatch and completion processing

`worker.event_systems.worker.processing.max_retries`

Maximum retry attempts for a failed worker event processing operation

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Events exceeding this retry count are dropped or sent to the DLQ

backoff

Path: worker.event_systems.worker.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first worker event processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds between worker event retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive failure

timing

Path: worker.event_systems.worker.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds a worker event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`2`	Interval in seconds between fallback polling cycles for step dispatch
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the worker event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single worker event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued step dispatch message remains invisible to other workers

`worker.event_systems.worker.timing.claim_timeout_seconds`

Maximum time in seconds a worker event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned claims from blocking step processing indefinitely

`worker.event_systems.worker.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles for step dispatch

Type: u32
Default: 2
Valid Range: 1-60
System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency

`worker.event_systems.worker.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the worker event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the worker event system verifies its own connectivity

`worker.event_systems.worker.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single worker event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Events exceeding this timeout are considered failed and may be retried

`worker.event_systems.worker.timing.visibility_timeout_seconds`

Time in seconds a dequeued step dispatch message remains invisible to other workers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Prevents duplicate step execution; must be longer than typical step processing time

grpc

Path: worker.grpc

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"`	Socket address for the worker gRPC server
`enable_health_service`	`bool`	`true`	Enable the gRPC health checking service on the worker
`enable_reflection`	`bool`	`true`	Enable gRPC server reflection for the worker service
`enabled`	`bool`	`true`	Enable the gRPC API server for the worker service
`keepalive_interval_seconds`	`u32`	`30`	Interval in seconds between gRPC keepalive ping frames on the worker
`keepalive_timeout_seconds`	`u32`	`20`	Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
`max_concurrent_streams`	`u32`	`1000`	Maximum number of concurrent gRPC streams per connection
`max_frame_size`	`u32`	`16384`	Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
`tls_enabled`	`bool`	`false`	Enable TLS encryption for worker gRPC connections

`worker.grpc.bind_address`

Socket address for the worker gRPC server

Type: String
Default: "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"
Valid Range: host:port
System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191

`worker.grpc.enable_health_service`

Enable the gRPC health checking service on the worker

Type: bool
Default: true
Valid Range: true/false
System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

`worker.grpc.enable_reflection`

Enable gRPC server reflection for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production

`worker.grpc.enabled`

Enable the gRPC API server for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no gRPC endpoints are available; clients must use REST

`worker.grpc.keepalive_interval_seconds`

Interval in seconds between gRPC keepalive ping frames on the worker

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

`worker.grpc.keepalive_timeout_seconds`

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

Type: u32
Default: 20
Valid Range: 1-300
System Impact: Connections that fail to acknowledge within this window are considered dead and closed

`worker.grpc.max_concurrent_streams`

Maximum number of concurrent gRPC streams per connection

Type: u32
Default: 1000
Valid Range: 1-10000
System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this

`worker.grpc.max_frame_size`

Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server

Type: u32
Default: 16384
Valid Range: 16384-16777215
System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

`worker.grpc.tls_enabled`

Enable TLS encryption for worker gRPC connections

Type: bool
Default: false
Valid Range: true/false
System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments

mpsc_channels

Path: worker.mpsc_channels

command_processor

Path: worker.mpsc_channels.command_processor

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`2000`	Bounded channel capacity for the worker command processor

`worker.mpsc_channels.command_processor.command_buffer_size`

Bounded channel capacity for the worker command processor

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types

domain_events

Path: worker.mpsc_channels.domain_events

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`1000`	Bounded channel capacity for domain event system commands
`log_dropped_events`	`bool`	`true`	Log a warning when domain events are dropped due to channel saturation
`shutdown_drain_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds to drain pending domain events during shutdown

`worker.mpsc_channels.domain_events.command_buffer_size`

Bounded channel capacity for domain event system commands

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown

`worker.mpsc_channels.domain_events.log_dropped_events`

Log a warning when domain events are dropped due to channel saturation

Type: bool
Default: true
Valid Range: true/false
System Impact: Observability: helps detect when event volume exceeds channel capacity

`worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms`

Maximum time in milliseconds to drain pending domain events during shutdown

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss

event_listeners

Path: worker.mpsc_channels.event_listeners

Parameter	Type	Default	Description
`pgmq_event_buffer_size`	`usize`	`10000`	Bounded channel capacity for PGMQ event listener notifications on the worker

`worker.mpsc_channels.event_listeners.pgmq_event_buffer_size`

Bounded channel capacity for PGMQ event listener notifications on the worker

Type: usize
Default: 10000
Valid Range: 1000-1000000
System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types

event_subscribers

Path: worker.mpsc_channels.event_subscribers

Parameter	Type	Default	Description
`completion_buffer_size`	`usize`	`1000`	Bounded channel capacity for step completion event subscribers
`result_buffer_size`	`usize`	`1000`	Bounded channel capacity for step result event subscribers

`worker.mpsc_channels.event_subscribers.completion_buffer_size`

Bounded channel capacity for step completion event subscribers

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step completion notifications before they are forwarded to the orchestration service

`worker.mpsc_channels.event_subscribers.result_buffer_size`

Bounded channel capacity for step result event subscribers

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step execution results before they are published to the result queue

event_systems

Path: worker.mpsc_channels.event_systems

Parameter	Type	Default	Description
`event_channel_buffer_size`	`usize`	`2000`	Bounded channel capacity for the worker event system internal channel

`worker.mpsc_channels.event_systems.event_channel_buffer_size`

Bounded channel capacity for the worker event system internal channel

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Buffers events between the listener and processor; sized for worker-level throughput

ffi_dispatch

Path: worker.mpsc_channels.ffi_dispatch

Parameter	Type	Default	Description
`callback_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
`completion_send_timeout_ms`	`u32`	`10000`	Maximum time in milliseconds to retry sending FFI completion results when the channel is full
`completion_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds to wait for an FFI step handler to complete
`dispatch_buffer_size`	`usize`	`1000`	Bounded channel capacity for FFI step dispatch requests
`starvation_warning_threshold_ms`	`u32`	`10000`	Age in milliseconds of pending FFI events that triggers a starvation warning

`worker.mpsc_channels.ffi_dispatch.callback_timeout_ms`

Maximum time in milliseconds for FFI fire-and-forget domain event callbacks

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Prevents indefinite blocking of FFI threads during domain event publishing

`worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms`

Maximum time in milliseconds to retry sending FFI completion results when the channel is full

Type: u32
Default: 10000
Valid Range: 1000-300000
System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks

`worker.mpsc_channels.ffi_dispatch.completion_timeout_ms`

Maximum time in milliseconds to wait for an FFI step handler to complete

Type: u32
Default: 30000
Valid Range: 1000-600000
System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads

`worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size`

Bounded channel capacity for FFI step dispatch requests

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers

`worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms`

Age in milliseconds of pending FFI events that triggers a starvation warning

Type: u32
Default: 10000
Valid Range: 1000-300000
System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached

handler_dispatch

Path: worker.mpsc_channels.handler_dispatch

Parameter	Type	Default	Description
`completion_buffer_size`	`usize`	`1000`	Bounded channel capacity for step handler completion notifications
`dispatch_buffer_size`	`usize`	`1000`	Bounded channel capacity for step handler dispatch requests
`handler_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for a step handler to complete execution
`max_concurrent_handlers`	`u32`	`10`	Maximum number of step handlers executing simultaneously

`worker.mpsc_channels.handler_dispatch.completion_buffer_size`

Bounded channel capacity for step handler completion notifications

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers handler completion results before they are forwarded to the result processor

`worker.mpsc_channels.handler_dispatch.dispatch_buffer_size`

Bounded channel capacity for step handler dispatch requests

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers incoming step execution requests before handler assignment

`worker.mpsc_channels.handler_dispatch.handler_timeout_ms`

Maximum time in milliseconds for a step handler to complete execution

Type: u32
Default: 30000
Valid Range: 1000-600000
System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity

`worker.mpsc_channels.handler_dispatch.max_concurrent_handlers`

Maximum number of step handlers executing simultaneously

Type: u32
Default: 10
Valid Range: 1-10000
System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore

load_shedding

Path: worker.mpsc_channels.handler_dispatch.load_shedding

Parameter	Type	Default	Description
`capacity_threshold_percent`	`f64`	`80.0`	Handler capacity percentage above which new step claims are refused
`enabled`	`bool`	`true`	Enable load shedding to refuse step claims when handler capacity is nearly exhausted
`warning_threshold_percent`	`f64`	`70.0`	Handler capacity percentage at which warning logs are emitted

`worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent`

Handler capacity percentage above which new step claims are refused

Type: f64
Default: 80.0
Valid Range: 0.0-100.0
System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy

`worker.mpsc_channels.handler_dispatch.load_shedding.enabled`

Enable load shedding to refuse step claims when handler capacity is nearly exhausted

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload

`worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent`

Handler capacity percentage at which warning logs are emitted

Type: f64
Default: 70.0
Valid Range: 0.0-100.0
System Impact: Observability: alerts operators that the worker is approaching its capacity limit

in_process_events

Path: worker.mpsc_channels.in_process_events

Parameter	Type	Default	Description
`broadcast_buffer_size`	`usize`	`2000`	Bounded broadcast channel capacity for in-process domain event delivery
`dispatch_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds to wait when dispatching an in-process event
`log_subscriber_errors`	`bool`	`true`	Log errors when in-process event subscribers fail to receive events

`worker.mpsc_channels.in_process_events.broadcast_buffer_size`

Bounded broadcast channel capacity for in-process domain event delivery

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure

`worker.mpsc_channels.in_process_events.dispatch_timeout_ms`

Maximum time in milliseconds to wait when dispatching an in-process event

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow

`worker.mpsc_channels.in_process_events.log_subscriber_errors`

Log errors when in-process event subscribers fail to receive events

Type: bool
Default: true
Valid Range: true/false
System Impact: Observability: helps identify slow or failing event subscribers

orchestration_client

Path: worker.orchestration_client

Parameter	Type	Default	Description
`base_url`	`String`	`"http://localhost:8080"`	Base URL of the orchestration REST API that this worker reports to
`max_retries`	`u32`	`3`	Maximum retry attempts for failed orchestration API calls
`timeout_ms`	`u32`	`30000`	HTTP request timeout in milliseconds for orchestration API calls

`worker.orchestration_client.base_url`

Base URL of the orchestration REST API that this worker reports to

Type: String
Default: "http://localhost:8080"
Valid Range: valid HTTP(S) URL
System Impact: Workers send step completion results and health reports to this endpoint

Environment Recommendations:

Environment	Value	Rationale
production	http://orchestration:8080	Container-internal DNS in Kubernetes/Docker
test	http://localhost:8080	Local orchestration for testing

Related: orchestration.web.bind_address

`worker.orchestration_client.max_retries`

Maximum retry attempts for failed orchestration API calls

Type: u32
Default: 3
Valid Range: 0-10
System Impact: Retries use backoff; higher values improve resilience to transient network issues

`worker.orchestration_client.timeout_ms`

HTTP request timeout in milliseconds for orchestration API calls

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried

web

Path: worker.web

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"`	Socket address for the worker REST API server
`enabled`	`bool`	`true`	Enable the REST API server for the worker service
`request_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for a worker HTTP request to complete before timeout

`worker.web.bind_address`

Socket address for the worker REST API server

Type: String
Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"
Valid Range: host:port
System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081

Environment Recommendations:

Environment	Value	Rationale
production	0.0.0.0:8081	Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override
test	0.0.0.0:8081	Default port offset from orchestration (8080)

`worker.web.enabled`

Enable the REST API server for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only

`worker.web.request_timeout_ms`

Maximum time in milliseconds for a worker HTTP request to complete before timeout

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: worker.web.auth

Parameter	Type	Default	Description
`api_key`	`String`	`""`	Static API key for simple key-based authentication on the worker API
`api_key_header`	`String`	`"X-API-Key"`	HTTP header name for API key authentication on the worker API
`enabled`	`bool`	`false`	Enable authentication for the worker REST API
`jwt_audience`	`String`	`"worker-api"`	Expected ‘aud’ claim in JWT tokens for the worker API
`jwt_issuer`	`String`	`"tasker-worker"`	Expected ‘iss’ claim in JWT tokens for the worker API
`jwt_private_key`	`String`	`""`	PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
`jwt_public_key`	`String`	`"${TASKER_JWT_PUBLIC_KEY:-}"`	PEM-encoded public key for verifying JWT token signatures on the worker API
`jwt_public_key_path`	`String`	`"${TASKER_JWT_PUBLIC_KEY_PATH:-}"`	File path to a PEM-encoded public key for worker JWT verification
`jwt_token_expiry_hours`	`u32`	`24`	Default JWT token validity period in hours for worker API tokens

`worker.web.auth.api_key`

Static API key for simple key-based authentication on the worker API

Type: String
Default: ""
Valid Range: non-empty string or empty to disable
System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

`worker.web.auth.api_key_header`

HTTP header name for API key authentication on the worker API

Type: String
Default: "X-API-Key"
Valid Range: valid HTTP header name
System Impact: Clients send their API key in this header; default is X-API-Key

`worker.web.auth.enabled`

Enable authentication for the worker REST API

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all worker API endpoints are unauthenticated

`worker.web.auth.jwt_audience`

Expected ‘aud’ claim in JWT tokens for the worker API

Type: String
Default: "worker-api"
Valid Range: non-empty string
System Impact: Tokens with a different audience are rejected during validation

`worker.web.auth.jwt_issuer`

Expected ‘iss’ claim in JWT tokens for the worker API

Type: String
Default: "tasker-worker"
Valid Range: non-empty string
System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens

`worker.web.auth.jwt_private_key`

PEM-encoded private key for signing JWT tokens (if the worker issues tokens)

Type: String
Default: ""
Valid Range: valid PEM private key or empty
System Impact: Required only if the worker service issues its own JWT tokens; typically empty

`worker.web.auth.jwt_public_key`

PEM-encoded public key for verifying JWT token signatures on the worker API

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY:-}"
Valid Range: valid PEM public key or empty
System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management

`worker.web.auth.jwt_public_key_path`

File path to a PEM-encoded public key for worker JWT verification

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
Valid Range: valid file path or empty
System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

`worker.web.auth.jwt_token_expiry_hours`

Default JWT token validity period in hours for worker API tokens

Type: u32
Default: 24
Valid Range: 1-720
System Impact: Tokens older than this are rejected; shorter values improve security

database_pools

Path: worker.web.database_pools

Parameter	Type	Default	Description
`max_total_connections_hint`	`u32`	`25`	Advisory hint for the total number of database connections across all worker pools
`web_api_connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when acquiring a connection from the worker web API pool
`web_api_idle_timeout_seconds`	`u32`	`600`	Time before an idle worker web API connection is closed
`web_api_max_connections`	`u32`	`15`	Maximum number of connections the worker web API pool can grow to under load
`web_api_pool_size`	`u32`	`10`	Target number of connections in the worker web API database pool

`worker.web.database_pools.max_total_connections_hint`

Advisory hint for the total number of database connections across all worker pools

Type: u32
Default: 25
Valid Range: 1-1000
System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

`worker.web.database_pools.web_api_connection_timeout_seconds`

Maximum time to wait when acquiring a connection from the worker web API pool

Type: u32
Default: 30
Valid Range: 1-300
System Impact: Worker API requests that cannot acquire a connection within this window return an error

`worker.web.database_pools.web_api_idle_timeout_seconds`

Time before an idle worker web API connection is closed

Type: u32
Default: 600
Valid Range: 1-3600
System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides

`worker.web.database_pools.web_api_max_connections`

Maximum number of connections the worker web API pool can grow to under load

Type: u32
Default: 15
Valid Range: 1-500
System Impact: Hard ceiling for worker web API database connections

`worker.web.database_pools.web_api_pool_size`

Target number of connections in the worker web API database pool

Type: u32
Default: 10
Valid Range: 1-200
System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration

Generated by tasker-ctl docs — Tasker Configuration System

Configuration Reference: orchestration

91/91 parameters documented

orchestration

Root-level orchestration parameters

Path: orchestration

Parameter	Type	Default	Description
`enable_performance_logging`	`bool`	`true`	Enable detailed performance logging for orchestration actors
`shutdown_timeout_ms`	`u64`	`30000`	Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

`orchestration.enable_performance_logging`

Enable detailed performance logging for orchestration actors

Type: bool
Default: true
Valid Range: true/false
System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern

`orchestration.shutdown_timeout_ms`

Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

Type: u64
Default: 30000
Valid Range: 1000-300000
System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments

batch_processing

Path: orchestration.batch_processing

Parameter	Type	Default	Description
`checkpoint_stall_minutes`	`u32`	`15`	Minutes without a checkpoint update before a batch is considered stalled
`default_batch_size`	`u32`	`1000`	Default number of items in a single batch when not specified by the handler
`enabled`	`bool`	`true`	Enable the batch processing subsystem for large-scale step execution
`max_parallel_batches`	`u32`	`50`	Maximum number of batch operations that can execute concurrently

`orchestration.batch_processing.checkpoint_stall_minutes`

Minutes without a checkpoint update before a batch is considered stalled

Type: u32
Default: 15
Valid Range: 1-1440
System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster

`orchestration.batch_processing.default_batch_size`

Default number of items in a single batch when not specified by the handler

Type: u32
Default: 1000
Valid Range: 1-100000
System Impact: Larger batches improve throughput but increase memory usage and per-batch latency

`orchestration.batch_processing.enabled`

Enable the batch processing subsystem for large-scale step execution

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, batch step handlers cannot be used; all steps must be processed individually

`orchestration.batch_processing.max_parallel_batches`

Maximum number of batch operations that can execute concurrently

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads

decision_points

Path: orchestration.decision_points

Parameter	Type	Default	Description
`enable_detailed_logging`	`bool`	`false`	Enable verbose logging of decision point evaluation including expression results
`enable_metrics`	`bool`	`true`	Enable metrics collection for decision point evaluations
`enabled`	`bool`	`true`	Enable the decision point evaluation subsystem for conditional workflow branching
`max_decision_depth`	`u32`	`20`	Maximum depth of nested decision point chains
`max_steps_per_decision`	`u32`	`100`	Maximum number of steps that can be generated by a single decision point evaluation
`warn_threshold_depth`	`u32`	`10`	Decision depth above which a warning is logged
`warn_threshold_steps`	`u32`	`50`	Number of steps per decision above which a warning is logged

`orchestration.decision_points.enable_detailed_logging`

Enable verbose logging of decision point evaluation including expression results

Type: bool
Default: false
Valid Range: true/false
System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior

`orchestration.decision_points.enable_metrics`

Enable metrics collection for decision point evaluations

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks evaluation counts, timings, and branch selection distribution

`orchestration.decision_points.enabled`

Enable the decision point evaluation subsystem for conditional workflow branching

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, all decision points are skipped and conditional steps are not evaluated

`orchestration.decision_points.max_decision_depth`

Maximum depth of nested decision point chains

Type: u32
Default: 20
Valid Range: 1-100
System Impact: Prevents infinite recursion from circular decision point references

`orchestration.decision_points.max_steps_per_decision`

Maximum number of steps that can be generated by a single decision point evaluation

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Safety limit to prevent decision points from creating unbounded step graphs

`orchestration.decision_points.warn_threshold_depth`

Decision depth above which a warning is logged

Type: u32
Default: 10
Valid Range: 1-100
System Impact: Observability: identifies deeply nested decision chains that may indicate design issues

`orchestration.decision_points.warn_threshold_steps`

Number of steps per decision above which a warning is logged

Type: u32
Default: 50
Valid Range: 1-10000
System Impact: Observability: identifies decision points that generate unusually large step sets

dlq

Path: orchestration.dlq

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

`orchestration.dlq.enabled`

Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, stale or failed tasks remain in their error state without DLQ routing

staleness_detection

Path: orchestration.dlq.staleness_detection

Parameter	Type	Default	Description
`batch_size`	`u32`	`100`	Number of potentially stale tasks to evaluate in a single detection sweep
`detection_interval_seconds`	`u32`	`300`	Interval in seconds between staleness detection sweeps
`dry_run`	`bool`	`false`	Run staleness detection in observation-only mode without taking action
`enabled`	`bool`	`true`	Enable periodic scanning for stale tasks

`orchestration.dlq.staleness_detection.batch_size`

Number of potentially stale tasks to evaluate in a single detection sweep

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost

`orchestration.dlq.staleness_detection.detection_interval_seconds`

Interval in seconds between staleness detection sweeps

Type: u32
Default: 300
Valid Range: 30-3600
System Impact: Lower values detect stale tasks faster but increase database query frequency

`orchestration.dlq.staleness_detection.dry_run`

Run staleness detection in observation-only mode without taking action

Type: bool
Default: false
Valid Range: true/false
System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds

`orchestration.dlq.staleness_detection.enabled`

Enable periodic scanning for stale tasks

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d

actions

Path: orchestration.dlq.staleness_detection.actions

Parameter	Type	Default	Description
`auto_move_to_dlq`	`bool`	`true`	Automatically move stale tasks to the DLQ after transitioning to error
`auto_transition_to_error`	`bool`	`true`	Automatically transition stale tasks to the Error state
`emit_events`	`bool`	`true`	Emit domain events when staleness is detected
`event_channel`	`String`	`"task_staleness_detected"`	PGMQ channel name for staleness detection events

`orchestration.dlq.staleness_detection.actions.auto_move_to_dlq`

Automatically move stale tasks to the DLQ after transitioning to error

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review

`orchestration.dlq.staleness_detection.actions.auto_transition_to_error`

Automatically transition stale tasks to the Error state

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state

`orchestration.dlq.staleness_detection.actions.emit_events`

Emit domain events when staleness is detected

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling

`orchestration.dlq.staleness_detection.actions.event_channel`

PGMQ channel name for staleness detection events

Type: String
Default: "task_staleness_detected"
Valid Range: 1-255 characters
System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling

thresholds

Path: orchestration.dlq.staleness_detection.thresholds

Parameter	Type	Default	Description
`steps_in_process_minutes`	`u32`	`30`	Minutes a task can have steps in process before being considered stale
`task_max_lifetime_hours`	`u32`	`24`	Absolute maximum lifetime for any task regardless of state
`waiting_for_dependencies_minutes`	`u32`	`60`	Minutes a task can wait for step dependencies before being considered stale
`waiting_for_retry_minutes`	`u32`	`30`	Minutes a task can wait for step retries before being considered stale

`orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes`

Minutes a task can have steps in process before being considered stale

Type: u32
Default: 30
Valid Range: 1-1440
System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation

`orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours`

Absolute maximum lifetime for any task regardless of state

Type: u32
Default: 24
Valid Range: 1-168
System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing

`orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes`

Minutes a task can wait for step dependencies before being considered stale

Type: u32
Default: 60
Valid Range: 1-1440
System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration

`orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes`

Minutes a task can wait for step retries before being considered stale

Type: u32
Default: 30
Valid Range: 1-1440
System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration

event_systems

Path: orchestration.event_systems

orchestration

Path: orchestration.event_systems.orchestration

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"orchestration-event-system"`	Unique identifier for the orchestration event system instance

`orchestration.event_systems.orchestration.deployment_mode`

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

`orchestration.event_systems.orchestration.system_id`

Unique identifier for the orchestration event system instance

Type: String
Default: "orchestration-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: orchestration.event_systems.orchestration.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the orchestration event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the event system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the event system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics collection for event processing

`orchestration.event_systems.orchestration.health.enabled`

Enable health monitoring for the orchestration event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks or error tracking run for this event system

`orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute`

Error rate per minute above which the event system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection

`orchestration.event_systems.orchestration.health.max_consecutive_errors`

Number of consecutive errors before the event system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation after sustained failures; resets on any success

`orchestration.event_systems.orchestration.health.performance_monitoring_enabled`

Enable detailed performance metrics collection for event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks processing latency percentiles and throughput; adds minor overhead

processing

Path: orchestration.event_systems.orchestration.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`20`	Number of events dequeued in a single batch read
`max_concurrent_operations`	`u32`	`50`	Maximum number of events processed concurrently by the orchestration event system
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed event processing operation

`orchestration.event_systems.orchestration.processing.batch_size`

Number of events dequeued in a single batch read

Type: u32
Default: 20
Valid Range: 1-1000
System Impact: Larger batches improve throughput but increase per-batch processing time

`orchestration.event_systems.orchestration.processing.max_concurrent_operations`

Maximum number of events processed concurrently by the orchestration event system

Type: u32
Default: 50
Valid Range: 1-10000
System Impact: Controls parallelism for task request, result, and finalization processing

`orchestration.event_systems.orchestration.processing.max_retries`

Maximum retry attempts for a failed event processing operation

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Events exceeding this retry count are dropped or sent to the DLQ

backoff

Path: orchestration.event_systems.orchestration.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first event processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds between event processing retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive failure

timing

Path: orchestration.event_systems.orchestration.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds an event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`5`	Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the orchestration event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued message remains invisible to other consumers

`orchestration.event_systems.orchestration.timing.claim_timeout_seconds`

Maximum time in seconds an event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned claims from blocking event processing indefinitely

`orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load

`orchestration.event_systems.orchestration.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the orchestration event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness

`orchestration.event_systems.orchestration.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Events exceeding this timeout are considered failed and may be retried

`orchestration.event_systems.orchestration.timing.visibility_timeout_seconds`

Time in seconds a dequeued message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: If processing is not completed within this window, the message becomes visible again for redelivery

task_readiness

Path: orchestration.event_systems.task_readiness

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"task-readiness-event-system"`	Unique identifier for the task readiness event system instance

`orchestration.event_systems.task_readiness.deployment_mode`

Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery

`orchestration.event_systems.task_readiness.system_id`

Unique identifier for the task readiness event system instance

Type: String
Default: "task-readiness-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish task readiness events from other event systems

health

Path: orchestration.event_systems.task_readiness.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the task readiness event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the task readiness system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the task readiness system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics for task readiness event processing

`orchestration.event_systems.task_readiness.health.enabled`

Enable health monitoring for the task readiness event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks run for task readiness processing

`orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute`

Error rate per minute above which the task readiness system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal complementing max_consecutive_errors

`orchestration.event_systems.task_readiness.health.max_consecutive_errors`

Number of consecutive errors before the task readiness system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation; resets on any successful readiness check

`orchestration.event_systems.task_readiness.health.performance_monitoring_enabled`

Enable detailed performance metrics for task readiness event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency

processing

Path: orchestration.event_systems.task_readiness.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`50`	Number of task readiness events dequeued in a single batch
`max_concurrent_operations`	`u32`	`100`	Maximum number of task readiness events processed concurrently
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed task readiness event

`orchestration.event_systems.task_readiness.processing.batch_size`

Number of task readiness events dequeued in a single batch

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput

`orchestration.event_systems.task_readiness.processing.max_concurrent_operations`

Maximum number of task readiness events processed concurrently

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries

`orchestration.event_systems.task_readiness.processing.max_retries`

Maximum retry attempts for a failed task readiness event

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Readiness events are idempotent so retries are safe; limits retry storms

backoff

Path: orchestration.event_systems.task_readiness.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first task readiness processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay for readiness retries
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds for task readiness retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive readiness failure

timing

Path: orchestration.event_systems.task_readiness.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds a task readiness event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`5`	Interval in seconds between fallback polling cycles for task readiness
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the task readiness event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single task readiness event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued task readiness message remains invisible to other consumers

`orchestration.event_systems.task_readiness.timing.claim_timeout_seconds`

Maximum time in seconds a task readiness event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned readiness claims from blocking task evaluation

`orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles for task readiness

Type: u32
Default: 5
Valid Range: 1-60
System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness

`orchestration.event_systems.task_readiness.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the task readiness event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the task readiness system verifies its own connectivity

`orchestration.event_systems.task_readiness.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single task readiness event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Readiness events exceeding this timeout are considered failed

`orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds`

Time in seconds a dequeued task readiness message remains invisible to other consumers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Prevents duplicate processing of readiness events during normal operation

grpc

Path: orchestration.grpc

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"`	Socket address for the gRPC server
`enable_health_service`	`bool`	`true`	Enable the gRPC health checking service (grpc.health.v1)
`enable_reflection`	`bool`	`true`	Enable gRPC server reflection for service discovery
`enabled`	`bool`	`true`	Enable the gRPC API server alongside the REST API
`keepalive_interval_seconds`	`u32`	`30`	Interval in seconds between gRPC keepalive ping frames
`keepalive_timeout_seconds`	`u32`	`20`	Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
`max_concurrent_streams`	`u32`	`200`	Maximum number of concurrent gRPC streams per connection
`max_frame_size`	`u32`	`16384`	Maximum size in bytes of a single HTTP/2 frame
`tls_enabled`	`bool`	`false`	Enable TLS encryption for gRPC connections

`orchestration.grpc.bind_address`

Socket address for the gRPC server

Type: String
Default: "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
Valid Range: host:port
System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict

`orchestration.grpc.enable_health_service`

Enable the gRPC health checking service (grpc.health.v1)

Type: bool
Default: true
Valid Range: true/false
System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

`orchestration.grpc.enable_reflection`

Enable gRPC server reflection for service discovery

Type: bool
Default: true
Valid Range: true/false
System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production

`orchestration.grpc.enabled`

Enable the gRPC API server alongside the REST API

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no gRPC endpoints are available; clients must use REST

`orchestration.grpc.keepalive_interval_seconds`

Interval in seconds between gRPC keepalive ping frames

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

`orchestration.grpc.keepalive_timeout_seconds`

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

Type: u32
Default: 20
Valid Range: 1-300
System Impact: Connections that fail to acknowledge within this window are considered dead and closed

`orchestration.grpc.max_concurrent_streams`

Maximum number of concurrent gRPC streams per connection

Type: u32
Default: 200
Valid Range: 1-10000
System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads

`orchestration.grpc.max_frame_size`

Maximum size in bytes of a single HTTP/2 frame

Type: u32
Default: 16384
Valid Range: 16384-16777215
System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

`orchestration.grpc.tls_enabled`

Enable TLS encryption for gRPC connections

Type: bool
Default: false
Valid Range: true/false
System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments

mpsc_channels

Path: orchestration.mpsc_channels

command_processor

Path: orchestration.mpsc_channels.command_processor

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`5000`	Bounded channel capacity for the orchestration command processor

`orchestration.mpsc_channels.command_processor.command_buffer_size`

Bounded channel capacity for the orchestration command processor

Type: usize
Default: 5000
Valid Range: 100-100000
System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory

event_listeners

Path: orchestration.mpsc_channels.event_listeners

Parameter	Type	Default	Description
`pgmq_event_buffer_size`	`usize`	`50000`	Bounded channel capacity for PGMQ event listener notifications

`orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size`

Bounded channel capacity for PGMQ event listener notifications

Type: usize
Default: 50000
Valid Range: 1000-1000000
System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL

event_systems

Path: orchestration.mpsc_channels.event_systems

Parameter	Type	Default	Description
`event_channel_buffer_size`	`usize`	`10000`	Bounded channel capacity for the orchestration event system internal channel

`orchestration.mpsc_channels.event_systems.event_channel_buffer_size`

Bounded channel capacity for the orchestration event system internal channel

Type: usize
Default: 10000
Valid Range: 100-100000
System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts

web

Path: orchestration.web

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"`	Socket address for the REST API server
`enabled`	`bool`	`true`	Enable the REST API server for the orchestration service
`request_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for an HTTP request to complete before timeout

`orchestration.web.bind_address`

Socket address for the REST API server

Type: String
Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
Valid Range: host:port
System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments

Environment Recommendations:

Environment	Value	Rationale
production	0.0.0.0:8080	Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI
test	0.0.0.0:8080	Default port for test fixtures

`orchestration.web.enabled`

Enable the REST API server for the orchestration service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no HTTP endpoints are available; the service operates via messaging only

`orchestration.web.request_timeout_ms`

Maximum time in milliseconds for an HTTP request to complete before timeout

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: orchestration.web.auth

Parameter	Type	Default	Description
`api_key`	`String`	`""`	Static API key for simple key-based authentication
`api_key_header`	`String`	`"X-API-Key"`	HTTP header name for API key authentication
`enabled`	`bool`	`false`	Enable authentication for the REST API
`jwt_audience`	`String`	`"tasker-api"`	Expected ‘aud’ claim in JWT tokens
`jwt_issuer`	`String`	`"tasker-core"`	Expected ‘iss’ claim in JWT tokens
`jwt_private_key`	`String`	`""`	PEM-encoded private key for signing JWT tokens (if this service issues tokens)
`jwt_public_key`	`String`	`"${TASKER_JWT_PUBLIC_KEY:-}"`	PEM-encoded public key for verifying JWT token signatures
`jwt_public_key_path`	`String`	`"${TASKER_JWT_PUBLIC_KEY_PATH:-}"`	File path to a PEM-encoded public key for JWT verification
`jwt_token_expiry_hours`	`u32`	`24`	Default JWT token validity period in hours

`orchestration.web.auth.api_key`

Static API key for simple key-based authentication

Type: String
Default: ""
Valid Range: non-empty string or empty to disable
System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

`orchestration.web.auth.api_key_header`

HTTP header name for API key authentication

Type: String
Default: "X-API-Key"
Valid Range: valid HTTP header name
System Impact: Clients send their API key in this header; default is X-API-Key

`orchestration.web.auth.enabled`

Enable authentication for the REST API

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth

`orchestration.web.auth.jwt_audience`

Expected ‘aud’ claim in JWT tokens

Type: String
Default: "tasker-api"
Valid Range: non-empty string
System Impact: Tokens with a different audience are rejected during validation

`orchestration.web.auth.jwt_issuer`

Expected ‘iss’ claim in JWT tokens

Type: String
Default: "tasker-core"
Valid Range: non-empty string
System Impact: Tokens with a different issuer are rejected during validation

`orchestration.web.auth.jwt_private_key`

PEM-encoded private key for signing JWT tokens (if this service issues tokens)

Type: String
Default: ""
Valid Range: valid PEM private key or empty
System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers

`orchestration.web.auth.jwt_public_key`

PEM-encoded public key for verifying JWT token signatures

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY:-}"
Valid Range: valid PEM public key or empty
System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production

`orchestration.web.auth.jwt_public_key_path`

File path to a PEM-encoded public key for JWT verification

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
Valid Range: valid file path or empty
System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

`orchestration.web.auth.jwt_token_expiry_hours`

Default JWT token validity period in hours

Type: u32
Default: 24
Valid Range: 1-720
System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication

database_pools

Path: orchestration.web.database_pools

Parameter	Type	Default	Description
`max_total_connections_hint`	`u32`	`50`	Advisory hint for the total number of database connections across all orchestration pools
`web_api_connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when acquiring a connection from the web API pool
`web_api_idle_timeout_seconds`	`u32`	`600`	Time before an idle web API connection is closed
`web_api_max_connections`	`u32`	`30`	Maximum number of connections the web API pool can grow to under load
`web_api_pool_size`	`u32`	`20`	Target number of connections in the web API database pool

`orchestration.web.database_pools.max_total_connections_hint`

Advisory hint for the total number of database connections across all orchestration pools

Type: u32
Default: 50
Valid Range: 1-1000
System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

`orchestration.web.database_pools.web_api_connection_timeout_seconds`

Maximum time to wait when acquiring a connection from the web API pool

Type: u32
Default: 30
Valid Range: 1-300
System Impact: API requests that cannot acquire a connection within this window return an error

`orchestration.web.database_pools.web_api_idle_timeout_seconds`

Time before an idle web API connection is closed

Type: u32
Default: 600
Valid Range: 1-3600
System Impact: Controls how quickly the web API pool shrinks after traffic subsides

`orchestration.web.database_pools.web_api_max_connections`

Maximum number of connections the web API pool can grow to under load

Type: u32
Default: 30
Valid Range: 1-500
System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes

`orchestration.web.database_pools.web_api_pool_size`

Target number of connections in the web API database pool

Type: u32
Default: 20
Valid Range: 1-200
System Impact: Determines how many concurrent database queries the REST API can execute

Generated by tasker-ctl docs — Tasker Configuration System

Configuration Reference: worker

90/90 parameters documented

worker

Root-level worker parameters

Path: worker

Parameter	Type	Default	Description
`worker_id`	`String`	`"worker-default-001"`	Unique identifier for this worker instance
`worker_type`	`String`	`"general"`	Worker type classification for routing and reporting

`worker.worker_id`

Unique identifier for this worker instance

Type: String
Default: "worker-default-001"
Valid Range: non-empty string
System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster

`worker.worker_type`

Worker type classification for routing and reporting

Type: String
Default: "general"
Valid Range: non-empty string
System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types

circuit_breakers

Path: worker.circuit_breakers

ffi_completion_send

Path: worker.circuit_breakers.ffi_completion_send

Parameter	Type	Default	Description
`failure_threshold`	`u32`	`5`	Number of consecutive FFI completion send failures before the circuit breaker trips
`recovery_timeout_seconds`	`u32`	`5`	Time the FFI completion breaker stays Open before probing with a test send
`slow_send_threshold_ms`	`u32`	`100`	Threshold in milliseconds above which FFI completion channel sends are logged as slow
`success_threshold`	`u32`	`2`	Consecutive successful sends in Half-Open required to close the breaker

`worker.circuit_breakers.ffi_completion_send.failure_threshold`

Number of consecutive FFI completion send failures before the circuit breaker trips

Type: u32
Default: 5
Valid Range: 1-100
System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited

`worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds`

Time the FFI completion breaker stays Open before probing with a test send

Type: u32
Default: 5
Valid Range: 1-300
System Impact: Short timeout (5s) because FFI channel issues are typically transient

`worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms`

Threshold in milliseconds above which FFI completion channel sends are logged as slow

Type: u32
Default: 100
Valid Range: 10-10000
System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers

`worker.circuit_breakers.ffi_completion_send.success_threshold`

Consecutive successful sends in Half-Open required to close the breaker

Type: u32
Default: 2
Valid Range: 1-100
System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient

event_systems

Path: worker.event_systems

worker

Path: worker.event_systems.worker

Parameter	Type	Default	Description
`deployment_mode`	`DeploymentMode`	`"Hybrid"`	Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
`system_id`	`String`	`"worker-event-system"`	Unique identifier for the worker event system instance

`worker.event_systems.worker.deployment_mode`

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

Type: DeploymentMode
Default: "Hybrid"
Valid Range: Hybrid | EventDrivenOnly | PollingOnly
System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

`worker.event_systems.worker.system_id`

Unique identifier for the worker event system instance

Type: String
Default: "worker-event-system"
Valid Range: non-empty string
System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: worker.event_systems.worker.health

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable health monitoring for the worker event system
`error_rate_threshold_per_minute`	`u32`	`20`	Error rate per minute above which the worker event system reports as unhealthy
`max_consecutive_errors`	`u32`	`10`	Number of consecutive errors before the worker event system reports as unhealthy
`performance_monitoring_enabled`	`bool`	`true`	Enable detailed performance metrics for worker event processing

`worker.event_systems.worker.health.enabled`

Enable health monitoring for the worker event system

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no health checks or error tracking run for the worker event system

`worker.event_systems.worker.health.error_rate_threshold_per_minute`

Error rate per minute above which the worker event system reports as unhealthy

Type: u32
Default: 20
Valid Range: 1-10000
System Impact: Rate-based health signal complementing max_consecutive_errors

`worker.event_systems.worker.health.max_consecutive_errors`

Number of consecutive errors before the worker event system reports as unhealthy

Type: u32
Default: 10
Valid Range: 1-1000
System Impact: Triggers health status degradation; resets on any successful event processing

`worker.event_systems.worker.health.performance_monitoring_enabled`

Enable detailed performance metrics for worker event processing

Type: bool
Default: true
Valid Range: true/false
System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings

metadata

Path: worker.event_systems.worker.metadata

fallback_poller

Path: worker.event_systems.worker.metadata.fallback_poller

Parameter	Type	Default	Description
`age_threshold_seconds`	`u32`	`5`	Minimum age in seconds of a message before the fallback poller will pick it up
`batch_size`	`u32`	`20`	Number of messages to dequeue in a single fallback poll
`enabled`	`bool`	`true`	Enable the fallback polling mechanism for step dispatch
`max_age_hours`	`u32`	`24`	Maximum age in hours of messages the fallback poller will process
`polling_interval_ms`	`u32`	`1000`	Interval in milliseconds between fallback polling cycles
`supported_namespaces`	`Vec<String>`	`[]`	List of queue namespaces the fallback poller monitors; empty means all namespaces
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a message polled by the fallback mechanism remains invisible

in_process_events

Path: worker.event_systems.worker.metadata.in_process_events

Parameter	Type	Default	Description
`deduplication_cache_size`	`usize`	`10000`	Number of event IDs to cache for deduplication of in-process events
`ffi_integration_enabled`	`bool`	`true`	Enable FFI integration for in-process event delivery to Ruby/Python workers

listener

Path: worker.event_systems.worker.metadata.listener

Parameter	Type	Default	Description
`batch_processing`	`bool`	`true`	Enable batch processing of accumulated LISTEN/NOTIFY events
`connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection
`event_timeout_seconds`	`u32`	`60`	Maximum time to wait for a LISTEN/NOTIFY event before yielding
`max_retry_attempts`	`u32`	`5`	Maximum number of listener reconnection attempts before falling back to polling
`retry_interval_seconds`	`u32`	`5`	Interval in seconds between LISTEN/NOTIFY listener reconnection attempts

processing

Path: worker.event_systems.worker.processing

Parameter	Type	Default	Description
`batch_size`	`u32`	`20`	Number of events dequeued in a single batch read by the worker
`max_concurrent_operations`	`u32`	`100`	Maximum number of events processed concurrently by the worker event system
`max_retries`	`u32`	`3`	Maximum retry attempts for a failed worker event processing operation

`worker.event_systems.worker.processing.batch_size`

Number of events dequeued in a single batch read by the worker

Type: u32
Default: 20
Valid Range: 1-1000
System Impact: Larger batches improve throughput but increase per-batch processing time

`worker.event_systems.worker.processing.max_concurrent_operations`

Maximum number of events processed concurrently by the worker event system

Type: u32
Default: 100
Valid Range: 1-10000
System Impact: Controls parallelism for step dispatch and completion processing

`worker.event_systems.worker.processing.max_retries`

Maximum retry attempts for a failed worker event processing operation

Type: u32
Default: 3
Valid Range: 0-100
System Impact: Events exceeding this retry count are dropped or sent to the DLQ

backoff

Path: worker.event_systems.worker.processing.backoff

Parameter	Type	Default	Description
`initial_delay_ms`	`u64`	`100`	Initial backoff delay in milliseconds after first worker event processing failure
`jitter_percent`	`f64`	`0.1`	Maximum jitter as a fraction of the computed backoff delay
`max_delay_ms`	`u64`	`10000`	Maximum backoff delay in milliseconds between worker event retries
`multiplier`	`f64`	`2.0`	Multiplier applied to the backoff delay after each consecutive failure

timing

Path: worker.event_systems.worker.timing

Parameter	Type	Default	Description
`claim_timeout_seconds`	`u32`	`300`	Maximum time in seconds a worker event claim remains valid
`fallback_polling_interval_seconds`	`u32`	`2`	Interval in seconds between fallback polling cycles for step dispatch
`health_check_interval_seconds`	`u32`	`30`	Interval in seconds between health check probes for the worker event system
`processing_timeout_seconds`	`u32`	`60`	Maximum time in seconds allowed for processing a single worker event
`visibility_timeout_seconds`	`u32`	`30`	Time in seconds a dequeued step dispatch message remains invisible to other workers

`worker.event_systems.worker.timing.claim_timeout_seconds`

Maximum time in seconds a worker event claim remains valid

Type: u32
Default: 300
Valid Range: 1-3600
System Impact: Prevents abandoned claims from blocking step processing indefinitely

`worker.event_systems.worker.timing.fallback_polling_interval_seconds`

Interval in seconds between fallback polling cycles for step dispatch

Type: u32
Default: 2
Valid Range: 1-60
System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency

`worker.event_systems.worker.timing.health_check_interval_seconds`

Interval in seconds between health check probes for the worker event system

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Controls how frequently the worker event system verifies its own connectivity

`worker.event_systems.worker.timing.processing_timeout_seconds`

Maximum time in seconds allowed for processing a single worker event

Type: u32
Default: 60
Valid Range: 1-3600
System Impact: Events exceeding this timeout are considered failed and may be retried

`worker.event_systems.worker.timing.visibility_timeout_seconds`

Time in seconds a dequeued step dispatch message remains invisible to other workers

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Prevents duplicate step execution; must be longer than typical step processing time

grpc

Path: worker.grpc

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"`	Socket address for the worker gRPC server
`enable_health_service`	`bool`	`true`	Enable the gRPC health checking service on the worker
`enable_reflection`	`bool`	`true`	Enable gRPC server reflection for the worker service
`enabled`	`bool`	`true`	Enable the gRPC API server for the worker service
`keepalive_interval_seconds`	`u32`	`30`	Interval in seconds between gRPC keepalive ping frames on the worker
`keepalive_timeout_seconds`	`u32`	`20`	Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
`max_concurrent_streams`	`u32`	`1000`	Maximum number of concurrent gRPC streams per connection
`max_frame_size`	`u32`	`16384`	Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
`tls_enabled`	`bool`	`false`	Enable TLS encryption for worker gRPC connections

`worker.grpc.bind_address`

Socket address for the worker gRPC server

Type: String
Default: "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"
Valid Range: host:port
System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191

`worker.grpc.enable_health_service`

Enable the gRPC health checking service on the worker

Type: bool
Default: true
Valid Range: true/false
System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

`worker.grpc.enable_reflection`

Enable gRPC server reflection for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production

`worker.grpc.enabled`

Enable the gRPC API server for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no gRPC endpoints are available; clients must use REST

`worker.grpc.keepalive_interval_seconds`

Interval in seconds between gRPC keepalive ping frames on the worker

Type: u32
Default: 30
Valid Range: 1-3600
System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

`worker.grpc.keepalive_timeout_seconds`

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

Type: u32
Default: 20
Valid Range: 1-300
System Impact: Connections that fail to acknowledge within this window are considered dead and closed

`worker.grpc.max_concurrent_streams`

Maximum number of concurrent gRPC streams per connection

Type: u32
Default: 1000
Valid Range: 1-10000
System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this

`worker.grpc.max_frame_size`

Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server

Type: u32
Default: 16384
Valid Range: 16384-16777215
System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

`worker.grpc.tls_enabled`

Enable TLS encryption for worker gRPC connections

Type: bool
Default: false
Valid Range: true/false
System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments

mpsc_channels

Path: worker.mpsc_channels

command_processor

Path: worker.mpsc_channels.command_processor

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`2000`	Bounded channel capacity for the worker command processor

`worker.mpsc_channels.command_processor.command_buffer_size`

Bounded channel capacity for the worker command processor

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types

domain_events

Path: worker.mpsc_channels.domain_events

Parameter	Type	Default	Description
`command_buffer_size`	`usize`	`1000`	Bounded channel capacity for domain event system commands
`log_dropped_events`	`bool`	`true`	Log a warning when domain events are dropped due to channel saturation
`shutdown_drain_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds to drain pending domain events during shutdown

`worker.mpsc_channels.domain_events.command_buffer_size`

Bounded channel capacity for domain event system commands

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown

`worker.mpsc_channels.domain_events.log_dropped_events`

Log a warning when domain events are dropped due to channel saturation

Type: bool
Default: true
Valid Range: true/false
System Impact: Observability: helps detect when event volume exceeds channel capacity

`worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms`

Maximum time in milliseconds to drain pending domain events during shutdown

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss

event_listeners

Path: worker.mpsc_channels.event_listeners

Parameter	Type	Default	Description
`pgmq_event_buffer_size`	`usize`	`10000`	Bounded channel capacity for PGMQ event listener notifications on the worker

`worker.mpsc_channels.event_listeners.pgmq_event_buffer_size`

Bounded channel capacity for PGMQ event listener notifications on the worker

Type: usize
Default: 10000
Valid Range: 1000-1000000
System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types

event_subscribers

Path: worker.mpsc_channels.event_subscribers

Parameter	Type	Default	Description
`completion_buffer_size`	`usize`	`1000`	Bounded channel capacity for step completion event subscribers
`result_buffer_size`	`usize`	`1000`	Bounded channel capacity for step result event subscribers

`worker.mpsc_channels.event_subscribers.completion_buffer_size`

Bounded channel capacity for step completion event subscribers

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step completion notifications before they are forwarded to the orchestration service

`worker.mpsc_channels.event_subscribers.result_buffer_size`

Bounded channel capacity for step result event subscribers

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step execution results before they are published to the result queue

event_systems

Path: worker.mpsc_channels.event_systems

Parameter	Type	Default	Description
`event_channel_buffer_size`	`usize`	`2000`	Bounded channel capacity for the worker event system internal channel

`worker.mpsc_channels.event_systems.event_channel_buffer_size`

Bounded channel capacity for the worker event system internal channel

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Buffers events between the listener and processor; sized for worker-level throughput

ffi_dispatch

Path: worker.mpsc_channels.ffi_dispatch

Parameter	Type	Default	Description
`callback_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
`completion_send_timeout_ms`	`u32`	`10000`	Maximum time in milliseconds to retry sending FFI completion results when the channel is full
`completion_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds to wait for an FFI step handler to complete
`dispatch_buffer_size`	`usize`	`1000`	Bounded channel capacity for FFI step dispatch requests
`starvation_warning_threshold_ms`	`u32`	`10000`	Age in milliseconds of pending FFI events that triggers a starvation warning

`worker.mpsc_channels.ffi_dispatch.callback_timeout_ms`

Maximum time in milliseconds for FFI fire-and-forget domain event callbacks

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Prevents indefinite blocking of FFI threads during domain event publishing

`worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms`

Maximum time in milliseconds to retry sending FFI completion results when the channel is full

Type: u32
Default: 10000
Valid Range: 1000-300000
System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks

`worker.mpsc_channels.ffi_dispatch.completion_timeout_ms`

Maximum time in milliseconds to wait for an FFI step handler to complete

Type: u32
Default: 30000
Valid Range: 1000-600000
System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads

`worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size`

Bounded channel capacity for FFI step dispatch requests

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers

`worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms`

Age in milliseconds of pending FFI events that triggers a starvation warning

Type: u32
Default: 10000
Valid Range: 1000-300000
System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached

handler_dispatch

Path: worker.mpsc_channels.handler_dispatch

Parameter	Type	Default	Description
`completion_buffer_size`	`usize`	`1000`	Bounded channel capacity for step handler completion notifications
`dispatch_buffer_size`	`usize`	`1000`	Bounded channel capacity for step handler dispatch requests
`handler_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for a step handler to complete execution
`max_concurrent_handlers`	`u32`	`10`	Maximum number of step handlers executing simultaneously

`worker.mpsc_channels.handler_dispatch.completion_buffer_size`

Bounded channel capacity for step handler completion notifications

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers handler completion results before they are forwarded to the result processor

`worker.mpsc_channels.handler_dispatch.dispatch_buffer_size`

Bounded channel capacity for step handler dispatch requests

Type: usize
Default: 1000
Valid Range: 100-50000
System Impact: Buffers incoming step execution requests before handler assignment

`worker.mpsc_channels.handler_dispatch.handler_timeout_ms`

Maximum time in milliseconds for a step handler to complete execution

Type: u32
Default: 30000
Valid Range: 1000-600000
System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity

`worker.mpsc_channels.handler_dispatch.max_concurrent_handlers`

Maximum number of step handlers executing simultaneously

Type: u32
Default: 10
Valid Range: 1-10000
System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore

load_shedding

Path: worker.mpsc_channels.handler_dispatch.load_shedding

Parameter	Type	Default	Description
`capacity_threshold_percent`	`f64`	`80.0`	Handler capacity percentage above which new step claims are refused
`enabled`	`bool`	`true`	Enable load shedding to refuse step claims when handler capacity is nearly exhausted
`warning_threshold_percent`	`f64`	`70.0`	Handler capacity percentage at which warning logs are emitted

`worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent`

Handler capacity percentage above which new step claims are refused

Type: f64
Default: 80.0
Valid Range: 0.0-100.0
System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy

`worker.mpsc_channels.handler_dispatch.load_shedding.enabled`

Enable load shedding to refuse step claims when handler capacity is nearly exhausted

Type: bool
Default: true
Valid Range: true/false
System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload

`worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent`

Handler capacity percentage at which warning logs are emitted

Type: f64
Default: 70.0
Valid Range: 0.0-100.0
System Impact: Observability: alerts operators that the worker is approaching its capacity limit

in_process_events

Path: worker.mpsc_channels.in_process_events

Parameter	Type	Default	Description
`broadcast_buffer_size`	`usize`	`2000`	Bounded broadcast channel capacity for in-process domain event delivery
`dispatch_timeout_ms`	`u32`	`5000`	Maximum time in milliseconds to wait when dispatching an in-process event
`log_subscriber_errors`	`bool`	`true`	Log errors when in-process event subscribers fail to receive events

`worker.mpsc_channels.in_process_events.broadcast_buffer_size`

Bounded broadcast channel capacity for in-process domain event delivery

Type: usize
Default: 2000
Valid Range: 100-100000
System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure

`worker.mpsc_channels.in_process_events.dispatch_timeout_ms`

Maximum time in milliseconds to wait when dispatching an in-process event

Type: u32
Default: 5000
Valid Range: 100-60000
System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow

`worker.mpsc_channels.in_process_events.log_subscriber_errors`

Log errors when in-process event subscribers fail to receive events

Type: bool
Default: true
Valid Range: true/false
System Impact: Observability: helps identify slow or failing event subscribers

orchestration_client

Path: worker.orchestration_client

Parameter	Type	Default	Description
`base_url`	`String`	`"http://localhost:8080"`	Base URL of the orchestration REST API that this worker reports to
`max_retries`	`u32`	`3`	Maximum retry attempts for failed orchestration API calls
`timeout_ms`	`u32`	`30000`	HTTP request timeout in milliseconds for orchestration API calls

`worker.orchestration_client.base_url`

Base URL of the orchestration REST API that this worker reports to

Type: String
Default: "http://localhost:8080"
Valid Range: valid HTTP(S) URL
System Impact: Workers send step completion results and health reports to this endpoint

Environment Recommendations:

Environment	Value	Rationale
production	http://orchestration:8080	Container-internal DNS in Kubernetes/Docker
test	http://localhost:8080	Local orchestration for testing

Related: orchestration.web.bind_address

`worker.orchestration_client.max_retries`

Maximum retry attempts for failed orchestration API calls

Type: u32
Default: 3
Valid Range: 0-10
System Impact: Retries use backoff; higher values improve resilience to transient network issues

`worker.orchestration_client.timeout_ms`

HTTP request timeout in milliseconds for orchestration API calls

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried

web

Path: worker.web

Parameter	Type	Default	Description
`bind_address`	`String`	`"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"`	Socket address for the worker REST API server
`enabled`	`bool`	`true`	Enable the REST API server for the worker service
`request_timeout_ms`	`u32`	`30000`	Maximum time in milliseconds for a worker HTTP request to complete before timeout

`worker.web.bind_address`

Socket address for the worker REST API server

Type: String
Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"
Valid Range: host:port
System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081

Environment Recommendations:

Environment	Value	Rationale
production	0.0.0.0:8081	Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override
test	0.0.0.0:8081	Default port offset from orchestration (8080)

`worker.web.enabled`

Enable the REST API server for the worker service

Type: bool
Default: true
Valid Range: true/false
System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only

`worker.web.request_timeout_ms`

Maximum time in milliseconds for a worker HTTP request to complete before timeout

Type: u32
Default: 30000
Valid Range: 100-300000
System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: worker.web.auth

Parameter	Type	Default	Description
`api_key`	`String`	`""`	Static API key for simple key-based authentication on the worker API
`api_key_header`	`String`	`"X-API-Key"`	HTTP header name for API key authentication on the worker API
`enabled`	`bool`	`false`	Enable authentication for the worker REST API
`jwt_audience`	`String`	`"worker-api"`	Expected ‘aud’ claim in JWT tokens for the worker API
`jwt_issuer`	`String`	`"tasker-worker"`	Expected ‘iss’ claim in JWT tokens for the worker API
`jwt_private_key`	`String`	`""`	PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
`jwt_public_key`	`String`	`"${TASKER_JWT_PUBLIC_KEY:-}"`	PEM-encoded public key for verifying JWT token signatures on the worker API
`jwt_public_key_path`	`String`	`"${TASKER_JWT_PUBLIC_KEY_PATH:-}"`	File path to a PEM-encoded public key for worker JWT verification
`jwt_token_expiry_hours`	`u32`	`24`	Default JWT token validity period in hours for worker API tokens

`worker.web.auth.api_key`

Static API key for simple key-based authentication on the worker API

Type: String
Default: ""
Valid Range: non-empty string or empty to disable
System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

`worker.web.auth.api_key_header`

HTTP header name for API key authentication on the worker API

Type: String
Default: "X-API-Key"
Valid Range: valid HTTP header name
System Impact: Clients send their API key in this header; default is X-API-Key

`worker.web.auth.enabled`

Enable authentication for the worker REST API

Type: bool
Default: false
Valid Range: true/false
System Impact: When false, all worker API endpoints are unauthenticated

`worker.web.auth.jwt_audience`

Expected ‘aud’ claim in JWT tokens for the worker API

Type: String
Default: "worker-api"
Valid Range: non-empty string
System Impact: Tokens with a different audience are rejected during validation

`worker.web.auth.jwt_issuer`

Expected ‘iss’ claim in JWT tokens for the worker API

Type: String
Default: "tasker-worker"
Valid Range: non-empty string
System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens

`worker.web.auth.jwt_private_key`

PEM-encoded private key for signing JWT tokens (if the worker issues tokens)

Type: String
Default: ""
Valid Range: valid PEM private key or empty
System Impact: Required only if the worker service issues its own JWT tokens; typically empty

`worker.web.auth.jwt_public_key`

PEM-encoded public key for verifying JWT token signatures on the worker API

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY:-}"
Valid Range: valid PEM public key or empty
System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management

`worker.web.auth.jwt_public_key_path`

File path to a PEM-encoded public key for worker JWT verification

Type: String
Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
Valid Range: valid file path or empty
System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

`worker.web.auth.jwt_token_expiry_hours`

Default JWT token validity period in hours for worker API tokens

Type: u32
Default: 24
Valid Range: 1-720
System Impact: Tokens older than this are rejected; shorter values improve security

database_pools

Path: worker.web.database_pools

Parameter	Type	Default	Description
`max_total_connections_hint`	`u32`	`25`	Advisory hint for the total number of database connections across all worker pools
`web_api_connection_timeout_seconds`	`u32`	`30`	Maximum time to wait when acquiring a connection from the worker web API pool
`web_api_idle_timeout_seconds`	`u32`	`600`	Time before an idle worker web API connection is closed
`web_api_max_connections`	`u32`	`15`	Maximum number of connections the worker web API pool can grow to under load
`web_api_pool_size`	`u32`	`10`	Target number of connections in the worker web API database pool

`worker.web.database_pools.max_total_connections_hint`

Advisory hint for the total number of database connections across all worker pools

Type: u32
Default: 25
Valid Range: 1-1000
System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

`worker.web.database_pools.web_api_connection_timeout_seconds`

Maximum time to wait when acquiring a connection from the worker web API pool

Type: u32
Default: 30
Valid Range: 1-300
System Impact: Worker API requests that cannot acquire a connection within this window return an error

`worker.web.database_pools.web_api_idle_timeout_seconds`

Time before an idle worker web API connection is closed

Type: u32
Default: 600
Valid Range: 1-3600
System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides

`worker.web.database_pools.web_api_max_connections`

Maximum number of connections the worker web API pool can grow to under load

Type: u32
Default: 15
Valid Range: 1-500
System Impact: Hard ceiling for worker web API database connections

`worker.web.database_pools.web_api_pool_size`

Target number of connections in the worker web API database pool

Type: u32
Default: 10
Valid Range: 1-200
System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration

Generated by tasker-ctl docs — Tasker Configuration System

Authentication & Authorization

API-level security for Tasker’s orchestration and worker HTTP endpoints, providing JWT bearer token and API key authentication with permission-based access control.

Architecture

                         ┌──────────────────────────────┐
Request ──►  Middleware  │  SecurityService              │
             (per-route) │  ├─ JwtAuthenticator          │
                         │  ├─ JwksKeyStore (optional)   │
                         │  └─ ApiKeyRegistry (optional) │
                         └───────────┬──────────────────┘
                                     │
                                     ▼
                           SecurityContext
                           (injected into request extensions)
                                     │
                                     ▼
                         ┌───────────────────────┐
                         │  authorize() wrapper  │
                         │  Resource + Action    │
                         └───────────┬───────────┘
                                     │
                           ┌─────────┴─────────┐
                           ▼                   ▼
                     Body parsing        403 (denied)
                         │
                         ▼
                    Handler body
                         │
                         ▼
                   200 (success)

Key Components

Component	Location	Role
`SecurityService`	`tasker-shared/src/services/security_service.rs`	Unified auth backend: validates JWTs (static key or JWKS) and API keys
`SecurityContext`	`tasker-shared/src/types/security.rs`	Per-request identity + permissions, extracted by handlers
`Permission` enum	`tasker-shared/src/types/permissions.rs`	Compile-time permission vocabulary (`resource:action`)
`Resource`, `Action`	`tasker-shared/src/types/resources.rs`	Resource-based authorization types
`authorize()` wrapper	`tasker-shared/src/web/authorize.rs`	Handler wrapper for declarative permission checks
Auth middleware	`*/src/web/middleware/auth.rs`	Axum middleware injecting `SecurityContext`
`require_permission()`	`*/src/web/middleware/permission.rs`	Legacy per-handler permission gate (still available)

Request Flow

Middleware (conditional_auth) runs on protected routes
If auth disabled → injects SecurityContext::disabled_context() (all permissions)
If auth enabled → extracts Bearer token or API key from headers
SecurityService validates credentials, returns SecurityContext
authorize() wrapper checks permission BEFORE body deserialization → 403 if denied
Body deserialization and handler execution proceed if authorized

Route Layers

Routes are split into public (never require auth) and protected (auth middleware applied):

Orchestration (port 8080):

Public: /health/*, /metrics, /api-docs/*
Protected: /v1/*, /config (opt-in)

Worker (port 8081):

Public: /health/*, /metrics, /api-docs/*
Protected: /v1/templates/*, /config (opt-in)

Quick Start

# 1. Generate RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys

# 2. Generate a token
cargo run --bin tasker-ctl -- auth generate-token \
  --private-key ./keys/jwt-private-key.pem \
  --permissions "tasks:create,tasks:read,tasks:list" \
  --subject my-service \
  --expiry-hours 24

# 3. Enable auth in config (orchestration.toml)
# [web.auth]
# enabled = true
# jwt_public_key_path = "./keys/jwt-public-key.pem"

# 4. Use the token
curl -H "Authorization: Bearer <token>" http://localhost:8080/v1/tasks

Documentation Index

Document	Contents
Permissions	Permission vocabulary, route mapping, wildcards, role patterns
Configuration	TOML config, environment variables, deployment patterns
Testing	E2E test infrastructure, cargo-make tasks, writing auth tests

Cross-References

Document	Contents
API Security Guide	Quick start, CLI commands, error responses, observability
Auth Integration Guide	JWKS, Auth0, Keycloak, Okta configuration

Design Decisions

Auth Disabled by Default

Security is opt-in (enabled = false default). Existing deployments are unaffected. When disabled, all handlers receive a SecurityContext with AuthMethod::Disabled and permissions: ["*"].

Config Endpoint Opt-In

The /config endpoint exposes runtime configuration (secrets redacted). It is controlled by a separate toggle (config_endpoint_enabled, default false). When disabled, the route is not registered (404, not 401).

Resource-Based Authorization

Permission checks happen at the route level via authorize() wrappers BEFORE body deserialization:

#![allow(unused)]
fn main() {
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
}

This approach:

Rejects unauthorized requests before parsing request bodies
Provides a declarative, visible permission model at the route level
Is protocol-agnostic (same Resource/Action types work for REST and gRPC)
Documents permissions in OpenAPI via x-required-permission extensions

The legacy require_permission() function is still available for cases where permission checks need to happen inside handler logic.

Credential Priority (Client)

The tasker-client library resolves credentials in this order:

Endpoint-specific token (TASKER_ORCHESTRATION_AUTH_TOKEN / TASKER_WORKER_AUTH_TOKEN)
Global token (TASKER_AUTH_TOKEN)
API key (TASKER_API_KEY)
JWT generation from private key (if configured)

Known Limitations

Body-before-permission ordering for POST/PATCH endpoints — Resolved by resource-based authorization
No token refresh — tokens are stateless; clients must generate new tokens before expiry
API keys have no expiration — rotate by removing from config and redeploying

Configuration Reference

Complete configuration for Tasker authentication: server-side TOML, environment variables, and client settings.

Server-Side Configuration

Auth config lives under [web.auth] in both orchestration and worker TOML files.

Location

config/tasker/base/orchestration.toml    → [web.auth]
config/tasker/base/worker.toml           → [web.auth]
config/tasker/environments/{env}/...     → environment overrides

Configuration follows the role-based structure (see Configuration Management).

Full Reference

[web]
# Whether the /config endpoint is registered (default: false).
# When false, GET /config returns 404. When true, requires system:config_read permission.
config_endpoint_enabled = false

[web.auth]
# Master switch (default: false). When disabled, all routes are accessible without credentials.
enabled = false

# --- JWT Configuration ---

# Token issuer claim (validated against incoming tokens)
jwt_issuer = "tasker-core"

# Token audience claim (validated against incoming tokens)
jwt_audience = "tasker-api"

# Token expiry for generated tokens (via CLI)
jwt_token_expiry_hours = 24

# Verification method: "public_key" (static RSA key) or "jwks" (dynamic key rotation)
jwt_verification_method = "public_key"

# Static public key (one of these, path takes precedence):
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_public_key = ""  # Inline PEM string (use path instead for production)

# Private key (for token generation only, not needed for verification):
jwt_private_key = ""

# --- JWKS Configuration (when jwt_verification_method = "jwks") ---

# JWKS endpoint URL
jwks_url = "https://auth.example.com/.well-known/jwks.json"

# How often to refresh the key set (seconds)
jwks_refresh_interval_seconds = 3600

# --- Permission Validation ---

# JWT claim name containing the permissions array
permissions_claim = "permissions"

# Reject tokens with unrecognized permission strings
strict_validation = true

# Log unrecognized permissions even when strict_validation = false
log_unknown_permissions = true

# --- API Key Authentication ---

# Header name for API key authentication
api_key_header = "X-API-Key"

# Enable multi-key registry (default: false)
api_keys_enabled = false

# API key registry (multiple keys with individual permissions)
[[web.auth.api_keys]]
key = "sk-prod-monitoring-key"
permissions = ["tasks:read", "tasks:list", "dlq:read", "dlq:stats"]
description = "Production monitoring service"

[[web.auth.api_keys]]
key = "sk-prod-admin-key"
permissions = ["*"]
description = "Production admin"

Environment Variables

Server-Side

Variable	Description	Overrides
`TASKER_JWT_PUBLIC_KEY_PATH`	Path to RSA public key PEM file	`web.auth.jwt_public_key_path`
`TASKER_JWT_PUBLIC_KEY`	Inline PEM public key	`web.auth.jwt_public_key`

These override TOML values via the config loader’s environment interpolation.

Client-Side

Variable	Priority	Description
`TASKER_ORCHESTRATION_AUTH_TOKEN`	1 (highest)	Bearer token for orchestration API only
`TASKER_WORKER_AUTH_TOKEN`	1 (highest)	Bearer token for worker API only
`TASKER_AUTH_TOKEN`	2	Bearer token for both APIs
`TASKER_API_KEY`	3	API key (sent via configured header)
`TASKER_API_KEY_HEADER`	—	Custom header name (default: `X-API-Key`)
`TASKER_JWT_PRIVATE_KEY_PATH`	4 (lowest)	Private key for on-demand token generation

The tasker-client library checks these in priority order and uses the first available credential.

Deployment Patterns

Development (Auth Disabled)

[web.auth]
enabled = false

All endpoints accessible without credentials. Default behavior.

Development (Auth Enabled, Static Key)

[web.auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
strict_validation = false

[[web.auth.api_keys]]
key = "dev-key"
permissions = ["*"]
description = "Dev superuser key"

Production (JWKS + API Keys)

[web.auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://auth.company.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://auth.company.com/"
jwt_audience = "tasker-api"
strict_validation = true
log_unknown_permissions = true
api_keys_enabled = true
api_key_header = "X-API-Key"

[[web.auth.api_keys]]
key = "sk-monitoring-prod"
permissions = ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
description = "Monitoring service"

[[web.auth.api_keys]]
key = "sk-submitter-prod"
permissions = ["tasks:create", "tasks:read", "tasks:list"]
description = "Task submission service"

Production (Config Endpoint Enabled)

[web]
config_endpoint_enabled = true

[web.auth]
enabled = true
# ... auth config ...

Exposes GET /config (requires system:config_read permission). Secrets are redacted in the response.

Key Management

Generating Keys

# Generate 2048-bit RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys --key-size 2048

# Output:
#   keys/jwt-private-key.pem  (keep secret, used for token generation)
#   keys/jwt-public-key.pem   (distribute to servers for verification)

Key Rotation (Static Key)

Generate a new key pair
Update jwt_public_key_path in server config
Restart servers
Re-generate tokens with the new private key
Old tokens become invalid immediately

Key Rotation (JWKS)

Handled automatically by the identity provider. Tasker refreshes keys on:

Timer interval (jwks_refresh_interval_seconds)
Unknown kid in incoming token (triggers immediate refresh)

Security Hardening Checklist

Private keys never committed to version control
enabled = true in production configs
strict_validation = true to reject unknown permissions
Token expiry set appropriately (1-24h recommended)
API keys use descriptive names for audit trails
config_endpoint_enabled = false unless needed (default)
Monitor tasker.auth.failures.total metric for anomalies
Use JWKS in production for automatic key rotation
Least-privilege: each service gets only the permissions it needs

API Security Guide — Quick start, CLI commands, error responses
Auth Integration Guide — Auth0, Keycloak, Okta, JWKS setup
Permissions — Full permission vocabulary and route mapping

Permissions

Permission-based access control using a resource:action vocabulary with wildcard support.

Permission Vocabulary

17 permissions organized by resource:

Tasks

Permission	Description	Endpoints
`tasks:create`	Create new tasks	`POST /v1/tasks`
`tasks:read`	Read task details	`GET /v1/tasks/{uuid}`
`tasks:list`	List tasks	`GET /v1/tasks`
`tasks:cancel`	Cancel running tasks	`DELETE /v1/tasks/{uuid}`
`tasks:context_read`	Read task context data	`GET /v1/tasks/{uuid}/context`

Steps

Permission	Description	Endpoints
`steps:read`	Read workflow step details	`GET /v1/tasks/{uuid}/workflow_steps`, `GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}`, `GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}/audit`
`steps:resolve`	Manually resolve steps	`PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}`

Dead Letter Queue

Permission	Description	Endpoints
`dlq:read`	Read DLQ entries	`GET /v1/dlq`, `GET /v1/dlq/task/{task_uuid}`, `GET /v1/dlq/investigation-queue`, `GET /v1/dlq/staleness`
`dlq:update`	Update DLQ investigations	`PATCH /v1/dlq/entry/{dlq_entry_uuid}`
`dlq:stats`	View DLQ statistics	`GET /v1/dlq/stats`

Templates

Permission	Description	Endpoints
`templates:read`	Read task templates	Orchestration: `GET /v1/templates`, `GET /v1/templates/{namespace}/{name}/{version}`
`templates:validate`	Validate templates	Worker: `POST /v1/templates/{namespace}/{name}/{version}/validate`

System (Orchestration)

Permission	Description	Endpoints
`system:config_read`	Read system configuration	`GET /config`
`system:handlers_read`	Read handler registry	`GET /v1/handlers`, `GET /v1/handlers/{namespace}`, `GET /v1/handlers/{namespace}/{name}`
`system:analytics_read`	Read analytics data	`GET /v1/analytics/performance`, `GET /v1/analytics/bottlenecks`

Worker

Permission	Description	Endpoints
`worker:config_read`	Read worker configuration	Worker: `GET /config`
`worker:templates_read`	Read worker templates	Worker: `GET /v1/templates`, `GET /v1/templates/{namespace}/{name}/{version}`

Wildcards

Resource-level wildcards allow broad access within a resource domain:

Pattern	Matches
`tasks:*`	All task permissions
`steps:*`	All step permissions
`dlq:*`	All DLQ permissions
`templates:*`	All template permissions
`system:*`	All system permissions
`worker:*`	All worker permissions

Note: Global wildcards (*) are NOT supported. Use explicit resource wildcards for broad access (e.g., tasks:*, system:*). This follows AWS IAM-style resource-level granularity.

Wildcard matching is implemented in permission_matches():

resource:* → matches if required permission’s resource component equals the prefix
Exact string → matches if strings are identical

Role Patterns

Common permission sets for different service roles:

Read-Only Operator

["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]

Suitable for dashboards, monitoring services, and read-only admin UIs.

Task Submitter

["tasks:create", "tasks:read", "tasks:list"]

Services that submit work to Tasker and track their submissions.

Ops Admin

["tasks:*", "steps:*", "dlq:*", "system:*"]

Full operational access including step resolution, DLQ investigation, and system observability.

Worker Service

["worker:config_read", "worker:templates_read"]

Worker processes that need to read their configuration and available templates.

Full Access (Admin)

["tasks:*", "steps:*", "dlq:*", "templates:*", "system:*", "worker:*"]

Full access to all resources via resource wildcards. Use sparingly.

Strict Validation

When strict_validation = true (default), tokens containing permission strings not in the vocabulary are rejected with 401:

Unknown permissions: custom:action, tasks:delete

Set strict_validation = false if your identity provider includes additional scopes that are not part of Tasker’s vocabulary. Use log_unknown_permissions = true to still log unrecognized permissions for monitoring.

Permission Check Implementation

Resource-Based Authorization

Permissions are enforced declaratively at the route level using authorize() wrappers. This ensures authorization happens before body deserialization:

#![allow(unused)]
fn main() {
// In routes.rs
use tasker_shared::web::authorize;
use tasker_shared::types::resources::{Resource, Action};

Router::new()
    .route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
    .route("/tasks", get(authorize(Resource::Tasks, Action::List, list_tasks)))
    .route("/tasks/{uuid}", get(authorize(Resource::Tasks, Action::Read, get_task)))
}

The authorize() wrapper:

Extracts SecurityContext from request extensions (set by auth middleware)
If resource is public (Health/Metrics/Docs) → proceeds to handler
If auth disabled (AuthMethod::Disabled) → proceeds to handler
Checks has_permission(required) → if yes, proceeds; if no, returns 403

Resource → Permission Mapping

The ResourceAction type maps resource+action combinations to permissions:

Resource	Action	Permission
Tasks	Create	`tasks:create`
Tasks	Read	`tasks:read`
Tasks	List	`tasks:list`
Tasks	Cancel	`tasks:cancel`
Tasks	ContextRead	`tasks:context_read`
Steps	Read/List	`steps:read`
Steps	Resolve	`steps:resolve`
Dlq	Read/List	`dlq:read`
Dlq	Update	`dlq:update`
Dlq	Stats	`dlq:stats`
Templates	Read/List	`templates:read`
Templates	Validate	`templates:validate`
System	ConfigRead	`system:config_read`
System	HandlersRead	`system:handlers_read`
System	AnalyticsRead	`system:analytics_read`
Worker	ConfigRead	`worker:config_read`
Worker	Read/List	`worker:templates_read`

Public Resources

These resources don’t require authentication:

Resource::Health - Health check endpoints
Resource::Metrics - Prometheus metrics
Resource::Docs - OpenAPI/Swagger documentation

Legacy Handler-Level Check (Still Available)

For cases where you need permission checks inside handler logic:

#![allow(unused)]
fn main() {
use tasker_shared::services::require_permission;
use tasker_shared::types::Permission;

fn my_handler(ctx: SecurityContext) -> Result<(), ApiError> {
    require_permission(&ctx, Permission::TasksCreate)?;
    // ... handler logic
}
}

Source: tasker-shared/src/web/authorize.rs, tasker-shared/src/types/resources.rs

OpenAPI Documentation

Permission Extensions

Each protected endpoint in the OpenAPI spec includes an x-required-permission extension that documents the exact permission required:

{
  "paths": {
    "/v1/tasks": {
      "post": {
        "security": [
          { "bearer_auth": [] },
          { "api_key_auth": [] }
        ],
        "x-required-permission": "tasks:create",
        ...
      }
    }
  }
}

Why Extensions Instead of OAuth2 Scopes?

OpenAPI 3.x only formally supports scopes for OAuth2 and OpenID Connect security schemes—not for HTTP Bearer or API Key authentication. Since Tasker uses JWT Bearer tokens with JWKS validation (not OAuth2 flows), we use vendor extensions (x-required-permission) to document permissions in a standards-compliant way.

This approach:

Is OpenAPI compliant (tools ignore unknown x- fields gracefully)
Doesn’t misrepresent our authentication mechanism
Is machine-readable for SDK generators and tooling
Is visible in generated documentation

Viewing Permissions in Swagger UI

Each operation’s description includes a Required Permission line:

**Required Permission:** `tasks:create`

This provides human-readable permission information directly in the Swagger UI.

Programmatic Access

To extract permission requirements from the OpenAPI spec:

import json

spec = json.load(open("orchestration-openapi.json"))
for path, methods in spec["paths"].items():
    for method, operation in methods.items():
        if "x-required-permission" in operation:
            print(f"{method.upper()} {path}: {operation['x-required-permission']}")

CLI: List Permissions

cargo run --bin tasker-ctl -- auth show-permissions

Outputs all 17 permissions with their resource grouping.

Auth Testing

E2E test infrastructure for validating authentication and permission enforcement.

Test Organization

tasker-orchestration/tests/web/auth/
├── mod.rs                  # Module declarations
├── common.rs               # AuthWebTestClient, token generators, constants
├── tasks.rs                # Task endpoint auth tests
├── workflow_steps.rs       # Step resolution auth tests
├── dlq.rs                  # DLQ endpoint auth tests
├── handlers.rs             # Handler registry auth tests
├── analytics.rs            # Analytics endpoint auth tests
├── config.rs               # Config endpoint auth tests
├── health.rs               # Health endpoint public access tests
└── api_keys.rs             # API key auth tests (full/read/tasks/none)

All tests are feature-gated: #[cfg(feature = "test-services")]

Running Auth Tests

# Run all auth E2E tests (requires database running)
cargo make test-auth-e2e    # or: cargo make tae

# Run a specific test file
cargo nextest run --features test-services \
  -E 'test(auth::tasks)' \
  --package tasker-orchestration

# Run with output
cargo nextest run --features test-services \
  -E 'test(auth::)' \
  --package tasker-orchestration \
  --nocapture

Test Infrastructure

AuthWebTestClient

A specialized HTTP client that starts an auth-enabled Axum server:

#![allow(unused)]
fn main() {
use crate::web::auth::common::AuthWebTestClient;

#[tokio::test]
async fn test_example() {
    let client = AuthWebTestClient::new().await;
    // client.base_url is http://127.0.0.1:{dynamic_port}
}
}

AuthWebTestClient::new() does:

Loads config/tasker/generated/auth-test.toml (auth enabled, test keys)
Resolves jwt-public-key-test.pem via CARGO_MANIFEST_DIR
Creates SystemContext + OrchestrationCore + AppState
Starts Axum on a dynamically-allocated port (127.0.0.1:0)
Provides HTTP methods: get(), post_json(), patch_json(), delete()

Token Generators

#![allow(unused)]
fn main() {
use crate::web::auth::common::{generate_jwt, generate_expired_jwt, generate_jwt_wrong_issuer};

// Valid token with specific permissions
let token = generate_jwt(&["tasks:create", "tasks:read"]);

// Expired token (1 hour ago)
let token = generate_expired_jwt(&["tasks:create"]);

// Wrong issuer (won't validate)
let token = generate_jwt_wrong_issuer(&["tasks:create"]);
}

Token generation uses the test RSA private key (tests/fixtures/auth/jwt-private-key-test.pem) embedded as a constant.

API Key Constants

#![allow(unused)]
fn main() {
use crate::web::auth::common::{
    TEST_API_KEY_FULL_ACCESS,      // permissions: ["*"]
    TEST_API_KEY_READ_ONLY,        // permissions: tasks/steps/dlq read + system read
    TEST_API_KEY_TASKS_ONLY,       // permissions: ["tasks:*"]
    TEST_API_KEY_NO_PERMISSIONS,   // permissions: []
    INVALID_API_KEY,               // not registered
};
}

These match the keys configured in config/tasker/generated/auth-test.toml.

Test Configuration

`config/tasker/generated/auth-test.toml`

A copy of complete-test.toml with auth overrides:

[orchestration.web.auth]
enabled = true
jwt_issuer = "tasker-core-test"
jwt_audience = "tasker-api-test"
jwt_verification_method = "public_key"
jwt_public_key_path = ""  # Set via TASKER_JWT_PUBLIC_KEY_PATH at runtime
api_keys_enabled = true
strict_validation = false

[[orchestration.web.auth.api_keys]]
key = "test-api-key-full-access"
permissions = ["*"]

[[orchestration.web.auth.api_keys]]
key = "test-api-key-read-only"
permissions = ["tasks:read", "tasks:list", "steps:read", ...]

# ... more keys ...

Test Fixture Keys

tests/fixtures/auth/
├── jwt-private-key-test.pem   # RSA private key (for token generation in tests)
└── jwt-public-key-test.pem    # RSA public key (loaded by SecurityService)

These are deterministic test keys committed to the repository. They are only used in tests and have no security value.

Test Patterns

Pattern: No Credentials → 401

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_no_credentials_returns_401() {
    let client = AuthWebTestClient::new().await;
    let response = client.get("/v1/tasks").await.unwrap();
    assert_eq!(response.status(), 401);
}
}

Pattern: Valid JWT with Required Permission → 200

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_with_permission_succeeds() {
    let client = AuthWebTestClient::new().await;
    let token = generate_jwt(&["tasks:list"]);
    let response = client
        .get_with_token("/v1/tasks", &token)
        .await
        .unwrap();
    assert_eq!(response.status(), 200);
}
}

Pattern: Valid JWT Missing Permission → 403

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_without_permission_returns_403() {
    let client = AuthWebTestClient::new().await;
    let token = generate_jwt(&["tasks:read"]);  // missing tasks:create
    let body = serde_json::json!({ /* ... */ });
    let response = client
        .post_json_with_token("/v1/tasks", &body, &token)
        .await
        .unwrap();
    assert_eq!(response.status(), 403);
}
}

Pattern: API Key with Permissions → 200

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_api_key_full_access() {
    let client = AuthWebTestClient::new().await;
    let response = client
        .get_with_api_key("/v1/tasks", TEST_API_KEY_FULL_ACCESS)
        .await
        .unwrap();
    assert_eq!(response.status(), 200);
}
}

Pattern: Health Always Public

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_health_no_auth_required() {
    let client = AuthWebTestClient::new().await;
    let response = client.get("/health").await.unwrap();
    assert_eq!(response.status(), 200);
}
}

Test Coverage Matrix

Scenario	Expected	Test File
No credentials on protected routes	401	All files
JWT with exact permission	200	tasks, dlq, handlers, analytics, config
JWT with resource wildcard (`tasks:*`)	200	tasks
JWT with global wildcard (`*`)	200	All files
JWT missing required permission	403	tasks, dlq, handlers, analytics
JWT wrong issuer	401	tasks
JWT wrong audience	401	tasks
Expired JWT	401	tasks
Malformed JWT	401	tasks
API key full access	200	api_keys
API key read-only	200/403	api_keys
API key tasks-only	200/403	api_keys
API key no permissions	403	api_keys
Invalid API key	401	api_keys
Health endpoints without auth	200	health

CI Compatibility

Auth tests are compatible with CI without special environment setup:

Dynamic port allocation: TcpListener::bind("127.0.0.1:0") avoids port conflicts
Self-configuring paths: Uses CARGO_MANIFEST_DIR to resolve fixture paths at compile time
No external services: Auth validation is in-process (no external JWKS/IdP needed)
Nextest isolation: Each test runs in its own process, preventing env var conflicts

Adding New Auth Tests

Identify the endpoint and required permission (see Permissions)
Add tests to the appropriate file (by resource) or create a new one
Test at minimum: no credentials (401), correct permission (200), wrong permission (403)
For POST/PATCH endpoints, use a valid request body (deserialization runs before permission check)
Run cargo make test-auth-e2e to verify

Permissions — Full permission vocabulary and endpoint mapping
Configuration — Auth config reference
config/tasker/generated/auth-test.toml — Test auth configuration

Backpressure Monitoring Runbook

Last Updated: 2026-02-05 Audience: Operations, SRE, On-Call Engineers Status: Active Related Docs: Backpressure Architecture | MPSC Channel Tuning

This runbook provides guidance for monitoring, alerting, and responding to backpressure events in tasker-core.

Quick Reference

Critical Metrics Dashboard

Metric	Normal	Warning	Critical	Action
`api_circuit_breaker_state`	closed	-	open	See Circuit Breaker Open
`messaging_circuit_breaker_state`	closed	half-open	open	See Messaging Circuit Breaker Open
`api_requests_rejected_total`	< 1/min	> 5/min	> 20/min	See API Rejections
`mpsc_channel_saturation`	< 50%	> 70%	> 90%	See Channel Saturation
`pgmq_queue_depth`	< 50% max	> 70% max	> 90% max	See Queue Depth High
`worker_claim_refusals_total`	< 5/min	> 20/min	> 50/min	See Claim Refusals
`handler_semaphore_wait_ms_p99`	< 100ms	> 500ms	> 2000ms	See Handler Wait
`domain_events_dropped_total`	< 10/min	> 50/min	> 200/min	See Domain Events

Key Metrics

API Layer Metrics

`api_requests_total`

Type: Counter
Labels: endpoint, status_code, method
Description: Total API requests received
Usage: Calculate request rate, error rate

`api_requests_rejected_total`

Type: Counter
Labels: endpoint, reason (rate_limit, circuit_breaker, validation)
Description: Requests rejected due to backpressure
Alert: > 10/min sustained

`api_circuit_breaker_state`

Type: Gauge
Values: 0 = closed, 1 = half-open, 2 = open
Description: Current circuit breaker state
Alert: state = 2 (open)

`api_request_latency_ms`

Type: Histogram
Labels: endpoint
Description: Request processing time
Alert: p99 > 5000ms

Messaging Metrics

`messaging_circuit_breaker_state`

Type: Gauge
Values: 0 = closed, 1 = half-open, 2 = open
Description: Current messaging circuit breaker state
Alert: state = 2 (open) — both orchestration and workers lose queue access

`messaging_circuit_breaker_rejections_total`

Type: Counter
Labels: operation (send, receive)
Description: Messaging operations rejected due to open circuit breaker
Alert: > 0 (any rejection indicates messaging outage)

Orchestration Metrics

`orchestration_command_channel_size`

Type: Gauge
Description: Current command channel buffer usage
Alert: > 80% of command_buffer_size

`orchestration_command_channel_saturation`

Type: Gauge (0.0 - 1.0)
Description: Channel saturation ratio
Alert: > 0.8 sustained for > 1min

`pgmq_queue_depth`

Type: Gauge
Labels: queue_name
Description: Messages in queue
Alert: > configured max_queue_depth * 0.8

`pgmq_enqueue_failures_total`

Type: Counter
Labels: queue_name, reason
Description: Failed enqueue operations
Alert: > 0 (any failure)

Worker Metrics

`worker_claim_refusals_total`

Type: Counter
Labels: worker_id, namespace
Description: Step claims refused due to capacity
Alert: > 10/min sustained

`worker_handler_semaphore_permits_available`

Type: Gauge
Labels: worker_id
Description: Available handler permits
Alert: = 0 sustained for > 30s

`worker_handler_semaphore_wait_ms`

Type: Histogram
Labels: worker_id
Description: Time waiting for semaphore permit
Alert: p99 > 1000ms

`worker_dispatch_channel_saturation`

Type: Gauge
Labels: worker_id
Description: Dispatch channel saturation
Alert: > 0.8 sustained

`worker_completion_channel_saturation`

Type: Gauge
Labels: worker_id
Description: Completion channel saturation
Alert: > 0.8 sustained

`domain_events_dropped_total`

Type: Counter
Labels: worker_id, event_type
Description: Domain events dropped due to channel full
Alert: > 50/min (informational)

Alert Configurations

Prometheus Alert Rules

groups:
  - name: tasker_backpressure
    rules:
      # API Layer
      - alert: TaskerCircuitBreakerOpen
        expr: api_circuit_breaker_state == 2
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Tasker API circuit breaker is open"
          description: "Circuit breaker {{ $labels.instance }} has been open for > 30s"
          runbook: "https://docs/operations/backpressure-monitoring.md#circuit-breaker-open"

      - alert: TaskerMessagingCircuitBreakerOpen
        expr: messaging_circuit_breaker_state == 2
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Tasker messaging circuit breaker is open"
          description: "Messaging circuit breaker has been open for > 30s — queue operations are failing"
          runbook: "https://docs/operations/backpressure-monitoring.md#messaging-circuit-breaker-open"

      - alert: TaskerAPIRejectionsHigh
        expr: rate(api_requests_rejected_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate of API request rejections"
          description: "{{ $value | printf \"%.2f\" }} requests/sec being rejected"
          runbook: "https://docs/operations/backpressure-monitoring.md#api-rejections-high"

      # Orchestration Layer
      - alert: TaskerCommandChannelSaturated
        expr: orchestration_command_channel_saturation > 0.8
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Orchestration command channel is saturated"
          description: "Channel saturation at {{ $value | printf \"%.0f\" }}%"
          runbook: "https://docs/operations/backpressure-monitoring.md#channel-saturation"

      - alert: TaskerPGMQQueueDepthHigh
        expr: pgmq_queue_depth / pgmq_queue_max_depth > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PGMQ queue depth is high"
          description: "Queue {{ $labels.queue_name }} at {{ $value | printf \"%.0f\" }}% capacity"
          runbook: "https://docs/operations/backpressure-monitoring.md#pgmq-queue-depth-high"

      # Worker Layer
      - alert: TaskerWorkerClaimRefusalsHigh
        expr: rate(worker_claim_refusals_total[5m]) > 0.2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate of worker claim refusals"
          description: "Worker {{ $labels.worker_id }} refusing {{ $value | printf \"%.1f\" }} claims/sec"
          runbook: "https://docs/operations/backpressure-monitoring.md#worker-claim-refusals-high"

      - alert: TaskerHandlerWaitTimeHigh
        expr: histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket) > 2000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Handler wait time is high"
          description: "p99 handler wait time is {{ $value | printf \"%.0f\" }}ms"
          runbook: "https://docs/operations/backpressure-monitoring.md#handler-wait-time-high"

      - alert: TaskerDomainEventsDropped
        expr: rate(domain_events_dropped_total[5m]) > 1
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Domain events being dropped"
          description: "{{ $value | printf \"%.1f\" }} events/sec dropped"
          runbook: "https://docs/operations/backpressure-monitoring.md#domain-events-dropped"

Incident Response Procedures

Circuit Breaker Open

Severity: Critical

Symptoms:

API returning 503 responses
api_circuit_breaker_state = 2
Downstream operations failing

Immediate Actions:

Check database connectivity
```
psql $DATABASE_URL -c "SELECT 1"
```

Check PGMQ extension health

psql $DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5"

Check recent error logs

kubectl logs -l app=tasker-orchestration --tail=100 | grep ERROR

Recovery:

Circuit breaker will automatically attempt recovery after timeout_seconds (default: 30s)
If database is healthy, breaker should close after success_threshold (default: 2) successful requests
If database is unhealthy, fix database first

Escalation:

If breaker remains open > 5 min after database recovery: Escalate to engineering

Messaging Circuit Breaker Open

Severity: Critical

Symptoms:

Orchestration cannot enqueue steps or send task finalizations
Workers cannot receive step messages or send results
messaging_circuit_breaker_state = 2
MessagingError::CircuitBreakerOpen in logs

Immediate Actions:

Check messaging backend health

# For PGMQ (default)
psql $PGMQ_DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5"

# For RabbitMQ
rabbitmqctl status

Check PGMQ database connectivity (may differ from main database)
```
psql $PGMQ_DATABASE_URL -c "SELECT 1"
```

Check recent messaging errors

kubectl logs -l app=tasker-orchestration --tail=100 | grep -E "messaging|circuit_breaker|CircuitBreakerOpen"

Impact:

Orchestration: Task initialization stalls, step results cannot be received, task finalizations blocked
Workers: Step messages not received, results cannot be sent back to orchestration
Safety: Messages remain in queues protected by visibility timeouts; no data loss occurs
Health checks: Unaffected (bypass circuit breaker to detect recovery)

Recovery:

Circuit breaker will automatically test recovery after timeout_seconds (default: 30s)
On recovery, queued messages will be processed normally (visibility timeouts protect against loss)
If messaging backend is unhealthy, fix it first — the breaker protects against cascading timeouts

Escalation:

If breaker remains open > 5 min after backend recovery: Escalate to engineering
If both web and messaging breakers are open simultaneously: Likely database-wide issue, escalate to DBA

API Rejections High

Severity: Warning

Symptoms:

Clients receiving 429 or 503 responses
api_requests_rejected_total increasing

Diagnosis:

Check rejection reason distribution

sum by (reason) (rate(api_requests_rejected_total[5m]))

If reason=rate_limit: Legitimate load spike or client misbehavior
If reason=circuit_breaker: See Circuit Breaker Open

Actions:

Rate limit rejections:
- Identify high-volume client
- Consider increasing rate limit or contacting client
Circuit breaker rejections:
- Follow circuit breaker procedure

Channel Saturation

Severity: Warning → Critical if sustained

Symptoms:

mpsc_channel_saturation > 0.8
Increased latency
Potential backpressure cascade

Diagnosis:

Identify saturated channel

orchestration_command_channel_saturation > 0.8

Check upstream rate

rate(orchestration_commands_received_total[5m])

Check downstream processing rate

rate(orchestration_commands_processed_total[5m])

Actions:

Temporary: Scale up orchestration replicas

Short-term: Increase channel buffer size

[orchestration.mpsc_channels.command_processor]
command_buffer_size = 10000  # Increase from 5000

Long-term: Investigate why processing is slow

PGMQ Queue Depth High

Severity: Warning → Critical if approaching max

Symptoms:

pgmq_queue_depth growing
Step execution delays
Potential OOM if queue grows unbounded

Diagnosis:

Identify growing queue
```
pgmq_queue_depth{queue_name=~".*"}
```

Check worker health

sum(worker_handler_semaphore_permits_available)

Check for stuck workers

count(worker_claim_refusals_total) by (worker_id)

Actions:

Scale workers: Add more worker replicas

Increase handler concurrency (short-term):

[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 20  # Increase from 10

Investigate slow handlers: Check handler execution latency

Worker Claim Refusals High

Severity: Warning

Symptoms:

worker_claim_refusals_total increasing
Workers at capacity
Step execution delayed

Diagnosis:

Check handler permit usage

worker_handler_semaphore_permits_available

Check handler execution time

histogram_quantile(0.99, worker_handler_execution_ms_bucket)

Actions:

Scale workers: Add replicas
Optimize handlers: If execution time is high

Adjust threshold: If refusals are premature

[worker]
claim_capacity_threshold = 0.9  # More aggressive claiming

Handler Wait Time High

Severity: Warning

Symptoms:

handler_semaphore_wait_ms_p99 > 1000ms
Steps waiting for execution
Increased end-to-end latency

Diagnosis:

Check permit utilization

1 - (worker_handler_semaphore_permits_available / worker_handler_semaphore_permits_total)

Check completion channel saturation
```
worker_completion_channel_saturation
```

Actions:

Increase permits (if CPU/memory allow):

[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 15

Optimize handlers: Reduce execution time
Scale workers: If resources constrained

Domain Events Dropped

Severity: Informational

Symptoms:

domain_events_dropped_total increasing
Downstream event consumers missing events

Diagnosis:

This is expected behavior under load

Check if drop rate is excessive

rate(domain_events_dropped_total[5m]) / rate(domain_events_dispatched_total[5m])

Actions:

If < 1% dropped: Normal, no action needed

If > 5% dropped: Consider increasing event channel buffer

[shared.domain_events]
buffer_size = 20000  # Increase from 10000

Note: Domain events are non-critical. Dropping does not affect step execution.

Capacity Planning

Determining Appropriate Limits

Command Channel Size

Required buffer = (peak_requests_per_second) * (avg_processing_time_ms / 1000) * safety_factor

Example:
  peak_requests_per_second = 100
  avg_processing_time_ms = 50
  safety_factor = 2

  Required buffer = 100 * 0.05 * 2 = 10 messages
  Recommended: 5000 (50x headroom for bursts)

Handler Concurrency

Optimal concurrency = (worker_cpu_cores) * (1 + (io_wait_ratio))

Example:
  worker_cpu_cores = 4
  io_wait_ratio = 0.8 (handlers are mostly I/O bound)

  Optimal concurrency = 4 * 1.8 = 7.2
  Recommended: 8-10 permits

PGMQ Queue Depth

Max depth = (expected_processing_rate) * (max_acceptable_delay_seconds)

Example:
  expected_processing_rate = 100 steps/sec
  max_acceptable_delay = 60 seconds

  Max depth = 100 * 60 = 6000 messages
  Recommended: 10000 (headroom for bursts)

Grafana Dashboard

Import this dashboard for backpressure monitoring:

{
  "dashboard": {
    "title": "Tasker Backpressure",
    "panels": [
      {
        "title": "Circuit Breaker State",
        "type": "stat",
        "targets": [{"expr": "api_circuit_breaker_state"}]
      },
      {
        "title": "API Rejections Rate",
        "type": "graph",
        "targets": [{"expr": "rate(api_requests_rejected_total[5m])"}]
      },
      {
        "title": "Channel Saturation",
        "type": "graph",
        "targets": [
          {"expr": "orchestration_command_channel_saturation", "legendFormat": "orchestration"},
          {"expr": "worker_dispatch_channel_saturation", "legendFormat": "worker-dispatch"},
          {"expr": "worker_completion_channel_saturation", "legendFormat": "worker-completion"}
        ]
      },
      {
        "title": "PGMQ Queue Depth",
        "type": "graph",
        "targets": [{"expr": "pgmq_queue_depth", "legendFormat": "{{queue_name}}"}]
      },
      {
        "title": "Handler Wait Time (p99)",
        "type": "graph",
        "targets": [{"expr": "histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket)"}]
      },
      {
        "title": "Worker Claim Refusals",
        "type": "graph",
        "targets": [{"expr": "rate(worker_claim_refusals_total[5m])"}]
      }
    ]
  }
}

Backpressure Architecture - Strategy overview
MPSC Channel Tuning - Channel configuration
Worker Event Systems - Worker architecture

Checkpoint Operations Guide

Last Updated: 2026-01-06 Status: Active Related: Batch Processing - Checkpoint Yielding

Overview

This guide covers operational concerns for checkpoint yielding in production environments, including monitoring, troubleshooting, and maintenance tasks.

Monitoring Checkpoints

Key Metrics

Metric	Description	Alert Threshold
Checkpoint history size	Length of `history` array	>100 entries
Checkpoint age	Time since last checkpoint	>10 minutes (step-dependent)
Accumulated results size	Size of `accumulated_results` JSONB	>1MB
Checkpoint frequency	Checkpoints per step execution	<1 per minute (may indicate issues)

SQL Queries for Monitoring

Steps with large checkpoint history:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    t.task_uuid,
    jsonb_array_length(ws.checkpoint->'history') as history_length,
    ws.checkpoint->>'timestamp' as last_checkpoint
FROM tasker.workflow_steps ws
JOIN tasker.tasks t ON ws.task_uuid = t.task_uuid
WHERE ws.checkpoint IS NOT NULL
  AND jsonb_array_length(ws.checkpoint->'history') > 50
ORDER BY history_length DESC
LIMIT 20;

Steps with stale checkpoints (in progress but not checkpointed recently):

SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.checkpoint->>'timestamp' as last_checkpoint,
    NOW() - (ws.checkpoint->>'timestamp')::timestamptz as checkpoint_age
FROM tasker.workflow_steps ws
WHERE ws.current_state = 'in_progress'
  AND ws.checkpoint IS NOT NULL
  AND (ws.checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes'
ORDER BY checkpoint_age DESC;

Large accumulated results:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    pg_column_size(ws.checkpoint->'accumulated_results') as results_size_bytes,
    pg_size_pretty(pg_column_size(ws.checkpoint->'accumulated_results')::bigint) as results_size
FROM tasker.workflow_steps ws
WHERE ws.checkpoint->'accumulated_results' IS NOT NULL
  AND pg_column_size(ws.checkpoint->'accumulated_results') > 100000
ORDER BY results_size_bytes DESC
LIMIT 20;

Logging

Checkpoint operations emit structured logs:

INFO checkpoint_yield_step_event step_uuid=abc-123 cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc-123 history_length=5

Log fields to monitor:

step_uuid - Step being checkpointed
cursor - Current position
items_processed - Total items at checkpoint
history_length - Number of checkpoint entries

Troubleshooting

Step Not Resuming from Checkpoint

Symptoms: Step restarts from beginning instead of checkpoint position.

Checks:

Verify checkpoint exists:

SELECT checkpoint FROM tasker.workflow_steps WHERE workflow_step_uuid = 'uuid';

Check handler uses BatchWorkerContext accessors:
- has_checkpoint? / has_checkpoint() / hasCheckpoint()
- checkpoint_cursor / checkpointCursor
Verify handler respects checkpoint in processing loop

Checkpoint Not Persisting

Symptoms: checkpoint_yield() returns but data not in database.

Checks:

Check for errors in worker logs
Verify FFI bridge is healthy
Check database connectivity

Excessive Checkpoint History Growth

Symptoms: Steps have hundreds or thousands of checkpoint history entries.

Causes:

Very long-running processes with frequent checkpoints
Small checkpoint intervals relative to work

Remediation:

Increase checkpoint interval (process more items between checkpoints)
Clear history for completed steps (see Maintenance section)
Monitor with history size query above

Large Accumulated Results

Symptoms: Database bloat, slow step queries.

Causes:

Storing full result sets instead of summaries
Unbounded accumulation without size checks

Remediation:

Modify handler to store summaries, not full data
Use external storage for large intermediate results
Add size checks before checkpoint

Maintenance Tasks

Clear Checkpoint for Completed Steps

Completed steps retain checkpoint data for debugging. To clear:

-- Clear checkpoints for completed steps older than 7 days
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE current_state = 'complete'
  AND checkpoint IS NOT NULL
  AND updated_at < NOW() - INTERVAL '7 days';

Truncate History Array

For steps with excessive history:

-- Keep only last 10 history entries
UPDATE tasker.workflow_steps
SET checkpoint = jsonb_set(
    checkpoint,
    '{history}',
    (SELECT jsonb_agg(elem)
     FROM (
         SELECT elem
         FROM jsonb_array_elements(checkpoint->'history') elem
         ORDER BY (elem->>'timestamp')::timestamptz DESC
         LIMIT 10
     ) sub)
)
WHERE jsonb_array_length(checkpoint->'history') > 10;

Clear Checkpoint for Manual Reset

When manually resetting a step to reprocess from scratch:

-- Clear checkpoint to force reprocessing from beginning
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE workflow_step_uuid = 'step-uuid-here';

Warning: Only clear checkpoints if you want the step to restart from the beginning.

Capacity Planning

Database Sizing

Checkpoint column considerations:

Each checkpoint: ~1-10KB typical (cursor, timestamp, metadata)
History array: ~100 bytes per entry
Accumulated results: Variable (handler-dependent)

Formula for checkpoint storage:

Storage = Active Steps × (Base Checkpoint Size + History Entries × 100 bytes + Accumulated Size)

Example: 10,000 active steps with 50-entry history and 5KB accumulated results:

10,000 × (5KB + 50 × 100B + 5KB) = 10,000 × 15KB = 150MB

Performance Impact

Checkpoint write: ~1-5ms per checkpoint (single row UPDATE)

Checkpoint read: Included in step data fetch (no additional query)

Recommendations:

Checkpoint every 1000-10000 items or every 1-5 minutes
Too frequent: Excessive database writes
Too infrequent: Lost progress on failure

Alerting Recommendations

Prometheus/Grafana Metrics

If exporting to Prometheus:

# Alert on stale checkpoints
- alert: StaleCheckpoint
  expr: tasker_checkpoint_age_seconds > 600
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Step checkpoint is stale"

# Alert on large history
- alert: CheckpointHistoryGrowth
  expr: tasker_checkpoint_history_size > 100
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Checkpoint history exceeding threshold"

Database-Based Alerting

For periodic SQL-based monitoring:

-- Return non-zero if any issues detected
SELECT COUNT(*)
FROM tasker.workflow_steps
WHERE (
    -- Stale in-progress checkpoints
    (current_state = 'in_progress'
     AND checkpoint IS NOT NULL
     AND (checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes')
    OR
    -- Excessive history
    (checkpoint IS NOT NULL
     AND jsonb_array_length(checkpoint->'history') > 100)
);

Connection Pool Tuning Guide

Overview

Tasker maintains two connection pools: tasker (task/step/transition operations) and pgmq (queue operations). Pool observability is provided via:

/health/detailed - Pool utilization in pool_utilization field
/metrics - Prometheus gauges tasker_db_pool_connections{pool,state}
Atomic counters tracking acquire latency and errors

Pool Sizing Guidelines

Formula

max_connections = (peak_concurrent_operations * avg_hold_time_ms) / 1000 + headroom

Rules of thumb:

Orchestration pool: 2-3x the number of concurrent tasks expected
PGMQ pool: 1-2x the number of workers × batch size
min_connections: 20-30% of max to avoid cold-start latency
Never exceed PostgreSQL’s max_connections / number_of_services

Environment Defaults

Parameter	Base	Production	Development	Test
`max_connections` (tasker)	25	50	25	30
`min_connections` (tasker)	5	10	5	2
`max_connections` (pgmq)	15	25	15	10
`slow_acquire_threshold_ms`	100	50	200	500

Metrics Interpretation

Utilization Thresholds

Level	Utilization	Action
Healthy	< 80%	Normal operation
Degraded	80-95%	Monitor closely, consider increasing `max_connections`
Unhealthy	> 95%	Pool exhaustion imminent; increase pool size or reduce load

Slow Acquires

The slow_acquire_threshold_ms setting controls when an acquire is classified as “slow”:

Production (50ms): Tight threshold for SLO-sensitive workloads
Development (200ms): Relaxed for local debugging with fewer resources
Test (500ms): Very relaxed for CI environments with contention

A high slow_acquires count relative to total_acquires (>5%) suggests:

Pool is undersized for the workload
Connections are held too long (long queries or transactions)
Connection creation is slow (network latency to DB)

Acquire Errors

Non-zero acquire_errors indicates pool exhaustion (timeout waiting for connection). Remediation:

Increase max_connections
Increase acquire_timeout_seconds (masks the problem)
Reduce query execution time
Check for connection leaks (connections not returned to pool)

PostgreSQL Server-Side Considerations

max_connections

PostgreSQL’s max_connections is a hard limit across all clients. For cluster deployments:

pg_max_connections >= sum(service_max_pool * service_instance_count) + superuser_reserved

Default PostgreSQL max_connections is 100. For production:

Set max_connections = 500 or higher
Reserve 5-10 connections for superuser (superuser_reserved_connections)
Monitor with SELECT count(*) FROM pg_stat_activity

Connection Overhead

Each PostgreSQL connection consumes ~5-10MB RAM. Size accordingly:

100 connections ~ 0.5-1GB additional RAM
500 connections ~ 2.5-5GB additional RAM

Statement Timeout

The statement_timeout database variable protects against runaway queries:

Production: 30s (default)
Test: 5s (fail fast)

Alert Threshold Recommendations

Metric	Warning	Critical
Pool utilization	> 80% for 5 min	> 95% for 1 min
Slow acquires / total	> 5% over 5 min	> 20% over 1 min
Acquire errors	> 0 in 5 min	> 10 in 1 min
Average acquire time	> 50ms	> 200ms

Configuration Reference

Pool settings are in config/tasker/base/common.toml under [common.database.pool] and [common.pgmq_database.pool]. Environment-specific overrides are in config/tasker/environments/{env}/common.toml.

[common.database.pool]
max_connections = 25
min_connections = 5
acquire_timeout_seconds = 10
idle_timeout_seconds = 300
max_lifetime_seconds = 1800
slow_acquire_threshold_ms = 100

MPSC Channel Tuning - Operational Runbook

Last Updated: 2025-12-10 Owner: Platform Engineering Related: ADR: Bounded MPSC Channels | Circuit Breakers | Backpressure Architecture

Overview

This runbook provides operational guidance for monitoring, diagnosing, and tuning MPSC channel buffer sizes in the tasker-core system. All channels are bounded with configurable capacities to prevent unbounded memory growth.

Quick Reference

Configuration Files

File	Purpose	When to Edit
`config/tasker/base/mpsc_channels.toml`	Base configuration	Default values
`config/tasker/environments/test/mpsc_channels.toml`	Test overrides	Test environment tuning
`config/tasker/environments/development/mpsc_channels.toml`	Dev overrides	Local development tuning
`config/tasker/environments/production/mpsc_channels.toml`	Prod overrides	Production capacity planning

Key Metrics

Metric	Description	Alert Threshold
`mpsc_channel_usage_percent`	Current fill percentage	> 80%
`mpsc_channel_capacity`	Configured buffer size	N/A (informational)
`mpsc_channel_full_events_total`	Overflow events counter	> 0 (indicates backpressure)

Default Buffer Sizes

Channel	Base	Test	Development	Production
Orchestration command	1000	100	1000	5000
PGMQ notifications	10000	10000	10000	50000
Task readiness	1000	100	500	5000
Worker command	1000	1000	1000	2000
Event publisher	5000	5000	5000	10000
Ruby FFI	1000	1000	500	2000

Monitoring and Alerting

Recommended Alerts

Critical: Channel Saturation

# Alert when any channel is >90% full for 5 minutes
mpsc_channel_usage_percent > 90

Action: Immediate capacity increase or identify bottleneck

Warning: Channel High Usage

# Alert when any channel is >80% full for 15 minutes
mpsc_channel_usage_percent > 80

Action: Plan capacity increase, investigate throughput

Info: Channel Overflow Events

# Alert on any overflow events
rate(mpsc_channel_full_events_total[5m]) > 0

Action: Review backpressure handling, consider capacity increase

Grafana Queries

Channel Usage by Component

max by (channel, component) (mpsc_channel_usage_percent)

Channel Capacity Configuration

max by (channel, component) (mpsc_channel_capacity)

Overflow Event Rate

rate(mpsc_channel_full_events_total[5m])

Log Patterns

Saturation Warning (80% full)

WARN mpsc_channel_saturation channel=orchestration_command usage_percent=82.5

Overflow Event (channel full)

ERROR mpsc_channel_full channel=event_publisher action=dropped

Backpressure Applied

ERROR Ruby FFI event channel full - backpressure applied

Common Issues and Solutions

Issue 1: High Channel Saturation

Symptoms:

mpsc_channel_usage_percent consistently > 80%
Slow message processing
Increased latency

Diagnosis:

Check which channel is saturated:

# Grep logs for saturation warnings
grep "mpsc_channel_saturation" logs/tasker.log | tail -20

Check metrics for specific channel:

mpsc_channel_usage_percent{channel="orchestration_command"}

Solutions:

Short-term (Immediate Relief):

# Edit appropriate environment file
# Example: config/tasker/environments/production/mpsc_channels.toml

[mpsc_channels.orchestration.command_processor]
command_buffer_size = 10000  # Increase from 5000

Long-term:

Investigate message producer rate
Optimize message consumer processing
Consider horizontal scaling

Issue 2: PGMQ Notification Bursts

Symptoms:

Spike in mpsc_channel_usage_percent{channel="pgmq_notifications"}
During bulk task creation (1000+ tasks)
Temporary saturation followed by recovery

Diagnosis:

Correlate with bulk task operations:

# Check for bulk task creation in logs
grep "Bulk task creation" logs/tasker.log

Verify buffer size configuration:

# Check current production configuration
cat config/tasker/environments/production/mpsc_channels.toml | \
  grep -A 2 "event_listeners"

Solutions:

If production buffer < 50000:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 50000  # Recommended for production

If already at 50000 and still saturating:

Consider notification coalescing (future feature)
Implement batch notification processing
Scale orchestration services horizontally

Issue 3: Ruby FFI Backpressure

Symptoms:

Errors: “Ruby FFI event channel full - backpressure applied”
Ruby handler slowness
Increased Rust-side latency

Diagnosis:

Check Ruby handler processing time:

# Add timing to Ruby handlers
time_start = Time.now
result = handler.execute(step)
duration = Time.now - time_start
logger.warn("Slow handler: #{duration}s") if duration > 1.0

Check FFI channel saturation:

mpsc_channel_usage_percent{channel="ruby_ffi"}

Solutions:

If Ruby handlers are slow:

Optimize Ruby handler code
Consider async Ruby processing
Profile Ruby handler performance

If FFI buffer too small:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 2000  # Increase from 1000

If Rust-side producing too fast:

Add rate limiting to Rust event production
Batch events before FFI crossing

Issue 4: Event Publisher Drops

Symptoms:

Counter increasing: mpsc_channel_full_events_total{channel="event_publisher"}
Log warnings: “Event channel full, dropping event”

Diagnosis:

Check drop rate:

rate(mpsc_channel_full_events_total{channel="event_publisher"}[5m])

Identify event types being dropped:

grep "dropping event" logs/tasker.log | awk '{print $NF}' | sort | uniq -c

Solutions:

If drops are rare (< 1/min):

Acceptable for non-critical events
Monitor but no action needed

If drops are frequent (> 10/min):

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 20000  # Increase from 10000

If drops are continuous:

Investigate event storm cause
Consider event sampling/filtering
Review event subscriber performance

Capacity Planning

Sizing Formula

General guideline:

buffer_size = (peak_message_rate_per_sec * avg_processing_time_sec) * safety_factor

Where:

peak_message_rate_per_sec: Expected peak throughput
avg_processing_time_sec: Average consumer processing time
safety_factor: 2-5x for bursts

Example calculation:

# Orchestration command channel
peak_rate = 500 messages/sec
processing_time = 0.01 sec (10ms)
safety_factor = 2x

buffer_size = (500 * 0.01) * 2 = 10 messages minimum
# Use 1000 for burst handling

Environment-Specific Guidelines

Test Environment:

Use small buffers (100-500)
Exposes backpressure issues early
Forces proper error handling

Development Environment:

Use moderate buffers (500-1000)
Balances local resource usage
Mimics test environment behavior

Production Environment:

Use large buffers (2000-50000)
Handles real-world burst traffic
Prioritizes availability over memory

When to Increase Buffer Sizes

Increase if:

✅ Saturation > 80% for extended periods
✅ Overflow events occur regularly
✅ Latency increases during peak load
✅ Known traffic increase incoming

Don’t increase if:

❌ Consumer is bottleneck (fix consumer instead)
❌ Saturation is brief and recovers quickly
❌ Would mask underlying performance issue

Configuration Change Procedure

1. Identify Need

Review metrics and logs to determine which channel needs adjustment.

2. Calculate New Size

Use sizing formula or apply percentage increase:

new_size = current_size * (100 / (100 - target_usage_percent))

# Example: Currently 90% full, target 70%
new_size = 5000 * (100 / (100 - 70)) = 5000 * 3.33 = 16,650
# Round up: 20,000

3. Update Configuration

Important: Environment overrides MUST use full [mpsc_channels.*] prefix!

# ✅ CORRECT
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 20000

# ❌ WRONG - creates conflicting top-level key
[orchestration.command_processor]
command_buffer_size = 20000

4. Deploy

Local/Development:

# Restart service - picks up new config automatically
cargo run --package tasker-orchestration --bin tasker-server --features web-api

Production:

# Standard deployment process
# Configuration is loaded at service startup
kubectl rollout restart deployment/tasker-orchestration

5. Monitor

Watch metrics for 1-2 hours post-change:

Channel usage percentage should decrease
Overflow events should stop
Latency should improve

6. Document

Update this runbook with:

Why change was made
New values
Observed impact

Troubleshooting Checklist

□ Check metric: mpsc_channel_usage_percent for affected channel
□ Review logs for saturation warnings in last 24 hours
□ Verify configuration file has correct [mpsc_channels] prefix
□ Confirm environment variable TASKER_ENV matches intended environment
□ Check if issue correlates with specific operations (bulk tasks, etc.)
□ Verify consumer processing time hasn't increased
□ Check for resource constraints (CPU, memory)
□ Review recent code changes that might affect throughput
□ Consider if horizontal scaling is needed vs buffer increase

Emergency Response

Critical Saturation (>95%)

Immediate Actions:

Increase buffer size by 2-5x in production config
Deploy immediately via rolling restart
Page on-call if service degradation visible

Example:

# Edit config
vim config/tasker/environments/production/mpsc_channels.toml

# Deploy
kubectl rollout restart deployment/tasker-orchestration

# Monitor
watch -n 5 'curl -s localhost:9090/api/v1/query?query=mpsc_channel_usage_percent | jq'

Service Unresponsive Due to Backpressure

Symptoms:

All channels showing 100% usage
No message processing
Health checks failing

Actions:

Check for downstream bottleneck (database, queue service)
Scale out consumer services
Temporarily increase all buffer sizes
Check circuit breaker states (/health/detailed endpoint) - if circuit breakers are open, address underlying database/service issues first

Note: MPSC channels and circuit breakers are complementary resilience mechanisms. Channel saturation indicates internal backpressure, while circuit breaker state indicates external service health. See Circuit Breakers for operational guidance.

Best Practices

Monitor Proactively: Don’t wait for alerts - review metrics weekly
Test Changes in Dev: Validate buffer changes in development first
Document Rationale: Note why each production override exists
Gradual Increases: Prefer 2x increases over 10x jumps
Review Quarterly: Adjust defaults based on production patterns
Alert on Changes: Get notified of configuration file commits

Architecture:

Backpressure Architecture - How MPSC channels fit into the broader resilience strategy
Circuit Breakers - Fault isolation working alongside bounded channels
ADR: Bounded MPSC Channels - Design decisions

Development:

Developer Guidelines - Creating and using MPSC channels

Operations:

Backpressure Monitoring - Unified alerting and incident response

Support

Questions? Ask in #platform-engineering Slack channel Issues? File ticket with label infrastructure/channels Escalation? Page on-call via PagerDuty

Cluster Testing Guide

Last Updated: 2026-01-19 Audience: Developers, QA Status: Active Related: Tooling | Idempotency and Atomicity

Overview

This guide covers multi-instance cluster testing for validating horizontal scaling, race condition detection, and concurrent processing scenarios.

Key Capabilities:

Run N orchestration instances with M worker instances
Test concurrent task creation across instances
Validate state consistency across cluster
Detect race conditions and data corruption
Measure performance under concurrent load

Test Infrastructure

Feature Flags

Tests are organized by infrastructure requirements using Cargo feature flags:

Feature Flag	Infrastructure Required	In CI?
`test-db`	PostgreSQL database	Yes
`test-messaging`	DB + messaging (PGMQ/RabbitMQ)	Yes
`test-services`	DB + messaging + services running	Yes
`test-cluster`	Multi-instance cluster running	No

Hierarchy: Each flag implies the previous (test-cluster includes test-services includes test-messaging includes test-db).

Test Commands

# Unit tests (DB + messaging only)
cargo make test-rust-unit

# E2E tests (services running)
cargo make test-rust-e2e

# Cluster tests (cluster running - LOCAL ONLY)
cargo make test-rust-cluster

# All tests including cluster
cargo make test-rust-all

Test Entry Points

tests/
├── basic_tests.rs        # Always compiles
├── integration_tests.rs  # #[cfg(feature = "test-messaging")]
├── e2e_tests.rs         # #[cfg(feature = "test-services")]
└── e2e/
    └── multi_instance/   # #[cfg(feature = "test-cluster")]
        ├── mod.rs
        ├── concurrent_task_creation_test.rs
        └── consistency_test.rs

Multi-Instance Test Manager

The MultiInstanceTestManager provides high-level APIs for cluster testing.

Location

tests/common/multi_instance_test_manager.rs
tests/common/orchestration_cluster.rs

Basic Usage

#![allow(unused)]
fn main() {
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;

#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_concurrent_operations() -> Result<()> {
    // Setup from environment (reads TASKER_TEST_ORCHESTRATION_URLS, etc.)
    let manager = MultiInstanceTestManager::setup_from_env().await?;

    // Wait for all instances to become healthy
    manager.wait_for_healthy(Duration::from_secs(30)).await?;

    // Create tasks concurrently across the cluster
    let requests = vec![create_task_request("namespace", "task", json!({})); 10];
    let responses = manager.create_tasks_concurrent(requests).await?;

    // Wait for completion
    let task_uuids: Vec<_> = responses.iter()
        .map(|r| uuid::Uuid::parse_str(&r.task_uuid).unwrap())
        .collect();
    let completed = manager.wait_for_tasks_completion(task_uuids.clone(), timeout).await?;

    // Verify consistency across all instances
    for uuid in &task_uuids {
        manager.verify_task_consistency(*uuid).await?;
    }

    Ok(())
}
}

Key Methods

Method	Description
`setup_from_env()`	Create manager from environment variables
`setup(orch_count, worker_count)`	Create manager with explicit counts
`wait_for_healthy(timeout)`	Wait for all instances to be healthy
`create_tasks_concurrent(requests)`	Create tasks using round-robin distribution
`wait_for_task_completion(uuid, timeout)`	Wait for single task completion
`wait_for_tasks_completion(uuids, timeout)`	Wait for multiple tasks
`verify_task_consistency(uuid)`	Verify task state across all instances
`orchestration_count()`	Number of orchestration instances
`worker_count()`	Number of worker instances

OrchestrationCluster

Lower-level cluster abstraction with load balancing:

#![allow(unused)]
fn main() {
use crate::common::orchestration_cluster::{OrchestrationCluster, ClusterConfig, LoadBalancingStrategy};

// Create cluster with round-robin load balancing
let config = ClusterConfig {
    orchestration_urls: vec!["http://localhost:8080", "http://localhost:8081"],
    worker_urls: vec!["http://localhost:8100", "http://localhost:8101"],
    load_balancing: LoadBalancingStrategy::RoundRobin,
    health_timeout: Duration::from_secs(5),
};
let cluster = OrchestrationCluster::new(config).await?;

// Get client using load balancing strategy
let client = cluster.get_client();

// Get all clients for parallel operations
for client in cluster.all_clients() {
    let task = client.get_task(task_uuid).await?;
}
}

Running Cluster Tests

Prerequisites

PostgreSQL running with PGMQ extension
Environment configured for cluster mode

Step-by-Step

# 1. Start PostgreSQL (if not already running)
cargo make docker-up

# 2. Setup cluster environment
cargo make setup-env-cluster

# 3. Start the full cluster
cargo make cluster-start-all

# 4. Verify cluster health
cargo make cluster-status

# Expected output:
# Instance Status:
# ─────────────────────────────────────────────────────────────
# INSTANCE              STATUS     PID        PORT
# ─────────────────────────────────────────────────────────────
# orchestration-1       healthy    12345      8080
# orchestration-2       healthy    12346      8081
# worker-rust-1         healthy    12347      8100
# worker-rust-2         healthy    12348      8101
# ... (more workers)

# 5. Run cluster tests
cargo make test-rust-cluster

# 6. Stop cluster when done
cargo make cluster-stop

Monitoring During Tests

# In separate terminal: Watch cluster logs
cargo make cluster-logs

# Or orchestration logs only
cargo make cluster-logs-orchestration

# Quick status check (no health probes)
cargo make cluster-status-quick

Test Scenarios

Concurrent Task Creation

Validates that tasks can be created concurrently across orchestration instances without conflicts.

File: tests/e2e/multi_instance/concurrent_task_creation_test.rs

Test: test_concurrent_task_creation_across_instances

Validates:

Tasks created through different orchestration instances
All tasks complete successfully
State is consistent across all instances
No duplicate UUIDs generated

Rapid Task Burst

Stress tests the system by creating many tasks in quick succession.

Test: test_rapid_task_creation_burst

Validates:

System handles high task creation rate
No duplicate task UUIDs
All tasks created successfully

Round-Robin Distribution

Verifies tasks are distributed across instances using round-robin.

Test: test_task_creation_round_robin_distribution

Validates:

Tasks distributed across instances
Distribution is approximately even
No single-instance bottleneck

Validation Results

The cluster testing infrastructure was validated with the following results:

Test Summary

Metric	Result
Tests Passed	1645
Intermittent Failures	3 (resource contention, not race conditions)
Tests Skipped	21 (domain event tests, require single-instance)
Cluster Configuration	2x orchestration + 2x each worker type (10 total)

Key Findings

No Race Conditions Detected: All concurrent operations completed without data corruption or invalid states
Defense-in-Depth Validated: Four protection layers (database atomicity, state machine guards, transaction boundaries, application logic) work correctly together
Recovery Mechanism Works: Tasks and steps recover correctly after simulated failures
Consistent State: Task state is consistent when queried from any orchestration instance

Connection Pool Deadlock (Fixed)

Initial testing revealed intermittent failures under high parallelization:

Cause: Connection pool deadlock in task initialization - transactions held connections while template loading needed additional connections
Root Cause Fix: Moved template loading BEFORE transaction begins in task_initialization/service.rs
Additional Tuning: Increased pool sizes (20→30 max, 1→2 min connections)
Status: ✅ Fixed - all 9 cluster tests now pass in parallel

See the connection pool deadlock pattern documentation in docs/ticket-specs/ for details.

Domain Event Tests

21 tests were skipped in cluster mode (marked with #[cfg(not(feature = "test-cluster"))]):

Reason: Domain event tests verify in-process event delivery, incompatible with multi-process cluster
Status: Working as designed - these tests run in single-instance CI

Test Feature Flag Implementation

Adding the Feature Gate

Tests requiring cluster infrastructure should use the feature gate:

#![allow(unused)]
fn main() {
// At module level
#![cfg(feature = "test-cluster")]

// Or at test level
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_cluster_specific_behavior() -> Result<()> {
    // ...
}
}

Skipping Tests in Cluster Mode

Some tests (like domain events) don’t work in cluster mode:

#![allow(unused)]
fn main() {
// Only run when NOT in cluster mode
#[tokio::test]
#[cfg(not(feature = "test-cluster"))]
async fn test_domain_event_delivery() -> Result<()> {
    // In-process event testing
}
}

Conditional Imports

#![allow(unused)]
fn main() {
// Only import cluster test utilities when needed
#[cfg(feature = "test-cluster")]
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
}

Nextest Configuration

The .config/nextest.toml configures test execution for cluster scenarios:

[profile.default]
retries = 0
leak-timeout = { period = "500ms", result = "fail" }
fail-fast = false

# Multi-instance tests can run in parallel once cluster is warmed up
[[profile.default.overrides]]
filter = 'test(multi_instance)'

[profile.ci]
# Limit parallelism to avoid database connection pool exhaustion
test-threads = 4

Cluster Warmup: Multi-instance tests can run in parallel. Connection pools now start with min_connections=2 for faster warmup. The 5-second delay built into cluster-start-all usually suffices. If you see “Failed to create task after all retries” errors immediately after startup, wait a few more seconds for pools to fully initialize.

Troubleshooting

Cluster Won’t Start

# Check for port conflicts
lsof -i :8080-8089
lsof -i :8100-8109

# Check for stale PID files
ls -la .pids/
rm -rf .pids/*.pid  # Clean up stale PIDs

# Retry start
cargo make cluster-start-all

Tests Timeout / “Failed to create task after all retries”

This typically indicates the cluster wasn’t fully warmed up:

# Check cluster health
cargo make cluster-status

# If health is green but tests fail, wait for warmup
sleep 10 && cargo make test-rust-cluster

# Check logs for errors
cargo make cluster-logs | head -100

# Restart cluster with extra warmup
cargo make cluster-stop
cargo make cluster-start-all
sleep 10
cargo make test-rust-cluster

Root cause: Connection pools start at min_connections=2 and grow on demand. The first requests after startup may timeout while pools are establishing connections.

Connection Pool Exhaustion

If tests fail with “pool timed out” errors, ensure you have the latest code with:

Template loading before transaction in task_initialization/service.rs
Pool sizes: max_connections=30, min_connections=2 in test config

If issues persist, verify pool configuration:

# Check test config
cat config/tasker/generated/orchestration-test.toml | grep -A5 "pool"

Environment Variables Not Set

# Verify environment
env | grep TASKER_TEST

# Re-source environment
source .env

# Or regenerate
cargo make setup-env-cluster

CI Considerations

Cluster tests are NOT run in CI due to GitHub Actions resource constraints:

Running multiple orchestration + worker instances requires more memory than free GHA runners provide
This is a conscious tradeoff for an open-source, pre-alpha project

Future Options (when project matures):

Self-hosted runners with more resources
Paid GHA larger runners
Separate manual workflow trigger for cluster tests

Workaround: Run cluster tests locally before PRs that touch concurrent processing code.

Tooling - Cluster deployment tasks
Idempotency and Atomicity - Protection mechanisms

Comprehensive Lifecycle Testing Framework Guide

This guide demonstrates the complete lifecycle testing framework, showing patterns, examples, and best practices for validating task and workflow step lifecycles with integrated SQL function validation.

Framework Overview
Core Testing Patterns
Advanced Assertion Traits
Template-Based Testing
SQL Function Integration
Example Test Executions
Tracing Output Examples
Best Practices
Troubleshooting

Framework Overview

Architecture

The comprehensive lifecycle testing framework consists of several key components:

#![allow(unused)]
fn main() {
// Core Infrastructure
TestOrchestrator          // Wrapper around orchestration components
StepErrorSimulator        // Realistic error scenario simulation
SqlLifecycleAssertion     // SQL function validation
TestScenarioBuilder       // YAML template loading

// Advanced Patterns
TemplateTestRunner        // Parameterized error pattern testing
ErrorPattern              // Comprehensive error configuration
TaskAssertions           // Task-level validation trait
StepAssertions           // Step-level validation trait
}

Integration Strategy

Each test follows the integrated validation pattern:

Exercise Lifecycle: Use orchestration framework to create scenario
Capture SQL State: Call SQL functions to get current state
Assert Integration: Validate SQL functions return expected values
Document Relationship: Structured tracing showing cause → effect

Core Testing Patterns

Pattern 1: Basic Lifecycle Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_basic_lifecycle_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🧪 Testing basic lifecycle progression");

    // STEP 1: Exercise lifecycle using framework
    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "basic_validation").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // STEP 2: Validate initial state
    pool.assert_step_ready(step.workflow_step_uuid).await?;

    // STEP 3: Execute step
    let result = orchestrator.execute_step(&step, true, 1000).await?;
    assert!(result.success);

    // STEP 4: Validate final state
    pool.assert_step_complete(step.workflow_step_uuid).await?;

    tracing::info!("✅ Basic lifecycle validation complete");
    Ok(())
}
}

Pattern 2: Error and Retry Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_error_retry_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🔄 Testing error and retry behavior");

    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "retry_validation").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // STEP 1: Simulate retryable error
    StepErrorSimulator::simulate_execution_error(
        &pool,
        &step,
        1 // attempt number
    ).await?;

    // STEP 2: Validate retry behavior
    pool.assert_step_retry_behavior(
        step.workflow_step_uuid,
        1,    // expected attempts
        None, // no custom backoff
        true  // still retry eligible
    ).await?;

    tracing::info!("✅ Error retry validation complete");
    Ok(())
}
}

Pattern 3: Complex Dependency Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_dependency_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🔗 Testing dependency relationships");

    let orchestrator = TestOrchestrator::new(pool.clone());

    // Create diamond pattern workflow
    let task = create_diamond_workflow_task(&orchestrator).await?;
    let steps = get_task_steps(&pool, task.task_uuid).await?;

    // Execute start step
    let result = orchestrator.execute_step(&steps[0], true, 1000).await?;
    assert!(result.success);

    // Fail one branch
    StepErrorSimulator::simulate_validation_error(
        &pool,
        &steps[1],
        "dependency_test_error"
    ).await?;

    // Complete other branch
    let result = orchestrator.execute_step(&steps[2], true, 1000).await?;
    assert!(result.success);

    // Validate convergence step is blocked
    pool.assert_step_blocked(steps[3].workflow_step_uuid).await?;

    tracing::info!("✅ Dependency validation complete");
    Ok(())
}
}

Advanced Assertion Traits

TaskAssertions Trait Usage

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{TaskAssertions, TaskStepDistribution};

// Task completion validation
pool.assert_task_complete(task_uuid).await?;

// Task error state validation
pool.assert_task_error(task_uuid, 2).await?; // 2 error steps

// Complex step distribution validation
pool.assert_task_step_distribution(
    task_uuid,
    TaskStepDistribution {
        total_steps: 4,
        completed_steps: 2,
        failed_steps: 1,
        ready_steps: 0,
        pending_steps: 1,
        in_progress_steps: 0,
        error_steps: 1,
    }
).await?;

// Execution status validation
pool.assert_task_execution_status(
    task_uuid,
    ExecutionStatus::BlockedByFailures,
    Some(RecommendedAction::HandleFailures)
).await?;

// Completion percentage validation
pool.assert_task_completion_percentage(task_uuid, 75.0, 5.0).await?;
}

StepAssertions Trait Usage

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::StepAssertions;

// Basic step state validations
pool.assert_step_ready(step_uuid).await?;
pool.assert_step_complete(step_uuid).await?;
pool.assert_step_blocked(step_uuid).await?;

// Retry behavior validation
pool.assert_step_retry_behavior(
    step_uuid,
    3,        // expected attempts
    Some(30), // custom backoff seconds
    false     // not retry eligible (exhausted)
).await?;

// Dependency validation
pool.assert_step_dependencies_satisfied(step_uuid, true).await?;

// State transition sequence validation
pool.assert_step_state_sequence(
    step_uuid,
    vec!["Pending".to_string(), "InProgress".to_string(), "Complete".to_string()]
).await?;

// Permanent failure validation
pool.assert_step_failed_permanently(step_uuid).await?;

// Waiting for retry with specific time
let retry_time = chrono::Utc::now() + chrono::Duration::seconds(60);
pool.assert_step_waiting(step_uuid, retry_time).await?;
}

Template-Based Testing

ErrorPattern Configuration

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{ErrorPattern, TemplateTestRunner};

// Simple patterns
let success_pattern = ErrorPattern::AllSuccess;
let first_fail_pattern = ErrorPattern::FirstStepFails { retryable: true };
let last_fail_pattern = ErrorPattern::LastStepFails { permanently: false };

// Advanced patterns
let targeted_pattern = ErrorPattern::MiddleStepFails {
    step_name: "process_payment".to_string(),
    attempts_before_success: 3
};

let dependency_pattern = ErrorPattern::DependencyBlockage {
    blocked_step: "finalize_order".to_string(),
    blocking_step: "validate_payment".to_string()
};

// Custom pattern with full control
let custom_pattern = ErrorPattern::Custom {
    step_configs: {
        let mut configs = HashMap::new();
        configs.insert("critical_step".to_string(), StepErrorConfig {
            error_type: StepErrorType::ExternalServiceError,
            attempts_before_success: Some(5),
            custom_backoff_seconds: Some(120),
            permanently_fails: false,
        });
        configs
    }
};
}

Template Runner Usage

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_template_patterns(pool: PgPool) -> Result<()> {
    let template_runner = TemplateTestRunner::new(pool.clone()).await?;

    // Test single pattern
    let summary = template_runner.run_template_with_errors(
        "order_fulfillment.yaml",
        ErrorPattern::FirstStepFails { retryable: true }
    ).await?;

    assert!(summary.sql_validations_passed > 0);
    assert_eq!(summary.sql_validations_failed, 0);

    // Test all patterns automatically
    let summaries = template_runner
        .run_template_with_all_patterns("linear_workflow.yaml")
        .await?;

    for summary in summaries {
        tracing::info!(
            pattern = summary.error_pattern,
            execution_time = summary.execution_time_ms,
            validations = summary.sql_validations_passed,
            "Pattern execution complete"
        );
    }

    Ok(())
}
}

SQL Function Integration

Direct SQL Function Testing

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_direct_sql_functions(pool: PgPool) -> Result<()> {
    // Test get_step_readiness_status
    let step_status = sqlx::query!(
        "SELECT ready_for_execution, dependencies_satisfied, retry_eligible, attempts,
                backoff_request_seconds, next_retry_at
         FROM get_step_readiness_status($1)",
        step_uuid
    )
    .fetch_one(&pool)
    .await?;

    // Validate individual fields
    assert_eq!(step_status.ready_for_execution, Some(true));
    assert_eq!(step_status.dependencies_satisfied, Some(true));
    assert_eq!(step_status.retry_eligible, Some(false));
    assert_eq!(step_status.attempts, Some(0));

    // Test get_task_execution_context
    let task_context = sqlx::query!(
        "SELECT total_steps, completed_steps, failed_steps, ready_steps,
                pending_steps, in_progress_steps, error_steps,
                completion_percentage, execution_status, recommended_action,
                blocked_by_errors
         FROM get_task_execution_context($1)",
        task_uuid
    )
    .fetch_one(&pool)
    .await?;

    // Validate task aggregations
    assert!(task_context.total_steps.unwrap_or(0) > 0);
    assert_eq!(task_context.completed_steps, Some(0));
    assert_eq!(task_context.failed_steps, Some(0));

    Ok(())
}
}

Integrated SQL Validation Pattern

#![allow(unused)]
fn main() {
// The standard pattern used throughout the framework
async fn validate_integrated_sql_behavior(
    pool: &PgPool,
    task_uuid: Uuid,
    step_uuid: Uuid
) -> Result<()> {
    // STEP 1: Execute lifecycle action
    StepErrorSimulator::simulate_execution_error(pool, step, 2).await?;

    // STEP 2: Immediately validate SQL functions
    SqlLifecycleAssertion::assert_step_scenario(
        pool,
        task_uuid,
        step_uuid,
        ExpectedStepState {
            state: "Error".to_string(),
            ready_for_execution: false,
            dependencies_satisfied: true,
            retry_eligible: true,
            attempts: 2,
            next_retry_at: Some(calculate_expected_retry_time()),
            backoff_request_seconds: None,
            retry_limit: 3,
        }
    ).await?;

    // STEP 3: Document the relationship
    tracing::info!(
        lifecycle_action = "simulate_execution_error",
        sql_result = "retry_eligible=true, attempts=2",
        "✅ INTEGRATION: Lifecycle → SQL alignment verified"
    );

    Ok(())
}
}

Example Test Executions

Running Individual Tests

# Run specific test with detailed output
RUST_LOG=info cargo test --test complex_retry_scenarios \
    test_cascading_retries_with_dependencies -- --nocapture

# Run all lifecycle tests
cargo test --all-features --test '*lifecycle*' -- --nocapture

# Run with specific environment
TASKER_ENV=test cargo test --test step_retry_lifecycle_tests -- --nocapture

Running Test Suites

# Run comprehensive validation
cargo test --test sql_function_integration_validation -- --nocapture

# Run complex scenarios
cargo test --test complex_retry_scenarios -- --nocapture

# Run task finalization tests
cargo test --test task_finalization_error_scenarios -- --nocapture

Tracing Output Examples

Successful Test Execution

2025-01-15T10:30:45.123Z INFO test_cascading_retries_with_dependencies:
🧪 Testing cascading retries with diamond dependency pattern

2025-01-15T10:30:45.125Z INFO test_cascading_retries_with_dependencies:
🏗️ Creating diamond workflow: Start → BranchA/BranchB → Convergence

2025-01-15T10:30:45.145Z INFO test_cascading_retries_with_dependencies:
📋 STEP 1: Executing start step successfully
step_uuid=01JGJX7K8QMRNP4W2X3Y5Z6ABC

2025-01-15T10:30:45.167Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 2: Simulating BranchA failure (attempt 1)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
error_type="ExecutionError" retryable=true

2025-01-15T10:30:45.189Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Retry behavior matches expectations
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
attempts=1 backoff=null retry_eligible=true

2025-01-15T10:30:45.201Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 3: BranchA retry attempt (attempt 2)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF

2025-01-15T10:30:45.223Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step completed successfully
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF

2025-01-15T10:30:45.245Z INFO test_cascading_retries_with_dependencies:
❌ STEP 4: Simulating BranchB permanent failure
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
error_type="ValidationError" retryable=false

2025-01-15T10:30:45.267Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step failed permanently (not retryable)
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI

2025-01-15T10:30:45.289Z INFO test_cascading_retries_with_dependencies:
🚫 STEP 5: Validating Convergence step is blocked
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL

2025-01-15T10:30:45.301Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step blocked by dependencies
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL

2025-01-15T10:30:45.323Z INFO test_cascading_retries_with_dependencies:
📊 TASK ASSERTION: Step distribution matches expectations
task_uuid=01JGJX7K8PMRNP4W2X3Y5Z6MNO
total=4 completed=2 failed=0 ready=0 pending=0 in_progress=0 error=2

2025-01-15T10:30:45.345Z INFO test_cascading_retries_with_dependencies:
✅ INTEGRATION: Lifecycle → SQL alignment verified
lifecycle_action="cascading_retry_with_dependency_blocking"
sql_result="blocked_by_errors=true, error_steps=2"

2025-01-15T10:30:45.356Z INFO test_cascading_retries_with_dependencies:
🧪 CASCADING RETRY TEST COMPLETE: Diamond pattern with mixed outcomes validated

Error Pattern Testing Output

2025-01-15T10:35:12.123Z INFO test_template_runner_all_patterns:
🎭 TEMPLATE DEMO: All error patterns with multiple templates

2025-01-15T10:35:12.145Z INFO test_template_runner_all_patterns:
📋 Testing template with all error patterns
template="linear_workflow.yaml"

2025-01-15T10:35:12.167Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"AllSuccess"#

2025-01-15T10:35:12.234Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=67
successful_steps=4 failed_steps=0 retried_steps=0
final_state="Complete"
validations_passed=12 validations_failed=0

2025-01-15T10:35:12.256Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"FirstStepFails { retryable: true }"#

2025-01-15T10:35:12.334Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=1

2025-01-15T10:35:12.356Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=2

2025-01-15T10:35:12.423Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=167
successful_steps=4 failed_steps=0 retried_steps=1
final_state="Complete"
validations_passed=15 validations_failed=0

2025-01-15T10:35:12.445Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=0
pattern="AllSuccess" execution_time_ms=67
final_state="Complete" total_validations=12 success_rate="100.0%"

2025-01-15T10:35:12.467Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=1
pattern=r#"FirstStepFails { retryable: true }"# execution_time_ms=167
final_state="Complete" total_validations=15 success_rate="100.0%"

SQL Function Validation Output

2025-01-15T10:40:30.123Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION: Starting comprehensive validation across all scenarios

2025-01-15T10:40:30.145Z INFO test_comprehensive_sql_function_integration:
📋 SCENARIO 1: Basic lifecycle progression validation

2025-01-15T10:40:30.167Z INFO validate_initial_state:
✅ Initial state validation passed

2025-01-15T10:40:30.189Z INFO validate_step_completion:
✅ Step completion validation passed
step_uuid=01JGJX7M8QMRNP4W2X3Y5Z6PQR

2025-01-15T10:40:30.201Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 1: Basic lifecycle validation complete
scenario="basic_lifecycle" validations=2

2025-01-15T10:40:30.223Z INFO test_comprehensive_sql_function_integration:
🔄 SCENARIO 2: Error handling and retry behavior validation

2025-01-15T10:40:30.245Z INFO validate_retry_behavior:
✅ Retry behavior validation passed
step_uuid=01JGJX7M8RMRNP4W2X3Y5Z6STU
attempts=1 backoff=Some(5) retry_eligible=true

2025-01-15T10:40:30.267Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 2: Error and retry validation complete
scenario="error_retry" validations=1

2025-01-15T10:40:30.289Z INFO test_comprehensive_sql_function_integration:
🎯 FINAL VALIDATION: Comprehensive results summary
total_validations=25 successful_validations=25
success_rate="100.00%" scenarios_tested=6

2025-01-15T10:40:30.301Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION VALIDATION COMPLETE: All scenarios validated successfully

Best Practices

1. Always Use Integrated Validation Pattern

#![allow(unused)]
fn main() {
// ✅ GOOD: Integrated lifecycle + SQL validation
async fn test_step_retry_behavior(pool: PgPool) -> Result<()> {
    // Exercise lifecycle
    StepErrorSimulator::simulate_execution_error(pool, step, 1).await?;

    // Immediately validate SQL functions
    pool.assert_step_retry_behavior(step_uuid, 1, None, true).await?;

    // Document relationship
    tracing::info!("✅ INTEGRATION: Retry behavior alignment verified");
    Ok(())
}

// ❌ BAD: Testing SQL functions in isolation
async fn test_sql_only(pool: PgPool) -> Result<()> {
    // Directly manipulating database state
    sqlx::query!("UPDATE steps SET attempts = 3").execute(pool).await?;

    // This doesn't prove the integration works
    let status = sqlx::query!("SELECT * FROM get_step_readiness_status($1)", uuid)
        .fetch_one(pool).await?;
    Ok(())
}
}

2. Use Structured Tracing

#![allow(unused)]
fn main() {
// ✅ GOOD: Structured tracing with context
tracing::info!(
    step_uuid = %step.workflow_step_uuid,
    attempts = expected_attempts,
    backoff = ?expected_backoff,
    retry_eligible = expected_retry_eligible,
    "✅ STEP ASSERTION: Retry behavior matches expectations"
);

// ❌ BAD: Unstructured logging
println!("Step retry test passed");
}

3. Test Multiple Scenarios

#![allow(unused)]
fn main() {
// ✅ GOOD: Comprehensive scenario coverage
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_complete_retry_scenarios(pool: PgPool) -> Result<()> {
    // Test retryable error
    test_retryable_error_scenario(&pool).await?;

    // Test non-retryable error
    test_non_retryable_error_scenario(&pool).await?;

    // Test retry exhaustion
    test_retry_exhaustion_scenario(&pool).await?;

    // Test custom backoff
    test_custom_backoff_scenario(&pool).await?;

    Ok(())
}
}

4. Validate State Transitions

#![allow(unused)]
fn main() {
// ✅ GOOD: Validate complete state transition sequence
pool.assert_step_state_sequence(
    step_uuid,
    vec![
        "Pending".to_string(),
        "InProgress".to_string(),
        "Error".to_string(),
        "WaitingForRetry".to_string(),
        "Ready".to_string(),
        "Complete".to_string()
    ]
).await?;
}

5. Use Assertion Traits for Readability

#![allow(unused)]
fn main() {
// ✅ GOOD: Clear, readable assertions
pool.assert_task_complete(task_uuid).await?;
pool.assert_step_failed_permanently(step_uuid).await?;

// ❌ BAD: Manual SQL queries everywhere
let task_status = sqlx::query!("SELECT ...").fetch_one(pool).await?;
assert_eq!(task_status.some_field, Some("Complete"));
}

Troubleshooting

Common Issues

1. Assertion Failures

Error: Task 01JGJX... assertion failed: expected Complete, found Processing

// Solution: Ensure lifecycle actions complete before asserting
tokio::time::sleep(Duration::from_millis(100)).await;
pool.assert_task_complete(task_uuid).await?;

2. SQL Function Mismatches

Error: Step 01JGJX... retry assertion failed: attempts expected 2, got Some(1)

// Solution: Verify error simulator is configured correctly
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?; // 2 attempts

3. State Machine Violations

Error: Cannot transition from Complete to InProgress

// Solution: Use proper orchestration framework, not direct DB manipulation
let result = orchestrator.execute_step(step, true, 1000).await?;
// Don't: sqlx::query!("UPDATE steps SET state = 'InProgress'").execute(pool).await?;

4. Template Loading Issues

Error: Template 'nonexistent.yaml' not found

// Solution: Ensure template exists in correct directory
// templates should be in tests/fixtures/task_templates/rust/

Debugging Techniques

1. Enable Detailed Tracing

RUST_LOG=debug cargo test test_name -- --nocapture

2. Inspect SQL Function Results Directly

#![allow(unused)]
fn main() {
let step_status = sqlx::query!(
    "SELECT * FROM get_step_readiness_status($1)",
    step_uuid
)
.fetch_one(&pool)
.await?;

tracing::debug!("Step status: {:?}", step_status);
}

3. Validate Test Prerequisites

#![allow(unused)]
fn main() {
// Ensure test setup is correct
assert_eq!(steps.len(), 4, "Test requires 4 steps");
assert_eq!(task.namespace, "expected_namespace");
}

4. Use Incremental Validation

#![allow(unused)]
fn main() {
// Validate after each step
orchestrator.execute_step(&step1, true, 1000).await?;
pool.assert_step_complete(step1.workflow_step_uuid).await?;

orchestrator.execute_step(&step2, false, 1000).await?;
pool.assert_step_retry_behavior(step2.workflow_step_uuid, 1, None, true).await?;
}

Migration from Old Tests

Before (Direct Database Manipulation)

#![allow(unused)]
fn main() {
// ❌ OLD: Bypassing orchestration framework
async fn test_task_finalization_old(pool: PgPool) -> Result<()> {
    // Direct database manipulation
    sqlx::query!("UPDATE tasks SET state = 'Error'").execute(&pool).await?;
    sqlx::query!("UPDATE steps SET state = 'Error'").execute(&pool).await?;

    // Test SQL functions in isolation
    let context = get_task_execution_context(&pool, task_uuid).await?;
    assert_eq!(context.execution_status, ExecutionStatus::Error);

    Ok(())
}
}

After (Integrated Framework)

#![allow(unused)]
fn main() {
// ✅ NEW: Using integrated framework
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_task_finalization_new(pool: PgPool) -> Result<()> {
    tracing::info!("🧪 Testing task finalization with integrated approach");

    // Use orchestration framework
    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "finalization").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // Create error state through framework
    StepErrorSimulator::simulate_validation_error(
        &pool,
        &step,
        "finalization_test_error"
    ).await?;

    // Immediately validate SQL functions
    pool.assert_step_failed_permanently(step.workflow_step_uuid).await?;
    pool.assert_task_error(task.task_uuid, 1).await?;

    tracing::info!("✅ INTEGRATION: Finalization behavior verified");
    Ok(())
}
}

This comprehensive guide demonstrates the power and flexibility of the lifecycle testing framework, providing developers with the tools needed to validate complex workflow behavior while maintaining confidence in the system’s correctness.

Decision Point E2E Tests

This document describes the E2E tests for decision point functionality and how to run them.

Test Location

tests/e2e/ruby/conditional_approval_test.rs

Design Note: Deferred Step Type (Added 2025-10-27)

A critical design refinement was introduced to handle convergence patterns in decision point workflows:

The Convergence Problem

In conditional_approval, all three possible outcomes (auto_approve, manager_approval, finance_review) converge to the same finalize_approval step. However, we cannot create finalize_approval at task initialization because:

We don’t know which approval steps will be created
finalize_approval needs different dependencies depending on the decision point’s choice

Solution: `type: deferred`

A new step type was added to handle this pattern:

- name: finalize_approval
  type: deferred  # NEW STEP TYPE!
  dependencies: [auto_approve, manager_approval, finance_review]  # All possible deps

How it works:

Deferred steps list ALL possible dependencies in the template
At initialization, deferred steps are excluded (they’re descendants of decision points)
When a decision point creates outcome steps, the system:
- Detects downstream deferred steps
- Computes: declared_deps ∩ actually_created_steps = actual DAG
- Creates deferred steps with resolved dependencies

Example:

When routing_decision chooses auto_approve:
- Creates: auto_approve
- Detects: finalize_approval is deferred with deps [auto_approve, manager_approval, finance_review]
- Intersection: [auto_approve] ∩ [auto_approve] = [auto_approve]
- Creates: finalize_approval depending on auto_approve only

This elegantly solves convergence without requiring handlers to explicitly list convergence steps or special orchestration logic.

Test Coverage

The test suite validates the conditional approval workflow, which demonstrates decision point functionality with dynamic step creation based on runtime conditions (approval amount thresholds).

Test Cases

test_small_amount_auto_approval() - Tests amounts < $1,000
- Expected path: validate_request → routing_decision → auto_approve → finalize_approval
- Verifies only 4 steps created
- Confirms manager_approval and finance_review are NOT created
test_medium_amount_manager_approval() - Tests amounts $1,000-$4,999
- Expected path: validate_request → routing_decision → manager_approval → finalize_approval
- Verifies only 4 steps created
- Confirms auto_approve and finance_review are NOT created
test_large_amount_dual_approval() - Tests amounts >= $5,000
- Expected path: validate_request → routing_decision → manager_approval + finance_review → finalize_approval
- Verifies 5 steps created
- Confirms both parallel approval steps complete
- Verifies auto_approve is NOT created
test_decision_point_step_dependency_structure() - Validates dependency resolution
- Verifies dynamically created steps depend on routing_decision
- Confirms finalize_approval waits for all approval steps
- Tests proper execution order
test_boundary_conditions() - Tests exactly at $1,000 threshold
- Verifies manager approval is used (not auto)
test_boundary_large_threshold() - Tests exactly at $5,000 threshold
- Verifies dual approval path is triggered
test_very_small_amount() - Tests $0.01 amount
- Verifies auto-approval for very small amounts

Running the Tests

Prerequisites

The tests require the full integration environment to be running. Use the Docker Compose test strategy:

# From the tasker-core directory

# 1. Stop any existing containers and clean up
docker-compose -f docker/docker-compose.test.yml down -v

# 2. Rebuild containers with latest changes
docker-compose -f docker/docker-compose.test.yml up --build -d

# 3. Wait for services to be healthy (about 10-15 seconds)
sleep 15

# 4. Run the conditional approval E2E tests
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo test --test e2e_tests e2e::ruby::conditional_approval_test -- --nocapture

# 5. Clean up after testing (optional)
docker-compose -f docker/docker-compose.test.yml down

Running Specific Tests

# Run just the small amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_small_amount_auto_approval -- --nocapture

# Run just the large amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_large_amount_dual_approval -- --nocapture

# Run all boundary tests
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_boundary -- --nocapture

Environment Variables

The tests use the following environment variables (set automatically in docker-compose.test.yml):

DATABASE_URL: PostgreSQL connection string
TASKER_ENV: Set to “test” for test configuration
TASK_TEMPLATE_PATH: Points to test fixtures directory
RUST_LOG: Set to “info” or “debug” for detailed logging

Test Workflow Details

Conditional Approval Workflow

The workflow implements amount-based routing:

┌─────────────────┐
│ validate_request│
│   (initial)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ routing_decision│ ◄─── DECISION POINT (type: decision)
│  (decision)     │
└────────┬────────┘
         │
         ├─────────── < $1,000 ─────────┐
         │                              │
         │                              ▼
         │                     ┌────────────────┐
         │                     │  auto_approve  │
         │                     └────────┬───────┘
         │                              │
         ├─────── $1,000-$4,999 ────────┼────┐
         │                              │    │
         │                              │    ▼
         │                              │  ┌──────────────────┐
         │                              │  │ manager_approval │
         │                              │  └────────┬─────────┘
         │                              │           │
         └──────── >= $5,000 ───────────┼───────────┼────┐
                                        │           │    │
                                        │           │    ▼
                                        │           │  ┌───────────────┐
                                        │           │  │ finance_review│
                                        │           │  └───────┬───────┘
                                        │           │          │
                                        ▼           ▼          ▼
                                     ┌─────────────────────────┐
                                     │   finalize_approval     │
                                     │    (convergence)        │
                                     └─────────────────────────┘

Decision Point Mechanism

routing_decision step executes with type: decision marker
Handler returns DecisionPointOutcome::CreateSteps with step names
Orchestration creates those steps dynamically and adds dependencies
Dynamically created steps execute like normal steps
Convergence step (finalize_approval) waits for all paths

Task Template Location

The test uses the task template at:

tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml

Ruby Handler Implementation

The Ruby handlers are located at:

workers/ruby/spec/handlers/examples/conditional_approval/
├── handlers/
│   └── conditional_approval_handler.rb
└── step_handlers/
    ├── validate_request_handler.rb
    ├── routing_decision_handler.rb      ◄─── DECISION POINT HANDLER
    ├── auto_approve_handler.rb
    ├── manager_approval_handler.rb
    ├── finance_review_handler.rb
    └── finalize_approval_handler.rb

Key Implementation Detail

The routing_decision_handler.rb returns a decision point outcome:

outcome = if steps_to_create.empty?
            TaskerCore::Types::DecisionPointOutcome.no_branches
          else
            TaskerCore::Types::DecisionPointOutcome.create_steps(steps_to_create)
          end

TaskerCore::Types::StepHandlerCallResult.success(
  result: {
    # IMPORTANT: The decision point outcome MUST be in this key
    decision_point_outcome: outcome.to_h,
    route_type: route[:type],
    # ... other result fields
  }
)

Troubleshooting

Tests Fail with “Template Not Found”

Ensure the Ruby worker container has the correct template path:

docker-compose -f docker/docker-compose.test.yml logs ruby-worker
# Should show: TASK_TEMPLATE_PATH=/app/tests/fixtures/task_templates/ruby

Tests Timeout

Increase wait time in docker-compose startup:

sleep 30  # Instead of sleep 15

Database Connection Errors

Verify PostgreSQL is running and healthy:

docker-compose -f docker/docker-compose.test.yml ps
docker-compose -f docker/docker-compose.test.yml logs postgres

Step Creation Doesn’t Happen

Check orchestration logs for decision point processing:

docker-compose -f docker/docker-compose.test.yml logs orchestration | grep -i decision

Success Criteria

All tests should pass with output similar to:

test e2e::ruby::conditional_approval_test::test_small_amount_auto_approval ... ok
test e2e::ruby::conditional_approval_test::test_medium_amount_manager_approval ... ok
test e2e::ruby::conditional_approval_test::test_large_amount_dual_approval ... ok
test e2e::ruby::conditional_approval_test::test_decision_point_step_dependency_structure ... ok
test e2e::ruby::conditional_approval_test::test_boundary_conditions ... ok
test e2e::ruby::conditional_approval_test::test_boundary_large_threshold ... ok
test e2e::ruby::conditional_approval_test::test_very_small_amount ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Next Steps

After validating Ruby workers:

Phase 8a: Implement Rust worker support for decision points
Phase 9a: Create E2E tests for Rust worker decision points

Focused Architectural and Security Audit Report

Audit Date: 2026-02-05 Auditor: Claude (Opus 4.6 / Sonnet 4.5 sub-agents) Status: Complete

Executive Summary

This audit evaluates all Tasker Core crates for alpha readiness across security, error handling, resilience, and architecture dimensions. Findings are categorized by severity (Critical/High/Medium/Low/Info) per the methodology defined in the audit specification.

Alpha Readiness Verdict

ALPHA READY with targeted fixes. No Critical vulnerabilities found. The High-severity items (dependency CVE, input validation gaps, shutdown timeouts) are straightforward fixes that can be completed in a single sprint.

Consolidated Finding Counts (All Crates)

Severity	Count	Status
Critical	0	None found
High	9	Must fix before alpha
Medium	22	Document as known limitations
Low	13	Track for post-alpha

High-Severity Findings (Must Fix Before Alpha)

ID	Finding	Crate	Fix Effort	Remediation
S-1	Queue name validation missing	tasker-shared	Small	Queue name validation
S-2	SQL error details exposed to clients	tasker-shared	Medium	Error message sanitization
S-3	`#[allow]` → `#[expect]` (systemic)	All	Small (batch)	Lint compliance cleanup
P-1	NOTIFY channel name unvalidated	tasker-pgmq	Small	Queue name validation
O-1	No actor panic recovery	tasker-orchestration	Medium	Shutdown and recovery hardening
O-2	Graceful shutdown lacks timeout	tasker-orchestration	Small	Shutdown and recovery hardening
W-1	checkpoint_yield blocks FFI without timeout	tasker-worker	Small	FFI checkpoint timeout
X-1	`bytes` v1.11.0 CVE (RUSTSEC-2026-0007)	Workspace	Trivial	Dependency upgrade
P-2	CLI migration SQL generation unescaped	tasker-pgmq	Small	Queue name validation

Crate 1: tasker-shared

Overall Rating: A- (Strong foundation with targeted improvements needed)

The tasker-shared crate is the largest and most foundational crate in the workspace. It provides core types, error handling, messaging abstraction, security services, circuit breakers, configuration management, database utilities, and shared models. The crate demonstrates strong security practices overall.

Strengths

Zero unsafe code across the entire crate
Excellent cryptographic hygiene: Constant-time API key comparison via subtle::ConstantTimeEq (src/types/api_key_auth.rs:53-62), JWKS hardening with SSRF prevention (blocks private IPs, cloud metadata endpoints, requires HTTPS), algorithm allowlist enforcement (no alg: none)
Comprehensive input validation: JSONB validation with size/depth/key count limits (src/validation.rs), namespace validation with PostgreSQL identifier rules, XSS sanitization
100% SQLx macro usage: All database queries use compile-time verified sqlx::query! macros, zero string interpolation in SQL
Lock-free circuit breakers: Atomic state management (AtomicU8 for state, AtomicU64 for metrics), proper memory ordering, correct state machine transitions
All MPSC channels bounded and config-driven: Full bounded-channel compliance
Exemplary config security: Environment variable allowlist with regex validation, TOML injection prevention via escape_toml_string(), fail-fast on validation errors
No hardcoded secrets: All sensitive values come from env vars or file paths
Well-organized API surface: Feature-gated modules (web-api, grpc-api), selective re-exports

Finding S-1 (HIGH): Queue Name Validation Missing

Location: tasker-shared/src/messaging/service/router.rs:96-97

Queue names are constructed via format! with unvalidated namespace input:

#![allow(unused)]
fn main() {
fn step_queue(&self, namespace: &str) -> String {
    format!("{}_{}_queue", self.worker_queue_prefix, namespace)
}
}

The MessagingError::InvalidQueueName variant exists (src/messaging/errors.rs:56) but is never raised. Neither the router nor the provider implementations (pgmq.rs:134-139, rabbitmq.rs:276-375) validate queue names before passing them to native queue APIs.

Risk: PGMQ creates PostgreSQL tables named after queues — special characters in namespace could cause SQL issues at the DDL level. RabbitMQ queue creation could fail with unexpected characters.

Recommendation: Add validate_queue_name() that enforces alphanumeric + underscore/hyphen, 1-255 chars. Call it in DefaultMessageRouter methods and/or ensure_queue().

Finding S-2 (HIGH): SQL Error Details Exposed to Clients

Location: tasker-shared/src/errors.rs:71-74, 431-437

#![allow(unused)]
fn main() {
impl From<sqlx::Error> for TaskerError {
    fn from(err: sqlx::Error) -> Self {
        TaskerError::DatabaseError(err.to_string())
    }
}
}

sqlx::Error::to_string() can expose SQL query details, table/column names, constraint names, and potentially connection string information. These error messages may propagate to API responses.

Recommendation: Create a sanitized error mapper that logs full details internally but returns generic messages to API clients (e.g., “Database operation failed” with an internal error ID for correlation).

Finding S-3 (HIGH): `#[allow]` Used Instead of `#[expect]` (Lint Policy Violation)

Locations:

src/messaging/execution_types.rs:383 — #[allow(clippy::too_many_arguments)]
src/web/authorize.rs:194 — #[allow(dead_code)]
src/utils/serde.rs:46-47 — #[allow(dead_code)]

Project lint policy mandates #[expect(lint_name, reason = "...")] instead of #[allow]. This is a policy compliance issue.

Recommendation: Convert all #[allow] to #[expect] with documented reasons.

Finding S-4 (MEDIUM): `unwrap_or_default()` Violations of Tenet #11 (Fail Loudly)

Locations (20+ instances across crate):

src/messaging/execution_types.rs:120,186,213 — Step execution status defaults to empty string
src/database/sql_functions.rs:377,558 — Query results default to empty vectors
src/registry/task_handler_registry.rs:214,268,656,700,942 — Config schema fields default silently
src/proto/conversions.rs:32 — Invalid timestamps silently default to UNIX epoch

Risk: Required fields silently defaulting to empty values can mask real errors and produce incorrect behavior that’s hard to debug.

Recommendation: Audit all unwrap_or_default() usages. Replace with explicit error returns for required fields. Keep unwrap_or_default() only for truly optional fields with documented rationale.

Finding S-5 (MEDIUM): Error Context Loss in `.map_err(|_| ...)`

14 instances where original error context is discarded:

src/messaging/service/providers/rabbitmq.rs:544 — Discards parse error
src/messaging/service/providers/in_memory.rs:305,331,368 — 3 instances
src/state_machine/task_state_machine.rs:114 — Discards parse error
src/state_machine/actions.rs:256,372,434,842 — 4 instances discarding publisher errors
src/config/config_loader.rs:220,417 — 2 instances discarding env var errors
src/database/sql_functions.rs:1032 — Discards decode error
src/types/auth.rs:283 — Discards parse error

Recommendation: Include original error via .map_err(|e| SomeError::new(context, e.to_string())).

Finding S-6 (MEDIUM): Production `expect()` Calls

src/macros.rs:65 — Panics if Tokio task spawning fails
src/cache/provider.rs:399,429,459,489,522 — Multiple expect("checked in should_use") calls

Risk: Panics in production code. While guarded by preconditions, they bypass error propagation.

Recommendation: Replace with Result propagation or add detailed safety comments explaining invariant guarantees.

Finding S-7 (MEDIUM): Database Pool Config Lacks Validation

Database pool configuration (PoolConfig) does not have a validate() method. Unlike circuit breaker config which validates ranges (failure_threshold > 0, timeout <= 300s), pool config relies on sqlx to reject invalid values at runtime.

Recommendation: Add validation: max_connections > 0, min_connections <= max_connections, acquire_timeout_seconds > 0.

Finding S-8 (MEDIUM): Individual Query Timeouts Missing

While database pools have acquire_timeout configured (src/database/pools.rs:169-170), individual sqlx::query! calls lack explicit timeout wrappers. Long-running queries rely solely on pool-level timeouts.

Recommendation: Consider PostgreSQL statement_timeout at the connection level, or add tokio::time::timeout() wrappers around critical query paths.

Finding S-9 (LOW): Message Size Limits Not Enforced

Messaging deserialization uses serde_json::from_slice() without explicit size limits. While PGMQ has implicit limits from PostgreSQL column sizes, a very large message could cause memory issues during deserialization.

Recommendation: Add configurable message size limits at the provider level.

Finding S-10 (LOW): File Path Exposure in Config Errors

src/services/security_service.rs:184-187 — Configuration errors include filesystem paths. Only occurs during startup (not exposed to API clients in normal operation).

Finding S-11 (LOW): Timestamp Conversion Silently Defaults to Epoch

src/proto/conversions.rs:32 — DateTime::from_timestamp().unwrap_or_default() silently converts invalid timestamps to 1970-01-01 instead of returning an error.

Finding S-12 (LOW): `cargo-machete` Ignore List Has 19 Entries

Cargo.toml:12-39 — Most are legitimately feature-gated or used via macros, but the list should be periodically audited to prevent dependency bloat.

Finding S-13 (LOW): Global Wildcard Permission Rejection Undocumented

src/types/permissions.rs — The permission_matches() function correctly rejects global wildcard (*) permissions but this behavior isn’t documented in user-facing comments.

Crate 2: tasker-pgmq

Overall Rating: B+ (Good with one high-priority fix needed)

The tasker-pgmq crate is a PGMQ wrapper providing PostgreSQL LISTEN/NOTIFY support for event-driven message processing. ~3,345 source lines across 9 files. No dependencies on tasker-shared (clean separation).

Strengths

No unsafe code across the entire crate
Payload uses parameterized queries: Message payloads bound via $1 parameter in NOTIFY
Payload size validation: Enforces pg_notify 8KB limit
Comprehensive thiserror error types with context preservation
Bounded channels: All MPSC channels bounded
Good test coverage: 6 integration test files covering major flows
Clean separation from tasker-shared: No duplication, standalone library

Finding P-1 (HIGH): SQL Injection via NOTIFY Channel Name

Location: tasker-pgmq/src/emitter.rs:122

#![allow(unused)]
fn main() {
let sql = format!("NOTIFY {}, $1", channel);
sqlx::query(&sql).bind(payload).execute(&self.pool)
}

PostgreSQL’s NOTIFY does not support parameterized channel identifiers. The channel name is interpolated directly via format!. Channel names flow from config.build_channel_name() which concatenates channels_prefix (from TOML config) with base channel names and namespace strings.

Risk: While the NOTIFY command has limited injection surface (it’s not a general SQL execution vector), malformed channel names could cause PostgreSQL errors, unexpected channel routing, or denial of service. The channels_prefix comes from config (lower risk), but namespace strings flow from queue operations.

Recommendation: Add channel name validation — allow only [a-zA-Z0-9_.]+, max 63 chars (PostgreSQL identifier limit). Apply in build_channel_name() and/or notify_channel().

Finding P-2 (HIGH): CLI Migration SQL Generation with Unescaped Input

Location: tasker-pgmq/src/bin/cli.rs:179-353

User-provided regex patterns and channel prefixes are directly interpolated into SQL migration strings when generating migration files. While these are generated files that should be reviewed before application, the lack of escaping creates a risk if the generation process is automated.

Recommendation: Validate inputs against strict patterns before interpolation. Add a warning comment in generated files that they should be reviewed.

Finding P-3 (MEDIUM): `unwrap_or_default()` on Database Results (Tenet #11)

Location: tasker-pgmq/src/client.rs:164

#![allow(unused)]
fn main() {
.read_batch(queue_name, visibility_timeout, l).await?.unwrap_or_default()
}

When read_batch returns None, this silently produces an empty vector instead of failing loudly. Could mask permission errors, connection failures, or other serious issues.

Recommendation: Return explicit error on unexpected None.

Finding P-4 (MEDIUM): RwLock Poison Handling Masks Panics

Location: tasker-pgmq/src/listener.rs (22 instances)

#![allow(unused)]
fn main() {
self.stats.write().unwrap_or_else(|p| p.into_inner())
}

Silently recovers from poisoned RwLock without logging. Could propagate corrupted state from a panicked thread.

Recommendation: Log warning on poison recovery, or switch to parking_lot::RwLock (doesn’t poison).

Finding P-5 (MEDIUM): Hardcoded Pool Size

Location: tasker-pgmq/src/client.rs:41-44

#![allow(unused)]
fn main() {
let pool = sqlx::postgres::PgPoolOptions::new()
    .max_connections(20)  // Hard-coded
    .connect(database_url).await?;
}

Pool size should be configurable for different deployment scenarios.

Finding P-6 (MEDIUM): Missing Async Operation Timeouts

Database operations in client.rs, emitter.rs, and listener.rs lack explicit tokio::time::timeout() wrappers. Relies solely on pool-level acquire timeouts.

Finding P-7 (LOW): Error Context Loss in Regex Compilation

Location: tasker-pgmq/src/config.rs:169

#![allow(unused)]
fn main() {
Regex::new(&self.queue_naming_pattern)
    .map_err(|_| PgmqNotifyError::invalid_pattern(&self.queue_naming_pattern))
}

Original regex error details discarded.

Finding P-8 (LOW): `#[allow]` Instead of `#[expect]` (Lint Policy)

Location: tasker-pgmq/src/emitter.rs:299-320 — 3 instances of #[allow(dead_code)] on EmitterFactory.

Crate 3: tasker-orchestration

Overall Rating: A- (Strong security with targeted resilience improvements needed)

The tasker-orchestration crate handles core orchestration logic: actors, state machines, REST + gRPC APIs, and auth middleware. This is the largest service crate and the primary attack surface.

Strengths

Zero unsafe code across the entire crate
Excellent auth architecture: Constant-time API key comparison, JWT algorithm allowlist, JWKS SSRF prevention, auth before body parsing
gRPC/REST auth parity verified: All 6 gRPC task methods enforce identical permissions to REST counterparts
No auth bypass found: All API v1 routes wrapped in authorize(), health/metrics public by design
Database-level atomic claiming: FOR UPDATE SKIP LOCKED prevents concurrent state corruption
State transitions enforce ownership: No API endpoint allows direct state manipulation
Sanitized error responses: No stack traces, database errors genericized, consistent JSON format
Backpressure checked before resource operations: 503 with Retry-After header
Full bounded-channel compliance: All MPSC channels bounded and config-driven (0 unbounded channels)
HTTP request timeout: TimeoutLayer with configurable 30s default

Finding O-1 (HIGH): No Actor Panic Recovery

Location: tasker-orchestration/src/actors/command_processor_actor.rs:139

Actors spawn via spawn_named! but have no supervisor/restart logic. If OrchestrationCommandProcessorActor panics, the entire orchestration processing stops. Recovery requires full process restart.

Recommendation: Implement panic-catching wrapper with logged restart, or document that process-level supervision (systemd, k8s) handles this.

Finding O-2 (HIGH): Graceful Shutdown Lacks Timeout

Locations:

tasker-orchestration/src/orchestration/bootstrap.rs:177-213
tasker-orchestration/src/bin/server.rs:68-82

Shutdown calls coordinator.lock().await.stop().await and orchestration_handle.stop().await with no timeout. If the event coordinator or actors hang during shutdown, the server never completes graceful shutdown.

Recommendation: Add 30-second timeout with force-kill fallback.

Finding O-3 (HIGH): `#[allow]` Instead of `#[expect]` (Lint Policy)

21 instances of #[allow] found across the crate (most without reason = clause):

src/actors/traits.rs:67,81
src/web/extractors.rs:6
src/health/channel_status.rs:87
src/grpc/conversions.rs:42
And 16 more locations

Finding O-4 (MEDIUM): Request Validation Not Enforced at Handler Layer

Location: src/web/handlers/tasks.rs:47

TaskRequest has #[derive(Validate)] with constraints (name length 1-255, namespace length 1-255, priority range -100 to 100) but handlers accept Json<TaskRequest> without calling .validate(). Validation happens later at the service layer.

Impact: Oversized payloads are deserialized before rejection. Not a security vulnerability per se, but the defense-in-depth pattern would catch malformed input earlier.

Recommendation: Add .validate() at handler entry or use Valid<Json<TaskRequest>> extractor.

Finding O-5 (MEDIUM): Actor Shutdown May Lose In-Flight Work

Location: tasker-orchestration/src/actors/registry.rs:216-259

Shutdown uses Arc::get_mut() which only works if no other references exist. If get_mut fails, stopped() is silently skipped. In-flight work may be lost.

Finding O-6 (MEDIUM): Database Query Timeouts Missing

Same pattern as tasker-shared (Finding S-8). Individual sqlx::query! calls lack explicit timeout wrappers:

src/services/health/service.rs:284 — health check query
src/orchestration/backoff_calculator.rs:232,245,290,345,368 — multiple queries

Pool-level acquire timeout (30s) provides partial mitigation.

Finding O-7 (MEDIUM): `unwrap_or_default()` on Config Fields

src/orchestration/event_systems/unified_event_coordinator.rs:89 — event system config
src/orchestration/bootstrap.rs:581 — namespace config
src/grpc/services/config.rs:96-97 — jwt_issuer and jwt_audience default to empty strings

Finding O-8 (MEDIUM): Error Context Loss

~12 instances of .map_err(|_| ...) discarding error context:

src/orchestration/bootstrap.rs:203 — oneshot send error
src/web/handlers/health.rs:53 — timeout error
src/web/handlers/tasks.rs:113 — UUID parse error

Finding O-9 (MEDIUM): Hardcoded Magic Numbers

src/services/task_service.rs:257-259 — per_page > 100 validation
src/orchestration/event_systems/orchestration_event_system.rs:142 — 24h max message age
src/services/analytics_query_service.rs:229 — 30.0s slow step threshold

Finding O-10 (LOW): gRPC Internal Error May Leak Details

Location: src/grpc/conversions.rs:152-153

tonic::Status::internal(error.to_string()) — depending on error Display implementations, could expose implementation details in gRPC error messages.

Finding O-11 (LOW): CORS Allows Any Origin

Location: src/web/mod.rs

#![allow(unused)]
fn main() {
CorsLayer::new()
    .allow_origin(tower_http::cors::Any)
    .allow_methods(tower_http::cors::Any)
    .allow_headers(tower_http::cors::Any)
}

Acceptable for alpha/API service, but should be configurable for production deployments.

Crate 4: tasker-worker

Overall Rating: A- (Strong FFI safety with one notable gap)

The tasker-worker crate handles handler dispatch, FFI integration, and completion processing. Despite complex FFI requirements, it achieves this with zero unsafe blocks in the crate itself.

Strengths

Zero unsafe code despite handling Ruby/Python FFI integration
All SQL queries via sqlx macros — no string interpolation
Handler panic containment: catch_unwind() + AssertUnwindSafe wraps all handler calls
Error classification preserved: Permanent/Retryable distinction maintained across FFI boundary
Fire-and-forget callbacks: Spawned into runtime, 5s timeout, no deadlock risk
FFI completion circuit breaker: Latency-based, 100ms threshold, lock-free metrics
All MPSC channels bounded — full bounded-channel compliance
No production unwrap()/expect() in core paths

Finding W-1 (HIGH): `checkpoint_yield` Blocks FFI Thread Without Timeout

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:904

#![allow(unused)]
fn main() {
let result = self.config.runtime_handle.block_on(async {
    self.handle_checkpoint_yield_async(/* ... */).await
});
}

Uses block_on which blocks the Ruby/Python thread while persisting checkpoint data to the database. No timeout wrapper. If the database is slow, this blocks the FFI thread indefinitely, potentially exhausting the thread pool.

Recommendation: Add tokio::time::timeout() around the block_on body (configurable, suggest 10s default).

Finding W-2 (MEDIUM): Starvation Detection is Warning-Only

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:772-793

check_starvation_warnings() logs warnings but doesn’t enforce any action. Also requires manual invocation by the caller — no automatic monitoring loop.

Finding W-3 (MEDIUM): FFI Thread Safety Documentation Gap

The FfiDispatchChannel uses Arc<Mutex<mpsc::Receiver>> (thread-safe) but lacks documentation about thread-safety guarantees, poll() contention behavior, and block_on safety in FFI context.

Finding W-4 (MEDIUM): `#[allow]` vs `#[expect]` (Lint Policy)

5 instances in web/middleware/mod.rs and web/middleware/request_id.rs.

Finding W-5 (MEDIUM): Missing Database Query Timeouts

Same systemic pattern as other crates. Checkpoint service and step claim queries lack explicit timeout wrappers.

Finding W-6 (LOW): `unwrap_or_default()` in `worker/core.rs`

Several instances, appear to be for optional fields (likely legitimate), but warrants review.

Crates 5-6: tasker-client & tasker-cli

Overall Rating: A (Excellent — cleanest crates in the workspace)

These client crates demonstrate the strongest compliance across all audit dimensions. Notably, lint policy compliant (using #[expect] already). No Critical or High findings.

Strengths

No unsafe code in either crate
No hardcoded credentials — all auth from env vars or config files
RSA key generation validates minimum 2048-bit keys
Proper error context preservation in all From conversions
Complete transport abstraction: REST and gRPC both implement 11/11 methods
HTTP/gRPC timeouts configured: 30s request, 10s connect
Exponential backoff retry for create_task with configurable max retries
Lint policy compliant — uses #[expect] with reasons
User-facing CLI errors informative without leaking internals

Finding C-1 (MEDIUM): TLS Certificate Validation Not Explicitly Enforced

Location: tasker-client/src/api_clients/orchestration_client.rs:220

HTTP client uses reqwest::Client::builder() without explicitly setting .danger_accept_invalid_certs(false). Default is secure, but explicit enforcement prevents accidental changes.

Finding C-2 (MEDIUM): Default URLs Use HTTP

Location: tasker-client/src/config.rs:276

Default base_url is http://localhost:8080. Credentials transmitted over HTTP are vulnerable to interception. Appropriate for local dev, but should warn when HTTP is used with authentication enabled.

Finding C-3 (MEDIUM): Retry Logic Only on `create_task`

Other operations (get_task, list_tasks, etc.) do not retry on transient failures. Should either extend retry logic or document the limitation.

Finding C-4 (LOW): Production `expect()` in Config Initialization

tasker-client/src/api_clients/orchestration_client.rs:123 — panics if config is malformed. Acceptable during startup but could return Result instead.

Crates 7-10: Language Workers (Rust, Ruby, Python, TypeScript)

Overall Rating: A- (Strong FFI engineering, no critical gaps)

All 4 language workers share common architecture via FfiDispatchChannel for poll-based event dispatch. Audited ~22,000 lines of Rust FFI code plus language wrappers.

Strengths

TypeScript: Comprehensive panic handling — catch_unwind on all critical FFI functions, errors converted to JSON error responses
Ruby/Python: Managed FFI via Magnus and PyO3 — these frameworks handle panic unwinding automatically via their exception systems
Error classification preserved across all FFI boundaries: Permanent/Retryable distinction maintained
Fire-and-forget callbacks: No deadlock risk identified
Starvation detection functional in all workers
Proper Arc usage for thread-safe shared ownership across FFI
TypeScript C FFI: Correct string memory management with into_raw()/from_raw() pattern and free_rust_string() for caller cleanup
Checkpoint support uniformly implemented across all 4 workers
Consistent error hierarchy across all languages

Finding LW-1 (MEDIUM): TypeScript FFI Missing Safety Documentation

Location: workers/typescript/src-rust/lib.rs:38

#![allow(clippy::missing_safety_doc)] — suppresses docs for 9 unsafe extern "C" functions. Should use #[expect] per lint policy and add # Safety sections.

Finding LW-2 (MEDIUM): Rust Worker `#[allow(dead_code)]` (Lint Policy)

Location: workers/rust/src/event_subscribers/logging_subscriber.rs:60,98,132

3 instances of #[allow(dead_code)] instead of #[expect].

Finding LW-3 (LOW): Ruby Bootstrap Uses `expect()` on Ruby Runtime

Location: workers/ruby/ext/tasker_core/src/bridge.rs:19-20, bootstrap.rs:29-30

Ruby::get().expect("Ruby runtime should be available") — safe in practice (guaranteed by Magnus FFI contract) but could use ? for defensive programming.

Finding LW-4 (LOW): Timeout Cleanup Requires Manual Polling

cleanup_timeouts() exists in all FFI workers but documentation doesn’t specify recommended polling frequency. Workers must call this periodically.

Finding LW-5 (LOW): Ruby Tokio Thread Pool Hardcoded to 8

Location: workers/ruby/ext/tasker_core/src/bootstrap.rs:74-79

Hardcoded .worker_threads(8) for M2/M4 Pro compatibility. Python/TypeScript use defaults. Consider making configurable.

Cross-Cutting Concerns

Dependency Audit (`cargo audit`)

Finding X-1 (HIGH): bytes v1.11.0 Integer Overflow (RUSTSEC-2026-0007)

Published 2026-02-03. Integer overflow in BytesMut::reserve. Fix: upgrade to bytes >= 1.11.1. This is a transitive dependency used by tokio, hyper, axum, tonic, reqwest, sqlx — deeply embedded.

Recommendation: Add to workspace Cargo.toml: bytes = "1.11.1"

Finding X-2 (LOW): `rustls-pemfile` Unmaintained (RUSTSEC-2025-0134)

Transitive from lapin (RabbitMQ) → amq-protocol → tcp-stream → rustls-pemfile. No action available from this project; depends on upstream lapin update.

Clippy Compliance

Zero warnings across entire workspace with --all-targets --all-features. Excellent.

Systemic: `#[allow]` vs `#[expect]` (Lint Policy)

27 instances of #[allow] found across all crates. Distribution:

tasker-shared: ~5 instances
tasker-pgmq: 3 instances
tasker-orchestration: 21 instances (highest)
tasker-worker: 5 instances
tasker-client/cli: 0 (compliant)
Language workers: ~3 instances

Recommendation: Batch fix in a single PR — mechanical replacement of #[allow] → #[expect] with reason strings.

Systemic: Database Query Timeouts

Found across tasker-shared, tasker-orchestration, tasker-worker, and tasker-pgmq. Individual sqlx::query! calls lack explicit tokio::time::timeout() wrappers. Pool-level acquire timeouts (30s) provide partial mitigation.

Recommendation: Consider PostgreSQL statement_timeout at the connection level as a blanket fix, or add tokio::time::timeout() around critical query paths.

Systemic: `unwrap_or_default()` on Required Fields (Tenet #11)

Found across tasker-shared (20+ instances), tasker-orchestration (3 instances), tasker-pgmq (1 instance). Silent failures on required fields violate the Fail Loudly principle.

Recommendation: Audit all instances and replace with explicit error handling for required fields.

Appendix: Methodology

Each crate was evaluated across these dimensions:

Security — Input validation, SQL safety, auth checks, unsafe blocks, crypto, secrets
Error Handling — Fail Loudly (Tenet #11), context preservation, structured errors
Resilience — Bounded channels, timeouts, circuit breakers, backpressure
Architecture — API surface, documentation consistency, test coverage, dead code
FFI-Specific (language workers) — Error classification, deadlock risk, starvation detection, memory safety

Severity definitions follow the audit specification.

Appendix: Remediation Tracking

Remediation work items for all High-severity findings:

Work Item	Findings	Priority	Summary
Dependency upgrade	X-1	Urgent	Upgrade `bytes` to fix RUSTSEC-2026-0007 CVE
Queue name validation	S-1, P-1, P-2	High	Add queue name and NOTIFY channel validation
Lint compliance cleanup	S-3, O-3, W-4, LW-1, LW-2, P-8	Medium	Replace `#[allow]` with `#[expect]` workspace-wide
Shutdown and recovery hardening	O-1, O-2	High	Add shutdown timeout and actor panic recovery
FFI checkpoint timeout	W-1	High	Add timeout to `checkpoint_yield` `block_on`
Error message sanitization	S-2	High	Sanitize database error messages in API responses

Architecture Decision Records (ADRs)

This directory contains Architecture Decision Records that document significant design decisions in Tasker Core. Each ADR captures the context, decision, and consequences of a specific architectural choice.

ADR Index

Active Decisions

ADR	Title	Date	Status
ADR-001	Actor-Based Orchestration Architecture	2025-10	Accepted
ADR-002	Bounded MPSC Channels	2025-10	Accepted
ADR-003	Processor UUID Ownership Removal	2025-10	Accepted
ADR-004	Backoff Strategy Consolidation	2025-10	Accepted
ADR-005	Worker Dual-Channel Event System	2025-12	Accepted
ADR-006	Worker Actor-Service Decomposition	2025-12	Accepted
ADR-007	FFI Over WASM for Language Workers	2025-12	Accepted
ADR-008	Handler Composition Pattern	2025-12	Accepted

Root Cause Analyses

Document	Title	Date
RCA	Parallel Execution Timing Bugs	2025-12

ADR Template

When creating a new ADR, use this template:

# ADR: [Title]

**Status**: [Proposed | Accepted | Deprecated | Superseded]
**Date**: YYYY-MM-DD
**Ticket**: TAS-XXX

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

### Positive

- Benefit 1
- Benefit 2

### Negative

- Trade-off 1
- Trade-off 2

### Neutral

- Side effect that is neither positive nor negative

## Alternatives Considered

What other options were considered and why were they rejected?

### Alternative 1: [Name]

Description and why it was rejected.

### Alternative 2: [Name]

Description and why it was rejected.

## References

- Related documents
- External references

When to Create an ADR

Create an ADR when:

Making a significant architectural change that affects multiple components
Choosing between alternatives with meaningful trade-offs
Establishing a pattern that should be followed consistently
Removing or deprecating an existing pattern or approach
Learning from an incident (RCA format)

Don’t create an ADR for:

Minor implementation details
Bug fixes without architectural impact
Documentation updates
Routine refactoring

Tasker Core Tenets - Core design principles

ADR: Actor-Based Orchestration Architecture

Status: Accepted Date: 2025-10 Ticket: TAS-46

Context

The orchestration system used a command pattern with direct service delegation, but lacked formal boundaries between commands and lifecycle components. This created:

Testing Complexity: Lifecycle components tightly coupled to command processor
Unclear Boundaries: No formal interface between commands and lifecycle operations
Limited Supervision: No standardized lifecycle hooks for resource management
Inconsistent Patterns: Each component had different initialization patterns
Coupling: Command processor had direct dependencies on multiple service instances

The command processor was 1,164 lines, mixing routing, hydration, validation, and delegation.

Decision

Adopt a lightweight actor pattern with message-based interfaces:

Core Abstractions:

OrchestrationActor trait with lifecycle hooks (started(), stopped())
Message trait for type-safe messages with associated Response type
Handler<M> trait for async message processing
ActorRegistry for centralized actor management

Four Orchestration Actors:

TaskRequestActor: Task initialization and request processing
ResultProcessorActor: Step result processing
StepEnqueuerActor: Step enqueueing coordination
TaskFinalizerActor: Task finalization with atomic claiming

Implementation Approach:

Greenfield migration (no dual support)
Actors wrap existing services, not replace them
Arc-wrapped actors for efficient cloning across threads
No full actor framework (keeping it lightweight)

Consequences

Positive

92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
Clear boundaries: Each actor handles specific message types
Testability: Message-based interfaces enable isolated testing
Consistent patterns: Established migration pattern for all actors
Lifecycle management: Standardized started()/stopped() hooks
Thread safety: Arc-wrapped actors with Send+Sync guarantees

Negative

Additional abstraction: One more layer between commands and services
Learning curve: New pattern to understand
Message overhead: ~100-500ns per actor call (acceptable for our use case)
Not a full framework: Lacks supervision trees, mailboxes, etc.

Neutral

Services remain unchanged; actors are thin wrappers
Performance impact minimal (<1μs per operation)

Alternatives Considered

Alternative 1: Full Actor Framework (Actix)

Would provide supervision, mailboxes, and advanced patterns.

Rejected: Too heavyweight for our needs. We need lifecycle hooks and message-based testing, not a full distributed actor system.

Alternative 2: Keep Direct Service Delegation

Continue with command processor calling services directly.

Rejected: Doesn’t address testing complexity, unclear boundaries, or lifecycle management needs.

Alternative 3: Trait-Based Service Abstraction

Define Service trait and implement on each lifecycle component.

Partially adopted: Combined with actor pattern. Services implement business logic; actors provide message-based coordination.

References

See the actor pattern implementation in tasker-orchestration/
Actors Architecture - Actor pattern documentation
Events and Commands - Integration context

ADR: Bounded MPSC Channel Migration

Status: Implemented Date: 2025-10-14 Decision Makers: Engineering Team Ticket: TAS-51

Context and Problem Statement

Prior to this change, the tasker-core system had inconsistent and risky MPSC channel usage:

Unbounded Channels (3 critical sites): Risk of unbounded memory growth under load
- PGMQ notification listener: Could exhaust memory during notification bursts
- Event publisher: Vulnerable to event storms
- Ruby FFI handler: No backpressure across FFI boundary
Configuration Disconnect (6 sites): TOML configuration existed but wasn’t used
- Hard-coded values (100, 1000) with no rationale
- Test/dev/prod environments used identical capacities
- No ability to tune without code changes
No Backpressure Strategy: Missing overflow handling policies
- No monitoring of channel saturation
- No documented behavior when channels fill
- No metrics for operational visibility

Production Impact

Memory Risk: OOM possible under high database notification load (10k+ tasks enqueued)
Operational Pain: Cannot tune channel sizes without code deployment
Environment Mismatch: Test environment uses production-scale buffers, masking issues
Technical Debt: Wasted configuration infrastructure

Decision

Migrate to 100% bounded, configuration-driven MPSC channels with explicit backpressure handling.

Key Principles

All Channels Bounded: Zero unbounded_channel() calls in production code
Configuration-Driven: All capacities from TOML with environment overrides
Separation of Concerns: Infrastructure (sizing) separate from business logic (retry behavior)
Explicit Backpressure: Document and implement overflow policies
Full Observability: Metrics for channel saturation and overflows

Solution Architecture

Configuration Structure

Created unified MPSC channel configuration in config/tasker/base/mpsc_channels.toml:

[mpsc_channels]

# Orchestration subsystem
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 1000

[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 10000  # Large for notification bursts

# Task readiness subsystem
[mpsc_channels.task_readiness.event_channel]
buffer_size = 1000
send_timeout_ms = 1000

# Worker subsystem
[mpsc_channels.worker.command_processor]
command_buffer_size = 1000

[mpsc_channels.worker.in_process_events]
broadcast_buffer_size = 1000  # Rust → Ruby FFI

# Shared/cross-cutting
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 5000

[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 1000

# Overflow policy
[mpsc_channels.overflow_policy]
log_warning_threshold = 0.8  # Warn at 80% full
drop_policy = "block"

Environment-Specific Overrides

Production (config/tasker/environments/production/mpsc_channels.toml):

Orchestration command: 5000 (5x base)
PGMQ listeners: 50000 (5x base) - handles bulk task creation bursts
Event publisher: 10000 (2x base)

Development (config/tasker/environments/development/mpsc_channels.toml):

Task readiness: 500 (0.5x base)
Worker FFI: 500 (0.5x base)

Test (config/tasker/environments/test/mpsc_channels.toml):

Orchestration command: 100 (0.1x base) - exposes backpressure issues
Task readiness: 100 (0.1x base)

Critical Implementation Detail

Environment override files MUST use full [mpsc_channels.*] prefix:

# ✅ CORRECT
[mpsc_channels.task_readiness.event_channel]
buffer_size = 100

# ❌ WRONG - creates top-level key that overrides correct config
[task_readiness.event_channel]
buffer_size = 100

This was discovered during implementation when environment files created conflicting top-level configuration keys.

Configuration Migration

Migrated MPSC sizing fields from event_systems.toml to mpsc_channels.toml:

Moved to mpsc_channels.toml:

event_systems.task_readiness.metadata.event_channel.buffer_size
event_systems.task_readiness.metadata.event_channel.send_timeout_ms
event_systems.worker.metadata.in_process_events.broadcast_buffer_size

Kept in event_systems.toml (event processing logic):

event_systems.task_readiness.metadata.event_channel.max_retries
event_systems.task_readiness.metadata.event_channel.backoff

Rationale: Separation of concerns - infrastructure sizing vs business logic behavior.

Rust Type System

Created comprehensive type system in tasker-shared/src/config/mpsc_channels.rs:

#![allow(unused)]
fn main() {
pub struct MpscChannelsConfig {
    pub orchestration: OrchestrationChannelsConfig,
    pub task_readiness: TaskReadinessChannelsConfig,
    pub worker: WorkerChannelsConfig,
    pub shared: SharedChannelsConfig,
    pub overflow_policy: OverflowPolicyConfig,
}
}

All channel creation sites updated to use configuration:

#![allow(unused)]
fn main() {
// Before
let (tx, rx) = mpsc::unbounded_channel();

// After
let buffer_size = config.mpsc_channels
    .orchestration.event_listeners.pgmq_event_buffer_size;
let (tx, rx) = mpsc::channel(buffer_size);
}

Observability

ChannelMonitor Integration:

Tracks channel usage in real-time
Logs warnings at 80% saturation
Exposes metrics via OpenTelemetry

Metrics Available:

mpsc_channel_usage_percent - Current channel fill percentage
mpsc_channel_capacity - Configured capacity
Component and channel name labels for filtering

Consequences

Positive

Memory Safety: Bounded channels prevent OOM from unbounded growth
Operational Flexibility: Tune channel sizes via configuration without code changes
Environment Appropriateness: Test uses small buffers (exposes issues), production uses large buffers (handles load)
Observability: Channel saturation visible in metrics and logs
Documentation: Clear guidelines for future channel additions

Negative

Backpressure Complexity: Must handle full channel conditions
Configuration Overhead: More configuration files to maintain
Tuning Required: May need adjustment based on production load patterns

Neutral

No Performance Impact: Bounded channels with appropriate sizes perform identically to unbounded
Backward Compatible: Existing deployments automatically use new defaults

Implementation Notes

Backpressure Strategies by Component

PGMQ Notification Listener:

Strategy: Block sender (apply backpressure)
Rationale: Cannot drop database notifications
Buffer: Large (10000 base, 50000 production) to handle bursts

Event Publisher:

Strategy: Drop events with metrics when full
Rationale: Internal events are non-critical
Buffer: Medium (5000 base, 10000 production)

Ruby FFI Handler:

Strategy: Return error to Rust (signal backpressure)
Rationale: Ruby must handle gracefully
Buffer: Standard (1000) with monitoring

Sizing Guidelines

Command Channels (orchestration, worker):

Base: 1000
Test: 100 (expose issues)
Production: 2000-5000 (concurrent load)

Event Channels:

Base: 1000
Production: Higher if event-driven architecture

Notification Channels:

Base: 10000 (burst handling)
Production: 50000 (bulk operations)

Validation

Testing Performed

Unit Tests: Configuration loading and validation ✅
Integration Tests: All tests pass with bounded channels ✅
Local Verification: Service starts successfully in test environment ✅
Configuration Verification: All environments load correctly ✅

Success Criteria Met

✅ Zero unbounded channels in production code
✅ 100% configurable channel capacities
✅ Environment-specific overrides working
✅ Backpressure handling implemented
✅ Observability through ChannelMonitor
✅ All tests passing

Future Considerations

Dynamic Sizing: Consider runtime buffer adjustment based on load (not current scope)
Priority Queues: Allow critical events to bypass overflow drops (evaluate based on metrics)
Notification Coalescing: Reduce PGMQ notification volume during bursts (future optimization)
Advanced Metrics: Percentile latencies for channel send operations

References

Configuration Files: config/tasker/base/mpsc_channels.toml
Rust Module: tasker-shared/src/config/mpsc_channels.rs
Related ADRs: Command Pattern, Actor Pattern

Lessons Learned

Configuration Structure Matters: Environment override files must use proper prefixes
Separation of Concerns: Keep infrastructure config (sizing) separate from business logic (retry behavior)
Test Appropriately: Small buffers in test environment expose backpressure issues early
Migration Strategy: Moving config fields requires coordinated struct updates across all files
Type Safety: Rust’s type system caught many configuration mismatches during development

Decision: Approved and Implemented Review Date: 2025-10-14 Next Review: 2026-Q1 (evaluate sizing based on production metrics)

ADR: Processor UUID Ownership Removal

Status: Accepted Date: 2025-10 Ticket: TAS-54

Context

When orchestrators crash with tasks in active processing states (Initializing, EnqueuingSteps, EvaluatingResults), the processor UUID ownership enforcement prevented new orchestrators from taking over. Tasks became permanently stuck until manual intervention.

Root Cause: Three states required ownership enforcement (the original state machine pattern), but when orchestrator A crashed and orchestrator B tried to recover, the ownership check failed: B != A.

Production Impact:

Stuck tasks requiring manual intervention
Orchestrator restarts caused task processing to halt
15-second gap between crash and retry, but tasks permanently blocked

Decision

Move to audit-only processor UUID tracking:

Keep processor UUID in all transitions (audit trail for debugging)
Remove ownership enforcement from state transitions
Rely on existing state machine guards for idempotency
Add configuration flag for gradual rollout

Key Insight: The original problem (race conditions) had been solved by multiple other mechanisms:

Atomic finalization claiming via SQL functions
Command pattern with stateless async processors
Actor pattern with 4 production-ready actors

Idempotency Without Ownership

Actor	Idempotency Mechanism	Race Condition Protection
TaskRequestActor	`identity_hash` unique constraint	Transaction atomicity
ResultProcessorActor	Current state guards	State machine atomicity
StepEnqueuerActor	SQL function atomicity	PGMQ transactional operations
TaskFinalizerActor	Atomic claiming	SQL compare-and-swap

Consequences

Positive

Task recovery: Tasks automatically recover after orchestrator crashes
Zero manual interventions: Stuck task count approaches zero
Audit trail preserved: Full debugging capability retained
Instant rollback: Configuration flag allows quick revert

Negative

New debugging patterns: Processor ownership changes visible in audit trail
Team training: Operators need to understand audit-only interpretation

Neutral

No database schema changes required
No performance impact (one fewer query per transition)

Alternatives Considered

Alternative 1: Timeout-Based Ownership Transfer

Add timeout after which ownership can be claimed by another processor.

Rejected: Adds complexity; existing idempotency guards make ownership redundant entirely.

Alternative 2: Keep Ownership Enforcement

Continue with existing ownership enforcement behavior, add manual recovery tools.

Rejected: Doesn’t address root cause; manual intervention doesn’t scale.

References

Defense in Depth - Multi-layer protection philosophy
Idempotency and Atomicity - Defense layer documentation

ADR: Backoff Logic Consolidation

Status: Implemented Date: 2025-10-29 Deciders: Engineering Team Ticket: TAS-57

Context

The tasker-core distributed workflow orchestration system had multiple, potentially conflicting implementations of exponential backoff logic for step retry coordination. This created several critical issues:

Problems Identified

Configuration Conflicts: Three different maximum backoff values existed across the system:
- SQL Migration (hardcoded): 30 seconds
- Rust Code Default: 60 seconds
- TOML Configuration: 300 seconds
Race Conditions: No atomic guarantees on backoff updates when multiple orchestrators processed the same step failure simultaneously, leading to potential lost updates and inconsistent state.
Implementation Divergence: Dual calculation paths (Rust BackoffCalculator vs SQL fallback) could produce different results due to:
- Different time sources (last_attempted_at vs failure_time)
- Hardcoded vs configurable parameters
- Lack of timestamp synchronization

Hardcoded SQL Values: The SQL migration contained non-configurable exponential backoff logic:

-- Old hardcoded implementation
power(2, COALESCE(attempts, 1)) * interval '1 second', interval '30 seconds'

Decision

We consolidated the backoff logic with the following architectural decisions:

1. Single Source of Truth: TOML Configuration

Decision: All backoff parameters originate from TOML configuration files.

Rationale:

Centralized configuration management
Environment-specific overrides (test/development/production)
Runtime validation and type safety
Clear documentation of system behavior

Implementation:

# config/tasker/base/orchestration.toml
[backoff]
default_backoff_seconds = [1, 2, 4, 8, 16, 32]
max_backoff_seconds = 60  # Standard across all environments
backoff_multiplier = 2.0
jitter_enabled = true
jitter_max_percentage = 0.1

2. Standard Maximum Backoff: 60 Seconds

Decision: Standardize on 60 seconds as the maximum backoff delay.

Rationale:

Balance: 60 seconds balances retry speed with system load reduction
Not Too Short: 30 seconds (old SQL max) insufficient for rate limiting scenarios
Not Too Long: 300 seconds (old TOML config) creates excessive delays in failure scenarios
Alignment: Matches Rust code defaults and production requirements

Impact:

Tasks recover faster from transient failures
Rate-limited APIs get adequate cooldown
User experience improved with reasonable retry times

3. Parameterized SQL Functions

Decision: SQL functions accept configuration parameters with sensible defaults.

Implementation:

CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP,
    failure_time TIMESTAMP,
    attempts INTEGER,
    p_max_backoff_seconds INTEGER DEFAULT 60,
    p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP

Rationale:

Eliminates hardcoded values in SQL
Allows runtime configuration without schema changes
Maintains SQL fallback safety net
Defaults prevent breaking existing code

4. Atomic Backoff Updates with Row-Level Locking

Decision: Use PostgreSQL SELECT FOR UPDATE for atomic backoff updates.

Implementation:

#![allow(unused)]
fn main() {
// Rust BackoffCalculator
async fn update_backoff_atomic(&self, step_uuid: &Uuid, delay_seconds: u32) {
    let mut tx = self.pool.begin().await?;

    // Acquire row-level lock
    sqlx::query!("SELECT ... FROM tasker_workflow_steps WHERE ... FOR UPDATE")
        .fetch_one(&mut *tx).await?;

    // Update with lock held
    sqlx::query!("UPDATE tasker_workflow_steps SET ...")
        .execute(&mut *tx).await?;

    tx.commit().await?;
}
}

Rationale:

Correctness: Prevents lost updates from concurrent orchestrators
Simplicity: PostgreSQL’s row-level locking is well-understood and reliable
Performance: Minimal overhead - locks only held during UPDATE operation
Idempotency: Multiple retries produce consistent results

Alternative Considered: Optimistic concurrency with version field

Rejected: More complex implementation, retry logic in application layer
Benefit of Chosen Approach: Database guarantees atomicity

5. Timing Consistency: Update last_attempted_at with backoff_request_seconds

Decision: Always update both backoff_request_seconds and last_attempted_at atomically.

Rationale:

SQL fallback calculation: last_attempted_at + backoff_request_seconds
Prevents timing window where calculation uses stale timestamp
Single transaction ensures consistency

Before:

#![allow(unused)]
fn main() {
// Old: Only updated backoff_request_seconds
sqlx::query!("UPDATE tasker_workflow_steps SET backoff_request_seconds = $1 ...")
}

After:

#![allow(unused)]
fn main() {
// New: Updates both atomically
sqlx::query!(
    "UPDATE tasker_workflow_steps
     SET backoff_request_seconds = $1,
         last_attempted_at = NOW()
     WHERE ..."
)
}

6. Dual-Path Strategy: Rust Primary, SQL Fallback

Decision: Maintain both Rust calculation and SQL fallback, but ensure they use same configuration.

Rationale:

Rust Primary: Fast, configurable, with jitter support
SQL Fallback: Safety net if backoff_request_seconds is NULL
Consistency: Both paths now use same max delay and multiplier

Path Selection Logic:

CASE
    -- Primary: Rust-calculated backoff
    WHEN backoff_request_seconds IS NOT NULL AND last_attempted_at IS NOT NULL THEN
        last_attempted_at + (backoff_request_seconds * interval '1 second')

    -- Fallback: SQL exponential with configurable params
    WHEN failure_time IS NOT NULL THEN
        failure_time + LEAST(
            power(p_backoff_multiplier, attempts) * interval '1 second',
            p_max_backoff_seconds * interval '1 second'
        )

    ELSE NULL
END

Consequences

Positive

Configuration Clarity: Single max_backoff_seconds value (60s) across entire system
Race Condition Prevention: Atomic updates guarantee correctness in distributed deployments
Flexibility: Parameterized SQL allows future config changes without migrations
Timing Consistency: Synchronized timestamp updates eliminate calculation errors
Maintainability: Clear separation of concerns - Rust for calculation, SQL for fallback
Test Coverage: All 518 unit tests pass, validating correctness

Negative

Performance Overhead: Row-level locking adds ~1-2ms per backoff update
- Mitigation: Negligible compared to step execution time (typically seconds)
- Acceptable Trade-off: Correctness more important than microseconds
Lock Contention Risk: High-frequency failures on same step could cause lock queuing
- Mitigation: Exponential backoff naturally spreads out retries
- Monitoring: Added metrics for lock contention detection
- Real-World Impact: Minimal - failures are infrequent by design
Complexity: Transaction management adds code complexity
- Mitigation: Encapsulated in update_backoff_atomic() method
- Benefit: Hidden behind clean interface, testable in isolation

Neutral

Breaking Change: SQL function signature changed (added parameters)
- Not an Issue: Greenfield alpha project, no production dependencies
- Future-Proof: Default parameters maintain backward compatibility
Configuration Migration: Changed max from 300s → 60s
- Impact: Tasks retry faster, reducing user-perceived latency
- Validation: All tests pass with new values

Validation

Testing

Unit Tests: All 518 unit tests pass
- BackoffCalculator calculation correctness
- Jitter bounds validation
- Max cap enforcement
Database Tests: SQL function behavior validated
- Parameterization with various max values
- Exponential calculation matches Rust
- Boundary conditions (attempts 0, 10, 20)
Integration Tests: End-to-end flow verified
- Worker failure → Backoff applied → Readiness respects delay
- SQL fallback when backoff_request_seconds NULL
- Rust and SQL calculations produce consistent results

Verification Steps Completed

✅ Configuration alignment (TOML, Rust defaults) ✅ SQL function rewrite with parameters ✅ BackoffCalculator atomic updates implemented ✅ Database reset successful with new migration ✅ All unit tests passing ✅ Architecture documentation updated

Implementation Notes

Files Modified

Configuration:
- config/tasker/base/orchestration.toml: max_backoff_seconds = 60
- tasker-shared/src/config/tasker.rs: jitter_max_percentage = 0.1
Database Migration:
- migrations/20250927000000_add_waiting_for_retry_state.sql: Parameterized functions
Rust Implementation:
- tasker-orchestration/src/orchestration/backoff_calculator.rs: Atomic updates
Documentation:
- docs/task-and-step-readiness-and-execution.md: Backoff section added
- This ADR

Migration Path

Since this is greenfield alpha:

Drop and recreate test database
Run migrations with updated SQL functions
Rebuild sqlx cache
Run full test suite

Future Production Path (when needed):

Deploy parameterized SQL functions alongside old functions
Update Rust code to use new atomic methods
Enable in staging, monitor metrics
Gradual production rollout with feature flag
Remove old functions after validation period

Future Enhancements

Potential Improvements (Post-Alpha)

Configuration Table: Store backoff config in database for runtime updates
Metrics: OpenTelemetry metrics for backoff application and lock contention
Adaptive Backoff: Adjust multiplier based on system load or error patterns
Per-Namespace Policies: Different backoff configs per workflow namespace
Backoff Profiles: Named profiles (aggressive, moderate, conservative)

Monitoring Recommendations

Key Metrics to Track:

backoff_calculation_duration_seconds: Time to calculate and apply backoff
backoff_lock_contention_total: Lock acquisition failures
backoff_sql_fallback_total: Frequency of SQL fallback usage
backoff_delay_applied_seconds: Histogram of actual delays

Alert Conditions:

SQL fallback usage > 5% (indicates Rust path failing)
Lock contention > threshold (indicates hot spots)
Backoff delays > max_backoff_seconds (configuration issue)

References

Ownership Removal - Concurrent access patterns

Decision Status: ✅ Implemented and Validated (2025-10-29)

ADR: Worker Dual-Channel Event System

Status: Accepted Date: 2025-12 Ticket: TAS-67

Context

The original Rust worker used a blocking .call() pattern in the event handler:

#![allow(unused)]
fn main() {
let result = handler.call(&event.payload.task_sequence_step).await;  // BLOCKS
}

This created effectively sequential execution even for independent steps, preventing true concurrency and causing domain event race conditions where downstream systems saw events before orchestration processed results.

Decision

Adopt a dual-channel command pattern where handler invocation is fire-and-forget, and completions flow back through a separate channel.

Architecture:

[1] WorkerEventSystem receives StepExecutionEvent
        ↓
[2] ActorCommandProcessor routes to StepExecutorActor
        ↓
[3] StepExecutorActor claims step, publishes to HANDLER DISPATCH CHANNEL
        ↓ (fire-and-forget, non-blocking)
[4] HandlerDispatchService receives from channel
        ↓
[5] Resolves handler from registry, invokes handler.call()
        ↓
[6] Handler completes, publishes to COMPLETION CHANNEL
        ↓
[7] CompletionProcessorService receives from channel
        ↓
[8] Routes to FFICompletionService → Orchestration queue

Key Design Decisions:

Bounded Parallel Execution: Semaphore-bounded concurrency (configurable via TOML)
Ordered Domain Events: Events fire AFTER result is committed to completion channel
Comprehensive Error Handling: Panics, timeouts, handler errors all generate proper failure results
Fire-and-Forget FFI Callbacks: runtime_handle.spawn() instead of block_on() prevents deadlocks

Consequences

Positive

True parallelism: Parallel handler execution with bounded concurrency
Eliminated race conditions: Domain events only fire after results committed
Comprehensive error handling: All failure modes produce proper step failures
Foundation for FFI: Reusable abstractions for Ruby/Python/TypeScript workers
Bug discovery: Parallel execution surfaced latent SQL precedence bug

Negative

Increased complexity: Two channels to manage instead of one
Debugging complexity: Tracing flow across multiple channels requires structured logging

Neutral

Channel saturation monitoring available via metrics
Configurable buffer sizes per environment

Risk Mitigations Implemented

Risk	Mitigation
Semaphore acquisition failure	Generate failure result instead of silent exit
FFI polling starvation	Metrics + starvation warnings + timeout
Completion channel backpressure	Release permit before send
FFI thread runtime context	Fire-and-forget callbacks

Alternatives Considered

Alternative 1: Thread Pool Pattern

Use dedicated thread pool for handler execution.

Rejected: Tokio already provides excellent async runtime; adding threads increases complexity without benefit.

Alternative 2: Single Channel with Priority Queue

Priority queue for completions within single channel.

Rejected: Doesn’t address the fundamental blocking issue; still couples dispatch and completion.

Alternative 3: Keep Blocking Pattern with Larger Buffer

Increase buffer size to mask sequential execution.

Rejected: Doesn’t solve concurrency; just delays the problem.

References

Worker Event Systems - Architecture documentation
RCA: Parallel Execution Timing Bugs - Bug discovered during implementation
FFI Callback Safety - FFI patterns established

ADR: Worker Actor-Service Decomposition

Status: Accepted Date: 2025-12 Ticket: TAS-69

Context

The tasker-worker crate had a monolithic command processor architecture:

WorkerProcessor: 1,575 lines of code
All command handling inline
Difficult to test individual behaviors
Inconsistent with orchestration actor architecture

Decision

Transform the worker from monolithic command processor to actor-based design, mirroring the orchestration actor pattern.

Before: Monolithic Design

WorkerCore
    └── WorkerProcessor (1,575 LOC)
            └── All command handling inline

After: Actor-Based Design

WorkerCore
    └── ActorCommandProcessor (~350 LOC)
            └── WorkerActorRegistry
                    ├── StepExecutorActor → StepExecutorService
                    ├── FFICompletionActor → FFICompletionService
                    ├── TemplateCacheActor → TaskTemplateManager
                    ├── DomainEventActor → DomainEventSystem
                    └── WorkerStatusActor → WorkerStatusService

Five Actors:

Actor	Responsibility	Messages
StepExecutorActor	Step execution coordination	4
FFICompletionActor	FFI completion handling	2
TemplateCacheActor	Template cache management	2
DomainEventActor	Event dispatching	1
WorkerStatusActor	Status and health	4

Three Services:

Service	Lines	Purpose
StepExecutorService	~400	Step claiming, verification, FFI invocation
FFICompletionService	~200	Result delivery to orchestration
WorkerStatusService	~200	Stats tracking, health reporting

Consequences

Positive

92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
Single responsibility: Each file handles one concern
Testability: Services testable in isolation, actors via message handlers
Consistency: Mirrors orchestration architecture
Extensibility: New actors/services follow established pattern

Negative

Two-phase initialization: Registry requires careful startup ordering
Actor shutdown ordering: Must coordinate graceful shutdown
Learning curve: New pattern to understand for contributors

Neutral

Public API unchanged (WorkerCore::new(), send_command(), stop())
Internal restructuring transparent to users

Gaps Identified and Fixed

Gap	Issue	Fix
Domain Event Dispatch	Events not dispatched after step completion	Explicit dispatch call in actor
Silent Error Handling	Orchestration send errors swallowed	Explicit error propagation
Namespace Sharing	Registry created new manager, losing namespaces	Shared pre-initialized manager

Alternatives Considered

Alternative 1: Service-Only Pattern

Extract services without actor layer.

Rejected: Loses message-based interfaces that enable testing and future distributed execution.

Alternative 2: Keep Monolithic with Better Organization

Refactor WorkerProcessor into methods without extraction.

Rejected: Doesn’t address testability or architectural consistency goals.

Alternative 3: Full Actor Framework (Actix)

Use production actor framework.

Rejected: Too heavyweight; we need lifecycle hooks and message-based testing, not distributed supervision.

References

Worker Actors - Architecture documentation
Actor Pattern - Orchestration actor precedent

ADR: FFI Over WASM for Language Workers

Status: Accepted Date: 2025-12 Ticket: TAS-100

Context

For the TypeScript worker implementation, we needed to decide between two integration approaches:

FFI (Foreign Function Interface): Direct C ABI calls to compiled Rust library
WASM (WebAssembly): Compile Rust to wasm32-wasi target

Ruby (Magnus) and Python (PyO3) workers already used FFI successfully.

Decision

Proceed with FFI for all language workers. Reserve WASM for future serverless handler execution.

Decision Matrix:

Criteria	FFI	WASM
Pattern Consistency	Matches Ruby/Python	Requires new architecture
Production Readiness	Node FFI mature, Bun stabilizing	WASI networking immature
Implementation Speed	2-3 weeks	2-3 months + unknowns
PostgreSQL Access	Native via Rust	Needs host functions
Multi-threading	Full Tokio support	Single-threaded WASM
Async Runtime	Tokio works	Incompatible
Debugging	Standard tools	Limited tooling

Score: FFI 8/10, WASM 3/10 for current requirements.

WASM Deal-Breakers:

No mature PostgreSQL client for wasm32-wasi
Single-threaded execution (our HandlerDispatchService relies on Tokio multi-threading)
Tokio doesn’t compile to wasm32-wasi target
WASI networking still experimental (Preview 2 adoption low)

Consequences

Positive

Pattern consistency: Single Rust codebase serves all four workers
Proven approach: Ruby/Python FFI already validated
Full feature access: PostgreSQL, PGMQ, Tokio, domain events all work
Standard debugging: lldb, gdb, structured logging across boundary
Fast implementation: Estimated 2-3 weeks for TypeScript worker

Negative

FFI safety concerns: Incorrect types can cause segfaults
Platform builds: Must distribute .dylib/.so/.dll per platform
Runtime compatibility: Different FFI semantics between Bun and Node

Neutral

Bun FFI experimental but fast-stabilizing
Pre-built binaries via GitHub releases address distribution

Future Vision

WASM Research: Revisit when WASI 0.3+ stabilizes with networking.

Serverless WASM Handlers:

Compile individual handlers to WASM (not orchestration)
Deploy to serverless platforms (AWS Lambda, Cloudflare Workers)
Cold start optimization (1ms vs 100ms)
Extreme scalability for compute-heavy workflows

Separation of Concerns:

Orchestration: Stays Rust (PostgreSQL, PGMQ, state machines)
Handlers: Optionally WASM (stateless compute units)

Alternatives Considered

Alternative 1: WASM with Host Functions

Implement database operations as host functions.

Rejected: Defeats the purpose - logic split between WASM and host, loses Rust benefits.

Alternative 2: Wait for WASI 0.3

Delay TypeScript worker until WASI matures.

Rejected: Timeline uncertain (6+ months); FFI works today.

Alternative 3: Spin Framework

Use Spin’s WASM abstraction layer.

Rejected: Framework lock-in; requires Spin APIs, can’t reuse Axum/Tower patterns.

References

Cross-Language Consistency - API philosophy
Workers Documentation - Language-specific implementation guides

ADR: Handler Composition Pattern

Status: Accepted Date: 2025-12 Ticket: TAS-112

Context

Cross-language step handler ergonomics research revealed an architectural inconsistency:

Batchable handlers: Already use composition via mixins (target pattern)
API handlers: Use inheritance (subclass pattern)
Decision handlers: Use inheritance (subclass pattern)

Current State:
✅ Batchable: class Handler(StepHandler, Batchable)  # Composition
❌ API: class Handler < APIHandler                    # Inheritance
❌ Decision: class Handler extends DecisionHandler    # Inheritance

Guiding Principle (Zen of Python): “There should be one– and preferably only one –obvious way to do it.”

Decision

Migrate all handler patterns to composition (mixins/traits), using batchable as the reference implementation.

Target Architecture:

All patterns use composition:
  Ruby:      include Base, include API, include Decision, include Batchable
  Python:    class Handler(StepHandler, API, Decision, Batchable)
  TypeScript: interface composition + mixins
  Rust:      trait composition (impl Base + API + Decision + Batchable)

Benefits:

Single responsibility - each mixin handles one concern
Flexible composition - handlers can mix capabilities as needed
Easier testing - can test each capability independently
Matches batchable pattern (already proven successful)

Example Migration:

# Old pattern (deprecated)
class MyHandler < TaskerCore::StepHandler::API
  def call(context)
    api_success(data)
  end
end

# New pattern
class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API

  def call(context)
    api_success(data)
  end
end

Consequences

Positive

Consistent architecture: One pattern for all handler capabilities
Composable capabilities: Mix API + Decision + Batchable as needed
Testable in isolation: Each mixin can be tested independently
Matches proven pattern: Batchable already validates approach
Cross-language alignment: Same mental model in all languages

Negative

Breaking change: All existing handlers need migration
Learning curve: Contributors must understand mixin pattern
Migration effort: All examples and documentation need updates

Neutral

Pre-alpha status means breaking changes are acceptable
Migration can be phased with deprecation warnings

Ruby Result Unification

Ruby uses separate Success/Error classes while Python/TypeScript use unified result with success flag. Recommend unifying Ruby to match.

Rust Handler Traits

Rust needs ergonomic traits for API, Decision, and Batchable capabilities to match other languages:

#![allow(unused)]
fn main() {
pub trait APICapable {
    fn api_success(&self, data: Value, status: u16) -> StepExecutionResult;
    fn api_failure(&self, message: &str, status: u16) -> StepExecutionResult;
}

pub trait DecisionCapable {
    fn decision_success(&self, step_names: Vec<String>) -> StepExecutionResult;
    fn skip_branches(&self, reason: &str) -> StepExecutionResult;
}
}

FFI Boundary Types

Data structures crossing FFI boundaries must have identical serialization. Create explicit type mirrors in all languages:

DecisionPointOutcome
BatchProcessingOutcome
CursorConfig

Alternatives Considered

Alternative 1: Keep Inheritance Pattern

Continue with subclass pattern for API and Decision.

Rejected: Inconsistent with batchable; makes multi-capability handlers awkward.

Alternative 2: Migrate Batchable to Inheritance

Make batchable use inheritance to match others.

Rejected: Batchable composition is the better pattern; others should follow it.

Alternative 3: Language-Specific Patterns

Let each language use its idiomatic pattern.

Rejected: Violates cross-language consistency principle; increases cognitive load.

References

Composition Over Inheritance - Principle documentation
Cross-Language Consistency - API philosophy
API Convergence Matrix - Cross-language API reference

RCA: Parallel Execution Exposing Latent Timing Bugs

Date: 2025-12-07 Related: Worker Dual-Channel Event System Status: Resolved Impact: Flaky E2E test test_mixed_workflow_scenario

Executive Summary

During the dual-channel event system implementation (fire-and-forget handler dispatch), a previously hidden bug in the SQL function get_task_execution_context() became consistently reproducible. The bug was a logical precedence error that had always existed but was masked by sequential execution timing. Introducing true parallelism changed the probability distribution of state combinations, transforming a Heisenbug into a Bohrbug.

This document captures the root cause analysis as a reference for understanding how architectural changes to concurrency can surface latent bugs in distributed systems.

The Bug

Symptom

Test test_mixed_workflow_scenario intermittently failed with timeout waiting for BlockedByFailures status, while the API returned HasReadySteps.

⏳ Waiting for task to fail (max 10s)...
   Task execution status: processing (processing)
   Task execution status: has_ready_steps (has_ready_steps)  ← Wrong!
   Task execution status: has_ready_steps (has_ready_steps)
   ... timeout ...

Root Cause

The SQL function get_task_execution_context() checked ready_steps > 0 BEFORE permanently_blocked_steps > 0:

-- BUGGY: Wrong precedence order
CASE
  WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'           -- ← Checked FIRST
  WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'
  ...
END as execution_status

When a task had BOTH permanently blocked steps AND ready steps, the function returned has_ready_steps instead of blocked_by_failures.

The Fix

Migration 20251207000000_fix_execution_status_priority.sql corrects the precedence:

-- FIXED: blocked_by_failures takes semantic priority
CASE
  WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'  -- ← Now FIRST
  WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'
  ...
END as execution_status

Why Did This Surface Now?

The Test Scenario

# 3 parallel steps with NO dependencies (can all run concurrently)
steps:
  - name: success_step
    retryable: false

  - name: permanent_error_step
    retryable: false          # Fails permanently

  - name: retryable_error_step
    retryable: true
    max_attempts: 2           # Fails, but becomes "ready" after backoff

Before: Blocking Handler Dispatch

The original architecture used blocking .call() in the event handler:

#![allow(unused)]
fn main() {
// workers/rust/src/event_handler.rs (before)
let result = handler.call(&step).await;  // BLOCKS until handler completes
}

This created effectively sequential execution even for independent steps:

Timeline (Sequential):
────────────────────────────────────────────────────────────────────

t=0ms     [success_step starts]
t=50ms    [success_step completes]
t=51ms    [permanent_error_step starts]
t=100ms   [permanent_error_step fails → PERMANENTLY BLOCKED]
t=101ms   [retryable_error_step starts]
t=150ms   [retryable_error_step fails → enters 100ms backoff]
t=151ms   ──► STATUS CHECK
              permanently_blocked_steps = 1
              ready_steps = 0 (still in backoff!)
              ──► Returns: blocked_by_failures ✓

The backoff hadn't elapsed yet because steps were processed one at a time.

After: Fire-and-Forget Handler Dispatch

The dual-channel event system introduced non-blocking dispatch via channels:

#![allow(unused)]
fn main() {
// Fire-and-forget pattern
dispatch_sender.send(DispatchHandlerMessage { step, ... }).await;
// Returns immediately - handler executes in separate task
}

This enables true parallel execution:

Timeline (Parallel):
────────────────────────────────────────────────────────────────────

t=0ms     [success_step starts]──────────────────►[completes t=50ms]
t=0ms     [permanent_error_step starts]──────────►[fails t=50ms → BLOCKED]
t=0ms     [retryable_error_step starts]──────────►[fails t=50ms → backoff]

t=150ms   [retryable_error_step backoff expires → becomes READY]

t=151ms   ──► STATUS CHECK
              permanently_blocked_steps = 1
              ready_steps = 1 (backoff elapsed!)
              ──► Returns: has_ready_steps ✗ (BUG!)

Probability Analysis

The “Both States” Window

The bug manifests when checking status while the task has BOTH:

At least one permanently blocked step
At least one ready step (e.g., retryable step after backoff)

Sequential Processing:
├────────────────────────────────────────────────────────────────────┤
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│ Very LOW probability of "both states" window                       │
│ Steps complete serially; backoff rarely overlaps with status check │
└────────────────────────────────────────────────────────────────────┘

Parallel Processing:
├────────────────────────────────────────────────────────────────────┤
│░░░░░░░░░░░░████████████████████████████████████████████░░░░░░░░░░░│
│            ↑                                          ↑            │
│            │ HIGH probability "both states" window    │            │
│            │ All steps complete ~simultaneously       │            │
│            │ Backoff expires while status is polled  │            │
└────────────────────────────────────────────────────────────────────┘

Quantifying the Change

Metric	Sequential	Parallel
Step completion spread	~150ms	~50ms
“Both states” window duration	~0ms (transient)	~100ms+ (stable)
Probability of hitting bug	<1%	>50%
Bug classification	Heisenbug	Bohrbug

Bug Classification

Heisenbug → Bohrbug Transformation

Property	Before (Heisenbug)	After (Bohrbug)
Reproducibility	Intermittent, timing-dependent	Consistent, deterministic
Root cause	Logical precedence error	Same
Visibility	Hidden by sequential timing	Exposed by parallel timing
Debug difficulty	Extremely hard (may never reproduce)	Straightforward
Detection in CI	Might pass for months	Fails consistently under load

Why This Matters

The bug was always present - It existed in the SQL function since it was written
Sequential execution hid it - Incidental timing prevented the problematic state
Parallelization surfaced it - Not by introducing a bug, but by applying concurrency pressure
This is good - Better to find in tests than production

Semantic Correctness

The Correct Mental Model

“If ANY step is permanently blocked, the task cannot make further progress toward completion, even if other steps are ready to execute.”

A task with permanent failures is blocked by failures regardless of what else might be runnable. The old code implicitly assumed:

“If work is available, we’re making progress”

This is incorrect for workflows where:

Convergence points require ALL branches to complete
Final task status depends on ALL steps succeeding
Partial progress doesn’t constitute overall success

State Precedence (Correct Order)

-- 1. Permanent failures block overall progress
WHEN permanently_blocked_steps > 0 THEN 'blocked_by_failures'

-- 2. Ready work can continue (but may not lead to completion)
WHEN ready_steps > 0 THEN 'has_ready_steps'

-- 3. Work in flight
WHEN in_progress_steps > 0 THEN 'processing'

-- 4. All done
WHEN completed_steps = total_steps THEN 'all_complete'

-- 5. Waiting for dependencies
ELSE 'waiting_for_dependencies'

Patterns to Watch For

1. State Combination Explosions

Sequential processing often means only one state at a time. Parallelism creates state combinations that were previously impossible:

Sequential: A → B → C (states are mutually exclusive in time)
Parallel:   A + B + C (states can coexist)

Watch for: CASE statements, if/else chains, and state machines that assume mutual exclusivity.

2. Timing-Dependent Invariants

Code may accidentally depend on timing:

#![allow(unused)]
fn main() {
// Assumes step_a completes before step_b starts
if step_a.is_complete() {
    // Safe to check step_b
}
}

Watch for: Implicit ordering assumptions in status calculations, rollups, and aggregations.

3. Transient vs Stable States

Some states were transient under sequential processing but become stable under parallel:

State	Sequential	Parallel
“1 complete, 1 in-progress”	Transient (~ms)	Stable (seconds)
“blocked + ready”	Nearly impossible	Common
“multiple errors”	Rare	Frequent

Watch for: Error handling, status rollups, and progress calculations that assumed single-state scenarios.

4. Test Timing Sensitivity

Tests written for sequential execution may have implicit timing dependencies:

#![allow(unused)]
fn main() {
// This worked when steps were sequential
wait_for_status(BlockedByFailures, timeout: 10s);

// But fails when parallel execution creates a different status first
}

Watch for: Tests that pass in isolation but fail under concurrent load.

Verification Strategy

After Parallelization Changes

Run tests multiple times - Timing bugs may not manifest on first run
Run tests under load - Concurrent test execution increases probability
Add explicit state combination tests - Test scenarios that were previously impossible
Review CASE/if-else precedence - Check all status calculations for correct ordering

Example: Testing State Combinations

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_blocked_with_ready_steps() {
    // Explicitly create the state combination
    let task = create_task_with_parallel_steps();

    // Force one step to permanent failure
    force_step_to_permanent_failure(&task, "step_a").await;

    // Force another step to ready (after backoff)
    force_step_to_ready_after_backoff(&task, "step_b").await;

    // Verify correct precedence
    let status = get_task_execution_status(&task).await;
    assert_eq!(status, ExecutionStatus::BlockedByFailures);
}
}

Conclusion

This bug exemplifies how architectural improvements to concurrency can surface latent correctness issues. The parallelization didn’t introduce a bug—it revealed one that had been hidden by incidental sequential timing.

This is a positive outcome: the bug was found in testing rather than production. The fix ensures correct semantic precedence regardless of execution timing, making the system more robust under parallel load.

Key Takeaways

Parallelization is a stress test - It exposes timing-dependent bugs
Sequential execution hides bugs - Incidental ordering masks logical errors
State precedence matters - Review all status calculations when adding concurrency
Heisenbugs become Bohrbugs - Parallel execution makes rare bugs reproducible
This is good engineering - Finding bugs through architectural improvements validates the testing strategy

References

Migration: migrations/20251207000000_fix_execution_status_priority.sql
Test: tests/e2e/ruby/error_scenarios_test.rs::test_mixed_workflow_scenario
SQL Function: get_task_execution_context() in migrations/20251001000000_fix_permanently_blocked_detection.sql
Dual-Channel Event System ADR

Tasker Core Benchmarks

Last Updated: 2026-01-23 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Observability | Deployment Patterns

<- Back to Documentation Hub

This directory contains documentation for all performance benchmarks in the tasker-core workspace.

Quick Reference

# E2E benchmarks (cluster mode, all tiers)
cargo make setup-env-all-cluster
cargo make cluster-start-all
set -a && source .env && set +a && cargo bench --bench e2e_latency
cargo make bench-report     # Percentile JSON
cargo make bench-analysis   # Markdown analysis
cargo make cluster-stop

# Component benchmarks (requires Docker services)
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo bench --package tasker-client --features benchmarks   # API benchmarks
cargo bench --package tasker-shared --features benchmarks   # SQL + Event benchmarks

Benchmark Categories

1. End-to-End Latency (`tests/benches`)

Location: tests/benches/e2e_latency.rs Documentation: e2e-benchmarks.md

Measures complete workflow execution from API call through orchestration, message queue, worker execution, result processing, and dependency resolution — across all distributed components in a 10-instance cluster.

Tier	Benchmark	Steps	Parallelism	P50	Target (p99)
1	Linear Rust	4 sequential	none	255-258ms	< 500ms
1	Diamond Rust	4 (2 parallel)	2-way	200-259ms	< 500ms
2	Complex DAG	7 (mixed)	2+3-way	382ms	< 800ms
2	Hierarchical Tree	8 (4 parallel)	4-way	389-426ms	< 800ms
2	Conditional	5 (3 executed)	dynamic	251-262ms	< 500ms
3	Cluster single task	4 sequential	none	261ms	< 500ms
3	Cluster concurrent 2x	4+4	distributed	332-384ms	< 800ms
4	FFI linear (Ruby/Python/TS)	4 sequential	none	312-316ms	< 800ms
4	FFI diamond (Ruby/Python/TS)	4 (2 parallel)	2-way	260-275ms	< 800ms
5	Batch 1000 rows	7 (5 parallel)	5-way	358-368ms	< 1000ms

Each step involves ~19 database operations, 2 message queue round-trips, 4+ state transitions, and dependency graph evaluation. See e2e-benchmarks.md for the detailed per-step lifecycle.

Key Characteristics:

FFI overhead: ~23% vs native Rust (all languages within 3ms of each other)
Linear patterns: highly reproducible (<2% variance between runs)
Parallel patterns: environment-sensitive (I/O contention affects parallelism)
Batch processing: 2,700-2,800 rows/second with tight P95/P50 ratios

Run Commands:

cargo make bench-e2e           # Tier 1: Rust core
cargo make bench-e2e-full      # Tier 1+2: + complexity
cargo make bench-e2e-cluster   # Tier 3: Multi-instance
cargo make bench-e2e-languages # Tier 4: FFI comparison
cargo make bench-e2e-batch     # Tier 5: Batch processing
cargo make bench-e2e-all       # All tiers

2. API Performance (`tasker-client`)

Location: tasker-client/benches/task_initialization.rs

Measures orchestration API response times for task creation (HTTP round-trip + DB insert + step initialization).

Benchmark	Target	Current	Status
Linear task init	< 50ms	17.7ms	2.8x better
Diamond task init	< 75ms	20.8ms	3.6x better

cargo bench --package tasker-client --features benchmarks

3. SQL Function Performance (`tasker-shared`)

Location: tasker-shared/benches/sql_functions.rs

Measures critical PostgreSQL function performance for orchestration polling.

Function	Target	Current (5K tasks)	Status
get_next_ready_tasks	< 3ms	1.75-2.93ms	Pass
get_step_readiness_status	< 1ms	440-603us	Pass
get_task_execution_context	< 1ms	380-460us	Pass

DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks sql_functions

4. Event Propagation (`tasker-shared`)

Location: tasker-shared/benches/event_propagation.rs

Measures PostgreSQL LISTEN/NOTIFY round-trip latency for real-time coordination.

Metric	Target (p95)	Current	Status
Notify round-trip	< 10ms	14.1ms	Slightly above, p99 < 20ms

DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks event_propagation

Performance Targets

System-Wide Goals

Category	Metric	Target	Rationale
API Latency	p99	< 100ms	User-facing responsiveness
SQL Functions	mean	< 3ms	Orchestration polling efficiency
Event Propagation	p95	< 10ms	Real-time coordination overhead
E2E Linear (4 steps)	p99	< 500ms	End-user task completion
E2E Complex (7-8 steps)	p99	< 800ms	Complex workflow completion
E2E Batch (1000 rows)	p99	< 1000ms	Bulk operation completion

Scaling Targets

Dataset Size	get_next_ready_tasks	Notes
1K tasks	< 2ms	Initial implementation
5K tasks	< 3ms	Current verified
10K tasks	< 5ms	Target
100K tasks	< 10ms	Production scale

Cluster Topology (E2E Benchmarks)

Service	Instances	Ports	Build
Orchestration	2	8080, 8081	Release
Rust Worker	2	8100, 8101	Release
Ruby Worker	2	8200, 8201	Release extension
Python Worker	2	8300, 8301	Maturin develop
TypeScript Worker	2	8400, 8401	Bun FFI

Deployment Mode: Hybrid (event-driven with polling fallback) Database: PostgreSQL (with PGMQ extension available) Messaging: RabbitMQ (via MessagingService provider abstraction; PGMQ also supported) Sample Size: 50 per benchmark

Running Benchmarks

E2E Benchmarks (Full Suite)

# 1. Setup cluster environment
cargo make setup-env-all-cluster

# 2. Start 10-instance cluster
cargo make cluster-start-all

# 3. Verify cluster health
cargo make cluster-status

# 4. Run benchmarks
set -a && source .env && set +a && cargo bench --bench e2e_latency

# 5. Generate reports
cargo make bench-report    # → target/criterion/percentile_report.json
cargo make bench-analysis  # → tmp/benchmark-results/benchmark-results.md

# 6. Stop cluster
cargo make cluster-stop

Component Benchmarks

# Start database
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

# Run individual suites
cargo bench --package tasker-client --features benchmarks     # API
cargo bench --package tasker-shared --features benchmarks     # SQL + Events

# Run all at once
cargo bench --all-features

Baseline Comparison

# Save current performance as baseline
cargo bench --all-features -- --save-baseline main

# After changes, compare
cargo bench --all-features -- --baseline main

# View report
open target/criterion/report/index.html

Interpreting Results

Stable Metrics (Reliable for Regression Detection)

These metrics show <2% variance between runs:

Linear pattern P50 (sequential execution baseline)
FFI linear P50 (framework overhead measurement)
Single task in cluster (cluster overhead measurement)
Batch P50 (parallel I/O throughput)

Environment-Sensitive Metrics

These metrics vary 10-30% depending on system load:

Diamond pattern P50 (parallelism benefit depends on I/O capacity)
Concurrent 2x (scheduling contention varies)
Hierarchical tree (deep dependency chains amplify I/O latency)

Key Ratios (Always Valid)

FFI overhead %: ~23% for all languages (framework-dominated)
P95/P50 ratio: 1.01-1.12 (execution stability indicator)
Cluster vs single overhead: <3ms (negligible cluster tax)
FFI language spread: <3ms (language runtime is not the bottleneck)

Design Principles

Natural Measurement

Benchmarks measure real system behavior without artificial test harnesses:

API benchmarks hit actual HTTP endpoints
SQL benchmarks use real database with realistic data volumes
E2E benchmarks execute complete workflows through all distributed components

Distributed System Focus

All benchmarks account for distributed system characteristics:

Network latency included (HTTP, PostgreSQL, message queues)
Database transaction timing considered
Message queue delivery overhead measured
Worker coordination and scheduling included

Load-Based Validation

Benchmarks serve dual purpose:

Performance measurement: Track regressions and improvements
Load testing: Expose race conditions and timing bugs

E2E benchmark warmup has historically discovered critical race conditions that manual testing never revealed.

Statistical Rigor

50 samples per benchmark for P50/P95 validity
Criterion framework with statistical regression detection
Multiple independent runs recommended for absolute comparisons
Relative metrics (ratios, overhead %) preferred over absolute milliseconds

Troubleshooting

“Services must be running”

cargo make cluster-status          # Check cluster health
cargo make cluster-start-all       # Restart cluster

Tier 3/4 benchmarks skipped

# Ensure cluster env is configured (not single-service)
cargo make setup-env-all-cluster   # Generates .env with cluster URLs

High variance between runs

Close resource-intensive applications (browsers, IDEs)
Ensure machine is plugged in (not throttling)
Focus on stable metrics (linear P50, FFI overhead %) for comparisons
Run benchmarks twice and compare for reproducibility

Benchmark takes too long

# Reduce sample size (default: 50)
cargo bench -- --sample-size 10

# Run single tier
cargo make bench-e2e               # Only Tier 1

CI Integration

# Example: .github/workflows/benchmarks.yml
name: Performance Benchmarks

on:
  pull_request:
    paths:
      - 'tasker-*/src/**'
      - 'migrations/**'

jobs:
  benchmark:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: ghcr.io/pgmq/pg18-pgmq:v1.8.1
        env:
          POSTGRES_DB: tasker_rust_test
          POSTGRES_USER: tasker
          POSTGRES_PASSWORD: tasker

    steps:
      - uses: actions/checkout@v3
      - run: cargo bench --all-features -- --save-baseline pr
      - uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'criterion'
          output-file-path: target/criterion/report/index.html

Criterion automatically detects performance regressions with statistical comparison to baselines and alerts on >5% slowdowns.

Contributing

When adding new benchmarks:

Follow naming convention: <tier>_<category>/<group>/<scenario>
Include targets: Document expected performance in this README
Add fixture: Create workflow template YAML in tests/fixtures/task_templates/
Document shape: Update e2e-benchmarks.md with topology
Consider variance: Account for distributed system characteristics
Use 50 samples: Minimum for P50/P95 statistical validity

Benchmark Template

#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use std::time::Duration;

fn bench_my_scenario(c: &mut Criterion) {
    let mut group = c.benchmark_group("e2e_my_tier");
    group.sample_size(50);
    group.measurement_time(Duration::from_secs(30));

    group.bench_function(BenchmarkId::new("workflow", "my_scenario"), |b| {
        b.iter(|| {
            runtime.block_on(async {
                execute_benchmark_scenario(&client, namespace, handler, context, timeout).await
            })
        });
    });

    group.finish();
}
}

E2E Benchmark Scenarios: Workflow Shapes and Per-Step Lifecycle

Last Updated: 2026-01-23 Audience: Architects, Developers, Performance Engineers Related Docs: Benchmarks README | States & Lifecycles | Actor Pattern

<- Back to Benchmarks

What Each Benchmark Measures

Each E2E benchmark executes a complete workflow through the full distributed system: HTTP API call, task initialization, step discovery, message queue dispatch, worker execution, result submission, dependency graph re-evaluation, and task finalization.

A 4-step linear workflow at P50=257ms means the system completes 76+ database operations, 8 message queue round-trips, 16+ state machine transitions, and 4 dependency graph evaluations — all across a 10-instance distributed cluster — in approximately one quarter of a second.

Per-Step Lifecycle: What Happens for Every Step

Before examining the benchmark scenarios, it’s important to understand the work the system performs for each individual step. Every step in every benchmark goes through this complete lifecycle.

Messaging Backend: Tasker uses a MessagingService trait with provider variants for PGMQ (PostgreSQL-native, single-dependency) and RabbitMQ (high-throughput). The benchmark results documented here were captured using the RabbitMQ backend. The per-step lifecycle is identical regardless of backend — only the transport layer differs.

State Machine Transitions (per step)

Step:  Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
Task:  StepsInProcess → EvaluatingResults → (EnqueuingSteps if more ready) → Complete

Database Operations (per step): ~19 operations

Phase	Operations	Description
Discovery	2 queries	`get_next_ready_tasks` + `get_step_readiness_status_batch` (8-CTE query)
Enqueueing	4 writes	Fetch correlation_id, transition Pending→Enqueued (SELECT sort_key + UPDATE most_recent + INSERT transition)
Message send	1 op	Send step dispatch to worker queue (via MessagingService)
Worker claim	1 op	Claim message with visibility timeout (via MessagingService)
Worker transition	3 writes	Transition Enqueued→InProgress
Result submission	4 writes	Transition InProgress→EnqueuedForOrchestration + audit trigger INSERT + send completion to orchestration queue
Result processing	4 writes	Fetch step state, transition →Complete, delete consumed message
Task coordination	1+ queries	Re-evaluate `get_step_readiness_status_batch` for remaining steps
Total	~19 ops

Message Queue Round-Trips (per step): 2

Orchestration → Worker: Step dispatch message (task_uuid, step_uuid, handler, context)
Worker → Orchestration: Completion notification (task_uuid, step_uuid, results)

Dependency Graph Evaluation (per step completion)

After each step completes, the orchestration:

Queries all steps in the task for current state
Evaluates dependency edges (parent steps must be Complete)
Calculates retry eligibility (attempts < max_attempts, backoff expired)
Identifies newly-ready steps for enqueueing
Updates task state (more steps ready → EnqueuingSteps, all complete → Complete)

Idempotency Guarantees

Message visibility timeout: MessagingService prevents duplicate processing (30s window)
State machine guards: Transitions validate from-state before applying
Atomic claiming: Workers claim via the messaging backend’s atomic read operation
Audit trail: Every transition creates an immutable workflow_step_transitions record

Tier 1: Core Performance (Rust Native)

Linear Rust (4 steps, sequential)

Fixture: tests/fixtures/task_templates/rust/mathematical_sequence.yaml Namespace: rust_e2e_linear | Handler: mathematical_sequence

linear_step_1 → linear_step_2 → linear_step_3 → linear_step_4

Step	Handler	Operation	Depends On	Math
linear_step_1	LinearStep1	square	none	6^2 = 36
linear_step_2	LinearStep2	square	step_1	36^2 = 1,296
linear_step_3	LinearStep3	square	step_2	1,296^2 = 1,679,616
linear_step_4	LinearStep4	square	step_3	1,679,616^2

Distributed system work for this workflow:

Metric	Count
State machine transitions (step)	16 (4 per step)
State machine transitions (task)	6 (Pending→Init→Enqueue→InProcess→Eval→Complete)
Database operations	76 (19 per step)
MQ messages	8 (2 per step)
Dependency evaluations	4 (after each step completes)
HTTP calls (benchmark→API)	1 create + ~5 polls
Sequential stages	4

Why this matters: This is the purest sequential latency test. Each step must fully complete (all 19 DB operations + 2 message round-trips) before the next step can begin. The P50 of ~257ms means each step’s complete lifecycle averages ~64ms including all distributed coordination.

Diamond Rust (4 steps, 2-way parallel)

Fixture: tests/fixtures/task_templates/rust/diamond_pattern.yaml Namespace: rust_e2e_diamond | Handler: diamond_pattern

         diamond_start
           /       \
          /         \
  diamond_branch_b  diamond_branch_c    ← parallel execution
          \         /
           \       /
         diamond_end                    ← 2-way convergence

Step	Handler	Operation	Depends On	Parallelism
diamond_start	Start	square	none	-
diamond_branch_b	BranchB	square	start	parallel with C
diamond_branch_c	BranchC	square	start	parallel with B
diamond_end	End	multiply_and_square	branch_b AND branch_c	convergence

Distributed system work:

Metric	Count
State machine transitions (step)	16
Database operations	76
MQ messages	8
Dependency evaluations	4
Sequential stages	3 (start → parallel → end)
Convergence points	1 (diamond_end waits for both branches)
Dependency edge checks	4 (start→B, start→C, B→end, C→end)

Why this matters: Tests the system’s ability to dispatch and execute steps concurrently. The convergence point (diamond_end) requires the orchestration to correctly evaluate that BOTH branch_b AND branch_c are Complete before enqueueing diamond_end. Under light load, this completes in 3 sequential stages vs 4 for linear (~30% faster).

Tier 2: Complexity Scaling

Complex DAG (7 steps, mixed parallelism)

Fixture: tests/fixtures/task_templates/rust/complex_dag.yaml Namespace: rust_e2e_mixed_dag | Handler: complex_dag

              dag_init
             /        \
   dag_process_left   dag_process_right     ← 2-way parallel
        /    |              |    \
       /     |              |     \
dag_validate dag_transform dag_analyze      ← mixed dependencies
       \          |          /
        \         |         /
         dag_finalize                       ← 3-way convergence

Step	Depends On	Type
dag_init	none	init
dag_process_left	init	parallel branch
dag_process_right	init	parallel branch
dag_validate	left AND right	2-way convergence
dag_transform	left only	linear continuation
dag_analyze	right only	linear continuation
dag_finalize	validate AND transform AND analyze	3-way convergence

Distributed system work:

Metric	Count
State machine transitions (step)	28 (7 steps x 4)
Database operations	133 (7 x 19)
MQ messages	14 (7 x 2)
Dependency evaluations	7
Sequential stages	4 (init → left/right → validate/transform/analyze → finalize)
Convergence points	2 (dag_validate: 2-way, dag_finalize: 3-way)
Dependency edge checks	8

Why this matters: Tests multiple convergence points with different fan-in widths. The orchestration must correctly handle that dag_validate needs 2 parents while dag_finalize needs 3. Also tests mixed patterns: some steps continue from a single parent (transform from left only) while others require multiple.

Hierarchical Tree (8 steps, 4-way convergence)

Fixture: tests/fixtures/task_templates/rust/hierarchical_tree.yaml Namespace: rust_e2e_tree | Handler: hierarchical_tree

                    tree_root
                   /         \
        tree_branch_left    tree_branch_right    ← 2-way parallel
          /       \           /        \
  tree_leaf_d  tree_leaf_e  tree_leaf_f  tree_leaf_g  ← 4-way parallel
         \          |            |          /
          \         |            |         /
           tree_final_convergence               ← 4-way convergence

Level	Steps	Parallelism	Operation
0	root	sequential	square
1	branch_left, branch_right	2-way parallel	square
2	leaf_d, leaf_e, leaf_f, leaf_g	4-way parallel	square
3	final_convergence	4-way convergence	multiply_all_and_square

Distributed system work:

Metric	Count
State machine transitions (step)	32 (8 x 4)
Database operations	152 (8 x 19)
MQ messages	16 (8 x 2)
Dependency evaluations	8
Sequential stages	4 (root → branches → leaves → convergence)
Maximum fan-out	2-way (each branch → 2 leaves)
Maximum fan-in	4-way (convergence waits for all 4 leaves)
Dependency edge checks	9

Why this matters: Tests the widest convergence pattern — 4 parallel leaves must all complete before the final step can execute. This exercises the dependency evaluation with a large number of parent checks per step. Also tests hierarchical fan-out (root→2 branches→4 leaves).

Conditional Routing (5 steps, 3 executed)

Fixture: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml Namespace: conditional_approval_rust | Handler: approval_routing Context: {"amount": 500, "requester": "benchmark"}

validate_request
       ↓
routing_decision          ← DECISION POINT (routes based on amount)
   /      |      \
  /       |       \
auto_approve  manager_approval  finance_review
(< $1000)     ($1000-$5000)     (> $5000)
  \       |       /
   \      |      /
  finalize_approval               ← deferred convergence

With benchmark context amount=500, only the auto_approve path executes:

validate_request → routing_decision → auto_approve → finalize_approval

Step	Executed	Condition
validate_request	Yes	always
routing_decision	Yes	always (decision point)
auto_approve	Yes	amount < 1000
manager_approval	Skipped	amount 1000-5000
finance_review	Skipped	amount > 5000
finalize_approval	Yes	deferred convergence (waits for executed paths only)

Distributed system work (executed steps only):

Metric	Count
State machine transitions (step)	16 (4 executed x 4)
Database operations	76 (4 executed x 19)
MQ messages	8 (4 executed x 2)
Dependency evaluations	4
Sequential stages	4 (validate → decision → approve → finalize)
Skipped steps	2 (manager_approval, finance_review)

Why this matters: Tests deferred convergence — the finalize_approval step depends on ALL conditional branches, but only blocks on branches that actually executed. The orchestration must correctly determine that manager_approval and finance_review were skipped (not just incomplete) and allow finalize_approval to proceed. Also tests the decision point routing pattern.

Tier 3: Cluster Performance

Single Task Linear (4 steps, round-robin across 2 orchestrators)

Same workflow as Tier 1 linear_rust, but benchmarked with round-robin across 2 orchestration instances to measure cluster coordination overhead.

Distributed system work: Same as linear_rust (76 DB ops, 8 MQ messages) plus cluster coordination overhead (shared database, message queue visibility).

Why this matters: Validates that running in cluster mode adds negligible overhead compared to single-instance. The P50 difference (261ms vs 257ms = ~4ms) represents the entire cluster coordination tax.

Concurrent Tasks 2x (2 tasks simultaneously across 2 orchestrators)

Two linear workflows submitted simultaneously, one to each orchestration instance.

Distributed system work:

Metric	Count
State machine transitions	44 (22 per task)
Database operations	152 (76 per task)
MQ messages	16 (8 per task)
Concurrent step executions	up to 2
Database connection contention	2 orchestrators + 2 workers competing

Why this matters: Tests work distribution across cluster instances under concurrent load. The P50 of ~332-384ms for TWO tasks (vs ~261ms for one) shows that the second task adds only 30-50% latency, not 100% — demonstrating effective parallelism in the cluster.

Tier 4: FFI Language Comparison

Same linear and diamond patterns as Tier 1, but using FFI workers (Ruby via Magnus, Python via PyO3, TypeScript via Bun FFI) instead of native Rust handlers.

Additional per-step work for FFI:

Phase	Additional Operations
Handler dispatch	FFI bridge call (Rust → language runtime)
Context serialization	JSON serialize context for foreign runtime
Result deserialization	JSON deserialize results back to Rust
Circuit breaker check	`should_allow()` (sync, atomic check)
Completion callback	FFI completion channel (bounded MPSC)

FFI overhead: ~23% (~60ms for 4 steps)

The overhead is framework-dominated (Rust dispatch + serialization + completion channel), not language-dominated — all three languages perform within 3ms of each other.

Tier 5: Batch Processing

CSV Products 1000 Rows (7 steps, 5-way parallel)

Fixture: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml Namespace: csv_processing_rust | Handler: csv_product_inventory_analyzer

analyze_csv                    ← reads CSV, returns BatchProcessingOutcome
    ↓
[orchestration creates 5 dynamic workers from batch template]
    ↓
process_csv_batch_001 ──┐
process_csv_batch_002 ──┤
process_csv_batch_003 ──├──→ aggregate_csv_results    ← deferred convergence
process_csv_batch_004 ──┤
process_csv_batch_005 ──┘

Step	Type	Rows	Operation
analyze_csv	batchable	all 1000	Count rows, compute batch ranges
process_csv_batch_001	batch_worker	1-200	Compute inventory metrics
process_csv_batch_002	batch_worker	201-400	Compute inventory metrics
process_csv_batch_003	batch_worker	401-600	Compute inventory metrics
process_csv_batch_004	batch_worker	601-800	Compute inventory metrics
process_csv_batch_005	batch_worker	801-1000	Compute inventory metrics
aggregate_csv_results	deferred_convergence	all	Merge batch results

Distributed system work:

Metric	Count
State machine transitions (step)	28 (7 x 4)
Database operations	133 (7 x 19)
MQ messages	14 (7 x 2)
Dynamic step creation	5 (batch workers created at runtime)
Dependency edges (dynamic)	6 (batch workers → analyze, aggregate → batch_template)
File I/O operations	6 (1 analysis read + 5 batch reads of CSV)
CSV rows processed	1000
Sequential stages	3 (analyze → 5 parallel workers → aggregate)

Why this matters: Tests the most complex orchestration pattern — dynamic step generation. The analyze_csv step returns a BatchProcessingOutcome that tells the orchestration to create N worker steps at runtime. The orchestration must:

Create new step records in the database
Create dependency edges dynamically
Enqueue all batch workers for parallel execution
Use deferred convergence for the aggregate step (waits for batch template, not specific steps)

At P50=358-368ms for 1000 rows, throughput is ~2,700 rows/second with all the distributed system overhead included.

Summary: Operations Per Benchmark

Benchmark	Steps	DB Ops	MQ Msgs	Transitions	Convergence	P50
Linear Rust	4	76	8	22	none	257ms
Diamond Rust	4	76	8	22	2-way	200-259ms
Complex DAG	7	133	14	34	2+3-way	382ms
Hierarchical Tree	8	152	16	38	4-way	389-426ms
Conditional	4*	76	8	22	deferred	251-262ms
Cluster single	4	76	8	22	none	261ms
Cluster 2x	8	152	16	44	none	332-384ms
FFI linear	4	76	8	22	none	312-316ms
FFI diamond	4	76	8	22	2-way	260-275ms
Batch 1000 rows	7	133	14	34	deferred	358-368ms

*Conditional executes 4 of 5 defined steps (2 skipped by routing decision)

Performance per Sequential Stage

For workflows with known sequential depth, we can calculate per-stage overhead:

Benchmark	Sequential Stages	P50	Per-Stage Avg
Linear (4 seq)	4	257ms	64ms
Diamond (3 seq)	3	200ms*	67ms
Complex DAG (4 seq)	4	382ms	96ms**
Tree (4 seq)	4	389ms	97ms**
Conditional (4 seq)	4	257ms	64ms
Batch (3 seq)	3	363ms	121ms***

*Diamond under light load (parallelism helping) **Higher per-stage due to multiple steps per stage (more DB ops per evaluation cycle) ***Higher per-stage due to batch worker creation overhead + file I/O

The ~64ms per sequential stage for simple patterns represents the total distributed round-trip: orchestration discovery → MQ dispatch → worker claim → handler execute (~1ms for math operations) → MQ completion → orchestration result processing → dependency re-evaluation. The handler execution itself is negligible; the 64ms is almost entirely orchestration infrastructure.

Tasker Contrib Documentation

Quick Links

Document	Description
README.md	Repository overview, vision, and structure
DEVELOPMENT.md	Local development and cross-repo setup

Implementation Specifications

Ticket	Status	Description
TAS-126	🚧 In Progress	Foundations: repo structure, vision, CLI plugin design

TAS-126 Documents

Document	Description
README.md	Ticket summary and deliverables
foundations.md	Architectural deep-dive and design rationale
rails.md	Rails-specific implementation plan
cli-plugin-architecture.md	CLI plugin system design

Architecture

The foundations document covers:

Design rationale (why separate repos, why Railtie over Engine)
Framework integration patterns (lifecycle, events, generators)
Configuration architecture (three-layer model)
Testing architecture (unit, integration, E2E)
Versioning strategy

Milestones

Milestone	Status	Description
Foundations and CLI	🚧 In Progress	TAS-126: Repo structure, vision, CLI plugin design
Rails	📋 Planned	tasker-contrib-rails gem, generators, event bridge
Python	📋 Planned	tasker-contrib-fastapi, pytest integration
TypeScript	📋 Planned	tasker-contrib-bun, Bun.serve patterns

Framework Guides

Coming soon as packages are implemented

Rails Integration Guide
FastAPI Integration Guide
Bun Integration Guide
Axum Integration Guide

Operational Guides

Coming soon

Helm Chart Deployment
Terraform Infrastructure
Monitoring Setup
Production Checklist

Example Applications

Complete example applications demonstrating Tasker patterns.

Examples

Example	Description
`e-commerce-workflow/`	Order processing with payment, inventory, and shipping
`etl-pipeline/`	Data extraction, transformation, and loading workflow
`approval-system/`	Multi-level approval with conditional routing

Purpose

These examples demonstrate:

Real-world workflow patterns
Multi-language handler implementations
Testing strategies
Deployment configurations

Status

📋 Planned

Approval System Example

Multi-level approval workflow demonstrating:

Decision handlers for routing
Convergence patterns (all approvals required)
Human-in-the-loop integration
Timeout and escalation

Status

📋 Planned

E-Commerce Workflow Example

Order processing workflow demonstrating:

Diamond dependency patterns (parallel payment + inventory check)
External API integration (payment gateway)
Conditional routing (shipping method selection)
Error handling and retries

Status

📋 Planned

ETL Pipeline Example

Data processing workflow demonstrating:

Batchable handlers for large datasets
Checkpoint/resume for long-running processes
Multiple data sources
Transformation chains

Status

📋 Planned

Engineering Stories

A progressive-disclosure blog series that teaches Tasker concepts through real-world scenarios. Each story builds on the previous, showing how a growing engineering team adopts workflow orchestration across all four supported languages.

These stories are being rewritten for the current Tasker architecture. See the archive for the original GitBook-era versions.

Story	Theme	Status
01: E-commerce Checkout	Basic workflows, error handling	Planned
02: Data Pipeline	ETL patterns, resilience	Planned
03: Microservices	Service coordination	Planned
04: Team Scaling	Namespace isolation	Planned
05: Observability	OpenTelemetry + domain events	Planned
06: Batch Processing	Batch step patterns	Planned
07: Conditional Workflows	Decision handlers	Planned
08: Production Debugging	DLQ investigation	Planned

Keyboard shortcuts

Tasker Documentation