Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tasker

Workflow orchestration that meets your code where it lives.

Tasker is an open-source workflow orchestration engine built on PostgreSQL and PGMQ. You define workflows as task templates with ordered steps, implement handlers in Rust, Ruby, Python, or TypeScript, and the engine handles execution, retries, circuit breaking, and observability.

Your existing business logic — API calls, database operations, service integrations — becomes a distributed, event-driven, retryable workflow with minimal ceremony. No DSLs to learn, no framework rewrites. Just thin handler wrappers around code you already have.


Get Started

Getting Started Guide

From zero to your first workflow. Install, write a handler, define a template, submit a task, and watch it run.

Why Tasker?

An honest look at where Tasker fits in the workflow orchestration landscape — and where established tools might be a better choice.

Architecture

How Tasker works under the hood: actors, state machines, event systems, circuit breakers, and the PostgreSQL-native execution model.

Configuration Reference

Complete reference for all 246 configuration parameters across orchestration, workers, and shared settings.


Choose Your Language

Tasker is polyglot from the ground up. The orchestration engine is Rust; workers can be any of four languages, all sharing the same core abstractions expressed idiomatically.

LanguagePackageInstallRegistry
Rusttasker-client / tasker-workercargo add tasker-client tasker-workercrates.io
Rubytasker-rbgem install tasker-rbrubygems.org
Pythontasker-pypip install tasker-pypypi.org
TypeScript@tasker-systems/taskernpm install @tasker-systems/taskernpmjs.com

Each language guide covers installation, handler patterns, testing, and production considerations:

Rust · Ruby · Python · TypeScript


Explore the Documentation

For New Users

Architecture & Design

Operational Guides

Reference

Framework Integrations


Engineering Stories

A progressive-disclosure blog series teaching Tasker concepts through real-world scenarios. Each story follows an engineering team as they adopt workflow orchestration, with working code examples across all four languages.

StoryWhat You’ll Learn
01: E-commerce CheckoutBasic workflows, error handling, retry patterns
02: Data Pipeline ResilienceETL orchestration, resilience under failure
03: Microservices CoordinationCross-service workflows, distributed tracing
04: Team ScalingNamespace isolation, multi-team patterns
05: ObservabilityOpenTelemetry integration, domain events
06: Batch ProcessingBatch step patterns, throughput optimization
07: Conditional WorkflowsDecision handlers, approval flows
08: Production DebuggingDLQ investigation, diagnostics tooling

Stories are being rewritten for the current Tasker architecture. View archive →


The Project

Tasker is open-source software (MIT license) built by an engineer who has spent years designing workflow systems at multiple organizations — and finally had the opportunity to build the one that was always in his head.

It’s not venture-backed. It’s not chasing a market. It’s a labor of love built for the engineering community.

Read the full story →

Source Repositories

RepositoryDescription
tasker-coreRust orchestration engine, polyglot workers, and CLI
tasker-contribFramework integrations and community packages
tasker-bookThis documentation site

Why Tasker

Last Updated: 2025-01-09 Audience: Engineers evaluating workflow orchestration tools Status: Pre-Alpha


The Story

Tasker is a labor of love.

Over the years, I’ve built workflow systems at multiple organizations—each time encountering the same fundamental challenges: orchestrating complex, multi-step processes with proper dependency management, ensuring idempotency, handling retries gracefully, and doing all of this in a way that doesn’t require teams to rewrite their existing business logic.

Each time, I’d design parts of the solution I wished we could build—but the investment was never justifiable. General-purpose workflow infrastructure rarely makes sense for a single company to build from scratch when there are urgent product features to ship. So I’d compromise, cobble together something workable, and move on.

Tasker represents the opportunity to finally build that system properly—the one that’s been evolving in my head for years. Not as a venture-backed startup chasing a market, but as open-source software built by someone who genuinely cares about the problem space and wants to give back to the engineering community.


The Landscape

Honesty is important, and so in full candor: Tasker is not solving an unsolved problem. The workflow orchestration space has mature, battle-tested options.

Apache Airflow

What it does well: Airflow is the industry standard for data pipeline orchestration. Born at Airbnb and now an Apache project with thousands of contributors, it excels at scheduled, batch-oriented workflows defined as Python DAGs. Its ecosystem of operators and integrations is unmatched—if you need to connect to a cloud service, there’s probably an Airflow provider for it.

When to choose it: You have scheduled ETL/ELT workloads, your team is Python-native, you need managed cloud options (AWS MWAA, Google Cloud Composer, Astronomer), and you value ecosystem breadth over ergonomic simplicity.

Honest comparison: Airflow’s 10+ years of production use across thousands of companies represents a level of battle-testing Tasker simply cannot match. If your primary use case is data pipeline orchestration with scheduled intervals, Airflow is likely the safer choice.

Temporal

What it does well: Temporal pioneered “durable execution”—workflows that automatically survive crashes, network failures, and infrastructure outages. It reconstructs application state transparently, letting developers write code as if failures don’t exist. The event history and replay capabilities are genuinely impressive.

When to choose it: You need long-running workflows (hours, days, or longer), your operations require true durability guarantees, you’re building microservice orchestration with complex saga patterns, or you need human-in-the-loop workflows with unbounded wait times.

Honest comparison: Temporal’s durable execution model is architecturally different from Tasker. If your workflows genuinely need to survive arbitrary failures mid-execution and resume from exact state, Temporal was purpose-built for this. Tasker provides resilience through retries and idempotent step execution, but doesn’t offer Temporal’s deterministic replay.

Prefect

What it does well: Prefect feels like “what if workflow orchestration were just Python decorators?” It emphasizes minimal boilerplate—add @flow and @task decorators to existing functions, and you have an orchestrated workflow. Prefect 3.0 embraces dynamic workflows with native Python control flow.

When to choose it: Your team is Python-native, you want the fastest path from script to production pipeline, you value simplicity and developer experience, or you’re doing ML/data science workflows where Jupyter-to-production is important.

Honest comparison: Prefect’s decorator-based ergonomics are genuinely excellent for Python-only teams. If your organization is homogeneously Python and you don’t need polyglot support, Prefect delivers a very clean experience.

Dagster

What it does well: Dagster introduced “software-defined assets” as first-class primitives—you define what data assets should exist and their dependencies, and the orchestrator figures out how to materialize them. This asset-centric model provides excellent lineage tracking and observability.

When to choose it: You’re building a data platform where understanding asset lineage is critical, you want a declarative approach focused on data products rather than task sequences, or you need strong dbt integration and data quality built into your orchestration layer.

Honest comparison: Dagster’s asset-centric philosophy is a genuinely different way of thinking about orchestration. If your mental model centers on “what data assets need to exist” rather than “what steps need to execute,” Dagster may be a better conceptual fit.


So Why Tasker?

Given this landscape, why build another workflow orchestrator?

Philosophy: Meet Teams Where They Are

Most workflow tools require you to think in their terms. Define your work as DAGs using their DSL. Adopt their scheduling model. Often, rewrite your business logic to fit their execution model.

Tasker takes a different approach: bring workflow orchestration to your existing code, rather than bringing your code to a workflow framework.

If your codebase already has reasonable SOLID characteristics—services with clear responsibilities, well-defined interfaces, operations that can be made idempotent—Tasker aims to turn that code into distributed, event-driven, retryable workflows with minimal ceremony.

This philosophy manifests in several ways:

Polyglot from the ground up. Tasker’s orchestration engine is written in Rust, but workers can be written in Ruby, Python, TypeScript, or native Rust. Each language implementation shares the same core abstractions—same handler signatures, same result factories, same patterns—expressed idiomatically for each language. This isn’t an afterthought; cross-language consistency is a core design principle.

Minimal migration burden. Your existing business logic—API calls, database operations, external service integrations—can become step handlers with thin wrappers. You don’t need to restructure your application around the orchestrator.

Framework-agnostic core. Tasker Core provides the fundamentals without framework opinions. Tasker Contrib then provides framework-specific integrations (Rails, FastAPI, Bun) for teams who want batteries-included ergonomics. Progressive disclosure: learn the core concepts first, add framework sugar when needed.

Architecture: Event-Driven with Resilience Built In

Tasker’s architecture reflects lessons learned from building distributed systems:

PostgreSQL-native by default. Everything flows through Postgres—task state, step queues (via PGMQ), event coordination (via LISTEN/NOTIFY). This isn’t because Postgres is trendy; it’s because many teams already have Postgres expertise and operational knowledge. Tasker works as a single-dependency system on PostgreSQL alone. For environments requiring higher throughput or existing RabbitMQ infrastructure, Tasker also supports RabbitMQ as an alternative messaging backend—switch with a configuration change.

Event-driven with polling fallback. Real-time responsiveness through Postgres notifications, but with polling as a reliability backstop. Events can be missed; polling ensures eventual consistency.

Defense in depth. Multiple overlapping protection layers provide robust idempotency without single-point dependency. Database-level atomicity, state machine guards, transaction boundaries, and application-level filtering each catch what others might miss.

Composition over inheritance. Handler capabilities are composed via mixins/traits, not class hierarchies. This enables selective capability inclusion, clear separation of concerns, and easier testing.

Performance: Fast by Default

Tasker is built in Rust not for marketing purposes, but because workflow orchestration has real performance implications. When you’re coordinating thousands of steps across distributed workers, overhead matters.

  • Complex 7-step DAG workflows complete in under 133ms with push-based notifications
  • Concurrent execution via work-stealing thread pools
  • Lock-free channel-based internal coordination
  • Zero-copy where possible in the FFI boundaries

The Honest Assessment

Tasker excels when:

  • You need polyglot worker support across Ruby, Python, TypeScript, and Rust
  • Your team already has Postgres expertise and wants to avoid additional infrastructure
  • You want to bring orchestration to existing business logic rather than rewriting
  • You value clean, consistent APIs across languages
  • Performance matters and you’re willing to trade ecosystem breadth for it

Tasker may not be the right choice when:

  • You need the battle-tested maturity and ecosystem of Airflow
  • Your workflows require Temporal-style durable execution with deterministic replay
  • You’re an all-Python team and Prefect’s ergonomics fit perfectly
  • You’re building a data platform where asset-centric thinking (Dagster) is the right model
  • You need managed cloud offerings with SLAs and enterprise support

What Tasker Is (and Isn’t)

Tasker Is:

  • A workflow orchestration engine for step-based DAG execution with complex dependencies
  • PostgreSQL-native with flexible messaging using PGMQ (default) or RabbitMQ
  • Polyglot by design with first-class support for multiple languages
  • Focused on developer experience for teams who want minimal intrusion
  • Open source (MIT license) and built as a labor of love

Tasker Is Not:

  • A data orchestration platform like Dagster with asset lineage and data quality primitives
  • A durable execution engine like Temporal with deterministic replay and unlimited durability
  • A scheduled job runner for simple cron-style workloads (use actual cron)
  • A message bus for pure pub/sub fan-out (use Kafka or dedicated brokers)
  • Enterprise software with commercial support, SLAs, or managed offerings

Current State

Tasker is pre-alpha software. This is important context:

What this means:

  • The architecture is solidifying but breaking changes are expected
  • Documentation is comprehensive but evolving
  • There are no production deployments (that I know of) outside development
  • You should not bet critical business processes on Tasker today

What this enables:

  • Rapid iteration based on real feedback
  • Willingness to break APIs to get the design right
  • Focus on architectural correctness over backward compatibility
  • Honest experimentation without legacy constraints

If you’re evaluating Tasker, I’d encourage you to explore it for non-critical workloads, provide feedback, and help shape what it becomes. If you need production-ready workflow orchestration today, please consider the established tools above—I genuinely recommend them for their respective strengths.


The Path Forward

Tasker is being built with care, not speed. The goal isn’t to capture market share or compete with well-funded companies. The goal is to create something genuinely useful—a workflow orchestration system that respects developers’ time and meets them where they are.

The codebase is open, the design decisions are documented, and contributions are welcome. This is software built by an engineer for engineers, not a product chasing metrics.

If that resonates with you, welcome. Let’s build something good together.



← Back to Documentation Hub

Getting Started

This section walks you from “what is Tasker?” to running your first workflow.

Path

  1. Overview - What Tasker is and why it exists
  2. Core Concepts - Tasks, steps, handlers, templates, and namespaces
  3. Installation - Installing packages and running infrastructure
  4. Choosing Your Package - Which package do you need?
  5. Your First Handler - Write a step handler in your language
  6. Your First Workflow - Define a template, submit a task, watch it run
  7. Next Steps - Where to go from here

Overview

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Core Concepts

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Installation

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Choosing Your Package

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Your First Handler

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Your First Workflow

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Next Steps

This page will be written as part of the consumer documentation effort. See TAS-215 for details.

Tasker Core Architecture

This directory contains architectural reference documentation describing how Tasker Core’s components work together.

Documents

DocumentDescription
Crate ArchitectureWorkspace structure and crate responsibilities
Messaging AbstractionProvider-agnostic messaging (PGMQ, RabbitMQ)
ActorsActor-based orchestration lifecycle components
Worker ActorsActor pattern for worker step execution
Worker Event SystemsDual-channel event architecture for workers
States and LifecyclesDual state machine architecture (Task + Step)
Events and CommandsEvent-driven coordination patterns
Domain EventsBusiness event publishing (durable/fast/broadcast)
Idempotency and AtomicityDefense-in-depth guarantees
Backpressure ArchitectureUnified resilience and flow control
Circuit BreakersFault isolation and cascade prevention
Deployment PatternsHybrid, EventDriven, PollingOnly modes; PGMQ/RabbitMQ backends

When to Read These

  • Designing new features: Understand how components interact
  • Debugging flow issues: Trace message paths through actors
  • Understanding trade-offs: See why patterns were chosen
  • Onboarding: Build mental model of the system
  • Principles - The “why” behind architectural decisions
  • Guides - Practical “how-to” documentation
  • CHRONOLOGY - Historical context for decisions

Actor-Based Architecture

Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Worker Actor Architecture | Events and Commands | States and Lifecycles

← Back to Documentation Hub


This document provides comprehensive documentation of the actor-based architecture in tasker-core, covering the lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture replaces imperative delegation with message-based actor coordination.

Overview

The tasker-core system implements a lightweight Actor pattern inspired by frameworks like Actix, but designed specifically for our orchestration needs without external dependencies. The architecture provides:

  1. Actor Abstraction: Lifecycle components encapsulated as actors with clear lifecycle hooks
  2. Message-Based Communication: Type-safe message handling via Handler trait
  3. Central Registry: ActorRegistry for managing all orchestration actors
  4. Service Decomposition: Focused components following single responsibility principle
  5. Direct Integration: Command processor calls actors directly without wrapper layers

This architecture eliminates inconsistencies in lifecycle component initialization, provides type-safe message handling, and creates a clear separation between command processing and business logic execution.

Implementation Status

All phases implemented and production-ready: core abstractions, all 4 primary actors, message hydration, module reorganization, service decomposition, and direct actor integration.

Core Concepts

What is an Actor?

In the tasker-core context, an Actor is an encapsulated lifecycle component that:

  • Manages its own state: Each actor owns its dependencies and configuration
  • Processes messages: Responds to typed command messages via the Handler trait
  • Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
  • Is isolated: Actors communicate through message passing
  • Is thread-safe: All actors are Send + Sync + ’static

Why Actors?

The previous architecture had several inconsistencies:

#![allow(unused)]
fn main() {
// OLD: Inconsistent initialization patterns
pub struct TaskInitializer {
    // Constructor pattern
}

pub struct TaskFinalizer {
    // Builder pattern with new()
}

pub struct StepEnqueuer {
    // Factory pattern with create()
}
}

The actor pattern provides consistency:

#![allow(unused)]
fn main() {
// NEW: Consistent actor pattern
impl OrchestrationActor for TaskRequestActor {
    fn name(&self) -> &'static str { "TaskRequestActor" }
    fn context(&self) -> &Arc<SystemContext> { &self.context }
    fn started(&mut self) -> TaskerResult<()> { /* initialization */ }
    fn stopped(&mut self) -> TaskerResult<()> { /* cleanup */ }
}
}

Actor vs Service

Services (underlying business logic):

  • Encapsulate business logic
  • Stateless operations on domain models
  • Direct method invocation
  • Examples: TaskFinalizer, StepEnqueuerService, OrchestrationResultProcessor

Actors (message-based coordination):

  • Wrap services with message-based interface
  • Manage service lifecycle
  • Asynchronous message handling
  • Examples: TaskRequestActor, ResultProcessorActor, StepEnqueuerActor, TaskFinalizerActor

The relationship:

#![allow(unused)]
fn main() {
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,  // Wraps underlying service
}

impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        // Delegates to service
        self.service.finalize_task(msg.task_uuid).await
    }
}
}

Actor Traits

OrchestrationActor Trait

The base trait for all orchestration actors, defined in tasker-orchestration/src/actors/traits.rs:

#![allow(unused)]
fn main() {
/// Base trait for all orchestration actors
///
/// Provides lifecycle management and context access for all actors in the
/// orchestration system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
///
/// # Lifecycle
///
/// 1. **Construction**: Actor is created by ActorRegistry
/// 2. **Initialization**: `started()` is called during registry build
/// 3. **Operation**: Actor processes messages via Handler<M> implementations
/// 4. **Shutdown**: `stopped()` is called during registry shutdown
pub trait OrchestrationActor: Send + Sync + 'static {
    /// Returns the unique name of this actor
    ///
    /// Used for logging, metrics, and debugging. Should be a static string
    /// that clearly identifies the actor's purpose.
    fn name(&self) -> &'static str;

    /// Returns a reference to the system context
    ///
    /// Provides access to database pool, configuration, and other
    /// framework-level resources.
    fn context(&self) -> &Arc<SystemContext>;

    /// Called when the actor is started
    ///
    /// Perform any initialization work here, such as:
    /// - Setting up database connections
    /// - Loading configuration
    /// - Initializing caches
    ///
    /// # Errors
    ///
    /// Return an error if initialization fails. The actor will not be
    /// registered and the system will fail to start.
    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor started");
        Ok(())
    }

    /// Called when the actor is stopped
    ///
    /// Perform any cleanup work here, such as:
    /// - Closing database connections
    /// - Flushing caches
    /// - Releasing resources
    ///
    /// # Errors
    ///
    /// Return an error if cleanup fails. Errors are logged but do not
    /// prevent other actors from shutting down.
    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor stopped");
        Ok(())
    }
}
}

Key Design Decisions:

  1. Send + Sync + ’static: Enables actors to be shared across threads
  2. Default lifecycle hooks: Actors only override when needed
  3. Context injection: All actors have access to SystemContext
  4. Error handling: Lifecycle failures are TaskerResult for proper error propagation

Handler Trait

The message handling trait, enabling type-safe message processing:

#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
///
/// Actors implement Handler<M> for each message type they can process.
/// This provides type-safe, asynchronous message handling with clear
/// input/output contracts.
#[async_trait]
pub trait Handler<M: Message>: OrchestrationActor {
    /// The response type returned by this handler
    type Response: Send;

    /// Handle a message asynchronously
    ///
    /// Process the message and return a response. This method should be
    /// idempotent where possible and handle errors gracefully.
    async fn handle(&self, msg: M) -> TaskerResult<Self::Response>;
}
}

Key Design Decisions:

  1. async_trait: All message handling is asynchronous
  2. Type safety: Message and Response types are checked at compile time
  3. Multiple implementations: Actor can implement Handler for multiple message types
  4. Error propagation: TaskerResult ensures proper error handling

Message Trait

The marker trait for command messages:

#![allow(unused)]
fn main() {
/// Marker trait for command messages
///
/// All messages sent to actors must implement this trait. The associated
/// `Response` type defines what the handler will return.
pub trait Message: Send + 'static {
    /// The response type for this message
    type Response: Send;
}
}

Key Design Decisions:

  1. Marker trait: No methods, just type constraints
  2. Associated type: Response type is part of the message definition
  3. Send + ’static: Enables messages to cross thread boundaries

ActorRegistry

The central registry managing all orchestration actors, defined in tasker-orchestration/src/actors/registry.rs:

Purpose

The ActorRegistry serves as:

  1. Single Source of Truth: All actors are registered here
  2. Lifecycle Manager: Handles initialization and shutdown
  3. Dependency Injection: Provides SystemContext to all actors
  4. Type-Safe Access: Strongly-typed access to each actor

Structure

#![allow(unused)]
fn main() {
/// Registry managing all orchestration actors
///
/// The ActorRegistry holds Arc references to all actors in the system,
/// providing centralized access and lifecycle management.
#[derive(Clone)]
pub struct ActorRegistry {
    /// System context shared by all actors
    context: Arc<SystemContext>,

    /// Task request actor for processing task initialization requests
    pub task_request_actor: Arc<TaskRequestActor>,

    /// Result processor actor for processing step execution results
    pub result_processor_actor: Arc<ResultProcessorActor>,

    /// Step enqueuer actor for batch processing ready tasks
    pub step_enqueuer_actor: Arc<StepEnqueuerActor>,

    /// Task finalizer actor for task finalization with atomic claiming
    pub task_finalizer_actor: Arc<TaskFinalizerActor>,
}
}

Initialization

The build() method creates and initializes all actors:

#![allow(unused)]
fn main() {
impl ActorRegistry {
    pub async fn build(context: Arc<SystemContext>) -> TaskerResult<Self> {
        tracing::info!("Building ActorRegistry with actors");

        // Create shared StepEnqueuerService (used by multiple actors)
        let task_claim_step_enqueuer = StepEnqueuerService::new(context.clone()).await?;
        let task_claim_step_enqueuer = Arc::new(task_claim_step_enqueuer);

        // Create TaskRequestActor and its dependencies
        let task_initializer = Arc::new(TaskInitializer::new(
            context.clone(),
            task_claim_step_enqueuer.clone(),
        ));

        let task_request_processor = Arc::new(TaskRequestProcessor::new(
            context.message_client.clone(),
            context.task_handler_registry.clone(),
            task_initializer,
            TaskRequestProcessorConfig::default(),
        ));

        let mut task_request_actor = TaskRequestActor::new(context.clone(), task_request_processor);
        task_request_actor.started()?;
        let task_request_actor = Arc::new(task_request_actor);

        // Create ResultProcessorActor and its dependencies
        let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
        let result_processor = Arc::new(OrchestrationResultProcessor::new(
            task_finalizer,
            context.clone(),
        ));

        let mut result_processor_actor =
            ResultProcessorActor::new(context.clone(), result_processor);
        result_processor_actor.started()?;
        let result_processor_actor = Arc::new(result_processor_actor);

        // Create StepEnqueuerActor using shared StepEnqueuerService
        let mut step_enqueuer_actor =
            StepEnqueuerActor::new(context.clone(), task_claim_step_enqueuer.clone());
        step_enqueuer_actor.started()?;
        let step_enqueuer_actor = Arc::new(step_enqueuer_actor);

        // Create TaskFinalizerActor using shared StepEnqueuerService
        let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
        let mut task_finalizer_actor = TaskFinalizerActor::new(context.clone(), task_finalizer);
        task_finalizer_actor.started()?;
        let task_finalizer_actor = Arc::new(task_finalizer_actor);

        tracing::info!("✅ ActorRegistry built successfully with 4 actors");

        Ok(Self {
            context,
            task_request_actor,
            result_processor_actor,
            step_enqueuer_actor,
            task_finalizer_actor,
        })
    }
}
}

Shutdown

The shutdown() method gracefully stops all actors:

#![allow(unused)]
fn main() {
impl ActorRegistry {
    pub async fn shutdown(&mut self) {
        tracing::info!("Shutting down ActorRegistry");

        // Call stopped() on all actors in reverse initialization order
        if let Some(actor) = Arc::get_mut(&mut self.task_finalizer_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop TaskFinalizerActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.step_enqueuer_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop StepEnqueuerActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.result_processor_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop ResultProcessorActor");
            }
        }

        if let Some(actor) = Arc::get_mut(&mut self.task_request_actor) {
            if let Err(e) = actor.stopped() {
                tracing::error!(error = %e, "Failed to stop TaskRequestActor");
            }
        }

        tracing::info!("✅ ActorRegistry shutdown complete");
    }
}
}

Implemented Actors

TaskRequestActor

Handles task initialization requests from external clients.

Location: tasker-orchestration/src/actors/task_request_actor.rs

Message: ProcessTaskRequestMessage

  • Input: TaskRequestMessage with task details
  • Response: Uuid of created task

Delegation: Wraps TaskRequestProcessor service

Purpose: Entry point for new workflow instances, coordinates task creation and initial step discovery.

ResultProcessorActor

Processes step execution results from workers.

Location: tasker-orchestration/src/actors/result_processor_actor.rs

Message: ProcessStepResultMessage

  • Input: StepExecutionResult with execution outcome
  • Response: () (unit type)

Delegation: Wraps OrchestrationResultProcessor service

Purpose: Handles step completion, coordinates task finalization when appropriate.

StepEnqueuerActor

Manages batch processing of ready tasks.

Location: tasker-orchestration/src/actors/step_enqueuer_actor.rs

Message: ProcessBatchMessage

  • Input: Empty (uses system state)
  • Response: StepEnqueuerServiceResult with batch stats

Delegation: Wraps StepEnqueuerService

Purpose: Discovers ready tasks and enqueues their executable steps.

TaskFinalizerActor

Handles task finalization with atomic claiming.

Location: tasker-orchestration/src/actors/task_finalizer_actor.rs

Message: FinalizeTaskMessage

  • Input: task_uuid to finalize
  • Response: FinalizationResult with action taken

Delegation: Wraps TaskFinalizer service (decomposed into focused components)

Purpose: Completes or fails tasks based on step execution results, prevents race conditions through atomic claiming.

Integration with Commands

Command Processor Integration

The command processor calls actors directly without intermediate wrapper layers:

#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs

/// Handle task initialization using TaskRequestActor directly
async fn handle_initialize_task(
    &self,
    request: TaskRequestMessage,
) -> TaskerResult<TaskInitializeResult> {
    // Direct actor-based task initialization
    let msg = ProcessTaskRequestMessage { request };
    let task_uuid = self.actors.task_request_actor.handle(msg).await?;

    Ok(TaskInitializeResult::Success {
        task_uuid,
        message: "Task initialized successfully".to_string(),
    })
}

/// Handle step result processing using ResultProcessorActor directly
async fn handle_process_step_result(
    &self,
    step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
    // Direct actor-based step result processing
    let msg = ProcessStepResultMessage {
        result: step_result.clone(),
    };

    match self.actors.result_processor_actor.handle(msg).await {
        Ok(()) => Ok(StepProcessResult::Success {
            message: format!(
                "Step {} result processed successfully",
                step_result.step_uuid
            ),
        }),
        Err(e) => Ok(StepProcessResult::Error {
            message: format!("Failed to process step result: {e}"),
        }),
    }
}

/// Handle task finalization using TaskFinalizerActor directly
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
    // Direct actor-based task finalization
    let msg = FinalizeTaskMessage { task_uuid };

    let result = self.actors.task_finalizer_actor.handle(msg).await?;

    Ok(TaskFinalizationResult::Success {
        task_uuid: result.task_uuid,
        final_status: format!("{:?}", result.action),
        completion_time: Some(chrono::Utc::now()),
    })
}
}

Design Evolution: Initially planned to use lifecycle_services/ as a wrapper layer between command processor and actors. After implementing Phase 7 service decomposition, we found direct actor calls were simpler and cleaner, so we removed the intermediate layer.

Service Decomposition (Phase 7)

Large services (800-900 lines) were decomposed into focused components following single responsibility principle:

TaskFinalizer Decomposition

task_finalization/ (848 lines → 6 files)
├── mod.rs                          # Public API and types
├── service.rs                      # Main TaskFinalizer service (~200 lines)
├── completion_handler.rs           # Task completion logic
├── event_publisher.rs              # Lifecycle event publishing
├── execution_context_provider.rs   # Context fetching
└── state_handlers.rs               # State-specific handling

StepEnqueuerService Decomposition

step_enqueuer_services/ (781 lines → 3 files)
├── mod.rs                          # Public API
├── service.rs                      # Main service (~250 lines)
├── batch_processor.rs              # Batch processing logic
└── state_handlers.rs               # State-specific processing

ResultProcessor Decomposition

result_processing/ (889 lines → 4 files)
├── mod.rs                          # Public API
├── service.rs                      # Main processor
├── metadata_processor.rs           # Metadata handling
├── error_handler.rs                # Error processing
└── result_validator.rs             # Result validation

Actor Lifecycle

Lifecycle Phases

┌─────────────────┐
│  Construction   │  ActorRegistry::build() creates actor instances
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Initialization  │  started() hook called on each actor
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Operation     │  Actors process messages via Handler<M>::handle()
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Shutdown     │  stopped() hook called on each actor (reverse order)
└─────────────────┘

Example Actor Implementation

#![allow(unused)]
fn main() {
use tasker_orchestration::actors::{OrchestrationActor, Handler, Message};

// Define the actor
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,
}

// Implement base actor trait
impl OrchestrationActor for TaskFinalizerActor {
    fn name(&self) -> &'static str {
        "TaskFinalizerActor"
    }

    fn context(&self) -> &Arc<SystemContext> {
        &self.context
    }

    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!("TaskFinalizerActor starting");
        Ok(())
    }

    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!("TaskFinalizerActor stopping");
        Ok(())
    }
}

// Define message type
pub struct FinalizeTaskMessage {
    pub task_uuid: Uuid,
}

impl Message for FinalizeTaskMessage {
    type Response = FinalizationResult;
}

// Implement message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        tracing::debug!(
            actor = %self.name(),
            task_uuid = %msg.task_uuid,
            "Processing FinalizeTaskMessage"
        );

        // Delegate to service
        self.service.finalize_task(msg.task_uuid).await
            .map_err(|e| e.into())
    }
}
}

Benefits

1. Consistency

All lifecycle components follow the same pattern:

  • Uniform initialization via started()
  • Uniform cleanup via stopped()
  • Uniform message handling via Handler<M>

2. Type Safety

Messages and responses are type-checked at compile time:

#![allow(unused)]
fn main() {
// Compile error if message/response types don't match
impl Handler<WrongMessage> for TaskFinalizerActor {
    type Response = WrongResponse;  // ❌ Won't compile
    // ...
}
}

3. Testability

  • Clear message boundaries for mocking
  • Isolated actor lifecycle for unit tests
  • Type-safe message construction

4. Maintainability

  • Clear separation of concerns
  • Explicit message contracts
  • Centralized lifecycle management
  • Decomposed services (<300 lines per file)

5. Simplicity

  • Direct actor calls (no wrapper layers)
  • Pure routing in command processor
  • Easy to trace message flow

Summary

The actor-based architecture provides a consistent, type-safe foundation for lifecycle component management in tasker-core. Key takeaways:

  1. Lightweight Pattern: Actors wrap decomposed services, providing message-based interface
  2. Lifecycle Management: Consistent initialization and shutdown via traits
  3. Type Safety: Compile-time verification of message contracts
  4. Service Decomposition: Focused components following single responsibility principle
  5. Direct Integration: Command processor calls actors directly without intermediate wrappers
  6. Production Ready: All phases complete, zero breaking changes, full test coverage

The architecture provides a solid foundation for future scalability and maintainability while maintaining the proven reliability of existing orchestration logic.


← Back to Documentation Hub

Backpressure Architecture

Last Updated: 2026-02-05 Audience: Architects, Developers, Operations Status: Active Related Docs: Worker Event Systems | MPSC Channel Guidelines

<- Back to Documentation Hub


This document provides the unified backpressure strategy for tasker-core, covering all system components from API ingestion through worker execution.

Core Principle

Step idempotency is the primary constraint. Any backpressure mechanism must ensure that step claiming, business logic execution, and result persistence remain stable and consistent. The system must gracefully degrade under load without compromising workflow correctness.

System Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKPRESSURE FLOW OVERVIEW                           │
└─────────────────────────────────────────────────────────────────────────────┘

                            ┌─────────────────┐
                            │  External Client │
                            └────────┬────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [1] API LAYER BACKPRESSURE      │
                    │  • Circuit breaker (503)         │
                    │  • System overload (503)         │
                    │  • Request validation            │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [2] ORCHESTRATION BACKPRESSURE  │
                    │  • Command channel (bounded)     │
                    │  • Connection pool limits        │
                    │  • PGMQ depth checks             │
                    └────────────────┬────────────────┘
                                     │
                         ┌───────────┴───────────┐
                         │     PGMQ Queues       │
                         │  • Namespace queues   │
                         │  • Result queues      │
                         └───────────┬───────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [3] WORKER BACKPRESSURE         │
                    │  • Claim capacity check          │
                    │  • Semaphore-bounded handlers    │
                    │  • Completion channel bounds     │
                    └────────────────┬────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │  [4] RESULT FLOW BACKPRESSURE    │
                    │  • Completion channel bounds     │
                    │  • Domain event drop semantics   │
                    └─────────────────────────────────┘

Backpressure Points by Component

1. API Layer

The API layer provides backpressure through 503 responses with intelligent Retry-After headers.

Rate Limiting (429): Rate limiting is intentionally out of scope for tasker-core. This responsibility belongs to upstream infrastructure (API Gateways, NLB/ALB, service mesh). Tasker focuses on system health-based backpressure via 503 responses.

MechanismStatusBehavior
Circuit BreakerImplementedReturn 503 when database breaker open
System OverloadPlannedReturn 503 when queue/channel saturation detected
Request ValidationImplementedReturn 400 for invalid requests

Response Codes:

  • 200 OK - Request accepted
  • 400 Bad Request - Invalid request format
  • 503 Service Unavailable - System overloaded (includes Retry-After header)

503 Response Triggers:

  1. Circuit Breaker Open: Database operations failing repeatedly
  2. Queue Depth High (Planned): PGMQ namespace queues approaching capacity
  3. Channel Saturation (Planned): Command channel buffer > 80% full

Retry-After Header Strategy:

503 Service Unavailable
Retry-After: {calculated_delay_seconds}

Calculation:
- Circuit breaker open: Use breaker timeout (default 30s)
- Queue depth high: Estimate based on processing rate
- Channel saturation: Short delay (5-10s) for buffer drain

Configuration:

# config/tasker/base/common.toml
[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

2. Orchestration Layer

The orchestration layer protects internal processing from command flooding.

MechanismStatusBehavior
Command ChannelImplementedBounded MPSC with monitoring
Connection PoolImplementedDatabase connection limits
PGMQ Depth CheckPlannedReject enqueue when queue too deep

Command Channel Backpressure:

Command Sender → [Bounded Channel] → Command Processor
                      │
                      └── If full: Block with timeout → Reject

Configuration:

# config/tasker/base/orchestration.toml
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

3. Messaging Layer

The messaging layer provides the backbone between orchestration and workers. Provider-agnostic via MessageClient, supporting PGMQ (default) and RabbitMQ backends.

MechanismStatusBehavior
Visibility TimeoutImplementedMessages return to queue after timeout
Batch Size LimitsImplementedBounded message reads
Queue Depth CheckPlannedReject enqueue when depth exceeded
Messaging Circuit BreakerImplementedFast-fail send/receive when provider unhealthy

Messaging Circuit Breaker: MessageClient wraps send/receive operations with circuit breaker protection. When the messaging provider (PGMQ or RabbitMQ) fails repeatedly, the breaker opens and returns MessagingError::CircuitBreakerOpen immediately, preventing slow timeouts from cascading into orchestration and worker processing loops. Ack/nack and health check operations bypass the breaker — ack/nack failures are safe (visibility timeout handles redelivery), and health check must work when the breaker is open to detect recovery. See Circuit Breakers for details.

Queue Depth Monitoring (Planned):

The system will work with PGMQ’s native capabilities rather than enforcing arbitrary limits. Queue depth monitoring provides visibility without hard rejection:

┌──────────────────────────────────────────────────────────────────────┐
│ QUEUE DEPTH STRATEGY                                                  │
├──────────────────────────────────────────────────────────────────────┤
│ Level    │ Depth Ratio │ Action                                       │
├──────────────────────────────────────────────────────────────────────┤
│ Normal   │ < 70%       │ Normal operation                             │
│ Warning  │ 70-85%      │ Log warning, emit metric                     │
│ Critical │ 85-95%      │ API returns 503 for new tasks                │
│ Overflow │ > 95%       │ API rejects all writes, alert operators      │
└──────────────────────────────────────────────────────────────────────┘

Note: Depth ratio = current_depth / configured_soft_limit
Soft limit is advisory, not a hard PGMQ constraint.

Portability Considerations:

  • Queue depth semantics vary by backend (PGMQ vs RabbitMQ vs SQS)
  • Configuration is backend-agnostic where possible
  • Backend-specific tuning goes in backend-specific config sections

Configuration:

# config/tasker/base/common.toml
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# Messaging circuit breaker
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)

4. Worker Layer

The worker layer protects handler execution from saturation.

MechanismStatusBehavior
Semaphore-Bounded DispatchImplementedMax concurrent handlers
Claim Capacity CheckPlannedRefuse claims when at capacity
Handler TimeoutImplementedKill stuck handlers
Completion ChannelImplementedBounded result buffer

Handler Dispatch Flow:

Step Message
     │
     ▼
┌─────────────────┐
│ Capacity Check  │──── At capacity? ──── Leave in queue
│ (Planned)       │                       (visibility timeout)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Acquire Permit  │
│ (Semaphore)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Execute Handler │
│ (with timeout)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Release Permit  │──── BEFORE sending to completion channel
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Send Completion │
└─────────────────┘

Configuration:

# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000

5. Domain Events

Domain events use fire-and-forget semantics to avoid blocking the critical path.

MechanismStatusBehavior
Try-SendImplementedNon-blocking send
Drop on FullImplementedEvents dropped if channel full
MetricsPlannedTrack dropped events

Domain Event Flow:

Handler Complete
     │
     ├── Result → Completion Channel (blocking, must succeed)
     │
     └── Domain Events → try_send() → If full: DROP with metric
                                       │
                                       └── Step execution NOT affected

Segmentation of Responsibility

Orchestration System

The orchestration system must protect itself from:

  1. Client overload: Too many /v1/tasks requests
  2. Internal saturation: Command channel overflow
  3. Database exhaustion: Connection pool depletion
  4. Queue explosion: Unbounded PGMQ growth

Backpressure Response Hierarchy:

  1. Return 503 to client with Retry-After (fastest, cheapest)
  2. Block at command channel (internal buffering)
  3. Soft-reject at queue depth threshold (503 to new tasks)
  4. Circuit breaker opens (stop accepting work)

Worker System

The worker system must protect itself from:

  1. Handler saturation: Too many concurrent handlers
  2. FFI backlog: Ruby/Python handlers falling behind
  3. Completion overflow: Results backing up
  4. Step starvation: Claims outpacing processing

Backpressure Response Hierarchy:

  1. Refuse step claim (leave in queue, visibility timeout)
  2. Block at dispatch channel (internal buffering)
  3. Block at completion channel (handler waits)
  4. Circuit breaker opens (stop claiming work)

Step Idempotency Guarantees

Safe Backpressure Points

These backpressure points preserve step idempotency:

PointWhy Safe
API 503 rejectionTask not yet created
Queue depth soft-limitStep not yet enqueued
Step claim refusalMessage stays in queue, visibility timeout protects
Handler dispatch channel fullStep claimed but execution queued
Completion channel backpressureHandler completed, result buffered

Unsafe Patterns (NEVER DO)

PatternRiskMitigation
Drop step after claimingLost workAlways send result (success or failure)
Timeout during handler executionDuplicate execution on retryHandlers MUST be idempotent
Drop completion resultOrchestration unaware of completionCompletion channel blocks, never drops
Reset step state without visibility timeoutRace with other workersUse PGMQ visibility timeout

Idempotency Contract

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STEP EXECUTION IDEMPOTENCY CONTRACT                       │
└─────────────────────────────────────────────────────────────────────────────┘

1. CLAIM: Atomic via pgmq_read_specific_message()
   ├── Only one worker can claim a message
   ├── Visibility timeout protects against worker crash
   └── If claim fails: Message stays in queue → another worker claims

2. EXECUTE: Handler invocation (FFI boundary critical - see below)
   ├── Handlers SHOULD be idempotent (business logic recommendation)
   ├── Timeout generates FAILURE result (not drop)
   ├── Panic generates FAILURE result (not drop)
   └── Error generates FAILURE result (not drop)

3. PERSIST: Result submission
   ├── Completion channel is bounded but BLOCKING
   ├── Result MUST reach orchestration (never dropped)
   └── If send fails: Step remains "in_progress" → recovered by orchestration

4. FINALIZE: Orchestration processes result
   ├── State transition is atomic
   ├── Duplicate results handled by state guards
   └── Idempotent: Same result processed twice = same outcome

FFI Boundary Idempotency Semantics

The FFI boundary (Rust → Ruby/Python handler) creates a critical demarcation for error classification:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FFI BOUNDARY ERROR CLASSIFICATION                         │
└─────────────────────────────────────────────────────────────────────────────┘

                           FFI BOUNDARY
                                │
    BEFORE FFI CROSSING         │         AFTER FFI CROSSING
    (System Layer)              │         (Business Logic Layer)
                                │
    ┌─────────────────────┐     │     ┌─────────────────────┐
    │ System errors are   │     │     │ System failures     │
    │ RETRYABLE:          │     │     │ are PERMANENT:      │
    │                     │     │     │                     │
    │ • Channel timeout   │     │     │ • Worker crash      │
    │ • Queue unavailable │     │     │ • FFI panic         │
    │ • Claim race lost   │     │     │ • Process killed    │
    │ • Network partition │     │     │                     │
    │ • Message malformed │     │     │ We cannot know if   │
    │                     │     │     │ business logic      │
    │ Step has NOT been   │     │     │ executed or not.    │
    │ handed to handler.  │     │     │                     │
    └─────────────────────┘     │     └─────────────────────┘
                                │
                                │     ┌─────────────────────┐
                                │     │ Developer errors    │
                                │     │ are TRUSTED:        │
                                │     │                     │
                                │     │ • RetryableError →  │
                                │     │   System retries    │
                                │     │                     │
                                │     │ • PermanentError →  │
                                │     │   Step fails        │
                                │     │                     │
                                │     │ Developer knows     │
                                │     │ their domain logic. │
                                │     └─────────────────────┘

Key Principles:

  1. Before FFI: Any system error is safe to retry because no business logic has executed.

  2. After FFI, system failure: If the worker crashes or FFI call fails after dispatch, we MUST treat it as permanent failure. We cannot know if the handler:

    • Never started (safe to retry)
    • Started but didn’t complete (unknown side effects)
    • Completed but didn’t return (work is done)
  3. After FFI, developer error: Trust the developer’s classification:

    • RetryableError: Developer explicitly signals safe to retry (e.g., temporary API unavailable)
    • PermanentError: Developer explicitly signals not retriable (e.g., invalid data, business rule violation)

Implementation Guidance:

#![allow(unused)]
fn main() {
// BEFORE FFI - system error, retryable
match dispatch_to_handler(step).await {
    Err(DispatchError::ChannelFull) => StepResult::retryable("dispatch_channel_full"),
    Err(DispatchError::Timeout) => StepResult::retryable("dispatch_timeout"),
    Ok(ffi_handle) => {
        // AFTER FFI - different rules apply
        match ffi_handle.await {
            // System crash after FFI = permanent (unknown state)
            Err(FfiError::ProcessCrash) => StepResult::permanent("handler_crash"),
            Err(FfiError::Panic) => StepResult::permanent("handler_panic"),

            // Developer-returned errors = trust their classification
            Ok(HandlerResult::RetryableError(msg)) => StepResult::retryable(msg),
            Ok(HandlerResult::PermanentError(msg)) => StepResult::permanent(msg),
            Ok(HandlerResult::Success(data)) => StepResult::success(data),
        }
    }
}
}

Note: We RECOMMEND handlers be idempotent but cannot REQUIRE it—business logic is developer-controlled. The system provides visibility timeout protection and duplicate result handling, but ultimate idempotency responsibility lies with handler implementations.


Backpressure Decision Tree

Use this decision tree when designing new backpressure mechanisms:

                    ┌─────────────────────────┐
                    │ New Backpressure Point  │
                    └────────────┬────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │ Does this affect step   │
                    │ execution correctness?  │
                    └────────────┬────────────┘
                                 │
                   ┌─────────────┴─────────────┐
                   │                           │
                  Yes                          No
                   │                           │
                   ▼                           ▼
         ┌─────────────────┐         ┌─────────────────┐
         │ Can the work be │         │ Safe to drop    │
         │ retried safely? │         │ or timeout      │
         └────────┬────────┘         └─────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
       Yes                  No
        │                   │
        ▼                   ▼
  ┌───────────┐      ┌───────────────┐
  │ Use block │      │ MUST NOT DROP │
  │ or reject │      │ Block until   │
  │ (retriable│      │ success       │
  │ error)    │      └───────────────┘
  └───────────┘

Configuration Reference

TOML Structure: Configuration files are organized as config/tasker/base/{common,worker,orchestration}.toml with environment overrides in config/tasker/environments/{test,development,production}/.

Complete Backpressure Configuration

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/common.toml - Shared settings
# ════════════════════════════════════════════════════════════════════════════

# Circuit breaker defaults (inherited by all component breakers)
[common.circuit_breakers.default_config]
failure_threshold = 5      # Failures before opening
timeout_seconds = 30       # Time in open state before half-open
success_threshold = 2      # Successes in half-open to close

# Web/API database circuit breaker
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

# Messaging circuit breaker - PGMQ/RabbitMQ operations
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Queue configuration
[common.queues]
default_visibility_timeout_seconds = 30

[common.queues.pgmq]
poll_interval_ms = 250

[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/orchestration.toml - Orchestration layer
# ════════════════════════════════════════════════════════════════════════════

[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000

[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000

[orchestration.mpsc_channels.event_channel]
event_channel_buffer_size = 10000

# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/worker.toml - Worker layer
# ════════════════════════════════════════════════════════════════════════════

[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000        # Steps waiting for handler
completion_buffer_size = 1000      # Results waiting for orchestration
max_concurrent_handlers = 10       # Semaphore permits
handler_timeout_ms = 30000         # Max handler execution time

[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000        # FFI events waiting for Ruby/Python
completion_timeout_ms = 30000      # Time to wait for FFI completion
starvation_warning_threshold_ms = 10000  # Warn if event waits this long

# Planned:
# claim_capacity_threshold = 0.8   # Refuse claims at 80% capacity

Monitoring and Alerting

See Backpressure Monitoring Runbook for:

  • Key metrics to monitor
  • Alerting thresholds
  • Incident response procedures

Key Metrics Summary

MetricTypeAlert Threshold
api_requests_rejected_totalCounter> 10/min
circuit_breaker_stateGaugestate = open
mpsc_channel_saturationGauge> 80%
pgmq_queue_depthGauge> 80% of max
worker_claim_refusals_totalCounter> 10/min
handler_semaphore_wait_time_msHistogramp99 > 1000ms


<- Back to Documentation Hub

Circuit Breakers

Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring

<- Back to Documentation Hub


Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.

Core Concept

Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.

State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                     CIRCUIT BREAKER STATE MACHINE                            │
└─────────────────────────────────────────────────────────────────────────────┘

                    Success
                  ┌─────────┐
                  │         │
                  ▼         │
              ┌───────┐     │
      ───────>│CLOSED │─────┘
              └───┬───┘
                  │
                  │ failure_threshold
                  │ consecutive failures
                  │
                  ▼
              ┌───────┐
              │ OPEN  │◄─────────────────────┐
              └───┬───┘                      │
                  │                          │
                  │ timeout_seconds          │ Any failure
                  │ elapsed                  │ in half-open
                  │                          │
                  ▼                          │
            ┌──────────┐                     │
            │HALF-OPEN │─────────────────────┘
            └────┬─────┘
                 │
                 │ success_threshold
                 │ consecutive successes
                 │
                 ▼
            ┌───────┐
            │CLOSED │
            └───────┘

States:

  • Closed: Normal operation. All calls allowed. Tracks consecutive failures.
  • Open: Failing fast. All calls rejected immediately. Waiting for timeout.
  • Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.

Unified Trait: CircuitBreakerBehavior

All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:

#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
    fn name(&self) -> &str;
    fn state(&self) -> CircuitState;
    fn should_allow(&self) -> bool;
    fn record_success(&self, duration: Duration);
    fn record_failure(&self, duration: Duration);
    fn is_healthy(&self) -> bool;
    fn force_open(&self);
    fn force_closed(&self);
    fn metrics(&self) -> CircuitBreakerMetrics;
}
}

Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:

  • Consistent state machine behavior across all breakers
  • Proper half-open → closed recovery via success_threshold
  • Lock-free atomic state management
  • Domain-specific methods remain as additional methods on each type

Circuit Breaker Implementations

Tasker-core has four circuit breaker implementations, each protecting specific components. All wrap the generic CircuitBreaker from tasker_shared::resilience:

Circuit BreakerLocationPurposeTrigger Type
Web Databasetasker-orchestrationAPI database operationsError-based
Task Readinesstasker-orchestrationFallback poller database checksError-based
FFI Completiontasker-workerRuby/Python handler completion channelLatency-based
Messagingtasker-sharedMessage queue operations (PGMQ/RabbitMQ)Error-based

1. Web Database Circuit Breaker

Purpose: Protects API endpoints from cascading database failures.

Scope: Independent from orchestration system’s internal operations.

Behavior:

  • Opens when database queries fail repeatedly
  • Returns 503 with Retry-After header when open
  • Fast-fail rejection with atomic state management

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.web]
failure_threshold = 5      # Consecutive failures before opening
success_threshold = 2      # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)

Health Check Integration:

  • Included in /health/ready endpoint
  • State reported in /health/detailed response
  • Metric: api_circuit_breaker_state (0=closed, 1=half-open, 2=open)

2. Task Readiness Circuit Breaker

Purpose: Protects fallback poller from database overload during polling cycles.

Scope: Independent from web circuit breaker, specific to task readiness queries.

Behavior:

  • Opens when task readiness queries fail repeatedly
  • Skips polling cycles when open (doesn’t fail-fast, just skips)
  • Allows orchestration to continue processing existing work

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10     # Higher threshold for polling
timeout_seconds = 60       # Longer recovery window
success_threshold = 3      # More successes needed for confidence

Why Separate from Web?:

  • Different failure patterns (polling vs request-driven)
  • Different recovery semantics (skip vs reject)
  • Isolation prevents web failures from stopping polling (and vice versa)

3. FFI Completion Circuit Breaker

Purpose: Protects Ruby/Python worker completion channels from backpressure.

Scope: Worker-specific, protects FFI boundary.

Behavior:

  • Latency-based: Treats slow sends (>100ms) as failures
  • Opens when completion channel is consistently slow
  • Prevents FFI threads from blocking on saturated channels
  • Drops completions when open (with metrics), allowing handler threads to continue

Configuration (config/tasker/base/worker.toml):

[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5            # Slow sends before opening
recovery_timeout_seconds = 5     # Short recovery window
success_threshold = 2            # Successes to close
slow_send_threshold_ms = 100     # Latency threshold (100ms)

Why Latency-Based?:

  • Slow channel sends indicate backpressure buildup
  • Blocking FFI threads can cascade to Ruby/Python handler starvation
  • Error-only detection misses slow-but-completing operations
  • Latency detection catches degradation before total failure

Metrics:

  • ffi_completion_slow_sends_total - Sends exceeding latency threshold
  • ffi_completion_circuit_open_rejections_total - Rejections due to open circuit

4. Messaging Circuit Breaker

Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).

Scope: Integrated into MessageClient, shared across orchestration and worker messaging.

Behavior:

  • Opens when send/receive operations fail repeatedly
  • Protected operations: send_step_message, receive_step_messages, send_step_result, receive_step_results, send_task_request, receive_task_requests, send_task_finalization, receive_task_finalizations, send_message, receive_messages
  • Unprotected operations (safe to fail or needed for recovery): ack_message, nack_message, extend_visibility, health_check, ensure_queue, queue stats
  • Coordinates with visibility timeout for message safety
  • Provider-agnostic: works with both PGMQ and RabbitMQ backends

Configuration (config/tasker/base/common.toml):

[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5      # Failures before opening
success_threshold = 2      # Successes to close
# timeout_seconds inherited from default_config (30s)

Why ack/nack bypass the breaker?:

  • Ack/nack failure causes message redelivery via visibility timeout, which is safe
  • Health check must work when breaker is open to detect recovery
  • Queue management is startup-only and should not be gated

Configuration Reference

Global Settings

[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30    # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions

Default Configuration

Applied to any circuit breaker without explicit configuration:

[common.circuit_breakers.default_config]
failure_threshold = 5      # 1-100 range
timeout_seconds = 30       # 1-300 range
success_threshold = 2      # 1-50 range

Component-Specific Overrides

# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3

# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2

# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2

Note: timeout_seconds is inherited from default_config for all component circuit breakers. The pgmq key is accepted as an alias for messaging for backward compatibility.

Worker-Specific Configuration

# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100

Environment Overrides

Different environments may need different thresholds:

Test (config/tasker/environments/test/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 2      # Faster failure detection
timeout_seconds = 5        # Quick recovery for tests
success_threshold = 1

Production (config/tasker/environments/production/common.toml):

[common.circuit_breakers.default_config]
failure_threshold = 10     # More tolerance for transient failures
timeout_seconds = 60       # Longer recovery window
success_threshold = 5      # More confidence before closing

Health Endpoint Integration

Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.

Orchestration Health (/health/detailed)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "Circuit breaker state: Closed",
      "duration_ms": 1,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Worker Health (/health/detailed)

{
  "status": "healthy",
  "checks": {
    "circuit_breakers": {
      "status": "healthy",
      "message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
      "duration_ms": 0,
      "last_checked": "2025-12-10T10:00:00Z"
    }
  }
}

Health Status Mapping

Circuit Breaker StateHealth StatusImpact
All ClosedhealthyNormal operation
Any Half-OpendegradedTesting recovery
Any OpenunhealthyFailing fast

Monitoring and Alerting

Key Metrics

MetricTypeDescription
api_circuit_breaker_stateGaugeWeb breaker state (0/1/2)
tasker_circuit_breaker_stateGaugePer-component state
api_requests_rejected_totalCounterRejections due to open breaker
ffi_completion_slow_sends_totalCounterSlow send detections
ffi_completion_circuit_open_rejections_totalCounterFFI breaker rejections

Prometheus Alerts

groups:
  - name: circuit_breakers
    rules:
      - alert: TaskerCircuitBreakerOpen
        expr: api_circuit_breaker_state == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN"
          description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"

      - alert: TaskerCircuitBreakerHalfOpen
        expr: api_circuit_breaker_state == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker stuck in half-open"
          description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"

      - alert: TaskerFFISlowSendsHigh
        expr: rate(ffi_completion_slow_sends_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "FFI completion channel experiencing backpressure"
          description: "Slow sends averaging >10/second, circuit breaker may open"

Grafana Dashboard Panels

Circuit Breaker State Timeline:

Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)

FFI Latency Percentiles:

Panel: Time series
Queries:
  - histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
  - histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical

Operational Procedures

When Circuit Breaker Opens

Immediate Actions:

  1. Check database connectivity: pg_isready -h <host> -p 5432
  2. Check connection pool status: /health/detailed endpoint
  3. Review recent error logs for root cause
  4. Monitor queue depth for message backlog

Recovery:

  • Circuit automatically tests recovery after timeout_seconds
  • No manual intervention needed for transient failures
  • For persistent failures, fix underlying issue first

Escalation:

  • If breaker stays open >5 minutes, escalate to database team
  • If breaker oscillates (open/half-open/open), increase failure_threshold

Tuning Guidelines

Symptom: Breaker opens too frequently

  • Increase failure_threshold
  • Investigate root cause of failures
  • Consider if failures are transient vs systemic

Symptom: Breaker stays open too long

  • Decrease timeout_seconds
  • Verify downstream system has recovered
  • Check if success_threshold is too high

Symptom: FFI breaker opens unnecessarily

  • Increase slow_send_threshold_ms
  • Verify channel buffer sizes are adequate
  • Check Ruby/Python handler throughput

Architecture Integration

Relationship to Backpressure

Circuit breakers are one layer of the broader backpressure strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        RESILIENCE LAYER STACK                                │
└─────────────────────────────────────────────────────────────────────────────┘

Layer 1: Circuit Breakers     → Fast-fail on component failure
Layer 2: Bounded Channels     → Backpressure on internal queues
Layer 3: Visibility Timeouts  → Message-level retry safety
Layer 4: Semaphore Limits     → Handler execution rate limiting
Layer 5: Connection Pools     → Database resource management

See Backpressure Architecture for the complete strategy.

Independence Principle

Each circuit breaker operates independently:

  • Web breaker can be open while task readiness breaker is closed
  • FFI breaker state doesn’t affect PGMQ breaker
  • Prevents single failure mode from cascading across components
  • Allows targeted recovery per component

Integration Points

ComponentCircuit BreakerIntegration Point
tasker-orchestration/src/webWeb DatabaseAPI request handlers
tasker-orchestration/src/orchestration/task_readinessTask ReadinessFallback poller loop
tasker-worker/src/worker/handlersFFI CompletionCompletion channel sends
tasker-shared/src/messaging/client.rsMessagingMessageClient send/receive methods

Troubleshooting

Common Issues

Issue: Web circuit breaker flapping (open → half-open → open rapidly)

Diagnosis:

  1. Check database query latency (slow queries can cause timeout failures)
  2. Review connection pool saturation
  3. Check if PostgreSQL is under memory pressure

Resolution:

  • Increase failure_threshold if failures are transient
  • Increase timeout_seconds to give more recovery time
  • Fix underlying database performance issues

Issue: FFI completion circuit breaker opens during normal load

Diagnosis:

  1. Check Ruby/Python handler execution time
  2. Review completion channel buffer utilization
  3. Verify worker concurrency settings

Resolution:

  • Increase slow_send_threshold_ms if handlers are legitimately slow
  • Increase channel buffer size in worker config
  • Reduce handler concurrency if system is overloaded

Issue: Task readiness breaker open but web API working fine

Diagnosis:

  • Task readiness queries may be slower/different than API queries
  • Polling may hit database at different times (e.g., during maintenance)

Resolution:

  • Independent breakers are working as designed
  • Check specific task readiness query performance
  • Consider database index optimization for readiness queries

Source Code Reference

ComponentFile
CircuitBreakerBehavior Traittasker-shared/src/resilience/behavior.rs
Generic CircuitBreakertasker-shared/src/resilience/circuit_breaker.rs
Circuit Breaker Configtasker-shared/src/config/circuit_breaker.rs
MessageClient (messaging breaker)tasker-shared/src/messaging/client.rs
WebDatabaseCircuitBreakertasker-orchestration/src/api_common/circuit_breaker.rs
Web CB Helperstasker-orchestration/src/web/circuit_breaker.rs
TaskReadinessCircuitBreakertasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs
FfiCompletionCircuitBreakertasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Worker Health Integrationtasker-worker/src/web/handlers/health.rs
Circuit Breaker Typestasker-shared/src/types/api/worker.rs

<- Back to Documentation Hub

Crate Architecture

Last Updated: 2026-01-15 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands | Quick Start

← Back to Documentation Hub


Overview

Tasker Core is organized as a Cargo workspace with 7 member crates, each with a specific responsibility in the workflow orchestration system. This document explains the role of each crate, their inter-dependencies, and how they work together to provide a complete orchestration solution.

Design Philosophy

The crate structure follows these principles:

  1. Separation of Concerns: Each crate has a well-defined responsibility
  2. Minimal Dependencies: Crates depend on the minimum necessary dependencies
  3. Shared Foundation: Common types and utilities in tasker-shared
  4. Language Flexibility: Support for multiple worker implementations (Rust, Ruby, Python planned)
  5. Production Ready: Workers and the orchestration system can be deployed and scaled independently

Workspace Structure

tasker-core/
├── tasker-pgmq/              # PGMQ wrapper with notification support
├── tasker-shared/            # Shared types, SQL functions, utilities
├── tasker-orchestration/     # Task coordination and lifecycle management
├── tasker-worker/            # Step execution and handler integration
├── tasker-client/            # API client library (REST + gRPC transport)
├── tasker-ctl/              # CLI binary (depends on tasker-client)
└── workers/
    ├── ruby/ext/tasker_core/ # Ruby FFI bindings
    └── rust/                 # Rust native worker

Crate Dependency Graph

┌─────────────────────────────────────────────────────────┐
│                   External Dependencies                 │
│  (sqlx, tokio, serde, pgmq, magnus, axum, etc.)       │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                    tasker-pgmq                          │
│  PGMQ wrapper with PostgreSQL LISTEN/NOTIFY            │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                    tasker-shared                        │
│  Core types, SQL functions, state machines             │
└─────────────────────────────────────────────────────────┘
                            │
               ┌────────────┴────────────┐
               │                         │
               ▼                         ▼
┌──────────────────────────┐  ┌──────────────────────────┐
│  tasker-orchestration    │  │    tasker-worker         │
│  Task coordination       │  │    Step execution        │
│  Lifecycle management    │  │    Handler integration   │
│  REST API                │  │    FFI support           │
└──────────────────────────┘  └──────────────────────────┘
               │                         │
               ▼                         │
┌──────────────────────────┐            │
│    tasker-client         │            │
│    API client library    │            │
│    REST + gRPC transport │            │
└──────────────────────────┘            │
               │                        │
               ▼                        │
┌──────────────────────────┐            │
│    tasker-ctl            │            │
│    CLI binary            │            │
└──────────────────────────┘            │
                                        │
               ┌────────────────────────┘
               │
      ┌────────┴────────┐
      ▼                 ▼
┌────────────┐  ┌────────────┐
│ workers/   │  │ workers/   │
│   ruby/    │  │   rust/    │
│   ext/     │  │            │
└────────────┘  └────────────┘

Core Crates

tasker-pgmq

Purpose: Wrapper around PostgreSQL Message Queue (PGMQ) with native PostgreSQL LISTEN/NOTIFY support

Location: tasker-pgmq/

Key Responsibilities:

  • Wrap pgmq crate with notification capabilities
  • Provide atomic pgmq_send_with_notify() operations
  • Handle notification channel management
  • Support namespace-aware queue naming

Public API:

#![allow(unused)]
fn main() {
pub struct PgmqClient {
    // Send message with atomic notification
    pub async fn send_with_notify<T>(&self, queue: &str, msg: T) -> Result<i64>;

    // Read message with visibility timeout
    pub async fn read<T>(&self, queue: &str, vt: i32) -> Result<Option<Message<T>>>;

    // Delete processed message
    pub async fn delete(&self, queue: &str, msg_id: i64) -> Result<bool>;
}
}

When to Use:

  • When you need reliable message queuing with PostgreSQL
  • When you need atomic send + notify operations
  • When building event-driven systems on PostgreSQL

Dependencies:

  • pgmq - Core PostgreSQL message queue functionality
  • sqlx - Database connectivity
  • tokio - Async runtime

tasker-shared

Purpose: Foundation crate containing all shared types, utilities, and SQL function interfaces

Location: tasker-shared/

Key Responsibilities:

  • Core domain models (Task, WorkflowStep, TaskTransition, etc.)
  • State machine implementations (Task + Step)
  • SQL function executor and registry
  • Database utilities and migrations
  • Event system traits and types
  • Messaging abstraction layer: Provider-agnostic messaging with PGMQ, RabbitMQ, and InMemory backends
  • Factory system for testing
  • Metrics and observability primitives

Public API:

#![allow(unused)]
fn main() {
// Core Models
pub mod models {
    pub struct Task { /* ... */ }
    pub struct WorkflowStep { /* ... */ }
    pub struct TaskTransition { /* ... */ }
    pub struct WorkflowStepTransition { /* ... */ }
}

// State Machines
pub mod state_machine {
    pub struct TaskStateMachine { /* ... */ }
    pub struct StepStateMachine { /* ... */ }
    pub enum TaskState { /* 12 states */ }
    pub enum WorkflowStepState { /* 9 states */ }
}

// SQL Functions
pub mod database {
    pub struct SqlFunctionExecutor { /* ... */ }
    pub async fn get_step_readiness_status(...) -> Result<Vec<StepReadinessStatus>>;
    pub async fn get_next_ready_tasks(...) -> Result<Vec<ReadyTaskInfo>>;
}

// Event System
pub mod event_system {
    pub trait EventDrivenSystem { /* ... */ }
    pub enum DeploymentMode { Hybrid, EventDrivenOnly, PollingOnly }
}

// Messaging
pub mod messaging {
    // Provider abstraction
    pub enum MessagingProvider { Pgmq, RabbitMq, InMemory }
    pub trait MessagingService { /* send_message, receive_messages, ack_message, ... */ }
    pub trait SupportsPushNotifications { /* subscribe, subscribe_many, requires_fallback_polling */ }
    pub enum MessageNotification { Available { ... }, Message(...) }

    // Domain client
    pub struct MessageClient { /* High-level queue operations */ }

    // Message types
    pub struct SimpleStepMessage { /* ... */ }
    pub struct TaskRequestMessage { /* ... */ }
    pub struct StepExecutionResult { /* ... */ }
}
}

When to Use:

  • Always - This is the foundation for all other crates
  • When you need core domain models
  • When you need state machine logic
  • When you need SQL function access
  • When you need testing factories

Dependencies:

  • tasker-pgmq - Message queue operations
  • sqlx - Database operations
  • serde - Serialization
  • Many workspace-shared dependencies

Why It’s Separate:

  • Eliminates circular dependencies between orchestration and worker
  • Provides single source of truth for domain models
  • Enables independent testing of core logic
  • Allows multiple implementations (orchestration vs worker) to share code

tasker-orchestration

Purpose: Task coordination, lifecycle management, and orchestration REST API

Location: tasker-orchestration/

Key Responsibilities:

  • Actor-based lifecycle coordination
  • Task initialization and finalization
  • Step discovery and enqueueing
  • Result processing from workers
  • Dynamic executor pool management
  • Event-driven coordination
  • REST API endpoints
  • Health monitoring
  • Metrics collection

Public API:

#![allow(unused)]
fn main() {
// Core orchestration
pub struct OrchestrationCore {
    pub async fn new() -> Result<Self>;
    pub async fn from_config(config: ConfigManager) -> Result<Self>;
}

// Actor-based coordination
pub mod actors {
    pub struct ActorRegistry { /* ... */ }
    pub struct TaskRequestActor { /* ... */ }
    pub struct ResultProcessorActor { /* ... */ }
    pub struct StepEnqueuerActor { /* ... */ }
    pub struct TaskFinalizerActor { /* ... */ }

    pub trait OrchestrationActor { /* ... */ }
    pub trait Handler<M: Message> { /* ... */ }
    pub trait Message { /* ... */ }
}

// Lifecycle services (wrapped by actors)
pub mod lifecycle {
    pub struct TaskInitializer { /* ... */ }
    pub struct StepEnqueuerService { /* ... */ }
    pub struct OrchestrationResultProcessor { /* ... */ }
    pub struct TaskFinalizer { /* ... */ }
}

// Message hydration (Phase 4)
pub mod hydration {
    pub struct StepResultHydrator { /* ... */ }
    pub struct TaskRequestHydrator { /* ... */ }
    pub struct FinalizationHydrator { /* ... */ }
}

// REST API (Axum)
pub mod web {
    // POST /v1/tasks
    pub async fn create_task(request: TaskRequest) -> Result<TaskResponse>;

    // GET /v1/tasks/{uuid}
    pub async fn get_task(uuid: Uuid) -> Result<TaskResponse>;

    // GET /health
    pub async fn health_check() -> Result<HealthResponse>;
}

// gRPC API (Tonic)
// Feature-gated behind `grpc-api`
pub mod grpc {
    pub struct GrpcServer { /* ... */ }
    pub struct GrpcState { /* wraps Arc<SharedApiServices> */ }

    pub mod services {
        pub struct TaskServiceImpl { /* 6 RPCs */ }
        pub struct StepServiceImpl { /* 4 RPCs */ }
        pub struct TemplateServiceImpl { /* 2 RPCs */ }
        pub struct HealthServiceImpl { /* 4 RPCs */ }
        pub struct AnalyticsServiceImpl { /* 2 RPCs */ }
        pub struct DlqServiceImpl { /* 6 RPCs */ }
        pub struct ConfigServiceImpl { /* 1 RPC */ }
    }

    pub mod interceptors {
        pub struct AuthInterceptor { /* Bearer token, API key */ }
    }
}

// Event systems
pub mod event_systems {
    pub struct OrchestrationEventSystem { /* ... */ }
    pub struct TaskReadinessEventSystem { /* ... */ }
}
}

Actor Architecture:

The orchestration crate implements a lightweight actor pattern for lifecycle component coordination:

  • ActorRegistry: Manages all 4 orchestration actors with lifecycle hooks
  • Message-Based Communication: Type-safe message handling via Handler<M> trait
  • Service Decomposition: Large services decomposed into focused components (<300 lines per file)
  • Direct Integration: Command processor calls actors directly without wrapper layers

See Actor-Based Architecture for comprehensive documentation.

When to Use:

  • When you need to run the orchestration server
  • When you need task coordination logic
  • When building custom orchestration components
  • When integrating with the REST API

Dependencies:

  • tasker-shared - Core types and SQL functions
  • tasker-pgmq - Message queuing
  • axum - REST API framework
  • tower-http - HTTP middleware

Deployment: Typically deployed as a server process (tasker-server binary)

Dual-Server Architecture:

Orchestration supports both REST and gRPC APIs running simultaneously via SharedApiServices:

#![allow(unused)]
fn main() {
pub struct SharedApiServices {
    pub security_service: Option<Arc<SecurityService>>,
    pub task_service: TaskService,
    pub step_service: StepService,
    pub health_service: HealthService,
    // ... other services
}

// Both APIs share the same service instances
AppState { services: Arc<SharedApiServices>, ... }   // REST
GrpcState { services: Arc<SharedApiServices>, ... }  // gRPC
}

Port Allocation:

  • REST: 8080 (configurable)
  • gRPC: 9190 (configurable)

tasker-worker

Purpose: Step execution, handler integration, and worker coordination

Location: tasker-worker/

Key Responsibilities:

  • Claim steps from namespace queues
  • Execute step handlers (Rust or FFI)
  • Submit results to orchestration
  • Template management and caching
  • Event-driven step claiming
  • Worker health monitoring
  • FFI integration layer

Public API:

#![allow(unused)]
fn main() {
// Worker core
pub struct WorkerCore {
    pub async fn new(config: WorkerConfig) -> Result<Self>;
    pub async fn start(&mut self) -> Result<()>;
}

// Handler execution
pub mod handlers {
    pub trait StepHandler {
        async fn execute(&self, context: StepContext) -> Result<StepResult>;
    }
}

// Template management
pub mod task_template_manager {
    pub struct TaskTemplateManager {
        pub async fn load_templates(&mut self) -> Result<()>;
        pub fn get_template(&self, name: &str) -> Option<&TaskTemplate>;
    }
}

// Event systems
pub mod event_systems {
    pub struct WorkerEventSystem { /* ... */ }
}
}

When to Use:

  • When you need to run a worker process
  • When implementing custom step handlers
  • When integrating with Ruby/Python handlers via FFI
  • When building worker-specific tools

Dependencies:

  • tasker-shared - Core types and messaging
  • tasker-pgmq - Message queuing
  • magnus (optional) - Ruby FFI bindings

Deployment: Deployed as worker processes, typically one per namespace or scaled horizontally


tasker-client

Purpose: Transport-agnostic API client library for REST and gRPC

Location: tasker-client/

Key Responsibilities:

  • HTTP client for orchestration REST API
  • gRPC client for orchestration gRPC API (feature-gated)
  • Transport abstraction via unified client traits
  • Configuration management and auth resolution
  • Client-side request building

Public API:

#![allow(unused)]
fn main() {
// REST client
pub struct RestOrchestrationClient {
    pub async fn new(base_url: &str) -> Result<Self>;
    // Task, step, template, health operations
}

// gRPC client (feature-gated)
#[cfg(feature = "grpc")]
pub struct GrpcOrchestrationClient {
    pub async fn connect(endpoint: &str) -> Result<Self>;
    pub async fn connect_with_auth(endpoint: &str, auth: GrpcAuthConfig) -> Result<Self>;
    // Same operations as REST client
}

// Transport-agnostic client
pub enum UnifiedOrchestrationClient {
    Rest(Box<RestOrchestrationClient>),
    Grpc(Box<GrpcOrchestrationClient>),
}

// Client trait for transport abstraction
pub trait OrchestrationClient: Send + Sync {
    async fn create_task(&self, request: TaskRequest) -> Result<TaskResponse>;
    async fn get_task(&self, uuid: Uuid) -> Result<TaskResponse>;
    async fn list_tasks(&self, filters: TaskFilters) -> Result<Vec<TaskResponse>>;
    async fn health_check(&self) -> Result<HealthResponse>;
    // ... more operations
}
}

When to Use:

  • When you need to interact with orchestration API from Rust
  • When building integration tests
  • When implementing client applications or FFI bindings
  • When building UI frontends (TUI, web) that need API access

tasker-ctl

Purpose: Command-line interface for Tasker (split from tasker-client)

Location: tasker-ctl/

Key Responsibilities:

  • CLI argument parsing and command dispatch (via clap)
  • Task, worker, system, config, auth, and DLQ commands
  • Configuration documentation generation (via askama, feature-gated)
  • API key generation and management

CLI Tools:

# Task management
tasker-ctl task create --template linear_workflow
tasker-ctl task get <uuid>
tasker-ctl task list --namespace payments

# Health checks
tasker-ctl health

# Configuration docs generation
tasker-ctl docs generate

When to Use:

  • When managing tasks from the command line
  • When generating configuration documentation
  • When performing administrative operations (auth, DLQ management)

Dependencies:

  • reqwest - HTTP client
  • clap - CLI argument parsing
  • serde_json - JSON serialization

Worker Implementations

workers/ruby/ext/tasker_core

Purpose: Ruby FFI bindings enabling Ruby workers to execute Rust-orchestrated workflows

Location: workers/ruby/ext/tasker_core/

Key Responsibilities:

  • Expose Rust worker functionality to Ruby via Magnus (FFI)
  • Handle Ruby handler execution
  • Manage Ruby <-> Rust type conversions
  • Provide Ruby API for template registration
  • FFI performance optimization

Ruby API:

# Worker bootstrap
result = TaskerCore::Worker::Bootstrap.start!

# Template registration (automatic)
# Ruby templates in workers/ruby/app/tasker/tasks/templates/

# Handler execution (automatic via FFI)
class MyHandler < TaskerCore::StepHandler
  def execute(context)
    # Step implementation
    { success: true, result: "done" }
  end
end

When to Use:

  • When you have existing Ruby handlers
  • When you need Ruby-specific libraries or gems
  • When migrating from Ruby-based orchestration
  • When team expertise is primarily Ruby

Dependencies:

  • magnus - Ruby FFI bindings
  • tasker-worker - Core worker logic
  • Ruby runtime

Performance Considerations:

  • FFI overhead: ~5-10ms per step (measured)
  • Ruby GC can impact latency
  • Thread-safe FFI calls via Ruby global lock
  • Best for I/O-bound operations, not CPU-intensive

workers/rust

Purpose: Native Rust worker implementation for maximum performance

Location: workers/rust/

Key Responsibilities:

  • Native Rust step handler execution
  • Template definitions in Rust
  • Direct integration with tasker-worker
  • Maximum performance for CPU-intensive operations

Handler API:

#![allow(unused)]
fn main() {
// Define handler in Rust
pub struct MyHandler;

#[async_trait]
impl StepHandler for MyHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Step implementation
        Ok(StepResult::success(json!({"result": "done"})))
    }
}

// Register in template
pub fn register_template() -> TaskTemplate {
    TaskTemplate {
        name: "my_workflow",
        steps: vec![
            StepTemplate {
                name: "my_step",
                handler: Box::new(MyHandler),
                // ...
            }
        ]
    }
}
}

When to Use:

  • When you need maximum performance
  • For CPU-intensive operations
  • When building new workflows in Rust
  • When minimizing latency is critical

Dependencies:

  • tasker-worker - Core worker logic
  • tokio - Async runtime

Performance: Native Rust handlers have zero FFI overhead


Crate Relationships

How Crates Work Together

Task Creation Flow

Client Application
  ↓ [HTTP POST]
tasker-client
  ↓ [REST API]
tasker-orchestration::web
  ↓ [Task lifecycle]
tasker-orchestration::lifecycle::TaskInitializer
  ↓ [Uses]
tasker-shared::models::Task
tasker-shared::database::sql_functions
  ↓ [PostgreSQL]
Database + PGMQ

Step Execution Flow

tasker-orchestration::lifecycle::StepEnqueuer
  ↓ [pgmq_send_with_notify]
PGMQ namespace queue
  ↓ [pg_notify event]
tasker-worker::event_systems::WorkerEventSystem
  ↓ [Claims step]
tasker-worker::handlers::execute_handler
  ↓ [FFI or native]
workers/ruby or workers/rust
  ↓ [Returns result]
tasker-worker::orchestration_result_sender
  ↓ [pgmq_send_with_notify]
PGMQ orchestration_step_results queue
  ↓ [pg_notify event]
tasker-orchestration::lifecycle::ResultProcessor
  ↓ [Updates state]
tasker-shared::models::WorkflowStepTransition

Dependency Rationale

Why tasker-shared exists:

  • Prevents circular dependencies (orchestration ↔ worker)
  • Single source of truth for domain models
  • Enables independent testing
  • Allows SQL function reuse

Why workers are separate from tasker-worker:

  • Language-specific implementations
  • Independent deployment
  • FFI boundary separation
  • Multiple worker types supported

Why tasker-pgmq is separate:

  • Reusable in other projects
  • Focused responsibility
  • Easy to test independently
  • Can be published as separate crate

Building and Testing

Build All Crates

# Build everything with all features
cargo build --all-features

# Build specific crate
cargo build --package tasker-orchestration --all-features

# Build workspace root (minimal, mostly for integration)
cargo build

Test All Crates

# Test everything
cargo test --all-features

# Test specific crate
cargo test --package tasker-shared --all-features

# Test with database
DATABASE_URL="postgresql://..." cargo test --all-features

Feature Flags

# Root workspace features
[features]
benchmarks = [
    "tasker-shared/benchmarks",
    # ...
]
test-utils = [
    "tasker-orchestration/test-utils",
    "tasker-shared/test-utils",
    "tasker-worker/test-utils",
]

Migration Notes

Root Crate Being Phased Out

The root tasker-core crate (defined in the workspace root Cargo.toml) is being phased out:

  • Current: Contains minimal code, mostly workspace configuration
  • Future: Will be removed entirely, replaced by individual crates
  • Impact: No functional impact, internal restructuring only
  • Timeline: Complete when all functionality moved to member crates

Why: Cleaner workspace structure, better separation of concerns, easier to understand

Adding New Crates

When adding a new crate to the workspace:

  1. Add to [workspace.members] in root Cargo.toml
  2. Create crate: cargo new --lib tasker-new-crate
  3. Add workspace dependencies to crate’s Cargo.toml
  4. Update this documentation
  5. Add to dependency graph above
  6. Document public API

Best Practices

When to Create a New Crate

Create a new crate when:

  • ✅ You have a distinct, reusable component
  • ✅ You need independent versioning
  • ✅ You want to reduce compile times
  • ✅ You need isolation for testing
  • ✅ You have language-specific implementations

Don’t create a new crate when:

  • ❌ It’s tightly coupled to existing crates
  • ❌ It’s only used in one place
  • ❌ It would create circular dependencies
  • ❌ It’s a small utility module

Dependency Management

  • Use workspace dependencies: Define versions in root Cargo.toml
  • Minimize dependencies: Only depend on what you need
  • Version consistently: Use workspace = true in member crates
  • Document dependencies: Explain why each dependency is needed

API Design

  • Stable public API: Changes should be backward compatible
  • Clear documentation: Every public item needs docs
  • Examples in docs: Show how to use the API
  • Error handling: Use Result with meaningful error types


← Back to Documentation Hub

Deployment Patterns and Configuration

Last Updated: 2026-01-15 Audience: Architects, Operators Status: Active Related Docs: Documentation Hub | Quick Start | Observability | Messaging Abstraction

← Back to Documentation Hub


Overview

Tasker Core supports three deployment modes, each optimized for different operational requirements and infrastructure constraints. This guide covers deployment patterns, configuration management, and production considerations.

Key Deployment Modes:

  • Hybrid Mode (Recommended) - Event-driven with polling fallback
  • EventDrivenOnly Mode - Pure event-driven for lowest latency
  • PollingOnly Mode - Traditional polling for restricted environments

Messaging Backend Options:

  • PGMQ (Default) - PostgreSQL-based, single infrastructure dependency
  • RabbitMQ - AMQP broker, higher throughput for high-volume scenarios

Messaging Backend Selection

Tasker Core supports multiple messaging backends through a provider-agnostic abstraction layer. The choice of backend affects deployment architecture and operational requirements.

Backend Comparison

FeaturePGMQRabbitMQ
InfrastructurePostgreSQL onlyPostgreSQL + RabbitMQ
Delivery ModelPoll + pg_notify signalsNative push (basic_consume)
Fallback PollingRequired for reliabilityNot needed
ThroughputGoodHigher
LatencyLow (~10-50ms)Lowest (~5-20ms)
Operational ComplexityLowerHigher
Message PersistencePostgreSQL transactionsRabbitMQ durability

PGMQ (Default)

PostgreSQL Message Queue is the default backend, ideal for:

  • Simpler deployments: Single database dependency
  • Transactional workflows: Messages participate in PostgreSQL transactions
  • Smaller to medium scale: Excellent for most workloads

Configuration:

# Default - no additional configuration needed
TASKER_MESSAGING_BACKEND=pgmq

Deployment Mode Interaction:

  • Uses pg_notify for real-time notifications
  • Fallback polling recommended for reliability
  • Hybrid mode provides best balance

RabbitMQ

AMQP-based messaging for high-throughput scenarios:

  • High-volume workloads: Better throughput characteristics
  • Existing RabbitMQ infrastructure: Leverage existing investments
  • Pure push delivery: No fallback polling required

Configuration:

TASKER_MESSAGING_BACKEND=rabbitmq
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/%2F

Deployment Mode Interaction:

  • EventDrivenOnly mode is natural fit (no fallback needed)
  • Native push delivery via basic_consume()
  • Protocol-guaranteed message delivery

Choosing a Backend

Decision Tree:
                              ┌─────────────────┐
                              │ Do you need the │
                              │ highest possible │
                              │ throughput?     │
                              └────────┬────────┘
                                       │
                            ┌──────────┴──────────┐
                            │                     │
                           Yes                    No
                            │                     │
                            ▼                     ▼
                   ┌────────────────┐   ┌────────────────┐
                   │ Do you have    │   │ Use PGMQ       │
                   │ existing       │   │ (simpler ops)  │
                   │ RabbitMQ?      │   └────────────────┘
                   └───────┬────────┘
                           │
                ┌──────────┴──────────┐
                │                     │
               Yes                    No
                │                     │
                ▼                     ▼
       ┌────────────────┐    ┌────────────────┐
       │ Use RabbitMQ   │    │ Evaluate       │
       └────────────────┘    │ operational    │
                             │ tradeoffs      │
                             └────────────────┘

Recommendation: Start with PGMQ. Migrate to RabbitMQ only when throughput requirements demand it.


Production Deployment Strategy: Mixed Mode Architecture

Important: In production-grade Kubernetes environments, you typically run multiple orchestration containers simultaneously with different deployment modes. This is not just about horizontal scaling with identical configurations—it’s about deploying containers with different coordination strategies to optimize for both throughput and reliability.

High-Throughput + Safety Net Architecture:

# Most orchestration containers in EventDrivenOnly mode for maximum throughput
- EventDrivenOnly containers: 8-12 replicas (handles 80-90% of workload)
- PollingOnly containers: 2-3 replicas (safety net for missed events)

Why this works:

  1. EventDrivenOnly containers handle the bulk of work with ~10ms latency
  2. PollingOnly containers catch any events that might be missed during network issues or LISTEN/NOTIFY failures
  3. Both sets of containers coordinate through atomic SQL operations (no conflicts)
  4. Scale each mode independently based on throughput needs

Alternative: All-Hybrid Deployment

You can also deploy all containers in Hybrid mode and scale horizontally:

# All containers use Hybrid mode
- Hybrid containers: 10-15 replicas

This is simpler but less flexible. The mixed-mode approach lets you:

  • Tune for specific workload patterns (event-heavy vs. polling-heavy)
  • Adapt to infrastructure constraints (some networks better for events, others for polling)
  • Optimize resource usage (EventDrivenOnly uses less CPU than Hybrid)
  • Scale dimensions independently (scale up event listeners without scaling pollers)

Key Insight

The different deployment modes exist not just for config tuning, but to enable sophisticated deployment strategies where you mix coordination approaches across containers to meet production throughput and reliability requirements.


Deployment Mode Comparison

FeatureHybridEventDrivenOnlyPollingOnly
LatencyLow (event-driven primary)Lowest (~10ms)Higher (~100-500ms)
ReliabilityHighest (automatic fallback)Good (requires stable connections)Good (no dependencies)
Resource UsageMedium (listeners + pollers)Low (listeners only)Medium (pollers only)
Network RequirementsStandard PostgreSQLPersistent connections requiredStandard PostgreSQL
Production Recommended✅ Yes⚠️ With stable network⚠️ For restricted environments
ComplexityMediumLowLow

Overview

Hybrid mode combines the best of both worlds: event-driven coordination for real-time performance with polling fallback for reliability.

How it works:

  1. PostgreSQL LISTEN/NOTIFY provides real-time event notifications
  2. If event listeners fail or lag, polling automatically takes over
  3. System continuously monitors and switches between modes
  4. No manual intervention required

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"

[orchestration.hybrid]
# Event listener settings
enable_event_listeners = true
listener_reconnect_interval_ms = 5000
listener_health_check_interval_ms = 30000

# Polling fallback settings
enable_polling_fallback = true
polling_interval_ms = 1000
fallback_activation_threshold_ms = 5000

# Worker event settings
[orchestration.worker_events]
enable_worker_listeners = true
worker_listener_reconnect_ms = 5000

When to Use Hybrid Mode

Ideal for:

  • Production deployments requiring high reliability
  • Environments with occasional network instability
  • Systems requiring both low latency and guaranteed delivery
  • Multi-region deployments with variable network quality

Example: Production E-commerce Platform

# docker-compose.production.yml
version: '3.8'

services:
  orchestration:
    image: tasker-orchestration:latest
    environment:
      - TASKER_ENV=production
      - TASKER_DEPLOYMENT_MODE=Hybrid
      - DATABASE_URL=postgresql://tasker:${DB_PASSWORD}@postgres:5432/tasker_production
      - RUST_LOG=info
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  postgres:
    image: postgres:16
    environment:
      - POSTGRES_DB=tasker_production
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G

volumes:
  postgres-data:

Monitoring Hybrid Mode

Key Metrics:

#![allow(unused)]
fn main() {
// Hybrid mode health indicators
tasker_event_listener_active{mode="hybrid"} = 1           // Listener is active
tasker_event_listener_lag_ms{mode="hybrid"} < 100         // Event lag is acceptable
tasker_polling_fallback_active{mode="hybrid"} = 0         // Not in fallback mode
tasker_mode_switches_total{mode="hybrid"} < 10/hour       // Infrequent mode switching
}

Alert conditions:

  • Event listener down for > 60 seconds
  • Polling fallback active for > 5 minutes
  • Mode switches > 20 per hour (indicates instability)

EventDrivenOnly Mode

Overview

EventDrivenOnly mode provides the lowest possible latency by relying entirely on PostgreSQL LISTEN/NOTIFY for coordination.

How it works:

  1. Orchestration and workers establish persistent PostgreSQL connections
  2. LISTEN on specific channels for events
  3. Immediate notification on queue changes
  4. No polling overhead or delay

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "EventDrivenOnly"

[orchestration.event_driven]
# Listener configuration
listener_reconnect_interval_ms = 2000
listener_health_check_interval_ms = 15000
max_reconnect_attempts = 10

# Event channels
channels = [
    "pgmq_message_ready.orchestration",
    "pgmq_message_ready.*",
    "pgmq_queue_created"
]

# Connection pool for listeners
listener_pool_size = 5
connection_timeout_ms = 5000

When to Use EventDrivenOnly Mode

Ideal for:

  • High-throughput, low-latency requirements
  • Stable network environments
  • Development and testing environments
  • Systems with reliable PostgreSQL infrastructure

Not recommended for:

  • Unstable network connections
  • Environments with frequent PostgreSQL failovers
  • Systems requiring guaranteed operation during network issues

Example: High-Performance Payment Processing

#![allow(unused)]
fn main() {
// Worker configuration for event-driven mode
use tasker_worker::WorkerConfig;

let config = WorkerConfig {
    deployment_mode: DeploymentMode::EventDrivenOnly,
    namespaces: vec!["payments".to_string()],
    event_driven_settings: EventDrivenSettings {
        listener_reconnect_interval_ms: 2000,
        health_check_interval_ms: 15000,
        max_reconnect_attempts: 10,
    },
    ..Default::default()
};

// Start worker with event-driven mode
let worker = WorkerCore::from_config(config).await?;
worker.start().await?;
}

Monitoring EventDrivenOnly Mode

Critical Metrics:

#![allow(unused)]
fn main() {
// Event-driven health indicators
tasker_event_listener_active{mode="event_driven"} = 1    // Must be 1
tasker_event_notifications_received_total                 // Should be > 0
tasker_event_processing_duration_seconds                  // Should be < 0.01
tasker_listener_reconnections_total                       // Should be low
}

Alert conditions:

  • Event listener inactive
  • No events received for > 60 seconds (when activity expected)
  • Reconnections > 5 per hour

PollingOnly Mode

Overview

PollingOnly mode provides the most reliable operation in restricted or unstable network environments by using traditional polling.

How it works:

  1. Orchestration and workers poll message queues at regular intervals
  2. No dependency on persistent connections or LISTEN/NOTIFY
  3. Configurable polling intervals for performance/resource trade-offs
  4. Automatic retry and backoff on failures

Configuration

# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
# Polling intervals
task_request_poll_interval_ms = 1000
step_result_poll_interval_ms = 500
finalization_poll_interval_ms = 2000

# Batch processing
batch_size = 10
max_messages_per_poll = 100

# Backoff on errors
error_backoff_base_ms = 1000
error_backoff_max_ms = 30000
error_backoff_multiplier = 2.0

When to Use PollingOnly Mode

Ideal for:

  • Restricted network environments (firewalls blocking persistent connections)
  • Environments with frequent PostgreSQL connection issues
  • Systems prioritizing reliability over latency
  • Legacy infrastructure with limited LISTEN/NOTIFY support

Not recommended for:

  • High-frequency, low-latency requirements
  • Systems with strict resource constraints
  • Environments where polling overhead is problematic

Example: Batch Processing System

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
# Longer intervals for batch processing
task_request_poll_interval_ms = 5000
step_result_poll_interval_ms = 2000
finalization_poll_interval_ms = 10000

# Large batches for efficiency
batch_size = 50
max_messages_per_poll = 500

Monitoring PollingOnly Mode

Key Metrics:

#![allow(unused)]
fn main() {
// Polling health indicators
tasker_polling_cycles_total                               // Should be increasing
tasker_polling_messages_processed_total                   // Should be > 0
tasker_polling_duration_seconds                           // Should be stable
tasker_polling_errors_total                               // Should be low
}

Alert conditions:

  • Polling stopped (no cycles in last 60 seconds)
  • Polling duration > 10x interval (indicates overload)
  • Error rate > 5% of polling cycles

Configuration Management

Component-Based Configuration

Tasker Core uses a component-based TOML configuration system with environment-specific overrides.

Configuration Structure:

config/tasker/
├── base/                          # Base configuration (all environments)
│   ├── database.toml             # Database connection pool settings
│   ├── orchestration.toml        # Orchestration and deployment mode
│   ├── circuit_breakers.toml    # Circuit breaker thresholds
│   ├── executor_pools.toml      # Executor pool sizing
│   ├── pgmq.toml                # Message queue configuration
│   ├── query_cache.toml         # Query caching settings
│   └── telemetry.toml           # Metrics and logging
│
└── environments/                  # Environment-specific overrides
    ├── development/
    │   └── *.toml               # Development overrides
    ├── test/
    │   └── *.toml               # Test overrides
    └── production/
        └── *.toml               # Production overrides

Environment Detection

# Set environment via TASKER_ENV
export TASKER_ENV=production

# Validate configuration
cargo run --bin config-validator

# Expected output:
# ✓ Configuration loaded successfully
# ✓ Environment: production
# ✓ Deployment mode: Hybrid
# ✓ Database pool: 50 connections
# ✓ Circuit breakers: 10 configurations

Example: Production Configuration

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
max_concurrent_tasks = 1000
task_timeout_seconds = 3600

[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true
polling_interval_ms = 2000
fallback_activation_threshold_ms = 10000

[orchestration.health]
health_check_interval_ms = 30000
unhealthy_threshold = 3
recovery_threshold = 2
# config/tasker/environments/production/database.toml
[database]
max_connections = 50
min_connections = 10
connection_timeout_ms = 5000
idle_timeout_seconds = 600
max_lifetime_seconds = 1800

[database.query_cache]
enabled = true
max_size = 1000
ttl_seconds = 300
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5
timeout_seconds = 60
half_open_timeout_seconds = 30

[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60

Docker Compose Deployment

Development Setup

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: tasker
      POSTGRES_PASSWORD: tasker
      POSTGRES_DB: tasker_rust_test
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U tasker"]
      interval: 5s
      timeout: 5s
      retries: 5

  orchestration:
    build:
      context: .
      target: orchestration
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

  worker:
    build:
      context: .
      target: worker
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8081:8081"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

  ruby-worker:
    build:
      context: ./workers/ruby
      dockerfile: Dockerfile
    environment:
      - TASKER_ENV=development
      - DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
      - RUST_LOG=debug
    ports:
      - "8082:8082"
    depends_on:
      postgres:
        condition: service_healthy
    profiles:
      - server

volumes:
  postgres-data:

Production Deployment

# docker-compose.production.yml
version: '3.8'

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: tasker
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
      POSTGRES_DB: tasker_production
    volumes:
      - postgres-data:/var/lib/postgresql/data
    deploy:
      placement:
        constraints:
          - node.labels.database == true
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G
    secrets:
      - db_password

  orchestration:
    image: tasker-orchestration:${VERSION}
    environment:
      - TASKER_ENV=production
      - DATABASE_URL_FILE=/run/secrets/database_url
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      rollback_config:
        parallelism: 0
        order: stop-first
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G
    secrets:
      - database_url

  worker:
    image: tasker-worker:${VERSION}
    environment:
      - TASKER_ENV=production
      - DATABASE_URL_FILE=/run/secrets/database_url
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
    secrets:
      - database_url

secrets:
  db_password:
    external: true
  database_url:
    external: true

volumes:
  postgres-data:
    driver: local

Kubernetes Deployment

This example demonstrates the recommended production pattern: multiple orchestration deployments with different modes.

# k8s/orchestration-event-driven.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration-event-driven
  namespace: tasker
  labels:
    app: tasker-orchestration
    mode: event-driven
spec:
  replicas: 10  # Majority of orchestration capacity
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
      mode: event-driven
  template:
    metadata:
      labels:
        app: tasker-orchestration
        mode: event-driven
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DEPLOYMENT_MODE
          value: "EventDrivenOnly"  # High-throughput mode
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            cpu: 500m      # Lower CPU for event-driven
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
# k8s/orchestration-polling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration-polling
  namespace: tasker
  labels:
    app: tasker-orchestration
    mode: polling
spec:
  replicas: 3  # Safety net for missed events
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
      mode: polling
  template:
    metadata:
      labels:
        app: tasker-orchestration
        mode: polling
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DEPLOYMENT_MODE
          value: "PollingOnly"  # Reliability safety net
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            cpu: 750m      # Higher CPU for polling
            memory: 512Mi
          limits:
            cpu: 1500m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
# k8s/orchestration-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  selector:
    app: tasker-orchestration  # Matches BOTH deployments
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
  type: ClusterIP

Key points about this mixed-mode deployment:

  1. 10 EventDrivenOnly pods handle 80-90% of work with ~10ms latency
  2. 3 PollingOnly pods catch anything missed by event listeners
  3. Single service load balances across all 13 pods
  4. No conflicts - atomic SQL operations prevent duplicate processing
  5. Independent scaling - scale event-driven pods for throughput, polling pods for reliability

Single-Mode Orchestration Deployment

# k8s/orchestration-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tasker-orchestration
  template:
    metadata:
      labels:
        app: tasker-orchestration
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: orchestration
        image: tasker-orchestration:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2

---
apiVersion: v1
kind: Service
metadata:
  name: tasker-orchestration
  namespace: tasker
spec:
  selector:
    app: tasker-orchestration
  ports:
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
  type: ClusterIP

Worker Deployment

# k8s/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-worker-payments
  namespace: tasker
spec:
  replicas: 5
  selector:
    matchLabels:
      app: tasker-worker
      namespace: payments
  template:
    metadata:
      labels:
        app: tasker-worker
        namespace: payments
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8081"
    spec:
      containers:
      - name: worker
        image: tasker-worker:1.0.0
        env:
        - name: TASKER_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-secrets
              key: database-url
        - name: RUST_LOG
          value: "info"
        - name: WORKER_NAMESPACES
          value: "payments"
        ports:
        - containerPort: 8081
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 20
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 5

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tasker-worker-payments
  namespace: tasker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tasker-worker-payments
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Health Monitoring

Health Check Endpoints

Orchestration Health:

# Basic health check
curl http://localhost:8080/health

# Response:
{
  "status": "healthy",
  "database": "connected",
  "message_queue": "operational"
}

# Detailed health check
curl http://localhost:8080/health/detailed

# Response:
{
  "status": "healthy",
  "deployment_mode": "Hybrid",
  "event_listeners": {
    "active": true,
    "channels": 3,
    "lag_ms": 12
  },
  "polling": {
    "active": false,
    "fallback_triggered": false
  },
  "database": {
    "status": "connected",
    "pool_size": 50,
    "active_connections": 23
  },
  "circuit_breakers": {
    "database": "closed",
    "message_queue": "closed"
  },
  "executors": {
    "task_initializer": {
      "active": 3,
      "max": 10,
      "queue_depth": 5
    },
    "result_processor": {
      "active": 5,
      "max": 10,
      "queue_depth": 12
    }
  }
}

Worker Health:

# Worker health check
curl http://localhost:8081/health

# Response:
{
  "status": "healthy",
  "namespaces": ["payments", "inventory"],
  "active_executions": 8,
  "claimed_steps": 3
}

Kubernetes Probes

# Liveness probe - restart if unhealthy
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# Readiness probe - remove from load balancer if not ready
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

gRPC Health Checks

Tasker Core exposes gRPC health endpoints alongside REST for Kubernetes gRPC health probes.

Port Allocation:

ServiceREST PortgRPC Port
Orchestration80809190
Rust Worker80819191
Ruby Worker80829200
Python Worker80839300
TypeScript Worker80859400

gRPC Health Endpoints:

# Using grpcurl
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckLiveness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckReadiness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckDetailedHealth

Kubernetes gRPC Probes (Kubernetes 1.24+):

# gRPC liveness probe
livenessProbe:
  grpc:
    port: 9190
    service: tasker.v1.HealthService
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# gRPC readiness probe
readinessProbe:
  grpc:
    port: 9190
    service: tasker.v1.HealthService
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Configuration (config/tasker/base/orchestration.toml):

[orchestration.grpc]
enabled = true
bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
enable_reflection = true       # Service discovery via grpcurl
enable_health_service = true   # gRPC health checks

Scaling Patterns

Horizontal Scaling

Scale different deployment modes independently to optimize for throughput and reliability:

# Scale event-driven pods for throughput
kubectl scale deployment tasker-orchestration-event-driven --replicas=15 -n tasker

# Scale polling pods for reliability
kubectl scale deployment tasker-orchestration-polling --replicas=5 -n tasker

Scaling strategy by workload:

ScenarioEvent-Driven PodsPolling PodsRationale
High throughput15-203-5Maximize event-driven capacity
Network unstable5-85-8Balance between modes
Cost optimization10-122-3Minimize polling overhead
Maximum reliability8-108-10Ensure complete coverage

Single-Mode Orchestration Scaling

If using single deployment mode (Hybrid or EventDrivenOnly):

# Scale orchestration to 10 replicas (all same mode)
kubectl scale deployment tasker-orchestration --replicas=10 -n tasker

Key principles:

  • Multiple orchestration instances process tasks independently
  • Atomic finalization claiming prevents duplicate processing
  • Load balancer distributes API requests across instances

Worker Scaling

Workers scale independently per namespace:

# Scale payment workers to 10 replicas
kubectl scale deployment tasker-worker-payments --replicas=10 -n tasker
  • Each worker claims steps from namespace-specific queues
  • No coordination required between workers
  • Scale per namespace based on queue depth

Vertical Scaling

Resource Allocation:

# High-throughput orchestration
resources:
  requests:
    cpu: 2000m
    memory: 4Gi
  limits:
    cpu: 4000m
    memory: 8Gi

# Standard worker
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Auto-Scaling

HPA Configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tasker-orchestration
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tasker-orchestration
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: tasker_tasks_active
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Production Considerations

Database Configuration

Connection Pooling:

# config/tasker/environments/production/database.toml
[database]
max_connections = 50              # Total pool size
min_connections = 10              # Minimum maintained connections
connection_timeout_ms = 5000      # Connection acquisition timeout
idle_timeout_seconds = 600        # Close idle connections after 10 minutes
max_lifetime_seconds = 1800       # Recycle connections after 30 minutes

Calculation:

Total DB Connections = (Orchestration Replicas × Pool Size) + (Worker Replicas × Pool Size)
Example: (3 × 50) + (10 × 20) = 350 connections

Ensure PostgreSQL max_connections > Total DB Connections + Buffer
Recommended: max_connections = 500 for above example

Circuit Breaker Tuning

# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5               # Open after 5 consecutive errors
timeout_seconds = 60              # Stay open for 60 seconds
half_open_timeout_seconds = 30    # Test recovery for 30 seconds

[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60

Executor Pool Sizing

# config/tasker/environments/production/executor_pools.toml
[executor_pools.task_initializer]
min_executors = 2
max_executors = 10
queue_high_watermark = 100
queue_low_watermark = 10

[executor_pools.result_processor]
min_executors = 5
max_executors = 20
queue_high_watermark = 200
queue_low_watermark = 20

[executor_pools.step_enqueuer]
min_executors = 3
max_executors = 15
queue_high_watermark = 150
queue_low_watermark = 15

Observability Integration

Prometheus Metrics:

# Prometheus scrape config
scrape_configs:
  - job_name: 'tasker-orchestration'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - tasker
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Key Alerts:

# alerts.yaml
groups:
  - name: tasker
    interval: 30s
    rules:
      - alert: TaskerOrchestrationDown
        expr: up{job="tasker-orchestration"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Tasker orchestration instance down"

      - alert: TaskerHighErrorRate
        expr: rate(tasker_step_errors_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in step execution"

      - alert: TaskerCircuitBreakerOpen
        expr: tasker_circuit_breaker_state{state="open"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is open"

      - alert: TaskerDatabasePoolExhausted
        expr: tasker_database_pool_active >= tasker_database_pool_max
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool exhausted"

Migration Strategies

Migrating to Hybrid Mode

Step 1: Enable event listeners

# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"

[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true    # Keep polling enabled during migration

Step 2: Monitor event listener health

# Check metrics for event listener stability
curl http://localhost:8080/health/detailed | jq '.event_listeners'

Step 3: Gradually reduce polling frequency

# Once event listeners are stable
[orchestration.hybrid]
polling_interval_ms = 5000        # Increase from 1000ms to 5000ms

Step 4: Validate performance

  • Monitor latency metrics: tasker_step_discovery_duration_seconds
  • Verify no missed events: tasker_polling_messages_found_total should be near zero

Rollback Plan

If event-driven mode fails:

# Immediate rollback to PollingOnly
[orchestration]
deployment_mode = "PollingOnly"

[orchestration.polling]
task_request_poll_interval_ms = 500    # Aggressive polling

Gradual rollback:

  1. Increase polling frequency in Hybrid mode
  2. Monitor for stability
  3. Disable event listeners once polling is stable
  4. Switch to PollingOnly mode

Troubleshooting

Event Listener Issues

Problem: Event listeners not receiving notifications

Diagnosis:

-- Check PostgreSQL LISTEN/NOTIFY is working
NOTIFY pgmq_message_ready, 'test';
# Check listener status
curl http://localhost:8080/health/detailed | jq '.event_listeners'

Solutions:

  • Verify PostgreSQL version supports LISTEN/NOTIFY (9.0+)
  • Check firewall rules allow persistent connections
  • Increase listener_reconnect_interval_ms if connections drop frequently
  • Switch to Hybrid or PollingOnly mode if issues persist

Polling Performance Issues

Problem: High CPU usage from polling

Diagnosis:

# Check polling frequency and batch sizes
curl http://localhost:8080/health/detailed | jq '.polling'

Solutions:

  • Increase polling intervals
  • Increase batch sizes to process more messages per poll
  • Switch to Hybrid or EventDrivenOnly mode for better performance
  • Scale horizontally to distribute polling load

Database Connection Exhaustion

Problem: “connection pool exhausted” errors

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'tasker_production';

-- Check max connections
SHOW max_connections;

Solutions:

  • Increase max_connections in database.toml
  • Increase PostgreSQL max_connections setting
  • Reduce number of replicas
  • Implement connection pooling at infrastructure level (PgBouncer)

Best Practices

Configuration Management

  1. Use environment-specific overrides instead of modifying base configuration
  2. Validate configuration with config-validator before deployment
  3. Version control all configuration including environment overrides
  4. Use secrets management for sensitive values (passwords, keys)

Deployment Strategy

  1. Use mixed-mode architecture in production (EventDrivenOnly + PollingOnly)
    • Deploy 80-90% of orchestration pods in EventDrivenOnly mode for throughput
    • Deploy 10-20% of orchestration pods in PollingOnly mode as safety net
    • Single service load balances across all pods
  2. Alternative: Deploy all pods in Hybrid mode for simpler configuration
    • Trade-off: Less tuning flexibility, slightly higher resource usage
  3. Scale each mode independently based on workload characteristics
  4. Monitor deployment mode metrics to adjust ratios over time
  5. Test mixed-mode deployments in staging before production

Deployment Operations

  1. Always test configuration changes in staging first
  2. Use rolling updates with health checks to prevent downtime
  3. Monitor deployment mode health during and after deployments
  4. Keep polling capacity available even when event-driven is primary

Scaling Guidelines

  1. Mixed-mode orchestration: Scale EventDrivenOnly and PollingOnly deployments independently
    • Scale event-driven pods based on throughput requirements
    • Scale polling pods based on reliability requirements
  2. Single-mode orchestration: Scale based on API request rate and task initialization throughput
  3. Workers: Scale based on namespace-specific queue depth
  4. Database connections: Monitor and adjust pool sizes as replicas scale
  5. Use HPA for automatic scaling based on CPU/memory and custom metrics

Observability

  1. Enable comprehensive metrics in production
  2. Set up alerts for circuit breaker states, connection pool exhaustion
  3. Monitor deployment mode distribution in mixed-mode deployments
  4. Track event listener lag in EventDrivenOnly and Hybrid modes
  5. Monitor polling overhead to optimize resource usage
  6. Track step execution latency per namespace and handler

Summary

Tasker Core’s flexible deployment modes enable sophisticated production architectures:

Deployment Modes

  • Hybrid Mode: Event-driven with polling fallback in a single container
  • EventDrivenOnly Mode: Maximum throughput with ~10ms latency
  • PollingOnly Mode: Reliable safety net with traditional polling

Mixed-Mode Architecture (recommended for production at scale):

  • Deploy majority of orchestration pods in EventDrivenOnly mode for high throughput
  • Deploy minority of orchestration pods in PollingOnly mode as reliability safety net
  • Both deployments coordinate through atomic SQL operations with no conflicts
  • Scale each mode independently based on workload characteristics

Alternative: Deploy all pods in Hybrid mode for simpler configuration with automatic fallback.

The key insight: deployment modes exist not just for configuration tuning, but to enable mixing coordination strategies across containers to meet production requirements for both throughput and reliability.


← Back to Documentation Hub

Next: Observability | Benchmarks | Quick Start

Domain Events Architecture

Last Updated: 2025-12-01 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Observability | States and Lifecycles

← Back to Documentation Hub


This document provides comprehensive documentation of the domain event system in tasker-core, covering event delivery modes, publisher patterns, subscriber implementations, and integration with the workflow orchestration system.

Overview

Domain Events vs System Events

The tasker-core system distinguishes between two types of events:

AspectSystem EventsDomain Events
PurposeInternal coordinationBusiness observability
ProducersOrchestration componentsStep handlers during execution
ConsumersEvent systems, command processorsExternal systems, analytics, audit
DeliveryPGMQ + LISTEN/NOTIFYConfigurable (Durable/Fast/Broadcast)
SemanticsAt-least-onceFire-and-forget (best effort)

System events handle internal workflow coordination: task initialization, step enqueueing, result processing, and finalization. These are documented in Events and Commands.

Domain events enable business observability: payment processed, order fulfilled, inventory updated. Step handlers publish these events to enable external systems to react to business outcomes.

Key Design Principle: Non-Blocking Publication

Domain event publishing never fails the step. This is a fundamental design decision:

  • Event publish errors are logged with warn! level
  • Step execution continues regardless of publish outcome
  • Business logic success is independent of event delivery
  • A handler that successfully processes a payment should not fail if event publishing fails
#![allow(unused)]
fn main() {
// Event publishing is fire-and-forget
if let Err(e) = publisher.publish_event(event_name, payload, metadata).await {
    warn!(
        handler = self.handler_name(),
        event_name = event_name,
        error = %e,
        "Failed to publish domain event - step will continue"
    );
}
// Step continues regardless of publish result
}

Architecture

Data Flow

flowchart TB
    subgraph handlers["Step Handlers"]
        SH["Step Handler<br/>(Rust/Ruby)"]
    end

    SH -->|"publish_domain_event(name, payload)"| ER

    subgraph routing["Event Routing"]
        ER["EventRouter<br/>(Delivery Mode)"]
    end

    ER --> Durable
    ER --> Fast
    ER --> Broadcast

    subgraph modes["Delivery Modes"]
        Durable["Durable<br/>(PGMQ)"]
        Fast["Fast<br/>(In-Process)"]
        Broadcast["Broadcast<br/>(Both Paths)"]
    end

    Durable --> NEQ["Namespace<br/>Event Queue"]
    Fast --> IPB["InProcessEventBus"]
    Broadcast --> NEQ
    Broadcast --> IPB

    subgraph external["External Integration"]
        NEQ --> EC["External Consumer<br/>(Polling)"]
    end

    subgraph internal["Internal Subscribers"]
        IPB --> RS["Rust<br/>Subscribers"]
        IPB --> RFF["Ruby FFI<br/>Channel"]
    end

    style handlers fill:#e1f5fe
    style routing fill:#fff3e0
    style modes fill:#f3e5f5
    style external fill:#ffebee
    style internal fill:#e8f5e9

Component Summary

ComponentPurposeLocation
EventRouterRoutes events based on delivery modetasker-shared/src/events/domain_events/router.rs
DomainEventPublisherDurable PGMQ-based publishingtasker-shared/src/events/domain_events/publisher.rs
InProcessEventBusFast in-memory event dispatchtasker-shared/src/events/domain_events/in_process_bus.rs
EventRegistryPattern-based subscriber registrationtasker-shared/src/events/domain_events/registry.rs
StepEventPublisherHandler callback traittasker-shared/src/events/domain_events/step_event_publisher.rs
GenericStepEventPublisherDefault publisher implementationtasker-shared/src/events/domain_events/generic_publisher.rs

Delivery Modes

Overview

The domain event system supports three delivery modes, configured per event in YAML templates:

ModeDurabilityLatencyUse Case
DurableHigh (PGMQ)Higher (~5-10ms)External system integration, audit trails
FastLow (memory)Lowest (<1ms)Internal subscribers, metrics, real-time processing
BroadcastHigh + LowBoth pathsEvents needing both internal and external delivery

Durable Mode (PGMQ) - External Integration Boundary

Durable events define the integration boundary between Tasker and external systems. Events are published to namespace-specific PGMQ queues where external consumers can poll and process them.

Key Design Decision: Tasker does NOT consume durable events internally. PGMQ serves as a lightweight, PostgreSQL-native alternative to external message brokers (Kafka, AWS SNS/SQS, RabbitMQ). External systems or middleware proxies can:

  • Poll PGMQ queues directly
  • Forward events to Kafka, SNS/SQS, or other messaging systems
  • Implement custom event processing pipelines
payment.processed → payments_domain_events (PGMQ queue) → External Systems
order.fulfilled   → fulfillment_domain_events (PGMQ queue) → External Systems

Characteristics:

  • Persisted in PostgreSQL (survives restarts)
  • For external consumer integration only
  • No internal Tasker polling or subscription
  • Consumer acknowledgment and retry handled by external consumers
  • Ordered within namespace

Implementation:

#![allow(unused)]
fn main() {
// DomainEventPublisher routes durable events to PGMQ
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: Value,
    metadata: EventMetadata,
) -> TaskerResult<()> {
    let queue_name = format!("{}_domain_events", metadata.namespace);
    let message = DomainEventMessage {
        event_name: event_name.to_string(),
        payload,
        metadata,
    };

    self.message_client
        .send_message(&queue_name, &message)
        .await
}
}

Fast Mode (In-Process) - Internal Subscriber Pattern

Fast events are the only delivery mode with internal subscriber support. Events are dispatched immediately to in-memory subscribers within the Tasker worker process.

#![allow(unused)]
fn main() {
// InProcessEventBus provides dual-path delivery
pub struct InProcessEventBus {
    event_sender: tokio::sync::broadcast::Sender<DomainEvent>,
    ffi_event_sender: Option<mpsc::Sender<DomainEvent>>,
}
}

Characteristics:

  • Zero persistence overhead
  • Sub-millisecond latency
  • Lost on service restart
  • Internal to Tasker process only
  • Dual-path: Rust subscribers + Ruby FFI channel
  • Non-blocking broadcast semantics

Dual-Path Architecture:

InProcessEventBus
       │
       ├──► tokio::broadcast::Sender ──► Rust Subscribers (EventRegistry)
       │
       └──► mpsc::Sender ──► Ruby FFI Channel ──► Ruby Event Handlers

Use Cases:

  • Real-time metrics collection
  • Internal logging and telemetry
  • Secondary actions that are not business-critical parts of the Task -> WorkflowStep DAG hierarchy
  • Example: DataDog, Sentry, NewRelic, PagerDuty, Salesforce, Slack, Zapier

Broadcast Mode - Internal + External Delivery

Broadcast mode delivers events to both paths simultaneously: the fast in-process bus for internal subscribers AND the durable PGMQ queue for external systems. This ensures internal subscribers receive the same event shape as external consumers.

#![allow(unused)]
fn main() {
// EventRouter handles broadcast semantics
async fn route_event(&self, event: DomainEvent, mode: EventDeliveryMode) {
    match mode {
        EventDeliveryMode::Durable => {
            self.durable_publisher.publish(event).await;
        }
        EventDeliveryMode::Fast => {
            self.in_process_bus.publish(event).await;
        }
        EventDeliveryMode::Broadcast => {
            // Send to both paths concurrently
            let (durable, fast) = tokio::join!(
                self.durable_publisher.publish(event.clone()),
                self.in_process_bus.publish(event)
            );
            // Log errors but don't fail
        }
    }
}
}

When to Use Broadcast:

  • Internal subscribers need the same event that external systems receive
  • Real-time internal metrics tracking for events also exported externally
  • Audit logging both internally and to external compliance systems

Important: Data published via broadcast goes to BOTH the internal process AND the public PGMQ boundary. Do not use broadcast for sensitive internal-only data (use fast for those).

Publisher Patterns

Default Publisher (GenericStepEventPublisher)

The default publisher automatically handles event publication for step handlers:

#![allow(unused)]
fn main() {
pub struct GenericStepEventPublisher {
    router: Arc<EventRouter>,
    default_delivery_mode: EventDeliveryMode,
}

impl GenericStepEventPublisher {
    /// Publish event with metadata extracted from step context
    pub async fn publish(
        &self,
        step_data: &TaskSequenceStep,
        event_name: &str,
        payload: Value,
    ) -> TaskerResult<()> {
        let metadata = EventMetadata {
            task_uuid: step_data.task.task.task_uuid,
            step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
            step_name: Some(step_data.workflow_step.name.clone()),
            namespace: step_data.task.namespace_name.clone(),
            correlation_id: step_data.task.task.correlation_id,
            fired_at: Utc::now(),
            fired_by: "generic_publisher".to_string(),
        };

        self.router.route_event(event_name, payload, metadata).await
    }
}
}

Custom Publishers

Custom publishers extend TaskerCore::DomainEvents::BasePublisher (Ruby) to provide specialized event handling with payload transformation, conditional publishing, and lifecycle hooks.

Real Example: PaymentEventPublisher (workers/ruby/spec/handlers/examples/domain_events/publishers/payment_event_publisher.rb):

# Custom publisher for payment-related domain events
# Demonstrates durable delivery mode with custom payload enrichment
module DomainEvents
  module Publishers
    class PaymentEventPublisher < TaskerCore::DomainEvents::BasePublisher
      # Must match the `publisher:` field in YAML
      def name
        'DomainEvents::Publishers::PaymentEventPublisher'
      end

      # Transform step result into payment event payload
      def transform_payload(step_result, event_declaration, step_context = nil)
        result = step_result[:result] || {}
        event_name = event_declaration[:name]

        if step_result[:success] && event_name&.include?('processed')
          build_success_payload(result, step_result, step_context)
        elsif !step_result[:success] && event_name&.include?('failed')
          build_failure_payload(result, step_result, step_context)
        else
          result
        end
      end

      # Determine if this event should be published
      def should_publish?(step_result, event_declaration, step_context = nil)
        result = step_result[:result] || {}
        event_name = event_declaration[:name]

        # For success events, verify we have transaction data
        if event_name&.include?('processed')
          return step_result[:success] && result[:transaction_id].present?
        end

        # For failure events, verify we have error info
        if event_name&.include?('failed')
          metadata = step_result[:metadata] || {}
          return !step_result[:success] && metadata[:error_code].present?
        end

        true  # Default: always publish
      end

      # Add execution metrics to event metadata
      def additional_metadata(step_result, event_declaration, step_context = nil)
        metadata = step_result[:metadata] || {}
        {
          execution_time_ms: metadata[:execution_time_ms],
          publisher_type: 'custom',
          publisher_name: name,
          payment_provider: metadata[:payment_provider]
        }
      end

      private

      def build_success_payload(result, step_result, step_context)
        {
          transaction_id: result[:transaction_id],
          amount: result[:amount],
          currency: result[:currency] || 'USD',
          payment_method: result[:payment_method] || 'credit_card',
          processed_at: result[:processed_at] || Time.now.iso8601,
          delivery_mode: 'durable',
          publisher: name
        }
      end
    end
  end
end

YAML Configuration for Custom Publisher:

steps:
  - name: process_payment
    publishes_events:
      - name: payment.processed
        condition: success
        delivery_mode: durable
        publisher: DomainEvents::Publishers::PaymentEventPublisher
      - name: payment.failed
        condition: failure
        delivery_mode: durable
        publisher: DomainEvents::Publishers::PaymentEventPublisher

YAML Event Declaration

Events are declared in task template YAML files using the publishes_events field at the step level:

# config/tasks/payments/credit_card_payment/1.0.0.yaml
name: credit_card_payment
namespace_name: payments
version: 1.0.0
description: Process credit card payments with validation and fraud detection

# Task-level domain events (optional)
domain_events: []

steps:
  - name: process_payment
    description: Process the payment transaction
    handler:
      callable: PaymentProcessing::StepHandler::ProcessPaymentHandler
      initialization:
        gateway_url: "${PAYMENT_GATEWAY_URL}"
    dependencies:
      - validate_payment
    retry:
      retryable: true
      limit: 3
      backoff: exponential
    timeout_seconds: 120

    # Step-level event declarations
    publishes_events:
      - name: payment.processed
        description: "Payment successfully processed"
        condition: success  # success, failure, retryable_failure, permanent_failure, always
        schema:
          type: object
          required: [transaction_id, amount]
          properties:
            transaction_id: { type: string }
            amount: { type: number }
        delivery_mode: broadcast  # durable, fast, or broadcast
        publisher: PaymentEventPublisher  # optional custom publisher

      - name: payment.failed
        description: "Payment processing failed permanently"
        condition: permanent_failure
        schema:
          type: object
          required: [error_code, reason]
          properties:
            error_code: { type: string }
            reason: { type: string }
        delivery_mode: durable

Publication Conditions:

  • success: Publish only when step completes successfully
  • failure: Publish on any step failure (backward compatible)
  • retryable_failure: Publish only on retryable failures (step can be retried)
  • permanent_failure: Publish only on permanent failures (exhausted retries or non-retryable)
  • always: Publish regardless of step outcome

Event Declaration Fields:

  • name: Event name in dotted notation (e.g., payment.processed)
  • description: Human-readable description of when this event is published
  • condition: When to publish (defaults to success)
  • schema: JSON Schema for validating event payloads
  • delivery_mode: Delivery mode (defaults to durable)
  • publisher: Optional custom publisher class name

Subscriber Patterns

Subscriber patterns apply only to fast (in-process) events. Durable events are consumed by external systems, not by internal Tasker subscribers.

Rust Subscribers (InProcessEventBus)

Rust subscribers are registered with the InProcessEventBus using the EventHandler type. Subscribers are async closures that receive DomainEvent instances.

Real Example: Logging Subscriber (workers/rust/src/event_subscribers/logging_subscriber.rs):

#![allow(unused)]
fn main() {
use std::sync::Arc;
use tasker_shared::events::registry::EventHandler;
use tracing::info;

/// Create a logging subscriber that logs all events matching a pattern
pub fn create_logging_subscriber(prefix: &str) -> EventHandler {
    let prefix = prefix.to_string();

    Arc::new(move |event| {
        let prefix = prefix.clone();

        Box::pin(async move {
            let step_name = event.metadata.step_name.as_deref().unwrap_or("unknown");

            info!(
                prefix = %prefix,
                event_name = %event.event_name,
                event_id = %event.event_id,
                task_uuid = %event.metadata.task_uuid,
                step_name = %step_name,
                namespace = %event.metadata.namespace,
                correlation_id = %event.metadata.correlation_id,
                fired_at = %event.metadata.fired_at,
                "Domain event received"
            );

            Ok(())
        })
    })
}
}

Real Example: Metrics Collector (workers/rust/src/event_subscribers/metrics_subscriber.rs):

#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};

/// Collects metrics from domain events (thread-safe)
pub struct EventMetricsCollector {
    events_received: AtomicU64,
    success_events: AtomicU64,
    failure_events: AtomicU64,
    // ... additional fields
}

impl EventMetricsCollector {
    pub fn new() -> Arc<Self> {
        Arc::new(Self {
            events_received: AtomicU64::new(0),
            success_events: AtomicU64::new(0),
            failure_events: AtomicU64::new(0),
        })
    }

    /// Create an event handler for this collector
    pub fn create_handler(self: &Arc<Self>) -> EventHandler {
        let metrics = Arc::clone(self);

        Arc::new(move |event| {
            let metrics = Arc::clone(&metrics);

            Box::pin(async move {
                metrics.events_received.fetch_add(1, Ordering::Relaxed);

                if event.payload.execution_result.success {
                    metrics.success_events.fetch_add(1, Ordering::Relaxed);
                } else {
                    metrics.failure_events.fetch_add(1, Ordering::Relaxed);
                }

                Ok(())
            })
        })
    }

    pub fn events_received(&self) -> u64 {
        self.events_received.load(Ordering::Relaxed)
    }
}
}

Registration with InProcessEventBus:

#![allow(unused)]
fn main() {
use tasker_worker::worker::in_process_event_bus::InProcessEventBus;

let mut bus = InProcessEventBus::new(config);

// Subscribe to all events
bus.subscribe("*", create_logging_subscriber("[ALL]")).unwrap();

// Subscribe to specific patterns
bus.subscribe("payment.*", create_logging_subscriber("[PAYMENT]")).unwrap();

// Use metrics collector
let metrics = EventMetricsCollector::new();
bus.subscribe("*", metrics.create_handler()).unwrap();
}

Ruby Subscribers (BaseSubscriber)

Ruby subscribers extend TaskerCore::DomainEvents::BaseSubscriber and use the class-level subscribes_to pattern declaration.

Real Example: LoggingSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/logging_subscriber.rb):

# Example logging subscriber for fast/in-process domain events
module DomainEvents
  module Subscribers
    class LoggingSubscriber < TaskerCore::DomainEvents::BaseSubscriber
      # Subscribe to all events using pattern matching
      subscribes_to '*'

      # Handle any domain event by logging its details
      def handle(event)
        event_name = event[:event_name]
        metadata = event[:metadata] || {}

        logger.info "[LoggingSubscriber] Event: #{event_name}"
        logger.info "  Task: #{metadata[:task_uuid]}"
        logger.info "  Step: #{metadata[:step_name]}"
        logger.info "  Namespace: #{metadata[:namespace]}"
        logger.info "  Correlation: #{metadata[:correlation_id]}"
      end
    end
  end
end

Real Example: MetricsSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/metrics_subscriber.rb):

# Example metrics subscriber for fast/in-process domain events
module DomainEvents
  module Subscribers
    class MetricsSubscriber < TaskerCore::DomainEvents::BaseSubscriber
      subscribes_to '*'

      class << self
        attr_accessor :events_received, :success_events, :failure_events,
                      :events_by_namespace, :last_event_at

        def reset_counters!
          @mutex = Mutex.new
          @events_received = 0
          @success_events = 0
          @failure_events = 0
          @events_by_namespace = Hash.new(0)
          @last_event_at = nil
        end
      end

      reset_counters!

      def handle(event)
        event_name = event[:event_name]
        metadata = event[:metadata] || {}
        execution_result = event[:execution_result] || {}

        self.class.increment(:events_received)

        if execution_result[:success]
          self.class.increment(:success_events)
        else
          self.class.increment(:failure_events)
        end

        namespace = metadata[:namespace] || 'unknown'
        self.class.increment_hash(:events_by_namespace, namespace)
        self.class.set(:last_event_at, Time.now)
      end
    end
  end
end

Registration in Bootstrap:

# Register subscribers with the registry
registry = TaskerCore::DomainEvents::SubscriberRegistry.instance
registry.register(DomainEvents::Subscribers::LoggingSubscriber)
registry.register(DomainEvents::Subscribers::MetricsSubscriber)
registry.start_all!

# Later, query metrics
puts "Total events: #{DomainEvents::Subscribers::MetricsSubscriber.events_received}"
puts "By namespace: #{DomainEvents::Subscribers::MetricsSubscriber.events_by_namespace}"

External PGMQ Consumers (Durable Events)

Durable events are published to PGMQ queues for external consumption. Tasker does not provide internal consumers for these queues. External systems can consume events using:

  1. Direct PGMQ Polling: Query pgmq.q_{namespace}_domain_events tables directly
  2. PGMQ Client Libraries: Use pgmq client libraries in Python, Node.js, Go, etc.
  3. Middleware Proxies: Build adapters that forward events to Kafka, SNS/SQS, etc.

Example: External Python Consumer:

import pgmq

# Connect to PGMQ
queue = pgmq.Queue("payments_domain_events", dsn="postgresql://...")

# Poll for events
while True:
    messages = queue.read(batch_size=50, vt=30)
    for msg in messages:
        process_event(msg.message)
        queue.delete(msg.msg_id)

Configuration

Domain event system configuration is part of the worker configuration in worker.toml files.

TOML Configuration

# config/tasker/base/worker.toml

# In-process event bus configuration for fast domain event delivery
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 2000        # Channel capacity for broadcast events
log_subscriber_errors = true        # Log errors from event subscribers
dispatch_timeout_ms = 5000          # Timeout for event dispatch

# Domain Event System MPSC Configuration
[worker.mpsc_channels.domain_events]
command_buffer_size = 1000          # Channel capacity for domain event commands
shutdown_drain_timeout_ms = 5000    # Time to drain events on shutdown
log_dropped_events = true           # Log when events are dropped due to backpressure

# In-process event settings (part of worker event systems)
[worker.event_systems.worker.metadata.in_process_events]
ffi_integration_enabled = true      # Enable Ruby/Python FFI event channel
deduplication_cache_size = 10000    # Event deduplication cache size

Environment Overrides

Test Environment (config/tasker/environments/test/worker.toml):

# In-process event bus - smaller buffers for testing
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 1000
log_subscriber_errors = true
dispatch_timeout_ms = 2000

# Domain Event System - smaller buffers for testing
[worker.mpsc_channels.domain_events]
command_buffer_size = 100
shutdown_drain_timeout_ms = 1000
log_dropped_events = true

[worker.event_systems.worker.metadata.in_process_events]
deduplication_cache_size = 1000

Production Environment (config/tasker/environments/production/worker.toml):

# In-process event bus - large buffers for production throughput
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 5000
log_subscriber_errors = false       # Reduce log noise in production
dispatch_timeout_ms = 10000

# Domain Event System - large buffers for production throughput
[worker.mpsc_channels.domain_events]
command_buffer_size = 5000
shutdown_drain_timeout_ms = 10000
log_dropped_events = false          # Reduce log noise in production

Configuration Parameters

ParameterDescriptionDefault
broadcast_buffer_sizeCapacity of the broadcast channel for fast events2000
log_subscriber_errorsWhether to log subscriber errorstrue
dispatch_timeout_msTimeout for event dispatch to subscribers5000
command_buffer_sizeCapacity of domain event command channel1000
shutdown_drain_timeout_msTime to drain pending events on shutdown5000
log_dropped_eventsWhether to log events dropped due to backpressuretrue
ffi_integration_enabledEnable FFI event channel for Ruby/Pythontrue
deduplication_cache_sizeSize of event deduplication cache10000

Integration with Step Execution

Event-Driven Domain Event Publishing

The worker uses an event-driven command pattern for step execution and domain event publishing. Nothing blocks - domain events are dispatched after successful orchestration notification using fire-and-forget semantics.

Flow (tasker-worker/src/worker/command_processor.rs):

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  FFI Handler    │────►│ Completion       │────►│ WorkerProcessor     │
│  (Ruby/Rust)    │     │ Channel          │     │ Command Loop        │
└─────────────────┘     └──────────────────┘     └──────────┬──────────┘
                                                            │
                        ┌───────────────────────────────────┴───────────────┐
                        │                                                   │
                        ▼                                                   ▼
              ┌─────────────────────┐                          ┌────────────────────┐
              │ 1. Send result to   │                          │ 2. Dispatch domain │
              │    orchestration    │──── on success ─────────►│    events          │
              │    (PGMQ)           │                          │    (fire-and-forget)│
              └─────────────────────┘                          └────────────────────┘

Implementation:

#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 512-525)
// Worker command processor receives step completions via FFI channel
match self.handle_send_step_result(step_result.clone()).await {
    Ok(()) => {
        // Dispatch domain events AFTER successful orchestration notification.
        // Domain events are declarative of what HAS happened - the step is only
        // truly complete once orchestration has been notified successfully.
        // Fire-and-forget semantics (try_send) - never blocks the worker.
        self.dispatch_domain_events(&step_result, None);
        info!(
            worker_id = %self.worker_id,
            step_uuid = %step_result.step_uuid,
            "Step completion forwarded to orchestration successfully"
        );
    }
    Err(e) => {
        // Don't dispatch domain events - orchestration wasn't notified,
        // so the step isn't truly complete from the system's perspective
        error!(
            worker_id = %self.worker_id,
            step_uuid = %step_result.step_uuid,
            error = %e,
            "Failed to forward step completion to orchestration"
        );
    }
}
}

Domain Event Dispatch (fire-and-forget):

#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 362-432)
fn dispatch_domain_events(&mut self, step_result: &StepExecutionResult, correlation_id: Option<Uuid>) {
    // Retrieve cached step context (stored when step was claimed)
    let task_sequence_step = match self.step_execution_contexts.remove(&step_result.step_uuid) {
        Some(ctx) => ctx,
        None => return, // No context = can't build events
    };

    // Build events from step definition's publishes_events declarations
    for event_def in &task_sequence_step.step_definition.publishes_events {
        // Check publication condition before building event
        if !event_def.should_publish(step_result.success) {
            continue; // Skip events whose condition doesn't match
        }

        let event = DomainEventToPublish {
            event_name: event_def.name.clone(),
            delivery_mode: event_def.delivery_mode,
            business_payload: step_result.result.clone(),
            metadata: EventMetadata { /* ... */ },
            task_sequence_step: task_sequence_step.clone(),
            execution_result: step_result.clone(),
        };
        domain_events.push(event);
    }

    // Fire-and-forget dispatch - try_send never blocks
    let dispatched = handle.dispatch_events(domain_events, publisher_name, correlation);

    if !dispatched {
        warn!(
            step_uuid = %step_result.step_uuid,
            "Domain event dispatch failed - channel full (events dropped)"
        );
    }
}
}

Key Design Decisions:

  • Events only after orchestration success: Domain events are declarative of what HAS happened. If orchestration notification fails, the step isn’t truly complete from the system’s perspective.
  • Fire-and-forget via try_send: Never blocks the worker command loop. If the channel is full, events are dropped and logged.
  • Context caching: Step execution context is cached when the step is claimed, then retrieved for event building after completion.

Correlation ID Propagation

Domain events maintain correlation IDs for end-to-end distributed tracing. The correlation ID originates from the task and flows through all step executions and domain events.

EventMetadata Structure (tasker-shared/src/events/domain_events.rs):

#![allow(unused)]
fn main() {
pub struct EventMetadata {
    pub task_uuid: Uuid,
    pub step_uuid: Option<Uuid>,
    pub step_name: Option<String>,
    pub namespace: String,
    pub correlation_id: Uuid,      // From task for end-to-end tracing
    pub fired_at: DateTime<Utc>,
    pub fired_by: String,          // Publisher identifier (worker_id)
}
}

Getting Correlation ID via API:

Use the orchestration API to get the correlation ID for a task:

# Get task details including correlation_id
curl http://localhost:8080/v1/tasks/{task_uuid} | jq '.correlation_id'

# Response includes correlation_id
{
  "task_uuid": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
  "correlation_id": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
  "status": "complete",
  ...
}

Tracing Events in PGMQ:

# Find all durable events for a correlation ID
psql $DATABASE_URL -c "
  SELECT
    message->>'event_name' as event,
    message->'metadata'->>'step_name' as step,
    message->'metadata'->>'fired_at' as fired_at
  FROM pgmq.q_payments_domain_events
  WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
  ORDER BY message->'metadata'->>'fired_at';
"

Metrics and Observability

OpenTelemetry Metrics

Domain event publication emits OpenTelemetry counter metrics (tasker-shared/src/events/domain_events.rs:207-219):

#![allow(unused)]
fn main() {
// Metric emitted on every domain event publication
let counter = opentelemetry::global::meter("tasker")
    .u64_counter("tasker.domain_events.published.total")
    .with_description("Total number of domain events published")
    .build();

counter.add(1, &[
    opentelemetry::KeyValue::new("event_name", event_name.to_string()),
    opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
]);
}

Prometheus Metrics Endpoint

The orchestration service exposes Prometheus-format metrics:

# Get Prometheus metrics from orchestration service
curl http://localhost:8080/metrics

# Get Prometheus metrics from worker service
curl http://localhost:8081/metrics

OpenTelemetry Tracing

Domain event publication is instrumented with tracing spans (tasker-shared/src/events/domain_events.rs:157-161):

#![allow(unused)]
fn main() {
#[instrument(skip(self, payload, metadata), fields(
    event_name = %event_name,
    namespace = %metadata.namespace,
    correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: DomainEventPayload,
    metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
    // ... implementation with debug! and info! logs including correlation_id
}
}

Grafana Query Examples

Loki Query - Domain Events by Correlation ID:

{service_name="tasker-worker"} |= "Domain event published" | json | correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"

Loki Query - All Domain Event Publications:

{service_name=~"tasker.*"} |= "Domain event" | json | line_format "{{.event_name}} - {{.namespace}} - {{.correlation_id}}"

Tempo Query - Trace by Correlation ID:

{resource.service.name="tasker-worker"} && {span.correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Prometheus Query - Event Publication Rate by Namespace:

sum by (namespace) (rate(tasker_domain_events_published_total[5m]))

Prometheus Query - Event Publication Rate by Event Name:

topk(10, sum by (event_name) (rate(tasker_domain_events_published_total[5m])))

Structured Log Fields

Domain event logs include structured fields for querying:

FieldDescriptionExample
event_idUnique event UUID (v7, time-ordered)0199c3e0-d123-...
event_nameEvent name in dot notationpayment.processed
queue_nameTarget PGMQ queuepayments_domain_events
task_uuidParent task UUID0199c3e0-ccdb-...
correlation_idEnd-to-end trace correlation0199c3e0-ccdb-...
namespaceEvent namespacepayments
message_idPGMQ message ID12345

Best Practices

1. Choose the Right Delivery Mode

ScenarioRecommended ModeRationale
External system integrationDurableReliable delivery to external consumers
Internal metrics/telemetryFastInternal subscribers only, low latency
Internal + external needsBroadcastSame event shape to both internal and external
Audit trails for complianceDurablePersisted for external audit systems
Real-time internal dashboardsFastIn-process subscribers handle immediately

Key Decision Criteria:

  • Need internal Tasker subscribers? → Use fast or broadcast
  • Need external system integration? → Use durable or broadcast
  • Internal-only, sensitive data? → Use fast (never reaches PGMQ boundary)

2. Design Event Payloads

Do:

#![allow(unused)]
fn main() {
json!({
    "transaction_id": "TXN-123",
    "amount": 99.99,
    "currency": "USD",
    "timestamp": "2025-12-01T10:00:00Z",
    "idempotency_key": step_uuid
})
}

Don’t:

#![allow(unused)]
fn main() {
json!({
    "data": "payment processed",  // No structure
    "info": full_database_record  // Too much data
})
}

3. Handle Subscriber Failures Gracefully

#![allow(unused)]
fn main() {
#[async_trait]
impl EventSubscriber for MySubscriber {
    async fn handle(&self, event: &DomainEvent) -> TaskerResult<()> {
        // Wrap in timeout
        match timeout(Duration::from_secs(5), self.process(event)).await {
            Ok(result) => result,
            Err(_) => {
                warn!(event = %event.name, "Subscriber timeout");
                Ok(()) // Don't fail the dispatch
            }
        }
    }
}
}

4. Use Correlation IDs for Debugging

#![allow(unused)]
fn main() {
// Always include correlation ID in logs
info!(
    correlation_id = %event.metadata.correlation_id,
    event_name = %event.name,
    namespace = %event.metadata.namespace,
    "Processing domain event"
);
}


This domain event architecture provides a flexible, reliable foundation for business observability in the tasker-core workflow orchestration system.

Events and Commands Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Messaging Abstraction | States and Lifecycles | Deployment Patterns

← Back to Documentation Hub


This document provides comprehensive documentation of the event-driven and command pattern architecture in tasker-core, covering the unified event system foundation, orchestration and worker implementations, and the flow of tasks and steps through the system.

Overview

The tasker-core system implements a sophisticated hybrid architecture that combines:

  1. Event-Driven Systems: Real-time coordination using PostgreSQL LISTEN/NOTIFY and PGMQ notifications
  2. Command Pattern: Async command processors using tokio mpsc channels for orchestration and worker operations
  3. Hybrid Deployment Modes: PollingOnly, EventDrivenOnly, and Hybrid modes with fallback polling
  4. Queue-Based Communication: Provider-agnostic message queues (PGMQ or RabbitMQ) for reliable step execution and result processing

This architecture eliminates polling complexity while maintaining resilience through fallback mechanisms and provides horizontal scaling capabilities with atomic operation guarantees.

Event System Foundation

EventDrivenSystem Trait

The foundation of the event architecture is defined in tasker-shared/src/event_system/event_driven.rs with the EventDrivenSystem trait:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
    type SystemId: Send + Sync + Clone + fmt::Display + fmt::Debug;
    type Event: Send + Sync + Clone + fmt::Debug;
    type Config: Send + Sync + Clone;
    type Statistics: EventSystemStatistics + Send + Sync + Clone;

    // Core lifecycle methods
    async fn start(&mut self) -> Result<(), DeploymentModeError>;
    async fn stop(&mut self) -> Result<(), DeploymentModeError>;
    fn is_running(&self) -> bool;

    // Event processing
    async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;

    // Monitoring and health
    async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
    fn statistics(&self) -> Self::Statistics;

    // Configuration
    fn deployment_mode(&self) -> DeploymentMode;
    fn config(&self) -> &Self::Config;
}
}

Deployment Modes

The system supports three deployment modes for different operational requirements:

PollingOnly Mode

  • Traditional polling-based coordination
  • No event listeners or real-time notifications
  • Reliable fallback for environments with networking restrictions
  • Higher latency but guaranteed operation

EventDrivenOnly Mode

  • Pure event-driven coordination using PostgreSQL LISTEN/NOTIFY
  • Real-time response to database changes
  • Lowest latency for step discovery and task coordination
  • Requires reliable PostgreSQL connections

Hybrid Mode

  • Primary event-driven coordination with polling fallback
  • Best of both worlds: real-time when possible, reliable when needed
  • Automatic fallback during connection issues
  • Production-ready with resilience guarantees

Selecting a Deployment Mode

The Tasker system is built with the expectation of distributed deployment with multiple instances of both orchestration core servers and worker servers operating simultaneously. The goal of separating deployment mode is to enable different deployments to scale up event driven only processing nodes to meet demand, while having polling only nodes at a reasonable fallback polling interval and batch size. It is also to deploy in hybrid mode and control these on an instance over instance level.

Event Types and Sources

Queue-Level Events (Provider-Agnostic)

The system supports multiple messaging backends through MessageNotification:

#![allow(unused)]
fn main() {
pub enum MessageNotification {
    /// Signal-only notification (PGMQ style)
    /// Indicates a message is available but requires separate fetch
    Available {
        queue_name: String,
        msg_id: Option<i64>,
    },

    /// Full message notification (RabbitMQ style)
    /// Contains the complete message payload
    Message(QueuedMessage<Vec<u8>>),
}
}

Event Sources by Provider:

ProviderNotification TypeFetch RequiredFallback Polling
PGMQAvailableYes (read by msg_id)Required
RabbitMQMessageNo (full payload)Not needed
InMemoryMessageNoNot needed

Common Event Types:

  • Step Results: Worker completion notifications
  • Task Requests: New task initialization requests
  • Message Ready Events: Queue message availability notifications
  • Transport: Provider-agnostic via MessagingProvider.subscribe_many()

Command Pattern Architecture

Command Processor Pattern

Both orchestration and worker systems implement the command pattern to replace complex polling-based coordinators:

Benefits:

  • No Polling Loops (Except where intended for fallback): Pure tokio mpsc command processing
  • Simplified Architecture: ~100 lines vs 1000+ lines of complex systems
  • Race Condition Prevention: Atomic operations through proper delegation
  • Observability Preservation: Maintains metrics through delegated components

Command Flow Patterns

Both systems follow consistent command processing patterns:

sequenceDiagram
    participant Client
    participant CommandChannel
    participant Processor
    participant Delegate
    participant Response

    Client->>CommandChannel: Send Command + ResponseChannel
    CommandChannel->>Processor: Receive Command
    Processor->>Delegate: Delegate to Business Logic Component
    Delegate-->>Processor: Return Result
    Processor->>Response: Send Result via ResponseChannel
    Response-->>Client: Receive Result

Orchestration Event Systems

OrchestrationEventSystem

Implemented in tasker-orchestration/src/orchestration/event_systems/orchestration_event_system.rs:

#![allow(unused)]
fn main() {
pub struct OrchestrationEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    queue_listener: Option<OrchestrationQueueListener>,
    fallback_poller: Option<OrchestrationFallbackPoller>,
    context: Arc<SystemContext>,
    orchestration_core: Arc<OrchestrationCore>,
    command_sender: mpsc::Sender<OrchestrationCommand>,
    // ... statistics and state
}
}

Orchestration Command Types

The command processor handles both full-message and signal-only notification types:

#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
    // Task lifecycle
    InitializeTask { request: TaskRequestMessage, resp: CommandResponder<TaskInitializeResult> },
    ProcessStepResult { result: StepExecutionResult, resp: CommandResponder<StepProcessResult> },
    FinalizeTask { task_uuid: Uuid, resp: CommandResponder<TaskFinalizationResult> },

    // Full message processing (RabbitMQ style - MessageNotification::Message)
    // Used when provider delivers complete message payload
    ProcessStepResultFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<StepProcessResult> },
    InitializeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskInitializeResult> },
    FinalizeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskFinalizationResult> },

    // Signal-only processing (PGMQ style - MessageNotification::Available)
    // Used when provider sends notification that requires separate fetch
    ProcessStepResultFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<StepProcessResult> },
    InitializeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskInitializeResult> },
    FinalizeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskFinalizationResult> },

    // Task readiness (database events)
    ProcessTaskReadiness { task_uuid: Uuid, namespace: String, priority: i32, ready_steps: i32, triggered_by: String, resp: CommandResponder<TaskReadinessResult> },

    // System operations
    GetProcessingStats { resp: CommandResponder<OrchestrationProcessingStats> },
    HealthCheck { resp: CommandResponder<SystemHealth> },
    Shutdown { resp: CommandResponder<()> },
}
}

Command Routing by Notification Type:

  • MessageNotification::Message -> *FromMessage commands (immediate processing)
  • MessageNotification::Available -> *FromMessageEvent commands (requires fetch)

Orchestration Queue Architecture

The orchestration system coordinates multiple queue types:

  1. orchestration_step_results: Step completion results from workers
  2. orchestration_task_requests: New task initialization requests
  3. orchestration_task_finalization: Task finalization notifications
  4. Namespace Queues: Per-namespace step queues (e.g., fulfillment_queue, inventory_queue)

TaskReadinessEventSystem

Handles database-level events for task readiness using PostgreSQL LISTEN/NOTIFY:

#![allow(unused)]
fn main() {
pub struct TaskReadinessEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    listener: Option<TaskReadinessListener>,
    fallback_poller: Option<TaskReadinessFallbackPoller>,
    context: Arc<SystemContext>,
    command_sender: mpsc::Sender<OrchestrationCommand>,
    // ... configuration and statistics
}
}

PGMQ Notification Channels:

  • pgmq_message_ready.orchestration: Orchestration queue messages ready (task requests, step results, finalizations)
  • pgmq_message_ready.{namespace}: Worker namespace queue messages ready (e.g., payments, fulfillment, linear_workflow)
  • pgmq_message_ready: Global channel for all queue messages (fallback)
  • pgmq_queue_created: Queue creation notifications

Unified Event Coordination

The UnifiedEventCoordinator demonstrates coordinated management of multiple event systems:

#![allow(unused)]
fn main() {
pub struct UnifiedEventCoordinator {
    orchestration_system: OrchestrationEventSystem,
    task_readiness_fallback: FallbackPoller,
    deployment_mode: DeploymentMode,
    health_monitor: EventSystemHealthMonitor,
    // ... coordination logic
}
}

Coordination Features:

  • Shared Command Channel: Both systems send commands to same orchestration processor
  • Health Monitoring: Unified health checking across all event systems
  • Deployment Mode Management: Synchronized mode changes
  • Statistics Aggregation: Combined metrics from all systems

Worker Event Systems

WorkerEventSystem

Implemented in tasker-worker/src/worker/event_systems/worker_event_system.rs:

#![allow(unused)]
fn main() {
pub struct WorkerEventSystem {
    system_id: String,
    deployment_mode: DeploymentMode,
    queue_listeners: HashMap<String, WorkerQueueListener>,
    fallback_pollers: HashMap<String, WorkerFallbackPoller>,
    context: Arc<SystemContext>,
    command_sender: mpsc::Sender<WorkerCommand>,
    // ... statistics and configuration
}
}

Worker Command Types

#![allow(unused)]
fn main() {
pub enum WorkerCommand {
    // Step execution
    ExecuteStep { message: PgmqMessage<SimpleStepMessage>, queue_name: String, resp: CommandResponder<()> },
    ExecuteStepWithCorrelation { message: PgmqMessage<SimpleStepMessage>, queue_name: String, correlation_id: Uuid, resp: CommandResponder<()> },

    // Result processing
    SendStepResult { result: StepExecutionResult, resp: CommandResponder<()> },
    ProcessStepCompletion { step_result: StepExecutionResult, correlation_id: Option<Uuid>, resp: CommandResponder<()> },

    // Event integration
    ExecuteStepFromMessage { queue_name: String, message: PgmqMessage, resp: CommandResponder<()> },
    ExecuteStepFromEvent { message_event: MessageReadyEvent, resp: CommandResponder<()> },

    // System operations
    GetWorkerStatus { resp: CommandResponder<WorkerStatus> },
    SetEventIntegration { enabled: bool, resp: CommandResponder<()> },
    GetEventStatus { resp: CommandResponder<EventIntegrationStatus> },
    RefreshTemplateCache { namespace: Option<String>, resp: CommandResponder<()> },
    HealthCheck { resp: CommandResponder<WorkerHealthStatus> },
    Shutdown { resp: CommandResponder<()> },
}
}

Worker Queue Architecture

Workers monitor namespace-specific queues for step execution as Custom Namespace Queues that are dynamically configured per deployment

Example queues:

  1. fulfillment_queue: All fulfillment namespace steps
  2. inventory_queue: All inventory namespace steps
  3. notifications_queue: All notification namespace steps
  4. payment_queue: All payment processing steps

Event Flow and System Interactions

Complete Task Execution Flow

sequenceDiagram
    participant Client
    participant Orchestration
    participant TaskDB
    participant StepQueue
    participant Worker
    participant ResultQueue

    %% Task Initialization
    Client->>Orchestration: TaskRequestMessage (via pgmq_send_with_notify)
    Orchestration->>TaskDB: Create Task + Steps

    %% Step Discovery and Enqueueing (Event-Driven or Fallback Polling)
    Orchestration->>StepQueue: pgmq_send_with_notify(ready steps)
    StepQueue-->>Worker: pg_notify('pgmq_message_ready.{namespace}')

    %% Step Execution
    Worker->>StepQueue: pgmq.read() to claim step
    Worker->>Worker: Execute Step Handler
    Worker->>ResultQueue: pgmq_send_with_notify(StepExecutionResult)
    ResultQueue-->>Orchestration: pg_notify('pgmq_message_ready.orchestration')

    %% Result Processing
    Orchestration->>Orchestration: ProcessStepResult Command
    Orchestration->>TaskDB: Update Step State
    Note over Orchestration: Fallback poller discovers ready steps if events missed

    %% Task Completion
    Note over Orchestration: All Steps Complete
    Orchestration->>Orchestration: FinalizeTask Command
    Orchestration->>TaskDB: Mark Task Complete
    Orchestration-->>Client: Task Completed

Event-Driven Step Discovery

sequenceDiagram
    participant Worker
    participant PostgreSQL
    participant PgmqNotify
    participant OrchestrationListener
    participant StepEnqueuer

    Worker->>PostgreSQL: pgmq_send_with_notify('orchestration_step_results', result)
    PostgreSQL->>PostgreSQL: Atomic: pgmq.send() + pg_notify()
    PostgreSQL->>PgmqNotify: NOTIFY 'pgmq_message_ready.orchestration'
    PgmqNotify->>OrchestrationListener: MessageReadyEvent
    OrchestrationListener->>StepEnqueuer: ProcessStepResult Command
    StepEnqueuer->>PostgreSQL: Query ready steps, enqueue via pgmq_send_with_notify()

Hybrid Mode Operation

stateDiagram-v2
    [*] --> EventDriven

    EventDriven --> Processing : Event Received
    Processing --> EventDriven : Success
    Processing --> PollingFallback : Event Failed

    PollingFallback --> FallbackPolling : Start Polling
    FallbackPolling --> EventDriven : Connection Restored
    FallbackPolling --> Processing : Poll Found Work

    EventDriven --> HealthCheck : Periodic Check
    HealthCheck --> EventDriven : Healthy
    HealthCheck --> PollingFallback : Event Issues Detected

Queue Architecture and Message Flow

PGMQ Integration

The system uses PostgreSQL Message Queue (PGMQ) for reliable message delivery:

Queue Types and Purposes

Queue NamePurposeMessage TypeProcessing System
orchestration_step_resultsStep completion resultsStepExecutionResultOrchestration
orchestration_task_requestsNew task requestsTaskRequestMessageOrchestration
orchestration_task_finalizationTask finalizationTaskFinalizationMessageOrchestration
{namespace}_queueNamespace-specific stepsSimpleStepMessageWorkers

Message Processing Patterns

Event-Driven Processing:

  1. Message arrives in PGMQ queue
  2. PostgreSQL triggers pg_notify with MessageReadyEvent
  3. Event system receives notification
  4. System processes message via command pattern
  5. Message deleted after successful processing

Polling-Based Processing (Fallback):

  1. Periodic queue polling (configurable interval)
  2. Fetch available messages in batches
  3. Process messages via command pattern
  4. Delete processed messages

Circuit Breaker Integration

All PGMQ operations are protected by circuit breakers:

#![allow(unused)]
fn main() {
pub struct UnifiedPgmqClient {
    standard_client: Box<dyn PgmqClientTrait + Send + Sync>,
    protected_client: Option<ProtectedPgmqClient>,
    circuit_breaker_enabled: bool,
}
}

Circuit Breaker Features:

  • Automatic Protection: Failure detection and circuit opening
  • Configurable Thresholds: Error rate and timeout configuration
  • Seamless Fallback: Automatic switching between standard and protected clients
  • Recovery Detection: Automatic circuit closing when service recovers

Statistics and Monitoring

Event System Statistics

Both orchestration and worker event systems implement comprehensive statistics:

#![allow(unused)]
fn main() {
pub trait EventSystemStatistics {
    fn events_processed(&self) -> u64;
    fn events_failed(&self) -> u64;
    fn processing_rate(&self) -> f64;         // events/second
    fn average_latency_ms(&self) -> f64;
    fn deployment_mode_score(&self) -> f64;   // 0.0-1.0 effectiveness
    fn success_rate(&self) -> f64;            // derived: processed/(processed+failed)
}
}

Health Monitoring

Deployment Mode Health Status

#![allow(unused)]
fn main() {
pub enum DeploymentModeHealthStatus {
    Healthy,                    // All systems operational
    Degraded { reason: String },// Some issues but functional
    Unhealthy { reason: String },// Significant issues
    Critical { reason: String }, // System failure imminent
}
}

Health Check Integration

  • Event System Health: Connection status, processing latency, error rates
  • Command Processor Health: Queue backlog, processing timeout detection
  • Database Health: Connection pool status, query performance
  • Circuit Breaker Status: Circuit state, failure rates, recovery status

Metrics Collection

Key metrics collected across the system:

Orchestration Metrics

  • Task Initialization Rate: Tasks/minute initialized
  • Step Enqueueing Rate: Steps/minute enqueued to worker queues
  • Result Processing Rate: Results/minute processed from workers
  • Task Completion Rate: Tasks/minute completed successfully
  • Error Rates: Failures by operation type and cause

Worker Metrics

  • Step Execution Rate: Steps/minute executed
  • Handler Performance: Execution time by handler type
  • Queue Processing: Messages claimed/processed by queue
  • Result Submission Rate: Results/minute sent to orchestration
  • FFI Integration: Event correlation and handler communication stats

Error Handling and Resilience

Error Categories

The system handles multiple error categories with appropriate strategies:

Transient Errors

  • Database Connection Issues: Circuit breaker protection + retry with exponential backoff
  • Queue Processing Failures: Message retry with backoff, poison message detection
  • Network Interruptions: Automatic fallback to polling mode

Permanent Errors

  • Invalid Message Format: Dead letter queue for manual analysis
  • Handler Execution Failures: Step failure state with retry limits
  • Configuration Errors: System startup prevention with clear error messages

System Errors

  • Resource Exhaustion: Graceful degradation and load shedding
  • Component Crashes: Automatic restart with state recovery
  • Data Corruption: Transaction rollback and consistency validation

Fallback Mechanisms

Event System Fallbacks

  1. Event-Driven -> Polling: Automatic fallback when event connection fails
  2. Real-time -> Batch: Switch to batch processing during high load
  3. Primary -> Secondary: Database failover support for high availability

Command Processing Fallbacks

  1. Async -> Sync: Degraded operation for critical operations
  2. Distributed -> Local: Local processing when coordination fails
  3. Optimistic -> Pessimistic: Conservative processing during uncertainty

Configuration Management

Event System Configuration

Event systems are configured via TOML with environment overrides:

# config/tasker/base/event_systems.toml
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
health_monitoring_enabled = true
health_check_interval = "30s"
max_concurrent_processors = 10
processing_timeout = "100ms"

[orchestration_event_system.queue_listener]
enabled = true
batch_size = 50
poll_interval = "1s"
connection_timeout = "5s"

[orchestration_event_system.fallback_poller]
enabled = true
poll_interval = "5s"
batch_size = 20
max_retry_attempts = 3

[task_readiness]
enabled = true
polling_interval_seconds = 30

[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
# PGMQ channels handled by listeners, not direct postgres channels
supported_namespaces = ["orchestration"]

Runtime Configuration Changes

Certain configuration changes can be applied at runtime:

  • Deployment Mode Switching: EventDrivenOnly <-> Hybrid <-> PollingOnly
  • Event Integration Toggle: Enable/disable event processing
  • Health Check Intervals: Adjust monitoring frequency
  • Circuit Breaker Thresholds: Modify failure detection sensitivity

Integration Points

State Machine Integration

Event systems integrate tightly with the state machines documented in states-and-lifecycles.md:

  1. Task State Changes: Event systems react to task transitions
  2. Step State Changes: Step completion triggers task readiness checks
  3. Event Generation: State transitions generate events for system coordination
  4. Atomic Operations: Event processing maintains state machine consistency

Database Integration

Event systems coordinate with PostgreSQL at multiple levels:

  1. LISTEN/NOTIFY: Real-time notifications for database changes
  2. PGMQ Integration: Reliable message queues built on PostgreSQL
  3. Transaction Coordination: Event processing within database transactions
  4. SQL Functions: Database functions generate events and notifications

External System Integration

The event architecture supports integration with external systems:

  1. Webhook Events: HTTP callbacks for external system notifications
  2. Message Bus Integration: Apache Kafka, RabbitMQ, etc. for enterprise messaging
  3. Monitoring Integration: Prometheus, DataDog, etc. for metrics export
  4. API Integration: REST and GraphQL APIs for external coordination

Actor Integration

Overview

The tasker-core system implements a lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture provides a consistent, type-safe foundation for orchestration component management with all lifecycle operations coordinated through actors.

Status: Complete (Phases 1-7) - Production ready

For comprehensive actor documentation, see Actor-Based Architecture.

Actor Pattern Basics

The actor pattern introduces three core traits:

  1. OrchestrationActor: Base trait for all actors with lifecycle hooks
  2. Handler: Message handling trait for type-safe command processing
  3. Message: Marker trait for command messages
#![allow(unused)]
fn main() {
// Actor definition
pub struct TaskFinalizerActor {
    context: Arc<SystemContext>,
    service: TaskFinalizer,
}

// Message definition
pub struct FinalizeTaskMessage {
    pub task_uuid: Uuid,
}

impl Message for FinalizeTaskMessage {
    type Response = FinalizationResult;
}

// Message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
    type Response = FinalizationResult;

    async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
        self.service.finalize_task(msg.task_uuid).await
            .map_err(|e| e.into())
    }
}
}

Integration with Command Processor

The actor pattern integrates seamlessly with the command processor through direct actor calls:

#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs

async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
    // Direct actor-based task finalization
    let msg = FinalizeTaskMessage { task_uuid };
    let result = self.actors.task_finalizer_actor.handle(msg).await?;

    Ok(TaskFinalizationResult::Success {
        task_uuid: result.task_uuid,
        final_status: format!("{:?}", result.action),
        completion_time: Some(chrono::Utc::now()),
    })
}

async fn handle_process_step_result(
    &self,
    step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
    // Direct actor-based step result processing
    let msg = ProcessStepResultMessage {
        result: step_result.clone(),
    };

    match self.actors.result_processor_actor.handle(msg).await {
        Ok(()) => Ok(StepProcessResult::Success {
            message: format!(
                "Step {} result processed successfully",
                step_result.step_uuid
            ),
        }),
        Err(e) => Ok(StepProcessResult::Error {
            message: format!("Failed to process step result: {e}"),
        }),
    }
}
}

Event → Command → Actor Flow

The complete event-to-actor flow:

┌──────────────┐
│ PGMQ Message │ Message arrives in queue
└──────┬───────┘
       │
       ▼
┌──────────────────┐
│  Event Listener  │ EventDrivenSystem processes notification
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Command Channel  │ Send command to processor via tokio::mpsc
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Command Processor│ Convert command to actor message
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Actor Registry  │ Route message to appropriate actor
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Handler<M>::     │ Actor processes message
│    handle()      │ Delegates to underlying service
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Response        │ Return result to command processor
└──────────────────┘

ActorRegistry and Lifecycle

The ActorRegistry manages all 4 orchestration actors and integrates with the system lifecycle:

#![allow(unused)]
fn main() {
// During system startup
let context = Arc::new(SystemContext::with_pool(pool).await?);
let actors = ActorRegistry::build(context).await?;  // Calls started() on all actors

// During operation
let msg = FinalizeTaskMessage { task_uuid };
let result = actors.task_finalizer_actor.handle(msg).await?;

// During shutdown
actors.shutdown().await;  // Calls stopped() on all actors in reverse order
}

Current Actors:

  • TaskRequestActor: Handles task initialization requests
  • ResultProcessorActor: Processes step execution results
  • StepEnqueuerActor: Manages batch processing of ready tasks
  • TaskFinalizerActor: Handles task finalization with atomic claiming

Benefits for Event-Driven Architecture

The actor pattern enhances the event-driven architecture by providing:

  1. Type Safety: Compile-time verification of message contracts
  2. Consistency: Uniform lifecycle management across all components
  3. Testability: Clear message boundaries for isolated testing
  4. Observability: Actor-level metrics and tracing
  5. Evolvability: Easy to add new message handlers and actors

Implementation Status

The actor integration is complete:

  1. Phase 1 ✅: Actor infrastructure and test harness

    • OrchestrationActor, Handler, Message traits
    • ActorRegistry structure
  2. Phase 2-3 ✅: All 4 primary actors implemented

    • TaskRequestActor, ResultProcessorActor
    • StepEnqueuerActor, TaskFinalizerActor
  3. Phase 4-6 ✅: Message hydration and module reorganization

    • Hydration layer for PGMQ messages
    • Clean module organization
  4. Phase 7 ✅: Service decomposition

    • Large services decomposed into focused components
    • All files <300 lines following single responsibility principle
  5. Cleanup ✅: Direct actor integration

    • Command processor calls actors directly
    • Removed intermediate wrapper layers
    • Production-ready implementation

Service Decomposition

Large services (800-900 lines) were decomposed into focused components:

TaskFinalizer (848 → 6 files):

  • service.rs: Main TaskFinalizer (~200 lines)
  • completion_handler.rs: Task completion logic
  • event_publisher.rs: Lifecycle event publishing
  • execution_context_provider.rs: Context fetching
  • state_handlers.rs: State-specific handling

StepEnqueuerService (781 → 3 files):

  • service.rs: Main service (~250 lines)
  • batch_processor.rs: Batch processing logic
  • state_handlers.rs: State-specific processing

ResultProcessor (889 → 4 files):

  • service.rs: Main processor
  • metadata_processor.rs: Metadata handling
  • error_handler.rs: Error processing
  • result_validator.rs: Result validation

This comprehensive event and command architecture, now enhanced with the actor pattern, provides the foundation for scalable, reliable, and maintainable workflow orchestration in the tasker-core system while maintaining the flexibility to operate in diverse deployment environments.

Idempotency and Atomicity Guarantees

Last Updated: 2025-01-19 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands | Task Readiness & Execution

← Back to Documentation Hub


Overview

Tasker Core is designed for distributed orchestration with multiple orchestrator instances processing tasks concurrently. This document explains the defense-in-depth approach that ensures safe concurrent operation without race conditions, data corruption, or lost work.

The system provides idempotency and atomicity through four overlapping protection layers:

  1. Database Atomicity: PostgreSQL constraints, row locking, and compare-and-swap operations
  2. State Machine Guards: Current-state validation before all transitions
  3. Transaction Boundaries: All-or-nothing semantics for complex operations
  4. Application Logic: State-based filtering and idempotent patterns

These layers work together to ensure that operations can be safely retried, multiple orchestrators can process work concurrently, and crashes don’t leave the system in an inconsistent state.


Core Protection Mechanisms

Layer 1: Database Atomicity

PostgreSQL provides fundamental atomic guarantees through several mechanisms:

Unique Constraints

Purpose: Prevent duplicate creation of entities

Key Constraints:

  • tasker.tasks.identity_hash (UNIQUE) - Prevents duplicate task creation from identical requests
  • tasker.task_namespaces.name (UNIQUE) - Namespace name uniqueness
  • tasker.named_tasks (namespace_id, name, version) (UNIQUE) - Task template uniqueness
  • tasker.named_steps.system_name (UNIQUE) - Step handler uniqueness

Example Protection:

#![allow(unused)]
fn main() {
// Two orchestrators receive identical TaskRequestMessage
// Orchestrator A creates task first -> commits successfully
// Orchestrator B attempts to create -> unique constraint violation
// Result: Exactly one task created, error cleanly handled
}

See Task Initialization for details on how this protects task creation.

Row-Level Locking

Purpose: Prevent concurrent modifications to the same database row

Locking Patterns:

  1. FOR UPDATE - Exclusive lock, blocks concurrent transactions

    -- Used in: transition_task_state_atomic()
    SELECT * FROM tasker.tasks WHERE task_uuid = $1 FOR UPDATE;
    -- Blocks until transaction commits or rolls back
    
  2. FOR UPDATE SKIP LOCKED - Lock-free work distribution

    -- Used in: get_next_ready_tasks()
    SELECT * FROM tasker.tasks
    WHERE state = ANY($1)
    FOR UPDATE SKIP LOCKED
    LIMIT $2;
    -- Each orchestrator gets different tasks, no blocking
    

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Two orchestrators attempt state transition on same task
// Orchestrator A: BEGIN; SELECT FOR UPDATE; UPDATE state; COMMIT;
// Orchestrator B: BEGIN; SELECT FOR UPDATE (BLOCKS until A commits)
//                 UPDATE fails due to state validation
// Result: Only one transition succeeds, no race condition
}

Compare-and-Swap Semantics

Purpose: Validate expected state before making changes

Pattern: All state transitions validate current state in the same transaction as the update

-- From transition_task_state_atomic()
UPDATE tasker.tasks
SET state = $new_state, updated_at = NOW()
WHERE task_uuid = $uuid
  AND state = $expected_current_state  -- Critical: CAS validation
RETURNING *;

Example Protection:

#![allow(unused)]
fn main() {
// Orchestrator A and B both think task is in "Pending" state
// A transitions: WHERE state = 'Pending' -> succeeds, now "Initializing"
// B transitions: WHERE state = 'Pending' -> returns 0 rows (fails gracefully)
// Result: Atomic transition, no invalid state
}

See SQL Function Architecture for more on database-level guarantees.

Layer 2: State Machine Guards

Purpose: Enforce valid state transitions through application-level validation

Both task and step state machines validate current state before allowing transitions. This provides protection even when database constraints alone wouldn’t catch invalid operations.

Task State Machine

Defined in tasker-shared/src/state_machine/task_state_machine.rs, the TaskStateMachine validates:

  1. Current state retrieval: Always fetch latest state from database
  2. Event applicability: Check if event is valid for current state
  3. Terminal state protection: Cannot transition from Complete/Error/Cancelled
  4. Ownership tracking: Processor UUID tracked for audit (not enforced after ownership removal)

Example Protection:

#![allow(unused)]
fn main() {
// TaskStateMachine prevents invalid transitions
let mut state_machine = TaskStateMachine::new(task, context);

// Attempt to mark complete when still processing
let result = state_machine.transition(TaskEvent::MarkComplete).await;
// Result: Error - cannot mark complete while steps are in progress

// Current state validation prevents:
// - Completing tasks with pending steps
// - Re-initializing completed tasks
// - Transitioning from terminal states
}

See States and Lifecycles for complete state machine documentation.

Workflow Step State Machine

Defined in tasker-shared/src/state_machine/step_state_machine.rs, the StepStateMachine ensures:

  1. Execution claiming: Only Pending/Enqueued steps can transition to InProgress
  2. Completion validation: Only InProgress steps can be marked complete
  3. Retry eligibility: Validates max_attempts and backoff timing

Example Protection:

#![allow(unused)]
fn main() {
// Worker attempts to claim already-processing step
let mut step_machine = StepStateMachine::new(step.into(), context);

match step_machine.current_state().await {
    WorkflowStepState::InProgress => {
        // Already being processed by another worker
        return Ok(false); // Cannot claim
    }
    WorkflowStepState::Pending | WorkflowStepState::Enqueued => {
        // Attempt atomic transition
        step_machine.transition(StepEvent::Start).await?;
    }
}
}

This prevents:

  • Multiple workers executing the same step concurrently
  • Marking steps complete that weren’t started
  • Retrying steps that exceeded max_attempts

Layer 3: Transaction Boundaries

Purpose: Ensure all-or-nothing semantics for multi-step operations

Critical operations wrap multiple database changes in a single transaction, ensuring atomic completion or full rollback on failure.

Task Initialization Transaction

Task creation involves multiple dependent entities that must all succeed or all fail:

#![allow(unused)]
fn main() {
// From TaskInitializer.initialize_task()
let mut tx = pool.begin().await?;

// 1. Create or find namespace (find-or-create is idempotent)
let namespace = NamespaceResolver::resolve_namespace(&mut tx, namespace_name).await?;

// 2. Create or find named task
let named_task = NamespaceResolver::resolve_named_task(&mut tx, namespace, task_name).await?;

// 3. Create task record
let task = create_task(&mut tx, named_task.uuid, context).await?;

// 4. Create all workflow steps and edges
let (step_count, step_mapping) = WorkflowStepBuilder::create_workflow_steps(
    &mut tx, task.uuid, template
).await?;

// 5. Initialize state machine
StateInitializer::initialize_task_state(&mut tx, task.uuid).await?;

// ALL OR NOTHING: Commit entire transaction
tx.commit().await?;
}

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Task creation partially fails
// - Namespace created ✓
// - Named task created ✓
// - Task record created ✓
// - Workflow steps: Cycle detected ✗ (error thrown)
// Result: tx.rollback() -> ALL changes reverted, clean failure
}

Cycle Detection Enforcement

Workflow dependencies are validated during task initialization to prevent circular references:

#![allow(unused)]
fn main() {
// From WorkflowStepBuilder::create_step_dependencies()
for dependency in &step_definition.dependencies {
    let from_uuid = step_mapping[dependency];
    let to_uuid = step_mapping[&step_definition.name];

    // Check for self-reference
    if from_uuid == to_uuid {
        return Err(CycleDetected { from, to });
    }

    // Check for path that would create cycle
    if WorkflowStepEdge::would_create_cycle(pool, from_uuid, to_uuid).await? {
        return Err(CycleDetected { from, to });
    }

    // Safe to create edge
    WorkflowStepEdge::create_with_transaction(&mut tx, edge).await?;
}
}

This prevents invalid DAG structures from ever being persisted to the database.

Layer 4: Application Logic Patterns

Purpose: Implement idempotent patterns at the application level

Beyond database and state machine protections, application code uses several patterns to ensure safe retry and concurrent operation.

Find-or-Create Pattern

Used for entities that should be unique but may be created by multiple orchestrators:

#![allow(unused)]
fn main() {
// From NamespaceResolver
pub async fn resolve_namespace(
    tx: &mut Transaction<'_, Postgres>,
    name: &str,
) -> Result<TaskNamespace> {
    // Try to find existing
    if let Some(namespace) = TaskNamespace::find_by_name(pool, name).await? {
        return Ok(namespace);
    }

    // Create if not found
    match TaskNamespace::create_with_transaction(tx, NewTaskNamespace { name }).await {
        Ok(namespace) => Ok(namespace),
        Err(sqlx::Error::Database(e)) if is_unique_violation(&e) => {
            // Another orchestrator created it between our find and create
            // Re-query to get the one that won the race
            TaskNamespace::find_by_name(pool, name).await?
                .ok_or(Error::NotFound)
        }
        Err(e) => Err(e),
    }
}
}

Why This Works:

  • First attempt: Finds existing → idempotent
  • Create attempt: Unique constraint prevents duplicates
  • Retry after unique violation: Gets the winner → idempotent
  • Result: Exactly one namespace, regardless of concurrent attempts

State-Based Filtering

Operations filter by state to naturally deduplicate work:

#![allow(unused)]
fn main() {
// From StepEnqueuerService
// Only enqueue steps in specific states
let ready_steps = steps.iter()
    .filter(|step| matches!(
        step.state,
        WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
    ))
    .collect();

// Skip steps already:
// - Enqueued (another orchestrator handled it)
// - InProgress (worker is executing)
// - Complete (already done)
// - Error (terminal state)
}

Example Protection:

#![allow(unused)]
fn main() {
// Scenario: Orchestrator crash mid-batch
// Before crash: Enqueued steps 1-5 of 10
// After restart: Process task again
// State filtering:
//   - Steps 1-5: state = Enqueued → skip
//   - Steps 6-10: state = Pending → enqueue
// Result: Each step enqueued exactly once
}

State-Before-Queue Pattern

Ensures workers only see steps in correct state:

#![allow(unused)]
fn main() {
// 1. Commit state transition to database FIRST
step_state_machine.transition(StepEvent::Enqueue).await?;
// Step now in Enqueued state in database

// 2. THEN send PGMQ notification
pgmq_client.send_with_notify(queue_name, step_message).await?;

// Worker receives notification and:
// - Queries database for step
// - Sees state = Enqueued (committed)
// - Can safely claim and execute
}

Why Order Matters:

#![allow(unused)]
fn main() {
// Wrong order (queue-before-state):
// 1. Send PGMQ message
// 2. Worker receives immediately
// 3. Worker queries database → state still Pending
// 4. Worker might skip or fail to claim
// 5. State transition commits

// Correct order (state-before-queue):
// 1. State transition commits
// 2. Send PGMQ message
// 3. Worker receives
// 4. Worker queries → state correctly Enqueued
// 5. Worker can claim
}

See Events and Commands for event system details.


Component-by-Component Guarantees

Task Initialization Idempotency

Component: TaskRequestActor and TaskInitializer service Operation: Creating a new task from a template File: tasker-orchestration/src/orchestration/lifecycle/task_initialization/

Protection Mechanisms

  1. Identity Hash Unique Constraint

    #![allow(unused)]
    fn main() {
    // Tasks are identified by hash of (namespace, task_name, context)
    let identity_hash = calculate_identity_hash(namespace, name, context);
    
    NewTask {
        identity_hash,  // Unique constraint prevents duplicates
        named_task_uuid,
        context,
        // ...
    }
    }
  2. Transaction Atomicity

    • All entities created in single transaction
    • Namespace, named task, task, workflow steps, edges
    • Cycle detection validates DAG before committing
    • Any failure rolls back everything
  3. Find-or-Create for Shared Entities

    • Namespaces can be created by any orchestrator
    • Named tasks shared across workflow instances
    • Named steps reused across tasks

Concurrent Scenario

Two orchestrators receive identical TaskRequestMessage:

T0: Orchestrator A begins transaction
T1: Orchestrator B begins transaction
T2: A creates namespace "payments"
T3: B attempts to create namespace "payments"
T4: A creates task with identity_hash "abc123"
T5: B attempts to create task with identity_hash "abc123"
T6: A commits successfully ✓
T7: B attempts commit → unique constraint violation on identity_hash
T8: B transaction rolled back

Result:

  • Exactly one task created
  • No partial state in database
  • Orchestrator B receives clear error
  • Retry-safe: B can check if task exists and return it

Cycle Detection

Prevents invalid workflow definitions:

#![allow(unused)]
fn main() {
// Template defines: A depends on B, B depends on C, C depends on A
// During initialization:
//   - Create steps A, B, C
//   - Create edge A -> B (valid)
//   - Create edge B -> C (valid)
//   - Attempt edge C -> A
//     - would_create_cycle() returns true
//     - Error: CycleDetected
//   - Transaction rolled back
// Result: Invalid workflow rejected, no partial data
}

See tasker-shared/src/models/core/workflow_step_edge.rs:236-270 for cycle detection implementation.

Step Enqueueing Idempotency

Component: StepEnqueuerActor and StepEnqueuerService Operation: Enqueueing ready workflow steps to worker queues File: tasker-orchestration/src/orchestration/lifecycle/step_enqueuer_services/

Multi-Layer Protection

  1. SQL-Level Row Locking

    -- get_next_ready_tasks() uses SKIP LOCKED
    SELECT task_uuid FROM tasker.tasks
    WHERE state = ANY($states)
    FOR UPDATE SKIP LOCKED  -- Prevents concurrent claiming
    LIMIT $batch_size;
    

    Each orchestrator gets different tasks, no overlap

  2. State Machine Compare-and-Swap

    #![allow(unused)]
    fn main() {
    // Only transition if task in expected state
    state_machine.transition(TaskEvent::EnqueueSteps(uuids)).await?;
    // Fails if another orchestrator already transitioned
    }
  3. Step State Filtering

    #![allow(unused)]
    fn main() {
    // Only enqueue steps in specific states
    let enqueueable = steps.filter(|s| matches!(
        s.state,
        WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
    ));
    }
  4. State-Before-Queue Ordering

    #![allow(unused)]
    fn main() {
    // 1. Commit step state to Enqueued
    step.transition(StepEvent::Enqueue).await?;
    
    // 2. Send PGMQ message
    pgmq.send_with_notify(queue, message).await?;
    }

Concurrent Scenario

Two orchestrators discover the same ready steps:

T0: Orchestrator A queries get_next_ready_tasks(batch=100)
T1: Orchestrator B queries get_next_ready_tasks(batch=100)
T2: A gets tasks [1,2,3] (locked by A's transaction)
T3: B gets tasks [4,5,6] (different rows, SKIP LOCKED)
T4: A enqueues steps for tasks 1,2,3
T5: B enqueues steps for tasks 4,5,6
T6: Both commit successfully

Result: No overlap, each task processed once

Orchestrator Crash Mid-Batch:

T0: Orchestrator A gets task 1 with steps [A, B, C, D]
T1: A enqueues steps A, B to "payments_queue"
T2: A crashes before processing steps C, D
T3: Task 1 state still EnqueuingSteps
T4: Orchestrator B picks up task 1 (A's transaction rolled back)
T5: B queries steps for task 1
T6: Steps A, B have state = Enqueued → skip
T7: Steps C, D have state = Pending → enqueue

Result: Steps A, B enqueued once, C, D recovered and enqueued

Result Processing Idempotency

Component: ResultProcessorActor and OrchestrationResultProcessor Operation: Processing step execution results from workers File: tasker-orchestration/src/orchestration/lifecycle/result_processing/

Protection Mechanisms

  1. State Guard Validation

    #![allow(unused)]
    fn main() {
    // TaskCoordinator validates step state before processing result
    let current_state = step_state_machine.current_state().await?;
    
    match current_state {
        WorkflowStepState::InProgress => {
            // Valid: step is being processed
            step_state_machine.transition(StepEvent::Complete).await?;
        }
        WorkflowStepState::Complete => {
            // Idempotent: already processed this result
            return Ok(AlreadyComplete);
        }
        _ => {
            // Invalid state for result processing
            return Err(InvalidState);
        }
    }
    }
  2. Atomic State Transitions

    • Step result processing uses compare-and-swap
    • Task state transitions validate current state
    • All updates in same transaction as state check
  3. Ownership Removed

    • Processor UUID tracked for audit only
    • Not enforced for transitions
    • Any orchestrator can process results
    • Enables recovery after crashes

Concurrent Scenario

Worker submits result, orchestrator crashes, retry arrives:

T0: Worker completes step A, submits result to orchestration_step_results queue
T1: Orchestrator A pulls message, begins processing
T2: A transitions step A to Complete
T3: A begins task state evaluation
T4: A crashes before deleting PGMQ message
T5: PGMQ visibility timeout expires → message reappears
T6: Orchestrator B pulls same message
T7: B queries step A state → Complete
T8: B returns early (idempotent, already processed)
T9: B deletes PGMQ message

Result: Step processed exactly once, retry is harmless

Before Ownership Removal (Ownership Enforced):

// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: task.processor_uuid != B.uuid
// Error: Ownership violation → TASK STUCK

After Ownership Removal (Ownership Audit-Only):

// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: current task state (no ownership check)
// B processes successfully → TASK RECOVERS

See the Ownership Removal ADR for full analysis.

Task Finalization Idempotency

Component: TaskFinalizerActor and TaskFinalizer service Operation: Finalizing task to terminal state File: tasker-orchestration/src/orchestration/lifecycle/task_finalization/

Current Protection (Sufficient for Recovery)

  1. State Guard Protection

    #![allow(unused)]
    fn main() {
    // TaskFinalizer checks current task state
    let context = ExecutionContextProvider::fetch(task_uuid).await?;
    
    match context.should_finalize() {
        true => {
            // Transition to Complete
            task_state_machine.transition(TaskEvent::MarkComplete).await?;
        }
        false => {
            // Not ready to finalize (steps still pending)
            return Ok(NotReady);
        }
    }
    }
  2. Idempotent for Recovery

    #![allow(unused)]
    fn main() {
    // Scenario: Orchestrator crashes during finalization
    // - Task state already Complete → state guard returns early
    // - Task state still StepsInProcess → retry succeeds
    // Result: Recovery works, final state reached
    }

Concurrent Scenario (Not Graceful)

Two orchestrators attempt finalization simultaneously:

T0: Orchestrators A and B both receive finalization trigger
T1: A checks: all steps complete → proceed
T2: B checks: all steps complete → proceed
T3: A transitions task to Complete (succeeds)
T4: B attempts transition to Complete
T5: State guard: task already Complete
T6: B receives StateMachineError (invalid transition)

Result:

  • ✓ Task finalized exactly once (correct)
  • ✓ No data corruption
  • ⚠️ Orchestrator B gets error (not graceful)

Future Enhancement: Atomic Finalization Claiming

Atomic claiming would make concurrent finalization graceful:

-- Proposed claim_task_for_finalization() function
UPDATE tasker.tasks
SET finalization_claimed_at = NOW(),
    finalization_claimed_by = $processor_uuid
WHERE task_uuid = $uuid
  AND state = 'StepsInProcess'
  AND finalization_claimed_at IS NULL
RETURNING *;

With atomic finalization claiming:

T0: Orchestrators A and B both receive finalization trigger
T1: A calls claim_task_for_finalization() → succeeds
T2: B calls claim_task_for_finalization() → returns 0 rows
T3: A proceeds with finalization
T4: B returns early (silent success, already claimed)

This enhancement is deferred (implementation not yet scheduled).


SQL Function Atomicity

File: tasker-shared/src/database/sql/ Documented: Task Readiness & Execution

Atomic State Transitions

Function: transition_task_state_atomic() Protection: Compare-and-swap with row locking

-- Atomic state transition with validation
UPDATE tasker.tasks
SET state = $new_state,
    updated_at = NOW()
WHERE task_uuid = $uuid
  AND state = $expected_current_state  -- CAS: only if state matches
FOR UPDATE;  -- Lock prevents concurrent modifications

Key Guarantees:

  • Returns 0 rows if state doesn’t match → safe retry
  • Row lock prevents concurrent transitions
  • Processor UUID tracked for audit, not enforced

Work Distribution Without Contention

Function: get_next_ready_tasks() Protection: Lock-free claiming via SKIP LOCKED

SELECT task_uuid, correlation_id, state
FROM tasker.tasks
WHERE state = ANY($processable_states)
  AND (
    state NOT IN ('WaitingForRetry') OR
    last_retry_at + retry_interval < NOW()
  )
ORDER BY
  CASE state
    WHEN 'Pending' THEN 1
    WHEN 'WaitingForRetry' THEN 2
    ELSE 3
  END,
  created_at ASC
FOR UPDATE SKIP LOCKED  -- Skip locked rows, no blocking
LIMIT $batch_size;

Key Guarantees:

  • Each orchestrator gets different tasks
  • No blocking or contention
  • Dynamic priority (Pending before WaitingForRetry)
  • Prevents task starvation

Step Readiness with Dependency Validation

Function: get_step_readiness_status() Protection: Validates dependencies in single query

WITH step_dependencies AS (
  SELECT COUNT(*) as total_deps,
         SUM(CASE WHEN dep_step.state = 'Complete' THEN 1 ELSE 0 END) as completed_deps
  FROM tasker.workflow_step_edges e
  JOIN tasker.workflow_steps dep_step ON e.from_step_uuid = dep_step.uuid
  WHERE e.to_step_uuid = $step_uuid
)
SELECT
  CASE
    WHEN total_deps = completed_deps THEN 'Ready'
    WHEN step.state = 'Error' AND step.attempts < step.max_attempts THEN 'WaitingForRetry'
    ELSE 'Blocked'
  END as readiness
FROM step_dependencies, tasker.workflow_steps step
WHERE step.uuid = $step_uuid;

Key Guarantees:

  • Atomic dependency check
  • Handles retry logic with backoff
  • Prevents premature execution

Cycle Detection

Function: WorkflowStepEdge::would_create_cycle() (Rust, uses SQL) Protection: Recursive CTE path traversal

WITH RECURSIVE step_path AS (
  -- Base: Start from proposed destination
  SELECT from_step_uuid, to_step_uuid, 1 as depth
  FROM tasker.workflow_step_edges
  WHERE from_step_uuid = $proposed_to

  UNION ALL

  -- Recursive: Follow edges
  SELECT sp.from_step_uuid, wse.to_step_uuid, sp.depth + 1
  FROM step_path sp
  JOIN tasker.workflow_step_edges wse ON sp.to_step_uuid = wse.from_step_uuid
  WHERE sp.depth < 100  -- Prevent infinite recursion
)
SELECT COUNT(*) as has_path
FROM step_path
WHERE to_step_uuid = $proposed_from;

Returns: True if adding edge would create cycle

Enforcement: Called by WorkflowStepBuilder during task initialization

  • Self-reference check: from_uuid == to_uuid
  • Path check: Would adding edge create cycle?
  • Error before commit: Transaction rolled back on cycle

See tasker-orchestration/src/orchestration/lifecycle/task_initialization/workflow_step_builder.rs for enforcement.


Cross-Cutting Scenarios

Multiple Orchestrators Processing Same Task

Scenario: Load balancer distributes work to multiple orchestrators

Protection:

  1. Work Distribution:

    -- Each orchestrator gets different tasks via SKIP LOCKED
    Orchestrator A: Tasks [1, 2, 3]
    Orchestrator B: Tasks [4, 5, 6]
    
  2. State Transitions:

    #![allow(unused)]
    fn main() {
    // Both attempt to transition same task (shouldn't happen, but...)
    A: transition(Pending -> Initializing) → succeeds
    B: transition(Pending -> Initializing) → fails (state already changed)
    }
  3. Step Enqueueing:

    #![allow(unused)]
    fn main() {
    // Task in EnqueuingSteps state
    A: Processes task, enqueues steps A, B
    B: Cannot claim task (state not in processable states)
    // OR if B claims during transition:
    B: Filters steps by state → A, B already Enqueued, skips them
    }

Result: No duplicate work, clean coordination

Orchestrator Crashes and Recovers

Scenario: Orchestrator crashes mid-operation, another takes over

During Task Initialization

Before ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A)
T2: A crashes
T3: Task stuck in Initializing forever (ownership blocks recovery)

After ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A for audit)
T2: A crashes
T3: Orchestrator B picks up task 1
T4: B transitions Initializing -> EnqueuingSteps (succeeds, no ownership check)
T5: Task recovers automatically

During Step Enqueueing

T0: Orchestrator A enqueues steps [A, B] of task 1
T1: A crashes before committing
T2: Transaction rolls back
T3: Steps A, B remain in Pending state
T4: Orchestrator B picks up task 1
T5: B enqueues steps A, B (state still Pending)
T6: No duplicate work

During Result Processing

T0: Worker completes step A
T1: Orchestrator A receives result, transitions step to Complete
T2: A crashes before updating task state
T3: PGMQ message visibility timeout expires
T4: Orchestrator B receives same result message
T5: B queries step A → already Complete
T6: B skips processing (idempotent)
T7: B evaluates task state, continues workflow

Result: Complete recovery, no manual intervention

Retry After Transient Failure

Scenario: Database connection lost during operation

#![allow(unused)]
fn main() {
// Orchestrator attempts task initialization
let result = task_initializer.initialize(request).await;

match result {
    Err(TaskInitializationError::Database(_)) => {
        // Transient failure (connection lost)
        // Retry same request
        let retry_result = task_initializer.initialize(request).await;

        // Possibilities:
        // 1. Succeeds: Transaction completed before connection lost
        //    → identity_hash unique constraint prevents duplicate
        //    → Get existing task
        // 2. Succeeds: Transaction rolled back
        //    → Create task successfully
        // 3. Fails: Different error
        //    → Handle appropriately
    }
    Ok(task) => { /* Success */ }
}
}

Key Pattern: Operations are designed to be retry-safe

  • Database constraints prevent duplicates
  • State guards prevent invalid transitions
  • Find-or-create handles concurrent creation

PGMQ Message Duplicate Delivery

Scenario: PGMQ message processed twice due to visibility timeout

#![allow(unused)]
fn main() {
// Worker completes step, sends result
pgmq.send("orchestration_step_results", result).await?;

// Orchestrator A receives message
let message = pgmq.read("orchestration_step_results").await?;

// A processes result
result_processor.process(message.payload).await?;

// A about to delete message, crashes
// Message visibility timeout expires → message reappears

// Orchestrator B receives same message
let duplicate = pgmq.read("orchestration_step_results").await?;

// B processes result
// State machine checks: step already Complete
// Returns early (idempotent)
result_processor.process(duplicate.payload).await?; // Harmless

// B deletes message
pgmq.delete(duplicate.msg_id).await?;
}

Protection:

  • State guards: Check current state before processing
  • Idempotent handlers: Safe to process same message multiple times
  • Message deletion: Only after confirmed processing

See Events and Commands for PGMQ architecture.


Multi-Instance Validation

The defense-in-depth architecture was validated through comprehensive multi-instance cluster testing. This section documents the validation results and confirms the effectiveness of the protection mechanisms.

Test Configuration

  • Orchestration Instances: 2 (ports 8080, 8081)
  • Worker Instances: 2 per type (Rust: 8100-8101, Ruby: 8200-8201, Python: 8300-8301, TypeScript: 8400-8401)
  • Total Services: 10 concurrent instances
  • Database: Shared PostgreSQL with PGMQ messaging

Validation Results

MetricResult
Tests Passed1,645
Intermittent Failures3 (resource contention, not race conditions)
Tests Skipped21 (domain event tests, require single-instance)
Race Conditions Detected0
Data Corruption Detected0

What Was Validated

  1. Concurrent Task Creation

    • Tasks created through different orchestration instances
    • No duplicate tasks or UUIDs
    • All tasks complete successfully
    • State consistent across all instances
  2. Work Distribution

    • SKIP LOCKED distributes tasks without overlap
    • Multiple workers claim different steps
    • No duplicate step processing
  3. State Machine Guards

    • Invalid transitions rejected at state machine layer
    • Compare-and-swap prevents concurrent modifications
    • Terminal states protected from re-entry
  4. Transaction Boundaries

    • All-or-nothing semantics maintained under load
    • No partial task initialization observed
    • Crash recovery works correctly
  5. Cross-Instance Consistency

    • Task state queries return same result from any instance
    • Step state transitions visible immediately to all instances
    • No stale reads observed

Protection Layer Effectiveness

LayerValidation MethodResult
Database AtomicityConcurrent unique constraint testsDuplicates correctly rejected
State Machine GuardsParallel transition attemptsInvalid transitions blocked
Transaction BoundariesCrash injection testsClean rollback, no corruption
Application LogicState filtering under loadIdempotent processing confirmed

Intermittent Failures Analysis

Three tests showed intermittent failures under heavy parallelization:

  • Root Cause: Database connection pool exhaustion when running 1600+ tests in parallel
  • Evidence: Failures occurred only at high parallelism (>4 threads), not with serialized execution
  • Classification: Resource contention, NOT race conditions
  • Mitigation: Nextest configured with test-threads = 1 for multi_instance tests

Key Finding: No race conditions were detected. All intermittent failures traced to resource limits.

Domain Event Tests

21 tests were excluded from cluster mode using #[cfg(not(feature = "test-cluster"))]:

  • Reason: Domain event tests verify in-process event delivery (publish/subscribe within single process)
  • Behavior in Cluster: Events published in one instance aren’t delivered to subscribers in another instance
  • Status: Working as designed - these tests run correctly in single-instance CI

Stress Test Results

Rapid Task Burst Test:

  • 25 tasks created in <1 second
  • All tasks completed successfully
  • No duplicate UUIDs
  • Creation rate: ~50 tasks/second sustained

Round-Robin Distribution Test:

  • Tasks distributed evenly across orchestration instances
  • Load balancing working correctly
  • No single-instance bottleneck

Recommendations Validated

The following architectural decisions were validated by cluster testing:

  1. Ownership Removal: Processor UUID as audit-only (not enforced) enables automatic recovery
  2. SKIP LOCKED Pattern: Effective for contention-free work distribution
  3. State-Before-Queue Pattern: Prevents workers from seeing uncommitted state
  4. Find-or-Create Pattern: Handles concurrent entity creation correctly

Future Enhancements Identified

Testing identified one P2 improvement opportunity:

Atomic Finalization Claiming

  • Current: Second orchestrator gets StateMachineError during concurrent finalization
  • Proposed: Transaction-based locking for graceful handling
  • Priority: P2 (operational improvement, correctness already ensured)

Running Cluster Validation

To reproduce the validation:

# Setup cluster environment
cargo make setup-env-cluster

# Start full cluster
cargo make cluster-start-all

# Run all tests including cluster tests
cargo make test-rust-all

# Stop cluster
cargo make cluster-stop

See Cluster Testing Guide for detailed instructions.


Design Principles

Defense in Depth

The system intentionally provides multiple overlapping protection layers rather than relying on a single mechanism. This ensures:

  1. Resilience: If one layer fails (e.g., application bug), others prevent corruption
  2. Clear Semantics: Each layer has a specific purpose and failure mode
  3. Ease of Reasoning: Developers can understand guarantees at each level
  4. Graceful Degradation: System remains safe even under partial failures

Fail-Safe Defaults

When in doubt, the system errs on the side of caution:

  • State transitions fail if current state doesn’t match → prevents invalid states
  • Unique constraints fail creation → prevents duplicates
  • Row locks block concurrent access → prevents race conditions
  • Cycle detection fails initialization → prevents invalid workflows

Better to fail cleanly than to corrupt data.

Retry Safety

All critical operations are designed to be safely retryable:

  • Idempotent: Same operation, repeated → same outcome
  • State-Based: Check current state before acting
  • Atomic: All-or-nothing commits
  • No Side Effects: Operations don’t accumulate partial state

This enables:

  • Automatic retry after transient failures
  • Duplicate message handling
  • Recovery after crashes
  • Horizontal scaling without coordination overhead

Audit Trail Without Enforcement

Ownership Decision: Track ownership for observability, don’t enforce for correctness

#![allow(unused)]
fn main() {
// Processor UUID recorded in all transitions
pub struct TaskTransition {
    pub task_uuid: Uuid,
    pub from_state: TaskState,
    pub to_state: TaskState,
    pub processor_uuid: Uuid,  // For audit and debugging
    pub event: String,
    pub timestamp: DateTime<Utc>,
}

// But NOT enforced in transition logic
impl TaskStateMachine {
    pub async fn transition(&mut self, event: TaskEvent) -> Result<TaskState> {
        // ✅ Tracks processor UUID
        // ❌ Does NOT require ownership match
        // Reason: Enables recovery after crashes
    }
}
}

Why This Works:

  • State guards provide correctness (current state validation)
  • Processor UUID provides observability (who did what when)
  • No ownership blocking means automatic recovery
  • Full audit trail for debugging and monitoring

Implementation Checklist

When implementing new orchestration operations, ensure:

Database Layer

  • Unique constraints for entities that must be singular
  • FOR UPDATE locking for state transitions
  • FOR UPDATE SKIP LOCKED for work distribution
  • Compare-and-swap (CAS) in UPDATE WHERE clauses
  • Transaction wrapping for multi-step operations

State Machine Layer

  • Current state retrieval before transitions
  • Event applicability validation
  • Terminal state protection
  • Error handling for invalid transitions

Application Layer

  • Find-or-create pattern for shared entities
  • State-based filtering before processing
  • State-before-queue ordering for events
  • Idempotent message handlers

Testing

  • Concurrent operation tests (multiple orchestrators)
  • Crash recovery tests (mid-operation failures)
  • Retry safety tests (duplicate message handling)
  • Race condition tests (timing-dependent scenarios)

Core Architecture

Implementation Details

Multi-Instance Validation

Testing


← Back to Documentation Hub

Messaging Abstraction Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Deployment Patterns | Crate Architecture

<- Back to Documentation Hub


Overview

The provider-agnostic messaging abstraction enables Tasker Core to support multiple messaging backends through a unified interface. This architecture allows switching between PGMQ (PostgreSQL Message Queue) and RabbitMQ without changes to business logic.

Key Benefits:

  • Zero handler changes required: Switching providers requires only configuration changes
  • Provider-specific optimizations: Each backend can leverage its native strengths
  • Testability: In-memory provider for fast unit testing
  • Gradual migration: Systems can transition between providers incrementally

Core Concepts

Message Delivery Models

Different messaging providers have fundamentally different delivery models:

ProviderNative ModelPush SupportNotification TypeFallback Needed
PGMQPollYes (pg_notify)Signal onlyYes (catch-up)
RabbitMQPushYes (native)Full messageNo
InMemoryPushYesFull messageNo

PGMQ (Signal-Only):

  • pg_notify sends a signal that a message exists
  • Worker must fetch the message after receiving the signal
  • Fallback polling catches missed signals

RabbitMQ (Full Message Push):

  • basic_consume() delivers complete messages
  • No separate fetch required
  • Protocol guarantees delivery

Architecture Layers

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Application Layer                                  │
│  (Orchestration, Workers, Event Systems)                                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Uses MessageClient
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           MessageClient                                      │
│  Domain-level facade with queue classification                              │
│  Location: tasker-shared/src/messaging/client.rs                           │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ Delegates to
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         MessagingProvider Enum                               │
│  Runtime dispatch without trait objects (zero-cost abstraction)             │
│  Location: tasker-shared/src/messaging/service/provider.rs                 │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    │               │               │
                    ▼               ▼               ▼
            ┌───────────┐   ┌───────────┐   ┌───────────┐
            │   PGMQ    │   │ RabbitMQ  │   │ InMemory  │
            │ Provider  │   │ Provider  │   │ Provider  │
            └───────────┘   └───────────┘   └───────────┘

Core Traits and Types

MessagingService Trait

Location: tasker-shared/src/messaging/service/traits.rs

The foundational trait defining queue operations:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait MessagingService: Send + Sync {
    // Queue lifecycle
    async fn create_queue(&self, name: &str) -> Result<(), MessagingError>;
    async fn delete_queue(&self, name: &str) -> Result<(), MessagingError>;
    async fn queue_exists(&self, name: &str) -> Result<bool, MessagingError>;
    async fn list_queues(&self) -> Result<Vec<String>, MessagingError>;

    // Message operations
    async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError>;
    async fn send_message_with_delay(&self, queue: &str, payload: &[u8], delay_seconds: i64) -> Result<i64, MessagingError>;
    async fn receive_messages(&self, queue: &str, limit: i32, visibility_timeout: i32) -> Result<Vec<QueuedMessage<Vec<u8>>>, MessagingError>;
    async fn ack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
    async fn nack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;

    // Provider information
    fn provider_name(&self) -> &'static str;
}
}

SupportsPushNotifications Trait

Location: tasker-shared/src/messaging/service/traits.rs

Extends MessagingService with push notification capabilities:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait SupportsPushNotifications: MessagingService {
    /// Subscribe to messages on a single queue
    fn subscribe(&self, queue_name: &str)
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;

    /// Subscribe to messages on multiple queues
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;

    /// Whether this provider requires fallback polling for reliability
    fn requires_fallback_polling(&self) -> bool;

    /// Suggested polling interval if fallback is needed
    fn fallback_polling_interval(&self) -> Option<Duration>;

    /// Whether this provider supports fetching by message ID
    fn supports_fetch_by_message_id(&self) -> bool;
}
}

MessageNotification Enum

Location: tasker-shared/src/messaging/service/traits.rs

Abstracts the two notification models:

#![allow(unused)]
fn main() {
pub enum MessageNotification {
    /// Signal-only notification (PGMQ style)
    /// Indicates a message is available but requires separate fetch
    Available {
        queue_name: String,
        msg_id: Option<i64>,
    },

    /// Full message notification (RabbitMQ style)
    /// Contains the complete message payload
    Message(QueuedMessage<Vec<u8>>),
}
}

Provider Implementations

PGMQ Provider

Location: tasker-shared/src/messaging/service/providers/pgmq.rs

PostgreSQL-based message queue with LISTEN/NOTIFY integration:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for PgmqMessagingService {
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        // Uses PgmqNotifyListener for pg_notify subscription
        // Returns MessageNotification::Available (signal-only) for large messages
        // Returns MessageNotification::Message for small messages (<7KB)
    }

    fn requires_fallback_polling(&self) -> bool {
        true  // pg_notify can miss signals during connection issues
    }

    fn supports_fetch_by_message_id(&self) -> bool {
        true  // PGMQ supports read_specific_message()
    }
}
}

Characteristics:

  • Uses PostgreSQL for storage and delivery
  • pg_notify for real-time notifications
  • Fallback polling required for reliability
  • Supports visibility timeout for message claiming

RabbitMQ Provider

Location: tasker-shared/src/messaging/service/providers/rabbitmq.rs

AMQP-based message broker with native push delivery:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for RabbitMqMessagingService {
    fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        // Uses lapin basic_consume() for native push delivery
        // Always returns MessageNotification::Message (full payload)
    }

    fn requires_fallback_polling(&self) -> bool {
        false  // AMQP protocol guarantees delivery
    }

    fn supports_fetch_by_message_id(&self) -> bool {
        false  // RabbitMQ doesn't support fetch-by-ID
    }
}
}

Characteristics:

  • Native push delivery via AMQP protocol
  • No fallback polling needed
  • Higher throughput for high-volume scenarios
  • Requires separate infrastructure (RabbitMQ server)

InMemory Provider

Location: tasker-shared/src/messaging/service/providers/in_memory.rs

In-process message queue for testing:

#![allow(unused)]
fn main() {
impl SupportsPushNotifications for InMemoryMessagingService {
    fn requires_fallback_polling(&self) -> bool {
        false  // In-memory is reliable within process
    }
}
}

Use Cases:

  • Unit testing without external dependencies
  • Integration testing with controlled timing
  • Development environments

MessagingProvider Enum

Location: tasker-shared/src/messaging/service/provider.rs

Enum dispatch pattern for runtime provider selection without trait objects:

#![allow(unused)]
fn main() {
pub enum MessagingProvider {
    Pgmq(PgmqMessagingService),
    RabbitMq(RabbitMqMessagingService),
    InMemory(InMemoryMessagingService),
}

impl MessagingProvider {
    /// Delegate all MessagingService methods to the underlying provider
    pub async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError> {
        match self {
            Self::Pgmq(p) => p.send_message(queue, payload).await,
            Self::RabbitMq(p) => p.send_message(queue, payload).await,
            Self::InMemory(p) => p.send_message(queue, payload).await,
        }
    }

    /// Subscribe to push notifications
    pub fn subscribe_many(&self, queue_names: &[String])
        -> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
    {
        match self {
            Self::Pgmq(p) => p.subscribe_many(queue_names),
            Self::RabbitMq(p) => p.subscribe_many(queue_names),
            Self::InMemory(p) => p.subscribe_many(queue_names),
        }
    }

    /// Check if fallback polling is required
    pub fn requires_fallback_polling(&self) -> bool {
        match self {
            Self::Pgmq(p) => p.requires_fallback_polling(),
            Self::RabbitMq(p) => p.requires_fallback_polling(),
            Self::InMemory(p) => p.requires_fallback_polling(),
        }
    }
}
}

Benefits:

  • Zero-cost abstraction (no vtable indirection)
  • Exhaustive match ensures all providers handled
  • Easy to add new providers

MessageClient Facade

Location: tasker-shared/src/messaging/client.rs

Domain-level facade providing high-level queue operations:

#![allow(unused)]
fn main() {
pub struct MessageClient {
    provider: Arc<MessagingProvider>,
    classifier: QueueClassifier,
}

impl MessageClient {
    /// Send a step message to the appropriate namespace queue
    pub async fn send_step_message(
        &self,
        namespace: &str,
        step: &SimpleStepMessage,
    ) -> Result<i64, MessagingError> {
        let queue_name = self.classifier.step_queue_for_namespace(namespace);
        let payload = serde_json::to_vec(step)?;
        self.provider.send_message(&queue_name, &payload).await
    }

    /// Send a step result to the orchestration queue
    pub async fn send_step_result(
        &self,
        result: &StepExecutionResult,
    ) -> Result<i64, MessagingError> {
        let queue_name = self.classifier.orchestration_results_queue();
        let payload = serde_json::to_vec(result)?;
        self.provider.send_message(&queue_name, &payload).await
    }

    /// Access the underlying provider for advanced operations
    pub fn provider(&self) -> &MessagingProvider {
        &self.provider
    }
}
}

Event System Integration

Provider-Agnostic Queue Listeners

Both orchestration and worker queue listeners use provider.subscribe_many():

#![allow(unused)]
fn main() {
// tasker-orchestration/src/orchestration/orchestration_queues/listener.rs
impl OrchestrationQueueListener {
    pub async fn start(&mut self) -> Result<(), MessagingError> {
        let queues = vec![
            self.classifier.orchestration_results_queue(),
            self.classifier.orchestration_requests_queue(),
            self.classifier.orchestration_finalization_queue(),
        ];

        // Provider-agnostic subscription
        let stream = self.provider.subscribe_many(&queues)?;

        // Process notifications
        while let Some(notification) = stream.next().await {
            match notification {
                MessageNotification::Available { queue_name, msg_id } => {
                    // PGMQ style: send event command to fetch message
                    self.send_event_command(queue_name, msg_id).await;
                }
                MessageNotification::Message(msg) => {
                    // RabbitMQ style: send message command with full payload
                    self.send_message_command(msg).await;
                }
            }
        }
    }
}
}

Deployment Mode Selection

Event systems select the appropriate mode based on provider capabilities:

#![allow(unused)]
fn main() {
// Determine effective deployment mode for this provider
let effective_mode = deployment_mode.effective_for_provider(provider.provider_name());

match effective_mode {
    DeploymentMode::EventDrivenOnly => {
        // Start queue listener only (no fallback poller)
        // RabbitMQ typically uses this mode
    }
    DeploymentMode::Hybrid => {
        // Start both listener and fallback poller
        // PGMQ uses this mode for reliability
    }
    DeploymentMode::PollingOnly => {
        // Start fallback poller only
        // For restricted environments
    }
}
}

Command Routing

Dual Command Variants

Command processors handle both notification types:

#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
    // For full message notifications (RabbitMQ)
    ProcessStepResultFromMessage {
        queue_name: String,
        message: QueuedMessage<Vec<u8>>,
        resp: CommandResponder<StepProcessResult>,
    },

    // For signal-only notifications (PGMQ)
    ProcessStepResultFromMessageEvent {
        message_event: MessageReadyEvent,
        resp: CommandResponder<StepProcessResult>,
    },
}
}

Routing Logic:

  • MessageNotification::Message -> ProcessStepResultFromMessage
  • MessageNotification::Available -> ProcessStepResultFromMessageEvent

Type-Safe Channel Wrappers

NewType wrappers for MPSC channels prevent accidental misuse:

Orchestration Channels

Location: tasker-orchestration/src/orchestration/channels.rs

#![allow(unused)]
fn main() {
/// Strongly-typed sender for orchestration commands
#[derive(Debug, Clone)]
pub struct OrchestrationCommandSender(pub(crate) mpsc::Sender<OrchestrationCommand>);

/// Strongly-typed receiver for orchestration commands
#[derive(Debug)]
pub struct OrchestrationCommandReceiver(pub(crate) mpsc::Receiver<OrchestrationCommand>);

/// Strongly-typed sender for orchestration notifications
#[derive(Debug, Clone)]
pub struct OrchestrationNotificationSender(pub(crate) mpsc::Sender<OrchestrationNotification>);

/// Strongly-typed receiver for orchestration notifications
#[derive(Debug)]
pub struct OrchestrationNotificationReceiver(pub(crate) mpsc::Receiver<OrchestrationNotification>);
}

Worker Channels

Location: tasker-worker/src/worker/channels.rs

#![allow(unused)]
fn main() {
/// Strongly-typed sender for worker commands
#[derive(Debug, Clone)]
pub struct WorkerCommandSender(pub(crate) mpsc::Sender<WorkerCommand>);

/// Strongly-typed receiver for worker commands
#[derive(Debug)]
pub struct WorkerCommandReceiver(pub(crate) mpsc::Receiver<WorkerCommand>);
}

Channel Factory

#![allow(unused)]
fn main() {
pub struct ChannelFactory;

impl ChannelFactory {
    /// Create type-safe orchestration command channel pair
    pub fn orchestration_command_channel(buffer_size: usize)
        -> (OrchestrationCommandSender, OrchestrationCommandReceiver)
    {
        let (tx, rx) = mpsc::channel(buffer_size);
        (OrchestrationCommandSender(tx), OrchestrationCommandReceiver(rx))
    }
}
}

Benefits:

  • Compile-time prevention of channel misuse
  • Self-documenting function signatures
  • Zero runtime overhead (NewTypes compile away)

Configuration

Provider Selection

# config/dotenv/test.env
# Valid values: pgmq (default), rabbitmq
TASKER_MESSAGING_BACKEND=pgmq

# RabbitMQ connection (only used when backend=rabbitmq)
RABBITMQ_URL=amqp://tasker:tasker@localhost:5672/%2F

Provider-Specific Settings

# config/tasker/base/common.toml
[pgmq]
visibility_timeout_seconds = 60
max_message_size_bytes = 1048576
batch_size = 100

[rabbitmq]
prefetch_count = 100
connection_timeout_seconds = 30
heartbeat_seconds = 60

Migration Guide

Switching from PGMQ to RabbitMQ

  1. Deploy RabbitMQ infrastructure
  2. Update configuration:
    export TASKER_MESSAGING_BACKEND=rabbitmq
    export RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/%2F
    
  3. Restart services - No code changes required

Gradual Migration

For zero-downtime migration:

  1. Deploy new services with RabbitMQ configuration
  2. Gradually shift traffic to new services
  3. Monitor for any issues
  4. Decommission PGMQ-based services

Testing

Provider-Agnostic Tests

Most tests should use InMemoryMessagingService for speed:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_step_execution() {
    let provider = MessagingProvider::InMemory(InMemoryMessagingService::new());
    let client = MessageClient::new(Arc::new(provider));

    // Test with in-memory provider
    client.send_step_message("payments", &step_msg).await.unwrap();
}
}

Provider-Specific Tests

For integration tests that need specific provider behavior:

#![allow(unused)]
fn main() {
#[tokio::test]
#[cfg(feature = "integration-tests")]
async fn test_pgmq_notifications() {
    let provider = MessagingProvider::Pgmq(PgmqMessagingService::new(pool).await?);
    // Test PGMQ-specific behavior
}
}

Best Practices

1. Use MessageClient for Application Code

#![allow(unused)]
fn main() {
// Good: Use domain-level facade
let client = context.message_client();
client.send_step_result(&result).await?;

// Avoid: Direct provider access unless necessary
let provider = context.messaging_provider();
provider.send_message("queue", &payload).await?;
}

2. Handle Both Notification Types

#![allow(unused)]
fn main() {
match notification {
    MessageNotification::Available { queue_name, msg_id } => {
        // Signal-only: need to fetch message
    }
    MessageNotification::Message(msg) => {
        // Full message: can process immediately
    }
}
}

3. Respect Provider Capabilities

#![allow(unused)]
fn main() {
if provider.supports_fetch_by_message_id() {
    // Can use read_specific_message()
} else {
    // Must use alternative approach
}
}

4. Configure Fallback Appropriately

#![allow(unused)]
fn main() {
if provider.requires_fallback_polling() {
    // Start fallback poller for reliability
}
}


<- Back to Documentation Hub

Next: Events and Commands | Deployment Patterns

States and Lifecycles

Last Updated: 2025-10-10 Audience: All Status: Active Related Docs: Documentation Hub | Events and Commands | Task Readiness & Execution

← Back to Documentation Hub


This document provides comprehensive documentation of the state machine architecture in tasker-core, covering both task and workflow step lifecycles, their state transitions, and the underlying persistence mechanisms.

Overview

The tasker-core system implements a sophisticated dual-state-machine architecture:

  1. Task State Machine: Manages overall workflow orchestration with 12 comprehensive states
  2. Workflow Step State Machine: Manages individual step execution with 8 states including orchestration queuing

Both state machines work in coordination to provide atomic, auditable, and resilient workflow execution with proper event-driven communication between orchestration and worker systems.

Task State Machine Architecture

Task State Definitions

The task state machine implements 12 comprehensive states as defined in tasker-shared/src/state_machine/states.rs:

Initial States

  • Pending: Created but not started (default initial state)
  • Initializing: Discovering initial ready steps and setting up task context

Active Processing States

  • EnqueuingSteps: Actively enqueuing ready steps to worker queues
  • StepsInProcess: Steps are being processed by workers (orchestration monitoring)
  • EvaluatingResults: Processing results from completed steps and determining next actions

Waiting States

  • WaitingForDependencies: No ready steps, waiting for dependencies to be satisfied
  • WaitingForRetry: Waiting for retry timeout before attempting failed steps again
  • BlockedByFailures: Has failures that prevent progress (manual intervention may be needed)

Terminal States

  • Complete: All steps completed successfully (terminal)
  • Error: Task failed permanently (terminal)
  • Cancelled: Task was cancelled (terminal)
  • ResolvedManually: Manually resolved by operator (terminal)

Task State Properties

Each state has key properties that drive system behavior:

#![allow(unused)]
fn main() {
impl TaskState {
    pub fn is_terminal(&self) -> bool         // Cannot transition further
    pub fn requires_ownership(&self) -> bool  // Processor ownership required
    pub fn is_active(&self) -> bool          // Currently being processed  
    pub fn is_waiting(&self) -> bool         // Waiting for external conditions
    pub fn can_be_processed(&self) -> bool   // Available for orchestration pickup
}
}

Ownership-Required States: Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults Processable States: Pending, WaitingForDependencies, WaitingForRetry

Task Lifecycle Flow

stateDiagram-v2
    [*] --> Pending
    
    %% Initial Flow
    Pending --> Initializing : Start
    
    %% From Initializing
    Initializing --> EnqueuingSteps : ReadyStepsFound(count)
    Initializing --> Complete : NoStepsFound
    Initializing --> WaitingForDependencies : NoDependenciesReady
    
    %% Processing Flow
    EnqueuingSteps --> StepsInProcess : StepsEnqueued(uuids)
    EnqueuingSteps --> Error : EnqueueFailed(error)
    
    StepsInProcess --> EvaluatingResults : AllStepsCompleted
    StepsInProcess --> EvaluatingResults : StepCompleted(uuid)
    StepsInProcess --> WaitingForRetry : StepFailed(uuid)
    
    %% Result Evaluation
    EvaluatingResults --> Complete : AllStepsSuccessful
    EvaluatingResults --> EnqueuingSteps : ReadyStepsFound(count)
    EvaluatingResults --> WaitingForDependencies : NoDependenciesReady
    EvaluatingResults --> BlockedByFailures : PermanentFailure(error)
    
    %% Waiting States
    WaitingForDependencies --> EvaluatingResults : DependenciesReady
    WaitingForRetry --> EnqueuingSteps : RetryReady
    
    %% Problem Resolution
    BlockedByFailures --> Error : GiveUp
    BlockedByFailures --> ResolvedManually : ManualResolution
    
    %% Cancellation (from any non-terminal state)
    Pending --> Cancelled : Cancel
    Initializing --> Cancelled : Cancel
    EnqueuingSteps --> Cancelled : Cancel
    StepsInProcess --> Cancelled : Cancel
    EvaluatingResults --> Cancelled : Cancel
    WaitingForDependencies --> Cancelled : Cancel
    WaitingForRetry --> Cancelled : Cancel
    BlockedByFailures --> Cancelled : Cancel
    
    %% Legacy Support
    Error --> Pending : Reset
    
    %% Terminal States
    Complete --> [*]
    Error --> [*]
    Cancelled --> [*]
    ResolvedManually --> [*]

Task Event System

Task state transitions are driven by events defined in tasker-shared/src/state_machine/events.rs:

Lifecycle Events

  • Start: Begin task processing
  • Cancel: Cancel task execution
  • GiveUp: Abandon task (BlockedByFailures -> Error)
  • ManualResolution: Manually resolve task

Discovery Events

  • ReadyStepsFound(count): Ready steps discovered during initialization/evaluation
  • NoStepsFound: No steps defined - task can complete immediately
  • NoDependenciesReady: Dependencies not satisfied - wait required
  • DependenciesReady: Dependencies now ready - can proceed

Processing Events

  • StepsEnqueued(vec<Uuid>): Steps successfully queued for workers
  • EnqueueFailed(error): Failed to enqueue steps
  • StepCompleted(uuid): Individual step completed
  • StepFailed(uuid): Individual step failed
  • AllStepsCompleted: All current batch steps finished
  • AllStepsSuccessful: All steps completed successfully

System Events

  • PermanentFailure(error): Unrecoverable failure
  • RetryReady: Retry timeout expired
  • Timeout: Operation timeout occurred
  • ProcessorCrashed: Processor became unavailable

Processor Ownership

The task state machine implements processor ownership for active states to prevent race conditions:

#![allow(unused)]
fn main() {
// Ownership validation in task_state_machine.rs
if target_state.requires_ownership() {
    let current_processor = self.get_current_processor().await?;
    TransitionGuard::check_ownership(target_state, current_processor, self.processor_uuid)?;
}
}

Ownership Rules:

  • States requiring ownership: Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults
  • Processor UUID stored in tasker.task_transitions.processor_uuid column
  • Atomic ownership claiming prevents concurrent processing
  • Ownership validated on each transition attempt

Workflow Step State Machine Architecture

Step State Definitions

The workflow step state machine implements 9 states for individual step execution:

Processing Pipeline States

  • Pending: Initial state when step is created
  • Enqueued: Queued for processing but not yet claimed by worker
  • InProgress: Currently being executed by a worker
  • EnqueuedForOrchestration: Worker completed, queued for orchestration processing
  • EnqueuedAsErrorForOrchestration: Worker failed, queued for orchestration error processing

Waiting States

  • WaitingForRetry: Step failed with retryable error, waiting for backoff period before retry

Terminal States

  • Complete: Step completed successfully (after orchestration processing)
  • Error: Step failed permanently (non-retryable or max retries exceeded)
  • Cancelled: Step was cancelled
  • ResolvedManually: Step was manually resolved by operator

State Machine Evolution

Previously, the Error state was used for both retryable and permanent failures. The introduction of WaitingForRetry created a semantic change:

  • Before: Error = any failure (retryable or permanent)
  • After: Error = permanent failure only, WaitingForRetry = retryable failure awaiting backoff

This change required updates to:

  1. get_step_readiness_status() to recognize WaitingForRetry as a ready-eligible state
  2. get_task_execution_context() to properly detect blocked vs recovering tasks
  3. Error classification logic to distinguish permanent from retryable errors

Step State Properties

#![allow(unused)]
fn main() {
impl WorkflowStepState {
    pub fn is_terminal(&self) -> bool                    // No further transitions
    pub fn is_error(&self) -> bool                       // In error state (may allow retry)
    pub fn is_active(&self) -> bool                      // Being processed by worker
    pub fn is_in_processing_pipeline(&self) -> bool     // In execution pipeline
    pub fn is_ready_for_claiming(&self) -> bool         // Available for worker claim
    pub fn satisfies_dependencies(&self) -> bool        // Can satisfy other step dependencies
}
}

Step Lifecycle Flow

stateDiagram-v2
    [*] --> Pending

    %% Main Execution Path
    Pending --> Enqueued : Enqueue
    Enqueued --> InProgress : Start (worker claims)
    InProgress --> EnqueuedForOrchestration : EnqueueForOrchestration(success)
    EnqueuedForOrchestration --> Complete : Complete(results) [orchestration]

    %% Error Handling Path
    InProgress --> EnqueuedAsErrorForOrchestration : EnqueueForOrchestration(error)
    EnqueuedAsErrorForOrchestration --> WaitingForRetry : WaitForRetry(error) [retryable]
    EnqueuedAsErrorForOrchestration --> Error : Fail(error) [permanent/max retries]

    %% Retry Path
    WaitingForRetry --> Pending : Retry (after backoff)

    %% Legacy Direct Path (deprecated)
    InProgress --> Complete : Complete(results) [direct - legacy]
    InProgress --> Error : Fail(error) [direct - legacy]

    %% Legacy Backward Compatibility
    Pending --> InProgress : Start [legacy]

    %% Direct Failure Paths (error before worker processing)
    Pending --> Error : Fail(error)
    Enqueued --> Error : Fail(error)

    %% Cancellation Paths
    Pending --> Cancelled : Cancel
    Enqueued --> Cancelled : Cancel
    InProgress --> Cancelled : Cancel
    EnqueuedForOrchestration --> Cancelled : Cancel
    EnqueuedAsErrorForOrchestration --> Cancelled : Cancel
    WaitingForRetry --> Cancelled : Cancel
    Error --> Cancelled : Cancel

    %% Manual Resolution (from any state)
    Pending --> ResolvedManually : ResolveManually
    Enqueued --> ResolvedManually : ResolveManually
    InProgress --> ResolvedManually : ResolveManually
    EnqueuedForOrchestration --> ResolvedManually : ResolveManually
    EnqueuedAsErrorForOrchestration --> ResolvedManually : ResolveManually
    WaitingForRetry --> ResolvedManually : ResolveManually
    Error --> ResolvedManually : ResolveManually

    %% Terminal States
    Complete --> [*]
    Error --> [*]
    Cancelled --> [*]
    ResolvedManually --> [*]

Step Event System

Step transitions are driven by StepEvent types:

Processing Events

  • Enqueue: Queue step for worker processing
  • Start: Begin step execution (worker claims step)
  • EnqueueForOrchestration(results): Worker completes, queues for orchestration
  • Complete(results): Mark step complete (from orchestration or legacy direct)
  • Fail(error): Mark step as permanently failed
  • WaitForRetry(error): Mark step for retry after backoff

Control Events

  • Cancel: Cancel step execution
  • ResolveManually: Manual operator resolution
  • Retry: Retry step from WaitingForRetry or Error state

Step Execution Flow Integration

The step state machine integrates tightly with the task state machine:

  1. Task Discovers Ready Steps: TaskEvent::ReadyStepsFound(count) -> Task moves to EnqueuingSteps
  2. Steps Get Enqueued: StepEvent::Enqueue -> Steps move to Enqueued state
  3. Workers Claim Steps: StepEvent::Start -> Steps move to InProgress
  4. Workers Complete Steps: StepEvent::EnqueueForOrchestration(results) -> Steps move to EnqueuedForOrchestration
  5. Orchestration Processes Results: StepEvent::Complete(results) -> Steps move to Complete
  6. Task Evaluates Progress: TaskEvent::StepCompleted(uuid) -> Task moves to EvaluatingResults
  7. Task Completes or Continues: Based on remaining steps -> Task moves to Complete or back to EnqueuingSteps

Guard Conditions and Validation

Both state machines implement comprehensive guard conditions in tasker-shared/src/state_machine/guards.rs:

Task Guards

TransitionGuard

  • Validates all task state transitions
  • Prevents invalid state combinations
  • Enforces terminal state immutability
  • Supports legacy transition compatibility

Ownership Validation

  • Checks processor ownership for ownership-required states
  • Prevents concurrent task processing
  • Allows ownership claiming for unowned tasks

Step Guards

StepDependenciesMetGuard

  • Validates all step dependencies are satisfied
  • Delegates to WorkflowStep::dependencies_met()
  • Prevents premature step execution

StepNotInProgressGuard

  • Ensures step is not already being processed
  • Prevents duplicate worker claims
  • Validates step availability

Retry Guards

  • StepCanBeRetriedGuard: Validates step is in Error state
  • Checks retry limits and conditions
  • Prevents infinite retry loops

Orchestration Guards

  • StepCanBeEnqueuedForOrchestrationGuard: Step must be InProgress
  • StepCanBeCompletedFromOrchestrationGuard: Step must be EnqueuedForOrchestration
  • StepCanBeFailedFromOrchestrationGuard: Step must be EnqueuedForOrchestration

Persistence Layer Architecture

Delegation Pattern

The persistence layer in tasker-shared/src/state_machine/persistence.rs implements a delegation pattern to the model layer:

#![allow(unused)]
fn main() {
// TaskTransitionPersistence -> TaskTransition::create() & TaskTransition::get_current()
// StepTransitionPersistence -> WorkflowStepTransition::create() & WorkflowStepTransition::get_current()
}

Benefits:

  • No SQL duplication between state machine and models
  • Atomic transaction handling in models
  • Single source of truth for database operations
  • Independent testability of model methods

Transition Storage

Task Transitions (tasker.task_transitions)

CREATE TABLE tasker.task_transitions (
  task_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  task_uuid UUID NOT NULL,
  to_state VARCHAR NOT NULL,
  from_state VARCHAR,
  processor_uuid UUID,           -- Ownership tracking
  metadata JSONB,
  sort_key INTEGER NOT NULL,
  most_recent BOOLEAN DEFAULT false,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Step Transitions (tasker.workflow_step_transitions)

CREATE TABLE tasker.workflow_step_transitions (
  workflow_step_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
  workflow_step_uuid UUID NOT NULL,
  to_state VARCHAR NOT NULL,
  from_state VARCHAR,
  metadata JSONB,
  sort_key INTEGER NOT NULL,
  most_recent BOOLEAN DEFAULT false,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Current State Resolution

Both transition models implement efficient current state resolution:

#![allow(unused)]
fn main() {
// O(1) current state lookup using most_recent flag
TaskTransition::get_current(pool, task_uuid) -> Option<TaskTransition>
WorkflowStepTransition::get_current(pool, step_uuid) -> Option<WorkflowStepTransition>
}

Performance Optimization:

  • most_recent = true flag on latest transition only
  • Indexed queries: (task_uuid, most_recent) WHERE most_recent = true
  • Atomic flag updates during transition creation

Atomic Transitions with Ownership

Atomic transitions with processor ownership:

#![allow(unused)]
fn main() {
impl TaskTransitionPersistence {
    pub async fn transition_with_ownership(
        &self,
        task_uuid: Uuid,
        from_state: TaskState,
        to_state: TaskState, 
        processor_uuid: Uuid,
        metadata: Option<Value>,
        pool: &PgPool,
    ) -> PersistenceResult<bool>
}
}

Atomicity Guarantees:

  • Single database transaction for state change
  • Processor UUID stored in dedicated column
  • most_recent flag updated atomically
  • Race condition prevention through database constraints

Action System

Both state machines execute actions after successful transitions:

Task Actions

  1. PublishTransitionEventAction: Publishes task state change events
  2. UpdateTaskCompletionAction: Updates task completion status
  3. ErrorStateCleanupAction: Performs error state cleanup

Step Actions

  1. PublishTransitionEventAction: Publishes step state change events
  2. UpdateStepResultsAction: Updates step results and execution data
  3. TriggerStepDiscoveryAction: Triggers task-level step discovery
  4. ErrorStateCleanupAction: Performs step error cleanup

Actions execute sequentially after transition persistence, ensuring consistency.

State Machine Integration Points

Task <-> Step Coordination

  1. Step Discovery: Task initialization discovers ready steps
  2. Step Enqueueing: Task enqueues discovered steps to worker queues
  3. Progress Monitoring: Task monitors step completion via events
  4. Result Processing: Task processes step results and discovers next steps
  5. Completion Detection: Task completes when all steps are complete

Event-Driven Communication

  • pg_notify: PostgreSQL notifications for real-time coordination
  • Event Publishers: Publish state transition events to event system
  • Event Subscribers: React to state changes across system boundaries
  • Queue Integration: Provider-agnostic message queues (PGMQ or RabbitMQ) for worker communication

Worker Integration

  • Step Claiming: Workers claim Enqueued steps from queues
  • Progress Updates: Workers transition steps to InProgress
  • Result Submission: Workers submit results via EnqueueForOrchestration
  • Orchestration Processing: Orchestration processes results and completes steps

This sophisticated state machine architecture provides the foundation for reliable, auditable, and scalable workflow orchestration in the tasker-core system.

Step Result Audit System

The step result audit system provides SOC2-compliant audit trails for workflow step execution results, enabling complete attribution tracking for compliance and debugging.

Audit Table Design

The tasker.workflow_step_result_audit table stores lightweight references with attribution data:

CREATE TABLE tasker.workflow_step_result_audit (
    workflow_step_result_audit_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
    workflow_step_uuid UUID NOT NULL REFERENCES tasker.workflow_steps,
    workflow_step_transition_uuid UUID NOT NULL REFERENCES tasker.workflow_step_transitions,
    task_uuid UUID NOT NULL REFERENCES tasker.tasks,
    recorded_at TIMESTAMP NOT NULL DEFAULT NOW(),

    -- Attribution (NEW data not in transitions)
    worker_uuid UUID,
    correlation_id UUID,

    -- Extracted scalars for indexing/filtering
    success BOOLEAN NOT NULL,
    execution_time_ms BIGINT,

    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    UNIQUE (workflow_step_uuid, workflow_step_transition_uuid)
);

Design Principles

  1. No Data Duplication: Full execution results already exist in tasker.workflow_step_transitions.metadata. The audit table stores references only.

  2. Attribution Capture: The audit system captures NEW attribution data:

    • worker_uuid: Which worker instance processed the step
    • correlation_id: Distributed tracing identifier for request correlation
  3. Indexed Scalars: Success and execution time are extracted for efficient filtering without JSON parsing.

  4. SQL Trigger: A database trigger (trg_step_result_audit) guarantees audit record creation when workers persist results, ensuring SOC2 compliance.

Attribution Flow

Attribution data flows through the system via TransitionContext:

#![allow(unused)]
fn main() {
// Worker creates attribution context
let context = TransitionContext::with_worker(
    worker_uuid,
    Some(correlation_id),
);

// Context is merged into transition metadata
state_machine.transition_with_context(event, Some(context)).await?;

// SQL trigger extracts attribution from metadata
-- In trigger:
-- v_worker_uuid := (NEW.metadata->>'worker_uuid')::UUID;
-- v_correlation_id := (NEW.metadata->>'correlation_id')::UUID;
}

Trigger Behavior

The create_step_result_audit trigger fires on transitions to:

  • enqueued_for_orchestration: Successful step completion
  • enqueued_as_error_for_orchestration: Failed step completion

These states represent when workers persist execution results, creating the audit trail.

Querying Audit History

Via API

GET /v1/tasks/{task_uuid}/workflow_steps/{step_uuid}/audit

Returns audit records with full transition details via JOIN, ordered by recorded_at DESC.

Via Client

#![allow(unused)]
fn main() {
let audit_history = client.get_step_audit_history(task_uuid, step_uuid).await?;
for record in audit_history {
    println!("Worker: {:?}, Success: {}, Time: {:?}ms",
        record.worker_uuid,
        record.success,
        record.execution_time_ms
    );
}
}

Via Model

#![allow(unused)]
fn main() {
// Get audit history for a step with full transition details
let history = WorkflowStepResultAudit::get_audit_history(&pool, step_uuid).await?;

// Get all audit records for a task
let task_history = WorkflowStepResultAudit::get_task_audit_history(&pool, task_uuid).await?;

// Query by worker for attribution investigation
let worker_records = WorkflowStepResultAudit::get_by_worker(&pool, worker_uuid, Some(100)).await?;

// Query by correlation ID for distributed tracing
let correlated = WorkflowStepResultAudit::get_by_correlation_id(&pool, correlation_id).await?;
}

Indexes for Common Query Patterns

The audit table includes optimized indexes:

  • idx_audit_step_uuid: Primary query - get audit history for a step
  • idx_audit_task_uuid: Get all audit records for a task
  • idx_audit_recorded_at: Time-range queries for SOC2 audit reports
  • idx_audit_worker_uuid: Attribution investigation (partial index)
  • idx_audit_correlation_id: Distributed tracing queries (partial index)
  • idx_audit_success: Success/failure filtering

Historical Data

The migration includes a backfill for existing transitions. Historical records will have NULL attribution (worker_uuid, correlation_id) since that data wasn’t captured before the audit system was introduced.

Worker Actor-Based Architecture

Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands

<- Back to Documentation Hub


This document provides comprehensive documentation of the worker actor-based architecture in tasker-worker, covering the lightweight Actor pattern that mirrors the orchestration architecture for step execution and worker coordination.

Overview

The tasker-worker system implements a lightweight Actor pattern that mirrors the orchestration architecture, providing:

  1. Actor Abstraction: Worker components encapsulated as actors with clear lifecycle hooks
  2. Message-Based Communication: Type-safe message handling via Handler<M> trait
  3. Central Registry: WorkerActorRegistry for managing all worker actors
  4. Service Decomposition: Focused services following single responsibility principle
  5. Lock-Free Statistics: AtomicU64 counters for hot-path performance
  6. Direct Integration: Command processor routes to actors without wrapper layers

This architecture provides consistency between orchestration and worker systems, enabling clearer code organization and improved maintainability.

Implementation Status

Complete: All phases implemented and production-ready

  • Phase 1: Core abstractions (traits, registry, lifecycle management)
  • Phase 2: Service decomposition from 1575 LOC command_processor.rs
  • Phase 3: All 5 primary actors implemented
  • Phase 4: Command processor refactored to pure routing (~200 LOC)
  • Phase 5: Stateless service design eliminating lock contention
  • Cleanup: Lock-free AtomicU64 statistics, shared event system

Current State: Production-ready actor-based worker with 5 actors managing all step execution operations.

Core Concepts

What is a Worker Actor?

In the tasker-worker context, a Worker Actor is an encapsulated step execution component that:

  • Manages its own state: Each actor owns its dependencies and configuration
  • Processes messages: Responds to typed command messages via the Handler<M> trait
  • Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
  • Is isolated: Actors communicate through message passing
  • Is thread-safe: All actors are Send + Sync + 'static

Why Actors for Workers?

The previous architecture had a monolithic command processor:

#![allow(unused)]
fn main() {
// OLD: 1575 LOC monolithic command processor
pub struct WorkerProcessor {
    // All logic mixed together
    // RwLock contention on hot path
    // Two-phase initialization complexity
}
}

The actor pattern provides:

#![allow(unused)]
fn main() {
// NEW: Pure routing command processor (~200 LOC)
impl ActorCommandProcessor {
    async fn handle_command(&self, command: WorkerCommand) -> bool {
        match command {
            WorkerCommand::ExecuteStep { message, queue_name, resp } => {
                let msg = ExecuteStepMessage { message, queue_name };
                let result = self.actors.step_executor_actor.handle(msg).await;
                let _ = resp.send(result);
                true
            }
            // ... pure routing, no business logic
        }
    }
}
}

Actor vs Service

Services (underlying business logic):

  • Encapsulate step execution logic
  • Stateless operations on step data
  • Direct method invocation
  • Examples: StepExecutorService, FFICompletionService, WorkerStatusService

Actors (message-based coordination):

  • Wrap services with message-based interface
  • Manage service lifecycle
  • Asynchronous message handling
  • Examples: StepExecutorActor, FFICompletionActor, WorkerStatusActor

The relationship:

#![allow(unused)]
fn main() {
pub struct StepExecutorActor {
    context: Arc<SystemContext>,
    service: Arc<StepExecutorService>,  // Wraps underlying service
}

#[async_trait]
impl Handler<ExecuteStepMessage> for StepExecutorActor {
    async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
        // Delegates to stateless service
        self.service.execute_step(msg.message, &msg.queue_name).await
    }
}
}

Worker Actor Traits

WorkerActor Trait

The base trait for all worker actors, defined in tasker-worker/src/worker/actors/traits.rs:

#![allow(unused)]
fn main() {
/// Base trait for all worker actors
///
/// Provides lifecycle management and context access for all actors in the
/// worker system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
pub trait WorkerActor: Send + Sync + 'static {
    /// Returns the unique name of this actor
    fn name(&self) -> &'static str;

    /// Returns a reference to the system context
    fn context(&self) -> &Arc<SystemContext>;

    /// Called when the actor is started
    fn started(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor started");
        Ok(())
    }

    /// Called when the actor is stopped
    fn stopped(&mut self) -> TaskerResult<()> {
        tracing::info!(actor = %self.name(), "Actor stopped");
        Ok(())
    }
}
}

Handler Trait

The message handling trait, enabling type-safe message processing:

#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
#[async_trait]
pub trait Handler<M: Message>: WorkerActor {
    /// Handle a message asynchronously
    async fn handle(&self, msg: M) -> TaskerResult<M::Response>;
}
}

Message Trait

The marker trait for command messages:

#![allow(unused)]
fn main() {
/// Marker trait for command messages
pub trait Message: Send + 'static {
    /// The response type for this message
    type Response: Send;
}
}

WorkerActorRegistry

The central registry managing all worker actors, defined in tasker-worker/src/worker/actors/registry.rs:

Structure

#![allow(unused)]
fn main() {
/// Registry managing all worker actors
#[derive(Clone)]
pub struct WorkerActorRegistry {
    /// System context shared by all actors
    context: Arc<SystemContext>,

    /// Worker ID for this registry
    worker_id: String,

    /// Step executor actor for step execution    pub step_executor_actor: Arc<StepExecutorActor>,

    /// FFI completion actor for handling step completions    pub ffi_completion_actor: Arc<FFICompletionActor>,

    /// Template cache actor for template management    pub template_cache_actor: Arc<TemplateCacheActor>,

    /// Domain event actor for event dispatching    pub domain_event_actor: Arc<DomainEventActor>,

    /// Worker status actor for health and status    pub worker_status_actor: Arc<WorkerStatusActor>,
}
}

Initialization

All dependencies required at construction time (no two-phase initialization):

#![allow(unused)]
fn main() {
impl WorkerActorRegistry {
    pub async fn build(
        context: Arc<SystemContext>,
        worker_id: String,
        task_template_manager: Arc<TaskTemplateManager>,
        event_publisher: WorkerEventPublisher,
        domain_event_handle: DomainEventSystemHandle,
    ) -> TaskerResult<Self> {
        // Create actors with all dependencies upfront
        let mut step_executor_actor = StepExecutorActor::new(
            context.clone(),
            worker_id.clone(),
            task_template_manager.clone(),
            event_publisher,
            domain_event_handle,
        );

        // Call started() lifecycle hook
        step_executor_actor.started()?;

        // ... create other actors ...

        Ok(Self {
            context,
            worker_id,
            step_executor_actor: Arc::new(step_executor_actor),
            // ...
        })
    }
}
}

Implemented Actors

StepExecutorActor

Handles step execution from PGMQ messages and events.

Location: tasker-worker/src/worker/actors/step_executor_actor.rs

Messages:

  • ExecuteStepMessage - Execute step from raw data
  • ExecuteStepWithCorrelationMessage - Execute with FFI correlation
  • ExecuteStepFromPgmqMessage - Execute from PGMQ message
  • ExecuteStepFromEventMessage - Execute from event notification

Delegation: Wraps StepExecutorService (stateless, no locks)

Purpose: Central coordinator for all step execution, handles claiming, handler invocation, and result construction.

FFICompletionActor

Handles step completion results from FFI handlers.

Location: tasker-worker/src/worker/actors/ffi_completion_actor.rs

Messages:

  • SendStepResultMessage - Send result to orchestration
  • ProcessStepCompletionMessage - Process completion with correlation

Delegation: Wraps FFICompletionService

Purpose: Forwards step execution results to orchestration queue, manages correlation for async FFI handlers.

TemplateCacheActor

Manages task template caching and refresh.

Location: tasker-worker/src/worker/actors/template_cache_actor.rs

Messages:

  • RefreshTemplateCacheMessage - Refresh cache for namespace

Delegation: Wraps TaskTemplateManager

Purpose: Maintains handler template cache for efficient step execution.

DomainEventActor

Dispatches domain events after step completion.

Location: tasker-worker/src/worker/actors/domain_event_actor.rs

Messages:

  • DispatchDomainEventsMessage - Dispatch events for completed step

Delegation: Wraps DomainEventSystemHandle

Purpose: Fire-and-forget domain event dispatch (never blocks step completion).

WorkerStatusActor

Provides worker health and status reporting.

Location: tasker-worker/src/worker/actors/worker_status_actor.rs

Messages:

  • GetWorkerStatusMessage - Get current worker status
  • HealthCheckMessage - Perform health check
  • GetEventStatusMessage - Get event integration status
  • SetEventIntegrationMessage - Enable/disable event integration

Features:

  • Lock-free statistics via AtomicStepExecutionStats
  • AtomicU64 counters for total_executed, total_succeeded, total_failed
  • Average execution time computed on read from sum / count

Purpose: Real-time health monitoring and statistics without lock contention.

Lock-Free Statistics

The WorkerStatusActor uses atomic counters for lock-free statistics on the hot path:

#![allow(unused)]
fn main() {
/// Lock-free step execution statistics using atomic counters
#[derive(Debug)]
pub struct AtomicStepExecutionStats {
    total_executed: AtomicU64,
    total_succeeded: AtomicU64,
    total_failed: AtomicU64,
    total_execution_time_ms: AtomicU64,
}

impl AtomicStepExecutionStats {
    /// Record a successful step execution (lock-free)
    #[inline]
    pub fn record_success(&self, execution_time_ms: u64) {
        self.total_executed.fetch_add(1, Ordering::Relaxed);
        self.total_succeeded.fetch_add(1, Ordering::Relaxed);
        self.total_execution_time_ms.fetch_add(execution_time_ms, Ordering::Relaxed);
    }

    /// Record a failed step execution (lock-free)
    #[inline]
    pub fn record_failure(&self) {
        self.total_executed.fetch_add(1, Ordering::Relaxed);
        self.total_failed.fetch_add(1, Ordering::Relaxed);
    }

    /// Get a snapshot of current statistics
    pub fn snapshot(&self) -> StepExecutionStats {
        let total_executed = self.total_executed.load(Ordering::Relaxed);
        let total_time = self.total_execution_time_ms.load(Ordering::Relaxed);
        let average_execution_time_ms = if total_executed > 0 {
            total_time as f64 / total_executed as f64
        } else {
            0.0
        };
        StepExecutionStats {
            total_executed,
            total_succeeded: self.total_succeeded.load(Ordering::Relaxed),
            total_failed: self.total_failed.load(Ordering::Relaxed),
            average_execution_time_ms,
        }
    }
}
}

Benefits:

  • Zero lock contention on step completion (every step calls record_success or record_failure)
  • Sub-microsecond overhead per operation
  • Consistent averages computed from totals

Integration with Commands

ActorCommandProcessor

The command processor provides pure routing to actors:

#![allow(unused)]
fn main() {
impl ActorCommandProcessor {
    async fn handle_command(&self, command: WorkerCommand) -> bool {
        match command {
            // Step Execution Commands -> StepExecutorActor
            WorkerCommand::ExecuteStep { message, queue_name, resp } => {
                let msg = ExecuteStepMessage { message, queue_name };
                let result = self.actors.step_executor_actor.handle(msg).await;
                let _ = resp.send(result);
                true
            }

            // Completion Commands -> FFICompletionActor
            WorkerCommand::SendStepResult { result, resp } => {
                let msg = SendStepResultMessage { result };
                let send_result = self.actors.ffi_completion_actor.handle(msg).await;
                let _ = resp.send(send_result);
                true
            }

            // Status Commands -> WorkerStatusActor
            WorkerCommand::HealthCheck { resp } => {
                let result = self.actors.worker_status_actor.handle(HealthCheckMessage).await;
                let _ = resp.send(result);
                true
            }

            WorkerCommand::Shutdown { resp } => {
                let _ = resp.send(Ok(()));
                false  // Exit command loop
            }
        }
    }
}
}

FFI Completion Flow

Domain events are dispatched after successful orchestration notification:

#![allow(unused)]
fn main() {
async fn handle_ffi_completion(&self, step_result: StepExecutionResult) {
    // Record stats (lock-free)
    if step_result.success {
        self.actors.worker_status_actor
            .record_success(step_result.metadata.execution_time_ms as f64).await;
    } else {
        self.actors.worker_status_actor.record_failure().await;
    }

    // Send to orchestration FIRST
    let msg = SendStepResultMessage { result: step_result.clone() };
    match self.actors.ffi_completion_actor.handle(msg).await {
        Ok(()) => {
            // Domain events dispatched AFTER successful orchestration notification
            // Fire-and-forget - never blocks the worker
            self.actors.step_executor_actor
                .dispatch_domain_events(step_result.step_uuid, &step_result, None).await;
        }
        Err(e) => {
            // Don't dispatch domain events - orchestration wasn't notified
            tracing::error!("Failed to forward step completion to orchestration");
        }
    }
}
}

Service Decomposition

Large services were decomposed from the monolithic command processor:

StepExecutorService

services/step_execution/
├── mod.rs                  # Public API
├── service.rs              # StepExecutorService (~250 lines)
├── step_claimer.rs         # Step claiming logic
├── handler_invoker.rs      # Handler invocation
└── result_builder.rs       # Result construction

Key Design: Completely stateless service using &self methods. Wrapped in Arc<StepExecutorService> without any locks.

FFICompletionService

services/ffi_completion/
├── mod.rs                  # Public API
├── service.rs              # FFICompletionService
└── result_sender.rs        # Orchestration result sender

WorkerStatusService

services/worker_status/
├── mod.rs                  # Public API
└── service.rs              # WorkerStatusService

Key Architectural Decisions

1. Stateless Services

Services use &self methods with no mutable state:

#![allow(unused)]
fn main() {
impl StepExecutorService {
    pub async fn execute_step(
        &self,  // Immutable reference
        message: PgmqMessage<SimpleStepMessage>,
        queue_name: &str,
    ) -> TaskerResult<bool> {
        // Stateless execution - no mutable state
    }
}
}

Benefits:

  • Zero lock contention
  • Maximum concurrency per worker
  • Simplified reasoning about state

2. Constructor-Based Dependency Injection

All dependencies required at construction time:

#![allow(unused)]
fn main() {
pub async fn new(
    context: Arc<SystemContext>,
    worker_id: String,
    task_template_manager: Arc<TaskTemplateManager>,
    event_publisher: WorkerEventPublisher,        // Required
    domain_event_handle: DomainEventSystemHandle, // Required
) -> TaskerResult<Self>
}

Benefits:

  • Compiler enforces complete initialization
  • No “partially initialized” states
  • Clear dependency graph

3. Shared Event System

Event publisher and subscriber share the same WorkerEventSystem:

#![allow(unused)]
fn main() {
let shared_event_system = event_system
    .unwrap_or_else(|| Arc::new(WorkerEventSystem::new()));
let event_publisher =
    WorkerEventPublisher::with_event_system(worker_id.clone(), shared_event_system.clone());

// Enable subscriber with same shared system
processor.enable_event_subscriber(Some(shared_event_system)).await;
}

Benefits:

  • FFI handlers reliably receive step execution events
  • No isolated event systems causing silent failures

4. Graceful Degradation

Domain events never fail step completion:

#![allow(unused)]
fn main() {
// dispatch_domain_events returns () not TaskerResult<()>
// Errors logged but never propagated
pub async fn dispatch_domain_events(
    &self,
    step_uuid: Uuid,
    result: &StepExecutionResult,
    metadata: Option<HashMap<String, serde_json::Value>>,
) {
    // Fire-and-forget with error logging
    // Channel full? Log and continue
    // Dispatch error? Log and continue
}
}

Comparison with Orchestration Actors

AspectOrchestrationWorker
Actor Count4 actors5 actors
RegistryActorRegistryWorkerActorRegistry
Base TraitOrchestrationActorWorkerActor
Message TraitHandler<M>Handler<M> (same)
Service DesignDecomposedStateless
StatisticsN/ALock-free AtomicU64
LOC Reduction~800 -> ~2001575 -> ~200

Benefits

1. Consistency with Orchestration

Same patterns and traits as orchestration actors:

  • Identical Handler<M> trait interface
  • Similar registry lifecycle management
  • Consistent message-based communication

2. Zero Lock Contention

  • Stateless services eliminate RwLock on hot path
  • AtomicU64 counters for statistics
  • Maximum concurrent step execution

3. Type Safety

Messages and responses checked at compile time:

#![allow(unused)]
fn main() {
// Compile error if types don't match
impl Handler<ExecuteStepMessage> for StepExecutorActor {
    async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
        // Must return bool, not something else
    }
}
}

4. Testability

  • Clear message boundaries for mocking
  • Isolated actor lifecycle for unit tests
  • 119 unit tests, 73 E2E tests passing

5. Maintainability

  • 1575 LOC -> ~200 LOC command processor
  • Focused services (<300 lines per file)
  • Clear separation of concerns

Detailed Analysis

For design rationale, see the Worker Decomposition ADR.

Summary

The worker actor-based architecture provides a consistent, type-safe foundation for step execution in tasker-worker. Key takeaways:

  1. Mirrors Orchestration: Same patterns as orchestration actors for consistency
  2. Lock-Free Performance: Stateless services and AtomicU64 counters
  3. Type Safety: Compile-time verification of message contracts
  4. Pure Routing: Command processor delegates without business logic
  5. Graceful Degradation: Domain events never fail step completion
  6. Production Ready: 119 unit tests, 73 E2E tests, full regression coverage

The architecture provides a solid foundation for high-throughput step execution while maintaining the proven reliability of the orchestration system.


<- Back to Documentation Hub

Worker Event Systems Architecture

Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Worker Actors | Events and Commands | Messaging Abstraction

<- Back to Documentation Hub


This document provides comprehensive documentation of the worker event system architecture in tasker-worker, covering the dual-channel event pattern, domain event publishing, and FFI integration.

Overview

The worker event system implements a dual-channel architecture for non-blocking step execution:

  1. WorkerEventSystem: Receives step execution events via provider-agnostic subscriptions
  2. HandlerDispatchService: Fire-and-forget handler invocation with bounded concurrency
  3. CompletionProcessorService: Routes results back to orchestration
  4. DomainEventSystem: Fire-and-forget domain event publishing

Messaging Backend Support: The worker event system supports multiple messaging backends (PGMQ, RabbitMQ) through a provider-agnostic abstraction. See Messaging Abstraction for details.

This architecture enables true parallel handler execution while maintaining strict ordering guarantees for domain events.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WORKER EVENT FLOW                                  │
└─────────────────────────────────────────────────────────────────────────────┘

                    MessagingProvider (PGMQ or RabbitMQ)
                                  │
                                  │ provider.subscribe_many()
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         WorkerEventSystem                                    │
│  ┌──────────────────────┐    ┌──────────────────────┐                       │
│  │  WorkerQueueListener │    │  WorkerFallbackPoller │                      │
│  │  (provider-agnostic) │    │  (PGMQ only)          │                      │
│  └──────────┬───────────┘    └──────────┬───────────┘                       │
│             │                           │                                    │
│             └───────────┬───────────────┘                                    │
│                         │                                                    │
│                         ▼                                                    │
│   MessageNotification::Message → ExecuteStepFromMessage (RabbitMQ)          │
│   MessageNotification::Available → ExecuteStepFromEvent (PGMQ)              │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      ActorCommandProcessor                                   │
│                              │                                               │
│                              ▼                                               │
│                      StepExecutorActor                                       │
│                              │                                               │
│                              │ claim step, send to dispatch channel          │
│                              ▼                                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
           Rust Workers               FFI Workers (Ruby/Python)
                    │                           │
                    ▼                           ▼
┌───────────────────────────────┐   ┌───────────────────────────────┐
│   HandlerDispatchService      │   │     FfiDispatchChannel        │
│                               │   │                               │
│   dispatch_receiver           │   │   pending_events HashMap      │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   [Semaphore] N permits       │   │   poll_step_events()          │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   handler.call()              │   │   Ruby/Python handler         │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   PostHandlerCallback         │   │   complete_step_event()       │
│         │                     │   │         │                     │
│         ▼                     │   │         ▼                     │
│   completion_sender           │   │   PostHandlerCallback         │
│                               │   │         │                     │
└───────────────┬───────────────┘   │         ▼                     │
                │                   │   completion_sender           │
                │                   │                               │
                │                   └───────────────┬───────────────┘
                │                                   │
                └───────────────┬───────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CompletionProcessorService                                │
│                              │                                               │
│                              ▼                                               │
│                    FFICompletionService                                      │
│                              │                                               │
│                              ▼                                               │
│               orchestration_step_results queue                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                         Orchestration

Core Components

1. WorkerEventSystem

Location: tasker-worker/src/worker/event_systems/worker_event_system.rs

Implements the EventDrivenSystem trait for worker namespace queue processing. Supports three deployment modes with provider-agnostic message handling:

ModeDescriptionPGMQ BehaviorRabbitMQ Behavior
PollingOnlyTraditional pollingPoll PGMQ tablesPoll via basic_get
EventDrivenOnlyPure push deliverypg_notify signalsbasic_consume push
HybridEvent-driven + pollingpg_notify + fallbackPush only (no fallback)

Provider-Specific Behavior:

  • PGMQ: Uses MessageNotification::Available (signal-only), requires fallback polling
  • RabbitMQ: Uses MessageNotification::Message (full payload), no fallback needed

Key Features:

  • Unified configuration via WorkerEventSystemConfig
  • Atomic statistics with AtomicU64 counters
  • Converts WorkerNotification to WorkerCommand for processing
#![allow(unused)]
fn main() {
// Worker notification to command conversion (provider-agnostic)
match notification {
    // RabbitMQ style - full message delivered
    WorkerNotification::Message(msg) => {
        command_sender.send(WorkerCommand::ExecuteStepFromMessage {
            queue_name: msg.queue_name.clone(),
            message: msg,
            resp: resp_tx,
        }).await;
    }
    // PGMQ style - signal-only, requires fetch
    WorkerNotification::Event(WorkerQueueEvent::StepMessage(msg_event)) => {
        command_sender.send(WorkerCommand::ExecuteStepFromEvent {
            message_event: msg_event,
            resp: resp_tx,
        }).await;
    }
    // ...
}
}

2. HandlerDispatchService

Location: tasker-worker/src/worker/handlers/dispatch_service.rs

Non-blocking handler dispatch with bounded parallelism.

Architecture:

dispatch_receiver → [Semaphore] → handler.call() → [callback] → completion_sender
                         │                              │
                         └─→ Bounded to N concurrent    └─→ Domain events
                              tasks

Key Design Decisions:

  1. Semaphore-Bounded Concurrency: Limits concurrent handlers to prevent resource exhaustion
  2. Permit Release Before Send: Prevents backpressure cascade
  3. Post-Handler Callback: Domain events fire only after result is committed
#![allow(unused)]
fn main() {
tokio::spawn(async move {
    let permit = semaphore.acquire().await?;

    let result = execute_with_timeout(&registry, &msg, timeout).await;

    // Release permit BEFORE sending to completion channel
    drop(permit);

    // Send result FIRST
    sender.send(result.clone()).await?;

    // Callback fires AFTER result is committed
    if let Some(cb) = callback {
        cb.on_handler_complete(&step, &result, &worker_id).await;
    }
});
}

Error Handling:

ScenarioBehavior
Handler timeoutStepExecutionResult::failure() with error_type=handler_timeout
Handler panicCaught via catch_unwind(), failure result generated
Handler errorFailure result with error_type=handler_error
Semaphore closedFailure result with error_type=semaphore_acquisition_failed

Handler Resolution

Before handler execution, the dispatch service resolves the handler using a resolver chain pattern:

HandlerDefinition                    ResolverChain                    Handler
     │                                    │                              │
     │  callable: "process_payment"       │                              │
     │  method: "refund"                  │                              │
     │  resolver: null                    │                              │
     │                                    │                              │
     ├───────────────────────────────────►│                              │
     │                                    │                              │
     │                    ┌───────────────┴───────────────┐              │
     │                    │ ExplicitMappingResolver (10)  │              │
     │                    │ can_resolve? ─► YES           │              │
     │                    │ resolve() ─────────────────────────────────►│
     │                    └───────────────────────────────┘              │
     │                                                                   │
     │                    ┌───────────────────────────────┐              │
     │                    │ MethodDispatchWrapper         │              │
     │                    │ (if method != "call")         │◄─────────────┤
     │                    └───────────────────────────────┘              │

Built-in Resolvers:

ResolverPriorityFunction
ExplicitMappingResolver10Hash lookup of registered handlers
ClassConstantResolver100Runtime class lookup (Ruby only)
ClassLookupResolver100Runtime class lookup (Python/TypeScript only)

Method Dispatch: When handler.method is specified and not "call", a MethodDispatchWrapper is applied to invoke the specified method instead of the default call() method.

See Handler Resolution Guide for complete documentation.

3. FfiDispatchChannel

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs

Pull-based polling interface for FFI workers (Ruby, Python). Enables language-specific handlers without complex FFI memory management.

Flow:

Rust                           Ruby/Python
  │                                 │
  │  dispatch(step)                 │
  │ ──────────────────────────────► │
  │                                 │ pending_events.insert()
  │                                 │
  │  poll_step_events()             │
  │ ◄────────────────────────────── │
  │                                 │
  │                                 │ handler.call()
  │                                 │
  │  complete_step_event(result)    │
  │ ◄────────────────────────────── │
  │                                 │
  │  PostHandlerCallback            │
  │  completion_sender.send()       │
  │                                 │

Key Features:

  • Thread-safe pending events map with lock poisoning recovery
  • Configurable completion timeout (default 30s)
  • Starvation detection and warnings
  • Fire-and-forget callbacks via runtime_handle.spawn()

4. CompletionProcessorService

Location: tasker-worker/src/worker/handlers/completion_processor.rs

Receives completed step results and routes to orchestration queue via FFICompletionService.

completion_receiver → CompletionProcessorService → FFICompletionService → orchestration_step_results

Note: Currently processes completions sequentially. Parallel processing is planned as a future enhancement.

5. DomainEventSystem

Location: tasker-worker/src/worker/event_systems/domain_event_system.rs

Async system for fire-and-forget domain event publishing.

Architecture:

command_processor.rs                  DomainEventSystem
      │                                     │
      │ try_send(command)                   │ spawn process_loop()
      ▼                                     ▼
mpsc::Sender<DomainEventCommand>  →  mpsc::Receiver
                                            │
                                            ▼
                                    EventRouter → PGMQ / InProcess

Key Design:

  • try_send() never blocks - if channel is full, events are dropped with metrics
  • Background task processes commands asynchronously
  • Graceful shutdown drains fast events up to configurable timeout
  • Three delivery modes: Durable (PGMQ), Fast (in-process), Broadcast

Shared Event Abstractions

EventDrivenSystem Trait

Location: tasker-shared/src/event_system/event_driven.rs

Unified trait for all event-driven systems:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
    type SystemId: Send + Sync + Clone;
    type Event: Send + Sync + Clone;
    type Config: Send + Sync + Clone;
    type Statistics: EventSystemStatistics;

    fn system_id(&self) -> Self::SystemId;
    fn deployment_mode(&self) -> DeploymentMode;
    fn is_running(&self) -> bool;

    async fn start(&mut self) -> Result<(), DeploymentModeError>;
    async fn stop(&mut self) -> Result<(), DeploymentModeError>;
    async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
    async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;

    fn statistics(&self) -> Self::Statistics;
    fn config(&self) -> &Self::Config;
}
}

Deployment Modes

Location: tasker-shared/src/event_system/deployment.rs

#![allow(unused)]
fn main() {
pub enum DeploymentMode {
    PollingOnly,      // Traditional polling, no events
    EventDrivenOnly,  // Pure event-driven, no polling
    Hybrid,           // Event-driven with polling fallback
}
}

PostHandlerCallback Trait

Location: tasker-worker/src/worker/handlers/dispatch_service.rs

Extensibility point for post-handler actions:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PostHandlerCallback: Send + Sync + 'static {
    /// Called after a handler completes
    async fn on_handler_complete(
        &self,
        step: &TaskSequenceStep,
        result: &StepExecutionResult,
        worker_id: &str,
    );

    /// Name of this callback for logging purposes
    fn name(&self) -> &str;
}
}

Implementations:

  • NoOpCallback: Default no-operation callback
  • DomainEventCallback: Publishes domain events to DomainEventSystem

Configuration

Worker Event System

# config/tasker/base/event_systems.toml
[event_systems.worker]
system_id = "worker-event-system"
deployment_mode = "Hybrid"

[event_systems.worker.metadata.listener]
retry_interval_seconds = 5
max_retry_attempts = 3
event_timeout_seconds = 60
batch_processing = true
connection_timeout_seconds = 30

[event_systems.worker.metadata.fallback_poller]
enabled = true
polling_interval_ms = 100
batch_size = 10
age_threshold_seconds = 30
max_age_hours = 24
visibility_timeout_seconds = 60

Handler Dispatch

# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000

[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000
completion_timeout_ms = 30000
starvation_warning_threshold_ms = 10000
callback_timeout_ms = 5000
completion_send_timeout_ms = 10000

Integration with Worker Actors

The event systems integrate with the worker actor architecture:

WorkerEventSystem
       │
       ▼
ActorCommandProcessor
       │
       ├──► StepExecutorActor ──► dispatch_sender
       │
       ├──► FFICompletionActor ◄── completion_receiver
       │
       └──► DomainEventActor ◄── PostHandlerCallback

See Worker Actors Documentation for actor details.

Event Flow Guarantees

Ordering Guarantee

Domain events fire AFTER result is committed to completion channel:

handler.call()
    → result committed to completion_sender
    → PostHandlerCallback.on_handler_complete()
    → domain events dispatched

This eliminates race conditions where downstream systems see events before orchestration processes results.

Idempotency Guarantee

State machine guards prevent duplicate execution:

  1. Step claimed atomically via transition_step_state_atomic()
  2. State guards reject duplicate claims
  3. Results are deduplicated by completion channel

Fire-and-Forget Guarantee

Domain event failures never fail step completion:

#![allow(unused)]
fn main() {
// DomainEventCallback
pub async fn on_handler_complete(&self, step, result, worker_id) {
    // dispatch_events uses try_send() - never blocks
    // If channel full, events dropped with metrics
    // Step completion is NOT affected
    self.handle.dispatch_events(events, publisher_name, correlation_id);
}
}

Monitoring

Key Metrics

MetricDescription
tasker.worker.events_processedTotal events processed
tasker.worker.events_failedEvents that failed processing
tasker.ffi.pending_eventsPending FFI events (starvation indicator)
tasker.ffi.oldest_event_age_msAge of oldest pending event
tasker.channel.completion.saturationCompletion channel utilization
tasker.domain_events.dispatchedDomain events dispatched
tasker.domain_events.droppedDomain events dropped (backpressure)

Health Checks

#![allow(unused)]
fn main() {
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError> {
    if self.is_running.load(Ordering::Acquire) {
        Ok(DeploymentModeHealthStatus::Healthy)
    } else {
        Ok(DeploymentModeHealthStatus::Critical)
    }
}
}

Backpressure Handling

The worker event system implements multiple backpressure mechanisms to ensure graceful degradation under load while preserving step idempotency.

Backpressure Points

┌─────────────────────────────────────────────────────────────────────────────┐
│                      WORKER BACKPRESSURE FLOW                                │
└─────────────────────────────────────────────────────────────────────────────┘

[1] Step Claiming
    │
    ├── Planned: Capacity check before claiming
    │   └── If at capacity: Leave message in queue (visibility timeout)
    │
    ▼
[2] Handler Dispatch Channel (Bounded)
    │
    ├── dispatch_buffer_size = 1000
    │   └── If full: Sender blocks until space available
    │
    ▼
[3] Semaphore-Bounded Execution
    │
    ├── max_concurrent_handlers = 10
    │   └── If permits exhausted: Task waits for permit
    │
    ├── CRITICAL: Permit released BEFORE sending to completion channel
    │   └── Prevents backpressure cascade
    │
    ▼
[4] Completion Channel (Bounded)
    │
    ├── completion_buffer_size = 1000
    │   └── If full: Handler task blocks until space available
    │
    ▼
[5] Domain Events (Fire-and-Forget)
    │
    └── try_send() semantics
        └── If channel full: Events DROPPED (step execution unaffected)

Handler Dispatch Backpressure

The HandlerDispatchService uses semaphore-bounded parallelism:

#![allow(unused)]
fn main() {
// Permit acquisition blocks if all permits in use
let permit = semaphore.acquire().await?;

let result = execute_with_timeout(&registry, &msg, timeout).await;

// CRITICAL: Release permit BEFORE sending to completion channel
// This prevents backpressure cascade where full completion channel
// holds permits, starving new handler execution
drop(permit);

// Now send to completion channel (may block if full)
sender.send(result).await?;
}

Why permit release before send matters:

  • If completion channel is full, handler task blocks on send
  • If permit is held during block, no new handlers can start
  • By releasing permit first, new handlers can start even if completions are backing up

FFI Dispatch Backpressure

The FfiDispatchChannel handles backpressure for Ruby/Python workers:

ScenarioBehavior
Dispatch channel fullSender blocks
FFI polling too slowStarvation warning logged
Completion timeoutFailure result generated
Callback timeoutCallback fire-and-forget, logged

Starvation Detection:

[worker.mpsc_channels.ffi_dispatch]
starvation_warning_threshold_ms = 10000  # Warn if event waits > 10s

Domain Event Drop Semantics

Domain events use try_send() and are explicitly designed to be droppable:

#![allow(unused)]
fn main() {
// Domain events fire AFTER result is committed
// They are non-critical and use fire-and-forget semantics
match event_sender.try_send(event) {
    Ok(()) => { /* Event dispatched */ }
    Err(TrySendError::Full(_)) => {
        // Event dropped - step execution NOT affected
        warn!("Domain event dropped: channel full");
        metrics.increment("domain_events_dropped");
    }
}
}

Why this is safe: Domain events are informational. Dropping them does not affect step execution correctness. The step result is already committed to the completion channel before domain events fire.

Step Claiming Backpressure (Planned)

Future enhancement: Workers will check capacity before claiming steps:

#![allow(unused)]
fn main() {
// Planned implementation
fn should_claim_step(&self) -> bool {
    let available = self.semaphore.available_permits();
    let threshold = self.config.claim_capacity_threshold;  // e.g., 0.8
    let max = self.config.max_concurrent_handlers;

    available as f64 / max as f64 > (1.0 - threshold)
}
}

If at capacity:

  • Worker does NOT acknowledge the PGMQ message
  • Message returns to queue after visibility timeout
  • Another worker (or same worker later) claims it

Idempotency Under Backpressure

All backpressure mechanisms preserve step idempotency:

Backpressure PointIdempotency Guarantee
Claim refusalMessage stays in queue, visibility timeout protects
Dispatch channel fullStep claimed but queued for execution
Semaphore waitStep claimed, waiting for permit
Completion channel fullHandler completed, result buffered
Domain event dropNon-critical, step result already persisted

Critical Rule: A claimed step MUST produce a result (success or failure). Backpressure may delay but never drop step execution.

For comprehensive backpressure strategy, see Backpressure Architecture.

Best Practices

1. Choose Deployment Mode

  • Production: Use Hybrid for reliability with event-driven performance
  • Development: Use EventDrivenOnly for fastest iteration
  • Restricted environments: Use PollingOnly when pg_notify unavailable

2. Tune Concurrency

[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 10  # Start here, increase based on monitoring

Monitor:

  • Semaphore wait times
  • Handler execution latency
  • Completion channel saturation

3. Configure Timeouts

handler_timeout_ms = 30000        # Match your slowest handler
completion_timeout_ms = 30000     # FFI completion timeout
callback_timeout_ms = 5000        # Domain event callback timeout

4. Monitor Starvation

For FFI workers, monitor pending event age:

# Ruby
metrics = Tasker.ffi_dispatch_metrics
if metrics[:oldest_pending_age_ms] > 10000
  warn "FFI polling falling behind"
end

<- Back to Documentation Hub

Tasker Core Guides

This directory contains practical how-to guides for working with Tasker Core.

Documents

DocumentDescription
Quick StartGet running in 5 minutes
Use Cases and PatternsPractical workflow examples
Conditional WorkflowsRuntime decision-making and dynamic steps
Batch ProcessingParallel processing with cursor-based workers
DLQ SystemDead letter queue investigation and resolution
Retry SemanticsUnderstanding max_attempts and retryable flags
Identity StrategyTask deduplication with STRICT, CALLER_PROVIDED, ALWAYS_UNIQUE
Configuration ManagementTOML architecture, CLI tools, runtime observability

When to Read These

  • Getting started: Begin with Quick Start
  • Implementing features: Check Use Cases and Patterns
  • Handling errors: See Retry Semantics and DLQ System
  • Processing data: Review Batch Processing
  • Deploying: Consult Configuration Management
  • Architecture - The “what” - system structure
  • Principles - The “why” - design philosophy
  • Workers - Language-specific handler development

API Security Guide

API-level security for orchestration (8080) and worker (8081) endpoints using JWT bearer tokens and API key authentication with permission-based access control.

Security is disabled by default for backward compatibility. Enable it explicitly in configuration.

See also: Auth Documentation Hub for architecture overview, Permissions for route mapping, Configuration for full reference, Testing for E2E test patterns.


Quick Start

1. Generate Keys

cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys

2. Generate a Token

cargo run --bin tasker-ctl -- auth generate-token \
  --private-key ./keys/jwt-private-key.pem \
  --permissions "tasks:create,tasks:read,tasks:list,steps:read" \
  --subject my-service \
  --expiry-hours 24

3. Enable Auth in Configuration

In config/tasker/base/orchestration.toml:

[auth]
enabled = true
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"

4. Use the Token

export TASKER_AUTH_TOKEN=<generated-token>
cargo run --bin tasker-ctl -- task list

Or with curl:

curl -H "Authorization: Bearer $TASKER_AUTH_TOKEN" http://localhost:8080/v1/tasks

Permission Vocabulary

PermissionResourceDescription
tasks:createtasksCreate new tasks
tasks:readtasksRead task details
tasks:listtasksList tasks
tasks:canceltasksCancel running tasks
tasks:context_readtasksRead task context data
steps:readstepsRead workflow step details
steps:resolvestepsManually resolve steps
dlq:readdlqRead DLQ entries
dlq:updatedlqUpdate DLQ investigations
dlq:statsdlqView DLQ statistics
templates:readtemplatesRead task templates
templates:validatetemplatesValidate templates
system:config_readsystemRead system configuration
system:handlers_readsystemRead handler registry
system:analytics_readsystemRead analytics data
worker:config_readworkerRead worker configuration
worker:templates_readworkerRead worker templates

Wildcards

  • tasks:* - All task permissions
  • steps:* - All step permissions
  • dlq:* - All DLQ permissions
  • * - All permissions (superuser)

Show All Permissions

cargo run --bin tasker-ctl -- auth show-permissions

Configuration Reference

Server-Side (orchestration.toml / worker.toml)

[auth]
enabled = true

# JWT Configuration
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
jwt_token_expiry_hours = 24

# Key Configuration (one of these):
jwt_public_key_path = "./keys/jwt-public-key.pem"   # File path (preferred)
jwt_public_key = "-----BEGIN RSA PUBLIC KEY-----..." # Inline PEM
# Or set env: TASKER_JWT_PUBLIC_KEY_PATH

# JWKS (for dynamic key rotation)
jwt_verification_method = "jwks"  # "public_key" (default) or "jwks"
jwks_url = "https://auth.example.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600

# Permission validation
permissions_claim = "permissions"   # JWT claim containing permissions
strict_validation = true            # Reject tokens with unknown permissions
log_unknown_permissions = true

# API Key Authentication
api_key_header = "X-API-Key"
api_keys_enabled = true

[[auth.api_keys]]
key = "sk-prod-key-1"
permissions = ["tasks:read", "tasks:list", "steps:read"]
description = "Read-only monitoring service"

[[auth.api_keys]]
key = "sk-admin-key"
permissions = ["*"]
description = "Admin key"

Client-Side (Environment Variables)

VariableDescription
TASKER_AUTH_TOKENBearer token for both APIs
TASKER_ORCHESTRATION_AUTH_TOKENOverride token for orchestration only
TASKER_WORKER_AUTH_TOKENOverride token for worker only
TASKER_API_KEYAPI key (fallback if no token)
TASKER_API_KEY_HEADERCustom header name (default: X-API-Key)

Priority: endpoint-specific token > global token > API key > config file.


JWT Token Structure

{
  "sub": "my-service",
  "iss": "tasker-core",
  "aud": "tasker-api",
  "iat": 1706000000,
  "exp": 1706086400,
  "permissions": [
    "tasks:create",
    "tasks:read",
    "tasks:list",
    "steps:read"
  ],
  "worker_namespaces": []
}

Common Role Patterns

Read-only operator:

permissions: ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]

Task submitter:

permissions: ["tasks:create", "tasks:read", "tasks:list"]

Ops admin:

permissions: ["tasks:*", "steps:*", "dlq:*", "system:*"]

Worker service:

permissions: ["worker:config_read", "worker:templates_read"]

Superuser:

permissions: ["*"]

Public Endpoints

These endpoints never require authentication:

  • GET /health - Basic health check
  • GET /health/detailed - Detailed health
  • GET /metrics - Prometheus metrics

API Key Authentication

API keys are validated against a configured registry. Each key has its own set of permissions.

# Using API key
curl -H "X-API-Key: sk-prod-key-1" http://localhost:8080/v1/tasks

API keys are simpler than JWTs but have limitations:

  • No expiration (rotate by removing from config)
  • No claims beyond permissions
  • Best for service-to-service communication with static permissions

Error Responses

401 Unauthorized (Missing/Invalid Credentials)

{
  "error": "unauthorized",
  "message": "Missing authentication credentials"
}

403 Forbidden (Insufficient Permissions)

{
  "error": "forbidden",
  "message": "Missing required permission: tasks:create"
}

Migration Guide: Disabled to Enabled

  1. Generate keys and distribute the public key to server config
  2. Generate tokens for each service/user with appropriate permissions
  3. Set enabled = true in auth config
  4. Deploy - services without valid tokens will get 401 responses
  5. Monitor the tasker.auth.failures.total metric for issues

All endpoints remain accessible without auth when enabled = false.


Observability

Structured Logs

  • info on successful authentication (subject, method)
  • warn on authentication failure (error details)
  • warn on permission denial (subject, required permission)

Prometheus Metrics

MetricTypeLabels
tasker.auth.requests.totalCountermethod, result
tasker.auth.failures.totalCounterreason
tasker.permission.denials.totalCounterpermission
tasker.auth.jwt.verification.durationHistogramresult

CLI Auth Commands

# Generate RSA key pair
tasker-ctl auth generate-keys [--output-dir ./keys] [--key-size 2048]

# Generate JWT token
tasker-ctl auth generate-token \
  --permissions tasks:create,tasks:read \
  --subject my-service \
  --private-key ./keys/jwt-private-key.pem \
  --expiry-hours 24

# List all permissions
tasker-ctl auth show-permissions

# Validate a token
tasker-ctl auth validate-token \
  --token <JWT> \
  --public-key ./keys/jwt-public-key.pem

gRPC Authentication

gRPC endpoints support the same authentication methods as REST, using gRPC metadata instead of HTTP headers.

gRPC Ports

ServiceREST PortgRPC Port
Orchestration80809190
Rust Worker80819191

Bearer Token (gRPC)

# Using grpcurl with Bearer token
grpcurl -plaintext \
  -H "Authorization: Bearer $TASKER_AUTH_TOKEN" \
  localhost:9190 tasker.v1.TaskService/ListTasks

API Key (gRPC)

# Using grpcurl with API key
grpcurl -plaintext \
  -H "X-API-Key: sk-prod-key-1" \
  localhost:9190 tasker.v1.TaskService/ListTasks

gRPC Client Configuration

#![allow(unused)]
fn main() {
use tasker_client::grpc_clients::{OrchestrationGrpcClient, GrpcAuthConfig};

// With API key
let client = OrchestrationGrpcClient::connect_with_auth(
    "http://localhost:9190",
    GrpcAuthConfig::ApiKey("sk-prod-key-1".to_string()),
).await?;

// With Bearer token
let client = OrchestrationGrpcClient::connect_with_auth(
    "http://localhost:9190",
    GrpcAuthConfig::Bearer("eyJ...".to_string()),
).await?;
}

gRPC Error Codes

gRPC StatusHTTP EquivalentMeaning
UNAUTHENTICATED401Missing or invalid credentials
PERMISSION_DENIED403Valid credentials but insufficient permissions
NOT_FOUND404Resource not found
UNAVAILABLE503Service unavailable

Public gRPC Endpoints

These endpoints never require authentication:

  • HealthService/CheckHealth - Basic health check
  • HealthService/CheckLiveness - Kubernetes liveness probe
  • HealthService/CheckReadiness - Kubernetes readiness probe
  • HealthService/CheckDetailedHealth - Detailed health metrics

Security Considerations

  • Key storage: Private keys should never be committed to git. Use file paths or environment variables.
  • Token expiry: Set appropriate expiry times. Short-lived tokens (1-24h) are preferred.
  • Least privilege: Grant only the permissions each service needs.
  • Key rotation: Use JWKS for automatic key rotation in production.
  • API key rotation: Remove old keys from config and redeploy.
  • Audit: Monitor tasker.auth.failures.total and tasker.permission.denials.total for anomalies.

External Auth Provider Integration

Integrating Tasker’s API security with external identity providers via JWKS endpoints.

See also: Auth Documentation Hub for architecture overview, Configuration for full TOML reference.


JWKS Integration

Tasker supports JWKS (JSON Web Key Set) for dynamic public key discovery. This enables key rotation without redeploying Tasker.

Configuration

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://your-provider.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://your-provider.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"  # or custom claim name

How It Works

  1. On first request, Tasker fetches the JWKS from the configured URL
  2. Keys are cached for the configured refresh interval
  3. When a token has an unknown kid (Key ID), a refresh is triggered
  4. RSA keys are parsed from the JWK n and e components

Auth0

Auth0 Configuration

  1. Create an API in Auth0 Dashboard:

    • Name: Tasker API
    • Identifier: tasker-api (this becomes the audience)
    • Signing Algorithm: RS256
  2. Create permissions in the API settings matching Tasker’s vocabulary:

    • tasks:create, tasks:read, tasks:list, etc.
  3. Assign permissions to users/applications via Auth0 roles

Tasker Configuration for Auth0

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.auth0.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.auth0.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"

Token Request

curl --request POST \
  --url https://YOUR_DOMAIN.auth0.com/oauth/token \
  --header 'content-type: application/json' \
  --data '{
    "client_id": "YOUR_CLIENT_ID",
    "client_secret": "YOUR_CLIENT_SECRET",
    "audience": "tasker-api",
    "grant_type": "client_credentials"
  }'

Keycloak

Keycloak Configuration

  1. Create a realm and client for Tasker
  2. Define client roles matching Tasker permissions
  3. Configure the client to include roles in the permissions token claim via a protocol mapper

Tasker Configuration for Keycloak

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://keycloak.example.com/realms/YOUR_REALM/protocol/openid-connect/certs"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://keycloak.example.com/realms/YOUR_REALM"
jwt_audience = "tasker-api"
permissions_claim = "permissions"  # Configure via protocol mapper

Okta

Okta Configuration

  1. Create an API authorization server
  2. Add custom claims for permissions
  3. Define scopes matching Tasker permissions

Tasker Configuration for Okta

[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID/v1/keys"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID"
jwt_audience = "tasker-api"
permissions_claim = "scp"  # Okta uses "scp" for scopes by default

Custom JWKS Endpoint

Any provider that serves a standard JWKS endpoint works. The endpoint must return:

{
  "keys": [
    {
      "kty": "RSA",
      "kid": "key-id-1",
      "use": "sig",
      "alg": "RS256",
      "n": "<base64url-encoded modulus>",
      "e": "<base64url-encoded exponent>"
    }
  ]
}

Static Public Key (Development)

For development or simple deployments without a JWKS endpoint:

[auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"

Generate keys with:

tasker-ctl auth generate-keys --output-dir /etc/tasker/keys

Permission Claim Mapping

If your identity provider uses a different claim name for permissions:

permissions_claim = "custom_permissions"  # Default: "permissions"

The claim must be a JSON array of strings:

{
  "custom_permissions": ["tasks:create", "tasks:read"]
}

Strict Validation

When strict_validation = true (default), tokens containing unknown permission strings are rejected. Set to false if your provider includes additional scopes/permissions not in Tasker’s vocabulary:

strict_validation = false
log_unknown_permissions = true  # Still log unknown permissions for monitoring

Batch Processing in Tasker

Last Updated: 2026-01-06 Status: Production Ready Related: Conditional Workflows, DLQ System


Table of Contents


Overview

Batch processing in Tasker enables parallel processing of large datasets by dynamically creating worker steps at runtime. A single “batchable” step analyzes a workload and instructs orchestration to create N worker instances, each processing a subset of data using cursor-based boundaries.

Key Characteristics

Dynamic Worker Creation: Workers are created at runtime based on dataset analysis, using predefined in templates for structure, but scaled according to need.

Cursor-Based Resumability: Each worker processes a specific range (cursor) and can resume from checkpoints on failure.

Deferred Convergence: Aggregation steps use intersection semantics to wait for all created workers, regardless of count.

Standard Lifecycle: Workers use existing retry, timeout, and DLQ mechanics - no special recovery system needed.

Example Flow

Task: Process 1000-row CSV file

1. analyze_csv (batchable step)
   → Counts rows: 1000
   → Calculates workers: 5 (200 rows each)
   → Returns BatchProcessingOutcome::CreateBatches

2. Orchestration creates workers dynamically:
   ├─ process_csv_batch_001 (rows 1-200)
   ├─ process_csv_batch_002 (rows 201-400)
   ├─ process_csv_batch_003 (rows 401-600)
   ├─ process_csv_batch_004 (rows 601-800)
   └─ process_csv_batch_005 (rows 801-1000)

3. Workers process in parallel

4. aggregate_csv_results (deferred convergence)
   → Waits for all 5 workers (intersection semantics)
   → Aggregates results from completed workers
   → Returns combined metrics

Architecture Foundations

Batch processing builds on and extends three foundational Tasker patterns:

1. DAG (Directed Acyclic Graph) Workflow Orchestration

What Batch Processing Inherits:

  • Worker steps are full DAG nodes with standard state machines
  • Parent-child dependencies enforced via tasker_workflow_step_edges
  • Cycle detection prevents circular dependencies
  • Topological ordering ensures correct execution sequence

What Batch Processing Extends:

  • Dynamic node creation: Template steps instantiated N times at runtime
  • Edge generation: Batchable step → worker instances → convergence step
  • Transactional atomicity: All workers created in single database transaction

Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:357-400

#![allow(unused)]
fn main() {
// Transaction ensures all-or-nothing worker creation
let mut tx = pool.begin().await?;

for (i, cursor_config) in cursor_configs.iter().enumerate() {
    // Create worker instance from template
    let worker_step = WorkflowStepCreator::create_from_template(
        &mut tx,
        task_uuid,
        &worker_template,
        &format!("{}_{:03}", worker_template_name, i + 1),
        Some(batch_worker_inputs.clone()),
    ).await?;

    // Create edge: batchable → worker
    WorkflowStepEdge::create_with_transaction(
        &mut tx,
        batchable_step.workflow_step_uuid,
        worker_step.workflow_step_uuid,
        "batch_dependency",
    ).await?;
}

tx.commit().await?; // Atomic - all workers or none
}

2. Retryability and Lifecycle Management

What Batch Processing Inherits:

  • Standard lifecycle.max_retries configuration per template
  • Exponential backoff via lifecycle.backoff_multiplier
  • Staleness detection using lifecycle.max_steps_in_process_minutes
  • Standard state transitions (Pending → Enqueued → InProgress → Complete/Error)

What Batch Processing Extends:

  • Checkpoint-based resumability: Workers checkpoint progress and resume from last cursor position
  • Cursor preservation during retry: workflow_steps.results field preserved by ResetForRetry action
  • Additional staleness detection: Checkpoint timestamp tracking alongside duration-based detection

Key Simplification:

  • No BatchRecoveryService - Uses standard retry + DLQ
  • No duplicate timeout settings - Uses lifecycle config only
  • Cursor data preserved during ResetForRetry

Configuration Example: tests/fixtures/task_templates/ruby/batch_processing_products_csv.yaml:749-752

- name: process_csv_batch
  type: batch_worker
  lifecycle:
    max_steps_in_process_minutes: 120  # DLQ timeout
    max_retries: 3                     # Standard retry limit
    backoff_multiplier: 2.0            # Exponential backoff

3. Deferred Convergence

What Batch Processing Inherits:

  • Intersection semantics: Wait for declared dependencies ∩ actually created steps
  • Template-level dependencies: Convergence step depends on worker template, not instances
  • Runtime resolution: System computes effective dependencies when workers are created

What Batch Processing Extends:

  • Batch aggregation pattern: Convergence steps aggregate results from N workers
  • NoBatches scenario handling: Placeholder worker created when dataset too small
  • Scenario detection helpers: BatchAggregationScenario::detect() for both cases

Flow Comparison:

Conditional Workflows (Decision Points):

decision_step → creates → option_a, option_b (conditional)
                            ↓
convergence_step (depends on option_a AND option_b templates)
                 → waits for whichever were created (intersection)

Batch Processing (Batchable Steps):

batchable_step → creates → worker_001, worker_002, ..., worker_N
                            ↓
convergence_step (depends on worker template)
                 → waits for ALL workers created (intersection)

Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:600-650

#![allow(unused)]
fn main() {
// Determine and create convergence step with intersection semantics
pub async fn determine_and_create_convergence_step(
    &self,
    tx: &mut PgTransaction,
    task_uuid: Uuid,
    convergence_template: &StepDefinition,
    created_workers: &[WorkflowStep],
) -> Result<Option<WorkflowStep>> {
    // Create convergence step as deferred type
    let convergence_step = WorkflowStepCreator::create_from_template(
        tx,
        task_uuid,
        convergence_template,
        &convergence_template.name,
        None,
    ).await?;

    // Create edges from ALL worker instances to convergence step
    for worker in created_workers {
        WorkflowStepEdge::create_with_transaction(
            tx,
            worker.workflow_step_uuid,
            convergence_step.workflow_step_uuid,
            "batch_convergence_dependency",
        ).await?;
    }

    Ok(Some(convergence_step))
}
}

Core Concepts

Batchable Steps

Purpose: Analyze a workload and decide whether to create batch workers.

Responsibilities:

  1. Examine dataset (size, complexity, business logic)
  2. Calculate optimal worker count based on batch size
  3. Generate cursor configurations defining batch boundaries
  4. Return BatchProcessingOutcome instructing orchestration

Returns: BatchProcessingOutcome enum with two variants:

  • NoBatches: Dataset too small or empty - create placeholder worker
  • CreateBatches: Create N workers with cursor configurations

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-120

#![allow(unused)]
fn main() {
// Batchable handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    let csv_file_path = step_data.task.context.get("csv_file_path").unwrap();

    // Count rows in CSV (excluding header)
    let total_rows = count_csv_rows(csv_file_path)?;

    // Get batch configuration from handler initialization
    let batch_size = step_data.handler_initialization
        .get("batch_size").and_then(|v| v.as_u64()).unwrap_or(200);

    if total_rows == 0 {
        // No batches needed
        let outcome = BatchProcessingOutcome::no_batches();
        return Ok(success_result(
            step_uuid,
            json!({ "batch_processing_outcome": outcome.to_value() }),
            elapsed_ms,
            None,
        ));
    }

    // Calculate workers
    let worker_count = (total_rows as f64 / batch_size as f64).ceil() as u32;

    // Generate cursor configs
    let cursor_configs = create_cursor_configs(total_rows, worker_count);

    // Return CreateBatches outcome
    let outcome = BatchProcessingOutcome::create_batches(
        "process_csv_batch".to_string(),
        worker_count,
        cursor_configs,
        total_rows,
    );

    Ok(success_result(
        step_uuid,
        json!({
            "batch_processing_outcome": outcome.to_value(),
            "worker_count": worker_count,
            "total_rows": total_rows
        }),
        elapsed_ms,
        None,
    ))
}
}

Batch Workers

Purpose: Process a specific subset of data defined by cursor configuration.

Responsibilities:

  1. Extract cursor config from workflow_step.inputs
  2. Check for is_no_op flag (NoBatches placeholder scenario)
  3. Process items within cursor range (start_cursor to end_cursor)
  4. Checkpoint progress periodically for resumability
  5. Return processed results for aggregation

Cursor Configuration: Each worker receives BatchWorkerInputs in workflow_step.inputs:

{
  "cursor": {
    "batch_id": "001",
    "start_cursor": 1,
    "end_cursor": 200,
    "batch_size": 200
  },
  "batch_metadata": {
    "checkpoint_interval": 100,
    "cursor_field": "row_number",
    "failure_strategy": "fail_fast"
  },
  "is_no_op": false
}

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-280

#![allow(unused)]
fn main() {
// Batch worker handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Extract context using helper
    let context = BatchWorkerContext::from_step_data(step_data)?;

    // Check for no-op placeholder worker
    if context.is_no_op() {
        return Ok(success_result(
            step_uuid,
            json!({
                "no_op": true,
                "reason": "NoBatches scenario - no items to process"
            }),
            elapsed_ms,
            None,
        ));
    }

    // Get cursor range
    let start_row = context.start_position();
    let end_row = context.end_position();

    // Get CSV file path from dependency results
    let csv_file_path = step_data
        .dependency_results
        .get("analyze_csv")
        .and_then(|r| r.result.get("csv_file_path"))
        .unwrap();

    // Process CSV rows in cursor range
    let mut processed_count = 0;
    let mut metrics = initialize_metrics();

    let file = File::open(csv_file_path)?;
    let mut csv_reader = csv::ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
        let data_row_num = row_idx + 1; // 1-indexed after header

        if data_row_num < start_row {
            continue; // Skip rows before our range
        }
        if data_row_num >= end_row {
            break; // Processed all our rows
        }

        let product: Product = result?;

        // Update metrics
        metrics.total_inventory_value += product.price * (product.stock as f64);
        metrics.category_counts.entry(product.category.clone())
            .or_insert(0) += 1;

        processed_count += 1;

        // Checkpoint progress periodically
        if processed_count % context.checkpoint_interval() == 0 {
            checkpoint_progress(step_uuid, data_row_num).await?;
        }
    }

    // Return results for aggregation
    Ok(success_result(
        step_uuid,
        json!({
            "processed_count": processed_count,
            "total_inventory_value": metrics.total_inventory_value,
            "category_counts": metrics.category_counts,
            "batch_id": context.batch_id(),
            "start_row": start_row,
            "end_row": end_row
        }),
        elapsed_ms,
        None,
    ))
}
}

Convergence Steps

Purpose: Aggregate results from all batch workers using deferred intersection semantics.

Responsibilities:

  1. Detect scenario using BatchAggregationScenario::detect()
  2. Handle both NoBatches and WithBatches scenarios
  3. Aggregate metrics from all worker results
  4. Return combined results for task completion

Scenario Detection:

#![allow(unused)]
fn main() {
pub enum BatchAggregationScenario {
    /// No batches created - placeholder worker used
    NoBatches {
        batchable_result: StepDependencyResult,
    },

    /// Batches created - multiple workers processed data
    WithBatches {
        batch_results: Vec<(String, StepDependencyResult)>,
        worker_count: u32,
    },
}
}

Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-480

#![allow(unused)]
fn main() {
// Convergence handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Detect scenario using helper
    let scenario = BatchAggregationScenario::detect(
        &step_data.dependency_results,
        "analyze_csv",        // batchable step name
        "process_csv_batch_", // batch worker prefix
    )?;

    match scenario {
        BatchAggregationScenario::NoBatches { batchable_result } => {
            // No workers created - get dataset size from batchable step
            let total_rows = batchable_result
                .result.get("total_rows")
                .and_then(|v| v.as_u64())
                .unwrap_or(0);

            // Return zero metrics
            Ok(success_result(
                step_uuid,
                json!({
                    "total_processed": total_rows,
                    "total_inventory_value": 0.0,
                    "category_counts": {},
                    "worker_count": 0
                }),
                elapsed_ms,
                None,
            ))
        }

        BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
            // Aggregate results from all workers
            let mut total_processed = 0u64;
            let mut total_inventory_value = 0.0;
            let mut global_category_counts = HashMap::new();
            let mut max_price = 0.0;
            let mut max_price_product = None;

            for (step_name, result) in batch_results {
                // Sum processed counts
                total_processed += result.result
                    .get("processed_count")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                // Sum inventory values
                total_inventory_value += result.result
                    .get("total_inventory_value")
                    .and_then(|v| v.as_f64())
                    .unwrap_or(0.0);

                // Merge category counts
                if let Some(categories) = result.result
                    .get("category_counts")
                    .and_then(|v| v.as_object()) {
                    for (category, count) in categories {
                        *global_category_counts.entry(category.clone()).or_insert(0)
                            += count.as_u64().unwrap_or(0);
                    }
                }

                // Find global max price
                let batch_max_price = result.result
                    .get("max_price")
                    .and_then(|v| v.as_f64())
                    .unwrap_or(0.0);
                if batch_max_price > max_price {
                    max_price = batch_max_price;
                    max_price_product = result.result
                        .get("max_price_product")
                        .and_then(|v| v.as_str())
                        .map(String::from);
                }
            }

            // Return aggregated metrics
            Ok(success_result(
                step_uuid,
                json!({
                    "total_processed": total_processed,
                    "total_inventory_value": total_inventory_value,
                    "category_counts": global_category_counts,
                    "max_price": max_price,
                    "max_price_product": max_price_product,
                    "worker_count": worker_count
                }),
                elapsed_ms,
                None,
            ))
        }
    }
}
}

Checkpoint Yielding

Checkpoint yielding enables handler-driven progress persistence during long-running batch operations. Handlers explicitly checkpoint their progress, persist state to the database, and yield control back to the orchestrator for re-dispatch.

Key Characteristics

Handler-Driven: Handlers decide when to checkpoint based on business logic, not configuration timers. This gives handlers full control over checkpoint frequency and timing.

Checkpoint-Persist-Then-Redispatch: Progress is atomically saved to the database before the step is re-dispatched. This ensures no progress is ever lost, even during infrastructure failures.

Step Remains In-Progress: During checkpoint yield cycles, the step stays in InProgress state. It is not released or re-enqueued through normal channels—the re-dispatch happens internally.

State Machine Integrity: Only Success or Failure results trigger state transitions. Checkpoint yields are internal handler mechanics that don’t affect the step state machine.

When to Use Checkpoint Yielding

Use checkpoint yielding when:

  • Processing takes longer than your visibility timeout (prevents DLQ escalation)
  • You want resumable processing after transient failures
  • You need to periodically release resources (memory, connections)
  • Long-running operations need progress visibility

Don’t use checkpoint yielding when:

  • Batch processing completes quickly (<30 seconds)
  • The overhead of checkpointing exceeds the benefit
  • Operations are inherently non-resumable

API Reference

All languages provide a checkpoint_yield() method (or checkpointYield() in TypeScript) on the Batchable mixin:

Ruby

class MyBatchWorkerHandler
  include Tasker::StepHandler::Batchable

  def call(step_data)
    context = BatchWorkerContext.from_step_data(step_data)

    # Resume from checkpoint if present
    start_item = context.has_checkpoint? ? context.checkpoint_cursor : 0
    accumulated = context.accumulated_results || []

    items = fetch_items_to_process(start_item)

    items.each_with_index do |item, idx|
      result = process_item(item)
      accumulated << result

      # Checkpoint every 1000 items
      if (idx + 1) % 1000 == 0
        checkpoint_yield(
          cursor: start_item + idx + 1,
          items_processed: accumulated.size,
          accumulated_results: { processed: accumulated }
        )
        # Handler execution stops here and resumes on re-dispatch
      end
    end

    # Return final success result
    success_result(results: { all_processed: accumulated })
  end
end

BatchWorkerContext Accessors (Ruby):

  • checkpoint_cursor - Current cursor position (or nil if no checkpoint)
  • accumulated_results - Previously accumulated results (or nil)
  • has_checkpoint? - Returns true if checkpoint data exists
  • checkpoint_items_processed - Number of items processed at checkpoint

Python

class MyBatchWorkerHandler(BatchableHandler):
    def call(self, step_data: TaskSequenceStep) -> StepExecutionResult:
        context = BatchWorkerContext.from_step_data(step_data)

        # Resume from checkpoint if present
        start_item = context.checkpoint_cursor if context.has_checkpoint() else 0
        accumulated = context.accumulated_results or []

        items = self.fetch_items_to_process(start_item)

        for idx, item in enumerate(items):
            result = self.process_item(item)
            accumulated.append(result)

            # Checkpoint every 1000 items
            if (idx + 1) % 1000 == 0:
                self.checkpoint_yield(
                    cursor=start_item + idx + 1,
                    items_processed=len(accumulated),
                    accumulated_results={"processed": accumulated}
                )
                # Handler execution stops here and resumes on re-dispatch

        # Return final success result
        return self.success_result(results={"all_processed": accumulated})

BatchWorkerContext Accessors (Python):

  • checkpoint_cursor: int | str | dict | None - Current cursor position
  • accumulated_results: dict | None - Previously accumulated results
  • has_checkpoint() -> bool - Returns true if checkpoint data exists
  • checkpoint_items_processed: int - Number of items processed at checkpoint

TypeScript

class MyBatchWorkerHandler extends BatchableHandler {
  async call(stepData: TaskSequenceStep): Promise<StepExecutionResult> {
    const context = BatchWorkerContext.fromStepData(stepData);

    // Resume from checkpoint if present
    const startItem = context.hasCheckpoint() ? context.checkpointCursor : 0;
    const accumulated = context.accumulatedResults ?? [];

    const items = await this.fetchItemsToProcess(startItem);

    for (let idx = 0; idx < items.length; idx++) {
      const result = await this.processItem(items[idx]);
      accumulated.push(result);

      // Checkpoint every 1000 items
      if ((idx + 1) % 1000 === 0) {
        await this.checkpointYield({
          cursor: startItem + idx + 1,
          itemsProcessed: accumulated.length,
          accumulatedResults: { processed: accumulated }
        });
        // Handler execution stops here and resumes on re-dispatch
      }
    }

    // Return final success result
    return this.successResult({ results: { allProcessed: accumulated } });
  }
}

BatchWorkerContext Properties (TypeScript):

  • checkpointCursor: number | string | Record<string, unknown> | undefined
  • accumulatedResults: Record<string, unknown> | undefined
  • hasCheckpoint(): boolean
  • checkpointItemsProcessed: number

Checkpoint Data Structure

Checkpoints are persisted in the checkpoint JSONB column on workflow_steps:

{
  "cursor": 1000,
  "items_processed": 1000,
  "timestamp": "2026-01-06T12:00:00Z",
  "accumulated_results": {
    "processed": ["item1", "item2", "..."]
  },
  "history": [
    {"cursor": 500, "timestamp": "2026-01-06T11:59:30Z"},
    {"cursor": 1000, "timestamp": "2026-01-06T12:00:00Z"}
  ]
}

Fields:

  • cursor - Flexible JSON value representing position (integer, string, or object)
  • items_processed - Total items processed at this checkpoint
  • timestamp - ISO 8601 timestamp when checkpoint was created
  • accumulated_results - Optional intermediate results to preserve
  • history - Array of previous checkpoint positions (appended automatically)

Checkpoint Flow

┌──────────────────────────────────────────────────────────────────┐
│  Handler calls checkpoint_yield(cursor, items_processed, ...)   │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  FFI Bridge: checkpoint_yield_step_event()                       │
│  Converts language-specific types to CheckpointYieldData         │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  CheckpointService::save_checkpoint()                            │
│  - Atomic SQL update with history append                         │
│  - Uses JSONB jsonb_set for history array                        │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  Worker re-dispatches step via internal MPSC channel             │
│  - Step stays InProgress (not released)                          │
│  - Re-queued for immediate processing                            │
└───────────────────────────────┬──────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  Handler resumes with checkpoint data in workflow_step           │
│  - BatchWorkerContext provides checkpoint accessors              │
│  - Handler continues from saved cursor position                  │
└──────────────────────────────────────────────────────────────────┘

Failure and Recovery

Transient Failure After Checkpoint:

  1. Handler checkpoints at position 500
  2. Handler fails at position 750 (transient error)
  3. Step is retried (standard retry semantics)
  4. Handler resumes from checkpoint (position 500)
  5. Items 500-750 are reprocessed (idempotency required)
  6. Processing continues to completion

Permanent Failure:

  1. Handler checkpoints at position 500
  2. Handler encounters non-retryable error
  3. Step transitions to Error state
  4. Checkpoint data preserved for operator inspection
  5. Manual intervention may use checkpoint to resume later

Best Practices

Checkpoint Frequency:

  • Too frequent: Overhead dominates (database writes, re-dispatch latency)
  • Too infrequent: Lost progress on failure, long recovery time
  • Rule of thumb: Checkpoint every 1-5 minutes of work, or every 1000-10000 items

Accumulated Results:

  • Keep accumulated results small (summaries, counts, IDs)
  • For large result sets, write to external storage and store reference
  • Unbounded accumulated results can cause performance degradation

Cursor Design:

  • Use monotonic cursors (integers, timestamps) when possible
  • Complex cursors (objects) are supported but harder to debug
  • Cursor must uniquely identify resume position

Idempotency:

  • Items between last checkpoint and failure will be reprocessed
  • Ensure item processing is idempotent or use deduplication
  • Consider storing processed item IDs in accumulated_results

Monitoring

Checkpoint Events (logged automatically):

INFO checkpoint_yield_step_event step_uuid=abc cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc history_length=2

Metrics to Monitor:

  • Checkpoint frequency per step
  • Average items processed between checkpoints
  • Checkpoint history size (detect unbounded growth)
  • Re-dispatch latency after checkpoint

Known Limitations

History Array Growth: The history array grows with each checkpoint. For very long-running processes with frequent checkpoints, this can lead to large JSONB values. Consider:

  • Setting a maximum history length (future enhancement)
  • Clearing history on step completion
  • Using external storage for detailed history

Accumulated Results Size: No built-in size limit on accumulated_results. Handlers must self-regulate to prevent database bloat. Consider:

  • Storing summaries instead of raw data
  • Using external storage for large intermediate results
  • Implementing size checks before checkpoint

Workflow Pattern

Template Definition

Batch processing workflows use three step types in YAML templates:

name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"

steps:
  # BATCHABLE STEP: Analyzes dataset and decides batching strategy
  - name: analyze_csv
    type: batchable
    dependencies: []
    handler:
      callable: BatchProcessing::CsvAnalyzerHandler
      initialization:
        batch_size: 200
        max_workers: 5

  # BATCH WORKER TEMPLATE: Single batch processing unit
  # Orchestration creates N instances from this template at runtime
  - name: process_csv_batch
    type: batch_worker
    dependencies:
      - analyze_csv
    lifecycle:
      max_steps_in_process_minutes: 120
      max_retries: 3
      backoff_multiplier: 2.0
    handler:
      callable: BatchProcessing::CsvBatchProcessorHandler
      initialization:
        operation: "inventory_analysis"

  # DEFERRED CONVERGENCE STEP: Aggregates results from all workers
  - name: aggregate_csv_results
    type: deferred_convergence
    dependencies:
      - process_csv_batch  # Template dependency - resolves to all instances
    handler:
      callable: BatchProcessing::CsvResultsAggregatorHandler
      initialization:
        aggregation_type: "inventory_metrics"

Runtime Execution Flow

1. Task Initialization

User creates task with context: { "csv_file_path": "/path/to/data.csv" }
↓
Task enters Initializing state
↓
Orchestration discovers ready steps: [analyze_csv]

2. Batchable Step Execution

analyze_csv step enqueued to worker queue
↓
Worker claims step, executes CsvAnalyzerHandler
↓
Handler counts rows: 1000
Handler calculates workers: 5 (200 rows each)
Handler generates cursor configs
Handler returns BatchProcessingOutcome::CreateBatches
↓
Step completes with batch_processing_outcome in results

3. Batch Worker Creation (Orchestration)

ResultProcessorActor processes analyze_csv completion
↓
Detects batch_processing_outcome in step results
↓
Sends ProcessBatchableStepMessage to BatchProcessingActor
↓
BatchProcessingService.process_batchable_step():
  - Begins database transaction
  - Creates 5 worker instances from process_csv_batch template:
    * process_csv_batch_001 (cursor: rows 1-200)
    * process_csv_batch_002 (cursor: rows 201-400)
    * process_csv_batch_003 (cursor: rows 401-600)
    * process_csv_batch_004 (cursor: rows 601-800)
    * process_csv_batch_005 (cursor: rows 801-1000)
  - Creates edges: analyze_csv → each worker
  - Creates convergence step: aggregate_csv_results
  - Creates edges: each worker → aggregate_csv_results
  - Commits transaction (all-or-nothing)
↓
Workers enqueued to worker queue with PGMQ notifications

4. Parallel Worker Execution

5 workers execute in parallel:

Worker 001:
  - Extracts cursor: start=1, end=200
  - Processes CSV rows 1-200
  - Returns: processed_count=200, metrics={...}

Worker 002:
  - Extracts cursor: start=201, end=400
  - Processes CSV rows 201-400
  - Returns: processed_count=200, metrics={...}

... (workers 003-005 similar)

All workers complete

5. Convergence Step Execution

Orchestration discovers aggregate_csv_results is ready
(all parent workers completed - intersection semantics)
↓
aggregate_csv_results enqueued to worker queue
↓
Worker claims step, executes CsvResultsAggregatorHandler
↓
Handler detects scenario: WithBatches (5 workers)
Handler aggregates results from all 5 workers:
  - total_processed: 1000
  - total_inventory_value: $XXX,XXX.XX
  - category_counts: {electronics: 300, clothing: 250, ...}
Handler returns aggregated metrics
↓
Step completes

6. Task Completion

Orchestration detects all steps complete
↓
TaskFinalizerActor finalizes task
↓
Task state: Complete

NoBatches Scenario Flow

When dataset is too small or empty:

analyze_csv determines dataset too small (e.g., 50 rows < 200 batch_size)
↓
Returns BatchProcessingOutcome::NoBatches
↓
Orchestration creates single placeholder worker:
  - process_csv_batch_001 (is_no_op: true)
  - No cursor configuration needed
  - Still maintains DAG structure
↓
Placeholder worker executes:
  - Detects is_no_op flag
  - Returns immediately with no_op: true
  - No actual data processing
↓
Convergence step detects NoBatches scenario:
  - Uses batchable step result directly
  - Returns zero metrics or empty aggregation

Why placeholder workers?

  • Maintains consistent DAG structure
  • Convergence step logic handles both scenarios uniformly
  • No special-case orchestration logic needed
  • Standard retry/DLQ mechanics still apply

Data Structures

BatchProcessingOutcome

Location: tasker-shared/src/messaging/execution_types.rs

Purpose: Returned by batchable handlers to instruct orchestration.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
    /// No batching needed - create placeholder worker
    NoBatches,

    /// Create N batch workers with cursor configurations
    CreateBatches {
        /// Template step name (e.g., "process_csv_batch")
        worker_template_name: String,

        /// Number of workers to create
        worker_count: u32,

        /// Cursor configurations for each worker
        cursor_configs: Vec<CursorConfig>,

        /// Total items across all batches
        total_items: u64,
    },
}

impl BatchProcessingOutcome {
    pub fn no_batches() -> Self {
        BatchProcessingOutcome::NoBatches
    }

    pub fn create_batches(
        worker_template_name: String,
        worker_count: u32,
        cursor_configs: Vec<CursorConfig>,
        total_items: u64,
    ) -> Self {
        BatchProcessingOutcome::CreateBatches {
            worker_template_name,
            worker_count,
            cursor_configs,
            total_items,
        }
    }

    pub fn to_value(&self) -> serde_json::Value {
        serde_json::to_value(self).unwrap_or(json!({}))
    }
}
}

Ruby Mirror: workers/ruby/lib/tasker_core/types/batch_processing_outcome.rb

module TaskerCore
  module Types
    module BatchProcessingOutcome
      class NoBatches < Dry::Struct
        attribute :type, Types::String.default('no_batches')

        def to_h
          { 'type' => 'no_batches' }
        end

        def requires_batch_creation?
          false
        end
      end

      class CreateBatches < Dry::Struct
        attribute :type, Types::String.default('create_batches')
        attribute :worker_template_name, Types::Strict::String
        attribute :worker_count, Types::Coercible::Integer.constrained(gteq: 1)
        attribute :cursor_configs, Types::Array.of(Types::Hash).constrained(min_size: 1)
        attribute :total_items, Types::Coercible::Integer.constrained(gteq: 0)

        def to_h
          {
            'type' => 'create_batches',
            'worker_template_name' => worker_template_name,
            'worker_count' => worker_count,
            'cursor_configs' => cursor_configs,
            'total_items' => total_items
          }
        end

        def requires_batch_creation?
          true
        end
      end

      class << self
        def no_batches
          NoBatches.new
        end

        def create_batches(worker_template_name:, worker_count:, cursor_configs:, total_items:)
          CreateBatches.new(
            worker_template_name: worker_template_name,
            worker_count: worker_count,
            cursor_configs: cursor_configs,
            total_items: total_items
          )
        end
      end
    end
  end
end

CursorConfig

Location: tasker-shared/src/messaging/execution_types.rs

Purpose: Defines batch boundaries for each worker.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct CursorConfig {
    /// Batch identifier (e.g., "001", "002", "003")
    pub batch_id: String,

    /// Starting position (inclusive) - flexible JSON value
    pub start_cursor: serde_json::Value,

    /// Ending position (exclusive) - flexible JSON value
    pub end_cursor: serde_json::Value,

    /// Number of items in this batch
    pub batch_size: u32,
}
}

Design Notes:

  • Cursor values use serde_json::Value for flexibility
  • Supports integers, strings, timestamps, UUIDs, etc.
  • Batch IDs are zero-padded strings for consistent ordering
  • start_cursor is inclusive, end_cursor is exclusive

Example Cursor Configs:

// Numeric cursors (CSV row numbers)
{
  "batch_id": "001",
  "start_cursor": 1,
  "end_cursor": 200,
  "batch_size": 200
}

// Timestamp cursors (event processing)
{
  "batch_id": "002",
  "start_cursor": "2025-01-01T00:00:00Z",
  "end_cursor": "2025-01-01T01:00:00Z",
  "batch_size": 3600
}

// UUID cursors (database pagination)
{
  "batch_id": "003",
  "start_cursor": "00000000-0000-0000-0000-000000000000",
  "end_cursor": "3fffffff-ffff-ffff-ffff-ffffffffffff",
  "batch_size": 1000000
}

BatchWorkerInputs

Location: tasker-shared/src/models/core/batch_worker.rs

Purpose: Stored in workflow_steps.inputs for each worker instance.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchWorkerInputs {
    /// Cursor configuration defining this worker's batch range
    pub cursor: CursorConfig,

    /// Batch processing metadata
    pub batch_metadata: BatchMetadata,

    /// Flag indicating if this is a placeholder worker (NoBatches scenario)
    #[serde(default)]
    pub is_no_op: bool,
}

impl BatchWorkerInputs {
    pub fn new(
        cursor_config: CursorConfig,
        batch_config: &BatchConfiguration,
        is_no_op: bool,
    ) -> Self {
        Self {
            cursor: cursor_config,
            batch_metadata: BatchMetadata {
                checkpoint_interval: batch_config.checkpoint_interval,
                cursor_field: batch_config.cursor_field.clone(),
                failure_strategy: batch_config.failure_strategy.clone(),
            },
            is_no_op,
        }
    }
}
}

Storage Location:

  • workflow_steps.inputs (instance-specific runtime data)
  • ❌ NOT in step_definition.handler.initialization (that’s the template)

BatchMetadata

Location: tasker-shared/src/models/core/batch_worker.rs

Purpose: Runtime configuration for batch processing behavior.

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchMetadata {
    /// Checkpoint frequency (every N items)
    pub checkpoint_interval: u32,

    /// Field name used for cursor tracking (e.g., "id", "row_number")
    pub cursor_field: String,

    /// How to handle failures during batch processing
    pub failure_strategy: FailureStrategy,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum FailureStrategy {
    /// Fail immediately if any item fails
    FailFast,

    /// Continue processing remaining items, report failures at end
    ContinueOnFailure,

    /// Isolate failed items to separate queue
    IsolateFailed,
}
}

Implementation Patterns

Rust Implementation

1. Batchable Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_shared::messaging::execution_types::{BatchProcessingOutcome, CursorConfig};
use serde_json::json;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Analyze dataset
    let dataset_size = analyze_dataset(step_data)?;
    let batch_size = get_batch_size_from_config(step_data)?;

    // 2. Check if batching needed
    if dataset_size == 0 || dataset_size < batch_size {
        let outcome = BatchProcessingOutcome::no_batches();
        return Ok(success_result(
            step_uuid,
            json!({ "batch_processing_outcome": outcome.to_value() }),
            elapsed_ms,
            None,
        ));
    }

    // 3. Calculate worker count
    let worker_count = (dataset_size as f64 / batch_size as f64).ceil() as u32;

    // 4. Generate cursor configs
    let cursor_configs = create_cursor_configs(dataset_size, worker_count, batch_size);

    // 5. Return CreateBatches outcome
    let outcome = BatchProcessingOutcome::create_batches(
        "worker_template_name".to_string(),
        worker_count,
        cursor_configs,
        dataset_size,
    );

    Ok(success_result(
        step_uuid,
        json!({
            "batch_processing_outcome": outcome.to_value(),
            "worker_count": worker_count,
            "total_items": dataset_size
        }),
        elapsed_ms,
        None,
    ))
}

fn create_cursor_configs(
    total_items: u64,
    worker_count: u32,
    batch_size: u64,
) -> Vec<CursorConfig> {
    let mut cursor_configs = Vec::new();
    let items_per_worker = (total_items as f64 / worker_count as f64).ceil() as u64;

    for i in 0..worker_count {
        let start_position = i as u64 * items_per_worker;
        let end_position = ((i + 1) as u64 * items_per_worker).min(total_items);

        cursor_configs.push(CursorConfig {
            batch_id: format!("{:03}", i + 1),
            start_cursor: json!(start_position),
            end_cursor: json!(end_position),
            batch_size: (end_position - start_position) as u32,
        });
    }

    cursor_configs
}
}

2. Batch Worker Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchWorkerContext;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Extract batch worker context using helper
    let context = BatchWorkerContext::from_step_data(step_data)?;

    // 2. Check for no-op placeholder worker
    if context.is_no_op() {
        return Ok(success_result(
            step_uuid,
            json!({
                "no_op": true,
                "reason": "NoBatches scenario",
                "batch_id": context.batch_id()
            }),
            elapsed_ms,
            None,
        ));
    }

    // 3. Extract cursor range
    let start_idx = context.start_position();
    let end_idx = context.end_position();
    let checkpoint_interval = context.checkpoint_interval();

    // 4. Process items in range
    let mut processed_count = 0;
    let mut results = initialize_results();

    for idx in start_idx..end_idx {
        // Process item
        let item = get_item(idx)?;
        update_results(&mut results, item);

        processed_count += 1;

        // 5. Checkpoint progress periodically
        if processed_count % checkpoint_interval == 0 {
            checkpoint_progress(step_uuid, idx).await?;
        }
    }

    // 6. Return results for aggregation
    Ok(success_result(
        step_uuid,
        json!({
            "processed_count": processed_count,
            "results": results,
            "batch_id": context.batch_id(),
            "start_position": start_idx,
            "end_position": end_idx
        }),
        elapsed_ms,
        None,
    ))
}
}

3. Convergence Handler Pattern:

#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchAggregationScenario;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // 1. Detect scenario using helper
    let scenario = BatchAggregationScenario::detect(
        &step_data.dependency_results,
        "batchable_step_name",
        "batch_worker_prefix_",
    )?;

    // 2. Handle both scenarios
    let aggregated_results = match scenario {
        BatchAggregationScenario::NoBatches { batchable_result } => {
            // Get dataset info from batchable step
            let total_items = batchable_result
                .result.get("total_items")
                .and_then(|v| v.as_u64())
                .unwrap_or(0);

            // Return zero metrics
            json!({
                "total_processed": total_items,
                "worker_count": 0
            })
        }

        BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
            // Aggregate results from all workers
            let mut total_processed = 0u64;

            for (step_name, result) in batch_results {
                total_processed += result.result
                    .get("processed_count")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                // Additional aggregation logic...
            }

            json!({
                "total_processed": total_processed,
                "worker_count": worker_count
            })
        }
    };

    // 3. Return aggregated results
    Ok(success_result(
        step_uuid,
        aggregated_results,
        elapsed_ms,
        None,
    ))
}
}

Ruby Implementation

1. Batchable Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
    def call(task, _sequence, step)
      csv_file_path = task.context['csv_file_path']
      total_rows = count_csv_rows(csv_file_path)

      # Get batch configuration
      batch_size = step_definition_initialization['batch_size'] || 200
      max_workers = step_definition_initialization['max_workers'] || 5

      # Calculate worker count
      worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min

      if worker_count == 0 || total_rows == 0
        # Use helper for NoBatches outcome
        return no_batches_success(
          reason: 'dataset_too_small',
          total_rows: total_rows
        )
      end

      # Generate cursor configs using helper
      cursor_configs = generate_cursor_configs(
        total_items: total_rows,
        worker_count: worker_count
      )

      # Use helper for CreateBatches outcome
      create_batches_success(
        worker_template_name: 'process_csv_batch',
        worker_count: worker_count,
        cursor_configs: cursor_configs,
        total_items: total_rows,
        additional_data: {
          'csv_file_path' => csv_file_path
        }
      )
    end

    private

    def count_csv_rows(csv_file_path)
      CSV.read(csv_file_path, headers: true).length
    end
  end
end

2. Batch Worker Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
    def call(context)
      # Extract batch context using helper
      batch_ctx = get_batch_context(context)

      # Use helper to check for no-op worker
      no_op_result = handle_no_op_worker(batch_ctx)
      return no_op_result if no_op_result

      # Get CSV file path from dependency results
      csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')

      # Process CSV rows in cursor range
      metrics = process_csv_batch(
        csv_file_path,
        batch_ctx.start_cursor,
        batch_ctx.end_cursor
      )

      # Return results for aggregation
      success(
        result_data: {
          'processed_count' => metrics[:processed_count],
          'total_inventory_value' => metrics[:total_inventory_value],
          'category_counts' => metrics[:category_counts],
          'batch_id' => batch_ctx.batch_id
        }
      )
    end

    private

    def process_csv_batch(csv_file_path, start_row, end_row)
      metrics = {
        processed_count: 0,
        total_inventory_value: 0.0,
        category_counts: Hash.new(0)
      }

      CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
        next if data_row_num < start_row
        break if data_row_num >= end_row

        product = parse_product(row)

        metrics[:total_inventory_value] += product.price * product.stock
        metrics[:category_counts][product.category] += 1
        metrics[:processed_count] += 1
      end

      metrics
    end
  end
end

3. Convergence Handler Pattern (using Batchable base class):

module BatchProcessing
  class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
    def call(_task, sequence, _step)
      # Detect scenario using helper
      scenario = detect_aggregation_scenario(
        sequence,
        batchable_step_name: 'analyze_csv',
        batch_worker_prefix: 'process_csv_batch_'
      )

      # Use helper for aggregation with custom block
      aggregate_batch_worker_results(scenario) do |batch_results|
        # Custom aggregation logic
        total_processed = 0
        total_inventory_value = 0.0
        global_category_counts = Hash.new(0)

        batch_results.each do |step_name, result|
          total_processed += result.dig('result', 'processed_count') || 0
          total_inventory_value += result.dig('result', 'total_inventory_value') || 0.0

          (result.dig('result', 'category_counts') || {}).each do |category, count|
            global_category_counts[category] += count
          end
        end

        {
          'total_processed' => total_processed,
          'total_inventory_value' => total_inventory_value,
          'category_counts' => global_category_counts,
          'worker_count' => batch_results.size
        }
      end
    end
  end
end

Use Cases

1. Large Dataset Processing

Scenario: Process millions of records from a database, file, or API.

Why Batch Processing?

  • Single worker would timeout
  • Memory constraints prevent loading entire dataset
  • Want parallelism for speed

Example: Product catalog synchronization

Batchable: Analyze product table (5 million products)
Workers: 100 workers × 50,000 products each
Convergence: Aggregate sync statistics
Result: 5M products synced in 10 minutes vs 2 hours sequential

2. Time-Based Event Processing

Scenario: Process events from a time-series database or log aggregation system.

Why Batch Processing?

  • Events span long time ranges
  • Want to process hourly/daily chunks in parallel
  • Need resumability for long-running processing

Example: Analytics event processing

Batchable: Analyze events (30 days × 24 hours)
Workers: 720 workers (1 per hour)
Cursors: Timestamp ranges (2025-01-01T00:00 to 2025-01-01T01:00)
Convergence: Aggregate daily/monthly metrics

3. Multi-Source Data Integration

Scenario: Fetch data from multiple external APIs or services.

Why Batch Processing?

  • Each source is independent
  • Want parallel fetching for speed
  • Some sources may fail (need retry per source)

Example: Third-party data enrichment

Batchable: Analyze customer list (partition by data provider)
Workers: 5 workers (1 per provider: Stripe, Salesforce, HubSpot, etc.)
Cursors: Provider-specific identifiers
Convergence: Merge enriched customer profiles

4. Bulk File Processing

Scenario: Process multiple files (CSVs, images, documents).

Why Batch Processing?

  • Each file is independent processing unit
  • Want parallelism across files
  • File sizes vary (dynamic batch sizing)

Example: Image transformation pipeline

Batchable: List S3 bucket objects (1000 images)
Workers: 20 workers × 50 images each
Cursors: S3 object key ranges
Convergence: Verify all images transformed

5. Geographical Data Partitioning

Scenario: Process data partitioned by geography (regions, countries, cities).

Why Batch Processing?

  • Geographic boundaries provide natural partitions
  • Want parallel processing per region
  • Different regions may have different data volumes

Example: Regional sales report generation

Batchable: Analyze sales data (50 US states)
Workers: 50 workers (1 per state)
Cursors: State codes (AL, AK, AZ, ...)
Convergence: National sales dashboard

Operator Workflows

Batch processing integrates seamlessly with the DLQ (Dead Letter Queue) system for operator visibility and manual intervention. This section shows how operators manage failed batch workers.

DLQ Integration Principles

From DLQ System Documentation:

  1. Investigation Tracking Only: DLQ tracks “why task is stuck” and “who investigated” - it doesn’t manipulate tasks
  2. Step-Level Resolution: Operators fix problem steps using step APIs, not task-level operations
  3. Three Resolution Types:
    • ResetForRetry: Reset attempts, return step to pending (cursor preserved)
    • ResolveManually: Skip step, mark resolved without results
    • CompleteManually: Provide manual results for dependent steps

Key for Batch Processing: Cursor data in workflow_steps.results is preserved during ResetForRetry, enabling resumability without data loss.

Staleness Detection for Batch Workers

Batch workers have two staleness detection mechanisms:

1. Duration-Based (Standard):

lifecycle:
  max_steps_in_process_minutes: 120  # DLQ threshold

If worker stays in InProgress state for > 120 minutes, flagged as stale.

2. Checkpoint-Based (Batch-Specific):

#![allow(unused)]
fn main() {
// Workers checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
    checkpoint_progress(step_uuid, current_cursor).await?;
}
}

If last checkpoint timestamp is too old, flagged as stale even if within duration threshold.

Common Operator Scenarios

Scenario 1: Transient Database Failure

Problem: 3 out of 5 batch workers failed due to database connection timeout.

Step 1: Find the stuck task in DLQ:

# Get investigation queue (prioritized by age and reason)
curl http://localhost:8080/v1/dlq/investigation-queue | jq

Step 2: Get task details and identify failed workers:

-- Get DLQ entry for the task
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.resolution_status,
    dlq.task_snapshot->'workflow_steps' as steps
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = 'task-uuid-here'
  AND dlq.resolution_status = 'pending';

-- Query task's workflow steps to find failed batch workers
SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = 'task-uuid-here'
  AND ws.name LIKE 'process_csv_batch_%'
  AND ws.current_state = 'Error';

Result:

workflow_step_uuid | name                   | current_state | attempts | last_error
-------------------|------------------------|---------------|----------|------------------
uuid-worker-2      | process_csv_batch_002  | Error         | 3        | DB timeout
uuid-worker-4      | process_csv_batch_004  | Error         | 3        | DB timeout
uuid-worker-5      | process_csv_batch_005  | Error         | 3        | DB timeout

Operator Action: Database is now healthy - reset workers for retry

# Get task UUID from DLQ entry
TASK_UUID="abc-123-task-uuid"

# Reset worker 2 (preserves cursor: rows 201-400)
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-2 \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "reset_for_retry",
    "reset_by": "operator@example.com",
    "reason": "Database connection restored, resetting attempts"
  }'

# Reset workers 4 and 5
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-4 \
  -H "Content-Type: application/json" \
  -d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-5 \
  -H "Content-Type: application/json" \
  -d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'

# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "resolution_status": "manually_resolved",
    "resolution_notes": "Reset 3 failed batch workers after database connection restored",
    "resolved_by": "operator@example.com"
  }'

Result:

  • Workers 2, 4, 5 return to Pending state
  • Cursor configs preserved in workflow_steps.inputs
  • Retry attempt counter reset to 0
  • Workers re-enqueued for execution
  • DLQ entry updated with resolution metadata

Scenario 2: Bad Data in Specific Batch

Problem: Worker 3 repeatedly fails due to malformed CSV row in its range (rows 401-600).

Investigation:

-- Get worker details
SELECT
    ws.name,
    ws.current_state,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-3';

Result:

name: process_csv_batch_003
current_state: Error
attempts: 3
last_error: "CSV parsing failed at row 523: invalid price format"

Operator Decision: Row 523 has known data quality issue, already fixed in source system.

Option 1: Complete Manually (provide results for this batch):

TASK_UUID="abc-123-task-uuid"
STEP_UUID="uuid-worker-3"

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "complete_manually",
    "completion_data": {
      "result": {
        "processed_count": 199,
        "total_inventory_value": 45232.50,
        "category_counts": {"electronics": 150, "clothing": 49},
        "batch_id": "003",
        "note": "Row 523 skipped due to data quality issue, manually verified totals"
      },
      "metadata": {
        "manually_verified": true,
        "verification_method": "manual_inspection",
        "skipped_rows": [523]
      }
    },
    "reason": "Manual completion after verifying corrected data in source system",
    "completed_by": "operator@example.com"
  }'

Option 2: Resolve Manually (skip this batch):

curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "action_type": "resolve_manually",
    "resolved_by": "operator@example.com",
    "reason": "Non-critical batch containing known bad data, skipping 200 rows out of 1000 total"
  }'

Result (Option 1):

  • Worker 3 marked Complete with manual results
  • Convergence step receives manual results in aggregation
  • Task completes successfully with note about manual intervention

Result (Option 2):

  • Worker 3 marked ResolvedManually (no results provided)
  • Convergence step detects missing results, adjusts aggregation
  • Task completes with reduced total (800 rows instead of 1000)

Scenario 3: Long-Running Worker Needs Checkpoint

Problem: Worker 1 processing 10,000 rows, operator notices it’s been running 90 minutes (threshold: 120 minutes).

Investigation:

-- Check last checkpoint
SELECT
    ws.name,
    ws.current_state,
    ws.results->>'last_checkpoint_cursor' as last_checkpoint,
    ws.results->>'checkpoint_timestamp' as checkpoint_time,
    NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-1';

Result:

name: process_large_batch_001
current_state: InProgress
last_checkpoint: 7850
checkpoint_time: 2025-01-15 11:30:00
time_since_checkpoint: 00:05:00

Operator Action: Worker is healthy and making progress (checkpointed 5 minutes ago at row 7850).

No action needed - worker will complete normally. Operator adds investigation note to DLQ entry:

DLQ_ENTRY_UUID="dlq-entry-uuid-here"

curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {
      "investigation_notes": "Worker healthy, last checkpoint at row 7850 (5 min ago), estimated 15 min remaining",
      "investigator": "operator@example.com",
      "timestamp": "2025-01-15T11:35:00Z",
      "action_taken": "none - monitoring"
    }
  }'

Scenario 4: All Workers Failed - Batch Strategy Issue

Problem: All 10 workers fail with “memory exhausted” error - batch size too large.

Investigation via API:

TASK_UUID="task-uuid-here"

# Get task details including all workflow steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps | jq '.[] | select(.name | startswith("process_large_batch_")) | {name, current_state, attempts, last_error}'

Result: All workers show current_state: "Error" with same OOM error in last_error.

Operator Action: Cancel entire task, will re-run with smaller batch size.

DLQ_ENTRY_UUID="dlq-entry-uuid-here"

# Cancel task (cancels all workers)
curl -X DELETE http://localhost:8080/v1/tasks/${TASK_UUID}

# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
  -H "Content-Type: application/json" \
  -d '{
    "resolution_status": "permanently_failed",
    "resolution_notes": "Batch size too large causing OOM. Cancelled task and created new task with batch_size: 5000 instead of 10000",
    "resolved_by": "operator@example.com",
    "metadata": {
      "root_cause": "configuration_error",
      "corrective_action": "reduced_batch_size",
      "new_task_uuid": "new-task-uuid-here"
    }
  }'

# Create new task with corrected configuration
curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "data_processing",
    "template_name": "large_dataset_processor",
    "context": {
      "dataset_id": "dataset-123",
      "batch_size": 5000,
      "max_workers": 20
    }
  }'

DLQ Query Patterns for Batch Processing

1. Find DLQ entry for a batch processing task:

-- Get DLQ entry with task snapshot
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.resolution_status,
    dlq.dlq_timestamp,
    dlq.resolution_notes,
    dlq.resolved_by,
    dlq.task_snapshot->'namespace_name' as namespace,
    dlq.task_snapshot->'template_name' as template,
    dlq.task_snapshot->'current_state' as task_state
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = :task_uuid
  AND dlq.resolution_status = 'pending'
ORDER BY dlq.dlq_timestamp DESC
LIMIT 1;

2. Check batch completion progress:

SELECT
    COUNT(*) FILTER (WHERE ws.current_state = 'Complete') as completed_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'InProgress') as in_progress_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'Error') as failed_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'Pending') as pending_workers,
    COUNT(*) FILTER (WHERE ws.current_state = 'WaitingForRetry') as waiting_retry,
    COUNT(*) as total_workers
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
  AND ws.name LIKE 'process_%_batch_%';

3. Find workers with stale checkpoints:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.results->>'last_checkpoint_cursor' as checkpoint_cursor,
    ws.results->>'checkpoint_timestamp' as checkpoint_time,
    NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint,
    ws.attempts,
    ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
  AND ws.name LIKE 'process_%_batch_%'
  AND ws.current_state = 'InProgress'
  AND ws.results->>'checkpoint_timestamp' IS NOT NULL
  AND NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz > interval '15 minutes'
ORDER BY time_since_checkpoint DESC;

4. Get aggregated batch task health:

SELECT
    t.task_uuid,
    t.namespace_name,
    t.template_name,
    t.current_state as task_state,
    t.execution_status,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_count,
    jsonb_object_agg(
        ws.current_state,
        COUNT(*)
    ) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_states,
    dlq.dlq_reason,
    dlq.resolution_status
FROM tasker.tasks t
JOIN tasker.workflow_steps ws ON ws.task_uuid = t.task_uuid
LEFT JOIN tasker.tasks_dlq dlq ON dlq.task_uuid = t.task_uuid
    AND dlq.resolution_status = 'pending'
WHERE t.task_uuid = :task_uuid
GROUP BY t.task_uuid, t.namespace_name, t.template_name, t.current_state, t.execution_status,
         dlq.dlq_reason, dlq.resolution_status;

5. Find all batch tasks in DLQ:

-- Find tasks with batch workers that are stuck
SELECT
    dlq.dlq_entry_uuid,
    dlq.task_uuid,
    dlq.dlq_reason,
    dlq.dlq_timestamp,
    t.namespace_name,
    t.template_name,
    t.current_state,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as batch_worker_count,
    COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.current_state = 'Error' AND ws.name LIKE 'process_%_batch_%') as failed_workers
FROM tasker.tasks_dlq dlq
JOIN tasker.tasks t ON t.task_uuid = dlq.task_uuid
JOIN tasker.workflow_steps ws ON ws.task_uuid = dlq.task_uuid
WHERE dlq.resolution_status = 'pending'
GROUP BY dlq.dlq_entry_uuid, dlq.task_uuid, dlq.dlq_reason, dlq.dlq_timestamp,
         t.namespace_name, t.template_name, t.current_state
HAVING COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') > 0
ORDER BY dlq.dlq_timestamp DESC;

Operator Dashboard Recommendations

For monitoring batch processing tasks, operators should have dashboards showing:

  1. Batch Progress:

    • Total workers vs completed workers
    • Estimated completion time based on worker velocity
    • Current throughput (items/second across all workers)
  2. Stale Worker Alerts:

    • Workers exceeding duration threshold
    • Workers with stale checkpoints
    • Workers with repeated failures
  3. Batch Health Metrics:

    • Success rate per batch
    • Average processing time per worker
    • Resource utilization (CPU, memory)
  4. Resolution Actions:

    • Recent operator interventions
    • Resolution action distribution (ResetForRetry vs ResolveManually)
    • Time to resolution for stale workers

Code Examples

Complete Working Example: CSV Product Inventory

This section shows a complete end-to-end implementation processing a 1000-row CSV file in 5 parallel batches.

Rust Implementation

1. Batchable Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-150

#![allow(unused)]
fn main() {
pub struct CsvAnalyzerHandler;

#[async_trait]
impl StepHandler for CsvAnalyzerHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Get CSV file path from task context
        let csv_file_path = step_data
            .task
            .context
            .get("csv_file_path")
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow!("Missing csv_file_path in task context"))?;

        // Count total data rows (excluding header)
        let file = File::open(csv_file_path)?;
        let reader = BufReader::new(file);
        let total_rows = reader.lines().count().saturating_sub(1) as u64;

        info!("CSV Analysis: {} rows in {}", total_rows, csv_file_path);

        // Get batch configuration
        let handler_init = step_data.handler_initialization.as_object().unwrap();
        let batch_size = handler_init
            .get("batch_size")
            .and_then(|v| v.as_u64())
            .unwrap_or(200);
        let max_workers = handler_init
            .get("max_workers")
            .and_then(|v| v.as_u64())
            .unwrap_or(5);

        // Determine if batching needed
        if total_rows == 0 {
            let outcome = BatchProcessingOutcome::no_batches();
            let elapsed_ms = start_time.elapsed().as_millis() as u64;

            return Ok(success_result(
                step_uuid,
                json!({
                    "batch_processing_outcome": outcome.to_value(),
                    "reason": "empty_dataset",
                    "total_rows": 0
                }),
                elapsed_ms,
                None,
            ));
        }

        // Calculate worker count
        let worker_count = ((total_rows as f64 / batch_size as f64).ceil() as u64)
            .min(max_workers);

        // Generate cursor configurations
        let actual_batch_size = (total_rows as f64 / worker_count as f64).ceil() as u64;
        let mut cursor_configs = Vec::new();

        for i in 0..worker_count {
            let start_row = (i * actual_batch_size) + 1; // 1-indexed after header
            let end_row = ((i + 1) * actual_batch_size).min(total_rows) + 1;

            cursor_configs.push(CursorConfig {
                batch_id: format!("{:03}", i + 1),
                start_cursor: json!(start_row),
                end_cursor: json!(end_row),
                batch_size: (end_row - start_row) as u32,
            });
        }

        info!(
            "Creating {} batch workers for {} rows (batch_size: {})",
            worker_count, total_rows, actual_batch_size
        );

        // Return CreateBatches outcome
        let outcome = BatchProcessingOutcome::create_batches(
            "process_csv_batch".to_string(),
            worker_count as u32,
            cursor_configs,
            total_rows,
        );

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        Ok(success_result(
            step_uuid,
            json!({
                "batch_processing_outcome": outcome.to_value(),
                "worker_count": worker_count,
                "total_rows": total_rows,
                "csv_file_path": csv_file_path
            }),
            elapsed_ms,
            Some(json!({
                "batch_size": actual_batch_size,
                "file_path": csv_file_path
            })),
        ))
    }
}
}

2. Batch Worker Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-350

#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;

#[derive(Debug, Deserialize)]
struct Product {
    id: u32,
    title: String,
    category: String,
    price: f64,
    stock: u32,
    rating: f64,
}

#[async_trait]
impl StepHandler for CsvBatchProcessorHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract batch worker context using helper
        let context = BatchWorkerContext::from_step_data(step_data)?;

        // Check for no-op placeholder worker
        if context.is_no_op() {
            let elapsed_ms = start_time.elapsed().as_millis() as u64;
            return Ok(success_result(
                step_uuid,
                json!({
                    "no_op": true,
                    "reason": "NoBatches scenario - no items to process",
                    "batch_id": context.batch_id()
                }),
                elapsed_ms,
                None,
            ));
        }

        // Get CSV file path from dependency results
        let csv_file_path = step_data
            .dependency_results
            .get("analyze_csv")
            .and_then(|r| r.result.get("csv_file_path"))
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow!("Missing csv_file_path from analyze_csv"))?;

        // Extract cursor range
        let start_row = context.start_position();
        let end_row = context.end_position();

        info!(
            "Processing batch {} (rows {}-{})",
            context.batch_id(),
            start_row,
            end_row
        );

        // Initialize metrics
        let mut processed_count = 0u64;
        let mut total_inventory_value = 0.0;
        let mut category_counts: HashMap<String, u32> = HashMap::new();
        let mut max_price = 0.0;
        let mut max_price_product = None;
        let mut total_rating = 0.0;

        // Open CSV and process rows in cursor range
        let file = File::open(Path::new(csv_file_path))?;
        let mut csv_reader = csv::ReaderBuilder::new()
            .has_headers(true)
            .from_reader(BufReader::new(file));

        for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
            let data_row_num = row_idx + 1; // 1-indexed after header

            if data_row_num < start_row {
                continue; // Skip rows before our range
            }
            if data_row_num >= end_row {
                break; // Processed all our rows
            }

            let product: Product = result?;

            // Calculate inventory metrics
            let inventory_value = product.price * (product.stock as f64);
            total_inventory_value += inventory_value;

            *category_counts.entry(product.category.clone()).or_insert(0) += 1;

            if product.price > max_price {
                max_price = product.price;
                max_price_product = Some(product.title.clone());
            }

            total_rating += product.rating;
            processed_count += 1;

            // Checkpoint progress periodically
            if processed_count % context.checkpoint_interval() == 0 {
                debug!(
                    "Checkpoint: batch {} processed {} items",
                    context.batch_id(),
                    processed_count
                );
            }
        }

        let average_rating = if processed_count > 0 {
            total_rating / (processed_count as f64)
        } else {
            0.0
        };

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        info!(
            "Batch {} complete: {} items processed",
            context.batch_id(),
            processed_count
        );

        Ok(success_result(
            step_uuid,
            json!({
                "processed_count": processed_count,
                "total_inventory_value": total_inventory_value,
                "category_counts": category_counts,
                "max_price": max_price,
                "max_price_product": max_price_product,
                "average_rating": average_rating,
                "batch_id": context.batch_id(),
                "start_row": start_row,
                "end_row": end_row
            }),
            elapsed_ms,
            None,
        ))
    }
}
}

3. Convergence Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-520

#![allow(unused)]
fn main() {
pub struct CsvResultsAggregatorHandler;

#[async_trait]
impl StepHandler for CsvResultsAggregatorHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Detect scenario using helper
        let scenario = BatchAggregationScenario::detect(
            &step_data.dependency_results,
            "analyze_csv",
            "process_csv_batch_",
        )?;

        let (total_processed, total_inventory_value, category_counts, max_price, max_price_product, overall_avg_rating, worker_count) = match scenario {
            BatchAggregationScenario::NoBatches { batchable_result } => {
                // No batch workers - get dataset size from batchable step
                let total_rows = batchable_result
                    .result
                    .get("total_rows")
                    .and_then(|v| v.as_u64())
                    .unwrap_or(0);

                info!("NoBatches scenario: {} rows (no processing needed)", total_rows);

                (total_rows, 0.0, HashMap::new(), 0.0, None, 0.0, 0)
            }

            BatchAggregationScenario::WithBatches {
                batch_results,
                worker_count,
            } => {
                info!("Aggregating results from {} batch workers", worker_count);

                let mut total_processed = 0u64;
                let mut total_inventory_value = 0.0;
                let mut global_category_counts: HashMap<String, u64> = HashMap::new();
                let mut max_price = 0.0;
                let mut max_price_product = None;
                let mut weighted_ratings = Vec::new();

                for (step_name, result) in batch_results {
                    // Sum processed counts
                    let count = result
                        .result
                        .get("processed_count")
                        .and_then(|v| v.as_u64())
                        .unwrap_or(0);
                    total_processed += count;

                    // Sum inventory values
                    let value = result
                        .result
                        .get("total_inventory_value")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    total_inventory_value += value;

                    // Merge category counts
                    if let Some(categories) = result
                        .result
                        .get("category_counts")
                        .and_then(|v| v.as_object())
                    {
                        for (category, cat_count) in categories {
                            *global_category_counts
                                .entry(category.clone())
                                .or_insert(0) += cat_count.as_u64().unwrap_or(0);
                        }
                    }

                    // Find global max price
                    let batch_max_price = result
                        .result
                        .get("max_price")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    if batch_max_price > max_price {
                        max_price = batch_max_price;
                        max_price_product = result
                            .result
                            .get("max_price_product")
                            .and_then(|v| v.as_str())
                            .map(String::from);
                    }

                    // Collect ratings for weighted average
                    let avg_rating = result
                        .result
                        .get("average_rating")
                        .and_then(|v| v.as_f64())
                        .unwrap_or(0.0);
                    weighted_ratings.push((count, avg_rating));
                }

                // Calculate overall weighted average rating
                let total_items = weighted_ratings.iter().map(|(c, _)| c).sum::<u64>();
                let overall_avg_rating = if total_items > 0 {
                    weighted_ratings
                        .iter()
                        .map(|(count, avg)| (*count as f64) * avg)
                        .sum::<f64>()
                        / (total_items as f64)
                } else {
                    0.0
                };

                (
                    total_processed,
                    total_inventory_value,
                    global_category_counts,
                    max_price,
                    max_price_product,
                    overall_avg_rating,
                    worker_count,
                )
            }
        };

        let elapsed_ms = start_time.elapsed().as_millis() as u64;

        info!(
            "Aggregation complete: {} total items processed by {} workers",
            total_processed, worker_count
        );

        Ok(success_result(
            step_uuid,
            json!({
                "total_processed": total_processed,
                "total_inventory_value": total_inventory_value,
                "category_counts": category_counts,
                "max_price": max_price,
                "max_price_product": max_price_product,
                "overall_average_rating": overall_avg_rating,
                "worker_count": worker_count
            }),
            elapsed_ms,
            None,
        ))
    }
}
}

Ruby Implementation

1. Batchable Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_analyzer_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Analyzer - Batchable Step
    class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
      def call(task, _sequence, step)
        csv_file_path = task.context['csv_file_path']
        raise ArgumentError, 'Missing csv_file_path in task context' unless csv_file_path

        # Count CSV rows (excluding header)
        total_rows = count_csv_rows(csv_file_path)

        Rails.logger.info("CSV Analysis: #{total_rows} rows in #{csv_file_path}")

        # Get batch configuration from handler initialization
        batch_size = step_definition_initialization['batch_size'] || 200
        max_workers = step_definition_initialization['max_workers'] || 5

        # Calculate worker count
        worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min

        if worker_count.zero? || total_rows.zero?
          # Use helper for NoBatches outcome
          return no_batches_success(
            reason: 'empty_dataset',
            total_rows: total_rows
          )
        end

        # Generate cursor configs using helper
        cursor_configs = generate_cursor_configs(
          total_items: total_rows,
          worker_count: worker_count
        ) do |batch_idx, start_pos, end_pos, items_in_batch|
          # Adjust to 1-indexed row numbers (after header)
          {
            'batch_id' => format('%03d', batch_idx + 1),
            'start_cursor' => start_pos + 1,
            'end_cursor' => end_pos + 1,
            'batch_size' => items_in_batch
          }
        end

        Rails.logger.info("Creating #{worker_count} batch workers for #{total_rows} rows")

        # Use helper for CreateBatches outcome
        create_batches_success(
          worker_template_name: 'process_csv_batch',
          worker_count: worker_count,
          cursor_configs: cursor_configs,
          total_items: total_rows,
          additional_data: {
            'csv_file_path' => csv_file_path
          }
        )
      end

      private

      def count_csv_rows(csv_file_path)
        CSV.read(csv_file_path, headers: true).length
      end

      def step_definition_initialization
        @step_definition_initialization ||= {}
      end
    end
  end
end

2. Batch Worker Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_batch_processor_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Batch Processor - Batch Worker
    class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
      Product = Struct.new(
        :id, :title, :description, :category, :price,
        :discount_percentage, :rating, :stock, :brand, :sku, :weight,
        keyword_init: true
      )

      def call(context)
        # Extract batch context using helper
        batch_ctx = get_batch_context(context)

        # Use helper to check for no-op worker
        no_op_result = handle_no_op_worker(batch_ctx)
        return no_op_result if no_op_result

        # Get CSV file path from dependency results
        csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
        raise ArgumentError, 'Missing csv_file_path from analyze_csv' unless csv_file_path

        Rails.logger.info("Processing batch #{batch_ctx.batch_id} (rows #{batch_ctx.start_cursor}-#{batch_ctx.end_cursor})")

        # Process CSV rows in cursor range
        metrics = process_csv_batch(
          csv_file_path,
          batch_ctx.start_cursor,
          batch_ctx.end_cursor
        )

        Rails.logger.info("Batch #{batch_ctx.batch_id} complete: #{metrics[:processed_count]} items processed")

        # Return results for aggregation
        success(
          result_data: {
            'processed_count' => metrics[:processed_count],
            'total_inventory_value' => metrics[:total_inventory_value],
            'category_counts' => metrics[:category_counts],
            'max_price' => metrics[:max_price],
            'max_price_product' => metrics[:max_price_product],
            'average_rating' => metrics[:average_rating],
            'batch_id' => batch_ctx.batch_id,
            'start_row' => batch_ctx.start_cursor,
            'end_row' => batch_ctx.end_cursor
          }
        )
      end

      private

      def process_csv_batch(csv_file_path, start_row, end_row)
        metrics = {
          processed_count: 0,
          total_inventory_value: 0.0,
          category_counts: Hash.new(0),
          max_price: 0.0,
          max_price_product: nil,
          ratings: []
        }

        CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
          # Skip rows before our range
          next if data_row_num < start_row
          # Break when we've processed all our rows
          break if data_row_num >= end_row

          product = parse_product(row)

          # Calculate inventory metrics
          inventory_value = product.price * product.stock
          metrics[:total_inventory_value] += inventory_value

          metrics[:category_counts][product.category] += 1

          if product.price > metrics[:max_price]
            metrics[:max_price] = product.price
            metrics[:max_price_product] = product.title
          end

          metrics[:ratings] << product.rating
          metrics[:processed_count] += 1
        end

        # Calculate average rating
        metrics[:average_rating] = if metrics[:ratings].any?
                                      metrics[:ratings].sum / metrics[:ratings].size.to_f
                                    else
                                      0.0
                                    end

        metrics.except(:ratings)
      end

      def parse_product(row)
        Product.new(
          id: row['id'].to_i,
          title: row['title'],
          description: row['description'],
          category: row['category'],
          price: row['price'].to_f,
          discount_percentage: row['discountPercentage'].to_f,
          rating: row['rating'].to_f,
          stock: row['stock'].to_i,
          brand: row['brand'],
          sku: row['sku'],
          weight: row['weight'].to_i
        )
      end
    end
  end
end

3. Convergence Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_results_aggregator_handler.rb

module BatchProcessing
  module StepHandlers
    # CSV Results Aggregator - Deferred Convergence Step
    class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
      def call(_task, sequence, _step)
        # Detect scenario using helper
        scenario = detect_aggregation_scenario(
          sequence,
          batchable_step_name: 'analyze_csv',
          batch_worker_prefix: 'process_csv_batch_'
        )

        # Use helper for aggregation with custom block
        aggregate_batch_worker_results(scenario) do |batch_results|
          aggregate_csv_metrics(batch_results)
        end
      end

      private

      def aggregate_csv_metrics(batch_results)
        total_processed = 0
        total_inventory_value = 0.0
        global_category_counts = Hash.new(0)
        max_price = 0.0
        max_price_product = nil
        weighted_ratings = []

        batch_results.each do |step_name, batch_result|
          result = batch_result['result'] || {}

          # Sum processed counts
          count = result['processed_count'] || 0
          total_processed += count

          # Sum inventory values
          total_inventory_value += result['total_inventory_value'] || 0.0

          # Merge category counts
          (result['category_counts'] || {}).each do |category, cat_count|
            global_category_counts[category] += cat_count
          end

          # Find global max price
          batch_max_price = result['max_price'] || 0.0
          if batch_max_price > max_price
            max_price = batch_max_price
            max_price_product = result['max_price_product']
          end

          # Collect ratings for weighted average
          avg_rating = result['average_rating'] || 0.0
          weighted_ratings << { count: count, avg: avg_rating }
        end

        # Calculate overall weighted average rating
        total_items = weighted_ratings.sum { |r| r[:count] }
        overall_avg_rating = if total_items.positive?
                               weighted_ratings.sum { |r| r[:avg] * r[:count] } / total_items.to_f
                             else
                               0.0
                             end

        Rails.logger.info("Aggregation complete: #{total_processed} total items processed by #{batch_results.size} workers")

        {
          'total_processed' => total_processed,
          'total_inventory_value' => total_inventory_value,
          'category_counts' => global_category_counts,
          'max_price' => max_price,
          'max_price_product' => max_price_product,
          'overall_average_rating' => overall_avg_rating,
          'worker_count' => batch_results.size
        }
      end
    end
  end
end

YAML Template

File: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml

---
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
description: "Process CSV product data in parallel batches"
task_handler:
  callable: rust_handler
  initialization: {}

steps:
  # BATCHABLE STEP: CSV Analysis and Batch Planning
  - name: analyze_csv
    type: batchable
    dependencies: []
    handler:
      callable: CsvAnalyzerHandler
      initialization:
        batch_size: 200
        max_workers: 5

  # BATCH WORKER TEMPLATE: Single CSV Batch Processing
  # Orchestration creates N instances from this template
  - name: process_csv_batch
    type: batch_worker
    dependencies:
      - analyze_csv
    lifecycle:
      max_steps_in_process_minutes: 120
      max_retries: 3
      backoff_multiplier: 2.0
    handler:
      callable: CsvBatchProcessorHandler
      initialization:
        operation: "inventory_analysis"

  # DEFERRED CONVERGENCE STEP: CSV Results Aggregation
  - name: aggregate_csv_results
    type: deferred_convergence
    dependencies:
      - process_csv_batch  # Template dependency - resolves to all worker instances
    handler:
      callable: CsvResultsAggregatorHandler
      initialization:
        aggregation_type: "inventory_metrics"

Best Practices

1. Batch Size Calculation

Guideline: Balance parallelism with overhead.

Too Small:

  • Excessive orchestration overhead
  • Too many database transactions
  • Diminishing returns on parallelism

Too Large:

  • Workers timeout or OOM
  • Long retry times on failure
  • Reduced parallelism

Recommended Approach:

def calculate_optimal_batch_size(total_items, item_processing_time_ms)
  # Target: Each batch takes 5-10 minutes
  target_duration_ms = 7.5 * 60 * 1000

  # Calculate items per batch
  items_per_batch = (target_duration_ms / item_processing_time_ms).ceil

  # Enforce min/max bounds
  [[items_per_batch, 100].max, 10000].min
end

2. Worker Count Limits

Guideline: Limit concurrency based on system resources.

handler:
  initialization:
    batch_size: 200
    max_workers: 10  # Prevents creating 100 workers for 20,000 items

Considerations:

  • Database connection pool size
  • Memory per worker
  • External API rate limits
  • CPU cores available

3. Cursor Design

Guideline: Use cursors that support resumability.

Good Cursor Types:

  • ✅ Integer offsets: start_cursor: 1000, end_cursor: 2000
  • ✅ Timestamps: start_cursor: "2025-01-01T00:00:00Z"
  • ✅ Database IDs: start_cursor: uuid_a, end_cursor: uuid_b
  • ✅ Composite keys: { date: "2025-01-01", partition: "US-WEST" }

Bad Cursor Types:

  • ❌ Page numbers (data can shift between pages)
  • ❌ Non-deterministic ordering (random, relevance scores)
  • ❌ Mutable values (last_modified_at can change)

4. Checkpoint Frequency

Guideline: Balance resumability with performance.

#![allow(unused)]
fn main() {
// Checkpoint every 100 items
if processed_count % 100 == 0 {
    checkpoint_progress(step_uuid, current_cursor).await?;
}
}

Factors:

  • Item processing time (faster = higher frequency)
  • Worker failure rate (higher = more frequent checkpoints)
  • Database write overhead (less frequent = better performance)

Recommended:

  • Fast items (< 10ms each): Checkpoint every 1000 items
  • Medium items (10-100ms each): Checkpoint every 100 items
  • Slow items (> 100ms each): Checkpoint every 10 items

5. Error Handling Strategies

FailFast (default):

#![allow(unused)]
fn main() {
FailureStrategy::FailFast
}
  • Worker fails immediately on first error
  • Suitable for: Data validation, schema violations
  • Retry preserves cursor for retry

ContinueOnFailure:

#![allow(unused)]
fn main() {
FailureStrategy::ContinueOnFailure
}
  • Worker processes all items, collects errors
  • Suitable for: Best-effort processing, partial results acceptable
  • Returns both results and error list

IsolateFailed:

#![allow(unused)]
fn main() {
FailureStrategy::IsolateFailed
}
  • Failed items moved to separate queue
  • Suitable for: Large batches with few expected failures
  • Allows manual review of failed items

6. Aggregation Patterns

Sum/Count:

#![allow(unused)]
fn main() {
let total = batch_results.iter()
    .map(|(_, r)| r.result.get("count").unwrap().as_u64().unwrap())
    .sum::<u64>();
}

Max/Min:

#![allow(unused)]
fn main() {
let max_value = batch_results.iter()
    .filter_map(|(_, r)| r.result.get("max").and_then(|v| v.as_f64()))
    .max_by(|a, b| a.partial_cmp(b).unwrap())
    .unwrap_or(0.0);
}

Weighted Average:

#![allow(unused)]
fn main() {
let total_weight: u64 = weighted_values.iter().map(|(w, _)| w).sum();
let weighted_avg = weighted_values.iter()
    .map(|(weight, value)| (*weight as f64) * value)
    .sum::<f64>() / (total_weight as f64);
}

Merge HashMaps:

#![allow(unused)]
fn main() {
let mut merged = HashMap::new();
for (_, result) in batch_results {
    if let Some(counts) = result.get("counts").and_then(|v| v.as_object()) {
        for (key, count) in counts {
            *merged.entry(key.clone()).or_insert(0) += count.as_u64().unwrap();
        }
    }
}
}

7. Testing Strategies

Unit Tests: Test handler logic independently

#![allow(unused)]
fn main() {
#[test]
fn test_cursor_generation() {
    let configs = create_cursor_configs(1000, 5, 200);
    assert_eq!(configs.len(), 5);
    assert_eq!(configs[0].start_cursor, json!(0));
    assert_eq!(configs[0].end_cursor, json!(200));
}
}

Integration Tests: Test with small datasets

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_batch_processing_integration() {
    let task = create_task_with_csv("test_data_10_rows.csv").await;
    assert_eq!(task.current_state, TaskState::Complete);

    let steps = get_workflow_steps(task.task_uuid).await;
    let workers = steps.iter().filter(|s| s.step_type == "batch_worker").count();
    assert_eq!(workers, 1); // 10 rows = 1 worker with batch_size 200
}
}

E2E Tests: Test complete workflow with realistic data

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_csv_batch_processing_e2e() {
    let task = create_task_with_csv("products_1000_rows.csv").await;
    wait_for_completion(task.task_uuid, Duration::from_secs(60)).await;

    let results = get_aggregation_results(task.task_uuid).await;
    assert_eq!(results["total_processed"], 1000);
    assert_eq!(results["worker_count"], 5);
}
}

8. Monitoring and Observability

Metrics to Track:

  • Worker creation time
  • Individual worker duration
  • Batch size distribution
  • Retry rate per batch
  • Aggregation duration

Recommended Dashboards:

-- Batch processing health
SELECT
    COUNT(*) FILTER (WHERE step_type = 'batch_worker') as total_workers,
    AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_worker_duration_sec,
    MAX(EXTRACT(EPOCH FROM (updated_at - created_at))) as max_worker_duration_sec,
    COUNT(*) FILTER (WHERE current_state = 'Error') as failed_workers
FROM tasker.workflow_steps
WHERE task_uuid = :task_uuid
  AND step_type = 'batch_worker';

Summary

Batch processing in Tasker provides a robust, production-ready pattern for parallel dataset processing:

Key Strengths:

  • ✅ Builds on proven DAG, retry, and deferred convergence foundations
  • ✅ No special recovery system needed (uses standard DLQ + retry)
  • ✅ Transaction-based worker creation prevents corruption
  • ✅ Cursor-based resumability enables long-running processing
  • ✅ Language-agnostic design works across Rust and Ruby workers

Integration Points:

  • DAG: Workers are full nodes with standard lifecycle
  • Retryability: Uses lifecycle.max_retries and exponential backoff
  • Deferred Convergence: Intersection semantics aggregate dynamic worker counts
  • DLQ: Standard operator workflows with cursor preservation

Production Readiness:

  • 908 tests passing (Ruby workers)
  • Real-world CSV processing (1000 rows)
  • Docker integration working
  • Code review complete with recommended fixes

For More Information:

  • Conditional Workflows: See docs/conditional-workflows.md
  • DLQ System: See docs/dlq-system.md
  • Code Examples: See workers/rust/src/step_handlers/batch_processing_*.rs

Caching Guide

This guide covers Tasker’s distributed caching system, including configuration, backend selection, circuit breaker protection, and operational considerations.

Overview

Tasker provides optional caching for:

  • Task Templates: Reduces database queries when loading workflow definitions
  • Analytics: Caches performance metrics and bottleneck analysis results

Caching is disabled by default and must be explicitly enabled in configuration.

Configuration

Basic Setup

[common.cache]
enabled = true
backend = "redis"              # or "dragonfly" / "moka" / "memory" / "in-memory"
default_ttl_seconds = 3600     # 1 hour default
template_ttl_seconds = 3600    # 1 hour for templates
analytics_ttl_seconds = 60     # 1 minute for analytics
key_prefix = "tasker"          # Namespace for cache keys

[common.cache.redis]
url = "${REDIS_URL:-redis://localhost:6379}"
max_connections = 10
connection_timeout_seconds = 5
database = 0

[common.cache.moka]
max_capacity = 10000           # Maximum entries in cache

Backend Selection

BackendConfig ValueUse Case
Redis"redis"Multi-instance deployments (production)
Dragonfly"dragonfly"Redis-compatible with better multi-threaded performance
Memcached"memcached"Simple distributed cache (requires cache-memcached feature)
Moka"moka", "memory", "in-memory"Single-instance, development, DoS protection
NoOp(enabled = false)Disabled, always-miss

Cache Backends

Redis (Distributed)

Redis is the recommended backend for production deployments:

  • Shared state: All instances see the same cache entries
  • Invalidation works: Worker bootstrap invalidations propagate to all instances
  • Persistence: Survives process restarts (if Redis is configured for persistence)
[common.cache]
enabled = true
backend = "redis"

[common.cache.redis]
url = "redis://redis.internal:6379"

Dragonfly (Distributed)

Dragonfly is a Redis-compatible in-memory data store with better multi-threaded performance. It uses the same port (6379) and protocol as Redis, so no code changes are required.

  • Redis compatible: Drop-in replacement for Redis
  • Better performance: Multi-threaded architecture for higher throughput
  • Shared state: Same distributed semantics as Redis
[common.cache]
enabled = true
backend = "dragonfly"  # Uses Redis provider internally

[common.cache.redis]
url = "redis://dragonfly.internal:6379"

Note: Dragonfly is used in Tasker’s test and CI environments for improved performance. For production, either Redis or Dragonfly works.

Memcached (Distributed)

Memcached is a simple, high-performance distributed cache. It requires the cache-memcached feature flag (not enabled by default).

  • Simple protocol: Lightweight key-value store
  • Distributed: State is shared across instances
  • No pattern deletion: Relies on TTL expiry (like Moka)
[common.cache]
enabled = true
backend = "memcached"

[common.cache.memcached]
url = "tcp://memcached.internal:11211"
connection_timeout_seconds = 5

Note: Enable with cargo build --features cache-memcached. Not enabled by default to reduce dependency footprint.

Moka (In-Memory)

Moka provides a high-performance in-memory cache:

  • Zero network latency: All operations are in-process
  • DoS protection: Rate-limits expensive operations without Redis dependency
  • Single-instance only: Cache is not shared across processes
[common.cache]
enabled = true
backend = "moka"

[common.cache.moka]
max_capacity = 10000

Important: Moka is only suitable for:

  • Single-instance deployments
  • Development environments
  • Analytics caching (where brief staleness is acceptable)

NoOp (Disabled)

When caching is disabled or a backend fails to initialize:

[common.cache]
enabled = false

The NoOp provider always returns cache misses and succeeds on writes (no-op). This is also used as a graceful fallback when Redis connection fails.

Circuit Breaker Protection

The cache circuit breaker prevents repeated timeout penalties when Redis/Dragonfly is unavailable. Instead of waiting for connection timeouts on every request, the circuit breaker fails fast after detecting failures.

Configuration

[common.circuit_breakers.component_configs.cache]
failure_threshold = 5    # Open after 5 consecutive failures
timeout_seconds = 15     # Test recovery after 15 seconds
success_threshold = 2    # Close after 2 successful calls

Behavior When Circuit is Open

When the circuit breaker is open (cache unavailable):

OperationBehavior
get()Returns None (cache miss)
set()Returns Ok(()) (no-op)
delete()Returns Ok(()) (no-op)
health_check()Returns false (unhealthy)

This fail-fast behavior ensures:

  1. Requests don’t wait for connection timeouts
  2. Database queries still work (cache miss → DB fallback)
  3. Recovery is automatic when Redis/Dragonfly becomes available

Circuit States

StateDescription
ClosedNormal operation, all calls go through
OpenFailing fast, calls return fallback values
Half-OpenTesting recovery, limited calls allowed

Monitoring

Circuit state is logged at state transitions:

INFO  Circuit breaker half-open (testing recovery)
INFO  Circuit breaker closed (recovered)
ERROR Circuit breaker opened (failing fast)

Usage Context Constraints

Different caching use cases have different consistency requirements. Tasker enforces these constraints at runtime:

Template Caching

Constraint: Requires distributed cache (Redis) or no cache (NoOp)

Templates are cached to avoid repeated database queries when loading workflow definitions. However, workers invalidate the template cache on bootstrap when they register new handler versions.

If an in-memory cache (Moka) is used:

  1. Orchestration server caches templates in its local memory
  2. Worker boots and invalidates templates in Redis (or nowhere, if Moka)
  3. Orchestration server never sees the invalidation
  4. Stale templates are served → operational errors

Behavior with Moka: Template caching is automatically disabled with a warning:

WARN Cache provider 'moka' is not safe for template caching (in-memory cache
     would drift from worker invalidations). Template caching disabled.

Analytics Caching

Constraint: Any backend allowed

Analytics data is informational and TTL-bounded. Brief staleness is acceptable, and in-memory caching provides DoS protection for expensive aggregation queries.

Behavior with Moka: Analytics caching works normally.

Cache Keys

Cache keys are prefixed with the configured key_prefix to allow multiple Tasker deployments to share a Redis instance:

ResourceKey Pattern
Templates{prefix}:template:{namespace}:{name}:{version}
Performance Metrics{prefix}:analytics:performance:{hours}
Bottleneck Analysis{prefix}:analytics:bottlenecks:{limit}:{min_executions}

Operational Patterns

Multi-Instance Production

[common.cache]
enabled = true
backend = "redis"
template_ttl_seconds = 3600    # Long TTL, rely on invalidation
analytics_ttl_seconds = 60     # Short TTL for fresh data
  • Templates cached for 1 hour but invalidated on worker registration
  • Analytics cached briefly to reduce database load

Single-Instance / Development

[common.cache]
enabled = true
backend = "moka"
template_ttl_seconds = 300     # Shorter TTL since no invalidation
analytics_ttl_seconds = 30
  • Template caching automatically disabled (Moka constraint)
  • Analytics caching works, provides DoS protection

Caching Disabled

[common.cache]
enabled = false
  • All cache operations are no-ops
  • Every request hits the database
  • Useful for debugging or when cache adds complexity without benefit

Graceful Degradation

Tasker never fails to start due to cache issues:

  1. Redis connection failure: Falls back to NoOp with warning
  2. Backend misconfiguration: Falls back to NoOp with warning
  3. Cache operation errors: Logged as warnings, never propagated
WARN Failed to connect to Redis, falling back to NoOp cache (graceful degradation)

The cache layer uses “best-effort” writes—failures are logged but never block request processing.

Monitoring

Cache Hit/Miss Rates

Cache operations are logged at DEBUG level:

DEBUG hours=24 "Performance metrics cache HIT"
DEBUG hours=24 "Performance metrics cache MISS, querying DB"

Provider Status

On startup, the active cache provider is logged:

INFO backend="redis" "Distributed cache provider initialized successfully"
INFO backend="moka" max_capacity=10000 "In-memory cache provider initialized"
INFO "Distributed cache disabled by configuration"

Troubleshooting

Templates Not Caching

  1. Check if backend is Moka—template caching is disabled with Moka
  2. Check for Redis connection warnings in logs
  3. Verify enabled = true in configuration

Stale Templates Being Served

  1. Verify all instances point to the same Redis
  2. Check that workers are properly invalidating on bootstrap
  3. Consider reducing template_ttl_seconds

High Cache Miss Rate

  1. Check Redis connectivity and latency
  2. Verify TTL settings aren’t too aggressive
  3. Check for cache key collisions (multiple deployments, same prefix)

Memory Growth with Moka

  1. Reduce max_capacity setting
  2. Check TTL settings—items evict on TTL or capacity limit
  3. Monitor entry count if metrics are available

Conditional Workflows and Decision Points

Last Updated: 2025-10-27 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Use Cases & Patterns | States and Lifecycles

← Back to Documentation Hub


Overview

Conditional workflows enable runtime decision-making that dynamically determines which workflow steps to execute based on business logic. Unlike static DAG workflows where all steps are predefined, conditional workflows use decision point steps to create steps on-demand based on runtime conditions.

Dynamic Workflow Decision Points provide this capability through:

  • Decision Point Steps: Special step type that evaluates business logic and returns step names to create
  • Deferred Steps: Step type with dynamic dependency resolution using intersection semantics
  • Type-Safe Integration: Ruby and Rust helpers ensuring clean serialization between languages

Table of Contents

  1. When to Use Conditional Workflows
  2. Logical Pattern
  3. Architecture and Implementation
  4. YAML Configuration
  5. Simple Example: Approval Routing
  6. Complex Example: Multi-Tier Approval
  7. Ruby Implementation Guide
  8. Rust Implementation Guide
  9. Best Practices
  10. Limitations and Constraints

When to Use Conditional Workflows

✅ Use Conditional Workflows When:

1. Business Logic Determines Execution Path

  • Approval workflows with amount-based routing (small/medium/large)
  • Risk-based processing (low/medium/high risk paths)
  • Tiered customer service (bronze/silver/gold/platinum)
  • Regulatory compliance with jurisdictional variations

2. Step Requirements Are Unknown Until Runtime

  • Dynamic validation checks based on request type
  • Multi-stage approvals where approval count depends on amount
  • Conditional enrichment steps based on data completeness
  • Parallel processing with variable worker count

3. Workflow Complexity Varies By Input

  • Simple cases skip expensive steps
  • Complex cases trigger additional validation
  • Emergency processing bypasses normal checks
  • VIP customers get expedited handling

❌ Don’t Use Conditional Workflows When:

1. Static DAG is Sufficient

  • All possible execution paths known at design time
  • Complexity overhead not justified
  • Simple if/else can be handled in handler code

2. Purely Sequential Logic

  • No parallelism or branching needed
  • Handler code can make decisions directly

3. Real-Time Sub-Second Decisions

  • Decision overhead (~10-20ms) not acceptable
  • In-memory processing required

Logical Pattern

Core Concepts

Task Initialization
       ↓
Regular Step(s)
       ↓
Decision Point Step ← Evaluates business logic
       ↓
   [Decision Made]
       ↓
   ┌───┴───┐
   ↓       ↓
Path A  Path B  ← Steps created dynamically
   ↓       ↓
   └───┬───┘
       ↓
Convergence Step ← Deferred dependencies resolve via intersection
       ↓
Task Complete

Decision Point Pattern

  1. Evaluation Phase: Decision point step executes handler
  2. Decision Output: Handler returns list of step names to create
  3. Dynamic Creation: Orchestration creates specified steps with proper dependencies
  4. Execution: Created steps execute like normal steps
  5. Convergence: Deferred steps wait for intersection of declared dependencies + created steps

Intersection Semantics for Deferred Steps

Declared Dependencies (in template):

- step_a
- step_b
- step_c

Actually Created Steps (by decision point):

Only step_a and step_c were created

Effective Dependencies (intersection):

step_a AND step_c  (step_b ignored since not created)

This enables convergence steps that work regardless of which path was taken.


Architecture and Implementation

Step Type: Decision Point

Decision point steps are regular steps with a special handler that returns a DecisionPointOutcome:

#![allow(unused)]
fn main() {
pub enum DecisionPointOutcome {
    NoBranches,               // No additional steps needed
    CreateSteps {             // Dynamically create these steps
        step_names: Vec<String>,
    },
}
}

Key Characteristics:

  • Executes like a normal step
  • Result includes decision_point_outcome field
  • Orchestration detects outcome and creates steps
  • Created steps depend on the decision point step
  • Fully atomic - either all steps created or none

Step Type: Deferred

Deferred steps use intersection semantics for dependency resolution:

type: deferred  # Special step type
dependencies:
  - routing_decision  # Must wait for decision point
  - step_a           # Might be created
  - step_b           # Might be created
  - step_c           # Might be created

Resolution Logic:

  1. Wait for decision point to complete
  2. Check which declared dependencies actually exist
  3. Wait only for intersection of declared + created
  4. Execute when all existing dependencies complete

Orchestration Flow

┌─────────────────────────────────────────┐
│ Step Result Processor                   │
│                                         │
│ 1. Check if result has                  │
│    decision_point_outcome field         │
│                                         │
│ 2. If CreateSteps:                      │
│    - Validate step names exist          │
│    - Create WorkflowStep records        │
│    - Set dependencies                   │
│    - Enqueue for execution              │
│                                         │
│ 3. If NoBranches:                       │
│    - Continue normally                  │
│                                         │
│ 4. Metrics and telemetry:               │
│    - Track steps_created count          │
│    - Log decision outcome               │
│    - Warn if depth limit approached     │
└─────────────────────────────────────────┘

Configuration

Decision point behavior is configured per environment:

# config/tasker/base/orchestration.toml
[orchestration.decision_points]
enabled = true
max_depth = 3           # Prevent infinite recursion
warn_threshold = 2      # Warn when nearing limit

YAML Configuration

Task Template Structure

Actual Implementation (from tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml):

---
name: approval_routing
namespace_name: conditional_approval
version: 1.0.0
description: >
  Ruby implementation of conditional approval workflow demonstrating dynamic decision points.
  Routes approval requests through different paths based on amount thresholds.
task_handler:
  callable: tasker_worker_ruby::TaskHandler
  initialization: {}
steps:
  - name: validate_request
    type: standard
    dependencies: []
    handler:
      callable: ConditionalApproval::StepHandlers::ValidateRequestHandler
      initialization: {}

  - name: routing_decision
    type: decision  # DECISION POINT
    dependencies:
      - validate_request
    handler:
      callable: ConditionalApproval::StepHandlers::RoutingDecisionHandler
      initialization: {}

  - name: finalize_approval
    type: deferred  # DEFERRED - uses intersection semantics
    dependencies:
      - auto_approve       # ALL possible dependencies listed
      - manager_approval   # System computes intersection at runtime
      - finance_review
    handler:
      callable: ConditionalApproval::StepHandlers::FinalizeApprovalHandler
      initialization: {}

  # Possible dynamic branches (created by decision point)
  - name: auto_approve
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::AutoApproveHandler
      initialization: {}

  - name: manager_approval
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::ManagerApprovalHandler
      initialization: {}

  - name: finance_review
    type: standard
    dependencies:
      - routing_decision
    handler:
      callable: ConditionalApproval::StepHandlers::FinanceReviewHandler
      initialization: {}

Key Points:

  • type: decision marks the decision point step
  • type: deferred enables intersection semantics for convergence
  • ALL possible dependencies listed in deferred step
  • Orchestration computes: declared deps ∩ actually created steps

Simple Example: Approval Routing

Business Requirement

Route approval requests based on amount:

  • < $1,000: Auto-approve (no human intervention)
  • $1,000 - $4,999: Manager approval required
  • ≥ $5,000: Manager + Finance approval required

Template Configuration

namespace: approval_workflows
name: simple_routing
version: "1.0"

steps:
  - name: validate_request
    handler: validate_request

  - name: routing_decision
    handler: routing_decision
    type: decision_point
    dependencies:
      - validate_request

  - name: auto_approve
    handler: auto_approve
    dependencies:
      - routing_decision

  - name: manager_approval
    handler: manager_approval
    dependencies:
      - routing_decision

  - name: finance_review
    handler: finance_review
    dependencies:
      - routing_decision

  - name: finalize_approval
    handler: finalize_approval
    type: deferred
    dependencies:
      - routing_decision
      - auto_approve
      - manager_approval
      - finance_review

Ruby Handler Implementation

Actual Implementation (from workers/ruby/spec/handlers/examples/conditional_approval/step_handlers/routing_decision_handler.rb):

# frozen_string_literal: true

module ConditionalApproval
  module StepHandlers
    # Routing Decision: DECISION POINT that routes approval based on amount
    #
    # Uses TaskerCore::StepHandler::Decision base class for clean, type-safe
    # decision outcome serialization consistent with Rust expectations.
    class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
      SMALL_AMOUNT_THRESHOLD = 1_000
      LARGE_AMOUNT_THRESHOLD = 5_000

      def call(task, _sequence, _step)
        # Get amount from validated request
        amount = task.context['amount']
        raise 'Amount is required for routing decision' unless amount

        # Make routing decision based on amount
        route = determine_route(amount)

        # Use Decision base class helper for clean outcome serialization
        decision_success(
          steps: route[:steps],
          result_data: {
            route_type: route[:type],
            reasoning: route[:reasoning],
            amount: amount
          },
          metadata: {
            operation: 'routing_decision',
            route_thresholds: {
              small: SMALL_AMOUNT_THRESHOLD,
              large: LARGE_AMOUNT_THRESHOLD
            }
          }
        )
      end

      private

      def determine_route(amount)
        if amount < SMALL_AMOUNT_THRESHOLD
          {
            type: 'auto_approval',
            steps: ['auto_approve'],
            reasoning: "Amount $#{amount} below threshold - auto-approval"
          }
        elsif amount < LARGE_AMOUNT_THRESHOLD
          {
            type: 'manager_only',
            steps: ['manager_approval'],
            reasoning: "Amount $#{amount} requires manager approval"
          }
        else
          {
            type: 'dual_approval',
            steps: %w[manager_approval finance_review],
            reasoning: "Amount $#{amount} >= $#{LARGE_AMOUNT_THRESHOLD} - dual approval required"
          }
        end
      end
    end
  end
end

Key Ruby Patterns:

  • Inherit from TaskerCore::StepHandler::Decision - Specialized base class for decision points
  • Use helper method decision_success(steps:, result_data:, metadata:) - Clean API for decision outcomes
  • Helper automatically creates DecisionPointOutcome and embeds it correctly
  • No manual serialization needed - base class handles Rust compatibility
  • For no-branch scenarios, use decision_no_branches(result_data:, metadata:)

Execution Flow Examples

Example 1: Small Amount ($500)

1. validate_request → Complete
2. routing_decision → Complete (creates: auto_approve)
3. auto_approve     → Complete
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {auto_approve} = auto_approve)

Total Steps Created: 4
Execution Time: ~500ms

Example 2: Medium Amount ($2,500)

1. validate_request  → Complete
2. routing_decision  → Complete (creates: manager_approval)
3. manager_approval  → Complete
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {manager_approval} = manager_approval)

Total Steps Created: 4
Execution Time: ~2s (human approval delay)

Example 3: Large Amount ($10,000)

1. validate_request  → Complete
2. routing_decision  → Complete (creates: manager_approval, finance_review)
3. manager_approval  → Complete (parallel)
3. finance_review    → Complete (parallel)
4. finalize_approval → Complete
   (waits for: routing_decision ∩ {manager_approval, finance_review})

Total Steps Created: 5
Execution Time: ~3s (parallel approvals)

Complex Example: Multi-Tier Approval

Business Requirement

Implement sophisticated approval routing with:

  • Risk assessment step
  • Tiered approval requirements
  • Emergency override path
  • Compliance checks based on jurisdiction

Template Configuration

namespace: approval_workflows
name: multi_tier_approval
version: "1.0"

steps:
  # Phase 1: Initial validation and risk assessment
  - name: validate_request
    handler: validate_request

  - name: assess_risk
    handler: assess_risk
    dependencies:
      - validate_request

  # Phase 2: Primary routing decision
  - name: primary_routing
    handler: primary_routing
    type: decision_point
    dependencies:
      - assess_risk

  # Phase 3: Conditional approval paths
  - name: emergency_approval
    handler: emergency_approval
    dependencies:
      - primary_routing

  - name: standard_manager_approval
    handler: standard_manager_approval
    dependencies:
      - primary_routing

  - name: senior_manager_approval
    handler: senior_manager_approval
    dependencies:
      - primary_routing

  # Phase 4: Secondary routing for high-risk cases
  - name: compliance_routing
    handler: compliance_routing
    type: decision_point
    dependencies:
      - primary_routing
      - senior_manager_approval  # Only if created

  # Phase 5: Compliance paths
  - name: legal_review
    handler: legal_review
    dependencies:
      - compliance_routing

  - name: fraud_investigation
    handler: fraud_investigation
    dependencies:
      - compliance_routing

  - name: jurisdictional_check
    handler: jurisdictional_check
    dependencies:
      - compliance_routing

  # Phase 6: Convergence
  - name: finalize_approval
    handler: finalize_approval
    type: deferred
    dependencies:
      - primary_routing
      - emergency_approval
      - standard_manager_approval
      - senior_manager_approval
      - compliance_routing
      - legal_review
      - fraud_investigation
      - jurisdictional_check

Ruby Handler: Primary Routing

class PrimaryRoutingHandler < TaskerCore::StepHandler::Decision
  def call(task, sequence, _step)
    amount = task.context['amount']
    risk_score = sequence.get_results('assess_risk')['risk_score']
    is_emergency = task.context['emergency'] == true

    steps_to_create = if is_emergency && amount < 10_000
      # Emergency override path
      ['emergency_approval']
    elsif risk_score < 30 && amount < 5_000
      # Low risk, standard approval
      ['standard_manager_approval']
    else
      # High risk or large amount - senior approval + compliance routing
      ['senior_manager_approval', 'compliance_routing']
    end

    decision_success(
      steps: steps_to_create,
      result_data: {
        route_type: determine_route_type(is_emergency, risk_score, amount),
        risk_score: risk_score,
        amount: amount,
        emergency: is_emergency
      }
    )
  end
end

Ruby Handler: Compliance Routing (Nested Decision)

class ComplianceRoutingHandler < TaskerCore::StepHandler::Decision
  def call(task, sequence, _step)
    amount = task.context['amount']
    risk_score = sequence.get_results('assess_risk')['risk_score']
    jurisdiction = task.context['jurisdiction']

    steps_to_create = []

    # Large amounts always need legal review
    steps_to_create << 'legal_review' if amount >= 50_000

    # High risk triggers fraud investigation
    steps_to_create << 'fraud_investigation' if risk_score >= 70

    # Certain jurisdictions need special checks
    steps_to_create << 'jurisdictional_check' if high_regulation_jurisdiction?(jurisdiction)

    if steps_to_create.empty?
      # No additional compliance steps needed
      decision_no_branches(
        result_data: { reason: 'no_compliance_requirements' }
      )
    else
      decision_success(
        steps: steps_to_create,
        result_data: {
          compliance_level: 'enhanced',
          checks_required: steps_to_create
        }
      )
    end
  end

  private

  def high_regulation_jurisdiction?(jurisdiction)
    %w[EU UK APAC].include?(jurisdiction)
  end
end

Execution Scenarios

Scenario 1: Emergency Low-Risk Request ($5,000)

Path: validate → assess_risk → primary_routing → emergency_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates emergency_approval)
Complexity: Low

Scenario 2: Standard Medium-Risk Request ($3,000, Risk 25)

Path: validate → assess_risk → primary_routing → standard_manager_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates standard_manager_approval)
Complexity: Low

Scenario 3: High-Risk Large Amount ($75,000, Risk 80, EU)

Path: validate → assess_risk → primary_routing → senior_manager_approval + compliance_routing
      → legal_review + fraud_investigation + jurisdictional_check → finalize
Steps Created: 9
Decision Points: 2 (primary_routing → compliance_routing)
Complexity: High (nested decisions)

Ruby Implementation Guide

Using the Decision Base Class

The TaskerCore::StepHandler::Decision base class provides type-safe helpers:

class MyDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    # Your business logic here
    amount = context.get_task_field('amount')

    if amount < 1000
      # Create single step
      decision_success(
        steps: 'auto_approve',  # Can pass string or array
        result_data: { route: 'auto' }
      )
    elsif amount < 5000
      # Create multiple steps
      decision_success(
        steps: ['manager_approval', 'risk_check'],
        result_data: { route: 'standard' }
      )
    else
      # No additional steps needed
      decision_no_branches(
        result_data: { route: 'none', reason: 'manual_review_required' }
      )
    end
  end
end

Helper Methods

decision_success(steps:, result_data: {}, metadata: {})

  • Creates steps dynamically
  • steps: String or Array of step names
  • result_data: Additional data to store in step results
  • metadata: Observability metadata

decision_no_branches(result_data: {}, metadata: {})

  • No additional steps created
  • Workflow proceeds to next static step

decision_with_custom_outcome(outcome:, result_data: {}, metadata: {})

  • Advanced: Full control over outcome structure
  • Most handlers should use decision_success or decision_no_branches

validate_decision_outcome!(outcome)

  • Validates custom outcome structure
  • Raises error if invalid

Type Definitions

# workers/ruby/lib/tasker_core/types/decision_point_outcome.rb

module TaskerCore
  module Types
    module DecisionPointOutcome
      # Factory methods
      def self.no_branches
        NoBranches.new
      end

      def self.create_steps(step_names)
        CreateSteps.new(step_names: step_names)
      end

      # Serialization format (matches Rust)
      class NoBranches
        def to_h
          { type: 'no_branches' }
        end
      end

      class CreateSteps
        def to_h
          { type: 'create_steps', step_names: step_names }
        end
      end
    end
  end
end

Rust Implementation Guide

Decision Handler Implementation

Actual Implementation (from workers/rust/src/step_handlers/conditional_approval_rust.rs):

#![allow(unused)]
fn main() {
use super::{error_result, success_result, RustStepHandler, StepHandlerConfig};
use anyhow::Result;
use async_trait::async_trait;
use chrono::Utc;
use serde_json::json;
use std::collections::HashMap;
use tasker_shared::messaging::{DecisionPointOutcome, StepExecutionResult};
use tasker_shared::types::TaskSequenceStep;

const SMALL_AMOUNT_THRESHOLD: i64 = 1000;
const LARGE_AMOUNT_THRESHOLD: i64 = 5000;

pub struct RoutingDecisionHandler {
    #[allow(dead_code)]
    config: StepHandlerConfig,
}

#[async_trait]
impl RustStepHandler for RoutingDecisionHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract amount from task context
        let amount: i64 = step_data.get_context_field("amount")?;

        // Business logic: determine routing
        let (route_type, steps, reasoning) = if amount < SMALL_AMOUNT_THRESHOLD {
            (
                "auto_approval",
                vec!["auto_approve"],
                format!("Amount ${} under threshold", amount)
            )
        } else if amount < LARGE_AMOUNT_THRESHOLD {
            (
                "manager_only",
                vec!["manager_approval"],
                format!("Amount ${} requires manager approval", amount)
            )
        } else {
            (
                "dual_approval",
                vec!["manager_approval", "finance_review"],
                format!("Amount ${} requires dual approval", amount)
            )
        };

        // Create decision point outcome
        let outcome = DecisionPointOutcome::create_steps(
            steps.iter().map(|s| s.to_string()).collect()
        );

        // Build result with embedded outcome
        let result_data = json!({
            "route_type": route_type,
            "reasoning": reasoning,
            "amount": amount,
            "decision_point_outcome": outcome.to_value()  // Embedded outcome
        });

        let metadata = HashMap::from([
            ("route_type".to_string(), json!(route_type)),
            ("steps_to_create".to_string(), json!(steps)),
        ]);

        Ok(success_result(
            step_uuid,
            result_data,
            start_time.elapsed().as_millis() as i64,
            Some(metadata),
        ))
    }

    fn name(&self) -> &str {
        "routing_decision"
    }

    fn new(config: StepHandlerConfig) -> Self {
        Self { config }
    }
}
}

DecisionPointOutcome Type

Type Definition (from tasker-shared/src/messaging/execution_types.rs):

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum DecisionPointOutcome {
    NoBranches,
    CreateSteps {
        step_names: Vec<String>,
    },
}

impl DecisionPointOutcome {
    /// Create outcome that creates specific steps
    pub fn create_steps(step_names: Vec<String>) -> Self {
        Self::CreateSteps { step_names }
    }

    /// Create outcome with no additional steps
    pub fn no_branches() -> Self {
        Self::NoBranches
    }

    /// Convert to JSON value for embedding in StepExecutionResult
    pub fn to_value(&self) -> serde_json::Value {
        serde_json::to_value(self).expect("DecisionPointOutcome serialization should not fail")
    }

    /// Extract decision outcome from step execution result
    pub fn from_step_result(result: &serde_json::Value) -> Option<Self> {
        result
            .as_object()?
            .get("decision_point_outcome")
            .and_then(|v| serde_json::from_value(v.clone()).ok())
    }
}
}

Key Rust Patterns:

  • DecisionPointOutcome::create_steps(vec![...]) - Type-safe factory
  • outcome.to_value() - Serializes to JSON matching Ruby format
  • Embedded in result JSON as decision_point_outcome field
  • Serde handles serialization: { "type": "create_steps", "step_names": [...] }

Best Practices

1. Keep Decision Logic Deterministic

# ✅ Good: Deterministic decision based on input
def call(context)
  amount = context.get_task_field('amount')

  steps = if amount < 1000
    ['auto_approve']
  else
    ['manager_approval']
  end

  decision_success(steps: steps)
end

# ❌ Bad: Non-deterministic (time-based, random)
def call(context)
  # Decision changes based on when it runs
  steps = if Time.now.hour < 9
    ['emergency_approval']
  else
    ['standard_approval']
  end

  decision_success(steps: steps)
end

2. Validate Step Names

Ensure all step names in decision outcomes exist in template:

VALID_STEPS = %w[auto_approve manager_approval finance_review].freeze

def call(context)
  steps_to_create = determine_steps(context)

  # Validate step names
  invalid = steps_to_create - VALID_STEPS
  unless invalid.empty?
    raise "Invalid step names: #{invalid.join(', ')}"
  end

  decision_success(steps: steps_to_create)
end

3. Use Deferred Type for Convergence

Any step that might depend on dynamically created steps should be type: deferred:

# ✅ Correct
- name: finalize
  type: deferred  # Uses intersection semantics
  dependencies:
    - routing_decision
    - auto_approve
    - manager_approval

# ❌ Wrong - will fail if dependencies don't all exist
- name: finalize
  dependencies:
    - routing_decision
    - auto_approve
    - manager_approval

4. Limit Decision Depth

Prevent infinite recursion:

[orchestration.decision_points]
max_depth = 3  # Maximum nesting level
warn_threshold = 2  # Warn when approaching limit
# ✅ Good: Linear decision chain (depth 1-2)
validate → routing_decision → compliance_check → finalize

# ⚠️ Be Careful: Deep nesting (depth 3)
validate → routing_1 → routing_2 → routing_3 → finalize

# ❌ Bad: Circular or unbounded nesting
routing_decision creates steps that create more routing decisions...

5. Handle No-Branch Cases

Explicitly return no_branches when no steps needed:

def call(context)
  amount = context.get_task_field('amount')

  if context.get_task_field('skip_approval')
    # No additional steps needed
    decision_no_branches(
      result_data: { reason: 'approval_skipped' }
    )
  else
    decision_success(steps: determine_steps(amount))
  end
end

6. Meaningful Result Data

Include context for debugging and audit trails:

decision_success(
  steps: ['manager_approval', 'finance_review'],
  result_data: {
    route_type: 'dual_approval',
    reasoning: "Amount $#{amount} >= $5,000 threshold",
    amount: amount,
    thresholds_applied: {
      small: 1_000,
      large: 5_000
    }
  },
  metadata: {
    decision_time_ms: elapsed_ms,
    steps_created_count: 2
  }
)

Limitations and Constraints

Technical Limits

1. Maximum Decision Depth

  • Default: 3 levels of nested decision points
  • Configurable via orchestration.decision_points.max_depth
  • Prevents infinite recursion

2. Step Names Must Exist in Template

  • All step names in CreateSteps must be defined in template
  • Orchestration validates before creating steps
  • Invalid names cause permanent failure

3. Decision Logic is Non-Retryable by Default

  • Decision steps should be deterministic
  • Retry disabled by default (max_attempts: 1)
  • External API calls should be in separate steps

4. Created Steps Cannot Modify Template

  • Decision points create instances of template steps
  • Cannot dynamically define new step types
  • All possible steps must be in template

Performance Considerations

1. Decision Overhead

  • Each decision point adds ~10-20ms overhead
  • Includes: handler execution + step creation + dependency resolution
  • Factor into SLA planning

2. Database Impact

  • Each created step = 1 WorkflowStep record + edges
  • Large branch counts increase database operations
  • Monitor workflow_steps table growth

3. Observability

  • Decision outcomes logged with telemetry
  • Metrics track: decision_points.steps_created, decision_points.depth
  • Use structured logging for audit trails

Semantic Constraints

1. Deferred Dependencies Must Include Decision Point

# ✅ Correct
- name: finalize
  type: deferred
  dependencies:
    - routing_decision  # Must list the decision point
    - auto_approve
    - manager_approval

# ❌ Wrong - missing decision point
- name: finalize
  type: deferred
  dependencies:
    - auto_approve
    - manager_approval

2. Decision Points Cannot Be Circular

# ❌ Not allowed - circular dependency
routing_a creates routing_b
routing_b creates routing_a

3. No Dynamic Template Modification

  • Cannot add new handler types at runtime
  • Cannot modify step configurations
  • All possibilities must be predefined

Testing Decision Point Workflows

E2E Test Structure

Both Ruby and Rust implementations include comprehensive E2E tests covering all routing scenarios:

Test Locations:

  • Ruby: tests/e2e/ruby/conditional_approval_test.rs
  • Rust: tests/e2e/rust/conditional_approval_rust.rs

Test Scenarios:

  1. Small Amount ($500) - Auto-approval only

    validate_request → routing_decision → auto_approve → finalize_approval
    Expected: 4 steps created, only auto_approve path taken
    
  2. Medium Amount ($3,000) - Manager approval only

    validate_request → routing_decision → manager_approval → finalize_approval
    Expected: 4 steps created, only manager path taken
    
  3. Large Amount ($10,000) - Dual approval

    validate_request → routing_decision → manager_approval + finance_review → finalize_approval
    Expected: 5 steps created, both approval paths taken (parallel)
    
  4. API Validation - Initial step count verification

    Expected: 2 steps at initialization (validate_request, routing_decision)
    Reason: finalize_approval is transitive descendant of decision point
    

Running Tests

# Run all E2E tests
cargo test --test e2e_tests

# Run Ruby conditional approval tests only
cargo test --test e2e_tests e2e::ruby::conditional_approval

# Run Rust conditional approval tests only
cargo test --test e2e_tests e2e::rust::conditional_approval_rust

# Run with output for debugging
cargo test --test e2e_tests -- --nocapture

Test Fixtures

Ruby Template: tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml Rust Template: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml

Both templates demonstrate:

  • Decision point step configuration (type: decision)
  • Deferred convergence step (type: deferred)
  • Dynamic step dependencies
  • Namespace isolation between Ruby/Rust

Validation Checklist

When implementing decision point workflows, ensure:

  • ✅ Decision point step has type: decision
  • ✅ Deferred convergence step has type: deferred
  • ✅ All possible dependencies listed in deferred step
  • ✅ Handler embeds decision_point_outcome in result
  • ✅ Step names in outcome match template definitions
  • ✅ E2E tests cover all routing scenarios
  • ✅ Tests validate step creation and completion
  • ✅ Namespace isolated if multiple implementations exist


← Back to Documentation Hub

Configuration Management

Last Updated: 2025-10-17 Audience: Operators, Developers, Architects Status: Active Related Docs: Environment Configuration Comparison, Deployment Patterns

← Back to Documentation Hub


Overview

Tasker Core implements a sophisticated component-based configuration system with environment-specific overrides, runtime observability, and comprehensive validation. This document explains how to manage, validate, inspect, and deploy Tasker configurations.

Key Features

FeatureDescriptionBenefit
Component-Based Architecture3 focused TOML files organized by common, orchestration, and workerEasy to understand and maintain
Environment OverridesTest, development, production-specific settingsSafe defaults with production scale-out
Single-File Runtime LoadingLoad from pre-merged configuration files at runtimeDeployment certainty - exact config known at build time
Runtime Observability/config API endpoints with secret redactionLive inspection of deployed configurations
CLI ToolsGenerate and validate single deployable configsBuild-time verification, deployment artifacts
Context-Specific ValidationOrchestration and worker-specific validation rulesCatch errors before deployment
Secret Redaction12+ sensitive key patterns automatically hiddenSafe configuration inspection

Quick Start

Inspect Running System Configuration

# Check orchestration configuration (includes common + orchestration-specific)
curl http://localhost:8080/config | jq

# Check worker configuration (includes common + worker-specific)
curl http://localhost:8081/config | jq

# Secrets are automatically redacted for safety

Generate Deployable Configuration

# Generate production orchestration config for deployment
tasker-ctl config generate \
    --context orchestration \
    --environment production \
    --output config/tasker/orchestration-production.toml

# This merged file is then loaded at runtime via TASKER_CONFIG_PATH
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml

Validate Configuration

# Validate orchestration config for production
tasker-ctl config validate \
    --context orchestration \
    --environment production

# Validates: type safety, ranges, required fields, business rules

Part 1: Configuration Architecture

1.1 Component-Based Structure

Tasker uses a component-based TOML architecture where configuration is split into focused files with single responsibility:

config/tasker/
├── base/                           # Base configuration (defaults)
│   ├── common.toml                 # Shared: database, circuit breakers, telemetry
│   ├── orchestration.toml          # Orchestration-specific settings
│   └── worker.toml                 # Worker-specific settings
│
├── environments/                   # Environment-specific overrides
│   ├── test/
│   │   ├── common.toml             # Test overrides (small values, fast execution)
│   │   ├── orchestration.toml
│   │   └── worker.toml
│   │
│   ├── development/
│   │   ├── common.toml             # Development overrides (medium values, local Docker)
│   │   ├── orchestration.toml
│   │   └── worker.toml
│   │
│   └── production/
│       ├── common.toml             # Production overrides (large values, scale-out)
│       ├── orchestration.toml
│       └── worker.toml
│
├── orchestration-test.toml         # Generated merged configs (used at runtime via TASKER_CONFIG_PATH)
├── orchestration-production.toml   # Single-file deployment artifacts
├── worker-test.toml
└── worker-production.toml

1.2 Configuration Contexts

Tasker has three configuration contexts:

ContextPurposeComponents
CommonShared across orchestration and workerDatabase, circuit breakers, telemetry, backoff, system
OrchestrationOrchestration-specific settingsWeb API, MPSC channels, event systems, shutdown
WorkerWorker-specific settingsHandler discovery, resource limits, health monitoring

1.3 Environment Detection

Configuration loading uses TASKER_ENV environment variable:

# Test environment (default) - small values for fast tests
export TASKER_ENV=test

# Development environment - medium values for local Docker
export TASKER_ENV=development

# Production environment - large values for scale-out
export TASKER_ENV=production

Detection Order:

  1. TASKER_ENV environment variable
  2. Default to “development” if not set

1.4 Runtime Configuration Loading

Production/Docker Deployment: Single-file loading via TASKER_CONFIG_PATH

Runtime systems (orchestration and worker) load configuration from pre-merged single files:

# Set path to merged configuration file
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml

# System loads this single file at startup
# No directory merging at runtime - configuration is fully determined at build time

Key Benefits:

  • Deployment Certainty: Exact configuration known before deployment
  • Simplified Debugging: Single file shows exactly what’s running
  • Configuration Auditing: One file to version control and code review
  • Fail Loudly: Missing or invalid config halts startup with explicit errors

Configuration Path Precedence:

The system uses a two-tier configuration loading strategy with clear precedence:

  1. Primary: TASKER_CONFIG_PATH (Explicit single file - Docker/production)

    • When set, system loads configuration from this exact file path
    • Intended for production and Docker deployments
    • Example: TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
    • Source logging: "📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)"
  2. Fallback: TASKER_CONFIG_ROOT (Convention-based - tests/development)

    • When TASKER_CONFIG_PATH is not set, system looks for config using convention
    • Convention: {TASKER_CONFIG_ROOT}/tasker/{context}-{environment}.toml
    • Examples:
      • Orchestration: /config/tasker/generated/orchestration-test.toml
      • Worker: /config/tasker/worker-production.toml
    • Source logging: "📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))"

Logging and Transparency:

The system clearly logs which approach was taken at startup:

# Explicit path approach (TASKER_CONFIG_PATH set)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)

# Convention-based approach (TASKER_CONFIG_ROOT set)
INFO tasker_shared::system_context: Using convention-based config path: /config/tasker/generated/orchestration-test.toml (environment=test)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))

When to Use Each:

EnvironmentRecommended ApproachReason
ProductionTASKER_CONFIG_PATHExplicit, auditable, matches what’s reviewed
DockerTASKER_CONFIG_PATHSingle source of truth, no ambiguity
KubernetesTASKER_CONFIG_PATHConfigMap contains exact file
Tests (nextest)TASKER_CONFIG_ROOTTests span multiple contexts, convention handles both
Local devEitherPersonal preference

Error Handling:

If neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set:

ConfigurationError("Neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set.
For Docker/production: set TASKER_CONFIG_PATH to the merged config file.
For tests/development: set TASKER_CONFIG_ROOT to the config directory.")

Local Development: Directory-based loading (legacy tests only)

For legacy test compatibility, you can still use directory-based loading via the load_context_direct() method, but this is not supported for production use.

1.5 Merging Strategy

Configuration merging follows environment overrides win pattern:

# base/common.toml
[database.pool]
max_connections = 30
min_connections = 8

# environments/production/common.toml
[database.pool]
max_connections = 50

# Result: max_connections = 50, min_connections = 8 (inherited from base)

Part 2: Runtime Observability

2.1 Configuration API Endpoints

Tasker provides unified configuration endpoints that return complete configuration (common + context-specific) in a single response.

Orchestration API

Endpoint: GET /config (system endpoint at root level)

Purpose: Inspect complete orchestration configuration including common settings

Example Request:

curl http://localhost:8080/config | jq

Response Structure:

{
  "environment": "production",
  "common": {
    "database": {
      "url": "***REDACTED***",
      "pool": {
        "max_connections": 50,
        "min_connections": 15
      }
    },
    "circuit_breakers": { "...": "..." },
    "telemetry": { "...": "..." },
    "system": { "...": "..." },
    "backoff": { "...": "..." },
    "task_templates": { "...": "..." }
  },
  "orchestration": {
    "web": {
      "bind_address": "0.0.0.0:8080",
      "request_timeout_ms": 60000
    },
    "mpsc_channels": {
      "command_buffer_size": 5000,
      "pgmq_notification_buffer_size": 50000
    },
    "event_systems": { "...": "..." }
  },
  "metadata": {
    "timestamp": "2025-10-17T15:30:45Z",
    "source": "runtime",
    "redacted_fields": [
      "database.url",
      "telemetry.api_key"
    ]
  }
}

Worker API

Endpoint: GET /config (system endpoint at root level)

Purpose: Inspect complete worker configuration including common settings

Example Request:

curl http://localhost:8081/config | jq

Response Structure:

{
  "environment": "production",
  "common": {
    "database": { "...": "..." },
    "circuit_breakers": { "...": "..." },
    "telemetry": { "...": "..." }
  },
  "worker": {
    "template_path": "/app/templates",
    "max_concurrent_steps": 500,
    "resource_limits": {
      "max_memory_mb": 4096,
      "max_cpu_percent": 90
    },
    "web": {
      "bind_address": "0.0.0.0:8081",
      "request_timeout_ms": 60000
    }
  },
  "metadata": {
    "timestamp": "2025-10-17T15:30:45Z",
    "source": "runtime",
    "redacted_fields": [
      "database.url",
      "worker.auth_token"
    ]
  }
}

2.2 Design Philosophy

Single Endpoint, Complete Configuration: Each system has one /config endpoint that returns both common and context-specific configuration in a single response.

Benefits:

  1. Single curl command: Get complete picture without correlation
  2. Easy comparison: Compare orchestration vs worker configs for compatibility
  3. Tooling-friendly: Automated tools can validate shared config matches
  4. Debugging-friendly: No mental correlation between multiple endpoints
  5. System endpoint: At root level like /health, /metrics (not under /v1/)

2.3 Comprehensive Secret Redaction

All sensitive configuration values are automatically redacted before returning to clients.

Sensitive Key Patterns (12+ patterns, case-insensitive):

  • password, secret, token, key, api_key
  • private_key, jwt_private_key, jwt_public_key
  • auth_token, credentials, database_url, url

Key Features:

  • Recursive Processing: Handles deeply nested objects and arrays
  • Field Path Tracking: Reports which fields were redacted (e.g., database.url)
  • Smart Skipping: Empty strings and booleans not redacted
  • Case-Insensitive: Catches API_KEY, Secret_Token, database_PASSWORD
  • Structure Preservation: Non-sensitive data remains intact

Example:

{
  "database": {
    "url": "***REDACTED***",
    "adapter": "postgresql",
    "pool": {
      "max_connections": 30
    }
  },
  "metadata": {
    "redacted_fields": ["database.url"]
  }
}

2.4 OpenAPI/Swagger Integration

All configuration endpoints are documented with OpenAPI 3.0 and Swagger UI.

Access Swagger UI:

  • Orchestration: http://localhost:8080/api-docs/ui
  • Worker: http://localhost:8081/api-docs/ui

OpenAPI Specification:

  • Orchestration: http://localhost:8080/api-docs/openapi.json
  • Worker: http://localhost:8081/api-docs/openapi.json

Part 3: CLI Tools

3.1 Generate Command

Purpose: Generate a single merged configuration file from base + environment overrides for deployment.

Command Signature:

tasker-ctl config generate \
    --context <common|orchestration|worker> \
    --environment <test|development|production>

Examples:

# Generate orchestration config for production
tasker-ctl config generate --context orchestration --environment production

# Generate worker config for development
tasker-ctl config generate --context worker --environment development

# Generate common config for test
tasker-ctl config generate --context common --environment test

Output Location: Automatically generated at:

config/tasker/generated/{context}-{environment}.toml

Key Features:

  1. Automatic Paths: No need for --source-dir or --output flags
  2. Metadata Headers: Generated files include rich metadata:
    # Generated by Tasker Configuration System
    # Context: orchestration
    # Environment: production
    # Generated At: 2025-10-17T15:30:45Z
    # Base Config: config/tasker/base/orchestration.toml
    # Environment Override: config/tasker/environments/production/orchestration.toml
    #
    # This is a merged configuration file combining base settings with
    # environment-specific overrides. Environment values take precedence.
    
  3. Automatic Validation: Validates during generation
  4. Smart Merging: TOML-level merging preserves structure

3.2 Validate Command

Purpose: Validate configuration files with context-specific validation rules.

Command Signature:

tasker-ctl config validate \
    --context <common|orchestration|worker> \
    --environment <test|development|production>

Examples:

# Validate orchestration config for production
tasker-ctl config validate --context orchestration --environment production

# Validate worker config for test
tasker-ctl config validate --context worker --environment test

Validation Features:

  • Environment variable substitution (${VAR:-default})
  • Type checking (numeric ranges, boolean values)
  • Required field validation
  • Context-specific business rules
  • Clear error messages

Example Output:

🔍 Validating configuration...
   Context: orchestration
   Environment: production
   ✓ Configuration loaded
   ✓ Validation passed

✅ Configuration is valid!

📊 Configuration Summary:
   Context: orchestration
   Environment: production
   Database: postgresql://tasker:***@localhost/tasker_production
   Web API: 0.0.0.0:8080
   MPSC Channels: 5 configured

3.3 Configuration Validator Binary

For quick validation without the full CLI:

# Validate all three environments
TASKER_ENV=test cargo run --bin config-validator
TASKER_ENV=development cargo run --bin config-validator
TASKER_ENV=production cargo run --bin config-validator

Part 4: Environment-Specific Configurations

See Environment Configuration Comparison for complete details on configuration values across environments.

4.1 Scaling Pattern

Tasker follows a 1:5:50 scaling pattern across environments:

ComponentTestDevelopmentProductionPattern
Database Connections1025501x → 2.5x → 5x
Concurrent Steps10505001x → 5x → 50x
MPSC Channel Buffers100-500500-10002000-500001x → 5-10x → 20-100x
Memory Limits512MB2GB4GB1x → 4x → 8x

4.2 Environment Philosophy

Test Environment:

  • Goal: Fast execution, test isolation
  • Strategy: Minimal resources, small buffers
  • Example: 10 database connections, 100-500 MPSC buffers

Development Environment:

  • Goal: Comfortable local Docker development
  • Strategy: Medium values, realistic workflows
  • Example: 25 database connections, 2GB RAM, 500-1000 MPSC buffers
  • Cluster Testing: 2 orchestrators to test multi-instance coordination

Production Environment:

  • Goal: High throughput, scale-out capacity
  • Strategy: Large values, production resilience
  • Example: 50 database connections, 4GB RAM, 2000-50000 MPSC buffers

Part 5: Deployment Workflows

5.1 Docker Deployment

Build-Time Configuration Generation:

FROM rust:1.75 as builder

WORKDIR /app
COPY . .

# Build CLI tool
RUN cargo build --release --bin tasker-ctl

# Generate production config (single merged file)
RUN ./target/release/tasker-ctl config generate \
    --context orchestration \
    --environment production \
    --output config/tasker/orchestration-production.toml

# Build orchestration binary
RUN cargo build --release --bin tasker-orchestration

FROM rust:1.75-slim

WORKDIR /app

# Copy orchestration binary
COPY --from=builder /app/target/release/tasker-orchestration /usr/local/bin/

# Copy generated config (single file with all merged settings)
COPY --from=builder /app/config/tasker/orchestration-production.toml /app/config/orchestration.toml

# Set environment - TASKER_CONFIG_PATH is REQUIRED
ENV TASKER_CONFIG_PATH=/app/config/orchestration.toml
ENV TASKER_ENV=production

CMD ["tasker-orchestration"]

Key Changes from Phase 2:

  • ✅ Single merged file generated at build time
  • TASKER_CONFIG_PATH environment variable (required)
  • ✅ No runtime merging - exact config known at build time
  • ✅ Fail loudly if TASKER_CONFIG_PATH not set

5.2 Kubernetes Deployment

ConfigMap Strategy with Pre-Generated Config:

# Step 1: Generate merged configuration locally
tasker-ctl config generate \
  --context orchestration \
  --environment production \
  --output orchestration-production.toml

# Step 2: Create ConfigMap from generated file
kubectl create configmap tasker-orchestration-config \
  --from-file=orchestration.toml=orchestration-production.toml

Deployment Manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tasker-orchestration
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tasker-orchestration
  template:
    metadata:
      labels:
        app: tasker-orchestration
    spec:
      containers:
      - name: orchestration
        image: tasker/orchestration:latest
        env:
        - name: TASKER_ENV
          value: "production"
        # REQUIRED: Path to single merged configuration file
        - name: TASKER_CONFIG_PATH
          value: "/config/orchestration.toml"
        # DATABASE_URL should be in a separate secret
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: tasker-db-credentials
              key: database-url
        volumeMounts:
        - name: config
          mountPath: /config
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: tasker-orchestration-config
          items:
          - key: orchestration.toml
            path: orchestration.toml

Key Benefits:

  • ✅ Generated file reviewed before deployment
  • ✅ Single source of truth for runtime configuration
  • ✅ Easy to diff between environments
  • ✅ ConfigMap contains exact runtime configuration

5.3 Local Development and Testing

For Tests (Legacy directory-based loading):

# Set test environment
export TASKER_ENV=test

# Tests use legacy load_context_direct() method
cargo test --all-features

For Docker Compose (Single-file loading):

# Generate test configs first
tasker-ctl config generate --context orchestration --environment test \
  --output config/tasker/generated/orchestration-test.toml

tasker-ctl config generate --context worker --environment test \
  --output config/tasker/generated/worker-test.toml

# Start services with generated configs
docker-compose -f docker/docker-compose.test.yml up

Docker Compose Configuration:

services:
  orchestration:
    environment:
      # REQUIRED: Path to single merged file
      TASKER_CONFIG_PATH: /app/config/tasker/generated/orchestration-test.toml
    volumes:
      # Mount config directory (contains generated files)
      - ./config/tasker:/app/config/tasker:ro

Key Points:

  • ✅ Tests use legacy directory-based loading for convenience
  • ✅ Docker Compose uses single-file loading (matches production)
  • ✅ Generated files should be committed to repo for reproducibility
  • ✅ Both approaches work; choose based on use case

Part 6: Configuration Validation

6.1 Context-Specific Validation

Each configuration context has specific validation rules:

Common Configuration:

  • Database URL format and connectivity
  • Pool size ranges (1-1000 connections)
  • Circuit breaker thresholds (1-100 failures)
  • Timeout durations (1-3600 seconds)

Orchestration Configuration:

  • Web API bind address format
  • Request timeout ranges (1000-300000 ms)
  • MPSC channel buffer sizes (100-100000)
  • Event system configuration consistency

Worker Configuration:

  • Template path existence
  • Resource limit ranges (memory, CPU %)
  • Handler discovery path validation
  • Concurrent step limits (1-10000)

6.2 Validation Workflow

Pre-Deployment Validation:

# Validate before generating deployment artifact
tasker-ctl config validate --context orchestration --environment production

# Generate only if validation passes
tasker-ctl config generate --context orchestration --environment production

Runtime Validation:

  • Configuration validated on application startup
  • Invalid config prevents startup (fail-fast)
  • Clear error messages for troubleshooting

6.3 Common Validation Errors

Example Error Messages:

❌ Validation Error: database.pool.max_connections
   Value: 5000
   Issue: Exceeds maximum allowed value (1000)
   Fix: Reduce to 1000 or less

❌ Validation Error: web.bind_address
   Value: "invalid:port"
   Issue: Invalid IP:port format
   Fix: Use format like "0.0.0.0:8080" or "127.0.0.1:3000"

Part 7: Operational Workflows

7.1 Compare Deployed Configurations

Cross-System Comparison:

# Get orchestration config
curl http://orchestration:8080/config > orch-config.json

# Get worker config
curl http://worker:8081/config > worker-config.json

# Compare common sections for compatibility
jq '.common' orch-config.json > orch-common.json
jq '.common' worker-config.json > worker-common.json

diff orch-common.json worker-common.json

Why This Matters:

  • Ensures orchestration and worker share same database config
  • Validates circuit breaker settings match
  • Confirms telemetry endpoints aligned

7.2 Debug Configuration Issues

Step 1: Inspect Runtime Config

# Check what's actually deployed
curl http://localhost:8080/config | jq '.orchestration.web'

Step 2: Compare to Expected

# Check generated config file
cat config/tasker/generated/orchestration-production.toml

# Compare values

Step 3: Trace Configuration Source

# Check metadata for source files
curl http://localhost:8080/config | jq '.metadata'

# Metadata shows:
# - Environment (production)
# - Timestamp (when config was loaded)
# - Source (runtime)
# - Redacted fields (for transparency)

7.3 Configuration Drift Detection

Manual Comparison:

# Generate what should be deployed
tasker-ctl config generate --context orchestration --environment production

# Compare to runtime
diff config/tasker/generated/orchestration-production.toml \
     <(curl -s http://localhost:8080/config | jq -r '.orchestration')

Automated Monitoring (future):

  • Periodic config snapshots
  • Alert on unexpected changes
  • Configuration version tracking

Part 8: Best Practices

8.1 Configuration Management

DO: ✅ Use environment variables for secrets (${DATABASE_URL}) ✅ Validate configs before deployment ✅ Generate single deployable artifacts for production ✅ Use /config endpoints for debugging ✅ Keep environment overrides minimal (only what changes) ✅ Document configuration changes in commit messages

DON’T: ❌ Commit production secrets to config files ❌ Mix test and production configurations ❌ Skip validation before deployment ❌ Use unbounded configuration values ❌ Override all settings in environment files

8.2 Security Best Practices

Secrets Management:

# ✅ GOOD: Use environment variable substitution
[database]
url = "${DATABASE_URL}"

# ❌ BAD: Hard-code credentials
[database]
url = "postgresql://user:password@localhost/db"

Production Deployment:

# ✅ GOOD: Use Kubernetes secrets
kubectl create secret generic tasker-db-url \
  --from-literal=url='postgresql://...'

# ❌ BAD: Commit secrets to config files

Runtime Inspection:

  • /config endpoint automatically redacts secrets
  • Safe to use in logging and monitoring
  • Field path tracking shows what was redacted

8.3 Testing Strategy

Test All Environments:

# Ensure all environments validate
for env in test development production; do
  echo "Validating $env..."
  tasker-ctl config validate --context orchestration --environment $env
done

Integration Testing:

# Test with generated configs
tasker-ctl config generate --context orchestration --environment test
export TASKER_CONFIG_PATH=config/tasker/generated/orchestration-test.toml
cargo test --all-features

Part 9: Troubleshooting

9.1 Common Issues

Issue: Configuration fails to load

# Check environment variable
echo $TASKER_ENV

# Check config files exist
ls -la config/tasker/base/
ls -la config/tasker/environments/$TASKER_ENV/

# Validate config
tasker-ctl config validate --context orchestration --environment $TASKER_ENV

Issue: Unexpected configuration values at runtime

# Check runtime config
curl http://localhost:8080/config | jq

# Compare to expected
cat config/tasker/generated/orchestration-$TASKER_ENV.toml

Issue: Validation errors

# Run validation with detailed output
RUST_LOG=debug tasker-ctl config validate \
  --context orchestration \
  --environment production

9.2 Debug Mode

Enable Configuration Debug Logging:

# Detailed config loading logs
RUST_LOG=tasker_shared::config=debug cargo run

# Shows:
# - Which files are loaded
# - Merge order
# - Environment variable substitution
# - Validation results

Part 10: Future Enhancements

10.1 Planned Features

Explain Command (Deferred):

# Get documentation for a parameter
tasker-ctl config explain --parameter database.pool.max_connections

# Shows:
# - Purpose and system impact
# - Valid range and type
# - Environment-specific recommendations
# - Related parameters
# - Example usage

Detect-Unused Command (Deferred):

# Find unused configuration parameters
tasker-ctl config detect-unused --context orchestration

# Auto-remove with backup
tasker-ctl config detect-unused --context orchestration --fix

10.2 Operational Enhancements

Configuration Versioning:

  • Track configuration changes over time
  • Compare configs across versions
  • Rollback capability

Automated Drift Detection:

  • Periodic config snapshots
  • Alert on unexpected changes
  • Configuration compliance checking

Configuration Templates:

  • Pre-built configurations for common scenarios
  • Quick-start templates for new deployments
  • Best practice configurations


Summary

Tasker’s configuration system provides:

  1. Component-Based Architecture: Focused TOML files with single responsibility
  2. Environment Scaling: 1:5:50 pattern from test → development → production
  3. Single-File Runtime Loading: Deploy exact configuration known at build time via TASKER_CONFIG_PATH
  4. Runtime Observability: /config endpoints with comprehensive secret redaction
  5. CLI Tools: Generate and validate single deployable configs
  6. Context-Specific Validation: Catch errors before deployment
  7. Security First: Automatic secret redaction, environment variable substitution

Key Workflows:

  • Production/Docker: Generate single-file config at build time, set TASKER_CONFIG_PATH, deploy
  • Testing: Use legacy directory-based loading for convenience
  • Debugging: Use /config endpoints to inspect runtime configuration
  • Validation: Validate before generating deployment artifacts

Phase 3 Changes (October 2025):

  • ✅ Runtime systems now require TASKER_CONFIG_PATH environment variable
  • ✅ Configuration loaded from single merged files (no runtime merging)
  • ✅ Deployment certainty: exact config known at build time
  • ✅ Fail loudly: missing/invalid config halts startup with explicit errors
  • ✅ Generated configs committed to repo for reproducibility

← Back to Documentation Hub

Dead Letter Queue (DLQ) System Architecture

Purpose: Investigation tracking system for stuck, stale, or problematic tasks

Last Updated: 2025-11-01


Executive Summary

The DLQ (Dead Letter Queue) system is an investigation tracking system, NOT a task manipulation layer.

Key Principles:

  • DLQ tracks “why task is stuck” and “who investigated”
  • Resolution happens at step level via step APIs
  • No task-level “requeue” - fix the problem steps instead
  • Steps carry their own retry, attempt, and state lifecycles independent of DLQ
  • DLQ is for audit, visibility, and investigation only

Architecture: PostgreSQL-based system with:

  • tasks_dlq table for investigation tracking
  • 3 database views for monitoring and analysis
  • 6 REST endpoints for operator interaction
  • Background staleness detection service

DLQ vs Step Resolution

What DLQ Does

Investigation Tracking:

  • Record when and why task became stuck
  • Capture complete task snapshot for debugging
  • Track operator investigation workflow
  • Provide visibility into systemic issues

Visibility and Monitoring:

  • Dashboard statistics by DLQ reason
  • Prioritized investigation queue for triage
  • Proactive staleness monitoring (before DLQ)
  • Alerting integration for high-priority entries

What DLQ Does NOT Do

Task Manipulation:

  • Does NOT retry failed steps
  • Does NOT requeue tasks
  • Does NOT modify step state
  • Does NOT execute business logic

Why This Separation Matters

Steps are mutable - Operators can:

  • Manually resolve failed steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
  • View step readiness status: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
  • Check retry eligibility and dependency satisfaction
  • Trigger next steps by completing blocked steps

DLQ is immutable audit trail - Operators should:

  • Review task snapshot to understand what went wrong
  • Use step endpoints to fix the underlying problem
  • Update DLQ investigation status to track resolution
  • Analyze DLQ patterns to prevent future occurrences

DLQ Reasons

staleness_timeout

Definition: Task exceeded state-specific staleness threshold

States:

  • waiting_for_dependencies - Default 60 minutes
  • waiting_for_retry - Default 30 minutes
  • steps_in_process - Default 30 minutes

Template Override: Configure per-template thresholds:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60
  max_duration_minutes: 1440  # 24 hours

Resolution Pattern:

  1. Operator: GET /v1/dlq/task/{task_uuid} - Review task snapshot
  2. Identify stuck steps: Check current_state in snapshot
  3. Fix steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
  4. Task state machine automatically progresses when steps fixed
  5. Operator: PATCH /v1/dlq/entry/{dlq_entry_uuid} - Mark investigation resolved

Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring

max_retries_exceeded

Definition: Step exhausted all retry attempts and remains in Error state

Resolution Pattern:

  1. Review step results: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
  2. Analyze last_failure_at and error details
  3. Fix underlying issue (infrastructure, data, etc.)
  4. Manually resolve step: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
  5. Update DLQ investigation status

dependency_cycle_detected

Definition: Circular dependency detected in workflow step graph

Resolution Pattern:

  1. Review task template configuration
  2. Identify cycle in step dependencies
  3. Update template to break cycle
  4. Manually cancel affected tasks
  5. Re-submit tasks with corrected template

worker_unavailable

Definition: No worker available for task’s namespace

Resolution Pattern:

  1. Check worker service health
  2. Verify namespace configuration
  3. Scale worker capacity if needed
  4. Tasks automatically progress when worker available

manual_dlq

Definition: Operator manually sent task to DLQ for investigation

Resolution Pattern: Custom per-investigation


Database Schema

tasks_dlq Table

CREATE TABLE tasker.tasks_dlq (
    dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
    task_uuid UUID NOT NULL UNIQUE,  -- One pending entry per task
    original_state VARCHAR(50) NOT NULL,
    dlq_reason dlq_reason NOT NULL,
    dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    task_snapshot JSONB,  -- Complete task state for debugging
    resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
    resolution_notes TEXT,
    resolved_at TIMESTAMPTZ,
    resolved_by VARCHAR(255),
    metadata JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
    ON tasker.tasks_dlq (task_uuid)
    WHERE resolution_status = 'pending';

Key Fields:

  • dlq_entry_uuid - UUID v7 (time-ordered) for investigation tracking
  • task_uuid - Foreign key to task (unique for pending entries)
  • original_state - Task state when sent to DLQ
  • task_snapshot - JSONB snapshot with debugging context
  • resolution_status - Investigation workflow status

Database Views

v_dlq_dashboard

Purpose: Aggregated statistics for monitoring dashboard

Columns:

  • dlq_reason - Why tasks are in DLQ
  • total_entries - Count of entries
  • pending, manually_resolved, permanent_failures, cancelled - Breakdown by status
  • oldest_entry, newest_entry - Time range
  • avg_resolution_time_minutes - Average time to resolve

Use Case: High-level DLQ health monitoring

v_dlq_investigation_queue

Purpose: Prioritized queue for operator triage

Columns:

  • Task and DLQ entry UUIDs
  • priority_score - Composite score (base reason priority + age factor)
  • minutes_in_dlq - How long entry has been pending
  • Task metadata for context

Ordering: Priority score DESC (most urgent first)

Use Case: Operator dashboard showing “what to investigate next”

v_task_staleness_monitoring

Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ

Columns:

  • task_uuid, namespace_name, task_name
  • current_state, time_in_state_minutes
  • staleness_threshold_minutes - Threshold for this state
  • health_status - healthy | warning | stale
  • priority - Task priority for ordering

Health Status Classification:

  • healthy - < 80% of threshold
  • warning - 80-99% of threshold
  • stale - ≥ 100% of threshold

Use Case: Alerting at 80% threshold to prevent DLQ entries


REST API Endpoints

1. List DLQ Entries

GET /v1/dlq?resolution_status=pending&limit=50

Purpose: Browse DLQ entries with filtering

Query Parameters:

  • resolution_status - Filter by status (optional)
  • limit - Max entries (default: 50)
  • offset - Pagination offset (default: 0)

Response: Array of DlqEntry objects

Use Case: General DLQ browsing and pagination


2. Get DLQ Entry with Task Snapshot

GET /v1/dlq/task/{task_uuid}

Purpose: Retrieve most recent DLQ entry for a task with complete snapshot

Response: DlqEntry with full task_snapshot JSONB

Task Snapshot Contains:

  • Task UUID, namespace, name
  • Current state and time in state
  • Staleness threshold
  • Task age and priority
  • Template configuration
  • Detection time

Use Case: Investigation starting point - “why is this task stuck?”


3. Update DLQ Investigation Status

PATCH /v1/dlq/entry/{dlq_entry_uuid}

Purpose: Track investigation workflow

Request Body:

{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed by manually completing stuck step using step API",
  "resolved_by": "operator@example.com",
  "metadata": {
    "fixed_step_uuid": "...",
    "root_cause": "database connection timeout"
  }
}

Use Case: Document investigation findings and resolution


4. Get DLQ Statistics

GET /v1/dlq/stats

Purpose: Aggregated statistics for monitoring

Response: Statistics grouped by dlq_reason

Use Case: Dashboard metrics, identifying systemic issues


5. Get Investigation Queue

GET /v1/dlq/investigation-queue?limit=100

Purpose: Prioritized queue for operator triage

Response: Array of DlqInvestigationQueueEntry ordered by priority

Priority Factors:

  • Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
  • Age multiplier (older entries = higher priority)

Use Case: “What should I investigate next?”


6. Get Staleness Monitoring

GET /v1/dlq/staleness?limit=100

Purpose: Proactive monitoring BEFORE tasks hit DLQ

Response: Array of StalenessMonitoring with health status

Ordering: Stale first, then warning, then healthy

Use Case: Alerting and prevention

Alert Integration:

# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'

Step Endpoints and Resolution Workflow

Step Endpoints

1. List Task Steps

GET /v1/tasks/{uuid}/workflow_steps

Returns: Array of steps with readiness status

Key Fields:

  • current_state - Step state (pending, enqueued, in_progress, complete, error)
  • dependencies_satisfied - Can step execute?
  • retry_eligible - Can step retry?
  • ready_for_execution - Ready to enqueue?
  • attempts / max_attempts - Retry tracking
  • last_failure_at - When step last failed
  • next_retry_at - When step eligible for retry

Use Case: Understand task execution status


2. Get Step Details

GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Returns: Single step with full readiness analysis

Use Case: Deep dive into specific step


3. Manually Resolve Step

PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Purpose: Operator actions to handle stuck or failed steps

Action Types:

  1. ResetForRetry - Reset attempt counter and return to pending for automatic retry:
{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection restored, resetting attempts"
}
  1. ResolveManually - Mark step as manually resolved without results:
{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical step, bypassing for workflow continuation"
}
  1. CompleteManually - Complete step with execution results for dependent steps:
{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validated": true,
      "score": 95
    },
    "metadata": {
      "manually_verified": true,
      "verification_method": "manual_inspection"
    }
  },
  "reason": "Manual verification completed after infrastructure fix",
  "completed_by": "operator@example.com"
}

Behavior by Action Type:

  • reset_for_retry: Clears attempt counter, transitions to pending, enables automatic retry
  • resolve_manually: Transitions to resolved_manually (terminal state)
  • complete_manually: Transitions to complete with results available for dependent steps

Common Effects:

  • Triggers task state machine re-evaluation
  • Task automatically discovers next ready steps
  • Task progresses when all dependencies satisfied

Use Case: Unblock stuck workflow by fixing problem step


Complete Resolution Workflow

Scenario: Task Stuck in waiting_for_dependencies

1. Operator receives DLQ alert

GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority

2. Operator reviews task snapshot

GET /v1/dlq/task/abc-123
# Response:
{
  "dlq_entry_uuid": "xyz-789",
  "task_uuid": "abc-123",
  "original_state": "waiting_for_dependencies",
  "dlq_reason": "staleness_timeout",
  "task_snapshot": {
    "task_uuid": "abc-123",
    "namespace": "order_processing",
    "task_name": "fulfill_order",
    "current_state": "error",
    "time_in_state_minutes": 65,
    "threshold_minutes": 60
  }
}

3. Operator checks task steps

GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)

4. Operator investigates step_2 failure

GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout

5. Operator fixes infrastructure issue

# Fix database connection pool configuration
# Verify database connectivity

6. Operator chooses resolution strategy

Option A: Reset for retry (infrastructure fixed, retry should work):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection pool fixed, resetting attempts for automatic retry"
}

Option B: Resolve manually (bypass step entirely):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical validation step, bypassing"
}

Option C: Complete manually (provide results for dependent steps):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validation_status": "passed",
      "score": 100
    },
    "metadata": {
      "manually_verified": true
    }
  },
  "reason": "Manual validation completed",
  "completed_by": "operator@example.com"
}

7. Task state machine automatically progresses

Outcome depends on action type chosen:

If Option A (reset_for_retry):

  • Step 2 → pending (attempts reset to 0)
  • Automatic retry begins when dependencies satisfied
  • Step 2 re-enqueued to worker
  • If successful, workflow continues normally

If Option B (resolve_manually):

  • Step 2 → resolved_manually (terminal state)
  • Step 3 dependencies satisfied (manual resolution counts as success)
  • Task transitions: errorenqueuing_steps
  • Step 3 enqueued to worker
  • Task resumes normal execution

If Option C (complete_manually):

  • Step 2 → complete (with operator-provided results)
  • Step 3 can consume results from completion_data
  • Task transitions: errorenqueuing_steps
  • Step 3 enqueued to worker with access to step 2 results
  • Task resumes normal execution

8. Operator updates DLQ investigation

PATCH /v1/dlq/entry/xyz-789
{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
  "resolved_by": "operator@example.com",
  "metadata": {
    "root_cause": "database_connection_timeout",
    "fixed_step_uuid": "{step_2_uuid}",
    "infrastructure_fix": "increased_connection_pool_size"
  }
}

Step Retry and Attempt Lifecycles

Step State Machine

States:

  • pending - Initial state, awaiting dependencies
  • enqueued - Sent to worker queue
  • in_progress - Worker actively processing
  • enqueued_for_orchestration - Result submitted, awaiting orchestration
  • complete - Successfully finished
  • error - Failed (may be retryable)
  • cancelled - Manually cancelled
  • resolved_manually - Operator intervention

Retry Logic

Configured per step in template:

retry:
  retryable: true
  max_attempts: 3
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 30000

Retry Eligibility Criteria:

  1. retryable: true in configuration
  2. attempts < max_attempts
  3. Current state is error
  4. next_retry_at timestamp has passed (backoff elapsed)

Backoff Calculation:

backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)

Example (base=1000ms, max=30000ms):

  • Attempt 1 fails → wait 1s
  • Attempt 2 fails → wait 2s
  • Attempt 3 fails → wait 4s

SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at

Attempt Tracking

Fields (on workflow_steps table):

  • attempts - Current attempt count
  • max_attempts - Configuration limit
  • last_attempted_at - Timestamp of last execution
  • last_failure_at - Timestamp of last failure

Workflow:

  1. Step enqueued → attempts++
  2. Step fails → Record last_failure_at, calculate next_retry_at
  3. Backoff elapses → Step becomes retry_eligible: true
  4. Orchestration discovers ready steps → Step re-enqueued
  5. Repeat until success or attempts >= max_attempts

Max Attempts Exceeded:

  • Step remains in error state
  • retry_eligible: false
  • Task transitions to error state
  • May trigger DLQ entry with reason max_retries_exceeded

Independence from DLQ

Key Point: Step retry logic is INDEPENDENT of DLQ

  • Steps retry automatically based on configuration
  • DLQ does NOT trigger retries
  • DLQ does NOT modify retry counters
  • DLQ is pure observation and investigation

Why This Matters:

  • Retry logic is predictable and configuration-driven
  • DLQ doesn’t interfere with normal workflow execution
  • Operators can manually resolve to bypass retry limits
  • DLQ provides visibility into retry exhaustion patterns

Staleness Detection

Background Service

Component: tasker-orchestration/src/orchestration/staleness_detector.rs

Configuration:

[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300  # 5 minutes

Operation:

  1. Timer triggers every 5 minutes
  2. Calls detect_and_transition_stale_tasks() SQL function
  3. Function identifies tasks exceeding thresholds
  4. Creates DLQ entries for stale tasks
  5. Transitions tasks to error state
  6. Records OpenTelemetry metrics

Staleness Thresholds

Per-State Defaults (configurable):

  • waiting_for_dependencies: 60 minutes
  • waiting_for_retry: 30 minutes
  • steps_in_process: 30 minutes

Per-Template Override:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60

Precedence: Template config > Global defaults

Staleness SQL Function

Function: detect_and_transition_stale_tasks()

Architecture:

v_task_state_analysis (base view)
    │
    ├── get_stale_tasks_for_dlq() (discovery function)
    │       │
    │       └── detect_and_transition_stale_tasks() (main orchestration)
    │               ├── create_dlq_entry() (DLQ creation)
    │               └── transition_stale_task_to_error() (state transition)

Performance Optimization:

  • Expensive joins happen ONCE in base view
  • Discovery function filters stale tasks
  • Main function processes results in loop
  • LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries

Output: Returns StalenessResult records with:

  • Task identification (UUID, namespace, name)
  • State and timing information
  • action_taken - What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)
  • moved_to_dlq - Boolean
  • transition_success - Boolean

OpenTelemetry Metrics

Metrics Exported

Counters:

  • tasker.dlq.entries_created.total - DLQ entries created
  • tasker.staleness.tasks_detected.total - Stale tasks detected
  • tasker.staleness.tasks_transitioned_to_error.total - Tasks moved to Error
  • tasker.staleness.detection_runs.total - Detection cycles

Histograms:

  • tasker.staleness.detection.duration - Detection execution time (ms)
  • tasker.dlq.time_in_queue - Time in DLQ before resolution (hours)

Gauges:

  • tasker.dlq.pending_investigations - Current pending DLQ count

Alert Examples

Prometheus Alerting Rules:

# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
  expr: tasker_dlq_pending_investigations > 50
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High number of pending DLQ investigations ({{ $value }})"

# Alert on slow detection cycles
- alert: SlowStalenessDetection
  expr: tasker_staleness_detection_duration > 5000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Staleness detection taking >5s ({{ $value }}ms)"

# Alert on high stale task rate
- alert: HighStalenessRate
  expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High rate of stale task detection ({{ $value }}/sec)"

CLI Usage Examples

The tasker-ctl tool provides commands for managing workflow steps directly from the command line.

List Workflow Steps

# List all steps for a task
tasker-ctl task steps <TASK_UUID>

# Example output:
# ✓ Found 3 workflow steps:
#
#   Step: validate_input (01933d7c-...)
#     State: complete
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 1/3
#
#   Step: process_order (01933d7c-...)
#     State: error
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 3/3
#     ⚠ Retry eligible

Get Step Details

# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>

# Example output:
# ✓ Step Details:
#
#   UUID: 01933d7c-...
#   Name: process_order
#   State: error
#   Dependencies satisfied: true
#   Ready for execution: false
#   Retry eligible: false
#   Attempts: 3/3
#   Last failure: 2025-11-02T14:23:45Z

Reset Step for Retry

When infrastructure is fixed and you want to reset attempt counter:

tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
  --reason "Database connection pool increased" \
  --reset-by "ops-team@example.com"

# Example output:
# ✓ Step reset successfully!
#   New state: pending
#   Reason: Database connection pool increased
#   Reset by: ops-team@example.com

Resolve Step Manually

When you want to bypass a non-critical step:

tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
  --reason "Non-critical validation, bypassing" \
  --resolved-by "ops-team@example.com"

# Example output:
# ✓ Step resolved manually!
#   New state: resolved_manually
#   Reason: Non-critical validation, bypassing
#   Resolved by: ops-team@example.com

Complete Step Manually with Results

When you’ve manually performed the step’s work and need to provide results:

tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
  --result '{"validated": true, "score": 95}' \
  --metadata '{"verification_method": "manual_review"}' \
  --reason "Manual verification after infrastructure fix" \
  --completed-by "ops-team@example.com"

# Example output:
# ✓ Step completed manually with results!
#   New state: complete
#   Reason: Manual verification after infrastructure fix
#   Completed by: ops-team@example.com

JSON Formatting Tips:

# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'

# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
  "validation_status": "passed",
  "checks": ["auth", "permissions", "rate_limit"],
  "score": 100
}
EOF
)"

# Or read from a file
--result "$(cat result.json)"

Operational Runbooks

Runbook 1: Investigating High DLQ Count

Trigger: tasker_dlq_pending_investigations > 50

Steps:

  1. Check DLQ dashboard:
curl /v1/dlq/stats | jq
  1. Identify dominant reason:
{
  "dlq_reason": "staleness_timeout",
  "total_entries": 45,
  "pending": 45
}
  1. Get investigation queue:
curl /v1/dlq/investigation-queue?limit=10 | jq
  1. Check staleness monitoring:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'
  1. Identify patterns:
  • Common namespace?
  • Common task template?
  • Common time period?
  1. Take action:
  • Infrastructure issue? → Fix and manually resolve affected tasks
  • Template misconfiguration? → Update template thresholds
  • Worker unavailable? → Scale worker capacity
  • Systemic dependency issue? → Investigate upstream systems

Runbook 2: Proactive Staleness Prevention

Trigger: Regular monitoring (not incident-driven)

Steps:

  1. Monitor warning threshold:
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
  1. Alert when warning count exceeds baseline:
if [ $warning_count -gt 10 ]; then
  alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi
  1. Investigate early:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
  task_uuid,
  current_state,
  time_in_state_minutes,
  staleness_threshold_minutes,
  threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'
  1. Intervene before DLQ:
  • Check task steps for blockages
  • Review dependencies
  • Manually resolve if appropriate

Best Practices

For Operators

DO:

  • Use staleness monitoring for proactive prevention
  • Document investigation findings in DLQ resolution notes
  • Fix root causes, not just symptoms
  • Update DLQ investigation status promptly
  • Use step endpoints to resolve stuck workflows
  • Monitor DLQ statistics for systemic patterns

DON’T:

  • Don’t try to “requeue” from DLQ - fix the steps instead
  • Don’t ignore warning health status - investigate early
  • Don’t manually resolve steps without fixing root cause
  • Don’t leave DLQ investigations in pending status indefinitely

For Developers

DO:

  • Configure appropriate staleness thresholds per template
  • Make steps retryable with sensible backoff
  • Implement idempotent step handlers
  • Add defensive timeouts to prevent hanging
  • Test workflows under failure scenarios

DON’T:

  • Don’t set thresholds too low (causes false positives)
  • Don’t set thresholds too high (delays detection)
  • Don’t make all steps non-retryable
  • Don’t ignore DLQ patterns - they indicate design issues
  • Don’t rely on DLQ for normal workflow control flow

Testing

Test Coverage

Unit Tests: SQL function testing (17 tests)

  • Staleness detection logic
  • DLQ entry creation
  • Threshold calculation with template overrides
  • View query correctness

Integration Tests: Lifecycle testing (4 tests)

  • Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
  • Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
  • Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
  • Complete investigation workflow (test_dlq_investigation_workflow)

Metrics Tests: OpenTelemetry integration (1 test)

  • Staleness detection metrics recording
  • DLQ investigation metrics recording
  • Pending investigations gauge query

Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml

  • 2-step linear workflow
  • 2-minute staleness thresholds for fast test execution
  • Test-only template for lifecycle validation

Performance: All 22 tests complete in 0.95s (< 1s target)


Implementation Notes

File Locations:

  • Staleness detector: tasker-orchestration/src/orchestration/staleness_detector.rs
  • DLQ models: tasker-shared/src/models/orchestration/dlq.rs
  • SQL functions: migrations/20251122000004_add_dlq_discovery_function.sql
  • Database views: migrations/20251122000003_add_dlq_views.sql

Key Design Decisions:

  • Investigation tracking only - no task manipulation
  • Step-level resolution via existing step endpoints
  • Proactive monitoring at 80% threshold
  • Template-specific threshold overrides
  • Atomic DLQ entry creation with unique constraint
  • Time-ordered UUID v7 for investigation tracking

Future Enhancements

Potential improvements (not currently planned):

  1. DLQ Patterns Analysis

    • Machine learning to identify systemic issues
    • Automated root cause suggestions
    • Pattern clustering by namespace/template
  2. Advanced Alerting

    • Anomaly detection on staleness rates
    • Predictive DLQ entry forecasting
    • Correlation with infrastructure metrics
  3. Investigation Workflow

    • Automated triage rules
    • Escalation policies
    • Integration with incident management systems
  4. Performance Optimization

    • Materialized views for dashboard
    • Query result caching
    • Incremental staleness detection

End of Documentation

Handler Resolution Guide

Last Updated: 2026-01-08 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | API Convergence Matrix

<- Back to Guides


Overview

Handler resolution is the process of converting a callable address (a string in your YAML template) into an executable handler instance that can process workflow steps. The resolver chain pattern provides a flexible, extensible approach that works consistently across all language workers.

This guide covers:

  • The mental model for handler resolution
  • The common path for task templates
  • Built-in resolvers and how they work
  • Method dispatch for multi-method handlers
  • Writing custom resolvers
  • Cross-language considerations

Mental Model

Handler resolution uses three key concepts:

handler:
  callable: "PaymentProcessor"      # 1. Address: WHERE to find the handler
  method: "refund"                  # 2. Entry Point: WHICH method to invoke
  resolver: "explicit_mapping"      # 3. Resolution Hint: HOW to resolve

1. Address (callable)

The callable field is a logical address that identifies the handler. Think of it like a URL - it points to where the handler lives, but the format depends on your resolution strategy:

FormatExampleResolver
Short namepayment_processorExplicitMappingResolver
Class path (Ruby)PaymentHandlers::ProcessPaymentHandlerClassConstantResolver
Module path (Python)payment_handlers.ProcessPaymentHandlerClassLookupResolver
Namespace path (TS)PaymentHandlers.ProcessPaymentHandlerClassLookupResolver

2. Entry Point (method)

The method field specifies which method to invoke on the handler. This enables multi-method handlers - a single handler class that exposes multiple entry points:

# Default: calls the `call` method
handler:
  callable: payment_processor

# Explicit method: calls the `refund` method instead
handler:
  callable: payment_processor
  method: refund

When to use method dispatch:

  • Payment handlers with charge, refund, void methods
  • Validation handlers with validate_input, validate_output methods
  • CRUD handlers with create, read, update, delete methods

3. Resolution Hint (resolver)

The resolver field is an optional optimization that bypasses the resolver chain and goes directly to a specific resolver:

# Let the chain figure it out (default)
handler:
  callable: payment_processor

# Skip directly to explicit mapping (faster, explicit)
handler:
  callable: payment_processor
  resolver: explicit_mapping

When to use resolver hints:

  • Performance optimization for high-throughput steps
  • Explicit documentation of resolution strategy
  • Avoiding ambiguity when multiple resolvers could match

The Common Path

For most templates, you don’t need to think about resolution at all. The default resolution flow handles common cases automatically:

# Most common pattern - just specify the callable
steps:
  - name: process_payment
    handler:
      callable: process_payment  # Resolved by ExplicitMappingResolver
      initialization:
        timeout_ms: 5000

What happens under the hood:

  1. Worker receives step execution event
  2. HandlerDispatchService extracts the HandlerDefinition
  3. ResolverChain iterates through resolvers by priority
  4. ExplicitMappingResolver (priority 10) finds the registered handler
  5. Handler is invoked with call() method (default)

Resolver Chain Architecture

The resolver chain is an ordered list of resolvers, each with a priority. Lower priority numbers are checked first:

┌─────────────────────────────────────────────────────────────────┐
│                      ResolverChain                               │
│                                                                  │
│  ┌──────────────────────┐  ┌──────────────────────┐            │
│  │ ExplicitMapping      │  │ ClassConstant        │            │
│  │ Priority: 10         │──│ Priority: 100        │──► ...     │
│  │                      │  │                      │            │
│  │ "process_payment" ──►│  │ "Handlers::Payment"──►           │
│  │  Handler instance    │  │  constantize()       │            │
│  └──────────────────────┘  └──────────────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Resolution Flow

HandlerDefinition
       │
       ▼
┌──────────────────┐
│ Has resolver     │──Yes──► Go directly to named resolver
│ hint?            │
└────────┬─────────┘
         │ No
         ▼
┌──────────────────┐
│ ExplicitMapping  │──can_resolve?──Yes──► Return handler
│ (priority 10)    │
└────────┬─────────┘
         │ No
         ▼
┌──────────────────┐
│ ClassConstant    │──can_resolve?──Yes──► Return handler
│ (priority 100)   │
└────────┬─────────┘
         │ No
         ▼
    ResolutionError

Built-in Resolvers

ExplicitMappingResolver (Priority 10)

The primary resolver for all workers. Handlers are registered with string keys at startup:

#![allow(unused)]
fn main() {
// Rust registration
registry.register("process_payment", Arc::new(ProcessPaymentHandler::new()));
}
# Ruby registration
registry.register("process_payment", ProcessPaymentHandler)
# Python registration
registry.register("process_payment", ProcessPaymentHandler)
// TypeScript registration
registry.register("process_payment", ProcessPaymentHandler);

When it resolves: When the callable exactly matches a registered key.

Best for:

  • Native Rust handlers (required - no runtime reflection)
  • Performance-critical handlers
  • Explicit, predictable resolution

Class Lookup Resolvers (Priority 100)

Dynamic language only (Ruby, Python, TypeScript). Interprets the callable as a class path and instantiates it at runtime.

Naming Note: Ruby uses ClassConstantResolver (Ruby terminology for classes). Python and TypeScript use ClassLookupResolver. The functionality is equivalent.

# Ruby: Uses Object.const_get (ClassConstantResolver)
handler:
  callable: PaymentHandlers::ProcessPaymentHandler

# Python: Uses importlib (ClassLookupResolver)
handler:
  callable: payment_handlers.ProcessPaymentHandler

# TypeScript: Uses dynamic import (ClassLookupResolver)
handler:
  callable: PaymentHandlers.ProcessPaymentHandler

When it resolves: When the callable looks like a class/module path (contains ::, ., or starts with uppercase).

Best for:

  • Convention-over-configuration setups
  • Handlers that don’t need explicit registration
  • Dynamic handler loading

Not available in Rust: Rust has no runtime reflection, so class lookup resolvers always return None. Use ExplicitMappingResolver instead.


Method Dispatch

Method dispatch allows a single handler to expose multiple entry points. This is useful for handlers that perform related operations:

Defining a Multi-Method Handler

# Ruby
class PaymentHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Default method - standard payment processing
  end

  def refund(context)
    # Refund-specific logic
  end

  def void(context)
    # Void-specific logic
  end
end
# Python
class PaymentHandler(StepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Default method
        pass

    def refund(self, context: StepContext) -> StepHandlerResult:
        # Refund-specific logic
        pass
// TypeScript
class PaymentHandler extends StepHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    // Default method
  }

  async refund(context: StepContext): Promise<StepHandlerResult> {
    // Refund-specific logic
  }
}
#![allow(unused)]
fn main() {
// Rust - requires explicit method routing
impl RustStepHandler for PaymentHandler {
    async fn call(&self, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
        // Default method
    }

    async fn invoke_method(&self, method: &str, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
        match method {
            "refund" => self.refund(step).await,
            "void" => self.void(step).await,
            _ => self.call(step).await,
        }
    }
}
}

Using Method Dispatch in Templates

steps:
  - name: process_refund
    handler:
      callable: payment_handler
      method: refund  # Invokes refund() instead of call()
      initialization:
        reason_required: true

How Method Dispatch Works

  1. Resolver chain resolves the handler from callable
  2. If method is specified and not “call”, a MethodDispatchWrapper is applied
  3. When invoked, the wrapper calls the specified method instead of call()
                    ┌───────────────────┐
HandlerDefinition ──│ ResolverChain     │── Handler
(method: "refund")  │                   │
                    └─────────┬─────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │ MethodDispatch    │
                    │ Wrapper           │
                    │                   │
                    │ inner.refund()    │
                    └───────────────────┘

Writing Custom Resolvers

You can extend the resolver chain with custom resolution strategies for your domain.

Rust Custom Resolver

#![allow(unused)]
fn main() {
use tasker_shared::registry::{StepHandlerResolver, ResolutionContext, ResolvedHandler};
use async_trait::async_trait;

#[derive(Debug)]
pub struct ServiceDiscoveryResolver {
    service_registry: Arc<ServiceRegistry>,
}

#[async_trait]
impl StepHandlerResolver for ServiceDiscoveryResolver {
    fn resolver_name(&self) -> &str {
        "service_discovery"
    }

    fn priority(&self) -> u32 {
        50  // Between explicit (10) and class constant (100)
    }

    fn can_resolve(&self, definition: &HandlerDefinition) -> bool {
        // Resolve callables that start with "service://"
        definition.callable.starts_with("service://")
    }

    async fn resolve(
        &self,
        definition: &HandlerDefinition,
        context: &ResolutionContext,
    ) -> Result<Arc<dyn ResolvedHandler>, ResolutionError> {
        let service_name = definition.callable.strip_prefix("service://").unwrap();
        let handler = self.service_registry.lookup(service_name).await?;
        Ok(Arc::new(StepHandlerAsResolved::new(handler)))
    }
}
}

Ruby Custom Resolver

module TaskerCore
  module Registry
    class ServiceDiscoveryResolver < BaseResolver
      def resolver_name
        "service_discovery"
      end

      def priority
        50
      end

      def can_resolve?(definition)
        definition.callable.start_with?("service://")
      end

      def resolve(definition, context)
        service_name = definition.callable.delete_prefix("service://")
        handler_class = ServiceRegistry.lookup(service_name)
        handler_class.new
      end
    end
  end
end

Python Custom Resolver

from tasker_core.registry import BaseResolver, ResolutionError

class ServiceDiscoveryResolver(BaseResolver):
    def resolver_name(self) -> str:
        return "service_discovery"

    def priority(self) -> int:
        return 50

    def can_resolve(self, definition: HandlerDefinition) -> bool:
        return definition.callable.startswith("service://")

    async def resolve(
        self, definition: HandlerDefinition, context: ResolutionContext
    ) -> ResolvedHandler:
        service_name = definition.callable.removeprefix("service://")
        handler_class = self.service_registry.lookup(service_name)
        return handler_class()

TypeScript Custom Resolver

import { BaseResolver, HandlerDefinition, ResolutionContext } from './registry';

export class ServiceDiscoveryResolver extends BaseResolver {
  resolverName(): string {
    return 'service_discovery';
  }

  priority(): number {
    return 50;
  }

  canResolve(definition: HandlerDefinition): boolean {
    return definition.callable.startsWith('service://');
  }

  async resolve(
    definition: HandlerDefinition,
    context: ResolutionContext
  ): Promise<ResolvedHandler> {
    const serviceName = definition.callable.replace('service://', '');
    const HandlerClass = await this.serviceRegistry.lookup(serviceName);
    return new HandlerClass();
  }
}

Registering Custom Resolvers

#![allow(unused)]
fn main() {
// Rust
let mut chain = ResolverChain::new();
chain.register(Arc::new(ExplicitMappingResolver::new()));
chain.register(Arc::new(ServiceDiscoveryResolver::new(service_registry)));
chain.register(Arc::new(ClassConstantResolver::new()));
}
# Ruby
chain = TaskerCore::Registry::ResolverChain.new
chain.register(TaskerCore::Registry::ExplicitMappingResolver.new)
chain.register(ServiceDiscoveryResolver.new(service_registry))
chain.register(TaskerCore::Registry::ClassConstantResolver.new)

Cross-Language Considerations

Why Rust is Different

Rust has no runtime reflection, which affects handler resolution:

CapabilityRuby/Python/TypeScriptRust
Class Lookup Resolver✅ Works❌ Always returns None
Method dispatch✅ Native (send, getattr)⚠️ Requires invoke_method
Dynamic handler loadingconst_get, importlib❌ Must pre-register

Best Practice for Rust:

  • Always use ExplicitMappingResolver with explicit registration
  • Implement invoke_method() for multi-method handlers
  • Use resolver hints (resolver: explicit_mapping) for clarity

Method Dispatch by Language

LanguageDefault MethodDynamic Dispatch
Rubycallhandler.public_send(method, context)
Pythoncallgetattr(handler, method)(context)
TypeScriptcallhandler[method](context)
Rustcallhandler.invoke_method(method, step)

Troubleshooting

“Handler not found” Error

Symptoms: ResolutionError: No resolver could resolve callable 'my_handler'

Causes:

  1. Handler not registered with ExplicitMappingResolver
  2. Class path typo (for ClassConstantResolver)
  3. Handler registered with different name than callable

Solutions:

#![allow(unused)]
fn main() {
// Verify registration
assert!(registry.is_registered("my_handler"));

// Check registered handlers
println!("{:?}", registry.list_handlers());
}

Method Not Found

Symptoms: MethodNotFound: Handler 'my_handler' does not respond to 'refund'

Causes:

  1. Method name typo in YAML template
  2. Method not defined on handler class
  3. Method is private (Ruby) or underscore-prefixed (Python)

Solutions:

# Verify method name matches exactly
handler:
  callable: payment_handler
  method: refund  # Must match method name in handler

Resolver Hint Ignored

Symptoms: Resolution works but seems slow, or wrong resolver is used

Causes:

  1. Resolver hint name doesn’t match any registered resolver
  2. Resolver with that name returns None for this callable

Solutions:

# Use exact resolver name
handler:
  callable: my_handler
  resolver: explicit_mapping  # Not "explicit" or "mapping"

Best Practices

1. Prefer Explicit Registration

# Good: Clear, predictable, works in all languages
handler:
  callable: process_payment

# Avoid: Relies on runtime class lookup, not portable to Rust
handler:
  callable: PaymentHandlers::ProcessPaymentHandler
# Good: Single handler, multiple entry points
steps:
  - name: validate_input
    handler:
      callable: validator
      method: validate_input

  - name: validate_output
    handler:
      callable: validator
      method: validate_output

# Avoid: Separate handlers for closely related operations
steps:
  - name: validate_input
    handler:
      callable: input_validator
  - name: validate_output
    handler:
      callable: output_validator

3. Document Resolution Strategy

# Good: Explicit about how resolution works
handler:
  callable: payment_processor
  resolver: explicit_mapping  # Self-documenting
  method: refund
  initialization:
    timeout_ms: 5000

4. Test Resolution in Isolation

#![allow(unused)]
fn main() {
#[test]
fn test_handler_resolution() {
    let chain = create_resolver_chain();
    let definition = HandlerDefinition::builder()
        .callable("process_payment")
        .build();

    assert!(chain.can_resolve(&definition));
}
}

Summary

ConceptPurposeDefault
callableHandler addressRequired
methodEntry point method"call"
resolverResolution strategy hintChain iteration
ExplicitMappingResolverRegistered handlersPriority 10
ClassConstantResolver / ClassLookupResolverDynamic class lookupPriority 100
MethodDispatchWrapperMulti-method supportApplied when method != "call"

The resolver chain provides a flexible, extensible system for handler resolution that works consistently across all language workers while respecting each language’s capabilities.

Task Identity Strategy Pattern

Last Updated: 2026-01-20 Audience: Developers, Operators Status: Active Related Docs: Documentation Hub | Idempotency and Atomicity

← Back to Documentation Hub


Overview

Task identity determines how Tasker deduplicates task creation requests. The identity strategy pattern allows named tasks to configure their deduplication behavior based on domain requirements.

When a task creation request arrives, Tasker computes an identity hash based on the configured strategy. If a task with that identity hash already exists, the request is rejected with a 409 Conflict response.

Why This Matters

Task identity is domain-specific:

Use CaseSame Template + Same ContextDesired Behavior
Payment processingLikely accidental duplicateDeduplicate (safety)
Nightly batch jobIntentional repetitionAllow (operational)
Report generationCould be eitherConfigurable
Event-driven triggersOften intentionalAllow
Retry with same paramsIntentionalAllow

A TaskRequest with identical context might be:

  • An accidental duplicate (network retry, user double-click) → should deduplicate
  • An intentional repetition (scheduled job, legitimate re-run) → should allow

Identity Strategies

STRICT (Default)

identity_hash = hash(named_task_uuid, normalized_context)

Same named task + same context = same identity hash = deduplicated.

Use when:

  • Accidental duplicates are a risk (payments, orders, notifications)
  • Context fully describes the work to be done
  • Network retries or user double-clicks should be safe

Example:

#![allow(unused)]
fn main() {
// Payment processing - same payment_id should never create duplicate tasks
TaskRequest {
    namespace: "payments".to_string(),
    name: "process_payment".to_string(),
    context: json!({
        "payment_id": "PAY-12345",
        "amount": 100.00,
        "currency": "USD"
    }),
    idempotency_key: None,  // Uses STRICT strategy
    ..Default::default()
}
}

CALLER_PROVIDED

identity_hash = hash(named_task_uuid, idempotency_key)

Caller must provide idempotency_key. Request is rejected with 400 Bad Request if the key is missing.

Use when:

  • Caller has a natural idempotency key (order_id, transaction_id, request_id)
  • Caller needs control over deduplication scope
  • Similar to Stripe’s Idempotency-Key pattern

Example:

#![allow(unused)]
fn main() {
// Order processing - caller controls idempotency with their order ID
TaskRequest {
    namespace: "orders".to_string(),
    name: "fulfill_order".to_string(),
    context: json!({
        "order_id": "ORD-98765",
        "items": [...]
    }),
    idempotency_key: Some("ORD-98765".to_string()),  // Required for CallerProvided
    ..Default::default()
}
}

ALWAYS_UNIQUE

identity_hash = uuidv7()

Every request creates a new task. No deduplication.

Use when:

  • Every submission should create work (notifications, events)
  • Repetition is intentional (scheduled jobs, cron-like triggers)
  • Context doesn’t define uniqueness

Example:

#![allow(unused)]
fn main() {
// Notification sending - every call should send a notification
TaskRequest {
    namespace: "notifications".to_string(),
    name: "send_email".to_string(),
    context: json!({
        "user_id": 123,
        "template": "welcome",
        "data": {...}
    }),
    idempotency_key: None,  // ALWAYS_UNIQUE ignores this
    ..Default::default()
}
}

Configuration

Named Task Configuration

Set the identity strategy in your task template:

# templates/payments/process_payment.yaml
namespace: payments
name: process_payment
version: "1.0.0"
identity_strategy: strict  # strict | caller_provided | always_unique

steps:
  - name: validate_payment
    handler: payment_validator
    # ...

Per-Request Override

The idempotency_key field overrides any strategy:

#![allow(unused)]
fn main() {
// Even if named task is ALWAYS_UNIQUE, this key makes it deduplicate
TaskRequest {
    idempotency_key: Some("my-custom-key-12345".to_string()),
    // ... other fields
}
}

Precedence:

  1. idempotency_key (if provided) → always uses hash of key
  2. Named task’s identity_strategy → applies if no key provided
  3. Default → STRICT (if strategy not configured)

API Behavior

Successful Creation (201 Created)

{
  "task_uuid": "019bddae-b818-7d82-b7c5-bd42e5db27fc",
  "step_count": 4,
  "message": "Task created successfully"
}

Duplicate Identity (409 Conflict)

When a task with the same identity hash exists:

{
  "error": {
    "code": "CONFLICT",
    "message": "A task with this identity already exists. The task's identity strategy prevents duplicate creation."
  }
}

Security Note: The API returns 409 Conflict rather than the existing task’s UUID. This prevents potential data leakage where attackers could probe for existing task UUIDs by submitting requests with guessed contexts.

Missing Idempotency Key (400 Bad Request)

When CallerProvided strategy requires a key:

{
  "error": {
    "code": "BAD_REQUEST",
    "message": "idempotency_key is required when named task uses CallerProvided identity strategy"
  }
}

JSON Normalization

For STRICT strategy, the context JSON is normalized before hashing:

  • Key ordering: Keys are sorted alphabetically (recursively)
  • Whitespace: Removed for consistency
  • Semantic equivalence: {"b":2,"a":1} and {"a":1,"b":2} produce the same hash

This means these two requests produce the same identity hash:

#![allow(unused)]
fn main() {
// Request 1
context: json!({"user_id": 123, "action": "create"})

// Request 2 - same content, different key order
context: json!({"action": "create", "user_id": 123})
}

Note: Array order is preserved (arrays are ordered by definition).

Pattern 1: Time-Bucketed Keys

For deduplication within a time window but allowing repetition across windows:

#![allow(unused)]
fn main() {
// Dedupe within same hour, allow across hours
let hour_bucket = chrono::Utc::now().format("%Y-%m-%d-%H");
let idempotency_key = format!("{}-{}-{}", job_name, customer_id, hour_bucket);

TaskRequest {
    namespace: "reports".to_string(),
    name: "generate_report".to_string(),
    context: json!({ "customer_id": 12345 }),
    idempotency_key: Some(idempotency_key),
    ..Default::default()
}
}

Pattern 2: Time-Aware Context

Include scheduling context directly in the request:

#![allow(unused)]
fn main() {
TaskRequest {
    namespace: "batch".to_string(),
    name: "daily_reconciliation".to_string(),
    context: json!({
        "account_id": "ACC-001",
        "run_date": "2026-01-20",      // Changes daily
        "run_window": "morning"         // Optional: finer granularity
    }),
    ..Default::default()
}
}

Granularity Guide

Dedup WindowKey/Context PatternUse Case
Per-minute{job}-{YYYY-MM-DD-HH-mm}High-frequency event processing
Per-hour{job}-{YYYY-MM-DD-HH}Hourly reports, rate-limited APIs
Per-day{job}-{YYYY-MM-DD}Daily batch jobs, EOD processing
Per-week{job}-{YYYY-Www}Weekly aggregations
Per-month{job}-{YYYY-MM}Monthly billing cycles

Anti-Patterns

Don’t Rely on Timing

#![allow(unused)]
fn main() {
// BAD: Hoping requests are "far enough apart"
TaskRequest { context: json!({ "customer_id": 123 }) }
}

Don’t Use ALWAYS_UNIQUE for Critical Operations

#![allow(unused)]
fn main() {
// BAD: Creates duplicate work on network retries
// Named task with AlwaysUnique for payment processing
}

Do Make Identity Explicit

#![allow(unused)]
fn main() {
// GOOD: Clear what makes this task unique
TaskRequest {
    context: json!({
        "payment_id": "PAY-123",  // Natural idempotency key
        "amount": 100
    }),
    ..Default::default()
}
}

Database Implementation

The identity strategy is enforced at the database level:

  1. UNIQUE constraint on identity_hash column prevents duplicates
  2. identity_strategy column on named_tasks stores the configured strategy
  3. Atomic insertion with constraint violation returns 409 Conflict
-- Identity hash has unique constraint
CREATE UNIQUE INDEX idx_tasks_identity_hash ON tasker.tasks(identity_hash);

-- Named tasks store their strategy
ALTER TABLE tasker.named_tasks
ADD COLUMN identity_strategy VARCHAR(20) DEFAULT 'strict';

Testing Considerations

When writing tests that create tasks, inject a unique identifier to avoid identity hash collisions:

#![allow(unused)]
fn main() {
// Test utility that ensures unique identity per test run
fn create_task_request(namespace: &str, name: &str, context: Value) -> TaskRequest {
    let mut ctx = context.as_object().cloned().unwrap_or_default();
    ctx.insert("_test_run_id".to_string(), json!(Uuid::now_v7().to_string()));

    TaskRequest {
        namespace: namespace.to_string(),
        name: name.to_string(),
        context: Value::Object(ctx),
        ..Default::default()
    }
}
}

Summary

StrategyIdentity HashDeduplicates?Key Required?
STRICThash(uuid, context)YesNo
CALLER_PROVIDEDhash(uuid, key)YesYes
ALWAYS_UNIQUEuuidv7()NoNo

Choose STRICT (default) unless you have a specific reason not to. It’s the safest option for preventing accidental duplicate task creation.

Quick Start Guide

Last Updated: 2025-10-10 Audience: Developers Status: Active Time to Complete: 5 minutes Related Docs: Documentation Hub | Use Cases | Crate Architecture

← Back to Documentation Hub


Get Tasker Core Running in 5 Minutes

This guide will get you from zero to running your first workflow in under 5 minutes using Docker Compose.

Prerequisites

Before starting, ensure you have:

  • Docker and Docker Compose installed
  • Git to clone the repository
  • curl for testing (or any HTTP client)

That’s it! Docker Compose handles all the complexity.


Step 1: Clone and Start Services (2 minutes)

# Clone the repository
git clone https://github.com/tasker-systems/tasker-core
cd tasker-core

# Start PostgreSQL (includes PGMQ extension for default messaging)
docker-compose up -d postgres

# Wait for PostgreSQL to be ready (about 10 seconds)
docker-compose logs -f postgres
# Press Ctrl+C when you see "database system is ready to accept connections"

# Run database migrations
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"  # Verify connection

# Start orchestration server and workers
docker-compose --profile server up -d

# Verify all services are healthy
docker-compose ps

You should see:

NAME                     STATUS              PORTS
tasker-postgres          Up (healthy)        5432
tasker-orchestration     Up (healthy)        0.0.0.0:8080->8080/tcp
tasker-worker            Up (healthy)        0.0.0.0:8081->8081/tcp
tasker-ruby-worker       Up (healthy)        0.0.0.0:8082->8082/tcp

Step 2: Verify Services (30 seconds)

Check that all services are responding:

# Check orchestration health
curl http://localhost:8080/health

# Expected response:
# {
#   "status": "healthy",
#   "database": "connected",
#   "message_queue": "operational"
# }

# Check Rust worker health
curl http://localhost:8081/health

# Check Ruby worker health (if started)
curl http://localhost:8082/health

Step 3: Create Your First Task (1 minute)

Now let’s create a simple linear workflow with 4 steps:

# Create a task using the linear_workflow template
curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "template_name": "linear_workflow",
    "namespace": "rust_e2e_linear",
    "configuration": {
      "test_value": "hello_world"
    }
  }'

Response:

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "status": "pending",
  "namespace": "rust_e2e_linear",
  "created_at": "2025-10-10T12:00:00Z"
}

Save the task_uuid from the response! You’ll need it to check the task status.


Step 4: Monitor Task Execution (1 minute)

Watch your workflow execute in real-time:

# Replace {task_uuid} with your actual task UUID
TASK_UUID="01234567-89ab-cdef-0123-456789abcdef"

# Check task status
curl http://localhost:8080/v1/tasks/${TASK_UUID}

Initial Response (task just created):

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "current_state": "initializing",
  "total_steps": 4,
  "completed_steps": 0,
  "namespace": "rust_e2e_linear"
}

Wait a few seconds and check again:

# Check again after a few seconds
curl http://localhost:8080/v1/tasks/${TASK_UUID}

Final Response (task completed):

{
  "task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
  "current_state": "complete",
  "total_steps": 4,
  "completed_steps": 4,
  "namespace": "rust_e2e_linear",
  "completed_at": "2025-10-10T12:00:05Z",
  "duration_ms": 134
}

Congratulations! 🎉 You’ve just executed your first workflow with Tasker Core!


What Just Happened?

Let’s break down what happened in those ~100-150ms:

1. Orchestration received task creation request
   ↓
2. Task initialized with "linear_workflow" template
   ↓
3. 4 workflow steps created with dependencies:
   - mathematical_add (no dependencies)
   - mathematical_multiply (depends on add)
   - mathematical_subtract (depends on multiply)
   - mathematical_divide (depends on subtract)
   ↓
4. Orchestration discovered step 1 was ready
   ↓
5. Step 1 enqueued to "rust_e2e_linear" namespace queue
   ↓
6. Worker claimed and executed step 1
   ↓
7. Worker sent result back to orchestration
   ↓
8. Orchestration processed result, discovered step 2
   ↓
9. Steps 2, 3, 4 executed sequentially (due to dependencies)
   ↓
10. All steps complete → Task marked "complete"

Key Observations:

  • Each step executed by autonomous workers
  • Steps executed in dependency order automatically
  • Complete workflow: ~130-150ms (including all coordination)
  • All state changes recorded in audit trail

View Detailed Task Information

Get complete task execution details:

# Get full task details including steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/details

Response includes:

{
  "task": {
    "task_uuid": "...",
    "current_state": "complete",
    "namespace": "rust_e2e_linear"
  },
  "steps": [
    {
      "name": "mathematical_add",
      "current_state": "complete",
      "result": {"value": 15},
      "duration_ms": 12
    },
    {
      "name": "mathematical_multiply",
      "current_state": "complete",
      "result": {"value": 30},
      "duration_ms": 8
    },
    // ... remaining steps
  ],
  "state_transitions": [
    {
      "from_state": null,
      "to_state": "pending",
      "timestamp": "2025-10-10T12:00:00.000Z"
    },
    {
      "from_state": "pending",
      "to_state": "initializing",
      "timestamp": "2025-10-10T12:00:00.050Z"
    },
    // ... complete transition history
  ]
}

Try a More Complex Workflow

Now try the diamond workflow pattern (parallel execution):

curl -X POST http://localhost:8080/v1/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "template_name": "diamond_workflow",
    "namespace": "rust_e2e_diamond",
    "configuration": {
      "test_value": "parallel_test"
    }
  }'

Diamond pattern:

        step_1 (root)
       /            \
   step_2          step_3    ← Execute in PARALLEL
       \            /
        step_4 (join)

Steps 2 and 3 execute simultaneously because they both depend only on step 1!


View Logs

See what’s happening inside the services:

# Orchestration logs
docker-compose logs -f orchestration

# Worker logs
docker-compose logs -f worker

# All logs
docker-compose logs -f

Key log patterns to look for:

  • Task initialized: task_uuid=... - Task created
  • Step enqueued: step_uuid=... - Step sent to worker
  • Step claimed: step_uuid=... - Worker picked up step
  • Step completed: step_uuid=... - Step finished successfully
  • Task finalized: task_uuid=... - Workflow complete

Explore the API

List All Tasks

curl http://localhost:8080/v1/tasks

Get Task Execution Context

curl http://localhost:8080/v1/tasks/${TASK_UUID}/context

View Available Templates

curl http://localhost:8080/v1/templates

Check System Health

curl http://localhost:8080/health/detailed

Next Steps

1. Understand What You Just Built

Read about the architecture:

2. See Real-World Examples

Explore practical use cases:

  • Use Cases and Patterns - E-commerce, payments, ETL, microservices
  • See example templates in: tests/fixtures/task_templates/

3. Create Your Own Workflow

Option A: Rust Handler (Native Performance)

#![allow(unused)]
fn main() {
// workers/rust/src/handlers/my_handler.rs
pub struct MyCustomHandler;

#[async_trait]
impl StepHandler for MyCustomHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Your business logic here
        let input: String = context.configuration.get("input")?;

        let result = process_data(&input).await?;

        Ok(StepResult::success(json!({
            "output": result
        })))
    }
}
}

Option B: Ruby Handler (via FFI)

# workers/ruby/app/tasker/tasks/templates/my_workflow/handlers/my_handler.rb
class MyHandler < TaskerCore::StepHandler
  def execute(context)
    input = context.configuration['input']

    result = process_data(input)

    { success: true, output: result }
  end
end

Define Your Workflow Template

# tests/fixtures/task_templates/rust/my_workflow.yaml
namespace: my_namespace
name: my_workflow
version: "1.0"

steps:
  - name: my_step
    handler: my_handler
    dependencies: []
    retry:
      retryable: true
      max_attempts: 3
      backoff: exponential
      backoff_base_ms: 1000

4. Deploy to Production

Learn about deployment:

5. Run Tests Locally

# Build the workspace
cargo build --all-features

# Run all tests
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo test --all-features

# Run benchmarks
cargo bench --all-features

Troubleshooting

Services Won’t Start

# Check Docker service status
docker-compose ps

# View service logs
docker-compose logs postgres
docker-compose logs orchestration

# Restart services
docker-compose restart

# Clean restart
docker-compose down
docker-compose up -d

Task Stays in “pending” or “initializing”

Possible causes:

  1. Template not found - Check available templates: curl http://localhost:8080/v1/templates
  2. Worker not running - Check worker status: curl http://localhost:8081/health
  3. Database connection issue - Check logs: docker-compose logs postgres

Solution:

# Verify template exists
curl http://localhost:8080/v1/templates | jq '.[] | select(.name == "linear_workflow")'

# Restart workers
docker-compose restart worker

# Check orchestration logs for errors
docker-compose logs orchestration | grep ERROR

“Connection refused” Errors

Cause: Services not fully started yet

Solution: Wait 10-15 seconds after docker-compose up, then check health:

curl http://localhost:8080/health

PostgreSQL Connection Issues

# Verify PostgreSQL is running
docker-compose ps postgres

# Test connection
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"

# View PostgreSQL logs
docker-compose logs postgres | tail -50

Cleanup

When you’re done exploring:

# Stop all services
docker-compose down

# Stop and remove volumes (cleans database)
docker-compose down -v

# Remove all Docker resources (complete cleanup)
docker-compose down -v
docker system prune -f

Summary

You’ve successfully:

  • ✅ Started Tasker Core services with Docker Compose
  • ✅ Created and executed a linear workflow
  • ✅ Monitored task execution in real-time
  • ✅ Viewed detailed task and step information
  • ✅ Explored the REST API

Total time: ~5 minutes from zero to working workflow! 🚀


Getting Help


← Back to Documentation Hub

Next: Use Cases and Patterns | Crate Architecture

Retry Semantics: max_attempts and retryable

Last Updated: 2025-10-10 Audience: Developers Status: Active Related Docs: Documentation Hub | Bug Report: Retry Eligibility Logic | States and Lifecycles

← Back to Documentation Hub


Overview

The Tasker orchestration system uses two configuration fields to control step execution and retry behavior:

  1. max_attempts: Maximum number of total execution attempts (including first execution)
  2. retryable: Whether the step can be retried after failure

Semantic Definitions

max_attempts

Definition: The maximum number of times a step can be attempted, including the first execution.

This is NOT “number of retries” - it’s total attempts:

  • max_attempts=0: Step can never execute (likely a configuration error)
  • max_attempts=1: Exactly one attempt (no retries after failure)
  • max_attempts=3: First attempt + up to 2 retries = 3 total attempts

Implementation: SQL formula attempts < max_attempts where attempts starts at 0.

retryable

Definition: Whether a step can be retried after the first execution fails.

Important: The retryable flag does NOT affect the first execution attempt:

  • First execution (attempts=0): Always eligible regardless of retryable setting
  • Retry attempts (attempts>0): Require retryable=true

Configuration Examples

Single Execution, No Retries

retry:
  retryable: false
  max_attempts: 1  # First attempt only
  backoff: exponential

Behavior:

attemptsretry_eligibleOutcome
0✅ trueFirst execution allowed
1❌ falseNo retries (retryable=false)

Use Case: Idempotent operations that should not retry (e.g., record creation with unique constraints)

Multiple Attempts with Retries

retry:
  retryable: true
  max_attempts: 3  # First attempt + 2 retries
  backoff: exponential
  backoff_base_ms: 1000

Behavior:

attemptsretry_eligibleOutcome
0✅ trueFirst execution allowed
1✅ trueFirst retry allowed (1 < 3)
2✅ trueSecond retry allowed (2 < 3)
3❌ falseMax attempts exhausted (3 >= 3)

Use Case: External API calls that might have transient failures

retry:
  retryable: true
  max_attempts: 999999
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 300000  # Cap at 5 minutes

Behavior: Will retry until external intervention (task cancellation, system restart)

Use Case: Critical operations that must eventually succeed (use with caution!)

Retry Eligibility Logic

SQL Implementation

From migrations/20251006000000_fix_retry_eligibility_logic.sql:

-- retry_eligible calculation
(
  COALESCE(ws.attempts, 0) = 0  -- First attempt always eligible
  OR (
    COALESCE(ws.retryable, true) = true  -- Must be retryable for retries
    AND COALESCE(ws.attempts, 0) < COALESCE(ws.max_attempts, 3)
  )
) as retry_eligible

Decision Tree

Is attempts = 0?
├─ YES → retry_eligible = TRUE (first execution)
└─ NO  → Is retryable = true?
    ├─ YES → Is attempts < max_attempts?
    │   ├─ YES → retry_eligible = TRUE (retry allowed)
    │   └─ NO  → retry_eligible = FALSE (max attempts exhausted)
    └─ NO  → retry_eligible = FALSE (retries disabled)

Edge Cases

max_attempts=0

retry:
  max_attempts: 0

Behavior: Step can never execute (0 < 0 = false for all attempts)

Status: ⚠️ Configuration error - likely unintended

Recommendation: Use max_attempts: 1 for single execution

retryable=false with max_attempts > 1

retry:
  retryable: false
  max_attempts: 3  # Only first attempt will execute

Behavior: First execution allowed, but no retries regardless of max_attempts

Effective Result: Same as max_attempts: 1

Recommendation: Set max_attempts: 1 when retryable: false for clarity

Historical Context

Why “max_attempts” instead of “retry_limit”?

The original field name retry_limit was semantically confusing:

Old Interpretation (incorrect):

  • retry_limit=1 → “1 retry allowed” → 2 total attempts?
  • retry_limit=0 → “0 retries” → 1 attempt or blocked?

New Interpretation (clear):

  • max_attempts=1 → “1 total attempt” → exactly 1 execution
  • max_attempts=0 → “0 attempts” → clearly invalid

Migration Timeline

  • Original: retry_limit field with ambiguous semantics
  • 2025-10-05: Bug discovered - retry_limit=0 blocked all execution
  • 2025-10-06: Fixed SQL logic + renamed to max_attempts
  • 2025-10-06: Added 6 SQL boundary tests for edge cases

Testing

Boundary Condition Tests

See tests/integration/sql_functions/retry_boundary_tests.rs for comprehensive coverage:

  1. test_max_attempts_zero_allows_first_execution - Edge case handling
  2. test_max_attempts_zero_blocks_after_first - Exhaustion after first
  3. test_max_attempts_one_semantics - Single execution semantics
  4. test_max_attempts_three_progression - Standard retry progression
  5. test_first_attempt_ignores_retryable_flag - First execution independence
  6. test_retries_require_retryable_true - Retry flag enforcement

All tests passing as of 2025-10-06.

Best Practices

For Single-Execution Steps

retry:
  retryable: false
  max_attempts: 1
  backoff: exponential  # Ignored, but required for schema

Why: Makes intent crystal clear - execute once, never retry

For Transient Failure Tolerance

retry:
  retryable: true
  max_attempts: 3
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 30000

Why: Reasonable retry count with exponential backoff prevents thundering herd

For Critical Operations

retry:
  retryable: true
  max_attempts: 10
  backoff: exponential
  backoff_base_ms: 5000
  max_backoff_ms: 300000  # 5 minutes

Why: More attempts with longer backoff for operations that must succeed


Questions or Issues? See test suite for comprehensive examples or consult bug report for historical context.

Use Cases and Patterns

Last Updated: 2025-10-10 Audience: Developers, Architects, Product Managers Status: Active Related Docs: Documentation Hub | Quick Start | Crate Architecture

← Back to Documentation Hub


Overview

This guide provides practical examples of when and how to use Tasker Core for workflow orchestration. Each use case includes architectural patterns, example workflows, and implementation guidance based on real-world scenarios.


Table of Contents

  1. E-Commerce Order Fulfillment
  2. Payment Processing Pipeline
  3. Data Transformation ETL
  4. Microservices Orchestration
  5. Scheduled Job Coordination
  6. Conditional Workflows and Decision Points
  7. Anti-Patterns

E-Commerce Order Fulfillment

Problem Statement

An e-commerce platform needs to coordinate multiple steps when processing orders:

  • Validate order details and inventory
  • Reserve inventory and process payment (parallel)
  • Ship order after both payment and inventory confirmed
  • Send confirmation emails
  • Handle failures gracefully with retries

Why Tasker Core?

  • Complex Dependencies: Steps have clear dependency relationships
  • Parallel Execution: Payment and inventory can happen simultaneously
  • Retry Logic: External API calls need retry with backoff
  • Audit Trail: Complete history needed for compliance
  • Idempotency: Steps must handle duplicate executions safely

Workflow Structure

Task: order_fulfillment_#{order_id}
  Priority: Based on order value and customer tier
  Namespace: fulfillment

  Steps:
    1. validate_order
       - Handler: ValidateOrderHandler
       - Dependencies: None (root step)
       - Retry: retryable=true, max_attempts=3
       - Validates order data, checks fraud

    2. check_inventory
       - Handler: InventoryCheckHandler
       - Dependencies: validate_order (must complete)
       - Retry: retryable=true, max_attempts=5
       - Queries inventory system

    3. reserve_inventory
       - Handler: InventoryReservationHandler
       - Dependencies: check_inventory
       - Retry: retryable=true, max_attempts=3
       - Reserves stock with timeout

    4. process_payment
       - Handler: PaymentProcessingHandler
       - Dependencies: validate_order
       - Retry: retryable=true, max_attempts=3
       - Charges customer payment method
       - **Runs in parallel with reserve_inventory**

    5. ship_order
       - Handler: ShippingHandler
       - Dependencies: reserve_inventory AND process_payment
       - Retry: retryable=false, max_attempts=1
       - Creates shipping label, schedules pickup

    6. send_confirmation
       - Handler: EmailNotificationHandler
       - Dependencies: ship_order
       - Retry: retryable=true, max_attempts=10
       - Sends confirmation email to customer

Implementation Pattern

Task Template (YAML configuration):

namespace: fulfillment
name: order_fulfillment
version: "1.0"

steps:
  - name: validate_order
    handler: validate_order
    retry:
      retryable: true
      max_attempts: 3
      backoff: exponential
      backoff_base_ms: 1000

  - name: check_inventory
    handler: check_inventory
    dependencies:
      - validate_order
    retry:
      retryable: true
      max_attempts: 5
      backoff: exponential
      backoff_base_ms: 2000

  # ... remaining steps

Step Handler (Rust implementation):

#![allow(unused)]
fn main() {
pub struct ValidateOrderHandler;

#[async_trait]
impl StepHandler for ValidateOrderHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        // Extract order data from context
        let order_id: String = context.configuration.get("order_id")?;
        let customer_id: String = context.configuration.get("customer_id")?;

        // Validate order
        let order = validate_order_data(&order_id).await?;

        // Check fraud detection
        if check_fraud_risk(&customer_id, &order).await? {
            return Ok(StepResult::permanent_failure(
                "fraud_detected",
                json!({"reason": "High fraud risk"})
            ));
        }

        // Success - pass data to next steps
        Ok(StepResult::success(json!({
            "order_id": order_id,
            "validated_at": Utc::now(),
            "total_amount": order.total
        })))
    }
}
}

Ruby Handler Alternative:

class ProcessPaymentHandler < TaskerCore::StepHandler
  def execute(context)
    order_id = context.configuration['order_id']
    amount = context.configuration['amount']

    # Process payment via payment gateway
    result = PaymentGateway.charge(
      amount: amount,
      idempotency_key: context.step_uuid
    )

    if result.success?
      { success: true, transaction_id: result.transaction_id }
    else
      # Retryable failure with backoff
      { success: false, retryable: true, error: result.error }
    end
  rescue PaymentGateway::NetworkError => e
    # Transient error, retry
    { success: false, retryable: true, error: e.message }
  rescue PaymentGateway::CardDeclined => e
    # Permanent failure, don't retry
    { success: false, retryable: false, error: e.message }
  end
end

Key Patterns

1. Parallel Execution

  • reserve_inventory and process_payment both depend only on earlier steps
  • Tasker automatically executes them in parallel
  • ship_order waits for both to complete

2. Idempotent Handlers

  • Use step_uuid as idempotency key for external APIs
  • Check if operation already completed before retrying
  • Handle duplicate executions gracefully

3. Smart Retry Logic

  • Network errors → retryable with exponential backoff
  • Business logic failures → permanent, no retry
  • Configure max_attempts based on criticality

4. Data Flow

  • Early steps provide data to later steps via results
  • Access parent results: context.parent_results["validate_order"]
  • Build context as workflow progresses

Observability

Monitor these metrics for order fulfillment:

#![allow(unused)]
fn main() {
// Track order processing stages
metrics::counter!("orders.validated").increment(1);
metrics::counter!("orders.payment_processed").increment(1);
metrics::counter!("orders.shipped").increment(1);

// Track failures by reason
metrics::counter!("orders.failed", "reason" => "fraud").increment(1);
metrics::counter!("orders.failed", "reason" => "inventory").increment(1);

// Track timing
metrics::histogram!("order.fulfillment_time_ms").record(elapsed_ms);
}

Payment Processing Pipeline

Problem Statement

A fintech platform needs to process payments with strict requirements:

  • Multiple payment methods (card, bank transfer, wallet)
  • Regulatory compliance and audit trails
  • Automatic retry for transient failures
  • Reconciliation with accounting system
  • Webhook notifications to customers

Why Tasker Core?

  • Compliance: Complete audit trail with state transitions
  • Reliability: Automatic retry with configurable limits
  • Observability: Detailed metrics for financial operations
  • Idempotency: Prevent duplicate charges
  • Flexibility: Support multiple payment flows

Workflow Structure

Task: payment_processing_#{payment_id}
  Namespace: payments
  Priority: High (financial operations)

  Steps:
    1. validate_payment_request
       - Verify payment details
       - Check account status
       - Validate payment method

    2. check_fraud
       - Run fraud detection
       - Verify transaction limits
       - Check velocity rules

    3. authorize_payment
       - Contact payment gateway
       - Reserve funds (authorization hold)
       - Return authorization code

    4. capture_payment (depends on authorize_payment)
       - Capture authorized funds
       - Handle settlement
       - Generate receipt

    5. record_transaction (depends on capture_payment)
       - Write to accounting ledger
       - Update customer balance
       - Create audit records

    6. send_notification (depends on record_transaction)
       - Send webhook to merchant
       - Send receipt to customer
       - Update payment status

Implementation Highlights

Retry Strategy for Payment Gateway:

#![allow(unused)]
fn main() {
impl StepHandler for AuthorizePaymentHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let payment_id = context.configuration.get("payment_id")?;

        match gateway.authorize(payment_id, &context.step_uuid).await {
            Ok(auth) => {
                Ok(StepResult::success(json!({
                    "authorization_code": auth.code,
                    "authorized_at": Utc::now(),
                    "gateway_transaction_id": auth.transaction_id
                })))
            }

            Err(GatewayError::NetworkTimeout) => {
                // Transient - retry with backoff
                Ok(StepResult::retryable_failure(
                    "network_timeout",
                    json!({"retry_recommended": true})
                ))
            }

            Err(GatewayError::InsufficientFunds) => {
                // Permanent - don't retry
                Ok(StepResult::permanent_failure(
                    "insufficient_funds",
                    json!({"requires_manual_intervention": false})
                ))
            }

            Err(GatewayError::InvalidCard) => {
                // Permanent - don't retry
                Ok(StepResult::permanent_failure(
                    "invalid_card",
                    json!({"requires_manual_intervention": true})
                ))
            }
        }
    }
}
}

Idempotency Pattern:

#![allow(unused)]
fn main() {
async fn capture_payment(context: &StepContext) -> Result<StepResult> {
    let idempotency_key = context.step_uuid.to_string();

    // Check if we already captured this payment
    if let Some(existing) = check_existing_capture(&idempotency_key).await? {
        return Ok(StepResult::success(json!({
            "already_captured": true,
            "transaction_id": existing.transaction_id,
            "note": "Idempotent duplicate detected"
        })));
    }

    // Proceed with capture
    let result = gateway.capture(&idempotency_key).await?;

    // Store idempotency record
    store_capture_record(&idempotency_key, &result).await?;

    Ok(StepResult::success(json!(result)))
}
}

Key Patterns

1. Two-Phase Commit

  • Authorize (reserve) → Capture (settle)
  • Allows cancellation between phases
  • Common in payment processing

2. Audit Trail

  • Every state transition recorded
  • Regulatory compliance built-in
  • Forensic investigation support

3. Circuit Breaking

  • Protect against payment gateway failures
  • Automatic backoff when gateway degraded
  • Fallback to alternate gateways

Data Transformation ETL

Problem Statement

A data analytics platform needs to process data through multiple transformation stages:

  • Extract data from multiple sources (APIs, databases, files)
  • Transform data (clean, enrich, aggregate)
  • Load to data warehouse
  • Handle large datasets with partitioning
  • Retry transient failures, skip corrupted data

Why Tasker Core?

  • DAG Execution: Complex transformation pipelines
  • Parallel Processing: Independent partitions processed concurrently
  • Error Handling: Skip corrupted records, retry transient failures
  • Observability: Track data quality and processing metrics
  • Scheduling: Integrate with cron/scheduler for periodic runs

Workflow Structure

Task: etl_customer_data_#{date}
  Namespace: data_pipeline

  Steps:
    1. extract_customer_profiles
       - Fetch from customer database
       - Partition by customer_id ranges
       - Creates multiple output partitions

    2. extract_transaction_history
       - Fetch from transactions database
       - Runs in parallel with extract_customer_profiles
       - Time-based partitioning

    3. enrich_customer_data (depends on extract_customer_profiles)
       - Add demographic data from external API
       - Process partitions in parallel
       - Each partition is independent

    4. join_transactions (depends on enrich_customer_data, extract_transaction_history)
       - Join enriched profiles with transactions
       - Aggregate metrics per customer
       - Parallel processing per partition

    5. load_to_warehouse (depends on join_transactions)
       - Bulk load to data warehouse
       - Verify data quality
       - Update metadata tables

    6. generate_summary_report (depends on load_to_warehouse)
       - Generate processing statistics
       - Send notification with summary
       - Archive source files

Implementation Pattern

Partition-Based Processing:

#![allow(unused)]
fn main() {
pub struct ExtractCustomerProfilesHandler;

#[async_trait]
impl StepHandler for ExtractCustomerProfilesHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let date: String = context.configuration.get("processing_date")?;

        // Determine partitions (e.g., by customer_id ranges)
        let partitions = calculate_partitions(1000000, 100000)?; // 10 partitions

        // Extract data for each partition
        let mut partition_files = Vec::new();
        for partition in partitions {
            let filename = extract_partition(&date, partition).await?;
            partition_files.push(filename);
        }

        // Return partition info for downstream steps
        Ok(StepResult::success(json!({
            "partitions": partition_files,
            "total_records": 1000000,
            "extracted_at": Utc::now()
        })))
    }
}
}

Error Handling for Data Quality:

#![allow(unused)]
fn main() {
async fn enrich_customer_data(context: &StepContext) -> Result<StepResult> {
    let partition_file: String = context.configuration.get("partition_file")?;

    let mut processed = 0;
    let mut skipped = 0;
    let mut errors = Vec::new();

    for record in read_partition(&partition_file).await? {
        match enrich_record(record).await {
            Ok(enriched) => {
                write_enriched(enriched).await?;
                processed += 1;
            }
            Err(EnrichmentError::MalformedData(e)) => {
                // Skip corrupted record, continue processing
                skipped += 1;
                errors.push(format!("Skipped record: {}", e));
            }
            Err(EnrichmentError::ApiTimeout(e)) => {
                // Transient failure, retry entire step
                return Ok(StepResult::retryable_failure(
                    "api_timeout",
                    json!({"error": e.to_string()})
                ));
            }
        }
    }

    if skipped as f64 / processed as f64 > 0.1 {
        // Too many skipped records
        return Ok(StepResult::permanent_failure(
            "data_quality_issue",
            json!({
                "processed": processed,
                "skipped": skipped,
                "error_rate": skipped as f64 / processed as f64
            })
        ));
    }

    Ok(StepResult::success(json!({
        "processed": processed,
        "skipped": skipped,
        "errors": errors
    })))
}
}

Key Patterns

1. Partition-Based Parallelism

  • Split large datasets into partitions
  • Process partitions independently
  • Aggregate results in final step

2. Graceful Degradation

  • Skip corrupted individual records
  • Continue processing remaining data
  • Report data quality issues

3. Monitoring Data Quality

  • Track record counts through pipeline
  • Alert on unexpected error rates
  • Validate schema at boundaries

Microservices Orchestration

Problem Statement

Coordinate operations across multiple microservices:

  • User registration flow (auth, profile, notifications, analytics)
  • Distributed transactions with compensation
  • Service dependency management
  • Timeout and circuit breaking

Why Tasker Core?

  • Service Coordination: Orchestrate distributed operations
  • Saga Pattern: Implement compensation for failures
  • Resilience: Circuit breakers and timeouts
  • Observability: End-to-end tracing with correlation IDs
  • Flexibility: Handle heterogeneous service protocols

Workflow Structure (User Registration Example)

Task: user_registration_#{user_id}
  Namespace: user_onboarding

  Steps:
    1. create_auth_account
       - Call auth service to create account
       - Generate user credentials
       - Store authentication tokens

    2. create_user_profile (depends on create_auth_account)
       - Call profile service
       - Initialize user preferences
       - Set default settings

    3. setup_notification_preferences (depends on create_user_profile)
       - Call notification service
       - Configure email preferences
       - Set up push notifications

    4. track_user_signup (depends on create_user_profile)
       - Call analytics service
       - Record signup event
       - Runs in parallel with setup_notification_preferences

    5. send_welcome_email (depends on setup_notification_preferences)
       - Send welcome email
       - Provide onboarding links
       - Track email delivery

  Compensation Steps (on failure):
    - If create_user_profile fails → delete_auth_account
    - If any step fails after profile → deactivate_user

Implementation Pattern (Saga with Compensation)

#![allow(unused)]
fn main() {
pub struct CreateUserProfileHandler;

#[async_trait]
impl StepHandler for CreateUserProfileHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let user_id: String = context.configuration.get("user_id")?;
        let email: String = context.configuration.get("email")?;

        // Get auth details from previous step
        let auth_result = context.parent_results.get("create_auth_account")
            .ok_or("Missing auth result")?;
        let auth_token: String = auth_result.get("auth_token")?;

        // Call profile service
        match profile_service.create_profile(&user_id, &email, &auth_token).await {
            Ok(profile) => {
                Ok(StepResult::success(json!({
                    "profile_id": profile.id,
                    "created_at": profile.created_at,
                    "user_id": user_id
                })))
            }

            Err(ProfileServiceError::DuplicateEmail) => {
                // Permanent failure - email already exists
                // Trigger compensation
                Ok(StepResult::permanent_failure_with_compensation(
                    "duplicate_email",
                    json!({"email": email}),
                    vec!["delete_auth_account"] // Compensation steps
                ))
            }

            Err(ProfileServiceError::ServiceUnavailable) => {
                // Transient - retry
                Ok(StepResult::retryable_failure(
                    "service_unavailable",
                    json!({"retry_recommended": true})
                ))
            }
        }
    }
}
}

Compensation Handler:

#![allow(unused)]
fn main() {
pub struct DeleteAuthAccountHandler;

#[async_trait]
impl StepHandler for DeleteAuthAccountHandler {
    async fn execute(&self, context: StepContext) -> Result<StepResult> {
        let user_id: String = context.configuration.get("user_id")?;

        // Best-effort deletion
        match auth_service.delete_account(&user_id).await {
            Ok(_) => {
                Ok(StepResult::success(json!({
                    "compensated": true,
                    "user_id": user_id
                })))
            }
            Err(e) => {
                // Log error but don't fail - compensation is best-effort
                warn!("Compensation failed for user {}: {}", user_id, e);
                Ok(StepResult::success(json!({
                    "compensated": false,
                    "error": e.to_string(),
                    "requires_manual_cleanup": true
                })))
            }
        }
    }
}
}

Key Patterns

1. Correlation IDs

  • Pass correlation_id through all services
  • Enable end-to-end tracing
  • Simplify debugging distributed issues

2. Compensation (Saga Pattern)

  • Define compensation steps for cleanup
  • Execute on permanent failures
  • Best-effort execution, log failures

3. Service Circuit Breakers

  • Wrap service calls in circuit breakers
  • Fail fast when services degraded
  • Automatic recovery detection

Scheduled Job Coordination

Problem Statement

Run periodic jobs with dependencies:

  • Daily report generation (depends on data refresh)
  • Scheduled data backups (depends on maintenance window)
  • Cleanup jobs (depends on retention policies)

Why Tasker Core?

  • Dependency Management: Jobs run in correct order
  • Failure Handling: Automatic retry of failed jobs
  • Observability: Track job execution history
  • Flexibility: Dynamic scheduling based on results

Implementation Pattern

#![allow(unused)]
fn main() {
// External scheduler (cron, Kubernetes CronJob, etc.) creates tasks
pub async fn schedule_daily_reports() -> Result<Uuid> {
    let client = OrchestrationClient::new("http://orchestration:8080").await?;

    let task_request = TaskRequest {
        template_name: "daily_reporting".to_string(),
        namespace: "scheduled_jobs".to_string(),
        configuration: json!({
            "report_date": Utc::now().format("%Y-%m-%d").to_string(),
            "report_types": ["sales", "inventory", "customer_activity"]
        }),
        priority: 5, // Normal priority
    };

    let response = client.create_task(task_request).await?;
    Ok(response.task_uuid)
}
}

Conditional Workflows and Decision Points

Problem Statement

Many workflows require runtime decision-making where the execution path depends on business logic evaluated at runtime:

  • Approval routing based on request amount or risk level
  • Tiered processing based on customer status
  • Compliance checks varying by jurisdiction
  • Dynamic resource allocation based on workload

Why Use Decision Points?

Traditional Approach (Static DAG):

# Must define ALL possible paths upfront
Steps:
  - validate
  - route_A  # Always created
  - route_B  # Always created
  - route_C  # Always created
  - converge # Must handle all paths

Decision Point Approach (Dynamic DAG):

# Create ONLY the needed path at runtime
Steps:
  - validate
  - routing_decision  # Decides which path
  - route_A           # Created dynamically if needed
  - route_B           # Created dynamically if needed
  - route_C           # Created dynamically if needed
  - converge          # Uses intersection semantics

Benefits

  • Efficiency: Only execute steps actually needed
  • Clarity: Workflow reflects actual business logic
  • Cost Savings: Reduce API calls, processing time, and resource usage
  • Flexibility: Add new paths without changing core logic

Core Pattern

Task: conditional_approval
  Steps:
    1. validate_request       # Regular step
    2. routing_decision       # Decision point (type: decision_point)
       → Evaluates business logic
       → Returns: CreateSteps(['manager_approval']) or NoBranches
    3. auto_approve          # Might be created
    4. manager_approval      # Might be created
    5. finance_review        # Might be created
    6. finalize_approval     # Convergence (type: deferred)
       → Waits for intersection of dependencies

Example: Amount-Based Approval Routing

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    # Business logic determines which steps to create
    steps = if amount < 1_000
      ['auto_approve']
    elsif amount < 5_000
      ['manager_approval']
    else
      ['manager_approval', 'finance_review']
    end

    # Return decision outcome
    decision_success(
      steps: steps,
      result_data: {
        route_type: determine_route_type(amount),
        amount: amount
      }
    )
  end
end

Real-World Scenarios

1. E-Commerce Returns Processing

  • Low-value returns: Auto-approve
  • Medium-value: Manager review
  • High-value or suspicious: Fraud investigation + manager review

2. Financial Risk Assessment

  • Low-risk transactions: Standard processing
  • Medium-risk: Additional verification
  • High-risk: Manual review + compliance checks + legal review

3. Healthcare Prior Authorization

  • Standard procedures: Auto-approve
  • Specialized care: Medical director review
  • Experimental treatments: Medical director + insurance review + compliance

4. Customer Support Escalation

  • Simple issues: Tier 1 resolution
  • Complex issues: Tier 2 specialist
  • VIP customers: Immediate senior support + account manager notification

Key Features

Decision Point Steps:

  • Special step type that returns DecisionPointOutcome
  • Can return NoBranches (no additional steps) or CreateSteps (list of step names)
  • Fully atomic - either all steps created or none
  • Supports nested decisions (configurable depth limit)

Deferred Steps:

  • Use intersection semantics for dependencies
  • Wait for: (declared dependencies) ∩ (actually created steps)
  • Enable convergence regardless of path taken

Type-Safe Implementation:

  • Ruby: TaskerCore::StepHandler::Decision base class
  • Rust: DecisionPointOutcome enum with serde support
  • Automatic validation and serialization

Implementation

See the complete guide: Conditional Workflows and Decision Points

Covers:

  • When to use conditional workflows
  • YAML configuration
  • Ruby and Rust implementation patterns
  • Simple and complex examples
  • Best practices and limitations

Anti-Patterns

❌ Don’t Use Tasker Core For:

1. Simple Cron Jobs

# ❌ Anti-pattern: Single-step scheduled job
Task: send_daily_email
  Steps:
    - send_email  # No dependencies, no retry needed

Why: Overhead not justified. Use native cron or systemd timers.

2. Real-Time Sub-Millisecond Operations

# ❌ Anti-pattern: High-frequency trading
Task: execute_trade_#{microseconds}
  Steps:
    - check_price   # Needs <1ms latency
    - execute_order

Why: Architectural overhead (~10-20ms) too high. Use in-memory queues or direct service calls.

3. Pure Fan-Out

# ❌ Anti-pattern: Simple message broadcasting
Task: broadcast_notification
  Steps:
    - send_to_user_1
    - send_to_user_2
    - send_to_user_3
    # ... 1000s of independent steps

Why: Use message bus (Kafka, RabbitMQ) for pub/sub patterns. Tasker is for orchestration, not broadcasting.

4. Stateless Single Operations

# ❌ Anti-pattern: Single API call with no retry
Task: fetch_user_data
  Steps:
    - call_api  # No dependencies, no state management needed

Why: Direct API call with client-side retry is simpler.


Pattern Selection Guide

CharacteristicUse Tasker Core?Alternative
Multiple dependent steps✅ YesN/A
Parallel execution needed✅ YesThread pools for simple cases
Retry logic required✅ YesClient-side retry libraries
Audit trail needed✅ YesAppend-only logs
Single step, no retry❌ NoDirect function call
Sub-second latency required❌ NoIn-memory queues
Pure broadcast/fan-out❌ NoMessage bus (Kafka, etc.)
Simple scheduled job❌ NoCron, systemd timers


← Back to Documentation Hub

Worker Crates Overview

Last Updated: 2025-12-27 Audience: Developers, Architects, Operators Status: Active Related Docs: Worker Event Systems | Worker Actors

<- Back to Documentation Hub


The tasker-core workspace provides four worker implementations for executing workflow step handlers. Each implementation targets different deployment scenarios and developer ecosystems while sharing the same core Rust foundation.

Quick Navigation

DocumentDescription
API Convergence MatrixQuick reference for aligned APIs across languages
Example HandlersSide-by-side handler examples
Patterns and PracticesCommon patterns across all workers
Rust WorkerNative Rust implementation
Ruby WorkerRuby gem for Rails integration
Python WorkerPython package for data pipelines
TypeScript WorkerTypeScript/JS for Bun/Node/Deno

Overview

Four Workers, One Foundation

All workers share the same Rust core (tasker-worker crate) for orchestration, queueing, and state management. The language-specific workers add handler execution in their respective runtimes.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WORKER ARCHITECTURE                                 │
└─────────────────────────────────────────────────────────────────────────────┘

                              PostgreSQL + PGMQ
                                      │
                                      ▼
                    ┌─────────────────────────────┐
                    │   Rust Core (tasker-worker) │
                    │   ─────────────────────────│
                    │   • Queue Management        │
                    │   • State Machines          │
                    │   • Orchestration           │
                    │   • Actor System            │
                    └─────────────────────────────┘
                                      │
          ┌───────────────┬───────────┼───────────┬───────────────┐
          │               │           │           │               │
          ▼               ▼           ▼           ▼               ▼
    ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌─────────────┐
    │   Rust    │   │   Ruby    │   │  Python   │   │ TypeScript  │
    │  Worker   │   │  Worker   │   │  Worker   │   │   Worker    │
    │───────────│   │───────────│   │───────────│   │─────────────│
    │ Native    │   │ FFI Bridge│   │ FFI Bridge│   │ FFI Bridge  │
    │ Handlers  │   │ + Gem     │   │ + Package │   │ Bun/Node/Deno│
    └───────────┘   └───────────┘   └───────────┘   └─────────────┘

Comparison Table

FeatureRustRubyPythonTypeScript
PerformanceNativeGVL-limitedGIL-limitedV8/Bun native
IntegrationStandaloneRails/Rack appsData pipelinesNode/Bun/Deno apps
Handler StyleAsync traitsClass-basedABC-basedClass-based
ConcurrencyTokio asyncThread + FFI pollThread + FFI pollEvent loop + FFI poll
DeploymentBinaryGem + ServerPackage + ServerPackage + Server
Headless ModeN/ALibrary embedLibrary embedLibrary embed
Runtimes-MRICPythonBun, Node.js, Deno

When to Use Each

Rust Worker - Best for:

  • Maximum throughput requirements
  • Resource-constrained environments
  • Standalone microservices
  • Performance-critical handlers

Ruby Worker - Best for:

  • Rails/Ruby applications
  • ActiveRecord/ORM integration
  • Existing Ruby codebases
  • Quick prototyping with Ruby ecosystem

Python Worker - Best for:

  • Data processing pipelines
  • ML/AI integration
  • Scientific computing workflows
  • Python-native team preferences

TypeScript Worker - Best for:

  • Modern JavaScript/TypeScript applications
  • Full-stack Node.js teams
  • Edge computing with Bun or Deno
  • React/Vue/Angular backend services
  • Multi-runtime deployment flexibility

Deployment Modes

Server Mode

All workers can run as standalone servers:

Rust:

cargo run -p workers-rust

Ruby:

cd workers/ruby
./bin/server.rb

Python:

cd workers/python
python bin/server.py

TypeScript (Bun):

cd workers/typescript
bun run bin/server.ts

TypeScript (Node.js):

cd workers/typescript
npx tsx bin/server.ts

Headless/Embedded Mode (Ruby, Python & TypeScript)

Ruby, Python, and TypeScript workers can be embedded into existing applications without running the HTTP server. Headless mode is controlled via TOML configuration, not bootstrap parameters.

TOML Configuration (e.g., config/tasker/base/worker.toml):

[web]
enabled = false  # Disables HTTP server for headless/embedded mode

Ruby (in Rails):

# config/initializers/tasker.rb
require 'tasker_core'

# Bootstrap worker (web server disabled via TOML config)
TaskerCore::Worker::Bootstrap.start!

# Register handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
  'MyHandler',
  MyHandler
)

Python (in application):

from tasker_core import bootstrap_worker, HandlerRegistry
from tasker_core.types import BootstrapConfig

# Bootstrap worker (web server disabled via TOML config)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)

# Register handlers
registry = HandlerRegistry.instance()
registry.register("my_handler", MyHandler)

TypeScript (in application):

import { createRuntime, HandlerRegistry, EventEmitter, EventPoller, StepExecutionSubscriber } from '@tasker-systems/tasker';

// Bootstrap worker (web server disabled via TOML config)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });

// Register handlers
const registry = new HandlerRegistry();
registry.register('my_handler', MyHandler);

// Start event processing
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();

const poller = new EventPoller(runtime, emitter);
poller.start();

Core Concepts

1. Handler Registration

All workers use a registry pattern for handler discovery:

                    ┌─────────────────────┐
                    │  HandlerRegistry    │
                    │  (Singleton)        │
                    └─────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
         ┌─────────┐    ┌─────────┐    ┌─────────┐
         │Handler A│    │Handler B│    │Handler C│
         └─────────┘    └─────────┘    └─────────┘

2. Event Flow

Step events flow through a consistent pipeline:

1. PGMQ Queue → Event received
2. Worker claims step (atomic)
3. Handler resolved by name
4. Handler.call(context) executed
5. Result sent to completion channel
6. Orchestration receives result

3. Error Classification

All workers distinguish between:

  • Retryable Errors: Transient failures → Re-enqueue step
  • Permanent Errors: Unrecoverable → Mark step failed

4. Graceful Shutdown

All workers handle shutdown signals (SIGTERM, SIGINT):

1. Signal received
2. Stop accepting new work
3. Complete in-flight handlers
4. Flush completion channel
5. Shutdown Rust foundation
6. Exit cleanly

Configuration

Environment Variables

Common across all workers:

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
TASKER_ENVEnvironment (test/development/production)development
TASKER_CONFIG_PATHPath to TOML configurationAuto-detected
TASKER_TEMPLATE_PATHPath to task templatesAuto-detected
TASKER_NAMESPACEWorker namespace for queue isolationdefault
RUST_LOGLog level (trace/debug/info/warn/error)info

Language-Specific

Ruby:

VariableDescription
RUBY_GC_HEAP_GROWTH_FACTORGC tuning for production

Python:

VariableDescription
PYTHON_HANDLER_PATHPath for handler auto-discovery

Handler Types

All workers support specialized handler types:

StepHandler (Base)

Basic step execution:

class MyHandler(StepHandler):
    handler_name = "my_handler"

    def call(self, context):
        return self.success({"result": "done"})

ApiHandler

HTTP/REST API integration with automatic error classification:

class FetchDataHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')
    response = connection.get("/users/#{user_id}")
    process_response(response)
    success(result: response.body)
  end
end

DecisionHandler

Dynamic workflow routing:

class RouteHandler(DecisionHandler):
    handler_name = "route_handler"

    def call(self, context):
        if context.input_data["amount"] < 1000:
            return self.route_to_steps(["auto_approve"])
        return self.route_to_steps(["manager_approval"])

Batchable

Large dataset processing. Note: Ruby uses subclass inheritance, Python uses mixin:

Ruby (subclass of Base):

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
    batch_worker_complete(processed_count: batch_ctx.batch_size)
  end
end

Python (mixin):

class CsvBatchProcessor(StepHandler, Batchable):
    handler_name = "csv_batch_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)
        if batch_ctx is None:
            return self.failure(message="No batch context", error_type="missing_context")
        # Process batch using batch_ctx.start_cursor, batch_ctx.end_cursor
        batch_size = batch_ctx.cursor_config.end_cursor - batch_ctx.cursor_config.start_cursor
        return self.batch_worker_success(items_processed=batch_size)

Quick Start

Rust

# Build and run
cd workers/rust
cargo run

# With custom configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run

Ruby

# Install dependencies
cd workers/ruby
bundle install
bundle exec rake compile

# Run server
./bin/server.rb

Python

# Install dependencies
cd workers/python
uv sync
uv run maturin develop

# Run server
python bin/server.py

TypeScript

# Install dependencies
cd workers/typescript
bun install
cargo build --release -p tasker-ts

# Run server (Bun)
bun run bin/server.ts

# Run server (Node.js)
npx tsx bin/server.ts

# Run server (Deno)
deno run --allow-ffi --allow-env --allow-net bin/server.ts

Monitoring

Health Checks

All workers expose health status:

# Python
from tasker_core import get_health_check
health = get_health_check()
# Ruby
health = TaskerCore::FFI.health_check

Metrics

Common metrics available:

MetricDescription
pending_countEvents awaiting processing
in_flight_countEvents being processed
completed_countSuccessfully completed
failed_countFailed events
starvation_detectedProcessing bottleneck

Logging

All workers use structured logging:

2025-01-15T10:30:00Z [INFO] python-worker: Processing step step_uuid=abc-123 handler=process_order
2025-01-15T10:30:01Z [INFO] python-worker: Step completed step_uuid=abc-123 success=true duration_ms=150

Architecture Deep Dive

For detailed architectural documentation:


See Also

API Convergence Matrix

Last Updated: 2026-01-08 Status: Active <- Back to Worker Crates Overview


Overview

This document provides a quick reference for the aligned APIs across Ruby, Python, TypeScript, and Rust worker implementations. All four languages share consistent patterns for handler execution, result creation, registry operations, and composition via mixins/traits.


Handler Signatures

LanguageBase ClassSignature
RubyTaskerCore::StepHandler::Basedef call(context)
PythonBaseStepHandlerdef call(self, context: StepContext) -> StepHandlerResult
TypeScriptStepHandlerasync call(context: StepContext): Promise<StepHandlerResult>
RustStepHandler traitasync fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult

Composition Pattern

All languages use composition via mixins/traits rather than inheritance hierarchies.

Handler Composition

LanguageBaseMixin SyntaxExample
RubyStepHandler::Baseinclude Mixins::APIclass Handler < Base; include Mixins::API
PythonStepHandlerMultiple inheritanceclass Handler(StepHandler, APIMixin)
TypeScriptStepHandlerapplyAPI(this)Mixin functions applied in constructor
Rustimpl StepHandlerimpl APICapableMultiple trait implementations

Available Mixins/Traits

CapabilityRubyPythonTypeScriptRust
APIMixins::APIAPIMixinapplyAPI()APICapable
DecisionMixins::DecisionDecisionMixinapplyDecision()DecisionCapable
BatchableMixins::BatchableBatchableMixinBatchableHandlerBatchableCapable

StepContext Fields

The StepContext provides unified access to step execution data across Ruby, Python, and TypeScript.

FieldTypeDescription
task_uuidStringUnique task identifier (UUID v7)
step_uuidStringUnique step identifier (UUID v7)
input_dataDict/HashInput data for the step from workflow_step.inputs
step_inputsDict/HashAlias for input_data
step_configDict/HashHandler configuration from step_definition.handler.initialization
dependency_resultsWrapperResults from parent steps (DependencyResultsWrapper)
retry_countIntegerCurrent retry attempt (from workflow_step.attempts)
max_retriesIntegerMaximum retry attempts (from workflow_step.max_attempts)

Convenience Methods

MethodDescription
get_task_field(name)Get field from task context
get_dependency_result(step_name)Get result from a parent step

Ruby-Specific Accessors

PropertyTypeDescription
taskTaskWrapperFull task wrapper with context and metadata
workflow_stepWorkflowStepWrapperWorkflow step with execution state
step_definitionStepDefinitionWrapperStep definition from task template

Result Factories

Success Results

LanguageMethodExample
Rubysuccess(result:, metadata:)success(result: { id: 123 }, metadata: { ms: 50 })
Pythonself.success(result, metadata)self.success({"id": 123}, {"ms": 50})
RustStepExecutionResult::success(...)StepExecutionResult::success(result, metadata)

Failure Results

LanguageMethodKey Parameters
Rubyfailure(message:, error_type:, error_code:, retryable:, metadata:)keyword arguments
Pythonself.failure(message, error_type, error_code, retryable, metadata)positional/keyword
RustStepExecutionResult::failure(...)structured fields

Result Fields

FieldRubyPythonRustDescription
successboolboolboolWhether step succeeded
resultHashDictHashMapResult data
metadataHashDictHashMapAdditional context
error_messageStringstrStringHuman-readable error
error_typeStringstrStringError classification
error_codeString (optional)str (optional)String (optional)Application error code
retryableboolboolboolWhether to retry

Standard error_type Values

Use these standard values for consistent error classification:

ValueDescriptionRetry Behavior
PermanentErrorNon-recoverable failureNever retry
RetryableErrorTemporary failureWill retry
ValidationErrorInput validation failedNo retry
TimeoutErrorOperation timed outMay retry
UnexpectedErrorUnexpected handler errorMay retry

Registry API

OperationRubyPythonRust
Registerregister(name, klass)register(name, klass)register_handler(name, handler)
Checkis_registered(name)is_registered(name)is_registered(name)
Resolveresolve(name)resolve(name)get_handler(name)
Listlist_handlerslist_handlers()list_handlers()

Note: Ruby also provides original method names (register_handler, handler_available?, resolve_handler, registered_handlers) as the primary API with the above as cross-language aliases.


Resolver Chain API

Handler resolution uses a chain-of-responsibility pattern to convert callable addresses into executable handlers.

StepHandlerResolver Interface

MethodRubyPythonTypeScriptRust
Get Namenameresolver_name()resolverName()resolver_name(&self)
Get Priorityprioritypriority()priority()priority(&self)
Can Resolve?can_resolve?(definition, config)can_resolve(definition)canResolve(definition)can_resolve(&self, definition)
Resolveresolve(definition, config)resolve(definition, context)resolve(definition, context)resolve(&self, definition, context)

ResolverChain Operations

OperationRubyPythonTypeScriptRust
CreateResolverChain.newResolverChain()new ResolverChain()ResolverChain::new()
Registerregister(resolver)register(resolver)register(resolver)register(resolver)
Resolveresolve(definition, context)resolve(definition, context)resolve(definition, context)resolve(definition, context)
Can Resolve?can_resolve?(definition)can_resolve(definition)canResolve(definition)can_resolve(definition)
Listresolversresolversresolversresolvers()

Built-in Resolvers

ResolverPriorityFunctionRustRubyPythonTypeScript
ExplicitMappingResolver10Hash lookup of registered handlers
ClassConstantResolver100Runtime class lookup (Ruby)--
ClassLookupResolver100Runtime class lookup (Python/TS)-

Note: Class lookup resolvers are not available in Rust due to lack of runtime reflection. Rust handlers must use ExplicitMappingResolver. Ruby uses ClassConstantResolver (Ruby terminology); Python and TypeScript use ClassLookupResolver (same functionality, language-appropriate naming).

HandlerDefinition Fields

FieldTypeDescriptionRequired
callableStringHandler address (name or class path)Yes
methodStringEntry point method (default: "call")No
resolverStringResolution hint to bypass chainNo
initializationDict/HashHandler configurationNo

Method Dispatch

Multi-method handlers expose multiple entry points through the method field:

LanguageDefault MethodDynamic Dispatch
Rubycallhandler.public_send(method, context)
Pythoncallgetattr(handler, method)(context)
TypeScriptcallhandler[method](context)
Rustcallhandler.invoke_method(method, step)

Creating Multi-Method Handlers:

LanguageSignature
RubyDefine additional methods alongside call
PythonDefine additional methods alongside call
TypeScriptDefine additional async methods alongside call
RustImplement invoke_method to dispatch to internal methods

See Handler Resolution Guide for complete documentation.


Specialized Handlers

API Handler

OperationRubyPythonTypeScript
GETget(path, params: {}, headers: {})self.get(path, params={}, headers={})this.get(path, params?, headers?)
POSTpost(path, data: {}, headers: {})self.post(path, data={}, headers={})this.post(path, data?, headers?)
PUTput(path, data: {}, headers: {})self.put(path, data={}, headers={})this.put(path, data?, headers?)
DELETEdelete(path, params: {}, headers: {})self.delete(path, params={}, headers={})this.delete(path, params?, headers?)

Decision Handler

LanguageSimple APIResult Fields
Rubydecision_success(steps:, routing_context:)decision_point_outcome: { type, step_names }
Pythondecision_success(steps, routing_context)decision_point_outcome: { type, step_names }
TypeScriptdecisionSuccess(steps, routingContext?)decision_point_outcome: { type, step_names }
Rustdecision_success(step_uuid, step_names, ...)Pattern-based

Decision Helper Methods (Cross-Language):

  • decision_success(steps, routing_context) - Create dynamic steps
  • skip_branches(reason, routing_context) - Skip all conditional branches
  • decision_failure(message, error_type) - Decision could not be made

Batchable Handler

OperationRubyPythonTypeScript
Get Contextget_batch_context(context)get_batch_context(context)getBatchContext(context)
Complete Batchbatch_worker_complete(processed_count:, result_data:)batch_worker_complete(processed_count, result_data)batchWorkerComplete(processedCount, resultData)
Handle No-Ophandle_no_op_worker(batch_ctx)handle_no_op_worker(batch_ctx)handleNoOpWorker(batchCtx)

Standard Batch Result Fields:

  • processed_count / items_processed
  • items_succeeded / items_failed
  • start_cursor, end_cursor, batch_size, last_cursor

Cursor Indexing:

  • All languages use 0-indexed cursors (start at 0, not 1)
  • Ruby was updated from 1-indexed to 0-indexed for consistency

Checkpoint Yielding

Checkpoint yielding enables batch workers to persist progress and yield control for re-dispatch.

OperationRubyPythonTypeScript
Checkpointcheckpoint_yield(cursor:, items_processed:, accumulated_results:)checkpoint_yield(cursor, items_processed, accumulated_results)checkpointYield({ cursor, itemsProcessed, accumulatedResults })

BatchWorkerContext Checkpoint Accessors:

AccessorRubyPythonTypeScript
Cursorcheckpoint_cursorcheckpoint_cursorcheckpointCursor
Accumulated Resultsaccumulated_resultsaccumulated_resultsaccumulatedResults
Has Checkpoint?has_checkpoint?has_checkpoint()hasCheckpoint()
Items Processedcheckpoint_items_processedcheckpoint_items_processedcheckpointItemsProcessed

FFI Contract:

FunctionDescription
checkpoint_yield_step_event(event_id, data)Persist checkpoint and re-dispatch step

Key Invariants:

  • Progress is atomically saved before re-dispatch
  • Step remains InProgress during checkpoint yield cycle
  • Only Success/Failure trigger state transitions

See Batch Processing Guide - Checkpoint Yielding for full documentation.


Domain Events

Publisher Contract

LanguageBase ClassKey Method
RubyTaskerCore::DomainEvents::BasePublisherpublish(ctx)
PythonBasePublisherpublish(ctx)
TypeScriptBasePublisherpublish(ctx)
RustStepEventPublisher traitpublish(ctx)

Publisher Lifecycle Hooks

All languages support publisher lifecycle hooks for instrumentation:

HookRubyPythonTypeScriptDescription
Before Publishbefore_publish(ctx)before_publish(ctx)beforePublish(ctx)Called before publishing
After Publishafter_publish(ctx, result)after_publish(ctx, result)afterPublish(ctx, result)Called after successful publish
On Erroron_publish_error(ctx, error)on_publish_error(ctx, error)onPublishError(ctx, error)Called on publish failure
Metadataadditional_metadata(ctx)additional_metadata(ctx)additionalMetadata(ctx)Inject custom metadata

StepEventContext Fields

FieldDescription
task_uuidTask identifier
step_uuidStep identifier
step_nameHandler/step name
namespaceTask namespace
correlation_idTracing correlation ID
resultStep execution result
metadataAdditional metadata

Subscriber Contract

LanguageBase ClassKey Methods
RubyTaskerCore::DomainEvents::BaseSubscribersubscribes_to, handle(event)
PythonBaseSubscribersubscribes_to(), handle(event)
TypeScriptBaseSubscribersubscribesTo(), handle(event)
RustEventHandler closuresN/A

Subscriber Lifecycle Hooks

All languages support subscriber lifecycle hooks:

HookRubyPythonTypeScriptDescription
Before Handlebefore_handle(event)before_handle(event)beforeHandle(event)Called before handling
After Handleafter_handle(event, result)after_handle(event, result)afterHandle(event, result)Called after handling
On Erroron_handle_error(event, error)on_handle_error(event, error)onHandleError(event, error)Called on handler failure

Registries

LanguagePublisher RegistrySubscriber Registry
RubyPublisherRegistry.instanceSubscriberRegistry.instance
PythonPublisherRegistry.instance()SubscriberRegistry.instance()
TypeScriptPublisherRegistry.getInstance()SubscriberRegistry.getInstance()

Migration Summary

Ruby

BeforeAfter
def call(task, sequence, step)def call(context)
class Handler < APIclass Handler < Base; include Mixins::API
task.context['field']context.get_task_field('field')
sequence.get_results('step')context.get_dependency_result('step')
1-indexed cursors0-indexed cursors

Python

BeforeAfter
def handle(self, task, sequence, step)def call(self, context)
class Handler(APIHandler)class Handler(StepHandler, APIMixin)
N/Aself.success(result, metadata)
N/APublisher/Subscriber lifecycle hooks

TypeScript

BeforeAfter
class Handler extends APIHandlerclass Handler extends StepHandler implements APICapable
No domain eventsComplete domain events module
N/APublisher/Subscriber lifecycle hooks
N/AapplyAPI(this), applyDecision(this) mixins

Rust

BeforeAfter
(already aligned)(already aligned)
N/AAPICapable, DecisionCapable, BatchableCapable traits

See Also

Example Handlers - Cross-Language Reference

Last Updated: 2025-12-21 Status: Active <- Back to Worker Crates Overview


Overview

This document provides side-by-side handler examples across Ruby, Python, and Rust. These examples demonstrate the aligned APIs that enable consistent patterns across all worker implementations.


Simple Step Handler

Ruby

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order_id = context.get_task_field('order_id')
    amount = context.get_task_field('amount')

    result = process_order(order_id, amount)

    success(
      result: {
        order_id: order_id,
        status: 'processed',
        total: result[:total]
      },
      metadata: { processed_at: Time.now.iso8601 }
    )
  rescue StandardError => e
    failure(
      message: e.message,
      error_type: 'UnexpectedError',
      retryable: true,
      metadata: { order_id: order_id }
    )
  end

  private

  def process_order(order_id, amount)
    # Business logic here
    { total: amount * 1.08 }
  end
end

Python

from tasker_core import BaseStepHandler, StepContext, StepHandlerResult


class ProcessOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        try:
            order_id = context.get_task_field("order_id")
            amount = context.get_task_field("amount")

            result = self.process_order(order_id, amount)

            return self.success(
                result={
                    "order_id": order_id,
                    "status": "processed",
                    "total": result["total"],
                },
                metadata={"processed_at": datetime.now().isoformat()},
            )
        except Exception as e:
            return self.failure(
                message=str(e),
                error_type="handler_error",
                retryable=True,
                metadata={"order_id": order_id},
            )

    def process_order(self, order_id: str, amount: float) -> dict:
        # Business logic here
        return {"total": amount * 1.08}

Rust

#![allow(unused)]
fn main() {
use tasker_shared::types::{TaskSequenceStep, StepExecutionResult};

pub struct ProcessOrderHandler;

impl ProcessOrderHandler {
    pub async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
        let order_id = step_data.task.context.get("order_id")
            .and_then(|v| v.as_str())
            .unwrap_or_default();
        let amount = step_data.task.context.get("amount")
            .and_then(|v| v.as_f64())
            .unwrap_or(0.0);

        match self.process_order(order_id, amount).await {
            Ok(result) => StepExecutionResult::success(
                serde_json::json!({
                    "order_id": order_id,
                    "status": "processed",
                    "total": result.total,
                }),
                Some(serde_json::json!({
                    "processed_at": chrono::Utc::now().to_rfc3339(),
                })),
            ),
            Err(e) => StepExecutionResult::failure(
                &e.to_string(),
                "handler_error",
                true, // retryable
            ),
        }
    }

    async fn process_order(&self, _order_id: &str, amount: f64) -> Result<OrderResult, Error> {
        Ok(OrderResult { total: amount * 1.08 })
    }
}
}

Handler with Dependencies

Ruby

class ShipOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Get results from dependent steps
    validation = context.get_dependency_result('validate_order')
    payment = context.get_dependency_result('process_payment')

    unless validation && validation['valid']
      return failure(
        message: 'Order validation failed',
        error_type: 'ValidationError',
        retryable: false
      )
    end

    unless payment && payment['status'] == 'completed'
      return failure(
        message: 'Payment not completed',
        error_type: 'PermanentError',
        retryable: false
      )
    end

    # Access task context
    order_id = context.get_task_field('order_id')
    shipping_address = context.get_task_field('shipping_address')

    tracking_number = create_shipment(order_id, shipping_address)

    success(result: {
      order_id: order_id,
      tracking_number: tracking_number,
      shipped_at: Time.now.iso8601
    })
  end
end

Python

class ShipOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Get results from dependent steps
        validation = context.get_dependency_result("validate_order")
        payment = context.get_dependency_result("process_payment")

        if not validation or not validation.get("valid"):
            return self.failure(
                message="Order validation failed",
                error_type="validation_error",
                retryable=False,
            )

        if not payment or payment.get("status") != "completed":
            return self.failure(
                message="Payment not completed",
                error_type="permanent_error",
                retryable=False,
            )

        # Access task context
        order_id = context.get_task_field("order_id")
        shipping_address = context.get_task_field("shipping_address")

        tracking_number = self.create_shipment(order_id, shipping_address)

        return self.success(
            result={
                "order_id": order_id,
                "tracking_number": tracking_number,
                "shipped_at": datetime.now().isoformat(),
            }
        )

Decision Handler

Ruby

class ApprovalRoutingHandler < TaskerCore::StepHandler::Decision
  THRESHOLDS = {
    auto_approve: 1000,
    manager_only: 5000
  }.freeze

  def call(context)
    amount = context.get_task_field('amount').to_f
    department = context.get_task_field('department')

    if amount < THRESHOLDS[:auto_approve]
      decision_success(
        steps: ['auto_approve'],
        result_data: {
          route_type: 'automatic',
          amount: amount,
          reason: 'Below threshold'
        }
      )
    elsif amount < THRESHOLDS[:manager_only]
      decision_success(
        steps: ['manager_approval'],
        result_data: {
          route_type: 'manager',
          amount: amount,
          approver: find_manager(department)
        }
      )
    else
      decision_success(
        steps: ['manager_approval', 'finance_review'],
        result_data: {
          route_type: 'dual_approval',
          amount: amount,
          requires_cfo: amount > 50_000
        }
      )
    end
  end

  private

  def find_manager(department)
    # Lookup logic
    "manager@example.com"
  end
end

Python

class ApprovalRoutingHandler(DecisionHandler):
    THRESHOLDS = {
        "auto_approve": 1000,
        "manager_only": 5000,
    }

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = float(context.get_task_field("amount") or 0)
        department = context.get_task_field("department")

        if amount < self.THRESHOLDS["auto_approve"]:
            return self.decision_success(
                steps=["auto_approve"],
                routing_context={
                    "route_type": "automatic",
                    "amount": amount,
                    "reason": "Below threshold",
                },
            )
        elif amount < self.THRESHOLDS["manager_only"]:
            return self.decision_success(
                steps=["manager_approval"],
                routing_context={
                    "route_type": "manager",
                    "amount": amount,
                    "approver": self.find_manager(department),
                },
            )
        else:
            return self.decision_success(
                steps=["manager_approval", "finance_review"],
                routing_context={
                    "route_type": "dual_approval",
                    "amount": amount,
                    "requires_cfo": amount > 50000,
                },
            )

    def find_manager(self, department: str) -> str:
        return "manager@example.com"

Batch Processing Handler

Ruby (Analyzer)

class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
  BATCH_SIZE = 100

  def call(context)
    file_path = context.get_task_field('csv_file_path')
    total_rows = count_csv_rows(file_path)

    if total_rows <= BATCH_SIZE
      # Small file - process inline, no batches needed
      outcome = TaskerCore::Types::BatchProcessingOutcome.no_batches

      success(
        result: {
          batch_processing_outcome: outcome.to_h,
          total_rows: total_rows,
          processing_mode: 'inline'
        }
      )
    else
      # Large file - create batch workers
      cursor_configs = calculate_batches(total_rows, BATCH_SIZE)
      outcome = TaskerCore::Types::BatchProcessingOutcome.create_batches(
        worker_template_name: 'process_csv_batch',
        worker_count: cursor_configs.size,
        cursor_configs: cursor_configs,
        total_items: total_rows
      )

      success(
        result: {
          batch_processing_outcome: outcome.to_h,
          total_rows: total_rows,
          batch_count: cursor_configs.size
        }
      )
    end
  end

  private

  def calculate_batches(total, batch_size)
    (0...total).step(batch_size).map.with_index do |start, idx|
      {
        'batch_id' => format('%03d', idx),
        'start_cursor' => start,
        'end_cursor' => [start + batch_size, total].min,
        'batch_size' => [batch_size, total - start].min
      }
    end
  end
end

Ruby (Batch Worker)

class CsvBatchWorkerHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)

    # Handle placeholder batches
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Get file path from analyzer step
    analyzer_result = context.get_dependency_result('analyze_csv')
    file_path = analyzer_result&.dig('csv_file_path')

    # Process this batch
    records = read_csv_range(file_path, batch_ctx.start_cursor, batch_ctx.batch_size)
    processed = records.map { |row| transform_row(row) }

    batch_worker_complete(
      processed_count: processed.size,
      result_data: {
        batch_id: batch_ctx.batch_id,
        records_processed: processed.size,
        summary: calculate_summary(processed)
      }
    )
  end
end

Python (Batch Worker)

class CsvBatchWorkerHandler(BatchableHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)

        # Handle placeholder batches
        no_op_result = self.handle_no_op_worker(batch_ctx)
        if no_op_result:
            return no_op_result

        # Get file path from analyzer step
        analyzer_result = context.get_dependency_result("analyze_csv")
        file_path = analyzer_result.get("csv_file_path") if analyzer_result else None

        # Process this batch
        records = self.read_csv_range(
            file_path, batch_ctx.start_cursor, batch_ctx.batch_size
        )
        processed = [self.transform_row(row) for row in records]

        return self.batch_worker_complete(
            processed_count=len(processed),
            result_data={
                "batch_id": batch_ctx.batch_id,
                "records_processed": len(processed),
                "summary": self.calculate_summary(processed),
            },
        )

API Handler

Ruby

class FetchUserHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')

    # Automatic error classification (429 -> retryable, 404 -> permanent)
    response = connection.get("/users/#{user_id}")
    process_response(response)

    success(result: {
      user_id: user_id,
      email: response.body['email'],
      name: response.body['name']
    })
  end

  def base_url
    'https://api.example.com'
  end

  def configure_connection
    Faraday.new(base_url) do |conn|
      conn.request :json
      conn.response :json
      conn.options.timeout = 30
    end
  end
end

Python

class FetchUserHandler(ApiStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        user_id = context.get_task_field("user_id")

        # Automatic error classification
        response = self.get(f"/users/{user_id}")

        return self.success(
            result={
                "user_id": user_id,
                "email": response["email"],
                "name": response["name"],
            }
        )

    @property
    def base_url(self) -> str:
        return "https://api.example.com"

    def configure_session(self, session):
        session.headers["Authorization"] = f"Bearer {self.get_token()}"
        session.timeout = 30

Error Handling Patterns

Ruby - Raising Exceptions

class ValidateOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order = context.task.context

    # Permanent error - will not retry
    if order['amount'].to_f <= 0
      raise TaskerCore::Errors::PermanentError.new(
        'Order amount must be positive',
        error_code: 'INVALID_AMOUNT',
        context: { amount: order['amount'] }
      )
    end

    # Retryable error - will retry with backoff
    if external_service_unavailable?
      raise TaskerCore::Errors::RetryableError.new(
        'External service temporarily unavailable',
        retry_after: 30,
        context: { service: 'payment_gateway' }
      )
    end

    success(result: { valid: true })
  end
end

Python - Returning Failures

class ValidateOrderHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        order = context.task.context

        # Permanent error - will not retry
        amount = float(order.get("amount", 0))
        if amount <= 0:
            return self.failure(
                message="Order amount must be positive",
                error_type="validation_error",
                error_code="INVALID_AMOUNT",
                retryable=False,
                metadata={"amount": amount},
            )

        # Retryable error - will retry with backoff
        if self.external_service_unavailable():
            return self.failure(
                message="External service temporarily unavailable",
                error_type="retryable_error",
                retryable=True,
                metadata={"service": "payment_gateway"},
            )

        return self.success(result={"valid": True})

See Also

FFI Safety Safeguards

Last Updated: 2026-02-02 Status: Production Implementation Applies To: Ruby (Magnus), Python (PyO3), TypeScript (C FFI) workers


Overview

Tasker’s FFI workers embed the Rust tasker-worker runtime inside language-specific host processes (Ruby, Python, TypeScript/JavaScript). This document describes the safeguards that prevent Rust-side failures from crashing or corrupting the host process, ensuring that infrastructure unavailability, misconfiguration, and unexpected panics are surfaced as language-native errors rather than process faults.

FFI Architecture

Host Process (Ruby / Python / Node.js)
         │
         ▼
    FFI Boundary
    ┌─────────────────────────────────────┐
    │  Language Binding Layer              │
    │  (Magnus / PyO3 / extern "C")       │
    │                                     │
    │  ┌─────────────────────────────┐    │
    │  │  Bridge Module              │    │
    │  │  (bootstrap, poll, complete)│    │
    │  └────────────┬────────────────┘    │
    │               │                     │
    │  ┌────────────▼────────────────┐    │
    │  │  FfiDispatchChannel         │    │
    │  │  (event dispatch, callbacks)│    │
    │  └────────────┬────────────────┘    │
    │               │                     │
    │  ┌────────────▼────────────────┐    │
    │  │  WorkerBootstrap            │    │
    │  │  (runtime, DB, messaging)   │    │
    │  └─────────────────────────────┘    │
    └─────────────────────────────────────┘

Panic Safety by Framework

Each FFI framework provides different levels of automatic panic protection:

FrameworkPanic HandlingMechanism
Magnus (Ruby)AutomaticCatches panics at FFI boundary, converts to Ruby RuntimeError
PyO3 (Python)AutomaticCatches panics at #[pyfunction] boundary, converts to PanicException
C FFI (TypeScript)ManualRequires explicit std::panic::catch_unwind wrappers

TypeScript C FFI: Explicit Panic Guards

Because the TypeScript worker uses raw extern "C" functions (for compatibility with Node.js, Bun, and Deno FFI), panics unwinding through this boundary would be undefined behavior. All extern "C" functions that call into bridge internals are wrapped with catch_unwind:

#![allow(unused)]
fn main() {
// workers/typescript/src-rust/lib.rs
#[no_mangle]
pub unsafe extern "C" fn bootstrap_worker(config_json: *const c_char) -> *mut c_char {
    // ... parse config_json ...

    let result = panic::catch_unwind(AssertUnwindSafe(|| {
        bridge::bootstrap_worker_internal(config_str)
    }));

    match result {
        Ok(Ok(json)) => /* return JSON */,
        Ok(Err(e)) => json_error(&format!("Bootstrap failed: {}", e)),
        Err(panic_info) => {
            // Extract panic message, log it, return JSON error
            json_error(&msg)
        }
    }
}
}

Protected functions: bootstrap_worker, stop_worker, get_worker_status, transition_to_graceful_shutdown, poll_step_events, poll_in_process_events, complete_step_event, checkpoint_yield_step_event, get_ffi_dispatch_metrics, check_starvation_warnings, cleanup_timeouts.

Error Handling at FFI Boundaries

Bootstrap Failures

When infrastructure is unavailable during worker startup, errors flow through the normal Result path rather than panicking:

Failure ScenarioHandlingHost Process Impact
Database unreachableTaskerError::DatabaseError returnedLanguage exception, app can retry
Config TOML missingTaskerError::ConfigurationError returnedLanguage exception with descriptive message
Worker config section absentTaskerError::ConfigurationError returnedLanguage exception (was previously a panic)
Messaging backend unavailableTaskerError::ConfigurationError returnedLanguage exception
Tokio runtime creation failsLogged + language error returnedLanguage exception
Port already in useTaskerError::WorkerError returnedLanguage exception
Redis/cache unavailableGraceful degradation to noop cacheNo error - worker starts without cache

Steady-State Operation Failures

Once bootstrapped, the worker handles infrastructure failures gracefully:

Failure ScenarioHandlingHost Process Impact
Database goes down during pollPoll returns None (no events)No impact - polling continues
Completion channel fullRetry loop with timeout, then loggedStep result may be lost after timeout
Completion channel closedReturns false to callerApp code sees completion failure
Callback timeout (5s)Logged, step completion unaffectedDomain events may be delayed
Messaging down during callbackCallback times out, loggedDomain events may not publish
Lock poisonedError returned to callerLanguage exception
Worker not initializedError returned to callerLanguage exception

Lock Acquisition

All three workers validate lock acquisition before proceeding:

#![allow(unused)]
fn main() {
// Pattern used in all workers
let handle_guard = WORKER_SYSTEM.lock().map_err(|e| {
    error!("Failed to acquire worker system lock: {}", e);
    // Convert to language-appropriate error
})?;
}

A poisoned mutex (from a previous panic) produces a language exception rather than propagating the original panic.

EventRouter Availability

Post-bootstrap access to the EventRouter uses fallible error handling rather than .expect():

#![allow(unused)]
fn main() {
// Use ok_or_else instead of expect to prevent panic at FFI boundary
let event_router = worker_core.event_router().ok_or_else(|| {
    error!("EventRouter not available from WorkerCore after bootstrap");
    // Return language-appropriate error
})?;
}

Callback Safety

The FfiDispatchChannel uses a fire-and-forget pattern for post-completion callbacks, preventing the host process from being blocked or deadlocked by Rust-side async operations:

  1. Completion is sent first - the step result is delivered to the completion channel before any callback fires
  2. Callback is spawned separately - runs in the Tokio runtime, not the FFI caller’s thread
  3. Timeout protection - callbacks are bounded by a configurable timeout (default 5s)
  4. Callback failures are logged - they never affect step completion or the host process
FFI Thread (Ruby/Python/JS)          Tokio Runtime
         │                                │
         ├──► complete(event_id, result)   │
         │    ├──► send result to channel  │
         │    └──► spawn callback ─────────┼──► callback.on_handler_complete()
         │                                 │    (with 5s timeout)
         ◄──── return true ────────────────│
         │  (immediate, non-blocking)      │

See docs/development/ffi-callback-safety.md for detailed callback safety guidelines.

Backpressure Protection

Completion Channel

The completion channel uses a try-send retry loop with timeout to prevent indefinite blocking:

  • Try-send avoids blocking the FFI thread
  • Retry with sleep (10ms intervals) handles transient backpressure
  • Timeout (configurable, default 30s) prevents permanent stalls
  • Logged when backpressure delays exceed 100ms

Starvation Detection

The FfiDispatchChannel tracks event age and warns when polling falls behind:

  • Events older than starvation_warning_threshold_ms (default 10s) trigger warnings
  • check_starvation_warnings() can be called periodically from the host process
  • FfiDispatchMetrics exposes pending count, oldest event age, and starvation status

Infrastructure Dependency Matrix

ComponentBootstrapPollCompleteCallback
DatabaseRequired (error on failure)Not neededNot neededErrors logged
Message BusRequired (error on failure)Not neededNot neededErrors logged
Config SystemRequired (error on failure)Not neededNot neededNot needed
Cache (Redis)Optional (degrades to noop)Not neededNot neededNot needed
Tokio RuntimeRequired (error on failure)UsedUsedUsed

Worker Lifecycle Safety

Start (bootstrap_worker)

  • Validates configuration, creates runtime, initializes all subsystems
  • All failures return language-appropriate errors
  • Already-running detection prevents double initialization

Status (get_worker_status)

  • Safe when worker is not initialized (returns running: false)
  • Safe when worker is running (queries internal state)
  • Lock acquisition failure returns error

Stop (stop_worker)

  • Safe when worker is not running (returns success message)
  • Sends shutdown signal and clears handle
  • In-flight operations complete before shutdown

Graceful Shutdown (transition_to_graceful_shutdown)

  • Initiates graceful shutdown allowing in-flight work to drain
  • Errors during transition are logged and returned
  • Requires worker to be running (error otherwise)

Adding a New FFI Worker

When implementing a new language worker:

  1. Check framework panic safety - if the framework (like Magnus/PyO3) catches panics automatically, you get protection for free. If using raw C FFI, wrap all extern "C" functions with catch_unwind.

  2. Use the standard bridge pattern - global WORKER_SYSTEM mutex, BridgeHandle struct containing WorkerSystemHandle + FfiDispatchChannel + runtime.

  3. Handle all lock acquisitions - always use .map_err() on .lock() calls.

  4. Avoid .expect() and .unwrap() in FFI code - use ok_or_else() or map_err() to convert to language-appropriate errors.

  5. Use fire-and-forget callbacks - never block the FFI thread on async operations.

  6. Integrate starvation detection - call check_starvation_warnings() periodically.

  7. Expose metrics - expose FfiDispatchMetrics for health monitoring.

FFI Memory Management in TypeScript Workers

Status: Active
Applies To: TypeScript/Bun/Node.js FFI Related: Ruby (Magnus), Python (PyO3)


Overview

This document explains the memory management pattern used when calling Rust functions from TypeScript via FFI (Foreign Function Interface). Understanding this pattern is critical for preventing memory leaks and undefined behavior.

Key Principle: When Rust hands memory to JavaScript across the FFI boundary, Rust’s ownership system no longer applies. The JavaScript code becomes responsible for explicitly freeing that memory.


The Memory Handoff Pattern

Three-Step Process

// 1. ALLOCATE: Rust allocates memory and returns a pointer
const ptr = this.lib.symbols.get_worker_status() as Pointer;

// 2. READ: JavaScript reads/copies the data from that pointer
const json = new CString(ptr);              // Read C string into JS string
const status = JSON.parse(json);            // Parse into JS object

// 3. FREE: JavaScript tells Rust to deallocate the memory
this.lib.symbols.free_rust_string(ptr);     // Rust frees the memory

// After this point, 'status' is a safe JavaScript object
// and the Rust memory has been freed (no leak)

Why This Pattern Exists

When Rust returns a pointer across the FFI boundary, it deliberately leaks the memory from Rust’s perspective:

#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn get_worker_status() -> *mut c_char {
    let status = WorkerStatus { /* ... */ };
    let json = serde_json::to_string(&status).unwrap();
    
    // into_raw() transfers ownership OUT of Rust's memory system
    CString::new(json).unwrap().into_raw()
    // Rust's Drop trait will NOT run on this memory!
}
}

The .into_raw() method:

  • Converts CString to a raw pointer
  • Prevents Rust from freeing the memory when it goes out of scope
  • Transfers ownership responsibility to the caller

Without this, Rust would free the memory immediately, and JavaScript would read garbage data (use-after-free).


The Free Function

JavaScript must call back into Rust to free the memory:

#![allow(unused)]
fn main() {
// Rust side:
#[no_mangle]
pub extern "C" fn free_rust_string(ptr: *mut c_char) {
    if ptr.is_null() {
        return;
    }
    
    // SAFETY: We know this pointer came from CString::into_raw()
    // and this function is only called once per pointer
    unsafe {
        let _ = CString::from_raw(ptr);
        // CString goes out of scope here and properly frees the memory
    }
}
}

This reconstructs the CString from the raw pointer, which causes Rust’s Drop trait to run and free the memory.


Safety Guarantees

This pattern is safe because of three key properties:

1. Single-Threaded JavaScript Runtime

JavaScript (and TypeScript) runs on a single thread (ignoring Web Workers), which means:

  • No race conditions: The read → free sequence is atomic from Rust’s perspective
  • No concurrent access: Only one piece of code can access the pointer at a time
  • Predictable execution order: Steps always happen in sequence

2. One-Way Handoff

Rust follows a strict contract:

Rust allocates → Returns pointer → NEVER TOUCHES IT AGAIN
  • Rust doesn’t keep any references to the memory
  • Rust never reads or writes to that memory after returning the pointer
  • The memory is “orphaned” from Rust’s perspective until free_rust_string is called

3. JavaScript Copies Before Freeing

JavaScript creates a new copy of the data before freeing:

const ptr = this.lib.symbols.get_worker_status() as Pointer;

// Step 1: Read bytes from Rust memory into a JavaScript string
const json = new CString(ptr);  // COPY operation

// Step 2: Parse string into JavaScript objects
const status = JSON.parse(json);  // Creates new JS objects

// Step 3: Free the Rust memory
this.lib.symbols.free_rust_string(ptr);

// At this point:
// - 'status' is pure JavaScript (managed by V8/JavaScriptCore)
// - Rust memory has been freed (no leak)
// - 'ptr' is invalid (but we never use it again)

The status object is fully owned by JavaScript’s garbage collector. It has no connection to the freed Rust memory.


Comparison to Ruby and Python FFI

Ruby (Magnus)

# Ruby FFI with Magnus
result = TaskerCore::FFI.get_worker_status()
# No explicit free needed - Magnus manages memory via Rust Drop traits

How it works: Magnus creates a bridge between Ruby’s GC and Rust’s ownership system. When Ruby no longer references the object, Rust’s Drop trait eventually runs.

Python (PyO3)

# Python FFI with PyO3
result = tasker_core.get_worker_status()
# No explicit free needed - PyO3 uses Python's reference counting

How it works: PyO3 wraps Rust data in PyObject wrappers. When Python’s reference count reaches zero, the Rust data is dropped.

TypeScript (Bun/Node FFI)

// TypeScript FFI - manual memory management required
const ptr = lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
lib.symbols.free_rust_string(ptr);  // MUST call explicitly

Why different: Bun and Node.js use raw C FFI (similar to ctypes in Python or FFI gem in Ruby). There’s no automatic memory management bridge, so we must manually free.

Tradeoff: More verbose, but gives us complete control and makes memory lifetime explicit.


Common Pitfalls and How We Avoid Them

1. Memory Leak (Forgetting to Free)

Problem:

// BAD: Memory leak
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
const status = JSON.parse(json);
// Oops! Forgot to call free_rust_string(ptr)

How we avoid it: Every code path that allocates a pointer must free it. We wrap this in methods like pollStepEvents() that handle the complete lifecycle:

pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events() as Pointer;
  if (!ptr) {
    return [];  // No allocation, no free needed
  }
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);  // Always freed
  return events;
}

2. Double-Free

Problem:

// BAD: Double-free (undefined behavior)
const ptr = this.lib.symbols.get_worker_status();
const json = new CString(ptr);
this.lib.symbols.free_rust_string(ptr);
this.lib.symbols.free_rust_string(ptr);  // CRASH! Already freed

How we avoid it: We free the pointer exactly once in each code path, and we never store pointers for reuse. Each pointer is used in a single scope and immediately freed.

3. Use-After-Free

Problem:

// BAD: Use-after-free
const ptr = this.lib.symbols.get_worker_status();
this.lib.symbols.free_rust_string(ptr);
const json = new CString(ptr);  // CRASH! Memory is gone

How we avoid it: We always read/copy before freeing. The order is strictly: allocate → read → free.


Pattern in Practice

Example: Worker Status

getWorkerStatus(): WorkerStatus {
  // 1. Allocate: Rust allocates memory for JSON string
  const ptr = this.lib.symbols.get_worker_status() as Pointer;
  
  // 2. Read: Copy data into JavaScript
  const json = new CString(ptr);        // Rust memory → JS string
  const status = JSON.parse(json);      // JS string → JS object
  
  // 3. Free: Deallocate Rust memory
  this.lib.symbols.free_rust_string(ptr);
  
  // 4. Return: Pure JavaScript object (safe)
  return status;
}

Example: Polling Step Events

pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events() as Pointer;
  
  // Handle null pointer (no events available)
  if (!ptr) {
    return [];
  }
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);
  
  return events;
}

Example: Bootstrap Worker

bootstrapWorker(config: BootstrapConfig): BootstrapResult {
  const configJson = JSON.stringify(config);
  
  // Pass JavaScript data TO Rust (no pointer returned)
  const ptr = this.lib.symbols.bootstrap_worker(configJson) as Pointer;
  
  // Read the result
  const json = new CString(ptr);
  const result = JSON.parse(json);
  
  // Free the result pointer
  this.lib.symbols.free_rust_string(ptr);
  
  return result;
}

Memory Lifetime Diagrams

Successful Pattern

Time →

JavaScript:    [allocate ptr] → [read data] → [free ptr] → [use data]
Rust Memory:   [allocated]    → [allocated] → [freed]    → [freed]
JS Objects:    [none]         → [created]   → [exists]   → [exists]
                                  ↑
                            Data copied here

Memory Leak (Anti-Pattern)

Time →

JavaScript:    [allocate ptr] → [read data] → [use data] → ...
Rust Memory:   [allocated]    → [allocated] → [LEAK]     → [LEAK]
JS Objects:    [none]         → [created]   → [exists]   → [exists]
                                                ↑
                                    Forgot to free! Memory leaked

Use-After-Free (Anti-Pattern)

Time →

JavaScript:    [allocate ptr] → [free ptr] → [read ptr] → CRASH!
Rust Memory:   [allocated]    → [freed]    → [freed]
JS Objects:    [none]         → [none]     → [CORRUPT]
                                              ↑
                                    Reading freed memory!

Best Practices

1. Keep Pointer Lifetime Short

// GOOD: Pointer freed in same scope
const result = this.getWorkerStatus();

// BAD: Don't store pointers
this.statusPtr = this.lib.symbols.get_worker_status();  // Leak risk

2. Always Free in Same Method

// GOOD: Allocate and free in same method
pollStepEvents(): FfiStepEvent[] {
  const ptr = this.lib.symbols.poll_step_events();
  if (!ptr) return [];
  
  const json = new CString(ptr);
  const events = JSON.parse(json);
  this.lib.symbols.free_rust_string(ptr);
  return events;
}

// BAD: Returning pointer for later freeing
getPtrToStatus(): Pointer {
  return this.lib.symbols.get_worker_status();  // Who will free this?
}

3. Handle Null Pointers

// GOOD: Check for null before freeing
const ptr = this.lib.symbols.poll_step_events();
if (!ptr) {
  return [];  // No memory allocated, nothing to free
}

const json = new CString(ptr);
const events = JSON.parse(json);
this.lib.symbols.free_rust_string(ptr);
return events;

4. Document Ownership in Comments

/**
 * Poll for step events from FFI.
 * 
 * MEMORY: This function manages the lifetime of the pointer returned
 * by poll_step_events(). The pointer is freed before returning.
 */
pollStepEvents(): FfiStepEvent[] {
  // ...
}

Testing Memory Safety

Rust Tests

Rust’s test suite can verify FFI functions don’t leak:

#![allow(unused)]
fn main() {
#[test]
fn test_status_no_leak() {
    let ptr = get_worker_status();
    assert!(!ptr.is_null());
    
    // Manually free to ensure it works
    free_rust_string(ptr);
    
    // If we had a leak, tools like valgrind or AddressSanitizer
    // would catch it
}
}

TypeScript Tests

TypeScript tests verify proper usage:

test('status retrieval frees memory', () => {
  const runtime = new BunTaskerRuntime();
  
  // This should not leak - memory freed internally
  const status = runtime.getWorkerStatus();
  
  expect(status.running).toBeDefined();
  
  // Call multiple times to stress test
  for (let i = 0; i < 100; i++) {
    runtime.getWorkerStatus();
  }
  // If we leaked, we'd have 100 leaked strings
});

Leak Detection Tools

  • Valgrind (Linux): Detects memory leaks in Rust code
  • AddressSanitizer: Detects use-after-free and double-free
  • Process memory monitoring: Track RSS growth over time

When in Doubt

Golden Rule: Every *mut c_char pointer returned by a Rust FFI function must have a corresponding free_rust_string() call in the TypeScript code, executed exactly once per pointer, after all reads are complete.

If you see a pattern like:

const ptr = this.lib.symbols.some_function();

Ask yourself:

  1. Does this return a pointer to allocated memory? (Check Rust signature)
  2. Am I reading the data before freeing?
  3. Am I freeing exactly once?
  4. Am I never using ptr after freeing?

If the answer to all is “yes”, you’re following the pattern correctly.


References

  • Rust FFI Guidelines: https://doc.rust-lang.org/nomicon/ffi.html
  • Bun FFI Documentation: https://bun.sh/docs/api/ffi
  • Node.js ffi-napi: https://github.com/node-ffi-napi/node-ffi-napi
  • docs/worker-crates/patterns-and-practices.md: General worker patterns

Worker Crates: Common Patterns and Practices

Last Updated: 2026-01-06 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | Worker Actors

<- Back to Worker Crates Overview


This document describes the common patterns and practices shared across all three worker implementations (Rust, Ruby, Python). Understanding these patterns helps developers write consistent handlers regardless of the language.

Table of Contents


Architectural Patterns

Dual-Channel Architecture

All workers implement a dual-channel architecture for non-blocking step execution:

┌─────────────────────────────────────────────────────────────────┐
│                    DUAL-CHANNEL PATTERN                         │
└─────────────────────────────────────────────────────────────────┘

    PostgreSQL PGMQ
          │
          ▼
  ┌───────────────────┐
  │  Dispatch Channel │  ──→  Step events flow TO handlers
  └───────────────────┘
          │
          ▼
  ┌───────────────────┐
  │  Handler Execution │  ──→  Business logic runs here
  └───────────────────┘
          │
          ▼
  ┌───────────────────┐
  │ Completion Channel │  ──→  Results flow BACK to orchestration
  └───────────────────┘
          │
          ▼
    Orchestration

Benefits:

  • Fire-and-forget dispatch (non-blocking)
  • Bounded concurrency via semaphores
  • Results processed independently from dispatch
  • Consistent pattern across all languages

Language-Specific Implementations

ComponentRustRubyPython
Dispatch Channelmpsc::channelpoll_step_events FFIpoll_step_events FFI
Completion Channelmpsc::channelcomplete_step_event FFIcomplete_step_event FFI
Concurrency ModelTokio async tasksRuby threads + FFI pollingPython threads + FFI polling
GIL HandlingN/APull-based pollingPull-based polling

Handler Lifecycle

Handler Registration

All implementations follow the same registration pattern:

1. Define handler class/struct
2. Set handler_name identifier
3. Register with HandlerRegistry
4. Handler ready for resolution

Ruby Example:

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Access data via cross-language standard methods
    order_id = context.get_task_field('order_id')

    # Business logic here...

    # Return result using base class helper (keyword args required)
    success(result: { order_id: order_id, status: 'processed' })
  end
end

# Registration
registry = TaskerCore::Registry::HandlerRegistry.instance
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)

Python Example:

from tasker_core import StepHandler, StepHandlerResult, HandlerRegistry

class ProcessOrderHandler(StepHandler):
    handler_name = "process_order"

    def call(self, context):
        order_id = context.input_data.get("order_id")
        return StepHandlerResult.success_handler_result(
            {"order_id": order_id, "status": "processed"}
        )

# Registration
registry = HandlerRegistry.instance()
registry.register("process_order", ProcessOrderHandler)

Handler Resolution Flow

1. Step event received with handler name
2. Registry.resolve(handler_name) called
3. Handler class instantiated
4. handler.call(context) invoked
5. Result returned to completion channel

Handler Context

All handlers receive a context object containing:

FieldDescription
task_uuidUnique identifier for the task
step_uuidUnique identifier for the step
input_dataTask context data passed to the step
dependency_resultsResults from parent/dependency steps
step_configConfiguration from step definition
step_inputsRuntime inputs from workflow_step.inputs
retry_countCurrent retry attempt number
max_retriesMaximum retry attempts allowed

Handler Results

All handlers return a structured result indicating success or failure. However, the APIs differ between Ruby and Python - this is a known design inconsistency that may be addressed in a future ticket.

Ruby - Uses keyword arguments and separate Success/Error types:

# Via base handler shortcuts
success(result: { key: "value" }, metadata: { duration_ms: 150 })

failure(
  message: "Something went wrong",
  error_type: "PermanentError",
  error_code: "VALIDATION_ERROR",  # Ruby has error_code field
  retryable: false,
  metadata: { field: "email" }
)

# Or via type factory methods
TaskerCore::Types::StepHandlerCallResult.success(result: { key: "value" })
TaskerCore::Types::StepHandlerCallResult.error(
  error_type: "PermanentError",
  message: "Error message",
  error_code: "ERR_001"
)

Python - Uses positional/keyword arguments and a single result type:

# Via base handler shortcuts
self.success(result={"key": "value"}, metadata={"duration_ms": 150})

self.failure(
    message="Something went wrong",
    error_type="ValidationError",  # Python has error_type only (no error_code)
    retryable=False,
    metadata={"field": "email"}
)

# Or via class factory methods
StepHandlerResult.success_handler_result(
    {"key": "value"},             # First positional arg is result
    {"duration_ms": 150}          # Second positional arg is metadata
)
StepHandlerResult.failure_handler_result(
    message="Something went wrong",
    error_type="ValidationError",
    retryable=False,
    metadata={"field": "email"}
)

Key Differences:

AspectRubyPython
Factory method names.success(), .error().success_handler_result(), .failure_handler_result()
Result typeSuccess / Error structsSingle StepHandlerResult class
Error code fielderror_code (freeform)Not present
Argument styleKeyword required (result:)Positional allowed

Error Handling

Error Classification

All workers classify errors into two categories:

TypeDescriptionBehavior
RetryableTransient errors that may succeed on retryStep re-enqueued up to max_retries
PermanentUnrecoverable errorsStep marked as failed immediately

HTTP Status Code Classification (ApiHandler)

400, 401, 403, 404, 422  →  Permanent Error (client errors)
429                       →  Retryable Error (rate limiting)
500-599                   →  Retryable Error (server errors)

Exception Hierarchy

Ruby:

TaskerCore::Error                  # Base class
├── TaskerCore::RetryableError     # Transient failures
├── TaskerCore::PermanentError     # Unrecoverable failures
├── TaskerCore::FFIError           # FFI bridge errors
└── TaskerCore::ConfigurationError # Configuration issues

Python:

TaskerError                        # Base class
├── WorkerNotInitializedError      # Worker not bootstrapped
├── WorkerBootstrapError           # Bootstrap failed
├── WorkerAlreadyRunningError      # Double initialization
├── FFIError                       # FFI bridge errors
├── ConversionError                # Type conversion errors
└── StepExecutionError             # Handler execution errors

Error Context Propagation

All errors should include context for debugging:

StepHandlerResult.failure_handler_result(
    message="Payment gateway timeout",
    error_type="gateway_timeout",
    retryable=True,
    metadata={
        "gateway": "stripe",
        "request_id": "req_xyz",
        "response_time_ms": 30000
    }
)

Polling Architecture

Why Polling?

Ruby and Python workers use a pull-based polling model due to language runtime constraints:

Ruby: The Global VM Lock (GVL) prevents Rust from safely calling Ruby methods from Rust threads. Polling allows Ruby to control thread context.

Python: The Global Interpreter Lock (GIL) has the same limitation. Python must initiate all cross-language calls.

Polling Characteristics

ParameterDefault ValueDescription
Poll Interval10msTime between polls when no events
Max Latency~10msTime from event generation to processing start
Starvation CheckEvery 100 polls (1 second)Detect processing bottlenecks
Cleanup IntervalEvery 1000 polls (10 seconds)Clean up timed-out events

Poll Loop Structure

while running:
    # 1. Poll for event
    event = poll_step_events()

    if event:
        # 2. Process event through handler
        process_event(event)
    else:
        # 3. Sleep when no events
        time.sleep(0.01)  # 10ms

    # 4. Periodic maintenance
    if poll_count % 100 == 0:
        check_starvation_warnings()

    if poll_count % 1000 == 0:
        cleanup_timeouts()

FFI Contract

Ruby and Python share the same FFI contract:

FunctionDescription
poll_step_events()Get next pending event (returns None if empty)
complete_step_event(event_id, result)Submit handler result
get_ffi_dispatch_metrics()Get dispatch channel metrics
check_starvation_warnings()Trigger starvation logging
cleanup_timeouts()Clean up timed-out events

Event Bridge Pattern

Overview

All workers implement an EventBridge (pub/sub) pattern for internal coordination:

┌─────────────────────────────────────────────────────────────────┐
│                      EVENT BRIDGE PATTERN                        │
└─────────────────────────────────────────────────────────────────┘

  Publishers                    EventBridge                 Subscribers
  ─────────                    ───────────                 ───────────
  HandlerRegistry  ──publish──→            ──notify──→  StepExecutionSubscriber
  EventPoller      ──publish──→  [Events]  ──notify──→  MetricsCollector
  Worker           ──publish──→            ──notify──→  Custom Subscribers

Standard Event Names

EventDescriptionPayload
handler_registeredHandler added to registry(name, handler_class)
step_execution_receivedStep event receivedFfiStepEvent
step_execution_completedHandler finishedStepHandlerResult
worker_startedWorker bootstrap completeworker_id
worker_stoppedWorker shutdownworker_id

Implementation Libraries

LanguageLibraryPattern
Rubydry-eventsPublisher/Subscriber
PythonpyeeEventEmitter
RustNative channelsmpsc

Usage Example (Python)

from tasker_core import EventBridge, EventNames

bridge = EventBridge.instance()

# Subscribe to events
def on_step_received(event):
    print(f"Processing step {event.step_uuid}")

bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)

# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)

Singleton Pattern

Worker State Management

All workers store global state in a thread-safe singleton:

┌─────────────────────────────────────────────────────────────────┐
│                    SINGLETON WORKER STATE                        │
└─────────────────────────────────────────────────────────────────┘

    Thread-Safe Global
           │
           ▼
    ┌──────────────────┐
    │   WorkerSystem   │
    │  ┌────────────┐  │
    │  │ Mutex/Lock │  │
    │  │  Inner     │  │
    │  │  State     │  │
    │  └────────────┘  │
    └──────────────────┘
           │
           ├──→ HandlerRegistry
           ├──→ EventBridge
           ├──→ EventPoller
           └──→ Configuration

Singleton Classes

LanguageSingleton Implementation
RustOnceLock<Mutex<WorkerSystem>>
RubySingleton module
PythonClass-level _instance with instance() classmethod

Reset for Testing

All singletons provide reset methods for test isolation:

# Python
HandlerRegistry.reset_instance()
EventBridge.reset_instance()
# Ruby
TaskerCore::Registry::HandlerRegistry.reset_instance!

Observability

Health Checks

All workers expose health information via FFI:

from tasker_core import get_health_check

health = get_health_check()
# Returns: HealthCheck with component statuses

Metrics

Standard metrics available from all workers:

MetricDescription
pending_countEvents awaiting processing
in_flight_countEvents currently being processed
completed_countSuccessfully completed events
failed_countFailed events
starvation_detectedWhether events are timing out
starving_event_countEvents exceeding timeout threshold

Structured Logging

All workers use structured logging with consistent fields:

from tasker_core import log_info, LogContext

context = LogContext(
    correlation_id="abc-123",
    task_uuid="task-456",
    operation="process_order"
)
log_info("Processing order", context)

Specialized Handlers

Handler Type Hierarchy

Ruby (all are subclasses):

TaskerCore::StepHandler::Base
├── TaskerCore::StepHandler::Api        # HTTP/REST API integration
├── TaskerCore::StepHandler::Decision   # Dynamic workflow decisions
└── TaskerCore::StepHandler::Batchable  # Batch processing support

Python (Batchable is a mixin):

StepHandler (ABC)
├── ApiHandler         # HTTP/REST API integration (subclass)
├── DecisionHandler    # Dynamic workflow decisions (subclass)
└── + Batchable        # Batch processing (mixin via multiple inheritance)

ApiHandler

For HTTP API integration with automatic error classification:

class FetchUserHandler(ApiHandler):
    handler_name = "fetch_user"

    def call(self, context):
        response = self.get(f"/users/{context.input_data['user_id']}")
        return self.success(result=response.json())

DecisionHandler

For dynamic workflow routing:

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    if amount < 1000
      decision_success(steps: ['auto_approve'], result_data: { route: 'auto' })
    else
      decision_success(steps: ['manager_approval', 'finance_review'])
    end
  end
end

Batchable

For processing large datasets in chunks. Note: Ruby uses subclass inheritance, Python uses mixin.

Ruby (subclass):

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
    batch_worker_complete(processed_count: batch_ctx.batch_size)
  end
end

Python (mixin):

class CsvProcessorHandler(StepHandler, Batchable):
    handler_name = "csv_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        batch_ctx = self.get_batch_context(context)
        # Process records using batch_ctx.start_cursor, batch_ctx.end_cursor
        return self.batch_worker_success(processed_count=batch_ctx.batch_size)

Checkpoint Yielding

Checkpoint yielding enables batch workers to persist progress and yield control back to the orchestrator for re-dispatch. This is essential for long-running batch operations.

When to Use

  • Processing takes longer than visibility timeout
  • You need resumable processing after failures
  • Long-running operations need progress visibility

Cross-Language API

All Batchable handlers provide checkpoint_yield() (or checkpointYield() in TypeScript):

Ruby:

class MyBatchWorker < TaskerCore::StepHandler::Batchable
  def call(context)
    batch_ctx = get_batch_context(context)

    # Resume from checkpoint if present
    start = batch_ctx.has_checkpoint? ? batch_ctx.checkpoint_cursor : 0

    items.each_with_index do |item, idx|
      process_item(item)

      # Checkpoint every 1000 items
      if (idx + 1) % 1000 == 0
        checkpoint_yield(
          cursor: start + idx + 1,
          items_processed: idx + 1,
          accumulated_results: { partial: "data" }
        )
      end
    end

    batch_worker_complete(processed_count: items.size)
  end
end

Python:

class MyBatchWorker(StepHandler, Batchable):
    def call(self, context):
        batch_ctx = self.get_batch_context(context)

        # Resume from checkpoint if present
        start = batch_ctx.checkpoint_cursor if batch_ctx.has_checkpoint() else 0

        for idx, item in enumerate(items):
            self.process_item(item)

            # Checkpoint every 1000 items
            if (idx + 1) % 1000 == 0:
                self.checkpoint_yield(
                    cursor=start + idx + 1,
                    items_processed=idx + 1,
                    accumulated_results={"partial": "data"}
                )

        return self.batch_worker_success(processed_count=len(items))

TypeScript:

class MyBatchWorker extends BatchableHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    const batchCtx = this.getBatchContext(context);

    // Resume from checkpoint if present
    const start = batchCtx.hasCheckpoint() ? batchCtx.checkpointCursor : 0;

    for (let idx = 0; idx < items.length; idx++) {
      await this.processItem(items[idx]);

      // Checkpoint every 1000 items
      if ((idx + 1) % 1000 === 0) {
        await this.checkpointYield({
          cursor: start + idx + 1,
          itemsProcessed: idx + 1,
          accumulatedResults: { partial: "data" }
        });
      }
    }

    return this.batchWorkerSuccess({ processedCount: items.length });
  }
}

BatchWorkerContext Checkpoint Accessors

All languages provide consistent accessors for checkpoint data:

AccessorRubyPythonTypeScript
Cursor positioncheckpoint_cursorcheckpoint_cursorcheckpointCursor
Accumulated dataaccumulated_resultsaccumulated_resultsaccumulatedResults
Has checkpoint?has_checkpoint?has_checkpoint()hasCheckpoint()
Items processedcheckpoint_items_processedcheckpoint_items_processedcheckpointItemsProcessed

FFI Contract

FunctionDescription
checkpoint_yield_step_event(event_id, data)Persist checkpoint and re-dispatch step

Key Invariants

  1. Checkpoint-Persist-Then-Redispatch: Progress saved before re-dispatch
  2. Step Stays InProgress: No state machine transitions during yield
  3. Handler-Driven: Handlers decide when to checkpoint

See Batch Processing Guide - Checkpoint Yielding for comprehensive documentation.


Best Practices

1. Keep Handlers Focused

Each handler should do one thing well:

  • Validate input
  • Perform single operation
  • Return clear result

2. Use Error Classification

Always specify whether errors are retryable:

# Good - clear error classification
return self.failure("API rate limit", retryable=True)

# Bad - ambiguous error handling
raise Exception("API error")

3. Include Context in Errors

return StepHandlerResult.failure_handler_result(
    message="Database connection failed",
    error_type="database_error",
    retryable=True,
    metadata={
        "host": "db.example.com",
        "port": 5432,
        "connection_timeout_ms": 5000
    }
)

4. Use Structured Logging

log_info("Order processed", {
    "order_id": order_id,
    "total": total,
    "items_count": len(items)
})

5. Test Handler Isolation

Reset singletons between tests:

def setup_method(self):
    HandlerRegistry.reset_instance()
    EventBridge.reset_instance()

See Also

Python Worker

Last Updated: 2026-01-01 Audience: Python Developers Status: Active Package: tasker_core Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview


The Python worker provides a package-based interface for integrating tasker-core workflow execution into Python applications. It supports both standalone server deployment and headless embedding in existing codebases.

Quick Start

Installation

cd workers/python
uv sync                    # Install dependencies
uv run maturin develop     # Build FFI extension

Running the Server

python bin/server.py

Environment Variables

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
TASKER_ENVEnvironment (test/development/production)development
TASKER_CONFIG_PATHPath to TOML configurationAuto-detected
TASKER_TEMPLATE_PATHPath to task templatesAuto-detected
PYTHON_HANDLER_PATHPath for handler auto-discoveryNot set
RUST_LOGLog level (trace/debug/info/warn/error)info

Architecture

Server Mode

Location: workers/python/bin/server.py

The server bootstraps the Rust foundation and manages Python handler execution:

from tasker_core import (
    bootstrap_worker,
    EventBridge,
    EventPoller,
    HandlerRegistry,
    StepExecutionSubscriber,
)

# Bootstrap Rust worker foundation
result = bootstrap_worker(config)

# Start event dispatch system
event_bridge = EventBridge.instance()
event_bridge.start()

# Create step execution subscriber
handler_registry = HandlerRegistry.instance()
step_subscriber = StepExecutionSubscriber(
    event_bridge=event_bridge,
    handler_registry=handler_registry,
    worker_id="python-worker-001"
)
step_subscriber.start()

# Start event poller (10ms polling)
event_poller = EventPoller(polling_interval_ms=10)
event_poller.on_step_event(lambda e: event_bridge.publish("step_execution_received", e))
event_poller.start()

# Wait for shutdown signal
shutdown_event.wait()

# Graceful shutdown
event_poller.stop()
step_subscriber.stop()
event_bridge.stop()
stop_worker()

Headless/Embedded Mode

For embedding in existing Python applications:

from tasker_core import (
    bootstrap_worker,
    HandlerRegistry,
    EventBridge,
    EventPoller,
    StepExecutionSubscriber,
)
from tasker_core.types import BootstrapConfig

# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)

# Register handlers
registry = HandlerRegistry.instance()
registry.register("process_data", ProcessDataHandler)

# Start event dispatch (required for embedded usage)
bridge = EventBridge.instance()
bridge.start()

subscriber = StepExecutionSubscriber(bridge, registry, "embedded-worker")
subscriber.start()

poller = EventPoller()
poller.on_step_event(lambda e: bridge.publish("step_execution_received", e))
poller.start()

FFI Bridge

Python communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                    PYTHON FFI BRIDGE                            │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (poll_step_events)
          ▼
   ┌─────────────────────┐
   │    EventPoller      │
   │  (daemon thread)    │──→ poll every 10ms
   └─────────────────────┘
          │
          │ publish to EventBridge
          ▼
   ┌─────────────────────┐
   │ StepExecution       │
   │ Subscriber          │──→ route to handler
   └─────────────────────┘
          │
          │ handler.call(context)
          ▼
   ┌─────────────────────┐
   │  Handler Execution  │
   └─────────────────────┘
          │
          │ FFI (complete_step_event)
          ▼
   Rust Completion Channel

Handler Development

Base Handler (ABC)

Location: python/tasker_core/step_handler/base.py

All handlers inherit from StepHandler:

from tasker_core import StepHandler, StepContext, StepHandlerResult

class ProcessOrderHandler(StepHandler):
    handler_name = "process_order"
    handler_version = "1.0.0"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Access input data
        order_id = context.input_data.get("order_id")
        amount = context.input_data.get("amount")

        # Business logic
        result = self.process_order(order_id, amount)

        # Return success
        return self.success(result={
            "order_id": order_id,
            "status": "processed",
            "total": result["total"]
        })

Handler Signature

def call(self, context: StepContext) -> StepHandlerResult:
    # context.task_uuid       - Task identifier
    # context.step_uuid       - Step identifier
    # context.input_data      - Task context data
    # context.dependency_results - Results from parent steps
    # context.step_config     - Handler configuration
    # context.step_inputs     - Runtime inputs
    # context.retry_count     - Current retry attempt
    # context.max_retries     - Maximum retry attempts

Result Methods

# Success result (from base class)
return self.success(
    result={"key": "value"},
    metadata={"duration_ms": 100}
)

# Failure result (from base class)
return self.failure(
    message="Payment declined",
    error_type="payment_error",
    retryable=True,
    metadata={"card_last_four": "1234"}
)

# Or using factory methods
from tasker_core import StepHandlerResult

return StepHandlerResult.success_handler_result(
    {"key": "value"},
    {"duration_ms": 100}
)

return StepHandlerResult.failure_handler_result(
    message="Error",
    error_type="validation_error",
    retryable=False
)

Accessing Dependencies

def call(self, context: StepContext) -> StepHandlerResult:
    # Get result from a dependency step
    validation = context.dependency_results.get("validate_order", {})

    if validation.get("valid"):
        # Process with validated data
        return self.success(result={"processed": True})

    return self.failure("Validation failed", retryable=False)

Composition Pattern

Python handlers use composition via mixins (multiple inheritance) rather than single inheritance.

from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin

class MyHandler(StepHandler, APIMixin, DecisionMixin):
    handler_name = "my_handler"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Has both API methods (get, post, put, delete)
        # And Decision methods (decision_success, skip_branches)
        response = self.get("/api/endpoint")
        return self.decision_success(["next_step"], response)

Available Mixins

MixinLocationMethods Provided
APIMixinmixins/api.pyget, post, put, delete, http_client
DecisionMixinmixins/decision.pydecision_success, skip_branches, decision_failure
BatchableMixin(base class)get_batch_context, handle_no_op_worker, create_cursor_configs

Using Wrapper Classes (Backward Compatible)

The wrapper classes delegate to mixins internally:

# These are equivalent:
class MyHandler(ApiHandler):
    # Inherits API methods via APIMixin internally
    pass

class MyHandler(StepHandler, APIMixin):
    # Explicit mixin inclusion
    pass

Specialized Handlers

API Handler

Location: python/tasker_core/step_handler/api.py

For HTTP API integration with automatic error classification:

from tasker_core.step_handler import ApiHandler

class FetchUserHandler(ApiHandler):
    handler_name = "fetch_user"
    base_url = "https://api.example.com"

    def call(self, context: StepContext) -> StepHandlerResult:
        user_id = context.input_data["user_id"]

        # Automatic error classification
        response = self.get(f"/users/{user_id}")

        return self.api_success(response)

HTTP Methods:

# GET request
response = self.get("/path", params={"key": "value"}, headers={})

# POST request
response = self.post("/path", data={"key": "value"}, headers={})

# PUT request
response = self.put("/path", data={"key": "value"}, headers={})

# DELETE request
response = self.delete("/path", params={}, headers={})

ApiResponse Properties:

response.status_code     # HTTP status code
response.headers         # Response headers
response.body            # Parsed body (dict or str)
response.ok              # True if 2xx status
response.is_client_error # True if 4xx status
response.is_server_error # True if 5xx status
response.is_retryable    # True if should retry (408, 429, 500-504)
response.retry_after     # Retry-After header value in seconds

Error Classification:

StatusClassificationBehavior
400, 401, 403, 404, 422Non-retryablePermanent failure
408, 429, 500-504RetryableStandard retry

Decision Handler

Location: python/tasker_core/step_handler/decision.py

For dynamic workflow routing:

from tasker_core.step_handler import DecisionHandler
from tasker_core import DecisionPointOutcome

class RoutingDecisionHandler(DecisionHandler):
    handler_name = "routing_decision"

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = context.input_data.get("amount", 0)

        if amount < 1000:
            # Auto-approve small amounts
            outcome = DecisionPointOutcome.create_steps(
                ["auto_approve"],
                routing_context={"route_type": "auto"}
            )
            return self.decision_success(outcome)

        elif amount < 5000:
            # Manager approval for medium amounts
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval"],
                routing_context={"route_type": "manager"}
            )
            return self.decision_success(outcome)

        else:
            # Dual approval for large amounts
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval", "finance_review"],
                routing_context={"route_type": "dual"}
            )
            return self.decision_success(outcome)

Decision Methods:

# Create steps
outcome = DecisionPointOutcome.create_steps(
    step_names=["step1", "step2"],
    routing_context={"key": "value"}
)
return self.decision_success(outcome)

# No branches needed
outcome = DecisionPointOutcome.no_branches(reason="condition not met")
return self.decision_no_branches(outcome)

Batchable Mixin

Location: python/tasker_core/batch_processing/

For processing large datasets in chunks. Both analyzer and worker handlers implement the standard call() method:

Analyzer Handler (creates batch configurations):

from tasker_core import StepHandler, StepHandlerResult
from tasker_core.batch_processing import Batchable

class CsvAnalyzerHandler(StepHandler, Batchable):
    handler_name = "csv_analyzer"

    def call(self, context: StepContext) -> StepHandlerResult:
        """Analyze CSV and create batch worker configurations."""
        csv_path = context.input_data["csv_path"]
        row_count = count_csv_rows(csv_path)

        if row_count == 0:
            # No data to process
            return self.batch_analyzer_success(
                cursor_configs=[],
                total_items=0,
                batch_metadata={"csv_path": csv_path}
            )

        # Create cursor ranges for batch workers
        cursor_configs = self.create_cursor_ranges(
            total_items=row_count,
            batch_size=100,
            max_batches=5
        )

        return self.batch_analyzer_success(
            cursor_configs=cursor_configs,
            total_items=row_count,
            worker_template_name="process_csv_batch",
            batch_metadata={"csv_path": csv_path}
        )

Worker Handler (processes a batch):

class CsvBatchProcessorHandler(StepHandler, Batchable):
    handler_name = "csv_batch_processor"

    def call(self, context: StepContext) -> StepHandlerResult:
        """Process a batch of CSV rows."""
        # Get cursor config from step_inputs
        step_inputs = context.step_inputs or {}

        # Check for no-op placeholder batch
        if step_inputs.get("is_no_op"):
            return self.batch_worker_success(
                items_processed=0,
                items_succeeded=0,
                metadata={"no_op": True}
            )

        cursor = step_inputs.get("cursor", {})
        start_cursor = cursor.get("start_cursor", 0)
        end_cursor = cursor.get("end_cursor", 0)

        # Get CSV path from analyzer result
        analyzer_result = context.get_dependency_result("analyze_csv")
        csv_path = analyzer_result["batch_metadata"]["csv_path"]

        # Process the batch
        results = process_csv_batch(csv_path, start_cursor, end_cursor)

        return self.batch_worker_success(
            items_processed=results["count"],
            items_succeeded=results["success"],
            items_failed=results["failed"],
            results=results["data"],
            last_cursor=end_cursor
        )

Batchable Helper Methods:

# Analyzer helpers
self.create_cursor_ranges(total_items, batch_size, max_batches)
self.batch_analyzer_success(cursor_configs, total_items, worker_template_name, batch_metadata)

# Worker helpers
self.batch_worker_success(items_processed, items_succeeded, items_failed, results, last_cursor, metadata)
self.get_batch_context(context)  # Returns BatchWorkerContext or None

# Aggregator helpers
self.aggregate_worker_results(worker_results)  # Returns aggregated counts

Handler Registry

Registration

Location: python/tasker_core/handler.py

from tasker_core import HandlerRegistry

registry = HandlerRegistry.instance()

# Manual registration
registry.register("process_order", ProcessOrderHandler)

# Check if registered
registry.is_registered("process_order")  # True

# Resolve and instantiate
handler = registry.resolve("process_order")
result = handler.call(context)

# List all handlers
registry.list_handlers()  # ["process_order", ...]

# Handler count
registry.handler_count()  # 1

Auto-Discovery

# Discover handlers from a package
count = registry.discover_handlers("myapp.handlers")
print(f"Discovered {count} handlers")

Handlers are discovered by:

  1. Scanning the package for classes inheriting from StepHandler
  2. Using the handler_name class attribute for registration

Type System

Pydantic Models

Python types use Pydantic for validation:

from tasker_core import StepContext, StepHandlerResult, FfiStepEvent

# StepContext - validated from FFI event
context = StepContext.from_ffi_event(event, "handler_name")
context.task_uuid      # UUID
context.step_uuid      # UUID
context.input_data     # dict
context.retry_count    # int

# StepHandlerResult - structured result
result = StepHandlerResult.success_handler_result({"key": "value"})
result.success         # True
result.result          # {"key": "value"}
result.error_message   # None

Configuration Types

from tasker_core.types import BootstrapConfig, CursorConfig

# Bootstrap configuration
# Note: Headless mode is controlled via TOML config (web.enabled = false)
config = BootstrapConfig(
    namespace="my-app",
    log_level="info"
)

# Cursor configuration for batch processing
cursor = CursorConfig(
    batch_size=100,
    start_cursor=0,
    end_cursor=1000
)

Event System

EventBridge

Location: python/tasker_core/event_bridge.py

from tasker_core import EventBridge, EventNames

bridge = EventBridge.instance()

# Start the event system
bridge.start()

# Subscribe to events
def on_step_received(event):
    print(f"Processing step: {event.step_uuid}")

bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)

# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)

# Stop when done
bridge.stop()

Event Names

from tasker_core import EventNames

EventNames.STEP_EXECUTION_RECEIVED  # Step event received from FFI
EventNames.STEP_COMPLETION_SENT     # Handler result sent to FFI
EventNames.HANDLER_REGISTERED       # Handler registered
EventNames.HANDLER_ERROR            # Handler execution error
EventNames.POLLER_METRICS           # FFI dispatch metrics update
EventNames.POLLER_ERROR             # Poller encountered an error

EventPoller

Location: python/tasker_core/event_poller.py

from tasker_core import EventPoller

poller = EventPoller(
    polling_interval_ms=10,       # Poll every 10ms
    starvation_check_interval=100, # Check every 1 second
    cleanup_interval=1000          # Cleanup every 10 seconds
)

# Register callbacks
poller.on_step_event(handle_step)
poller.on_metrics(handle_metrics)
poller.on_error(handle_error)

# Start polling (daemon thread)
poller.start()

# Get metrics
metrics = poller.get_metrics()
print(f"Pending: {metrics.pending_count}")

# Stop polling
poller.stop(timeout=5.0)

Domain Events

Python has full domain event support with lifecycle hooks matching Ruby and TypeScript capabilities.

Location: python/tasker_core/domain_events.py

BasePublisher

Publishers transform step execution context into domain-specific events:

from tasker_core.domain_events import BasePublisher, StepEventContext, DomainEvent

class PaymentEventPublisher(BasePublisher):
    publisher_name = "payment_events"

    def publishes_for(self) -> list[str]:
        """Which steps trigger this publisher."""
        return ["process_payment", "refund_payment"]

    async def transform_payload(self, ctx: StepEventContext) -> dict:
        """Transform step context into domain event payload."""
        return {
            "payment_id": ctx.result.get("payment_id"),
            "amount": ctx.result.get("amount"),
            "currency": ctx.result.get("currency"),
            "status": ctx.result.get("status")
        }

    # Lifecycle hooks (optional)
    async def before_publish(self, ctx: StepEventContext) -> None:
        """Called before publishing."""
        print(f"Publishing payment event for step: {ctx.step_name}")

    async def after_publish(self, ctx: StepEventContext, event: DomainEvent) -> None:
        """Called after successful publish."""
        print(f"Published event: {event.event_name}")

    async def on_publish_error(self, ctx: StepEventContext, error: Exception) -> None:
        """Called on publish failure."""
        print(f"Failed to publish: {error}")

    async def additional_metadata(self, ctx: StepEventContext) -> dict:
        """Inject custom metadata."""
        return {"payment_processor": "stripe"}

BaseSubscriber

Subscribers react to domain events matching specific patterns:

from tasker_core.domain_events import BaseSubscriber, InProcessDomainEvent, SubscriberResult

class AuditLoggingSubscriber(BaseSubscriber):
    subscriber_name = "audit_logger"

    def subscribes_to(self) -> list[str]:
        """Which events to handle (glob patterns supported)."""
        return ["payment.*", "order.completed"]

    async def handle(self, event: InProcessDomainEvent) -> SubscriberResult:
        """Handle matching events."""
        await self.log_to_audit_trail(event)
        return SubscriberResult(success=True)

    # Lifecycle hooks (optional)
    async def before_handle(self, event: InProcessDomainEvent) -> None:
        """Called before handling."""
        print(f"Handling: {event.event_name}")

    async def after_handle(self, event: InProcessDomainEvent, result: SubscriberResult) -> None:
        """Called after handling."""
        print(f"Handled successfully: {result.success}")

    async def on_handle_error(self, event: InProcessDomainEvent, error: Exception) -> None:
        """Called on handler failure."""
        print(f"Handler error: {error}")

Registries

Manage publishers and subscribers with singleton registries:

from tasker_core.domain_events import PublisherRegistry, SubscriberRegistry

# Publisher Registry
pub_registry = PublisherRegistry.instance()
pub_registry.register(PaymentEventPublisher)
pub_registry.register(OrderEventPublisher)

# Get publisher for a step
publisher = pub_registry.get_for_step("process_payment")

# Subscriber Registry
sub_registry = SubscriberRegistry.instance()
sub_registry.register(AuditLoggingSubscriber)
sub_registry.register(MetricsSubscriber)

# Start all subscribers
sub_registry.start_all()

# Stop all subscribers
sub_registry.stop_all()

Signal Handling

The Python worker handles signals for graceful shutdown:

SignalBehavior
SIGTERMGraceful shutdown
SIGINTGraceful shutdown (Ctrl+C)
SIGUSR1Report worker status
import signal

def handle_shutdown(signum, frame):
    print("Shutting down...")
    shutdown_event.set()

signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)

Error Handling

Exception Classes

from tasker_core import (
    TaskerError,              # Base class
    WorkerNotInitializedError,
    WorkerBootstrapError,
    WorkerAlreadyRunningError,
    FFIError,
    ConversionError,
    StepExecutionError,
)

Using StepExecutionError

from tasker_core import StepExecutionError

def call(self, context):
    # Retryable error
    raise StepExecutionError(
        "Database connection timeout",
        error_type="database_error",
        retryable=True
    )

    # Non-retryable error
    raise StepExecutionError(
        "Invalid input format",
        error_type="validation_error",
        retryable=False
    )

Logging

Structured Logging

from tasker_core import log_info, log_error, log_warn, log_debug, LogContext

# Simple logging
log_info("Processing started")
log_error("Failed to connect")

# With context dict
log_info("Order processed", {
    "order_id": "123",
    "amount": "100.00"
})

# With LogContext model
context = LogContext(
    correlation_id="abc-123",
    task_uuid="task-456",
    operation="process_order"
)
log_info("Processing", context)

File Structure

workers/python/
├── bin/
│   └── server.py              # Production server
├── python/
│   └── tasker_core/
│       ├── __init__.py        # Package exports
│       ├── handler.py         # Handler registry
│       ├── event_bridge.py    # Event pub/sub
│       ├── event_poller.py    # FFI polling
│       ├── logging.py         # Structured logging
│       ├── types.py           # Pydantic models
│       ├── step_handler/
│       │   ├── __init__.py
│       │   ├── base.py        # Base handler ABC
│       │   ├── api.py         # API handler
│       │   └── decision.py    # Decision handler
│       ├── batch_processing/
│       │   └── __init__.py    # Batchable mixin
│       └── step_execution_subscriber.py
├── src/                       # Rust/PyO3 extension
├── tests/
│   ├── test_step_handler.py
│   ├── test_module_exports.py
│   └── handlers/examples/
├── pyproject.toml
└── uv.lock

Testing

Unit Tests

cd workers/python
uv run pytest tests/

With Coverage

uv run pytest tests/ --cov=tasker_core

Type Checking

uv run mypy python/tasker_core/

Linting

uv run ruff check python/

Example Handlers

Linear Workflow

class LinearStep1Handler(StepHandler):
    handler_name = "linear_step_1"

    def call(self, context: StepContext) -> StepHandlerResult:
        return self.success(result={
            "step1_processed": True,
            "input_received": context.input_data,
            "processed_at": datetime.now().isoformat()
        })

Data Processing

class TransformDataHandler(StepHandler):
    handler_name = "transform_data"

    def call(self, context: StepContext) -> StepHandlerResult:
        # Get raw data from dependency
        raw_data = context.dependency_results.get("fetch_data", {})

        # Transform
        transformed = [
            {"id": item["id"], "value": item["raw_value"] * 2}
            for item in raw_data.get("items", [])
        ]

        return self.success(result={
            "items": transformed,
            "count": len(transformed)
        })

Conditional Approval

class ApprovalRouterHandler(DecisionHandler):
    handler_name = "approval_router"

    THRESHOLDS = {
        "auto": 1000,
        "manager": 5000
    }

    def call(self, context: StepContext) -> StepHandlerResult:
        amount = context.input_data.get("amount", 0)

        if amount < self.THRESHOLDS["auto"]:
            outcome = DecisionPointOutcome.create_steps(["auto_approve"])
        elif amount < self.THRESHOLDS["manager"]:
            outcome = DecisionPointOutcome.create_steps(["manager_approval"])
        else:
            outcome = DecisionPointOutcome.create_steps(
                ["manager_approval", "finance_review"]
            )

        return self.decision_success(outcome)

See Also

Ruby Worker

Last Updated: 2026-01-01 Audience: Ruby Developers Status: Active Package: tasker_core (gem) Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview


The Ruby worker provides a gem-based interface for integrating tasker-core workflow execution into Ruby applications. It supports both standalone server deployment and headless embedding in Rails applications.

Quick Start

Installation

cd workers/ruby
bundle install
bundle exec rake compile  # Compile FFI extension

Running the Server

./bin/server.rb

Environment Variables

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
TASKER_ENVEnvironment (test/development/production)development
TASKER_CONFIG_PATHPath to TOML configurationAuto-detected
TASKER_TEMPLATE_PATHPath to task templatesAuto-detected
RUBY_GC_HEAP_GROWTH_FACTORGC tuning for productionRuby default

Architecture

Server Mode

Location: workers/ruby/bin/server.rb

The server bootstraps the Rust foundation and manages Ruby handler execution:

# Bootstrap the worker system
bootstrap = TaskerCore::Worker::Bootstrap.start!

# Signal handlers for graceful shutdown
Signal.trap('TERM') { shutdown_event.set }
Signal.trap('INT') { shutdown_event.set }

# Main loop with health checks
loop do
  break if shutdown_event.set?
  sleep(1)
end

# Graceful shutdown
bootstrap.shutdown!

Headless/Embedded Mode

For embedding in Rails applications without an HTTP server:

# config/initializers/tasker.rb
require 'tasker_core'

# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
TaskerCore::Worker::Bootstrap.start!

# Register application handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
  'ProcessOrderHandler',
  ProcessOrderHandler
)

FFI Bridge

Ruby communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                    RUBY FFI BRIDGE                              │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (poll_step_events)
          ▼
   ┌─────────────┐
   │   Ruby      │
   │   Thread    │──→ poll every 10ms
   └─────────────┘
          │
          ▼
   ┌─────────────┐
   │  Handler    │
   │  Execution  │──→ handler.call(context)
   └─────────────┘
          │
          │ FFI (complete_step_event)
          ▼
   Rust Completion Channel

Handler Development

Base Handler

Location: lib/tasker_core/step_handler/base.rb

All handlers inherit from TaskerCore::StepHandler::Base:

class ProcessOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    # Access task context via cross-language standard methods
    order_id = context.get_task_field('order_id')
    amount = context.get_task_field('amount')

    # Business logic
    result = process_order(order_id, amount)

    # Return success result
    success(result: {
      order_id: order_id,
      status: 'processed',
      total: result[:total]
    })
  end
end

Handler Signature

def call(context)
  # context - StepContext with cross-language standard fields:
  #   context.task_uuid       - Task UUID
  #   context.step_uuid       - Step UUID
  #   context.input_data      - Step inputs from workflow_step.inputs
  #   context.step_config     - Handler config from step_definition
  #   context.retry_count     - Current retry attempt
  #   context.max_retries     - Maximum retry attempts
  #   context.get_task_field('field')       - Get field from task context
  #   context.get_dependency_result('step') - Get result from parent step
end

Result Methods

# Success result (keyword arguments required)
success(
  result: { key: 'value' },
  metadata: { duration_ms: 100 }
)

# Failure result
# error_type must be one of: 'PermanentError', 'RetryableError',
# 'ValidationError', 'UnexpectedError', 'StepCompletionError'
failure(
  message: 'Payment declined',
  error_type: 'PermanentError',   # Use enum value, not freeform string
  error_code: 'PAYMENT_DECLINED', # Optional freeform error code
  retryable: false,
  metadata: { card_last_four: '1234' }
)

Accessing Dependencies

def call(context)
  # Get result from a dependency step
  validation_result = context.get_dependency_result('validate_order')

  if validation_result && validation_result['valid']
    # Process with validated data
  end
end

Composition Pattern

Ruby handlers use composition via mixins rather than inheritance. You can use either:

  1. Wrapper classes (Api, Decision, Batchable) - simpler, backward compatible
  2. Mixin modules (Mixins::API, Mixins::Decision, Mixins::Batchable) - explicit composition
class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API
  include TaskerCore::StepHandler::Mixins::Decision

  def call(context)
    # Has both API methods (get, post, put, delete)
    # And Decision methods (decision_success, decision_no_branches)
    response = get('/api/endpoint')
    decision_success(steps: ['next_step'], result_data: response)
  end
end

Available Mixins

MixinLocationMethods Provided
Mixins::APImixins/api.rbget, post, put, delete, connection
Mixins::Decisionmixins/decision.rbdecision_success, decision_no_branches, skip_branches
Mixins::Batchablemixins/batchable.rbget_batch_context, handle_no_op_worker, create_cursor_configs

Using Wrapper Classes (Backward Compatible)

The wrapper classes delegate to mixins internally:

# These are equivalent:
class MyHandler < TaskerCore::StepHandler::Api
  # Inherits API methods via Mixins::API
end

class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API
  # Explicit mixin inclusion
end

Specialized Handlers

API Handler

Location: lib/tasker_core/step_handler/api.rb

For HTTP API integration with automatic error classification:

class FetchUserHandler < TaskerCore::StepHandler::Api
  def call(context)
    user_id = context.get_task_field('user_id')

    # Automatic error classification (429 → retryable, 404 → permanent)
    response = connection.get("/users/#{user_id}")
    process_response(response)  # Raises on errors, returns response on success

    # Return success result with response data
    success(result: response.body)
  end

  # Optional: Custom connection configuration
  def configure_connection
    Faraday.new(base_url) do |conn|
      conn.request :json
      conn.response :json
      conn.options.timeout = 30
    end
  end
end

HTTP Methods Available:

  • get(path, params: {}, headers: {})
  • post(path, data: {}, headers: {})
  • put(path, data: {}, headers: {})
  • delete(path, params: {}, headers: {})

Error Classification:

StatusClassificationBehavior
400, 401, 403, 404, 422PermanentNo retry
429RetryableRespect Retry-After
500-599RetryableStandard backoff

Decision Handler

Location: lib/tasker_core/step_handler/decision.rb

For dynamic workflow routing:

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  def call(context)
    amount = context.get_task_field('amount')

    if amount < 1000
      # Auto-approve small amounts
      decision_success(
        steps: ['auto_approve'],
        result_data: { route_type: 'auto', amount: amount }
      )
    elsif amount < 5000
      # Manager approval for medium amounts
      decision_success(
        steps: ['manager_approval'],
        result_data: { route_type: 'manager', amount: amount }
      )
    else
      # Dual approval for large amounts
      decision_success(
        steps: ['manager_approval', 'finance_review'],
        result_data: { route_type: 'dual', amount: amount }
      )
    end
  end
end

Decision Methods:

  • decision_success(steps:, result_data: {}) - Create steps dynamically
  • decision_no_branches(result_data: {}) - Skip conditional steps

Batchable Handler

Location: lib/tasker_core/step_handler/batchable.rb

For processing large datasets in chunks:

Breaking Change: Cursors are now 0-indexed (previously 1-indexed) to match Python, TypeScript, and Rust.

class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
  def call(context)
    # Extract batch context from step inputs
    batch_ctx = get_batch_context(context)

    # Handle no-op placeholder batches
    no_op_result = handle_no_op_worker(batch_ctx)
    return no_op_result if no_op_result

    # Process this batch
    csv_file = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
    records = read_csv_batch(csv_file, batch_ctx.start_cursor, batch_ctx.batch_size)

    processed = records.map { |record| transform_record(record) }

    # Return batch completion
    batch_worker_complete(
      processed_count: processed.size,
      result_data: { records: processed }
    )
  end
end

Batch Helper Methods:

  • get_batch_context(context) - Get batch boundaries from StepContext
  • handle_no_op_worker(batch_ctx) - Handle placeholder batches
  • batch_worker_complete(processed_count:, result_data:) - Complete batch
  • create_cursor_configs(total_items, worker_count) - Create 0-indexed cursor ranges

Cursor Indexing:

# Creates 0-indexed cursor ranges
configs = create_cursor_configs(1000, 5)
# => [
#   { batch_id: '1', start_cursor: 0, end_cursor: 200 },
#   { batch_id: '2', start_cursor: 200, end_cursor: 400 },
#   { batch_id: '3', start_cursor: 400, end_cursor: 600 },
#   { batch_id: '4', start_cursor: 600, end_cursor: 800 },
#   { batch_id: '5', start_cursor: 800, end_cursor: 1000 }
# ]

Handler Registry

Registration

Location: lib/tasker_core/registry/handler_registry.rb

registry = TaskerCore::Registry::HandlerRegistry.instance

# Manual registration
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)

# Check availability
registry.handler_available?('ProcessOrderHandler')  # => true

# List all handlers
registry.registered_handlers  # => ["ProcessOrderHandler", ...]

Discovery Modes

  1. Preloaded Handlers (Test environment)

    • ObjectSpace scanning for loaded handler classes
  2. Template-Driven Discovery

    • YAML templates define handler references
    • Handlers loaded from configured paths

Handler Search Paths

app/handlers/
lib/handlers/
handlers/
app/tasker/handlers/
lib/tasker/handlers/
spec/handlers/examples/  (test environment)

Configuration

Bootstrap Configuration

Bootstrap configuration is controlled via TOML files, not Ruby parameters:

# config/tasker/base/worker.toml
[web]
enabled = true              # Set to false for headless/embedded mode
bind_address = "0.0.0.0"
port = 8080
# Ruby bootstrap is simple - config comes from TOML
TaskerCore::Worker::Bootstrap.start!

Handler Configuration

class MyHandler < TaskerCore::StepHandler::Base
  def initialize(config: {})
    super
    @timeout = config[:timeout] || 30
    @max_retries = config[:retries] || 3
  end

  def config_schema
    {
      type: 'object',
      properties: {
        timeout: { type: 'integer', minimum: 1, default: 30 },
        retries: { type: 'integer', minimum: 0, default: 3 }
      }
    }
  end
end

Signal Handling

The Ruby worker handles multiple signals:

SignalBehavior
SIGTERMGraceful shutdown
SIGINTGraceful shutdown (Ctrl+C)
SIGUSR1Report worker status
SIGUSR2Reload configuration
# Status reporting
Signal.trap('USR1') do
  logger.info "Worker Status: #{bootstrap.status.inspect}"
end

# Configuration reload
Signal.trap('USR2') do
  bootstrap.reload_config
end

Error Handling

Exception Classes

TaskerCore::Errors::Error                  # Base class
├── TaskerCore::Errors::ConfigurationError # Configuration issues
├── TaskerCore::Errors::FFIError           # FFI bridge errors
├── TaskerCore::Errors::ProceduralError    # Base for workflow errors
│   ├── TaskerCore::Errors::RetryableError # Transient failures
│   ├── TaskerCore::Errors::PermanentError # Unrecoverable failures
│   │   ├── TaskerCore::Errors::ValidationError # Validation failures
│   │   └── TaskerCore::Errors::NotFoundError   # Resource not found
│   ├── TaskerCore::Errors::TimeoutError   # Timeout failures
│   └── TaskerCore::Errors::NetworkError   # Network failures
└── TaskerCore::Errors::ServerError        # Embedded server errors

Raising Errors

def call(context)
  # Retryable error (will be retried)
  raise TaskerCore::Errors::RetryableError.new(
    'Database connection timeout',
    retry_after: 5,
    context: { service: 'database' }
  )

  # Permanent error (no retry)
  raise TaskerCore::Errors::PermanentError.new(
    'Invalid order format',
    error_code: 'INVALID_ORDER',
    context: { field: 'order_id' }
  )

  # Validation error (permanent, with field info)
  raise TaskerCore::Errors::ValidationError.new(
    'Email format is invalid',
    field: 'email',
    error_code: 'INVALID_EMAIL'
  )
end

Logging

New code should use TaskerCore::Tracing for unified structured logging via FFI:

# Recommended: Use Tracing directly
TaskerCore::Tracing.info('Processing order', {
  order_id: order.id,
  amount: order.total,
  customer_id: order.customer_id
})

TaskerCore::Tracing.error('Payment failed', {
  error_code: 'DECLINED',
  card_last_four: '1234'
})

Legacy Logger (Deprecated)

Note: TaskerCore::Logger is maintained for backward compatibility but delegates to TaskerCore::Tracing. New code should use Tracing directly.

# Legacy (still works, but deprecated)
logger = TaskerCore::Logger.instance
logger.info('Processing order', {
  order_id: order.id,
  amount: order.total
})

Log Levels

Controlled via RUST_LOG environment variable:

  • trace - Very detailed debugging
  • debug - Debugging information
  • info - Normal operation
  • warn - Warning conditions
  • error - Error conditions

File Structure

workers/ruby/
├── bin/
│   ├── server.rb            # Production server
│   └── health_check.rb      # Health check script
├── ext/
│   └── tasker_core/
│       └── extconf.rb       # FFI extension config
├── lib/
│   └── tasker_core/
│       ├── errors.rb        # Exception classes
│       ├── handlers.rb      # Handler namespace
│       ├── internal.rb      # Internal modules
│       ├── logger.rb        # Logging
│       ├── models.rb        # Type models
│       ├── registry/
│       │   ├── handler_registry.rb
│       │   └── step_handler_resolver.rb
│       ├── step_handler/
│       │   ├── base.rb      # Base handler
│       │   ├── api.rb       # API handler
│       │   ├── decision.rb  # Decision handler
│       │   └── batchable.rb # Batch handler
│       ├── task_handler/
│       │   └── base.rb      # Task orchestration
│       ├── types/           # Type definitions
│       └── version.rb
├── spec/
│   ├── handlers/examples/   # Example handlers
│   └── integration/         # Integration tests
├── Gemfile
└── tasker_core.gemspec

Testing

Unit Tests

cd workers/ruby
bundle exec rspec spec/

Integration Tests

DATABASE_URL=postgresql://... bundle exec rspec spec/integration/

E2E Tests (from project root)

DATABASE_URL=postgresql://... \
TASKER_ENV=test \
bundle exec rspec spec/handlers/

Example Handlers

Linear Workflow

# spec/handlers/examples/linear_workflow/step_handlers/linear_step_1_handler.rb
module LinearWorkflow
  module StepHandlers
    class LinearStep1Handler < TaskerCore::StepHandler::Base
      def call(context)
        input = context.context  # Full task context
        success(result: {
          step1_processed: true,
          input_received: input,
          processed_at: Time.now.iso8601
        })
      end
    end
  end
end

Order Fulfillment

class ValidateOrderHandler < TaskerCore::StepHandler::Base
  def call(context)
    order = context.context  # Full task context

    unless order['items']&.any?
      return failure(
        message: 'Order must have at least one item',
        error_type: 'ValidationError',
        error_code: 'EMPTY_ORDER',
        retryable: false
      )
    end

    success(result: {
      valid: true,
      item_count: order['items'].size,
      total: calculate_total(order['items'])
    })
  end
end

Conditional Approval

class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
  THRESHOLDS = {
    auto_approve: 1000,
    manager_only: 5000
  }.freeze

  def call(context)
    amount = context.get_task_field('amount').to_f

    if amount < THRESHOLDS[:auto_approve]
      decision_success(steps: ['auto_approve'])
    elsif amount < THRESHOLDS[:manager_only]
      decision_success(steps: ['manager_approval'])
    else
      decision_success(steps: ['manager_approval', 'finance_review'])
    end
  end
end

See Also

Rust Worker

Last Updated: 2026-01-01 Audience: Rust Developers Status: Active Package: workers-rust Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview


The Rust worker is the native, high-performance implementation for workflow step execution. It demonstrates the full capability of the tasker-worker foundation with zero FFI overhead.

Quick Start

Running the Server

cd workers/rust
cargo run

With Custom Configuration

TASKER_CONFIG_PATH=/path/to/config.toml cargo run

Environment Variables

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
TASKER_CONFIG_PATHPath to TOML configurationAuto-detected
RUST_LOGLog levelinfo

Architecture

Entry Point

Location: workers/rust/src/main.rs

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize structured logging
    tasker_shared::logging::init_tracing();

    // Bootstrap worker system
    let mut bootstrap_result = bootstrap().await?;

    // Start event handler (legacy path)
    tokio::spawn(async move {
        bootstrap_result.event_handler.start().await
    });

    // Wait for shutdown signal
    tokio::select! {
        _ = tokio::signal::ctrl_c() => { /* shutdown */ }
        _ = wait_for_sigterm() => { /* shutdown */ }
    }

    bootstrap_result.worker_handle.stop()?;
    Ok(())
}

Bootstrap Process

Location: workers/rust/src/bootstrap.rs

The bootstrap process:

  1. Creates step handler registry with all handlers
  2. Sets up global event system
  3. Bootstraps tasker-worker foundation
  4. Creates domain event publisher registry
  5. Spawns HandlerDispatchService for non-blocking dispatch
  6. Creates event handler for legacy path
#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<RustWorkerBootstrapResult> {
    // Create registry with all handlers
    let registry = Arc::new(RustStepHandlerRegistry::new());

    // Bootstrap worker foundation
    let worker_handle = WorkerBootstrap::bootstrap_with_event_system(...).await?;

    // Set up dispatch service (non-blocking path)
    let dispatch_service = HandlerDispatchService::with_callback(...);

    Ok(RustWorkerBootstrapResult {
        worker_handle,
        event_handler,
        dispatch_service_handle,
    })
}
}

Handler Dispatch

The Rust worker uses the HandlerDispatchService for non-blocking handler execution:

┌────────────────────────────────────────────────────────────────┐
│                    RUST HANDLER DISPATCH                        │
└────────────────────────────────────────────────────────────────┘

   PGMQ Queue
        │
        ▼
  ┌─────────────┐
  │  Dispatch   │
  │  Channel    │
  └─────────────┘
        │
        ▼
  ┌─────────────────────────────────────────┐
  │       HandlerDispatchService            │
  │  ┌────────────────────────────────────┐ │
  │  │  Semaphore (10 permits)            │ │
  │  │       │                            │ │
  │  │       ▼                            │ │
  │  │  handler.call(&step_data).await    │ │
  │  │       │                            │ │
  │  │       ▼                            │ │
  │  │  DomainEventCallback               │ │
  │  └────────────────────────────────────┘ │
  └─────────────────────────────────────────┘
        │
        ▼
  ┌─────────────┐
  │ Completion  │
  │  Channel    │
  └─────────────┘
        │
        ▼
   Orchestration

Handler Development

Capability Traits

Rust uses traits for handler composition, matching the mixin pattern in Ruby/Python/TypeScript.

Location: tasker-worker/src/handler_capabilities.rs

APICapable Trait

For HTTP API integration:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::APICapable;

impl APICapable for MyHandler {
    // Use the helper methods:
    // - api_success(step_uuid, data, status, headers, execution_time_ms)
    // - api_failure(step_uuid, message, status, error_type, execution_time_ms)
    // - classify_status_code(status) -> ErrorClassification
}
}

DecisionCapable Trait

For dynamic workflow routing:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::DecisionCapable;

impl DecisionCapable for MyHandler {
    // Use the helper methods:
    // - decision_success(step_uuid, step_names, routing_context, execution_time_ms)
    // - skip_branches(step_uuid, reason, routing_context, execution_time_ms)
    // - decision_failure(step_uuid, message, error_type, execution_time_ms)
}
}

BatchableCapable Trait

For batch processing:

#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::BatchableCapable;

impl BatchableCapable for MyHandler {
    // Use the helper methods:
    // - create_cursor_configs(total_items, worker_count) -> Vec<CursorConfig>
    // - create_cursor_ranges(total_items, batch_size, max_batches) -> Vec<CursorConfig>
    // - batch_analyzer_success(step_uuid, worker_template, configs, total_items, ...)
    // - batch_worker_success(step_uuid, processed, succeeded, failed, skipped, ...)
    // - no_batches_outcome(step_uuid, reason, execution_time_ms)
    // - batch_failure(step_uuid, message, error_type, retryable, ...)
}
}

Composing Multiple Traits

#![allow(unused)]
fn main() {
// Implement multiple capability traits for a single handler
pub struct CompositeHandler {
    config: StepHandlerConfig,
}

impl APICapable for CompositeHandler {}
impl DecisionCapable for CompositeHandler {}

#[async_trait]
impl RustStepHandler for CompositeHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Use API capability to fetch data
        let response = self.call_api("/users/123").await?;

        // Use Decision capability to route based on response
        if response.status == 200 {
            self.decision_success(step_uuid, vec!["process_user"], None, 50)
        } else {
            self.api_failure(step_uuid, "API failed", response.status, "api_error", 50)
        }
    }
}
}

Handler Trait

Location: workers/rust/src/step_handlers/mod.rs

All Rust handlers implement the RustStepHandler trait:

#![allow(unused)]
fn main() {
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;

#[async_trait]
pub trait RustStepHandler: Send + Sync {
    /// Handler name for registration
    fn name(&self) -> &str;

    /// Execute the handler
    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult>;

    /// Create a new instance with configuration from YAML
    fn new(config: StepHandlerConfig) -> Self where Self: Sized;
}
}

Creating a Handler

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use anyhow::Result;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use crate::step_handlers::{RustStepHandler, StepHandlerConfig, success_result};
use serde_json::json;

pub struct ProcessOrderHandler {
    _config: StepHandlerConfig,
}

#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
    fn name(&self) -> &str {
        "process_order"
    }

    fn new(config: StepHandlerConfig) -> Self {
        Self { _config: config }
    }

    async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
        let start_time = std::time::Instant::now();
        let step_uuid = step_data.workflow_step.workflow_step_uuid;

        // Extract input from task context
        let order_id = step_data.task.context
            .get("order_id")
            .and_then(|v| v.as_str())
            .ok_or_else(|| anyhow::anyhow!("Missing order_id"))?;

        // Process the order
        let result = process_order(order_id).await?;

        // Return success using helper function
        Ok(success_result(
            step_uuid,
            json!({
                "order_id": order_id,
                "status": "processed",
                "total": result.total
            }),
            start_time.elapsed().as_millis() as i64,
            None,
        ))
    }
}
}

Handler Registration

Location: workers/rust/src/step_handlers/registry.rs

Handlers are registered in the RustStepHandlerRegistry:

#![allow(unused)]
fn main() {
pub struct RustStepHandlerRegistry {
    handlers: HashMap<String, Arc<dyn RustStepHandler>>,
}

impl RustStepHandlerRegistry {
    pub fn new() -> Self {
        let mut registry = Self {
            handlers: HashMap::new(),
        };

        registry.register_all_handlers();
        registry
    }

    fn register_all_handlers(&mut self) {
        let empty_config = StepHandlerConfig::empty();

        // Linear workflow handlers
        self.register_handler(Arc::new(LinearStep1Handler::new(empty_config.clone())));
        self.register_handler(Arc::new(LinearStep2Handler::new(empty_config.clone())));

        // Order fulfillment handlers
        self.register_handler(Arc::new(ValidateOrderHandler::new(empty_config.clone())));
        self.register_handler(Arc::new(ProcessPaymentHandler::new(empty_config.clone())));

        // ... more handlers
    }

    fn register_handler(&mut self, handler: Arc<dyn RustStepHandler>) {
        let name = handler.name().to_string();
        self.handlers.insert(name, handler);
    }

    pub fn get_handler(&self, name: &str) -> Result<Arc<dyn RustStepHandler>, RustStepHandlerError> {
        self.handlers
            .get(name)
            .cloned()
            .ok_or_else(|| RustStepHandlerError::SystemError {
                message: format!("Handler '{}' not found in registry", name),
            })
    }
}
}

Example Handlers

Linear Workflow

Location: workers/rust/src/step_handlers/linear_workflow.rs

Simple sequential workflow with 4 steps:

#![allow(unused)]
fn main() {
pub struct LinearStep1Handler;

#[async_trait]
impl RustStepHandler for LinearStep1Handler {
    fn name(&self) -> &str {
        "linear_step_1"
    }

    async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        info!("LinearStep1Handler: Processing step");

        let input = step_data.input_data.clone();
        let mut result = serde_json::Map::new();
        result.insert("step1_processed".to_string(), json!(true));
        result.insert("input_received".to_string(), input);

        Ok(StepHandlerResult::success(json!(result)))
    }
}
}

Diamond Workflow

Location: workers/rust/src/step_handlers/diamond_workflow.rs

Parallel branching with convergence:

    ┌─────┐
    │Start│
    └──┬──┘
       │
  ┌────┴────┐
  ▼         ▼
┌───┐     ┌───┐
│ B │     │ C │
└─┬─┘     └─┬─┘
  │         │
  └────┬────┘
       ▼
    ┌─────┐
    │ End │
    └─────┘

Batch Processing

Location: workers/rust/src/step_handlers/batch_processing_products_csv.rs

Three-phase batch processing:

  1. Analyzer: Counts total records
  2. Batch Processor: Processes chunks
  3. Aggregator: Combines results
#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;

#[async_trait]
impl RustStepHandler for CsvBatchProcessorHandler {
    fn name(&self) -> &str {
        "csv_batch_processor"
    }

    async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        let batch_size = step_data.step_inputs
            .get("batch_size")
            .and_then(|v| v.as_u64())
            .unwrap_or(100) as usize;

        let start_cursor = step_data.step_inputs
            .get("start_cursor")
            .and_then(|v| v.as_u64())
            .unwrap_or(0) as usize;

        // Process records in batch
        let processed = process_batch(start_cursor, batch_size).await?;

        Ok(StepHandlerResult::success(json!({
            "processed_count": processed,
            "batch_complete": true
        })))
    }
}
}

Error Injection (Testing)

Location: workers/rust/src/step_handlers/error_injection/

Handlers for testing retry behavior:

#![allow(unused)]
fn main() {
pub struct FailNTimesHandler;

impl FailNTimesHandler {
    /// Create handler that fails N times before succeeding
    pub fn new(fail_count: u32) -> Self {
        Self { fail_count, attempts: AtomicU32::new(0) }
    }
}

#[async_trait]
impl RustStepHandler for FailNTimesHandler {
    async fn call(&self, _step_data: &StepExecutionData) -> Result<StepHandlerResult> {
        let attempt = self.attempts.fetch_add(1, Ordering::SeqCst);

        if attempt < self.fail_count {
            Ok(StepHandlerResult::failure(
                "Intentional failure for testing",
                "test_error",
                true, // retryable
            ))
        } else {
            Ok(StepHandlerResult::success(json!({"attempts": attempt + 1})))
        }
    }
}
}

Domain Events

Post-Execution Publishing

Handlers can publish domain events after step execution using the StepEventPublisher trait:

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use std::sync::Arc;
use tasker_shared::events::domain_events::DomainEventPublisher;
use tasker_worker::worker::step_event_publisher::{
    StepEventPublisher, StepEventContext, PublishResult
};

#[derive(Debug)]
pub struct PaymentEventPublisher {
    domain_publisher: Arc<DomainEventPublisher>,
}

impl PaymentEventPublisher {
    pub fn new(domain_publisher: Arc<DomainEventPublisher>) -> Self {
        Self { domain_publisher }
    }
}

#[async_trait]
impl StepEventPublisher for PaymentEventPublisher {
    fn name(&self) -> &str {
        "PaymentEventPublisher"
    }

    fn domain_publisher(&self) -> &Arc<DomainEventPublisher> {
        &self.domain_publisher
    }

    async fn publish(&self, ctx: &StepEventContext) -> PublishResult {
        let mut result = PublishResult::default();

        if ctx.step_succeeded() {
            let payload = json!({
                "order_id": ctx.execution_result.result["order_id"],
                "amount": ctx.execution_result.result["amount"],
            });

            // Uses default impl from trait
            match self.publish_event(ctx, "payment.completed", payload).await {
                Ok(event_id) => result.published.push(event_id),
                Err(e) => result.errors.push(e.to_string()),
            }
        }

        result
    }
}
}

Dual-Path Delivery

Events can route to different delivery paths:

PathDescriptionUse Case
durablePublished to PGMQExternal consumers, audit
fastIn-process busMetrics, telemetry

Configuration

Bootstrap Configuration

#![allow(unused)]
fn main() {
pub struct WorkerBootstrapConfig {
    pub worker_id: String,
    pub enable_web_api: bool,
    pub event_driven_enabled: bool,
    pub deployment_mode_hint: Option<String>,
}

// Default configuration
let config = WorkerBootstrapConfig {
    worker_id: "rust-worker-001".to_string(),
    enable_web_api: true,
    event_driven_enabled: true,
    deployment_mode_hint: Some("Hybrid".to_string()),
    ..Default::default()
};
}

Dispatch Configuration

#![allow(unused)]
fn main() {
let config = HandlerDispatchConfig {
    max_concurrent_handlers: 10,
    handler_timeout: Duration::from_secs(30),
    service_id: "rust-handler-dispatch".to_string(),
    load_shedding: LoadSheddingConfig::default(),
};
}

Signal Handling

The Rust worker handles graceful shutdown:

#![allow(unused)]
fn main() {
// Wait for shutdown signal
tokio::select! {
    _ = tokio::signal::ctrl_c() => {
        info!("Received Ctrl+C, initiating graceful shutdown...");
    }
    result = wait_for_sigterm() => {
        info!("Received SIGTERM, initiating graceful shutdown...");
    }
}

// Graceful shutdown sequence
bootstrap_result.worker_handle.stop()?;
}

Performance

Characteristics

  • Zero FFI Overhead: Native Rust handlers
  • Async/Await: Non-blocking I/O with Tokio
  • Bounded Concurrency: Semaphore-limited parallelism
  • Memory Safety: Rust’s ownership model

Benchmarking

# Run with release optimizations
cargo run --release

# With performance profiling
RUST_LOG=trace cargo run --release

File Structure

workers/rust/
├── src/
│   ├── main.rs                  # Entry point
│   ├── bootstrap.rs             # Worker initialization
│   ├── lib.rs                   # Library exports
│   ├── event_handler.rs         # Event bridging (legacy)
│   ├── global_event_system.rs   # Global event coordination
│   ├── step_handlers/
│   │   ├── mod.rs               # Handler traits and types
│   │   ├── registry.rs          # Handler registry
│   │   ├── linear_workflow.rs   # Linear workflow handlers
│   │   ├── diamond_workflow.rs  # Diamond workflow handlers
│   │   ├── tree_workflow.rs     # Tree workflow handlers
│   │   ├── mixed_dag_workflow.rs
│   │   ├── order_fulfillment.rs
│   │   ├── batch_processing_*.rs
│   │   ├── error_injection/     # Test handlers
│   │   └── domain_event_*.rs    # Event publishing
│   └── event_subscribers/
│       ├── mod.rs
│       ├── logging_subscriber.rs
│       └── metrics_subscriber.rs
├── Cargo.toml
└── tests/

Testing

Unit Tests

cargo test -p workers-rust

Integration Tests

# With database
DATABASE_URL=postgresql://... cargo test -p workers-rust --test integration

E2E Tests

# From project root
DATABASE_URL=postgresql://... cargo nextest run --package workers-rust

See Also

TypeScript Worker

Last Updated: 2026-01-01 Audience: TypeScript/JavaScript Developers Status: Active Package: @tasker-systems/tasker Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix <- Back to Worker Crates Overview


The TypeScript worker provides a multi-runtime interface for integrating tasker-core workflow execution into TypeScript/JavaScript applications. It supports Bun, Node.js, and Deno runtimes with unified FFI bindings to the Rust worker foundation.

Quick Start

Installation

cd workers/typescript
bun install                     # Install dependencies
cargo build --release -p tasker-ts  # Build FFI library

Running the Server

# With Bun (recommended for production)
bun run bin/server.ts

# With Node.js
npx tsx bin/server.ts

# With Deno
deno run --allow-ffi --allow-env --allow-net bin/server.ts

Environment Variables

VariableDescriptionDefault
DATABASE_URLPostgreSQL connection stringRequired
TASKER_ENVEnvironment (test/development/production)development
TASKER_CONFIG_PATHPath to TOML configurationAuto-detected
TASKER_TEMPLATE_PATHPath to task templatesAuto-detected
TASKER_FFI_LIBRARY_PATHPath to libtasker_tsAuto-detected
RUST_LOGLog level (trace/debug/info/warn/error)info
PORTHTTP server port8081

Architecture

Server Mode

Location: workers/typescript/bin/server.ts

The server bootstraps the Rust foundation and manages TypeScript handler execution:

import { createRuntime } from '../src/ffi/index.js';
import { EventEmitter } from '../src/events/event-emitter.js';
import { EventPoller } from '../src/events/event-poller.js';
import { HandlerRegistry } from '../src/handler/registry.js';
import { StepExecutionSubscriber } from '../src/subscriber/step-execution-subscriber.js';

// Create runtime for current environment (Bun/Node/Deno)
const runtime = createRuntime();
await runtime.load(libraryPath);

// Bootstrap Rust worker foundation
const result = runtime.bootstrapWorker({ namespace: 'my-app' });

// Create event system
const emitter = new EventEmitter();
const registry = new HandlerRegistry();

// Register handlers
registry.register('process_order', ProcessOrderHandler);

// Create step execution subscriber
const subscriber = new StepExecutionSubscriber(
  emitter,
  registry,
  runtime,
  { workerId: 'typescript-worker-001' }
);
subscriber.start();

// Start event poller (10ms polling)
const poller = new EventPoller(runtime, emitter, {
  pollingIntervalMs: 10
});
poller.start();

// Wait for shutdown signal
await shutdownSignal;

// Graceful shutdown
poller.stop();
await subscriber.waitForCompletion();
runtime.stopWorker();

Headless/Embedded Mode

For embedding in existing TypeScript applications:

import { createRuntime } from '@tasker-systems/tasker';
import { EventEmitter, EventPoller, HandlerRegistry, StepExecutionSubscriber } from '@tasker-systems/tasker';

// Bootstrap worker (headless mode via TOML: web.enabled = false)
const runtime = createRuntime();
await runtime.load('/path/to/libtasker_ts.dylib');
runtime.bootstrapWorker({ namespace: 'my-app' });

// Register handlers
const registry = new HandlerRegistry();
registry.register('process_data', ProcessDataHandler);

// Start event system
const emitter = new EventEmitter();
const subscriber = new StepExecutionSubscriber(emitter, registry, runtime, {});
subscriber.start();

const poller = new EventPoller(runtime, emitter);
poller.start();

FFI Bridge

TypeScript communicates with the Rust foundation via FFI polling:

┌────────────────────────────────────────────────────────────────┐
│                  TYPESCRIPT FFI BRIDGE                          │
└────────────────────────────────────────────────────────────────┘

   Rust Worker System
          │
          │ FFI (pollStepEvents)
          ▼
   ┌─────────────────────┐
   │    EventPoller      │
   │  (setInterval)      │──→ poll every 10ms
   └─────────────────────┘
          │
          │ emit to EventEmitter
          ▼
   ┌─────────────────────┐
   │ StepExecution       │
   │ Subscriber          │──→ route to handler
   └─────────────────────┘
          │
          │ handler.call(context)
          ▼
   ┌─────────────────────┐
   │  Handler Execution  │
   └─────────────────────┘
          │
          │ FFI (completeStepEvent)
          ▼
   Rust Completion Channel

Multi-Runtime Support

RuntimeFFI LibraryStatus
BunkoffiProduction
Node.jskoffiProduction
DenoDeno.dlopenProduction

Handler Development

Base Handler

Location: workers/typescript/src/handler/base.ts

All handlers extend StepHandler:

import { StepHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class ProcessOrderHandler extends StepHandler {
  static handlerName = 'process_order';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Access input data
    const orderId = context.getInput<string>('order_id');
    const amount = context.getInput<number>('amount');

    // Business logic
    const result = await this.processOrder(orderId, amount);

    // Return success
    return this.success({
      order_id: orderId,
      status: 'processed',
      total: result.total
    });
  }

  private async processOrder(orderId: string, amount: number) {
    // Implementation
    return { total: amount * 1.1 };
  }
}

Handler Signature

async call(context: StepContext): Promise<StepHandlerResult>

// StepContext provides:
context.taskUuid          // Task identifier
context.stepUuid          // Step identifier
context.stepInputs        // Runtime inputs
context.stepConfig        // Handler configuration
context.dependencyResults // Results from parent steps
context.taskContext       // Full task context
context.retryCount        // Current retry attempt

// Type-safe accessors:
context.getInput<T>(key)              // Get single input
context.getDependencyResult(stepName) // Get dependency result
context.getAllDependencyResults(name) // Get all instances (batch workers)

Result Methods

// Success result (from base class)
return this.success(
  { key: 'value' },           // result
  { duration_ms: 100 }        // metadata (optional)
);

// Failure result (from base class)
return this.failure(
  'Payment declined',         // message
  'payment_error',            // errorType
  true,                       // retryable
  { card_last_four: '1234' }  // metadata (optional)
);

Error Types

import { ErrorType } from '@tasker-systems/tasker';

ErrorType.PERMANENT_ERROR   // Non-retryable failures
ErrorType.RETRYABLE_ERROR   // Retryable failures
ErrorType.VALIDATION_ERROR  // Input validation failures
ErrorType.HANDLER_ERROR     // Handler execution failures

Accessing Dependencies

async call(context: StepContext): Promise<StepHandlerResult> {
  // Get result from a dependency step
  const validation = context.getDependencyResult('validate_order') as {
    valid: boolean;
    amount: number;
  } | null;

  if (!validation) {
    return this.failure('Missing validation result', 'dependency_error', false);
  }

  if (validation.valid) {
    return this.success({ processed: true, amount: validation.amount });
  }

  return this.failure('Validation failed', 'validation_error', false);
}

Specialized Handlers

Mixin Pattern

TypeScript uses composition via mixins rather than inheritance. You can use either:

  1. Wrapper classes (ApiHandler, DecisionHandler) - simpler, backward compatible
  2. Mixin functions (applyAPI, applyDecision) - explicit composition
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';

// Using mixin pattern (recommended for new code)
class MyHandler extends StepHandler implements APICapable {
  constructor() {
    super();
    applyAPI(this);  // Adds get/post/put/delete methods
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    const response = await this.get('/api/data');
    return this.apiSuccess(response);
  }
}

// Or using wrapper class (simpler, backward compatible)
import { ApiHandler } from '@tasker-systems/tasker';

class MyHandler extends ApiHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    const response = await this.get('/api/data');
    return this.apiSuccess(response);
  }
}

API Handler

Location: workers/typescript/src/handler/api.ts

For HTTP API integration with automatic error classification:

import { ApiHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class FetchUserHandler extends ApiHandler {
  static handlerName = 'fetch_user';
  static handlerVersion = '1.0.0';

  protected baseUrl = 'https://api.example.com';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const userId = context.getInput<string>('user_id');

    // Automatic error classification
    const response = await this.get(`/users/${userId}`);

    if (!response.ok) {
      return this.apiFailure(response);
    }

    return this.apiSuccess(response);
  }
}

HTTP Methods:

// GET request
const response = await this.get('/path', {
  params: { key: 'value' },
  headers: { 'Authorization': 'Bearer token' }
});

// POST request
const response = await this.post('/path', {
  body: { key: 'value' },
  headers: {}
});

// PUT request
const response = await this.put('/path', { body: { key: 'value' } });

// DELETE request
const response = await this.delete('/path', { params: {} });

ApiResponse Properties:

response.statusCode      // HTTP status code
response.headers         // Response headers
response.body            // Parsed body (object or string)
response.ok              // True if 2xx status
response.isClientError   // True if 4xx status
response.isServerError   // True if 5xx status
response.isRetryable     // True if should retry (408, 429, 500-504)
response.retryAfter      // Retry-After header value in seconds

Error Classification:

StatusClassificationBehavior
400, 401, 403, 404, 422Non-retryablePermanent failure
408, 429, 500-504RetryableStandard retry

Decision Handler

Location: workers/typescript/src/handler/decision.ts

For dynamic workflow routing:

import { DecisionHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';

export class RoutingDecisionHandler extends DecisionHandler {
  static handlerName = 'routing_decision';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const amount = context.getInput<number>('amount') ?? 0;

    if (amount < 1000) {
      // Auto-approve small amounts
      return this.decisionSuccess(['auto_approve'], {
        route_type: 'auto',
        amount
      });
    } else if (amount < 5000) {
      // Manager approval for medium amounts
      return this.decisionSuccess(['manager_approval'], {
        route_type: 'manager',
        amount
      });
    } else {
      // Dual approval for large amounts
      return this.decisionSuccess(['manager_approval', 'finance_review'], {
        route_type: 'dual',
        amount
      });
    }
  }
}

Decision Methods:

// Activate specific steps
return this.decisionSuccess(
  ['step1', 'step2'],           // steps to activate
  { route_reason: 'threshold' } // routing context
);

// No branches needed
return this.decisionNoBranches('condition not met');

BatchableStepHandler

Location: workers/typescript/src/handler/batchable.ts

For processing large datasets in chunks. Cross-language aligned with Ruby and Python implementations.

Analyzer Handler (creates batch configurations):

import { BatchableStepHandler } from '@tasker-systems/tasker';
import type { StepContext, BatchableResult } from '@tasker-systems/tasker';

export class CsvAnalyzerHandler extends BatchableStepHandler {
  static handlerName = 'csv_analyzer';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<BatchableResult> {
    const csvPath = context.getInput<string>('csv_path');
    const rowCount = await this.countCsvRows(csvPath);

    if (rowCount === 0) {
      // No data to process - use cross-language standard
      return this.noBatchesResult('empty_dataset', {
        csv_path: csvPath,
        analyzed_at: new Date().toISOString()
      });
    }

    // Create cursor configs using Ruby-style helper
    // Divides rowCount into 5 roughly equal batches
    const batchConfigs = this.createCursorConfigs(rowCount, 5);

    return this.batchSuccess('process_csv_batch', batchConfigs, {
      csv_path: csvPath,
      total_rows: rowCount,
      analyzed_at: new Date().toISOString()
    });
  }
}

Worker Handler (processes a batch):

export class CsvBatchProcessorHandler extends BatchableStepHandler {
  static handlerName = 'csv_batch_processor';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Cross-language standard: check for no-op worker first
    const noOpResult = this.handleNoOpWorker(context);
    if (noOpResult) {
      return noOpResult;
    }

    // Get batch worker inputs from Rust
    const batchInputs = this.getBatchWorkerInputs(context);
    const cursor = batchInputs?.cursor;

    if (!cursor) {
      return this.failure('Missing batch cursor', 'batch_error', false);
    }

    // Process the batch
    const results = await this.processCsvBatch(
      cursor.start_cursor,
      cursor.end_cursor
    );

    return this.success({
      batch_id: cursor.batch_id,
      rows_processed: results.count,
      items_succeeded: results.success,
      items_failed: results.failed
    });
  }
}

Aggregator Handler (combines results):

export class CsvAggregatorHandler extends StepHandler {
  static handlerName = 'csv_aggregator';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Get all batch worker results
    const workerResults = context.getAllDependencyResults('process_csv_batch') as Array<{
      rows_processed: number;
      items_succeeded: number;
      items_failed: number;
    } | null>;

    // Aggregate results
    let totalProcessed = 0;
    let totalSucceeded = 0;
    let totalFailed = 0;

    for (const result of workerResults) {
      if (result) {
        totalProcessed += result.rows_processed ?? 0;
        totalSucceeded += result.items_succeeded ?? 0;
        totalFailed += result.items_failed ?? 0;
      }
    }

    return this.success({
      total_processed: totalProcessed,
      total_succeeded: totalSucceeded,
      total_failed: totalFailed,
      worker_count: workerResults.length
    });
  }
}

BatchableStepHandler Methods (Cross-Language Aligned):

MethodRuby EquivalentPurpose
batchSuccess(template, configs, metadata)batch_successCreate batch workers
noBatchesResult(reason, metadata)no_batches_outcomeEmpty dataset handling
createCursorConfigs(total, workers)create_cursor_configsDivide work by worker count
handleNoOpWorker(context)handle_no_op_workerDetect no-op placeholders
getBatchWorkerInputs(context)get_batch_contextAccess Rust batch inputs
aggregateWorkerResults(results)aggregate_batch_worker_resultsStatic aggregation helper

Handler Registry

Registration

Location: workers/typescript/src/handler/registry.ts

import { HandlerRegistry } from '@tasker-systems/tasker';

const registry = new HandlerRegistry();

// Manual registration
registry.register('process_order', ProcessOrderHandler);

// Check if registered
registry.isRegistered('process_order'); // true

// Resolve and instantiate
const handler = registry.resolve('process_order');
if (handler) {
  const result = await handler.call(context);
}

// List all handlers
registry.listHandlers(); // ['process_order', ...]

// Handler count
registry.handlerCount(); // 1

Bulk Registration

import { registerExampleHandlers } from './handlers/examples/index.js';

// Register multiple handlers at once
registerExampleHandlers(registry);

Type System

Core Types

import type {
  StepContext,
  StepHandlerResult,
  BatchableResult,
  FfiStepEvent,
  BootstrapConfig,
  WorkerStatus,
} from '@tasker-systems/tasker';

// StepContext - created from FFI event
const context = StepContext.fromFfiEvent(event, 'handler_name');
context.taskUuid;      // string
context.stepUuid;      // string
context.stepInputs;    // Record<string, unknown>
context.retryCount;    // number

// StepHandlerResult - handler output
result.success;        // boolean
result.result;         // Record<string, unknown>
result.errorMessage;   // string | undefined
result.retryable;      // boolean

Configuration Types

import type { BootstrapConfig } from '@tasker-systems/tasker';

const config: BootstrapConfig = {
  namespace: 'my-app',
  environment: 'production',
  configPath: '/path/to/config.toml'
};

Event System

EventEmitter

Location: workers/typescript/src/events/event-emitter.ts

import { EventEmitter } from '@tasker-systems/tasker';
import { StepEventNames } from '@tasker-systems/tasker';

const emitter = new EventEmitter();

// Subscribe to events
emitter.on(StepEventNames.STEP_EXECUTION_RECEIVED, (payload) => {
  console.log(`Processing step: ${payload.event.step_uuid}`);
});

emitter.on(StepEventNames.STEP_EXECUTION_COMPLETED, (payload) => {
  console.log(`Step completed: ${payload.stepUuid}`);
});

// Emit events
emitter.emit(StepEventNames.STEP_EXECUTION_RECEIVED, {
  event: ffiStepEvent
});

Event Names

import { StepEventNames } from '@tasker-systems/tasker';

StepEventNames.STEP_EXECUTION_RECEIVED  // Step event received from FFI
StepEventNames.STEP_EXECUTION_STARTED   // Handler execution started
StepEventNames.STEP_EXECUTION_COMPLETED // Handler execution completed
StepEventNames.STEP_EXECUTION_FAILED    // Handler execution failed
StepEventNames.STEP_COMPLETION_SENT     // Result sent to FFI

EventPoller

Location: workers/typescript/src/events/event-poller.ts

import { EventPoller } from '@tasker-systems/tasker';

const poller = new EventPoller(runtime, emitter, {
  pollingIntervalMs: 10,        // Poll every 10ms
  starvationCheckInterval: 100, // Check every 1 second
  cleanupInterval: 1000         // Cleanup every 10 seconds
});

// Start polling
poller.start();

// Get metrics
const metrics = poller.getMetrics();
console.log(`Pending: ${metrics.pendingCount}`);

// Stop polling
poller.stop();

Domain Events

TypeScript has full domain event support, matching Ruby and Python capabilities. The domain events module provides BasePublisher, BaseSubscriber, and registries for custom event handling.

Location: workers/typescript/src/handler/domain-events.ts

BasePublisher

Publishers transform step execution context into domain-specific events:

import { BasePublisher, StepEventContext, DomainEvent } from '@tasker-systems/tasker';

export class PaymentEventPublisher extends BasePublisher {
  static publisherName = 'payment_events';

  // Required: which steps trigger this publisher
  publishesFor(): string[] {
    return ['process_payment', 'refund_payment'];
  }

  // Transform step context into domain event
  async transformPayload(ctx: StepEventContext): Promise<Record<string, unknown>> {
    return {
      payment_id: ctx.result?.payment_id,
      amount: ctx.result?.amount,
      currency: ctx.result?.currency,
      status: ctx.result?.status
    };
  }

  // Lifecycle hooks (optional)
  async beforePublish(ctx: StepEventContext): Promise<void> {
    console.log(`Publishing payment event for step: ${ctx.stepName}`);
  }

  async afterPublish(ctx: StepEventContext, event: DomainEvent): Promise<void> {
    console.log(`Published event: ${event.eventName}`);
  }

  async onPublishError(ctx: StepEventContext, error: Error): Promise<void> {
    console.error(`Failed to publish: ${error.message}`);
  }

  // Inject custom metadata
  async additionalMetadata(ctx: StepEventContext): Promise<Record<string, unknown>> {
    return { payment_processor: 'stripe' };
  }
}

BaseSubscriber

Subscribers react to domain events matching specific patterns:

import { BaseSubscriber, InProcessDomainEvent, SubscriberResult } from '@tasker-systems/tasker';

export class AuditLoggingSubscriber extends BaseSubscriber {
  static subscriberName = 'audit_logger';

  // Which events to handle (glob patterns supported)
  subscribesTo(): string[] {
    return ['payment.*', 'order.completed'];
  }

  // Handle matching events
  async handle(event: InProcessDomainEvent): Promise<SubscriberResult> {
    await this.logToAuditTrail(event);
    return { success: true };
  }

  // Lifecycle hooks (optional)
  async beforeHandle(event: InProcessDomainEvent): Promise<void> {
    console.log(`Handling: ${event.eventName}`);
  }

  async afterHandle(event: InProcessDomainEvent, result: SubscriberResult): Promise<void> {
    console.log(`Handled successfully: ${result.success}`);
  }

  async onHandleError(event: InProcessDomainEvent, error: Error): Promise<void> {
    console.error(`Handler error: ${error.message}`);
  }
}

Registries

Manage publishers and subscribers with singleton registries:

import { PublisherRegistry, SubscriberRegistry } from '@tasker-systems/tasker';

// Publisher Registry
const pubRegistry = PublisherRegistry.getInstance();
pubRegistry.register(PaymentEventPublisher);
pubRegistry.register(OrderEventPublisher);
pubRegistry.freeze(); // Prevent further registrations

// Get publisher for a step
const publisher = pubRegistry.getForStep('process_payment');

// Subscriber Registry
const subRegistry = SubscriberRegistry.getInstance();
subRegistry.register(AuditLoggingSubscriber);
subRegistry.register(MetricsSubscriber);

// Start all subscribers
subRegistry.startAll();

// Stop all subscribers
subRegistry.stopAll();

FFI Integration

Domain events integrate with the Rust FFI layer for cross-language event flow:

import { createFfiPollAdapter, InProcessDomainEventPoller } from '@tasker-systems/tasker';

// Create poller connected to Rust broadcast channel
const poller = new InProcessDomainEventPoller();

// Set the FFI poll function
poller.setPollFunction(createFfiPollAdapter(runtime));

// Start polling for events
poller.start((event) => {
  // Route to appropriate subscriber
  const subscribers = subRegistry.getMatchingSubscribers(event.eventName);
  for (const sub of subscribers) {
    sub.handle(event);
  }
});

Signal Handling

The TypeScript worker handles signals for graceful shutdown:

SignalBehavior
SIGTERMGraceful shutdown
SIGINTGraceful shutdown (Ctrl+C)
import { ShutdownController } from '@tasker-systems/tasker';

const shutdown = new ShutdownController();

// Register signal handlers
shutdown.registerSignalHandlers();

// Wait for shutdown signal
await shutdown.waitForShutdown();

// Or check if shutdown requested
if (shutdown.isShutdownRequested()) {
  // Begin cleanup
}

Error Handling

Using Failure Results

async call(context: StepContext): Promise<StepHandlerResult> {
  try {
    const result = await this.processData(context);
    return this.success(result);
  } catch (error) {
    if (error instanceof NetworkError) {
      // Retryable error
      return this.failure(
        error.message,
        ErrorType.RETRYABLE_ERROR,
        true,
        { endpoint: error.endpoint }
      );
    }

    // Non-retryable error
    return this.failure(
      error instanceof Error ? error.message : 'Unknown error',
      ErrorType.HANDLER_ERROR,
      false
    );
  }
}

Logging

Structured Logging

import { logInfo, logError, logWarn, logDebug } from '@tasker-systems/tasker';

// Simple logging
logInfo('Processing started', { component: 'handler' });
logError('Failed to connect', { component: 'database' });

// With additional context
logInfo('Order processed', {
  component: 'handler',
  order_id: '123',
  amount: '100.00'
});

Pino Integration

The worker uses pino for structured logging:

import pino from 'pino';

const logger = pino({
  name: 'my-handler',
  level: process.env.RUST_LOG ?? 'info'
});

logger.info({ orderId: '123' }, 'Processing order');

File Structure

workers/typescript/
├── bin/
│   └── server.ts               # Production server
├── src/
│   ├── index.ts                # Package exports
│   ├── bootstrap/
│   │   └── bootstrap.ts        # Worker initialization
│   ├── events/
│   │   ├── event-emitter.ts    # Event pub/sub
│   │   ├── event-poller.ts     # FFI polling
│   │   └── event-system.ts     # Combined event system
│   ├── ffi/
│   │   ├── bun-runtime.ts      # Bun FFI adapter
│   │   ├── node-runtime.ts     # Node.js FFI adapter
│   │   ├── deno-runtime.ts     # Deno FFI adapter
│   │   ├── runtime-interface.ts # Common interface
│   │   └── types.ts            # FFI types
│   ├── handler/
│   │   ├── base.ts             # Base handler class
│   │   ├── api.ts              # API handler
│   │   ├── decision.ts         # Decision handler
│   │   ├── batchable.ts        # Batchable handler
│   │   ├── domain-events.ts    # Domain events module
│   │   ├── registry.ts         # Handler registry
│   │   └── mixins/             # Mixin modules
│   │       ├── index.ts        # Mixin exports
│   │       ├── api.ts          # APIMixin, applyAPI
│   │       └── decision.ts     # DecisionMixin, applyDecision
│   ├── server/
│   │   ├── worker-server.ts    # Server implementation
│   │   └── types.ts            # Server types
│   ├── subscriber/
│   │   └── step-execution-subscriber.ts
│   └── types/
│       ├── step-context.ts     # Step context
│       └── step-handler-result.ts
├── tests/
│   ├── unit/                   # Unit tests
│   ├── integration/            # Integration tests
│   └── handlers/examples/      # Example handlers
├── src-rust/                   # Rust FFI extension
├── package.json
├── tsconfig.json
└── biome.json                  # Linting config

Testing

Unit Tests

cd workers/typescript
bun test                        # Run all tests
bun test tests/unit/            # Run unit tests only

Integration Tests

bun test tests/integration/     # Run integration tests

With Coverage

bun test --coverage

Linting

bun run check                   # Biome lint + format check
bun run check:fix               # Auto-fix issues

Type Checking

bunx tsc --noEmit               # Type check without emit

Example Handlers

Linear Workflow

export class DoubleHandler extends StepHandler {
  static handlerName = 'double_value';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const value = context.getInput<number>('value') ?? 0;
    return this.success({
      result: value * 2,
      operation: 'double'
    });
  }
}

export class AddHandler extends StepHandler {
  static handlerName = 'add_constant';
  static handlerVersion = '1.0.0';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const prev = context.getDependencyResult('double_value') as { result: number } | null;
    const value = prev?.result ?? 0;
    return this.success({
      result: value + 10,
      operation: 'add'
    });
  }
}

Diamond Workflow (Parallel Branches)

export class DiamondStartHandler extends StepHandler {
  static handlerName = 'diamond_start';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const input = context.getInput<number>('value') ?? 0;
    return this.success({ squared: input * input });
  }
}

export class BranchBHandler extends StepHandler {
  static handlerName = 'branch_b';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const start = context.getDependencyResult('diamond_start') as { squared: number };
    return this.success({ result: start.squared + 25 });
  }
}

export class BranchCHandler extends StepHandler {
  static handlerName = 'branch_c';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const start = context.getDependencyResult('diamond_start') as { squared: number };
    return this.success({ result: start.squared * 2 });
  }
}

export class DiamondEndHandler extends StepHandler {
  static handlerName = 'diamond_end';

  async call(context: StepContext): Promise<StepHandlerResult> {
    const branchB = context.getDependencyResult('branch_b') as { result: number };
    const branchC = context.getDependencyResult('branch_c') as { result: number };
    return this.success({
      final: (branchB.result + branchC.result) / 2
    });
  }
}

Error Handling

export class RetryableErrorHandler extends StepHandler {
  static handlerName = 'retryable_error';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Simulate a retryable error (e.g., network timeout)
    return this.failure(
      'Connection timeout - will be retried',
      ErrorType.RETRYABLE_ERROR,
      true,
      { attempt: context.retryCount }
    );
  }
}

export class PermanentErrorHandler extends StepHandler {
  static handlerName = 'permanent_error';

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Simulate a permanent error (e.g., validation failure)
    return this.failure(
      'Invalid input - no retry allowed',
      ErrorType.PERMANENT_ERROR,
      false
    );
  }
}

Docker Deployment

Dockerfile

FROM oven/bun:1.1.38 AS runtime

WORKDIR /app

# Copy built artifacts
COPY workers/typescript/dist/ ./dist/
COPY workers/typescript/package.json ./
COPY target/release/libtasker_ts.dylib ./lib/

# Install production dependencies
RUN bun install --production

# Set environment
ENV TASKER_FFI_LIBRARY_PATH=/app/lib/libtasker_ts.dylib
ENV PORT=8081

EXPOSE 8081

CMD ["bun", "run", "dist/bin/server.js"]

Docker Compose

typescript-worker:
  build:
    context: .
    dockerfile: docker/build/typescript-worker.Dockerfile
  environment:
    DATABASE_URL: postgresql://tasker:tasker@postgres:5432/tasker
    TASKER_ENV: production
    TASKER_TEMPLATE_PATH: /app/templates
    PORT: 8081
  ports:
    - "8084:8081"
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
    interval: 10s
    timeout: 5s
    retries: 3

See Also

Observability Documentation

Last Updated: 2025-12-01 Audience: Operators, Developers Status: Active Related Docs: Documentation Hub | Benchmarks | Deployment Patterns | Domain Events

← Back to Documentation Hub


This directory contains documentation for monitoring, metrics, logging, and performance measurement in tasker-core.


Quick Navigation

📊 Performance & Benchmarking../benchmarks/

All benchmark documentation has been consolidated in the docs/benchmarks/ directory.

See: Benchmark README for:

  • API performance benchmarks
  • SQL function benchmarks
  • Event propagation benchmarks
  • End-to-end latency benchmarks
  • Benchmark quick reference
  • Performance targets and CI integration

Migration Note: The following files remain in this directory for historical context but are superseded by the consolidated benchmarks documentation:

  • benchmark-implementation-decision.md - Decision rationale (archived)
  • benchmark-quick-reference.md - Superseded by ../benchmarks/README.md
  • benchmark-strategy-summary.md - Consolidated into benchmark-specific docs
  • benchmarking-guide.md - SQL benchmarks moved to ../benchmarks/sql-benchmarks.md
  • phase-5.4-distributed-benchmarks-plan.md - Implementation complete

Observability Categories

1. Metrics (metrics-*.md)

Purpose: System health, performance counters, and operational metrics

Documentation:

Key Metrics Tracked:

  • Task lifecycle events (created, started, completed, failed)
  • Step execution metrics (claimed, executed, retried)
  • Database operation performance (query times, cache hit rates)
  • Worker health (active workers, queue depths, claim rates)
  • System resource usage (memory, connections, threads)

Export Targets:

  • OpenTelemetry (planned)
  • Prometheus (supported)
  • CloudWatch (planned)
  • Datadog (planned)

Quick Reference:

#![allow(unused)]
fn main() {
// Example: Recording a metric
metrics::counter!("tasker.tasks.created").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed_ms);
metrics::gauge!("tasker.workers.active").set(worker_count as f64);
}

2. Logging (logging-standards.md)

Purpose: Structured logging for debugging, audit trails, and operational visibility

Documentation:

Log Levels:

  • ERROR: Critical failures requiring immediate attention
  • WARN: Degraded operation or retry scenarios
  • INFO: Significant lifecycle events and state transitions
  • DEBUG: Detailed execution flow for troubleshooting
  • TRACE: Exhaustive detail for deep debugging

Structured Fields:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task_uuid,
    correlation_id = %correlation_id,
    step_name = %step_name,
    elapsed_ms = elapsed.as_millis(),
    "Step execution completed successfully"
);
}

Key Standards:

  • Use structured logging (not string interpolation)
  • Include correlation IDs for distributed tracing
  • Log state transitions at INFO level
  • Include timing information for performance analysis
  • Sanitize sensitive data (credentials, PII)

3. Tracing and OpenTelemetry

Purpose: Distributed request tracing across services

Status: ✅ Active

Documentation:

Current Features:

  • Distributed trace propagation via correlation IDs (UUIDv7)
  • Span creation for major operations:
    • API request handling
    • Step execution (claim → execute → submit)
    • Orchestration coordination
    • Domain event publishing
    • Message queue operations
  • Two-phase FFI telemetry initialization (safe for Ruby/Python workers)
  • Integration with Grafana LGTM stack (Prometheus, Tempo)
  • Domain event metrics (/metrics/events endpoint)

Two-Phase FFI Initialization:

  • Phase 1: Console-only logging (safe during FFI bridge setup)
  • Phase 2: Full OpenTelemetry (after FFI established)

Example:

#![allow(unused)]
fn main() {
#[tracing::instrument(
    name = "publish_domain_event",
    skip(self, payload),
    fields(
        event_name = %event_name,
        namespace = %metadata.namespace,
        correlation_id = %metadata.correlation_id,
        delivery_mode = ?delivery_mode
    )
)]
async fn publish_event(&self, event_name: &str, ...) -> Result<()> {
    // Implementation
}
}

4. Health Checks

Purpose: Service health monitoring for orchestration, availability, and alerting

Endpoints:

  • GET /health - Overall service health
  • GET /health/ready - Readiness for traffic (K8s readiness probe)
  • GET /health/live - Liveness check (K8s liveness probe)

Health Indicators:

  • Database connection pool status
  • Message queue connectivity
  • Worker availability
  • Circuit breaker states
  • Resource utilization (memory, connections)

Response Format:

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "connections_active": 5,
      "connections_idle": 15,
      "connections_max": 20
    },
    "message_queue": {
      "status": "healthy",
      "queues_monitored": 3
    },
    "circuit_breakers": {
      "status": "healthy",
      "open_breakers": 0
    }
  },
  "uptime_seconds": 3600
}

Observability Architecture

Component-Level Instrumentation

┌─────────────────────────────────────────────────────────────┐
│                   Observability Stack                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Metrics │  │   Logs   │  │  Traces  │  │  Health  │  │
│  │ (Counters│  │(Structured)│  │(Planned)│  │  Checks  │  │
│  │Histograms│  │   JSON   │  │  Spans   │  │   HTTP   │  │
│  │  Gauges) │  │   Fields │  │   Tags   │  │  Probes  │  │
│  └─────┬────┘  └─────┬────┘  └─────┬────┘  └─────┬────┘  │
│        │             │             │             │        │
└────────┼─────────────┼─────────────┼─────────────┼────────┘
         │             │             │             │
         ▼             ▼             ▼             ▼
  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
  │Prometheus │ │  Loki /   │ │  Jaeger / │ │    K8s    │
  │   OTLP    │ │CloudWatch │ │   Tempo   │ │  Probes   │
  └───────────┘ └───────────┘ └───────────┘ └───────────┘

Instrumentation Points

Orchestration:

  • Task lifecycle transitions
  • Step discovery and enqueueing
  • Result processing
  • Finalization operations
  • Database query performance

Worker:

  • Step claiming
  • Handler execution
  • Result submission
  • FFI call overhead (Ruby workers)
  • Event propagation latency

Database:

  • Query execution times
  • Connection pool metrics
  • Transaction commit latency
  • Buffer cache hit ratio

Message Queue:

  • Message send/receive latency
  • Queue depth
  • Notification propagation time
  • Message processing errors

Performance Monitoring

Key Performance Indicators (KPIs)

MetricTargetAlert ThresholdNotes
API Response Time (p99)< 100ms> 200msUser-facing latency
SQL Function Time (mean)< 3ms> 5msOrchestration efficiency
Event Propagation (p95)< 10ms> 20msReal-time coordination
E2E Task Completion (p99)< 500ms> 1000msEnd-user experience
Worker Claim Success Rate> 95%< 90%Resource contention
Database Connection Pool< 80%> 90%Resource exhaustion

Monitoring Dashboards

Recommended Dashboard Panels:

  1. Task Throughput

    • Tasks created/min
    • Tasks completed/min
    • Tasks failed/min
    • Active tasks count
  2. Step Execution

    • Steps enqueued/min
    • Steps completed/min
    • Average step execution time
    • Step retry rate
  3. System Health

    • Worker health status
    • Database connection pool utilization
    • Circuit breaker status
    • API response times (p50, p95, p99)
  4. Error Rates

    • Task failures by namespace
    • Step failures by handler
    • Database errors
    • Message queue errors

Correlation and Debugging

Correlation ID Propagation

Every request generates a UUIDv7 correlation ID that flows through:

  1. API request → Task creation
  2. Task → Step enqueueing
  3. Step → Worker execution
  4. Worker → Result submission
  5. Result → Orchestration processing

Tracing a Request:

# Find correlation ID from task creation
curl http://localhost:8080/v1/tasks/{task_uuid} | jq .correlation_id

# Search logs across all services
docker logs orchestration 2>&1 | grep {correlation_id}
docker logs worker 2>&1 | grep {correlation_id}

# Query database for full timeline
psql $DATABASE_URL -c "
  SELECT
    created_at,
    from_state,
    to_state,
    metadata->>'duration_ms' as duration
  FROM tasker.task_transitions
  WHERE metadata->>'correlation_id' = '{correlation_id}'
  ORDER BY created_at;
"

Debug Logging

Enable debug logging for detailed execution flow:

# Docker Compose
RUST_LOG=debug docker-compose up

# Local development
RUST_LOG=tasker_worker=debug,tasker_orchestration=debug cargo run

# Specific modules
RUST_LOG=tasker_worker::worker::command_processor=trace cargo test

Best Practices

1. Structured Logging

Do:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task.uuid,
    namespace = %task.namespace,
    elapsed_ms = elapsed.as_millis(),
    "Task completed successfully"
);
}

Don’t:

#![allow(unused)]
fn main() {
info!("Task {} in namespace {} completed in {}ms",
    task.uuid, task.namespace, elapsed.as_millis());
}

2. Metric Naming

Use consistent, hierarchical naming:

#![allow(unused)]
fn main() {
metrics::counter!("tasker.tasks.created").increment(1);
metrics::counter!("tasker.tasks.completed").increment(1);
metrics::counter!("tasker.tasks.failed").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed);
}

3. Performance Measurement

Measure at operation boundaries:

#![allow(unused)]
fn main() {
let start = Instant::now();
let result = operation().await?;
let elapsed = start.elapsed();

metrics::histogram!("tasker.operation.duration_ms")
    .record(elapsed.as_millis() as f64);

info!(
    operation = "operation_name",
    elapsed_ms = elapsed.as_millis(),
    success = result.is_ok(),
    "Operation completed"
);
}

4. Error Context

Include rich context in errors:

#![allow(unused)]
fn main() {
error!(
    task_uuid = %task_uuid,
    step_uuid = %step_uuid,
    error = %err,
    retry_count = retry_count,
    "Step execution failed, will retry"
);
}

Tools and Integration

Development Tools

Metrics Visualization:

# Prometheus (if configured)
open http://localhost:9090

# Grafana (if configured)
open http://localhost:3000

Log Aggregation:

# Docker Compose logs
docker-compose -f docker/docker-compose.test.yml logs -f

# Specific service
docker-compose -f docker/docker-compose.test.yml logs -f orchestration

# JSON parsing
docker-compose logs orchestration | jq 'select(.level == "ERROR")'

Production Tools (Planned)

  • Metrics: Prometheus + Grafana / DataDog / CloudWatch
  • Logs: Loki / CloudWatch Logs / Splunk
  • Traces: Jaeger / Tempo / Honeycomb
  • Alerts: AlertManager / PagerDuty / Opsgenie


File Organization

Current Files

Active:

  • metrics-reference.md - Complete metrics catalog
  • metrics-verification.md - Verification procedures
  • logging-standards.md - Logging best practices
  • opentelemetry-improvements.md - Telemetry enhancements
  • VERIFICATION_RESULTS.md - Test results

Archived (superseded by docs/benchmarks/):

  • benchmark-implementation-decision.md
  • benchmark-quick-reference.md
  • benchmark-strategy-summary.md
  • benchmarking-guide.md
  • phase-5.4-distributed-benchmarks-plan.md

Move benchmark files to docs/archive/ or delete:

# Option 1: Archive
mkdir -p docs/archive/benchmarks
mv docs/observability/benchmark-*.md docs/archive/benchmarks/
mv docs/observability/phase-5.4-*.md docs/archive/benchmarks/

# Option 2: Delete (information consolidated)
rm docs/observability/benchmark-*.md
rm docs/observability/phase-5.4-*.md

Contributing

When adding observability instrumentation:

  1. Follow standards: Use structured logging and consistent metric naming
  2. Include context: Add correlation IDs and relevant metadata
  3. Document metrics: Update metrics-reference.md with new metrics
  4. Test instrumentation: Verify metrics and logs in development
  5. Consider performance: Avoid expensive operations in hot paths

Benchmark Audit & Profiling Plan

Created: 2025-10-09 Status: 📋 Planning Purpose: Audit existing benchmarks, establish profiling tooling, baseline before Actor/Services refactor


Executive Summary

Before refactoring tasker-orchestration/src/orchestration/lifecycle/ to Actor/Services pattern, we need to:

  1. Audit Benchmarks: Review which benchmarks are implemented vs placeholders
  2. Clean Up: Remove or complete placeholder benchmarks
  3. Establish Profiling: Set up flamegraph/samply tooling
  4. Baseline Profiles: Capture performance profiles for comparison post-refactor

Current Status: We have working SQL and E2E benchmarks but several placeholder component benchmarks that need decisions.


Benchmark Inventory

✅ Working & Complete Benchmarks

1. SQL Function Benchmarks

  • Location: tasker-shared/benches/sql_functions.rs
  • Status: ✅ Complete, Compiles, Well-documented
  • Coverage:
    • get_next_ready_tasks() (4 batch sizes)
    • get_step_readiness_status() (5 diverse samples)
    • transition_task_state_atomic() (5 samples)
    • get_task_execution_context() (5 samples)
    • get_step_transitive_dependencies() (10 samples)
  • Documentation: docs/observability/benchmarking-guide.md
  • Run Command:
    cargo bench --package tasker-shared --features benchmarks
    

2. Event Propagation Benchmarks

  • Location: tasker-shared/benches/event_propagation.rs
  • Status: ✅ Complete, Compiles
  • Coverage: PostgreSQL LISTEN/NOTIFY event propagation
  • Run Command:
    cargo bench --package tasker-shared --features benchmarks event_propagation
    

3. Task Initialization Benchmarks

  • Location: tasker-client/benches/task_initialization.rs
  • Status: ✅ Complete, Compiles
  • Coverage: API task creation latency
  • Run Command:
    export SQLX_OFFLINE=true
    cargo bench --package tasker-client --features benchmarks task_initialization
    

4. End-to-End Workflow Latency Benchmarks

  • Location: tests/benches/e2e_latency.rs
  • Status: ✅ Complete, Compiles
  • Coverage: Complete workflow execution (API → Result)
    • Linear workflow (Ruby FFI)
    • Diamond workflow (Ruby FFI)
    • Linear workflow (Rust native)
    • Diamond workflow (Rust native)
  • Prerequisites: Docker Compose services running
  • Run Command:
    export SQLX_OFFLINE=true
    cargo bench --bench e2e_latency
    

⚠️ Placeholder Benchmarks (Need Decision)

5. Orchestration Benchmarks

  • Location: tasker-orchestration/benches/
  • Files:
    • orchestration_benchmarks.rs - Empty placeholder
    • step_enqueueing.rs - Placeholder with documentation
  • Status: Not implemented
  • Documented Intent: Measure orchestration coordination latency
  • Challenges:
    • Requires triggering orchestration cycle without full execution
    • Need step discovery measurement isolation
    • Queue publishing and notification overhead breakdown

6. Worker Benchmarks

  • Location: tasker-worker/benches/
  • Files:
    • worker_benchmarks.rs - Empty placeholder
    • worker_execution.rs - Placeholder with documentation
    • handler_overhead.rs - Placeholder with documentation
  • Status: Not implemented
  • Documented Intent:
    • Worker processing cycle (claim, execute, submit)
    • Framework overhead vs pure handler execution
    • Ruby FFI overhead measurement
  • Challenges:
    • Need pre-enqueued steps in test queues
    • Noop handler implementations for baseline
    • Breakdown metrics for each phase

Recommendations

Rationale:

  • Phase 5.4 distributed benchmarks are documented but complex to implement
  • E2E benchmarks (e2e_latency.rs) already provide full workflow metrics
  • SQL benchmarks provide component-level detail
  • Actor/Services refactor is more urgent than distributed component benchmarks

Action:

  • Keep placeholder files with clear “NOT IMPLEMENTED” status
  • Update comments to reference this audit document
  • Future ticket (post-refactor) can implement if needed

Option 2: Remove Placeholders

Rationale:

  • Reduce confusion about benchmark status
  • E2E benchmarks already cover end-to-end latency
  • SQL benchmarks cover database hot paths

Action:

  • Delete placeholder bench files
  • Document decision in this file
  • Can recreate later if specific component isolation needed

Option 3: Implement Placeholders Now

Rationale:

  • Complete benchmark suite before refactor
  • Better baseline data for Actor/Services comparison

Concerns:

  • 2-3 days implementation effort
  • Delays Actor/Services refactor
  • May need re-implementation post-refactor anyway

Decision: Option 1 (Keep Placeholders, Document Status)

We have sufficient benchmarking coverage:

  1. ✅ SQL functions (hot path queries)
  2. ✅ E2E workflows (user-facing latency)
  3. ✅ Event propagation (LISTEN/NOTIFY)
  4. ✅ Task initialization (API latency)

What’s Missing:

  • Component-level orchestration breakdown (not critical for refactor)
  • Worker cycle breakdown (available via OpenTelemetry traces)
  • Framework overhead measurement (nice-to-have, not blocking)

Action Items:

  1. Update placeholder comments with “Status: Planned for future implementation”
  2. Reference this document for implementation guidance
  3. Move forward with profiling and refactor

Profiling Tooling Setup

Goals

  1. Identify Inefficiencies: Find hot spots in lifecycle code
  2. Establish Baseline: Profile before Actor/Services refactor
  3. Compare Post-Refactor: Validate performance impact of refactor
  4. Continuous Profiling: Enable ongoing performance analysis

Tool Selection

Primary: samply (macOS-friendly)

  • GitHub: https://github.com/mstange/samply
  • Advantages:
    • Native macOS support (uses Instruments)
    • Interactive web UI for flamegraphs
    • Low overhead
    • Works with release builds
  • Use Case: Development profiling on macOS

Secondary: flamegraph (CI/production)

  • GitHub: https://github.com/flamegraph-rs/flamegraph
  • Advantages:
    • Linux support (perf-based)
    • SVG output for CI artifacts
    • Well-established in Rust ecosystem
  • Use Case: CI profiling, Linux production analysis

Tertiary: cargo-flamegraph (Convenience)

  • Cargo Plugin: Wraps flamegraph-rs
  • Advantages:
    • Single command profiling
    • Automatic symbol resolution
  • Use Case: Quick local profiling

Installation

macOS Setup (samply)

# Install samply
cargo install samply

# macOS requires SIP adjustment for sampling (one-time setup)
# https://github.com/mstange/samply#macos-permissions

# Verify installation
samply --version

Linux Setup (flamegraph)

# Install prerequisites (Ubuntu/Debian)
sudo apt-get install linux-tools-common linux-tools-generic

# Install flamegraph
cargo install flamegraph

# Allow perf without sudo (optional)
echo 'kernel.perf_event_paranoid=-1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Verify installation
flamegraph --version

Cross-Platform (cargo-flamegraph)

# Install cargo-flamegraph
cargo install cargo-flamegraph

# Verify installation
cargo flamegraph --version

Profiling Workflows

Captures the entire workflow execution including orchestration lifecycle:

# macOS
samply record cargo bench --bench e2e_latency -- --profile-time=60

# Linux
cargo flamegraph --bench e2e_latency -- --profile-time=60

# Output: Interactive flamegraph showing hot paths

What to Look For:

  • Time spent in lifecycle/ modules (task_initializer, step_enqueuer, result_processor, etc.)
  • Database query time vs business logic time
  • Serialization/deserialization overhead
  • Lock contention (should be minimal with our architecture)

2. Profile SQL Benchmarks

Isolates database performance:

# Profile just SQL function benchmarks
samply record cargo bench --package tasker-shared --features benchmarks sql_functions

# Output: Shows PostgreSQL function overhead

What to Look For:

  • Time in sqlx query execution
  • Connection pool overhead
  • Query planning time (shouldn’t be visible if using prepared statements)

3. Profile Integration Tests (Realistic Workload)

Profile actual test execution for realistic patterns:

# Profile a specific integration test
samply record cargo test --test e2e_tests e2e::rust::simple_integration_tests::test_linear_workflow

# Profile all integration tests (longer run)
samply record cargo test --test e2e_tests --all-features

What to Look For:

  • Initialization overhead
  • Test setup time vs actual execution time
  • Repeated patterns across tests

4. Profile Specific Lifecycle Components

Isolate specific modules for deep analysis:

# Example: Profile only result processing
samply record cargo test --package tasker-orchestration --test lifecycle_integration_tests \
  test_result_processing_updates_task_state --all-features -- --nocapture

# Or profile a unit test for a specific function
samply record cargo test --package tasker-orchestration \
  result_processor::tests::test_process_step_result_success --all-features

Baseline Profiling Plan

Phase 1: Capture Pre-Refactor Baselines (Day 1)

Goal: Establish performance baseline of current lifecycle code before Actor/Services refactor

# 1. Clean build
cargo clean
cargo build --release --all-features

# 2. Profile E2E benchmarks (primary baseline)
samply record --output=baseline-e2e-pre-refactor.json \
  cargo bench --bench e2e_latency

# 3. Profile SQL benchmarks
samply record --output=baseline-sql-pre-refactor.json \
  cargo bench --package tasker-shared --features benchmarks

# 4. Profile specific lifecycle operations
samply record --output=baseline-task-init-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::task_initializer::tests --all-features

samply record --output=baseline-step-enqueue-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::step_enqueuer::tests --all-features

samply record --output=baseline-result-processor-pre-refactor.json \
  cargo test --package tasker-orchestration \
  lifecycle::result_processor::tests --all-features

Deliverables (completed, profiles removed — superseded by cluster benchmarks):

  • Baseline profile files in profiles/pre-refactor/ (removed)
  • Performance baselines now in docs/benchmarks/README.md

Phase 2: Identify Optimization Opportunities (Day 1)

Goal: Document current performance characteristics to preserve in refactor

Analysis Checklist:

  1. ✅ Time spent in each lifecycle module (task_initializer, step_enqueuer, etc.)
  2. ✅ Database query time breakdown
  3. ✅ Serialization overhead (JSON, MessagePack)
  4. ✅ Lock contention points (if any)
  5. ✅ Unnecessary allocations or clones
  6. ✅ Recursive call depth

Document Findings: Performance baselines are now documented in docs/benchmarks/README.md. The original lifecycle-performance-baseline.md was removed — its measurements had data quality issues and the refactor it targeted is complete.

Phase 3: Post-Refactor Validation (After Refactor)

Goal: Validate Actor/Services refactor maintains or improves performance

# Re-run same profiling commands after refactor
samply record --output=baseline-e2e-post-refactor.json \
  cargo bench --bench e2e_latency

# Compare baselines
# (samply doesn't have built-in diff, use manual comparison)

Success Criteria:

  • E2E latency: Within 10% of baseline (preferably faster)
  • SQL latency: Unchanged (no regression from refactor)
  • Lifecycle operation time: Within 20% of baseline
  • No new hot paths or contention points

Regression Signals:

  • E2E latency >20% slower
  • New allocations/clones in hot paths
  • Increased lock contention
  • Message passing overhead >5% of total time

Profiling Best Practices

1. Use Release Builds

# Always profile release builds (--release flag)
cargo build --release --all-features
samply record cargo bench --bench e2e_latency

Rationale: Debug builds have 10-100x overhead that masks real performance issues

2. Run Multiple Times

# Run 3 times, compare consistency
for i in {1..3}; do
  samply record --output=profile-$i.json cargo bench --bench e2e_latency
done

Rationale: Catch warm-up effects, JIT compilation, cache behavior

3. Isolate Interference

# Close other applications
# Disable background processes (Spotlight, backups)
# Use consistent hardware (don't profile on battery power)

4. Focus on Hot Paths

80/20 Rule: 80% of time is spent in 20% of code

Priority Order:

  1. Top 5 functions by time (>5% each)
  2. Recursive calls (can amplify overhead)
  3. Locks and synchronization (contention multiplies)
  4. Allocations in loops (O(n) becomes visible)

5. Benchmark-Driven Profiling

Always profile realistic workloads:

  • ✅ E2E benchmarks (represents user experience)
  • ✅ Integration tests (real workflow patterns)
  • ❌ Unit tests (too isolated, not representative)

Flamegraph Interpretation

Reading Flamegraphs

┌─────────────────────────────────────────────┐ ← Total Program Time (100%)
│                                             │
│  ┌────────────────┐  ┌─────────────────┐  │
│  │ Database Ops   │  │ Serialization   │  │ ← High-level Operations (60%)
│  │ (30%)          │  │ (30%)           │  │
│  │                │  │                 │  │
│  │  ┌──────────┐  │  │  ┌───────────┐ │  │
│  │  │ SQL Exec │  │  │  │ JSON Ser  │ │  │ ← Leaf Operations (25%)
│  │  │ (25%)    │  │  │  │ (20%)     │ │  │
│  └──┴──────────┴──┘  └──┴───────────┴─┘  │
│                                             │
│  ┌──────────────────────────────────────┐  │
│  │ Business Logic (20%)                  │  │ ← Remaining Time
│  └──────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

Width = Time spent in function (including children) Height = Call stack depth Color = Function group (can be customized)

Key Patterns

1. Wide Flat Bars = Hot Path

┌───────────────────────────────────────┐
│ step_enqueuer::enqueue_ready_steps()  │  ← 40% of total time
└───────────────────────────────────────┘

Action: Optimize this function

2. Deep Call Stack = Recursion/Abstractions

┌─────────────────────────┐
│ process_dependencies()  │
│  ┌─────────────────────┐│
│  │ resolve_deps()      ││
│  │  ┌─────────────────┐││
│  │  │ check_ready()   │││
│  │  └─────────────────┘││
│  └─────────────────────┘│
└─────────────────────────┘

Action: Consider flattening or caching

3. Many Narrow Bars = Fragmentation

┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│A│B│C│D│E│F│G│H│I│J│K│L│M│  ← Many small functions
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

Action: Not necessarily bad (may be inlining), but check if overhead-heavy


Integration with CI

GitHub Actions Workflow (Future Enhancement)

# .github/workflows/profile-benchmarks.yml
name: Profile Benchmarks

on:
  pull_request:
    paths:
      - 'tasker-orchestration/src/orchestration/lifecycle/**'
      - 'tasker-shared/src/**'

jobs:
  profile:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install flamegraph
        run: cargo install flamegraph

      - name: Profile benchmarks
        run: |
          cargo flamegraph --bench e2e_latency -- --profile-time=60 -o flamegraph.svg

      - name: Upload flamegraph
        uses: actions/upload-artifact@v3
        with:
          name: flamegraph
          path: flamegraph.svg

      - name: Compare with baseline
        run: |
          # TODO: Implement baseline comparison
          # Download previous flamegraph, compare hot paths

Documentation Structure

Created Documents

  1. This Document: docs/observability/benchmark-audit-and-profiling-plan.md

    • Benchmark inventory
    • Profiling tooling setup
    • Baseline capture plan
  2. Existing: docs/observability/benchmarking-guide.md

    • SQL benchmark documentation
    • Running instructions
    • Performance expectations
  3. docs/observability/lifecycle-performance-baseline.md (Removed — superseded by docs/benchmarks/README.md)


Next Steps

Before Actor/Services Refactor

  1. Audit Complete: Documented benchmark status
  2. Install Profiling Tools:
    cargo install samply  # macOS
    cargo install flamegraph  # Linux
    
  3. Capture Baselines (1 day):
    • Run profiling plan Phase 1
    • Generate flamegraphs
    • Document hot paths
  4. Baseline Document: Superseded by docs/benchmarks/README.md

During Actor/Services Refactor

  1. Incremental Profiling: Profile after each major component conversion
  2. Compare Baselines: Ensure no performance regressions
  3. Document Changes: Note architectural changes affecting performance

After Actor/Services Refactor

  1. Full Re-Profile: Run profiling plan Phase 3
  2. Comparison Analysis: Document performance changes
  3. Update Documentation: Reflect new architecture
  4. Benchmark Updates: Update benchmarks if Actor/Services changes measurement approach

Summary

Current State:

  • ✅ SQL benchmarks working
  • ✅ E2E benchmarks working
  • ✅ Event propagation benchmarks working
  • ✅ Task initialization benchmarks working
  • ⚠️ Component benchmarks are placeholders (OK for now)

Decision:

  • Keep placeholder benchmarks for future work
  • Move forward with profiling and baseline capture
  • Sufficient coverage to validate Actor/Services refactor

Action Plan:

  1. Install profiling tools (samply/flamegraph)
  2. Capture pre-refactor baselines (1 day)
  3. Document current hot paths
  4. Proceed with Actor/Services refactor
  5. Validate post-refactor performance

Success Criteria:

  • Baseline profiles captured
  • Hot paths documented
  • Post-refactor validation plan established
  • No performance regressions from refactor

Benchmark Implementation Decision: Event-Driven + E2E Focus

Date: 2025-10-08 Decision: Focus on event propagation and E2E benchmarks; infer worker metrics from traces


Context

Original Phase 5.4 plan included 7 benchmark categories:

  1. ✅ API Task Creation
  2. 🚧 Worker Processing Cycle
  3. ✅ Event Propagation
  4. 🚧 Step Enqueueing
  5. 🚧 Handler Overhead
  6. ✅ SQL Functions
  7. ✅ E2E Latency

Architectural Challenge: Worker Benchmarking

Problem: Direct worker benchmarking doesn’t match production reality

In a distributed system with multiple workers:

  • Can’t predict which worker will claim which step
  • Can’t control step distribution across workers
  • Artificial scenarios required to direct specific steps to specific workers
  • API queries would need to know which worker to query (unknowable in advance)

Example:

Task with 10 steps across 3 workers:
- Worker A might claim steps 1, 3, 7
- Worker B might claim steps 2, 5, 6, 9
- Worker C might claim steps 4, 8, 10

Which worker do you benchmark? How do you ensure consistent measurement?

Decision: Focus on Observable Metrics

✅ What We WILL Measure Directly

1. Event Propagation (tasker-shared/benches/event_propagation.rs)

Status: ✅ IMPLEMENTED

Measures: PostgreSQL LISTEN/NOTIFY round-trip latency

Approach:

#![allow(unused)]
fn main() {
// Setup listener on test channel
listener.listen("pgmq_message_ready.benchmark_event_test").await;

// Send message with notify
let send_time = Instant::now();
sqlx::query("SELECT pgmq_send_with_notify(...)").execute(&pool).await;

// Measure until listener receives
let received_at = listener.recv().await;
let latency = received_at.duration_since(send_time);
}

Why it works:

  • Observable from outside the system
  • Deterministic measurement (single listener, single sender)
  • Matches production behavior (real LISTEN/NOTIFY path)
  • Critical for worker responsiveness

Expected Performance: < 5-10ms p95


2. End-to-End Latency (tests/benches/e2e_latency.rs)

Status: ✅ IMPLEMENTED

Measures: Complete workflow execution (API → Task Complete)

Approach:

#![allow(unused)]
fn main() {
// Create task
let response = client.create_task(request).await;
let start = Instant::now();

// Poll for completion
loop {
    let task = client.get_task(task_uuid).await;
    if task.execution_status == "AllComplete" {
        return start.elapsed();
    }
    tokio::time::sleep(Duration::from_millis(50)).await;
}
}

Why it works:

  • Measures user experience (submit → result)
  • Naturally includes ALL system overhead:
    • API processing
    • Database writes
    • Message queue latency
    • Worker claim/execute/submit (embedded in total time)
    • Event propagation
    • Orchestration coordination
  • No need to know which workers executed which steps
  • Reflects real production behavior

Expected Performance:

  • Linear (3 steps): < 500ms p99
  • Diamond (4 steps): < 800ms p99

📊 What We WILL Infer from Traces

Worker-Level Breakdown via OpenTelemetry

Instead of direct benchmarking, use existing OpenTelemetry instrumentation:

# Query traces by correlation_id from E2E benchmark
curl "http://localhost:16686/api/traces?service=tasker-worker&tags=correlation_id:abc-123"

# Extract span timings:
{
  "spans": [
    {"operationName": "step_claim",       "duration": 15ms},
    {"operationName": "execute_handler",  "duration": 42ms},  // Business logic
    {"operationName": "submit_result",    "duration": 23ms}
  ]
}

Advantages:

  • ✅ Works across distributed workers (correlation ID links everything)
  • ✅ Captures real production behavior (actual task execution)
  • ✅ Breaks down by step type (different handlers have different timing)
  • ✅ Shows which worker processed each step
  • ✅ Already instrumented (Phase 3.3 work)

Metrics Available:

  • step_claim_duration - Time to claim step from queue
  • handler_execution_duration - Time to execute handler logic
  • result_submission_duration - Time to submit result back
  • ffi_overhead - Rust vs Ruby handler comparison

🚧 Benchmarks NOT Implemented (By Design)

Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

  • Requires artificial pre-arrangement of which worker claims which step
  • Doesn’t match production (multiple workers competing for steps)
  • Metrics available via OpenTelemetry traces instead

Alternative: Query traces for step_claimexecute_handlersubmit_result span timing


Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

  • Difficult to trigger orchestration step discovery without full execution
  • Result naturally embedded in E2E latency measurement
  • Coordination overhead visible in E2E timing

Alternative: E2E benchmark includes step enqueueing naturally


Handler Overhead (tasker-worker/benches/handler_overhead.rs)

Status: 🚧 Skeleton only (placeholder)

Why not implemented:

  • FFI overhead varies by handler type (can’t benchmark in isolation)
  • Real overhead visible in E2E benchmark + traces
  • Rust vs Ruby comparison available via trace analysis

Alternative: Compare handler_execution_duration spans for Rust vs Ruby handlers in traces


Implementation Summary

✅ Complete Benchmarks (3/7)

BenchmarkStatusMeasuresRun Command
SQL Functions✅ CompletePostgreSQL function performanceDATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions
Task Initialization✅ CompleteAPI task creation latencycargo bench -p tasker-client --features benchmarks
Event Propagation✅ CompleteLISTEN/NOTIFY round-tripDATABASE_URL=... cargo bench -p tasker-shared --features benchmarks event_propagation
E2E Latency✅ CompleteComplete workflow executioncargo bench --test e2e_latency

🚧 Placeholder Benchmarks (3/7)

BenchmarkStatusAlternative Measurement
Worker Execution🚧 PlaceholderOpenTelemetry traces (correlation ID)
Step Enqueueing🚧 PlaceholderEmbedded in E2E latency
Handler Overhead🚧 PlaceholderOpenTelemetry span comparison (Rust vs Ruby)

Advantages of This Approach

1. Matches Production Reality

  • E2E benchmark reflects actual user experience
  • No artificial worker pre-arrangement required
  • Measures real distributed system behavior

2. Complete Coverage

  • E2E latency includes ALL components naturally
  • OpenTelemetry provides worker-level breakdown
  • Event propagation measures critical notification path

3. Lower Maintenance

  • Fewer benchmarks to maintain
  • No complex setup for worker isolation
  • Traces provide flexible analysis

4. Better Insights

  • Correlation IDs link entire workflow across services
  • Can analyze timing for ANY task in production
  • Breakdown available on-demand via trace queries

How to Use This System

Running Performance Analysis

Step 1: Run E2E benchmark

cargo bench --test e2e_latency

Step 2: Extract correlation_id from benchmark output

Created task: abc-123-def-456 (correlation_id: xyz-789)

Step 3: Query traces for breakdown

# Jaeger UI or API
curl "http://localhost:16686/api/traces?tags=correlation_id:xyz-789"

Step 4: Analyze span timing

{
  "spans": [
    {"service": "orchestration", "operation": "create_task", "duration": 18ms},
    {"service": "orchestration", "operation": "enqueue_steps", "duration": 12ms},
    {"service": "worker", "operation": "step_claim", "duration": 15ms},
    {"service": "worker", "operation": "execute_handler", "duration": 42ms},
    {"service": "worker", "operation": "submit_result", "duration": 23ms},
    {"service": "orchestration", "operation": "process_result", "duration": 8ms}
  ]
}

Total E2E: ~118ms (matches benchmark) Worker overhead: 15ms + 23ms = 38ms (claim + submit, excluding business logic)


Recommendations

Completion Criteria

Complete with 4 working benchmarks:

  1. SQL Functions
  2. Task Initialization
  3. Event Propagation
  4. E2E Latency

📋 Document that worker-level metrics come from OpenTelemetry

For Future Enhancement

If direct worker benchmarking becomes necessary:

  1. Use single-worker mode Docker Compose configuration
  2. Pre-create tasks with known step assignments
  3. Query specific worker API for deterministic steps
  4. Document as synthetic benchmark (not matching production)

For Production Monitoring

Use OpenTelemetry for ongoing performance analysis:

  • Set up trace retention (7-30 days)
  • Create Grafana dashboards for span timing
  • Alert on p95 latency increases
  • Analyze slow workflows via correlation ID

Conclusion

Decision: Focus on event propagation and E2E latency benchmarks, use OpenTelemetry traces for worker-level breakdown.

Rationale: Matches production reality, provides complete coverage, lower maintenance, better insights.

Status: ✅ 4/4 practical benchmarks implemented and working

Benchmark Quick Reference Guide

Last Updated: 2025-10-08

Quick commands for running all benchmarks in the distributed benchmarking suite.


Prerequisites

# Start all Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# Verify services are healthy
curl http://localhost:8080/health  # Orchestration
curl http://localhost:8081/health  # Rust Worker
curl http://localhost:8082/health  # Ruby Worker (optional)

# Set database URL (for SQL benchmarks)
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

Individual Benchmarks

✅ Implemented Benchmarks

# 1. API Task Creation (COMPLETE - 17.7-20.8ms)
cargo bench --package tasker-client --features benchmarks

# 2. SQL Function Performance (COMPLETE - 380µs-2.93ms)
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions

🚧 Placeholder Benchmarks

# 3. Event Propagation (placeholder)
cargo bench --package tasker-shared --features benchmarks event_propagation

# 4. Worker Execution (placeholder)
cargo bench --package tasker-worker --features benchmarks worker_execution

# 5. Handler Overhead (placeholder)
cargo bench --package tasker-worker --features benchmarks handler_overhead

# 6. Step Enqueueing (placeholder)
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing

# 7. End-to-End Latency (placeholder)
cargo bench --test e2e_latency

Run All Benchmarks

# Run ALL benchmarks (implemented + placeholders)
cargo bench --all-features

# Run only SQL benchmarks
cargo bench --package tasker-shared --features benchmarks

# Run only worker benchmarks
cargo bench --package tasker-worker --features benchmarks

Benchmark Categories

CategoryPackageBenchmark NameStatusRun Command
APItasker-clienttask_initialization✅ Completecargo bench -p tasker-client --features benchmarks
SQLtasker-sharedsql_functions✅ CompleteDATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions
Eventstasker-sharedevent_propagation🚧 Placeholdercargo bench -p tasker-shared --features benchmarks event_propagation
Workertasker-workerworker_execution🚧 Placeholdercargo bench -p tasker-worker --features benchmarks worker_execution
Workertasker-workerhandler_overhead🚧 Placeholdercargo bench -p tasker-worker --features benchmarks handler_overhead
Orchestrationtasker-orchestrationstep_enqueueing🚧 Placeholdercargo bench -p tasker-orchestration --features benchmarks
E2Etestse2e_latency🚧 Placeholdercargo bench --test e2e_latency

Benchmark Output Locations

# Criterion HTML reports
open target/criterion/report/index.html

# Individual benchmark data
ls target/criterion/

# Proposed: Structured logs (not yet implemented)
# tmp/benchmarks/YYYY-MM-DD-benchmark-name.log

Common Options

# Save baseline for comparison
cargo bench --features benchmarks -- --save-baseline main

# Compare to baseline
cargo bench --features benchmarks -- --baseline main

# Verbose output
cargo bench --features benchmarks -- --verbose

# Run specific benchmark
cargo bench --package tasker-client --features benchmarks task_creation_api

# Skip health checks (CI mode)
TASKER_TEST_SKIP_HEALTH_CHECK=true cargo bench --features benchmarks

Troubleshooting

“Services must be running”

# Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# Check service health
curl http://localhost:8080/health

“DATABASE_URL must be set”

export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

“Task template not found”

# Ensure worker services are running (they register templates)
docker-compose -f docker/docker-compose.test.yml ps

# Check registered templates
curl -s http://localhost:8080/v1/handlers | jq

Compilation errors

# Clean and rebuild
cargo clean
cargo build --all-features

Performance Targets

BenchmarkMetricTargetCurrentStatus
Task Init (linear)mean< 50ms17.7ms✅ 3x better
Task Init (diamond)mean< 75ms20.8ms✅ 3.6x better
SQL Task Discoverymean< 3ms1.75-2.93ms✅ Pass
SQL Step Readinessmean< 1ms440-603µs✅ Pass
Worker Total Overheadmean< 60msTBD🚧
Event Notify (p95)p95< 10msTBD🚧
Step Enqueue (3 steps)mean< 50msTBD🚧
E2E Complete (3 steps)p99< 500msTBD🚧

Documentation

Distributed Benchmarking Strategy

Status: 🎯 Framework Complete | Implementation In Progress Last Updated: 2025-10-08


Overview

Complete benchmarking infrastructure for measuring distributed system performance across all components.

Benchmark Suite Structure

✅ Implemented

1. API Task Creation (tasker-client/benches/task_initialization.rs)

Status: ✅ COMPLETE - Fully implemented and tested

Measures:

  • HTTP request → task initialized latency
  • Task record creation in PostgreSQL
  • Initial step discovery from template
  • Response generation and serialization

Results (2025-10-08):

Linear (3 steps):   17.7ms  (Target: < 50ms)  ✅ 3x better than target
Diamond (4 steps):  20.8ms  (Target: < 75ms)  ✅ 3.6x better than target

Run Command:

cargo bench --package tasker-client --features benchmarks

2. SQL Function Performance (tasker-shared/benches/sql_functions.rs)

Status: ✅ COMPLETE - Fully implemented (Phase 5.2)

Measures:

  • 6 critical PostgreSQL function benchmarks
  • Intelligent stratified sampling (5-10 diverse samples per function)
  • EXPLAIN ANALYZE query plan analysis (run once per function)

Results (from Phase 5.2):

Task discovery:            1.75-2.93ms  (O(1) scaling!)
Step readiness:            440-603µs    (37% variance captured)
State transitions:         ~380µs       (±5% variance)
Task execution context:    448-559µs
Step dependencies:         332-343µs
Query plan buffer hit:     100%         (all functions)

Run Command:

DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions

🚧 Placeholders (Ready for Implementation)

3. Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)

Status: 🚧 Skeleton created - needs implementation

Measures:

  • Claim: PGMQ read + atomic claim
  • Execute: Handler execution (framework overhead)
  • Submit: Result serialization + HTTP submit
  • Total: Complete worker cycle

Targets:

  • Claim: < 20ms
  • Execute (noop): < 10ms
  • Submit: < 30ms
  • Total overhead: < 60ms

Implementation Requirements:

  • Pre-enqueued steps in namespace queues
  • Worker client with breakdown metrics
  • Multiple handler types (noop, calculation, database)
  • Accurate timestamp collection for each phase

Run Command (when implemented):

cargo bench --package tasker-worker --features benchmarks worker_execution

4. Event Propagation (tasker-shared/benches/event_propagation.rs)

Status: 🚧 Skeleton created - needs implementation

Measures:

  • PostgreSQL LISTEN/NOTIFY latency
  • PGMQ pgmq_send_with_notify overhead
  • Event system framework overhead

Targets:

  • p50: < 5ms
  • p95: < 10ms
  • p99: < 20ms

Implementation Requirements:

  • PostgreSQL LISTEN connection setup
  • PGMQ notification channel configuration
  • Concurrent listener with timestamp correlation
  • Accurate cross-thread time measurement

Run Command (when implemented):

cargo bench --package tasker-shared --features benchmarks event_propagation

5. Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)

Status: 🚧 Skeleton created - needs implementation

Measures:

  • Ready step discovery (SQL query time)
  • Queue publishing (PGMQ write time)
  • Notification overhead (LISTEN/NOTIFY)
  • Total orchestration coordination

Targets:

  • 3-step workflow: < 50ms
  • 10-step workflow: < 100ms
  • 50-step workflow: < 500ms

Implementation Requirements:

  • Pre-created tasks with dependency chains
  • Orchestration client with result processing trigger
  • Queue polling to detect enqueued steps
  • Breakdown metrics (discovery, publish, notify)

Challenge: Triggering step discovery without full workflow execution

Run Command (when implemented):

cargo bench --package tasker-orchestration --features benchmarks step_enqueueing

6. Handler Overhead (tasker-worker/benches/handler_overhead.rs)

Status: 🚧 Skeleton created - needs implementation

Measures:

  • Pure Rust handler (baseline - direct call)
  • Rust handler via framework (dispatch overhead)
  • Ruby handler via FFI (FFI boundary cost)

Targets:

  • Pure Rust: < 1µs (baseline)
  • Via Framework: < 1ms
  • Ruby FFI: < 5ms

Implementation Requirements:

  • Noop handler implementations (Rust + Ruby)
  • Direct function call benchmarks
  • Framework dispatch overhead measurement
  • FFI bridge overhead measurement

Run Command (when implemented):

cargo bench --package tasker-worker --features benchmarks handler_overhead

7. End-to-End Latency (tests/benches/e2e_latency.rs)

Status: 🚧 Skeleton created - needs implementation

Measures:

  • Complete workflow execution (API → Task Complete)
  • All system components (API, DB, Queue, Worker, Events)
  • Real network overhead
  • Different workflow patterns

Targets:

  • Linear (3 steps): < 500ms p99
  • Diamond (4 steps): < 800ms p99
  • Tree (7 steps): < 1500ms p99

Implementation Requirements:

  • All Docker Compose services running
  • Orchestration client for task creation
  • Polling mechanism for completion detection
  • Multiple workflow templates
  • Timeout handling for stuck workflows

Special Considerations:

  • SLOW by design: Measures real workflow execution (seconds)
  • Fewer samples (sample_size=10 vs 50 default)
  • Higher variance expected (network + system state)
  • Focus on regression detection, not absolute numbers

Run Command (when implemented):

# Requires all Docker services running
docker-compose -f docker/docker-compose.test.yml up -d

cargo bench --test e2e_latency

Benchmark Output Logging Strategy

Current State

Implemented:

  • Criterion default output (terminal + HTML reports)
  • Custom health check banners in benchmarks
  • EXPLAIN ANALYZE output in SQL benchmarks
  • Inline result commentary

Location: Results saved to target/criterion/

Proposed Consistent Structure

1. Standard Output Format

All benchmarks should follow this pattern:

═══════════════════════════════════════════════════════════════════════════════
🔍 VERIFYING PREREQUISITES
═══════════════════════════════════════════════════════════════════════════════
✅ All prerequisites met
═══════════════════════════════════════════════════════════════════════════════

Benchmarking <category>/<test_name>
...
<category>/<test_name>   time: [X.XX ms Y.YY ms Z.ZZ ms]

═══════════════════════════════════════════════════════════════════════════════
📊 BENCHMARK RESULTS: <CATEGORY NAME>
═══════════════════════════════════════════════════════════════════════════════

Performance Summary:
  • Test 1: X.XX ms  (Target: < YY ms)  ✅ Status
  • Test 2: X.XX ms  (Target: < YY ms)  ⚠️  Status

Key Findings:
  • Finding 1
  • Finding 2

═══════════════════════════════════════════════════════════════════════════════

2. Structured Log Files

Proposal: Create tmp/benchmarks/ directory with dated output:

tmp/benchmarks/
├── 2025-10-08-task-initialization.log
├── 2025-10-08-sql-functions.log
├── 2025-10-08-worker-execution.log
├── ...
└── latest/
    ├── task-initialization.log -> ../2025-10-08-task-initialization.log
    └── summary.md

Log Format (example):

# Benchmark Run: task_initialization
Date: 2025-10-08 14:23:45 UTC
Commit: abc123def456
Environment: Docker Compose Test

## Prerequisites
- [x] Orchestration service healthy (http://localhost:8080)
- [x] Worker service healthy (http://localhost:8081)

## Results

### Linear Workflow (3 steps)
- Mean: 17.748 ms
- Std Dev: 0.624 ms
- Min: 17.081 ms
- Max: 18.507 ms
- Target: < 50 ms
- Status: ✅ PASS (3.0x better than target)
- Outliers: 2/20 (10%)

### Diamond Workflow (4 steps)
- Mean: 20.805 ms
- Std Dev: 0.741 ms
- Min: 19.949 ms
- Max: 21.633 ms
- Target: < 75 ms
- Status: ✅ PASS (3.6x better than target)
- Outliers: 0/20 (0%)

## Summary
✅ All tests passed
🎯 Average performance: 3.3x better than targets

3. Baseline Comparison Format

For tracking performance over time:

# Performance Baseline Comparison
Baseline: main branch (2025-10-01)
Current: feature/benchmarks (2025-10-08)

| Benchmark | Baseline | Current | Change | Status |
|-----------|----------|---------|--------|--------|
| task_init/linear | 18.2ms | 17.7ms | -2.7% | ✅ Improved |
| task_init/diamond | 21.1ms | 20.8ms | -1.4% | ✅ Improved |
| sql/task_discovery | 2.91ms | 2.93ms | +0.7% | ✅ Stable |

4. CI Integration Format

For GitHub Actions / CI output:

{
  "benchmark_suite": "task_initialization",
  "timestamp": "2025-10-08T14:23:45Z",
  "commit": "abc123def456",
  "results": [
    {
      "name": "linear_3_steps",
      "mean_ms": 17.748,
      "std_dev_ms": 0.624,
      "target_ms": 50,
      "status": "pass",
      "performance_ratio": 3.0
    }
  ],
  "summary": {
    "total_tests": 2,
    "passed": 2,
    "failed": 0,
    "warnings": 0
  }
}

Running All Benchmarks

Quick Reference

# 1. Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d

# 2. Run individual benchmarks
cargo bench --package tasker-client --features benchmarks     # Task initialization
cargo bench --package tasker-shared --features benchmarks     # SQL + Events
cargo bench --package tasker-worker --features benchmarks     # Worker + Handlers
cargo bench --package tasker-orchestration --features benchmarks  # Step enqueueing
cargo bench --test e2e_latency                                # End-to-end

# 3. Run ALL benchmarks (when all implemented)
cargo bench --all-features

Environment Variables

# Required for SQL benchmarks
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

# Optional: Skip health checks (CI)
export TASKER_TEST_SKIP_HEALTH_CHECK="true"

# Optional: Custom service URLs
export TASKER_TEST_ORCHESTRATION_URL="http://localhost:9080"
export TASKER_TEST_WORKER_URL="http://localhost:9081"

Performance Targets Summary

CategoryComponentMetricTargetStatus
APITask Creation (3 steps)p99< 50ms✅ 17.7ms
APITask Creation (4 steps)p99< 75ms✅ 20.8ms
SQLTask Discoverymean< 3ms✅ 1.75-2.93ms
SQLStep Readinessmean< 1ms✅ 440-603µs
WorkerTotal Overheadmean< 60ms🚧 TBD
WorkerFFI Overheadmean< 5ms🚧 TBD
EventsNotify Latencyp95< 10ms🚧 TBD
OrchestrationStep Enqueueing (3 steps)mean< 50ms🚧 TBD
E2EComplete Workflow (3 steps)p99< 500ms🚧 TBD

Next Steps

Immediate (Current Session)

  1. ✅ Create all benchmark skeletons
  2. 🎯 Design consistent logging structure
  3. Decide on implementation priorities

Short Term

  1. Implement worker execution benchmark
  2. Implement event propagation benchmark
  3. Create benchmark output logging utilities

Medium Term

  1. Implement step enqueueing benchmark
  2. Implement handler overhead benchmark
  3. Implement E2E latency benchmark

Long Term

  1. CI integration with baseline tracking
  2. Performance regression detection
  3. Automated benchmark reports
  4. Historical performance trending

Documentation

SQL Function Benchmarking Guide

Created: 2025-10-08 Status: ✅ Complete Location: tasker-shared/benches/sql_functions.rs


Overview

The SQL function benchmark suite measures performance of critical database operations that form the hot paths in the Tasker orchestration system. These benchmarks provide:

  1. Baseline Performance Metrics: Establish expected performance ranges
  2. Regression Detection: Identify performance degradations in code changes
  3. Optimization Guidance: Use EXPLAIN ANALYZE output to guide index/query improvements
  4. Capacity Planning: Understand scaling characteristics with data volume

Quick Start

Prerequisites

# 1. Ensure PostgreSQL is running
pg_isready

# 2. Set up test database
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo sqlx migrate run

# 3. Populate with test data - REQUIRED for representative benchmarks
cargo test --all-features

Important: The benchmarks use intelligent sampling to test diverse task/step complexities. Running integration tests first ensures the database contains various workflow patterns (linear, diamond, parallel) for representative benchmarking.

Running Benchmarks

# Run all SQL benchmarks
cargo bench --package tasker-shared --features benchmarks

# Run specific benchmark group
cargo bench --package tasker-shared --features benchmarks get_next_ready_tasks

# Run with baseline comparison
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main
# ... make changes ...
cargo bench --package tasker-shared --features benchmarks -- --baseline main

Sampling Strategy

The benchmarks use intelligent sampling to ensure representative results:

Task Sampling

  • Samples 5 diverse tasks from different named_task_uuid types
  • Distributes samples across different workflow patterns
  • Maintains deterministic ordering (same UUIDs in same order each run)
  • Provides consistent results while capturing complexity variance

Step Sampling

  • Samples 10 diverse steps from different tasks
  • Selects up to 2 steps per task for variety
  • Captures different DAG depths and dependency patterns
  • Helps identify performance variance in recursive queries

Benefits

  1. Representativeness: No bias from single task/step selection
  2. Consistency: Same samples = comparable baseline comparisons
  3. Variance Detection: Criterion can measure performance across complexities
  4. Real-world Accuracy: Reflects actual production workload diversity

Example Output:

step_readiness_status/calculate_readiness/0    2.345 ms
step_readiness_status/calculate_readiness/1    1.234 ms  (simple linear task)
step_readiness_status/calculate_readiness/2    5.678 ms  (complex diamond DAG)
step_readiness_status/calculate_readiness/3    3.456 ms
step_readiness_status/calculate_readiness/4    2.789 ms

Benchmark Categories

1. Task Discovery (get_next_ready_tasks)

What it measures: Time to discover ready tasks for orchestration

Hot path: Orchestration coordinator → Task discovery

Test parameters:

  • Batch size: 1, 10, 50, 100 tasks
  • Measures function overhead even with empty database

Expected performance:

  • Empty DB: < 5ms for any batch size (function overhead)
  • With data: Should scale linearly, not exponentially

Optimization targets:

  • Index on task state
  • Index on namespace for filtering
  • Efficient processor ownership checks

Example output:

get_next_ready_tasks/batch_size/1
                        time:   [2.1234 ms 2.1567 ms 2.1845 ms]
get_next_ready_tasks/batch_size/10
                        time:   [2.2156 ms 2.2489 ms 2.2756 ms]
get_next_ready_tasks/batch_size/50
                        time:   [2.5123 ms 2.5456 ms 2.5789 ms]
get_next_ready_tasks/batch_size/100
                        time:   [3.0234 ms 3.0567 ms 3.0890 ms]

Analysis: Near-constant time across batch sizes indicates efficient query planning.


2. Step Readiness (get_step_readiness_status)

What it measures: Time to calculate if a step is ready to execute

Hot path: Step enqueuer → Dependency resolution

Dependencies: Requires test data (tasks with steps)

Expected performance:

  • Simple linear: < 10ms
  • Diamond pattern: < 20ms
  • Complex DAG: < 50ms

Optimization targets:

  • Parent step completion checks
  • Dependency graph traversal
  • Retry backoff calculations

Graceful degradation:

⚠️  Skipping step_readiness_status benchmark - no test data found
    Run integration tests first to populate test data

3. State Transitions (transition_task_state_atomic)

What it measures: Time for atomic state transitions with processor ownership

Hot path: All orchestration operations (initialization, enqueuing, finalization)

Expected performance:

  • Successful transition: < 15ms
  • Failed transition (wrong state): < 10ms (faster path)
  • Contention scenario: < 50ms with backoff

Optimization targets:

  • Atomic compare-and-swap efficiency
  • Index on task_uuid + processor_uuid
  • Transition history table size

4. Task Execution Context (get_task_execution_context)

What it measures: Time to retrieve comprehensive task orchestration status

Hot path: Orchestration coordinator → Status checking

Dependencies: Requires test data (tasks in database)

Expected performance:

  • Simple tasks: < 10ms
  • Complex tasks: < 25ms
  • With many steps: < 50ms

Optimization targets:

  • Step aggregation queries
  • State calculation efficiency
  • Join optimization for step counts

5. Transitive Dependencies (get_step_transitive_dependencies)

What it measures: Time to resolve complete dependency tree for a step

Hot path: Worker → Step execution preparation (once per step lifecycle)

Dependencies: Requires test data (steps with dependencies)

Expected performance:

  • Linear dependencies: < 5ms
  • Diamond pattern: < 10ms
  • Complex DAG (10+ levels): < 25ms

Optimization targets:

  • Recursive CTE performance
  • Index on step dependencies
  • Materialized dependency graphs (future)

Why it matters: Called once per step on worker side when populating step data. While not in orchestration hot path, it affects worker step initialization latency. Recursive CTEs can be expensive with deep dependency trees.


6. EXPLAIN ANALYZE (explain_analyze)

What it measures: Query execution plans, not just timing

How it works: Runs EXPLAIN ANALYZE once per function (no repeated iterations since query plans don’t change between executions)

Functions analyzed:

  • get_next_ready_tasks() - Task discovery query plans
  • get_task_execution_context() - Task status aggregation plans
  • get_step_transitive_dependencies() - Recursive CTE dependency traversal plans

Purpose: Identify optimization opportunities:

  • Sequential scans (need indexes)
  • Nested loop performance
  • Buffer hit ratios
  • Index usage patterns
  • Recursive CTE efficiency

Automatic Query Plan Logging: Captures each query plan once and analyzes, printing:

  • ⏱️ Execution Time: Actual query execution duration
  • 📋 Planning Time: Time spent planning the query
  • 📦 Node Type: Primary operation type (Aggregate, Index Scan, etc.)
  • 💰 Total Cost: PostgreSQL’s cost estimate
  • ⚠️ Sequential Scan Warning: Alerts for potential missing indexes
  • 📊 Buffer Hit Ratio: Cache efficiency (higher is better)

Example output:

════════════════════════════════════════════════════════════════════════════════
📊 QUERY PLAN ANALYSIS
════════════════════════════════════════════════════════════════════════════════

🔍 Function: get_next_ready_tasks
────────────────────────────────────────────────────────────────────────────────
⏱️  Execution Time: 2.345 ms
📋 Planning Time: 0.123 ms
📦 Node Type: Aggregate
💰 Total Cost: 45.67
📊 Buffer Hit Ratio: 98.5% (197/200 blocks)
────────────────────────────────────────────────────────────────────────────────

Saving Full Plans:

# Save complete JSON plans to target/query_plan_*.json
SAVE_QUERY_PLANS=1 cargo bench --package tasker-shared --features benchmarks

Red flags to investigate:

  • “Seq Scan” on large tables → Add index
  • “Nested Loop” with high iteration count → Optimize join strategy
  • “Sort” operations on large datasets → Add index for ORDER BY
  • Low buffer hit ratio (< 90%) → Increase shared_buffers or investigate I/O

Interpreting Results

Criterion Statistics

Criterion provides comprehensive statistics for each benchmark:

get_next_ready_tasks/batch_size/10
                        time:   [2.2156 ms 2.2489 ms 2.2756 ms]
                        change: [-1.5% +0.2% +1.9%] (p = 0.31 > 0.05)
                        No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe

Key metrics:

  • [2.2156 ms 2.2489 ms 2.2756 ms]: Lower bound, mean, upper bound (95% confidence)
  • change: Comparison to baseline (if available)
  • p-value: Statistical significance (p < 0.05 = significant)
  • Outliers: Measurements far from median (cache effects, GC, etc.)

Performance Expectations

Based on Phase 3 metrics verification (26 tasks executed):

MetricExpectedWarningCritical
Task initialization< 50ms50-100ms> 100ms
Step readiness< 20ms20-50ms> 50ms
State transition< 15ms15-30ms> 30ms
Finalization claim< 10ms10-25ms> 25ms

Note: These are function-level times, not end-to-end latencies.


Using Benchmarks for Optimization

Workflow

  1. Establish Baseline

    cargo bench --package tasker-shared --features benchmarks -- --save-baseline main
    
  2. Make Changes (e.g., add index, optimize query)

  3. Compare

    cargo bench --package tasker-shared --features benchmarks -- --baseline main
    
  4. Review Output

    get_next_ready_tasks/batch_size/100
                         time:   [2.0123 ms 2.0456 ms 2.0789 ms]
                         change: [-34.5% -32.1% -29.7%] (p = 0.00 < 0.05)
                         Performance has improved.
    
  5. Analyze EXPLAIN Plans (if improvement isn’t clear)


Common Optimization Patterns

Pattern 1: Missing Index

Symptom: Exponential scaling with data volume

EXPLAIN shows: Seq Scan on tasks

Solution:

CREATE INDEX idx_tasks_state ON tasker.tasks(current_state)
WHERE complete = false;

Pattern 2: Inefficient Join

Symptom: High latency with complex DAGs

EXPLAIN shows: Nested Loop with high row counts

Solution: Use CTE or adjust join strategy

WITH parent_status AS (
  SELECT ... -- Pre-compute parent completions
)
SELECT ... FROM tasker.workflow_steps s
JOIN parent_status ps ON ...

Pattern 3: Large Transaction History

Symptom: State transition slowing over time

EXPLAIN shows: Large scan of task_transitions

Solution: Partition by date or archive old transitions

CREATE TABLE tasker.task_transitions_archive (LIKE tasker.task_transitions);
-- Move old data periodically

Integration with Metrics

The benchmark results should correlate with production metrics:

From metrics-reference.md:

  • tasker_task_initialization_duration_milliseconds → Benchmark: task discovery + initialization
  • tasker_step_result_processing_duration_milliseconds → Benchmark: step readiness + state transitions
  • tasker_task_finalization_duration_milliseconds → Benchmark: finalization claiming

Validation approach:

  1. Run benchmarks: Get ~2ms for task discovery
  2. Check metrics: tasker_task_initialization_duration P95 = ~45ms
  3. Calculate overhead: 45ms - 2ms = 43ms (business logic + framework)

This helps identify where optimization efforts should focus:

  • If benchmark is slow → Optimize SQL/indexes
  • If benchmark is fast but metrics slow → Optimize Rust code

Continuous Integration

# .github/workflows/benchmarks.yml
name: Performance Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: tasker
        options: >-
          --health-cmd pg_isready
          --health-interval 10s

    steps:
      - uses: actions/checkout@v3
      - uses: dtolnay/rust-toolchain@stable

      - name: Run migrations
        run: cargo sqlx migrate run
        env:
          DATABASE_URL: postgresql://postgres:tasker@localhost/test

      - name: Run benchmarks
        run: cargo bench --package tasker-shared --features benchmarks

      - name: Check for regressions
        run: |
          # Parse Criterion output and fail if P95 > threshold
          # This is left as an exercise for CI implementation

Future Enhancements

Phase 5.3: Data Generation (Deferred)

The current benchmarks work with existing test data. Future work could add:

  1. Realistic Data Generation

    • Create 100/1,000/10,000 task datasets
    • Various DAG complexities (linear, diamond, tree)
    • State distribution (60% complete, 20% in-progress, etc.)
  2. Contention Testing

    • Multiple processors competing for same tasks
    • Race condition scenarios
    • Deadlock detection
  3. Long-Running Benchmarks

    • Memory leak detection
    • Connection pool exhaustion
    • Query plan cache effects

Troubleshooting

Benchmark fails with “DATABASE_URL must be set”

export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

All benchmarks show “no test data found”

# Run integration tests to populate database
cargo test --all-features

# Or run specific test suite
cargo test --package tasker-shared --all-features

Benchmarks are inconsistent/noisy

  • Close other applications
  • Ensure PostgreSQL isn’t under load
  • Run benchmarks multiple times
  • Increase sample_size in benchmark code

Results don’t match production metrics

  • Production has different data volumes
  • Network latency in production
  • Different PostgreSQL version/configuration
  • Connection pool overhead in production

References

  • Criterion Documentation: https://bheisler.github.io/criterion.rs/book/
  • PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/sql-explain.html
  • Phase 3 Metrics: docs/observability/metrics-reference.md
  • Verification Results: docs/observability/VERIFICATION_RESULTS.md

Sign-Off

Phase 5.2 Status: ✅ COMPLETE

Benchmarks Implemented:

  • get_next_ready_tasks() - 4 batch sizes
  • get_step_readiness_status() - with graceful skip
  • transition_task_state_atomic() - atomic operations
  • get_task_execution_context() - orchestration status retrieval
  • get_step_transitive_dependencies() - recursive dependency traversal
  • EXPLAIN ANALYZE - query plan capture with automatic analysis

Documentation Complete:

  • ✅ Quick start guide
  • ✅ Interpretation guidance
  • ✅ Optimization patterns
  • ✅ Integration with metrics
  • ✅ CI recommendations

Next Steps: Run benchmarks with real data and establish baseline performance targets.

Tasker-Core Logging Standards

Version: 1.0 Last Updated: 2025-10-07 Status: Active Related: Observability Standardization


Table of Contents

  1. Philosophy
  2. Log Levels
  3. Structured Fields
  4. Message Style
  5. Instrument Macro
  6. Error Handling
  7. Examples
  8. Enforcement

Philosophy

Principles:

  • Production-First: Logs must be parseable, searchable, and professional
  • Correlation-Driven: All operations include correlation_id for distributed tracing
  • Structured: Fields over string interpolation for aggregation and querying
  • Concise: Clear, actionable messages without noise
  • Consistent: Predictable patterns across all code

Anti-Patterns to Avoid:

  • ❌ Emojis (🚀✅❌) - Breaks log parsers, unprofessional
  • ❌ All-caps prefixes (“BOOTSTRAP:”, “CORE:”) - Redundant with module paths
  • ❌ Ticket references (“JIRA-123”, “PROJ-40”) - Internal, meaningless externally
  • ❌ String interpolation - Use structured fields instead
  • ❌ Verbose messages - Be concise, let fields provide detail

Log Levels

ERROR - Unrecoverable Failures

When to Use:

  • Database connection permanently lost
  • Critical system component failure
  • Unrecoverable state machine violation
  • Data corruption detected
  • Message queue unavailable

Characteristics:

  • Requires immediate human intervention
  • Service degradation or outage
  • Cannot automatically recover
  • Should trigger alerts/pages

Example:

#![allow(unused)]
fn main() {
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Failed to claim task for finalization: database unavailable"
);
}

WARN - Degraded Operation

When to Use:

  • Retryable failures after exhausting retries
  • Circuit breaker opened (degraded mode)
  • Fallback behavior activated
  • Rate limiting engaged
  • Configuration issues (non-fatal)
  • Unexpected but handled conditions

Characteristics:

  • Service continues but degraded
  • Automatic recovery possible
  • Should be monitored for patterns
  • May indicate upstream problems

Example:

#![allow(unused)]
fn main() {
warn!(
    correlation_id = %correlation_id,
    step_uuid = %step_uuid,
    retry_count = attempts,
    max_retries = max_attempts,
    next_retry_at = ?next_retry,
    "Step execution failed after max retries, will not retry further"
);
}

INFO - Lifecycle Events

When to Use:

  • System startup/shutdown
  • Task created/completed/failed
  • Step enqueued/completed
  • State transitions (task/step)
  • Configuration loaded
  • Significant business events

Characteristics:

  • Normal operation milestones
  • Useful for understanding flow
  • Production-ready verbosity
  • Default log level in production

Example:

#![allow(unused)]
fn main() {
info!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    steps_enqueued = count,
    duration_ms = elapsed.as_millis(),
    "Task initialization complete"
);
}

DEBUG - Detailed Diagnostics

When to Use:

  • Discovery query results
  • Queue depth checks
  • Dependency analysis details
  • Configuration value dumps
  • State machine transition details
  • Detailed operation flow

Characteristics:

  • Troubleshooting information
  • Not shown in production (usually)
  • Safe to be verbose
  • Helps understand “why”

Example:

#![allow(unused)]
fn main() {
debug!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    viable_steps = steps.len(),
    pending_steps = pending.len(),
    blocked_steps = blocked.len(),
    "Step readiness analysis complete"
);
}

TRACE - Very Verbose

When to Use:

  • Function entry/exit in hot paths
  • Loop iteration details
  • Deep parameter inspection
  • Performance profiling hooks

Characteristics:

  • Extremely verbose
  • Usually disabled even in dev
  • Performance impact acceptable
  • Use sparingly

Example:

#![allow(unused)]
fn main() {
trace!(
    correlation_id = %correlation_id,
    iteration = i,
    "Polling loop iteration"
);
}

Structured Fields

Required Fields (Context-Dependent)

Always Include:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,  // ALWAYS when available
}

When Task Context Available:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace,
}

When Step Context Available:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
}

For Operations:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
operation = "step_enqueue",        // Operation identifier
duration_ms = elapsed.as_millis(), // Timing for operations
}

For Errors:

#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
error = %e,                        // Error Display
error_type = %type_name::<E>(),   // Optional: Error type
}

Field Ordering (MANDATORY)

Standard Order:

  1. correlation_id (always first)
  2. Entity IDs (task_uuid, step_uuid, namespace)
  3. Operation/Action (operation, state, status)
  4. Measurements (duration_ms, count, size)
  5. Error Info (error, error_type, context)
  6. Other Context (additional fields)

Example:

#![allow(unused)]
fn main() {
info!(
    // 1. Correlation ID (ALWAYS FIRST)
    correlation_id = %correlation_id,

    // 2. Entity IDs
    task_uuid = %task_uuid,
    step_uuid = %step_uuid,
    namespace = %namespace,

    // 3. Operation
    operation = "step_transition",
    from_state = %old_state,
    to_state = %new_state,

    // 4. Measurements
    duration_ms = elapsed.as_millis(),

    // 5. No errors (success case)

    // 6. Other context
    processor_id = %processor_uuid,

    "Step state transition complete"
);
}

Field Formatting

Use Display Formatting (%):

#![allow(unused)]
fn main() {
// ✅ CORRECT: Let tracing handle formatting
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
}

Avoid Manual Conversion:

#![allow(unused)]
fn main() {
// ❌ WRONG: Manual to_string()
task_uuid = task_uuid.to_string(),

// ❌ WRONG: Debug formatting for production types
task_uuid = ?task_uuid,  // Use ? only for Debug types
}

Field Naming:

#![allow(unused)]
fn main() {
// ✅ Standard names
duration_ms          // Not elapsed_ms, time_ms
error                // Not err, error_message
step_uuid            // Not workflow_step_uuid (be consistent)
retry_count          // Not attempts, retries
max_retries          // Not max_attempts
}

Message Style

Guidelines

DO:

  • ✅ Be concise and actionable
  • ✅ Use present tense for states: “Step enqueued”
  • ✅ Use past tense for events: “Task completed”
  • ✅ Start with the subject: “Task completed” not “Successfully completed task”
  • ✅ Focus on WHAT happened (fields show HOW)

DON’T:

  • ❌ Use emojis: “🚀 Starting…” → “Starting orchestration system”
  • ❌ Use all-caps prefixes: “BOOTSTRAP: Starting…” → “Starting orchestration bootstrap”
  • ❌ Include ticket numbers: “PROJ-40: Processing…” → “Processing command”
  • ❌ Be redundant: “Successfully enqueued step successfully” → “Step enqueued”
  • ❌ Include technical jargon: “Atomic CAS transition succeeded” → “State transition complete”
  • ❌ Be verbose: Keep messages under 10 words ideally

Before/After Examples

Lifecycle Events:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🚀 BOOTSTRAP: Starting unified orchestration system bootstrap");

// ✅ AFTER
info!("Starting orchestration system bootstrap");
}

Operation Completion:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("✅ STEP_ENQUEUER: Successfully marked step {} as enqueued", step_uuid);

// ✅ AFTER
info!(
    correlation_id = %correlation_id,
    step_uuid = %step_uuid,
    "Step marked as enqueued"
);
}

Error Handling:

#![allow(unused)]
fn main() {
// ❌ BEFORE
error!("❌ ORCHESTRATION_LOOP: Failed to process task {}: {}", task_uuid, e);

// ✅ AFTER
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Task processing failed"
);
}

Shutdown:

#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🛑 Shutdown signal received, initiating graceful shutdown...");

// ✅ AFTER
info!("Shutdown signal received, initiating graceful shutdown");
}

Instrument Macro

When to Use

Use #[instrument] for:

  • Function-level spans in hot paths
  • Automatic correlation ID tracking
  • Operations that should appear in traces
  • Functions with significant duration

Benefits:

  • Automatic span creation
  • Automatic timing
  • Better OpenTelemetry integration (Phase 2)
  • Cleaner code

Example

#![allow(unused)]
fn main() {
use tracing::instrument;

#[instrument(skip(self), fields(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    namespace = %namespace
))]
pub async fn process_task(
    &self,
    correlation_id: Uuid,
    task_uuid: Uuid,
    namespace: String,
) -> Result<TaskResult> {
    // Span automatically created with fields above
    info!("Starting task processing");

    // ... implementation ...

    info!(
        duration_ms = start.elapsed().as_millis(),
        "Task processing complete"
    );

    Ok(result)
}
}

Skip Parameters

Always skip:

  • self (redundant)
  • Large structures (use specific fields instead)
  • Sensitive data (passwords, tokens, PII)
#![allow(unused)]
fn main() {
#[instrument(
    skip(self, context),  // Skip large context
    fields(
        correlation_id = %correlation_id,
        task_uuid = %context.task_uuid,  // Extract specific fields
    )
)]
}

Error Handling

Error Context

Always include:

#![allow(unused)]
fn main() {
error!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,              // Error Display (user-friendly)
    error_type = %type_name::<E>(),  // Optional: For classification
    "Operation failed"
);
}

Error Propagation

#![allow(unused)]
fn main() {
// ✅ Log and return for caller to handle
debug!(
    correlation_id = %correlation_id,
    task_uuid = %task_uuid,
    error = %e,
    "Step discovery query failed, will retry"
);
return Err(e);

// ❌ Don't log at every level (causes noise)
// Instead: Log once at appropriate level where handled
}

Error Classification

#![allow(unused)]
fn main() {
match result {
    Err(e) if e.is_retryable() => {
        warn!(
            correlation_id = %correlation_id,
            error = %e,
            retry_count = attempts,
            "Operation failed, will retry"
        );
    }
    Err(e) => {
        error!(
            correlation_id = %correlation_id,
            error = %e,
            "Operation failed permanently"
        );
    }
    Ok(result) => {
        info!(
            correlation_id = %correlation_id,
            duration_ms = elapsed.as_millis(),
            "Operation completed successfully"
        );
    }
}
}

Examples

Complete Examples by Scenario

Task Initialization

#![allow(unused)]
fn main() {
#[instrument(skip(self), fields(
    correlation_id = %task_request.correlation_id,
    task_name = %task_request.name,
    namespace = %task_request.namespace
))]
pub async fn create_task_from_request(
    &self,
    task_request: TaskRequest,
) -> Result<TaskInitializationResult> {
    let correlation_id = task_request.correlation_id;
    let start = Instant::now();

    info!("Starting task initialization");

    // Create task
    let task = self.create_task(&task_request).await?;

    debug!(
        task_uuid = %task.task_uuid,
        template_uuid = %task.named_task_uuid,
        "Task created in database"
    );

    // Discover steps
    let steps = self.discover_initial_steps(task.task_uuid).await?;

    info!(
        correlation_id = %correlation_id,
        task_uuid = %task.task_uuid,
        step_count = steps.len(),
        duration_ms = start.elapsed().as_millis(),
        "Task initialization complete"
    );

    Ok(TaskInitializationResult {
        task_uuid: task.task_uuid,
        step_count: steps.len(),
    })
}
}

Step Enqueueing

#![allow(unused)]
fn main() {
pub async fn enqueue_step(
    &self,
    correlation_id: Uuid,
    task_uuid: Uuid,
    step: &ViableStep,
) -> Result<()> {
    let start = Instant::now();

    debug!(
        correlation_id = %correlation_id,
        task_uuid = %task_uuid,
        step_uuid = %step.step_uuid,
        step_name = %step.name,
        queue = %step.queue_name,
        "Enqueueing step"
    );

    let message = self.create_message(correlation_id, task_uuid, step)?;

    self.pgmq_client
        .send(&step.queue_name, &message)
        .await?;

    info!(
        correlation_id = %correlation_id,
        task_uuid = %task_uuid,
        step_uuid = %step.step_uuid,
        queue = %step.queue_name,
        duration_ms = start.elapsed().as_millis(),
        "Step enqueued"
    );

    Ok(())
}
}

Error Handling

#![allow(unused)]
fn main() {
match self.process_step_result(result).await {
    Ok(()) => {
        info!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            duration_ms = elapsed.as_millis(),
            "Step result processed"
        );
    }
    Err(e) if e.is_retryable() => {
        warn!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            error = %e,
            retry_count = result.attempts,
            "Step result processing failed, will retry"
        );
        return Err(e);
    }
    Err(e) => {
        error!(
            correlation_id = %result.correlation_id,
            task_uuid = %result.task_uuid,
            step_uuid = %result.step_uuid,
            error = %e,
            "Step result processing failed permanently"
        );
        return Err(e);
    }
}
}

Bootstrap/Shutdown

#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<OrchestrationSystemHandle> {
    info!("Starting orchestration system bootstrap");

    let config = ConfigManager::load()?;
    debug!(environment = %config.environment, "Configuration loaded");

    let context = SystemContext::from_config(config).await?;
    info!(processor_uuid = %context.processor_uuid, "System context initialized");

    let core = OrchestrationCore::new(context).await?;
    info!("Orchestration core initialized");

    // ... more initialization ...

    info!(
        processor_uuid = %core.processor_uuid,
        namespaces = ?core.supported_namespaces,
        "Orchestration system bootstrap complete"
    );

    Ok(handle)
}

pub async fn shutdown(&mut self) -> Result<()> {
    info!("Initiating graceful shutdown");

    if let Some(coordinator) = &self.event_coordinator {
        coordinator.stop().await?;
        debug!("Event coordinator stopped");
    }

    info!("Orchestration system shutdown complete");
    Ok(())
}
}

Enforcement

Code Review Checklist

Before merging, verify:

  • No emojis in log messages
  • No all-caps component prefixes
  • No ticket references in runtime logs
  • correlation_id present in all task/step operations
  • Structured fields follow standard ordering
  • Messages are concise and actionable
  • Appropriate log levels used
  • Error context is complete

CI Checks

Recommended lints (future):

# Check for emojis
! grep -r '[🔧✅🚀❌⚠️📊🔍🎉🛡️⏱️📝🏗️🎯🔄💡📦🧪🌉🔌⏳🛑]' src/

# Check for all-caps prefixes
! grep -rE '(info|debug|warn|error)!\(".*[A-Z_]{3,}:' src/

# Check for ticket references in logs (allow in comments)
! grep -rE '(info|debug|warn|error)!.*[A-Z]+-[0-9]+' src/

Pre-commit Hook

Add to .git/hooks/pre-commit:

#!/bin/bash
./scripts/audit-logging.sh --check || {
    echo "❌ Logging standards violation detected"
    echo "Run: ./scripts/audit-logging.sh for details"
    exit 1
}

Migration Guide

For Existing Code

  1. Remove emojis: Use find/replace
  2. Remove all-caps prefixes: Simple cleanup
  3. Add correlation_id: Extract from context
  4. Reorder fields: correlation_id first
  5. Shorten messages: Remove redundancy
  6. Verify log levels: Lifecycle = INFO, diagnostics = DEBUG

For New Code

  1. Always include correlation_id when context available
  2. Use #[instrument] for significant functions
  3. Follow field ordering: correlation_id, IDs, operation, measurements, errors
  4. Keep messages concise: Under 10 words
  5. Choose appropriate level: ERROR (fatal), WARN (degraded), INFO (lifecycle), DEBUG (diagnostic)

FAQ

Q: Should I use info! or debug! for step enqueueing? A: info! - It’s a significant lifecycle event even if frequent.

Q: When should I add duration_ms? A: For any operation that:

  • Calls external systems (DB, queue)
  • Is in the hot path
  • Takes >10ms typically
  • Needs performance monitoring

Q: Can I use emojis in error messages? A: No. Never use emojis in any log message. They break parsers and are unprofessional.

Q: Should correlation_id really always be first? A: Yes. This enables easy correlation across all logs. It’s the #1 most important field for distributed tracing.

Q: What about ticket references in module docs? A: Acceptable in module-level documentation for architectural context. Remove from runtime logs and inline comments.

Q: Can I include stack traces in logs? A: Use error = %e which includes the error chain. Only add explicit backtrace for truly exceptional cases.


References


Document End

This is a living document. Propose changes via PR with rationale.

OpenTelemetry Metrics Reference

Status: ✅ Complete Export Interval: 60 seconds OTLP Endpoint: http://localhost:4317 Grafana UI: http://localhost:3000

This document provides a complete reference for all OpenTelemetry metrics instrumented in the Tasker orchestration system.

Table of Contents


Overview

The Tasker system exports 47+ OpenTelemetry metrics across 5 domains:

DomainMetricsDescription
Orchestration11Task lifecycle, step coordination, finalization
Worker10Step execution, claiming, result submission
Resilience8+Circuit breakers, MPSC channels
Database7SQL query performance, connection pools
Messaging11PGMQ queue operations, message processing

All metrics include correlation_id labels for distributed tracing correlation with Tempo traces.

Histogram Metric Naming

OpenTelemetry automatically appends _milliseconds to histogram metric names when the unit is specified as ms. This provides clarity in Prometheus queries.

Pattern: metric_namemetric_name_milliseconds_{bucket,sum,count}

Example:

  • Code: tasker.step.execution.duration with unit “ms”
  • Prometheus: tasker_step_execution_duration_milliseconds_*

Query Patterns: Instant vs Rate-Based

Instant/Recent Data Queries - Use these when:

  • Testing with burst/batch task execution
  • Viewing data from recent runs (last few minutes)
  • Data points are sparse or clustered together
  • You want simple averages without time windows

Rate-Based Queries - Use these when:

  • Continuous production monitoring
  • Data flows steadily over time
  • Calculating per-second rates
  • Building alerting rules

Why the difference matters: The rate() function calculates per-second change rates over a time window. It requires data points spread across that window. If you run 26 tasks in quick succession, all data points cluster at one timestamp, and rate() returns no data because there’s no rate change to calculate.


Configuration

Enable OpenTelemetry

File: config/tasker/environments/development/telemetry.toml

[telemetry]
enabled = true
service_name = "tasker-core-dev"
sample_rate = 1.0

[telemetry.opentelemetry]
enabled = true  # Must be true to export metrics

Verify in Logs

# Should see:
# opentelemetry_enabled=true
# NOT: Metrics collection disabled (TELEMETRY_ENABLED=false)

Orchestration Metrics

Module: tasker-shared/src/metrics/orchestration.rs Instrumentation: tasker-orchestration/src/orchestration/lifecycle/*.rs

Counters

tasker.tasks.requests.total

Description: Total number of task creation requests received Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • task_type: Task name (e.g., “mathematical_sequence”)
  • namespace: Task namespace (e.g., “rust_e2e_linear”)

Instrumented In: task_initializer.rs:start_task_initialization()

Example Query:

# Total task requests
tasker_tasks_requests_total

# By namespace
sum by (namespace) (tasker_tasks_requests_total)

# Specific correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)


tasker.tasks.completions.total

Description: Total number of tasks that completed successfully Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID

Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Completed)

Example Query:

# Total completions
tasker_tasks_completions_total

# Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])

Expected Output: (To be verified)


tasker.tasks.failures.total

Description: Total number of tasks that failed Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID

Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Failed)

Example Query:

# Total failures
tasker_tasks_failures_total

# Error rate over 5 minutes
rate(tasker_tasks_failures_total[5m])

Expected Output: (To be verified)


tasker.steps.enqueued.total

Description: Total number of steps enqueued to worker queues Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • namespace: Task namespace
  • step_name: Name of the enqueued step

Instrumented In: step_enqueuer.rs:enqueue_steps()

Example Query:

# Total steps enqueued
tasker_steps_enqueued_total

# By step name
sum by (step_name) (tasker_steps_enqueued_total)

# For specific task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)


tasker.step_results.processed.total

Description: Total number of step results processed from workers Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • result_type: “success”, “error”, “timeout”, “cancelled”, “skipped”

Instrumented In: result_processor.rs:process_step_result()

Example Query:

# Total results processed
tasker_step_results_processed_total

# By result type
sum by (result_type) (tasker_step_results_processed_total)

# Success rate
rate(tasker_step_results_processed_total{result_type="success"}[5m])

Expected Output: (To be verified)


Histograms

tasker.task.initialization.duration

Description: Task initialization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_task_initialization_duration_milliseconds_bucket
  • tasker_task_initialization_duration_milliseconds_sum
  • tasker_task_initialization_duration_milliseconds_count

Labels:

  • correlation_id: Request correlation ID
  • task_type: Task name

Instrumented In: task_initializer.rs:start_task_initialization()

Example Queries:

Instant/Recent Data (works immediately after task execution):

# Simple average initialization time
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count

# P95 latency
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

Rate-Based (for continuous monitoring, requires data spread over time):

# Average initialization time over 5 minutes
rate(tasker_task_initialization_duration_milliseconds_sum[5m]) /
rate(tasker_task_initialization_duration_milliseconds_count[5m])

# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values


tasker.task.finalization.duration

Description: Task finalization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_task_finalization_duration_milliseconds_bucket
  • tasker_task_finalization_duration_milliseconds_sum
  • tasker_task_finalization_duration_milliseconds_count

Labels:

  • correlation_id: Request correlation ID
  • final_state: “complete”, “error”, “cancelled”

Instrumented In: task_finalizer.rs:finalize_task()

Example Queries:

Instant/Recent Data:

# Simple average finalization time
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count

# P95 by final state
histogram_quantile(0.95,
  sum by (final_state, le) (
    tasker_task_finalization_duration_milliseconds_bucket
  )
)

Rate-Based:

# Average finalization time over 5 minutes
rate(tasker_task_finalization_duration_milliseconds_sum[5m]) /
rate(tasker_task_finalization_duration_milliseconds_count[5m])

# P95 by final state over 5 minutes
histogram_quantile(0.95,
  sum by (final_state, le) (
    rate(tasker_task_finalization_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values


tasker.step_result.processing.duration

Description: Step result processing duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_step_result_processing_duration_milliseconds_bucket
  • tasker_step_result_processing_duration_milliseconds_sum
  • tasker_step_result_processing_duration_milliseconds_count

Labels:

  • correlation_id: Request correlation ID
  • result_type: “success”, “error”, “timeout”, “cancelled”, “skipped”

Instrumented In: result_processor.rs:process_step_result()

Example Queries:

Instant/Recent Data:

# Simple average result processing time
tasker_step_result_processing_duration_milliseconds_sum /
tasker_step_result_processing_duration_milliseconds_count

# P50, P95, P99 latencies
histogram_quantile(0.50, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.95, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.99, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))

Rate-Based:

# Average result processing time over 5 minutes
rate(tasker_step_result_processing_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_processing_duration_milliseconds_count[5m])

# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_processing_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values


Gauges

tasker.tasks.active

Description: Number of tasks currently being processed Type: Gauge (u64) Labels:

  • state: Current task state

Status: Planned (not yet instrumented)


tasker.steps.ready

Description: Number of steps ready for execution Type: Gauge (u64) Labels:

  • namespace: Worker namespace

Status: Planned (not yet instrumented)


Worker Metrics

Module: tasker-shared/src/metrics/worker.rs Instrumentation: tasker-worker/src/worker/*.rs

Counters

tasker.steps.executions.total

Description: Total number of step executions attempted Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID

Instrumented In: command_processor.rs:handle_execute_step()

Example Query:

# Total step executions
tasker_steps_executions_total

# Execution rate
rate(tasker_steps_executions_total[5m])

# For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)


tasker.steps.successes.total

Description: Total number of step executions that completed successfully Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • namespace: Worker namespace

Instrumented In: command_processor.rs:handle_execute_step() (success path)

Example Query:

# Total successes
tasker_steps_successes_total

# Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])

# By namespace
sum by (namespace) (tasker_steps_successes_total)

Expected Output: (To be verified)


tasker.steps.failures.total

Description: Total number of step executions that failed Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • namespace: Worker namespace (or “unknown” for early failures)
  • error_type: “claim_failed”, “database_error”, “step_not_found”, “message_deletion_failed”

Instrumented In: command_processor.rs:handle_execute_step() (error paths)

Example Query:

# Total failures
tasker_steps_failures_total

# Failure rate
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])

# By error type
sum by (error_type) (tasker_steps_failures_total)

# Error distribution
topk(5, sum by (error_type) (tasker_steps_failures_total))

Expected Output: (To be verified)


tasker.steps.claimed.total

Description: Total number of steps claimed from queues Type: Counter (u64) Labels:

  • namespace: Worker namespace
  • claim_method: “event”, “poll”

Instrumented In: step_claim.rs:try_claim_step()

Example Query:

# Total claims
tasker_steps_claimed_total

# By claim method
sum by (claim_method) (tasker_steps_claimed_total)

# Claim rate
rate(tasker_steps_claimed_total[5m])

Expected Output: (To be verified)


tasker.steps.results_submitted.total

Description: Total number of step results submitted to orchestration Type: Counter (u64) Labels:

  • correlation_id: Request correlation ID
  • result_type: “completion”

Instrumented In: orchestration_result_sender.rs:send_completion()

Example Query:

# Total submissions
tasker_steps_results_submitted_total

# Submission rate
rate(tasker_steps_results_submitted_total[5m])

# For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Output: (To be verified)


Histograms

tasker.step.execution.duration

Description: Step execution duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_step_execution_duration_milliseconds_bucket
  • tasker_step_execution_duration_milliseconds_sum
  • tasker_step_execution_duration_milliseconds_count

Labels:

  • correlation_id: Request correlation ID
  • namespace: Worker namespace
  • result: “success”, “error”

Instrumented In: command_processor.rs:handle_execute_step()

Example Queries:

Instant/Recent Data:

# Simple average execution time
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count

# P95 latency by namespace
histogram_quantile(0.95,
  sum by (namespace, le) (
    tasker_step_execution_duration_milliseconds_bucket
  )
)

# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_step_execution_duration_milliseconds_bucket))

Rate-Based:

# Average execution time over 5 minutes
rate(tasker_step_execution_duration_milliseconds_sum[5m]) /
rate(tasker_step_execution_duration_milliseconds_count[5m])

# P95 latency by namespace over 5 minutes
histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values


tasker.step.claim.duration

Description: Step claiming duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_step_claim_duration_milliseconds_bucket
  • tasker_step_claim_duration_milliseconds_sum
  • tasker_step_claim_duration_milliseconds_count

Labels:

  • namespace: Worker namespace
  • claim_method: “event”, “poll”

Instrumented In: step_claim.rs:try_claim_step()

Example Queries:

Instant/Recent Data:

# Simple average claim time
tasker_step_claim_duration_milliseconds_sum /
tasker_step_claim_duration_milliseconds_count

# Compare event vs poll claiming (P95)
histogram_quantile(0.95,
  sum by (claim_method, le) (
    tasker_step_claim_duration_milliseconds_bucket
  )
)

Rate-Based:

# Average claim time over 5 minutes
rate(tasker_step_claim_duration_milliseconds_sum[5m]) /
rate(tasker_step_claim_duration_milliseconds_count[5m])

# P95 by claim method over 5 minutes
histogram_quantile(0.95,
  sum by (claim_method, le) (
    rate(tasker_step_claim_duration_milliseconds_bucket[5m])
  )
)

Expected Output: ✅ Verified - Returns millisecond values


tasker.step_result.submission.duration

Description: Step result submission duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:

  • tasker_step_result_submission_duration_milliseconds_bucket
  • tasker_step_result_submission_duration_milliseconds_sum
  • tasker_step_result_submission_duration_milliseconds_count

Labels:

  • correlation_id: Request correlation ID
  • result_type: “completion”

Instrumented In: orchestration_result_sender.rs:send_completion()

Example Queries:

Instant/Recent Data:

# Simple average submission time
tasker_step_result_submission_duration_milliseconds_sum /
tasker_step_result_submission_duration_milliseconds_count

# P95 submission latency
histogram_quantile(0.95, sum by (le) (tasker_step_result_submission_duration_milliseconds_bucket))

Rate-Based:

# Average submission time over 5 minutes
rate(tasker_step_result_submission_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_submission_duration_milliseconds_count[5m])

# P95 submission latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_submission_duration_milliseconds_bucket[5m])))

Expected Output: ✅ Verified - Returns millisecond values


Gauges

tasker.steps.active_executions

Description: Number of steps currently being executed Type: Gauge (u64) Labels:

  • namespace: Worker namespace
  • handler_type: “rust”, “ruby”

Status: Defined but not actively instrumented (gauge tracking removed during implementation)


tasker.queue.depth

Description: Current queue depth per namespace Type: Gauge (u64) Labels:

  • namespace: Worker namespace

Status: Planned (not yet instrumented)


Resilience Metrics

Module: tasker-shared/src/metrics/worker.rs, tasker-orchestration/src/web/circuit_breaker.rs Instrumentation: Circuit breakers, MPSC channels Related Docs: Circuit Breakers | Backpressure Architecture

Circuit Breaker Metrics

Circuit breakers provide fault isolation and cascade prevention. These metrics track breaker state transitions and related operations.

api_circuit_breaker_state

Description: Current state of the web API database circuit breaker Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels: None

Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs

Example Queries:

# Current state
api_circuit_breaker_state

# Alert when open
api_circuit_breaker_state == 2

tasker_circuit_breaker_state

Description: Per-component circuit breaker state Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels:

  • component: Circuit breaker name (e.g., “ffi_completion”, “task_readiness”, “pgmq”)

Instrumented In: Various circuit breaker implementations

Example Queries:

# All circuit breaker states
tasker_circuit_breaker_state

# Check specific component
tasker_circuit_breaker_state{component="ffi_completion"}

# Count open breakers
count(tasker_circuit_breaker_state == 2)

api_requests_rejected_total

Description: Total API requests rejected due to open circuit breaker Type: Counter (u64) Labels:

  • endpoint: The rejected endpoint path

Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs

Example Queries:

# Total rejections
api_requests_rejected_total

# Rejection rate
rate(api_requests_rejected_total[5m])

# By endpoint
sum by (endpoint) (api_requests_rejected_total)

ffi_completion_slow_sends_total

Description: FFI completion channel sends exceeding latency threshold (100ms default) Type: Counter (u64) Labels: None

Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs

Example Queries:

# Total slow sends
ffi_completion_slow_sends_total

# Slow send rate (alerts at >10/sec)
rate(ffi_completion_slow_sends_total[5m]) > 10

Alert Threshold: Warning when rate exceeds 10/second for 2 minutes


ffi_completion_circuit_open_rejections_total

Description: FFI completion operations rejected due to open circuit breaker Type: Counter (u64) Labels: None

Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs

Example Queries:

# Total rejections
ffi_completion_circuit_open_rejections_total

# Rejection rate
rate(ffi_completion_circuit_open_rejections_total[5m])

MPSC Channel Metrics

Bounded MPSC channels provide backpressure control. These metrics track channel utilization and overflow events.

mpsc_channel_usage_percent

Description: Current fill percentage of a bounded MPSC channel Type: Gauge (f64) Labels:

  • channel: Channel name (e.g., “orchestration_command”, “pgmq_notifications”)
  • component: Owning component

Instrumented In: Channel monitor integration points

Example Queries:

# All channel usage
mpsc_channel_usage_percent

# High usage channels
mpsc_channel_usage_percent > 80

# By component
max by (component) (mpsc_channel_usage_percent)

Alert Thresholds:

  • Warning: > 80% for 15 minutes
  • Critical: > 90% for 5 minutes

mpsc_channel_capacity

Description: Configured buffer capacity for an MPSC channel Type: Gauge (u64) Labels:

  • channel: Channel name
  • component: Owning component

Instrumented In: Channel monitor initialization

Example Queries:

# All channel capacities
mpsc_channel_capacity

# Compare usage to capacity
mpsc_channel_usage_percent / 100 * mpsc_channel_capacity

mpsc_channel_full_events_total

Description: Count of channel overflow events (backpressure applied) Type: Counter (u64) Labels:

  • channel: Channel name
  • component: Owning component

Instrumented In: Channel send operations with backpressure handling

Example Queries:

# Total overflow events
mpsc_channel_full_events_total

# Overflow rate
rate(mpsc_channel_full_events_total[5m])

# By channel
sum by (channel) (mpsc_channel_full_events_total)

Alert Threshold: Any overflow events indicate backpressure is occurring


Resilience Dashboard Panels

Circuit Breaker State Timeline:

# Panel: Time series with state mapping
api_circuit_breaker_state
# Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)

FFI Completion Health:

# Panel: Multi-stat showing slow sends and rejections
rate(ffi_completion_slow_sends_total[5m])
rate(ffi_completion_circuit_open_rejections_total[5m])

Channel Saturation Overview:

# Panel: Gauge showing max channel usage
max(mpsc_channel_usage_percent)
# Thresholds: Green < 70%, Yellow < 90%, Red >= 90%

Backpressure Events:

# Panel: Time series of overflow rate
rate(mpsc_channel_full_events_total[5m])

Database Metrics

Module: tasker-shared/src/metrics/database.rs Status: ⚠️ Defined but not yet instrumented

Planned Metrics

  • tasker.sql.queries.total - Counter
  • tasker.sql.query.duration - Histogram
  • tasker.db.pool.connections_active - Gauge
  • tasker.db.pool.connections_idle - Gauge
  • tasker.db.pool.wait_duration - Histogram
  • tasker.db.transactions.total - Counter
  • tasker.db.transaction.duration - Histogram

Messaging Metrics

Module: tasker-shared/src/metrics/messaging.rs Status: ⚠️ Defined but not yet instrumented

Planned Metrics

  • tasker.queue.messages_sent.total - Counter
  • tasker.queue.messages_received.total - Counter
  • tasker.queue.messages_deleted.total - Counter
  • tasker.queue.message_send.duration - Histogram
  • tasker.queue.message_receive.duration - Histogram
  • tasker.queue.depth - Gauge
  • tasker.queue.age_seconds - Gauge
  • tasker.queue.visibility_timeouts.total - Counter
  • tasker.queue.errors.total - Counter
  • tasker.queue.retry_attempts.total - Counter

Note: Circuit breaker metrics (including queue-related circuit breakers) are documented in the Resilience Metrics section.


Example Queries

Task Execution Flow

Complete task execution for a specific correlation_id:

# 1. Task creation
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 3. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 4. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 5. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 6. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

# 7. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected Flow: 1 → N → N → N → N → N → 1 (where N = number of steps)


Performance Analysis

Task initialization latency percentiles:

Instant/Recent Data:

# P50 (median)
histogram_quantile(0.50, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P95
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

# P99
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))

Rate-Based (continuous monitoring):

# P50 (median)
histogram_quantile(0.50, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

# P95
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

# P99
histogram_quantile(0.99, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))

Step execution latency by namespace:

Instant/Recent Data:

histogram_quantile(0.95,
  sum by (namespace, le) (
    tasker_step_execution_duration_milliseconds_bucket
  )
)

Rate-Based:

histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_milliseconds_bucket[5m])
  )
)

End-to-end task duration (from request to completion):

This requires combining initialization + step execution + finalization durations. Use the simple average approach for instant data:

# Average task initialization
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count

# Average step execution
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count

# Average finalization
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count

Error Rate Monitoring

Overall step failure rate:

rate(tasker_steps_failures_total[5m]) /
rate(tasker_steps_executions_total[5m])

Error distribution by type:

topk(5, sum by (error_type) (tasker_steps_failures_total))

Task failure rate:

rate(tasker_tasks_failures_total[5m]) /
(rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))

Throughput Monitoring

Task request rate:

rate(tasker_tasks_requests_total[1m])
rate(tasker_tasks_requests_total[5m])
rate(tasker_tasks_requests_total[15m])

Step execution throughput:

sum(rate(tasker_steps_executions_total[5m]))

Step completion rate (successes + failures):

sum(rate(tasker_steps_successes_total[5m])) +
sum(rate(tasker_steps_failures_total[5m]))

Dashboard Recommendations

Task Execution Overview Dashboard

Panels:

  1. Task Request Rate

    • Query: rate(tasker_tasks_requests_total[5m])
    • Visualization: Time series graph
  2. Task Completion Rate

    • Query: rate(tasker_tasks_completions_total[5m])
    • Visualization: Time series graph
  3. Task Success/Failure Ratio

    • Query: Two series
      • Completions: rate(tasker_tasks_completions_total[5m])
      • Failures: rate(tasker_tasks_failures_total[5m])
    • Visualization: Stacked area chart
  4. Task Initialization Latency (P95)

    • Query: histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))
    • Visualization: Time series graph
  5. Steps Enqueued vs Executed

    • Query: Two series
      • Enqueued: rate(tasker_steps_enqueued_total[5m])
      • Executed: rate(tasker_steps_executions_total[5m])
    • Visualization: Time series graph

Worker Performance Dashboard

Panels:

  1. Step Execution Throughput by Namespace

    • Query: sum by (namespace) (rate(tasker_steps_executions_total[5m]))
    • Visualization: Time series graph (multi-series)
  2. Step Success Rate

    • Query: rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
    • Visualization: Gauge (0-1 scale)
  3. Step Execution Latency Percentiles

    • Query: Three series
      • P50: histogram_quantile(0.50, rate(tasker_step_execution_duration_bucket[5m]))
      • P95: histogram_quantile(0.95, rate(tasker_step_execution_duration_bucket[5m]))
      • P99: histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
    • Visualization: Time series graph
  4. Step Claiming Performance (Event vs Poll)

    • Query: histogram_quantile(0.95, sum by (claim_method, le) (rate(tasker_step_claim_duration_bucket[5m])))
    • Visualization: Time series graph
  5. Error Distribution by Type

    • Query: sum by (error_type) (rate(tasker_steps_failures_total[5m]))
    • Visualization: Pie chart or bar chart

System Health Dashboard

Panels:

  1. Overall Task Success Rate

    • Query: rate(tasker_tasks_completions_total[5m]) / (rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))
    • Visualization: Stat panel with thresholds (green > 0.95, yellow > 0.90, red < 0.90)
  2. Step Failure Rate

    • Query: rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])
    • Visualization: Stat panel with thresholds
  3. Average Task End-to-End Duration

    • Query: Combination of initialization, execution, and finalization durations
    • Visualization: Time series graph
  4. Result Processing Latency

    • Query: rate(tasker_step_result_processing_duration_sum[5m]) / rate(tasker_step_result_processing_duration_count[5m])
    • Visualization: Time series graph
  5. Active Operations

    • Query: Currently not instrumented (gauges removed)
    • Status: Planned future enhancement

Verification Checklist

Use this checklist to verify metrics are working correctly:

Prerequisites

  • telemetry.opentelemetry.enabled = true in development config
  • Services restarted after config change
  • Logs show opentelemetry_enabled=true
  • Grafana LGTM container running on ports 3000, 4317

Basic Verification

  • At least one task created via CLI
  • Correlation ID captured from task creation
  • Trace visible in Grafana Tempo for correlation ID

Orchestration Metrics

  • tasker_tasks_requests_total returns non-zero
  • tasker_steps_enqueued_total returns expected step count
  • tasker_step_results_processed_total returns expected result count
  • tasker_tasks_completions_total increments on success
  • tasker_task_initialization_duration_bucket has histogram data

Worker Metrics

  • tasker_steps_executions_total returns non-zero
  • tasker_steps_successes_total matches successful steps
  • tasker_steps_claimed_total returns expected claims
  • tasker_steps_results_submitted_total matches result submissions
  • tasker_step_execution_duration_bucket has histogram data

Resilience Metrics

  • api_circuit_breaker_state returns 0 (Closed) during normal operation
  • /health/detailed endpoint shows circuit breaker states
  • mpsc_channel_usage_percent returns values < 80% (no saturation)
  • mpsc_channel_full_events_total is 0 or very low (no backpressure)
  • FFI workers: ffi_completion_slow_sends_total is near zero

Correlation

  • All metrics filterable by correlation_id
  • Correlation ID in metrics matches trace ID in Tempo
  • Complete execution flow visible from request to completion

Troubleshooting

No Metrics Appearing

Check 1: OpenTelemetry enabled

grep "opentelemetry_enabled" tmp/*.log
# Should show: opentelemetry_enabled=true

Check 2: OTLP endpoint accessible

curl -v http://localhost:4317 2>&1 | grep Connected
# Should show: Connected to localhost (127.0.0.1) port 4317

Check 3: Grafana LGTM running

curl -s http://localhost:3000/api/health | jq
# Should return healthy status

Check 4: Wait for export interval (60 seconds) Metrics are batched and exported every 60 seconds. Wait at least 1 minute after task execution.

Metrics Missing Labels

If correlation_id or other labels are missing, check:

  • Logs for correlation_id field presence
  • Metric instrumentation includes KeyValue::new() calls
  • Labels match between metric definition and usage

Histogram Buckets Empty

If histogram queries return no data:

  • Verify histogram is initialized: check logs for metric initialization
  • Ensure duration values are non-zero and reasonable
  • Check that record() is called, not add() for histograms

Next Steps

Phase 3.4 (Future)

  • Instrument database metrics (7 metrics)
  • Instrument messaging metrics (11 metrics)
  • Add gauge tracking for active operations
  • Implement queue depth monitoring

Production Readiness

  • Create alert rules for error rates
  • Set up automated dashboards
  • Configure metric retention policies
  • Add metric aggregation for long-term storage

Last Updated: 2025-12-10 Test Task: mathematical_sequence (correlation_id: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5) Status: All orchestration and worker metrics verified and producing data ✅

Recent Updates:

  • 2025-12-10: Added Resilience Metrics section (circuit breakers, MPSC channels)
  • 2025-10-08: Initial metrics verification completed

Metrics Verification Guide

Purpose: Verify that documented metrics queries work with actual system data Test Task: mathematical_sequence Correlation ID: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5 Task ID: 0199c3e0-ccea-70f0-b6ae-3086b2f68280 Trace ID: d640f82572e231322edba0a5ef6e1405

How to Use This Guide

  1. Open Grafana at http://localhost:3000
  2. Navigate to Explore (compass icon in sidebar)
  3. Select Prometheus as the data source
  4. Copy each query below into the query editor
  5. Record the actual output
  6. Mark ✅ if query works, ❌ if it fails, or ⚠️ if partial data

Orchestration Metrics Verification

1. Task Requests Counter

Metric: tasker.tasks.requests.total

Query 1: Basic counter

tasker_tasks_requests_total

Expected: At least 1 (for our test task) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Filtered by correlation_id

tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Exactly 1 Actual Result: _____________ Labels Present: [ ] correlation_id [ ] task_type [ ] namespace Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: Sum by namespace

sum by (namespace) (tasker_tasks_requests_total)

Expected: 1 for namespace “rust_e2e_linear” Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


2. Task Completions Counter

Metric: tasker.tasks.completions.total

Query 1: Basic counter

tasker_tasks_completions_total

Expected: At least 1 (if task completed successfully) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Filtered by correlation_id

tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: Completion rate over 5 minutes

rate(tasker_tasks_completions_total[5m])

Expected: Some positive rate value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


3. Steps Enqueued Counter

Metric: tasker.steps.enqueued.total

Query 1: Total steps enqueued for our task

tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of steps in mathematical_sequence workflow (likely 3-4 steps) Actual Result: _____________ Step Names Visible: [ ] Yes [ ] No Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Sum by step name

sum by (step_name) (tasker_steps_enqueued_total)

Expected: Breakdown by step name Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


4. Step Results Processed Counter

Metric: tasker.step_results.processed.total

Query 1: Results processed for our task

tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Same as steps enqueued Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Sum by result type

sum by (result_type) (tasker_step_results_processed_total)

Expected: Breakdown showing “success” results Actual Result: _____________ Result Types Visible: [ ] success [ ] error [ ] timeout [ ] cancelled [ ] skipped Status: [ ] ✅ [ ] ❌ [ ] ⚠️


5. Task Initialization Duration Histogram

Metric: tasker.task.initialization.duration

Query 1: Check if histogram has data

tasker_task_initialization_duration_count

Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average initialization time

rate(tasker_task_initialization_duration_sum[5m]) /
rate(tasker_task_initialization_duration_count[5m])

Expected: Some millisecond value (probably < 100ms) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: P95 latency

histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))

Expected: P95 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 4: P99 latency

histogram_quantile(0.99, rate(tasker_task_initialization_duration_bucket[5m]))

Expected: P99 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


6. Task Finalization Duration Histogram

Metric: tasker.task.finalization.duration

Query 1: Check count

tasker_task_finalization_duration_count

Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average finalization time

rate(tasker_task_finalization_duration_sum[5m]) /
rate(tasker_task_finalization_duration_count[5m])

Expected: Some millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: P95 by final_state

histogram_quantile(0.95,
  sum by (final_state, le) (
    rate(tasker_task_finalization_duration_bucket[5m])
  )
)

Expected: P95 value grouped by final_state (likely “complete”) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


7. Step Result Processing Duration Histogram

Metric: tasker.step_result.processing.duration

Query 1: Check count

tasker_step_result_processing_duration_count

Expected: Number of steps processed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average processing time

rate(tasker_step_result_processing_duration_sum[5m]) /
rate(tasker_step_result_processing_duration_count[5m])

Expected: Millisecond value for result processing Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Worker Metrics Verification

8. Step Executions Counter

Metric: tasker.steps.executions.total

Query 1: Total executions

tasker_steps_executions_total

Expected: Number of steps in workflow Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: For specific task

tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of steps executed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: Execution rate

rate(tasker_steps_executions_total[5m])

Expected: Positive rate Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


9. Step Successes Counter

Metric: tasker.steps.successes.total

Query 1: Total successes

tasker_steps_successes_total

Expected: Should equal executions if all succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: By namespace

sum by (namespace) (tasker_steps_successes_total)

Expected: Successes grouped by namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: Success rate

rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])

Expected: ~1.0 (100%) if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


10. Step Failures Counter

Metric: tasker.steps.failures.total

Query 1: Total failures

tasker_steps_failures_total

Expected: 0 if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: By error type

sum by (error_type) (tasker_steps_failures_total)

Expected: No results if no failures, or breakdown by error type Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


11. Steps Claimed Counter

Metric: tasker.steps.claimed.total

Query 1: Total claims

tasker_steps_claimed_total

Expected: Number of steps claimed (should match executions) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: By claim method

sum by (claim_method) (tasker_steps_claimed_total)

Expected: Breakdown by “event” or “poll” Actual Result: _____________ Claim Methods Visible: [ ] event [ ] poll Status: [ ] ✅ [ ] ❌ [ ] ⚠️


12. Step Results Submitted Counter

Metric: tasker.steps.results_submitted.total

Query 1: Total submissions

tasker_steps_results_submitted_total

Expected: Number of steps that submitted results Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: For specific task

tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}

Expected: Number of step results submitted Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


13. Step Execution Duration Histogram

Metric: tasker.step.execution.duration

Query 1: Check count

tasker_step_execution_duration_count

Expected: Number of step executions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average execution time

rate(tasker_step_execution_duration_sum[5m]) /
rate(tasker_step_execution_duration_count[5m])

Expected: Average milliseconds per step Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: P95 latency by namespace

histogram_quantile(0.95,
  sum by (namespace, le) (
    rate(tasker_step_execution_duration_bucket[5m])
  )
)

Expected: P95 latency for rust_e2e_linear namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 4: P99 latency

histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))

Expected: P99 latency value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


14. Step Claim Duration Histogram

Metric: tasker.step.claim.duration

Query 1: Check count

tasker_step_claim_duration_count

Expected: Number of claims Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average claim time

rate(tasker_step_claim_duration_sum[5m]) /
rate(tasker_step_claim_duration_count[5m])

Expected: Average milliseconds to claim Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: P95 by claim method

histogram_quantile(0.95,
  sum by (claim_method, le) (
    rate(tasker_step_claim_duration_bucket[5m])
  )
)

Expected: P95 claim latency by method Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


15. Step Result Submission Duration Histogram

Metric: tasker.step_result.submission.duration

Query 1: Check count

tasker_step_result_submission_duration_count

Expected: Number of result submissions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 2: Average submission time

rate(tasker_step_result_submission_duration_sum[5m]) /
rate(tasker_step_result_submission_duration_count[5m])

Expected: Average milliseconds to submit Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Query 3: P95 submission latency

histogram_quantile(0.95, rate(tasker_step_result_submission_duration_bucket[5m]))

Expected: P95 submission latency Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️


Complete Execution Flow Verification

Purpose: Verify the full task lifecycle is visible in metrics

Query: Complete flow for correlation_id

# Run each query and record the value

# 1. Task created
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 3. Steps claimed
tasker_steps_claimed_total
Result: _____________

# 4. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 5. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 6. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 7. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

# 8. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________

Expected Pattern: 1 → N → N → N → N → N → N → 1 Actual Pattern: _____________ → _____ → _____ → _____ → _____ → _____ → _____ → _____

Analysis:

  • Do the numbers make sense for your workflow? [ ] Yes [ ] No
  • Are any steps missing? _____________
  • Do counts match expectations? [ ] Yes [ ] No

Issues Found

Document any issues discovered during verification:

Issue 1

Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No

Issue 2

Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No


Summary

Total Queries Tested: _____________ Successful: _____________ ✅ Failed: _____________ ❌ Partial: _____________ ⚠️

Overall Status: [ ] All Working [ ] Some Issues [ ] Major Problems

Ready for Production: [ ] Yes [ ] No [ ] Needs Work


Verification Date: _____________ Verified By: _____________ Grafana Version: _____________ OpenTelemetry Version: 0.26

OpenTelemetry Improvements

Last Updated: 2025-12-01 Audience: Developers, Operators Status: Active Related Docs: Observability Hub | Metrics Reference | Domain Events

← Back to Observability Hub


This document describes the OpenTelemetry improvements for the domain event system, including two-phase FFI telemetry initialization, domain event metrics, and enhanced observability for the distributed event system.

Overview

These telemetry improvements support the domain event system while addressing FFI-specific challenges:

ImprovementPurposeImpact
Two-Phase FFI TelemetrySafe telemetry in FFI workersNo segfaults during Ruby/Python bridging
Domain Event MetricsEvent system observabilityReal-time monitoring of event publication
Correlation ID PropagationEnd-to-end tracingEvents traceable across distributed system
Worker Metrics EndpointDomain event statistics/metrics/events for monitoring dashboards

Two-Phase FFI Telemetry Initialization

The Problem

When Rust workers operate with Ruby FFI bindings, OpenTelemetry’s global tracer/meter providers can cause issues:

  1. Thread Safety: Ruby’s GVL (Global VM Lock) conflicts with OpenTelemetry’s internal threading
  2. Signal Handling: OpenTelemetry’s OTLP exporter may interfere with Ruby signal handling
  3. Segfaults: Premature initialization can cause crashes during FFI boundary crossings

The Solution: Two-Phase Initialization

flowchart LR
    subgraph Phase1["Phase 1 (FFI-Safe)"]
        A[Console logger]
        B[Tracing init]
        C[No OTLP export]
        D[No global state]
    end

    subgraph Phase2["Phase 2 (Full OTel)"]
        E[OTLP exporter]
        F[Metrics export]
        G[Full tracing]
        H[Global tracer]
    end

    Phase1 -->|"After FFI bridge<br/>established"| Phase2

Worker Bootstrap Sequence:

  1. Load Rust worker library
  2. Initialize Phase 1 (console-only logging)
  3. Execute FFI bridge setup (Ruby/Python)
  4. Initialize Phase 2 (full OpenTelemetry)

Implementation

Phase 1: Console-Only Initialization (FFI-Safe):

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 284-326)

/// Initialize console-only logging (FFI-safe, no Tokio runtime required)
///
/// This function sets up structured console logging without OpenTelemetry,
/// making it safe to call from FFI initialization contexts where no Tokio
/// runtime exists yet.
pub fn init_console_only() {
    TRACING_INITIALIZED.get_or_init(|| {
        let environment = get_environment();
        let log_level = get_log_level(&environment);

        // Determine if we're in a TTY for ANSI color support
        let use_ansi = IsTerminal::is_terminal(&std::io::stdout());

        // Create base console layer
        let console_layer = fmt::layer()
            .with_target(true)
            .with_thread_ids(true)
            .with_level(true)
            .with_ansi(use_ansi)
            .with_filter(EnvFilter::new(&log_level));

        // Build subscriber with console layer only (no telemetry)
        let subscriber = tracing_subscriber::registry().with(console_layer);

        if subscriber.try_init().is_err() {
            tracing::debug!(
                "Global tracing subscriber already initialized"
            );
        } else {
            tracing::info!(
                environment = %environment,
                opentelemetry_enabled = false,
                context = "ffi_initialization",
                "Console-only logging initialized (FFI-safe mode)"
            );
        }

        // Initialize basic metrics (no OpenTelemetry exporters)
        metrics::init_metrics();
        metrics::orchestration::init();
        metrics::worker::init();
        metrics::database::init();
        metrics::messaging::init();
    });
}
}

Phase 2: Full OpenTelemetry Initialization:

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs (lines 361-449)

/// Initialize tracing with console output and optional OpenTelemetry
///
/// When OpenTelemetry is enabled (via TELEMETRY_ENABLED=true), it also
/// configures distributed tracing with OTLP exporter.
///
/// **IMPORTANT**: When telemetry is enabled, this function MUST be called from
/// a Tokio runtime context because the batch exporter requires async I/O.
pub fn init_tracing() {
    TRACING_INITIALIZED.get_or_init(|| {
        let environment = get_environment();
        let log_level = get_log_level(&environment);
        let telemetry_config = TelemetryConfig::default();

        // Determine if we're in a TTY for ANSI color support
        let use_ansi = IsTerminal::is_terminal(&std::io::stdout());

        // Create base console layer
        let console_layer = fmt::layer()
            .with_target(true)
            .with_thread_ids(true)
            .with_level(true)
            .with_ansi(use_ansi)
            .with_filter(EnvFilter::new(&log_level));

        // Build subscriber with optional OpenTelemetry layer
        let subscriber = tracing_subscriber::registry().with(console_layer);

        if telemetry_config.enabled {
            // Initialize OpenTelemetry tracer and logger
            match (init_opentelemetry_tracer(&telemetry_config),
                   init_opentelemetry_logger(&telemetry_config)) {
                (Ok(tracer_provider), Ok(logger_provider)) => {
                    // Add trace layer
                    let tracer = tracer_provider.tracer("tasker-core");
                    let telemetry_layer = OpenTelemetryLayer::new(tracer);

                    // Add log layer (bridge tracing logs -> OTEL logs)
                    let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);

                    let subscriber = subscriber.with(telemetry_layer).with(log_layer);

                    if subscriber.try_init().is_ok() {
                        tracing::info!(
                            environment = %environment,
                            opentelemetry_enabled = true,
                            logs_enabled = true,
                            otlp_endpoint = %telemetry_config.otlp_endpoint,
                            service_name = %telemetry_config.service_name,
                            "Console logging with OpenTelemetry initialized"
                        );
                    }
                }
                // ... error handling with fallback to console-only
            }
        }
    });
}
}

Worker Bootstrap Integration:

#![allow(unused)]
fn main() {
// workers/rust/src/bootstrap.rs (lines 69-131)

pub async fn bootstrap() -> Result<(WorkerSystemHandle, RustEventHandler)> {
    info!("📋 Creating native Rust step handler registry...");
    let registry = Arc::new(RustStepHandlerRegistry::new());

    // Get global event system for connecting to worker events
    info!("🔗 Setting up event system connection...");
    let event_system = get_global_event_system();

    // Bootstrap the worker using tasker-worker foundation
    info!("🏗️  Bootstrapping worker with tasker-worker foundation...");
    let worker_handle =
        WorkerBootstrap::bootstrap_with_event_system(Some(event_system.clone())).await?;

    // Create step event publisher registry with domain event publisher
    info!("🔔 Setting up step event publisher registry...");
    let domain_event_publisher = {
        let worker_core = worker_handle.worker_core.lock().await;
        worker_core.domain_event_publisher()
    };

    // Dual-Path: Create in-process event bus for fast event delivery
    info!("⚡ Creating in-process event bus for fast domain events...");
    let in_process_bus = Arc::new(RwLock::new(InProcessEventBus::new(
        InProcessEventBusConfig::default(),
    )));

    // Dual-Path: Create event router for dual-path delivery
    info!("🔀 Creating event router for dual-path delivery...");
    let event_router = Arc::new(RwLock::new(EventRouter::new(
        domain_event_publisher.clone(),
        in_process_bus.clone(),
    )));

    // Create registry with EventRouter for dual-path delivery
    let mut step_event_registry =
        StepEventPublisherRegistry::with_event_router(
            domain_event_publisher.clone(),
            event_router
        );

    Ok((worker_handle, event_handler))
}
}

Configuration

Telemetry is configured exclusively via environment variables. This is intentional because logging must be initialized before the TOML config loader runs (to log any config loading errors).

# Enable OpenTelemetry
export TELEMETRY_ENABLED=true

# OTLP endpoint (default: http://localhost:4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Service identification
export OTEL_SERVICE_NAME=tasker-orchestration
export OTEL_SERVICE_VERSION=0.1.0

# Deployment environment (falls back to TASKER_ENV, then "development")
export DEPLOYMENT_ENVIRONMENT=production

# Sampling rate (0.0 to 1.0, default: 1.0 = 100%)
export OTEL_TRACES_SAMPLER_ARG=1.0

The TelemetryConfig::default() implementation in tasker-shared/src/logging.rs:144-164 reads all values from environment variables at initialization time.

Domain Event Metrics

New Metrics

Domain event observability metrics:

MetricTypeDescription
tasker.domain_events.published.totalCounterTotal events published
router.durable_routedCounterEvents sent via durable path (PGMQ)
router.fast_routedCounterEvents sent via fast path (in-process)
router.broadcast_routedCounterEvents broadcast to both paths

Implementation

Domain event metrics are emitted inline during publication:

#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 207-219)

// Emit OpenTelemetry metric
let counter = opentelemetry::global::meter("tasker")
    .u64_counter("tasker.domain_events.published.total")
    .with_description("Total number of domain events published")
    .build();

counter.add(
    1,
    &[
        opentelemetry::KeyValue::new("event_name", event_name.to_string()),
        opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
    ],
);
}

Event routing statistics are tracked in the EventRouterStats and InProcessEventBusStats structures:

#![allow(unused)]
fn main() {
// tasker-shared/src/metrics/worker.rs (lines 431-444)

/// Statistics for the event router
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct EventRouterStats {
    /// Total events routed through the router
    pub total_routed: u64,
    /// Events sent via durable path (PGMQ)
    pub durable_routed: u64,
    /// Events sent via fast path (in-process)
    pub fast_routed: u64,
    /// Events broadcast to both paths
    pub broadcast_routed: u64,
    /// Fast delivery errors in broadcast mode (non-fatal, logged for monitoring)
    pub fast_delivery_errors: u64,
    /// Failed routing attempts (durable failures only)
    pub routing_errors: u64,
}

// tasker-shared/src/metrics/worker.rs (lines 455-467)

/// Statistics for the in-process event bus
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct InProcessEventBusStats {
    /// Total events dispatched through the bus
    pub total_events_dispatched: u64,
    /// Total events dispatched to Rust handlers
    pub rust_handler_dispatches: u64,
    /// Total events dispatched to FFI channel
    pub ffi_channel_dispatches: u64,
}
}

Prometheus Queries

Event publication rate by namespace:

sum by (namespace) (rate(tasker_domain_events_published_total[5m]))

Event failure rate:

rate(tasker_domain_events_failed_total[5m]) /
rate(tasker_domain_events_published_total[5m])

Publication latency (P95):

histogram_quantile(0.95,
  sum by (le) (rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m]))
)

Latency by delivery mode:

histogram_quantile(0.95,
  sum by (delivery_mode, le) (
    rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m])
  )
)

Worker Metrics Endpoint

/metrics/events Endpoint

The worker exposes domain event statistics through a dedicated metrics endpoint:

Request:

curl http://localhost:8081/metrics/events

Response:

{
  "router": {
    "total_routed": 42,
    "durable_routed": 10,
    "fast_routed": 30,
    "broadcast_routed": 2,
    "fast_delivery_errors": 0,
    "routing_errors": 0
  },
  "in_process_bus": {
    "total_events_dispatched": 32,
    "rust_handler_dispatches": 20,
    "ffi_channel_dispatches": 12
  },
  "captured_at": "2025-12-01T10:30:00Z",
  "worker_id": "worker-01234567"
}

Implementation

#![allow(unused)]
fn main() {
// tasker-worker/src/web/handlers/metrics.rs (lines 178-218)

/// Domain event statistics endpoint: GET /metrics/events
///
/// Returns statistics about domain event routing and delivery paths.
/// Used for monitoring event publishing and by E2E tests to verify
/// events were published through the expected delivery paths.
///
/// # Response
///
/// Returns statistics for:
/// - **Router stats**: durable_routed, fast_routed, broadcast_routed counts
/// - **In-process bus stats**: handler dispatches, FFI channel dispatches
pub async fn domain_event_stats(
    State(state): State<Arc<WorkerWebState>>,
) -> Json<DomainEventStats> {
    debug!("Serving domain event statistics");

    // Use cached event components - does not lock worker core
    let stats = state.domain_event_stats().await;

    Json(stats)
}
}

The DomainEventStats structure is defined in tasker-shared/src/types/web.rs:

#![allow(unused)]
fn main() {
// tasker-shared/src/types/web.rs (lines 546-555)

#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct DomainEventStats {
    /// Event router statistics
    pub router: EventRouterStats,
    /// In-process event bus statistics
    pub in_process_bus: InProcessEventBusStats,
    /// Timestamp when stats were captured
    pub captured_at: DateTime<Utc>,
    /// Worker ID for correlation
    pub worker_id: String,
}
}

Correlation ID Propagation

End-to-End Tracing

Domain events maintain correlation IDs for distributed tracing:

flowchart LR
    subgraph TaskCreation["Task Creation"]
        A[correlation_id<br/>UUIDv7]
    end

    subgraph StepExecution["Step Execution"]
        B[correlation_id<br/>propagated]
    end

    subgraph DomainEvent["Domain Event"]
        C[correlation_id<br/>in metadata]
    end

    TaskCreation --> StepExecution --> DomainEvent

    subgraph TraceContext["Trace Context"]
        D[task_uuid]
        E[step_uuid]
        F[step_name]
        G[namespace]
        H[correlation_id]
    end

Tracing Integration

The DomainEventPublisher::publish_event method uses #[instrument] for automatic span creation:

#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs (lines 157-231)

#[instrument(skip(self, payload, metadata), fields(
    event_name = %event_name,
    namespace = %metadata.namespace,
    correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
    &self,
    event_name: &str,
    payload: DomainEventPayload,
    metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
    let event_id = Uuid::now_v7();
    let queue_name = format!("{}_domain_events", metadata.namespace);

    debug!(
        event_id = %event_id,
        event_name = %event_name,
        queue_name = %queue_name,
        task_uuid = %metadata.task_uuid,
        correlation_id = %metadata.correlation_id,
        "Publishing domain event"
    );

    // Create and serialize domain event
    let event = DomainEvent {
        event_id,
        event_name: event_name.to_string(),
        event_version: "1.0".to_string(),
        payload,
        metadata: metadata.clone(),
    };

    // Publish to PGMQ
    let message_id = self.message_client
        .send_json_message(&queue_name, &event_json)
        .await?;

    info!(
        event_id = %event_id,
        message_id = message_id,
        correlation_id = %metadata.correlation_id,
        "Domain event published successfully"
    );

    Ok(event_id)
}
}

Querying by Correlation ID

Find all events for a task:

# In Grafana/Tempo
correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"

In PostgreSQL (PGMQ queues):

SELECT
    message->>'event_name' as event,
    message->'metadata'->>'step_name' as step,
    message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';

Span Hierarchy

Domain Event Spans

Domain event spans:

Task Execution (root span)
├── Step Execution
│   ├── Handler Call
│   │   └── Business Logic
│   └── publish_domain_event           ◄── NEW
│       ├── route_event
│       │   ├── publish_durable        (if durable/broadcast)
│       │   └── publish_fast           (if fast/broadcast)
│       └── record_metrics
└── Result Submission

Span Attributes

SpanAttributes
publish_domain_eventevent_name, namespace, correlation_id, delivery_mode
route_eventdelivery_mode, target_queue (if durable)
publish_durablequeue_name, message_size
publish_fastsubscriber_count

Troubleshooting

Console-Only Mode (No OTLP Export)

Symptom: Logs show “Console-only logging initialized (FFI-safe mode)” but no OpenTelemetry traces

Cause: init_console_only() was called but init_tracing() was never called, or TELEMETRY_ENABLED=false

Fix:

  1. Check initialization logs:
    grep -E "(Console-only|OpenTelemetry)" logs/worker.log
    
  2. Verify TELEMETRY_ENABLED=true is set:
    grep "opentelemetry_enabled" logs/worker.log
    

Domain Event Metrics Missing

Symptom: /metrics/events returns zeros for all stats

Cause: Events not being published or the event router/bus not tracking statistics

Fix:

  1. Verify events are being published:
    grep "Domain event published successfully" logs/worker.log
    
  2. Check event router initialization:
    grep "event router" logs/worker.log
    
  3. Verify in-process event bus is configured:
    grep "in-process event bus" logs/worker.log
    

Correlation ID Not Propagating

Symptom: Events have different correlation IDs than parent task

Cause: EventMetadata not constructed with task’s correlation_id

Fix: Verify EventMetadata is constructed with the correct correlation_id from the task:

#![allow(unused)]
fn main() {
// When constructing EventMetadata, always use the task's correlation_id
let metadata = EventMetadata {
    task_uuid: step_data.task.task.task_uuid,
    step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
    step_name: Some(step_data.workflow_step.name.clone()),
    namespace: step_data.task.namespace_name.clone(),
    correlation_id: step_data.task.task.correlation_id,  // Must use task's ID
    fired_at: chrono::Utc::now(),
    fired_by: handler_name.to_string(),
};
}

Best Practices

1. Always Use Two-Phase Init for FFI Workers

#![allow(unused)]
fn main() {
// Correct: Two-phase initialization pattern
// Phase 1: During FFI initialization (Magnus, PyO3, WASM)
tasker_shared::logging::init_console_only();

// Phase 2: After runtime creation
let runtime = tokio::runtime::Runtime::new()?;
runtime.block_on(async {
    tasker_shared::logging::init_tracing();
});

// Incorrect: Calling init_tracing() during FFI initialization
// before Tokio runtime exists (may cause issues with OTLP exporter)
}

2. Include Correlation ID in All Events

#![allow(unused)]
fn main() {
// Always propagate correlation_id from the task
let metadata = EventMetadata {
    task_uuid: step_data.task.task.task_uuid,
    step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
    step_name: Some(step_data.workflow_step.name.clone()),
    namespace: step_data.task.namespace_name.clone(),
    correlation_id: step_data.task.task.correlation_id,  // Critical!
    fired_at: chrono::Utc::now(),
    fired_by: handler_name.to_string(),
};
}

3. Use Structured Logging with Correlation Context

#![allow(unused)]
fn main() {
// All logs should include correlation_id for trace correlation
info!(
    event_id = %event_id,
    event_name = %event_name,
    correlation_id = %metadata.correlation_id,
    namespace = %metadata.namespace,
    "Domain event published successfully"
);
}


This telemetry architecture provides robust observability for domain events while ensuring safe operation with FFI-based language bindings.

Tasker Core Principles

This directory contains the core principles and design philosophy that guide Tasker Core development. These principles are not arbitrary rules but hard-won lessons extracted from implementation experience, root cause analyses, and architectural decisions.

Core Documents

DocumentDescription
Tasker Core TenetsThe 11 foundational principles that drive all architecture and design decisions
Defense in DepthMulti-layered protection model for idempotency and data integrity
Fail LoudlyWhy errors beat silent defaults, and phantom data breaks trust
Cross-Language ConsistencyThe “one API” philosophy for Rust, Ruby, Python, and TypeScript workers
Composition Over InheritanceMixin-based handler composition pattern
Intentional AI PartnershipCollaborative approach to AI integration

Influences

DocumentDescription
Twelve-Factor App AlignmentHow the 12-factor methodology shapes our architecture, with codebase examples and honest gap assessment
Zen of Python (PEP-20)Tim Peters’ guiding principles — referenced as inspiration

How These Principles Were Derived

These principles emerged from:

  1. Root Cause Analyses: Ownership removal revealed that “redundant protection with harmful side effects” is worse than minimal, well-understood protection
  2. Cross-Language Development: Handler harmonization established patterns for consistent APIs across four languages
  3. Architectural Migrations: Actor pattern refactoring proved the pattern’s effectiveness
  4. Production Incidents: Real bugs in parallel execution (Heisenbugs becoming Bohrbugs) shaped defensive design
  5. Protocol Trust Analysis: gRPC client refactoring exposed how silent defaults create phantom data that breaks consumer trust

When to Consult These Documents

  • Architecture Decisions: docs/decisions/ for specific ADRs
  • Historical Context: docs/CHRONOLOGY.md for development timeline
  • Implementation Details: docs/ticket-specs/ for original specifications

Composition Over Inheritance

Last Updated: 2026-01-01 This document describes Tasker Core’s approach to handler composition using mixins and traits rather than class hierarchies.

The Core Principle

Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable

Handlers gain capabilities by mixing in modules, not by inheriting from specialized base classes.


Why Composition?

The Problem with Inheritance

Deep inheritance hierarchies create problems:

# BAD: Inheritance-based capabilities
class APIDecisionBatchableHandler < APIDecisionHandler < APIHandler < BaseHandler
  # Which methods came from where?
  # How do I override just one behavior?
  # What if I need Batchable but not API?
end
ProblemDescription
Diamond problemMultiple paths to same ancestor
Tight couplingCan’t change base without affecting all children
InflexibleCan’t pick-and-choose capabilities
Hard to testMust test entire hierarchy
Opaque behaviorMethod origin unclear

The Composition Solution

Mixins provide selective capabilities:

# GOOD: Composition-based capabilities
class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::APICapable
  include TaskerCore::StepHandler::DecisionCapable

  def call(context)
    # Has API methods (get, post, put, delete)
    # Has Decision methods (decision_success, decision_no_branches)
    # Does NOT have Batchable methods (didn't include it)
  end
end
BenefitDescription
Selective inclusionOnly the capabilities you need
Clear originModule name indicates where methods come from
Independent testingTest each mixin in isolation
Flexible combinationAny combination of capabilities
Flat structureNo deep hierarchies to navigate

The Discovery

Analysis of Batchable handlers revealed they already used the composition pattern:

# Batchable was the TARGET architecture all along
class BatchHandler < Base
  include BatchableCapable  # Already doing it right!

  def call(context)
    batch_ctx = get_batch_context(context)
    # ...process batch...
    batch_worker_complete(processed_count: count, result_data: data)
  end
end

The cross-language handler harmonization recommended migrating API and Decision handlers to match this pattern.


Capability Modules

Available Capabilities

CapabilityModule (Ruby)Methods Provided
APIAPICapableget, post, put, delete
DecisionDecisionCapabledecision_success, decision_no_branches
BatchableBatchableCapableget_batch_context, batch_worker_complete, handle_no_op_worker

Rust Traits

#![allow(unused)]
fn main() {
// Rust uses traits for the same pattern
pub trait APICapable {
    async fn get(&self, path: &str, params: Option<Params>) -> Response;
    async fn post(&self, path: &str, data: Option<Value>) -> Response;
    async fn put(&self, path: &str, data: Option<Value>) -> Response;
    async fn delete(&self, path: &str, params: Option<Params>) -> Response;
}

pub trait DecisionCapable {
    fn decision_success(&self, steps: Vec<String>, result: Value) -> StepExecutionResult;
    fn decision_no_branches(&self, result: Value) -> StepExecutionResult;
}

pub trait BatchableCapable {
    fn get_batch_context(&self, context: &StepContext) -> BatchContext;
    fn batch_worker_complete(&self, count: usize, data: Value) -> StepExecutionResult;
}
}

Python Mixins

# Python uses multiple inheritance (mixins)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin

class MyHandler(StepHandler, APIMixin, DecisionMixin):
    def call(self, context: StepContext) -> StepHandlerResult:
        # Has both API and Decision methods
        response = self.get("/api/endpoint")
        return self.decision_success(["next_step"], response)

TypeScript Mixins

// TypeScript uses mixin functions applied in constructor
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, applyDecision, APICapable, DecisionCapable } from '@tasker-systems/tasker';

class MyHandler extends StepHandler implements APICapable, DecisionCapable {
  constructor() {
    super();
    applyAPI(this);       // Adds get/post/put/delete methods
    applyDecision(this);  // Adds decisionSuccess/skipBranches methods
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Has both API and Decision methods
    const response = await this.get('/api/endpoint');
    return this.decisionSuccess(['next_step'], response.body);
  }
}

Separation of Concerns

What Orchestration Owns

The orchestration layer handles:

  • Domain event publishing (after results committed)
  • Decision point step creation (from DecisionPointOutcome)
  • Batch worker creation (from BatchProcessingOutcome)
  • State machine transitions

What Workers Own

Workers handle:

  • Decision logic (returns DecisionPointOutcome)
  • Batch analysis (returns BatchProcessingOutcome)
  • Handler execution (returns StepHandlerResult)
  • Custom publishers/subscribers (fast path events)

The Boundary

┌─────────────────────────────────────────────────────────────────┐
│                        Worker Space                              │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Handler.call(context)                                       ││
│  │   - Executes business logic                                 ││
│  │   - Uses API/Decision/Batchable capabilities               ││
│  │   - Returns StepHandlerResult with outcome                  ││
│  └─────────────────────────────────────────────────────────────┘│
│                              ↓ Result (with outcome)             │
├─────────────────────────────────────────────────────────────────┤
│                    Orchestration Space                           │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Process result                                              ││
│  │   - Commit state transition                                 ││
│  │   - If DecisionPointOutcome: create decision steps          ││
│  │   - If BatchProcessingOutcome: create batch workers         ││
│  │   - Publish domain events                                   ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

FFI Boundary Types

Outcomes crossing the FFI boundary need explicit types:

DecisionPointOutcome

#![allow(unused)]
fn main() {
// Rust definition
pub enum DecisionPointOutcome {
    ActivateSteps { step_names: Vec<String> },
    NoBranches,
}

// Serialized (all languages)
{
    "type": "ActivateSteps",
    "step_names": ["branch_a", "branch_b"]
}
}

BatchProcessingOutcome

#![allow(unused)]
fn main() {
// Rust definition
pub enum BatchProcessingOutcome {
    Continue { cursor: CursorConfig },
    Complete,
    NoOp,
}

// Serialized (all languages)
{
    "type": "Continue",
    "cursor": {
        "position": "offset_123",
        "batch_size": 100
    }
}
}

Migration Path

Cross-Language Migration Examples

Ruby

Before (inheritance):

class MyAPIHandler < TaskerCore::APIHandler
  def call(context)
    # ...
  end
end

After (composition):

class MyAPIHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API

  def call(context)
    # Same implementation, different structure
  end
end

Python

Before (inheritance):

class MyAPIHandler(APIHandler):
    def call(self, context):
        # ...

After (composition):

from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin

class MyAPIHandler(StepHandler, APIMixin):
    def call(self, context):
        # Same implementation, different structure

TypeScript

Before (inheritance):

class MyAPIHandler extends APIHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    // ...
  }
}

After (composition):

import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';

class MyAPIHandler extends StepHandler implements APICapable {
  constructor() {
    super();
    applyAPI(this);
  }

  async call(context: StepContext): Promise<StepHandlerResult> {
    // Same implementation, different structure
  }
}

Rust

Rust already used the composition pattern via traits:

#![allow(unused)]
fn main() {
// Rust has always used traits (composition)
impl StepHandler for MyHandler { ... }
impl APICapable for MyHandler { ... }
impl DecisionCapable for MyHandler { ... }
}

Breaking Changes Implemented

The migration to composition involved breaking changes:

  1. Base class changes across all languages
  2. Module/mixin includes required
  3. Ruby cursor indexing changed from 1-indexed to 0-indexed

All breaking changes were accumulated and released together.


Anti-Patterns

Don’t: Inherit from Multiple Specialized Classes

# BAD: Ruby doesn't support multiple inheritance like this
class MyHandler < APIHandler, DecisionHandler  # Syntax error!

Don’t: Reimplement Mixin Methods

# BAD: Overriding mixin methods defeats the purpose
class MyHandler < Base
  include APICapable

  def get(path, params: {})
    # Custom implementation - now you own this forever
  end
end

Don’t: Mix Concerns

# BAD: Handler doing orchestration's job
class MyHandler < Base
  include DecisionCapable

  def call(context)
    # Don't create steps directly!
    create_workflow_step("next_step")  # Orchestration does this!

    # Do return the outcome
    decision_success(steps: ["next_step"], result_data: {})
  end
end

Testing Composition

Test Mixins in Isolation

# Test the mixin itself
RSpec.describe TaskerCore::StepHandler::APICapable do
  let(:handler) { Class.new { include TaskerCore::StepHandler::APICapable }.new }

  it "provides get method" do
    expect(handler).to respond_to(:get)
  end
end

Test Handler with Stubs

# Test handler behavior, stub mixin methods
RSpec.describe MyHandler do
  let(:handler) { MyHandler.new }

  it "calls API and makes decision" do
    allow(handler).to receive(:get).and_return({ status: 200 })

    result = handler.call(context)

    expect(result.decision_point_outcome.type).to eq("ActivateSteps")
  end
end

Cross-Language Consistency

This document describes Tasker Core’s philosophy for maintaining consistent APIs across Rust, Ruby, Python, and TypeScript workers while respecting each language’s idioms.

The Core Philosophy

“There should be one–and preferably only one–obvious way to do it.” – The Zen of Python

When a developer learns one Tasker worker language, they should understand all of them at the conceptual level. The specific syntax changes; the patterns remain constant.


Consistency Without Uniformity

What We Align

Developer-facing touchpoints that affect daily work:

TouchpointWhy Align
Handler signaturesDevelopers switch languages within projects
Result factoriesError handling should feel familiar
Registry APIsService configuration is cross-cutting
Context access patternsData access is the core operation
Specialized handlersAPI, Decision, Batchable are reusable patterns

What We Don’t Force

Language idioms that feel natural in their ecosystem:

RubyPythonTypeScriptRust
Blocks, yieldDecorators, context managersGenerics, interfacesTraits, associated types
Symbols (:name)Type hintsasync/awaitPattern matching
Duck typingABC, ProtocolUnion typesEnums, Result<T,E>

The Aligned APIs

Handler Signatures

All handlers receive context, return results:

# Ruby
class MyHandler < TaskerCore::StepHandler::Base
  def call(context)
    success(result: { data: "value" })
  end
end
# Python
class MyHandler(BaseStepHandler):
    def call(self, context: StepContext) -> StepHandlerResult:
        return self.success({"data": "value"})
// TypeScript
class MyHandler extends BaseStepHandler {
  async call(context: StepContext): Promise<StepHandlerResult> {
    return this.success({ data: "value" });
  }
}
#![allow(unused)]
fn main() {
// Rust
impl StepHandler for MyHandler {
    async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
        StepExecutionResult::success(json!({"data": "value"}), None)
    }
}
}

Result Factories

Success and failure patterns are identical:

OperationPattern
Successsuccess(result_data, metadata?)
Failurefailure(message, error_type, error_code?, retryable?, metadata?)

The factory methods hide implementation details (wrapper classes, enum variants) behind a consistent interface.

Registry Operations

All registries support the same core operations:

OperationDescription
register(name, handler)Register a handler by name
is_registered(name)Check if handler exists
resolve(name)Get handler instance
list_handlers()List all registered handlers

Context Access Patterns

The StepContext provides unified access to execution data:

Core Fields (All Languages)

FieldTypeDescription
task_uuidStringUnique task identifier
step_uuidStringUnique step identifier
input_dataDict/HashInput data for the step
step_configDict/HashHandler configuration
dependency_resultsWrapperResults from parent steps
retry_countIntegerCurrent retry attempt
max_retriesIntegerMaximum retry attempts

Convenience Methods

MethodDescription
get_task_field(name)Get field from task context
get_dependency_result(step_name)Get result from a parent step

Specialized Handler Patterns

API Handler

HTTP operations available in all languages:

MethodPattern
GETget(path, params?, headers?)
POSTpost(path, data?, headers?)
PUTput(path, data?, headers?)
DELETEdelete(path, params?, headers?)

Decision Handler

Conditional workflow branching:

# Ruby
decision_success(steps: ["branch_a", "branch_b"], result_data: { routing: "criteria" })
decision_no_branches(result_data: { reason: "no action needed" })
# Python
self.decision_success(["branch_a", "branch_b"], {"routing": "criteria"})
self.decision_no_branches({"reason": "no action needed"})

Batchable Handler

Cursor-based batch processing:

OperationDescription
get_batch_context(context)Extract batch metadata
batch_worker_complete(count, data)Signal batch completion
handle_no_op_worker(batch_ctx)Handle empty batch

FFI Boundary Types

When data crosses the FFI boundary (Rust <-> Ruby/Python/TypeScript), types must serialize identically:

Required Explicit Types

TypePurpose
DecisionPointOutcomeDecision handler results
BatchProcessingOutcomeBatch handler results
CursorConfigBatch cursor configuration
StepHandlerResultAll handler results

Serialization Guarantee

The same JSON representation must work across all languages:

{
  "success": true,
  "result": { "data": "value" },
  "metadata": { "timing_ms": 50 }
}

Why This Matters

Developer Productivity

When switching from a Ruby handler to a Python handler:

  • No relearning core concepts
  • Same mental model applies
  • Documentation transfers

Code Review Consistency

Reviewers can evaluate handlers in any language:

  • Pattern violations are obvious
  • Best practices are universal
  • Anti-patterns are recognizable

Documentation Efficiency

One set of conceptual docs serves all languages:

  • Language-specific pages show syntax only
  • Core patterns documented once
  • Examples parallel across languages

The Pre-Alpha Advantage

In pre-alpha, we can make breaking changes to achieve consistency:

Change TypeExample
Method renameshandle()call()
Signature changes(task, step)(context)
Return type unificationSeparate Success/Error → unified result

These changes would be costly post-release but are cheap now.


Migration Path

When APIs diverge, we follow this pattern:

  1. Non-Breaking First: Add aliases, helpers, new modules
  2. Deprecation Period: Mark old APIs deprecated (warnings in logs)
  3. Breaking Release: Remove old APIs, document migration

Example timeline:

Phase 1: Python migration (non-breaking + breaking)
Phase 2: Ruby migration (non-breaking + breaking)
Phase 3: Rust alignment (already aligned)
Phase 4: TypeScript alignment (new implementation)
Phase 5: Breaking changes release (all languages together)

Anti-Patterns

Don’t: Force Identical Syntax

# BAD: Ruby-style symbols in Python
def call(context) -> Hash[:success => true]  # Not Python!

Don’t: Ignore Language Idioms

# BAD: Python-style type hints in Ruby
def call(context: StepContext) -> StepHandlerResult  # Not Ruby!

Don’t: Duplicate Orchestration Logic

# BAD: Worker creating decision steps
def call(context)
  # Don't do orchestration's job!
  create_decision_steps(...)  # Orchestration handles this
end

Defense in Depth

This document describes Tasker Core’s multi-layered protection model for idempotency and data integrity.

The Four Protection Layers

Tasker Core uses four independent protection layers. Each layer catches what others might miss, and no single layer bears full responsibility for data integrity.

┌─────────────────────────────────────────────────────────────────┐
│                    Layer 4: Application Logic                   │
│                    (State-based deduplication)                  │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 3: Transaction Boundaries              │
│                    (All-or-nothing semantics)                   │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 2: State Machine Guards                │
│                    (Current state validation)                   │
├─────────────────────────────────────────────────────────────────┤
│                    Layer 1: Database Atomicity                  │
│                    (Unique constraints, row locks, CAS)         │
└─────────────────────────────────────────────────────────────────┘

Layer 1: Database Atomicity

The foundation layer using PostgreSQL’s transactional guarantees.

Mechanisms

MechanismPurposeExample
Unique constraintsPrevent duplicate recordsOne active task per (namespace, external_id)
Row-level lockingPrevent concurrent modificationSELECT ... FOR UPDATE on task claim
Compare-and-swapAtomic state transitionsUPDATE ... WHERE state = $expected
Advisory locksDistributed coordinationTemplate cache invalidation

Atomic Claiming Pattern

-- Only one processor can claim a task
UPDATE tasks
SET state = 'in_progress',
    processor_uuid = $1,
    claimed_at = NOW()
WHERE id = $2
  AND state = 'pending'  -- CAS: only if still pending
RETURNING *;

If two processors try to claim the same task:

  • First: Succeeds, task transitions to in_progress
  • Second: Fails (0 rows affected), no state change

Why This Works

PostgreSQL’s MVCC ensures the WHERE state = 'pending' check and the SET state = 'in_progress' happen atomically. There’s no window where both processors see state = 'pending'.


Layer 2: State Machine Guards

State machine validation before any transition is attempted.

Implementation

#![allow(unused)]
fn main() {
impl TaskStateMachine {
    pub fn can_transition(&self, from: TaskState, to: TaskState) -> bool {
        VALID_TRANSITIONS.contains(&(from, to))
    }

    pub fn transition(&mut self, to: TaskState) -> Result<(), StateError> {
        if !self.can_transition(self.current, to) {
            return Err(StateError::InvalidTransition { from: self.current, to });
        }
        // Proceed with transition
    }
}
}

Valid Transitions Matrix

The state machine explicitly defines which transitions are valid:

Pending → Initializing → EnqueuingSteps → StepsInProcess
                                              ↓
Complete ← EvaluatingResults ← (step completions)
                  ↓
               Error (from any state)

Invalid transitions are rejected before reaching the database.

Why This Works

Application-level guards prevent obviously invalid operations from even attempting database changes. This reduces database load and provides better error messages.


Layer 3: Transaction Boundaries

All-or-nothing semantics for multi-step operations.

Example: Step Enqueueing

#![allow(unused)]
fn main() {
async fn enqueue_steps(task_id: TaskId, steps: Vec<Step>) -> Result<()> {
    let mut tx = pool.begin().await?;

    // Insert all steps
    for step in steps {
        insert_step(&mut tx, task_id, &step).await?;
    }

    // Update task state
    update_task_state(&mut tx, task_id, TaskState::StepsInProcess).await?;

    // Atomic commit - all or nothing
    tx.commit().await?;
    Ok(())
}
}

If step insertion fails:

  • Transaction rolls back
  • Task state unchanged
  • No partial steps created

Why This Works

PostgreSQL transactions ensure that either all changes commit or none do. There’s no intermediate state where some steps exist but task state is wrong.


Layer 4: Application-Level Filtering

State-based deduplication in application logic.

Example: Result Processing

#![allow(unused)]
fn main() {
async fn process_result(step_id: StepId, result: HandlerResult) -> Result<()> {
    let step = get_step(step_id).await?;

    // Filter: Only process if step is in_progress
    if step.state != StepState::InProgress {
        log::info!("Ignoring result for step {} in state {:?}", step_id, step.state);
        return Ok(()); // Idempotent: already processed
    }

    // Proceed with result processing
    apply_result(step, result).await
}
}

Why This Works

Even if the same result arrives multiple times (network retries, duplicate messages), only the first processing has effect. Subsequent attempts see the step already transitioned and exit cleanly.


The Discovery: Ownership Was Harmful

What We Learned

Analysis of processor UUID “ownership” enforcement revealed:

#![allow(unused)]
fn main() {
// OLD: Ownership enforcement (REMOVED)
fn can_process(&self, processor_uuid: Uuid) -> bool {
    self.owner_uuid == processor_uuid  // BLOCKED recovery!
}

// NEW: Ownership tracking only (for audit)
fn process(&self, processor_uuid: Uuid) -> Result<()> {
    self.record_processor(processor_uuid);  // Track, don't enforce
    // ... proceed with processing
}
}

Why Ownership Enforcement Was Removed

ScenarioWith EnforcementWithout Enforcement
Normal operationWorksWorks
Orchestrator crash & restartBLOCKED - new UUIDAutomatic recovery
Duplicate messageRejectedLayer 1 (CAS) rejects
Race conditionRejectedLayer 1 (CAS) rejects

The four protection layers already prevent corruption. Ownership added:

  • Zero additional safety (layers 1-4 sufficient)
  • Recovery blocking (crashed tasks stuck forever)
  • Operational complexity (manual intervention needed)

The Verdict

“Processor UUID ownership was redundant protection with harmful side effects.”

When two actors receive identical messages:

  • First: Succeeds atomically (Layer 1 CAS)
  • Second: Fails cleanly (Layer 1 CAS)
  • No partial state, no corruption
  • No ownership check needed

Designing New Protections

When adding protection mechanisms, evaluate against this checklist:

Before Adding Protection

  1. Which layer does this belong to? (Database, state machine, transaction, application)
  2. What does it protect against? (Be specific: race condition, duplicate, corruption)
  3. Do existing layers already cover this? (Usually yes)
  4. What failure modes does it introduce? (Blocked recovery, increased latency)
  5. Can the system recover if this protection itself fails?

The Minimal Set Principle

Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.

A system that:

  • Has fewer protections
  • Recovers automatically from crashes
  • Handles duplicates idempotently

Is better than a system that:

  • Has more protections
  • Requires manual intervention after crashes
  • Is “theoretically more secure”


Relationship to Fail Loudly

Defense in Depth and Fail Loudly are complementary principles:

Defense in DepthFail Loudly
Multiple layers prevent corruptionErrors surface problems immediately
Redundancy catches edge casesTransparency enables diagnosis
Protection happens before damageVisibility happens at detection

Both reject the same anti-pattern: silent failures.

  • Defense in Depth rejects: silent corruption (data changed without protection)
  • Fail Loudly rejects: silent defaults (missing data hidden with fabricated values)

Together they ensure: if something goes wrong, we know about it—either protection prevents it, or an error surfaces it.


Fail Loudly

This document describes Tasker Core’s philosophy on error handling: errors are first-class citizens, not inconveniences to hide.

The Core Principle

A system that lies is worse than one that fails.

When data is missing, malformed, or unexpected, the correct response is an explicit error—not a fabricated default that makes the problem invisible.


The Problem: Phantom Data

“Phantom data” is data that:

  • Looks valid to consumers
  • Passes type checks and validation
  • Contains no actual information from the source
  • Was fabricated by defensive code trying to be “helpful”

Example: The Silent Default

#![allow(unused)]
fn main() {
// WRONG: Silent default hides protocol violation
fn get_pool_utilization(response: Response) -> PoolUtilization {
    response.pool_utilization.unwrap_or_else(|| PoolUtilization {
        active_connections: 0,
        idle_connections: 0,
        max_connections: 0,
        utilization_percent: 0.0,  // Looks like "no load"
    })
}
}

A monitoring system receiving this response sees:

  • utilization_percent: 0.0 — “Great, the system is idle!”
  • Reality: The server never sent pool data. The system might be at 100% load.

The consumer cannot distinguish “server reported 0%” from “server sent nothing.”

The Trust Equation

Silent default
  → Consumer receives valid-looking data
  → Consumer makes decisions based on phantom values
  → Phantom bugs manifest in production
  → Debugging nightmare: "But the data looked correct!"

vs.

Explicit error
  → Consumer receives clear failure
  → Consumer handles error appropriately
  → Problem visible immediately
  → Fix applied at source

The Solution: Explicit Errors

Pattern: Required Fields Return Errors

#![allow(unused)]
fn main() {
// RIGHT: Explicit error on missing required data
fn get_pool_utilization(response: Response) -> Result<PoolUtilization, ClientError> {
    response.pool_utilization.ok_or_else(|| {
        ClientError::invalid_response(
            "Response.pool_utilization",
            "Server omitted required pool utilization data",
        )
    })
}
}

Now the consumer:

  • Knows data is missing
  • Can retry, alert, or degrade gracefully
  • Never operates on phantom values

Pattern: Distinguish Required vs Optional

Not all fields should fail on absence. The distinction matters:

Field TypeMissing MeansResponse
RequiredProtocol violation, server bugReturn error
OptionalLegitimately absent, feature not configuredReturn None
#![allow(unused)]
fn main() {
// Required: Server MUST send health checks
let checks = response.checks.ok_or_else(||
    ClientError::invalid_response("checks", "missing")
)?;

// Optional: Distributed cache may not be configured
let cache = response.distributed_cache; // Option<T> preserved
}

Pattern: Propagate, Don’t Swallow

Errors should flow up, not disappear:

#![allow(unused)]
fn main() {
// WRONG: Error swallowed, default returned
fn convert_response(r: Response) -> DomainType {
    let info = r.info.unwrap_or_default();  // Error hidden
    // ...
}

// RIGHT: Error propagated to caller
fn convert_response(r: Response) -> Result<DomainType, ClientError> {
    let info = r.info.ok_or_else(||
        ClientError::invalid_response("info", "missing")
    )?;  // Error visible
    // ...
}
}

When Defaults Are Acceptable

Not every unwrap_or_default() is wrong. Defaults are acceptable when:

  1. The field is explicitly optional in the domain model

    #![allow(unused)]
    fn main() {
    // Optional metadata that may legitimately be absent
    let metadata: Option<Value> = response.metadata;
    }
  2. The default is semantically meaningful

    #![allow(unused)]
    fn main() {
    // Empty tags list is valid—means "no tags"
    let tags = response.tags.unwrap_or_default(); // Vec<String>
    }
  3. Absence cannot be confused with a valid value

    #![allow(unused)]
    fn main() {
    // description being None vs "" are distinguishable
    let description: Option<String> = response.description;
    }

Red Flags to Watch For

When reviewing code, these patterns indicate potential phantom data:

1. unwrap_or_default() on Numeric Types

#![allow(unused)]
fn main() {
// RED FLAG: 0 looks like a valid measurement
let active_connections = pool.active.unwrap_or_default();
}

2. unwrap_or_else(|| ...) with Fabricated Values

#![allow(unused)]
fn main() {
// RED FLAG: "unknown" looks like real status
let status = check.status.unwrap_or_else(|| "unknown".to_string());
}

3. Default Structs for Missing Nested Data

#![allow(unused)]
fn main() {
// RED FLAG: Entire section fabricated
let config = response.config.unwrap_or_else(default_config);
}

4. Silent Fallbacks in Health Checks

#![allow(unused)]
fn main() {
// RED FLAG: Health check that never fails is useless
let health = check_health().unwrap_or(HealthStatus::Ok);
}

Implementation Checklist

When implementing new conversions or response handling:

  • Is this field required by the protocol/API contract?
  • If missing, would a default be indistinguishable from a valid value?
  • Could a consumer make incorrect decisions based on a default?
  • Is the error message actionable? (includes field name, explains what’s wrong)
  • Is the error type appropriate? (InvalidResponse for protocol violations)

The Discovery

What We Found

During gRPC client implementation, analysis revealed pervasive patterns like:

#![allow(unused)]
fn main() {
// Found throughout conversions.rs
let checks = response.checks.unwrap_or_else(|| ReadinessChecks {
    web_database: HealthCheck { status: "unknown".into(), ... },
    orchestration_database: HealthCheck { status: "unknown".into(), ... },
    // ... more fabricated checks
});
}

A client calling get_readiness() would receive what looked like a valid response with “unknown” status for all checks—when in reality, the server sent nothing.

The Refactoring

All required-field patterns were changed to explicit errors:

#![allow(unused)]
fn main() {
// After refactoring
let checks = response.checks.ok_or_else(|| {
    ClientError::invalid_response(
        "ReadinessResponse.checks",
        "Readiness response missing required health checks",
    )
})?;
}

Now a malformed server response immediately fails with:

Error: Invalid response: ReadinessResponse.checks - Readiness response missing required health checks

The problem is visible. The fix can be applied. Trust is preserved.


  • Tenet #11: Fail Loudly in Tasker Core Tenets
  • Meta-Principle #6: Errors Over Defaults
  • Defense in Depth — fail loudly is a form of protection; silent defaults are a form of hiding

Summary

Don’tDo
Hide missing data with defaultsReturn explicit errors
Make consumers guess if data is realDistinguish required vs optional
Fabricate “unknown” status valuesError: “status unavailable”
Swallow errors in conversionsPropagate with ? operator
Treat all fields as optionalModel optionality in types

The golden rule: If you can’t tell the difference between “server sent 0” and “server sent nothing,” you have a phantom data problem.

Intentional AI Partnership

A philosophy of rigorous collaboration for the age of AI-assisted engineering


The Growing Divide

There is a phrase gaining traction in software engineering circles: “Nice AI slop.”

It’s dismissive. It’s reductive. And it’s not entirely wrong.

The critique is valid: AI tools have made it possible to generate enormous volumes of code without understanding what that code does, why it’s structured the way it is, or how to maintain it when something breaks at 2 AM. Engineers who would never have shipped code they couldn’t explain are now approving pull requests they couldn’t debug. Project leads are drowning in contributions from well-meaning developers who “vibe-coded” their way into maintenance nightmares.

For those of us who have spent years—decades—in the craft of software engineering, who have sat with codebases through their full lifecycles, who have felt the weight of technical decisions made five years ago landing on our shoulders today, this is frustrating. The hard-won discipline of our profession seems to be eroding in favor of velocity.

And yet.

The response to “AI slop” cannot be rejection of AI as a partner in engineering work. That path leads to irrelevance. The question is not whether to work with AI, but how—with what principles, what practices, what commitments to quality and accountability.

This document is an attempt to articulate those principles. Not as abstract ideals, but as a working philosophy grounded in practice: building real systems, shipping real code, maintaining real accountability.


The Core Insight: Amplification, Not Replacement

AI does not create the problems we’re seeing. It amplifies them.

Teams that already had weak ownership practices now produce more poorly-understood code, faster. Organizations where “move fast and break things” meant “ship it and let someone else figure it out” now ship more of it. Engineers who never quite understood the systems they worked on can now generate more code they don’t understand.

But the inverse is also true.

Teams with strong engineering discipline—clear specifications, rigorous review, genuine ownership—can leverage AI to operate at a higher level of abstraction while maintaining (or even improving) quality. The acceleration becomes an advantage, not a liability.

This is the same dynamic that exists in any collaboration. A junior engineer paired with a senior engineer who doesn’t mentor becomes a junior engineer who writes more code without learning. A junior engineer paired with a senior engineer who invests in their growth becomes a stronger engineer, faster.

AI partnership follows the same pattern. The quality of the outcome depends on the quality of the collaboration practices surrounding it.

The discipline required for effective AI partnership is not new. It is the discipline that should characterize all engineering collaboration. AI simply makes the presence or absence of that discipline more visible, more consequential, and more urgent.


Principles of Intentional Partnership

1. Specification Before Implementation

The most effective AI collaboration begins long before code is written.

When you ask an AI to “build a feature,” you get code. When you work with an AI to understand the problem, research the landscape, evaluate approaches, and specify the solution—then implement—you get software.

This is not an AI-specific insight. It’s foundational engineering practice. But AI makes the cost of skipping specification deceptively low: you can generate code instantly, so why spend time on design? The answer is the same as it’s always been: because code without design is not software, it’s typing.

In practice:

  • Begin with exploration: What problem are we solving? What does the current system look like? What will be different when this work is complete?
  • Research with tools: Use AI capabilities to understand the codebase, explore patterns in the ecosystem, review prior art. Ground the work in reality, not assumptions.
  • Develop evaluation criteria before evaluating solutions. Know what “good” looks like before you start judging options.
  • Document the approach, not just the code. Specifications are artifacts of understanding.

2. Phased Delivery with Validation Gates

Large work should be decomposed into phases, and each phase should have clear acceptance criteria.

This principle exists because humans have limited working memory. It’s true for individual engineers, it’s true for teams, and it’s true for AI systems. Complex work exceeds the capacity of any single context—human or machine—to hold it all at once.

Phased delivery is how we manage this limitation. Each phase is small enough to understand completely, validate thoroughly, and commit to confidently. The boundaries between phases are synchronization points where understanding is verified.

In practice:

  • Identify what can be parallelized versus what must be sequential. Not all work is equally dependent.
  • Determine which aspects require careful attention versus which can be resolved at implementation time. Not all decisions are equally consequential.
  • Each phase should be independently validatable: tests pass, acceptance criteria met, code reviewed.
  • Phase documentation should include code samples for critical paths. Show, don’t just tell.

3. Validation as a First-Class Concern

Testing is not a phase that happens after implementation. It is a design constraint that shapes implementation.

AI can generate tests as easily as it generates code. This makes it tempting to treat testing as an afterthought: write the code, then generate tests to cover it. This inverts the value proposition of testing entirely.

Tests are specifications. They encode expectations about behavior. When tests are written first—or at least designed first—they constrain the implementation toward correctness. When tests are generated after the fact, they merely document whatever the implementation happens to do, bugs included.

In practice:

  • Define acceptance criteria before implementation begins.
  • Include edge cases, boundary conditions, and non-happy-path scenarios in specifications.
  • End-to-end testing validates that the system works, not just that individual units work.
  • Review tests with the same rigor as implementation code. Tests can have bugs too.

4. Human Accountability as the Final Gate

This is the principle that separates intentional partnership from “AI slop.”

The human engineer is ultimately responsible for code that ships. Not symbolically responsible—actually responsible. Responsible for understanding what the code does, why it’s structured the way it is, what trade-offs were made, and how to maintain it.

This is not about low trust in AI. It’s about the nature of accountability.

If you cannot explain why a particular approach was chosen, you should not approve it. If you cannot articulate the trade-offs embedded in a design decision, you should not sign off on it. If you cannot defend a choice—or at least explain why the choice wasn’t worth extensive deliberation—then you are not in a position to take responsibility for it.

This standard applies to all code, regardless of its origin. Human-written code that the approving engineer doesn’t understand is no better than AI-written code they don’t understand. The source is irrelevant; the accountability is what matters.

In practice:

  • Review is not approval. Approval requires understanding.
  • The bikeshedding threshold is a valid concept: knowing why something isn’t worth debating is also knowledge. But you must actually know this, not assume it.
  • Code review agents and architectural validators are useful, but they augment human judgment rather than replacing it.
  • If you wouldn’t ship code you wrote yourself without understanding it, don’t ship AI-written code without understanding it either.

5. Documentation as Extended Cognition

Documentation is not an artifact of completed work. It is a tool that enables work to continue.

Every engineer who joins a project faces the same challenge: building sufficient context to contribute effectively. Every AI session faces the same challenge: starting fresh without memory of prior work. Good documentation serves both.

This is the insight that makes documentation investment worthwhile: it extends cognition across time and across minds. The context you build today, documented well, becomes instantly available to future collaborators—human or AI.

In practice:

  • Structure documentation for efficient context loading. Navigation guides, trigger patterns, clear hierarchies.
  • Capture the “why” alongside the “what.” Decisions without rationale are trivia.
  • Principles, architecture, guides, reference—different documents serve different needs at different times.
  • Documentation that serves future AI sessions also serves future human engineers. The requirements are the same: limited working memory, need for efficient orientation.

6. Toolchain Alignment

Some development environments are better suited to intentional partnership than others.

The ideal toolchain provides fast feedback loops, enforces correctness constraints, and makes architectural decisions explicit. The compiler, the type system, the test framework—these become additional collaborators in the process, catching errors early and forcing clarity about intent.

Languages and tools that defer decisions to runtime, that allow implicit behavior, that prioritize flexibility over explicitness, make intentional partnership harder. Not impossible—but harder. The burden of verification shifts more heavily to the human.

In practice:

  • Strong type systems document intent in ways that survive across sessions and collaborators.
  • Compilers that enforce correctness (memory safety, exhaustive matching) catch the classes of errors most likely to slip through in high-velocity development.
  • Explicit architectural patterns—actor models, channel semantics, clear ownership boundaries—force intentional design rather than emergent mess.
  • The goal is not language advocacy but recognition: your toolchain affects your collaboration quality.

A Concrete Example: Building Tasker

These principles are not theoretical. They emerged from—and continue to guide—the development of Tasker, a workflow orchestration system built in Rust.

Why Rust?

Rust is not chosen as a recommendation but as an illustration of what makes a toolchain powerful for intentional partnership.

The Rust compiler forces agreement on memory ownership. You cannot be vague about who owns data and when it’s released; the borrow checker requires explicitness. This means architectural decisions about data flow must be made consciously rather than accidentally.

Exhaustive pattern matching means you cannot forget to handle a case. Every enum variant must be addressed. This is particularly valuable when working with AI: generated code that handles only the happy path fails to compile rather than failing silently in production.

The type system documents intent in ways that persist across context windows. When an AI session resumes work on a Rust codebase, the types communicate constraints that would otherwise need to be re-established through conversation.

Tokio channels, MPSC patterns, actor boundaries—these require intentional design. You cannot stumble into an actor architecture; you must choose it and implement it explicitly. This aligns well with specification-driven development.

None of this makes Rust uniquely suitable or necessary. It makes Rust an example of the properties that matter: explicitness, enforcement, feedback loops that catch errors early.

The Spec-Driven Workflow

Every significant piece of Tasker work follows a pattern:

  1. Problem exploration: What are we trying to accomplish? What’s the current state? What will success look like?

  2. Grounded research: Use AI capabilities to understand the codebase, explore ecosystem patterns, review tooling options. Generate a situated view of how the problem exists within the actual system.

  3. Approach analysis: Develop criteria for evaluating solutions. Generate multiple approaches. Evaluate against criteria. Select and refine.

  4. Phased planning: Break work into milestones with validation gates. Identify dependencies, parallelization opportunities, risk areas. Determine what needs careful specification versus what can be resolved during implementation.

  5. Phase documentation: Each phase gets its own specification in a dedicated directory. Includes acceptance criteria, code samples for critical paths, and explicit validation requirements.

  6. Implementation with validation: Work proceeds phase by phase. Tests are written. Code is reviewed. Each phase is complete before the next begins.

  7. Human accountability gate: The human partner reviews not just for correctness but for understanding. Can they defend the choices? Do they know why alternatives were rejected? Are they prepared to maintain this code?

This workflow produces comprehensive documentation as a side effect of doing the work. The docs/ticket-specs/ directories in Tasker contain detailed specifications that serve both as implementation guides and as institutional memory. Future engineers—and future AI sessions—can understand not just what was built but why.

The Tenets as Guardrails

Tasker’s development is guided by ten core tenets, derived from experience. Several are directly relevant to intentional partnership:

State Machine Rigor: All state transitions are atomic, audited, and validated. This principle emerged from debugging distributed systems failures; it also provides clear contracts for AI-generated code to satisfy.

Defense in Depth: Multiple overlapping protection layers rather than single points of failure. In collaboration terms: review, testing, type checking, and runtime validation each catch what others might miss.

Composition Over Inheritance: Capabilities are composed via mixins, not class hierarchies. This produces code that’s easier to understand in isolation—crucial when any given context (human or AI) can only hold part of the system at once.

These tenets emerged from building software over many years. They apply to AI partnership because they apply to engineering generally. AI is a collaborator; good engineering principles govern collaboration.


The Organizational Dimension

Intentional AI partnership is not just an individual practice. It’s an organizational capability.

What Changes

When AI acceleration is available to everyone, the differentiator becomes the quality of surrounding practices:

  • Specification quality determines whether AI generates useful code or plausible-looking nonsense.
  • Review rigor determines whether errors are caught before or after deployment.
  • Testing discipline determines whether systems are verifiably correct or coincidentally working.
  • Documentation investment determines whether institutional knowledge accumulates or evaporates.

Organizations that were already strong in these areas will find AI amplifies their strength. Organizations that were weak will find AI amplifies their weakness—faster.

The Accountability Question

The hardest organizational challenge is accountability.

When an engineer can generate a month’s worth of code in a day, traditional review processes break down. You cannot carefully review a thousand lines of code per hour. Something has to give.

The answer is not “skip review” or “automate review entirely.” The answer is to change what gets reviewed.

In intentional partnership, the specification is the primary artifact. The specification is reviewed carefully: Does this approach make sense? Does it align with architectural principles? Does it handle edge cases? Does it integrate with existing systems?

The implementation—whether AI-generated or human-written—is validated against the specification. Tests verify behavior. Type systems verify contracts. Review confirms that the implementation matches the spec.

This shifts review from “read every line of code” to “verify that implementation matches intent.” It’s a different skill, but it’s learnable. And it scales in ways that line-by-line review does not.

Building the Capability

Organizations building intentional AI partnership should focus on:

  1. Specification practices: Invest in training engineers to write clear, complete specifications. This skill was always valuable; it’s now critical.

  2. Review culture: Shift review culture from gatekeeping to verification. The question is not “would I have written it this way?” but “does this correctly implement the specification?”

  3. Testing infrastructure: Fast, comprehensive test suites become even more valuable when implementation velocity increases. Invest accordingly.

  4. Documentation standards: Establish expectations for documentation quality. Make documentation a first-class deliverable, not an afterthought.

  5. Toolchain alignment: Choose languages, frameworks, and tools that provide fast feedback and enforce correctness. The compiler is a collaborator.


The Call to Action: What Becomes Possible

There is another dimension to this conversation that deserves attention.

We have focused on rigor, accountability, and the discipline required to avoid producing “slop.” This framing is necessary but insufficient. It treats AI partnership primarily as a risk to be managed rather than an opportunity to be seized.

Consider what has changed.

For decades, software engineers have carried mental backlogs of things we would build if we had the time. Ideas that were sound, architecturally feasible, genuinely useful—but the time-to-execute made them impractical. Side projects abandoned. Features deprioritized. Entire systems that existed only as sketches in notebooks because the implementation cost was prohibitive.

That calculus has shifted.

AI partnership, applied rigorously, compresses implementation timelines in ways that make previously infeasible work feasible. The system you would have built “someday” can be prototyped in a weekend. The refactoring you’ve been putting off for years can be specified, planned, and executed in weeks. The tooling you wished existed can be created rather than merely wished for.

This is not about moving faster for its own sake. It’s about what becomes creatively possible when the friction of implementation is reduced.

Tasker exists because of this shift. A workflow orchestration system supporting four languages, with comprehensive documentation, rigorous testing, and production-grade architecture—built as a labor of love alongside a demanding day job. Ten years ago, this project would have remained an idea. Five years ago, perhaps a half-finished prototype. Today, it’s real software approaching production readiness.

And Tasker is not unique. Across the industry, engineers are building things that would not have existed otherwise. Not “AI-generated slop,” but genuine contributions to the craft—systems built with care, designed with intention, maintained with accountability.

This is what’s at stake when we talk about intentional partnership.

When we approach AI collaboration carelessly, we produce code we don’t understand and can’t maintain. We waste the capability on work that creates more problems than it solves. We give ammunition to critics who argue that AI makes engineering worse.

When we approach AI collaboration with rigor, clarity, and commitment to excellence, we unlock creative possibilities that were previously out of reach. We build things that matter. We expand what a single engineer, or a small team, can accomplish.

It is not treating ourselves with respect—our time, our creativity, our professional aspirations—to squander this capability on careless work. It is not treating the partnership with respect to use it without intention.

The opportunity before us is unprecedented. The discipline required to seize it is not new—it’s the discipline of good engineering, applied to a new context.

Let’s not waste it.


Conclusion: Craft Persists

The critique of “AI slop” is fundamentally a critique of craft—or its absence.

Craft is the accumulated wisdom of how to do something well. In software engineering, craft includes knowing when to abstract and when to be concrete, when to optimize and when to leave well enough alone, when to document and when the code is the documentation. Craft is what separates software that works from software that lasts.

AI does not possess craft. AI possesses capability—vast capability—but capability without wisdom is dangerous. This is true of humans as well; we just notice it less because human capability is more limited.

Intentional AI partnership is the practice of combining AI capability with human craft. The AI brings speed, breadth of knowledge, tireless pattern matching. The human brings judgment, accountability, and the accumulated wisdom of the profession.

Neither is sufficient alone. Together, working with discipline and intention, they can build software that is not just functional but maintainable, not just shipped but understood, not just code but craft.

The divide between “AI slop” and intentional partnership is not about the tools. It’s about us—whether we bring the same standards to AI collaboration that we would (or should) bring to any engineering work.

The tools are new. The standards are not. Let’s hold ourselves to them.


This document is part of the Tasker Core project principles. It reflects one approach to AI-assisted engineering; your mileage may vary. The principles here emerged from practice and continue to evolve with it.

Tasker Core Tenets

These 11 tenets guide all architectural and design decisions in Tasker Core. Each emerged from real implementation experience, root cause analyses, or architectural migrations.


The 11 Tenets

1. Defense in Depth

Multiple overlapping protection layers provide robust idempotency without single-point dependency.

Protection comes from four independent layers:

  • Database-level atomicity: Unique constraints, row locking, compare-and-swap
  • State machine guards: Current state validation before transitions
  • Transaction boundaries: All-or-nothing semantics
  • Application-level filtering: State-based deduplication

Each layer catches what others might miss. No single layer is responsible for all protection.

Origin: Processor UUID ownership was removed when analysis proved it provided redundant protection with harmful side effects (blocking recovery after crashes).

Lesson: Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.


2. Event-Driven with Polling Fallback

Real-time responsiveness via PostgreSQL LISTEN/NOTIFY, with polling as a reliability backstop.

The system supports three deployment modes:

  • EventDrivenOnly: Lowest latency, relies on pg_notify
  • PollingOnly: Traditional polling, higher latency but simple
  • Hybrid (recommended): Event-driven primary, polling fallback

Events can be missed (network issues, connection drops). Polling ensures eventual consistency.

Origin: Event-driven task claiming was added for low-latency response while preserving reliability guarantees.


3. Composition Over Inheritance

Mixins and traits for handler capabilities, not class hierarchies.

Handler capabilities are composed via mixins:

Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable

This pattern enables:

  • Selective capability inclusion
  • Clear separation of concerns
  • Easier testing of individual capabilities
  • No diamond inheritance problems

Origin: Analysis of cross-language handler harmonization revealed Batchable handlers already used composition. This was identified as the target architecture for all handlers.

See also: Composition Over Inheritance


4. Cross-Language Consistency

Unified developer-facing APIs across Rust, Ruby, Python, and TypeScript.

Consistent touchpoints include:

  • Handler signatures: call(context) pattern
  • Result factories: success(data) / failure(error, retry_on)
  • Registry APIs: register_handler(name, handler)
  • Specialized patterns: API, Decision, Batchable

Each language expresses these idiomatically while maintaining conceptual consistency.

Origin: Cross-language API alignment established the “one obvious way” philosophy.

See also: Cross-Language Consistency


5. Actor-Based Decomposition

Lightweight actors for lifecycle management and clear boundaries.

Orchestration uses four core actors:

  1. TaskRequestActor: Task initialization
  2. ResultProcessorActor: Step result processing
  3. StepEnqueuerActor: Batch step enqueueing
  4. TaskFinalizerActor: Task completion

Worker uses five specialized actors:

  1. StepExecutorActor: Step execution coordination
  2. FFICompletionActor: FFI completion handling
  3. TemplateCacheActor: Template cache management
  4. DomainEventActor: Event dispatching
  5. WorkerStatusActor: Status and health

Each actor handles specific message types, enabling testability and clear ownership.

Origin: Actor pattern refactoring reduced monolithic processors from 1,575 LOC to ~150 LOC focused files.


6. State Machine Rigor

Dual state machines (Task + Step) for atomic transitions with full audit trails.

Task states (12): Pending → Initializing → EnqueuingSteps → StepsInProcess → EvaluatingResults → Complete/Error

Step states (8): Pending → Enqueued → InProgress → Complete/Error

All transitions are:

  • Atomic (compare-and-swap at database level)
  • Audited (full history in transitions table)
  • Validated (state guards prevent invalid transitions)

Origin: Enhanced state machines with richer task states were introduced for better workflow visibility.


7. Audit Before Enforce

Track for observability, don’t block for “ownership.”

Processor UUID is tracked in every transition for:

  • Debugging (which instance processed which step)
  • Audit trails (full history of processing)
  • Metrics (load distribution analysis)

But not enforced for:

  • Ownership claims (blocks recovery)
  • Permission checks (redundant with state guards)

Origin: Ownership enforcement removal proved that audit trails provide value without enforcement costs.

Key insight: When two actors receive identical messages, first succeeds atomically, second fails cleanly - no partial state, no corruption.


8. Pre-Alpha Freedom

Break things early to get architecture right.

In pre-alpha phase:

  • Breaking changes are encouraged when architecture is fundamentally unsound
  • No backward compatibility required for greenfield work
  • Migration debt is cheaper than technical debt
  • “Perfect” is the enemy of “architecturally sound”

This freedom enables:

  • Rapid iteration on core patterns
  • Learning from real implementation
  • Correcting course before users depend on specifics

Origin: All major refactoring efforts made breaking changes that improved architecture fundamentally.


9. PostgreSQL as Foundation

Database-level guarantees with flexible messaging (PGMQ default, RabbitMQ optional).

PostgreSQL provides:

  • State storage: Task and step state with transactional guarantees
  • Advisory locks: Distributed coordination primitives
  • Atomic functions: State transitions in single round-trip
  • Row-level locking: Prevents concurrent modification

Messaging is provider-agnostic:

  • PGMQ (default): Message queue built on PostgreSQL—single-dependency deployment
  • RabbitMQ (optional): For high-throughput or existing broker infrastructure

The database is not just storage—it’s the coordination layer. Message delivery is pluggable.

Origin: Core architecture decision - PostgreSQL’s transactional guarantees eliminate entire classes of distributed systems problems. The messaging abstraction was added for deployment flexibility.


10. Bounded Resources

All channels bounded, backpressure everywhere.

Every MPSC channel is:

  • Bounded: Fixed capacity, no unbounded memory growth
  • Configurable: Sizes set via TOML configuration
  • Monitored: Backpressure metrics exposed

Semaphores limit concurrent handler execution. Circuit breakers protect downstream services.

Origin: Bounded MPSC channels were mandated after analysis of unbounded channel risks.

Rule: Never use unbounded_channel(). Always configure bounds via TOML.


11. Fail Loudly

A system that lies is worse than one that fails. Errors are first-class citizens, not inconveniences to hide.

When data is missing, malformed, or unexpected:

  • Return errors, not fabricated defaults
  • Propagate failures up the call stack
  • Make problems visible immediately, not downstream
  • Trust nothing that hasn’t been validated

Silent defaults create phantom data—values that look valid but represent nothing real. A monitoring system that receives 0% utilization cannot distinguish “system is idle” from “data was missing.”

What this means in practice:

ScenarioWrong ApproachRight Approach
gRPC response missing fieldReturn default valueReturn InvalidResponse error
Config section absentUse empty/zero defaultsFail with clear message
Health check data missingFabricate “unknown” statusError: “health data unavailable”
Optional vs RequiredTreat all as optionalDistinguish explicitly in types

The trust equation:

A client that returns fabricated data
  = A client that lies to you
  = Worse than a client that fails loudly
  = Debugging phantom bugs in production

Origin: gRPC client refactoring revealed pervasive unwrap_or_default() patterns that silently fabricated response data. Analysis showed consumers could receive “valid-looking” responses containing entirely phantom data, breaking the trust contract between client and caller.

Key insight: When a gRPC server omits required fields, that’s a protocol violation—not an opportunity to be “helpful” with defaults. The server is broken; pretending otherwise delays the fix and misleads operators.

Rule: Never use unwrap_or_default() or unwrap_or_else(|| fabricated_value) for required fields. Use ok_or_else(|| ClientError::invalid_response(...)) instead.


Meta-Principles

These overarching themes emerge from the tenets:

  1. Simplicity Over Elegance: The minimal protection set that prevents corruption beats layered defense that prevents recovery

  2. Observation-Driven Design: Let real behavior (parallel execution, edge cases) guide architecture

  3. Explicit Over Implicit: Make boundaries, layers, and decisions visible in documentation and code

  4. Consistency Without Uniformity: Align APIs while preserving language idioms

  5. Separation of Concerns: Orchestration handles state and coordination; workers handle execution and domain events

  6. Errors Over Defaults: When in doubt, fail with a clear error rather than proceeding with fabricated data


Applying These Tenets

When making design decisions:

  1. Check against tenets: Does this violate any of the 10 tenets?
  2. Find the precedent: Has a similar decision been made before? (See ticket-specs)
  3. Document the trade-off: What are you gaining and giving up?
  4. Consider recovery: If this fails, how does the system recover?

When reviewing code:

  1. Bounded resources: Are all channels bounded? All concurrency limited?
  2. State machine compliance: Do transitions use atomic database operations?
  3. Language consistency: Does the API align with other language workers?
  4. Composition pattern: Are capabilities mixed in rather than inherited?
  5. Fail loudly: Are missing/invalid data handled with errors, not silent defaults?

Twelve-Factor App Alignment

The Twelve-Factor App methodology, authored by Adam Wiggins and contributors at Heroku, has been a foundational influence on Tasker Core’s systems design. These principles were not adopted as a checklist but absorbed over years of building production systems. Some factors are deeply embedded in the architecture; others remain aspirational or partially realized.

This document maps each factor to where it shows up in the codebase, where we fall short, and what contributors should keep in mind. It is meant as practical guidance, not a compliance scorecard.


I. Codebase

One codebase tracked in revision control, many deploys.

Tasker Core is a single Git monorepo containing all deployable services: orchestration server, workers (Rust, Ruby, Python, TypeScript), CLI, and shared libraries.

Where this lives:

  • Root Cargo.toml defines the workspace with all crate members
  • Environment-specific Docker Compose files produce different deploys from the same source: docker/docker-compose.prod.yml, docker/docker-compose.dev.yml, docker/docker-compose.test.yml, docker/docker-compose.ci.yml
  • Feature flags (web-api, grpc-api, test-services, test-cluster) control build variations without code branches

Gaps: The monorepo means all crates share a single version today (v0.1.0). As the project matures toward independent crate publishing, version coordination will need more tooling. Independent crate versioning and release management tooling will need to evolve as the project matures.


II. Dependencies

Explicitly declare and isolate dependencies.

Rust’s Cargo ecosystem makes this natural. All dependencies are declared in Cargo.toml with workspace-level management and pinned in Cargo.lock.

Where this lives:

  • Root Cargo.toml [workspace.dependencies] section — single source of truth for shared dependency versions
  • Cargo.lock committed to the repository for reproducible builds
  • Multi-stage Docker builds (docker/build/orchestration.prod.Dockerfile) use cargo-chef for cached, reproducible dependency resolution
  • No runtime dependency fetching — everything resolved at build time

Gaps: FFI workers each bring their own dependency ecosystem (Python’s uv/pyproject.toml, Ruby’s Bundler/Gemfile, TypeScript’s bun/package.json). These are well-declared but not unified — contributors working across languages need to manage multiple lock files.


III. Config

Store config in the environment.

This is one of the strongest alignments. All runtime configuration flows through environment variables, with TOML files providing structured defaults that reference those variables.

Where this lives:

  • config/dotenv/ — environment-specific .env files (base.env, test.env, orchestration.env)
  • config/tasker/base/*.toml — role-based defaults with ${ENV_VAR:-default} interpolation
  • config/tasker/environments/{test,development,production}/ — environment overrides
  • docker/.env.prod.template — production variable template
  • tasker-shared/src/config/ — config loading with environment variable resolution
  • No secrets in source: DATABASE_URL, POSTGRES_PASSWORD, JWT keys all via environment

For contributors: Never hard-code connection strings, credentials, or deployment-specific values. Use environment variables with sensible defaults in the TOML layer. The configuration structure is role-based (orchestration/worker/common), not component-based — see CLAUDE.md for details.


IV. Backing Services

Treat backing services as attached resources.

Backing services are abstracted behind trait interfaces and swappable via configuration alone.

Where this lives:

  • Database: PostgreSQL connection via DATABASE_URL, pool settings in config/tasker/base/common.toml under [common.database.pool]
  • Messaging: PGMQ or RabbitMQ selected via TASKER_MESSAGING_BACKEND environment variable — same code paths, different drivers
  • Cache: Redis, Moka (in-process), or disabled entirely via [common.cache] configuration
  • Observability: OpenTelemetry with pluggable backends (Honeycomb, Jaeger, Grafana Tempo) via OTEL_EXPORTER_OTLP_ENDPOINT
  • Circuit breakers protect against backing service failures: [common.circuit_breakers.component_configs]

For contributors: When adding a new backing service dependency, ensure it can be configured via environment variables and that the system degrades gracefully when it’s unavailable. Follow the messaging abstraction pattern — trait-based interfaces, not concrete types.


V. Build, Release, Run

Strictly separate build and run stages.

The Docker build pipeline enforces this cleanly with multi-stage builds.

Where this lives:

  • Build: docker/build/orchestration.prod.Dockerfilecargo-chef dependency caching, cargo build --release --all-features --locked, binary stripping
  • Release: Tagged Docker images with only runtime dependencies (no build tools), non-root user (tasker:999), read-only config mounts
  • Run: docker/scripts/orchestration-entrypoint.sh — environment validation, database availability check, migrations, then exec into the application binary
  • Deployment modes control startup behavior: standard, migrate-only, no-migrate, safe, emergency

Gaps: Local development doesn’t enforce the same separation — developers run cargo run directly, which conflates build and run. This is fine for development ergonomics but worth noting as a difference from the production path.


VI. Processes

Execute the app as one or more stateless processes.

All persistent state lives in PostgreSQL. Processes can be killed and restarted at any time without data loss.

Where this lives:

  • Orchestration server: stateless HTTP/gRPC service backed by tasker.tasks and tasker.steps tables
  • Workers: claim steps from message queues, execute handlers, write results back — no in-memory state across requests
  • Message queue visibility timeouts (visibility_timeout_seconds in worker config) ensure unacknowledged messages are reclaimed by other workers
  • Docker Compose replicas setting scales workers horizontally

For contributors: Never store workflow state in memory across requests. If you need coordination state, it belongs in PostgreSQL. In-memory caches (Moka) are optimization layers, not sources of truth — the system must function correctly without them.


VII. Port Binding

Export services via port binding.

Each service is self-contained and binds its own ports.

Where this lives:

  • REST: config/tasker/base/orchestration.toml[orchestration.web] bind_address = "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
  • gRPC: [orchestration.grpc] bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
  • Worker REST/gRPC on separate ports (8081/9191)
  • Health endpoints on both transports for load balancer integration
  • Docker exposes ports via environment-configurable mappings

VIII. Concurrency

Scale out via the process model.

The system scales horizontally by adding worker processes and vertically by tuning concurrency settings.

Where this lives:

  • Horizontal: docker/docker-compose.prod.ymlreplicas: ${WORKER_REPLICAS:-2}, each worker is independent
  • Vertical: config/tasker/base/orchestration.tomlmax_concurrent_operations, batch_size per event system
  • Worker handler parallelism: [worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 10
  • Load shedding: [worker.mpsc_channels.handler_dispatch.load_shedding] capacity_threshold_percent = 80.0

Gaps: The actor pattern within a single process is more vertical than horizontal — actors share a Tokio runtime and scale via async concurrency, not OS processes. This is a pragmatic choice for Rust’s async model but means single-process scaling has limits that multiple processes solve.


IX. Disposability

Maximize robustness with fast startup and graceful shutdown.

This factor gets significant attention due to the distributed nature of task orchestration.

Where this lives:

  • Graceful shutdown: Signal handlers (SIGTERM, SIGINT) in tasker-orchestration/src/bin/server.rs and tasker-worker/src/bin/ — actors drain in-flight work, OpenTelemetry flushes spans, connections close cleanly
  • Fast startup: Compiled binary, pooled database connections, environment-driven config (no service discovery delays)
  • Crash recovery: PGMQ visibility timeouts requeue unacknowledged messages; steps claimed by a crashed worker reappear for others after visibility_timeout_seconds
  • Entrypoint: docker/scripts/orchestration-entrypoint.sh uses exec to replace shell with app process (proper PID 1 signal handling)
  • Health checks: Docker start_period allows grace time before liveness probes begin

For contributors: When adding new async subsystems, ensure they participate in the shutdown sequence. Bounded channels and drain timeouts (shutdown_drain_timeout_ms) prevent shutdown from hanging indefinitely.


X. Dev/Prod Parity

Keep development, staging, and production as similar as possible.

The same code, same migrations, and same config structure run everywhere — only values change.

Where this lives:

  • config/tasker/base/ provides defaults; config/tasker/environments/ overrides per-environment — structure is identical
  • migrations/ directory contains SQL migrations shared across all environments
  • Docker images use the same base (debian:bullseye-slim) and runtime user (tasker:999)
  • Structured logging format (tracing crate) is consistent; only verbosity changes (RUST_LOG)
  • E2E tests (--features test-services) exercise the same code paths as production

Gaps: Development uses cargo run with debug builds while production uses release-optimized Docker images. The observability stack (Grafana LGTM) is available in docker-compose.dev.yml but most local development happens without it. These are standard trade-offs, but contributors should periodically test against the full Docker stack to catch environment-specific issues.


XI. Logs

Treat logs as event streams.

All logging goes to stdout/stderr. No file-based logging is built into the application.

Where this lives:

  • tasker-shared/src/logging.rs — tracing subscriber writes to stdout, JSON format in production, ANSI colors in development (TTY-detected)
  • OpenTelemetry integration exports structured traces via OTEL_EXPORTER_OTLP_ENDPOINT
  • Correlation IDs (correlation_id) propagate through tasks, steps, actors, and message queues for distributed tracing
  • docker-compose.dev.yml includes Loki for log aggregation and Grafana for visualization
  • Entrypoint scripts log to stdout/stderr with role-prefixed format

For contributors: Use the tracing crate’s #[instrument] macro and structured fields (tracing::info!(task_id = %id, "processing")) rather than string interpolation. Never write to log files directly.


XII. Admin Processes

Run admin/management tasks as one-off processes.

The CLI and deployment scripts serve this role.

Where this lives:

  • tasker-ctl/ — task management (create, list, cancel), DLQ investigation (dlq list, dlq recover), system health, auth token management
  • docker/scripts/orchestration-entrypoint.shDEPLOYMENT_MODE=migrate-only runs migrations and exits without starting the server
  • config-validator binary validates TOML configuration as a one-off check
  • Database migrations run as a distinct phase before application startup, with retry logic and timeout protection

Gaps: Some administrative operations (cache invalidation, circuit breaker reset) are only available through the REST/gRPC API, not the CLI. As the CLI matures, these should become first-class admin commands.


Using This as a Contributor

These factors are not rules to enforce mechanically. They’re a lens for evaluating design decisions:

  • Adding a new service dependency? Factor IV says treat it as an attached resource — configure via environment, degrade gracefully without it.
  • Storing state? Factor VI says processes are stateless — put it in PostgreSQL, not in memory.
  • Adding configuration? Factor III says environment variables — use the existing TOML-with-env-var-interpolation pattern.
  • Writing logs? Factor XI says event streams — stdout, structured fields, correlation IDs.
  • Building deployment artifacts? Factor V says separate build/release/run — don’t bake configuration into images.

When a factor conflicts with practical needs, document the trade-off. The goal is not purity but awareness.


Attribution

The Twelve-Factor App methodology was created by Adam Wiggins with contributions from many others, originally published at 12factor.net. It is made available under the MIT License and has influenced how a generation of developers think about building software-as-a-service applications. Its influence on this project is gratefully acknowledged.

PEP: 20 Title: The Zen of Python Author: Tim Peters tim.peters@gmail.com Status: Active Type: Informational Created: 19-Aug-2004 Post-History: 22-Aug-2004

Abstract

Long time Pythoneer Tim Peters succinctly channels the BDFL’s guiding principles for Python’s design into 20 aphorisms, only 19 of which have been written down.

The Zen of Python

.. code-block:: text

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Easter Egg

.. code-block:: pycon

import this

References

Originally posted to comp.lang.python/python-list@python.org under a thread called "The Way of Python" <https://groups.google.com/d/msg/comp.lang.python/B_VxeTBClM0/L8W9KlsiriUJ>__

Copyright

This document has been placed in the public domain.

Tasker Core Reference

This directory contains technical reference documentation with precise specifications and implementation details.

Documents

DocumentDescription
StepContext APICross-language API reference for step handlers
Table ManagementDatabase table structure and management
Task and Step ReadinessSQL functions and execution logic
sccache ConfigurationBuild caching setup
Library Deployment PatternsLibrary distribution strategies
FFI Telemetry PatternCross-language telemetry integration

When to Read These

  • Need exact behavior: Consult these for precise specifications
  • Debugging edge cases: Check implementation details
  • Database operations: See Table Management and SQL functions
  • Build optimization: Review sccache Configuration

FFI Boundary Types Reference

Cross-language type harmonization for Rust, Python, and TypeScript boundaries.

This document defines the canonical FFI boundary types that cross the Rust orchestration layer and the Python/TypeScript worker implementations. These types are critical for correct serialization/deserialization between languages.

Overview

The tasker-core system uses FFI (Foreign Function Interface) to integrate Rust orchestration with Python and TypeScript step handlers. Data crosses this boundary via JSON serialization. These types must remain consistent across all three languages.

Source of Truth: Rust types in tasker-shared/src/messaging/execution_types.rs and tasker-shared/src/models/core/batch_worker.rs.

Type Mapping

Rust TypePython TypeTypeScript Type
CursorConfigRustCursorConfigRustCursorConfig
BatchProcessingOutcomeBatchProcessingOutcomeBatchProcessingOutcome
BatchWorkerInputsRustBatchWorkerInputsRustBatchWorkerInputs
BatchMetadataBatchMetadataBatchMetadata
FailureStrategyFailureStrategyFailureStrategy

CursorConfig

Cursor configuration for a single batch’s position and range.

Flexible Cursor Types

Unlike simple integer cursors, RustCursorConfig supports flexible cursor values:

  • Integer for record IDs: 123
  • String for timestamps: "2025-11-01T00:00:00Z"
  • Object for composite keys: {"page": 1, "offset": 0}

This enables cursor-based pagination across diverse data sources.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
pub struct CursorConfig {
    pub batch_id: String,
    pub start_cursor: serde_json::Value,  // Flexible type
    pub end_cursor: serde_json::Value,    // Flexible type
    pub batch_size: u32,
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export interface RustCursorConfig {
  batch_id: string;
  start_cursor: unknown;  // Flexible: number | string | object
  end_cursor: unknown;
  batch_size: number;
}

Python Definition

# workers/python/python/tasker_core/types.py
class RustCursorConfig(BaseModel):
    batch_id: str
    start_cursor: Any  # Flexible: int | str | dict
    end_cursor: Any
    batch_size: int

JSON Wire Format

{
  "batch_id": "batch_001",
  "start_cursor": 0,
  "end_cursor": 1000,
  "batch_size": 1000
}

BatchProcessingOutcome

Discriminated union representing the outcome of a batchable step.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
    NoBatches,
    CreateBatches {
        worker_template_name: String,
        worker_count: u32,
        cursor_configs: Vec<CursorConfig>,
        total_items: u64,
    },
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export interface NoBatchesOutcome {
  type: 'no_batches';
}

export interface CreateBatchesOutcome {
  type: 'create_batches';
  worker_template_name: string;
  worker_count: number;
  cursor_configs: RustCursorConfig[];
  total_items: number;
}

export type BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome;

Python Definition

# workers/python/python/tasker_core/types.py
class NoBatchesOutcome(BaseModel):
    type: str = "no_batches"

class CreateBatchesOutcome(BaseModel):
    type: str = "create_batches"
    worker_template_name: str
    worker_count: int
    cursor_configs: list[RustCursorConfig]
    total_items: int

BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome

JSON Wire Formats

NoBatches:

{
  "type": "no_batches"
}

CreateBatches:

{
  "type": "create_batches",
  "worker_template_name": "batch_worker_template",
  "worker_count": 5,
  "cursor_configs": [
    { "batch_id": "001", "start_cursor": 0, "end_cursor": 1000, "batch_size": 1000 },
    { "batch_id": "002", "start_cursor": 1000, "end_cursor": 2000, "batch_size": 1000 }
  ],
  "total_items": 5000
}

BatchWorkerInputs

Initialization inputs for batch worker instances, stored in workflow_steps.inputs.

Rust Definition

#![allow(unused)]
fn main() {
// tasker-shared/src/models/core/batch_worker.rs
pub struct BatchWorkerInputs {
    pub cursor: CursorConfig,
    pub batch_metadata: BatchMetadata,
    pub is_no_op: bool,
}

pub struct BatchMetadata {
    // checkpoint_interval removed - handlers decide when to checkpoint
    pub cursor_field: String,
    pub failure_strategy: FailureStrategy,
}

pub enum FailureStrategy {
    ContinueOnFailure,
    FailFast,
    Isolate,
}
}

TypeScript Definition

// workers/typescript/src/types/batch.ts
export type FailureStrategy = 'continue_on_failure' | 'fail_fast' | 'isolate';

export interface BatchMetadata {
  // checkpoint_interval removed - handlers decide when to checkpoint
  cursor_field: string;
  failure_strategy: FailureStrategy;
}

export interface RustBatchWorkerInputs {
  cursor: RustCursorConfig;
  batch_metadata: BatchMetadata;
  is_no_op: boolean;
}

Python Definition

# workers/python/python/tasker_core/types.py
class FailureStrategy(str, Enum):
    CONTINUE_ON_FAILURE = "continue_on_failure"
    FAIL_FAST = "fail_fast"
    ISOLATE = "isolate"

class BatchMetadata(BaseModel):
    # checkpoint_interval removed - handlers decide when to checkpoint
    cursor_field: str
    failure_strategy: FailureStrategy

class RustBatchWorkerInputs(BaseModel):
    cursor: RustCursorConfig
    batch_metadata: BatchMetadata
    is_no_op: bool

JSON Wire Format

{
  "cursor": {
    "batch_id": "batch_001",
    "start_cursor": 0,
    "end_cursor": 1000,
    "batch_size": 1000
  },
  "batch_metadata": {
    "cursor_field": "id",
    "failure_strategy": "continue_on_failure"
  },
  "is_no_op": false
}

BatchAggregationResult

Standardized result from aggregating multiple batch worker results.

Cross-Language Standard

All three languages produce identical aggregation results:

FieldTypeDescription
total_processedintItems processed across all batches
total_succeededintItems that succeeded
total_failedintItems that failed
total_skippedintItems that were skipped
batch_countintNumber of batch workers that ran
success_ratefloatSuccess rate (0.0 to 1.0)
errorsarrayCollected errors (limited to 100)
error_countintTotal error count

Usage Examples

TypeScript:

import { aggregateBatchResults } from 'tasker-core';

const workerResults = Object.values(context.previousResults)
  .filter(r => r?.batch_worker);
const summary = aggregateBatchResults(workerResults);
return this.success(summary);

Python:

from tasker_core.types import aggregate_batch_results

worker_results = [
    context.get_dependency_result(f"worker_{i}")
    for i in range(batch_count)
]
summary = aggregate_batch_results(worker_results)
return self.success(summary.model_dump())

Factory Functions

Creating BatchProcessingOutcome

TypeScript:

import { noBatches, createBatches, RustCursorConfig } from 'tasker-core';

// No batches needed
const outcome1 = noBatches();

// Create batch workers
const configs: RustCursorConfig[] = [
  { batch_id: '001', start_cursor: 0, end_cursor: 1000, batch_size: 1000 },
  { batch_id: '002', start_cursor: 1000, end_cursor: 2000, batch_size: 1000 },
];
const outcome2 = createBatches('process_batch', 2, configs, 2000);

Python:

from tasker_core.types import no_batches, create_batches, RustCursorConfig

# No batches needed
outcome1 = no_batches()

# Create batch workers
configs = [
    RustCursorConfig(batch_id="001", start_cursor=0, end_cursor=1000, batch_size=1000),
    RustCursorConfig(batch_id="002", start_cursor=1000, end_cursor=2000, batch_size=1000),
]
outcome2 = create_batches("process_batch", 2, configs, 2000)

Type Guards (TypeScript)

import {
  BatchProcessingOutcome,
  isNoBatches,
  isCreateBatches
} from 'tasker-core';

function handleOutcome(outcome: BatchProcessingOutcome): void {
  if (isNoBatches(outcome)) {
    console.log('No batches needed');
    return;
  }

  if (isCreateBatches(outcome)) {
    console.log(`Creating ${outcome.worker_count} workers`);
    console.log(`Total items: ${outcome.total_items}`);
  }
}

Migration Notes

From Legacy Types

If migrating from older batch processing types:

  1. CursorConfigRustCursorConfig: The new type adds batch_id field and uses flexible cursor types (unknown/Any) instead of fixed number/int.

  2. Inline batch_processing_outcomeBatchProcessingOutcome: Use the discriminated union type with factory functions instead of building JSON manually.

  3. Manual aggregationaggregateBatchResults: Use the standardized aggregation function for consistent cross-language behavior.

Backwards Compatibility

The legacy CursorConfig type (with number/int cursors) is preserved for simple use cases. Use RustCursorConfig when:

  • Working with Rust orchestration inputs
  • Needing flexible cursor types (timestamps, UUIDs, composites)
  • Building BatchProcessingOutcome structures

FFI Telemetry Initialization Pattern

Overview

This document describes the two-phase telemetry initialization pattern for Foreign Function Interface (FFI) integrations where Rust code is called from languages that don’t have a Tokio runtime during initialization (Ruby, Python, WASM).

The Problem

OpenTelemetry batch exporter requires a Tokio runtime context for async I/O operations:

#![allow(unused)]
fn main() {
// This PANICS if called outside a Tokio runtime
let tracer_provider = SdkTracerProvider::builder()
    .with_batch_exporter(exporter)  // ❌ Requires Tokio runtime
    .with_resource(resource)
    .with_sampler(sampler)
    .build();
}

FFI Initialization Timeline:

1. Language Runtime Loads Extension (Ruby, Python, WASM)
   ↓ No Tokio runtime exists yet
2. Extension Init Function Called (Magnus init, PyO3 init, etc.)
   ↓ Logging needed for debugging, but no async runtime
3. Later: Create Tokio Runtime
   ↓ Now safe to initialize telemetry
4. Bootstrap Worker System

The Solution: Two-Phase Initialization

Phase 1: Console-Only Logging (FFI-Safe)

During language extension initialization, use console-only logging that requires no Tokio runtime:

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs
pub fn init_console_only() {
    // Initialize console logging without OpenTelemetry
    // Safe to call from any thread, no async runtime required
}
}

When to use:

  • During Magnus initialization (Ruby)
  • During PyO3 initialization (Python)
  • During WASM module initialization
  • Any context where no Tokio runtime exists

Phase 2: Full Telemetry (Tokio Context)

After creating the Tokio runtime, initialize full telemetry including OpenTelemetry:

#![allow(unused)]
fn main() {
// Create Tokio runtime
let runtime = tokio::runtime::Runtime::new()?;

// Initialize telemetry in runtime context
runtime.block_on(async {
    tasker_shared::logging::init_tracing();
});
}

When to use:

  • After creating Tokio runtime in bootstrap
  • Inside runtime.block_on() context
  • When async I/O is available

Implementation Guide

Ruby FFI (Magnus)

File Structure:

  • workers/ruby/ext/tasker_core/src/ffi_logging.rs - Phase 1
  • workers/ruby/ext/tasker_core/src/bootstrap.rs - Phase 2

Phase 1: Magnus Initialization

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/ffi_logging.rs

pub fn init_ffi_logger() -> Result<(), Box<dyn std::error::Error>> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        // Phase 1: Defer telemetry init to runtime context
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        // Phase 1: Safe to initialize console-only logging
        tasker_shared::logging::init_console_only();
        tasker_shared::log_ffi!(
            info,
            "FFI console logging initialized (no telemetry)",
            component: "ffi_boundary"
        );
    }

    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/bootstrap.rs

pub fn bootstrap_worker() -> Result<Value, Error> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

Python FFI (PyO3)

Phase 1: PyO3 Module Initialization

#![allow(unused)]
fn main() {
// workers/python/src/lib.rs

#[pymodule]
fn tasker_core(py: Python, m: &PyModule) -> PyResult<()> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        tasker_shared::logging::init_console_only();
    }

    // Register Python functions...
    m.add_function(wrap_pyfunction!(bootstrap_worker, m)?)?;
    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/python/src/bootstrap.rs

#[pyfunction]
pub fn bootstrap_worker() -> PyResult<String> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()
        .map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
            format!("Failed to create runtime: {}", e)
        ))?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

WASM FFI

Phase 1: WASM Module Initialization

#![allow(unused)]
fn main() {
// workers/wasm/src/lib.rs

#[wasm_bindgen(start)]
pub fn init_wasm() {
    // Check if telemetry is enabled (from JS environment)
    let telemetry_enabled = js_sys::Reflect::get(
        &js_sys::global(),
        &"TELEMETRY_ENABLED".into()
    ).ok()
    .and_then(|v| v.as_bool())
    .unwrap_or(false);

    if telemetry_enabled {
        web_sys::console::log_1(&"Telemetry enabled - deferring logging init to runtime context".into());
    } else {
        tasker_shared::logging::init_console_only();
    }
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/wasm/src/bootstrap.rs

#[wasm_bindgen]
pub async fn bootstrap_worker() -> Result<JsValue, JsValue> {
    // In WASM, we're already in an async context
    // Initialize telemetry directly
    tasker_shared::logging::init_tracing();

    // Continue with bootstrap...
    let system_context = SystemContext::new_for_worker().await
        .map_err(|e| JsValue::from_str(&format!("Bootstrap failed: {}", e)))?;

    // ... rest of bootstrap
}
}

Docker Configuration

Enable telemetry in docker-compose with appropriate comments:

# docker/docker-compose.test.yml

ruby-worker:
  environment:
    # Two-phase FFI telemetry initialization pattern
    # Phase 1: Magnus init skips telemetry (no runtime)
    # Phase 2: bootstrap_worker() initializes telemetry in Tokio context
    TELEMETRY_ENABLED: "true"
    OTEL_EXPORTER_OTLP_ENDPOINT: http://observability:4317
    OTEL_SERVICE_NAME: tasker-ruby-worker
    OTEL_SERVICE_VERSION: "0.1.0"

Verification

Expected Log Sequence

Ruby Worker with Telemetry Enabled:

1. Magnus init:
Telemetry enabled - deferring logging init to runtime context

2. After runtime creation:
Console logging with OpenTelemetry initialized
  environment=test
  opentelemetry_enabled=true
  otlp_endpoint=http://observability:4317
  service_name=tasker-ruby-worker

3. OpenTelemetry components:
Global meter provider is set
OpenTelemetry Prometheus text exporter initialized

Ruby Worker with Telemetry Disabled:

1. Magnus init:
Console-only logging initialized (FFI-safe mode)
  environment=test
  opentelemetry_enabled=false
  context=ffi_initialization

2. After runtime creation:
(No additional initialization - already complete)

Health Check

All workers should be healthy with telemetry enabled:

$ curl http://localhost:8082/health
{"status":"healthy","timestamp":"...","worker_id":"worker-..."}

Grafana Verification

With all services running with telemetry:

  1. Access Grafana: http://localhost:3000 (admin/admin)
  2. Navigate to Explore → Tempo
  3. Query by service: tasker-ruby-worker
  4. Verify traces appear with correlation IDs

Key Principles

1. Separation of Concerns

  • Infrastructure Decision (Tokio runtime availability): Handled by init functions
  • Business Logic (when to log): Handled by application code
  • Clean separation prevents runtime panics

2. Fail-Safe Defaults

  • Always provide console logging at minimum
  • Telemetry is enhancement, not requirement
  • Graceful degradation if telemetry unavailable

3. Explicit Over Implicit

  • Clear phase separation in code
  • Documented at each call site
  • Easy to understand initialization flow

4. Language-Agnostic Pattern

  • Same pattern works for Ruby, Python, WASM
  • Consistent across all FFI bindings
  • Single source of truth in tasker-shared

Troubleshooting

“no reactor running” Panic

Symptom:

thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'

Cause: Calling init_tracing() when TELEMETRY_ENABLED=true outside a Tokio runtime context.

Solution: Use two-phase pattern:

#![allow(unused)]
fn main() {
// Phase 1: Skip telemetry init
if telemetry_enabled {
    println!("Deferring telemetry init...");
} else {
    init_console_only();
}

// Phase 2: Initialize in runtime
runtime.block_on(async {
    init_tracing();
});
}

Telemetry Not Appearing

Symptom: No traces in Grafana/Tempo despite TELEMETRY_ENABLED=true.

Check:

  1. Verify environment variable is set: TELEMETRY_ENABLED=true
  2. Check logs for initialization message
  3. Verify OTLP endpoint is reachable
  4. Check observability stack is healthy

Debug:

# Check worker logs
docker logs docker-ruby-worker-1 | grep -E "telemetry|OpenTelemetry"

# Check observability stack
curl http://localhost:4317  # Should connect to OTLP gRPC

# Check Grafana Tempo
curl http://localhost:3200/api/status/buildinfo

Performance Considerations

Minimal Overhead

  • Phase 1: Simple console initialization, <1ms
  • Phase 2: Batch exporter initialization, <10ms
  • Total overhead: <15ms during startup
  • Zero runtime overhead after initialization

Memory Usage

  • Console-only: ~100KB (tracing subscriber)
  • With telemetry: ~500KB (includes OTLP client buffers)
  • Acceptable for all deployment scenarios

Future Enhancements

Lazy Telemetry Upgrade

Future optimization could upgrade console-only subscriber to include telemetry without restart:

#![allow(unused)]
fn main() {
// Not yet implemented - requires tracing layer hot-swapping
pub fn upgrade_to_telemetry() -> TaskerResult<()> {
    // Would require custom subscriber implementation
    // to support layer addition after initialization
}
}

Per-Worker Telemetry Control

Could extend pattern to support per-worker telemetry configuration:

#![allow(unused)]
fn main() {
// Not yet implemented
pub fn init_with_config(config: TelemetryConfig) -> TaskerResult<()> {
    // Would allow fine-grained control per worker
}
}

Phase 1.5: Worker Span Instrumentation with Trace Context Propagation

Implemented: 2025-11-24 Status: ✅ Production Ready - Validated end-to-end with Ruby workers

The Challenge

After implementing two-phase telemetry initialization (Phase 1), we discovered a gap: while OpenTelemetry infrastructure was working, worker step execution spans lacked correlation attributes needed for distributed tracing.

The Problem:

  • ✅ Orchestration spans had correlation_id, task_uuid, step_uuid
  • ✅ Worker infrastructure spans existed (read_messages, reserve_capacity)
  • ❌ Worker step execution spans were missing these attributes

Root Cause: Ruby workers use an async dual-event-system architecture where:

  1. Rust worker fires FFI event to Ruby (via EventPoller polling every 10ms)
  2. Ruby processes event asynchronously
  3. Ruby returns completion via FFI

The async boundary made traditional span scope maintenance impossible.

The Solution: Trace ID Propagation Pattern

Instead of trying to maintain span scope across the async FFI boundary, we propagate trace context as opaque strings:

Rust: Extract trace_id/span_id → Add to FFI event payload →
Ruby: Treat as opaque strings → Propagate through processing → Include in completion →
Rust: Create linked span using returned trace_id/span_id

Key Insight: Ruby doesn’t need to understand OpenTelemetry - it just passes through trace IDs like it already does with correlation_id.

Implementation: Rust Side (Phase 1.5a)

File: tasker-worker/src/worker/command_processor.rs

Step 1: Create instrumented span with all required attributes

#![allow(unused)]
fn main() {
use tracing::{span, event, Level, Instrument};

pub async fn handle_execute_step(&self, step_message: SimpleStepMessage) -> TaskerResult<()> {
    // Fetch step details to get step_name and namespace
    let task_sequence_step = self.fetch_task_sequence_step(&step_message).await?;

    // Create span with all 5 required attributes
    let step_span = span!(
        Level::INFO,
        "worker.step_execution",
        correlation_id = %step_message.correlation_id,
        task_uuid = %step_message.task_uuid,
        step_uuid = %step_message.step_uuid,
        step_name = %task_sequence_step.workflow_step.name,
        namespace = %task_sequence_step.task.namespace_name
    );

    let execution_result = async {
        event!(Level::INFO, "step.execution_started");

        // Extract trace context for FFI propagation
        let trace_id = Some(step_message.correlation_id.to_string());
        let span_id = Some(format!("span-{}", step_message.step_uuid));

        // Fire FFI event with trace context
        let result = self.event_publisher
            .fire_step_execution_event_with_trace(
                &task_sequence_step,
                trace_id,
                span_id,
            )
            .await?;

        event!(Level::INFO, "step.execution_completed");
        Ok(result)
    }
    .instrument(step_span)  // Wrap async block with span
    .await;

    execution_result
}
}

Key Points:

  • All 5 attributes present: correlation_id, task_uuid, step_uuid, step_name, namespace
  • Event markers: step.execution_started, step.execution_completed
  • .instrument(span) pattern for async code
  • Trace context extracted and passed to FFI

Implementation: Data Structures

File: tasker-shared/src/types/base.rs

Add trace context fields to FFI event structures:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub task_sequence_step: TaskSequenceStep,
    pub correlation_id: Uuid,

    // Trace context propagation
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionCompletionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub success: bool,
    pub result: Option<serde_json::Value>,

    // Trace context from Ruby
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}
}

Design Notes:

  • Fields are optional for backward compatibility
  • skip_serializing_if prevents empty fields in JSON
  • Treated as opaque strings (no OpenTelemetry types)

Implementation: Ruby Side Propagation

File: workers/ruby/lib/tasker_core/event_bridge.rb

Propagate trace context like correlation_id:

def wrap_step_execution_event(event_data)
  wrapped = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    task_sequence_step: TaskerCore::Models::TaskSequenceStepWrapper.new(event_data[:task_sequence_step])
  }

  # Expose correlation_id at top level for easy access
  wrapped[:correlation_id] = event_data[:correlation_id] if event_data[:correlation_id]
  wrapped[:parent_correlation_id] = event_data[:parent_correlation_id] if event_data[:parent_correlation_id]

  # Expose trace_id and span_id for distributed tracing
  wrapped[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  wrapped[:span_id] = event_data[:span_id] if event_data[:span_id]

  wrapped
end

File: workers/ruby/lib/tasker_core/subscriber.rb

Include trace context in completion:

def publish_step_completion(event_data:, success:, result: nil, error_message: nil, metadata: nil)
  completion_payload = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    success: success,
    result: result,
    metadata: metadata,
    error_message: error_message
  }

  # Propagate trace context back to Rust
  completion_payload[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  completion_payload[:span_id] = event_data[:span_id] if event_data[:span_id]

  TaskerCore::Worker::EventBridge.instance.publish_step_completion(completion_payload)
end

Key Points:

  • Ruby treats trace_id and span_id as opaque strings
  • No OpenTelemetry dependency in Ruby
  • Simple pass-through pattern like correlation_id
  • Works with existing dual-event-system architecture

Implementation: Completion Span (Rust)

File: tasker-worker/src/worker/event_subscriber.rs

Create linked span when receiving Ruby completion:

#![allow(unused)]
fn main() {
pub fn handle_completion(&self, completion: StepExecutionCompletionEvent) -> TaskerResult<()> {
    // Create linked span using trace context from Ruby
    let completion_span = if let (Some(trace_id), Some(span_id)) =
        (&completion.trace_id, &completion.span_id) {
        span!(
            Level::INFO,
            "worker.step_completion_received",
            trace_id = %trace_id,
            span_id = %span_id,
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    } else {
        // Fallback span without trace context
        span!(
            Level::INFO,
            "worker.step_completion_received",
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    };

    let _guard = completion_span.enter();

    event!(Level::INFO, "step.ruby_execution_completed",
        success = completion.success,
        duration_ms = completion.metadata.execution_time_ms
    );

    // Continue with normal completion processing...
    Ok(())
}
}

Key Points:

  • Uses returned trace_id/span_id to create linked span
  • Graceful fallback if trace context not available
  • Event: step.ruby_execution_completed

Validation Results (2025-11-24)

Test Task:

  • Correlation ID: 88f21229-4085-4d53-8f52-2fde0b7228e2
  • Task UUID: 019ab6f9-7a27-7d16-b298-1ea41b327373
  • 4 steps executed successfully

Log Evidence:

worker.step_execution{
  correlation_id=88f21229-4085-4d53-8f52-2fde0b7228e2
  task_uuid=019ab6f9-7a27-7d16-b298-1ea41b327373
  step_uuid=019ab6f9-7a2a-7873-a5d1-93234ae46003
  step_name=linear_step_1
  namespace=linear_workflow
}: step.execution_started

Step execution event with trace context fired successfully to FFI handlers
  trace_id=Some("88f21229-4085-4d53-8f52-2fde0b7228e2")
  span_id=Some("span-019ab6f9-7a2a-7873-a5d1-93234ae46003")

worker.step_completion_received{...}: step.ruby_execution_completed

Tempo Query Results:

  • By correlation_id: 9 traces (5 orchestration + 4 worker)
  • By task_uuid: 13 traces (complete task lifecycle)
  • ✅ All attributes indexed and queryable
  • ✅ Spans exported to Tempo successfully

Complete Trace Flow

For each step execution:

┌─────────────────────────────────────────────────────┐
│ Rust Worker (command_processor.rs)                 │
│ 1. Create worker.step_execution span               │
│    - correlation_id, task_uuid, step_uuid          │
│    - step_name, namespace                          │
│ 2. Emit step.execution_started event               │
│ 3. Extract trace_id and span_id from span          │
│ 4. Add to StepExecutionEvent                       │
│ 5. Fire FFI event with trace context               │
│ 6. Emit step.execution_completed event             │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Async FFI boundary (EventPoller polling)
                  ▼
┌─────────────────────────────────────────────────────┐
│ Ruby EventBridge & Subscriber                       │
│ 1. Receive event with trace_id/span_id            │
│ 2. Propagate as opaque strings                     │
│ 3. Execute Ruby handler (business logic)           │
│ 4. Include trace_id/span_id in completion          │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Completion via FFI
                  ▼
┌─────────────────────────────────────────────────────┐
│ Rust Worker (event_subscriber.rs)                  │
│ 1. Receive StepExecutionCompletionEvent            │
│ 2. Extract trace_id and span_id                    │
│ 3. Create worker.step_completion_received span     │
│ 4. Emit step.ruby_execution_completed event        │
└─────────────────────────────────────────────────────┘

Benefits of This Pattern

  1. No Breaking Changes: Optional fields, backward compatible
  2. Ruby Simplicity: No OpenTelemetry dependency, opaque string propagation
  3. Trace Continuity: Same trace_id flows Rust → Ruby → Rust
  4. Query-Friendly: Tempo queries show complete execution flow
  5. Extensible: Pattern works for Python, WASM, any FFI language
  6. Performance: Zero overhead in Ruby (just string passing)

Pattern for Python Workers

The exact same pattern applies to Python workers:

Python Side (PyO3):

# workers/python/tasker_core/event_bridge.py

def wrap_step_execution_event(event_data):
    wrapped = {
        'event_id': event_data['event_id'],
        'task_uuid': event_data['task_uuid'],
        'step_uuid': event_data['step_uuid'],
        # ... other fields
    }

    # Propagate trace context as opaque strings
    if 'trace_id' in event_data:
        wrapped['trace_id'] = event_data['trace_id']
    if 'span_id' in event_data:
        wrapped['span_id'] = event_data['span_id']

    return wrapped

Key Insight: Any FFI language can use this pattern - they just need to pass through trace_id and span_id as strings.

Performance Characteristics

  • Rust overhead: ~50-100 microseconds per span creation
  • FFI overhead: ~10-50 microseconds for extra string fields
  • Ruby overhead: Zero (just string passing, no OpenTelemetry)
  • Total overhead: <200 microseconds per step execution
  • Network: Spans batched and exported asynchronously

Troubleshooting

Symptom: Spans missing trace_id/span_id in Tempo

Check:

  1. Verify Rust logs show “Step execution event with trace context fired successfully”
  2. Check Ruby logs don’t have errors in EventBridge
  3. Verify completion events include trace_id/span_id
  4. Query Tempo by task_uuid to see if spans exist

Debug:

# Check Rust worker logs for trace context
docker logs docker-ruby-worker-1 | grep -E "(trace_id|span_id)"

# Query Tempo by task_uuid
curl "http://localhost:3200/api/search?tags=task_uuid=<UUID>"

# Check span export metrics
curl "http://localhost:9090/metrics" | grep otel

Future Enhancements

OpenTelemetry W3C Trace Context: Currently using correlation_id as trace_id placeholder. Future enhancement:

#![allow(unused)]
fn main() {
use opentelemetry::trace::TraceContextExt;

// Extract real OpenTelemetry trace context
let cx = tracing::Span::current().context();
let span_context = cx.span().span_context();
let trace_id = span_context.trace_id().to_string();
let span_id = span_context.span_id().to_string();
}

Span Linking: Use OpenTelemetry’s Link API for explicit parent-child relationships:

#![allow(unused)]
fn main() {
use opentelemetry::trace::{Link, SpanContext, TraceId, SpanId};

// Create linked span
let parent_context = SpanContext::new(
    TraceId::from_hex(&trace_id)?,
    SpanId::from_hex(&span_id)?,
    TraceFlags::default(),
    false,
    TraceState::default(),
);

let span = span!(
    Level::INFO,
    "worker.step_completion_received",
    links = vec![Link::new(parent_context, Vec::new())]
);
}

References

  • OpenTelemetry Rust: https://github.com/open-telemetry/opentelemetry-rust
  • Grafana LGTM Stack: https://grafana.com/oss/lgtm-stack/
  • W3C Trace Context: https://www.w3.org/TR/trace-context/
  • tasker-shared/src/logging.rs - Core logging implementation
  • workers/rust/README.md - Event-driven FFI architecture
  • docs/batch-processing.md - Distributed tracing integration
  • docker/docker-compose.test.yml - Observability stack configuration

Status: ✅ Production Ready - Two-phase initialization and Phase 1.5 worker span instrumentation patterns implemented and validated with Ruby FFI. Ready for Python and WASM implementations.

Library Deployment Patterns

This document describes the library deployment patterns feature that enables applications to consume worker observability data (health, metrics, templates, configuration) either via the HTTP API or directly through FFI, without running a web server.

Overview

Previously, applications needed to run the worker’s HTTP server to access observability data. This created deployment overhead for applications that only needed programmatic access to health checks, metrics, or template information.

The library deployment patterns feature:

  1. Extracts observability logic into reusable services - Business logic moved from HTTP handlers to service classes
  2. Exposes services via FFI - Same functionality available without HTTP overhead
  3. Provides Ruby wrapper layer - Type-safe Ruby interface with dry-struct types
  4. Makes HTTP server optional - Services always available, web server is opt-in

Architecture

Service Layer

Four services encapsulate observability logic:

tasker-worker/src/worker/services/
├── health/          # HealthService - health checks
├── metrics/         # MetricsService - metrics collection
├── template_query/  # TemplateQueryService - template operations
└── config_query/    # ConfigQueryService - configuration queries

Each service:

  • Contains all business logic previously in HTTP handlers
  • Is independent of HTTP transport
  • Can be accessed via web handlers OR FFI
  • Returns typed response structures

Service Access Patterns

                    ┌─────────────────────────────────────────┐
                    │            WorkerWebState               │
                    │  ┌────────────────────────────────────┐ │
                    │  │         Service Instances           │ │
                    │  │  ┌────────────┐ ┌────────────────┐ │ │
                    │  │  │HealthServ.│ │MetricsService  │ │ │
                    │  │  └────────────┘ └────────────────┘ │ │
                    │  │  ┌────────────┐ ┌────────────────┐ │ │
                    │  │  │TemplQuery │ │ConfigQuery     │ │ │
                    │  │  └────────────┘ └────────────────┘ │ │
                    │  └────────────────────────────────────┘ │
                    └──────────────┬───────────────┬──────────┘
                                   │               │
           ┌───────────────────────┴───┐     ┌─────┴──────────────────────┐
           │     HTTP Handlers         │     │     FFI Layer              │
           │  (web/handlers/*.rs)      │     │  (observability_ffi.rs)    │
           └───────────────────────────┘     └────────────────────────────┘
                       │                                 │
                       ▼                                 ▼
               ┌───────────────┐                ┌───────────────┐
               │  HTTP Clients │                │  Ruby/Python  │
               │  curl, etc.   │                │  Applications │
               └───────────────┘                └───────────────┘

Usage

Ruby FFI Access

The TaskerCore::Observability module provides type-safe access to all services:

# Health checks
health = TaskerCore::Observability.health_basic
puts health.status        # => "healthy"
puts health.worker_id     # => "worker-abc123"

# Kubernetes-style probes
if TaskerCore::Observability.ready?
  puts "Worker ready to receive requests"
end

if TaskerCore::Observability.alive?
  puts "Worker is alive"
end

# Detailed health information
detailed = TaskerCore::Observability.health_detailed
detailed.checks.each do |name, check|
  puts "#{name}: #{check.status} (#{check.duration_ms}ms)"
end

Metrics Access

# Domain event statistics
events = TaskerCore::Observability.event_stats
puts "Events routed: #{events.router.total_routed}"
puts "FFI dispatches: #{events.in_process_bus.ffi_channel_dispatches}"

# Prometheus format (for custom scrapers)
prometheus_text = TaskerCore::Observability.prometheus_metrics

Template Operations

# List templates (JSON string)
templates_json = TaskerCore::Observability.templates_list

# Validate a template
validation = TaskerCore::Observability.template_validate(
  namespace: "payments",
  name: "process_payment",
  version: "v1"
)

if validation.valid
  puts "Template valid with #{validation.handler_count} handlers"
else
  validation.issues.each { |issue| puts "Issue: #{issue}" }
end

# Cache management
stats = TaskerCore::Observability.cache_stats
puts "Cache hits: #{stats.hits}, misses: #{stats.misses}"

TaskerCore::Observability.cache_clear  # Clear all cached templates

Configuration Access

# Get runtime configuration (secrets redacted)
config = TaskerCore::Observability.config
puts "Environment: #{config.environment}"
puts "Redacted fields: #{config.metadata.redacted_fields.join(', ')}"

# Quick environment check
env = TaskerCore::Observability.environment
puts "Running in: #{env}"  # => "production"

Configuration

HTTP Server Toggle

The HTTP server is now optional. Services are always created, but the HTTP server only starts if enabled:

# config/tasker/base/worker.toml
[worker.web]
enabled = true              # Set to false to disable HTTP server
bind_address = "0.0.0.0:8081"
request_timeout_ms = 30000

When enabled = false:

  • WorkerWebState is still created (services available)
  • HTTP server does NOT start
  • All services accessible via FFI only
  • Reduces resource usage (no HTTP listener, no connections)

Deployment Modes

ModeHTTP ServerFFI ServicesUse Case
FullStandard deployment with monitoring
LibraryEmbedded in application, no external access
HeadlessContainer with external health checks disabled

Type Definitions

The Ruby wrapper uses dry-struct types for structured access:

Health Types

TaskerCore::Observability::Types::BasicHealth
  - status: String
  - worker_id: String
  - timestamp: String

TaskerCore::Observability::Types::DetailedHealth
  - status: String
  - timestamp: String
  - worker_id: String
  - checks: Hash[String, HealthCheck]
  - system_info: WorkerSystemInfo

TaskerCore::Observability::Types::HealthCheck
  - status: String
  - message: String?
  - duration_ms: Integer
  - last_checked: String

Metrics Types

TaskerCore::Observability::Types::DomainEventStats
  - router: EventRouterStats
  - in_process_bus: InProcessEventBusStats
  - captured_at: String
  - worker_id: String

TaskerCore::Observability::Types::EventRouterStats
  - total_routed: Integer
  - durable_routed: Integer
  - fast_routed: Integer
  - broadcast_routed: Integer
  - fast_delivery_errors: Integer
  - routing_errors: Integer

Template Types

TaskerCore::Observability::Types::CacheStats
  - total_entries: Integer
  - hits: Integer
  - misses: Integer
  - evictions: Integer
  - last_maintenance: String?

TaskerCore::Observability::Types::TemplateValidation
  - valid: Boolean
  - namespace: String
  - name: String
  - version: String
  - handler_count: Integer
  - issues: Array[String]
  - handler_metadata: Hash?

Config Types

TaskerCore::Observability::Types::RuntimeConfig
  - environment: String
  - common: Hash
  - worker: Hash
  - metadata: ConfigMetadata

TaskerCore::Observability::Types::ConfigMetadata
  - timestamp: String
  - source: String
  - redacted_fields: Array[String]

Error Handling

FFI methods raise RuntimeError on failures:

begin
  health = TaskerCore::Observability.health_basic
rescue RuntimeError => e
  if e.message.include?("Worker system not running")
    # Worker not bootstrapped yet
  elsif e.message.include?("Web state not available")
    # Services not initialized
  end
end

Template Operation Errors

Template operations raise RuntimeError for missing templates or namespaces:

begin
  result = TaskerCore::Observability.template_get(
    namespace: "unknown",
    name: "missing",
    version: "1.0.0"
  )
rescue RuntimeError => e
  puts "Template not found: #{e.message}"
end

# template_refresh handles errors gracefully, returning a result struct
result = TaskerCore::Observability.template_refresh(
  namespace: "unknown",
  name: "missing",
  version: "1.0.0"
)
puts result.success  # => false
puts result.message  # => error description

Convenience Methods

The ready? and alive? methods handle errors gracefully:

# These never raise - they return false on any error
TaskerCore::Observability.ready?  # => true/false
TaskerCore::Observability.alive?  # => true/false

Note: alive? checks for status == "alive" (from liveness probe), while ready? checks for status == "healthy" (from readiness probe).

Best Practices

  1. Use type-safe methods when possible - Methods returning dry-struct types provide better validation
  2. Handle errors gracefully - FFI can fail if worker not bootstrapped
  3. Consider caching - For high-frequency health checks, cache results briefly
  4. Use ready?/alive? helpers - They handle exceptions and return boolean
  5. Prefer FFI for internal use - Less overhead than HTTP for same-process access

Migration Guide

From HTTP to FFI

Before (HTTP):

response = Faraday.get("http://localhost:8081/health")
health = JSON.parse(response.body)

After (FFI):

health = TaskerCore::Observability.health_basic

Disabling HTTP Server

  1. Update configuration:

    [worker.web]
    enabled = false
    
  2. Update health check scripts to use FFI:

    # health_check.rb
    require 'tasker_core'
    
    exit(TaskerCore::Observability.ready? ? 0 : 1)
    
  3. Update monitoring to scrape via FFI:

    metrics = TaskerCore::Observability.prometheus_metrics
    # Send to Prometheus pushgateway or custom aggregator
    

API Reference

Health Methods

MethodReturnsDescription
health_basicTypes::BasicHealthBasic health status
health_liveTypes::BasicHealthLiveness probe (status: “alive”)
health_readyTypes::DetailedHealthReadiness probe with all checks
health_detailedTypes::DetailedHealthFull health information
ready?BooleanTrue if status == “healthy”
alive?BooleanTrue if status == “alive”

Metrics Methods

MethodReturnsDescription
metrics_workerString (JSON)Worker metrics as JSON
event_statsTypes::DomainEventStatsDomain event statistics
prometheus_metricsStringPrometheus text format

Template Methods

MethodReturnsDescription
templates_list(include_cache_stats: false)String (JSON)List all templates
template_get(namespace:, name:, version:)String (JSON)Get specific template (raises on error)
template_validate(namespace:, name:, version:)Types::TemplateValidationValidate template (raises on error)
cache_statsTypes::CacheStatsCache statistics
cache_clearTypes::CacheOperationResultClear template cache
template_refresh(namespace:, name:, version:)Types::CacheOperationResultRefresh specific template

Config Methods

MethodReturnsDescription
configTypes::RuntimeConfigFull config (secrets redacted)
environmentStringCurrent environment name

SCache Configuration Documentation

Overview

This document records our sccache configuration for future reference. Sccache is currently disabled due to GitHub Actions cache service issues, but we plan to re-enable it once the service is stable.

Current Status

🚫 DISABLED - Temporarily disabled due to GitHub Actions cache service issues:

sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => <h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p>

Planned Configuration

Environment Variables (setup-env action)

RUSTC_WRAPPER=sccache
SCCACHE_GHA_ENABLED=true
SCCACHE_CACHE_SIZE=2G  # For Docker builds

GitHub Actions Integration

Workflows Using sccache

  1. code-quality.yml - Build caching for clippy and rustfmt
  2. test-unit.yml - Build caching for unit tests
  3. test-integration.yml - Build caching for integration tests

Action Configuration

- uses: mozilla-actions/sccache-action@v0.0.4

Expected Benefits

  • 50%+ faster builds through compilation caching
  • Reduced CI costs by avoiding redundant compilation
  • Better developer experience with faster feedback loops

Performance Targets

  • Build cache hit rate: Target > 80%
  • Compilation time reduction: 50%+ on cache hits
  • Total CI time: Reduce by 10-20 minutes per run

Local Development Setup

For local development when sccache is working:

# Install sccache
cargo binstall sccache -y

# Set environment variables
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED=true

# Check stats
sccache --show-stats

# Clear cache if needed
sccache --zero-stats

Re-enabling Steps

When GitHub Actions cache service is stable:

  1. Re-enable in workflows:

    • Uncomment mozilla-actions/sccache-action@v0.0.4 in workflows
    • Restore sccache environment variables in setup-env action
  2. Test with minimal workflow first:

    • Start with code-quality.yml
    • Monitor for cache service issues
    • Gradually enable in other workflows
  3. Monitor performance:

    • Track build times before/after
    • Monitor cache hit rates
    • Watch for any new cache service errors

Configuration Locations

Files containing sccache configuration:

  • .github/actions/setup-env/action.yml - Environment variables
  • .github/workflows/code-quality.yml - Action usage
  • .github/workflows/test-unit.yml - Action usage
  • .github/workflows/test-integration.yml - Action usage
  • docs/sccache-configuration.md - This documentation

Docker Integration

For Docker builds, pass sccache variables as build args:

build-args: |
  SCCACHE_GHA_ENABLED=true
  RUSTC_WRAPPER=sccache
  SCCACHE_CACHE_SIZE=2G

Troubleshooting

Common Issues

  • Cache service unavailable: Wait for GitHub to restore service
  • Cache misses: Check RUSTC_WRAPPER is set correctly
  • Permission errors: Ensure sccache action has proper permissions

Monitoring

  • Check sccache --show-stats for cache effectiveness
  • Monitor CI run times for performance improvements
  • Watch GitHub status page for cache service updates

References

StepContext API Reference

StepContext is the primary data access object for step handlers across all languages in the Tasker worker ecosystem. It provides a consistent interface for accessing task inputs, dependency results, configuration, and checkpoint data.

Overview

Every step handler receives a StepContext (or TaskSequenceStep in Rust) that contains:

  • Task context - Input data for the workflow (JSONB from task.context)
  • Dependency results - Results from upstream DAG steps
  • Step configuration - Handler-specific settings from the template
  • Checkpoint data - Batch processing state for resumability
  • Retry information - Current attempt count and max retries

Cross-Language API Reference

Core Data Access

OperationRustRubyPythonTypeScript
Get task inputget_input::<T>("key")?get_input("key")get_input("key")getInput("key")
Get input with defaultget_input_or("key", default)get_input_or("key", default)get_input_or("key", default)getInputOr("key", default)
Get config valueget_config::<T>("key")?get_config("key")get_config("key")getConfig("key")
Get dependency resultget_dependency_result_column_value::<T>("step")?get_dependency_result("step")get_dependency_result("step")getDependencyResult("step")
Get nested dependency fieldget_dependency_field::<T>("step", &["path"])?get_dependency_field("step", *path)get_dependency_field("step", *path)getDependencyField("step", ...path)

Retry Helpers

OperationRustRubyPythonTypeScript
Check if retryis_retry()is_retry?is_retry()isRetry()
Check if last retryis_last_retry()is_last_retry?is_last_retry()isLastRetry()
Get retry countretry_count()retry_countretry_countretryCount
Get max retriesmax_retries()max_retriesmax_retriesmaxRetries

Checkpoint Access

OperationRustRubyPythonTypeScript
Get raw checkpointcheckpoint()checkpointcheckpointcheckpoint
Get cursorcheckpoint_cursor::<T>()checkpoint_cursorcheckpoint_cursorcheckpointCursor
Get items processedcheckpoint_items_processed()checkpoint_items_processedcheckpoint_items_processedcheckpointItemsProcessed
Get accumulated resultsaccumulated_results::<T>()accumulated_resultsaccumulated_resultsaccumulatedResults
Check has checkpointhas_checkpoint()has_checkpoint?has_checkpoint()hasCheckpoint()

Standard Fields

FieldRustRubyPythonTypeScript
Task UUIDtask.task.task_uuidtask_uuidtask_uuidtaskUuid
Step UUIDworkflow_step.workflow_step_uuidstep_uuidstep_uuidstepUuid
Correlation IDtask.task.correlation_idtask.correlation_idcorrelation_idcorrelationId
Input data (raw)task.task.contextinput_data / contextinput_datainputData
Step config (raw)step_definition.handler.initializationstep_configstep_configstepConfig

Usage Examples

Rust

#![allow(unused)]
fn main() {
use tasker_shared::types::base::TaskSequenceStep;

async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
    // Get task input
    let order_id: String = step_data.get_input("order_id")?;
    let batch_size: i32 = step_data.get_input_or("batch_size", 100);

    // Get config
    let api_url: String = step_data.get_config("api_url")?;

    // Get dependency result
    let validation_result: ValidationResult = step_data.get_dependency_result_column_value("validate")?;

    // Extract nested field from dependency
    let item_count: i32 = step_data.get_dependency_field("process", &["stats", "count"])?;

    // Check retry status
    if step_data.is_retry() {
        println!("Retry attempt {}", step_data.retry_count());
    }

    // Resume from checkpoint
    let cursor: Option<i64> = step_data.checkpoint_cursor();
    let start_from = cursor.unwrap_or(0);

    // ... handler logic ...
}
}

Ruby

def call(context)
  # Get task input
  order_id = context.get_input('order_id')
  batch_size = context.get_input_or('batch_size', 100)

  # Get config
  api_url = context.get_config('api_url')

  # Get dependency result
  validation_result = context.get_dependency_result('validate')

  # Extract nested field from dependency
  item_count = context.get_dependency_field('process', 'stats', 'count')

  # Check retry status
  if context.is_retry?
    logger.info("Retry attempt #{context.retry_count}")
  end

  # Resume from checkpoint
  start_from = context.checkpoint_cursor || 0

  # ... handler logic ...
end

Python

def call(self, context: StepContext) -> StepHandlerResult:
    # Get task input
    order_id = context.get_input("order_id")
    batch_size = context.get_input_or("batch_size", 100)

    # Get config
    api_url = context.get_config("api_url")

    # Get dependency result
    validation_result = context.get_dependency_result("validate")

    # Extract nested field from dependency
    item_count = context.get_dependency_field("process", "stats", "count")

    # Check retry status
    if context.is_retry():
        print(f"Retry attempt {context.retry_count}")

    # Resume from checkpoint
    start_from = context.checkpoint_cursor or 0

    # ... handler logic ...

TypeScript

async call(context: StepContext): Promise<StepHandlerResult> {
  // Get task input
  const orderId = context.getInput<string>('order_id');
  const batchSize = context.getInputOr('batch_size', 100);

  // Get config
  const apiUrl = context.getConfig<string>('api_url');

  // Get dependency result
  const validationResult = context.getDependencyResult('validate');

  // Extract nested field from dependency
  const itemCount = context.getDependencyField('process', 'stats', 'count');

  // Check retry status
  if (context.isRetry()) {
    console.log(`Retry attempt ${context.retryCount}`);
  }

  // Resume from checkpoint
  const startFrom = context.checkpointCursor ?? 0;

  // ... handler logic ...
}

Checkpoint Usage Guide

Checkpoints enable resumable batch processing. When a handler processes large datasets, it can save progress via checkpoints and resume from where it left off on retry.

Checkpoint Fields

  • cursor - Position marker (can be int, string, or object)
  • items_processed - Count of items completed
  • accumulated_results - Running totals or aggregated state

Reading Checkpoints

# Python example
def call(self, context: StepContext) -> StepHandlerResult:
    # Check if resuming from checkpoint
    if context.has_checkpoint():
        cursor = context.checkpoint_cursor
        items_done = context.checkpoint_items_processed
        totals = context.accumulated_results or {}
        print(f"Resuming from cursor {cursor}, {items_done} items done")
    else:
        cursor = 0
        items_done = 0
        totals = {}

    # Process from cursor position...

Writing Checkpoints

Checkpoints are written by including checkpoint data in the handler result metadata. See the batch processing documentation for details on the checkpoint yield pattern.

Notes

  • All accessor methods handle missing data gracefully (return None/null or use defaults)
  • Dependency results are automatically unwrapped from the {"result": value} envelope
  • Type conversion is handled automatically where supported (Rust, TypeScript generics)
  • Checkpoint data is persisted atomically by the CheckpointService

Table Management and Growth Strategies

Last Updated: 2026-01-10 Status: Active Recommendation

Problem Statement

In high-throughput workflow orchestration systems, the core task tables (tasks, workflow_steps, task_transitions, workflow_step_transitions) can grow to millions of rows over time. Without proper management, this growth can lead to:

Note: All tables reside in the tasker schema with simplified names (e.g., tasks instead of tasker_tasks). With search_path = tasker, public, queries use unqualified table names.

  • Query Performance Degradation: Even with proper indexes, very large tables require more I/O operations
  • Maintenance Overhead: VACUUM, ANALYZE, and index maintenance become increasingly expensive
  • Backup/Recovery Challenges: Larger tables increase backup windows and recovery times
  • Storage Costs: Historical data that’s rarely accessed still consumes storage resources

Existing Performance Mitigations

The tasker-core system employs several strategies to maintain query performance even with large tables:

1. Strategic Indexing

Covering Indexes for Hot Paths

The most critical indexes use PostgreSQL’s INCLUDE clause to create covering indexes that satisfy queries without table lookups:

Active Task Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for active task queries with priority sorting
CREATE INDEX IF NOT EXISTS idx_tasks_active_with_priority_covering
    ON tasks (complete, priority, task_uuid)
    INCLUDE (named_task_uuid, requested_at)
    WHERE complete = false;

Impact: Task discovery queries can be satisfied entirely from the index without accessing the main table.

Step Readiness Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for step readiness queries
CREATE INDEX IF NOT EXISTS idx_workflow_steps_ready_covering
    ON workflow_steps (task_uuid, processed, in_process)
    INCLUDE (workflow_step_uuid, attempts, max_attempts, retryable)
    WHERE processed = false;

-- Covering index for task-based step grouping
CREATE INDEX IF NOT EXISTS idx_workflow_steps_task_covering
    ON workflow_steps (task_uuid)
    INCLUDE (workflow_step_uuid, processed, in_process, attempts, max_attempts);

Impact: Step dependency resolution and retry logic queries avoid heap lookups.

Transitive Dependency Optimization (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Covering index for transitive dependency traversal
CREATE INDEX IF NOT EXISTS idx_workflow_steps_transitive_deps
    ON workflow_steps (workflow_step_uuid, named_step_uuid)
    INCLUDE (task_uuid, results, processed);

Impact: DAG traversal operations can read all needed columns from the index.

State Transition Lookups (Partial Indexes)

Current State Resolution (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Fast current state resolution (only indexes most_recent = true)
CREATE INDEX IF NOT EXISTS idx_task_transitions_state_lookup
    ON task_transitions (task_uuid, to_state, most_recent)
    WHERE most_recent = true;

CREATE INDEX IF NOT EXISTS idx_workflow_step_transitions_state_lookup
    ON workflow_step_transitions (workflow_step_uuid, to_state, most_recent)
    WHERE most_recent = true;

Impact: State lookups index only current state, not full audit history. Reduces index size by >90%.

Correlation and Tracing Indexes

Distributed Tracing Support (migrations/tasker/20251007000000_add_correlation_ids.sql):

-- Primary correlation ID lookups
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_id
    ON tasks(correlation_id);

-- Hierarchical workflow traversal (parent-child relationships)
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_hierarchy
    ON tasks(parent_correlation_id, correlation_id)
    WHERE parent_correlation_id IS NOT NULL;

Impact: Enables efficient distributed tracing and workflow hierarchy queries.

Processor Ownership and Monitoring

Processor Tracking (migrations/tasker/20250912000000_tas41_richer_task_states.sql):

-- Index for processor ownership queries (audit trail only, enforcement removed)
CREATE INDEX IF NOT EXISTS idx_task_transitions_processor
    ON task_transitions(processor_uuid)
    WHERE processor_uuid IS NOT NULL;

-- Index for timeout monitoring using JSONB metadata
CREATE INDEX IF NOT EXISTS idx_task_transitions_timeout
    ON task_transitions((transition_metadata->>'timeout_at'))
    WHERE most_recent = true;

Impact: Enables processor-level debugging and timeout monitoring. Processor ownership enforcement was removed but the audit trail is preserved.

Dependency Graph Navigation

Step Edges for DAG Operations (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):

-- Parent-to-child navigation for dependency resolution
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_from_step
    ON workflow_step_edges (from_step_uuid);

-- Child-to-parent navigation for completion propagation
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_to_step
    ON workflow_step_edges (to_step_uuid);

Impact: Bidirectional DAG traversal for readiness checks and completion propagation.

2. Partial Indexes

Many indexes use WHERE clauses to index only active/relevant rows:

-- Only index tasks that are actively being processed
WHERE current_state IN ('pending', 'initializing', 'steps_in_process')

-- Only index the current state transition
WHERE most_recent = true

This significantly reduces index size and maintenance overhead while keeping lookups fast.

3. SQL Function Optimizations

Complex orchestration queries are implemented as PostgreSQL functions that leverage:

  • Lateral Joins: For efficient correlated subqueries
  • CTEs with Materialization: For complex dependency analysis
  • Targeted Filtering: Early elimination of irrelevant rows using index scans

Example from get_next_ready_tasks():

-- First filter to active tasks with priority sorting (uses index)
WITH prioritized_tasks AS (
  SELECT task_uuid, priority
  FROM tasks
  WHERE current_state IN ('pending', 'steps_in_process')
  ORDER BY priority DESC, created_at ASC
  LIMIT $1 * 2  -- Get more candidates than needed for filtering
)
-- Then apply complex staleness/readiness checks only on candidates
...

4. Staleness Exclusion

The system automatically excludes stale tasks from active processing queues:

  • Tasks stuck in waiting_for_dependencies > 60 minutes
  • Tasks stuck in waiting_for_retry > 30 minutes
  • Tasks with lifecycle timeouts exceeded

This prevents the active query set from growing indefinitely, even if old tasks aren’t archived.

Archive-and-Delete Strategy (Considered, Not Implemented)

What We Considered

We initially designed an archive-and-delete strategy:

Architecture:

  • Mirror tables: tasker.archived_tasks, tasker.archived_workflow_steps, tasker.archived_task_transitions, tasker.archived_workflow_step_transitions
  • Background service running every 24 hours
  • Batch processing: 1000 tasks per run
  • Transactional archival: INSERT into archive tables → DELETE from main tables
  • Retention policies: Configurable per task state (completed, error, cancelled)

Implementation Details:

#![allow(unused)]
fn main() {
// Archive tasks in terminal states older than retention period
pub async fn archive_completed_tasks(
    pool: &PgPool,
    retention_days: i32,
    batch_size: i32,
) -> Result<ArchiveStats> {
    // 1. INSERT INTO archived_tasks SELECT * FROM tasks WHERE ...
    // 2. INSERT INTO archived_workflow_steps SELECT * WHERE task_uuid IN (...)
    // 3. INSERT INTO archived_task_transitions SELECT * WHERE task_uuid IN (...)
    // 4. DELETE FROM workflow_step_transitions WHERE ...
    // 5. DELETE FROM task_transitions WHERE ...
    // 6. DELETE FROM workflow_steps WHERE ...
    // 7. DELETE FROM tasks WHERE ...
}
}

Why We Decided Against It

After implementation and analysis, we identified critical performance issues:

1. Write Amplification

Every archived task results in:

  • 2× writes per row: INSERT into archive table + original row still exists until DELETE
  • 1× delete per row: DELETE from main table triggers index updates
  • Cascade costs: Foreign key relationships require multiple DELETE operations in sequence

For a system processing 100,000 tasks/day with 30-day retention:

  • Daily archival: ~100,000 tasks × 2 write operations = 200,000 write I/Os
  • Plus associated workflow_steps (typically 5-10 per task): 500,000-1,000,000 additional writes

2. Index Maintenance Overhead

PostgreSQL must maintain indexes during both INSERT and DELETE operations:

During INSERT to archive tables:

  • Build index entries for all archive table indexes
  • Update statistics for query planner

During DELETE from main tables:

  • Mark deleted tuples in main table indexes
  • Update free space maps
  • Trigger VACUUM requirements

Result: Periodic severe degradation (2-5 seconds) during archival runs, even with batch processing.

3. Lock Contention

Large DELETE operations require:

  • Row-level locks on deleted rows
  • Table-level locks during index updates
  • Lock escalation risk with large batch sizes

This creates a “stop-the-world” effect where active task processing is blocked during archival.

4. VACUUM Pressure

Frequent large DELETEs create dead tuples that require aggressive VACUUMing:

  • Increases I/O load during off-hours
  • Can’t be fully eliminated even with proper tuning
  • Competes with active workload for resources

5. The “Garbage Collector” Anti-Pattern

The archive-and-delete strategy essentially implements a manual garbage collector:

  • Periodic runs with performance impact
  • Tuning trade-offs (frequency vs. batch size vs. impact)
  • Operational complexity (monitoring, alerting, recovery)

Overview

PostgreSQL’s native table partitioning with pg_partman provides zero-runtime-cost table management:

Key Advantages:

  • No write amplification: Data stays in place, partitions are logical divisions
  • No DELETE operations: Old partitions are DETACHed and dropped as units
  • Instant partition drops: Dropping a partition is O(1), not O(rows)
  • Transparent to application: Queries work identically on partitioned tables
  • Battle-tested: Used by pgmq (our queue infrastructure) and thousands of production systems

How It Works

-- 1. Create partitioned parent table (in tasker schema)
CREATE TABLE tasker.tasks (
    task_uuid UUID NOT NULL,
    created_at TIMESTAMP NOT NULL,
    -- ... other columns
) PARTITION BY RANGE (created_at);

-- 2. pg_partman automatically creates child partitions
-- tasker.tasks_p2025_01  (Jan 2025)
-- tasker.tasks_p2025_02  (Feb 2025)
-- tasker.tasks_p2025_03  (Mar 2025)
-- ... etc

-- 3. Queries transparently use appropriate partitions
SELECT * FROM tasks WHERE task_uuid = $1;
-- → PostgreSQL automatically queries correct partition

-- 4. Dropping old partitions is instant
ALTER TABLE tasker.tasks DETACH PARTITION tasker.tasks_p2024_12;
DROP TABLE tasker.tasks_p2024_12;  -- Instant, no row-by-row deletion

Performance Characteristics

OperationArchive-and-DeleteNative Partitioning
Write pathINSERT + DELETE (2× I/O)INSERT only (1× I/O)
Index maintenanceOn INSERT + DELETEOn INSERT only
Lock contentionRow locks during DELETENo locks for drops
VACUUM pressureHigh (dead tuples)None (partition drops)
Old data removalO(rows) per deletionO(1) partition detach
Query performanceScans entire tablePartition pruning
Runtime impactPeriodic degradationZero

Implementation with pg_partman

Installation

CREATE EXTENSION pg_partman;

Setup for tasks

-- 1. Create partitioned table structure
-- (Include all existing columns and indexes)

-- 2. Initialize pg_partman for monthly partitions
SELECT partman.create_parent(
    p_parent_table := 'tasker.tasks',
    p_control := 'created_at',
    p_type := 'native',
    p_interval := 'monthly',
    p_premake := 3  -- Pre-create 3 future months
);

-- 3. Configure retention (keep 90 days)
UPDATE partman.part_config
SET retention = '90 days',
    retention_keep_table = false  -- Drop old partitions entirely
WHERE parent_table = 'tasker.tasks';

-- 4. Enable automatic maintenance
SELECT partman.run_maintenance(p_parent_table := 'tasker.tasks');

Automation

Add to cron or pg_cron:

-- Run maintenance every hour
SELECT cron.schedule('partman-maintenance', '0 * * * *',
    $$SELECT partman.run_maintenance()$$
);

This automatically:

  • Creates new partitions before they’re needed
  • Detaches and drops partitions older than retention period
  • Updates partition constraints for query optimization

Real-World Example: pgmq

The pgmq message queue system (which tasker-core uses for orchestration) implements partitioned queues for high-throughput scenarios:

Reference: pgmq Partitioned Queues

pgmq’s Rationale (from their docs):

“For very high-throughput queues, you may want to partition the queue table by time. This allows you to drop old partitions instead of deleting rows, which is much faster and doesn’t cause table bloat.”

pgmq’s Approach:

-- pgmq uses pg_partman for message queues
SELECT pgmq.create_partitioned(
    queue_name := 'high_throughput_queue',
    partition_interval := '1 day',
    retention_interval := '7 days'
);

Benefits They Report:

  • 10× faster old message cleanup vs. DELETE
  • Zero bloat from message deletion
  • Consistent performance even at millions of messages per day

Applying to Tasker: Our use case is nearly identical to pgmq:

  • High-throughput append-heavy workload
  • Time-series data (created_at is natural partition key)
  • Need to retain recent data, drop old data
  • Performance-critical read path

If pgmq chose partitioning over archive-and-delete for these reasons, we should too.

Migration Path

Phase 1: Analysis (Current State)

Before implementing partitioning:

  1. Analyze Current Growth Rate:
SELECT
    pg_size_pretty(pg_total_relation_size('tasker.tasks')) as total_size,
    count(*) as row_count,
    min(created_at) as oldest_task,
    max(created_at) as newest_task,
    count(*) / EXTRACT(day FROM (max(created_at) - min(created_at))) as avg_tasks_per_day
FROM tasks;
  1. Determine Partition Strategy:

    • Daily partitions: For > 1M tasks/day
    • Weekly partitions: For 100K-1M tasks/day
    • Monthly partitions: For < 100K tasks/day
  2. Plan Retention Period:

    • Legal/compliance requirements
    • Analytics/reporting needs
    • Typical task investigation window

Phase 2: Implementation

  1. Create Partitioned Tables (requires downtime or blue-green deployment)
  2. Migrate Existing Data using pg_partman.partition_data_proc()
  3. Update Application (no code changes needed if using same table names)
  4. Configure Automation (pg_cron for maintenance)

Phase 3: Monitoring

Track partition management effectiveness:

-- Check partition sizes
SELECT
    schemaname || '.' || tablename as partition_name,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'tasker' AND tablename LIKE 'tasks_p%'
ORDER BY tablename;

-- Verify partition pruning is working
EXPLAIN SELECT * FROM tasks
WHERE created_at > NOW() - INTERVAL '7 days';
-- Should show: "Seq Scan on tasker.tasks_p2025_11" (only current partition)

Decision Summary

Decision: Use PostgreSQL native partitioning with pg_partman for table growth management.

Rationale:

  • Zero runtime performance impact vs. periodic degradation with archive-and-delete
  • Operationally simpler (set-and-forget vs. monitoring archive jobs)
  • Battle-tested solution used by pgmq and thousands of production systems
  • Aligns with PostgreSQL best practices and community recommendations

Not Recommended: Archive-and-delete strategy due to write amplification, lock contention, and periodic performance degradation.

References

See Also

Task and Step Readiness and Execution

Last Updated: 2026-01-10 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands

← Back to Documentation Hub


This document provides comprehensive documentation of the SQL functions and database logic that drives task and step readiness analysis, dependency resolution, and execution coordination in the tasker-core system.

Overview

The tasker-core system relies heavily on sophisticated PostgreSQL functions to perform complex workflow orchestration operations at the database level. This approach provides significant performance benefits through set-based operations, atomic transactions, and reduced network round trips while maintaining data consistency.

The SQL function system supports several critical categories of operations:

  1. Step Readiness Analysis: Complex dependency resolution and backoff calculations
  2. DAG Operations: Cycle detection, depth calculation, and parallel execution discovery
  3. State Management: Atomic state transitions with processor ownership tracking
  4. Analytics and Monitoring: Performance metrics and system health analysis
  5. Task Execution Context: Comprehensive execution metadata and results management

SQL Function Architecture

Function Categories

The SQL functions are organized into logical categories as defined in tasker-shared/src/database/sql_functions.rs:

1. Step Readiness Analysis

  • get_step_readiness_status(task_uuid, step_uuids[]): Comprehensive dependency analysis
  • calculate_backoff_delay(attempts, base_delay): Exponential backoff calculation
  • check_step_dependencies(step_uuid): Parent completion validation
  • get_ready_steps(task_uuid): Parallel execution candidate discovery

2. DAG Operations

  • detect_cycle(from_step_uuid, to_step_uuid): Cycle detection using recursive CTEs
  • calculate_dependency_levels(task_uuid): Topological depth calculation
  • calculate_step_depth(step_uuid): Individual step depth analysis
  • get_step_transitive_dependencies(step_uuid): Full dependency tree traversal

3. State Management

  • transition_task_state_atomic(task_uuid, from_state, to_state, processor_uuid): Atomic state transitions with ownership
  • get_current_task_state(task_uuid): Current task state resolution
  • finalize_task_completion(task_uuid): Task completion orchestration

4. Analytics and Monitoring

  • get_analytics_metrics(since_timestamp): Comprehensive system analytics
  • get_system_health_counts(): System-wide health and performance metrics
  • get_slowest_steps(limit): Performance optimization analysis
  • get_slowest_tasks(limit): Task performance analysis

5. Task Discovery and Execution

  • get_next_ready_task(): Single task discovery for orchestration
  • get_next_ready_tasks(limit): Batch task discovery for scaling
  • get_task_ready_info(task_uuid): Detailed task readiness information
  • get_task_execution_context(task_uuid): Complete execution metadata

Database Schema Foundation

Core Tables

The SQL functions operate on a comprehensive schema designed for UUID v7 performance and scalability. All tables reside in the tasker schema with simplified names. With search_path = tasker, public, queries use unqualified table names.

Primary Tables

  • tasks: Main workflow instances with UUID v7 primary keys
  • workflow_steps: Individual workflow steps with dependency relationships
  • task_transitions: Task state change audit trail with processor tracking
  • workflow_step_transitions: Step state change audit trail

Registry Tables

  • task_namespaces: Workflow namespace definitions
  • named_tasks: Task type templates and metadata
  • named_steps: Step type definitions and handlers
  • workflow_step_edges: Step dependency relationships (DAG structure)

Richer Task State Enhancements

The richer task states migration (migrations/tasker/20251209000000_tas41_richer_task_states.sql) enhanced the schema with:

Task State Management:

-- 12 comprehensive task states
ALTER TABLE task_transitions
ADD CONSTRAINT chk_task_transitions_to_state
CHECK (to_state IN (
    'pending', 'initializing', 'enqueuing_steps', 'steps_in_process',
    'evaluating_results', 'waiting_for_dependencies', 'waiting_for_retry',
    'blocked_by_failures', 'complete', 'error', 'cancelled', 'resolved_manually'
));

Processor Ownership Tracking:

ALTER TABLE task_transitions
ADD COLUMN processor_uuid UUID,
ADD COLUMN transition_metadata JSONB DEFAULT '{}';

Atomic State Transitions:

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN

Step Readiness Analysis

Recent Enhancements

WaitingForRetry State Support (Migration 20250927000000)

The step readiness system was enhanced to support the new WaitingForRetry state, which distinguishes retryable failures from permanent errors:

Key Changes:

  1. Helper Functions: Added calculate_step_next_retry_time() and evaluate_step_state_readiness() for consistent backoff logic
  2. State Recognition: Updated readiness evaluation to treat waiting_for_retry as a ready-eligible state alongside pending
  3. Backoff Calculation: Centralized exponential backoff logic with configurable backoff periods
  4. Performance Optimization: Introduced task-scoped CTEs to eliminate table scans for batch operations

Semantic Impact:

  • Before: error state included both retryable and permanent failures
  • After: error = permanent only, waiting_for_retry = awaiting backoff for retry

Backoff Logic Consolidation (October 2025)

The backoff calculation system was consolidated to eliminate configuration conflicts and race conditions:

Key Changes:

  1. Configuration Alignment: Single source of truth (TOML config) with max_backoff_seconds = 60
  2. Parameterized SQL Functions: calculate_step_next_retry_time() accepts configurable max delay and multiplier
  3. Atomic Updates: Row-level locking prevents concurrent backoff update conflicts
  4. Timing Consistency: last_attempted_at updated atomically with backoff_request_seconds

Issues Resolved:

  • Configuration Conflicts: Eliminated three conflicting max values (30s SQL, 60s code, 300s TOML)
  • Race Conditions: Added SELECT FOR UPDATE locking in BackoffCalculator
  • Hardcoded Values: Removed hardcoded 30-second cap and power(2, attempts) in SQL

Helper Functions Enhanced:

  1. calculate_step_next_retry_time(): Now parameterized with configuration values

    CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
        backoff_request_seconds INTEGER,
        last_attempted_at TIMESTAMP,
        failure_time TIMESTAMP,
        attempts INTEGER,
        p_max_backoff_seconds INTEGER DEFAULT 60,
        p_backoff_multiplier NUMERIC DEFAULT 2.0
    ) RETURNS TIMESTAMP
    
    • Respects custom backoff periods from step configuration (primary path)
    • Falls back to exponential backoff with configurable parameters
    • Defaults aligned with TOML config (60s max, 2.0 multiplier)
    • Used consistently across all readiness evaluation
  2. set_step_backoff_atomic(): New atomic update function

    CREATE OR REPLACE FUNCTION set_step_backoff_atomic(
        p_step_uuid UUID,
        p_backoff_seconds INTEGER
    ) RETURNS BOOLEAN
    
    • Provides transactional guarantee for concurrent updates
    • Updates both backoff_request_seconds and last_attempted_at
    • Ensures timing consistency with SQL calculations
  3. evaluate_step_state_readiness(): Determines if a step is ready for execution

    CREATE OR REPLACE FUNCTION evaluate_step_state_readiness(
        current_state TEXT,
        processed BOOLEAN,
        in_process BOOLEAN,
        dependencies_satisfied BOOLEAN,
        retry_eligible BOOLEAN,
        retryable BOOLEAN,
        next_retry_time TIMESTAMP
    ) RETURNS BOOLEAN
    
    • Recognizes both pending and waiting_for_retry as ready-eligible states
    • Validates backoff period has expired before allowing retry
    • Ensures dependencies are satisfied and retry limits not exceeded

Step Readiness Status

The get_step_readiness_status function provides comprehensive analysis of step execution eligibility:

CREATE OR REPLACE FUNCTION get_step_readiness_status(
    task_uuid UUID,
    step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
    workflow_step_uuid UUID,
    task_uuid UUID,
    named_step_uuid UUID,
    name VARCHAR,
    current_state VARCHAR,
    dependencies_satisfied BOOLEAN,
    retry_eligible BOOLEAN,
    ready_for_execution BOOLEAN,
    last_failure_at TIMESTAMP,
    next_retry_at TIMESTAMP,
    total_parents INTEGER,
    completed_parents INTEGER,
    attempts INTEGER,
    retry_limit INTEGER,
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP
)

Key Analysis Features

Dependency Satisfaction:

  • Validates all parent steps are in complete or resolved_manually states
  • Handles complex DAG structures with multiple dependency paths
  • Supports conditional dependencies based on parent results

Retry Logic:

  • Exponential backoff calculation: 2^attempts seconds (max 30)
  • Custom backoff periods from step configuration
  • Retry limit enforcement to prevent infinite loops
  • Failure tracking with temporal analysis

Execution Readiness:

  • State validation (must be pending or error)
  • Dependency satisfaction confirmation
  • Retry eligibility assessment
  • Backoff period expiration checking

Step Readiness Implementation

The Rust integration provides type-safe access to step readiness analysis:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct StepReadinessStatus {
    pub workflow_step_uuid: Uuid,
    pub task_uuid: Uuid,
    pub named_step_uuid: Uuid,
    pub name: String,
    pub current_state: String,
    pub dependencies_satisfied: bool,
    pub retry_eligible: bool,
    pub ready_for_execution: bool,
    pub last_failure_at: Option<NaiveDateTime>,
    pub next_retry_at: Option<NaiveDateTime>,
    pub total_parents: i32,
    pub completed_parents: i32,
    pub attempts: i32,
    pub retry_limit: i32,
    pub backoff_request_seconds: Option<i32>,
    pub last_attempted_at: Option<NaiveDateTime>,
}

impl StepReadinessStatus {
    pub fn can_execute_now(&self) -> bool {
        self.ready_for_execution
    }

    pub fn blocking_reason(&self) -> Option<&'static str> {
        if !self.dependencies_satisfied {
            return Some("dependencies_not_satisfied");
        }
        if !self.retry_eligible {
            return Some("retry_not_eligible");
        }
        Some("invalid_state")
    }

    pub fn effective_backoff_seconds(&self) -> i32 {
        self.backoff_request_seconds.unwrap_or_else(|| {
            if self.attempts > 0 {
                std::cmp::min(2_i32.pow(self.attempts as u32), 30)
            } else {
                0
            }
        })
    }
}
}

DAG Operations and Dependency Resolution

Dependency Level Calculation

The calculate_dependency_levels function uses recursive CTEs to perform topological analysis of the workflow DAG:

CREATE OR REPLACE FUNCTION calculate_dependency_levels(input_task_uuid UUID)
RETURNS TABLE(workflow_step_uuid UUID, dependency_level INTEGER)
LANGUAGE plpgsql STABLE AS $$
BEGIN
  RETURN QUERY
  WITH RECURSIVE dependency_levels AS (
    -- Base case: Find root nodes (steps with no dependencies)
    SELECT
      ws.workflow_step_uuid,
      0 as level
    FROM workflow_steps ws
    WHERE ws.task_uuid = input_task_uuid
      AND NOT EXISTS (
        SELECT 1
        FROM workflow_step_edges wse
        WHERE wse.to_step_uuid = ws.workflow_step_uuid
      )

    UNION ALL

    -- Recursive case: Find children of current level nodes
    SELECT
      wse.to_step_uuid as workflow_step_uuid,
      dl.level + 1 as level
    FROM dependency_levels dl
    JOIN workflow_step_edges wse ON wse.from_step_uuid = dl.workflow_step_uuid
    JOIN workflow_steps ws ON ws.workflow_step_uuid = wse.to_step_uuid
    WHERE ws.task_uuid = input_task_uuid
  )
  SELECT
    dl.workflow_step_uuid,
    MAX(dl.level) as dependency_level  -- Use MAX to handle multiple paths
  FROM dependency_levels dl
  GROUP BY dl.workflow_step_uuid
  ORDER BY dependency_level, workflow_step_uuid;
END;

Dependency Level Benefits

Parallel Execution Planning:

  • Steps at the same dependency level can execute in parallel
  • Enables optimal resource utilization across workers
  • Supports batch enqueueing for scalability

Execution Ordering:

  • Level 0: Root steps (no dependencies) - can start immediately
  • Level N: Steps requiring completion of level N-1 steps
  • Topological ordering ensures dependency satisfaction

Performance Optimization:

  • Single query provides complete dependency analysis
  • Avoids N+1 query problems in dependency resolution
  • Enables batch processing optimizations

Transitive Dependencies

The get_step_transitive_dependencies function provides complete ancestor analysis:

CREATE OR REPLACE FUNCTION get_step_transitive_dependencies(step_uuid UUID)
RETURNS TABLE(
    step_name VARCHAR,
    step_uuid UUID,
    task_uuid UUID,
    distance INTEGER,
    processed BOOLEAN,
    results JSONB
)

This enables step handlers to access results from any ancestor step:

#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
    pub async fn get_step_dependency_results_map(
        &self,
        step_uuid: Uuid,
    ) -> Result<HashMap<String, StepExecutionResult>, sqlx::Error> {
        let dependencies = self.get_step_transitive_dependencies(step_uuid).await?;
        Ok(dependencies
            .into_iter()
            .filter_map(|dep| {
                if dep.processed && dep.results.is_some() {
                    let results: StepExecutionResult = dep.results.unwrap().into();
                    Some((dep.step_name, results))
                } else {
                    None
                }
            })
            .collect())
    }
}
}

Task Execution Context

Recent Enhancements

Permanently Blocked Detection Fix (Migration 20251001000000)

The get_task_execution_context function was enhanced to correctly identify tasks blocked by permanent errors:

Problem: The function only checked attempts >= retry_limit to detect permanently blocked steps, missing cases where workers marked errors as non-retryable (e.g., missing handlers, configuration errors).

Solution: Updated permanently_blocked_steps calculation to check both conditions:

COUNT(CASE WHEN sd.current_state = 'error'
            AND (sd.attempts >= retry_limit OR sd.retry_eligible = false) THEN 1 END)

Impact:

  • execution_status: Now correctly returns blocked_by_failures instead of waiting_for_dependencies for tasks with non-retryable errors
  • recommended_action: Returns handle_failures instead of wait_for_dependencies
  • health_status: Returns blocked instead of recovering when appropriate

This fix ensures the orchestration system properly identifies when manual intervention is needed versus when a task is simply waiting for retry backoff.

Task Discovery and Orchestration

Task Readiness Discovery

The system provides multiple functions for task discovery based on orchestration needs:

Single Task Discovery

CREATE OR REPLACE FUNCTION get_next_ready_task()
RETURNS TABLE(
    task_uuid UUID,
    task_name VARCHAR,
    priority INTEGER,
    namespace_name VARCHAR,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Batch Task Discovery

CREATE OR REPLACE FUNCTION get_next_ready_tasks(limit_count INTEGER)
RETURNS TABLE(
    task_uuid UUID,
    task_name VARCHAR,
    priority INTEGER,
    namespace_name VARCHAR,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Task Ready Information

The ReadyTaskInfo structure provides comprehensive task metadata for orchestration decisions:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct ReadyTaskInfo {
    pub task_uuid: Uuid,
    pub task_name: String,
    pub priority: i32,
    pub namespace_name: String,
    pub ready_steps_count: i64,
    pub computed_priority: Option<BigDecimal>,
    pub current_state: String,
}
}

Priority Calculation:

  • Base priority from task configuration
  • Dynamic priority adjustment based on age, retry attempts
  • Namespace-based priority modifiers
  • SLA-based priority escalation

Ready Steps Count:

  • Real-time count of steps eligible for execution
  • Used for batch size optimization
  • Influences orchestration scheduling decisions

State Management and Atomic Transitions

Atomic State Transitions

The enhanced state machine provides atomic transitions with processor ownership:

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
DECLARE
    v_sort_key INTEGER;
    v_transitioned BOOLEAN := FALSE;
BEGIN
    -- Get next sort key
    SELECT COALESCE(MAX(sort_key), 0) + 1 INTO v_sort_key
    FROM task_transitions
    WHERE task_uuid = p_task_uuid;

    -- Atomically transition only if in expected state
    WITH current_state AS (
        SELECT to_state, processor_uuid
        FROM task_transitions
        WHERE task_uuid = p_task_uuid
        AND most_recent = true
        FOR UPDATE
    ),
    ownership_check AS (
        SELECT
            CASE
                -- States requiring ownership
                WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
                                   'steps_in_process', 'evaluating_results')
                THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
                -- Other states don't require ownership
                ELSE true
            END as can_transition
        FROM current_state cs
        WHERE cs.to_state = p_from_state
    ),
    do_update AS (
        UPDATE task_transitions
        SET most_recent = false
        WHERE task_uuid = p_task_uuid
        AND most_recent = true
        AND EXISTS (SELECT 1 FROM ownership_check WHERE can_transition)
        RETURNING task_uuid
    )
    INSERT INTO task_transitions (
        task_uuid, from_state, to_state,
        processor_uuid, transition_metadata,
        sort_key, most_recent, created_at, updated_at
    )
    SELECT
        p_task_uuid, p_from_state, p_to_state,
        p_processor_uuid, p_metadata,
        v_sort_key, true, NOW(), NOW()
    WHERE EXISTS (SELECT 1 FROM do_update);

    GET DIAGNOSTICS v_transitioned = ROW_COUNT;
    RETURN v_transitioned > 0;
END;
$$ LANGUAGE plpgsql;

Key Features

Atomic Operation:

  • Single transaction with row-level locking
  • Compare-and-swap semantics prevent race conditions
  • Returns boolean indicating success/failure

Ownership Validation:

  • Processor ownership required for active states
  • Prevents concurrent processing by multiple orchestrators
  • Supports ownership claiming for unowned tasks

State Consistency:

  • Validates current state matches expected from_state
  • Maintains audit trail with complete transition history
  • Updates most_recent flags atomically

Current State Resolution

Fast current state lookups are provided through optimized queries:

#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
    pub async fn get_current_task_state(&self, task_uuid: Uuid)
        -> Result<TaskState, sqlx::Error> {
        let state_str = sqlx::query_scalar!(
            r#"SELECT get_current_task_state($1) as "state""#,
            task_uuid
        )
        .fetch_optional(&self.pool)
        .await?
        .ok_or_else(|| sqlx::Error::RowNotFound)?;

        match state_str {
            Some(state) => TaskState::try_from(state.as_str())
                .map_err(|_| sqlx::Error::Decode("Invalid task state".into())),
            None => Err(sqlx::Error::RowNotFound),
        }
    }
}
}

Analytics and System Health

System Health Monitoring

The get_system_health_counts function provides comprehensive system visibility:

CREATE OR REPLACE FUNCTION get_system_health_counts()
RETURNS TABLE(
    pending_tasks BIGINT,
    initializing_tasks BIGINT,
    enqueuing_steps_tasks BIGINT,
    steps_in_process_tasks BIGINT,
    evaluating_results_tasks BIGINT,
    waiting_for_dependencies_tasks BIGINT,
    waiting_for_retry_tasks BIGINT,
    blocked_by_failures_tasks BIGINT,
    complete_tasks BIGINT,
    error_tasks BIGINT,
    cancelled_tasks BIGINT,
    resolved_manually_tasks BIGINT,
    total_tasks BIGINT,
    -- step counts...
) AS $$

Health Score Calculation

The Rust implementation provides derived health metrics:

#![allow(unused)]
fn main() {
impl SystemHealthCounts {
    pub fn health_score(&self) -> f64 {
        if self.total_tasks == 0 {
            return 1.0;
        }

        let success_rate = self.complete_tasks as f64 / self.total_tasks as f64;
        let error_rate = self.error_tasks as f64 / self.total_tasks as f64;
        let connection_health = 1.0 -
            (self.active_connections as f64 / self.max_connections as f64).min(1.0);

        // Weighted combination: 50% success rate, 30% error rate, 20% connection health
        (success_rate * 0.5) + ((1.0 - error_rate) * 0.3) + (connection_health * 0.2)
    }

    pub fn is_under_heavy_load(&self) -> bool {
        let connection_pressure =
            self.active_connections as f64 / self.max_connections as f64;
        let error_rate = if self.total_tasks > 0 {
            self.error_tasks as f64 / self.total_tasks as f64
        } else {
            0.0
        };

        connection_pressure > 0.8 || error_rate > 0.2
    }
}
}

Analytics Metrics

The get_analytics_metrics function provides comprehensive performance analysis:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct AnalyticsMetrics {
    pub active_tasks_count: i64,
    pub total_namespaces_count: i64,
    pub unique_task_types_count: i64,
    pub system_health_score: BigDecimal,
    pub task_throughput: i64,
    pub completion_count: i64,
    pub error_count: i64,
    pub completion_rate: BigDecimal,
    pub error_rate: BigDecimal,
    pub avg_task_duration: BigDecimal,
    pub avg_step_duration: BigDecimal,
    pub step_throughput: i64,
    pub analysis_period_start: DateTime<Utc>,
    pub calculated_at: DateTime<Utc>,
}
}

Performance Optimization Analysis

Slowest Steps Analysis

The system provides performance optimization guidance through detailed analysis:

CREATE OR REPLACE FUNCTION get_slowest_steps(
    since_timestamp TIMESTAMP WITH TIME ZONE,
    limit_count INTEGER,
    namespace_filter VARCHAR,
    task_name_filter VARCHAR,
    version_filter VARCHAR
) RETURNS TABLE(
    named_step_uuid INTEGER,
    step_name VARCHAR,
    avg_duration_seconds NUMERIC,
    max_duration_seconds NUMERIC,
    min_duration_seconds NUMERIC,
    execution_count INTEGER,
    error_count INTEGER,
    error_rate NUMERIC,
    last_executed_at TIMESTAMP WITH TIME ZONE
)

Slowest Tasks Analysis

Similar analysis is available at the task level:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct SlowestTaskAnalysis {
    pub named_task_uuid: Uuid,
    pub task_name: String,
    pub avg_duration_seconds: f64,
    pub max_duration_seconds: f64,
    pub min_duration_seconds: f64,
    pub execution_count: i32,
    pub avg_step_count: f64,
    pub error_count: i32,
    pub error_rate: f64,
    pub last_executed_at: Option<DateTime<Utc>>,
}
}

Critical Problem-Solving SQL Functions

PGMQ Message Race Condition Prevention

Problem: Multiple Workers Claiming Same Message

When multiple workers simultaneously try to process steps from the same queue, PGMQ’s standard pgmq.read() function randomly selects messages, potentially causing workers to miss messages they were specifically notified about. This creates inefficiency and potential race conditions.

Solution: pgmq_read_specific_message()

CREATE OR REPLACE FUNCTION pgmq_read_specific_message(
    queue_name text,
    target_msg_id bigint,
    vt_seconds integer DEFAULT 30
) RETURNS TABLE (
    msg_id bigint,
    read_ct integer,
    enqueued_at timestamp with time zone,
    vt timestamp with time zone,
    message jsonb
) AS $$

Key Problem-Solving Logic:

  1. Atomic Claim with Visibility Timeout: Uses UPDATE…RETURNING pattern to atomically:

    • Check if message is available (vt <= now())
    • Set new visibility timeout preventing other workers from claiming
    • Increment read count for monitoring retry attempts
    • Return message data only if successfully claimed
  2. Race Condition Prevention: The WHERE vt <= now() clause ensures only one worker can claim a message. If two workers try simultaneously, only one UPDATE succeeds.

  3. Graceful Failure Handling: Returns empty result set if message is:

    • Already claimed by another worker (vt > now())
    • Non-existent (deleted or never existed)
    • Archived (moved to archive table)
  4. Security: Validates queue name to prevent SQL injection in dynamic query construction.

Real-World Impact: Eliminates “message not found” errors when workers are notified about specific messages but can’t retrieve them due to random selection in standard read.

Task State Ownership and Atomic Transitions

Problem: Concurrent Orchestrators Processing Same Task

In distributed deployments, multiple orchestrator instances might try to process the same task simultaneously, leading to duplicate work, inconsistent state, and race conditions.

Solution: transition_task_state_atomic()

CREATE OR REPLACE FUNCTION transition_task_state_atomic(
    p_task_uuid UUID,
    p_from_state VARCHAR,
    p_to_state VARCHAR,
    p_processor_uuid UUID,
    p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$

Key Problem-Solving Logic:

  1. Compare-and-Swap Pattern:

    • Reads current state with FOR UPDATE lock
    • Only transitions if current state matches expected from_state
    • Returns false if state has changed, allowing caller to retry with fresh state
  2. Processor Ownership Enforcement:

    CASE
        WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
                            'steps_in_process', 'evaluating_results')
        THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
        ELSE true
    END
    
    • Active processing states require ownership match
    • Allows claiming unowned tasks (NULL processor_uuid)
    • Terminal states (complete, error) don’t require ownership
  3. Audit Trail Preservation:

    • Updates previous transition’s most_recent = false
    • Inserts new transition with most_recent = true
    • Maintains complete history with sort_key ordering
  4. Atomic Success/Failure: Returns boolean indicating whether transition succeeded, enabling callers to handle contention gracefully.

Real-World Impact: Enables safe distributed orchestration where multiple instances can operate without conflicts, automatically distributing work through ownership claiming.

Batch Task Discovery with Priority

Problem: Efficient Work Distribution Across Orchestrators

Orchestrators need to discover ready tasks efficiently without creating hotspots or missing tasks, while respecting priority and avoiding claimed tasks.

Solution: get_next_ready_tasks()

CREATE OR REPLACE FUNCTION get_next_ready_tasks(p_limit INTEGER DEFAULT 5)
RETURNS TABLE(
    task_uuid UUID,
    task_name TEXT,
    priority INTEGER,
    namespace_name TEXT,
    ready_steps_count BIGINT,
    computed_priority NUMERIC,
    current_state VARCHAR
)

Key Problem-Solving Logic:

  1. Ready Step Discovery:

    WITH ready_steps AS (
        SELECT task_uuid, COUNT(*) as ready_count
        FROM workflow_steps
        WHERE current_state IN ('pending', 'error')
        AND [dependency checks]
        GROUP BY task_uuid
    )
    
    • Pre-aggregates ready steps per task for efficiency
    • Considers both new steps and retryable errors
  2. State-Based Filtering:

    • Only returns tasks in states that need processing
    • Excludes terminal states (complete, cancelled)
    • Includes waiting states that might have become ready
  3. Priority Computation:

    computed_priority = base_priority +
                       (age_factor * hours_waiting) +
                       (retry_factor * retry_count)
    
    • Dynamic priority based on age and retry attempts
    • Prevents task starvation through age escalation
  4. Batch Efficiency:

    • Returns multiple tasks in single query
    • Reduces database round trips
    • Enables parallel processing across orchestrators

Real-World Impact: Enables efficient work distribution where each orchestrator can claim a batch of tasks, reducing contention and improving throughput.

Complex Dependency Resolution

Problem: Determining Step Execution Readiness

Workflow steps have complex dependencies involving parent completion, retry logic, backoff timing, and state validation. Determining which steps are ready for execution requires sophisticated dependency analysis that must handle:

  • Multiple parent dependencies with conditional logic
  • Exponential backoff after failures
  • Retry limits and attempt tracking
  • State consistency across distributed workers

Solution: get_step_readiness_status()

CREATE OR REPLACE FUNCTION get_step_readiness_status(
    input_task_uuid UUID,
    step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
    workflow_step_uuid UUID,
    task_uuid UUID,
    named_step_uuid UUID,
    name VARCHAR,
    current_state VARCHAR,
    dependencies_satisfied BOOLEAN,
    retry_eligible BOOLEAN,
    ready_for_execution BOOLEAN,
    -- ... additional metadata
)

Key Problem-Solving Logic:

  1. Dependency Satisfaction Analysis:

    WITH parent_completion AS (
        SELECT
            edge.to_step_uuid,
            COUNT(*) as total_parents,
            COUNT(CASE WHEN parent.current_state = 'complete' THEN 1 END) as completed_parents
        FROM workflow_step_edges edge
        JOIN workflow_steps parent ON parent.workflow_step_uuid = edge.from_step_uuid
        WHERE parent.task_uuid = input_task_uuid
        GROUP BY edge.to_step_uuid
    )
    
    • Counts total vs. completed parent dependencies
    • Handles conditional dependencies based on parent results
    • Supports complex DAG structures with multiple paths
  2. Retry Eligibility Assessment:

    retry_eligible = (
        current_state = 'error' AND
        attempts < retry_limit AND
        (last_attempted_at IS NULL OR
         last_attempted_at + backoff_interval <= NOW())
    )
    
    • Enforces retry limits to prevent infinite loops
    • Calculates exponential backoff: 2^attempts seconds (max 30)
    • Respects custom backoff periods from step configuration
    • Considers temporal constraints for retry timing
  3. State Validation:

    ready_for_execution = (
        current_state IN ('pending', 'error') AND
        dependencies_satisfied AND
        retry_eligible
    )
    
    • Only pending or retryable error steps can execute
    • Requires all dependencies satisfied
    • Must pass retry eligibility checks
    • Prevents execution of steps in terminal states
  4. Backoff Calculation:

    next_retry_at = CASE
        WHEN current_state = 'error' AND attempts > 0
        THEN last_attempted_at + INTERVAL '1 second' *
             COALESCE(backoff_request_seconds, LEAST(POW(2, attempts), 30))
        ELSE NULL
    END
    
    • Custom backoff from step configuration takes precedence
    • Default exponential backoff with maximum cap
    • Temporal precision for scheduling retry attempts

Real-World Impact: Enables complex workflow orchestration with sophisticated dependency management, retry logic, and backoff handling, supporting enterprise-grade reliability patterns while maintaining high performance through set-based operations.

Integration with Event and State Systems

PostgreSQL LISTEN/NOTIFY Integration

The SQL functions integrate with the event-driven architecture through PostgreSQL notifications:

PGMQ Wrapper Functions for Atomic Operations

The system uses wrapper functions that combine PGMQ message sending with PostgreSQL notifications atomically:

-- Atomic wrapper that sends message AND notification
CREATE OR REPLACE FUNCTION pgmq_send_with_notify(
    queue_name TEXT,
    message JSONB,
    delay_seconds INTEGER DEFAULT 0
) RETURNS BIGINT AS $$
DECLARE
    msg_id BIGINT;
    namespace_name TEXT;
    event_payload TEXT;
    namespace_channel TEXT;
    global_channel TEXT := 'pgmq_message_ready';
BEGIN
    -- Send message using PGMQ's native function
    SELECT pgmq.send(queue_name, message, delay_seconds) INTO msg_id;

    -- Extract namespace from queue name using robust helper
    namespace_name := extract_queue_namespace(queue_name);

    -- Build namespace-specific channel name
    namespace_channel := 'pgmq_message_ready.' || namespace_name;

    -- Build event payload
    event_payload := json_build_object(
        'event_type', 'message_ready',
        'msg_id', msg_id,
        'queue_name', queue_name,
        'namespace', namespace_name,
        'ready_at', NOW()::timestamptz,
        'delay_seconds', delay_seconds
    )::text;

    -- Send notifications in same transaction
    PERFORM pg_notify(namespace_channel, event_payload);

    -- Also send to global channel if different
    IF namespace_channel != global_channel THEN
        PERFORM pg_notify(global_channel, event_payload);
    END IF;

    RETURN msg_id;
END;
$$ LANGUAGE plpgsql;

Namespace Extraction Helper

-- Robust namespace extraction helper function
CREATE OR REPLACE FUNCTION extract_queue_namespace(queue_name TEXT)
RETURNS TEXT AS $$
BEGIN
    -- Handle orchestration queues
    IF queue_name ~ '^orchestration' THEN
        RETURN 'orchestration';
    END IF;

    -- Handle worker queues: worker_namespace_queue -> namespace
    IF queue_name ~ '^worker_.*_queue$' THEN
        RETURN COALESCE(
            (regexp_match(queue_name, '^worker_(.+?)_queue$'))[1],
            'worker'
        );
    END IF;

    -- Handle standard namespace_queue pattern
    IF queue_name ~ '^[a-zA-Z][a-zA-Z0-9_]*_queue$' THEN
        RETURN COALESCE(
            (regexp_match(queue_name, '^([a-zA-Z][a-zA-Z0-9_]*)_queue$'))[1],
            'default'
        );
    END IF;

    -- Fallback for any other pattern
    RETURN 'default';
END;
$$ LANGUAGE plpgsql;

Fallback Polling for Task Readiness

Instead of database triggers for task readiness notifications, the system uses a fallback polling mechanism to ensure no ready tasks are missed:

FallbackPoller Configuration:

  • Default polling interval: 30 seconds
  • Runs StepEnqueuerService::process_batch() periodically
  • Catches tasks that may have been missed by primary PGMQ notification system
  • Configurable enable/disable via TOML configuration

Key Benefits:

  • Resilience: Ensures no tasks are permanently stuck if notifications fail
  • Simplicity: No complex database triggers or state tracking required
  • Observability: Clear metrics on fallback discovery vs. event-driven discovery
  • Safety Net: Primary event-driven system + fallback polling provides redundancy

PGMQ Message Queue Integration

SQL functions coordinate with PGMQ for reliable message processing:

Queue Management Functions

-- Ensure queue exists with proper configuration
CREATE OR REPLACE FUNCTION ensure_task_queue(queue_name VARCHAR)
RETURNS BOOLEAN AS $$
BEGIN
    -- Create queue if it doesn't exist
    PERFORM pgmq.create_queue(queue_name);

    -- Ensure headers column exists (pgmq-rs compatibility)
    PERFORM pgmq_ensure_headers_column(queue_name);

    RETURN TRUE;
END;
$$ LANGUAGE plpgsql;

Message Processing Support

-- Get queue statistics for monitoring
CREATE OR REPLACE FUNCTION get_queue_statistics(queue_name VARCHAR)
RETURNS TABLE(
    queue_name VARCHAR,
    queue_length BIGINT,
    oldest_msg_age_seconds INTEGER,
    newest_msg_age_seconds INTEGER
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        queue_name,
        pgmq.queue_length(queue_name),
        EXTRACT(EPOCH FROM (NOW() - MIN(enqueued_at)))::INTEGER,
        EXTRACT(EPOCH FROM (NOW() - MAX(enqueued_at)))::INTEGER
    FROM pgmq.messages(queue_name);
END;
$$ LANGUAGE plpgsql;

Transaction Safety

All SQL functions are designed with transaction safety in mind:

Atomic Operations:

  • State transitions use row-level locking (FOR UPDATE)
  • Compare-and-swap patterns prevent race conditions
  • Rollback safety for partial failures

Consistency Guarantees:

  • Foreign key constraints maintained across all operations
  • Check constraints validate state transitions
  • Audit trails preserved for debugging and compliance

Performance Optimization:

  • Efficient indexes for common query patterns
  • Materialized views for expensive analytics queries
  • Connection pooling for high concurrency

Usage Patterns and Best Practices

Rust Integration Patterns

The SqlFunctionExecutor provides type-safe access to all SQL functions:

#![allow(unused)]
fn main() {
use tasker_shared::database::sql_functions::{SqlFunctionExecutor, FunctionRegistry};

// Direct executor usage
let executor = SqlFunctionExecutor::new(pool);
let ready_steps = executor.get_ready_steps(task_uuid).await?;

// Registry pattern for organized access
let registry = FunctionRegistry::new(pool);
let analytics = registry.analytics().get_analytics_metrics(None).await?;
let health = registry.system_health().get_system_health_counts().await?;
}

Batch Processing Optimization

For high-throughput scenarios, the system supports efficient batch operations:

#![allow(unused)]
fn main() {
// Batch step readiness analysis
let task_uuids = vec![task1_uuid, task2_uuid, task3_uuid];
let batch_readiness = executor.get_step_readiness_status_batch(task_uuids).await?;

// Batch task discovery
let ready_tasks = executor.get_next_ready_tasks(50).await?;
}

Error Handling Best Practices

SQL function errors are properly propagated through the type system:

#![allow(unused)]
fn main() {
match executor.get_current_task_state(task_uuid).await {
    Ok(state) => {
        // Process state
    }
    Err(sqlx::Error::RowNotFound) => {
        // Handle missing task
    }
    Err(e) => {
        // Handle other database errors
    }
}
}

Tasker Configuration Documentation Index

Coverage: 246/246 parameters documented (100%)


Common Configuration

  • backoff (common.backoff) — 5 params (5 documented)
  • cache (common.cache) — 10 params (10 documented)
  • moka (common.cache.moka) — 1 params
  • redis (common.cache.redis) — 4 params
  • circuit_breakers (common.circuit_breakers) — 13 params (13 documented)
  • component_configs (common.circuit_breakers.component_configs) — 8 params
  • default_config (common.circuit_breakers.default_config) — 3 params
  • global_settings (common.circuit_breakers.global_settings) — 2 params
  • database (common.database) — 7 params (7 documented)
  • pool (common.database.pool) — 6 params
  • execution (common.execution) — 2 params (2 documented)
  • mpsc_channels (common.mpsc_channels) — 4 params (4 documented)
  • event_publisher (common.mpsc_channels.event_publisher) — 1 params
  • ffi (common.mpsc_channels.ffi) — 1 params
  • overflow_policy (common.mpsc_channels.overflow_policy) — 2 params
  • pgmq_database (common.pgmq_database) — 8 params (8 documented)
  • pool (common.pgmq_database.pool) — 6 params
  • queues (common.queues) — 14 params (14 documented)
  • orchestration_queues (common.queues.orchestration_queues) — 3 params
  • pgmq (common.queues.pgmq) — 3 params
  • rabbitmq (common.queues.rabbitmq) — 3 params
  • system (common.system) — 1 params (1 documented)
  • task_templates (common.task_templates) — 1 params (1 documented)

Orchestration Configuration

  • orchestration (orchestration) — 2 params (2 documented)
  • batch_processing (orchestration.batch_processing) — 4 params (4 documented)
  • decision_points (orchestration.decision_points) — 7 params (7 documented)
  • dlq (orchestration.dlq) — 13 params (13 documented)
  • staleness_detection (orchestration.dlq.staleness_detection) — 12 params
  • event_systems (orchestration.event_systems) — 36 params (36 documented)
  • orchestration (orchestration.event_systems.orchestration) — 18 params
  • task_readiness (orchestration.event_systems.task_readiness) — 18 params
  • grpc (orchestration.grpc) — 9 params (9 documented)
  • mpsc_channels (orchestration.mpsc_channels) — 3 params (3 documented)
  • command_processor (orchestration.mpsc_channels.command_processor) — 1 params
  • event_listeners (orchestration.mpsc_channels.event_listeners) — 1 params
  • event_systems (orchestration.mpsc_channels.event_systems) — 1 params
  • web (orchestration.web) — 17 params (17 documented)
  • auth (orchestration.web.auth) — 9 params
  • database_pools (orchestration.web.database_pools) — 5 params

Worker Configuration

  • worker (worker) — 2 params (2 documented)
  • circuit_breakers (worker.circuit_breakers) — 4 params (4 documented)
  • ffi_completion_send (worker.circuit_breakers.ffi_completion_send) — 4 params
  • event_systems (worker.event_systems) — 32 params (32 documented)
  • worker (worker.event_systems.worker) — 32 params
  • grpc (worker.grpc) — 9 params (9 documented)
  • mpsc_channels (worker.mpsc_channels) — 23 params (23 documented)
  • command_processor (worker.mpsc_channels.command_processor) — 1 params
  • domain_events (worker.mpsc_channels.domain_events) — 3 params
  • event_listeners (worker.mpsc_channels.event_listeners) — 1 params
  • event_subscribers (worker.mpsc_channels.event_subscribers) — 2 params
  • event_systems (worker.mpsc_channels.event_systems) — 1 params
  • ffi_dispatch (worker.mpsc_channels.ffi_dispatch) — 5 params
  • handler_dispatch (worker.mpsc_channels.handler_dispatch) — 7 params
  • in_process_events (worker.mpsc_channels.in_process_events) — 3 params
  • orchestration_client (worker.orchestration_client) — 3 params (3 documented)
  • web (worker.web) — 17 params (17 documented)
  • auth (worker.web.auth) — 9 params
  • database_pools (worker.web.database_pools) — 5 params

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: common

65/65 parameters documented


backoff

Path: common.backoff

ParameterTypeDefaultDescription
backoff_multiplierf642.0Multiplier applied to the previous delay for exponential backoff calculations
default_backoff_secondsVec<u32>[1, 5, 15, 30, 60]Sequence of backoff delays in seconds for successive retry attempts
jitter_enabledbooltrueAdd random jitter to backoff delays to prevent thundering herd on retry
jitter_max_percentagef640.15Maximum jitter as a fraction of the computed backoff delay
max_backoff_secondsu323600Hard upper limit on any single backoff delay

common.backoff.backoff_multiplier

Multiplier applied to the previous delay for exponential backoff calculations

  • Type: f64
  • Default: 2.0
  • Valid Range: 1.0-10.0
  • System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous

common.backoff.default_backoff_seconds

Sequence of backoff delays in seconds for successive retry attempts

  • Type: Vec<u32>
  • Default: [1, 5, 15, 30, 60]
  • Valid Range: non-empty array of positive integers
  • System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds

common.backoff.jitter_enabled

Add random jitter to backoff delays to prevent thundering herd on retry

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time

common.backoff.jitter_max_percentage

Maximum jitter as a fraction of the computed backoff delay

  • Type: f64
  • Default: 0.15
  • Valid Range: 0.0-1.0
  • System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay

common.backoff.max_backoff_seconds

Hard upper limit on any single backoff delay

  • Type: u32
  • Default: 3600
  • Valid Range: 1-3600
  • System Impact: Caps exponential backoff growth to prevent excessively long delays between retries

cache

Path: common.cache

ParameterTypeDefaultDescription
analytics_ttl_secondsu3260Time-to-live in seconds for cached analytics and metrics data
backendString"redis"Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
default_ttl_secondsu323600Default time-to-live in seconds for cached entries
enabledboolfalseEnable the distributed cache layer for template and analytics data
template_ttl_secondsu323600Time-to-live in seconds for cached task template definitions

common.cache.analytics_ttl_seconds

Time-to-live in seconds for cached analytics and metrics data

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current

common.cache.backend

Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)

  • Type: String
  • Default: "redis"
  • Valid Range: redis | moka
  • System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection

common.cache.default_ttl_seconds

Default time-to-live in seconds for cached entries

  • Type: u32
  • Default: 3600
  • Valid Range: 1-86400
  • System Impact: Controls how long cached data remains valid before being re-fetched from the database

common.cache.enabled

Enable the distributed cache layer for template and analytics data

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required

common.cache.template_ttl_seconds

Time-to-live in seconds for cached task template definitions

  • Type: u32
  • Default: 3600
  • Valid Range: 1-86400
  • System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance

moka

Path: common.cache.moka

ParameterTypeDefaultDescription
max_capacityu6410000Maximum number of entries the in-process Moka cache can hold

common.cache.moka.max_capacity

Maximum number of entries the in-process Moka cache can hold

  • Type: u64
  • Default: 10000
  • Valid Range: 1-1000000
  • System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached

redis

Path: common.cache.redis

ParameterTypeDefaultDescription
connection_timeout_secondsu325Maximum time to wait when establishing a new Redis connection
databaseu320Redis database number (0-15)
max_connectionsu3210Maximum number of connections in the Redis connection pool
urlString"${REDIS_URL:-redis://localhost:6379}"Redis connection URL

common.cache.redis.connection_timeout_seconds

Maximum time to wait when establishing a new Redis connection

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Connections that cannot be established within this timeout fail; cache falls back to database

common.cache.redis.database

Redis database number (0-15)

  • Type: u32
  • Default: 0
  • Valid Range: 0-15
  • System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance

common.cache.redis.max_connections

Maximum number of connections in the Redis connection pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-500
  • System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads

common.cache.redis.url

Redis connection URL

  • Type: String
  • Default: "${REDIS_URL:-redis://localhost:6379}"
  • Valid Range: valid Redis URI
  • System Impact: Must be reachable when cache is enabled with redis backend

circuit_breakers

Path: common.circuit_breakers

component_configs

Path: common.circuit_breakers.component_configs

cache

Path: common.circuit_breakers.component_configs.cache

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the cache circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the cache breaker

common.circuit_breakers.component_configs.cache.failure_threshold

Failures before the cache circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database

common.circuit_breakers.component_configs.cache.success_threshold

Successes in Half-Open required to close the cache breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database

messaging

Path: common.circuit_breakers.component_configs.messaging

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the messaging circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the messaging breaker

common.circuit_breakers.component_configs.messaging.failure_threshold

Failures before the messaging circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited

common.circuit_breakers.component_configs.messaging.success_threshold

Successes in Half-Open required to close the messaging breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient

task_readiness

Path: common.circuit_breakers.component_configs.task_readiness

ParameterTypeDefaultDescription
failure_thresholdu3210Failures before the task readiness circuit breaker trips to Open
success_thresholdu323Successes in Half-Open required to close the task readiness breaker

common.circuit_breakers.component_configs.task_readiness.failure_threshold

Failures before the task readiness circuit breaker trips to Open

  • Type: u32
  • Default: 10
  • Valid Range: 1-100
  • System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected

common.circuit_breakers.component_configs.task_readiness.success_threshold

Successes in Half-Open required to close the task readiness breaker

  • Type: u32
  • Default: 3
  • Valid Range: 1-100
  • System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries

web

Path: common.circuit_breakers.component_configs.web

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the web/API database circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the web database breaker

common.circuit_breakers.component_configs.web.failure_threshold

Failures before the web/API database circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts

common.circuit_breakers.component_configs.web.success_threshold

Successes in Half-Open required to close the web database breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic

default_config

Path: common.circuit_breakers.default_config

ParameterTypeDefaultDescription
failure_thresholdu325Number of consecutive failures before a circuit breaker trips to the Open state
success_thresholdu322Number of consecutive successes in Half-Open state required to close the circuit breaker
timeout_secondsu3230Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

common.circuit_breakers.default_config.failure_threshold

Number of consecutive failures before a circuit breaker trips to the Open state

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping

common.circuit_breakers.default_config.success_threshold

Number of consecutive successes in Half-Open state required to close the circuit breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Higher values require more proof of recovery before restoring full traffic

common.circuit_breakers.default_config.timeout_seconds

Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures

global_settings

Path: common.circuit_breakers.global_settings

ParameterTypeDefaultDescription
metrics_collection_interval_secondsu3230Interval in seconds between circuit breaker metrics collection sweeps
min_state_transition_interval_secondsf645.0Minimum time in seconds between circuit breaker state transitions

common.circuit_breakers.global_settings.metrics_collection_interval_seconds

Interval in seconds between circuit breaker metrics collection sweeps

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability

common.circuit_breakers.global_settings.min_state_transition_interval_seconds

Minimum time in seconds between circuit breaker state transitions

  • Type: f64
  • Default: 5.0
  • Valid Range: 0.0-60.0
  • System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures

database

Path: common.database

ParameterTypeDefaultDescription
urlString"${DATABASE_URL:-postgresql://localhost/tasker}"PostgreSQL connection URL for the primary database

common.database.url

PostgreSQL connection URL for the primary database

  • Type: String
  • Default: "${DATABASE_URL:-postgresql://localhost/tasker}"
  • Valid Range: valid PostgreSQL connection URI
  • System Impact: All task, step, and workflow state is stored here; must be reachable at startup

Environment Recommendations:

EnvironmentValueRationale
developmentpostgresql://localhost/taskerLocal default, no auth
production${DATABASE_URL}Always use env var injection for secrets rotation
testpostgresql://tasker:tasker@localhost:5432/tasker_rust_testIsolated test database with known credentials

Related: common.database.pool.max_connections, common.pgmq_database.url

pool

Path: common.database.pool

ParameterTypeDefaultDescription
acquire_timeout_secondsu3210Maximum time to wait when acquiring a connection from the pool
idle_timeout_secondsu32300Time before an idle connection is closed and removed from the pool
max_connectionsu3225Maximum number of concurrent database connections in the pool
max_lifetime_secondsu321800Maximum total lifetime of a connection before it is closed and replaced
min_connectionsu325Minimum number of idle connections maintained in the pool
slow_acquire_threshold_msu32100Threshold in milliseconds above which connection acquisition is logged as slow

common.database.pool.acquire_timeout_seconds

Maximum time to wait when acquiring a connection from the pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-300
  • System Impact: Queries fail with a timeout error if no connection is available within this window

common.database.pool.idle_timeout_seconds

Time before an idle connection is closed and removed from the pool

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the pool shrinks back to min_connections after load drops

common.database.pool.max_connections

Maximum number of concurrent database connections in the pool

  • Type: u32
  • Default: 25
  • Valid Range: 1-1000
  • System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion

Environment Recommendations:

EnvironmentValueRationale
development10-25Small pool for local development
production30-50Scale based on worker count and concurrent task volume
test10-30Moderate pool; cluster tests may run 10 services sharing the same DB

Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds

common.database.pool.max_lifetime_seconds

Maximum total lifetime of a connection before it is closed and replaced

  • Type: u32
  • Default: 1800
  • Valid Range: 60-86400
  • System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections

common.database.pool.min_connections

Minimum number of idle connections maintained in the pool

  • Type: u32
  • Default: 5
  • Valid Range: 0-100
  • System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods

common.database.pool.slow_acquire_threshold_ms

Threshold in milliseconds above which connection acquisition is logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-60000
  • System Impact: Observability: slow acquire warnings indicate pool pressure or network issues

execution

Path: common.execution

ParameterTypeDefaultDescription
environmentString"development"Runtime environment identifier used for configuration context selection and logging
step_enqueue_batch_sizeu3250Number of steps to enqueue in a single batch during task initialization

common.execution.environment

Runtime environment identifier used for configuration context selection and logging

  • Type: String
  • Default: "development"
  • Valid Range: test | development | production
  • System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system

common.execution.step_enqueue_batch_size

Number of steps to enqueue in a single batch during task initialization

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency

mpsc_channels

Path: common.mpsc_channels

event_publisher

Path: common.mpsc_channels.event_publisher

ParameterTypeDefaultDescription
event_queue_buffer_sizeusize5000Bounded channel capacity for the event publisher MPSC channel

common.mpsc_channels.event_publisher.event_queue_buffer_size

Bounded channel capacity for the event publisher MPSC channel

  • Type: usize
  • Default: 5000
  • Valid Range: 100-100000
  • System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner

ffi

Path: common.mpsc_channels.ffi

ParameterTypeDefaultDescription
ruby_event_buffer_sizeusize1000Bounded channel capacity for Ruby FFI event delivery

common.mpsc_channels.ffi.ruby_event_buffer_size

Bounded channel capacity for Ruby FFI event delivery

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side

overflow_policy

Path: common.mpsc_channels.overflow_policy

ParameterTypeDefaultDescription
log_warning_thresholdf640.8Channel saturation fraction at which warning logs are emitted

common.mpsc_channels.overflow_policy.log_warning_threshold

Channel saturation fraction at which warning logs are emitted

  • Type: f64
  • Default: 0.8
  • Valid Range: 0.0-1.0
  • System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity

metrics

Path: common.mpsc_channels.overflow_policy.metrics

ParameterTypeDefaultDescription
saturation_check_interval_secondsu3230Interval in seconds between channel saturation metric samples

common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds

Interval in seconds between channel saturation metric samples

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead

pgmq_database

Path: common.pgmq_database

ParameterTypeDefaultDescription
enabledbooltrueEnable PGMQ messaging subsystem
urlString"${PGMQ_DATABASE_URL:-}"PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

common.pgmq_database.enabled

Enable PGMQ messaging subsystem

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend

common.pgmq_database.url

PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

  • Type: String
  • Default: "${PGMQ_DATABASE_URL:-}"
  • Valid Range: valid PostgreSQL connection URI or empty string
  • System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load

Related: common.database.url, common.pgmq_database.enabled

pool

Path: common.pgmq_database.pool

ParameterTypeDefaultDescription
acquire_timeout_secondsu325Maximum time to wait when acquiring a connection from the PGMQ pool
idle_timeout_secondsu32300Time before an idle PGMQ connection is closed and removed from the pool
max_connectionsu3215Maximum number of concurrent connections in the PGMQ database pool
max_lifetime_secondsu321800Maximum total lifetime of a PGMQ database connection before replacement
min_connectionsu323Minimum idle connections maintained in the PGMQ database pool
slow_acquire_threshold_msu32100Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

common.pgmq_database.pool.acquire_timeout_seconds

Maximum time to wait when acquiring a connection from the PGMQ pool

  • Type: u32
  • Default: 5
  • Valid Range: 1-300
  • System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window

common.pgmq_database.pool.idle_timeout_seconds

Time before an idle PGMQ connection is closed and removed from the pool

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops

common.pgmq_database.pool.max_connections

Maximum number of concurrent connections in the PGMQ database pool

  • Type: u32
  • Default: 15
  • Valid Range: 1-500
  • System Impact: Separate from the main database pool; size according to messaging throughput requirements

common.pgmq_database.pool.max_lifetime_seconds

Maximum total lifetime of a PGMQ database connection before replacement

  • Type: u32
  • Default: 1800
  • Valid Range: 60-86400
  • System Impact: Prevents connection drift in long-running PGMQ connections

common.pgmq_database.pool.min_connections

Minimum idle connections maintained in the PGMQ database pool

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations

common.pgmq_database.pool.slow_acquire_threshold_ms

Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-60000
  • System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure

queues

Path: common.queues

ParameterTypeDefaultDescription
backendString"${TASKER_MESSAGING_BACKEND:-pgmq}"Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
default_visibility_timeout_secondsu3230Default time a dequeued message remains invisible to other consumers
naming_patternString"{namespace}_{name}_queue"Template pattern for constructing queue names from namespace and name
orchestration_namespaceString"orchestration"Namespace prefix for orchestration queue names
worker_namespaceString"worker"Namespace prefix for worker queue names

common.queues.backend

Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)

  • Type: String
  • Default: "${TASKER_MESSAGING_BACKEND:-pgmq}"
  • Valid Range: pgmq | rabbitmq
  • System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker

Environment Recommendations:

EnvironmentValueRationale
productionpgmq or rabbitmqpgmq for simplicity, rabbitmq for high-throughput push semantics
testpgmqSingle-dependency setup, simpler CI

Related: common.queues.pgmq, common.queues.rabbitmq

common.queues.default_visibility_timeout_seconds

Default time a dequeued message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry

common.queues.naming_pattern

Template pattern for constructing queue names from namespace and name

  • Type: String
  • Default: "{namespace}_{name}_queue"
  • Valid Range: string containing {namespace} and {name} placeholders
  • System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration

common.queues.orchestration_namespace

Namespace prefix for orchestration queue names

  • Type: String
  • Default: "orchestration"
  • Valid Range: non-empty string
  • System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues

common.queues.worker_namespace

Namespace prefix for worker queue names

  • Type: String
  • Default: "worker"
  • Valid Range: non-empty string
  • System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues

orchestration_queues

Path: common.queues.orchestration_queues

ParameterTypeDefaultDescription
step_resultsString"orchestration_step_results"Queue name for step execution results returned by workers
task_finalizationsString"orchestration_task_finalizations"Queue name for task finalization messages
task_requestsString"orchestration_task_requests"Queue name for incoming task execution requests

common.queues.orchestration_queues.step_results

Queue name for step execution results returned by workers

  • Type: String
  • Default: "orchestration_step_results"
  • Valid Range: valid queue name
  • System Impact: Workers publish step completion results here for the orchestration result processor

common.queues.orchestration_queues.task_finalizations

Queue name for task finalization messages

  • Type: String
  • Default: "orchestration_task_finalizations"
  • Valid Range: valid queue name
  • System Impact: Tasks ready for completion evaluation are enqueued here

common.queues.orchestration_queues.task_requests

Queue name for incoming task execution requests

  • Type: String
  • Default: "orchestration_task_requests"
  • Valid Range: valid queue name
  • System Impact: The orchestration system reads new task requests from this queue

pgmq

Path: common.queues.pgmq

ParameterTypeDefaultDescription
poll_interval_msu32500Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

common.queues.pgmq.poll_interval_ms

Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

  • Type: u32
  • Default: 500
  • Valid Range: 10-10000
  • System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval

queue_depth_thresholds

Path: common.queues.pgmq.queue_depth_thresholds

ParameterTypeDefaultDescription
critical_thresholdi645000Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
overflow_thresholdi6410000Queue depth indicating an emergency condition requiring manual intervention

common.queues.pgmq.queue_depth_thresholds.critical_threshold

Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions

  • Type: i64
  • Default: 5000
  • Valid Range: 1+
  • System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages

common.queues.pgmq.queue_depth_thresholds.overflow_threshold

Queue depth indicating an emergency condition requiring manual intervention

  • Type: i64
  • Default: 10000
  • Valid Range: 1+
  • System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting

rabbitmq

Path: common.queues.rabbitmq

ParameterTypeDefaultDescription
heartbeat_secondsu1630AMQP heartbeat interval for connection liveness detection
prefetch_countu16100Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
urlString"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

common.queues.rabbitmq.heartbeat_seconds

AMQP heartbeat interval for connection liveness detection

  • Type: u16
  • Default: 30
  • Valid Range: 0-3600
  • System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)

common.queues.rabbitmq.prefetch_count

Number of unacknowledged messages RabbitMQ will deliver before waiting for acks

  • Type: u16
  • Default: 100
  • Valid Range: 1-65535
  • System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process

common.queues.rabbitmq.url

AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

  • Type: String
  • Default: "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"
  • Valid Range: valid AMQP URI
  • System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup

system

Path: common.system

ParameterTypeDefaultDescription
default_dependent_systemString"default"Default system name assigned to tasks that do not specify a dependent system

common.system.default_dependent_system

Default system name assigned to tasks that do not specify a dependent system

  • Type: String
  • Default: "default"
  • Valid Range: non-empty string
  • System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default

task_templates

Path: common.task_templates

ParameterTypeDefaultDescription
search_pathsVec<String>["config/tasks/**/*.{yml,yaml}"]Glob patterns for discovering task template YAML files

common.task_templates.search_paths

Glob patterns for discovering task template YAML files

  • Type: Vec<String>
  • Default: ["config/tasks/**/*.{yml,yaml}"]
  • Valid Range: valid glob patterns
  • System Impact: Templates matching these patterns are loaded at startup for task definition discovery

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: common

65/65 parameters documented


backoff

Path: common.backoff

ParameterTypeDefaultDescription
backoff_multiplierf642.0Multiplier applied to the previous delay for exponential backoff calculations
default_backoff_secondsVec<u32>[1, 5, 15, 30, 60]Sequence of backoff delays in seconds for successive retry attempts
jitter_enabledbooltrueAdd random jitter to backoff delays to prevent thundering herd on retry
jitter_max_percentagef640.15Maximum jitter as a fraction of the computed backoff delay
max_backoff_secondsu323600Hard upper limit on any single backoff delay

common.backoff.backoff_multiplier

Multiplier applied to the previous delay for exponential backoff calculations

  • Type: f64
  • Default: 2.0
  • Valid Range: 1.0-10.0
  • System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous

common.backoff.default_backoff_seconds

Sequence of backoff delays in seconds for successive retry attempts

  • Type: Vec<u32>
  • Default: [1, 5, 15, 30, 60]
  • Valid Range: non-empty array of positive integers
  • System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds

common.backoff.jitter_enabled

Add random jitter to backoff delays to prevent thundering herd on retry

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time

common.backoff.jitter_max_percentage

Maximum jitter as a fraction of the computed backoff delay

  • Type: f64
  • Default: 0.15
  • Valid Range: 0.0-1.0
  • System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay

common.backoff.max_backoff_seconds

Hard upper limit on any single backoff delay

  • Type: u32
  • Default: 3600
  • Valid Range: 1-3600
  • System Impact: Caps exponential backoff growth to prevent excessively long delays between retries

cache

Path: common.cache

ParameterTypeDefaultDescription
analytics_ttl_secondsu3260Time-to-live in seconds for cached analytics and metrics data
backendString"redis"Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
default_ttl_secondsu323600Default time-to-live in seconds for cached entries
enabledboolfalseEnable the distributed cache layer for template and analytics data
template_ttl_secondsu323600Time-to-live in seconds for cached task template definitions

common.cache.analytics_ttl_seconds

Time-to-live in seconds for cached analytics and metrics data

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current

common.cache.backend

Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)

  • Type: String
  • Default: "redis"
  • Valid Range: redis | moka
  • System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection

common.cache.default_ttl_seconds

Default time-to-live in seconds for cached entries

  • Type: u32
  • Default: 3600
  • Valid Range: 1-86400
  • System Impact: Controls how long cached data remains valid before being re-fetched from the database

common.cache.enabled

Enable the distributed cache layer for template and analytics data

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required

common.cache.template_ttl_seconds

Time-to-live in seconds for cached task template definitions

  • Type: u32
  • Default: 3600
  • Valid Range: 1-86400
  • System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance

moka

Path: common.cache.moka

ParameterTypeDefaultDescription
max_capacityu6410000Maximum number of entries the in-process Moka cache can hold

common.cache.moka.max_capacity

Maximum number of entries the in-process Moka cache can hold

  • Type: u64
  • Default: 10000
  • Valid Range: 1-1000000
  • System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached

redis

Path: common.cache.redis

ParameterTypeDefaultDescription
connection_timeout_secondsu325Maximum time to wait when establishing a new Redis connection
databaseu320Redis database number (0-15)
max_connectionsu3210Maximum number of connections in the Redis connection pool
urlString"${REDIS_URL:-redis://localhost:6379}"Redis connection URL

common.cache.redis.connection_timeout_seconds

Maximum time to wait when establishing a new Redis connection

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Connections that cannot be established within this timeout fail; cache falls back to database

common.cache.redis.database

Redis database number (0-15)

  • Type: u32
  • Default: 0
  • Valid Range: 0-15
  • System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance

common.cache.redis.max_connections

Maximum number of connections in the Redis connection pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-500
  • System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads

common.cache.redis.url

Redis connection URL

  • Type: String
  • Default: "${REDIS_URL:-redis://localhost:6379}"
  • Valid Range: valid Redis URI
  • System Impact: Must be reachable when cache is enabled with redis backend

circuit_breakers

Path: common.circuit_breakers

component_configs

Path: common.circuit_breakers.component_configs

cache

Path: common.circuit_breakers.component_configs.cache

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the cache circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the cache breaker

common.circuit_breakers.component_configs.cache.failure_threshold

Failures before the cache circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database

common.circuit_breakers.component_configs.cache.success_threshold

Successes in Half-Open required to close the cache breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database

messaging

Path: common.circuit_breakers.component_configs.messaging

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the messaging circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the messaging breaker

common.circuit_breakers.component_configs.messaging.failure_threshold

Failures before the messaging circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited

common.circuit_breakers.component_configs.messaging.success_threshold

Successes in Half-Open required to close the messaging breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient

task_readiness

Path: common.circuit_breakers.component_configs.task_readiness

ParameterTypeDefaultDescription
failure_thresholdu3210Failures before the task readiness circuit breaker trips to Open
success_thresholdu323Successes in Half-Open required to close the task readiness breaker

common.circuit_breakers.component_configs.task_readiness.failure_threshold

Failures before the task readiness circuit breaker trips to Open

  • Type: u32
  • Default: 10
  • Valid Range: 1-100
  • System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected

common.circuit_breakers.component_configs.task_readiness.success_threshold

Successes in Half-Open required to close the task readiness breaker

  • Type: u32
  • Default: 3
  • Valid Range: 1-100
  • System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries

web

Path: common.circuit_breakers.component_configs.web

ParameterTypeDefaultDescription
failure_thresholdu325Failures before the web/API database circuit breaker trips to Open
success_thresholdu322Successes in Half-Open required to close the web database breaker

common.circuit_breakers.component_configs.web.failure_threshold

Failures before the web/API database circuit breaker trips to Open

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts

common.circuit_breakers.component_configs.web.success_threshold

Successes in Half-Open required to close the web database breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic

default_config

Path: common.circuit_breakers.default_config

ParameterTypeDefaultDescription
failure_thresholdu325Number of consecutive failures before a circuit breaker trips to the Open state
success_thresholdu322Number of consecutive successes in Half-Open state required to close the circuit breaker
timeout_secondsu3230Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

common.circuit_breakers.default_config.failure_threshold

Number of consecutive failures before a circuit breaker trips to the Open state

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping

common.circuit_breakers.default_config.success_threshold

Number of consecutive successes in Half-Open state required to close the circuit breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Higher values require more proof of recovery before restoring full traffic

common.circuit_breakers.default_config.timeout_seconds

Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures

global_settings

Path: common.circuit_breakers.global_settings

ParameterTypeDefaultDescription
metrics_collection_interval_secondsu3230Interval in seconds between circuit breaker metrics collection sweeps
min_state_transition_interval_secondsf645.0Minimum time in seconds between circuit breaker state transitions

common.circuit_breakers.global_settings.metrics_collection_interval_seconds

Interval in seconds between circuit breaker metrics collection sweeps

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability

common.circuit_breakers.global_settings.min_state_transition_interval_seconds

Minimum time in seconds between circuit breaker state transitions

  • Type: f64
  • Default: 5.0
  • Valid Range: 0.0-60.0
  • System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures

database

Path: common.database

ParameterTypeDefaultDescription
urlString"${DATABASE_URL:-postgresql://localhost/tasker}"PostgreSQL connection URL for the primary database

common.database.url

PostgreSQL connection URL for the primary database

  • Type: String
  • Default: "${DATABASE_URL:-postgresql://localhost/tasker}"
  • Valid Range: valid PostgreSQL connection URI
  • System Impact: All task, step, and workflow state is stored here; must be reachable at startup

Environment Recommendations:

EnvironmentValueRationale
developmentpostgresql://localhost/taskerLocal default, no auth
production${DATABASE_URL}Always use env var injection for secrets rotation
testpostgresql://tasker:tasker@localhost:5432/tasker_rust_testIsolated test database with known credentials

Related: common.database.pool.max_connections, common.pgmq_database.url

pool

Path: common.database.pool

ParameterTypeDefaultDescription
acquire_timeout_secondsu3210Maximum time to wait when acquiring a connection from the pool
idle_timeout_secondsu32300Time before an idle connection is closed and removed from the pool
max_connectionsu3225Maximum number of concurrent database connections in the pool
max_lifetime_secondsu321800Maximum total lifetime of a connection before it is closed and replaced
min_connectionsu325Minimum number of idle connections maintained in the pool
slow_acquire_threshold_msu32100Threshold in milliseconds above which connection acquisition is logged as slow

common.database.pool.acquire_timeout_seconds

Maximum time to wait when acquiring a connection from the pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-300
  • System Impact: Queries fail with a timeout error if no connection is available within this window

common.database.pool.idle_timeout_seconds

Time before an idle connection is closed and removed from the pool

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the pool shrinks back to min_connections after load drops

common.database.pool.max_connections

Maximum number of concurrent database connections in the pool

  • Type: u32
  • Default: 25
  • Valid Range: 1-1000
  • System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion

Environment Recommendations:

EnvironmentValueRationale
development10-25Small pool for local development
production30-50Scale based on worker count and concurrent task volume
test10-30Moderate pool; cluster tests may run 10 services sharing the same DB

Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds

common.database.pool.max_lifetime_seconds

Maximum total lifetime of a connection before it is closed and replaced

  • Type: u32
  • Default: 1800
  • Valid Range: 60-86400
  • System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections

common.database.pool.min_connections

Minimum number of idle connections maintained in the pool

  • Type: u32
  • Default: 5
  • Valid Range: 0-100
  • System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods

common.database.pool.slow_acquire_threshold_ms

Threshold in milliseconds above which connection acquisition is logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-60000
  • System Impact: Observability: slow acquire warnings indicate pool pressure or network issues

execution

Path: common.execution

ParameterTypeDefaultDescription
environmentString"development"Runtime environment identifier used for configuration context selection and logging
step_enqueue_batch_sizeu3250Number of steps to enqueue in a single batch during task initialization

common.execution.environment

Runtime environment identifier used for configuration context selection and logging

  • Type: String
  • Default: "development"
  • Valid Range: test | development | production
  • System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system

common.execution.step_enqueue_batch_size

Number of steps to enqueue in a single batch during task initialization

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency

mpsc_channels

Path: common.mpsc_channels

event_publisher

Path: common.mpsc_channels.event_publisher

ParameterTypeDefaultDescription
event_queue_buffer_sizeusize5000Bounded channel capacity for the event publisher MPSC channel

common.mpsc_channels.event_publisher.event_queue_buffer_size

Bounded channel capacity for the event publisher MPSC channel

  • Type: usize
  • Default: 5000
  • Valid Range: 100-100000
  • System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner

ffi

Path: common.mpsc_channels.ffi

ParameterTypeDefaultDescription
ruby_event_buffer_sizeusize1000Bounded channel capacity for Ruby FFI event delivery

common.mpsc_channels.ffi.ruby_event_buffer_size

Bounded channel capacity for Ruby FFI event delivery

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side

overflow_policy

Path: common.mpsc_channels.overflow_policy

ParameterTypeDefaultDescription
log_warning_thresholdf640.8Channel saturation fraction at which warning logs are emitted

common.mpsc_channels.overflow_policy.log_warning_threshold

Channel saturation fraction at which warning logs are emitted

  • Type: f64
  • Default: 0.8
  • Valid Range: 0.0-1.0
  • System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity

metrics

Path: common.mpsc_channels.overflow_policy.metrics

ParameterTypeDefaultDescription
saturation_check_interval_secondsu3230Interval in seconds between channel saturation metric samples

common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds

Interval in seconds between channel saturation metric samples

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead

pgmq_database

Path: common.pgmq_database

ParameterTypeDefaultDescription
enabledbooltrueEnable PGMQ messaging subsystem
urlString"${PGMQ_DATABASE_URL:-}"PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

common.pgmq_database.enabled

Enable PGMQ messaging subsystem

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend

common.pgmq_database.url

PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database

  • Type: String
  • Default: "${PGMQ_DATABASE_URL:-}"
  • Valid Range: valid PostgreSQL connection URI or empty string
  • System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load

Related: common.database.url, common.pgmq_database.enabled

pool

Path: common.pgmq_database.pool

ParameterTypeDefaultDescription
acquire_timeout_secondsu325Maximum time to wait when acquiring a connection from the PGMQ pool
idle_timeout_secondsu32300Time before an idle PGMQ connection is closed and removed from the pool
max_connectionsu3215Maximum number of concurrent connections in the PGMQ database pool
max_lifetime_secondsu321800Maximum total lifetime of a PGMQ database connection before replacement
min_connectionsu323Minimum idle connections maintained in the PGMQ database pool
slow_acquire_threshold_msu32100Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

common.pgmq_database.pool.acquire_timeout_seconds

Maximum time to wait when acquiring a connection from the PGMQ pool

  • Type: u32
  • Default: 5
  • Valid Range: 1-300
  • System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window

common.pgmq_database.pool.idle_timeout_seconds

Time before an idle PGMQ connection is closed and removed from the pool

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops

common.pgmq_database.pool.max_connections

Maximum number of concurrent connections in the PGMQ database pool

  • Type: u32
  • Default: 15
  • Valid Range: 1-500
  • System Impact: Separate from the main database pool; size according to messaging throughput requirements

common.pgmq_database.pool.max_lifetime_seconds

Maximum total lifetime of a PGMQ database connection before replacement

  • Type: u32
  • Default: 1800
  • Valid Range: 60-86400
  • System Impact: Prevents connection drift in long-running PGMQ connections

common.pgmq_database.pool.min_connections

Minimum idle connections maintained in the PGMQ database pool

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations

common.pgmq_database.pool.slow_acquire_threshold_ms

Threshold in milliseconds above which PGMQ pool acquisition is logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-60000
  • System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure

queues

Path: common.queues

ParameterTypeDefaultDescription
backendString"${TASKER_MESSAGING_BACKEND:-pgmq}"Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
default_visibility_timeout_secondsu3230Default time a dequeued message remains invisible to other consumers
naming_patternString"{namespace}_{name}_queue"Template pattern for constructing queue names from namespace and name
orchestration_namespaceString"orchestration"Namespace prefix for orchestration queue names
worker_namespaceString"worker"Namespace prefix for worker queue names

common.queues.backend

Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)

  • Type: String
  • Default: "${TASKER_MESSAGING_BACKEND:-pgmq}"
  • Valid Range: pgmq | rabbitmq
  • System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker

Environment Recommendations:

EnvironmentValueRationale
productionpgmq or rabbitmqpgmq for simplicity, rabbitmq for high-throughput push semantics
testpgmqSingle-dependency setup, simpler CI

Related: common.queues.pgmq, common.queues.rabbitmq

common.queues.default_visibility_timeout_seconds

Default time a dequeued message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry

common.queues.naming_pattern

Template pattern for constructing queue names from namespace and name

  • Type: String
  • Default: "{namespace}_{name}_queue"
  • Valid Range: string containing {namespace} and {name} placeholders
  • System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration

common.queues.orchestration_namespace

Namespace prefix for orchestration queue names

  • Type: String
  • Default: "orchestration"
  • Valid Range: non-empty string
  • System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues

common.queues.worker_namespace

Namespace prefix for worker queue names

  • Type: String
  • Default: "worker"
  • Valid Range: non-empty string
  • System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues

orchestration_queues

Path: common.queues.orchestration_queues

ParameterTypeDefaultDescription
step_resultsString"orchestration_step_results"Queue name for step execution results returned by workers
task_finalizationsString"orchestration_task_finalizations"Queue name for task finalization messages
task_requestsString"orchestration_task_requests"Queue name for incoming task execution requests

common.queues.orchestration_queues.step_results

Queue name for step execution results returned by workers

  • Type: String
  • Default: "orchestration_step_results"
  • Valid Range: valid queue name
  • System Impact: Workers publish step completion results here for the orchestration result processor

common.queues.orchestration_queues.task_finalizations

Queue name for task finalization messages

  • Type: String
  • Default: "orchestration_task_finalizations"
  • Valid Range: valid queue name
  • System Impact: Tasks ready for completion evaluation are enqueued here

common.queues.orchestration_queues.task_requests

Queue name for incoming task execution requests

  • Type: String
  • Default: "orchestration_task_requests"
  • Valid Range: valid queue name
  • System Impact: The orchestration system reads new task requests from this queue

pgmq

Path: common.queues.pgmq

ParameterTypeDefaultDescription
poll_interval_msu32500Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

common.queues.pgmq.poll_interval_ms

Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive

  • Type: u32
  • Default: 500
  • Valid Range: 10-10000
  • System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval

queue_depth_thresholds

Path: common.queues.pgmq.queue_depth_thresholds

ParameterTypeDefaultDescription
critical_thresholdi645000Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
overflow_thresholdi6410000Queue depth indicating an emergency condition requiring manual intervention

common.queues.pgmq.queue_depth_thresholds.critical_threshold

Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions

  • Type: i64
  • Default: 5000
  • Valid Range: 1+
  • System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages

common.queues.pgmq.queue_depth_thresholds.overflow_threshold

Queue depth indicating an emergency condition requiring manual intervention

  • Type: i64
  • Default: 10000
  • Valid Range: 1+
  • System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting

rabbitmq

Path: common.queues.rabbitmq

ParameterTypeDefaultDescription
heartbeat_secondsu1630AMQP heartbeat interval for connection liveness detection
prefetch_countu16100Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
urlString"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

common.queues.rabbitmq.heartbeat_seconds

AMQP heartbeat interval for connection liveness detection

  • Type: u16
  • Default: 30
  • Valid Range: 0-3600
  • System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)

common.queues.rabbitmq.prefetch_count

Number of unacknowledged messages RabbitMQ will deliver before waiting for acks

  • Type: u16
  • Default: 100
  • Valid Range: 1-65535
  • System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process

common.queues.rabbitmq.url

AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’

  • Type: String
  • Default: "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}"
  • Valid Range: valid AMQP URI
  • System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup

system

Path: common.system

ParameterTypeDefaultDescription
default_dependent_systemString"default"Default system name assigned to tasks that do not specify a dependent system

common.system.default_dependent_system

Default system name assigned to tasks that do not specify a dependent system

  • Type: String
  • Default: "default"
  • Valid Range: non-empty string
  • System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default

task_templates

Path: common.task_templates

ParameterTypeDefaultDescription
search_pathsVec<String>["config/tasks/**/*.{yml,yaml}"]Glob patterns for discovering task template YAML files

common.task_templates.search_paths

Glob patterns for discovering task template YAML files

  • Type: Vec<String>
  • Default: ["config/tasks/**/*.{yml,yaml}"]
  • Valid Range: valid glob patterns
  • System Impact: Templates matching these patterns are loaded at startup for task definition discovery

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: orchestration

91/91 parameters documented


orchestration

Root-level orchestration parameters

Path: orchestration

ParameterTypeDefaultDescription
enable_performance_loggingbooltrueEnable detailed performance logging for orchestration actors
shutdown_timeout_msu6430000Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

orchestration.enable_performance_logging

Enable detailed performance logging for orchestration actors

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern

orchestration.shutdown_timeout_ms

Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

  • Type: u64
  • Default: 30000
  • Valid Range: 1000-300000
  • System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments

batch_processing

Path: orchestration.batch_processing

ParameterTypeDefaultDescription
checkpoint_stall_minutesu3215Minutes without a checkpoint update before a batch is considered stalled
default_batch_sizeu321000Default number of items in a single batch when not specified by the handler
enabledbooltrueEnable the batch processing subsystem for large-scale step execution
max_parallel_batchesu3250Maximum number of batch operations that can execute concurrently

orchestration.batch_processing.checkpoint_stall_minutes

Minutes without a checkpoint update before a batch is considered stalled

  • Type: u32
  • Default: 15
  • Valid Range: 1-1440
  • System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster

orchestration.batch_processing.default_batch_size

Default number of items in a single batch when not specified by the handler

  • Type: u32
  • Default: 1000
  • Valid Range: 1-100000
  • System Impact: Larger batches improve throughput but increase memory usage and per-batch latency

orchestration.batch_processing.enabled

Enable the batch processing subsystem for large-scale step execution

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, batch step handlers cannot be used; all steps must be processed individually

orchestration.batch_processing.max_parallel_batches

Maximum number of batch operations that can execute concurrently

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads

decision_points

Path: orchestration.decision_points

ParameterTypeDefaultDescription
enable_detailed_loggingboolfalseEnable verbose logging of decision point evaluation including expression results
enable_metricsbooltrueEnable metrics collection for decision point evaluations
enabledbooltrueEnable the decision point evaluation subsystem for conditional workflow branching
max_decision_depthu3220Maximum depth of nested decision point chains
max_steps_per_decisionu32100Maximum number of steps that can be generated by a single decision point evaluation
warn_threshold_depthu3210Decision depth above which a warning is logged
warn_threshold_stepsu3250Number of steps per decision above which a warning is logged

orchestration.decision_points.enable_detailed_logging

Enable verbose logging of decision point evaluation including expression results

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior

orchestration.decision_points.enable_metrics

Enable metrics collection for decision point evaluations

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks evaluation counts, timings, and branch selection distribution

orchestration.decision_points.enabled

Enable the decision point evaluation subsystem for conditional workflow branching

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, all decision points are skipped and conditional steps are not evaluated

orchestration.decision_points.max_decision_depth

Maximum depth of nested decision point chains

  • Type: u32
  • Default: 20
  • Valid Range: 1-100
  • System Impact: Prevents infinite recursion from circular decision point references

orchestration.decision_points.max_steps_per_decision

Maximum number of steps that can be generated by a single decision point evaluation

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Safety limit to prevent decision points from creating unbounded step graphs

orchestration.decision_points.warn_threshold_depth

Decision depth above which a warning is logged

  • Type: u32
  • Default: 10
  • Valid Range: 1-100
  • System Impact: Observability: identifies deeply nested decision chains that may indicate design issues

orchestration.decision_points.warn_threshold_steps

Number of steps per decision above which a warning is logged

  • Type: u32
  • Default: 50
  • Valid Range: 1-10000
  • System Impact: Observability: identifies decision points that generate unusually large step sets

dlq

Path: orchestration.dlq

ParameterTypeDefaultDescription
enabledbooltrueEnable the Dead Letter Queue subsystem for handling unrecoverable tasks

orchestration.dlq.enabled

Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, stale or failed tasks remain in their error state without DLQ routing

staleness_detection

Path: orchestration.dlq.staleness_detection

ParameterTypeDefaultDescription
batch_sizeu32100Number of potentially stale tasks to evaluate in a single detection sweep
detection_interval_secondsu32300Interval in seconds between staleness detection sweeps
dry_runboolfalseRun staleness detection in observation-only mode without taking action
enabledbooltrueEnable periodic scanning for stale tasks

orchestration.dlq.staleness_detection.batch_size

Number of potentially stale tasks to evaluate in a single detection sweep

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost

orchestration.dlq.staleness_detection.detection_interval_seconds

Interval in seconds between staleness detection sweeps

  • Type: u32
  • Default: 300
  • Valid Range: 30-3600
  • System Impact: Lower values detect stale tasks faster but increase database query frequency

orchestration.dlq.staleness_detection.dry_run

Run staleness detection in observation-only mode without taking action

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds

orchestration.dlq.staleness_detection.enabled

Enable periodic scanning for stale tasks

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d

actions

Path: orchestration.dlq.staleness_detection.actions

ParameterTypeDefaultDescription
auto_move_to_dlqbooltrueAutomatically move stale tasks to the DLQ after transitioning to error
auto_transition_to_errorbooltrueAutomatically transition stale tasks to the Error state
emit_eventsbooltrueEmit domain events when staleness is detected
event_channelString"task_staleness_detected"PGMQ channel name for staleness detection events

orchestration.dlq.staleness_detection.actions.auto_move_to_dlq

Automatically move stale tasks to the DLQ after transitioning to error

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review

orchestration.dlq.staleness_detection.actions.auto_transition_to_error

Automatically transition stale tasks to the Error state

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state

orchestration.dlq.staleness_detection.actions.emit_events

Emit domain events when staleness is detected

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling

orchestration.dlq.staleness_detection.actions.event_channel

PGMQ channel name for staleness detection events

  • Type: String
  • Default: "task_staleness_detected"
  • Valid Range: 1-255 characters
  • System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling

thresholds

Path: orchestration.dlq.staleness_detection.thresholds

ParameterTypeDefaultDescription
steps_in_process_minutesu3230Minutes a task can have steps in process before being considered stale
task_max_lifetime_hoursu3224Absolute maximum lifetime for any task regardless of state
waiting_for_dependencies_minutesu3260Minutes a task can wait for step dependencies before being considered stale
waiting_for_retry_minutesu3230Minutes a task can wait for step retries before being considered stale

orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes

Minutes a task can have steps in process before being considered stale

  • Type: u32
  • Default: 30
  • Valid Range: 1-1440
  • System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation

orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours

Absolute maximum lifetime for any task regardless of state

  • Type: u32
  • Default: 24
  • Valid Range: 1-168
  • System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing

orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes

Minutes a task can wait for step dependencies before being considered stale

  • Type: u32
  • Default: 60
  • Valid Range: 1-1440
  • System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration

orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes

Minutes a task can wait for step retries before being considered stale

  • Type: u32
  • Default: 30
  • Valid Range: 1-1440
  • System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration

event_systems

Path: orchestration.event_systems

orchestration

Path: orchestration.event_systems.orchestration

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"orchestration-event-system"Unique identifier for the orchestration event system instance

orchestration.event_systems.orchestration.deployment_mode

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

orchestration.event_systems.orchestration.system_id

Unique identifier for the orchestration event system instance

  • Type: String
  • Default: "orchestration-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: orchestration.event_systems.orchestration.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the orchestration event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the event system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the event system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics collection for event processing

orchestration.event_systems.orchestration.health.enabled

Enable health monitoring for the orchestration event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks or error tracking run for this event system

orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute

Error rate per minute above which the event system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection

orchestration.event_systems.orchestration.health.max_consecutive_errors

Number of consecutive errors before the event system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation after sustained failures; resets on any success

orchestration.event_systems.orchestration.health.performance_monitoring_enabled

Enable detailed performance metrics collection for event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks processing latency percentiles and throughput; adds minor overhead

processing

Path: orchestration.event_systems.orchestration.processing

ParameterTypeDefaultDescription
batch_sizeu3220Number of events dequeued in a single batch read
max_concurrent_operationsu3250Maximum number of events processed concurrently by the orchestration event system
max_retriesu323Maximum retry attempts for a failed event processing operation

orchestration.event_systems.orchestration.processing.batch_size

Number of events dequeued in a single batch read

  • Type: u32
  • Default: 20
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput but increase per-batch processing time

orchestration.event_systems.orchestration.processing.max_concurrent_operations

Maximum number of events processed concurrently by the orchestration event system

  • Type: u32
  • Default: 50
  • Valid Range: 1-10000
  • System Impact: Controls parallelism for task request, result, and finalization processing

orchestration.event_systems.orchestration.processing.max_retries

Maximum retry attempts for a failed event processing operation

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff

Path: orchestration.event_systems.orchestration.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first event processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay
max_delay_msu6410000Maximum backoff delay in milliseconds between event processing retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive failure

timing

Path: orchestration.event_systems.orchestration.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds an event claim remains valid
fallback_polling_interval_secondsu325Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
health_check_interval_secondsu3230Interval in seconds between health check probes for the orchestration event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single event
visibility_timeout_secondsu3230Time in seconds a dequeued message remains invisible to other consumers

orchestration.event_systems.orchestration.timing.claim_timeout_seconds

Maximum time in seconds an event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned claims from blocking event processing indefinitely

orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load

orchestration.event_systems.orchestration.timing.health_check_interval_seconds

Interval in seconds between health check probes for the orchestration event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness

orchestration.event_systems.orchestration.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Events exceeding this timeout are considered failed and may be retried

orchestration.event_systems.orchestration.timing.visibility_timeout_seconds

Time in seconds a dequeued message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: If processing is not completed within this window, the message becomes visible again for redelivery

task_readiness

Path: orchestration.event_systems.task_readiness

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"task-readiness-event-system"Unique identifier for the task readiness event system instance

orchestration.event_systems.task_readiness.deployment_mode

Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery

orchestration.event_systems.task_readiness.system_id

Unique identifier for the task readiness event system instance

  • Type: String
  • Default: "task-readiness-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish task readiness events from other event systems

health

Path: orchestration.event_systems.task_readiness.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the task readiness event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the task readiness system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the task readiness system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics for task readiness event processing

orchestration.event_systems.task_readiness.health.enabled

Enable health monitoring for the task readiness event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks run for task readiness processing

orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute

Error rate per minute above which the task readiness system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal complementing max_consecutive_errors

orchestration.event_systems.task_readiness.health.max_consecutive_errors

Number of consecutive errors before the task readiness system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation; resets on any successful readiness check

orchestration.event_systems.task_readiness.health.performance_monitoring_enabled

Enable detailed performance metrics for task readiness event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency

processing

Path: orchestration.event_systems.task_readiness.processing

ParameterTypeDefaultDescription
batch_sizeu3250Number of task readiness events dequeued in a single batch
max_concurrent_operationsu32100Maximum number of task readiness events processed concurrently
max_retriesu323Maximum retry attempts for a failed task readiness event

orchestration.event_systems.task_readiness.processing.batch_size

Number of task readiness events dequeued in a single batch

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput

orchestration.event_systems.task_readiness.processing.max_concurrent_operations

Maximum number of task readiness events processed concurrently

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries

orchestration.event_systems.task_readiness.processing.max_retries

Maximum retry attempts for a failed task readiness event

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff

Path: orchestration.event_systems.task_readiness.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first task readiness processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay for readiness retries
max_delay_msu6410000Maximum backoff delay in milliseconds for task readiness retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive readiness failure

timing

Path: orchestration.event_systems.task_readiness.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds a task readiness event claim remains valid
fallback_polling_interval_secondsu325Interval in seconds between fallback polling cycles for task readiness
health_check_interval_secondsu3230Interval in seconds between health check probes for the task readiness event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single task readiness event
visibility_timeout_secondsu3230Time in seconds a dequeued task readiness message remains invisible to other consumers

orchestration.event_systems.task_readiness.timing.claim_timeout_seconds

Maximum time in seconds a task readiness event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned readiness claims from blocking task evaluation

orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles for task readiness

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness

orchestration.event_systems.task_readiness.timing.health_check_interval_seconds

Interval in seconds between health check probes for the task readiness event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the task readiness system verifies its own connectivity

orchestration.event_systems.task_readiness.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single task readiness event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Readiness events exceeding this timeout are considered failed

orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds

Time in seconds a dequeued task readiness message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Prevents duplicate processing of readiness events during normal operation

grpc

Path: orchestration.grpc

ParameterTypeDefaultDescription
bind_addressString"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"Socket address for the gRPC server
enable_health_servicebooltrueEnable the gRPC health checking service (grpc.health.v1)
enable_reflectionbooltrueEnable gRPC server reflection for service discovery
enabledbooltrueEnable the gRPC API server alongside the REST API
keepalive_interval_secondsu3230Interval in seconds between gRPC keepalive ping frames
keepalive_timeout_secondsu3220Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
max_concurrent_streamsu32200Maximum number of concurrent gRPC streams per connection
max_frame_sizeu3216384Maximum size in bytes of a single HTTP/2 frame
tls_enabledboolfalseEnable TLS encryption for gRPC connections

orchestration.grpc.bind_address

Socket address for the gRPC server

  • Type: String
  • Default: "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
  • Valid Range: host:port
  • System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict

orchestration.grpc.enable_health_service

Enable the gRPC health checking service (grpc.health.v1)

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

orchestration.grpc.enable_reflection

Enable gRPC server reflection for service discovery

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production

orchestration.grpc.enabled

Enable the gRPC API server alongside the REST API

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no gRPC endpoints are available; clients must use REST

orchestration.grpc.keepalive_interval_seconds

Interval in seconds between gRPC keepalive ping frames

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

orchestration.grpc.keepalive_timeout_seconds

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

  • Type: u32
  • Default: 20
  • Valid Range: 1-300
  • System Impact: Connections that fail to acknowledge within this window are considered dead and closed

orchestration.grpc.max_concurrent_streams

Maximum number of concurrent gRPC streams per connection

  • Type: u32
  • Default: 200
  • Valid Range: 1-10000
  • System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads

orchestration.grpc.max_frame_size

Maximum size in bytes of a single HTTP/2 frame

  • Type: u32
  • Default: 16384
  • Valid Range: 16384-16777215
  • System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

orchestration.grpc.tls_enabled

Enable TLS encryption for gRPC connections

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments

mpsc_channels

Path: orchestration.mpsc_channels

command_processor

Path: orchestration.mpsc_channels.command_processor

ParameterTypeDefaultDescription
command_buffer_sizeusize5000Bounded channel capacity for the orchestration command processor

orchestration.mpsc_channels.command_processor.command_buffer_size

Bounded channel capacity for the orchestration command processor

  • Type: usize
  • Default: 5000
  • Valid Range: 100-100000
  • System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory

event_listeners

Path: orchestration.mpsc_channels.event_listeners

ParameterTypeDefaultDescription
pgmq_event_buffer_sizeusize50000Bounded channel capacity for PGMQ event listener notifications

orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size

Bounded channel capacity for PGMQ event listener notifications

  • Type: usize
  • Default: 50000
  • Valid Range: 1000-1000000
  • System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL

event_systems

Path: orchestration.mpsc_channels.event_systems

ParameterTypeDefaultDescription
event_channel_buffer_sizeusize10000Bounded channel capacity for the orchestration event system internal channel

orchestration.mpsc_channels.event_systems.event_channel_buffer_size

Bounded channel capacity for the orchestration event system internal channel

  • Type: usize
  • Default: 10000
  • Valid Range: 100-100000
  • System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts

web

Path: orchestration.web

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"Socket address for the REST API server
enabledbooltrueEnable the REST API server for the orchestration service
request_timeout_msu3230000Maximum time in milliseconds for an HTTP request to complete before timeout

orchestration.web.bind_address

Socket address for the REST API server

  • Type: String
  • Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
  • Valid Range: host:port
  • System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments

Environment Recommendations:

EnvironmentValueRationale
production0.0.0.0:8080Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI
test0.0.0.0:8080Default port for test fixtures

orchestration.web.enabled

Enable the REST API server for the orchestration service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no HTTP endpoints are available; the service operates via messaging only

orchestration.web.request_timeout_ms

Maximum time in milliseconds for an HTTP request to complete before timeout

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: orchestration.web.auth

ParameterTypeDefaultDescription
api_keyString""Static API key for simple key-based authentication
api_key_headerString"X-API-Key"HTTP header name for API key authentication
enabledboolfalseEnable authentication for the REST API
jwt_audienceString"tasker-api"Expected ‘aud’ claim in JWT tokens
jwt_issuerString"tasker-core"Expected ‘iss’ claim in JWT tokens
jwt_private_keyString""PEM-encoded private key for signing JWT tokens (if this service issues tokens)
jwt_public_keyString"${TASKER_JWT_PUBLIC_KEY:-}"PEM-encoded public key for verifying JWT token signatures
jwt_public_key_pathString"${TASKER_JWT_PUBLIC_KEY_PATH:-}"File path to a PEM-encoded public key for JWT verification
jwt_token_expiry_hoursu3224Default JWT token validity period in hours

orchestration.web.auth.api_key

Static API key for simple key-based authentication

  • Type: String
  • Default: ""
  • Valid Range: non-empty string or empty to disable
  • System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

orchestration.web.auth.api_key_header

HTTP header name for API key authentication

  • Type: String
  • Default: "X-API-Key"
  • Valid Range: valid HTTP header name
  • System Impact: Clients send their API key in this header; default is X-API-Key

orchestration.web.auth.enabled

Enable authentication for the REST API

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth

orchestration.web.auth.jwt_audience

Expected ‘aud’ claim in JWT tokens

  • Type: String
  • Default: "tasker-api"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different audience are rejected during validation

orchestration.web.auth.jwt_issuer

Expected ‘iss’ claim in JWT tokens

  • Type: String
  • Default: "tasker-core"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different issuer are rejected during validation

orchestration.web.auth.jwt_private_key

PEM-encoded private key for signing JWT tokens (if this service issues tokens)

  • Type: String
  • Default: ""
  • Valid Range: valid PEM private key or empty
  • System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers

orchestration.web.auth.jwt_public_key

PEM-encoded public key for verifying JWT token signatures

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY:-}"
  • Valid Range: valid PEM public key or empty
  • System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production

orchestration.web.auth.jwt_public_key_path

File path to a PEM-encoded public key for JWT verification

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
  • Valid Range: valid file path or empty
  • System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

orchestration.web.auth.jwt_token_expiry_hours

Default JWT token validity period in hours

  • Type: u32
  • Default: 24
  • Valid Range: 1-720
  • System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication

database_pools

Path: orchestration.web.database_pools

ParameterTypeDefaultDescription
max_total_connections_hintu3250Advisory hint for the total number of database connections across all orchestration pools
web_api_connection_timeout_secondsu3230Maximum time to wait when acquiring a connection from the web API pool
web_api_idle_timeout_secondsu32600Time before an idle web API connection is closed
web_api_max_connectionsu3230Maximum number of connections the web API pool can grow to under load
web_api_pool_sizeu3220Target number of connections in the web API database pool

orchestration.web.database_pools.max_total_connections_hint

Advisory hint for the total number of database connections across all orchestration pools

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

orchestration.web.database_pools.web_api_connection_timeout_seconds

Maximum time to wait when acquiring a connection from the web API pool

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: API requests that cannot acquire a connection within this window return an error

orchestration.web.database_pools.web_api_idle_timeout_seconds

Time before an idle web API connection is closed

  • Type: u32
  • Default: 600
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the web API pool shrinks after traffic subsides

orchestration.web.database_pools.web_api_max_connections

Maximum number of connections the web API pool can grow to under load

  • Type: u32
  • Default: 30
  • Valid Range: 1-500
  • System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes

orchestration.web.database_pools.web_api_pool_size

Target number of connections in the web API database pool

  • Type: u32
  • Default: 20
  • Valid Range: 1-200
  • System Impact: Determines how many concurrent database queries the REST API can execute

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: worker

90/90 parameters documented


worker

Root-level worker parameters

Path: worker

ParameterTypeDefaultDescription
worker_idString"worker-default-001"Unique identifier for this worker instance
worker_typeString"general"Worker type classification for routing and reporting

worker.worker_id

Unique identifier for this worker instance

  • Type: String
  • Default: "worker-default-001"
  • Valid Range: non-empty string
  • System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster

worker.worker_type

Worker type classification for routing and reporting

  • Type: String
  • Default: "general"
  • Valid Range: non-empty string
  • System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types

circuit_breakers

Path: worker.circuit_breakers

ffi_completion_send

Path: worker.circuit_breakers.ffi_completion_send

ParameterTypeDefaultDescription
failure_thresholdu325Number of consecutive FFI completion send failures before the circuit breaker trips
recovery_timeout_secondsu325Time the FFI completion breaker stays Open before probing with a test send
slow_send_threshold_msu32100Threshold in milliseconds above which FFI completion channel sends are logged as slow
success_thresholdu322Consecutive successful sends in Half-Open required to close the breaker

worker.circuit_breakers.ffi_completion_send.failure_threshold

Number of consecutive FFI completion send failures before the circuit breaker trips

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited

worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds

Time the FFI completion breaker stays Open before probing with a test send

  • Type: u32
  • Default: 5
  • Valid Range: 1-300
  • System Impact: Short timeout (5s) because FFI channel issues are typically transient

worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms

Threshold in milliseconds above which FFI completion channel sends are logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-10000
  • System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers

worker.circuit_breakers.ffi_completion_send.success_threshold

Consecutive successful sends in Half-Open required to close the breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient

event_systems

Path: worker.event_systems

worker

Path: worker.event_systems.worker

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"worker-event-system"Unique identifier for the worker event system instance

worker.event_systems.worker.deployment_mode

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

worker.event_systems.worker.system_id

Unique identifier for the worker event system instance

  • Type: String
  • Default: "worker-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: worker.event_systems.worker.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the worker event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the worker event system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the worker event system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics for worker event processing

worker.event_systems.worker.health.enabled

Enable health monitoring for the worker event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks or error tracking run for the worker event system

worker.event_systems.worker.health.error_rate_threshold_per_minute

Error rate per minute above which the worker event system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal complementing max_consecutive_errors

worker.event_systems.worker.health.max_consecutive_errors

Number of consecutive errors before the worker event system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation; resets on any successful event processing

worker.event_systems.worker.health.performance_monitoring_enabled

Enable detailed performance metrics for worker event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings

metadata

Path: worker.event_systems.worker.metadata

fallback_poller

Path: worker.event_systems.worker.metadata.fallback_poller

ParameterTypeDefaultDescription
age_threshold_secondsu325Minimum age in seconds of a message before the fallback poller will pick it up
batch_sizeu3220Number of messages to dequeue in a single fallback poll
enabledbooltrueEnable the fallback polling mechanism for step dispatch
max_age_hoursu3224Maximum age in hours of messages the fallback poller will process
polling_interval_msu321000Interval in milliseconds between fallback polling cycles
supported_namespacesVec<String>[]List of queue namespaces the fallback poller monitors; empty means all namespaces
visibility_timeout_secondsu3230Time in seconds a message polled by the fallback mechanism remains invisible
in_process_events

Path: worker.event_systems.worker.metadata.in_process_events

ParameterTypeDefaultDescription
deduplication_cache_sizeusize10000Number of event IDs to cache for deduplication of in-process events
ffi_integration_enabledbooltrueEnable FFI integration for in-process event delivery to Ruby/Python workers
listener

Path: worker.event_systems.worker.metadata.listener

ParameterTypeDefaultDescription
batch_processingbooltrueEnable batch processing of accumulated LISTEN/NOTIFY events
connection_timeout_secondsu3230Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection
event_timeout_secondsu3260Maximum time to wait for a LISTEN/NOTIFY event before yielding
max_retry_attemptsu325Maximum number of listener reconnection attempts before falling back to polling
retry_interval_secondsu325Interval in seconds between LISTEN/NOTIFY listener reconnection attempts

processing

Path: worker.event_systems.worker.processing

ParameterTypeDefaultDescription
batch_sizeu3220Number of events dequeued in a single batch read by the worker
max_concurrent_operationsu32100Maximum number of events processed concurrently by the worker event system
max_retriesu323Maximum retry attempts for a failed worker event processing operation

worker.event_systems.worker.processing.batch_size

Number of events dequeued in a single batch read by the worker

  • Type: u32
  • Default: 20
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput but increase per-batch processing time

worker.event_systems.worker.processing.max_concurrent_operations

Maximum number of events processed concurrently by the worker event system

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Controls parallelism for step dispatch and completion processing

worker.event_systems.worker.processing.max_retries

Maximum retry attempts for a failed worker event processing operation

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff

Path: worker.event_systems.worker.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first worker event processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay
max_delay_msu6410000Maximum backoff delay in milliseconds between worker event retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive failure

timing

Path: worker.event_systems.worker.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds a worker event claim remains valid
fallback_polling_interval_secondsu322Interval in seconds between fallback polling cycles for step dispatch
health_check_interval_secondsu3230Interval in seconds between health check probes for the worker event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single worker event
visibility_timeout_secondsu3230Time in seconds a dequeued step dispatch message remains invisible to other workers

worker.event_systems.worker.timing.claim_timeout_seconds

Maximum time in seconds a worker event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned claims from blocking step processing indefinitely

worker.event_systems.worker.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles for step dispatch

  • Type: u32
  • Default: 2
  • Valid Range: 1-60
  • System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency

worker.event_systems.worker.timing.health_check_interval_seconds

Interval in seconds between health check probes for the worker event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the worker event system verifies its own connectivity

worker.event_systems.worker.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single worker event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Events exceeding this timeout are considered failed and may be retried

worker.event_systems.worker.timing.visibility_timeout_seconds

Time in seconds a dequeued step dispatch message remains invisible to other workers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Prevents duplicate step execution; must be longer than typical step processing time

grpc

Path: worker.grpc

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"Socket address for the worker gRPC server
enable_health_servicebooltrueEnable the gRPC health checking service on the worker
enable_reflectionbooltrueEnable gRPC server reflection for the worker service
enabledbooltrueEnable the gRPC API server for the worker service
keepalive_interval_secondsu3230Interval in seconds between gRPC keepalive ping frames on the worker
keepalive_timeout_secondsu3220Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
max_concurrent_streamsu321000Maximum number of concurrent gRPC streams per connection
max_frame_sizeu3216384Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
tls_enabledboolfalseEnable TLS encryption for worker gRPC connections

worker.grpc.bind_address

Socket address for the worker gRPC server

  • Type: String
  • Default: "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"
  • Valid Range: host:port
  • System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191

worker.grpc.enable_health_service

Enable the gRPC health checking service on the worker

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

worker.grpc.enable_reflection

Enable gRPC server reflection for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production

worker.grpc.enabled

Enable the gRPC API server for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no gRPC endpoints are available; clients must use REST

worker.grpc.keepalive_interval_seconds

Interval in seconds between gRPC keepalive ping frames on the worker

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

worker.grpc.keepalive_timeout_seconds

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

  • Type: u32
  • Default: 20
  • Valid Range: 1-300
  • System Impact: Connections that fail to acknowledge within this window are considered dead and closed

worker.grpc.max_concurrent_streams

Maximum number of concurrent gRPC streams per connection

  • Type: u32
  • Default: 1000
  • Valid Range: 1-10000
  • System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this

worker.grpc.max_frame_size

Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server

  • Type: u32
  • Default: 16384
  • Valid Range: 16384-16777215
  • System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

worker.grpc.tls_enabled

Enable TLS encryption for worker gRPC connections

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments

mpsc_channels

Path: worker.mpsc_channels

command_processor

Path: worker.mpsc_channels.command_processor

ParameterTypeDefaultDescription
command_buffer_sizeusize2000Bounded channel capacity for the worker command processor

worker.mpsc_channels.command_processor.command_buffer_size

Bounded channel capacity for the worker command processor

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types

domain_events

Path: worker.mpsc_channels.domain_events

ParameterTypeDefaultDescription
command_buffer_sizeusize1000Bounded channel capacity for domain event system commands
log_dropped_eventsbooltrueLog a warning when domain events are dropped due to channel saturation
shutdown_drain_timeout_msu325000Maximum time in milliseconds to drain pending domain events during shutdown

worker.mpsc_channels.domain_events.command_buffer_size

Bounded channel capacity for domain event system commands

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown

worker.mpsc_channels.domain_events.log_dropped_events

Log a warning when domain events are dropped due to channel saturation

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Observability: helps detect when event volume exceeds channel capacity

worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms

Maximum time in milliseconds to drain pending domain events during shutdown

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss

event_listeners

Path: worker.mpsc_channels.event_listeners

ParameterTypeDefaultDescription
pgmq_event_buffer_sizeusize10000Bounded channel capacity for PGMQ event listener notifications on the worker

worker.mpsc_channels.event_listeners.pgmq_event_buffer_size

Bounded channel capacity for PGMQ event listener notifications on the worker

  • Type: usize
  • Default: 10000
  • Valid Range: 1000-1000000
  • System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types

event_subscribers

Path: worker.mpsc_channels.event_subscribers

ParameterTypeDefaultDescription
completion_buffer_sizeusize1000Bounded channel capacity for step completion event subscribers
result_buffer_sizeusize1000Bounded channel capacity for step result event subscribers

worker.mpsc_channels.event_subscribers.completion_buffer_size

Bounded channel capacity for step completion event subscribers

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step completion notifications before they are forwarded to the orchestration service

worker.mpsc_channels.event_subscribers.result_buffer_size

Bounded channel capacity for step result event subscribers

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step execution results before they are published to the result queue

event_systems

Path: worker.mpsc_channels.event_systems

ParameterTypeDefaultDescription
event_channel_buffer_sizeusize2000Bounded channel capacity for the worker event system internal channel

worker.mpsc_channels.event_systems.event_channel_buffer_size

Bounded channel capacity for the worker event system internal channel

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Buffers events between the listener and processor; sized for worker-level throughput

ffi_dispatch

Path: worker.mpsc_channels.ffi_dispatch

ParameterTypeDefaultDescription
callback_timeout_msu325000Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
completion_send_timeout_msu3210000Maximum time in milliseconds to retry sending FFI completion results when the channel is full
completion_timeout_msu3230000Maximum time in milliseconds to wait for an FFI step handler to complete
dispatch_buffer_sizeusize1000Bounded channel capacity for FFI step dispatch requests
starvation_warning_threshold_msu3210000Age in milliseconds of pending FFI events that triggers a starvation warning

worker.mpsc_channels.ffi_dispatch.callback_timeout_ms

Maximum time in milliseconds for FFI fire-and-forget domain event callbacks

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Prevents indefinite blocking of FFI threads during domain event publishing

worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms

Maximum time in milliseconds to retry sending FFI completion results when the channel is full

  • Type: u32
  • Default: 10000
  • Valid Range: 1000-300000
  • System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks

worker.mpsc_channels.ffi_dispatch.completion_timeout_ms

Maximum time in milliseconds to wait for an FFI step handler to complete

  • Type: u32
  • Default: 30000
  • Valid Range: 1000-600000
  • System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads

worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size

Bounded channel capacity for FFI step dispatch requests

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers

worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms

Age in milliseconds of pending FFI events that triggers a starvation warning

  • Type: u32
  • Default: 10000
  • Valid Range: 1000-300000
  • System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached

handler_dispatch

Path: worker.mpsc_channels.handler_dispatch

ParameterTypeDefaultDescription
completion_buffer_sizeusize1000Bounded channel capacity for step handler completion notifications
dispatch_buffer_sizeusize1000Bounded channel capacity for step handler dispatch requests
handler_timeout_msu3230000Maximum time in milliseconds for a step handler to complete execution
max_concurrent_handlersu3210Maximum number of step handlers executing simultaneously

worker.mpsc_channels.handler_dispatch.completion_buffer_size

Bounded channel capacity for step handler completion notifications

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers handler completion results before they are forwarded to the result processor

worker.mpsc_channels.handler_dispatch.dispatch_buffer_size

Bounded channel capacity for step handler dispatch requests

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers incoming step execution requests before handler assignment

worker.mpsc_channels.handler_dispatch.handler_timeout_ms

Maximum time in milliseconds for a step handler to complete execution

  • Type: u32
  • Default: 30000
  • Valid Range: 1000-600000
  • System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity

worker.mpsc_channels.handler_dispatch.max_concurrent_handlers

Maximum number of step handlers executing simultaneously

  • Type: u32
  • Default: 10
  • Valid Range: 1-10000
  • System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore

load_shedding

Path: worker.mpsc_channels.handler_dispatch.load_shedding

ParameterTypeDefaultDescription
capacity_threshold_percentf6480.0Handler capacity percentage above which new step claims are refused
enabledbooltrueEnable load shedding to refuse step claims when handler capacity is nearly exhausted
warning_threshold_percentf6470.0Handler capacity percentage at which warning logs are emitted

worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent

Handler capacity percentage above which new step claims are refused

  • Type: f64
  • Default: 80.0
  • Valid Range: 0.0-100.0
  • System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy

worker.mpsc_channels.handler_dispatch.load_shedding.enabled

Enable load shedding to refuse step claims when handler capacity is nearly exhausted

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload

worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent

Handler capacity percentage at which warning logs are emitted

  • Type: f64
  • Default: 70.0
  • Valid Range: 0.0-100.0
  • System Impact: Observability: alerts operators that the worker is approaching its capacity limit

in_process_events

Path: worker.mpsc_channels.in_process_events

ParameterTypeDefaultDescription
broadcast_buffer_sizeusize2000Bounded broadcast channel capacity for in-process domain event delivery
dispatch_timeout_msu325000Maximum time in milliseconds to wait when dispatching an in-process event
log_subscriber_errorsbooltrueLog errors when in-process event subscribers fail to receive events

worker.mpsc_channels.in_process_events.broadcast_buffer_size

Bounded broadcast channel capacity for in-process domain event delivery

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure

worker.mpsc_channels.in_process_events.dispatch_timeout_ms

Maximum time in milliseconds to wait when dispatching an in-process event

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow

worker.mpsc_channels.in_process_events.log_subscriber_errors

Log errors when in-process event subscribers fail to receive events

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Observability: helps identify slow or failing event subscribers

orchestration_client

Path: worker.orchestration_client

ParameterTypeDefaultDescription
base_urlString"http://localhost:8080"Base URL of the orchestration REST API that this worker reports to
max_retriesu323Maximum retry attempts for failed orchestration API calls
timeout_msu3230000HTTP request timeout in milliseconds for orchestration API calls

worker.orchestration_client.base_url

Base URL of the orchestration REST API that this worker reports to

  • Type: String
  • Default: "http://localhost:8080"
  • Valid Range: valid HTTP(S) URL
  • System Impact: Workers send step completion results and health reports to this endpoint

Environment Recommendations:

EnvironmentValueRationale
productionhttp://orchestration:8080Container-internal DNS in Kubernetes/Docker
testhttp://localhost:8080Local orchestration for testing

Related: orchestration.web.bind_address

worker.orchestration_client.max_retries

Maximum retry attempts for failed orchestration API calls

  • Type: u32
  • Default: 3
  • Valid Range: 0-10
  • System Impact: Retries use backoff; higher values improve resilience to transient network issues

worker.orchestration_client.timeout_ms

HTTP request timeout in milliseconds for orchestration API calls

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried

web

Path: worker.web

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"Socket address for the worker REST API server
enabledbooltrueEnable the REST API server for the worker service
request_timeout_msu3230000Maximum time in milliseconds for a worker HTTP request to complete before timeout

worker.web.bind_address

Socket address for the worker REST API server

  • Type: String
  • Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"
  • Valid Range: host:port
  • System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081

Environment Recommendations:

EnvironmentValueRationale
production0.0.0.0:8081Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override
test0.0.0.0:8081Default port offset from orchestration (8080)

worker.web.enabled

Enable the REST API server for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only

worker.web.request_timeout_ms

Maximum time in milliseconds for a worker HTTP request to complete before timeout

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: worker.web.auth

ParameterTypeDefaultDescription
api_keyString""Static API key for simple key-based authentication on the worker API
api_key_headerString"X-API-Key"HTTP header name for API key authentication on the worker API
enabledboolfalseEnable authentication for the worker REST API
jwt_audienceString"worker-api"Expected ‘aud’ claim in JWT tokens for the worker API
jwt_issuerString"tasker-worker"Expected ‘iss’ claim in JWT tokens for the worker API
jwt_private_keyString""PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
jwt_public_keyString"${TASKER_JWT_PUBLIC_KEY:-}"PEM-encoded public key for verifying JWT token signatures on the worker API
jwt_public_key_pathString"${TASKER_JWT_PUBLIC_KEY_PATH:-}"File path to a PEM-encoded public key for worker JWT verification
jwt_token_expiry_hoursu3224Default JWT token validity period in hours for worker API tokens

worker.web.auth.api_key

Static API key for simple key-based authentication on the worker API

  • Type: String
  • Default: ""
  • Valid Range: non-empty string or empty to disable
  • System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

worker.web.auth.api_key_header

HTTP header name for API key authentication on the worker API

  • Type: String
  • Default: "X-API-Key"
  • Valid Range: valid HTTP header name
  • System Impact: Clients send their API key in this header; default is X-API-Key

worker.web.auth.enabled

Enable authentication for the worker REST API

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all worker API endpoints are unauthenticated

worker.web.auth.jwt_audience

Expected ‘aud’ claim in JWT tokens for the worker API

  • Type: String
  • Default: "worker-api"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different audience are rejected during validation

worker.web.auth.jwt_issuer

Expected ‘iss’ claim in JWT tokens for the worker API

  • Type: String
  • Default: "tasker-worker"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens

worker.web.auth.jwt_private_key

PEM-encoded private key for signing JWT tokens (if the worker issues tokens)

  • Type: String
  • Default: ""
  • Valid Range: valid PEM private key or empty
  • System Impact: Required only if the worker service issues its own JWT tokens; typically empty

worker.web.auth.jwt_public_key

PEM-encoded public key for verifying JWT token signatures on the worker API

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY:-}"
  • Valid Range: valid PEM public key or empty
  • System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management

worker.web.auth.jwt_public_key_path

File path to a PEM-encoded public key for worker JWT verification

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
  • Valid Range: valid file path or empty
  • System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

worker.web.auth.jwt_token_expiry_hours

Default JWT token validity period in hours for worker API tokens

  • Type: u32
  • Default: 24
  • Valid Range: 1-720
  • System Impact: Tokens older than this are rejected; shorter values improve security

database_pools

Path: worker.web.database_pools

ParameterTypeDefaultDescription
max_total_connections_hintu3225Advisory hint for the total number of database connections across all worker pools
web_api_connection_timeout_secondsu3230Maximum time to wait when acquiring a connection from the worker web API pool
web_api_idle_timeout_secondsu32600Time before an idle worker web API connection is closed
web_api_max_connectionsu3215Maximum number of connections the worker web API pool can grow to under load
web_api_pool_sizeu3210Target number of connections in the worker web API database pool

worker.web.database_pools.max_total_connections_hint

Advisory hint for the total number of database connections across all worker pools

  • Type: u32
  • Default: 25
  • Valid Range: 1-1000
  • System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

worker.web.database_pools.web_api_connection_timeout_seconds

Maximum time to wait when acquiring a connection from the worker web API pool

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: Worker API requests that cannot acquire a connection within this window return an error

worker.web.database_pools.web_api_idle_timeout_seconds

Time before an idle worker web API connection is closed

  • Type: u32
  • Default: 600
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides

worker.web.database_pools.web_api_max_connections

Maximum number of connections the worker web API pool can grow to under load

  • Type: u32
  • Default: 15
  • Valid Range: 1-500
  • System Impact: Hard ceiling for worker web API database connections

worker.web.database_pools.web_api_pool_size

Target number of connections in the worker web API database pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-200
  • System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: orchestration

91/91 parameters documented


orchestration

Root-level orchestration parameters

Path: orchestration

ParameterTypeDefaultDescription
enable_performance_loggingbooltrueEnable detailed performance logging for orchestration actors
shutdown_timeout_msu6430000Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

orchestration.enable_performance_logging

Enable detailed performance logging for orchestration actors

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern

orchestration.shutdown_timeout_ms

Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown

  • Type: u64
  • Default: 30000
  • Valid Range: 1000-300000
  • System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments

batch_processing

Path: orchestration.batch_processing

ParameterTypeDefaultDescription
checkpoint_stall_minutesu3215Minutes without a checkpoint update before a batch is considered stalled
default_batch_sizeu321000Default number of items in a single batch when not specified by the handler
enabledbooltrueEnable the batch processing subsystem for large-scale step execution
max_parallel_batchesu3250Maximum number of batch operations that can execute concurrently

orchestration.batch_processing.checkpoint_stall_minutes

Minutes without a checkpoint update before a batch is considered stalled

  • Type: u32
  • Default: 15
  • Valid Range: 1-1440
  • System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster

orchestration.batch_processing.default_batch_size

Default number of items in a single batch when not specified by the handler

  • Type: u32
  • Default: 1000
  • Valid Range: 1-100000
  • System Impact: Larger batches improve throughput but increase memory usage and per-batch latency

orchestration.batch_processing.enabled

Enable the batch processing subsystem for large-scale step execution

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, batch step handlers cannot be used; all steps must be processed individually

orchestration.batch_processing.max_parallel_batches

Maximum number of batch operations that can execute concurrently

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads

decision_points

Path: orchestration.decision_points

ParameterTypeDefaultDescription
enable_detailed_loggingboolfalseEnable verbose logging of decision point evaluation including expression results
enable_metricsbooltrueEnable metrics collection for decision point evaluations
enabledbooltrueEnable the decision point evaluation subsystem for conditional workflow branching
max_decision_depthu3220Maximum depth of nested decision point chains
max_steps_per_decisionu32100Maximum number of steps that can be generated by a single decision point evaluation
warn_threshold_depthu3210Decision depth above which a warning is logged
warn_threshold_stepsu3250Number of steps per decision above which a warning is logged

orchestration.decision_points.enable_detailed_logging

Enable verbose logging of decision point evaluation including expression results

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior

orchestration.decision_points.enable_metrics

Enable metrics collection for decision point evaluations

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks evaluation counts, timings, and branch selection distribution

orchestration.decision_points.enabled

Enable the decision point evaluation subsystem for conditional workflow branching

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, all decision points are skipped and conditional steps are not evaluated

orchestration.decision_points.max_decision_depth

Maximum depth of nested decision point chains

  • Type: u32
  • Default: 20
  • Valid Range: 1-100
  • System Impact: Prevents infinite recursion from circular decision point references

orchestration.decision_points.max_steps_per_decision

Maximum number of steps that can be generated by a single decision point evaluation

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Safety limit to prevent decision points from creating unbounded step graphs

orchestration.decision_points.warn_threshold_depth

Decision depth above which a warning is logged

  • Type: u32
  • Default: 10
  • Valid Range: 1-100
  • System Impact: Observability: identifies deeply nested decision chains that may indicate design issues

orchestration.decision_points.warn_threshold_steps

Number of steps per decision above which a warning is logged

  • Type: u32
  • Default: 50
  • Valid Range: 1-10000
  • System Impact: Observability: identifies decision points that generate unusually large step sets

dlq

Path: orchestration.dlq

ParameterTypeDefaultDescription
enabledbooltrueEnable the Dead Letter Queue subsystem for handling unrecoverable tasks

orchestration.dlq.enabled

Enable the Dead Letter Queue subsystem for handling unrecoverable tasks

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, stale or failed tasks remain in their error state without DLQ routing

staleness_detection

Path: orchestration.dlq.staleness_detection

ParameterTypeDefaultDescription
batch_sizeu32100Number of potentially stale tasks to evaluate in a single detection sweep
detection_interval_secondsu32300Interval in seconds between staleness detection sweeps
dry_runboolfalseRun staleness detection in observation-only mode without taking action
enabledbooltrueEnable periodic scanning for stale tasks

orchestration.dlq.staleness_detection.batch_size

Number of potentially stale tasks to evaluate in a single detection sweep

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost

orchestration.dlq.staleness_detection.detection_interval_seconds

Interval in seconds between staleness detection sweeps

  • Type: u32
  • Default: 300
  • Valid Range: 30-3600
  • System Impact: Lower values detect stale tasks faster but increase database query frequency

orchestration.dlq.staleness_detection.dry_run

Run staleness detection in observation-only mode without taking action

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds

orchestration.dlq.staleness_detection.enabled

Enable periodic scanning for stale tasks

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d

actions

Path: orchestration.dlq.staleness_detection.actions

ParameterTypeDefaultDescription
auto_move_to_dlqbooltrueAutomatically move stale tasks to the DLQ after transitioning to error
auto_transition_to_errorbooltrueAutomatically transition stale tasks to the Error state
emit_eventsbooltrueEmit domain events when staleness is detected
event_channelString"task_staleness_detected"PGMQ channel name for staleness detection events

orchestration.dlq.staleness_detection.actions.auto_move_to_dlq

Automatically move stale tasks to the DLQ after transitioning to error

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review

orchestration.dlq.staleness_detection.actions.auto_transition_to_error

Automatically transition stale tasks to the Error state

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state

orchestration.dlq.staleness_detection.actions.emit_events

Emit domain events when staleness is detected

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling

orchestration.dlq.staleness_detection.actions.event_channel

PGMQ channel name for staleness detection events

  • Type: String
  • Default: "task_staleness_detected"
  • Valid Range: 1-255 characters
  • System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling

thresholds

Path: orchestration.dlq.staleness_detection.thresholds

ParameterTypeDefaultDescription
steps_in_process_minutesu3230Minutes a task can have steps in process before being considered stale
task_max_lifetime_hoursu3224Absolute maximum lifetime for any task regardless of state
waiting_for_dependencies_minutesu3260Minutes a task can wait for step dependencies before being considered stale
waiting_for_retry_minutesu3230Minutes a task can wait for step retries before being considered stale

orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes

Minutes a task can have steps in process before being considered stale

  • Type: u32
  • Default: 30
  • Valid Range: 1-1440
  • System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation

orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours

Absolute maximum lifetime for any task regardless of state

  • Type: u32
  • Default: 24
  • Valid Range: 1-168
  • System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing

orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes

Minutes a task can wait for step dependencies before being considered stale

  • Type: u32
  • Default: 60
  • Valid Range: 1-1440
  • System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration

orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes

Minutes a task can wait for step retries before being considered stale

  • Type: u32
  • Default: 30
  • Valid Range: 1-1440
  • System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration

event_systems

Path: orchestration.event_systems

orchestration

Path: orchestration.event_systems.orchestration

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"orchestration-event-system"Unique identifier for the orchestration event system instance

orchestration.event_systems.orchestration.deployment_mode

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

orchestration.event_systems.orchestration.system_id

Unique identifier for the orchestration event system instance

  • Type: String
  • Default: "orchestration-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: orchestration.event_systems.orchestration.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the orchestration event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the event system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the event system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics collection for event processing

orchestration.event_systems.orchestration.health.enabled

Enable health monitoring for the orchestration event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks or error tracking run for this event system

orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute

Error rate per minute above which the event system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection

orchestration.event_systems.orchestration.health.max_consecutive_errors

Number of consecutive errors before the event system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation after sustained failures; resets on any success

orchestration.event_systems.orchestration.health.performance_monitoring_enabled

Enable detailed performance metrics collection for event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks processing latency percentiles and throughput; adds minor overhead

processing

Path: orchestration.event_systems.orchestration.processing

ParameterTypeDefaultDescription
batch_sizeu3220Number of events dequeued in a single batch read
max_concurrent_operationsu3250Maximum number of events processed concurrently by the orchestration event system
max_retriesu323Maximum retry attempts for a failed event processing operation

orchestration.event_systems.orchestration.processing.batch_size

Number of events dequeued in a single batch read

  • Type: u32
  • Default: 20
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput but increase per-batch processing time

orchestration.event_systems.orchestration.processing.max_concurrent_operations

Maximum number of events processed concurrently by the orchestration event system

  • Type: u32
  • Default: 50
  • Valid Range: 1-10000
  • System Impact: Controls parallelism for task request, result, and finalization processing

orchestration.event_systems.orchestration.processing.max_retries

Maximum retry attempts for a failed event processing operation

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff

Path: orchestration.event_systems.orchestration.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first event processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay
max_delay_msu6410000Maximum backoff delay in milliseconds between event processing retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive failure

timing

Path: orchestration.event_systems.orchestration.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds an event claim remains valid
fallback_polling_interval_secondsu325Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
health_check_interval_secondsu3230Interval in seconds between health check probes for the orchestration event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single event
visibility_timeout_secondsu3230Time in seconds a dequeued message remains invisible to other consumers

orchestration.event_systems.orchestration.timing.claim_timeout_seconds

Maximum time in seconds an event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned claims from blocking event processing indefinitely

orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load

orchestration.event_systems.orchestration.timing.health_check_interval_seconds

Interval in seconds between health check probes for the orchestration event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness

orchestration.event_systems.orchestration.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Events exceeding this timeout are considered failed and may be retried

orchestration.event_systems.orchestration.timing.visibility_timeout_seconds

Time in seconds a dequeued message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: If processing is not completed within this window, the message becomes visible again for redelivery

task_readiness

Path: orchestration.event_systems.task_readiness

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"task-readiness-event-system"Unique identifier for the task readiness event system instance

orchestration.event_systems.task_readiness.deployment_mode

Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery

orchestration.event_systems.task_readiness.system_id

Unique identifier for the task readiness event system instance

  • Type: String
  • Default: "task-readiness-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish task readiness events from other event systems

health

Path: orchestration.event_systems.task_readiness.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the task readiness event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the task readiness system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the task readiness system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics for task readiness event processing

orchestration.event_systems.task_readiness.health.enabled

Enable health monitoring for the task readiness event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks run for task readiness processing

orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute

Error rate per minute above which the task readiness system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal complementing max_consecutive_errors

orchestration.event_systems.task_readiness.health.max_consecutive_errors

Number of consecutive errors before the task readiness system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation; resets on any successful readiness check

orchestration.event_systems.task_readiness.health.performance_monitoring_enabled

Enable detailed performance metrics for task readiness event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency

processing

Path: orchestration.event_systems.task_readiness.processing

ParameterTypeDefaultDescription
batch_sizeu3250Number of task readiness events dequeued in a single batch
max_concurrent_operationsu32100Maximum number of task readiness events processed concurrently
max_retriesu323Maximum retry attempts for a failed task readiness event

orchestration.event_systems.task_readiness.processing.batch_size

Number of task readiness events dequeued in a single batch

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput

orchestration.event_systems.task_readiness.processing.max_concurrent_operations

Maximum number of task readiness events processed concurrently

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries

orchestration.event_systems.task_readiness.processing.max_retries

Maximum retry attempts for a failed task readiness event

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff

Path: orchestration.event_systems.task_readiness.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first task readiness processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay for readiness retries
max_delay_msu6410000Maximum backoff delay in milliseconds for task readiness retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive readiness failure

timing

Path: orchestration.event_systems.task_readiness.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds a task readiness event claim remains valid
fallback_polling_interval_secondsu325Interval in seconds between fallback polling cycles for task readiness
health_check_interval_secondsu3230Interval in seconds between health check probes for the task readiness event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single task readiness event
visibility_timeout_secondsu3230Time in seconds a dequeued task readiness message remains invisible to other consumers

orchestration.event_systems.task_readiness.timing.claim_timeout_seconds

Maximum time in seconds a task readiness event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned readiness claims from blocking task evaluation

orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles for task readiness

  • Type: u32
  • Default: 5
  • Valid Range: 1-60
  • System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness

orchestration.event_systems.task_readiness.timing.health_check_interval_seconds

Interval in seconds between health check probes for the task readiness event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the task readiness system verifies its own connectivity

orchestration.event_systems.task_readiness.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single task readiness event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Readiness events exceeding this timeout are considered failed

orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds

Time in seconds a dequeued task readiness message remains invisible to other consumers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Prevents duplicate processing of readiness events during normal operation

grpc

Path: orchestration.grpc

ParameterTypeDefaultDescription
bind_addressString"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"Socket address for the gRPC server
enable_health_servicebooltrueEnable the gRPC health checking service (grpc.health.v1)
enable_reflectionbooltrueEnable gRPC server reflection for service discovery
enabledbooltrueEnable the gRPC API server alongside the REST API
keepalive_interval_secondsu3230Interval in seconds between gRPC keepalive ping frames
keepalive_timeout_secondsu3220Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
max_concurrent_streamsu32200Maximum number of concurrent gRPC streams per connection
max_frame_sizeu3216384Maximum size in bytes of a single HTTP/2 frame
tls_enabledboolfalseEnable TLS encryption for gRPC connections

orchestration.grpc.bind_address

Socket address for the gRPC server

  • Type: String
  • Default: "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
  • Valid Range: host:port
  • System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict

orchestration.grpc.enable_health_service

Enable the gRPC health checking service (grpc.health.v1)

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

orchestration.grpc.enable_reflection

Enable gRPC server reflection for service discovery

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production

orchestration.grpc.enabled

Enable the gRPC API server alongside the REST API

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no gRPC endpoints are available; clients must use REST

orchestration.grpc.keepalive_interval_seconds

Interval in seconds between gRPC keepalive ping frames

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

orchestration.grpc.keepalive_timeout_seconds

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

  • Type: u32
  • Default: 20
  • Valid Range: 1-300
  • System Impact: Connections that fail to acknowledge within this window are considered dead and closed

orchestration.grpc.max_concurrent_streams

Maximum number of concurrent gRPC streams per connection

  • Type: u32
  • Default: 200
  • Valid Range: 1-10000
  • System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads

orchestration.grpc.max_frame_size

Maximum size in bytes of a single HTTP/2 frame

  • Type: u32
  • Default: 16384
  • Valid Range: 16384-16777215
  • System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

orchestration.grpc.tls_enabled

Enable TLS encryption for gRPC connections

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments

mpsc_channels

Path: orchestration.mpsc_channels

command_processor

Path: orchestration.mpsc_channels.command_processor

ParameterTypeDefaultDescription
command_buffer_sizeusize5000Bounded channel capacity for the orchestration command processor

orchestration.mpsc_channels.command_processor.command_buffer_size

Bounded channel capacity for the orchestration command processor

  • Type: usize
  • Default: 5000
  • Valid Range: 100-100000
  • System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory

event_listeners

Path: orchestration.mpsc_channels.event_listeners

ParameterTypeDefaultDescription
pgmq_event_buffer_sizeusize50000Bounded channel capacity for PGMQ event listener notifications

orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size

Bounded channel capacity for PGMQ event listener notifications

  • Type: usize
  • Default: 50000
  • Valid Range: 1000-1000000
  • System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL

event_systems

Path: orchestration.mpsc_channels.event_systems

ParameterTypeDefaultDescription
event_channel_buffer_sizeusize10000Bounded channel capacity for the orchestration event system internal channel

orchestration.mpsc_channels.event_systems.event_channel_buffer_size

Bounded channel capacity for the orchestration event system internal channel

  • Type: usize
  • Default: 10000
  • Valid Range: 100-100000
  • System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts

web

Path: orchestration.web

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"Socket address for the REST API server
enabledbooltrueEnable the REST API server for the orchestration service
request_timeout_msu3230000Maximum time in milliseconds for an HTTP request to complete before timeout

orchestration.web.bind_address

Socket address for the REST API server

  • Type: String
  • Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}"
  • Valid Range: host:port
  • System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments

Environment Recommendations:

EnvironmentValueRationale
production0.0.0.0:8080Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI
test0.0.0.0:8080Default port for test fixtures

orchestration.web.enabled

Enable the REST API server for the orchestration service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no HTTP endpoints are available; the service operates via messaging only

orchestration.web.request_timeout_ms

Maximum time in milliseconds for an HTTP request to complete before timeout

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: orchestration.web.auth

ParameterTypeDefaultDescription
api_keyString""Static API key for simple key-based authentication
api_key_headerString"X-API-Key"HTTP header name for API key authentication
enabledboolfalseEnable authentication for the REST API
jwt_audienceString"tasker-api"Expected ‘aud’ claim in JWT tokens
jwt_issuerString"tasker-core"Expected ‘iss’ claim in JWT tokens
jwt_private_keyString""PEM-encoded private key for signing JWT tokens (if this service issues tokens)
jwt_public_keyString"${TASKER_JWT_PUBLIC_KEY:-}"PEM-encoded public key for verifying JWT token signatures
jwt_public_key_pathString"${TASKER_JWT_PUBLIC_KEY_PATH:-}"File path to a PEM-encoded public key for JWT verification
jwt_token_expiry_hoursu3224Default JWT token validity period in hours

orchestration.web.auth.api_key

Static API key for simple key-based authentication

  • Type: String
  • Default: ""
  • Valid Range: non-empty string or empty to disable
  • System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

orchestration.web.auth.api_key_header

HTTP header name for API key authentication

  • Type: String
  • Default: "X-API-Key"
  • Valid Range: valid HTTP header name
  • System Impact: Clients send their API key in this header; default is X-API-Key

orchestration.web.auth.enabled

Enable authentication for the REST API

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth

orchestration.web.auth.jwt_audience

Expected ‘aud’ claim in JWT tokens

  • Type: String
  • Default: "tasker-api"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different audience are rejected during validation

orchestration.web.auth.jwt_issuer

Expected ‘iss’ claim in JWT tokens

  • Type: String
  • Default: "tasker-core"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different issuer are rejected during validation

orchestration.web.auth.jwt_private_key

PEM-encoded private key for signing JWT tokens (if this service issues tokens)

  • Type: String
  • Default: ""
  • Valid Range: valid PEM private key or empty
  • System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers

orchestration.web.auth.jwt_public_key

PEM-encoded public key for verifying JWT token signatures

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY:-}"
  • Valid Range: valid PEM public key or empty
  • System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production

orchestration.web.auth.jwt_public_key_path

File path to a PEM-encoded public key for JWT verification

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
  • Valid Range: valid file path or empty
  • System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

orchestration.web.auth.jwt_token_expiry_hours

Default JWT token validity period in hours

  • Type: u32
  • Default: 24
  • Valid Range: 1-720
  • System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication

database_pools

Path: orchestration.web.database_pools

ParameterTypeDefaultDescription
max_total_connections_hintu3250Advisory hint for the total number of database connections across all orchestration pools
web_api_connection_timeout_secondsu3230Maximum time to wait when acquiring a connection from the web API pool
web_api_idle_timeout_secondsu32600Time before an idle web API connection is closed
web_api_max_connectionsu3230Maximum number of connections the web API pool can grow to under load
web_api_pool_sizeu3220Target number of connections in the web API database pool

orchestration.web.database_pools.max_total_connections_hint

Advisory hint for the total number of database connections across all orchestration pools

  • Type: u32
  • Default: 50
  • Valid Range: 1-1000
  • System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

orchestration.web.database_pools.web_api_connection_timeout_seconds

Maximum time to wait when acquiring a connection from the web API pool

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: API requests that cannot acquire a connection within this window return an error

orchestration.web.database_pools.web_api_idle_timeout_seconds

Time before an idle web API connection is closed

  • Type: u32
  • Default: 600
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the web API pool shrinks after traffic subsides

orchestration.web.database_pools.web_api_max_connections

Maximum number of connections the web API pool can grow to under load

  • Type: u32
  • Default: 30
  • Valid Range: 1-500
  • System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes

orchestration.web.database_pools.web_api_pool_size

Target number of connections in the web API database pool

  • Type: u32
  • Default: 20
  • Valid Range: 1-200
  • System Impact: Determines how many concurrent database queries the REST API can execute

Generated by tasker-ctl docsTasker Configuration System

Configuration Reference: worker

90/90 parameters documented


worker

Root-level worker parameters

Path: worker

ParameterTypeDefaultDescription
worker_idString"worker-default-001"Unique identifier for this worker instance
worker_typeString"general"Worker type classification for routing and reporting

worker.worker_id

Unique identifier for this worker instance

  • Type: String
  • Default: "worker-default-001"
  • Valid Range: non-empty string
  • System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster

worker.worker_type

Worker type classification for routing and reporting

  • Type: String
  • Default: "general"
  • Valid Range: non-empty string
  • System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types

circuit_breakers

Path: worker.circuit_breakers

ffi_completion_send

Path: worker.circuit_breakers.ffi_completion_send

ParameterTypeDefaultDescription
failure_thresholdu325Number of consecutive FFI completion send failures before the circuit breaker trips
recovery_timeout_secondsu325Time the FFI completion breaker stays Open before probing with a test send
slow_send_threshold_msu32100Threshold in milliseconds above which FFI completion channel sends are logged as slow
success_thresholdu322Consecutive successful sends in Half-Open required to close the breaker

worker.circuit_breakers.ffi_completion_send.failure_threshold

Number of consecutive FFI completion send failures before the circuit breaker trips

  • Type: u32
  • Default: 5
  • Valid Range: 1-100
  • System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited

worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds

Time the FFI completion breaker stays Open before probing with a test send

  • Type: u32
  • Default: 5
  • Valid Range: 1-300
  • System Impact: Short timeout (5s) because FFI channel issues are typically transient

worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms

Threshold in milliseconds above which FFI completion channel sends are logged as slow

  • Type: u32
  • Default: 100
  • Valid Range: 10-10000
  • System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers

worker.circuit_breakers.ffi_completion_send.success_threshold

Consecutive successful sends in Half-Open required to close the breaker

  • Type: u32
  • Default: 2
  • Valid Range: 1-100
  • System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient

event_systems

Path: worker.event_systems

worker

Path: worker.event_systems.worker

ParameterTypeDefaultDescription
deployment_modeDeploymentMode"Hybrid"Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
system_idString"worker-event-system"Unique identifier for the worker event system instance

worker.event_systems.worker.deployment_mode

Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’

  • Type: DeploymentMode
  • Default: "Hybrid"
  • Valid Range: Hybrid | EventDrivenOnly | PollingOnly
  • System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency

worker.event_systems.worker.system_id

Unique identifier for the worker event system instance

  • Type: String
  • Default: "worker-event-system"
  • Valid Range: non-empty string
  • System Impact: Used in logging and metrics to distinguish this event system from others

health

Path: worker.event_systems.worker.health

ParameterTypeDefaultDescription
enabledbooltrueEnable health monitoring for the worker event system
error_rate_threshold_per_minuteu3220Error rate per minute above which the worker event system reports as unhealthy
max_consecutive_errorsu3210Number of consecutive errors before the worker event system reports as unhealthy
performance_monitoring_enabledbooltrueEnable detailed performance metrics for worker event processing

worker.event_systems.worker.health.enabled

Enable health monitoring for the worker event system

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no health checks or error tracking run for the worker event system

worker.event_systems.worker.health.error_rate_threshold_per_minute

Error rate per minute above which the worker event system reports as unhealthy

  • Type: u32
  • Default: 20
  • Valid Range: 1-10000
  • System Impact: Rate-based health signal complementing max_consecutive_errors

worker.event_systems.worker.health.max_consecutive_errors

Number of consecutive errors before the worker event system reports as unhealthy

  • Type: u32
  • Default: 10
  • Valid Range: 1-1000
  • System Impact: Triggers health status degradation; resets on any successful event processing

worker.event_systems.worker.health.performance_monitoring_enabled

Enable detailed performance metrics for worker event processing

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings

metadata

Path: worker.event_systems.worker.metadata

fallback_poller

Path: worker.event_systems.worker.metadata.fallback_poller

ParameterTypeDefaultDescription
age_threshold_secondsu325Minimum age in seconds of a message before the fallback poller will pick it up
batch_sizeu3220Number of messages to dequeue in a single fallback poll
enabledbooltrueEnable the fallback polling mechanism for step dispatch
max_age_hoursu3224Maximum age in hours of messages the fallback poller will process
polling_interval_msu321000Interval in milliseconds between fallback polling cycles
supported_namespacesVec<String>[]List of queue namespaces the fallback poller monitors; empty means all namespaces
visibility_timeout_secondsu3230Time in seconds a message polled by the fallback mechanism remains invisible
in_process_events

Path: worker.event_systems.worker.metadata.in_process_events

ParameterTypeDefaultDescription
deduplication_cache_sizeusize10000Number of event IDs to cache for deduplication of in-process events
ffi_integration_enabledbooltrueEnable FFI integration for in-process event delivery to Ruby/Python workers
listener

Path: worker.event_systems.worker.metadata.listener

ParameterTypeDefaultDescription
batch_processingbooltrueEnable batch processing of accumulated LISTEN/NOTIFY events
connection_timeout_secondsu3230Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection
event_timeout_secondsu3260Maximum time to wait for a LISTEN/NOTIFY event before yielding
max_retry_attemptsu325Maximum number of listener reconnection attempts before falling back to polling
retry_interval_secondsu325Interval in seconds between LISTEN/NOTIFY listener reconnection attempts

processing

Path: worker.event_systems.worker.processing

ParameterTypeDefaultDescription
batch_sizeu3220Number of events dequeued in a single batch read by the worker
max_concurrent_operationsu32100Maximum number of events processed concurrently by the worker event system
max_retriesu323Maximum retry attempts for a failed worker event processing operation

worker.event_systems.worker.processing.batch_size

Number of events dequeued in a single batch read by the worker

  • Type: u32
  • Default: 20
  • Valid Range: 1-1000
  • System Impact: Larger batches improve throughput but increase per-batch processing time

worker.event_systems.worker.processing.max_concurrent_operations

Maximum number of events processed concurrently by the worker event system

  • Type: u32
  • Default: 100
  • Valid Range: 1-10000
  • System Impact: Controls parallelism for step dispatch and completion processing

worker.event_systems.worker.processing.max_retries

Maximum retry attempts for a failed worker event processing operation

  • Type: u32
  • Default: 3
  • Valid Range: 0-100
  • System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff

Path: worker.event_systems.worker.processing.backoff

ParameterTypeDefaultDescription
initial_delay_msu64100Initial backoff delay in milliseconds after first worker event processing failure
jitter_percentf640.1Maximum jitter as a fraction of the computed backoff delay
max_delay_msu6410000Maximum backoff delay in milliseconds between worker event retries
multiplierf642.0Multiplier applied to the backoff delay after each consecutive failure

timing

Path: worker.event_systems.worker.timing

ParameterTypeDefaultDescription
claim_timeout_secondsu32300Maximum time in seconds a worker event claim remains valid
fallback_polling_interval_secondsu322Interval in seconds between fallback polling cycles for step dispatch
health_check_interval_secondsu3230Interval in seconds between health check probes for the worker event system
processing_timeout_secondsu3260Maximum time in seconds allowed for processing a single worker event
visibility_timeout_secondsu3230Time in seconds a dequeued step dispatch message remains invisible to other workers

worker.event_systems.worker.timing.claim_timeout_seconds

Maximum time in seconds a worker event claim remains valid

  • Type: u32
  • Default: 300
  • Valid Range: 1-3600
  • System Impact: Prevents abandoned claims from blocking step processing indefinitely

worker.event_systems.worker.timing.fallback_polling_interval_seconds

Interval in seconds between fallback polling cycles for step dispatch

  • Type: u32
  • Default: 2
  • Valid Range: 1-60
  • System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency

worker.event_systems.worker.timing.health_check_interval_seconds

Interval in seconds between health check probes for the worker event system

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Controls how frequently the worker event system verifies its own connectivity

worker.event_systems.worker.timing.processing_timeout_seconds

Maximum time in seconds allowed for processing a single worker event

  • Type: u32
  • Default: 60
  • Valid Range: 1-3600
  • System Impact: Events exceeding this timeout are considered failed and may be retried

worker.event_systems.worker.timing.visibility_timeout_seconds

Time in seconds a dequeued step dispatch message remains invisible to other workers

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Prevents duplicate step execution; must be longer than typical step processing time

grpc

Path: worker.grpc

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"Socket address for the worker gRPC server
enable_health_servicebooltrueEnable the gRPC health checking service on the worker
enable_reflectionbooltrueEnable gRPC server reflection for the worker service
enabledbooltrueEnable the gRPC API server for the worker service
keepalive_interval_secondsu3230Interval in seconds between gRPC keepalive ping frames on the worker
keepalive_timeout_secondsu3220Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
max_concurrent_streamsu321000Maximum number of concurrent gRPC streams per connection
max_frame_sizeu3216384Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
tls_enabledboolfalseEnable TLS encryption for worker gRPC connections

worker.grpc.bind_address

Socket address for the worker gRPC server

  • Type: String
  • Default: "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}"
  • Valid Range: host:port
  • System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191

worker.grpc.enable_health_service

Enable the gRPC health checking service on the worker

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators

worker.grpc.enable_reflection

Enable gRPC server reflection for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production

worker.grpc.enabled

Enable the gRPC API server for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no gRPC endpoints are available; clients must use REST

worker.grpc.keepalive_interval_seconds

Interval in seconds between gRPC keepalive ping frames on the worker

  • Type: u32
  • Default: 30
  • Valid Range: 1-3600
  • System Impact: Detects dead connections; lower values detect failures faster but increase network overhead

worker.grpc.keepalive_timeout_seconds

Time in seconds to wait for a keepalive ping acknowledgment before closing the connection

  • Type: u32
  • Default: 20
  • Valid Range: 1-300
  • System Impact: Connections that fail to acknowledge within this window are considered dead and closed

worker.grpc.max_concurrent_streams

Maximum number of concurrent gRPC streams per connection

  • Type: u32
  • Default: 1000
  • Valid Range: 1-10000
  • System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this

worker.grpc.max_frame_size

Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server

  • Type: u32
  • Default: 16384
  • Valid Range: 16384-16777215
  • System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream

worker.grpc.tls_enabled

Enable TLS encryption for worker gRPC connections

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments

mpsc_channels

Path: worker.mpsc_channels

command_processor

Path: worker.mpsc_channels.command_processor

ParameterTypeDefaultDescription
command_buffer_sizeusize2000Bounded channel capacity for the worker command processor

worker.mpsc_channels.command_processor.command_buffer_size

Bounded channel capacity for the worker command processor

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types

domain_events

Path: worker.mpsc_channels.domain_events

ParameterTypeDefaultDescription
command_buffer_sizeusize1000Bounded channel capacity for domain event system commands
log_dropped_eventsbooltrueLog a warning when domain events are dropped due to channel saturation
shutdown_drain_timeout_msu325000Maximum time in milliseconds to drain pending domain events during shutdown

worker.mpsc_channels.domain_events.command_buffer_size

Bounded channel capacity for domain event system commands

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown

worker.mpsc_channels.domain_events.log_dropped_events

Log a warning when domain events are dropped due to channel saturation

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Observability: helps detect when event volume exceeds channel capacity

worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms

Maximum time in milliseconds to drain pending domain events during shutdown

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss

event_listeners

Path: worker.mpsc_channels.event_listeners

ParameterTypeDefaultDescription
pgmq_event_buffer_sizeusize10000Bounded channel capacity for PGMQ event listener notifications on the worker

worker.mpsc_channels.event_listeners.pgmq_event_buffer_size

Bounded channel capacity for PGMQ event listener notifications on the worker

  • Type: usize
  • Default: 10000
  • Valid Range: 1000-1000000
  • System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types

event_subscribers

Path: worker.mpsc_channels.event_subscribers

ParameterTypeDefaultDescription
completion_buffer_sizeusize1000Bounded channel capacity for step completion event subscribers
result_buffer_sizeusize1000Bounded channel capacity for step result event subscribers

worker.mpsc_channels.event_subscribers.completion_buffer_size

Bounded channel capacity for step completion event subscribers

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step completion notifications before they are forwarded to the orchestration service

worker.mpsc_channels.event_subscribers.result_buffer_size

Bounded channel capacity for step result event subscribers

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step execution results before they are published to the result queue

event_systems

Path: worker.mpsc_channels.event_systems

ParameterTypeDefaultDescription
event_channel_buffer_sizeusize2000Bounded channel capacity for the worker event system internal channel

worker.mpsc_channels.event_systems.event_channel_buffer_size

Bounded channel capacity for the worker event system internal channel

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Buffers events between the listener and processor; sized for worker-level throughput

ffi_dispatch

Path: worker.mpsc_channels.ffi_dispatch

ParameterTypeDefaultDescription
callback_timeout_msu325000Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
completion_send_timeout_msu3210000Maximum time in milliseconds to retry sending FFI completion results when the channel is full
completion_timeout_msu3230000Maximum time in milliseconds to wait for an FFI step handler to complete
dispatch_buffer_sizeusize1000Bounded channel capacity for FFI step dispatch requests
starvation_warning_threshold_msu3210000Age in milliseconds of pending FFI events that triggers a starvation warning

worker.mpsc_channels.ffi_dispatch.callback_timeout_ms

Maximum time in milliseconds for FFI fire-and-forget domain event callbacks

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Prevents indefinite blocking of FFI threads during domain event publishing

worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms

Maximum time in milliseconds to retry sending FFI completion results when the channel is full

  • Type: u32
  • Default: 10000
  • Valid Range: 1000-300000
  • System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks

worker.mpsc_channels.ffi_dispatch.completion_timeout_ms

Maximum time in milliseconds to wait for an FFI step handler to complete

  • Type: u32
  • Default: 30000
  • Valid Range: 1000-600000
  • System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads

worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size

Bounded channel capacity for FFI step dispatch requests

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers

worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms

Age in milliseconds of pending FFI events that triggers a starvation warning

  • Type: u32
  • Default: 10000
  • Valid Range: 1000-300000
  • System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached

handler_dispatch

Path: worker.mpsc_channels.handler_dispatch

ParameterTypeDefaultDescription
completion_buffer_sizeusize1000Bounded channel capacity for step handler completion notifications
dispatch_buffer_sizeusize1000Bounded channel capacity for step handler dispatch requests
handler_timeout_msu3230000Maximum time in milliseconds for a step handler to complete execution
max_concurrent_handlersu3210Maximum number of step handlers executing simultaneously

worker.mpsc_channels.handler_dispatch.completion_buffer_size

Bounded channel capacity for step handler completion notifications

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers handler completion results before they are forwarded to the result processor

worker.mpsc_channels.handler_dispatch.dispatch_buffer_size

Bounded channel capacity for step handler dispatch requests

  • Type: usize
  • Default: 1000
  • Valid Range: 100-50000
  • System Impact: Buffers incoming step execution requests before handler assignment

worker.mpsc_channels.handler_dispatch.handler_timeout_ms

Maximum time in milliseconds for a step handler to complete execution

  • Type: u32
  • Default: 30000
  • Valid Range: 1000-600000
  • System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity

worker.mpsc_channels.handler_dispatch.max_concurrent_handlers

Maximum number of step handlers executing simultaneously

  • Type: u32
  • Default: 10
  • Valid Range: 1-10000
  • System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore

load_shedding

Path: worker.mpsc_channels.handler_dispatch.load_shedding

ParameterTypeDefaultDescription
capacity_threshold_percentf6480.0Handler capacity percentage above which new step claims are refused
enabledbooltrueEnable load shedding to refuse step claims when handler capacity is nearly exhausted
warning_threshold_percentf6470.0Handler capacity percentage at which warning logs are emitted

worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent

Handler capacity percentage above which new step claims are refused

  • Type: f64
  • Default: 80.0
  • Valid Range: 0.0-100.0
  • System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy

worker.mpsc_channels.handler_dispatch.load_shedding.enabled

Enable load shedding to refuse step claims when handler capacity is nearly exhausted

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload

worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent

Handler capacity percentage at which warning logs are emitted

  • Type: f64
  • Default: 70.0
  • Valid Range: 0.0-100.0
  • System Impact: Observability: alerts operators that the worker is approaching its capacity limit

in_process_events

Path: worker.mpsc_channels.in_process_events

ParameterTypeDefaultDescription
broadcast_buffer_sizeusize2000Bounded broadcast channel capacity for in-process domain event delivery
dispatch_timeout_msu325000Maximum time in milliseconds to wait when dispatching an in-process event
log_subscriber_errorsbooltrueLog errors when in-process event subscribers fail to receive events

worker.mpsc_channels.in_process_events.broadcast_buffer_size

Bounded broadcast channel capacity for in-process domain event delivery

  • Type: usize
  • Default: 2000
  • Valid Range: 100-100000
  • System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure

worker.mpsc_channels.in_process_events.dispatch_timeout_ms

Maximum time in milliseconds to wait when dispatching an in-process event

  • Type: u32
  • Default: 5000
  • Valid Range: 100-60000
  • System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow

worker.mpsc_channels.in_process_events.log_subscriber_errors

Log errors when in-process event subscribers fail to receive events

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: Observability: helps identify slow or failing event subscribers

orchestration_client

Path: worker.orchestration_client

ParameterTypeDefaultDescription
base_urlString"http://localhost:8080"Base URL of the orchestration REST API that this worker reports to
max_retriesu323Maximum retry attempts for failed orchestration API calls
timeout_msu3230000HTTP request timeout in milliseconds for orchestration API calls

worker.orchestration_client.base_url

Base URL of the orchestration REST API that this worker reports to

  • Type: String
  • Default: "http://localhost:8080"
  • Valid Range: valid HTTP(S) URL
  • System Impact: Workers send step completion results and health reports to this endpoint

Environment Recommendations:

EnvironmentValueRationale
productionhttp://orchestration:8080Container-internal DNS in Kubernetes/Docker
testhttp://localhost:8080Local orchestration for testing

Related: orchestration.web.bind_address

worker.orchestration_client.max_retries

Maximum retry attempts for failed orchestration API calls

  • Type: u32
  • Default: 3
  • Valid Range: 0-10
  • System Impact: Retries use backoff; higher values improve resilience to transient network issues

worker.orchestration_client.timeout_ms

HTTP request timeout in milliseconds for orchestration API calls

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried

web

Path: worker.web

ParameterTypeDefaultDescription
bind_addressString"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"Socket address for the worker REST API server
enabledbooltrueEnable the REST API server for the worker service
request_timeout_msu3230000Maximum time in milliseconds for a worker HTTP request to complete before timeout

worker.web.bind_address

Socket address for the worker REST API server

  • Type: String
  • Default: "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}"
  • Valid Range: host:port
  • System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081

Environment Recommendations:

EnvironmentValueRationale
production0.0.0.0:8081Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override
test0.0.0.0:8081Default port offset from orchestration (8080)

worker.web.enabled

Enable the REST API server for the worker service

  • Type: bool
  • Default: true
  • Valid Range: true/false
  • System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only

worker.web.request_timeout_ms

Maximum time in milliseconds for a worker HTTP request to complete before timeout

  • Type: u32
  • Default: 30000
  • Valid Range: 100-300000
  • System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections

auth

Path: worker.web.auth

ParameterTypeDefaultDescription
api_keyString""Static API key for simple key-based authentication on the worker API
api_key_headerString"X-API-Key"HTTP header name for API key authentication on the worker API
enabledboolfalseEnable authentication for the worker REST API
jwt_audienceString"worker-api"Expected ‘aud’ claim in JWT tokens for the worker API
jwt_issuerString"tasker-worker"Expected ‘iss’ claim in JWT tokens for the worker API
jwt_private_keyString""PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
jwt_public_keyString"${TASKER_JWT_PUBLIC_KEY:-}"PEM-encoded public key for verifying JWT token signatures on the worker API
jwt_public_key_pathString"${TASKER_JWT_PUBLIC_KEY_PATH:-}"File path to a PEM-encoded public key for worker JWT verification
jwt_token_expiry_hoursu3224Default JWT token validity period in hours for worker API tokens

worker.web.auth.api_key

Static API key for simple key-based authentication on the worker API

  • Type: String
  • Default: ""
  • Valid Range: non-empty string or empty to disable
  • System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header

worker.web.auth.api_key_header

HTTP header name for API key authentication on the worker API

  • Type: String
  • Default: "X-API-Key"
  • Valid Range: valid HTTP header name
  • System Impact: Clients send their API key in this header; default is X-API-Key

worker.web.auth.enabled

Enable authentication for the worker REST API

  • Type: bool
  • Default: false
  • Valid Range: true/false
  • System Impact: When false, all worker API endpoints are unauthenticated

worker.web.auth.jwt_audience

Expected ‘aud’ claim in JWT tokens for the worker API

  • Type: String
  • Default: "worker-api"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different audience are rejected during validation

worker.web.auth.jwt_issuer

Expected ‘iss’ claim in JWT tokens for the worker API

  • Type: String
  • Default: "tasker-worker"
  • Valid Range: non-empty string
  • System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens

worker.web.auth.jwt_private_key

PEM-encoded private key for signing JWT tokens (if the worker issues tokens)

  • Type: String
  • Default: ""
  • Valid Range: valid PEM private key or empty
  • System Impact: Required only if the worker service issues its own JWT tokens; typically empty

worker.web.auth.jwt_public_key

PEM-encoded public key for verifying JWT token signatures on the worker API

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY:-}"
  • Valid Range: valid PEM public key or empty
  • System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management

worker.web.auth.jwt_public_key_path

File path to a PEM-encoded public key for worker JWT verification

  • Type: String
  • Default: "${TASKER_JWT_PUBLIC_KEY_PATH:-}"
  • Valid Range: valid file path or empty
  • System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file

worker.web.auth.jwt_token_expiry_hours

Default JWT token validity period in hours for worker API tokens

  • Type: u32
  • Default: 24
  • Valid Range: 1-720
  • System Impact: Tokens older than this are rejected; shorter values improve security

database_pools

Path: worker.web.database_pools

ParameterTypeDefaultDescription
max_total_connections_hintu3225Advisory hint for the total number of database connections across all worker pools
web_api_connection_timeout_secondsu3230Maximum time to wait when acquiring a connection from the worker web API pool
web_api_idle_timeout_secondsu32600Time before an idle worker web API connection is closed
web_api_max_connectionsu3215Maximum number of connections the worker web API pool can grow to under load
web_api_pool_sizeu3210Target number of connections in the worker web API database pool

worker.web.database_pools.max_total_connections_hint

Advisory hint for the total number of database connections across all worker pools

  • Type: u32
  • Default: 25
  • Valid Range: 1-1000
  • System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint

worker.web.database_pools.web_api_connection_timeout_seconds

Maximum time to wait when acquiring a connection from the worker web API pool

  • Type: u32
  • Default: 30
  • Valid Range: 1-300
  • System Impact: Worker API requests that cannot acquire a connection within this window return an error

worker.web.database_pools.web_api_idle_timeout_seconds

Time before an idle worker web API connection is closed

  • Type: u32
  • Default: 600
  • Valid Range: 1-3600
  • System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides

worker.web.database_pools.web_api_max_connections

Maximum number of connections the worker web API pool can grow to under load

  • Type: u32
  • Default: 15
  • Valid Range: 1-500
  • System Impact: Hard ceiling for worker web API database connections

worker.web.database_pools.web_api_pool_size

Target number of connections in the worker web API database pool

  • Type: u32
  • Default: 10
  • Valid Range: 1-200
  • System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration

Generated by tasker-ctl docsTasker Configuration System

Authentication & Authorization

API-level security for Tasker’s orchestration and worker HTTP endpoints, providing JWT bearer token and API key authentication with permission-based access control.


Architecture

                         ┌──────────────────────────────┐
Request ──►  Middleware  │  SecurityService              │
             (per-route) │  ├─ JwtAuthenticator          │
                         │  ├─ JwksKeyStore (optional)   │
                         │  └─ ApiKeyRegistry (optional) │
                         └───────────┬──────────────────┘
                                     │
                                     ▼
                           SecurityContext
                           (injected into request extensions)
                                     │
                                     ▼
                         ┌───────────────────────┐
                         │  authorize() wrapper  │
                         │  Resource + Action    │
                         └───────────┬───────────┘
                                     │
                           ┌─────────┴─────────┐
                           ▼                   ▼
                     Body parsing        403 (denied)
                         │
                         ▼
                    Handler body
                         │
                         ▼
                   200 (success)

Key Components

ComponentLocationRole
SecurityServicetasker-shared/src/services/security_service.rsUnified auth backend: validates JWTs (static key or JWKS) and API keys
SecurityContexttasker-shared/src/types/security.rsPer-request identity + permissions, extracted by handlers
Permission enumtasker-shared/src/types/permissions.rsCompile-time permission vocabulary (resource:action)
Resource, Actiontasker-shared/src/types/resources.rsResource-based authorization types
authorize() wrappertasker-shared/src/web/authorize.rsHandler wrapper for declarative permission checks
Auth middleware*/src/web/middleware/auth.rsAxum middleware injecting SecurityContext
require_permission()*/src/web/middleware/permission.rsLegacy per-handler permission gate (still available)

Request Flow

  1. Middleware (conditional_auth) runs on protected routes
  2. If auth disabled → injects SecurityContext::disabled_context() (all permissions)
  3. If auth enabled → extracts Bearer token or API key from headers
  4. SecurityService validates credentials, returns SecurityContext
  5. authorize() wrapper checks permission BEFORE body deserialization → 403 if denied
  6. Body deserialization and handler execution proceed if authorized

Route Layers

Routes are split into public (never require auth) and protected (auth middleware applied):

Orchestration (port 8080):

  • Public: /health/*, /metrics, /api-docs/*
  • Protected: /v1/*, /config (opt-in)

Worker (port 8081):

  • Public: /health/*, /metrics, /api-docs/*
  • Protected: /v1/templates/*, /config (opt-in)

Quick Start

# 1. Generate RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys

# 2. Generate a token
cargo run --bin tasker-ctl -- auth generate-token \
  --private-key ./keys/jwt-private-key.pem \
  --permissions "tasks:create,tasks:read,tasks:list" \
  --subject my-service \
  --expiry-hours 24

# 3. Enable auth in config (orchestration.toml)
# [web.auth]
# enabled = true
# jwt_public_key_path = "./keys/jwt-public-key.pem"

# 4. Use the token
curl -H "Authorization: Bearer <token>" http://localhost:8080/v1/tasks

Documentation Index

DocumentContents
PermissionsPermission vocabulary, route mapping, wildcards, role patterns
ConfigurationTOML config, environment variables, deployment patterns
TestingE2E test infrastructure, cargo-make tasks, writing auth tests

Cross-References

DocumentContents
API Security GuideQuick start, CLI commands, error responses, observability
Auth Integration GuideJWKS, Auth0, Keycloak, Okta configuration

Design Decisions

Auth Disabled by Default

Security is opt-in (enabled = false default). Existing deployments are unaffected. When disabled, all handlers receive a SecurityContext with AuthMethod::Disabled and permissions: ["*"].

Config Endpoint Opt-In

The /config endpoint exposes runtime configuration (secrets redacted). It is controlled by a separate toggle (config_endpoint_enabled, default false). When disabled, the route is not registered (404, not 401).

Resource-Based Authorization

Permission checks happen at the route level via authorize() wrappers BEFORE body deserialization:

#![allow(unused)]
fn main() {
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
}

This approach:

  • Rejects unauthorized requests before parsing request bodies
  • Provides a declarative, visible permission model at the route level
  • Is protocol-agnostic (same Resource/Action types work for REST and gRPC)
  • Documents permissions in OpenAPI via x-required-permission extensions

The legacy require_permission() function is still available for cases where permission checks need to happen inside handler logic.

Credential Priority (Client)

The tasker-client library resolves credentials in this order:

  1. Endpoint-specific token (TASKER_ORCHESTRATION_AUTH_TOKEN / TASKER_WORKER_AUTH_TOKEN)
  2. Global token (TASKER_AUTH_TOKEN)
  3. API key (TASKER_API_KEY)
  4. JWT generation from private key (if configured)

Known Limitations

  • Body-before-permission ordering for POST/PATCH endpoints — Resolved by resource-based authorization
  • No token refresh — tokens are stateless; clients must generate new tokens before expiry
  • API keys have no expiration — rotate by removing from config and redeploying

Configuration Reference

Complete configuration for Tasker authentication: server-side TOML, environment variables, and client settings.


Server-Side Configuration

Auth config lives under [web.auth] in both orchestration and worker TOML files.

Location

config/tasker/base/orchestration.toml    → [web.auth]
config/tasker/base/worker.toml           → [web.auth]
config/tasker/environments/{env}/...     → environment overrides

Configuration follows the role-based structure (see Configuration Management).

Full Reference

[web]
# Whether the /config endpoint is registered (default: false).
# When false, GET /config returns 404. When true, requires system:config_read permission.
config_endpoint_enabled = false

[web.auth]
# Master switch (default: false). When disabled, all routes are accessible without credentials.
enabled = false

# --- JWT Configuration ---

# Token issuer claim (validated against incoming tokens)
jwt_issuer = "tasker-core"

# Token audience claim (validated against incoming tokens)
jwt_audience = "tasker-api"

# Token expiry for generated tokens (via CLI)
jwt_token_expiry_hours = 24

# Verification method: "public_key" (static RSA key) or "jwks" (dynamic key rotation)
jwt_verification_method = "public_key"

# Static public key (one of these, path takes precedence):
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_public_key = ""  # Inline PEM string (use path instead for production)

# Private key (for token generation only, not needed for verification):
jwt_private_key = ""

# --- JWKS Configuration (when jwt_verification_method = "jwks") ---

# JWKS endpoint URL
jwks_url = "https://auth.example.com/.well-known/jwks.json"

# How often to refresh the key set (seconds)
jwks_refresh_interval_seconds = 3600

# --- Permission Validation ---

# JWT claim name containing the permissions array
permissions_claim = "permissions"

# Reject tokens with unrecognized permission strings
strict_validation = true

# Log unrecognized permissions even when strict_validation = false
log_unknown_permissions = true

# --- API Key Authentication ---

# Header name for API key authentication
api_key_header = "X-API-Key"

# Enable multi-key registry (default: false)
api_keys_enabled = false

# API key registry (multiple keys with individual permissions)
[[web.auth.api_keys]]
key = "sk-prod-monitoring-key"
permissions = ["tasks:read", "tasks:list", "dlq:read", "dlq:stats"]
description = "Production monitoring service"

[[web.auth.api_keys]]
key = "sk-prod-admin-key"
permissions = ["*"]
description = "Production admin"

Environment Variables

Server-Side

VariableDescriptionOverrides
TASKER_JWT_PUBLIC_KEY_PATHPath to RSA public key PEM fileweb.auth.jwt_public_key_path
TASKER_JWT_PUBLIC_KEYInline PEM public keyweb.auth.jwt_public_key

These override TOML values via the config loader’s environment interpolation.

Client-Side

VariablePriorityDescription
TASKER_ORCHESTRATION_AUTH_TOKEN1 (highest)Bearer token for orchestration API only
TASKER_WORKER_AUTH_TOKEN1 (highest)Bearer token for worker API only
TASKER_AUTH_TOKEN2Bearer token for both APIs
TASKER_API_KEY3API key (sent via configured header)
TASKER_API_KEY_HEADERCustom header name (default: X-API-Key)
TASKER_JWT_PRIVATE_KEY_PATH4 (lowest)Private key for on-demand token generation

The tasker-client library checks these in priority order and uses the first available credential.


Deployment Patterns

Development (Auth Disabled)

[web.auth]
enabled = false

All endpoints accessible without credentials. Default behavior.

Development (Auth Enabled, Static Key)

[web.auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
strict_validation = false

[[web.auth.api_keys]]
key = "dev-key"
permissions = ["*"]
description = "Dev superuser key"

Production (JWKS + API Keys)

[web.auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://auth.company.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://auth.company.com/"
jwt_audience = "tasker-api"
strict_validation = true
log_unknown_permissions = true
api_keys_enabled = true
api_key_header = "X-API-Key"

[[web.auth.api_keys]]
key = "sk-monitoring-prod"
permissions = ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
description = "Monitoring service"

[[web.auth.api_keys]]
key = "sk-submitter-prod"
permissions = ["tasks:create", "tasks:read", "tasks:list"]
description = "Task submission service"

Production (Config Endpoint Enabled)

[web]
config_endpoint_enabled = true

[web.auth]
enabled = true
# ... auth config ...

Exposes GET /config (requires system:config_read permission). Secrets are redacted in the response.


Key Management

Generating Keys

# Generate 2048-bit RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys --key-size 2048

# Output:
#   keys/jwt-private-key.pem  (keep secret, used for token generation)
#   keys/jwt-public-key.pem   (distribute to servers for verification)

Key Rotation (Static Key)

  1. Generate a new key pair
  2. Update jwt_public_key_path in server config
  3. Restart servers
  4. Re-generate tokens with the new private key
  5. Old tokens become invalid immediately

Key Rotation (JWKS)

Handled automatically by the identity provider. Tasker refreshes keys on:

  • Timer interval (jwks_refresh_interval_seconds)
  • Unknown kid in incoming token (triggers immediate refresh)

Security Hardening Checklist

  • Private keys never committed to version control
  • enabled = true in production configs
  • strict_validation = true to reject unknown permissions
  • Token expiry set appropriately (1-24h recommended)
  • API keys use descriptive names for audit trails
  • config_endpoint_enabled = false unless needed (default)
  • Monitor tasker.auth.failures.total metric for anomalies
  • Use JWKS in production for automatic key rotation
  • Least-privilege: each service gets only the permissions it needs

Permissions

Permission-based access control using a resource:action vocabulary with wildcard support.


Permission Vocabulary

17 permissions organized by resource:

Tasks

PermissionDescriptionEndpoints
tasks:createCreate new tasksPOST /v1/tasks
tasks:readRead task detailsGET /v1/tasks/{uuid}
tasks:listList tasksGET /v1/tasks
tasks:cancelCancel running tasksDELETE /v1/tasks/{uuid}
tasks:context_readRead task context dataGET /v1/tasks/{uuid}/context

Steps

PermissionDescriptionEndpoints
steps:readRead workflow step detailsGET /v1/tasks/{uuid}/workflow_steps, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}/audit
steps:resolveManually resolve stepsPATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Dead Letter Queue

PermissionDescriptionEndpoints
dlq:readRead DLQ entriesGET /v1/dlq, GET /v1/dlq/task/{task_uuid}, GET /v1/dlq/investigation-queue, GET /v1/dlq/staleness
dlq:updateUpdate DLQ investigationsPATCH /v1/dlq/entry/{dlq_entry_uuid}
dlq:statsView DLQ statisticsGET /v1/dlq/stats

Templates

PermissionDescriptionEndpoints
templates:readRead task templatesOrchestration: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version}
templates:validateValidate templatesWorker: POST /v1/templates/{namespace}/{name}/{version}/validate

System (Orchestration)

PermissionDescriptionEndpoints
system:config_readRead system configurationGET /config
system:handlers_readRead handler registryGET /v1/handlers, GET /v1/handlers/{namespace}, GET /v1/handlers/{namespace}/{name}
system:analytics_readRead analytics dataGET /v1/analytics/performance, GET /v1/analytics/bottlenecks

Worker

PermissionDescriptionEndpoints
worker:config_readRead worker configurationWorker: GET /config
worker:templates_readRead worker templatesWorker: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version}

Wildcards

Resource-level wildcards allow broad access within a resource domain:

PatternMatches
tasks:*All task permissions
steps:*All step permissions
dlq:*All DLQ permissions
templates:*All template permissions
system:*All system permissions
worker:*All worker permissions

Note: Global wildcards (*) are NOT supported. Use explicit resource wildcards for broad access (e.g., tasks:*, system:*). This follows AWS IAM-style resource-level granularity.

Wildcard matching is implemented in permission_matches():

  • resource:* → matches if required permission’s resource component equals the prefix
  • Exact string → matches if strings are identical

Role Patterns

Common permission sets for different service roles:

Read-Only Operator

["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]

Suitable for dashboards, monitoring services, and read-only admin UIs.

Task Submitter

["tasks:create", "tasks:read", "tasks:list"]

Services that submit work to Tasker and track their submissions.

Ops Admin

["tasks:*", "steps:*", "dlq:*", "system:*"]

Full operational access including step resolution, DLQ investigation, and system observability.

Worker Service

["worker:config_read", "worker:templates_read"]

Worker processes that need to read their configuration and available templates.

Full Access (Admin)

["tasks:*", "steps:*", "dlq:*", "templates:*", "system:*", "worker:*"]

Full access to all resources via resource wildcards. Use sparingly.


Strict Validation

When strict_validation = true (default), tokens containing permission strings not in the vocabulary are rejected with 401:

Unknown permissions: custom:action, tasks:delete

Set strict_validation = false if your identity provider includes additional scopes that are not part of Tasker’s vocabulary. Use log_unknown_permissions = true to still log unrecognized permissions for monitoring.


Permission Check Implementation

Resource-Based Authorization

Permissions are enforced declaratively at the route level using authorize() wrappers. This ensures authorization happens before body deserialization:

#![allow(unused)]
fn main() {
// In routes.rs
use tasker_shared::web::authorize;
use tasker_shared::types::resources::{Resource, Action};

Router::new()
    .route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
    .route("/tasks", get(authorize(Resource::Tasks, Action::List, list_tasks)))
    .route("/tasks/{uuid}", get(authorize(Resource::Tasks, Action::Read, get_task)))
}

The authorize() wrapper:

  1. Extracts SecurityContext from request extensions (set by auth middleware)
  2. If resource is public (Health/Metrics/Docs) → proceeds to handler
  3. If auth disabled (AuthMethod::Disabled) → proceeds to handler
  4. Checks has_permission(required) → if yes, proceeds; if no, returns 403

Resource → Permission Mapping

The ResourceAction type maps resource+action combinations to permissions:

ResourceActionPermission
TasksCreatetasks:create
TasksReadtasks:read
TasksListtasks:list
TasksCanceltasks:cancel
TasksContextReadtasks:context_read
StepsRead/Liststeps:read
StepsResolvesteps:resolve
DlqRead/Listdlq:read
DlqUpdatedlq:update
DlqStatsdlq:stats
TemplatesRead/Listtemplates:read
TemplatesValidatetemplates:validate
SystemConfigReadsystem:config_read
SystemHandlersReadsystem:handlers_read
SystemAnalyticsReadsystem:analytics_read
WorkerConfigReadworker:config_read
WorkerRead/Listworker:templates_read

Public Resources

These resources don’t require authentication:

  • Resource::Health - Health check endpoints
  • Resource::Metrics - Prometheus metrics
  • Resource::Docs - OpenAPI/Swagger documentation

Legacy Handler-Level Check (Still Available)

For cases where you need permission checks inside handler logic:

#![allow(unused)]
fn main() {
use tasker_shared::services::require_permission;
use tasker_shared::types::Permission;

fn my_handler(ctx: SecurityContext) -> Result<(), ApiError> {
    require_permission(&ctx, Permission::TasksCreate)?;
    // ... handler logic
}
}

Source: tasker-shared/src/web/authorize.rs, tasker-shared/src/types/resources.rs


OpenAPI Documentation

Permission Extensions

Each protected endpoint in the OpenAPI spec includes an x-required-permission extension that documents the exact permission required:

{
  "paths": {
    "/v1/tasks": {
      "post": {
        "security": [
          { "bearer_auth": [] },
          { "api_key_auth": [] }
        ],
        "x-required-permission": "tasks:create",
        ...
      }
    }
  }
}

Why Extensions Instead of OAuth2 Scopes?

OpenAPI 3.x only formally supports scopes for OAuth2 and OpenID Connect security schemes—not for HTTP Bearer or API Key authentication. Since Tasker uses JWT Bearer tokens with JWKS validation (not OAuth2 flows), we use vendor extensions (x-required-permission) to document permissions in a standards-compliant way.

This approach:

  • Is OpenAPI compliant (tools ignore unknown x- fields gracefully)
  • Doesn’t misrepresent our authentication mechanism
  • Is machine-readable for SDK generators and tooling
  • Is visible in generated documentation

Viewing Permissions in Swagger UI

Each operation’s description includes a Required Permission line:

**Required Permission:** `tasks:create`

This provides human-readable permission information directly in the Swagger UI.

Programmatic Access

To extract permission requirements from the OpenAPI spec:

import json

spec = json.load(open("orchestration-openapi.json"))
for path, methods in spec["paths"].items():
    for method, operation in methods.items():
        if "x-required-permission" in operation:
            print(f"{method.upper()} {path}: {operation['x-required-permission']}")

CLI: List Permissions

cargo run --bin tasker-ctl -- auth show-permissions

Outputs all 17 permissions with their resource grouping.

Auth Testing

E2E test infrastructure for validating authentication and permission enforcement.


Test Organization

tasker-orchestration/tests/web/auth/
├── mod.rs                  # Module declarations
├── common.rs               # AuthWebTestClient, token generators, constants
├── tasks.rs                # Task endpoint auth tests
├── workflow_steps.rs       # Step resolution auth tests
├── dlq.rs                  # DLQ endpoint auth tests
├── handlers.rs             # Handler registry auth tests
├── analytics.rs            # Analytics endpoint auth tests
├── config.rs               # Config endpoint auth tests
├── health.rs               # Health endpoint public access tests
└── api_keys.rs             # API key auth tests (full/read/tasks/none)

All tests are feature-gated: #[cfg(feature = "test-services")]


Running Auth Tests

# Run all auth E2E tests (requires database running)
cargo make test-auth-e2e    # or: cargo make tae

# Run a specific test file
cargo nextest run --features test-services \
  -E 'test(auth::tasks)' \
  --package tasker-orchestration

# Run with output
cargo nextest run --features test-services \
  -E 'test(auth::)' \
  --package tasker-orchestration \
  --nocapture

Test Infrastructure

AuthWebTestClient

A specialized HTTP client that starts an auth-enabled Axum server:

#![allow(unused)]
fn main() {
use crate::web::auth::common::AuthWebTestClient;

#[tokio::test]
async fn test_example() {
    let client = AuthWebTestClient::new().await;
    // client.base_url is http://127.0.0.1:{dynamic_port}
}
}

AuthWebTestClient::new() does:

  1. Loads config/tasker/generated/auth-test.toml (auth enabled, test keys)
  2. Resolves jwt-public-key-test.pem via CARGO_MANIFEST_DIR
  3. Creates SystemContext + OrchestrationCore + AppState
  4. Starts Axum on a dynamically-allocated port (127.0.0.1:0)
  5. Provides HTTP methods: get(), post_json(), patch_json(), delete()

Token Generators

#![allow(unused)]
fn main() {
use crate::web::auth::common::{generate_jwt, generate_expired_jwt, generate_jwt_wrong_issuer};

// Valid token with specific permissions
let token = generate_jwt(&["tasks:create", "tasks:read"]);

// Expired token (1 hour ago)
let token = generate_expired_jwt(&["tasks:create"]);

// Wrong issuer (won't validate)
let token = generate_jwt_wrong_issuer(&["tasks:create"]);
}

Token generation uses the test RSA private key (tests/fixtures/auth/jwt-private-key-test.pem) embedded as a constant.

API Key Constants

#![allow(unused)]
fn main() {
use crate::web::auth::common::{
    TEST_API_KEY_FULL_ACCESS,      // permissions: ["*"]
    TEST_API_KEY_READ_ONLY,        // permissions: tasks/steps/dlq read + system read
    TEST_API_KEY_TASKS_ONLY,       // permissions: ["tasks:*"]
    TEST_API_KEY_NO_PERMISSIONS,   // permissions: []
    INVALID_API_KEY,               // not registered
};
}

These match the keys configured in config/tasker/generated/auth-test.toml.


Test Configuration

config/tasker/generated/auth-test.toml

A copy of complete-test.toml with auth overrides:

[orchestration.web.auth]
enabled = true
jwt_issuer = "tasker-core-test"
jwt_audience = "tasker-api-test"
jwt_verification_method = "public_key"
jwt_public_key_path = ""  # Set via TASKER_JWT_PUBLIC_KEY_PATH at runtime
api_keys_enabled = true
strict_validation = false

[[orchestration.web.auth.api_keys]]
key = "test-api-key-full-access"
permissions = ["*"]

[[orchestration.web.auth.api_keys]]
key = "test-api-key-read-only"
permissions = ["tasks:read", "tasks:list", "steps:read", ...]

# ... more keys ...

Test Fixture Keys

tests/fixtures/auth/
├── jwt-private-key-test.pem   # RSA private key (for token generation in tests)
└── jwt-public-key-test.pem    # RSA public key (loaded by SecurityService)

These are deterministic test keys committed to the repository. They are only used in tests and have no security value.


Test Patterns

Pattern: No Credentials → 401

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_no_credentials_returns_401() {
    let client = AuthWebTestClient::new().await;
    let response = client.get("/v1/tasks").await.unwrap();
    assert_eq!(response.status(), 401);
}
}

Pattern: Valid JWT with Required Permission → 200

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_with_permission_succeeds() {
    let client = AuthWebTestClient::new().await;
    let token = generate_jwt(&["tasks:list"]);
    let response = client
        .get_with_token("/v1/tasks", &token)
        .await
        .unwrap();
    assert_eq!(response.status(), 200);
}
}

Pattern: Valid JWT Missing Permission → 403

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_without_permission_returns_403() {
    let client = AuthWebTestClient::new().await;
    let token = generate_jwt(&["tasks:read"]);  // missing tasks:create
    let body = serde_json::json!({ /* ... */ });
    let response = client
        .post_json_with_token("/v1/tasks", &body, &token)
        .await
        .unwrap();
    assert_eq!(response.status(), 403);
}
}

Pattern: API Key with Permissions → 200

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_api_key_full_access() {
    let client = AuthWebTestClient::new().await;
    let response = client
        .get_with_api_key("/v1/tasks", TEST_API_KEY_FULL_ACCESS)
        .await
        .unwrap();
    assert_eq!(response.status(), 200);
}
}

Pattern: Health Always Public

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_health_no_auth_required() {
    let client = AuthWebTestClient::new().await;
    let response = client.get("/health").await.unwrap();
    assert_eq!(response.status(), 200);
}
}

Test Coverage Matrix

ScenarioExpectedTest File
No credentials on protected routes401All files
JWT with exact permission200tasks, dlq, handlers, analytics, config
JWT with resource wildcard (tasks:*)200tasks
JWT with global wildcard (*)200All files
JWT missing required permission403tasks, dlq, handlers, analytics
JWT wrong issuer401tasks
JWT wrong audience401tasks
Expired JWT401tasks
Malformed JWT401tasks
API key full access200api_keys
API key read-only200/403api_keys
API key tasks-only200/403api_keys
API key no permissions403api_keys
Invalid API key401api_keys
Health endpoints without auth200health

CI Compatibility

Auth tests are compatible with CI without special environment setup:

  • Dynamic port allocation: TcpListener::bind("127.0.0.1:0") avoids port conflicts
  • Self-configuring paths: Uses CARGO_MANIFEST_DIR to resolve fixture paths at compile time
  • No external services: Auth validation is in-process (no external JWKS/IdP needed)
  • Nextest isolation: Each test runs in its own process, preventing env var conflicts

Adding New Auth Tests

  1. Identify the endpoint and required permission (see Permissions)
  2. Add tests to the appropriate file (by resource) or create a new one
  3. Test at minimum: no credentials (401), correct permission (200), wrong permission (403)
  4. For POST/PATCH endpoints, use a valid request body (deserialization runs before permission check)
  5. Run cargo make test-auth-e2e to verify

Backpressure Monitoring Runbook

Last Updated: 2026-02-05 Audience: Operations, SRE, On-Call Engineers Status: Active Related Docs: Backpressure Architecture | MPSC Channel Tuning


This runbook provides guidance for monitoring, alerting, and responding to backpressure events in tasker-core.

Quick Reference

Critical Metrics Dashboard

MetricNormalWarningCriticalAction
api_circuit_breaker_stateclosed-openSee Circuit Breaker Open
messaging_circuit_breaker_stateclosedhalf-openopenSee Messaging Circuit Breaker Open
api_requests_rejected_total< 1/min> 5/min> 20/minSee API Rejections
mpsc_channel_saturation< 50%> 70%> 90%See Channel Saturation
pgmq_queue_depth< 50% max> 70% max> 90% maxSee Queue Depth High
worker_claim_refusals_total< 5/min> 20/min> 50/minSee Claim Refusals
handler_semaphore_wait_ms_p99< 100ms> 500ms> 2000msSee Handler Wait
domain_events_dropped_total< 10/min> 50/min> 200/minSee Domain Events

Key Metrics

API Layer Metrics

api_requests_total

  • Type: Counter
  • Labels: endpoint, status_code, method
  • Description: Total API requests received
  • Usage: Calculate request rate, error rate

api_requests_rejected_total

  • Type: Counter
  • Labels: endpoint, reason (rate_limit, circuit_breaker, validation)
  • Description: Requests rejected due to backpressure
  • Alert: > 10/min sustained

api_circuit_breaker_state

  • Type: Gauge
  • Values: 0 = closed, 1 = half-open, 2 = open
  • Description: Current circuit breaker state
  • Alert: state = 2 (open)

api_request_latency_ms

  • Type: Histogram
  • Labels: endpoint
  • Description: Request processing time
  • Alert: p99 > 5000ms

Messaging Metrics

messaging_circuit_breaker_state

  • Type: Gauge
  • Values: 0 = closed, 1 = half-open, 2 = open
  • Description: Current messaging circuit breaker state
  • Alert: state = 2 (open) — both orchestration and workers lose queue access

messaging_circuit_breaker_rejections_total

  • Type: Counter
  • Labels: operation (send, receive)
  • Description: Messaging operations rejected due to open circuit breaker
  • Alert: > 0 (any rejection indicates messaging outage)

Orchestration Metrics

orchestration_command_channel_size

  • Type: Gauge
  • Description: Current command channel buffer usage
  • Alert: > 80% of command_buffer_size

orchestration_command_channel_saturation

  • Type: Gauge (0.0 - 1.0)
  • Description: Channel saturation ratio
  • Alert: > 0.8 sustained for > 1min

pgmq_queue_depth

  • Type: Gauge
  • Labels: queue_name
  • Description: Messages in queue
  • Alert: > configured max_queue_depth * 0.8

pgmq_enqueue_failures_total

  • Type: Counter
  • Labels: queue_name, reason
  • Description: Failed enqueue operations
  • Alert: > 0 (any failure)

Worker Metrics

worker_claim_refusals_total

  • Type: Counter
  • Labels: worker_id, namespace
  • Description: Step claims refused due to capacity
  • Alert: > 10/min sustained

worker_handler_semaphore_permits_available

  • Type: Gauge
  • Labels: worker_id
  • Description: Available handler permits
  • Alert: = 0 sustained for > 30s

worker_handler_semaphore_wait_ms

  • Type: Histogram
  • Labels: worker_id
  • Description: Time waiting for semaphore permit
  • Alert: p99 > 1000ms

worker_dispatch_channel_saturation

  • Type: Gauge
  • Labels: worker_id
  • Description: Dispatch channel saturation
  • Alert: > 0.8 sustained

worker_completion_channel_saturation

  • Type: Gauge
  • Labels: worker_id
  • Description: Completion channel saturation
  • Alert: > 0.8 sustained

domain_events_dropped_total

  • Type: Counter
  • Labels: worker_id, event_type
  • Description: Domain events dropped due to channel full
  • Alert: > 50/min (informational)

Alert Configurations

Prometheus Alert Rules

groups:
  - name: tasker_backpressure
    rules:
      # API Layer
      - alert: TaskerCircuitBreakerOpen
        expr: api_circuit_breaker_state == 2
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Tasker API circuit breaker is open"
          description: "Circuit breaker {{ $labels.instance }} has been open for > 30s"
          runbook: "https://docs/operations/backpressure-monitoring.md#circuit-breaker-open"

      - alert: TaskerMessagingCircuitBreakerOpen
        expr: messaging_circuit_breaker_state == 2
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Tasker messaging circuit breaker is open"
          description: "Messaging circuit breaker has been open for > 30s — queue operations are failing"
          runbook: "https://docs/operations/backpressure-monitoring.md#messaging-circuit-breaker-open"

      - alert: TaskerAPIRejectionsHigh
        expr: rate(api_requests_rejected_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate of API request rejections"
          description: "{{ $value | printf \"%.2f\" }} requests/sec being rejected"
          runbook: "https://docs/operations/backpressure-monitoring.md#api-rejections-high"

      # Orchestration Layer
      - alert: TaskerCommandChannelSaturated
        expr: orchestration_command_channel_saturation > 0.8
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Orchestration command channel is saturated"
          description: "Channel saturation at {{ $value | printf \"%.0f\" }}%"
          runbook: "https://docs/operations/backpressure-monitoring.md#channel-saturation"

      - alert: TaskerPGMQQueueDepthHigh
        expr: pgmq_queue_depth / pgmq_queue_max_depth > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PGMQ queue depth is high"
          description: "Queue {{ $labels.queue_name }} at {{ $value | printf \"%.0f\" }}% capacity"
          runbook: "https://docs/operations/backpressure-monitoring.md#pgmq-queue-depth-high"

      # Worker Layer
      - alert: TaskerWorkerClaimRefusalsHigh
        expr: rate(worker_claim_refusals_total[5m]) > 0.2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate of worker claim refusals"
          description: "Worker {{ $labels.worker_id }} refusing {{ $value | printf \"%.1f\" }} claims/sec"
          runbook: "https://docs/operations/backpressure-monitoring.md#worker-claim-refusals-high"

      - alert: TaskerHandlerWaitTimeHigh
        expr: histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket) > 2000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Handler wait time is high"
          description: "p99 handler wait time is {{ $value | printf \"%.0f\" }}ms"
          runbook: "https://docs/operations/backpressure-monitoring.md#handler-wait-time-high"

      - alert: TaskerDomainEventsDropped
        expr: rate(domain_events_dropped_total[5m]) > 1
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Domain events being dropped"
          description: "{{ $value | printf \"%.1f\" }} events/sec dropped"
          runbook: "https://docs/operations/backpressure-monitoring.md#domain-events-dropped"

Incident Response Procedures

Circuit Breaker Open

Severity: Critical

Symptoms:

  • API returning 503 responses
  • api_circuit_breaker_state = 2
  • Downstream operations failing

Immediate Actions:

  1. Check database connectivity
    psql $DATABASE_URL -c "SELECT 1"
    
  2. Check PGMQ extension health
    psql $DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5"
    
  3. Check recent error logs
    kubectl logs -l app=tasker-orchestration --tail=100 | grep ERROR
    

Recovery:

  • Circuit breaker will automatically attempt recovery after timeout_seconds (default: 30s)
  • If database is healthy, breaker should close after success_threshold (default: 2) successful requests
  • If database is unhealthy, fix database first

Escalation:

  • If breaker remains open > 5 min after database recovery: Escalate to engineering

Messaging Circuit Breaker Open

Severity: Critical

Symptoms:

  • Orchestration cannot enqueue steps or send task finalizations
  • Workers cannot receive step messages or send results
  • messaging_circuit_breaker_state = 2
  • MessagingError::CircuitBreakerOpen in logs

Immediate Actions:

  1. Check messaging backend health
    # For PGMQ (default)
    psql $PGMQ_DATABASE_URL -c "SELECT * FROM pgmq.meta LIMIT 5"
    
    # For RabbitMQ
    rabbitmqctl status
    
  2. Check PGMQ database connectivity (may differ from main database)
    psql $PGMQ_DATABASE_URL -c "SELECT 1"
    
  3. Check recent messaging errors
    kubectl logs -l app=tasker-orchestration --tail=100 | grep -E "messaging|circuit_breaker|CircuitBreakerOpen"
    

Impact:

  • Orchestration: Task initialization stalls, step results cannot be received, task finalizations blocked
  • Workers: Step messages not received, results cannot be sent back to orchestration
  • Safety: Messages remain in queues protected by visibility timeouts; no data loss occurs
  • Health checks: Unaffected (bypass circuit breaker to detect recovery)

Recovery:

  • Circuit breaker will automatically test recovery after timeout_seconds (default: 30s)
  • On recovery, queued messages will be processed normally (visibility timeouts protect against loss)
  • If messaging backend is unhealthy, fix it first — the breaker protects against cascading timeouts

Escalation:

  • If breaker remains open > 5 min after backend recovery: Escalate to engineering
  • If both web and messaging breakers are open simultaneously: Likely database-wide issue, escalate to DBA

API Rejections High

Severity: Warning

Symptoms:

  • Clients receiving 429 or 503 responses
  • api_requests_rejected_total increasing

Diagnosis:

  1. Check rejection reason distribution
    sum by (reason) (rate(api_requests_rejected_total[5m]))
    
  2. If reason=rate_limit: Legitimate load spike or client misbehavior
  3. If reason=circuit_breaker: See Circuit Breaker Open

Actions:

  • Rate limit rejections:
    • Identify high-volume client
    • Consider increasing rate limit or contacting client
  • Circuit breaker rejections:
    • Follow circuit breaker procedure

Channel Saturation

Severity: Warning → Critical if sustained

Symptoms:

  • mpsc_channel_saturation > 0.8
  • Increased latency
  • Potential backpressure cascade

Diagnosis:

  1. Identify saturated channel
    orchestration_command_channel_saturation > 0.8
    
  2. Check upstream rate
    rate(orchestration_commands_received_total[5m])
    
  3. Check downstream processing rate
    rate(orchestration_commands_processed_total[5m])
    

Actions:

  1. Temporary: Scale up orchestration replicas
  2. Short-term: Increase channel buffer size
    [orchestration.mpsc_channels.command_processor]
    command_buffer_size = 10000  # Increase from 5000
    
  3. Long-term: Investigate why processing is slow

PGMQ Queue Depth High

Severity: Warning → Critical if approaching max

Symptoms:

  • pgmq_queue_depth growing
  • Step execution delays
  • Potential OOM if queue grows unbounded

Diagnosis:

  1. Identify growing queue
    pgmq_queue_depth{queue_name=~".*"}
    
  2. Check worker health
    sum(worker_handler_semaphore_permits_available)
    
  3. Check for stuck workers
    count(worker_claim_refusals_total) by (worker_id)
    

Actions:

  1. Scale workers: Add more worker replicas
  2. Increase handler concurrency (short-term):
    [worker.mpsc_channels.handler_dispatch]
    max_concurrent_handlers = 20  # Increase from 10
    
  3. Investigate slow handlers: Check handler execution latency

Worker Claim Refusals High

Severity: Warning

Symptoms:

  • worker_claim_refusals_total increasing
  • Workers at capacity
  • Step execution delayed

Diagnosis:

  1. Check handler permit usage
    worker_handler_semaphore_permits_available
    
  2. Check handler execution time
    histogram_quantile(0.99, worker_handler_execution_ms_bucket)
    

Actions:

  1. Scale workers: Add replicas
  2. Optimize handlers: If execution time is high
  3. Adjust threshold: If refusals are premature
    [worker]
    claim_capacity_threshold = 0.9  # More aggressive claiming
    

Handler Wait Time High

Severity: Warning

Symptoms:

  • handler_semaphore_wait_ms_p99 > 1000ms
  • Steps waiting for execution
  • Increased end-to-end latency

Diagnosis:

  1. Check permit utilization
    1 - (worker_handler_semaphore_permits_available / worker_handler_semaphore_permits_total)
    
  2. Check completion channel saturation
    worker_completion_channel_saturation
    

Actions:

  1. Increase permits (if CPU/memory allow):
    [worker.mpsc_channels.handler_dispatch]
    max_concurrent_handlers = 15
    
  2. Optimize handlers: Reduce execution time
  3. Scale workers: If resources constrained

Domain Events Dropped

Severity: Informational

Symptoms:

  • domain_events_dropped_total increasing
  • Downstream event consumers missing events

Diagnosis:

  1. This is expected behavior under load
  2. Check if drop rate is excessive
    rate(domain_events_dropped_total[5m]) / rate(domain_events_dispatched_total[5m])
    

Actions:

  • If < 1% dropped: Normal, no action needed
  • If > 5% dropped: Consider increasing event channel buffer
    [shared.domain_events]
    buffer_size = 20000  # Increase from 10000
    
  • Note: Domain events are non-critical. Dropping does not affect step execution.

Capacity Planning

Determining Appropriate Limits

Command Channel Size

Required buffer = (peak_requests_per_second) * (avg_processing_time_ms / 1000) * safety_factor

Example:
  peak_requests_per_second = 100
  avg_processing_time_ms = 50
  safety_factor = 2

  Required buffer = 100 * 0.05 * 2 = 10 messages
  Recommended: 5000 (50x headroom for bursts)

Handler Concurrency

Optimal concurrency = (worker_cpu_cores) * (1 + (io_wait_ratio))

Example:
  worker_cpu_cores = 4
  io_wait_ratio = 0.8 (handlers are mostly I/O bound)

  Optimal concurrency = 4 * 1.8 = 7.2
  Recommended: 8-10 permits

PGMQ Queue Depth

Max depth = (expected_processing_rate) * (max_acceptable_delay_seconds)

Example:
  expected_processing_rate = 100 steps/sec
  max_acceptable_delay = 60 seconds

  Max depth = 100 * 60 = 6000 messages
  Recommended: 10000 (headroom for bursts)

Grafana Dashboard

Import this dashboard for backpressure monitoring:

{
  "dashboard": {
    "title": "Tasker Backpressure",
    "panels": [
      {
        "title": "Circuit Breaker State",
        "type": "stat",
        "targets": [{"expr": "api_circuit_breaker_state"}]
      },
      {
        "title": "API Rejections Rate",
        "type": "graph",
        "targets": [{"expr": "rate(api_requests_rejected_total[5m])"}]
      },
      {
        "title": "Channel Saturation",
        "type": "graph",
        "targets": [
          {"expr": "orchestration_command_channel_saturation", "legendFormat": "orchestration"},
          {"expr": "worker_dispatch_channel_saturation", "legendFormat": "worker-dispatch"},
          {"expr": "worker_completion_channel_saturation", "legendFormat": "worker-completion"}
        ]
      },
      {
        "title": "PGMQ Queue Depth",
        "type": "graph",
        "targets": [{"expr": "pgmq_queue_depth", "legendFormat": "{{queue_name}}"}]
      },
      {
        "title": "Handler Wait Time (p99)",
        "type": "graph",
        "targets": [{"expr": "histogram_quantile(0.99, worker_handler_semaphore_wait_ms_bucket)"}]
      },
      {
        "title": "Worker Claim Refusals",
        "type": "graph",
        "targets": [{"expr": "rate(worker_claim_refusals_total[5m])"}]
      }
    ]
  }
}

Checkpoint Operations Guide

Last Updated: 2026-01-06 Status: Active Related: Batch Processing - Checkpoint Yielding


Overview

This guide covers operational concerns for checkpoint yielding in production environments, including monitoring, troubleshooting, and maintenance tasks.


Monitoring Checkpoints

Key Metrics

MetricDescriptionAlert Threshold
Checkpoint history sizeLength of history array>100 entries
Checkpoint ageTime since last checkpoint>10 minutes (step-dependent)
Accumulated results sizeSize of accumulated_results JSONB>1MB
Checkpoint frequencyCheckpoints per step execution<1 per minute (may indicate issues)

SQL Queries for Monitoring

Steps with large checkpoint history:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    t.task_uuid,
    jsonb_array_length(ws.checkpoint->'history') as history_length,
    ws.checkpoint->>'timestamp' as last_checkpoint
FROM tasker.workflow_steps ws
JOIN tasker.tasks t ON ws.task_uuid = t.task_uuid
WHERE ws.checkpoint IS NOT NULL
  AND jsonb_array_length(ws.checkpoint->'history') > 50
ORDER BY history_length DESC
LIMIT 20;

Steps with stale checkpoints (in progress but not checkpointed recently):

SELECT
    ws.workflow_step_uuid,
    ws.name,
    ws.current_state,
    ws.checkpoint->>'timestamp' as last_checkpoint,
    NOW() - (ws.checkpoint->>'timestamp')::timestamptz as checkpoint_age
FROM tasker.workflow_steps ws
WHERE ws.current_state = 'in_progress'
  AND ws.checkpoint IS NOT NULL
  AND (ws.checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes'
ORDER BY checkpoint_age DESC;

Large accumulated results:

SELECT
    ws.workflow_step_uuid,
    ws.name,
    pg_column_size(ws.checkpoint->'accumulated_results') as results_size_bytes,
    pg_size_pretty(pg_column_size(ws.checkpoint->'accumulated_results')::bigint) as results_size
FROM tasker.workflow_steps ws
WHERE ws.checkpoint->'accumulated_results' IS NOT NULL
  AND pg_column_size(ws.checkpoint->'accumulated_results') > 100000
ORDER BY results_size_bytes DESC
LIMIT 20;

Logging

Checkpoint operations emit structured logs:

INFO checkpoint_yield_step_event step_uuid=abc-123 cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc-123 history_length=5

Log fields to monitor:

  • step_uuid - Step being checkpointed
  • cursor - Current position
  • items_processed - Total items at checkpoint
  • history_length - Number of checkpoint entries

Troubleshooting

Step Not Resuming from Checkpoint

Symptoms: Step restarts from beginning instead of checkpoint position.

Checks:

  1. Verify checkpoint exists:
    SELECT checkpoint FROM tasker.workflow_steps WHERE workflow_step_uuid = 'uuid';
    
  2. Check handler uses BatchWorkerContext accessors:
    • has_checkpoint? / has_checkpoint() / hasCheckpoint()
    • checkpoint_cursor / checkpointCursor
  3. Verify handler respects checkpoint in processing loop

Checkpoint Not Persisting

Symptoms: checkpoint_yield() returns but data not in database.

Checks:

  1. Check for errors in worker logs
  2. Verify FFI bridge is healthy
  3. Check database connectivity

Excessive Checkpoint History Growth

Symptoms: Steps have hundreds or thousands of checkpoint history entries.

Causes:

  • Very long-running processes with frequent checkpoints
  • Small checkpoint intervals relative to work

Remediation:

  1. Increase checkpoint interval (process more items between checkpoints)
  2. Clear history for completed steps (see Maintenance section)
  3. Monitor with history size query above

Large Accumulated Results

Symptoms: Database bloat, slow step queries.

Causes:

  • Storing full result sets instead of summaries
  • Unbounded accumulation without size checks

Remediation:

  1. Modify handler to store summaries, not full data
  2. Use external storage for large intermediate results
  3. Add size checks before checkpoint

Maintenance Tasks

Clear Checkpoint for Completed Steps

Completed steps retain checkpoint data for debugging. To clear:

-- Clear checkpoints for completed steps older than 7 days
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE current_state = 'complete'
  AND checkpoint IS NOT NULL
  AND updated_at < NOW() - INTERVAL '7 days';

Truncate History Array

For steps with excessive history:

-- Keep only last 10 history entries
UPDATE tasker.workflow_steps
SET checkpoint = jsonb_set(
    checkpoint,
    '{history}',
    (SELECT jsonb_agg(elem)
     FROM (
         SELECT elem
         FROM jsonb_array_elements(checkpoint->'history') elem
         ORDER BY (elem->>'timestamp')::timestamptz DESC
         LIMIT 10
     ) sub)
)
WHERE jsonb_array_length(checkpoint->'history') > 10;

Clear Checkpoint for Manual Reset

When manually resetting a step to reprocess from scratch:

-- Clear checkpoint to force reprocessing from beginning
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE workflow_step_uuid = 'step-uuid-here';

Warning: Only clear checkpoints if you want the step to restart from the beginning.


Capacity Planning

Database Sizing

Checkpoint column considerations:

  • Each checkpoint: ~1-10KB typical (cursor, timestamp, metadata)
  • History array: ~100 bytes per entry
  • Accumulated results: Variable (handler-dependent)

Formula for checkpoint storage:

Storage = Active Steps × (Base Checkpoint Size + History Entries × 100 bytes + Accumulated Size)

Example: 10,000 active steps with 50-entry history and 5KB accumulated results:

10,000 × (5KB + 50 × 100B + 5KB) = 10,000 × 15KB = 150MB

Performance Impact

Checkpoint write: ~1-5ms per checkpoint (single row UPDATE)

Checkpoint read: Included in step data fetch (no additional query)

Recommendations:

  • Checkpoint every 1000-10000 items or every 1-5 minutes
  • Too frequent: Excessive database writes
  • Too infrequent: Lost progress on failure

Alerting Recommendations

Prometheus/Grafana Metrics

If exporting to Prometheus:

# Alert on stale checkpoints
- alert: StaleCheckpoint
  expr: tasker_checkpoint_age_seconds > 600
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Step checkpoint is stale"

# Alert on large history
- alert: CheckpointHistoryGrowth
  expr: tasker_checkpoint_history_size > 100
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Checkpoint history exceeding threshold"

Database-Based Alerting

For periodic SQL-based monitoring:

-- Return non-zero if any issues detected
SELECT COUNT(*)
FROM tasker.workflow_steps
WHERE (
    -- Stale in-progress checkpoints
    (current_state = 'in_progress'
     AND checkpoint IS NOT NULL
     AND (checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes')
    OR
    -- Excessive history
    (checkpoint IS NOT NULL
     AND jsonb_array_length(checkpoint->'history') > 100)
);

See Also

Connection Pool Tuning Guide

Overview

Tasker maintains two connection pools: tasker (task/step/transition operations) and pgmq (queue operations). Pool observability is provided via:

  • /health/detailed - Pool utilization in pool_utilization field
  • /metrics - Prometheus gauges tasker_db_pool_connections{pool,state}
  • Atomic counters tracking acquire latency and errors

Pool Sizing Guidelines

Formula

max_connections = (peak_concurrent_operations * avg_hold_time_ms) / 1000 + headroom

Rules of thumb:

  • Orchestration pool: 2-3x the number of concurrent tasks expected
  • PGMQ pool: 1-2x the number of workers × batch size
  • min_connections: 20-30% of max to avoid cold-start latency
  • Never exceed PostgreSQL’s max_connections / number_of_services

Environment Defaults

ParameterBaseProductionDevelopmentTest
max_connections (tasker)25502530
min_connections (tasker)51052
max_connections (pgmq)15251510
slow_acquire_threshold_ms10050200500

Metrics Interpretation

Utilization Thresholds

LevelUtilizationAction
Healthy< 80%Normal operation
Degraded80-95%Monitor closely, consider increasing max_connections
Unhealthy> 95%Pool exhaustion imminent; increase pool size or reduce load

Slow Acquires

The slow_acquire_threshold_ms setting controls when an acquire is classified as “slow”:

  • Production (50ms): Tight threshold for SLO-sensitive workloads
  • Development (200ms): Relaxed for local debugging with fewer resources
  • Test (500ms): Very relaxed for CI environments with contention

A high slow_acquires count relative to total_acquires (>5%) suggests:

  1. Pool is undersized for the workload
  2. Connections are held too long (long queries or transactions)
  3. Connection creation is slow (network latency to DB)

Acquire Errors

Non-zero acquire_errors indicates pool exhaustion (timeout waiting for connection). Remediation:

  1. Increase max_connections
  2. Increase acquire_timeout_seconds (masks the problem)
  3. Reduce query execution time
  4. Check for connection leaks (connections not returned to pool)

PostgreSQL Server-Side Considerations

max_connections

PostgreSQL’s max_connections is a hard limit across all clients. For cluster deployments:

pg_max_connections >= sum(service_max_pool * service_instance_count) + superuser_reserved

Default PostgreSQL max_connections is 100. For production:

  • Set max_connections = 500 or higher
  • Reserve 5-10 connections for superuser (superuser_reserved_connections)
  • Monitor with SELECT count(*) FROM pg_stat_activity

Connection Overhead

Each PostgreSQL connection consumes ~5-10MB RAM. Size accordingly:

  • 100 connections ~ 0.5-1GB additional RAM
  • 500 connections ~ 2.5-5GB additional RAM

Statement Timeout

The statement_timeout database variable protects against runaway queries:

  • Production: 30s (default)
  • Test: 5s (fail fast)

Alert Threshold Recommendations

MetricWarningCritical
Pool utilization> 80% for 5 min> 95% for 1 min
Slow acquires / total> 5% over 5 min> 20% over 1 min
Acquire errors> 0 in 5 min> 10 in 1 min
Average acquire time> 50ms> 200ms

Configuration Reference

Pool settings are in config/tasker/base/common.toml under [common.database.pool] and [common.pgmq_database.pool]. Environment-specific overrides are in config/tasker/environments/{env}/common.toml.

[common.database.pool]
max_connections = 25
min_connections = 5
acquire_timeout_seconds = 10
idle_timeout_seconds = 300
max_lifetime_seconds = 1800
slow_acquire_threshold_ms = 100

MPSC Channel Tuning - Operational Runbook

Last Updated: 2025-12-10 Owner: Platform Engineering Related: ADR: Bounded MPSC Channels | Circuit Breakers | Backpressure Architecture

Overview

This runbook provides operational guidance for monitoring, diagnosing, and tuning MPSC channel buffer sizes in the tasker-core system. All channels are bounded with configurable capacities to prevent unbounded memory growth.

Quick Reference

Configuration Files

FilePurposeWhen to Edit
config/tasker/base/mpsc_channels.tomlBase configurationDefault values
config/tasker/environments/test/mpsc_channels.tomlTest overridesTest environment tuning
config/tasker/environments/development/mpsc_channels.tomlDev overridesLocal development tuning
config/tasker/environments/production/mpsc_channels.tomlProd overridesProduction capacity planning

Key Metrics

MetricDescriptionAlert Threshold
mpsc_channel_usage_percentCurrent fill percentage> 80%
mpsc_channel_capacityConfigured buffer sizeN/A (informational)
mpsc_channel_full_events_totalOverflow events counter> 0 (indicates backpressure)

Default Buffer Sizes

ChannelBaseTestDevelopmentProduction
Orchestration command100010010005000
PGMQ notifications10000100001000050000
Task readiness10001005005000
Worker command1000100010002000
Event publisher50005000500010000
Ruby FFI100010005002000

Monitoring and Alerting

Critical: Channel Saturation

# Alert when any channel is >90% full for 5 minutes
mpsc_channel_usage_percent > 90

Action: Immediate capacity increase or identify bottleneck

Warning: Channel High Usage

# Alert when any channel is >80% full for 15 minutes
mpsc_channel_usage_percent > 80

Action: Plan capacity increase, investigate throughput

Info: Channel Overflow Events

# Alert on any overflow events
rate(mpsc_channel_full_events_total[5m]) > 0

Action: Review backpressure handling, consider capacity increase

Grafana Queries

Channel Usage by Component

max by (channel, component) (mpsc_channel_usage_percent)

Channel Capacity Configuration

max by (channel, component) (mpsc_channel_capacity)

Overflow Event Rate

rate(mpsc_channel_full_events_total[5m])

Log Patterns

Saturation Warning (80% full)

WARN mpsc_channel_saturation channel=orchestration_command usage_percent=82.5

Overflow Event (channel full)

ERROR mpsc_channel_full channel=event_publisher action=dropped

Backpressure Applied

ERROR Ruby FFI event channel full - backpressure applied

Common Issues and Solutions

Issue 1: High Channel Saturation

Symptoms:

  • mpsc_channel_usage_percent consistently > 80%
  • Slow message processing
  • Increased latency

Diagnosis:

  1. Check which channel is saturated:

    # Grep logs for saturation warnings
    grep "mpsc_channel_saturation" logs/tasker.log | tail -20
    
  2. Check metrics for specific channel:

    mpsc_channel_usage_percent{channel="orchestration_command"}
    

Solutions:

Short-term (Immediate Relief):

# Edit appropriate environment file
# Example: config/tasker/environments/production/mpsc_channels.toml

[mpsc_channels.orchestration.command_processor]
command_buffer_size = 10000  # Increase from 5000

Long-term:

  • Investigate message producer rate
  • Optimize message consumer processing
  • Consider horizontal scaling

Issue 2: PGMQ Notification Bursts

Symptoms:

  • Spike in mpsc_channel_usage_percent{channel="pgmq_notifications"}
  • During bulk task creation (1000+ tasks)
  • Temporary saturation followed by recovery

Diagnosis:

  1. Correlate with bulk task operations:

    # Check for bulk task creation in logs
    grep "Bulk task creation" logs/tasker.log
    
  2. Verify buffer size configuration:

    # Check current production configuration
    cat config/tasker/environments/production/mpsc_channels.toml | \
      grep -A 2 "event_listeners"
    

Solutions:

If production buffer < 50000:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 50000  # Recommended for production

If already at 50000 and still saturating:

  • Consider notification coalescing (future feature)
  • Implement batch notification processing
  • Scale orchestration services horizontally

Issue 3: Ruby FFI Backpressure

Symptoms:

  • Errors: “Ruby FFI event channel full - backpressure applied”
  • Ruby handler slowness
  • Increased Rust-side latency

Diagnosis:

  1. Check Ruby handler processing time:

    # Add timing to Ruby handlers
    time_start = Time.now
    result = handler.execute(step)
    duration = Time.now - time_start
    logger.warn("Slow handler: #{duration}s") if duration > 1.0
    
  2. Check FFI channel saturation:

    mpsc_channel_usage_percent{channel="ruby_ffi"}
    

Solutions:

If Ruby handlers are slow:

  • Optimize Ruby handler code
  • Consider async Ruby processing
  • Profile Ruby handler performance

If FFI buffer too small:

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 2000  # Increase from 1000

If Rust-side producing too fast:

  • Add rate limiting to Rust event production
  • Batch events before FFI crossing

Issue 4: Event Publisher Drops

Symptoms:

  • Counter increasing: mpsc_channel_full_events_total{channel="event_publisher"}
  • Log warnings: “Event channel full, dropping event”

Diagnosis:

  1. Check drop rate:

    rate(mpsc_channel_full_events_total{channel="event_publisher"}[5m])
    
  2. Identify event types being dropped:

    grep "dropping event" logs/tasker.log | awk '{print $NF}' | sort | uniq -c
    

Solutions:

If drops are rare (< 1/min):

  • Acceptable for non-critical events
  • Monitor but no action needed

If drops are frequent (> 10/min):

# config/tasker/environments/production/mpsc_channels.toml
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 20000  # Increase from 10000

If drops are continuous:

  • Investigate event storm cause
  • Consider event sampling/filtering
  • Review event subscriber performance

Capacity Planning

Sizing Formula

General guideline:

buffer_size = (peak_message_rate_per_sec * avg_processing_time_sec) * safety_factor

Where:

  • peak_message_rate_per_sec: Expected peak throughput
  • avg_processing_time_sec: Average consumer processing time
  • safety_factor: 2-5x for bursts

Example calculation:

# Orchestration command channel
peak_rate = 500 messages/sec
processing_time = 0.01 sec (10ms)
safety_factor = 2x

buffer_size = (500 * 0.01) * 2 = 10 messages minimum
# Use 1000 for burst handling

Environment-Specific Guidelines

Test Environment:

  • Use small buffers (100-500)
  • Exposes backpressure issues early
  • Forces proper error handling

Development Environment:

  • Use moderate buffers (500-1000)
  • Balances local resource usage
  • Mimics test environment behavior

Production Environment:

  • Use large buffers (2000-50000)
  • Handles real-world burst traffic
  • Prioritizes availability over memory

When to Increase Buffer Sizes

Increase if:

  • ✅ Saturation > 80% for extended periods
  • ✅ Overflow events occur regularly
  • ✅ Latency increases during peak load
  • ✅ Known traffic increase incoming

Don’t increase if:

  • ❌ Consumer is bottleneck (fix consumer instead)
  • ❌ Saturation is brief and recovers quickly
  • ❌ Would mask underlying performance issue

Configuration Change Procedure

1. Identify Need

Review metrics and logs to determine which channel needs adjustment.

2. Calculate New Size

Use sizing formula or apply percentage increase:

new_size = current_size * (100 / (100 - target_usage_percent))

# Example: Currently 90% full, target 70%
new_size = 5000 * (100 / (100 - 70)) = 5000 * 3.33 = 16,650
# Round up: 20,000

3. Update Configuration

Important: Environment overrides MUST use full [mpsc_channels.*] prefix!

# ✅ CORRECT
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 20000

# ❌ WRONG - creates conflicting top-level key
[orchestration.command_processor]
command_buffer_size = 20000

4. Deploy

Local/Development:

# Restart service - picks up new config automatically
cargo run --package tasker-orchestration --bin tasker-server --features web-api

Production:

# Standard deployment process
# Configuration is loaded at service startup
kubectl rollout restart deployment/tasker-orchestration

5. Monitor

Watch metrics for 1-2 hours post-change:

  • Channel usage percentage should decrease
  • Overflow events should stop
  • Latency should improve

6. Document

Update this runbook with:

  • Why change was made
  • New values
  • Observed impact

Troubleshooting Checklist

□ Check metric: mpsc_channel_usage_percent for affected channel
□ Review logs for saturation warnings in last 24 hours
□ Verify configuration file has correct [mpsc_channels] prefix
□ Confirm environment variable TASKER_ENV matches intended environment
□ Check if issue correlates with specific operations (bulk tasks, etc.)
□ Verify consumer processing time hasn't increased
□ Check for resource constraints (CPU, memory)
□ Review recent code changes that might affect throughput
□ Consider if horizontal scaling is needed vs buffer increase

Emergency Response

Critical Saturation (>95%)

Immediate Actions:

  1. Increase buffer size by 2-5x in production config
  2. Deploy immediately via rolling restart
  3. Page on-call if service degradation visible

Example:

# Edit config
vim config/tasker/environments/production/mpsc_channels.toml

# Deploy
kubectl rollout restart deployment/tasker-orchestration

# Monitor
watch -n 5 'curl -s localhost:9090/api/v1/query?query=mpsc_channel_usage_percent | jq'

Service Unresponsive Due to Backpressure

Symptoms:

  • All channels showing 100% usage
  • No message processing
  • Health checks failing

Actions:

  1. Check for downstream bottleneck (database, queue service)
  2. Scale out consumer services
  3. Temporarily increase all buffer sizes
  4. Check circuit breaker states (/health/detailed endpoint) - if circuit breakers are open, address underlying database/service issues first

Note: MPSC channels and circuit breakers are complementary resilience mechanisms. Channel saturation indicates internal backpressure, while circuit breaker state indicates external service health. See Circuit Breakers for operational guidance.

Best Practices

  1. Monitor Proactively: Don’t wait for alerts - review metrics weekly
  2. Test Changes in Dev: Validate buffer changes in development first
  3. Document Rationale: Note why each production override exists
  4. Gradual Increases: Prefer 2x increases over 10x jumps
  5. Review Quarterly: Adjust defaults based on production patterns
  6. Alert on Changes: Get notified of configuration file commits

Architecture:

Development:

Operations:

Support

Questions? Ask in #platform-engineering Slack channel Issues? File ticket with label infrastructure/channels Escalation? Page on-call via PagerDuty

Cluster Testing Guide

Last Updated: 2026-01-19 Audience: Developers, QA Status: Active Related: Tooling | Idempotency and Atomicity


Overview

This guide covers multi-instance cluster testing for validating horizontal scaling, race condition detection, and concurrent processing scenarios.

Key Capabilities:

  • Run N orchestration instances with M worker instances
  • Test concurrent task creation across instances
  • Validate state consistency across cluster
  • Detect race conditions and data corruption
  • Measure performance under concurrent load

Test Infrastructure

Feature Flags

Tests are organized by infrastructure requirements using Cargo feature flags:

Feature FlagInfrastructure RequiredIn CI?
test-dbPostgreSQL databaseYes
test-messagingDB + messaging (PGMQ/RabbitMQ)Yes
test-servicesDB + messaging + services runningYes
test-clusterMulti-instance cluster runningNo

Hierarchy: Each flag implies the previous (test-cluster includes test-services includes test-messaging includes test-db).

Test Commands

# Unit tests (DB + messaging only)
cargo make test-rust-unit

# E2E tests (services running)
cargo make test-rust-e2e

# Cluster tests (cluster running - LOCAL ONLY)
cargo make test-rust-cluster

# All tests including cluster
cargo make test-rust-all

Test Entry Points

tests/
├── basic_tests.rs        # Always compiles
├── integration_tests.rs  # #[cfg(feature = "test-messaging")]
├── e2e_tests.rs         # #[cfg(feature = "test-services")]
└── e2e/
    └── multi_instance/   # #[cfg(feature = "test-cluster")]
        ├── mod.rs
        ├── concurrent_task_creation_test.rs
        └── consistency_test.rs

Multi-Instance Test Manager

The MultiInstanceTestManager provides high-level APIs for cluster testing.

Location

tests/common/multi_instance_test_manager.rs
tests/common/orchestration_cluster.rs

Basic Usage

#![allow(unused)]
fn main() {
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;

#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_concurrent_operations() -> Result<()> {
    // Setup from environment (reads TASKER_TEST_ORCHESTRATION_URLS, etc.)
    let manager = MultiInstanceTestManager::setup_from_env().await?;

    // Wait for all instances to become healthy
    manager.wait_for_healthy(Duration::from_secs(30)).await?;

    // Create tasks concurrently across the cluster
    let requests = vec![create_task_request("namespace", "task", json!({})); 10];
    let responses = manager.create_tasks_concurrent(requests).await?;

    // Wait for completion
    let task_uuids: Vec<_> = responses.iter()
        .map(|r| uuid::Uuid::parse_str(&r.task_uuid).unwrap())
        .collect();
    let completed = manager.wait_for_tasks_completion(task_uuids.clone(), timeout).await?;

    // Verify consistency across all instances
    for uuid in &task_uuids {
        manager.verify_task_consistency(*uuid).await?;
    }

    Ok(())
}
}

Key Methods

MethodDescription
setup_from_env()Create manager from environment variables
setup(orch_count, worker_count)Create manager with explicit counts
wait_for_healthy(timeout)Wait for all instances to be healthy
create_tasks_concurrent(requests)Create tasks using round-robin distribution
wait_for_task_completion(uuid, timeout)Wait for single task completion
wait_for_tasks_completion(uuids, timeout)Wait for multiple tasks
verify_task_consistency(uuid)Verify task state across all instances
orchestration_count()Number of orchestration instances
worker_count()Number of worker instances

OrchestrationCluster

Lower-level cluster abstraction with load balancing:

#![allow(unused)]
fn main() {
use crate::common::orchestration_cluster::{OrchestrationCluster, ClusterConfig, LoadBalancingStrategy};

// Create cluster with round-robin load balancing
let config = ClusterConfig {
    orchestration_urls: vec!["http://localhost:8080", "http://localhost:8081"],
    worker_urls: vec!["http://localhost:8100", "http://localhost:8101"],
    load_balancing: LoadBalancingStrategy::RoundRobin,
    health_timeout: Duration::from_secs(5),
};
let cluster = OrchestrationCluster::new(config).await?;

// Get client using load balancing strategy
let client = cluster.get_client();

// Get all clients for parallel operations
for client in cluster.all_clients() {
    let task = client.get_task(task_uuid).await?;
}
}

Running Cluster Tests

Prerequisites

  1. PostgreSQL running with PGMQ extension
  2. Environment configured for cluster mode

Step-by-Step

# 1. Start PostgreSQL (if not already running)
cargo make docker-up

# 2. Setup cluster environment
cargo make setup-env-cluster

# 3. Start the full cluster
cargo make cluster-start-all

# 4. Verify cluster health
cargo make cluster-status

# Expected output:
# Instance Status:
# ─────────────────────────────────────────────────────────────
# INSTANCE              STATUS     PID        PORT
# ─────────────────────────────────────────────────────────────
# orchestration-1       healthy    12345      8080
# orchestration-2       healthy    12346      8081
# worker-rust-1         healthy    12347      8100
# worker-rust-2         healthy    12348      8101
# ... (more workers)

# 5. Run cluster tests
cargo make test-rust-cluster

# 6. Stop cluster when done
cargo make cluster-stop

Monitoring During Tests

# In separate terminal: Watch cluster logs
cargo make cluster-logs

# Or orchestration logs only
cargo make cluster-logs-orchestration

# Quick status check (no health probes)
cargo make cluster-status-quick

Test Scenarios

Concurrent Task Creation

Validates that tasks can be created concurrently across orchestration instances without conflicts.

File: tests/e2e/multi_instance/concurrent_task_creation_test.rs

Test: test_concurrent_task_creation_across_instances

Validates:

  1. Tasks created through different orchestration instances
  2. All tasks complete successfully
  3. State is consistent across all instances
  4. No duplicate UUIDs generated

Rapid Task Burst

Stress tests the system by creating many tasks in quick succession.

Test: test_rapid_task_creation_burst

Validates:

  1. System handles high task creation rate
  2. No duplicate task UUIDs
  3. All tasks created successfully

Round-Robin Distribution

Verifies tasks are distributed across instances using round-robin.

Test: test_task_creation_round_robin_distribution

Validates:

  1. Tasks distributed across instances
  2. Distribution is approximately even
  3. No single-instance bottleneck

Validation Results

The cluster testing infrastructure was validated with the following results:

Test Summary

MetricResult
Tests Passed1645
Intermittent Failures3 (resource contention, not race conditions)
Tests Skipped21 (domain event tests, require single-instance)
Cluster Configuration2x orchestration + 2x each worker type (10 total)

Key Findings

  1. No Race Conditions Detected: All concurrent operations completed without data corruption or invalid states

  2. Defense-in-Depth Validated: Four protection layers (database atomicity, state machine guards, transaction boundaries, application logic) work correctly together

  3. Recovery Mechanism Works: Tasks and steps recover correctly after simulated failures

  4. Consistent State: Task state is consistent when queried from any orchestration instance

Connection Pool Deadlock (Fixed)

Initial testing revealed intermittent failures under high parallelization:

  • Cause: Connection pool deadlock in task initialization - transactions held connections while template loading needed additional connections
  • Root Cause Fix: Moved template loading BEFORE transaction begins in task_initialization/service.rs
  • Additional Tuning: Increased pool sizes (20→30 max, 1→2 min connections)
  • Status: ✅ Fixed - all 9 cluster tests now pass in parallel

See the connection pool deadlock pattern documentation in docs/ticket-specs/ for details.

Domain Event Tests

21 tests were skipped in cluster mode (marked with #[cfg(not(feature = "test-cluster"))]):

  • Reason: Domain event tests verify in-process event delivery, incompatible with multi-process cluster
  • Status: Working as designed - these tests run in single-instance CI

Test Feature Flag Implementation

Adding the Feature Gate

Tests requiring cluster infrastructure should use the feature gate:

#![allow(unused)]
fn main() {
// At module level
#![cfg(feature = "test-cluster")]

// Or at test level
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_cluster_specific_behavior() -> Result<()> {
    // ...
}
}

Skipping Tests in Cluster Mode

Some tests (like domain events) don’t work in cluster mode:

#![allow(unused)]
fn main() {
// Only run when NOT in cluster mode
#[tokio::test]
#[cfg(not(feature = "test-cluster"))]
async fn test_domain_event_delivery() -> Result<()> {
    // In-process event testing
}
}

Conditional Imports

#![allow(unused)]
fn main() {
// Only import cluster test utilities when needed
#[cfg(feature = "test-cluster")]
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
}

Nextest Configuration

The .config/nextest.toml configures test execution for cluster scenarios:

[profile.default]
retries = 0
leak-timeout = { period = "500ms", result = "fail" }
fail-fast = false

# Multi-instance tests can run in parallel once cluster is warmed up
[[profile.default.overrides]]
filter = 'test(multi_instance)'

[profile.ci]
# Limit parallelism to avoid database connection pool exhaustion
test-threads = 4

Cluster Warmup: Multi-instance tests can run in parallel. Connection pools now start with min_connections=2 for faster warmup. The 5-second delay built into cluster-start-all usually suffices. If you see “Failed to create task after all retries” errors immediately after startup, wait a few more seconds for pools to fully initialize.


Troubleshooting

Cluster Won’t Start

# Check for port conflicts
lsof -i :8080-8089
lsof -i :8100-8109

# Check for stale PID files
ls -la .pids/
rm -rf .pids/*.pid  # Clean up stale PIDs

# Retry start
cargo make cluster-start-all

Tests Timeout / “Failed to create task after all retries”

This typically indicates the cluster wasn’t fully warmed up:

# Check cluster health
cargo make cluster-status

# If health is green but tests fail, wait for warmup
sleep 10 && cargo make test-rust-cluster

# Check logs for errors
cargo make cluster-logs | head -100

# Restart cluster with extra warmup
cargo make cluster-stop
cargo make cluster-start-all
sleep 10
cargo make test-rust-cluster

Root cause: Connection pools start at min_connections=2 and grow on demand. The first requests after startup may timeout while pools are establishing connections.

Connection Pool Exhaustion

If tests fail with “pool timed out” errors, ensure you have the latest code with:

  • Template loading before transaction in task_initialization/service.rs
  • Pool sizes: max_connections=30, min_connections=2 in test config

If issues persist, verify pool configuration:

# Check test config
cat config/tasker/generated/orchestration-test.toml | grep -A5 "pool"

Environment Variables Not Set

# Verify environment
env | grep TASKER_TEST

# Re-source environment
source .env

# Or regenerate
cargo make setup-env-cluster

CI Considerations

Cluster tests are NOT run in CI due to GitHub Actions resource constraints:

  • Running multiple orchestration + worker instances requires more memory than free GHA runners provide
  • This is a conscious tradeoff for an open-source, pre-alpha project

Future Options (when project matures):

  • Self-hosted runners with more resources
  • Paid GHA larger runners
  • Separate manual workflow trigger for cluster tests

Workaround: Run cluster tests locally before PRs that touch concurrent processing code.


Comprehensive Lifecycle Testing Framework Guide

This guide demonstrates the complete lifecycle testing framework, showing patterns, examples, and best practices for validating task and workflow step lifecycles with integrated SQL function validation.

Table of Contents

  1. Framework Overview
  2. Core Testing Patterns
  3. Advanced Assertion Traits
  4. Template-Based Testing
  5. SQL Function Integration
  6. Example Test Executions
  7. Tracing Output Examples
  8. Best Practices
  9. Troubleshooting

Framework Overview

Architecture

The comprehensive lifecycle testing framework consists of several key components:

#![allow(unused)]
fn main() {
// Core Infrastructure
TestOrchestrator          // Wrapper around orchestration components
StepErrorSimulator        // Realistic error scenario simulation
SqlLifecycleAssertion     // SQL function validation
TestScenarioBuilder       // YAML template loading

// Advanced Patterns
TemplateTestRunner        // Parameterized error pattern testing
ErrorPattern              // Comprehensive error configuration
TaskAssertions           // Task-level validation trait
StepAssertions           // Step-level validation trait
}

Integration Strategy

Each test follows the integrated validation pattern:

  1. Exercise Lifecycle: Use orchestration framework to create scenario
  2. Capture SQL State: Call SQL functions to get current state
  3. Assert Integration: Validate SQL functions return expected values
  4. Document Relationship: Structured tracing showing cause → effect

Core Testing Patterns

Pattern 1: Basic Lifecycle Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_basic_lifecycle_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🧪 Testing basic lifecycle progression");

    // STEP 1: Exercise lifecycle using framework
    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "basic_validation").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // STEP 2: Validate initial state
    pool.assert_step_ready(step.workflow_step_uuid).await?;

    // STEP 3: Execute step
    let result = orchestrator.execute_step(&step, true, 1000).await?;
    assert!(result.success);

    // STEP 4: Validate final state
    pool.assert_step_complete(step.workflow_step_uuid).await?;

    tracing::info!("✅ Basic lifecycle validation complete");
    Ok(())
}
}

Pattern 2: Error and Retry Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_error_retry_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🔄 Testing error and retry behavior");

    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "retry_validation").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // STEP 1: Simulate retryable error
    StepErrorSimulator::simulate_execution_error(
        &pool,
        &step,
        1 // attempt number
    ).await?;

    // STEP 2: Validate retry behavior
    pool.assert_step_retry_behavior(
        step.workflow_step_uuid,
        1,    // expected attempts
        None, // no custom backoff
        true  // still retry eligible
    ).await?;

    tracing::info!("✅ Error retry validation complete");
    Ok(())
}
}

Pattern 3: Complex Dependency Validation

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_dependency_validation(pool: PgPool) -> Result<()> {
    tracing::info!("🔗 Testing dependency relationships");

    let orchestrator = TestOrchestrator::new(pool.clone());

    // Create diamond pattern workflow
    let task = create_diamond_workflow_task(&orchestrator).await?;
    let steps = get_task_steps(&pool, task.task_uuid).await?;

    // Execute start step
    let result = orchestrator.execute_step(&steps[0], true, 1000).await?;
    assert!(result.success);

    // Fail one branch
    StepErrorSimulator::simulate_validation_error(
        &pool,
        &steps[1],
        "dependency_test_error"
    ).await?;

    // Complete other branch
    let result = orchestrator.execute_step(&steps[2], true, 1000).await?;
    assert!(result.success);

    // Validate convergence step is blocked
    pool.assert_step_blocked(steps[3].workflow_step_uuid).await?;

    tracing::info!("✅ Dependency validation complete");
    Ok(())
}
}

Advanced Assertion Traits

TaskAssertions Trait Usage

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{TaskAssertions, TaskStepDistribution};

// Task completion validation
pool.assert_task_complete(task_uuid).await?;

// Task error state validation
pool.assert_task_error(task_uuid, 2).await?; // 2 error steps

// Complex step distribution validation
pool.assert_task_step_distribution(
    task_uuid,
    TaskStepDistribution {
        total_steps: 4,
        completed_steps: 2,
        failed_steps: 1,
        ready_steps: 0,
        pending_steps: 1,
        in_progress_steps: 0,
        error_steps: 1,
    }
).await?;

// Execution status validation
pool.assert_task_execution_status(
    task_uuid,
    ExecutionStatus::BlockedByFailures,
    Some(RecommendedAction::HandleFailures)
).await?;

// Completion percentage validation
pool.assert_task_completion_percentage(task_uuid, 75.0, 5.0).await?;
}

StepAssertions Trait Usage

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::StepAssertions;

// Basic step state validations
pool.assert_step_ready(step_uuid).await?;
pool.assert_step_complete(step_uuid).await?;
pool.assert_step_blocked(step_uuid).await?;

// Retry behavior validation
pool.assert_step_retry_behavior(
    step_uuid,
    3,        // expected attempts
    Some(30), // custom backoff seconds
    false     // not retry eligible (exhausted)
).await?;

// Dependency validation
pool.assert_step_dependencies_satisfied(step_uuid, true).await?;

// State transition sequence validation
pool.assert_step_state_sequence(
    step_uuid,
    vec!["Pending".to_string(), "InProgress".to_string(), "Complete".to_string()]
).await?;

// Permanent failure validation
pool.assert_step_failed_permanently(step_uuid).await?;

// Waiting for retry with specific time
let retry_time = chrono::Utc::now() + chrono::Duration::seconds(60);
pool.assert_step_waiting(step_uuid, retry_time).await?;
}

Template-Based Testing

ErrorPattern Configuration

#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{ErrorPattern, TemplateTestRunner};

// Simple patterns
let success_pattern = ErrorPattern::AllSuccess;
let first_fail_pattern = ErrorPattern::FirstStepFails { retryable: true };
let last_fail_pattern = ErrorPattern::LastStepFails { permanently: false };

// Advanced patterns
let targeted_pattern = ErrorPattern::MiddleStepFails {
    step_name: "process_payment".to_string(),
    attempts_before_success: 3
};

let dependency_pattern = ErrorPattern::DependencyBlockage {
    blocked_step: "finalize_order".to_string(),
    blocking_step: "validate_payment".to_string()
};

// Custom pattern with full control
let custom_pattern = ErrorPattern::Custom {
    step_configs: {
        let mut configs = HashMap::new();
        configs.insert("critical_step".to_string(), StepErrorConfig {
            error_type: StepErrorType::ExternalServiceError,
            attempts_before_success: Some(5),
            custom_backoff_seconds: Some(120),
            permanently_fails: false,
        });
        configs
    }
};
}

Template Runner Usage

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_template_patterns(pool: PgPool) -> Result<()> {
    let template_runner = TemplateTestRunner::new(pool.clone()).await?;

    // Test single pattern
    let summary = template_runner.run_template_with_errors(
        "order_fulfillment.yaml",
        ErrorPattern::FirstStepFails { retryable: true }
    ).await?;

    assert!(summary.sql_validations_passed > 0);
    assert_eq!(summary.sql_validations_failed, 0);

    // Test all patterns automatically
    let summaries = template_runner
        .run_template_with_all_patterns("linear_workflow.yaml")
        .await?;

    for summary in summaries {
        tracing::info!(
            pattern = summary.error_pattern,
            execution_time = summary.execution_time_ms,
            validations = summary.sql_validations_passed,
            "Pattern execution complete"
        );
    }

    Ok(())
}
}

SQL Function Integration

Direct SQL Function Testing

#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_direct_sql_functions(pool: PgPool) -> Result<()> {
    // Test get_step_readiness_status
    let step_status = sqlx::query!(
        "SELECT ready_for_execution, dependencies_satisfied, retry_eligible, attempts,
                backoff_request_seconds, next_retry_at
         FROM get_step_readiness_status($1)",
        step_uuid
    )
    .fetch_one(&pool)
    .await?;

    // Validate individual fields
    assert_eq!(step_status.ready_for_execution, Some(true));
    assert_eq!(step_status.dependencies_satisfied, Some(true));
    assert_eq!(step_status.retry_eligible, Some(false));
    assert_eq!(step_status.attempts, Some(0));

    // Test get_task_execution_context
    let task_context = sqlx::query!(
        "SELECT total_steps, completed_steps, failed_steps, ready_steps,
                pending_steps, in_progress_steps, error_steps,
                completion_percentage, execution_status, recommended_action,
                blocked_by_errors
         FROM get_task_execution_context($1)",
        task_uuid
    )
    .fetch_one(&pool)
    .await?;

    // Validate task aggregations
    assert!(task_context.total_steps.unwrap_or(0) > 0);
    assert_eq!(task_context.completed_steps, Some(0));
    assert_eq!(task_context.failed_steps, Some(0));

    Ok(())
}
}

Integrated SQL Validation Pattern

#![allow(unused)]
fn main() {
// The standard pattern used throughout the framework
async fn validate_integrated_sql_behavior(
    pool: &PgPool,
    task_uuid: Uuid,
    step_uuid: Uuid
) -> Result<()> {
    // STEP 1: Execute lifecycle action
    StepErrorSimulator::simulate_execution_error(pool, step, 2).await?;

    // STEP 2: Immediately validate SQL functions
    SqlLifecycleAssertion::assert_step_scenario(
        pool,
        task_uuid,
        step_uuid,
        ExpectedStepState {
            state: "Error".to_string(),
            ready_for_execution: false,
            dependencies_satisfied: true,
            retry_eligible: true,
            attempts: 2,
            next_retry_at: Some(calculate_expected_retry_time()),
            backoff_request_seconds: None,
            retry_limit: 3,
        }
    ).await?;

    // STEP 3: Document the relationship
    tracing::info!(
        lifecycle_action = "simulate_execution_error",
        sql_result = "retry_eligible=true, attempts=2",
        "✅ INTEGRATION: Lifecycle → SQL alignment verified"
    );

    Ok(())
}
}

Example Test Executions

Running Individual Tests

# Run specific test with detailed output
RUST_LOG=info cargo test --test complex_retry_scenarios \
    test_cascading_retries_with_dependencies -- --nocapture

# Run all lifecycle tests
cargo test --all-features --test '*lifecycle*' -- --nocapture

# Run with specific environment
TASKER_ENV=test cargo test --test step_retry_lifecycle_tests -- --nocapture

Running Test Suites

# Run comprehensive validation
cargo test --test sql_function_integration_validation -- --nocapture

# Run complex scenarios
cargo test --test complex_retry_scenarios -- --nocapture

# Run task finalization tests
cargo test --test task_finalization_error_scenarios -- --nocapture

Tracing Output Examples

Successful Test Execution

2025-01-15T10:30:45.123Z INFO test_cascading_retries_with_dependencies:
🧪 Testing cascading retries with diamond dependency pattern

2025-01-15T10:30:45.125Z INFO test_cascading_retries_with_dependencies:
🏗️ Creating diamond workflow: Start → BranchA/BranchB → Convergence

2025-01-15T10:30:45.145Z INFO test_cascading_retries_with_dependencies:
📋 STEP 1: Executing start step successfully
step_uuid=01JGJX7K8QMRNP4W2X3Y5Z6ABC

2025-01-15T10:30:45.167Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 2: Simulating BranchA failure (attempt 1)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
error_type="ExecutionError" retryable=true

2025-01-15T10:30:45.189Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Retry behavior matches expectations
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
attempts=1 backoff=null retry_eligible=true

2025-01-15T10:30:45.201Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 3: BranchA retry attempt (attempt 2)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF

2025-01-15T10:30:45.223Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step completed successfully
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF

2025-01-15T10:30:45.245Z INFO test_cascading_retries_with_dependencies:
❌ STEP 4: Simulating BranchB permanent failure
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
error_type="ValidationError" retryable=false

2025-01-15T10:30:45.267Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step failed permanently (not retryable)
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI

2025-01-15T10:30:45.289Z INFO test_cascading_retries_with_dependencies:
🚫 STEP 5: Validating Convergence step is blocked
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL

2025-01-15T10:30:45.301Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step blocked by dependencies
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL

2025-01-15T10:30:45.323Z INFO test_cascading_retries_with_dependencies:
📊 TASK ASSERTION: Step distribution matches expectations
task_uuid=01JGJX7K8PMRNP4W2X3Y5Z6MNO
total=4 completed=2 failed=0 ready=0 pending=0 in_progress=0 error=2

2025-01-15T10:30:45.345Z INFO test_cascading_retries_with_dependencies:
✅ INTEGRATION: Lifecycle → SQL alignment verified
lifecycle_action="cascading_retry_with_dependency_blocking"
sql_result="blocked_by_errors=true, error_steps=2"

2025-01-15T10:30:45.356Z INFO test_cascading_retries_with_dependencies:
🧪 CASCADING RETRY TEST COMPLETE: Diamond pattern with mixed outcomes validated

Error Pattern Testing Output

2025-01-15T10:35:12.123Z INFO test_template_runner_all_patterns:
🎭 TEMPLATE DEMO: All error patterns with multiple templates

2025-01-15T10:35:12.145Z INFO test_template_runner_all_patterns:
📋 Testing template with all error patterns
template="linear_workflow.yaml"

2025-01-15T10:35:12.167Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"AllSuccess"#

2025-01-15T10:35:12.234Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=67
successful_steps=4 failed_steps=0 retried_steps=0
final_state="Complete"
validations_passed=12 validations_failed=0

2025-01-15T10:35:12.256Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"FirstStepFails { retryable: true }"#

2025-01-15T10:35:12.334Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=1

2025-01-15T10:35:12.356Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=2

2025-01-15T10:35:12.423Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=167
successful_steps=4 failed_steps=0 retried_steps=1
final_state="Complete"
validations_passed=15 validations_failed=0

2025-01-15T10:35:12.445Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=0
pattern="AllSuccess" execution_time_ms=67
final_state="Complete" total_validations=12 success_rate="100.0%"

2025-01-15T10:35:12.467Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=1
pattern=r#"FirstStepFails { retryable: true }"# execution_time_ms=167
final_state="Complete" total_validations=15 success_rate="100.0%"

SQL Function Validation Output

2025-01-15T10:40:30.123Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION: Starting comprehensive validation across all scenarios

2025-01-15T10:40:30.145Z INFO test_comprehensive_sql_function_integration:
📋 SCENARIO 1: Basic lifecycle progression validation

2025-01-15T10:40:30.167Z INFO validate_initial_state:
✅ Initial state validation passed

2025-01-15T10:40:30.189Z INFO validate_step_completion:
✅ Step completion validation passed
step_uuid=01JGJX7M8QMRNP4W2X3Y5Z6PQR

2025-01-15T10:40:30.201Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 1: Basic lifecycle validation complete
scenario="basic_lifecycle" validations=2

2025-01-15T10:40:30.223Z INFO test_comprehensive_sql_function_integration:
🔄 SCENARIO 2: Error handling and retry behavior validation

2025-01-15T10:40:30.245Z INFO validate_retry_behavior:
✅ Retry behavior validation passed
step_uuid=01JGJX7M8RMRNP4W2X3Y5Z6STU
attempts=1 backoff=Some(5) retry_eligible=true

2025-01-15T10:40:30.267Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 2: Error and retry validation complete
scenario="error_retry" validations=1

2025-01-15T10:40:30.289Z INFO test_comprehensive_sql_function_integration:
🎯 FINAL VALIDATION: Comprehensive results summary
total_validations=25 successful_validations=25
success_rate="100.00%" scenarios_tested=6

2025-01-15T10:40:30.301Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION VALIDATION COMPLETE: All scenarios validated successfully

Best Practices

1. Always Use Integrated Validation Pattern

#![allow(unused)]
fn main() {
// ✅ GOOD: Integrated lifecycle + SQL validation
async fn test_step_retry_behavior(pool: PgPool) -> Result<()> {
    // Exercise lifecycle
    StepErrorSimulator::simulate_execution_error(pool, step, 1).await?;

    // Immediately validate SQL functions
    pool.assert_step_retry_behavior(step_uuid, 1, None, true).await?;

    // Document relationship
    tracing::info!("✅ INTEGRATION: Retry behavior alignment verified");
    Ok(())
}

// ❌ BAD: Testing SQL functions in isolation
async fn test_sql_only(pool: PgPool) -> Result<()> {
    // Directly manipulating database state
    sqlx::query!("UPDATE steps SET attempts = 3").execute(pool).await?;

    // This doesn't prove the integration works
    let status = sqlx::query!("SELECT * FROM get_step_readiness_status($1)", uuid)
        .fetch_one(pool).await?;
    Ok(())
}
}

2. Use Structured Tracing

#![allow(unused)]
fn main() {
// ✅ GOOD: Structured tracing with context
tracing::info!(
    step_uuid = %step.workflow_step_uuid,
    attempts = expected_attempts,
    backoff = ?expected_backoff,
    retry_eligible = expected_retry_eligible,
    "✅ STEP ASSERTION: Retry behavior matches expectations"
);

// ❌ BAD: Unstructured logging
println!("Step retry test passed");
}

3. Test Multiple Scenarios

#![allow(unused)]
fn main() {
// ✅ GOOD: Comprehensive scenario coverage
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_complete_retry_scenarios(pool: PgPool) -> Result<()> {
    // Test retryable error
    test_retryable_error_scenario(&pool).await?;

    // Test non-retryable error
    test_non_retryable_error_scenario(&pool).await?;

    // Test retry exhaustion
    test_retry_exhaustion_scenario(&pool).await?;

    // Test custom backoff
    test_custom_backoff_scenario(&pool).await?;

    Ok(())
}
}

4. Validate State Transitions

#![allow(unused)]
fn main() {
// ✅ GOOD: Validate complete state transition sequence
pool.assert_step_state_sequence(
    step_uuid,
    vec![
        "Pending".to_string(),
        "InProgress".to_string(),
        "Error".to_string(),
        "WaitingForRetry".to_string(),
        "Ready".to_string(),
        "Complete".to_string()
    ]
).await?;
}

5. Use Assertion Traits for Readability

#![allow(unused)]
fn main() {
// ✅ GOOD: Clear, readable assertions
pool.assert_task_complete(task_uuid).await?;
pool.assert_step_failed_permanently(step_uuid).await?;

// ❌ BAD: Manual SQL queries everywhere
let task_status = sqlx::query!("SELECT ...").fetch_one(pool).await?;
assert_eq!(task_status.some_field, Some("Complete"));
}

Troubleshooting

Common Issues

1. Assertion Failures

Error: Task 01JGJX... assertion failed: expected Complete, found Processing

// Solution: Ensure lifecycle actions complete before asserting
tokio::time::sleep(Duration::from_millis(100)).await;
pool.assert_task_complete(task_uuid).await?;

2. SQL Function Mismatches

Error: Step 01JGJX... retry assertion failed: attempts expected 2, got Some(1)

// Solution: Verify error simulator is configured correctly
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?; // 2 attempts

3. State Machine Violations

Error: Cannot transition from Complete to InProgress

// Solution: Use proper orchestration framework, not direct DB manipulation
let result = orchestrator.execute_step(step, true, 1000).await?;
// Don't: sqlx::query!("UPDATE steps SET state = 'InProgress'").execute(pool).await?;

4. Template Loading Issues

Error: Template 'nonexistent.yaml' not found

// Solution: Ensure template exists in correct directory
// templates should be in tests/fixtures/task_templates/rust/

Debugging Techniques

1. Enable Detailed Tracing

RUST_LOG=debug cargo test test_name -- --nocapture

2. Inspect SQL Function Results Directly

#![allow(unused)]
fn main() {
let step_status = sqlx::query!(
    "SELECT * FROM get_step_readiness_status($1)",
    step_uuid
)
.fetch_one(&pool)
.await?;

tracing::debug!("Step status: {:?}", step_status);
}

3. Validate Test Prerequisites

#![allow(unused)]
fn main() {
// Ensure test setup is correct
assert_eq!(steps.len(), 4, "Test requires 4 steps");
assert_eq!(task.namespace, "expected_namespace");
}

4. Use Incremental Validation

#![allow(unused)]
fn main() {
// Validate after each step
orchestrator.execute_step(&step1, true, 1000).await?;
pool.assert_step_complete(step1.workflow_step_uuid).await?;

orchestrator.execute_step(&step2, false, 1000).await?;
pool.assert_step_retry_behavior(step2.workflow_step_uuid, 1, None, true).await?;
}

Migration from Old Tests

Before (Direct Database Manipulation)

#![allow(unused)]
fn main() {
// ❌ OLD: Bypassing orchestration framework
async fn test_task_finalization_old(pool: PgPool) -> Result<()> {
    // Direct database manipulation
    sqlx::query!("UPDATE tasks SET state = 'Error'").execute(&pool).await?;
    sqlx::query!("UPDATE steps SET state = 'Error'").execute(&pool).await?;

    // Test SQL functions in isolation
    let context = get_task_execution_context(&pool, task_uuid).await?;
    assert_eq!(context.execution_status, ExecutionStatus::Error);

    Ok(())
}
}

After (Integrated Framework)

#![allow(unused)]
fn main() {
// ✅ NEW: Using integrated framework
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_task_finalization_new(pool: PgPool) -> Result<()> {
    tracing::info!("🧪 Testing task finalization with integrated approach");

    // Use orchestration framework
    let orchestrator = TestOrchestrator::new(pool.clone());
    let task = orchestrator.create_simple_task("test", "finalization").await?;
    let step = get_first_step(&pool, task.task_uuid).await?;

    // Create error state through framework
    StepErrorSimulator::simulate_validation_error(
        &pool,
        &step,
        "finalization_test_error"
    ).await?;

    // Immediately validate SQL functions
    pool.assert_step_failed_permanently(step.workflow_step_uuid).await?;
    pool.assert_task_error(task.task_uuid, 1).await?;

    tracing::info!("✅ INTEGRATION: Finalization behavior verified");
    Ok(())
}
}

This comprehensive guide demonstrates the power and flexibility of the lifecycle testing framework, providing developers with the tools needed to validate complex workflow behavior while maintaining confidence in the system’s correctness.

Decision Point E2E Tests

This document describes the E2E tests for decision point functionality and how to run them.

Test Location

tests/e2e/ruby/conditional_approval_test.rs

Design Note: Deferred Step Type (Added 2025-10-27)

A critical design refinement was introduced to handle convergence patterns in decision point workflows:

The Convergence Problem

In conditional_approval, all three possible outcomes (auto_approve, manager_approval, finance_review) converge to the same finalize_approval step. However, we cannot create finalize_approval at task initialization because:

  1. We don’t know which approval steps will be created
  2. finalize_approval needs different dependencies depending on the decision point’s choice

Solution: type: deferred

A new step type was added to handle this pattern:

- name: finalize_approval
  type: deferred  # NEW STEP TYPE!
  dependencies: [auto_approve, manager_approval, finance_review]  # All possible deps

How it works:

  1. Deferred steps list ALL possible dependencies in the template
  2. At initialization, deferred steps are excluded (they’re descendants of decision points)
  3. When a decision point creates outcome steps, the system:
    • Detects downstream deferred steps
    • Computes: declared_deps ∩ actually_created_steps = actual DAG
    • Creates deferred steps with resolved dependencies

Example:

  • When routing_decision chooses auto_approve:
    • Creates: auto_approve
    • Detects: finalize_approval is deferred with deps [auto_approve, manager_approval, finance_review]
    • Intersection: [auto_approve][auto_approve] = [auto_approve]
    • Creates: finalize_approval depending on auto_approve only

This elegantly solves convergence without requiring handlers to explicitly list convergence steps or special orchestration logic.

Test Coverage

The test suite validates the conditional approval workflow, which demonstrates decision point functionality with dynamic step creation based on runtime conditions (approval amount thresholds).

Test Cases

  1. test_small_amount_auto_approval() - Tests amounts < $1,000

    • Expected path: validate_request → routing_decision → auto_approve → finalize_approval
    • Verifies only 4 steps created
    • Confirms manager_approval and finance_review are NOT created
  2. test_medium_amount_manager_approval() - Tests amounts $1,000-$4,999

    • Expected path: validate_request → routing_decision → manager_approval → finalize_approval
    • Verifies only 4 steps created
    • Confirms auto_approve and finance_review are NOT created
  3. test_large_amount_dual_approval() - Tests amounts >= $5,000

    • Expected path: validate_request → routing_decision → manager_approval + finance_review → finalize_approval
    • Verifies 5 steps created
    • Confirms both parallel approval steps complete
    • Verifies auto_approve is NOT created
  4. test_decision_point_step_dependency_structure() - Validates dependency resolution

    • Verifies dynamically created steps depend on routing_decision
    • Confirms finalize_approval waits for all approval steps
    • Tests proper execution order
  5. test_boundary_conditions() - Tests exactly at $1,000 threshold

    • Verifies manager approval is used (not auto)
  6. test_boundary_large_threshold() - Tests exactly at $5,000 threshold

    • Verifies dual approval path is triggered
  7. test_very_small_amount() - Tests $0.01 amount

    • Verifies auto-approval for very small amounts

Running the Tests

Prerequisites

The tests require the full integration environment to be running. Use the Docker Compose test strategy:

# From the tasker-core directory

# 1. Stop any existing containers and clean up
docker-compose -f docker/docker-compose.test.yml down -v

# 2. Rebuild containers with latest changes
docker-compose -f docker/docker-compose.test.yml up --build -d

# 3. Wait for services to be healthy (about 10-15 seconds)
sleep 15

# 4. Run the conditional approval E2E tests
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo test --test e2e_tests e2e::ruby::conditional_approval_test -- --nocapture

# 5. Clean up after testing (optional)
docker-compose -f docker/docker-compose.test.yml down

Running Specific Tests

# Run just the small amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_small_amount_auto_approval -- --nocapture

# Run just the large amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_large_amount_dual_approval -- --nocapture

# Run all boundary tests
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_boundary -- --nocapture

Environment Variables

The tests use the following environment variables (set automatically in docker-compose.test.yml):

  • DATABASE_URL: PostgreSQL connection string
  • TASKER_ENV: Set to “test” for test configuration
  • TASK_TEMPLATE_PATH: Points to test fixtures directory
  • RUST_LOG: Set to “info” or “debug” for detailed logging

Test Workflow Details

Conditional Approval Workflow

The workflow implements amount-based routing:

┌─────────────────┐
│ validate_request│
│   (initial)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ routing_decision│ ◄─── DECISION POINT (type: decision)
│  (decision)     │
└────────┬────────┘
         │
         ├─────────── < $1,000 ─────────┐
         │                              │
         │                              ▼
         │                     ┌────────────────┐
         │                     │  auto_approve  │
         │                     └────────┬───────┘
         │                              │
         ├─────── $1,000-$4,999 ────────┼────┐
         │                              │    │
         │                              │    ▼
         │                              │  ┌──────────────────┐
         │                              │  │ manager_approval │
         │                              │  └────────┬─────────┘
         │                              │           │
         └──────── >= $5,000 ───────────┼───────────┼────┐
                                        │           │    │
                                        │           │    ▼
                                        │           │  ┌───────────────┐
                                        │           │  │ finance_review│
                                        │           │  └───────┬───────┘
                                        │           │          │
                                        ▼           ▼          ▼
                                     ┌─────────────────────────┐
                                     │   finalize_approval     │
                                     │    (convergence)        │
                                     └─────────────────────────┘

Decision Point Mechanism

  1. routing_decision step executes with type: decision marker
  2. Handler returns DecisionPointOutcome::CreateSteps with step names
  3. Orchestration creates those steps dynamically and adds dependencies
  4. Dynamically created steps execute like normal steps
  5. Convergence step (finalize_approval) waits for all paths

Task Template Location

The test uses the task template at:

tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml

Ruby Handler Implementation

The Ruby handlers are located at:

workers/ruby/spec/handlers/examples/conditional_approval/
├── handlers/
│   └── conditional_approval_handler.rb
└── step_handlers/
    ├── validate_request_handler.rb
    ├── routing_decision_handler.rb      ◄─── DECISION POINT HANDLER
    ├── auto_approve_handler.rb
    ├── manager_approval_handler.rb
    ├── finance_review_handler.rb
    └── finalize_approval_handler.rb

Key Implementation Detail

The routing_decision_handler.rb returns a decision point outcome:

outcome = if steps_to_create.empty?
            TaskerCore::Types::DecisionPointOutcome.no_branches
          else
            TaskerCore::Types::DecisionPointOutcome.create_steps(steps_to_create)
          end

TaskerCore::Types::StepHandlerCallResult.success(
  result: {
    # IMPORTANT: The decision point outcome MUST be in this key
    decision_point_outcome: outcome.to_h,
    route_type: route[:type],
    # ... other result fields
  }
)

Troubleshooting

Tests Fail with “Template Not Found”

Ensure the Ruby worker container has the correct template path:

docker-compose -f docker/docker-compose.test.yml logs ruby-worker
# Should show: TASK_TEMPLATE_PATH=/app/tests/fixtures/task_templates/ruby

Tests Timeout

Increase wait time in docker-compose startup:

sleep 30  # Instead of sleep 15

Database Connection Errors

Verify PostgreSQL is running and healthy:

docker-compose -f docker/docker-compose.test.yml ps
docker-compose -f docker/docker-compose.test.yml logs postgres

Step Creation Doesn’t Happen

Check orchestration logs for decision point processing:

docker-compose -f docker/docker-compose.test.yml logs orchestration | grep -i decision

Success Criteria

All tests should pass with output similar to:

test e2e::ruby::conditional_approval_test::test_small_amount_auto_approval ... ok
test e2e::ruby::conditional_approval_test::test_medium_amount_manager_approval ... ok
test e2e::ruby::conditional_approval_test::test_large_amount_dual_approval ... ok
test e2e::ruby::conditional_approval_test::test_decision_point_step_dependency_structure ... ok
test e2e::ruby::conditional_approval_test::test_boundary_conditions ... ok
test e2e::ruby::conditional_approval_test::test_boundary_large_threshold ... ok
test e2e::ruby::conditional_approval_test::test_very_small_amount ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Next Steps

After validating Ruby workers:

  • Phase 8a: Implement Rust worker support for decision points
  • Phase 9a: Create E2E tests for Rust worker decision points

Focused Architectural and Security Audit Report

Audit Date: 2026-02-05 Auditor: Claude (Opus 4.6 / Sonnet 4.5 sub-agents) Status: Complete


Executive Summary

This audit evaluates all Tasker Core crates for alpha readiness across security, error handling, resilience, and architecture dimensions. Findings are categorized by severity (Critical/High/Medium/Low/Info) per the methodology defined in the audit specification.

Alpha Readiness Verdict

ALPHA READY with targeted fixes. No Critical vulnerabilities found. The High-severity items (dependency CVE, input validation gaps, shutdown timeouts) are straightforward fixes that can be completed in a single sprint.

Consolidated Finding Counts (All Crates)

SeverityCountStatus
Critical0None found
High9Must fix before alpha
Medium22Document as known limitations
Low13Track for post-alpha

High-Severity Findings (Must Fix Before Alpha)

IDFindingCrateFix EffortRemediation
S-1Queue name validation missingtasker-sharedSmallQueue name validation
S-2SQL error details exposed to clientstasker-sharedMediumError message sanitization
S-3#[allow]#[expect] (systemic)AllSmall (batch)Lint compliance cleanup
P-1NOTIFY channel name unvalidatedtasker-pgmqSmallQueue name validation
O-1No actor panic recoverytasker-orchestrationMediumShutdown and recovery hardening
O-2Graceful shutdown lacks timeouttasker-orchestrationSmallShutdown and recovery hardening
W-1checkpoint_yield blocks FFI without timeouttasker-workerSmallFFI checkpoint timeout
X-1bytes v1.11.0 CVE (RUSTSEC-2026-0007)WorkspaceTrivialDependency upgrade
P-2CLI migration SQL generation unescapedtasker-pgmqSmallQueue name validation

Crate 1: tasker-shared

Overall Rating: A- (Strong foundation with targeted improvements needed)

The tasker-shared crate is the largest and most foundational crate in the workspace. It provides core types, error handling, messaging abstraction, security services, circuit breakers, configuration management, database utilities, and shared models. The crate demonstrates strong security practices overall.

Strengths

  • Zero unsafe code across the entire crate
  • Excellent cryptographic hygiene: Constant-time API key comparison via subtle::ConstantTimeEq (src/types/api_key_auth.rs:53-62), JWKS hardening with SSRF prevention (blocks private IPs, cloud metadata endpoints, requires HTTPS), algorithm allowlist enforcement (no alg: none)
  • Comprehensive input validation: JSONB validation with size/depth/key count limits (src/validation.rs), namespace validation with PostgreSQL identifier rules, XSS sanitization
  • 100% SQLx macro usage: All database queries use compile-time verified sqlx::query! macros, zero string interpolation in SQL
  • Lock-free circuit breakers: Atomic state management (AtomicU8 for state, AtomicU64 for metrics), proper memory ordering, correct state machine transitions
  • All MPSC channels bounded and config-driven: Full bounded-channel compliance
  • Exemplary config security: Environment variable allowlist with regex validation, TOML injection prevention via escape_toml_string(), fail-fast on validation errors
  • No hardcoded secrets: All sensitive values come from env vars or file paths
  • Well-organized API surface: Feature-gated modules (web-api, grpc-api), selective re-exports

Finding S-1 (HIGH): Queue Name Validation Missing

Location: tasker-shared/src/messaging/service/router.rs:96-97

Queue names are constructed via format! with unvalidated namespace input:

#![allow(unused)]
fn main() {
fn step_queue(&self, namespace: &str) -> String {
    format!("{}_{}_queue", self.worker_queue_prefix, namespace)
}
}

The MessagingError::InvalidQueueName variant exists (src/messaging/errors.rs:56) but is never raised. Neither the router nor the provider implementations (pgmq.rs:134-139, rabbitmq.rs:276-375) validate queue names before passing them to native queue APIs.

Risk: PGMQ creates PostgreSQL tables named after queues — special characters in namespace could cause SQL issues at the DDL level. RabbitMQ queue creation could fail with unexpected characters.

Recommendation: Add validate_queue_name() that enforces alphanumeric + underscore/hyphen, 1-255 chars. Call it in DefaultMessageRouter methods and/or ensure_queue().

Finding S-2 (HIGH): SQL Error Details Exposed to Clients

Location: tasker-shared/src/errors.rs:71-74, 431-437

#![allow(unused)]
fn main() {
impl From<sqlx::Error> for TaskerError {
    fn from(err: sqlx::Error) -> Self {
        TaskerError::DatabaseError(err.to_string())
    }
}
}

sqlx::Error::to_string() can expose SQL query details, table/column names, constraint names, and potentially connection string information. These error messages may propagate to API responses.

Recommendation: Create a sanitized error mapper that logs full details internally but returns generic messages to API clients (e.g., “Database operation failed” with an internal error ID for correlation).

Finding S-3 (HIGH): #[allow] Used Instead of #[expect] (Lint Policy Violation)

Locations:

  • src/messaging/execution_types.rs:383#[allow(clippy::too_many_arguments)]
  • src/web/authorize.rs:194#[allow(dead_code)]
  • src/utils/serde.rs:46-47#[allow(dead_code)]

Project lint policy mandates #[expect(lint_name, reason = "...")] instead of #[allow]. This is a policy compliance issue.

Recommendation: Convert all #[allow] to #[expect] with documented reasons.

Finding S-4 (MEDIUM): unwrap_or_default() Violations of Tenet #11 (Fail Loudly)

Locations (20+ instances across crate):

  • src/messaging/execution_types.rs:120,186,213 — Step execution status defaults to empty string
  • src/database/sql_functions.rs:377,558 — Query results default to empty vectors
  • src/registry/task_handler_registry.rs:214,268,656,700,942 — Config schema fields default silently
  • src/proto/conversions.rs:32 — Invalid timestamps silently default to UNIX epoch

Risk: Required fields silently defaulting to empty values can mask real errors and produce incorrect behavior that’s hard to debug.

Recommendation: Audit all unwrap_or_default() usages. Replace with explicit error returns for required fields. Keep unwrap_or_default() only for truly optional fields with documented rationale.

Finding S-5 (MEDIUM): Error Context Loss in .map_err(|_| ...)

14 instances where original error context is discarded:

  • src/messaging/service/providers/rabbitmq.rs:544 — Discards parse error
  • src/messaging/service/providers/in_memory.rs:305,331,368 — 3 instances
  • src/state_machine/task_state_machine.rs:114 — Discards parse error
  • src/state_machine/actions.rs:256,372,434,842 — 4 instances discarding publisher errors
  • src/config/config_loader.rs:220,417 — 2 instances discarding env var errors
  • src/database/sql_functions.rs:1032 — Discards decode error
  • src/types/auth.rs:283 — Discards parse error

Recommendation: Include original error via .map_err(|e| SomeError::new(context, e.to_string())).

Finding S-6 (MEDIUM): Production expect() Calls

  • src/macros.rs:65 — Panics if Tokio task spawning fails
  • src/cache/provider.rs:399,429,459,489,522 — Multiple expect("checked in should_use") calls

Risk: Panics in production code. While guarded by preconditions, they bypass error propagation.

Recommendation: Replace with Result propagation or add detailed safety comments explaining invariant guarantees.

Finding S-7 (MEDIUM): Database Pool Config Lacks Validation

Database pool configuration (PoolConfig) does not have a validate() method. Unlike circuit breaker config which validates ranges (failure_threshold > 0, timeout <= 300s), pool config relies on sqlx to reject invalid values at runtime.

Recommendation: Add validation: max_connections > 0, min_connections <= max_connections, acquire_timeout_seconds > 0.

Finding S-8 (MEDIUM): Individual Query Timeouts Missing

While database pools have acquire_timeout configured (src/database/pools.rs:169-170), individual sqlx::query! calls lack explicit timeout wrappers. Long-running queries rely solely on pool-level timeouts.

Recommendation: Consider PostgreSQL statement_timeout at the connection level, or add tokio::time::timeout() wrappers around critical query paths.

Finding S-9 (LOW): Message Size Limits Not Enforced

Messaging deserialization uses serde_json::from_slice() without explicit size limits. While PGMQ has implicit limits from PostgreSQL column sizes, a very large message could cause memory issues during deserialization.

Recommendation: Add configurable message size limits at the provider level.

Finding S-10 (LOW): File Path Exposure in Config Errors

src/services/security_service.rs:184-187 — Configuration errors include filesystem paths. Only occurs during startup (not exposed to API clients in normal operation).

Finding S-11 (LOW): Timestamp Conversion Silently Defaults to Epoch

src/proto/conversions.rs:32DateTime::from_timestamp().unwrap_or_default() silently converts invalid timestamps to 1970-01-01 instead of returning an error.

Finding S-12 (LOW): cargo-machete Ignore List Has 19 Entries

Cargo.toml:12-39 — Most are legitimately feature-gated or used via macros, but the list should be periodically audited to prevent dependency bloat.

Finding S-13 (LOW): Global Wildcard Permission Rejection Undocumented

src/types/permissions.rs — The permission_matches() function correctly rejects global wildcard (*) permissions but this behavior isn’t documented in user-facing comments.


Crate 2: tasker-pgmq

Overall Rating: B+ (Good with one high-priority fix needed)

The tasker-pgmq crate is a PGMQ wrapper providing PostgreSQL LISTEN/NOTIFY support for event-driven message processing. ~3,345 source lines across 9 files. No dependencies on tasker-shared (clean separation).

Strengths

  • No unsafe code across the entire crate
  • Payload uses parameterized queries: Message payloads bound via $1 parameter in NOTIFY
  • Payload size validation: Enforces pg_notify 8KB limit
  • Comprehensive thiserror error types with context preservation
  • Bounded channels: All MPSC channels bounded
  • Good test coverage: 6 integration test files covering major flows
  • Clean separation from tasker-shared: No duplication, standalone library

Finding P-1 (HIGH): SQL Injection via NOTIFY Channel Name

Location: tasker-pgmq/src/emitter.rs:122

#![allow(unused)]
fn main() {
let sql = format!("NOTIFY {}, $1", channel);
sqlx::query(&sql).bind(payload).execute(&self.pool)
}

PostgreSQL’s NOTIFY does not support parameterized channel identifiers. The channel name is interpolated directly via format!. Channel names flow from config.build_channel_name() which concatenates channels_prefix (from TOML config) with base channel names and namespace strings.

Risk: While the NOTIFY command has limited injection surface (it’s not a general SQL execution vector), malformed channel names could cause PostgreSQL errors, unexpected channel routing, or denial of service. The channels_prefix comes from config (lower risk), but namespace strings flow from queue operations.

Recommendation: Add channel name validation — allow only [a-zA-Z0-9_.]+, max 63 chars (PostgreSQL identifier limit). Apply in build_channel_name() and/or notify_channel().

Finding P-2 (HIGH): CLI Migration SQL Generation with Unescaped Input

Location: tasker-pgmq/src/bin/cli.rs:179-353

User-provided regex patterns and channel prefixes are directly interpolated into SQL migration strings when generating migration files. While these are generated files that should be reviewed before application, the lack of escaping creates a risk if the generation process is automated.

Recommendation: Validate inputs against strict patterns before interpolation. Add a warning comment in generated files that they should be reviewed.

Finding P-3 (MEDIUM): unwrap_or_default() on Database Results (Tenet #11)

Location: tasker-pgmq/src/client.rs:164

#![allow(unused)]
fn main() {
.read_batch(queue_name, visibility_timeout, l).await?.unwrap_or_default()
}

When read_batch returns None, this silently produces an empty vector instead of failing loudly. Could mask permission errors, connection failures, or other serious issues.

Recommendation: Return explicit error on unexpected None.

Finding P-4 (MEDIUM): RwLock Poison Handling Masks Panics

Location: tasker-pgmq/src/listener.rs (22 instances)

#![allow(unused)]
fn main() {
self.stats.write().unwrap_or_else(|p| p.into_inner())
}

Silently recovers from poisoned RwLock without logging. Could propagate corrupted state from a panicked thread.

Recommendation: Log warning on poison recovery, or switch to parking_lot::RwLock (doesn’t poison).

Finding P-5 (MEDIUM): Hardcoded Pool Size

Location: tasker-pgmq/src/client.rs:41-44

#![allow(unused)]
fn main() {
let pool = sqlx::postgres::PgPoolOptions::new()
    .max_connections(20)  // Hard-coded
    .connect(database_url).await?;
}

Pool size should be configurable for different deployment scenarios.

Finding P-6 (MEDIUM): Missing Async Operation Timeouts

Database operations in client.rs, emitter.rs, and listener.rs lack explicit tokio::time::timeout() wrappers. Relies solely on pool-level acquire timeouts.

Finding P-7 (LOW): Error Context Loss in Regex Compilation

Location: tasker-pgmq/src/config.rs:169

#![allow(unused)]
fn main() {
Regex::new(&self.queue_naming_pattern)
    .map_err(|_| PgmqNotifyError::invalid_pattern(&self.queue_naming_pattern))
}

Original regex error details discarded.

Finding P-8 (LOW): #[allow] Instead of #[expect] (Lint Policy)

Location: tasker-pgmq/src/emitter.rs:299-320 — 3 instances of #[allow(dead_code)] on EmitterFactory.


Crate 3: tasker-orchestration

Overall Rating: A- (Strong security with targeted resilience improvements needed)

The tasker-orchestration crate handles core orchestration logic: actors, state machines, REST + gRPC APIs, and auth middleware. This is the largest service crate and the primary attack surface.

Strengths

  • Zero unsafe code across the entire crate
  • Excellent auth architecture: Constant-time API key comparison, JWT algorithm allowlist, JWKS SSRF prevention, auth before body parsing
  • gRPC/REST auth parity verified: All 6 gRPC task methods enforce identical permissions to REST counterparts
  • No auth bypass found: All API v1 routes wrapped in authorize(), health/metrics public by design
  • Database-level atomic claiming: FOR UPDATE SKIP LOCKED prevents concurrent state corruption
  • State transitions enforce ownership: No API endpoint allows direct state manipulation
  • Sanitized error responses: No stack traces, database errors genericized, consistent JSON format
  • Backpressure checked before resource operations: 503 with Retry-After header
  • Full bounded-channel compliance: All MPSC channels bounded and config-driven (0 unbounded channels)
  • HTTP request timeout: TimeoutLayer with configurable 30s default

Finding O-1 (HIGH): No Actor Panic Recovery

Location: tasker-orchestration/src/actors/command_processor_actor.rs:139

Actors spawn via spawn_named! but have no supervisor/restart logic. If OrchestrationCommandProcessorActor panics, the entire orchestration processing stops. Recovery requires full process restart.

Recommendation: Implement panic-catching wrapper with logged restart, or document that process-level supervision (systemd, k8s) handles this.

Finding O-2 (HIGH): Graceful Shutdown Lacks Timeout

Locations:

  • tasker-orchestration/src/orchestration/bootstrap.rs:177-213
  • tasker-orchestration/src/bin/server.rs:68-82

Shutdown calls coordinator.lock().await.stop().await and orchestration_handle.stop().await with no timeout. If the event coordinator or actors hang during shutdown, the server never completes graceful shutdown.

Recommendation: Add 30-second timeout with force-kill fallback.

Finding O-3 (HIGH): #[allow] Instead of #[expect] (Lint Policy)

21 instances of #[allow] found across the crate (most without reason = clause):

  • src/actors/traits.rs:67,81
  • src/web/extractors.rs:6
  • src/health/channel_status.rs:87
  • src/grpc/conversions.rs:42
  • And 16 more locations

Finding O-4 (MEDIUM): Request Validation Not Enforced at Handler Layer

Location: src/web/handlers/tasks.rs:47

TaskRequest has #[derive(Validate)] with constraints (name length 1-255, namespace length 1-255, priority range -100 to 100) but handlers accept Json<TaskRequest> without calling .validate(). Validation happens later at the service layer.

Impact: Oversized payloads are deserialized before rejection. Not a security vulnerability per se, but the defense-in-depth pattern would catch malformed input earlier.

Recommendation: Add .validate() at handler entry or use Valid<Json<TaskRequest>> extractor.

Finding O-5 (MEDIUM): Actor Shutdown May Lose In-Flight Work

Location: tasker-orchestration/src/actors/registry.rs:216-259

Shutdown uses Arc::get_mut() which only works if no other references exist. If get_mut fails, stopped() is silently skipped. In-flight work may be lost.

Finding O-6 (MEDIUM): Database Query Timeouts Missing

Same pattern as tasker-shared (Finding S-8). Individual sqlx::query! calls lack explicit timeout wrappers:

  • src/services/health/service.rs:284 — health check query
  • src/orchestration/backoff_calculator.rs:232,245,290,345,368 — multiple queries

Pool-level acquire timeout (30s) provides partial mitigation.

Finding O-7 (MEDIUM): unwrap_or_default() on Config Fields

  • src/orchestration/event_systems/unified_event_coordinator.rs:89 — event system config
  • src/orchestration/bootstrap.rs:581 — namespace config
  • src/grpc/services/config.rs:96-97jwt_issuer and jwt_audience default to empty strings

Finding O-8 (MEDIUM): Error Context Loss

~12 instances of .map_err(|_| ...) discarding error context:

  • src/orchestration/bootstrap.rs:203 — oneshot send error
  • src/web/handlers/health.rs:53 — timeout error
  • src/web/handlers/tasks.rs:113 — UUID parse error

Finding O-9 (MEDIUM): Hardcoded Magic Numbers

  • src/services/task_service.rs:257-259per_page > 100 validation
  • src/orchestration/event_systems/orchestration_event_system.rs:142 — 24h max message age
  • src/services/analytics_query_service.rs:229 — 30.0s slow step threshold

Finding O-10 (LOW): gRPC Internal Error May Leak Details

Location: src/grpc/conversions.rs:152-153

tonic::Status::internal(error.to_string()) — depending on error Display implementations, could expose implementation details in gRPC error messages.

Finding O-11 (LOW): CORS Allows Any Origin

Location: src/web/mod.rs

#![allow(unused)]
fn main() {
CorsLayer::new()
    .allow_origin(tower_http::cors::Any)
    .allow_methods(tower_http::cors::Any)
    .allow_headers(tower_http::cors::Any)
}

Acceptable for alpha/API service, but should be configurable for production deployments.

Crate 4: tasker-worker

Overall Rating: A- (Strong FFI safety with one notable gap)

The tasker-worker crate handles handler dispatch, FFI integration, and completion processing. Despite complex FFI requirements, it achieves this with zero unsafe blocks in the crate itself.

Strengths

  • Zero unsafe code despite handling Ruby/Python FFI integration
  • All SQL queries via sqlx macros — no string interpolation
  • Handler panic containment: catch_unwind() + AssertUnwindSafe wraps all handler calls
  • Error classification preserved: Permanent/Retryable distinction maintained across FFI boundary
  • Fire-and-forget callbacks: Spawned into runtime, 5s timeout, no deadlock risk
  • FFI completion circuit breaker: Latency-based, 100ms threshold, lock-free metrics
  • All MPSC channels bounded — full bounded-channel compliance
  • No production unwrap()/expect() in core paths

Finding W-1 (HIGH): checkpoint_yield Blocks FFI Thread Without Timeout

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:904

#![allow(unused)]
fn main() {
let result = self.config.runtime_handle.block_on(async {
    self.handle_checkpoint_yield_async(/* ... */).await
});
}

Uses block_on which blocks the Ruby/Python thread while persisting checkpoint data to the database. No timeout wrapper. If the database is slow, this blocks the FFI thread indefinitely, potentially exhausting the thread pool.

Recommendation: Add tokio::time::timeout() around the block_on body (configurable, suggest 10s default).

Finding W-2 (MEDIUM): Starvation Detection is Warning-Only

Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:772-793

check_starvation_warnings() logs warnings but doesn’t enforce any action. Also requires manual invocation by the caller — no automatic monitoring loop.

Finding W-3 (MEDIUM): FFI Thread Safety Documentation Gap

The FfiDispatchChannel uses Arc<Mutex<mpsc::Receiver>> (thread-safe) but lacks documentation about thread-safety guarantees, poll() contention behavior, and block_on safety in FFI context.

Finding W-4 (MEDIUM): #[allow] vs #[expect] (Lint Policy)

5 instances in web/middleware/mod.rs and web/middleware/request_id.rs.

Finding W-5 (MEDIUM): Missing Database Query Timeouts

Same systemic pattern as other crates. Checkpoint service and step claim queries lack explicit timeout wrappers.

Finding W-6 (LOW): unwrap_or_default() in worker/core.rs

Several instances, appear to be for optional fields (likely legitimate), but warrants review.


Crates 5-6: tasker-client & tasker-cli

Overall Rating: A (Excellent — cleanest crates in the workspace)

These client crates demonstrate the strongest compliance across all audit dimensions. Notably, lint policy compliant (using #[expect] already). No Critical or High findings.

Strengths

  • No unsafe code in either crate
  • No hardcoded credentials — all auth from env vars or config files
  • RSA key generation validates minimum 2048-bit keys
  • Proper error context preservation in all From conversions
  • Complete transport abstraction: REST and gRPC both implement 11/11 methods
  • HTTP/gRPC timeouts configured: 30s request, 10s connect
  • Exponential backoff retry for create_task with configurable max retries
  • Lint policy compliant — uses #[expect] with reasons
  • User-facing CLI errors informative without leaking internals

Finding C-1 (MEDIUM): TLS Certificate Validation Not Explicitly Enforced

Location: tasker-client/src/api_clients/orchestration_client.rs:220

HTTP client uses reqwest::Client::builder() without explicitly setting .danger_accept_invalid_certs(false). Default is secure, but explicit enforcement prevents accidental changes.

Finding C-2 (MEDIUM): Default URLs Use HTTP

Location: tasker-client/src/config.rs:276

Default base_url is http://localhost:8080. Credentials transmitted over HTTP are vulnerable to interception. Appropriate for local dev, but should warn when HTTP is used with authentication enabled.

Finding C-3 (MEDIUM): Retry Logic Only on create_task

Other operations (get_task, list_tasks, etc.) do not retry on transient failures. Should either extend retry logic or document the limitation.

Finding C-4 (LOW): Production expect() in Config Initialization

tasker-client/src/api_clients/orchestration_client.rs:123 — panics if config is malformed. Acceptable during startup but could return Result instead.


Crates 7-10: Language Workers (Rust, Ruby, Python, TypeScript)

Overall Rating: A- (Strong FFI engineering, no critical gaps)

All 4 language workers share common architecture via FfiDispatchChannel for poll-based event dispatch. Audited ~22,000 lines of Rust FFI code plus language wrappers.

Strengths

  • TypeScript: Comprehensive panic handlingcatch_unwind on all critical FFI functions, errors converted to JSON error responses
  • Ruby/Python: Managed FFI via Magnus and PyO3 — these frameworks handle panic unwinding automatically via their exception systems
  • Error classification preserved across all FFI boundaries: Permanent/Retryable distinction maintained
  • Fire-and-forget callbacks: No deadlock risk identified
  • Starvation detection functional in all workers
  • Proper Arc usage for thread-safe shared ownership across FFI
  • TypeScript C FFI: Correct string memory management with into_raw()/from_raw() pattern and free_rust_string() for caller cleanup
  • Checkpoint support uniformly implemented across all 4 workers
  • Consistent error hierarchy across all languages

Finding LW-1 (MEDIUM): TypeScript FFI Missing Safety Documentation

Location: workers/typescript/src-rust/lib.rs:38

#![allow(clippy::missing_safety_doc)] — suppresses docs for 9 unsafe extern "C" functions. Should use #[expect] per lint policy and add # Safety sections.

Finding LW-2 (MEDIUM): Rust Worker #[allow(dead_code)] (Lint Policy)

Location: workers/rust/src/event_subscribers/logging_subscriber.rs:60,98,132

3 instances of #[allow(dead_code)] instead of #[expect].

Finding LW-3 (LOW): Ruby Bootstrap Uses expect() on Ruby Runtime

Location: workers/ruby/ext/tasker_core/src/bridge.rs:19-20, bootstrap.rs:29-30

Ruby::get().expect("Ruby runtime should be available") — safe in practice (guaranteed by Magnus FFI contract) but could use ? for defensive programming.

Finding LW-4 (LOW): Timeout Cleanup Requires Manual Polling

cleanup_timeouts() exists in all FFI workers but documentation doesn’t specify recommended polling frequency. Workers must call this periodically.

Finding LW-5 (LOW): Ruby Tokio Thread Pool Hardcoded to 8

Location: workers/ruby/ext/tasker_core/src/bootstrap.rs:74-79

Hardcoded .worker_threads(8) for M2/M4 Pro compatibility. Python/TypeScript use defaults. Consider making configurable.


Cross-Cutting Concerns

Dependency Audit (cargo audit)

Finding X-1 (HIGH): bytes v1.11.0 Integer Overflow (RUSTSEC-2026-0007)

Published 2026-02-03. Integer overflow in BytesMut::reserve. Fix: upgrade to bytes >= 1.11.1. This is a transitive dependency used by tokio, hyper, axum, tonic, reqwest, sqlx — deeply embedded.

Recommendation: Add to workspace Cargo.toml: bytes = "1.11.1"

Finding X-2 (LOW): rustls-pemfile Unmaintained (RUSTSEC-2025-0134)

Transitive from lapin (RabbitMQ) → amq-protocoltcp-streamrustls-pemfile. No action available from this project; depends on upstream lapin update.

Clippy Compliance

Zero warnings across entire workspace with --all-targets --all-features. Excellent.

Systemic: #[allow] vs #[expect] (Lint Policy)

27 instances of #[allow] found across all crates. Distribution:

  • tasker-shared: ~5 instances
  • tasker-pgmq: 3 instances
  • tasker-orchestration: 21 instances (highest)
  • tasker-worker: 5 instances
  • tasker-client/cli: 0 (compliant)
  • Language workers: ~3 instances

Recommendation: Batch fix in a single PR — mechanical replacement of #[allow]#[expect] with reason strings.

Systemic: Database Query Timeouts

Found across tasker-shared, tasker-orchestration, tasker-worker, and tasker-pgmq. Individual sqlx::query! calls lack explicit tokio::time::timeout() wrappers. Pool-level acquire timeouts (30s) provide partial mitigation.

Recommendation: Consider PostgreSQL statement_timeout at the connection level as a blanket fix, or add tokio::time::timeout() around critical query paths.

Systemic: unwrap_or_default() on Required Fields (Tenet #11)

Found across tasker-shared (20+ instances), tasker-orchestration (3 instances), tasker-pgmq (1 instance). Silent failures on required fields violate the Fail Loudly principle.

Recommendation: Audit all instances and replace with explicit error handling for required fields.


Appendix: Methodology

Each crate was evaluated across these dimensions:

  1. Security — Input validation, SQL safety, auth checks, unsafe blocks, crypto, secrets
  2. Error Handling — Fail Loudly (Tenet #11), context preservation, structured errors
  3. Resilience — Bounded channels, timeouts, circuit breakers, backpressure
  4. Architecture — API surface, documentation consistency, test coverage, dead code
  5. FFI-Specific (language workers) — Error classification, deadlock risk, starvation detection, memory safety

Severity definitions follow the audit specification.


Appendix: Remediation Tracking

Remediation work items for all High-severity findings:

Work ItemFindingsPrioritySummary
Dependency upgradeX-1UrgentUpgrade bytes to fix RUSTSEC-2026-0007 CVE
Queue name validationS-1, P-1, P-2HighAdd queue name and NOTIFY channel validation
Lint compliance cleanupS-3, O-3, W-4, LW-1, LW-2, P-8MediumReplace #[allow] with #[expect] workspace-wide
Shutdown and recovery hardeningO-1, O-2HighAdd shutdown timeout and actor panic recovery
FFI checkpoint timeoutW-1HighAdd timeout to checkpoint_yield block_on
Error message sanitizationS-2HighSanitize database error messages in API responses

Architecture Decision Records (ADRs)

This directory contains Architecture Decision Records that document significant design decisions in Tasker Core. Each ADR captures the context, decision, and consequences of a specific architectural choice.

ADR Index

Active Decisions

ADRTitleDateStatus
ADR-001Actor-Based Orchestration Architecture2025-10Accepted
ADR-002Bounded MPSC Channels2025-10Accepted
ADR-003Processor UUID Ownership Removal2025-10Accepted
ADR-004Backoff Strategy Consolidation2025-10Accepted
ADR-005Worker Dual-Channel Event System2025-12Accepted
ADR-006Worker Actor-Service Decomposition2025-12Accepted
ADR-007FFI Over WASM for Language Workers2025-12Accepted
ADR-008Handler Composition Pattern2025-12Accepted

Root Cause Analyses

DocumentTitleDate
RCAParallel Execution Timing Bugs2025-12

ADR Template

When creating a new ADR, use this template:

# ADR: [Title]

**Status**: [Proposed | Accepted | Deprecated | Superseded]
**Date**: YYYY-MM-DD
**Ticket**: TAS-XXX

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

### Positive

- Benefit 1
- Benefit 2

### Negative

- Trade-off 1
- Trade-off 2

### Neutral

- Side effect that is neither positive nor negative

## Alternatives Considered

What other options were considered and why were they rejected?

### Alternative 1: [Name]

Description and why it was rejected.

### Alternative 2: [Name]

Description and why it was rejected.

## References

- Related documents
- External references

When to Create an ADR

Create an ADR when:

  1. Making a significant architectural change that affects multiple components
  2. Choosing between alternatives with meaningful trade-offs
  3. Establishing a pattern that should be followed consistently
  4. Removing or deprecating an existing pattern or approach
  5. Learning from an incident (RCA format)

Don’t create an ADR for:

  • Minor implementation details
  • Bug fixes without architectural impact
  • Documentation updates
  • Routine refactoring

ADR: Actor-Based Orchestration Architecture

Status: Accepted Date: 2025-10 Ticket: TAS-46

Context

The orchestration system used a command pattern with direct service delegation, but lacked formal boundaries between commands and lifecycle components. This created:

  1. Testing Complexity: Lifecycle components tightly coupled to command processor
  2. Unclear Boundaries: No formal interface between commands and lifecycle operations
  3. Limited Supervision: No standardized lifecycle hooks for resource management
  4. Inconsistent Patterns: Each component had different initialization patterns
  5. Coupling: Command processor had direct dependencies on multiple service instances

The command processor was 1,164 lines, mixing routing, hydration, validation, and delegation.

Decision

Adopt a lightweight actor pattern with message-based interfaces:

Core Abstractions:

  1. OrchestrationActor trait with lifecycle hooks (started(), stopped())
  2. Message trait for type-safe messages with associated Response type
  3. Handler<M> trait for async message processing
  4. ActorRegistry for centralized actor management

Four Orchestration Actors:

  1. TaskRequestActor: Task initialization and request processing
  2. ResultProcessorActor: Step result processing
  3. StepEnqueuerActor: Step enqueueing coordination
  4. TaskFinalizerActor: Task finalization with atomic claiming

Implementation Approach:

  • Greenfield migration (no dual support)
  • Actors wrap existing services, not replace them
  • Arc-wrapped actors for efficient cloning across threads
  • No full actor framework (keeping it lightweight)

Consequences

Positive

  • 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
  • Clear boundaries: Each actor handles specific message types
  • Testability: Message-based interfaces enable isolated testing
  • Consistent patterns: Established migration pattern for all actors
  • Lifecycle management: Standardized started()/stopped() hooks
  • Thread safety: Arc-wrapped actors with Send+Sync guarantees

Negative

  • Additional abstraction: One more layer between commands and services
  • Learning curve: New pattern to understand
  • Message overhead: ~100-500ns per actor call (acceptable for our use case)
  • Not a full framework: Lacks supervision trees, mailboxes, etc.

Neutral

  • Services remain unchanged; actors are thin wrappers
  • Performance impact minimal (<1μs per operation)

Alternatives Considered

Alternative 1: Full Actor Framework (Actix)

Would provide supervision, mailboxes, and advanced patterns.

Rejected: Too heavyweight for our needs. We need lifecycle hooks and message-based testing, not a full distributed actor system.

Alternative 2: Keep Direct Service Delegation

Continue with command processor calling services directly.

Rejected: Doesn’t address testing complexity, unclear boundaries, or lifecycle management needs.

Alternative 3: Trait-Based Service Abstraction

Define Service trait and implement on each lifecycle component.

Partially adopted: Combined with actor pattern. Services implement business logic; actors provide message-based coordination.

References

ADR: Bounded MPSC Channel Migration

Status: Implemented Date: 2025-10-14 Decision Makers: Engineering Team Ticket: TAS-51

Context and Problem Statement

Prior to this change, the tasker-core system had inconsistent and risky MPSC channel usage:

  1. Unbounded Channels (3 critical sites): Risk of unbounded memory growth under load

    • PGMQ notification listener: Could exhaust memory during notification bursts
    • Event publisher: Vulnerable to event storms
    • Ruby FFI handler: No backpressure across FFI boundary
  2. Configuration Disconnect (6 sites): TOML configuration existed but wasn’t used

    • Hard-coded values (100, 1000) with no rationale
    • Test/dev/prod environments used identical capacities
    • No ability to tune without code changes
  3. No Backpressure Strategy: Missing overflow handling policies

    • No monitoring of channel saturation
    • No documented behavior when channels fill
    • No metrics for operational visibility

Production Impact

  • Memory Risk: OOM possible under high database notification load (10k+ tasks enqueued)
  • Operational Pain: Cannot tune channel sizes without code deployment
  • Environment Mismatch: Test environment uses production-scale buffers, masking issues
  • Technical Debt: Wasted configuration infrastructure

Decision

Migrate to 100% bounded, configuration-driven MPSC channels with explicit backpressure handling.

Key Principles

  1. All Channels Bounded: Zero unbounded_channel() calls in production code
  2. Configuration-Driven: All capacities from TOML with environment overrides
  3. Separation of Concerns: Infrastructure (sizing) separate from business logic (retry behavior)
  4. Explicit Backpressure: Document and implement overflow policies
  5. Full Observability: Metrics for channel saturation and overflows

Solution Architecture

Configuration Structure

Created unified MPSC channel configuration in config/tasker/base/mpsc_channels.toml:

[mpsc_channels]

# Orchestration subsystem
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 1000

[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 10000  # Large for notification bursts

# Task readiness subsystem
[mpsc_channels.task_readiness.event_channel]
buffer_size = 1000
send_timeout_ms = 1000

# Worker subsystem
[mpsc_channels.worker.command_processor]
command_buffer_size = 1000

[mpsc_channels.worker.in_process_events]
broadcast_buffer_size = 1000  # Rust → Ruby FFI

# Shared/cross-cutting
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 5000

[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 1000

# Overflow policy
[mpsc_channels.overflow_policy]
log_warning_threshold = 0.8  # Warn at 80% full
drop_policy = "block"

Environment-Specific Overrides

Production (config/tasker/environments/production/mpsc_channels.toml):

  • Orchestration command: 5000 (5x base)
  • PGMQ listeners: 50000 (5x base) - handles bulk task creation bursts
  • Event publisher: 10000 (2x base)

Development (config/tasker/environments/development/mpsc_channels.toml):

  • Task readiness: 500 (0.5x base)
  • Worker FFI: 500 (0.5x base)

Test (config/tasker/environments/test/mpsc_channels.toml):

  • Orchestration command: 100 (0.1x base) - exposes backpressure issues
  • Task readiness: 100 (0.1x base)

Critical Implementation Detail

Environment override files MUST use full [mpsc_channels.*] prefix:

# ✅ CORRECT
[mpsc_channels.task_readiness.event_channel]
buffer_size = 100

# ❌ WRONG - creates top-level key that overrides correct config
[task_readiness.event_channel]
buffer_size = 100

This was discovered during implementation when environment files created conflicting top-level configuration keys.

Configuration Migration

Migrated MPSC sizing fields from event_systems.toml to mpsc_channels.toml:

Moved to mpsc_channels.toml:

  • event_systems.task_readiness.metadata.event_channel.buffer_size
  • event_systems.task_readiness.metadata.event_channel.send_timeout_ms
  • event_systems.worker.metadata.in_process_events.broadcast_buffer_size

Kept in event_systems.toml (event processing logic):

  • event_systems.task_readiness.metadata.event_channel.max_retries
  • event_systems.task_readiness.metadata.event_channel.backoff

Rationale: Separation of concerns - infrastructure sizing vs business logic behavior.

Rust Type System

Created comprehensive type system in tasker-shared/src/config/mpsc_channels.rs:

#![allow(unused)]
fn main() {
pub struct MpscChannelsConfig {
    pub orchestration: OrchestrationChannelsConfig,
    pub task_readiness: TaskReadinessChannelsConfig,
    pub worker: WorkerChannelsConfig,
    pub shared: SharedChannelsConfig,
    pub overflow_policy: OverflowPolicyConfig,
}
}

All channel creation sites updated to use configuration:

#![allow(unused)]
fn main() {
// Before
let (tx, rx) = mpsc::unbounded_channel();

// After
let buffer_size = config.mpsc_channels
    .orchestration.event_listeners.pgmq_event_buffer_size;
let (tx, rx) = mpsc::channel(buffer_size);
}

Observability

ChannelMonitor Integration:

  • Tracks channel usage in real-time
  • Logs warnings at 80% saturation
  • Exposes metrics via OpenTelemetry

Metrics Available:

  • mpsc_channel_usage_percent - Current channel fill percentage
  • mpsc_channel_capacity - Configured capacity
  • Component and channel name labels for filtering

Consequences

Positive

  1. Memory Safety: Bounded channels prevent OOM from unbounded growth
  2. Operational Flexibility: Tune channel sizes via configuration without code changes
  3. Environment Appropriateness: Test uses small buffers (exposes issues), production uses large buffers (handles load)
  4. Observability: Channel saturation visible in metrics and logs
  5. Documentation: Clear guidelines for future channel additions

Negative

  1. Backpressure Complexity: Must handle full channel conditions
  2. Configuration Overhead: More configuration files to maintain
  3. Tuning Required: May need adjustment based on production load patterns

Neutral

  1. No Performance Impact: Bounded channels with appropriate sizes perform identically to unbounded
  2. Backward Compatible: Existing deployments automatically use new defaults

Implementation Notes

Backpressure Strategies by Component

PGMQ Notification Listener:

  • Strategy: Block sender (apply backpressure)
  • Rationale: Cannot drop database notifications
  • Buffer: Large (10000 base, 50000 production) to handle bursts

Event Publisher:

  • Strategy: Drop events with metrics when full
  • Rationale: Internal events are non-critical
  • Buffer: Medium (5000 base, 10000 production)

Ruby FFI Handler:

  • Strategy: Return error to Rust (signal backpressure)
  • Rationale: Ruby must handle gracefully
  • Buffer: Standard (1000) with monitoring

Sizing Guidelines

Command Channels (orchestration, worker):

  • Base: 1000
  • Test: 100 (expose issues)
  • Production: 2000-5000 (concurrent load)

Event Channels:

  • Base: 1000
  • Production: Higher if event-driven architecture

Notification Channels:

  • Base: 10000 (burst handling)
  • Production: 50000 (bulk operations)

Validation

Testing Performed

  1. Unit Tests: Configuration loading and validation ✅
  2. Integration Tests: All tests pass with bounded channels ✅
  3. Local Verification: Service starts successfully in test environment ✅
  4. Configuration Verification: All environments load correctly ✅

Success Criteria Met

  • ✅ Zero unbounded channels in production code
  • ✅ 100% configurable channel capacities
  • ✅ Environment-specific overrides working
  • ✅ Backpressure handling implemented
  • ✅ Observability through ChannelMonitor
  • ✅ All tests passing

Future Considerations

  1. Dynamic Sizing: Consider runtime buffer adjustment based on load (not current scope)
  2. Priority Queues: Allow critical events to bypass overflow drops (evaluate based on metrics)
  3. Notification Coalescing: Reduce PGMQ notification volume during bursts (future optimization)
  4. Advanced Metrics: Percentile latencies for channel send operations

References

  • Configuration Files: config/tasker/base/mpsc_channels.toml
  • Rust Module: tasker-shared/src/config/mpsc_channels.rs
  • Related ADRs: Command Pattern, Actor Pattern

Lessons Learned

  1. Configuration Structure Matters: Environment override files must use proper prefixes
  2. Separation of Concerns: Keep infrastructure config (sizing) separate from business logic (retry behavior)
  3. Test Appropriately: Small buffers in test environment expose backpressure issues early
  4. Migration Strategy: Moving config fields requires coordinated struct updates across all files
  5. Type Safety: Rust’s type system caught many configuration mismatches during development

Decision: Approved and Implemented Review Date: 2025-10-14 Next Review: 2026-Q1 (evaluate sizing based on production metrics)

ADR: Processor UUID Ownership Removal

Status: Accepted Date: 2025-10 Ticket: TAS-54

Context

When orchestrators crash with tasks in active processing states (Initializing, EnqueuingSteps, EvaluatingResults), the processor UUID ownership enforcement prevented new orchestrators from taking over. Tasks became permanently stuck until manual intervention.

Root Cause: Three states required ownership enforcement (the original state machine pattern), but when orchestrator A crashed and orchestrator B tried to recover, the ownership check failed: B != A.

Production Impact:

  • Stuck tasks requiring manual intervention
  • Orchestrator restarts caused task processing to halt
  • 15-second gap between crash and retry, but tasks permanently blocked

Decision

Move to audit-only processor UUID tracking:

  1. Keep processor UUID in all transitions (audit trail for debugging)
  2. Remove ownership enforcement from state transitions
  3. Rely on existing state machine guards for idempotency
  4. Add configuration flag for gradual rollout

Key Insight: The original problem (race conditions) had been solved by multiple other mechanisms:

  • Atomic finalization claiming via SQL functions
  • Command pattern with stateless async processors
  • Actor pattern with 4 production-ready actors

Idempotency Without Ownership

ActorIdempotency MechanismRace Condition Protection
TaskRequestActoridentity_hash unique constraintTransaction atomicity
ResultProcessorActorCurrent state guardsState machine atomicity
StepEnqueuerActorSQL function atomicityPGMQ transactional operations
TaskFinalizerActorAtomic claimingSQL compare-and-swap

Consequences

Positive

  • Task recovery: Tasks automatically recover after orchestrator crashes
  • Zero manual interventions: Stuck task count approaches zero
  • Audit trail preserved: Full debugging capability retained
  • Instant rollback: Configuration flag allows quick revert

Negative

  • New debugging patterns: Processor ownership changes visible in audit trail
  • Team training: Operators need to understand audit-only interpretation

Neutral

  • No database schema changes required
  • No performance impact (one fewer query per transition)

Alternatives Considered

Alternative 1: Timeout-Based Ownership Transfer

Add timeout after which ownership can be claimed by another processor.

Rejected: Adds complexity; existing idempotency guards make ownership redundant entirely.

Alternative 2: Keep Ownership Enforcement

Continue with existing ownership enforcement behavior, add manual recovery tools.

Rejected: Doesn’t address root cause; manual intervention doesn’t scale.

References

ADR: Backoff Logic Consolidation

Status: Implemented Date: 2025-10-29 Deciders: Engineering Team Ticket: TAS-57

Context

The tasker-core distributed workflow orchestration system had multiple, potentially conflicting implementations of exponential backoff logic for step retry coordination. This created several critical issues:

Problems Identified

  1. Configuration Conflicts: Three different maximum backoff values existed across the system:

    • SQL Migration (hardcoded): 30 seconds
    • Rust Code Default: 60 seconds
    • TOML Configuration: 300 seconds
  2. Race Conditions: No atomic guarantees on backoff updates when multiple orchestrators processed the same step failure simultaneously, leading to potential lost updates and inconsistent state.

  3. Implementation Divergence: Dual calculation paths (Rust BackoffCalculator vs SQL fallback) could produce different results due to:

    • Different time sources (last_attempted_at vs failure_time)
    • Hardcoded vs configurable parameters
    • Lack of timestamp synchronization
  4. Hardcoded SQL Values: The SQL migration contained non-configurable exponential backoff logic:

    -- Old hardcoded implementation
    power(2, COALESCE(attempts, 1)) * interval '1 second', interval '30 seconds'
    

Decision

We consolidated the backoff logic with the following architectural decisions:

1. Single Source of Truth: TOML Configuration

Decision: All backoff parameters originate from TOML configuration files.

Rationale:

  • Centralized configuration management
  • Environment-specific overrides (test/development/production)
  • Runtime validation and type safety
  • Clear documentation of system behavior

Implementation:

# config/tasker/base/orchestration.toml
[backoff]
default_backoff_seconds = [1, 2, 4, 8, 16, 32]
max_backoff_seconds = 60  # Standard across all environments
backoff_multiplier = 2.0
jitter_enabled = true
jitter_max_percentage = 0.1

2. Standard Maximum Backoff: 60 Seconds

Decision: Standardize on 60 seconds as the maximum backoff delay.

Rationale:

  • Balance: 60 seconds balances retry speed with system load reduction
  • Not Too Short: 30 seconds (old SQL max) insufficient for rate limiting scenarios
  • Not Too Long: 300 seconds (old TOML config) creates excessive delays in failure scenarios
  • Alignment: Matches Rust code defaults and production requirements

Impact:

  • Tasks recover faster from transient failures
  • Rate-limited APIs get adequate cooldown
  • User experience improved with reasonable retry times

3. Parameterized SQL Functions

Decision: SQL functions accept configuration parameters with sensible defaults.

Implementation:

CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP,
    failure_time TIMESTAMP,
    attempts INTEGER,
    p_max_backoff_seconds INTEGER DEFAULT 60,
    p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP

Rationale:

  • Eliminates hardcoded values in SQL
  • Allows runtime configuration without schema changes
  • Maintains SQL fallback safety net
  • Defaults prevent breaking existing code

4. Atomic Backoff Updates with Row-Level Locking

Decision: Use PostgreSQL SELECT FOR UPDATE for atomic backoff updates.

Implementation:

#![allow(unused)]
fn main() {
// Rust BackoffCalculator
async fn update_backoff_atomic(&self, step_uuid: &Uuid, delay_seconds: u32) {
    let mut tx = self.pool.begin().await?;

    // Acquire row-level lock
    sqlx::query!("SELECT ... FROM tasker_workflow_steps WHERE ... FOR UPDATE")
        .fetch_one(&mut *tx).await?;

    // Update with lock held
    sqlx::query!("UPDATE tasker_workflow_steps SET ...")
        .execute(&mut *tx).await?;

    tx.commit().await?;
}
}

Rationale:

  • Correctness: Prevents lost updates from concurrent orchestrators
  • Simplicity: PostgreSQL’s row-level locking is well-understood and reliable
  • Performance: Minimal overhead - locks only held during UPDATE operation
  • Idempotency: Multiple retries produce consistent results

Alternative Considered: Optimistic concurrency with version field

  • Rejected: More complex implementation, retry logic in application layer
  • Benefit of Chosen Approach: Database guarantees atomicity

5. Timing Consistency: Update last_attempted_at with backoff_request_seconds

Decision: Always update both backoff_request_seconds and last_attempted_at atomically.

Rationale:

  • SQL fallback calculation: last_attempted_at + backoff_request_seconds
  • Prevents timing window where calculation uses stale timestamp
  • Single transaction ensures consistency

Before:

#![allow(unused)]
fn main() {
// Old: Only updated backoff_request_seconds
sqlx::query!("UPDATE tasker_workflow_steps SET backoff_request_seconds = $1 ...")
}

After:

#![allow(unused)]
fn main() {
// New: Updates both atomically
sqlx::query!(
    "UPDATE tasker_workflow_steps
     SET backoff_request_seconds = $1,
         last_attempted_at = NOW()
     WHERE ..."
)
}

6. Dual-Path Strategy: Rust Primary, SQL Fallback

Decision: Maintain both Rust calculation and SQL fallback, but ensure they use same configuration.

Rationale:

  • Rust Primary: Fast, configurable, with jitter support
  • SQL Fallback: Safety net if backoff_request_seconds is NULL
  • Consistency: Both paths now use same max delay and multiplier

Path Selection Logic:

CASE
    -- Primary: Rust-calculated backoff
    WHEN backoff_request_seconds IS NOT NULL AND last_attempted_at IS NOT NULL THEN
        last_attempted_at + (backoff_request_seconds * interval '1 second')

    -- Fallback: SQL exponential with configurable params
    WHEN failure_time IS NOT NULL THEN
        failure_time + LEAST(
            power(p_backoff_multiplier, attempts) * interval '1 second',
            p_max_backoff_seconds * interval '1 second'
        )

    ELSE NULL
END

Consequences

Positive

  1. Configuration Clarity: Single max_backoff_seconds value (60s) across entire system
  2. Race Condition Prevention: Atomic updates guarantee correctness in distributed deployments
  3. Flexibility: Parameterized SQL allows future config changes without migrations
  4. Timing Consistency: Synchronized timestamp updates eliminate calculation errors
  5. Maintainability: Clear separation of concerns - Rust for calculation, SQL for fallback
  6. Test Coverage: All 518 unit tests pass, validating correctness

Negative

  1. Performance Overhead: Row-level locking adds ~1-2ms per backoff update

    • Mitigation: Negligible compared to step execution time (typically seconds)
    • Acceptable Trade-off: Correctness more important than microseconds
  2. Lock Contention Risk: High-frequency failures on same step could cause lock queuing

    • Mitigation: Exponential backoff naturally spreads out retries
    • Monitoring: Added metrics for lock contention detection
    • Real-World Impact: Minimal - failures are infrequent by design
  3. Complexity: Transaction management adds code complexity

    • Mitigation: Encapsulated in update_backoff_atomic() method
    • Benefit: Hidden behind clean interface, testable in isolation

Neutral

  1. Breaking Change: SQL function signature changed (added parameters)

    • Not an Issue: Greenfield alpha project, no production dependencies
    • Future-Proof: Default parameters maintain backward compatibility
  2. Configuration Migration: Changed max from 300s → 60s

    • Impact: Tasks retry faster, reducing user-perceived latency
    • Validation: All tests pass with new values

Validation

Testing

  1. Unit Tests: All 518 unit tests pass

    • BackoffCalculator calculation correctness
    • Jitter bounds validation
    • Max cap enforcement
  2. Database Tests: SQL function behavior validated

    • Parameterization with various max values
    • Exponential calculation matches Rust
    • Boundary conditions (attempts 0, 10, 20)
  3. Integration Tests: End-to-end flow verified

    • Worker failure → Backoff applied → Readiness respects delay
    • SQL fallback when backoff_request_seconds NULL
    • Rust and SQL calculations produce consistent results

Verification Steps Completed

✅ Configuration alignment (TOML, Rust defaults) ✅ SQL function rewrite with parameters ✅ BackoffCalculator atomic updates implemented ✅ Database reset successful with new migration ✅ All unit tests passing ✅ Architecture documentation updated

Implementation Notes

Files Modified

  1. Configuration:

    • config/tasker/base/orchestration.toml: max_backoff_seconds = 60
    • tasker-shared/src/config/tasker.rs: jitter_max_percentage = 0.1
  2. Database Migration:

    • migrations/20250927000000_add_waiting_for_retry_state.sql: Parameterized functions
  3. Rust Implementation:

    • tasker-orchestration/src/orchestration/backoff_calculator.rs: Atomic updates
  4. Documentation:

    • docs/task-and-step-readiness-and-execution.md: Backoff section added
    • This ADR

Migration Path

Since this is greenfield alpha:

  1. Drop and recreate test database
  2. Run migrations with updated SQL functions
  3. Rebuild sqlx cache
  4. Run full test suite

Future Production Path (when needed):

  1. Deploy parameterized SQL functions alongside old functions
  2. Update Rust code to use new atomic methods
  3. Enable in staging, monitor metrics
  4. Gradual production rollout with feature flag
  5. Remove old functions after validation period

Future Enhancements

Potential Improvements (Post-Alpha)

  1. Configuration Table: Store backoff config in database for runtime updates
  2. Metrics: OpenTelemetry metrics for backoff application and lock contention
  3. Adaptive Backoff: Adjust multiplier based on system load or error patterns
  4. Per-Namespace Policies: Different backoff configs per workflow namespace
  5. Backoff Profiles: Named profiles (aggressive, moderate, conservative)

Monitoring Recommendations

Key Metrics to Track:

  • backoff_calculation_duration_seconds: Time to calculate and apply backoff
  • backoff_lock_contention_total: Lock acquisition failures
  • backoff_sql_fallback_total: Frequency of SQL fallback usage
  • backoff_delay_applied_seconds: Histogram of actual delays

Alert Conditions:

  • SQL fallback usage > 5% (indicates Rust path failing)
  • Lock contention > threshold (indicates hot spots)
  • Backoff delays > max_backoff_seconds (configuration issue)

References


Decision Status: ✅ Implemented and Validated (2025-10-29)

ADR: Worker Dual-Channel Event System

Status: Accepted Date: 2025-12 Ticket: TAS-67

Context

The original Rust worker used a blocking .call() pattern in the event handler:

#![allow(unused)]
fn main() {
let result = handler.call(&event.payload.task_sequence_step).await;  // BLOCKS
}

This created effectively sequential execution even for independent steps, preventing true concurrency and causing domain event race conditions where downstream systems saw events before orchestration processed results.

Decision

Adopt a dual-channel command pattern where handler invocation is fire-and-forget, and completions flow back through a separate channel.

Architecture:

[1] WorkerEventSystem receives StepExecutionEvent
        ↓
[2] ActorCommandProcessor routes to StepExecutorActor
        ↓
[3] StepExecutorActor claims step, publishes to HANDLER DISPATCH CHANNEL
        ↓ (fire-and-forget, non-blocking)
[4] HandlerDispatchService receives from channel
        ↓
[5] Resolves handler from registry, invokes handler.call()
        ↓
[6] Handler completes, publishes to COMPLETION CHANNEL
        ↓
[7] CompletionProcessorService receives from channel
        ↓
[8] Routes to FFICompletionService → Orchestration queue

Key Design Decisions:

  1. Bounded Parallel Execution: Semaphore-bounded concurrency (configurable via TOML)
  2. Ordered Domain Events: Events fire AFTER result is committed to completion channel
  3. Comprehensive Error Handling: Panics, timeouts, handler errors all generate proper failure results
  4. Fire-and-Forget FFI Callbacks: runtime_handle.spawn() instead of block_on() prevents deadlocks

Consequences

Positive

  • True parallelism: Parallel handler execution with bounded concurrency
  • Eliminated race conditions: Domain events only fire after results committed
  • Comprehensive error handling: All failure modes produce proper step failures
  • Foundation for FFI: Reusable abstractions for Ruby/Python/TypeScript workers
  • Bug discovery: Parallel execution surfaced latent SQL precedence bug

Negative

  • Increased complexity: Two channels to manage instead of one
  • Debugging complexity: Tracing flow across multiple channels requires structured logging

Neutral

  • Channel saturation monitoring available via metrics
  • Configurable buffer sizes per environment

Risk Mitigations Implemented

RiskMitigation
Semaphore acquisition failureGenerate failure result instead of silent exit
FFI polling starvationMetrics + starvation warnings + timeout
Completion channel backpressureRelease permit before send
FFI thread runtime contextFire-and-forget callbacks

Alternatives Considered

Alternative 1: Thread Pool Pattern

Use dedicated thread pool for handler execution.

Rejected: Tokio already provides excellent async runtime; adding threads increases complexity without benefit.

Alternative 2: Single Channel with Priority Queue

Priority queue for completions within single channel.

Rejected: Doesn’t address the fundamental blocking issue; still couples dispatch and completion.

Alternative 3: Keep Blocking Pattern with Larger Buffer

Increase buffer size to mask sequential execution.

Rejected: Doesn’t solve concurrency; just delays the problem.

References

ADR: Worker Actor-Service Decomposition

Status: Accepted Date: 2025-12 Ticket: TAS-69

Context

The tasker-worker crate had a monolithic command processor architecture:

  • WorkerProcessor: 1,575 lines of code
  • All command handling inline
  • Difficult to test individual behaviors
  • Inconsistent with orchestration actor architecture

Decision

Transform the worker from monolithic command processor to actor-based design, mirroring the orchestration actor pattern.

Before: Monolithic Design

WorkerCore
    └── WorkerProcessor (1,575 LOC)
            └── All command handling inline

After: Actor-Based Design

WorkerCore
    └── ActorCommandProcessor (~350 LOC)
            └── WorkerActorRegistry
                    ├── StepExecutorActor → StepExecutorService
                    ├── FFICompletionActor → FFICompletionService
                    ├── TemplateCacheActor → TaskTemplateManager
                    ├── DomainEventActor → DomainEventSystem
                    └── WorkerStatusActor → WorkerStatusService

Five Actors:

ActorResponsibilityMessages
StepExecutorActorStep execution coordination4
FFICompletionActorFFI completion handling2
TemplateCacheActorTemplate cache management2
DomainEventActorEvent dispatching1
WorkerStatusActorStatus and health4

Three Services:

ServiceLinesPurpose
StepExecutorService~400Step claiming, verification, FFI invocation
FFICompletionService~200Result delivery to orchestration
WorkerStatusService~200Stats tracking, health reporting

Consequences

Positive

  • 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
  • Single responsibility: Each file handles one concern
  • Testability: Services testable in isolation, actors via message handlers
  • Consistency: Mirrors orchestration architecture
  • Extensibility: New actors/services follow established pattern

Negative

  • Two-phase initialization: Registry requires careful startup ordering
  • Actor shutdown ordering: Must coordinate graceful shutdown
  • Learning curve: New pattern to understand for contributors

Neutral

  • Public API unchanged (WorkerCore::new(), send_command(), stop())
  • Internal restructuring transparent to users

Gaps Identified and Fixed

GapIssueFix
Domain Event DispatchEvents not dispatched after step completionExplicit dispatch call in actor
Silent Error HandlingOrchestration send errors swallowedExplicit error propagation
Namespace SharingRegistry created new manager, losing namespacesShared pre-initialized manager

Alternatives Considered

Alternative 1: Service-Only Pattern

Extract services without actor layer.

Rejected: Loses message-based interfaces that enable testing and future distributed execution.

Alternative 2: Keep Monolithic with Better Organization

Refactor WorkerProcessor into methods without extraction.

Rejected: Doesn’t address testability or architectural consistency goals.

Alternative 3: Full Actor Framework (Actix)

Use production actor framework.

Rejected: Too heavyweight; we need lifecycle hooks and message-based testing, not distributed supervision.

References

ADR: FFI Over WASM for Language Workers

Status: Accepted Date: 2025-12 Ticket: TAS-100

Context

For the TypeScript worker implementation, we needed to decide between two integration approaches:

  1. FFI (Foreign Function Interface): Direct C ABI calls to compiled Rust library
  2. WASM (WebAssembly): Compile Rust to wasm32-wasi target

Ruby (Magnus) and Python (PyO3) workers already used FFI successfully.

Decision

Proceed with FFI for all language workers. Reserve WASM for future serverless handler execution.

Decision Matrix:

CriteriaFFIWASM
Pattern ConsistencyMatches Ruby/PythonRequires new architecture
Production ReadinessNode FFI mature, Bun stabilizingWASI networking immature
Implementation Speed2-3 weeks2-3 months + unknowns
PostgreSQL AccessNative via RustNeeds host functions
Multi-threadingFull Tokio supportSingle-threaded WASM
Async RuntimeTokio worksIncompatible
DebuggingStandard toolsLimited tooling

Score: FFI 8/10, WASM 3/10 for current requirements.

WASM Deal-Breakers:

  1. No mature PostgreSQL client for wasm32-wasi
  2. Single-threaded execution (our HandlerDispatchService relies on Tokio multi-threading)
  3. Tokio doesn’t compile to wasm32-wasi target
  4. WASI networking still experimental (Preview 2 adoption low)

Consequences

Positive

  • Pattern consistency: Single Rust codebase serves all four workers
  • Proven approach: Ruby/Python FFI already validated
  • Full feature access: PostgreSQL, PGMQ, Tokio, domain events all work
  • Standard debugging: lldb, gdb, structured logging across boundary
  • Fast implementation: Estimated 2-3 weeks for TypeScript worker

Negative

  • FFI safety concerns: Incorrect types can cause segfaults
  • Platform builds: Must distribute .dylib/.so/.dll per platform
  • Runtime compatibility: Different FFI semantics between Bun and Node

Neutral

  • Bun FFI experimental but fast-stabilizing
  • Pre-built binaries via GitHub releases address distribution

Future Vision

WASM Research: Revisit when WASI 0.3+ stabilizes with networking.

Serverless WASM Handlers:

  • Compile individual handlers to WASM (not orchestration)
  • Deploy to serverless platforms (AWS Lambda, Cloudflare Workers)
  • Cold start optimization (1ms vs 100ms)
  • Extreme scalability for compute-heavy workflows

Separation of Concerns:

  • Orchestration: Stays Rust (PostgreSQL, PGMQ, state machines)
  • Handlers: Optionally WASM (stateless compute units)

Alternatives Considered

Alternative 1: WASM with Host Functions

Implement database operations as host functions.

Rejected: Defeats the purpose - logic split between WASM and host, loses Rust benefits.

Alternative 2: Wait for WASI 0.3

Delay TypeScript worker until WASI matures.

Rejected: Timeline uncertain (6+ months); FFI works today.

Alternative 3: Spin Framework

Use Spin’s WASM abstraction layer.

Rejected: Framework lock-in; requires Spin APIs, can’t reuse Axum/Tower patterns.

References

ADR: Handler Composition Pattern

Status: Accepted Date: 2025-12 Ticket: TAS-112

Context

Cross-language step handler ergonomics research revealed an architectural inconsistency:

  • Batchable handlers: Already use composition via mixins (target pattern)
  • API handlers: Use inheritance (subclass pattern)
  • Decision handlers: Use inheritance (subclass pattern)
Current State:
✅ Batchable: class Handler(StepHandler, Batchable)  # Composition
❌ API: class Handler < APIHandler                    # Inheritance
❌ Decision: class Handler extends DecisionHandler    # Inheritance

Guiding Principle (Zen of Python): “There should be one– and preferably only one –obvious way to do it.”

Decision

Migrate all handler patterns to composition (mixins/traits), using batchable as the reference implementation.

Target Architecture:

All patterns use composition:
  Ruby:      include Base, include API, include Decision, include Batchable
  Python:    class Handler(StepHandler, API, Decision, Batchable)
  TypeScript: interface composition + mixins
  Rust:      trait composition (impl Base + API + Decision + Batchable)

Benefits:

  • Single responsibility - each mixin handles one concern
  • Flexible composition - handlers can mix capabilities as needed
  • Easier testing - can test each capability independently
  • Matches batchable pattern (already proven successful)

Example Migration:

# Old pattern (deprecated)
class MyHandler < TaskerCore::StepHandler::API
  def call(context)
    api_success(data)
  end
end

# New pattern
class MyHandler < TaskerCore::StepHandler::Base
  include TaskerCore::StepHandler::Mixins::API

  def call(context)
    api_success(data)
  end
end

Consequences

Positive

  • Consistent architecture: One pattern for all handler capabilities
  • Composable capabilities: Mix API + Decision + Batchable as needed
  • Testable in isolation: Each mixin can be tested independently
  • Matches proven pattern: Batchable already validates approach
  • Cross-language alignment: Same mental model in all languages

Negative

  • Breaking change: All existing handlers need migration
  • Learning curve: Contributors must understand mixin pattern
  • Migration effort: All examples and documentation need updates

Neutral

  • Pre-alpha status means breaking changes are acceptable
  • Migration can be phased with deprecation warnings

Ruby Result Unification

Ruby uses separate Success/Error classes while Python/TypeScript use unified result with success flag. Recommend unifying Ruby to match.

Rust Handler Traits

Rust needs ergonomic traits for API, Decision, and Batchable capabilities to match other languages:

#![allow(unused)]
fn main() {
pub trait APICapable {
    fn api_success(&self, data: Value, status: u16) -> StepExecutionResult;
    fn api_failure(&self, message: &str, status: u16) -> StepExecutionResult;
}

pub trait DecisionCapable {
    fn decision_success(&self, step_names: Vec<String>) -> StepExecutionResult;
    fn skip_branches(&self, reason: &str) -> StepExecutionResult;
}
}

FFI Boundary Types

Data structures crossing FFI boundaries must have identical serialization. Create explicit type mirrors in all languages:

  • DecisionPointOutcome
  • BatchProcessingOutcome
  • CursorConfig

Alternatives Considered

Alternative 1: Keep Inheritance Pattern

Continue with subclass pattern for API and Decision.

Rejected: Inconsistent with batchable; makes multi-capability handlers awkward.

Alternative 2: Migrate Batchable to Inheritance

Make batchable use inheritance to match others.

Rejected: Batchable composition is the better pattern; others should follow it.

Alternative 3: Language-Specific Patterns

Let each language use its idiomatic pattern.

Rejected: Violates cross-language consistency principle; increases cognitive load.

References

RCA: Parallel Execution Exposing Latent Timing Bugs

Date: 2025-12-07 Related: Worker Dual-Channel Event System Status: Resolved Impact: Flaky E2E test test_mixed_workflow_scenario


Executive Summary

During the dual-channel event system implementation (fire-and-forget handler dispatch), a previously hidden bug in the SQL function get_task_execution_context() became consistently reproducible. The bug was a logical precedence error that had always existed but was masked by sequential execution timing. Introducing true parallelism changed the probability distribution of state combinations, transforming a Heisenbug into a Bohrbug.

This document captures the root cause analysis as a reference for understanding how architectural changes to concurrency can surface latent bugs in distributed systems.


The Bug

Symptom

Test test_mixed_workflow_scenario intermittently failed with timeout waiting for BlockedByFailures status, while the API returned HasReadySteps.

⏳ Waiting for task to fail (max 10s)...
   Task execution status: processing (processing)
   Task execution status: has_ready_steps (has_ready_steps)  ← Wrong!
   Task execution status: has_ready_steps (has_ready_steps)
   ... timeout ...

Root Cause

The SQL function get_task_execution_context() checked ready_steps > 0 BEFORE permanently_blocked_steps > 0:

-- BUGGY: Wrong precedence order
CASE
  WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'           -- ← Checked FIRST
  WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'
  ...
END as execution_status

When a task had BOTH permanently blocked steps AND ready steps, the function returned has_ready_steps instead of blocked_by_failures.

The Fix

Migration 20251207000000_fix_execution_status_priority.sql corrects the precedence:

-- FIXED: blocked_by_failures takes semantic priority
CASE
  WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'  -- ← Now FIRST
  WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'
  ...
END as execution_status

Why Did This Surface Now?

The Test Scenario

# 3 parallel steps with NO dependencies (can all run concurrently)
steps:
  - name: success_step
    retryable: false

  - name: permanent_error_step
    retryable: false          # Fails permanently

  - name: retryable_error_step
    retryable: true
    max_attempts: 2           # Fails, but becomes "ready" after backoff

Before: Blocking Handler Dispatch

The original architecture used blocking .call() in the event handler:

#![allow(unused)]
fn main() {
// workers/rust/src/event_handler.rs (before)
let result = handler.call(&step).await;  // BLOCKS until handler completes
}

This created effectively sequential execution even for independent steps:

Timeline (Sequential):
────────────────────────────────────────────────────────────────────

t=0ms     [success_step starts]
t=50ms    [success_step completes]
t=51ms    [permanent_error_step starts]
t=100ms   [permanent_error_step fails → PERMANENTLY BLOCKED]
t=101ms   [retryable_error_step starts]
t=150ms   [retryable_error_step fails → enters 100ms backoff]
t=151ms   ──► STATUS CHECK
              permanently_blocked_steps = 1
              ready_steps = 0 (still in backoff!)
              ──► Returns: blocked_by_failures ✓

The backoff hadn't elapsed yet because steps were processed one at a time.

After: Fire-and-Forget Handler Dispatch

The dual-channel event system introduced non-blocking dispatch via channels:

#![allow(unused)]
fn main() {
// Fire-and-forget pattern
dispatch_sender.send(DispatchHandlerMessage { step, ... }).await;
// Returns immediately - handler executes in separate task
}

This enables true parallel execution:

Timeline (Parallel):
────────────────────────────────────────────────────────────────────

t=0ms     [success_step starts]──────────────────►[completes t=50ms]
t=0ms     [permanent_error_step starts]──────────►[fails t=50ms → BLOCKED]
t=0ms     [retryable_error_step starts]──────────►[fails t=50ms → backoff]

t=150ms   [retryable_error_step backoff expires → becomes READY]

t=151ms   ──► STATUS CHECK
              permanently_blocked_steps = 1
              ready_steps = 1 (backoff elapsed!)
              ──► Returns: has_ready_steps ✗ (BUG!)

Probability Analysis

The “Both States” Window

The bug manifests when checking status while the task has BOTH:

  • At least one permanently blocked step
  • At least one ready step (e.g., retryable step after backoff)
Sequential Processing:
├────────────────────────────────────────────────────────────────────┤
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│ Very LOW probability of "both states" window                       │
│ Steps complete serially; backoff rarely overlaps with status check │
└────────────────────────────────────────────────────────────────────┘

Parallel Processing:
├────────────────────────────────────────────────────────────────────┤
│░░░░░░░░░░░░████████████████████████████████████████████░░░░░░░░░░░│
│            ↑                                          ↑            │
│            │ HIGH probability "both states" window    │            │
│            │ All steps complete ~simultaneously       │            │
│            │ Backoff expires while status is polled  │            │
└────────────────────────────────────────────────────────────────────┘

Quantifying the Change

MetricSequentialParallel
Step completion spread~150ms~50ms
“Both states” window duration~0ms (transient)~100ms+ (stable)
Probability of hitting bug<1%>50%
Bug classificationHeisenbugBohrbug

Bug Classification

Heisenbug → Bohrbug Transformation

PropertyBefore (Heisenbug)After (Bohrbug)
ReproducibilityIntermittent, timing-dependentConsistent, deterministic
Root causeLogical precedence errorSame
VisibilityHidden by sequential timingExposed by parallel timing
Debug difficultyExtremely hard (may never reproduce)Straightforward
Detection in CIMight pass for monthsFails consistently under load

Why This Matters

  1. The bug was always present - It existed in the SQL function since it was written
  2. Sequential execution hid it - Incidental timing prevented the problematic state
  3. Parallelization surfaced it - Not by introducing a bug, but by applying concurrency pressure
  4. This is good - Better to find in tests than production

Semantic Correctness

The Correct Mental Model

“If ANY step is permanently blocked, the task cannot make further progress toward completion, even if other steps are ready to execute.”

A task with permanent failures is blocked by failures regardless of what else might be runnable. The old code implicitly assumed:

“If work is available, we’re making progress”

This is incorrect for workflows where:

  • Convergence points require ALL branches to complete
  • Final task status depends on ALL steps succeeding
  • Partial progress doesn’t constitute overall success

State Precedence (Correct Order)

-- 1. Permanent failures block overall progress
WHEN permanently_blocked_steps > 0 THEN 'blocked_by_failures'

-- 2. Ready work can continue (but may not lead to completion)
WHEN ready_steps > 0 THEN 'has_ready_steps'

-- 3. Work in flight
WHEN in_progress_steps > 0 THEN 'processing'

-- 4. All done
WHEN completed_steps = total_steps THEN 'all_complete'

-- 5. Waiting for dependencies
ELSE 'waiting_for_dependencies'

Patterns to Watch For

1. State Combination Explosions

Sequential processing often means only one state at a time. Parallelism creates state combinations that were previously impossible:

Sequential: A → B → C (states are mutually exclusive in time)
Parallel:   A + B + C (states can coexist)

Watch for: CASE statements, if/else chains, and state machines that assume mutual exclusivity.

2. Timing-Dependent Invariants

Code may accidentally depend on timing:

#![allow(unused)]
fn main() {
// Assumes step_a completes before step_b starts
if step_a.is_complete() {
    // Safe to check step_b
}
}

Watch for: Implicit ordering assumptions in status calculations, rollups, and aggregations.

3. Transient vs Stable States

Some states were transient under sequential processing but become stable under parallel:

StateSequentialParallel
“1 complete, 1 in-progress”Transient (~ms)Stable (seconds)
“blocked + ready”Nearly impossibleCommon
“multiple errors”RareFrequent

Watch for: Error handling, status rollups, and progress calculations that assumed single-state scenarios.

4. Test Timing Sensitivity

Tests written for sequential execution may have implicit timing dependencies:

#![allow(unused)]
fn main() {
// This worked when steps were sequential
wait_for_status(BlockedByFailures, timeout: 10s);

// But fails when parallel execution creates a different status first
}

Watch for: Tests that pass in isolation but fail under concurrent load.


Verification Strategy

After Parallelization Changes

  1. Run tests multiple times - Timing bugs may not manifest on first run
  2. Run tests under load - Concurrent test execution increases probability
  3. Add explicit state combination tests - Test scenarios that were previously impossible
  4. Review CASE/if-else precedence - Check all status calculations for correct ordering

Example: Testing State Combinations

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_blocked_with_ready_steps() {
    // Explicitly create the state combination
    let task = create_task_with_parallel_steps();

    // Force one step to permanent failure
    force_step_to_permanent_failure(&task, "step_a").await;

    // Force another step to ready (after backoff)
    force_step_to_ready_after_backoff(&task, "step_b").await;

    // Verify correct precedence
    let status = get_task_execution_status(&task).await;
    assert_eq!(status, ExecutionStatus::BlockedByFailures);
}
}

Conclusion

This bug exemplifies how architectural improvements to concurrency can surface latent correctness issues. The parallelization didn’t introduce a bug—it revealed one that had been hidden by incidental sequential timing.

This is a positive outcome: the bug was found in testing rather than production. The fix ensures correct semantic precedence regardless of execution timing, making the system more robust under parallel load.

Key Takeaways

  1. Parallelization is a stress test - It exposes timing-dependent bugs
  2. Sequential execution hides bugs - Incidental ordering masks logical errors
  3. State precedence matters - Review all status calculations when adding concurrency
  4. Heisenbugs become Bohrbugs - Parallel execution makes rare bugs reproducible
  5. This is good engineering - Finding bugs through architectural improvements validates the testing strategy

References

  • Migration: migrations/20251207000000_fix_execution_status_priority.sql
  • Test: tests/e2e/ruby/error_scenarios_test.rs::test_mixed_workflow_scenario
  • SQL Function: get_task_execution_context() in migrations/20251001000000_fix_permanently_blocked_detection.sql
  • Dual-Channel Event System ADR

Tasker Core Benchmarks

Last Updated: 2026-01-23 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Observability | Deployment Patterns

<- Back to Documentation Hub


This directory contains documentation for all performance benchmarks in the tasker-core workspace.


Quick Reference

# E2E benchmarks (cluster mode, all tiers)
cargo make setup-env-all-cluster
cargo make cluster-start-all
set -a && source .env && set +a && cargo bench --bench e2e_latency
cargo make bench-report     # Percentile JSON
cargo make bench-analysis   # Markdown analysis
cargo make cluster-stop

# Component benchmarks (requires Docker services)
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo bench --package tasker-client --features benchmarks   # API benchmarks
cargo bench --package tasker-shared --features benchmarks   # SQL + Event benchmarks

Benchmark Categories

1. End-to-End Latency (tests/benches)

Location: tests/benches/e2e_latency.rs Documentation: e2e-benchmarks.md

Measures complete workflow execution from API call through orchestration, message queue, worker execution, result processing, and dependency resolution — across all distributed components in a 10-instance cluster.

TierBenchmarkStepsParallelismP50Target (p99)
1Linear Rust4 sequentialnone255-258ms< 500ms
1Diamond Rust4 (2 parallel)2-way200-259ms< 500ms
2Complex DAG7 (mixed)2+3-way382ms< 800ms
2Hierarchical Tree8 (4 parallel)4-way389-426ms< 800ms
2Conditional5 (3 executed)dynamic251-262ms< 500ms
3Cluster single task4 sequentialnone261ms< 500ms
3Cluster concurrent 2x4+4distributed332-384ms< 800ms
4FFI linear (Ruby/Python/TS)4 sequentialnone312-316ms< 800ms
4FFI diamond (Ruby/Python/TS)4 (2 parallel)2-way260-275ms< 800ms
5Batch 1000 rows7 (5 parallel)5-way358-368ms< 1000ms

Each step involves ~19 database operations, 2 message queue round-trips, 4+ state transitions, and dependency graph evaluation. See e2e-benchmarks.md for the detailed per-step lifecycle.

Key Characteristics:

  • FFI overhead: ~23% vs native Rust (all languages within 3ms of each other)
  • Linear patterns: highly reproducible (<2% variance between runs)
  • Parallel patterns: environment-sensitive (I/O contention affects parallelism)
  • Batch processing: 2,700-2,800 rows/second with tight P95/P50 ratios

Run Commands:

cargo make bench-e2e           # Tier 1: Rust core
cargo make bench-e2e-full      # Tier 1+2: + complexity
cargo make bench-e2e-cluster   # Tier 3: Multi-instance
cargo make bench-e2e-languages # Tier 4: FFI comparison
cargo make bench-e2e-batch     # Tier 5: Batch processing
cargo make bench-e2e-all       # All tiers

2. API Performance (tasker-client)

Location: tasker-client/benches/task_initialization.rs

Measures orchestration API response times for task creation (HTTP round-trip + DB insert + step initialization).

BenchmarkTargetCurrentStatus
Linear task init< 50ms17.7ms2.8x better
Diamond task init< 75ms20.8ms3.6x better
cargo bench --package tasker-client --features benchmarks

3. SQL Function Performance (tasker-shared)

Location: tasker-shared/benches/sql_functions.rs

Measures critical PostgreSQL function performance for orchestration polling.

FunctionTargetCurrent (5K tasks)Status
get_next_ready_tasks< 3ms1.75-2.93msPass
get_step_readiness_status< 1ms440-603usPass
get_task_execution_context< 1ms380-460usPass
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks sql_functions

4. Event Propagation (tasker-shared)

Location: tasker-shared/benches/event_propagation.rs

Measures PostgreSQL LISTEN/NOTIFY round-trip latency for real-time coordination.

MetricTarget (p95)CurrentStatus
Notify round-trip< 10ms14.1msSlightly above, p99 < 20ms
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks event_propagation

Performance Targets

System-Wide Goals

CategoryMetricTargetRationale
API Latencyp99< 100msUser-facing responsiveness
SQL Functionsmean< 3msOrchestration polling efficiency
Event Propagationp95< 10msReal-time coordination overhead
E2E Linear (4 steps)p99< 500msEnd-user task completion
E2E Complex (7-8 steps)p99< 800msComplex workflow completion
E2E Batch (1000 rows)p99< 1000msBulk operation completion

Scaling Targets

Dataset Sizeget_next_ready_tasksNotes
1K tasks< 2msInitial implementation
5K tasks< 3msCurrent verified
10K tasks< 5msTarget
100K tasks< 10msProduction scale

Cluster Topology (E2E Benchmarks)

ServiceInstancesPortsBuild
Orchestration28080, 8081Release
Rust Worker28100, 8101Release
Ruby Worker28200, 8201Release extension
Python Worker28300, 8301Maturin develop
TypeScript Worker28400, 8401Bun FFI

Deployment Mode: Hybrid (event-driven with polling fallback) Database: PostgreSQL (with PGMQ extension available) Messaging: RabbitMQ (via MessagingService provider abstraction; PGMQ also supported) Sample Size: 50 per benchmark


Running Benchmarks

E2E Benchmarks (Full Suite)

# 1. Setup cluster environment
cargo make setup-env-all-cluster

# 2. Start 10-instance cluster
cargo make cluster-start-all

# 3. Verify cluster health
cargo make cluster-status

# 4. Run benchmarks
set -a && source .env && set +a && cargo bench --bench e2e_latency

# 5. Generate reports
cargo make bench-report    # → target/criterion/percentile_report.json
cargo make bench-analysis  # → tmp/benchmark-results/benchmark-results.md

# 6. Stop cluster
cargo make cluster-stop

Component Benchmarks

# Start database
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"

# Run individual suites
cargo bench --package tasker-client --features benchmarks     # API
cargo bench --package tasker-shared --features benchmarks     # SQL + Events

# Run all at once
cargo bench --all-features

Baseline Comparison

# Save current performance as baseline
cargo bench --all-features -- --save-baseline main

# After changes, compare
cargo bench --all-features -- --baseline main

# View report
open target/criterion/report/index.html

Interpreting Results

Stable Metrics (Reliable for Regression Detection)

These metrics show <2% variance between runs:

  • Linear pattern P50 (sequential execution baseline)
  • FFI linear P50 (framework overhead measurement)
  • Single task in cluster (cluster overhead measurement)
  • Batch P50 (parallel I/O throughput)

Environment-Sensitive Metrics

These metrics vary 10-30% depending on system load:

  • Diamond pattern P50 (parallelism benefit depends on I/O capacity)
  • Concurrent 2x (scheduling contention varies)
  • Hierarchical tree (deep dependency chains amplify I/O latency)

Key Ratios (Always Valid)

  • FFI overhead %: ~23% for all languages (framework-dominated)
  • P95/P50 ratio: 1.01-1.12 (execution stability indicator)
  • Cluster vs single overhead: <3ms (negligible cluster tax)
  • FFI language spread: <3ms (language runtime is not the bottleneck)

Design Principles

Natural Measurement

Benchmarks measure real system behavior without artificial test harnesses:

  • API benchmarks hit actual HTTP endpoints
  • SQL benchmarks use real database with realistic data volumes
  • E2E benchmarks execute complete workflows through all distributed components

Distributed System Focus

All benchmarks account for distributed system characteristics:

  • Network latency included (HTTP, PostgreSQL, message queues)
  • Database transaction timing considered
  • Message queue delivery overhead measured
  • Worker coordination and scheduling included

Load-Based Validation

Benchmarks serve dual purpose:

  • Performance measurement: Track regressions and improvements
  • Load testing: Expose race conditions and timing bugs

E2E benchmark warmup has historically discovered critical race conditions that manual testing never revealed.

Statistical Rigor

  • 50 samples per benchmark for P50/P95 validity
  • Criterion framework with statistical regression detection
  • Multiple independent runs recommended for absolute comparisons
  • Relative metrics (ratios, overhead %) preferred over absolute milliseconds

Troubleshooting

“Services must be running”

cargo make cluster-status          # Check cluster health
cargo make cluster-start-all       # Restart cluster

Tier 3/4 benchmarks skipped

# Ensure cluster env is configured (not single-service)
cargo make setup-env-all-cluster   # Generates .env with cluster URLs

High variance between runs

  • Close resource-intensive applications (browsers, IDEs)
  • Ensure machine is plugged in (not throttling)
  • Focus on stable metrics (linear P50, FFI overhead %) for comparisons
  • Run benchmarks twice and compare for reproducibility

Benchmark takes too long

# Reduce sample size (default: 50)
cargo bench -- --sample-size 10

# Run single tier
cargo make bench-e2e               # Only Tier 1

CI Integration

# Example: .github/workflows/benchmarks.yml
name: Performance Benchmarks

on:
  pull_request:
    paths:
      - 'tasker-*/src/**'
      - 'migrations/**'

jobs:
  benchmark:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: ghcr.io/pgmq/pg18-pgmq:v1.8.1
        env:
          POSTGRES_DB: tasker_rust_test
          POSTGRES_USER: tasker
          POSTGRES_PASSWORD: tasker

    steps:
      - uses: actions/checkout@v3
      - run: cargo bench --all-features -- --save-baseline pr
      - uses: benchmark-action/github-action-benchmark@v1
        with:
          tool: 'criterion'
          output-file-path: target/criterion/report/index.html

Criterion automatically detects performance regressions with statistical comparison to baselines and alerts on >5% slowdowns.


Contributing

When adding new benchmarks:

  1. Follow naming convention: <tier>_<category>/<group>/<scenario>
  2. Include targets: Document expected performance in this README
  3. Add fixture: Create workflow template YAML in tests/fixtures/task_templates/
  4. Document shape: Update e2e-benchmarks.md with topology
  5. Consider variance: Account for distributed system characteristics
  6. Use 50 samples: Minimum for P50/P95 statistical validity

Benchmark Template

#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use std::time::Duration;

fn bench_my_scenario(c: &mut Criterion) {
    let mut group = c.benchmark_group("e2e_my_tier");
    group.sample_size(50);
    group.measurement_time(Duration::from_secs(30));

    group.bench_function(BenchmarkId::new("workflow", "my_scenario"), |b| {
        b.iter(|| {
            runtime.block_on(async {
                execute_benchmark_scenario(&client, namespace, handler, context, timeout).await
            })
        });
    });

    group.finish();
}
}

E2E Benchmark Scenarios: Workflow Shapes and Per-Step Lifecycle

Last Updated: 2026-01-23 Audience: Architects, Developers, Performance Engineers Related Docs: Benchmarks README | States & Lifecycles | Actor Pattern

<- Back to Benchmarks


What Each Benchmark Measures

Each E2E benchmark executes a complete workflow through the full distributed system: HTTP API call, task initialization, step discovery, message queue dispatch, worker execution, result submission, dependency graph re-evaluation, and task finalization.

A 4-step linear workflow at P50=257ms means the system completes 76+ database operations, 8 message queue round-trips, 16+ state machine transitions, and 4 dependency graph evaluations — all across a 10-instance distributed cluster — in approximately one quarter of a second.


Per-Step Lifecycle: What Happens for Every Step

Before examining the benchmark scenarios, it’s important to understand the work the system performs for each individual step. Every step in every benchmark goes through this complete lifecycle.

Messaging Backend: Tasker uses a MessagingService trait with provider variants for PGMQ (PostgreSQL-native, single-dependency) and RabbitMQ (high-throughput). The benchmark results documented here were captured using the RabbitMQ backend. The per-step lifecycle is identical regardless of backend — only the transport layer differs.

State Machine Transitions (per step)

Step:  Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
Task:  StepsInProcess → EvaluatingResults → (EnqueuingSteps if more ready) → Complete

Database Operations (per step): ~19 operations

PhaseOperationsDescription
Discovery2 queriesget_next_ready_tasks + get_step_readiness_status_batch (8-CTE query)
Enqueueing4 writesFetch correlation_id, transition Pending→Enqueued (SELECT sort_key + UPDATE most_recent + INSERT transition)
Message send1 opSend step dispatch to worker queue (via MessagingService)
Worker claim1 opClaim message with visibility timeout (via MessagingService)
Worker transition3 writesTransition Enqueued→InProgress
Result submission4 writesTransition InProgress→EnqueuedForOrchestration + audit trigger INSERT + send completion to orchestration queue
Result processing4 writesFetch step state, transition →Complete, delete consumed message
Task coordination1+ queriesRe-evaluate get_step_readiness_status_batch for remaining steps
Total~19 ops

Message Queue Round-Trips (per step): 2

  1. Orchestration → Worker: Step dispatch message (task_uuid, step_uuid, handler, context)
  2. Worker → Orchestration: Completion notification (task_uuid, step_uuid, results)

Dependency Graph Evaluation (per step completion)

After each step completes, the orchestration:

  1. Queries all steps in the task for current state
  2. Evaluates dependency edges (parent steps must be Complete)
  3. Calculates retry eligibility (attempts < max_attempts, backoff expired)
  4. Identifies newly-ready steps for enqueueing
  5. Updates task state (more steps ready → EnqueuingSteps, all complete → Complete)

Idempotency Guarantees

  • Message visibility timeout: MessagingService prevents duplicate processing (30s window)
  • State machine guards: Transitions validate from-state before applying
  • Atomic claiming: Workers claim via the messaging backend’s atomic read operation
  • Audit trail: Every transition creates an immutable workflow_step_transitions record

Tier 1: Core Performance (Rust Native)

Linear Rust (4 steps, sequential)

Fixture: tests/fixtures/task_templates/rust/mathematical_sequence.yaml Namespace: rust_e2e_linear | Handler: mathematical_sequence

linear_step_1 → linear_step_2 → linear_step_3 → linear_step_4
StepHandlerOperationDepends OnMath
linear_step_1LinearStep1squarenone6^2 = 36
linear_step_2LinearStep2squarestep_136^2 = 1,296
linear_step_3LinearStep3squarestep_21,296^2 = 1,679,616
linear_step_4LinearStep4squarestep_31,679,616^2

Distributed system work for this workflow:

MetricCount
State machine transitions (step)16 (4 per step)
State machine transitions (task)6 (Pending→Init→Enqueue→InProcess→Eval→Complete)
Database operations76 (19 per step)
MQ messages8 (2 per step)
Dependency evaluations4 (after each step completes)
HTTP calls (benchmark→API)1 create + ~5 polls
Sequential stages4

Why this matters: This is the purest sequential latency test. Each step must fully complete (all 19 DB operations + 2 message round-trips) before the next step can begin. The P50 of ~257ms means each step’s complete lifecycle averages ~64ms including all distributed coordination.


Diamond Rust (4 steps, 2-way parallel)

Fixture: tests/fixtures/task_templates/rust/diamond_pattern.yaml Namespace: rust_e2e_diamond | Handler: diamond_pattern

         diamond_start
           /       \
          /         \
  diamond_branch_b  diamond_branch_c    ← parallel execution
          \         /
           \       /
         diamond_end                    ← 2-way convergence
StepHandlerOperationDepends OnParallelism
diamond_startStartsquarenone-
diamond_branch_bBranchBsquarestartparallel with C
diamond_branch_cBranchCsquarestartparallel with B
diamond_endEndmultiply_and_squarebranch_b AND branch_cconvergence

Distributed system work:

MetricCount
State machine transitions (step)16
Database operations76
MQ messages8
Dependency evaluations4
Sequential stages3 (start → parallel → end)
Convergence points1 (diamond_end waits for both branches)
Dependency edge checks4 (start→B, start→C, B→end, C→end)

Why this matters: Tests the system’s ability to dispatch and execute steps concurrently. The convergence point (diamond_end) requires the orchestration to correctly evaluate that BOTH branch_b AND branch_c are Complete before enqueueing diamond_end. Under light load, this completes in 3 sequential stages vs 4 for linear (~30% faster).


Tier 2: Complexity Scaling

Complex DAG (7 steps, mixed parallelism)

Fixture: tests/fixtures/task_templates/rust/complex_dag.yaml Namespace: rust_e2e_mixed_dag | Handler: complex_dag

              dag_init
             /        \
   dag_process_left   dag_process_right     ← 2-way parallel
        /    |              |    \
       /     |              |     \
dag_validate dag_transform dag_analyze      ← mixed dependencies
       \          |          /
        \         |         /
         dag_finalize                       ← 3-way convergence
StepDepends OnType
dag_initnoneinit
dag_process_leftinitparallel branch
dag_process_rightinitparallel branch
dag_validateleft AND right2-way convergence
dag_transformleft onlylinear continuation
dag_analyzeright onlylinear continuation
dag_finalizevalidate AND transform AND analyze3-way convergence

Distributed system work:

MetricCount
State machine transitions (step)28 (7 steps x 4)
Database operations133 (7 x 19)
MQ messages14 (7 x 2)
Dependency evaluations7
Sequential stages4 (init → left/right → validate/transform/analyze → finalize)
Convergence points2 (dag_validate: 2-way, dag_finalize: 3-way)
Dependency edge checks8

Why this matters: Tests multiple convergence points with different fan-in widths. The orchestration must correctly handle that dag_validate needs 2 parents while dag_finalize needs 3. Also tests mixed patterns: some steps continue from a single parent (transform from left only) while others require multiple.


Hierarchical Tree (8 steps, 4-way convergence)

Fixture: tests/fixtures/task_templates/rust/hierarchical_tree.yaml Namespace: rust_e2e_tree | Handler: hierarchical_tree

                    tree_root
                   /         \
        tree_branch_left    tree_branch_right    ← 2-way parallel
          /       \           /        \
  tree_leaf_d  tree_leaf_e  tree_leaf_f  tree_leaf_g  ← 4-way parallel
         \          |            |          /
          \         |            |         /
           tree_final_convergence               ← 4-way convergence
LevelStepsParallelismOperation
0rootsequentialsquare
1branch_left, branch_right2-way parallelsquare
2leaf_d, leaf_e, leaf_f, leaf_g4-way parallelsquare
3final_convergence4-way convergencemultiply_all_and_square

Distributed system work:

MetricCount
State machine transitions (step)32 (8 x 4)
Database operations152 (8 x 19)
MQ messages16 (8 x 2)
Dependency evaluations8
Sequential stages4 (root → branches → leaves → convergence)
Maximum fan-out2-way (each branch → 2 leaves)
Maximum fan-in4-way (convergence waits for all 4 leaves)
Dependency edge checks9

Why this matters: Tests the widest convergence pattern — 4 parallel leaves must all complete before the final step can execute. This exercises the dependency evaluation with a large number of parent checks per step. Also tests hierarchical fan-out (root→2 branches→4 leaves).


Conditional Routing (5 steps, 3 executed)

Fixture: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml Namespace: conditional_approval_rust | Handler: approval_routing Context: {"amount": 500, "requester": "benchmark"}

validate_request
       ↓
routing_decision          ← DECISION POINT (routes based on amount)
   /      |      \
  /       |       \
auto_approve  manager_approval  finance_review
(< $1000)     ($1000-$5000)     (> $5000)
  \       |       /
   \      |      /
  finalize_approval               ← deferred convergence

With benchmark context amount=500, only the auto_approve path executes:

validate_request → routing_decision → auto_approve → finalize_approval
StepExecutedCondition
validate_requestYesalways
routing_decisionYesalways (decision point)
auto_approveYesamount < 1000
manager_approvalSkippedamount 1000-5000
finance_reviewSkippedamount > 5000
finalize_approvalYesdeferred convergence (waits for executed paths only)

Distributed system work (executed steps only):

MetricCount
State machine transitions (step)16 (4 executed x 4)
Database operations76 (4 executed x 19)
MQ messages8 (4 executed x 2)
Dependency evaluations4
Sequential stages4 (validate → decision → approve → finalize)
Skipped steps2 (manager_approval, finance_review)

Why this matters: Tests deferred convergence — the finalize_approval step depends on ALL conditional branches, but only blocks on branches that actually executed. The orchestration must correctly determine that manager_approval and finance_review were skipped (not just incomplete) and allow finalize_approval to proceed. Also tests the decision point routing pattern.


Tier 3: Cluster Performance

Single Task Linear (4 steps, round-robin across 2 orchestrators)

Same workflow as Tier 1 linear_rust, but benchmarked with round-robin across 2 orchestration instances to measure cluster coordination overhead.

Distributed system work: Same as linear_rust (76 DB ops, 8 MQ messages) plus cluster coordination overhead (shared database, message queue visibility).

Why this matters: Validates that running in cluster mode adds negligible overhead compared to single-instance. The P50 difference (261ms vs 257ms = ~4ms) represents the entire cluster coordination tax.

Concurrent Tasks 2x (2 tasks simultaneously across 2 orchestrators)

Two linear workflows submitted simultaneously, one to each orchestration instance.

Distributed system work:

MetricCount
State machine transitions44 (22 per task)
Database operations152 (76 per task)
MQ messages16 (8 per task)
Concurrent step executionsup to 2
Database connection contention2 orchestrators + 2 workers competing

Why this matters: Tests work distribution across cluster instances under concurrent load. The P50 of ~332-384ms for TWO tasks (vs ~261ms for one) shows that the second task adds only 30-50% latency, not 100% — demonstrating effective parallelism in the cluster.


Tier 4: FFI Language Comparison

Same linear and diamond patterns as Tier 1, but using FFI workers (Ruby via Magnus, Python via PyO3, TypeScript via Bun FFI) instead of native Rust handlers.

Additional per-step work for FFI:

PhaseAdditional Operations
Handler dispatchFFI bridge call (Rust → language runtime)
Context serializationJSON serialize context for foreign runtime
Result deserializationJSON deserialize results back to Rust
Circuit breaker checkshould_allow() (sync, atomic check)
Completion callbackFFI completion channel (bounded MPSC)

FFI overhead: ~23% (~60ms for 4 steps)

The overhead is framework-dominated (Rust dispatch + serialization + completion channel), not language-dominated — all three languages perform within 3ms of each other.


Tier 5: Batch Processing

CSV Products 1000 Rows (7 steps, 5-way parallel)

Fixture: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml Namespace: csv_processing_rust | Handler: csv_product_inventory_analyzer

analyze_csv                    ← reads CSV, returns BatchProcessingOutcome
    ↓
[orchestration creates 5 dynamic workers from batch template]
    ↓
process_csv_batch_001 ──┐
process_csv_batch_002 ──┤
process_csv_batch_003 ──├──→ aggregate_csv_results    ← deferred convergence
process_csv_batch_004 ──┤
process_csv_batch_005 ──┘
StepTypeRowsOperation
analyze_csvbatchableall 1000Count rows, compute batch ranges
process_csv_batch_001batch_worker1-200Compute inventory metrics
process_csv_batch_002batch_worker201-400Compute inventory metrics
process_csv_batch_003batch_worker401-600Compute inventory metrics
process_csv_batch_004batch_worker601-800Compute inventory metrics
process_csv_batch_005batch_worker801-1000Compute inventory metrics
aggregate_csv_resultsdeferred_convergenceallMerge batch results

Distributed system work:

MetricCount
State machine transitions (step)28 (7 x 4)
Database operations133 (7 x 19)
MQ messages14 (7 x 2)
Dynamic step creation5 (batch workers created at runtime)
Dependency edges (dynamic)6 (batch workers → analyze, aggregate → batch_template)
File I/O operations6 (1 analysis read + 5 batch reads of CSV)
CSV rows processed1000
Sequential stages3 (analyze → 5 parallel workers → aggregate)

Why this matters: Tests the most complex orchestration pattern — dynamic step generation. The analyze_csv step returns a BatchProcessingOutcome that tells the orchestration to create N worker steps at runtime. The orchestration must:

  1. Create new step records in the database
  2. Create dependency edges dynamically
  3. Enqueue all batch workers for parallel execution
  4. Use deferred convergence for the aggregate step (waits for batch template, not specific steps)

At P50=358-368ms for 1000 rows, throughput is ~2,700 rows/second with all the distributed system overhead included.


Summary: Operations Per Benchmark

BenchmarkStepsDB OpsMQ MsgsTransitionsConvergenceP50
Linear Rust476822none257ms
Diamond Rust4768222-way200-259ms
Complex DAG713314342+3-way382ms
Hierarchical Tree815216384-way389-426ms
Conditional4*76822deferred251-262ms
Cluster single476822none261ms
Cluster 2x81521644none332-384ms
FFI linear476822none312-316ms
FFI diamond4768222-way260-275ms
Batch 1000 rows71331434deferred358-368ms

*Conditional executes 4 of 5 defined steps (2 skipped by routing decision)


Performance per Sequential Stage

For workflows with known sequential depth, we can calculate per-stage overhead:

BenchmarkSequential StagesP50Per-Stage Avg
Linear (4 seq)4257ms64ms
Diamond (3 seq)3200ms*67ms
Complex DAG (4 seq)4382ms96ms**
Tree (4 seq)4389ms97ms**
Conditional (4 seq)4257ms64ms
Batch (3 seq)3363ms121ms***

*Diamond under light load (parallelism helping) **Higher per-stage due to multiple steps per stage (more DB ops per evaluation cycle) ***Higher per-stage due to batch worker creation overhead + file I/O

The ~64ms per sequential stage for simple patterns represents the total distributed round-trip: orchestration discovery → MQ dispatch → worker claim → handler execute (~1ms for math operations) → MQ completion → orchestration result processing → dependency re-evaluation. The handler execution itself is negligible; the 64ms is almost entirely orchestration infrastructure.

Tasker Contrib Documentation

DocumentDescription
README.mdRepository overview, vision, and structure
DEVELOPMENT.mdLocal development and cross-repo setup

Implementation Specifications

TicketStatusDescription
TAS-126🚧 In ProgressFoundations: repo structure, vision, CLI plugin design

TAS-126 Documents

DocumentDescription
README.mdTicket summary and deliverables
foundations.mdArchitectural deep-dive and design rationale
rails.mdRails-specific implementation plan
cli-plugin-architecture.mdCLI plugin system design

Architecture

The foundations document covers:

  • Design rationale (why separate repos, why Railtie over Engine)
  • Framework integration patterns (lifecycle, events, generators)
  • Configuration architecture (three-layer model)
  • Testing architecture (unit, integration, E2E)
  • Versioning strategy

Milestones

MilestoneStatusDescription
Foundations and CLI🚧 In ProgressTAS-126: Repo structure, vision, CLI plugin design
Rails📋 Plannedtasker-contrib-rails gem, generators, event bridge
Python📋 Plannedtasker-contrib-fastapi, pytest integration
TypeScript📋 Plannedtasker-contrib-bun, Bun.serve patterns

Framework Guides

Coming soon as packages are implemented

  • Rails Integration Guide
  • FastAPI Integration Guide
  • Bun Integration Guide
  • Axum Integration Guide

Operational Guides

Coming soon

  • Helm Chart Deployment
  • Terraform Infrastructure
  • Monitoring Setup
  • Production Checklist

Example Applications

Complete example applications demonstrating Tasker patterns.

Examples

ExampleDescription
e-commerce-workflow/Order processing with payment, inventory, and shipping
etl-pipeline/Data extraction, transformation, and loading workflow
approval-system/Multi-level approval with conditional routing

Purpose

These examples demonstrate:

  • Real-world workflow patterns
  • Multi-language handler implementations
  • Testing strategies
  • Deployment configurations

Status

📋 Planned

Approval System Example

Multi-level approval workflow demonstrating:

  • Decision handlers for routing
  • Convergence patterns (all approvals required)
  • Human-in-the-loop integration
  • Timeout and escalation

Status

📋 Planned

E-Commerce Workflow Example

Order processing workflow demonstrating:

  • Diamond dependency patterns (parallel payment + inventory check)
  • External API integration (payment gateway)
  • Conditional routing (shipping method selection)
  • Error handling and retries

Status

📋 Planned

ETL Pipeline Example

Data processing workflow demonstrating:

  • Batchable handlers for large datasets
  • Checkpoint/resume for long-running processes
  • Multiple data sources
  • Transformation chains

Status

📋 Planned

Engineering Stories

A progressive-disclosure blog series that teaches Tasker concepts through real-world scenarios. Each story builds on the previous, showing how a growing engineering team adopts workflow orchestration across all four supported languages.

These stories are being rewritten for the current Tasker architecture. See the archive for the original GitBook-era versions.

StoryThemeStatus
01: E-commerce CheckoutBasic workflows, error handlingPlanned
02: Data PipelineETL patterns, resiliencePlanned
03: MicroservicesService coordinationPlanned
04: Team ScalingNamespace isolationPlanned
05: ObservabilityOpenTelemetry + domain eventsPlanned
06: Batch ProcessingBatch step patternsPlanned
07: Conditional WorkflowsDecision handlersPlanned
08: Production DebuggingDLQ investigationPlanned