Tasker
Workflow orchestration that meets your code where it lives.
Tasker is an open-source workflow orchestration engine built on PostgreSQL and PGMQ. You define workflows as task templates with ordered steps, implement handlers in Rust, Ruby, Python, or TypeScript, and the engine handles execution, retries, circuit breaking, and observability.
Your existing business logic — API calls, database operations, service integrations — becomes a distributed, event-driven, retryable workflow with minimal ceremony. No complex abstractions to learn, no framework rewrites. Just thin handler wrappers around code you already have.
Get Started
From zero to your first workflow. Install, write a handler, define a template, submit a task, and watch it run.
An honest look at where Tasker fits in the workflow orchestration landscape — and where established tools might be a better choice.
How Tasker works under the hood: actors, state machines, event systems, circuit breakers, and the PostgreSQL-native execution model.
Complete reference for all 246 configuration parameters across orchestration, workers, and shared settings.
Choose Your Language
Tasker is polyglot from the ground up. The orchestration engine is Rust; workers can be any of four languages, all sharing the same core abstractions expressed idiomatically.
| Language | Package | Install | Registry |
|---|---|---|---|
| Rust | tasker-client / tasker-worker | cargo add tasker-client tasker-worker | crates.io |
| Ruby | tasker-rb | gem install tasker-rb | rubygems.org |
| Python | tasker-py | pip install tasker-py | pypi.org |
| TypeScript | @tasker-systems/tasker | npm install @tasker-systems/tasker | npmjs.com |
Each language guide covers installation, handler patterns, testing, and production considerations:
Rust · Ruby · Python · TypeScript
Explore the Documentation
For New Users
- Core Concepts — Tasks, steps, handlers, templates, and namespaces
- Choosing Your Package — Which package do you need?
- Quick Start — Running in 5 minutes
Architecture & Design
- Architecture Overview — System design and component interaction
- Design Principles — The tenets behind Tasker’s design decisions
- Architectural Decisions — ADRs documenting key technical choices
Operational Guides
- Handler Resolution — How Tasker finds and runs your handlers
- Retry Semantics — Retry strategies, backoff, and circuit breaking
- Batch Processing — Processing work in batches
- DLQ System — Dead letter queue for failed tasks
- Observability — Metrics, tracing, and logging
Reference
- Configuration Reference — All 246 parameters documented
- Worker API Convergence — Cross-language API alignment
- FFI Safety — How polyglot workers communicate safely
Framework Integrations
- Example Apps & Integrations — Rails, FastAPI, Axum, and Bun integrations with working example projects
Engineering Stories
A progressive-disclosure blog series teaching Tasker concepts through real-world scenarios. Each story follows an engineering team as they adopt workflow orchestration, with working code examples across all four languages.
| Story | What You’ll Learn |
|---|---|
| 01: E-commerce Checkout | Basic workflows, error handling, retry patterns |
| 02: Data Pipeline Resilience | ETL orchestration, resilience under failure |
| 03: Microservices Coordination | Cross-service workflows, distributed tracing |
| 04: Team Scaling | Namespace isolation, multi-team patterns |
| 05: Observability | OpenTelemetry integration, domain events |
| 06: Batch Processing | Batch step patterns, throughput optimization |
| 07: Conditional Workflows | Decision handlers, approval flows |
| 08: Production Debugging | DLQ investigation, diagnostics tooling |
Stories are being rewritten for the current Tasker architecture. View archive →
The Project
Tasker is open-source software (MIT license) built by an engineer who has spent years designing workflow systems at multiple organizations — and finally had the opportunity to build the one that was always in his head.
It’s not venture-backed. It’s not chasing a market. It’s a labor of love built for the engineering community.
Source Repositories
| Repository | Description |
|---|---|
| tasker-core | Rust orchestration engine, polyglot workers, and CLI |
| tasker-contrib | Framework integrations and community packages |
| tasker-book | This documentation site |
Why Tasker
Last Updated: 2025-02-12 Audience: Engineers evaluating workflow orchestration tools Status: Early Release (0.1.x)
The Story
Tasker is a labor of love.
Over the years, I’ve built workflow systems at multiple organizations—each time encountering the same fundamental challenges: orchestrating complex, multi-step processes with proper dependency management, ensuring idempotency, handling retries gracefully, and doing all of this in a way that doesn’t require teams to rewrite their existing business logic.
Each time, I’d design parts of the solution I wished we could build—but the investment was never justifiable. General-purpose workflow infrastructure rarely makes sense for a single company to build from scratch when there are urgent product features to ship. So I’d compromise, cobble together something workable, and move on.
Tasker represents the opportunity to finally build that system properly—the one that’s been evolving in my head for years. Not as a venture-backed startup chasing a market, but as open-source software built by someone who genuinely cares about the problem space and wants to give back to the engineering community.
The Landscape
Honesty is important, and so in full candor: Tasker is not solving an unsolved problem. The workflow orchestration space has mature, battle-tested options.
Apache Airflow
What it does well: Airflow is the industry standard for data pipeline orchestration. Born at Airbnb and now an Apache project with thousands of contributors, it excels at scheduled, batch-oriented workflows defined as Python DAGs. Its ecosystem of operators and integrations is unmatched—if you need to connect to a cloud service, there’s probably an Airflow provider for it.
When to choose it: You have scheduled ETL/ELT workloads, your team is Python-native, you need managed cloud options (AWS MWAA, Google Cloud Composer, Astronomer), and you value ecosystem breadth over ergonomic simplicity.
Honest comparison: Airflow’s 10+ years of production use across thousands of companies represents a level of battle-testing Tasker simply cannot match. If your primary use case is data pipeline orchestration with scheduled intervals, Airflow is likely the safer choice.
Temporal
What it does well: Temporal pioneered “durable execution”—workflows that automatically survive crashes, network failures, and infrastructure outages. It reconstructs application state transparently, letting developers write code as if failures don’t exist. The event history and replay capabilities are genuinely impressive.
When to choose it: You need long-running workflows (hours, days, or longer), your operations require true durability guarantees, you’re building microservice orchestration with complex saga patterns, or you need human-in-the-loop workflows with unbounded wait times.
Honest comparison: Temporal’s durable execution model is architecturally different from Tasker. If your workflows genuinely need to survive arbitrary failures mid-execution and resume from exact state, Temporal was purpose-built for this. Tasker provides resilience through retries and idempotent step execution, but doesn’t offer Temporal’s deterministic replay.
Prefect
What it does well: Prefect feels like “what if workflow orchestration were just Python decorators?” It emphasizes minimal boilerplate—add @flow and @task decorators to existing functions, and you have an orchestrated workflow. Prefect 3.0 embraces dynamic workflows with native Python control flow.
When to choose it: Your team is Python-native, you want the fastest path from script to production pipeline, you value simplicity and developer experience, or you’re doing ML/data science workflows where Jupyter-to-production is important.
Honest comparison: Prefect’s decorator-based ergonomics are genuinely excellent for Python-only teams. If your organization is homogeneously Python and you don’t need polyglot support, Prefect delivers a very clean experience.
Dagster
What it does well: Dagster introduced “software-defined assets” as first-class primitives—you define what data assets should exist and their dependencies, and the orchestrator figures out how to materialize them. This asset-centric model provides excellent lineage tracking and observability.
When to choose it: You’re building a data platform where understanding asset lineage is critical, you want a declarative approach focused on data products rather than task sequences, or you need strong dbt integration and data quality built into your orchestration layer.
Honest comparison: Dagster’s asset-centric philosophy is a genuinely different way of thinking about orchestration. If your mental model centers on “what data assets need to exist” rather than “what steps need to execute,” Dagster may be a better conceptual fit.
So Why Tasker?
Given this landscape, why build another workflow orchestrator?
Philosophy: Meet Teams Where They Are
Most workflow tools require you to think in their terms. Define your work as DAGs using their DSL. Adopt their scheduling model. Often, rewrite your business logic to fit their execution model.
Tasker takes a different approach: bring workflow orchestration to your existing code, rather than bringing your code to a workflow framework.
If your codebase already has reasonable SOLID characteristics—services with clear responsibilities, well-defined interfaces, operations that can be made idempotent—Tasker aims to turn that code into distributed, event-driven, retryable workflows with minimal ceremony.
This philosophy manifests in several ways:
Polyglot from the ground up. Tasker’s orchestration engine is written in Rust, but workers can be written in Ruby, Python, TypeScript, or native Rust. Each language implementation shares the same core abstractions—same handler signatures, same result factories, same patterns—expressed idiomatically for each language. This isn’t an afterthought; cross-language consistency is a core design principle.
Minimal migration burden. Your existing business logic—API calls, database operations, external service integrations—can become step handlers with thin wrappers. You don’t need to restructure your application around the orchestrator.
Framework-agnostic core. Tasker Core provides the fundamentals without framework opinions. Tasker Contrib then provides framework-specific integrations (Rails, FastAPI, Bun) for teams who want batteries-included ergonomics. Progressive disclosure: learn the core concepts first, add framework sugar when needed.
Architecture: Event-Driven with Resilience Built In
Tasker’s architecture reflects lessons learned from building distributed systems:
PostgreSQL-native by default. Everything flows through Postgres—task state, step queues (via PGMQ), event coordination (via LISTEN/NOTIFY). This isn’t because Postgres is trendy; it’s because many teams already have Postgres expertise and operational knowledge. Tasker works as a single-dependency system on PostgreSQL alone. For environments requiring higher throughput or existing RabbitMQ infrastructure, Tasker also supports RabbitMQ as an alternative messaging backend—switch with a configuration change.
Event-driven with polling fallback. Real-time responsiveness through Postgres notifications, but with polling as a reliability backstop. Events can be missed; polling ensures eventual consistency.
Defense in depth. Multiple overlapping protection layers provide robust idempotency without single-point dependency. Database-level atomicity, state machine guards, transaction boundaries, and application-level filtering each catch what others might miss.
Composition over inheritance. Handler capabilities are composed via mixins/traits, not class hierarchies. This enables selective capability inclusion, clear separation of concerns, and easier testing.
Performance: Fast by Default
Tasker is built in Rust not for marketing purposes, but because workflow orchestration has real performance implications. When you’re coordinating thousands of steps across distributed workers, overhead matters.
- Complex 7-step DAG workflows complete in under 133ms with push-based notifications
- Concurrent execution via work-stealing thread pools
- Lock-free channel-based internal coordination
- Zero-copy where possible in the FFI boundaries
The Honest Assessment
Tasker excels when:
- You need polyglot worker support across Ruby, Python, TypeScript, and Rust
- Your team already has Postgres expertise and wants to avoid additional infrastructure
- You want to bring orchestration to existing business logic rather than rewriting
- You value clean, consistent APIs across languages
- Performance matters and you’re willing to trade ecosystem breadth for it
Tasker may not be the right choice when:
- You need the battle-tested maturity and ecosystem of Airflow
- Your workflows require Temporal-style durable execution with deterministic replay
- You’re an all-Python team and Prefect’s ergonomics fit perfectly
- You’re building a data platform where asset-centric thinking (Dagster) is the right model
- You need managed cloud offerings with SLAs and enterprise support
What Tasker Is (and Isn’t)
Tasker Is
- A workflow orchestration engine for step-based DAG execution with complex dependencies
- PostgreSQL-native with flexible messaging using PGMQ (default) or RabbitMQ
- Polyglot by design with first-class support for multiple languages
- Focused on developer experience for teams who want minimal intrusion
- Open source (MIT license) and built as a labor of love
Tasker Is Not
- A data orchestration platform like Dagster with asset lineage and data quality primitives
- A durable execution engine like Temporal with deterministic replay and unlimited durability
- A scheduled job runner for simple cron-style workloads (use actual cron)
- A message bus for pure pub/sub fan-out (use Kafka or dedicated brokers)
- Enterprise software with commercial support, SLAs, or managed offerings
Current State
Tasker is in early release (0.1.x). This is important context:
What this means:
- The architecture has solidified but breaking changes may still occur
- Documentation is comprehensive and continuously improving
- Early adopters are beginning to explore the system
- We follow semantic versioning and aim to communicate breaking changes clearly
Our commitment:
- Intentional about breaking changes—we weigh architectural correctness against user impact
- Migration guidance provided where practical
- Release notes document all significant changes
- Responsive to community feedback
If you’re evaluating Tasker, we encourage you to explore it, provide feedback, and help shape its direction. For production-critical workloads, evaluate whether the current stability level meets your needs, or consider the established tools above.
The Path Forward
Tasker is being built with care, not speed. The goal isn’t to capture market share or compete with well-funded companies. The goal is to create something genuinely useful—a workflow orchestration system that respects developers’ time and meets them where they are.
The codebase is open, the design decisions are documented, and contributions are welcome. This is software built by an engineer for engineers, not a product chasing metrics.
If that resonates with you, welcome. Let’s build something good together.
Related Documentation
- Tasker Core Tenets - The 10 foundational design principles
- Use Cases & Patterns - When and how to use Tasker
- Quick Start Guide - Get running in 5 minutes
- CHRONOLOGY - Development timeline and lessons learned
← Back to Documentation Hub
Contributing to Tasker
This guide covers development setup and workflow for contributing to tasker-core and tasker-contrib.
Contributing to tasker-core
Prerequisites
- Rust stable toolchain (via rustup)
- Docker Desktop (for PostgreSQL, RabbitMQ, and supporting services)
- cargo-make (
cargo install cargo-make) - Optional: Ruby 3.4+, Python 3.12+, Bun 1.x (for FFI worker development)
On macOS, the project includes a Brewfile that installs system dependencies (PostgreSQL 18, protobuf, LLVM, language toolchains).
Automated Setup
The fastest path is the automated setup script:
git clone https://github.com/tasker-systems/tasker-core.git
cd tasker-core
# Full setup — installs Homebrew deps, Rust tools, git hooks, worker deps
./bin/setup-dev.sh
# Or run targeted slices
./bin/setup-dev.sh --brew-only # Homebrew bundle only
./bin/setup-dev.sh --cargo-only # cargo-make, sqlx-cli, nextest
./bin/setup-dev.sh --hooks-only # Git pre-commit hook
./bin/setup-dev.sh --check # Audit what's installed
Starting Services
# Start PostgreSQL (with PGMQ), RabbitMQ, Dragonfly cache, Grafana LGTM
cargo make docker-up
Database Setup
cargo make db-setup # Run migrations
cargo make db-reset # Drop and recreate (clean slate)
Environment Configuration
Environment variables are assembled from modular files in config/dotenv/:
cargo make setup-env # Standard test mode
cargo make setup-env-split # Split database mode
cargo make setup-env-cluster # Cluster testing mode
See config/dotenv/README.md for the full file structure and assembly order.
Key Commands
| Command | Shortcut | What it does |
|---|---|---|
cargo make check | c | Lint + format + build |
cargo make test | t | All tests (requires services) |
cargo make fix | f | Auto-fix issues |
cargo make build | b | Build everything |
cargo make test-rust-unit | tu | Unit tests (DB + messaging only) |
cargo make test-rust-e2e | te | E2E tests (requires services) |
cargo make test-rust-cluster | tc | Cluster tests (multi-instance) |
Always use --all-features when running cargo commands directly.
Test Tiers
Tests are organized into infrastructure levels via feature flags:
| Level | Feature flag | Requires |
|---|---|---|
| Unit | test-messaging | PostgreSQL + messaging |
| E2E | test-services | + running services |
| Cluster | test-cluster | + multi-instance cluster |
Run cargo make test-rust-unit for fast iteration. Run the full suite with cargo make test before opening a PR.
Worker Development
Tasker supports polyglot workers through FFI:
- Ruby via magnus —
workers/ruby/ - Python via PyO3 —
workers/python/ - TypeScript via napi-rs —
workers/typescript/
Each worker directory has its own build and test commands. See the Worker Guides for language-specific details.
SQLx Query Cache
After modifying sqlx::query! macros or SQL schema, update the offline cache:
DATABASE_URL=postgresql://tasker:tasker@localhost:5432/tasker_rust_test \
cargo sqlx prepare --workspace -- --all-targets --all-features
git add .sqlx/
Conventions
- Branch naming:
username/ticket-id-short-description(e.g.,jcoletaylor/tas-190-add-version-fields) - Commit messages:
type(scope): description(e.g.,fix(orchestration): handle timeout in step enqueuer) - Git hooks: Pre-commit runs
cargo fmton staged Rust files. Install withgit config core.hooksPath .githooksor viasetup-dev.sh. - Lint suppression: Use
#[expect(lint_name, reason = "...")]instead of#[allow] - SQLx: Never use
SQLX_OFFLINE=true— always exportDATABASE_URL - Public types: Must implement
Debug - Channels: All MPSC channels must be bounded and configured via TOML
Pull Request Process
- Branch from
main - Make focused changes (one logical change per PR)
- Run
cargo make check && cargo make test - Update documentation if your change affects public APIs or behavior
- Open a PR against
main
New functionality should include tests. Bug fixes should include a regression test.
Contributing to tasker-contrib
Prerequisites
- cargo-make (
cargo install cargo-make) - tasker-ctl binary (build from core or
cargo install tasker-ctl) - Docker and Docker Compose (for example app testing)
- Language toolchain for the area you’re working on (Ruby 3.3+, Python 3.12+, Bun 1.0+, or Rust stable)
Getting tasker-ctl
# Option A: Install from crates.io
cargo install tasker-ctl
# Option B: Build from local tasker-core (requires sibling checkout)
cargo make build-ctl
Adding a New Template
Each language plugin lives in {language}/tasker-cli-plugin/ and follows this structure:
{language}/tasker-cli-plugin/
+-- tasker-plugin.toml # Plugin manifest
+-- templates/
+-- step_handler/ # Template directory
| +-- template.toml # Metadata and variables
| +-- files/ # Tera template files
+-- step_handler_api/
+-- task_template/
Steps to add a template:
- Create the template directory under the appropriate plugin
- Add
template.tomlwith metadata and variable definitions - Add template files in
files/using Tera syntax (built-in helpers:snake_case,pascal_case) - Register the template in
tasker-plugin.toml - Run
cargo make test-templatesto verify generation and syntax checking
Adding a New Example App
- Create
examples/{framework}-app/with standard project structure - Add the app database to
examples/init-db.sql - Add a
cargo make test-example-{framework}task toMakefile.toml - Add the app to the
test-examplesdependencies inMakefile.toml - Add CI steps to
.github/workflows/test-examples.yml
Validation Commands
| Command | Shortcut | What it does |
|---|---|---|
cargo make validate | v | Validate all plugin manifests |
cargo make test-templates | tt | Generate + syntax-check all templates |
cargo make test-all | ta | Full validation (validate + test-templates) |
cargo make test-examples | te | Integration tests for all example apps |
Example App Infrastructure
The example apps share a Docker Compose stack in examples/:
cd examples
docker compose up -d
This starts PostgreSQL 18 (with PGMQ + app databases), Tasker orchestration (from published GHCR images), Dragonfly cache, and RabbitMQ. Wait for orchestration to be healthy before running tests:
curl -sf http://localhost:8080/health
Getting Help
- Discussions for questions
- Issues for bugs or feature requests
- Both projects follow the Contributor Covenant 3.0 code of conduct
Getting Started
Tasker is a distributed workflow orchestration system that coordinates complex, multi-step processes across services and languages. It provides:
- Task Orchestration — Define workflows as directed acyclic graphs (DAGs) with dependency management
- Multi-Language Support — Write handlers in Rust, Ruby, Python, or TypeScript
- Built-in Resilience — Automatic retries, error handling, and state persistence
- Event-Driven Architecture — PGMQ and RabbitMQ messaging for real-time observability
How Tasker Works
┌─────────────────────────────────────────────────────────────────────┐
│ tasker-core (Rust) │
│ • REST + gRPC API for task submission │
│ • Workflow orchestration via lifecycle actors │
│ • Step execution and DAG dependency resolution │
│ • PostgreSQL state persistence │
│ • Event publishing (PGMQ default, RabbitMQ optional) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Ruby │ │ Python │ │TypeScript│
│ Workers │ │ Workers │ │ Workers │
└──────────┘ └──────────┘ └──────────┘
You define task templates (YAML DAGs) that describe what steps to run and their dependencies. You write step handlers in your preferred language that contain the business logic. Tasker’s orchestration engine executes the DAG — resolving dependencies, running independent steps in parallel, retrying failures, and persisting state.
Understand Tasker
| Page | What you’ll learn |
|---|---|
| Core Concepts | Tasks, steps, handlers, templates, dependencies, and lifecycle states |
| Handler Types | The four handler types (Step, API, Decision, Batchable) and when to use each |
| Choosing Your Package | Which language SDK fits your project |
Ready to Build?
Once you understand the concepts, head to Build Your First Project to set up your environment and write your first workflow.
If you prefer learning by example, the Quick Start gets you running a working app in 5 minutes.
Core Concepts
This page explains the fundamental building blocks of Tasker.
Tasks
A Task is a unit of work submitted to Tasker for execution. Tasks have:
- A task template that defines the workflow structure
- An initiator identifying the source (e.g.,
user:123,system:scheduler) - A context containing input data and metadata
- A state managed by a 12-state machine (see below)
{
"name": "order_fulfillment",
"initiator": "api:checkout",
"context": {
"order_id": "ORD-12345",
"customer_email": "customer@example.com"
}
}
Task Templates
A Task Template is a YAML definition of a workflow. It specifies:
- Steps to execute
- Dependencies between steps (creating a DAG)
- Handler mappings connecting steps to your code
name: order_fulfillment
namespace_name: ecommerce
version: 1.0.0
steps:
- name: validate_order
handler:
callable: OrderValidationHandler
dependencies: []
- name: reserve_inventory
handler:
callable: InventoryHandler
dependencies:
- validate_order
- name: charge_payment
handler:
callable: PaymentHandler
dependencies:
- validate_order
Steps
A Step is a single operation within a workflow. Steps:
- Execute independently once dependencies are satisfied
- Can run in parallel when they have no mutual dependencies
- Return results that downstream steps can access
- Can be retried on failure
Task Lifecycle
Tasks progress through a multi-phase lifecycle managed by the orchestration actors:
Pending → Initializing → EnqueuingSteps → StepsInProcess → EvaluatingResults → Complete
The evaluating phase may loop back to enqueue more steps as dependencies are satisfied, wait for retries, or transition to terminal states (Complete, Error, Cancelled, ResolvedManually). Tasks support cancellation from any non-terminal state and manual resolution from BlockedByFailures.
Step Lifecycle
Steps follow a worker-to-orchestration handoff pattern through 10 states:
Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
After a worker executes a step, the result is enqueued back to orchestration for processing. Steps can also transition through WaitingForRetry for automatic retry with backoff, or be cancelled, failed, or manually resolved.
For the full state machine diagrams and transition tables, see States and Lifecycles.
Step Handlers
A Step Handler is your code that executes a step’s business logic. The DSL approach declares what a handler receives — its inputs from the task context and results from upstream steps — and delegates to a service:
from tasker_core.step_handler.functional import inputs, step_handler
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
return svc.validate_cart_items(inputs.resolved_items)
The @inputs decorator extracts fields from the task context into a typed Pydantic model. The @depends_on decorator (shown below) does the same for upstream step results. Your handler function receives typed arguments instead of parsing raw JSON.
Class-based handlers (
class MyHandler(StepHandler)) are also supported. See Class-Based Handlers.
Dependency Results
Steps can access typed results from their dependencies using @depends_on:
from tasker_core.step_handler.functional import depends_on, inputs, step_handler
from app.services.types import EcommerceOrderInput, EcommerceValidateCartResult
@step_handler("process_payment")
@depends_on(cart_result=("validate_cart", EcommerceValidateCartResult))
@inputs(EcommerceOrderInput)
def process_payment(
cart_result: EcommerceValidateCartResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.process_payment(
payment_token=inputs.payment_token,
total=cart_result.total or 0.0,
)
Each @depends_on entry maps a parameter name to a ("step_name", ResultModel) tuple. Tasker resolves the upstream step’s result and deserializes it into the model, so your handler receives a typed object — not a raw dict.
Workflow Steps
A Workflow Step is a special step that delegates to another task template, creating a sub-workflow:
steps:
- name: process_line_items
handler:
callable: WorkflowHandler
dependencies:
- validate_order
Error Handling
Tasker distinguishes between error types:
| Error Type | Behavior |
|---|---|
PermanentError | No retry; step fails immediately |
RetryableError | Automatically retried with backoff |
from tasker_core.errors import PermanentError, RetryableError
def call(self, context):
if invalid_input:
raise PermanentError(message="Invalid order ID format", error_code="INVALID_ID")
if service_unavailable:
raise RetryableError(message="Payment gateway timeout", error_code="GATEWAY_TIMEOUT")
Next Steps
- Handler Types — The four handler types and when to use each
- Your First Handler — Write your first step handler
- Your First Workflow — Create a complete workflow
Handler Types
Tasker provides four handler types that cover the most common workflow patterns. The DSL approach lets you declare what a handler receives — typed inputs and dependency results — while your business logic stays in your service layer.
Cross-Language Availability
| Handler Type | Python | Ruby | TypeScript | Rust |
|---|---|---|---|---|
| Step Handler | Yes | Yes | Yes | Yes |
| API Handler | Yes | Yes | Yes | – |
| Decision Handler | Yes | Yes | Yes | – |
| Batchable Handler | Yes | Yes | Yes | – |
Rust provides only the base Step Handler trait, composing capability traits instead. See Rust’s Handler Architecture below.
Step Handler (DSL)
The base handler type. All other types extend it.
When to use: General-purpose business logic — database operations, calculations, transformations, service calls, or anything that takes input and produces output.
Python
from tasker_core.step_handler.functional import inputs, step_handler
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
return svc.validate_cart_items(inputs.resolved_items)
Ruby
extend TaskerCore::StepHandler::Functional
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(cart_items: inputs.cart_items)
end
TypeScript
import { defineHandler } from '@tasker-systems/tasker';
import * as svc from '../services/ecommerce';
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler',
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) => svc.validateCartItems(cartItems as CartItem[]),
);
For the class-based alternative, see Class-Based Handlers.
Generate with tasker-ctl:
tasker-ctl template generate step_handler \
--plugin tasker-contrib-python \
--param name=ProcessPayment
Available for all four languages: tasker-contrib-rails, tasker-contrib-python, tasker-contrib-typescript, tasker-contrib-rust.
See it in action: All five workflows in the example apps use step handlers. Start with the e-commerce checkout (Post 01) for the simplest example.
Next: Your First Handler walks through writing and registering a step handler end-to-end.
What the DSL Composes
The DSL builds a typed method signature from two sources:
| Decorator / Config | Source | What it provides |
|---|---|---|
@inputs(Model) / inputs: | Task context (submitted data) | Typed input fields |
@depends_on(name=("step", Model)) / depends: | Upstream step results | Typed dependency results |
Both are injected as function parameters. Your handler receives typed objects — not raw dicts or JSON — and delegates to a service function that contains the actual business logic.
Here’s a handler that uses both:
@step_handler("create_order")
@depends_on(
cart_result=("validate_cart", EcommerceValidateCartResult),
payment_result=("process_payment", EcommerceProcessPaymentResult),
inventory_result=("update_inventory", EcommerceUpdateInventoryResult),
)
@inputs(EcommerceOrderInput)
def create_order(
cart_result: EcommerceValidateCartResult,
payment_result: EcommerceProcessPaymentResult,
inventory_result: EcommerceUpdateInventoryResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.create_order(
cart=cart_result, payment=payment_result,
inventory=inventory_result, customer_email=inputs.customer_email,
)
The handler declares what it needs; Tasker resolves how to get it.
Type System by Language
Each language uses its native type system for input and result models:
| Python | Ruby | TypeScript | |
|---|---|---|---|
| Library | Pydantic BaseModel | Dry::Struct | TypeScript interfaces |
| Validation | @model_validator | validate! method | Manual type guards |
| Optional fields | field: str | None = None | attribute :field, Types::String.optional | field?: string |
| Field aliases | @property methods | Attribute readers | Getter functions |
| Error on invalid | Raises PermanentError | Raises PermanentError | Throws PermanentError |
Python (Pydantic BaseModel):
class EcommerceOrderInput(BaseModel):
items: list[dict[str, Any]] | None = None # submitted as "items"
cart_items: list[dict[str, Any]] | None = None # or "cart_items"
customer_email: str | None = None
payment_token: str | None = None
@property
def resolved_items(self) -> list[dict[str, Any]]:
"""Accept either field name from the task context."""
return self.items or self.cart_items or []
Ruby (Dry::Struct):
module Types
module Ecommerce
class OrderInput < Types::InputStruct
attribute :cart_items, Types::Array.of(Types::Hash).optional
attribute :customer_email, Types::String.optional
attribute :payment_token, Types::String.optional
end
end
end
TypeScript (interfaces):
interface CartItem {
sku: string;
name: string;
price: number;
quantity: number;
}
Specialized Handler Patterns
API Handler
Adds HTTP client methods with built-in error classification. The APIMixin provides self.get(), self.post(), etc. with automatic retryable/permanent error detection.
When to use: Calling external APIs where you need to distinguish retryable errors (5xx, timeouts) from permanent errors (4xx).
API handlers currently use the class-based pattern with mixin composition. See Class-Based Handlers — API Handler for the full pattern.
Decision Handler
Adds workflow routing methods. decision_success() activates downstream steps by name; skip_branches() when no steps should execute.
When to use: Conditional branching — when the next steps depend on runtime data.
from tasker_core.step_handler.functional import decision_handler
@decision_handler("order_routing")
def order_routing(context: StepContext):
order_type = context.get_input("order_type")
if order_type == "premium":
return ["validate_premium", "process_premium"]
return ["standard_processing"]
See Conditional Workflows for decision handler patterns in depth.
Batchable Handler
Adds batch processing for splitting large workloads into parallel cursor-based batches.
When to use: Processing large datasets where you want to divide work across multiple parallel workers.
Workflow pattern: Analyzer → parallel Workers → optional Aggregator.
Batchable handlers currently use the class-based pattern due to their stateful nature (cursor management, batch context). See Class-Based Handlers — Batchable Handler for the full pattern, and Batch Processing for the production guide.
Task Templates
All handler types are wired together using YAML task template definitions. A task template defines the DAG — which steps to run, their dependencies, and which handlers to invoke.
name: order_processing
namespace: ecommerce
version: "1.0.0"
description: "Order processing workflow"
step_templates:
- name: validate_order
description: "Validate the incoming order"
handler:
callable: ValidateOrderHandler
initialization: {}
depends_on_step_name: []
retry:
max_attempts: 3
backoff_strategy: exponential
backoff_base_seconds: 2
Generate with tasker-ctl:
tasker-ctl template generate task_template \
--plugin tasker-contrib-python \
--param name=OrderProcessing \
--param namespace=ecommerce
Task templates are language-agnostic — the same YAML structure works across all four languages. The handler.callable field maps to the handler’s registered name or class path.
For a complete walkthrough of building a multi-step workflow with templates, see Your First Workflow.
Rust’s Handler Architecture
Rust provides RustStepHandler as its single handler trait — but this is not a limitation. The Rust worker crate defines capability traits in handler_capabilities.rs that Rust handlers compose directly:
| Capability Trait | What it provides |
|---|---|
APICapable | HTTP client methods with retryable/permanent error classification |
DecisionCapable | Workflow routing via step activation |
BatchableCapable | Cursor-based parallel batch processing |
A Rust handler implements RustStepHandler and adds any capability traits it needs. This is idiomatic Rust — trait composition instead of class inheritance. For a complex example that combines multiple capabilities, see diamond_decision_batch.rs in the Rust worker crate.
In fact, the Rust batch_processing module is the foundation that Python, Ruby, and TypeScript access through FFI. The specialized handler types in those languages are ergonomic wrappers around the Rust implementation — Rust developers work with the underlying traits directly.
Choosing Your Package
Tasker supports multiple languages through FFI bindings. Each language package provides the same core capabilities with idiomatic APIs.
Language Guides
| Language | Package | Guide |
|---|---|---|
| Rust | tasker-worker + tasker-client | Rust Guide |
| Ruby | tasker-rb | Ruby Guide |
| Python | tasker-py | Python Guide |
| TypeScript | @tasker-systems/tasker | TypeScript Guide |
How to Choose
Use Rust
- Need maximum performance with zero-overhead abstractions
- Want compile-time type safety and memory safety guarantees
- Are building native Tasker extensions or the orchestration system itself
- Prefer direct API access without FFI overhead
Use Ruby
- Have an existing Rails or Ruby application
- Prefer Ruby’s expressive DSL capabilities
- Value rapid development with convention-over-configuration
- Want seamless integration with Ruby ecosystem gems
Use Python
- Are building data pipelines or ML workflows
- Want async/await support with asyncio, aiohttp, etc.
- Need integration with the Python data science ecosystem
- Prefer Python’s clean syntax and type hints
Use TypeScript
- Are building Node.js or Bun applications
- Want strong typing with TypeScript’s type system
- Need to integrate with existing JavaScript ecosystems
- Prefer modern async/await patterns
Package Architecture
All language packages share the same architecture:
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────────────────────┤
│ Language Package (tasker-rb, tasker-py, @tasker/tasker) │
├─────────────────────────────────────────────────────────────┤
│ FFI Layer (Magnus/PyO3/NAPI) │
├─────────────────────────────────────────────────────────────┤
│ tasker-worker (Rust core) │
├─────────────────────────────────────────────────────────────┤
│ tasker-client (API client) │
└─────────────────────────────────────────────────────────────┘
This means:
- Same core logic — All packages use the same Rust implementation
- Same features — Handler registration, client SDK, event system
- Cross-language consistency —
get_input(),get_dependency_result(), etc. work the same
Quick Installation
# Rust
cargo add tasker-worker tasker-client
# Ruby
gem install tasker-rb
# Python
pip install tasker-py
# TypeScript/JavaScript
npm install @tasker-systems/tasker
See the individual language guides for detailed setup and examples.
Example Apps & Framework Integrations
Tasker Contrib provides two things for each supported language:
- CLI plugin templates — code generators for
tasker-ctlthat scaffold handlers, task definitions, and infrastructure configuration - Example applications — fully working apps that demonstrate real-world workflow patterns against published Tasker packages
The Integration Pattern
All four framework integrations follow the same pattern:
- Bootstrap —
tasker-ctl initcreates project structure with infrastructure config - Create —
tasker-ctl template generatescaffolds handlers and task templates - Process — Your handlers receive
StepContext, execute business logic, returnStepHandlerResult - Query — Use the Tasker client SDK to submit tasks and check status
The framework integration layer is intentionally thin: it translates your framework’s idioms (Rails generators, FastAPI dependency injection, Bun middleware) into Tasker concepts without inventing new abstractions.
Available Integrations
| Framework | Language | SDK Package | CLI Plugin |
|---|---|---|---|
| Rails | Ruby | tasker-rb | tasker-contrib-rails |
| FastAPI | Python | tasker-py | tasker-contrib-python |
| Hono/Bun | TypeScript | @tasker-systems/tasker | tasker-contrib-typescript |
| Axum | Rust | tasker-worker | tasker-contrib-rust |
Each plugin provides templates for all handler types: step, API, decision, and batchable (Rust provides step handler only).
The Apps
| App | Framework | SDK Package | Source |
|---|---|---|---|
| rails-app | Rails 8 | tasker-rb | GitHub |
| fastapi-app | FastAPI | tasker-py | GitHub |
| bun-app | Hono/Bun | @tasker-systems/tasker | GitHub |
| axum-app | Axum | tasker-worker | GitHub |
Workflow Patterns
All four apps implement the same five workflow patterns, progressing from simple to complex:
| Pattern | Workflow | Handler Types Used | Story |
|---|---|---|---|
| Linear pipeline | E-commerce Order Processing | Step | Post 01 |
| Parallel DAG | Data Pipeline Analytics | Step | Post 02 |
| Diamond convergence | Microservices User Registration | Step | Post 03 |
| Namespace isolation | Customer Success + Payments Refund | Step | Post 04 |
| Cross-team coordination | Payments Compliance | Step | Post 04 |
Each workflow demonstrates a specific DAG pattern. The Engineering Stories series teaches these patterns through progressive narrative — start with Post 01 and work forward.
Shared Infrastructure
All example apps share a single docker-compose.yml that provides:
- PostgreSQL with PGMQ extensions — state persistence and default message queue
- Tasker orchestration engine — published GHCR image, handles DAG execution
- RabbitMQ — optional message broker for event publishing
- Dragonfly — Redis-compatible cache
cd examples
docker compose up -d
# Wait for orchestration to be healthy
curl -sf http://localhost:8080/health
Each app gets its own database (example_rails, example_fastapi, example_bun, example_axum) created by the shared init-db.sql script.
Running an Example
Python (FastAPI)
cd examples/fastapi-app
uv sync
uv run uvicorn app.main:app --port 8083
Ruby (Rails)
cd examples/rails-app
bundle install
bin/rails server -p 8082
TypeScript (Bun/Hono)
cd examples/bun-app
bun install
bun run dev
Rust (Axum)
cd examples/axum-app
cargo run
What to Study
Each app demonstrates the same concepts in its framework’s idioms. Comparing across languages is the fastest way to understand Tasker’s cross-language handler contract:
- Handler registration — How each framework discovers and registers step handlers
- Context access —
get_input(),get_dependency_result(), and step configuration - Error handling —
PermanentErrorvsRetryableErrorpatterns - Task templates — Identical YAML DAG definitions across all four apps
- Testing — Each app has integration tests that submit tasks and verify results
Getting Started
- Quick Start — Clone an example app and run it in 5 minutes
- Using tasker-ctl — Bootstrap a project with the CLI tool
- Choosing Your Package — Which language SDK fits your project
Source Repository
All example apps live in the tasker-systems/tasker-contrib repository under examples/.
Build Your First Project
This section walks you from zero to a running Tasker workflow. Choose your path based on how you like to learn:
Quick Start
Quick Start — Two ways to get running fast:
- Path A: Clone an example app (5 min) — Docker Compose up, pick a framework, curl an endpoint
- Path B: Bootstrap with tasker-ctl (10 min) — Initialize a project, generate handlers, configure, and run
Step-by-Step Guides
If you prefer a guided walkthrough:
- Installation — Install language packages and run Tasker infrastructure
- Using tasker-ctl — Initialize projects, generate handlers from templates, manage configuration
- Your First Handler — Write a step handler in your preferred language
- Your First Workflow — Define a task template, submit a task, watch it execute
Language Guides
Comprehensive setup and API reference for each supported language:
- Ruby — Ruby workers with
tasker-rb - Python — Python workers with
tasker-py - TypeScript — TypeScript workers with
@tasker-systems/tasker - Rust — Native Rust workers with
tasker-worker
What’s Next
After building your first project, see Next Steps for where to go from here — example apps, engineering stories, architecture deep-dives, and production operations.
Quick Start
Two paths to a running Tasker workflow. Pick the one that fits your style.
Path A: Clone an Example App (5 minutes)
The fastest way to see Tasker in action. The example apps provide fully working projects in all four languages, running against published packages via Docker Compose.
Prerequisites
- Docker and Docker Compose
- curl (or any HTTP client)
1. Clone tasker-contrib
git clone https://github.com/tasker-systems/tasker-contrib.git
cd tasker-contrib/examples
2. Start the infrastructure
docker compose up -d
This starts PostgreSQL (with PGMQ), the Tasker orchestration engine, RabbitMQ, and Dragonfly (cache). All services use published GHCR images — no local builds needed.
Apple Silicon: The compose file includes
platform: linux/amd64for Tasker images. Ensure “Use Rosetta” is enabled in Docker Desktop.
3. Wait for orchestration to be healthy
# Retry until healthy (up to 60 seconds on first pull)
until curl -sf http://localhost:8080/health > /dev/null; do
echo "Waiting for orchestration..."
sleep 5
done
echo "Orchestration is healthy"
4. Pick a framework and run it
Each app has its own setup instructions. For example, with Ruby (Rails):
cd rails-app
bundle install
bin/rails db:create db:migrate
bin/rails server -p 3000
Or with Python (FastAPI):
cd fastapi-app
uv sync
uv run alembic upgrade head
uv run uvicorn app.main:app --port 8000
5. Submit a task
# Submit an e-commerce order processing task
curl -X POST http://localhost:8080/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "ecommerce_order_processing",
"namespace": "ecommerce_rb",
"version": "1.0.0",
"initiator": "quickstart",
"source_system": "cli",
"reason": "Quick-start verification",
"context": {
"cart_items": [
{"sku": "WIDGET-001", "name": "Widget", "quantity": 2, "unit_price": 29.99}
],
"customer_email": "test@example.com"
}
}'
The orchestration engine coordinates the workflow — validating the cart, processing payment, reserving inventory, creating the order, and sending confirmation — with each step handled by your chosen framework’s app.
What you just ran
Each example app implements four real-world workflow patterns:
| Pattern | Workflow | Story |
|---|---|---|
| Linear pipeline | E-commerce Order Processing | Post 01 |
| Parallel DAG | Data Pipeline Analytics | Post 02 |
| Diamond convergence | Microservices User Registration | Post 03 |
| Namespace isolation | Team Scaling (Customer Success + Payments) | Post 04 |
See the Example Apps page for full details, or read the Engineering Stories for narrative walkthroughs.
Path B: Bootstrap with tasker-ctl (10 minutes)
Build a project from scratch using Tasker’s CLI tool. This path generates handler scaffolding, task templates, and infrastructure configuration — everything you need to start writing business logic.
Prerequisites
- Docker and Docker Compose (for Tasker infrastructure)
- Rust toolchain (for installing
tasker-ctl) - Your preferred language runtime (Python, Ruby, Bun/Node, or Rust)
1. Install and initialize
cargo install tasker-ctl
mkdir my-tasker-project && cd my-tasker-project
tasker-ctl init
tasker-ctl remote update
This creates a .tasker-ctl.toml configured with the tasker-contrib remote, then fetches the community templates to your local cache.
2. See what’s available
tasker-ctl template list
Templates are organized by language and type:
| Template | Languages | Description |
|---|---|---|
step_handler | Ruby, Python, TypeScript, Rust | Basic step handler with test |
step_handler_api | Ruby, Python, TypeScript | HTTP API handler with client |
step_handler_decision | Ruby, Python, TypeScript | Decision/routing handler |
step_handler_batchable | Ruby, Python, TypeScript | Parallel batch processing handler |
task_template | All | Task definition YAML with step DAG |
docker_compose | Ops | Docker Compose stack for Tasker services |
config | Ops | TOML configuration files |
Filter by language to see just your stack:
tasker-ctl template list --language python
3. Generate a handler
tasker-ctl template generate step_handler \
--language python \
--param name=ProcessOrder
This generates process_order_handler.py and tests/test_process_order_handler.py using the DSL pattern — a decorated function with typed inputs that delegates to a service. See Your First Handler for a walkthrough of the generated code.
4. Generate a task template
tasker-ctl template generate task_template \
--language python \
--param name=OrderProcessing \
--param namespace=default \
--param handler_callable=handlers.process_order_handler.ProcessOrderHandler
This generates order_processing.yaml — a task definition with one step wired to your handler. Edit it to add more steps and build a DAG.
5. Generate infrastructure
# Docker Compose stack (PostgreSQL + orchestration, optionally RabbitMQ + Dragonfly)
tasker-ctl template generate docker_compose \
--plugin tasker-contrib-ops \
--param name=myproject
# TOML configuration files (from tasker-contrib/config/tasker base configs)
tasker-ctl config generate --remote tasker-contrib \
--context orchestration --environment development --output config/orchestration.toml
tasker-ctl config generate --remote tasker-contrib \
--context worker --environment development --output config/worker.toml
6. Start infrastructure and submit
docker compose up -d
# Wait for health
until curl -sf http://localhost:8080/health > /dev/null; do sleep 5; done
# Submit a task
curl -X POST http://localhost:8080/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "order_processing",
"namespace": "default",
"version": "1.0.0",
"initiator": "quickstart",
"source_system": "cli",
"reason": "First task",
"context": {"order_id": "ORD-001"}
}'
What you just built
Path B gives you project scaffolding — handler code, task template YAML, and infrastructure config. To wire a handler into a running worker, you’ll need to integrate it with a framework (Rails, FastAPI, Bun, or Axum) that starts a Tasker worker at boot. See the language guides for that next step, or study the example apps for complete working implementations.
Next Steps
- Your First Handler — Detailed walkthrough of handler anatomy
- Your First Workflow — Build a multi-step DAG with dependencies
- Using tasker-ctl — Full CLI reference for project scaffolding
Installation
This guide covers installing Tasker components for development.
Prerequisites
- Docker and Docker Compose V2 (for Tasker infrastructure)
- Rust toolchain (for installing
tasker-ctl) - A language runtime for your workers: Ruby 3.2+, Python 3.10+, Bun 1.0+ (or Node 18+), or Rust 1.75+
Install tasker-ctl
tasker-ctl is the CLI tool for scaffolding projects, generating handlers, and managing configuration:
cargo install tasker-ctl
Verify the installation:
tasker-ctl --version
# tasker-ctl 0.1.4
Apple Silicon note: The published Docker images on GHCR are currently x86_64 only. On Apple Silicon Macs, enable “Use Rosetta for x86_64/amd64 emulation” in Docker Desktop settings, or ensure your
docker-compose.ymlincludesplatform: linux/amd64on Tasker service containers.
Installing Worker Packages
Install the package for your language of choice:
Ruby
gem install tasker-rb
Or add to your Gemfile:
gem 'tasker-rb', '~> 0.1.5'
Python
pip install tasker-py
Or with uv:
uv add tasker-py
TypeScript / JavaScript
bun add @tasker-systems/tasker
Or with npm:
npm install @tasker-systems/tasker
Rust
Add to your Cargo.toml:
[dependencies]
tasker-worker = "0.1.5"
tasker-client = "0.1.5"
Infrastructure with Docker Compose
Tasker requires PostgreSQL (with the PGMQ extension) and an orchestration service. The fastest way to get these running is Docker Compose.
You can generate a compose file with tasker-ctl:
tasker-ctl init
tasker-ctl remote update
tasker-ctl template generate docker_compose \
--plugin tasker-contrib-ops \
--param name=myproject
Or use the pre-configured stack from the example apps:
git clone https://github.com/tasker-systems/tasker-contrib.git
cd tasker-contrib/examples
docker compose up -d
This starts PostgreSQL (with PGMQ), the Tasker orchestration engine, RabbitMQ, and Dragonfly (cache). The orchestration API is available at http://localhost:8080.
Verify services are running
curl -sf http://localhost:8080/health
Configuration
Tasker uses environment variables and TOML configuration files. Key environment variables:
# Database connection (required)
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker"
# Orchestration API URL (for client SDKs and tasker-ctl)
export ORCHESTRATION_URL="http://localhost:8080"
# Messaging backend: "pgmq" (default, uses PostgreSQL) or "rabbitmq"
export TASKER_MESSAGING_BACKEND="pgmq"
For full configuration management, see Configuration Management or generate annotated config files with tasker-ctl config generate.
Next Steps
- Quick Start — Two paths to a running workflow
- Using tasker-ctl — Project scaffolding and template generation
- Your First Handler — Write your first step handler
Getting Started with tasker-ctl
tasker-ctl is the command-line tool for managing Tasker workflows, generating project scaffolding, and working with configuration. This guide covers the developer-facing features for bootstrapping new projects.
Installation
Install from crates.io:
cargo install tasker-ctl
Or with cargo-binstall (uses prebuilt binaries when available):
cargo binstall tasker-ctl
Initialize Your Project
Run tasker-ctl init to create a .tasker-ctl.toml configuration file in your project directory:
tasker-ctl init
This creates a .tasker-ctl.toml pre-configured with the tasker-contrib remote, which provides community templates for all supported languages and default configuration files.
To skip the tasker-contrib remote (e.g., if you only use private templates):
tasker-ctl init --no-contrib
Fetch Remote Templates
After initialization, fetch the remote templates:
tasker-ctl remote update
This clones the configured remotes to a local cache (~/.cache/tasker-ctl/remotes/). Subsequent fetches only pull changes. The cache is checked for freshness automatically and warnings are shown when it becomes stale (default: 24 hours).
Browse Templates
List all available templates:
tasker-ctl template list
Filter by language:
tasker-ctl template list --language ruby
tasker-ctl template list --language python
Get detailed information about a template:
tasker-ctl template info step_handler --language ruby
Generate Code from Scaffolding Templates
Generate a step handler from a scaffolding template:
tasker-ctl template generate step_handler \
--language ruby \
--param name=ProcessPayment \
--output ./app/handlers/
This creates handler files using the naming conventions and patterns for your chosen language. Template parameters (like name) are transformed automatically — ProcessPayment becomes process_payment for file names and ProcessPaymentHandler for class names.
Generate Typed Code from Task Templates
The generate command reads your task template YAML files and produces typed code — result models and handler scaffolds — in any supported language. This keeps your handler code aligned with the schemas defined in your templates.
tasker-ctl generate <COMMAND>
Commands:
types Generate typed result models from step result_schema definitions
handler Generate handler scaffolds with typed dependency wiring
Generate Types
Generate typed result models from the result_schema defined on each step in a task template:
tasker-ctl generate types \
--template config/tasker/templates/ecommerce_order_processing.yaml \
--language typescript
This reads each step’s result_schema and produces language-idiomatic types. For TypeScript, you get Zod schemas with inferred types:
export const EcommerceValidateCartResultSchema = z.object({
free_shipping: z.boolean(),
item_count: z.number().int(),
subtotal: z.number(),
tax: z.number(),
total: z.number(),
validated_items: z.array(EcommerceValidateCartResultValidatedItemsSchema),
validation_id: z.string(),
// ...
}).passthrough();
export type EcommerceValidateCartResult = z.infer<typeof EcommerceValidateCartResultSchema>;
For Python, you get Pydantic models:
class EcommerceValidateCartResult(BaseModel):
free_shipping: bool
item_count: int
subtotal: float
total: float
validated_items: list[EcommerceValidateCartResultValidatedItems]
validation_id: str
To generate types for a single step:
tasker-ctl generate types \
--template config/tasker/templates/ecommerce_order_processing.yaml \
--language python \
--step validate_cart
Supported languages: typescript (ts), python (py), ruby (rb), rust (rs).
Generate Handlers
Generate handler scaffolds with typed dependency wiring:
tasker-ctl generate handler \
--template config/tasker/templates/ecommerce_order_processing.yaml \
--language typescript \
--step process_payment
The generator reads the step’s dependencies from the template and wires them into the handler scaffold:
export const ProcessPaymentHandler = defineHandler(
'Ecommerce::StepHandlers::ProcessPaymentHandler',
{
depends: {
validateCartResult: 'validate_cart'
},
},
async ({ validateCartResult, context }) => {
// validateCartResult: ValidateCartResult (typed)
// TODO: implement handler logic
return {
amount_charged: 0,
authorization_code: "",
currency: "",
// ...
};
}
);
The return value stub matches the step’s result_schema, so you can fill in real logic and the shape is already correct.
To generate handlers for all steps at once, omit --step. Add --with-tests to also generate test scaffolds:
tasker-ctl generate handler \
--template config/tasker/templates/ecommerce_order_processing.yaml \
--language typescript \
--with-tests
Use --output to write to a file instead of stdout:
tasker-ctl generate types \
--template config/tasker/templates/ecommerce_order_processing.yaml \
--language typescript \
--output src/services/types.ts
Generate Configuration
Generate a deployable configuration file from the base + environment configs:
# From local config directory
tasker-ctl config generate \
--context orchestration \
--environment production \
--output config/orchestration.toml
# From a remote (tasker-contrib provides default configs)
tasker-ctl config generate \
--remote tasker-contrib \
--context orchestration \
--environment development \
--output config/orchestration.toml
The config generate command merges base configuration with environment-specific overrides and strips documentation metadata, producing a clean deployment-ready TOML file.
Manage Remotes
tasker-ctl remote list # Show configured remotes and cache status
tasker-ctl remote add my-templates URL # Add a new remote
tasker-ctl remote update # Fetch latest for all remotes
tasker-ctl remote update tasker-contrib # Fetch a specific remote
tasker-ctl remote remove my-templates # Remove a remote and its cache
Typical Workflow
A new project typically follows this sequence:
# 1. Initialize CLI configuration
tasker-ctl init
# 2. Fetch community templates
tasker-ctl remote update
# 3. Scaffold a step handler from a template
tasker-ctl template generate step_handler --language python --param name=ValidateOrder
# 4. Scaffold a task template
tasker-ctl template generate task_template --language python \
--param name=OrderProcessing \
--param namespace=default \
--param handler_callable=handlers.validate_order_handler.ValidateOrderHandler
# 5. Generate typed code from your task template
tasker-ctl generate types --template config/tasker/templates/order_processing.yaml --language python
tasker-ctl generate handler --template config/tasker/templates/order_processing.yaml --language python
# 6. Generate infrastructure (uses --plugin for ops templates)
tasker-ctl template generate docker_compose --plugin tasker-contrib-ops --param name=myproject
# 7. Generate environment-specific config (merges base + environment overrides)
tasker-ctl config generate --remote tasker-contrib \
--context worker --environment development --output config/worker.toml
Note: Language templates use
--languageto select the plugin (e.g.,--language rubyselectstasker-contrib-rails). Ops templates use--plugin tasker-contrib-opsdirectly since they are language-independent.
Next Steps
- Your First Handler — Write a step handler from scratch
- Your First Workflow — Define a task template and run it
- Configuration Management — Understanding the TOML config structure
- tasker-ctl Architecture — How the CLI is built
Your First Handler
This guide walks you through writing your first step handler using the DSL.
What is a Handler?
A Step Handler is your code that executes business logic for a single workflow step. With the DSL, a handler declares what it receives — typed inputs from the task context and typed results from upstream steps — and delegates to a service function. Handlers are thin wrappers: Tasker handles sequencing, retries, and error classification; your service layer handles the business logic.
You can generate a handler from a template with tasker-ctl:
tasker-ctl template generate step_handler --language python --param name=ProcessOrder
Or write one from scratch using the patterns below.
The DSL Approach
Every handler follows the same three-layer pattern:
- Type definition — the contract (what the handler receives and returns)
- Handler declaration — the DSL wiring (which step, which inputs, which dependencies)
- Service delegation — one-line call to your business logic
Python
# app/services/types.py — the contract
class EcommerceOrderInput(BaseModel):
items: list[dict[str, Any]] | None = None # submitted as "items"
cart_items: list[dict[str, Any]] | None = None # or "cart_items"
customer_email: str | None = None
payment_token: str | None = None
@property
def resolved_items(self) -> list[dict[str, Any]]:
"""Accept either field name from the task context."""
return self.items or self.cart_items or []
# app/handlers/ecommerce.py — the handler
from tasker_core.step_handler.functional import inputs, step_handler
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
return svc.validate_cart_items(inputs.resolved_items)
The @step_handler decorator registers this function under the name validate_cart — the exact string that must appear in the handler.callable field of the task template YAML. The @inputs decorator tells Tasker to extract the task context into an EcommerceOrderInput Pydantic model. The function body is a single service call.
Ruby
# app/services/types.rb — the contract
module Types
module Ecommerce
class OrderInput < Types::InputStruct
attribute :cart_items, Types::Array.of(Types::Hash).optional
attribute :customer_email, Types::String.optional
end
end
end
# app/handlers/ecommerce/step_handlers/validate_cart_handler.rb — the handler
module Ecommerce
module StepHandlers
extend TaskerCore::StepHandler::Functional
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(cart_items: inputs.cart_items)
end
end
end
Ruby uses step_handler as a method that takes a block. The inputs: keyword argument receives a Dry::Struct instance with typed attributes.
TypeScript
// src/services/types.ts — the contract
export interface CartItem {
sku: string;
name: string;
price: number;
quantity: number;
}
// src/handlers/ecommerce.ts — the handler
import { defineHandler } from '@tasker-systems/tasker';
import * as svc from '../services/ecommerce';
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler',
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) => svc.validateCartItems(cartItems as CartItem[]),
);
TypeScript uses defineHandler as a factory function. The inputs config maps camelCase parameter names to snake_case YAML field names.
Rust
Rust uses the RustStepHandler trait directly — this is Rust’s only handler pattern. There is no DSL equivalent, by design.
#![allow(unused)]
fn main() {
use anyhow::Result;
use async_trait::async_trait;
use serde_json::json;
use std::time::Instant;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use tasker_worker_rust::{success_result, RustStepHandler};
use tasker_worker_rust::step_handlers::StepHandlerConfig;
pub struct ProcessOrderHandler {
config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
fn new(config: StepHandlerConfig) -> Self {
Self { config }
}
fn name(&self) -> &str {
"process_order"
}
async fn call(
&self,
step_data: &TaskSequenceStep,
) -> Result<StepExecutionResult> {
let start = Instant::now();
let _input_data = &step_data.task.context;
let result_data = json!({
"processed": true,
"handler": "process_order"
});
let duration_ms = start.elapsed().as_millis() as i64;
Ok(success_result(
step_data.workflow_step.workflow_step_uuid,
result_data,
duration_ms,
None,
))
}
}
}
Reading the DSL
Each language’s DSL has the same three concepts:
| Concept | Python | Ruby | TypeScript |
|---|---|---|---|
| Register a handler | @step_handler("name") | step_handler('Name', ...) do | defineHandler('Name', ...) |
| Inject task inputs | @inputs(Model) | inputs: Model | { inputs: { key: 'field' } } |
| Inject dependency results | @depends_on(x=("step", Model)) | depends_on: { x: ['step', Model] } | { depends: { x: 'step' } } |
The handler function always receives context as its last parameter — a StepContext with execution metadata. Most handlers don’t need it directly, but it’s available for advanced patterns.
Registering Handlers
Handlers are resolved by matching the handler.callable field in task template YAML against the name you registered with the DSL. The mechanism is identical across all languages — the callable is an opaque key that must match between registration and template.
Each language follows its own naming convention:
| Language | Convention | Example |
|---|---|---|
| Ruby | Module::ClassName | Ecommerce::StepHandlers::ValidateCartHandler |
| Python | snake_case name | validate_cart |
| TypeScript | Namespace.ClassName | Ecommerce.StepHandlers.ValidateCartHandler |
| Rust | snake_case name | process_order |
The registry doesn’t enforce a format — any string works as long as the DSL registration and YAML callable match. Qualified names are recommended because they prevent collisions across modules and align with how each language’s class-based fallback resolver works. Short names like validate_cart are equally valid and resolve reliably through exact-match (the highest-priority resolver). See Handler Resolution for the full resolver chain architecture, fallback behavior, and conventions.
Class-Based Alternative
If you prefer class inheritance, all handler types support a class-based pattern where you extend StepHandler and implement call(context). See Class-Based Handlers for the full reference.
See It in Action
The example apps implement step handlers for four real-world workflows in all four languages. Compare the same handler across Rails, FastAPI, Bun, and Axum to see how each framework’s idioms map to the Tasker contract.
Next Steps
- Your First Workflow — Connect handlers into a multi-step DAG
- Language guides: Ruby | Python | TypeScript | Rust
Your First Workflow
This guide walks you through creating a complete workflow with multiple steps. We’ll use the e-commerce order processing pattern from the example apps, demonstrating parallel execution and typed dependency injection.
This walkthrough uses Python. See the Ruby, TypeScript, and Rust guides for language-specific examples.
What is a Workflow?
A Workflow is a directed acyclic graph (DAG) of steps defined in a task template YAML file. Steps execute when their dependencies are satisfied, enabling parallel execution where possible.
Example: Order Processing
Let’s build an order processing workflow with five steps. After validating the cart, payment processing and inventory reservation happen in parallel — they’re independent operations that don’t need each other’s results. Once both succeed, we create the order record and send the confirmation:
┌──────────────┐
│ validate_cart │
└──────┬───────┘
│
┌─────────┴─────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ process │ │ update │
│ payment │ │ inventory │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬─────────┘
▼
┌──────────────┐
│ create_order │
└──────┬───────┘
│
▼
┌──────────────┐
│ send │
│ confirmation │
└──────────────┘
This is a real-world pattern: payment authorization and inventory reservation are calls to different external systems. Running them in parallel reduces total checkout time. The order record isn’t created until both succeed, and the confirmation email isn’t sent until the order exists.
Step 1: Define the Task Template
Create a YAML file defining the workflow structure. You can generate a starter template with tasker-ctl:
tasker-ctl template generate task_template \
--language python \
--param name=EcommerceOrderProcessing \
--param namespace=ecommerce \
--param handler_callable=handlers.ecommerce.ValidateCartHandler
Then extend the generated single-step template into the full DAG:
# config/tasker/templates/ecommerce_order_processing.yaml
name: ecommerce_order_processing
namespace_name: ecommerce
version: "1.0.0"
description: "E-commerce checkout: validate → (payment ‖ inventory) → order → confirm"
steps:
- name: validate_cart
description: "Validate cart items, check availability, calculate totals"
handler:
callable: validate_cart
dependencies: []
retry:
retryable: true
max_attempts: 2
backoff: exponential
- name: process_payment
description: "Authorize payment through payment gateway"
handler:
callable: process_payment
dependencies:
- validate_cart
retry:
retryable: true
max_attempts: 3
backoff: exponential
- name: update_inventory
description: "Reserve inventory for order items"
handler:
callable: update_inventory
dependencies:
- validate_cart
retry:
retryable: true
max_attempts: 3
backoff: exponential
- name: create_order
description: "Create order record from payment and inventory results"
handler:
callable: create_order
dependencies:
- process_payment
- update_inventory
retry:
retryable: true
max_attempts: 2
backoff: exponential
- name: send_confirmation
description: "Send order confirmation email to customer"
handler:
callable: send_confirmation
dependencies:
- create_order
retry:
retryable: true
max_attempts: 2
backoff: exponential
input_schema:
type: object
required:
- cart_items
- customer_email
properties:
cart_items:
type: array
items:
type: object
required: [sku, name, quantity, unit_price]
properties:
sku:
type: string
name:
type: string
quantity:
type: integer
unit_price:
type: number
customer_email:
type: string
format: email
payment_token:
type: string
Key YAML Fields
| Field | Description |
|---|---|
name | Task template name (used in API submissions) |
namespace_name | Logical grouping for templates and queues (max 29 chars) |
steps | List of steps forming the execution DAG |
handler.callable | Identifies which handler processes this step |
dependencies | List of step names that must complete before this step runs |
retry | Retry policy (retryable, attempts, backoff strategy) |
input_schema | Optional JSON Schema for validating task context |
How Dependencies Create the DAG
The dependencies field defines the execution graph:
validate_carthas no dependencies — it runs firstprocess_paymentandupdate_inventoryboth depend only onvalidate_cart— they run in parallel once validation completescreate_orderdepends on bothprocess_paymentandupdate_inventory— it waits for both to complete (convergence point)send_confirmationdepends oncreate_order— it runs last
Tasker resolves these dependencies automatically. You declare what depends on what, and the engine figures out what can run in parallel.
YAML Dependencies vs Handler Dependencies
The YAML dependencies field and the handler’s @depends_on decorator serve different purposes:
-
YAML
dependenciesdefine the DAG shape — which steps must complete before this step starts. These are proximal (direct predecessors only).create_orderlistsprocess_paymentandupdate_inventorybecause it must wait for both. -
Handler
@depends_ondeclares which step results the handler needs injected as typed parameters. These can reference any ancestor step — not just direct predecessors. Tasker makes all ancestor results available in the step context.
In the create_order handler below, notice that @depends_on references validate_cart even though the YAML only lists process_payment and update_inventory as dependencies. The handler can access validate_cart’s result because it’s a transitive ancestor — Tasker has already executed it earlier in the DAG.
Step 2: Define Your Types
Before writing handlers, define the types that describe what flows between steps. These are Pydantic models — the same types the DSL uses to inject inputs and dependency results:
# app/services/types.py
from pydantic import BaseModel
from typing import Any
class EcommerceOrderInput(BaseModel):
items: list[dict[str, Any]] | None = None # submitted as "items"
cart_items: list[dict[str, Any]] | None = None # or "cart_items"
customer_email: str | None = None
payment_token: str | None = None
@property
def resolved_items(self) -> list[dict[str, Any]]:
"""Accept either field name from the task context."""
return self.items or self.cart_items or []
class EcommerceValidateCartResult(BaseModel):
validated_items: list[dict[str, Any]] | None = None
item_count: int | None = None
subtotal: float | None = None
tax: float | None = None
total: float | None = None
class EcommerceProcessPaymentResult(BaseModel):
payment_id: str | None = None
transaction_id: str | None = None
amount_charged: float | None = None
status: str | None = None
class EcommerceUpdateInventoryResult(BaseModel):
total_items_reserved: int | None = None
inventory_log_id: str | None = None
class EcommerceCreateOrderResult(BaseModel):
order_id: str | None = None
customer_email: str | None = None
total: float | None = None
status: str | None = None
All fields are optional with None defaults. This is intentional — task context may not include every field, and upstream step results may vary. The type system provides structure and IDE autocomplete without brittle required-field failures.
Step 3: Implement Handlers
With types defined, the handlers are short — each one declares what it receives and delegates to a service function:
# app/handlers/ecommerce.py
from tasker_core.step_handler.functional import depends_on, inputs, step_handler
from tasker_core.types import StepContext
from app.services import ecommerce as svc
from app.services.types import (
EcommerceCreateOrderResult,
EcommerceOrderInput,
EcommerceProcessPaymentResult,
EcommerceUpdateInventoryResult,
EcommerceValidateCartResult,
)
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
return svc.validate_cart_items(inputs.resolved_items)
@step_handler("process_payment")
@depends_on(cart_result=("validate_cart", EcommerceValidateCartResult))
@inputs(EcommerceOrderInput)
def process_payment(
cart_result: EcommerceValidateCartResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.process_payment(
payment_token=inputs.payment_token,
total=cart_result.total or 0.0,
)
@step_handler("update_inventory")
@depends_on(cart_result=("validate_cart", EcommerceValidateCartResult))
def update_inventory(cart_result: EcommerceValidateCartResult, context: StepContext):
return svc.update_inventory(cart_result.validated_items or [])
@step_handler("create_order")
@depends_on(
cart_result=("validate_cart", EcommerceValidateCartResult),
payment_result=("process_payment", EcommerceProcessPaymentResult),
inventory_result=("update_inventory", EcommerceUpdateInventoryResult),
)
@inputs(EcommerceOrderInput)
def create_order(
cart_result: EcommerceValidateCartResult,
payment_result: EcommerceProcessPaymentResult,
inventory_result: EcommerceUpdateInventoryResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.create_order(
cart=cart_result, payment=payment_result,
inventory=inventory_result, customer_email=inputs.customer_email,
)
@step_handler("send_confirmation")
@depends_on(order_result=("create_order", EcommerceCreateOrderResult))
@inputs(EcommerceOrderInput)
def send_confirmation(
order_result: EcommerceCreateOrderResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.send_confirmation(
order=order_result, customer_email=inputs.customer_email,
)
That’s the entire handler file — five handlers in about 50 lines. The service functions (svc.validate_cart_items, svc.process_payment, etc.) contain your actual business logic. Tasker doesn’t care what happens inside them — it cares about the handler’s typed signature and the result it returns.
Anatomy of a Handler with Dependencies
Look at create_order — the convergence point where three parallel branches meet:
@step_handler("create_order")
@depends_on(
cart_result=("validate_cart", EcommerceValidateCartResult), # ① upstream step + type
payment_result=("process_payment", EcommerceProcessPaymentResult), # ② another upstream step
inventory_result=("update_inventory", EcommerceUpdateInventoryResult),
)
@inputs(EcommerceOrderInput) # ③ task context
def create_order(
cart_result: EcommerceValidateCartResult, # injected as typed Pydantic model
payment_result: EcommerceProcessPaymentResult,
inventory_result: EcommerceUpdateInventoryResult,
inputs: EcommerceOrderInput, # task context as typed model
context: StepContext, # execution metadata
):
return svc.create_order(...) # ④ delegate to service
- Each
@depends_onentry maps a parameter name to a("step_name", ResultModel)tuple - Tasker resolves the upstream step’s result dict and deserializes it into the Pydantic model
@inputsdoes the same for the task context- The handler function receives fully typed objects and passes them to the service
Step 4: Submit a Task
Submit a task via the REST API:
curl -X POST http://localhost:8080/api/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"name": "ecommerce_order_processing",
"namespace": "ecommerce",
"version": "1.0.0",
"initiator": "api:checkout",
"source_system": "web",
"reason": "New order received",
"context": {
"customer_email": "customer@example.com",
"payment_token": "tok_test_success",
"cart_items": [
{"sku": "WIDGET-A", "name": "Widget A", "quantity": 2, "unit_price": 29.99},
{"sku": "GADGET-B", "name": "Gadget B", "quantity": 1, "unit_price": 49.99}
]
}
}'
Or with tasker-ctl:
tasker-ctl task create \
--name ecommerce_order_processing \
--namespace ecommerce \
--input '{"customer_email": "customer@example.com", "cart_items": [{"sku": "WIDGET-A", "name": "Widget A", "quantity": 2, "unit_price": 29.99}]}'
Execution Flow
When this task runs:
- validate_cart executes first (no dependencies)
- process_payment and update_inventory execute in parallel (both depend only on validate_cart)
- create_order executes after both parallel steps complete (convergence)
- send_confirmation executes after create_order completes
The total execution time is determined by the longest path through the DAG, not the sum of all steps. If payment takes 2 seconds and inventory takes 1 second, step 3 begins at the 2-second mark — the inventory result is already waiting.
Your Services, Tasker’s Orchestration
Notice what the handlers don’t contain: no tax calculations, no payment gateway logic, no inventory reservation algorithms. That business logic lives in your service layer (app/services/ecommerce.py), where it can be tested independently and reused outside of Tasker.
The handlers are thin wrappers that declare their typed signature and delegate. Tasker brings workflow orchestration to your existing codebase — it manages the DAG, sequencing, retries, and error classification. Your services do what they’ve always done.
See It in Action
The example apps implement this e-commerce workflow (and three others) in all four languages — Rails, FastAPI, Bun, and Axum. Each app is a fully working project you can clone and run with Docker Compose.
The example apps also include more complex DAG patterns:
- Data Pipeline — Three parallel extract branches, each feeding its own transform, converging at aggregation (8 steps)
- Microservices — User registration with parallel billing and preferences setup (5 steps, diamond pattern)
- Cross-Namespace — Customer success workflow that delegates to a payments namespace (namespace isolation)
Next Steps
- Language guides: Ruby | Python | TypeScript | Rust
- Architecture Overview — Understand lifecycle actors and DAG execution
- Handler Types — API, Decision, and Batchable handler patterns
Building with Ruby
This guide covers building Tasker step handlers with Ruby using the tasker-rb gem
in a Rails application.
Quick Start
Add the gem to your Gemfile:
gem 'tasker-rb'
Generate a step handler with tasker-ctl:
tasker-ctl template generate step_handler \
--language ruby \
--param name=ValidateCart \
--param module_name=Ecommerce
This creates a DSL-style handler with typed inputs that delegates to a service method.
Writing a Handler (DSL)
Every handler follows the three-layer pattern: type definition, handler declaration, service delegation.
# app/services/types.rb — the contract
module Types
module Ecommerce
class OrderInput < Types::InputStruct
attribute :cart_items, Types::Array.optional
attribute :customer_email, Types::String.optional
attribute :payment_info, Types::Hash.optional
attribute :shipping_address, Types::Hash.optional
end
end
end
# app/handlers/ecommerce/step_handlers/validate_cart_handler.rb — the handler
module Ecommerce
module StepHandlers
extend TaskerCore::StepHandler::Functional
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(
cart_items: inputs.cart_items,
customer_email: inputs.customer_email,
)
end
end
end
The step_handler method registers a handler and takes a block. The inputs: keyword argument receives a Dry::Struct instance with typed, optional attributes. The block body is a single service call.
Type System
Ruby handlers use Dry::Struct for both input and result types.
Input types extend Types::InputStruct — a base class where all attributes are optional and omittable, so missing keys don’t raise:
module Types
module Ecommerce
class OrderInput < Types::InputStruct
attribute :cart_items, Types::Array.optional
attribute :customer_email, Types::String.optional
attribute :payment_info, Types::Hash.optional
attribute :shipping_address, Types::Hash.optional
end
end
end
Result types extend Types::ResultStruct — similar to InputStruct but describing what a handler returns (used by downstream depends_on):
module Types
module Ecommerce
class ValidateCartResult < Types::ResultStruct
attribute :validated_items, Types::Array
attribute :item_count, Types::Integer
attribute :subtotal, Types::Float
attribute :tax, Types::Float
attribute :total, Types::Float
end
end
end
Both InputStruct and ResultStruct support string and symbol key access (e.g., result['user_id'] and result[:user_id]) and nested access via dig.
Accessing Task Context
The inputs: config extracts the full task context into a typed Dry::Struct instance. Fields are matched by name from the submitted JSON:
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
# inputs.cart_items, inputs.customer_email, etc. are typed attributes
Ecommerce::Service.validate_cart_items(cart_items: inputs.cart_items)
end
The context: keyword provides execution metadata (task UUID, step UUID, step config) but most handlers don’t need it directly.
Working with Dependencies
The depends_on: config injects typed results from upstream steps. Each entry maps a keyword argument name to a ['step_name', ResultModel] pair:
ProcessPaymentHandler = step_handler(
'Ecommerce::StepHandlers::ProcessPaymentHandler',
depends_on: { cart_total: ['validate_cart', Types::Ecommerce::ValidateCartResult] },
inputs: Types::Ecommerce::OrderInput
) do |cart_total:, inputs:, context:|
Ecommerce::Service.process_payment(
payment_info: inputs.payment_info,
total: cart_total&.total,
)
end
Handlers can reference any ancestor step in the DAG — not just direct predecessors. Here’s a convergence handler that accesses three upstream steps:
CreateOrderHandler = step_handler(
'Ecommerce::StepHandlers::CreateOrderHandler',
depends_on: {
cart_validation: ['validate_cart', Types::Ecommerce::ValidateCartResult],
payment_result: ['process_payment', Types::Ecommerce::ProcessPaymentResult],
inventory_result: ['update_inventory', Types::Ecommerce::UpdateInventoryResult],
},
inputs: Types::Ecommerce::OrderInput
) do |cart_validation:, payment_result:, inventory_result:, inputs:, context:|
Ecommerce::Service.create_order(
cart_validation: cart_validation,
payment_result: payment_result,
inventory_result: inventory_result,
customer_email: inputs.customer_email,
shipping_address: inputs.shipping_address,
)
end
Multi-Step Example: Data Pipeline
The data pipeline workflow demonstrates a parallel DAG — three independent extract branches, each feeding its own transform, converging at aggregation:
extract_sales extract_inventory extract_customers
│ │ │
▼ ▼ ▼
transform_sales transform_inventory transform_customers
│ │ │
└──────────────────┼────────────────────┘
▼
aggregate_metrics
│
▼
generate_insights
Ruby handlers follow the same concise pattern:
module DataPipeline
module StepHandlers
extend TaskerCore::StepHandler::Functional
# Extract — no dependencies, runs in parallel
ExtractSalesDataHandler = step_handler(
'DataPipeline::StepHandlers::ExtractSalesDataHandler',
inputs: Types::DataPipeline::PipelineInput
) do |inputs:, context:|
DataPipeline::Service.extract_sales_data(
source: inputs.source,
date_range_start: inputs.date_range_start,
)
end
# Transform — depends on one extract branch
TransformSalesHandler = step_handler(
'DataPipeline::StepHandlers::TransformSalesHandler',
depends_on: { sales_data: ['extract_sales_data', Types::DataPipeline::ExtractSalesResult] }
) do |sales_data:, context:|
DataPipeline::Service.transform_sales(sales_data: sales_data)
end
# Aggregate — converges three transform branches
AggregateMetricsHandler = step_handler(
'DataPipeline::StepHandlers::AggregateMetricsHandler',
depends_on: {
sales: ['transform_sales', Types::DataPipeline::TransformSalesResult],
inventory: ['transform_inventory', Types::DataPipeline::TransformInventoryResult],
customers: ['transform_customers', Types::DataPipeline::TransformCustomersResult],
}
) do |sales:, inventory:, customers:, context:|
DataPipeline::Service.aggregate_metrics(
sales: sales, inventory: inventory, customers: customers,
)
end
end
end
Error Handling
Raise typed exceptions to control retry behavior:
# Permanent error — will NOT be retried
raise TaskerCore::Errors::PermanentError.new(
'Payment declined: insufficient funds',
error_code: 'PAYMENT_DECLINED'
)
# Retryable error — will be retried per the step's retry config
raise TaskerCore::Errors::RetryableError.new(
'Payment gateway temporarily unavailable'
)
Error codes (like PAYMENT_DECLINED, EMPTY_CART) are included in the step result for observability and debugging.
Testing
DSL handlers are constants holding callable blocks — test by invoking the service functions directly or by using RSpec mocks:
RSpec.describe 'Ecommerce::StepHandlers::ValidateCartHandler' do
let(:inputs) do
Types::Ecommerce::OrderInput.new(
cart_items: [{ 'sku' => 'SKU-001', 'name' => 'Widget', 'quantity' => 2, 'unit_price' => 29.99 }],
customer_email: 'test@example.com'
)
end
it 'delegates to the service' do
expect(Ecommerce::Service).to receive(:validate_cart_items)
.with(cart_items: inputs.cart_items, customer_email: inputs.customer_email)
.and_return({ validated_items: [], total: 64.79 })
context = instance_double(TaskerCore::Types::StepContext)
result = Ecommerce::StepHandlers::ValidateCartHandler.call(inputs: inputs, context: context)
expect(result[:total]).to eq(64.79)
end
end
For handlers with dependencies, construct result models directly:
let(:cart) { Types::Ecommerce::ValidateCartResult.new(total: 64.79, validated_items: []) }
let(:payment) { Types::Ecommerce::ProcessPaymentResult.new(payment_id: 'pay_abc') }
let(:inventory) { Types::Ecommerce::UpdateInventoryResult.new(inventory_log_id: 'log_123') }
let(:inputs) { Types::Ecommerce::OrderInput.new(customer_email: 'test@example.com') }
it 'creates order from upstream data' do
expect(Ecommerce::Service).to receive(:create_order)
.and_return({ order_id: 'ORD-001' })
context = instance_double(TaskerCore::Types::StepContext)
result = Ecommerce::StepHandlers::CreateOrderHandler.call(
cart_validation: cart, payment_result: payment,
inventory_result: inventory, inputs: inputs, context: context,
)
expect(result[:order_id]).to eq('ORD-001')
end
Because handlers delegate to service methods, you can also test the services directly without any Tasker infrastructure.
Handler Variants
API Handler
Adds HTTP client methods with built-in error classification. Uses TaskerCore::StepHandler::Mixins::API with the class-based pattern. See Class-Based Handlers — API Handler.
Decision Handler
Adds workflow routing with decision_success() for activating downstream step sets. Uses TaskerCore::StepHandler::Mixins::Decision with the class-based pattern. See Conditional Workflows.
Batchable Handler
Adds batch processing with Analyzer/Worker pattern using TaskerCore::StepHandler::Mixins::Batchable. See Class-Based Handlers — Batchable Handler and Batch Processing.
Class-Based Alternative
If you prefer class inheritance, all handler types support a class-based pattern where you inherit from TaskerCore::StepHandler::Base and implement call(context). See Class-Based Handlers for the full reference.
Next Steps
- Your First Workflow — Build a multi-step DAG end-to-end
- Architecture — System design details
- Rails example app — Complete working implementation
Building with Python
This guide covers building Tasker step handlers with Python using the tasker_core
package in a FastAPI application.
Quick Start
Install the package:
pip install tasker-py
# Or with uv (recommended)
uv add tasker-py
Generate a step handler with tasker-ctl:
tasker-ctl template generate step_handler \
--language python \
--param name=ValidateCart \
--param module_name=handlers.ecommerce
This creates a DSL-style handler with typed inputs that delegates to a service function.
Writing a Handler (DSL)
Every handler follows the three-layer pattern: type definition, handler declaration, service delegation.
# app/services/types.py — the contract
from pydantic import BaseModel
from typing import Any
class EcommerceOrderInput(BaseModel):
items: list[dict[str, Any]] | None = None
cart_items: list[dict[str, Any]] | None = None
customer_email: str | None = None
payment_token: str | None = None
@property
def resolved_items(self) -> list[dict[str, Any]]:
"""Accept either field name from the task context."""
return self.items or self.cart_items or []
# app/handlers/ecommerce.py — the handler
from tasker_core.step_handler.functional import inputs, step_handler
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
return svc.validate_cart_items(inputs.resolved_items)
The @step_handler decorator registers this function as the handler for the validate_cart step. The @inputs decorator tells Tasker to extract the task context into a Pydantic model. The function body is a single service call.
Type System
Python handlers use Pydantic BaseModel for both input and result types. The DSL deserializes JSON into these models automatically.
Input types receive the task context:
class EcommerceOrderInput(BaseModel):
items: list[dict[str, Any]] | None = None
cart_items: list[dict[str, Any]] | None = None
payment_token: str | None = None
customer_email: str | None = None
@property
def resolved_items(self) -> list[dict[str, Any]]:
"""Accept either field name from the task context."""
return self.items or self.cart_items or []
Result types describe what a handler returns (used by downstream @depends_on):
class EcommerceValidateCartResult(BaseModel):
validated_items: list[dict[str, Any]] | None = None
item_count: int | None = None
subtotal: float | None = None
tax: float | None = None
total: float | None = None
All fields are optional with None defaults. This is intentional — task context may not include every field, and upstream results may vary. The type system provides structure and IDE autocomplete without brittle required-field failures.
Validation with @model_validator:
from pydantic import model_validator
class ValidateRefundRequestInput(BaseModel):
ticket_id: str | None = None
order_ref: str | None = None
refund_amount: float | None = None
@property
def resolved_ticket_id(self) -> str | None:
return self.ticket_id or self.order_ref
@model_validator(mode='after')
def check_required_fields(self) -> 'ValidateRefundRequestInput':
if not self.resolved_ticket_id:
raise PermanentError(
message="ticket_id or order_ref is required",
error_code="MISSING_TICKET_ID",
)
return self
Accessing Task Context
The @inputs(Model) decorator extracts the full task context into a typed Pydantic model. Fields are matched by name from the submitted JSON:
@step_handler("validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context: StepContext):
# inputs.cart_items, inputs.customer_email, etc. are typed fields
return svc.validate_cart_items(inputs.resolved_items)
The context parameter provides execution metadata (task UUID, step UUID, step config) but most handlers don’t need it directly.
Working with Dependencies
The @depends_on decorator injects typed results from upstream steps. Each entry maps a parameter name to a ("step_name", ResultModel) tuple:
@step_handler("process_payment")
@depends_on(cart_result=("validate_cart", EcommerceValidateCartResult))
@inputs(EcommerceOrderInput)
def process_payment(
cart_result: EcommerceValidateCartResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.process_payment(
payment_token=inputs.payment_token,
total=cart_result.total or 0.0,
)
Handlers can reference any ancestor step in the DAG — not just direct predecessors. Tasker makes all ancestor results available. Here’s a convergence handler that accesses three upstream steps:
@step_handler("create_order")
@depends_on(
cart_result=("validate_cart", EcommerceValidateCartResult),
payment_result=("process_payment", EcommerceProcessPaymentResult),
inventory_result=("update_inventory", EcommerceUpdateInventoryResult),
)
@inputs(EcommerceOrderInput)
def create_order(
cart_result: EcommerceValidateCartResult,
payment_result: EcommerceProcessPaymentResult,
inventory_result: EcommerceUpdateInventoryResult,
inputs: EcommerceOrderInput,
context: StepContext,
):
return svc.create_order(
cart=cart_result, payment=payment_result,
inventory=inventory_result, customer_email=inputs.customer_email,
)
Multi-Step Example: Data Pipeline
The data pipeline workflow demonstrates a parallel DAG — three independent extract branches, each feeding its own transform, converging at aggregation:
extract_sales extract_inventory extract_customers
│ │ │
▼ ▼ ▼
transform_sales transform_inventory transform_customers
│ │ │
└──────────────────┼────────────────────┘
▼
aggregate_metrics
│
▼
generate_insights
The handlers are just as concise as the e-commerce ones:
from app.services import data_pipeline as svc
from app.services.types import (
DataPipelineInput,
PipelineExtractSalesResult,
PipelineTransformSalesResult,
PipelineTransformInventoryResult,
PipelineTransformCustomersResult,
PipelineAggregateMetricsResult,
)
# Extract — no dependencies, runs in parallel
@step_handler("extract_sales_data")
@inputs(DataPipelineInput)
def extract_sales_data(inputs: DataPipelineInput, context: StepContext):
return svc.extract_sales_data(
source=inputs.source,
date_range_start=inputs.date_range_start,
date_range_end=inputs.date_range_end,
granularity=inputs.granularity,
)
# Transform — depends on one extract branch
@step_handler("transform_sales")
@depends_on(sales_data=("extract_sales_data", PipelineExtractSalesResult))
def transform_sales(sales_data: PipelineExtractSalesResult, context: StepContext):
return svc.transform_sales(sales_data=sales_data)
# Aggregate — converges three transform branches
@step_handler("aggregate_metrics")
@depends_on(
sales_transform=("transform_sales", PipelineTransformSalesResult),
traffic_transform=("transform_inventory", PipelineTransformInventoryResult),
inventory_transform=("transform_customers", PipelineTransformCustomersResult),
)
def aggregate_metrics(
sales_transform: PipelineTransformSalesResult,
traffic_transform: PipelineTransformInventoryResult,
inventory_transform: PipelineTransformCustomersResult,
context: StepContext,
):
return svc.aggregate_metrics(
sales_transform=sales_transform,
traffic_transform=traffic_transform,
inventory_transform=inventory_transform,
)
Eight handlers, eight service delegations. The pipeline DAG runs three extract steps in parallel, feeds each into a transform, then converges at aggregation and insight generation.
Error Handling
Raise PermanentError or RetryableError from your handler or service functions:
from tasker_core.errors import PermanentError, RetryableError
# Non-retryable validation failure
raise PermanentError(
message="Payment declined: insufficient funds",
error_code="PAYMENT_DECLINED",
)
# Retryable transient failure
raise RetryableError(
message="Payment gateway returned an error, will retry",
error_code="GATEWAY_ERROR",
)
Pydantic @model_validator errors are also caught and converted to PermanentError automatically — invalid input data won’t be retried.
Testing
DSL handlers are plain functions — test them by calling the function directly with mocked inputs:
from unittest.mock import MagicMock, patch
def test_validate_cart():
context = MagicMock()
# Mock the inputs that @inputs would inject
inputs = EcommerceOrderInput(
cart_items=[{"sku": "SKU-001", "name": "Widget", "quantity": 2, "unit_price": 29.99}]
)
with patch("app.services.ecommerce.validate_cart_items") as mock_svc:
mock_svc.return_value = {"validated_items": [], "total": 64.79}
result = validate_cart(inputs=inputs, context=context)
mock_svc.assert_called_once()
assert result["total"] == 64.79
For handlers with dependencies, construct the result models directly:
def test_create_order():
context = MagicMock()
cart = EcommerceValidateCartResult(total=64.79, validated_items=[])
payment = EcommerceProcessPaymentResult(payment_id="pay_abc", transaction_id="txn_xyz")
inventory = EcommerceUpdateInventoryResult(inventory_log_id="log_123")
inputs = EcommerceOrderInput(customer_email="test@example.com")
with patch("app.services.ecommerce.create_order") as mock_svc:
mock_svc.return_value = {"order_id": "ORD-001"}
result = create_order(
cart_result=cart, payment_result=payment,
inventory_result=inventory, inputs=inputs, context=context,
)
assert result["order_id"] == "ORD-001"
Because handlers delegate to service functions, you can also test the services directly without any Tasker infrastructure.
Handler Variants
API Handler
Adds HTTP client methods with built-in error classification. Currently uses the class-based pattern with APIMixin. See Class-Based Handlers — API Handler.
Decision Handler
Adds workflow routing. The DSL provides @decision_handler:
from tasker_core.step_handler.functional import decision_handler
@decision_handler("order_routing")
def order_routing(context: StepContext):
order_type = context.get_input("order_type")
if order_type == "premium":
return ["validate_premium", "process_premium"]
return ["standard_processing"]
See Conditional Workflows for decision handler patterns.
Batchable Handler
Adds batch processing for splitting large workloads. Uses the class-based pattern due to its stateful nature (cursor management, batch context). See Class-Based Handlers — Batchable Handler and Batch Processing.
Class-Based Alternative
If you prefer class inheritance, all handler types support a class-based pattern where you extend StepHandler and implement call(context). See Class-Based Handlers for the full reference.
Next Steps
- Your First Workflow — Build a multi-step DAG end-to-end
- Architecture — System design details
- FastAPI example app — Complete working implementation
Building with TypeScript
This guide covers building Tasker step handlers with TypeScript using the
@tasker-systems/tasker package in a Bun application.
Quick Start
Install the package:
bun add @tasker-systems/tasker
# Or with npm
npm install @tasker-systems/tasker
Generate a step handler with tasker-ctl:
tasker-ctl template generate step_handler \
--language typescript \
--param name=ValidateCart
This creates a DSL-style handler with typed inputs that delegates to a service function.
Writing a Handler (DSL)
Every handler follows the three-layer pattern: type definition, handler declaration, service delegation.
// src/services/types.ts — the contract
export interface CartItem {
sku: string;
name: string;
price: number;
quantity: number;
}
// src/handlers/ecommerce.ts — the handler
import { defineHandler } from '@tasker-systems/tasker';
import type { CartItem } from '../services/types';
import * as svc from '../services/ecommerce';
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler',
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) => svc.validateCartItems(cartItems as CartItem[] | undefined),
);
The defineHandler factory registers a handler by name. The inputs config maps camelCase parameter names to snake_case YAML field names. The async callback receives typed arguments and delegates to a service function.
Type System
TypeScript handlers use standard interfaces for both input and result types. The DSL injects values from the task context and upstream results as plain objects matching these interfaces.
Input types:
export interface CartItem {
sku: string;
name: string;
price: number;
quantity: number;
}
export interface PaymentInfo {
method: string;
card_last_four?: string;
token: string;
amount: number;
}
Result types describe what a handler returns (used by downstream depends):
export interface EcommerceValidateCartResult {
[key: string]: unknown;
validated_items: CartItem[];
item_count: number;
subtotal: number;
tax: number;
total: number;
}
export interface EcommerceProcessPaymentResult {
[key: string]: unknown;
payment_id: string;
transaction_id: string;
amount_charged: number;
status: string;
}
The [key: string]: unknown index signature allows result objects to carry additional fields without type errors when accessing known fields.
Accessing Task Context
The inputs config in defineHandler extracts fields from the task context. Each entry maps a camelCase parameter name to the snake_case field name in the submitted JSON:
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler',
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) => svc.validateCartItems(cartItems as CartItem[] | undefined),
);
The callback receives cartItems directly — no need to parse raw JSON.
Working with Dependencies
The depends config injects results from upstream steps. Each entry maps a camelCase parameter name to the upstream step name:
export const ProcessPaymentHandler = defineHandler(
'Ecommerce.StepHandlers.ProcessPaymentHandler',
{
depends: { cartResult: 'validate_cart' },
inputs: { paymentInfo: 'payment_info' },
},
async ({ cartResult, paymentInfo }) =>
svc.processPayment(
cartResult as Record<string, unknown>,
paymentInfo as PaymentInfo | undefined,
),
);
Handlers can reference any ancestor step in the DAG — not just direct predecessors. Here’s a convergence handler that accesses three upstream steps plus task inputs:
export const CreateOrderHandler = defineHandler(
'Ecommerce.StepHandlers.CreateOrderHandler',
{
depends: {
cartResult: 'validate_cart',
paymentResult: 'process_payment',
inventoryResult: 'update_inventory',
},
inputs: { customerEmail: 'customer_email' },
},
async ({ cartResult, paymentResult, inventoryResult, customerEmail }) =>
svc.createOrder(
cartResult as Record<string, unknown>,
paymentResult as Record<string, unknown>,
inventoryResult as Record<string, unknown>,
customerEmail as string | undefined,
),
);
Multi-Step Example: Data Pipeline
The data pipeline workflow demonstrates a parallel DAG — three independent extract branches, each feeding its own transform, converging at aggregation:
extract_sales extract_inventory extract_customers
│ │ │
▼ ▼ ▼
transform_sales transform_inventory transform_customers
│ │ │
└──────────────────┼────────────────────┘
▼
aggregate_metrics
│
▼
generate_insights
TypeScript handlers follow the same concise pattern:
import { defineHandler } from '@tasker-systems/tasker';
import type {
PipelineExtractSalesResult,
PipelineTransformSalesResult,
PipelineTransformInventoryResult,
PipelineTransformCustomersResult,
} from '../services/types';
import * as svc from '../services/data_pipeline';
// Extract — no dependencies, runs in parallel
export const ExtractSalesDataHandler = defineHandler(
'DataPipeline.StepHandlers.ExtractSalesDataHandler',
{ inputs: { source: 'source', dateRangeStart: 'date_range_start' } },
async ({ source, dateRangeStart }) =>
svc.extractSalesData(source as string, dateRangeStart as string | undefined),
);
// Transform — depends on one extract branch
export const TransformSalesHandler = defineHandler(
'DataPipeline.StepHandlers.TransformSalesHandler',
{ depends: { salesData: 'extract_sales_data' } },
async ({ salesData }) =>
svc.transformSales(salesData as PipelineExtractSalesResult),
);
// Aggregate — converges three transform branches
export const AggregateMetricsHandler = defineHandler(
'DataPipeline.StepHandlers.AggregateMetricsHandler',
{
depends: {
salesTransform: 'transform_sales',
inventoryTransform: 'transform_inventory',
customersTransform: 'transform_customers',
},
},
async ({ salesTransform, inventoryTransform, customersTransform }) =>
svc.aggregateMetrics(
salesTransform as PipelineTransformSalesResult,
inventoryTransform as PipelineTransformInventoryResult,
customersTransform as PipelineTransformCustomersResult,
),
);
Error Handling
Throw PermanentError or RetryableError from your handler or service functions:
import { PermanentError, RetryableError } from '@tasker-systems/tasker';
// Non-retryable validation failure
throw new PermanentError('Payment declined: insufficient funds', 'PAYMENT_DECLINED');
// Retryable transient failure
throw new RetryableError('Payment gateway temporarily unavailable', 'GATEWAY_ERROR');
Testing
DSL handlers are exported constants — test them with Vitest by calling the handler’s async callback directly, or by testing the service functions:
import { describe, it, expect, vi } from 'vitest';
import * as svc from '../services/ecommerce';
describe('ValidateCartHandler', () => {
it('delegates to service', async () => {
const mockResult = { validated_items: [], total: 64.79 };
vi.spyOn(svc, 'validateCartItems').mockResolvedValue(mockResult);
const cartItems = [{ sku: 'SKU-001', name: 'Widget', price: 29.99, quantity: 2 }];
const result = await svc.validateCartItems(cartItems);
expect(result.total).toBe(64.79);
});
});
For handlers with dependencies, test the service functions with typed arguments:
describe('CreateOrderHandler', () => {
it('creates order from upstream data', async () => {
const mockResult = { order_id: 'ORD-001' };
vi.spyOn(svc, 'createOrder').mockResolvedValue(mockResult);
const result = await svc.createOrder(
{ total: 64.79, validated_items: [] },
{ payment_id: 'pay_abc', transaction_id: 'txn_xyz' },
{ inventory_log_id: 'log_123' },
'test@example.com',
);
expect(result.order_id).toBe('ORD-001');
});
});
Because handlers delegate to service functions, you can test the services directly without any Tasker infrastructure.
Handler Variants
API Handler
Adds HTTP client methods with error classification using the native fetch API. Extends ApiHandler with the class-based pattern. See Class-Based Handlers — API Handler.
Decision Handler
Adds workflow routing with decisionSuccess() for activating downstream step sets. Extends DecisionHandler with the class-based pattern. See Conditional Workflows.
Batchable Handler
Adds batch processing with Analyzer/Worker pattern using BatchableStepHandler. See Class-Based Handlers — Batchable Handler and Batch Processing.
Class-Based Alternative
If you prefer class inheritance, all handler types support a class-based pattern where you extend StepHandler and implement async call(context). See Class-Based Handlers for the full reference.
Next Steps
- Your First Workflow — Build a multi-step DAG end-to-end
- Architecture — System design details
- Bun example app — Complete working implementation
Building with Rust
This guide covers building Tasker step handlers with native Rust using the
tasker-worker and tasker-shared crates in an Axum application.
Quick Start
Add dependencies to your Cargo.toml:
[dependencies]
tasker-worker = "0.1"
tasker-shared = "0.1"
tasker-client = "0.1"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
async-trait = "0.1"
Generate a step handler with tasker-ctl:
tasker-ctl template generate step_handler \
--language rust \
--param name=ValidateCart
This creates a handler implementing the StepHandler trait.
No DSL equivalent: Rust uses the
StepHandlertrait directly — the trait IS the pattern. Python, Ruby, and TypeScript have DSL wrappers for ergonomics, but Rust’s trait system provides the same structure natively. For the DSL approach in other languages, see Python, Ruby, or TypeScript.
Writing a Step Handler
The Axum example app uses plain functions wrapped in a handler registry. This is the recommended pattern for Rust handlers — write business logic as standalone functions, then register them with the worker.
Standalone Functions
Each handler function takes the task context and/or dependency results and returns Result<Value, String>:
#![allow(unused)]
fn main() {
use serde_json::{json, Value};
use std::collections::HashMap;
/// Step 1: Validate cart items, calculate totals.
/// No dependencies — receives task context only.
pub fn validate_cart(context: &Value) -> Result<Value, String> {
let cart_items: Vec<CartItem> =
serde_json::from_value(context.get("cart_items").cloned().unwrap_or(json!([])))
.map_err(|e| format!("Invalid cart_items format: {}", e))?;
if cart_items.is_empty() {
return Err("Cart cannot be empty".to_string());
}
// Business logic: validate items, calculate pricing...
let mut subtotal = 0.0_f64;
for item in &cart_items {
subtotal += item.price * item.quantity as f64;
}
let tax = (subtotal * 0.08 * 100.0).round() / 100.0;
let total = ((subtotal + tax) * 100.0).round() / 100.0;
Ok(json!({
"validated_items": cart_items,
"subtotal": subtotal,
"tax": tax,
"total": total,
"item_count": cart_items.len()
}))
}
}
Functions with Dependencies
Handlers that need upstream step results receive a HashMap<String, Value>:
#![allow(unused)]
fn main() {
/// Step 2: Process payment using cart total from validate_cart.
pub fn process_payment(
context: &Value,
dependency_results: &HashMap<String, Value>,
) -> Result<Value, String> {
let token = context
.get("payment_token")
.and_then(|v| v.as_str())
.unwrap_or("tok_test_success");
let cart_result = dependency_results
.get("validate_cart")
.ok_or("Missing validate_cart dependency result")?;
let cart_total = cart_result
.get("total")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
// Business logic: process payment...
Ok(json!({
"transaction_id": format!("txn_{}", uuid::Uuid::new_v4()),
"amount_charged": cart_total,
"status": "completed"
}))
}
}
Some handlers only need dependency results (no task context):
#![allow(unused)]
fn main() {
/// Step 3: Reserve inventory based on validated cart items.
pub fn update_inventory(
dependency_results: &HashMap<String, Value>,
) -> Result<Value, String> {
let cart_result = dependency_results
.get("validate_cart")
.ok_or("Missing validate_cart dependency result")?;
let validated_items = cart_result
.get("validated_items")
.and_then(|v| v.as_array())
.ok_or("Missing validated_items in cart result")?;
// Business logic: create inventory reservations...
Ok(json!({
"total_items_reserved": validated_items.len(),
"status": "reserved"
}))
}
}
Handler Registry
Plain functions are bridged to the StepHandler trait via a FunctionHandler wrapper and registered in a StepHandlerRegistry. The registry matches the handler.callable field from task template YAML:
#![allow(unused)]
fn main() {
use std::sync::{Arc, RwLock};
use std::collections::HashMap;
use async_trait::async_trait;
use tasker_worker::worker::handlers::{StepHandler, StepHandlerRegistry};
use tasker_shared::types::base::TaskSequenceStep;
use tasker_shared::messaging::StepExecutionResult;
pub struct AxumHandlerRegistry {
handlers: RwLock<HashMap<String, Arc<dyn StepHandler>>>,
}
impl AxumHandlerRegistry {
pub fn new() -> Self {
let registry = Self { handlers: RwLock::new(HashMap::new()) };
// Register all handlers — callable names match YAML handler.callable
registry.register_fn("ecommerce_validate_cart",
Box::new(|ctx, _deps| handlers::ecommerce::validate_cart(ctx)));
registry.register_fn("ecommerce_process_payment",
Box::new(|ctx, deps| handlers::ecommerce::process_payment(ctx, deps)));
// ... more handlers
registry
}
}
}
The FunctionHandler wrapper extracts context and dependency results from the TaskSequenceStep and calls the plain function:
#![allow(unused)]
fn main() {
#[async_trait]
impl StepHandler for FunctionHandler {
async fn call(&self, step: &TaskSequenceStep) -> TaskerResult<StepExecutionResult> {
let context = step.task.task.context.clone()
.unwrap_or_else(|| Value::Object(Default::default()));
let dep_results: HashMap<String, Value> = step.dependency_results
.iter()
.map(|(name, result)| (name.clone(), result.result.clone()))
.collect();
match (self.handler_fn)(&context, &dep_results) {
Ok(result) => Ok(StepExecutionResult::success(
step.workflow_step.workflow_step_uuid,
result, elapsed_ms, None,
)),
Err(err) => Ok(StepExecutionResult::failure(
step.workflow_step.workflow_step_uuid,
err, None, None, false, elapsed_ms, None,
)),
}
}
}
}
Accessing Task Context
In standalone functions, context is a &Value — use serde_json accessors:
#![allow(unused)]
fn main() {
// Read a string field with a default
let customer_email = context
.get("customer_email")
.and_then(|v| v.as_str())
.unwrap_or("unknown@example.com");
// Deserialize into a typed struct
#[derive(Debug, Deserialize)]
struct CartItem {
product_id: i64,
quantity: i64,
}
let cart_items: Vec<CartItem> =
serde_json::from_value(context.get("cart_items").cloned().unwrap_or(json!([])))
.map_err(|e| format!("Invalid cart_items: {}", e))?;
}
Accessing Dependency Results
Dependency results are a HashMap<String, Value> mapping step names to their result JSON:
#![allow(unused)]
fn main() {
// Get a single upstream result
let cart_result = dependency_results
.get("validate_cart")
.ok_or("Missing validate_cart dependency")?;
let total = cart_result.get("total").and_then(|v| v.as_f64()).unwrap_or(0.0);
// Combine results from multiple upstream steps (convergence)
let payment_result = dependency_results
.get("process_payment")
.ok_or("Missing process_payment dependency")?;
let inventory_result = dependency_results
.get("update_inventory")
.ok_or("Missing update_inventory dependency")?;
}
Error Handling
Return Err(String) from standalone functions. The FunctionHandler wrapper converts errors to StepExecutionResult::failure:
#![allow(unused)]
fn main() {
// Validation error (permanent — will not retry)
if cart_items.is_empty() {
return Err("Cart cannot be empty".to_string());
}
// Business logic error with context
return Err(format!(
"Insufficient stock for {}: requested {}, available {}",
product.name, requested, available
));
}
For finer control over retryability, use StepExecutionResult directly in a StepHandler implementation:
#![allow(unused)]
fn main() {
// Non-retryable error
Ok(StepExecutionResult::failure(
step.workflow_step.workflow_step_uuid,
"Invalid order data".to_string(),
Some("VALIDATION_ERROR".to_string()),
Some("ValidationError".to_string()),
false, // not retryable
duration_ms,
None,
))
// Retryable transient error
Ok(StepExecutionResult::failure(
step.workflow_step.workflow_step_uuid,
"Payment gateway unreachable".to_string(),
Some("GATEWAY_ERROR".to_string()),
Some("NetworkError".to_string()),
true, // retryable
duration_ms,
None,
))
}
Task Template Configuration
Generate a task template with tasker-ctl:
tasker-ctl template generate task_template \
--language rust \
--param name=EcommerceOrderProcessing \
--param namespace=ecommerce \
--param handler_callable=ecommerce_validate_cart
Rust handler callables use the snake_case names registered in the handler registry (e.g., ecommerce_validate_cart, ecommerce_process_payment).
Testing
Test standalone handler functions directly with serde_json values:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
use std::collections::HashMap;
#[test]
fn test_validate_cart_success() {
let context = json!({
"cart_items": [
{"product_id": 1, "quantity": 2},
{"product_id": 2, "quantity": 1}
]
});
let result = validate_cart(&context).unwrap();
assert!(result.get("total").unwrap().as_f64().unwrap() > 0.0);
}
#[test]
fn test_validate_cart_empty() {
let context = json!({"cart_items": []});
let result = validate_cart(&context);
assert!(result.is_err());
assert!(result.unwrap_err().contains("empty"));
}
#[test]
fn test_process_payment_with_dependency() {
let context = json!({
"payment_token": "tok_test_success"
});
let mut deps = HashMap::new();
deps.insert("validate_cart".to_string(), json!({
"total": 64.79,
"validated_items": []
}));
let result = process_payment(&context, &deps).unwrap();
assert_eq!(result["status"], "completed");
assert_eq!(result["amount_charged"], 64.79);
}
}
}
Capability Traits
Beyond the base StepHandler, the worker crate defines capability traits in handler_capabilities.rs for specialized patterns:
| Trait | What it provides |
|---|---|
APICapable | HTTP client methods with retryable/permanent error classification |
DecisionCapable | Workflow routing via step activation |
BatchableCapable | Cursor-based parallel batch processing |
A Rust handler implements StepHandler and adds any capability traits it needs — this is idiomatic Rust trait composition. For a complex example combining multiple capabilities, see diamond_decision_batch.rs in the Rust worker crate.
The Rust batch_processing module is the foundation that Python, Ruby, and TypeScript access through FFI. The specialized handler types in those languages are ergonomic wrappers — Rust developers work with the underlying traits directly.
Next Steps
- Your First Workflow — Build a multi-step DAG end-to-end
- Architecture — System design details
- Axum example app — Complete working implementation
Next Steps
You’ve built your first handler and workflow. Here’s where to go from here.
See Complete Applications
The Example Apps demonstrate five real-world workflow patterns implemented in all four languages. Each app is a fully working project you can clone, run, and study.
Learn Through Stories
The Engineering Stories series teaches Tasker concepts through progressive scenarios — from a simple e-commerce checkout to multi-team namespace isolation. Each story builds on the previous one.
Go Deeper
Architecture & Design
- Architecture Overview — System design, lifecycle actors, and DAG execution
- Worker Architecture — How workers process steps across languages
- Event System — Pub/sub and observability events
Operations
- Backpressure Architecture — Monitor and tune backpressure
- Configuration Management — Database connection and pool management
- Observability — Metrics, tracing, and logging
- Auth & Security — Authentication and authorization
Reference
- Configuration Reference — All configuration options
- API Reference — REST and gRPC API documentation
- Handler Types — API, Decision, and Batchable handler patterns
CLI Tooling
- Using tasker-ctl — Initialize projects, generate scaffolding, manage remotes
- tasker-ctl Architecture — How the CLI tool is built
Contributing
- Contributing to Tasker — Development setup, workflow, and PR process for tasker-core and tasker-contrib
- GitHub Issues — Report bugs or request features
Tasker Core Architecture
This directory contains architectural reference documentation describing how Tasker Core’s components work together.
Documents
| Document | Description |
|---|---|
| Crate Architecture | Workspace structure and crate responsibilities |
| Messaging Abstraction | Provider-agnostic messaging (PGMQ, RabbitMQ) |
| Actors | Actor-based orchestration lifecycle components |
| Worker Actors | Actor pattern for worker step execution |
| Worker Event Systems | Dual-channel event architecture for workers |
| States and Lifecycles | Dual state machine architecture (Task + Step) |
| Events and Commands | Event-driven coordination patterns |
| Domain Events | Business event publishing (durable/fast/broadcast) |
| Idempotency and Atomicity | Defense-in-depth guarantees |
| Backpressure Architecture | Unified resilience and flow control |
| Circuit Breakers | Fault isolation and cascade prevention |
| Deployment Patterns | Hybrid, EventDriven, PollingOnly modes; PGMQ/RabbitMQ backends |
| Tasker CLI | CLI architecture, plugin system, template engine, output styling |
When to Read These
- Designing new features: Understand how components interact
- Debugging flow issues: Trace message paths through actors
- Understanding trade-offs: See why patterns were chosen
- Onboarding: Build mental model of the system
Related Documentation
- Principles - The “why” behind architectural decisions
- Guides - Practical “how-to” documentation
- CHRONOLOGY - Historical context for decisions
Actor-Based Architecture
Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Worker Actor Architecture | Events and Commands | States and Lifecycles
← Back to Documentation Hub
This document provides comprehensive documentation of the actor-based architecture in tasker-core, covering the lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture replaces imperative delegation with message-based actor coordination.
Overview
The tasker-core system implements a lightweight Actor pattern inspired by frameworks like Actix, but designed specifically for our orchestration needs without external dependencies. The architecture provides:
- Actor Abstraction: Lifecycle components encapsulated as actors with clear lifecycle hooks
- Message-Based Communication: Type-safe message handling via
Handler<M>trait - Central Registry: ActorRegistry for managing all orchestration actors
- Service Decomposition: Focused components following single responsibility principle
- Direct Integration: Command processor calls actors directly without wrapper layers
This architecture eliminates inconsistencies in lifecycle component initialization, provides type-safe message handling, and creates a clear separation between command processing and business logic execution.
Implementation Status
All phases implemented and production-ready: core abstractions, all 4 primary actors, message hydration, module reorganization, service decomposition, and direct actor integration.
Core Concepts
What is an Actor?
In the tasker-core context, an Actor is an encapsulated lifecycle component that:
- Manages its own state: Each actor owns its dependencies and configuration
- Processes messages: Responds to typed command messages via the
Handler<M>trait - Has lifecycle hooks: Initialization (started) and cleanup (stopped) methods
- Is isolated: Actors communicate through message passing
- Is thread-safe: All actors are Send + Sync + ’static
Why Actors?
The previous architecture had several inconsistencies:
#![allow(unused)]
fn main() {
// OLD: Inconsistent initialization patterns
pub struct TaskInitializer {
// Constructor pattern
}
pub struct TaskFinalizer {
// Builder pattern with new()
}
pub struct StepEnqueuer {
// Factory pattern with create()
}
}
The actor pattern provides consistency:
#![allow(unused)]
fn main() {
// NEW: Consistent actor pattern
impl OrchestrationActor for TaskRequestActor {
fn name(&self) -> &'static str { "TaskRequestActor" }
fn context(&self) -> &Arc<SystemContext> { &self.context }
fn started(&mut self) -> TaskerResult<()> { /* initialization */ }
fn stopped(&mut self) -> TaskerResult<()> { /* cleanup */ }
}
}
Actor vs Service
Services (underlying business logic):
- Encapsulate business logic
- Stateless operations on domain models
- Direct method invocation
- Examples: TaskFinalizer, StepEnqueuerService, OrchestrationResultProcessor
Actors (message-based coordination):
- Wrap services with message-based interface
- Manage service lifecycle
- Asynchronous message handling
- Examples: TaskRequestActor, ResultProcessorActor, StepEnqueuerActor, TaskFinalizerActor
The relationship:
#![allow(unused)]
fn main() {
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer, // Wraps underlying service
}
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
// Delegates to service
self.service.finalize_task(msg.task_uuid).await
}
}
}
Actor Traits
OrchestrationActor Trait
The base trait for all orchestration actors, defined in tasker-orchestration/src/actors/traits.rs:
#![allow(unused)]
fn main() {
/// Base trait for all orchestration actors
///
/// Provides lifecycle management and context access for all actors in the
/// orchestration system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
///
/// # Lifecycle
///
/// 1. **Construction**: Actor is created by ActorRegistry
/// 2. **Initialization**: `started()` is called during registry build
/// 3. **Operation**: Actor processes messages via `Handler<M>` implementations
/// 4. **Shutdown**: `stopped()` is called during registry shutdown
pub trait OrchestrationActor: Send + Sync + 'static {
/// Returns the unique name of this actor
///
/// Used for logging, metrics, and debugging. Should be a static string
/// that clearly identifies the actor's purpose.
fn name(&self) -> &'static str;
/// Returns a reference to the system context
///
/// Provides access to database pool, configuration, and other
/// framework-level resources.
fn context(&self) -> &Arc<SystemContext>;
/// Called when the actor is started
///
/// Perform any initialization work here, such as:
/// - Setting up database connections
/// - Loading configuration
/// - Initializing caches
///
/// # Errors
///
/// Return an error if initialization fails. The actor will not be
/// registered and the system will fail to start.
fn started(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor started");
Ok(())
}
/// Called when the actor is stopped
///
/// Perform any cleanup work here, such as:
/// - Closing database connections
/// - Flushing caches
/// - Releasing resources
///
/// # Errors
///
/// Return an error if cleanup fails. Errors are logged but do not
/// prevent other actors from shutting down.
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor stopped");
Ok(())
}
}
}
Key Design Decisions:
- Send + Sync + ’static: Enables actors to be shared across threads
- Default lifecycle hooks: Actors only override when needed
- Context injection: All actors have access to SystemContext
- Error handling: Lifecycle failures are TaskerResult for proper error propagation
Handler<M> Trait
The message handling trait, enabling type-safe message processing:
#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
///
/// Actors implement `Handler<M>` for each message type they can process.
/// This provides type-safe, asynchronous message handling with clear
/// input/output contracts.
#[async_trait]
pub trait Handler<M: Message>: OrchestrationActor {
/// The response type returned by this handler
type Response: Send;
/// Handle a message asynchronously
///
/// Process the message and return a response. This method should be
/// idempotent where possible and handle errors gracefully.
async fn handle(&self, msg: M) -> TaskerResult<Self::Response>;
}
}
Key Design Decisions:
- async_trait: All message handling is asynchronous
- Type safety: Message and Response types are checked at compile time
- Multiple implementations: Actor can implement
Handler<M>for multiple message types - Error propagation: TaskerResult ensures proper error handling
Message Trait
The marker trait for command messages:
#![allow(unused)]
fn main() {
/// Marker trait for command messages
///
/// All messages sent to actors must implement this trait. The associated
/// `Response` type defines what the handler will return.
pub trait Message: Send + 'static {
/// The response type for this message
type Response: Send;
}
}
Key Design Decisions:
- Marker trait: No methods, just type constraints
- Associated type: Response type is part of the message definition
- Send + ’static: Enables messages to cross thread boundaries
ActorRegistry
The central registry managing all orchestration actors, defined in tasker-orchestration/src/actors/registry.rs:
Purpose
The ActorRegistry serves as:
- Single Source of Truth: All actors are registered here
- Lifecycle Manager: Handles initialization and shutdown
- Dependency Injection: Provides SystemContext to all actors
- Type-Safe Access: Strongly-typed access to each actor
Structure
#![allow(unused)]
fn main() {
/// Registry managing all orchestration actors
///
/// The ActorRegistry holds Arc references to all actors in the system,
/// providing centralized access and lifecycle management.
#[derive(Clone)]
pub struct ActorRegistry {
/// System context shared by all actors
context: Arc<SystemContext>,
/// Task request actor for processing task initialization requests
pub task_request_actor: Arc<TaskRequestActor>,
/// Result processor actor for processing step execution results
pub result_processor_actor: Arc<ResultProcessorActor>,
/// Step enqueuer actor for batch processing ready tasks
pub step_enqueuer_actor: Arc<StepEnqueuerActor>,
/// Task finalizer actor for task finalization with atomic claiming
pub task_finalizer_actor: Arc<TaskFinalizerActor>,
}
}
Initialization
The build() method creates and initializes all actors:
#![allow(unused)]
fn main() {
impl ActorRegistry {
pub async fn build(context: Arc<SystemContext>) -> TaskerResult<Self> {
tracing::info!("Building ActorRegistry with actors");
// Create shared StepEnqueuerService (used by multiple actors)
let task_claim_step_enqueuer = StepEnqueuerService::new(context.clone()).await?;
let task_claim_step_enqueuer = Arc::new(task_claim_step_enqueuer);
// Create TaskRequestActor and its dependencies
let task_initializer = Arc::new(TaskInitializer::new(
context.clone(),
task_claim_step_enqueuer.clone(),
));
let task_request_processor = Arc::new(TaskRequestProcessor::new(
context.message_client.clone(),
context.task_handler_registry.clone(),
task_initializer,
TaskRequestProcessorConfig::default(),
));
let mut task_request_actor = TaskRequestActor::new(context.clone(), task_request_processor);
task_request_actor.started()?;
let task_request_actor = Arc::new(task_request_actor);
// Create ResultProcessorActor and its dependencies
let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
let result_processor = Arc::new(OrchestrationResultProcessor::new(
task_finalizer,
context.clone(),
));
let mut result_processor_actor =
ResultProcessorActor::new(context.clone(), result_processor);
result_processor_actor.started()?;
let result_processor_actor = Arc::new(result_processor_actor);
// Create StepEnqueuerActor using shared StepEnqueuerService
let mut step_enqueuer_actor =
StepEnqueuerActor::new(context.clone(), task_claim_step_enqueuer.clone());
step_enqueuer_actor.started()?;
let step_enqueuer_actor = Arc::new(step_enqueuer_actor);
// Create TaskFinalizerActor using shared StepEnqueuerService
let task_finalizer = TaskFinalizer::new(context.clone(), task_claim_step_enqueuer.clone());
let mut task_finalizer_actor = TaskFinalizerActor::new(context.clone(), task_finalizer);
task_finalizer_actor.started()?;
let task_finalizer_actor = Arc::new(task_finalizer_actor);
tracing::info!("✅ ActorRegistry built successfully with 4 actors");
Ok(Self {
context,
task_request_actor,
result_processor_actor,
step_enqueuer_actor,
task_finalizer_actor,
})
}
}
}
Shutdown
The shutdown() method gracefully stops all actors:
#![allow(unused)]
fn main() {
impl ActorRegistry {
pub async fn shutdown(&mut self) {
tracing::info!("Shutting down ActorRegistry");
// Call stopped() on all actors in reverse initialization order
if let Some(actor) = Arc::get_mut(&mut self.task_finalizer_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop TaskFinalizerActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.step_enqueuer_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop StepEnqueuerActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.result_processor_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop ResultProcessorActor");
}
}
if let Some(actor) = Arc::get_mut(&mut self.task_request_actor) {
if let Err(e) = actor.stopped() {
tracing::error!(error = %e, "Failed to stop TaskRequestActor");
}
}
tracing::info!("✅ ActorRegistry shutdown complete");
}
}
}
Implemented Actors
TaskRequestActor
Handles task initialization requests from external clients.
Location: tasker-orchestration/src/actors/task_request_actor.rs
Message: ProcessTaskRequestMessage
- Input:
TaskRequestMessagewith task details - Response:
Uuidof created task
Delegation: Wraps TaskRequestProcessor service
Purpose: Entry point for new workflow instances, coordinates task creation and initial step discovery.
ResultProcessorActor
Processes step execution results from workers.
Location: tasker-orchestration/src/actors/result_processor_actor.rs
Message: ProcessStepResultMessage
- Input:
StepExecutionResultwith execution outcome - Response:
()(unit type)
Delegation: Wraps OrchestrationResultProcessor service
Purpose: Handles step completion, coordinates task finalization when appropriate.
StepEnqueuerActor
Manages batch processing of ready tasks.
Location: tasker-orchestration/src/actors/step_enqueuer_actor.rs
Message: ProcessBatchMessage
- Input: Empty (uses system state)
- Response:
StepEnqueuerServiceResultwith batch stats
Delegation: Wraps StepEnqueuerService
Purpose: Discovers ready tasks and enqueues their executable steps.
TaskFinalizerActor
Handles task finalization with atomic claiming.
Location: tasker-orchestration/src/actors/task_finalizer_actor.rs
Message: FinalizeTaskMessage
- Input:
task_uuidto finalize - Response:
FinalizationResultwith action taken
Delegation: Wraps TaskFinalizer service (decomposed into focused components)
Purpose: Completes or fails tasks based on step execution results, prevents race conditions through atomic claiming.
Integration with Commands
Command Processor Integration
The command processor calls actors directly without intermediate wrapper layers:
#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs
/// Handle task initialization using TaskRequestActor directly
async fn handle_initialize_task(
&self,
request: TaskRequestMessage,
) -> TaskerResult<TaskInitializeResult> {
// Direct actor-based task initialization
let msg = ProcessTaskRequestMessage { request };
let task_uuid = self.actors.task_request_actor.handle(msg).await?;
Ok(TaskInitializeResult::Success {
task_uuid,
message: "Task initialized successfully".to_string(),
})
}
/// Handle step result processing using ResultProcessorActor directly
async fn handle_process_step_result(
&self,
step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
// Direct actor-based step result processing
let msg = ProcessStepResultMessage {
result: step_result.clone(),
};
match self.actors.result_processor_actor.handle(msg).await {
Ok(()) => Ok(StepProcessResult::Success {
message: format!(
"Step {} result processed successfully",
step_result.step_uuid
),
}),
Err(e) => Ok(StepProcessResult::Error {
message: format!("Failed to process step result: {e}"),
}),
}
}
/// Handle task finalization using TaskFinalizerActor directly
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
// Direct actor-based task finalization
let msg = FinalizeTaskMessage { task_uuid };
let result = self.actors.task_finalizer_actor.handle(msg).await?;
Ok(TaskFinalizationResult::Success {
task_uuid: result.task_uuid,
final_status: format!("{:?}", result.action),
completion_time: Some(chrono::Utc::now()),
})
}
}
Design Evolution: Initially planned to use lifecycle_services/ as a wrapper layer between command processor and actors. After implementing Phase 7 service decomposition, we found direct actor calls were simpler and cleaner, so we removed the intermediate layer.
Service Decomposition (Phase 7)
Large services (800-900 lines) were decomposed into focused components following single responsibility principle:
TaskFinalizer Decomposition
task_finalization/ (848 lines → 6 files)
├── mod.rs # Public API and types
├── service.rs # Main TaskFinalizer service (~200 lines)
├── completion_handler.rs # Task completion logic
├── event_publisher.rs # Lifecycle event publishing
├── execution_context_provider.rs # Context fetching
└── state_handlers.rs # State-specific handling
StepEnqueuerService Decomposition
step_enqueuer_services/ (781 lines → 3 files)
├── mod.rs # Public API
├── service.rs # Main service (~250 lines)
├── batch_processor.rs # Batch processing logic
└── state_handlers.rs # State-specific processing
ResultProcessor Decomposition
result_processing/ (889 lines → 4 files)
├── mod.rs # Public API
├── service.rs # Main processor
├── metadata_processor.rs # Metadata handling
├── error_handler.rs # Error processing
└── result_validator.rs # Result validation
Actor Lifecycle
Lifecycle Phases
┌─────────────────┐
│ Construction │ ActorRegistry::build() creates actor instances
└────────┬────────┘
│
▼
┌─────────────────┐
│ Initialization │ started() hook called on each actor
└────────┬────────┘
│
▼
┌─────────────────┐
│ Operation │ Actors process messages via `Handler<M>`::handle()
└────────┬────────┘
│
▼
┌─────────────────┐
│ Shutdown │ stopped() hook called on each actor (reverse order)
└─────────────────┘
Example Actor Implementation
#![allow(unused)]
fn main() {
use tasker_orchestration::actors::{OrchestrationActor, Handler, Message};
// Define the actor
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer,
}
// Implement base actor trait
impl OrchestrationActor for TaskFinalizerActor {
fn name(&self) -> &'static str {
"TaskFinalizerActor"
}
fn context(&self) -> &Arc<SystemContext> {
&self.context
}
fn started(&mut self) -> TaskerResult<()> {
tracing::info!("TaskFinalizerActor starting");
Ok(())
}
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!("TaskFinalizerActor stopping");
Ok(())
}
}
// Define message type
pub struct FinalizeTaskMessage {
pub task_uuid: Uuid,
}
impl Message for FinalizeTaskMessage {
type Response = FinalizationResult;
}
// Implement message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
tracing::debug!(
actor = %self.name(),
task_uuid = %msg.task_uuid,
"Processing FinalizeTaskMessage"
);
// Delegate to service
self.service.finalize_task(msg.task_uuid).await
.map_err(|e| e.into())
}
}
}
Benefits
1. Consistency
All lifecycle components follow the same pattern:
- Uniform initialization via
started() - Uniform cleanup via
stopped() - Uniform message handling via
Handler<M>
2. Type Safety
Messages and responses are type-checked at compile time:
#![allow(unused)]
fn main() {
// Compile error if message/response types don't match
impl Handler<WrongMessage> for TaskFinalizerActor {
type Response = WrongResponse; // ❌ Won't compile
// ...
}
}
3. Testability
- Clear message boundaries for mocking
- Isolated actor lifecycle for unit tests
- Type-safe message construction
4. Maintainability
- Clear separation of concerns
- Explicit message contracts
- Centralized lifecycle management
- Decomposed services (<300 lines per file)
5. Simplicity
- Direct actor calls (no wrapper layers)
- Pure routing in command processor
- Easy to trace message flow
Summary
The actor-based architecture provides a consistent, type-safe foundation for lifecycle component management in tasker-core. Key takeaways:
- Lightweight Pattern: Actors wrap decomposed services, providing message-based interface
- Lifecycle Management: Consistent initialization and shutdown via traits
- Type Safety: Compile-time verification of message contracts
- Service Decomposition: Focused components following single responsibility principle
- Direct Integration: Command processor calls actors directly without intermediate wrappers
- Production Ready: All phases complete, zero breaking changes, full test coverage
The architecture provides a solid foundation for future scalability and maintainability while maintaining the proven reliability of existing orchestration logic.
← Back to Documentation Hub
Backpressure Architecture
Last Updated: 2026-02-05 Audience: Architects, Developers, Operations Status: Active Related Docs: Worker Event Systems | MPSC Channel Guidelines
<- Back to Documentation Hub
This document provides the unified backpressure strategy for tasker-core, covering all system components from API ingestion through worker execution.
Core Principle
Step idempotency is the primary constraint. Any backpressure mechanism must ensure that step claiming, business logic execution, and result persistence remain stable and consistent. The system must gracefully degrade under load without compromising workflow correctness.
System Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKPRESSURE FLOW OVERVIEW │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ External Client │
└────────┬────────┘
│
┌────────────────▼────────────────┐
│ [1] API LAYER BACKPRESSURE │
│ • Circuit breaker (503) │
│ • System overload (503) │
│ • Request validation │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ [2] ORCHESTRATION BACKPRESSURE │
│ • Command channel (bounded) │
│ • Connection pool limits │
│ • PGMQ depth checks │
└────────────────┬────────────────┘
│
┌───────────┴───────────┐
│ PGMQ Queues │
│ • Namespace queues │
│ • Result queues │
└───────────┬───────────┘
│
┌────────────────▼────────────────┐
│ [3] WORKER BACKPRESSURE │
│ • Claim capacity check │
│ • Semaphore-bounded handlers │
│ • Completion channel bounds │
└────────────────┬────────────────┘
│
┌────────────────▼────────────────┐
│ [4] RESULT FLOW BACKPRESSURE │
│ • Completion channel bounds │
│ • Domain event drop semantics │
└─────────────────────────────────┘
Backpressure Points by Component
1. API Layer
The API layer provides backpressure through 503 responses with intelligent Retry-After headers.
Rate Limiting (429): Rate limiting is intentionally out of scope for tasker-core. This responsibility belongs to upstream infrastructure (API Gateways, NLB/ALB, service mesh). Tasker focuses on system health-based backpressure via 503 responses.
| Mechanism | Status | Behavior |
|---|---|---|
| Circuit Breaker | Implemented | Return 503 when database breaker open |
| System Overload | Planned | Return 503 when queue/channel saturation detected |
| Request Validation | Implemented | Return 400 for invalid requests |
Response Codes:
200 OK- Request accepted400 Bad Request- Invalid request format503 Service Unavailable- System overloaded (includesRetry-Afterheader)
503 Response Triggers:
- Circuit Breaker Open: Database operations failing repeatedly
- Queue Depth High (Planned): PGMQ namespace queues approaching capacity
- Channel Saturation (Planned): Command channel buffer > 80% full
Retry-After Header Strategy:
503 Service Unavailable
Retry-After: {calculated_delay_seconds}
Calculation:
- Circuit breaker open: Use breaker timeout (default 30s)
- Queue depth high: Estimate based on processing rate
- Channel saturation: Short delay (5-10s) for buffer drain
Configuration:
# config/tasker/base/common.toml
[common.circuit_breakers.component_configs.web]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)
2. Orchestration Layer
The orchestration layer protects internal processing from command flooding.
| Mechanism | Status | Behavior |
|---|---|---|
| Command Channel | Implemented | Bounded MPSC with monitoring |
| Connection Pool | Implemented | Database connection limits |
| PGMQ Depth Check | Planned | Reject enqueue when queue too deep |
Command Channel Backpressure:
Command Sender → [Bounded Channel] → Command Processor
│
└── If full: Block with timeout → Reject
Configuration:
# config/tasker/base/orchestration.toml
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000
[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000
3. Messaging Layer
The messaging layer provides the backbone between orchestration and workers. Provider-agnostic via MessageClient, supporting PGMQ (default) and RabbitMQ backends.
| Mechanism | Status | Behavior |
|---|---|---|
| Visibility Timeout | Implemented | Messages return to queue after timeout |
| Batch Size Limits | Implemented | Bounded message reads |
| Queue Depth Check | Planned | Reject enqueue when depth exceeded |
| Messaging Circuit Breaker | Implemented | Fast-fail send/receive when provider unhealthy |
Messaging Circuit Breaker:
MessageClientwraps send/receive operations with circuit breaker protection. When the messaging provider (PGMQ or RabbitMQ) fails repeatedly, the breaker opens and returnsMessagingError::CircuitBreakerOpenimmediately, preventing slow timeouts from cascading into orchestration and worker processing loops. Ack/nack and health check operations bypass the breaker — ack/nack failures are safe (visibility timeout handles redelivery), and health check must work when the breaker is open to detect recovery. See Circuit Breakers for details.
Queue Depth Monitoring (Planned):
The system will work with PGMQ’s native capabilities rather than enforcing arbitrary limits. Queue depth monitoring provides visibility without hard rejection:
┌──────────────────────────────────────────────────────────────────────┐
│ QUEUE DEPTH STRATEGY │
├──────────────────────────────────────────────────────────────────────┤
│ Level │ Depth Ratio │ Action │
├──────────────────────────────────────────────────────────────────────┤
│ Normal │ < 70% │ Normal operation │
│ Warning │ 70-85% │ Log warning, emit metric │
│ Critical │ 85-95% │ API returns 503 for new tasks │
│ Overflow │ > 95% │ API rejects all writes, alert operators │
└──────────────────────────────────────────────────────────────────────┘
Note: Depth ratio = current_depth / configured_soft_limit
Soft limit is advisory, not a hard PGMQ constraint.
Portability Considerations:
- Queue depth semantics vary by backend (PGMQ vs RabbitMQ vs SQS)
- Configuration is backend-agnostic where possible
- Backend-specific tuning goes in backend-specific config sections
Configuration:
# config/tasker/base/common.toml
[common.queues]
default_visibility_timeout_seconds = 30
[common.queues.pgmq]
poll_interval_ms = 250
[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000
# Messaging circuit breaker
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes in half-open to close
# timeout_seconds inherited from default_config (30s)
4. Worker Layer
The worker layer protects handler execution from saturation.
| Mechanism | Status | Behavior |
|---|---|---|
| Semaphore-Bounded Dispatch | Implemented | Max concurrent handlers |
| Claim Capacity Check | Planned | Refuse claims when at capacity |
| Handler Timeout | Implemented | Kill stuck handlers |
| Completion Channel | Implemented | Bounded result buffer |
Handler Dispatch Flow:
Step Message
│
▼
┌─────────────────┐
│ Capacity Check │──── At capacity? ──── Leave in queue
│ (Planned) │ (visibility timeout)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Acquire Permit │
│ (Semaphore) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Execute Handler │
│ (with timeout) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Release Permit │──── BEFORE sending to completion channel
└────────┬────────┘
│
▼
┌─────────────────┐
│ Send Completion │
└─────────────────┘
Configuration:
# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000
5. Domain Events
Domain events use fire-and-forget semantics to avoid blocking the critical path.
| Mechanism | Status | Behavior |
|---|---|---|
| Try-Send | Implemented | Non-blocking send |
| Drop on Full | Implemented | Events dropped if channel full |
| Metrics | Planned | Track dropped events |
Domain Event Flow:
Handler Complete
│
├── Result → Completion Channel (blocking, must succeed)
│
└── Domain Events → try_send() → If full: DROP with metric
│
└── Step execution NOT affected
Segmentation of Responsibility
Orchestration System
The orchestration system must protect itself from:
- Client overload: Too many
/v1/tasksrequests - Internal saturation: Command channel overflow
- Database exhaustion: Connection pool depletion
- Queue explosion: Unbounded PGMQ growth
Backpressure Response Hierarchy:
- Return 503 to client with Retry-After (fastest, cheapest)
- Block at command channel (internal buffering)
- Soft-reject at queue depth threshold (503 to new tasks)
- Circuit breaker opens (stop accepting work)
Worker System
The worker system must protect itself from:
- Handler saturation: Too many concurrent handlers
- FFI backlog: Ruby/Python handlers falling behind
- Completion overflow: Results backing up
- Step starvation: Claims outpacing processing
Backpressure Response Hierarchy:
- Refuse step claim (leave in queue, visibility timeout)
- Block at dispatch channel (internal buffering)
- Block at completion channel (handler waits)
- Circuit breaker opens (stop claiming work)
Step Idempotency Guarantees
Safe Backpressure Points
These backpressure points preserve step idempotency:
| Point | Why Safe |
|---|---|
| API 503 rejection | Task not yet created |
| Queue depth soft-limit | Step not yet enqueued |
| Step claim refusal | Message stays in queue, visibility timeout protects |
| Handler dispatch channel full | Step claimed but execution queued |
| Completion channel backpressure | Handler completed, result buffered |
Unsafe Patterns (NEVER DO)
| Pattern | Risk | Mitigation |
|---|---|---|
| Drop step after claiming | Lost work | Always send result (success or failure) |
| Timeout during handler execution | Duplicate execution on retry | Handlers MUST be idempotent |
| Drop completion result | Orchestration unaware of completion | Completion channel blocks, never drops |
| Reset step state without visibility timeout | Race with other workers | Use PGMQ visibility timeout |
Idempotency Contract
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP EXECUTION IDEMPOTENCY CONTRACT │
└─────────────────────────────────────────────────────────────────────────────┘
1. CLAIM: Atomic via pgmq_read_specific_message()
├── Only one worker can claim a message
├── Visibility timeout protects against worker crash
└── If claim fails: Message stays in queue → another worker claims
2. EXECUTE: Handler invocation (FFI boundary critical - see below)
├── Handlers SHOULD be idempotent (business logic recommendation)
├── Timeout generates FAILURE result (not drop)
├── Panic generates FAILURE result (not drop)
└── Error generates FAILURE result (not drop)
3. PERSIST: Result submission
├── Completion channel is bounded but BLOCKING
├── Result MUST reach orchestration (never dropped)
└── If send fails: Step remains "in_progress" → recovered by orchestration
4. FINALIZE: Orchestration processes result
├── State transition is atomic
├── Duplicate results handled by state guards
└── Idempotent: Same result processed twice = same outcome
FFI Boundary Idempotency Semantics
The FFI boundary (Rust → Ruby/Python handler) creates a critical demarcation for error classification:
┌─────────────────────────────────────────────────────────────────────────────┐
│ FFI BOUNDARY ERROR CLASSIFICATION │
└─────────────────────────────────────────────────────────────────────────────┘
FFI BOUNDARY
│
BEFORE FFI CROSSING │ AFTER FFI CROSSING
(System Layer) │ (Business Logic Layer)
│
┌─────────────────────┐ │ ┌─────────────────────┐
│ System errors are │ │ │ System failures │
│ RETRYABLE: │ │ │ are PERMANENT: │
│ │ │ │ │
│ • Channel timeout │ │ │ • Worker crash │
│ • Queue unavailable │ │ │ • FFI panic │
│ • Claim race lost │ │ │ • Process killed │
│ • Network partition │ │ │ │
│ • Message malformed │ │ │ We cannot know if │
│ │ │ │ business logic │
│ Step has NOT been │ │ │ executed or not. │
│ handed to handler. │ │ │ │
└─────────────────────┘ │ └─────────────────────┘
│
│ ┌─────────────────────┐
│ │ Developer errors │
│ │ are TRUSTED: │
│ │ │
│ │ • RetryableError → │
│ │ System retries │
│ │ │
│ │ • PermanentError → │
│ │ Step fails │
│ │ │
│ │ Developer knows │
│ │ their domain logic. │
│ └─────────────────────┘
Key Principles:
-
Before FFI: Any system error is safe to retry because no business logic has executed.
-
After FFI, system failure: If the worker crashes or FFI call fails after dispatch, we MUST treat it as permanent failure. We cannot know if the handler:
- Never started (safe to retry)
- Started but didn’t complete (unknown side effects)
- Completed but didn’t return (work is done)
-
After FFI, developer error: Trust the developer’s classification:
RetryableError: Developer explicitly signals safe to retry (e.g., temporary API unavailable)PermanentError: Developer explicitly signals not retriable (e.g., invalid data, business rule violation)
Implementation Guidance:
#![allow(unused)]
fn main() {
// BEFORE FFI - system error, retryable
match dispatch_to_handler(step).await {
Err(DispatchError::ChannelFull) => StepExecutionResult::retryable("dispatch_channel_full"),
Err(DispatchError::Timeout) => StepExecutionResult::retryable("dispatch_timeout"),
Ok(ffi_handle) => {
// AFTER FFI - different rules apply
match ffi_handle.await {
// System crash after FFI = permanent (unknown state)
Err(FfiError::ProcessCrash) => StepExecutionResult::permanent("handler_crash"),
Err(FfiError::Panic) => StepExecutionResult::permanent("handler_panic"),
// Developer-returned errors = trust their classification
Ok(HandlerResult::RetryableError(msg)) => StepExecutionResult::retryable(msg),
Ok(HandlerResult::PermanentError(msg)) => StepExecutionResult::permanent(msg),
Ok(HandlerResult::Success(data)) => StepExecutionResult::success(data),
}
}
}
}
Note: We RECOMMEND handlers be idempotent but cannot REQUIRE it—business logic is developer-controlled. The system provides visibility timeout protection and duplicate result handling, but ultimate idempotency responsibility lies with handler implementations.
Backpressure Decision Tree
Use this decision tree when designing new backpressure mechanisms:
┌─────────────────────────┐
│ New Backpressure Point │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Does this affect step │
│ execution correctness? │
└────────────┬────────────┘
│
┌─────────────┴─────────────┐
│ │
Yes No
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Can the work be │ │ Safe to drop │
│ retried safely? │ │ or timeout │
└────────┬────────┘ └─────────────────┘
│
┌─────────┴─────────┐
│ │
Yes No
│ │
▼ ▼
┌───────────┐ ┌───────────────┐
│ Use block │ │ MUST NOT DROP │
│ or reject │ │ Block until │
│ (retriable│ │ success │
│ error) │ └───────────────┘
└───────────┘
Configuration Reference
TOML Structure: Configuration files are organized as
config/tasker/base/{common,worker,orchestration}.tomlwith environment overrides inconfig/tasker/environments/{test,development,production}/.
Complete Backpressure Configuration
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/common.toml - Shared settings
# ════════════════════════════════════════════════════════════════════════════
# Circuit breaker defaults (inherited by all component breakers)
[common.circuit_breakers.default_config]
failure_threshold = 5 # Failures before opening
timeout_seconds = 30 # Time in open state before half-open
success_threshold = 2 # Successes in half-open to close
# Web/API database circuit breaker
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2
# Messaging circuit breaker - PGMQ/RabbitMQ operations
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2
# Queue configuration
[common.queues]
default_visibility_timeout_seconds = 30
[common.queues.pgmq]
poll_interval_ms = 250
[common.queues.pgmq.queue_depth_thresholds]
critical_threshold = 500
overflow_threshold = 1000
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/orchestration.toml - Orchestration layer
# ════════════════════════════════════════════════════════════════════════════
[orchestration.mpsc_channels.command_processor]
command_buffer_size = 5000
[orchestration.mpsc_channels.pgmq_events]
pgmq_event_buffer_size = 50000
[orchestration.mpsc_channels.event_channel]
event_channel_buffer_size = 10000
# ════════════════════════════════════════════════════════════════════════════
# config/tasker/base/worker.toml - Worker layer
# ════════════════════════════════════════════════════════════════════════════
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000 # Steps waiting for handler
completion_buffer_size = 1000 # Results waiting for orchestration
max_concurrent_handlers = 10 # Semaphore permits
handler_timeout_ms = 30000 # Max handler execution time
[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000 # FFI events waiting for Ruby/Python
completion_timeout_ms = 30000 # Time to wait for FFI completion
starvation_warning_threshold_ms = 10000 # Warn if event waits this long
# Planned:
# claim_capacity_threshold = 0.8 # Refuse claims at 80% capacity
Monitoring and Alerting
See Backpressure Monitoring Runbook for:
- Key metrics to monitor
- Alerting thresholds
- Incident response procedures
Key Metrics Summary
| Metric | Type | Alert Threshold |
|---|---|---|
api_requests_rejected_total | Counter | > 10/min |
circuit_breaker_state | Gauge | state = open |
mpsc_channel_saturation | Gauge | > 80% |
pgmq_queue_depth | Gauge | > 80% of max |
worker_claim_refusals_total | Counter | > 10/min |
handler_semaphore_wait_time_ms | Histogram | p99 > 1000ms |
Related Documentation
- Worker Event Systems - Dual-channel architecture
- MPSC Channel Guidelines - Channel creation guide
- MPSC Channel Tuning - Operational tuning
- Bounded MPSC Channels ADR
<- Back to Documentation Hub
Circuit Breakers
Last Updated: 2026-02-04 Audience: Architects, Operators, Developers Status: Active Related Docs: Backpressure Architecture | Observability | Operations: Backpressure Monitoring
<- Back to Documentation Hub
Circuit breakers provide fault isolation and cascade prevention across tasker-core. This document covers the circuit breaker architecture, implementations, configuration, and operational monitoring.
Core Concept
Circuit breakers prevent cascading failures by failing fast when a component is unhealthy. Instead of waiting for slow or failing operations to timeout, circuit breakers detect failure patterns and immediately reject calls, giving the downstream system time to recover.
State Machine
┌─────────────────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATE MACHINE │
└─────────────────────────────────────────────────────────────────────────────┘
Success
┌─────────┐
│ │
▼ │
┌───────┐ │
───────>│CLOSED │─────┘
└───┬───┘
│
│ failure_threshold
│ consecutive failures
│
▼
┌───────┐
│ OPEN │◄─────────────────────┐
└───┬───┘ │
│ │
│ timeout_seconds │ Any failure
│ elapsed │ in half-open
│ │
▼ │
┌──────────┐ │
│HALF-OPEN │─────────────────────┘
└────┬─────┘
│
│ success_threshold
│ consecutive successes
│
▼
┌───────┐
│CLOSED │
└───────┘
States:
- Closed: Normal operation. All calls allowed. Tracks consecutive failures.
- Open: Failing fast. All calls rejected immediately. Waiting for timeout.
- Half-Open: Testing recovery. Limited calls allowed. Single failure reopens.
Unified Trait: CircuitBreakerBehavior
All circuit breaker implementations share a common trait defined in tasker-shared/src/resilience/behavior.rs:
#![allow(unused)]
fn main() {
pub trait CircuitBreakerBehavior: Send + Sync + Debug {
fn name(&self) -> &str;
fn state(&self) -> CircuitState;
fn should_allow(&self) -> bool;
fn record_success(&self, duration: Duration);
fn record_failure(&self, duration: Duration);
fn is_healthy(&self) -> bool;
fn force_open(&self);
fn force_closed(&self);
fn metrics(&self) -> CircuitBreakerMetrics;
}
}
Each specialized breaker wraps the generic CircuitBreaker (composition pattern) and implements this trait. This means:
- Consistent state machine behavior across all breakers
- Proper half-open → closed recovery via
success_threshold - Lock-free atomic state management
- Domain-specific methods remain as additional methods on each type
Circuit Breaker Implementations
Tasker-core has four circuit breaker implementations, each protecting specific components.
All wrap the generic CircuitBreaker from tasker_shared::resilience:
| Circuit Breaker | Location | Purpose | Trigger Type |
|---|---|---|---|
| Web Database | tasker-orchestration | API database operations | Error-based |
| Task Readiness | tasker-orchestration | Fallback poller database checks | Error-based |
| FFI Completion | tasker-worker | Ruby/Python handler completion channel | Latency-based |
| Messaging | tasker-shared | Message queue operations (PGMQ/RabbitMQ) | Error-based |
1. Web Database Circuit Breaker
Purpose: Protects API endpoints from cascading database failures.
Scope: Independent from orchestration system’s internal operations.
Behavior:
- Opens when database queries fail repeatedly
- Returns 503 with
Retry-Afterheader when open - Fast-fail rejection with atomic state management
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.web]
failure_threshold = 5 # Consecutive failures before opening
success_threshold = 2 # Successes in half-open to fully close
# timeout_seconds inherited from default_config (30s)
Health Check Integration:
- Included in
/health/readyendpoint - State reported in
/health/detailedresponse - Metric:
api_circuit_breaker_state(0=closed, 1=half-open, 2=open)
2. Task Readiness Circuit Breaker
Purpose: Protects fallback poller from database overload during polling cycles.
Scope: Independent from web circuit breaker, specific to task readiness queries.
Behavior:
- Opens when task readiness queries fail repeatedly
- Skips polling cycles when open (doesn’t fail-fast, just skips)
- Allows orchestration to continue processing existing work
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10 # Higher threshold for polling
timeout_seconds = 60 # Longer recovery window
success_threshold = 3 # More successes needed for confidence
Why Separate from Web?:
- Different failure patterns (polling vs request-driven)
- Different recovery semantics (skip vs reject)
- Isolation prevents web failures from stopping polling (and vice versa)
3. FFI Completion Circuit Breaker
Purpose: Protects Ruby/Python worker completion channels from backpressure.
Scope: Worker-specific, protects FFI boundary.
Behavior:
- Latency-based: Treats slow sends (>100ms) as failures
- Opens when completion channel is consistently slow
- Prevents FFI threads from blocking on saturated channels
- Drops completions when open (with metrics), allowing handler threads to continue
Configuration (config/tasker/base/worker.toml):
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5 # Slow sends before opening
recovery_timeout_seconds = 5 # Short recovery window
success_threshold = 2 # Successes to close
slow_send_threshold_ms = 100 # Latency threshold (100ms)
Why Latency-Based?:
- Slow channel sends indicate backpressure buildup
- Blocking FFI threads can cascade to Ruby/Python handler starvation
- Error-only detection misses slow-but-completing operations
- Latency detection catches degradation before total failure
Metrics:
ffi_completion_slow_sends_total- Sends exceeding latency thresholdffi_completion_circuit_open_rejections_total- Rejections due to open circuit
4. Messaging Circuit Breaker
Purpose: Protects message queue operations from provider failures (PGMQ or RabbitMQ).
Scope: Integrated into MessageClient, shared across orchestration and worker messaging.
Behavior:
- Opens when send/receive operations fail repeatedly
- Protected operations:
send_step_message,receive_step_messages,send_step_result,receive_step_results,send_task_request,receive_task_requests,send_task_finalization,receive_task_finalizations,send_message,receive_messages - Unprotected operations (safe to fail or needed for recovery):
ack_message,nack_message,extend_visibility,health_check,ensure_queue, queue stats - Coordinates with visibility timeout for message safety
- Provider-agnostic: works with both PGMQ and RabbitMQ backends
Configuration (config/tasker/base/common.toml):
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5 # Failures before opening
success_threshold = 2 # Successes to close
# timeout_seconds inherited from default_config (30s)
Why ack/nack bypass the breaker?:
- Ack/nack failure causes message redelivery via visibility timeout, which is safe
- Health check must work when breaker is open to detect recovery
- Queue management is startup-only and should not be gated
Configuration Reference
Global Settings
[common.circuit_breakers.global_settings]
metrics_collection_interval_seconds = 30 # Metrics aggregation interval
min_state_transition_interval_seconds = 5.0 # Debounce for rapid transitions
Default Configuration
Applied to any circuit breaker without explicit configuration:
[common.circuit_breakers.default_config]
failure_threshold = 5 # 1-100 range
timeout_seconds = 30 # 1-300 range
success_threshold = 2 # 1-50 range
Component-Specific Overrides
# Task readiness (polling-specific)
[common.circuit_breakers.component_configs.task_readiness]
failure_threshold = 10
success_threshold = 3
# Messaging operations (PGMQ/RabbitMQ)
[common.circuit_breakers.component_configs.messaging]
failure_threshold = 5
success_threshold = 2
# Web/API database operations
[common.circuit_breakers.component_configs.web]
failure_threshold = 5
success_threshold = 2
Note:
timeout_secondsis inherited fromdefault_configfor all component circuit breakers. Thepgmqkey is accepted as an alias formessagingfor backward compatibility.
Worker-Specific Configuration
# FFI completion (latency-based)
[worker.circuit_breakers.ffi_completion_send]
failure_threshold = 5
recovery_timeout_seconds = 5
success_threshold = 2
slow_send_threshold_ms = 100
Environment Overrides
Different environments may need different thresholds:
Test (config/tasker/environments/test/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 2 # Faster failure detection
timeout_seconds = 5 # Quick recovery for tests
success_threshold = 1
Production (config/tasker/environments/production/common.toml):
[common.circuit_breakers.default_config]
failure_threshold = 10 # More tolerance for transient failures
timeout_seconds = 60 # Longer recovery window
success_threshold = 5 # More confidence before closing
Health Endpoint Integration
Circuit breaker states are exposed through health endpoints for monitoring and Kubernetes probes.
Orchestration Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "Circuit breaker state: Closed",
"duration_ms": 1,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Worker Health (/health/detailed)
{
"status": "healthy",
"checks": {
"circuit_breakers": {
"status": "healthy",
"message": "2 circuit breakers: 2 closed, 0 open, 0 half-open. Details: ffi_completion: closed (100 calls, 2 failures); task_readiness: closed (50 calls, 0 failures)",
"duration_ms": 0,
"last_checked": "2025-12-10T10:00:00Z"
}
}
}
Health Status Mapping
| Circuit Breaker State | Health Status | Impact |
|---|---|---|
| All Closed | healthy | Normal operation |
| Any Half-Open | degraded | Testing recovery |
| Any Open | unhealthy | Failing fast |
Monitoring and Alerting
Key Metrics
| Metric | Type | Description |
|---|---|---|
api_circuit_breaker_state | Gauge | Web breaker state (0/1/2) |
tasker_circuit_breaker_state | Gauge | Per-component state |
api_requests_rejected_total | Counter | Rejections due to open breaker |
ffi_completion_slow_sends_total | Counter | Slow send detections |
ffi_completion_circuit_open_rejections_total | Counter | FFI breaker rejections |
Prometheus Alerts
groups:
- name: circuit_breakers
rules:
- alert: TaskerCircuitBreakerOpen
expr: api_circuit_breaker_state == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker is OPEN"
description: "Circuit breaker {{ $labels.component }} has been open for >1 minute"
- alert: TaskerCircuitBreakerHalfOpen
expr: api_circuit_breaker_state == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker stuck in half-open"
description: "Circuit breaker {{ $labels.component }} in half-open state >5 minutes"
- alert: TaskerFFISlowSendsHigh
expr: rate(ffi_completion_slow_sends_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "FFI completion channel experiencing backpressure"
description: "Slow sends averaging >10/second, circuit breaker may open"
Grafana Dashboard Panels
Circuit Breaker State Timeline:
Panel: Time series
Query: api_circuit_breaker_state
Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)
FFI Latency Percentiles:
Panel: Time series
Queries:
- histogram_quantile(0.50, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.95, ffi_completion_send_duration_seconds_bucket)
- histogram_quantile(0.99, ffi_completion_send_duration_seconds_bucket)
Thresholds: 100ms warning, 500ms critical
Operational Procedures
When Circuit Breaker Opens
Immediate Actions:
- Check database connectivity:
pg_isready -h <host> -p 5432 - Check connection pool status:
/health/detailedendpoint - Review recent error logs for root cause
- Monitor queue depth for message backlog
Recovery:
- Circuit automatically tests recovery after
timeout_seconds - No manual intervention needed for transient failures
- For persistent failures, fix underlying issue first
Escalation:
- If breaker stays open >5 minutes, escalate to database team
- If breaker oscillates (open/half-open/open), increase
failure_threshold
Tuning Guidelines
Symptom: Breaker opens too frequently
- Increase
failure_threshold - Investigate root cause of failures
- Consider if failures are transient vs systemic
Symptom: Breaker stays open too long
- Decrease
timeout_seconds - Verify downstream system has recovered
- Check if
success_thresholdis too high
Symptom: FFI breaker opens unnecessarily
- Increase
slow_send_threshold_ms - Verify channel buffer sizes are adequate
- Check Ruby/Python handler throughput
Architecture Integration
Relationship to Backpressure
Circuit breakers are one layer of the broader backpressure strategy:
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESILIENCE LAYER STACK │
└─────────────────────────────────────────────────────────────────────────────┘
Layer 1: Circuit Breakers → Fast-fail on component failure
Layer 2: Bounded Channels → Backpressure on internal queues
Layer 3: Visibility Timeouts → Message-level retry safety
Layer 4: Semaphore Limits → Handler execution rate limiting
Layer 5: Connection Pools → Database resource management
See Backpressure Architecture for the complete strategy.
Independence Principle
Each circuit breaker operates independently:
- Web breaker can be open while task readiness breaker is closed
- FFI breaker state doesn’t affect PGMQ breaker
- Prevents single failure mode from cascading across components
- Allows targeted recovery per component
Integration Points
| Component | Circuit Breaker | Integration Point |
|---|---|---|
tasker-orchestration/src/web | Web Database | API request handlers |
tasker-orchestration/src/orchestration/task_readiness | Task Readiness | Fallback poller loop |
tasker-worker/src/worker/handlers | FFI Completion | Completion channel sends |
tasker-shared/src/messaging/client.rs | Messaging | MessageClient send/receive methods |
Troubleshooting
Common Issues
Issue: Web circuit breaker flapping (open → half-open → open rapidly)
Diagnosis:
- Check database query latency (slow queries can cause timeout failures)
- Review connection pool saturation
- Check if PostgreSQL is under memory pressure
Resolution:
- Increase
failure_thresholdif failures are transient - Increase
timeout_secondsto give more recovery time - Fix underlying database performance issues
Issue: FFI completion circuit breaker opens during normal load
Diagnosis:
- Check Ruby/Python handler execution time
- Review completion channel buffer utilization
- Verify worker concurrency settings
Resolution:
- Increase
slow_send_threshold_msif handlers are legitimately slow - Increase channel buffer size in worker config
- Reduce handler concurrency if system is overloaded
Issue: Task readiness breaker open but web API working fine
Diagnosis:
- Task readiness queries may be slower/different than API queries
- Polling may hit database at different times (e.g., during maintenance)
Resolution:
- Independent breakers are working as designed
- Check specific task readiness query performance
- Consider database index optimization for readiness queries
Source Code Reference
| Component | File |
|---|---|
CircuitBreakerBehavior Trait | tasker-shared/src/resilience/behavior.rs |
Generic CircuitBreaker | tasker-shared/src/resilience/circuit_breaker.rs |
| Circuit Breaker Config | tasker-shared/src/config/circuit_breaker.rs |
MessageClient (messaging breaker) | tasker-shared/src/messaging/client.rs |
WebDatabaseCircuitBreaker | tasker-orchestration/src/api_common/circuit_breaker.rs |
| Web CB Helpers | tasker-orchestration/src/web/circuit_breaker.rs |
TaskReadinessCircuitBreaker | tasker-orchestration/src/orchestration/task_readiness/circuit_breaker.rs |
FfiCompletionCircuitBreaker | tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs |
| Worker Health Integration | tasker-worker/src/web/handlers/health.rs |
| Circuit Breaker Types | tasker-shared/src/types/api/worker.rs |
Related Documentation
- Backpressure Architecture - Complete resilience strategy
- Operations: Backpressure Monitoring - Operational runbooks
- Operations: MPSC Channel Tuning - Channel capacity management
- Observability - Metrics and logging standards
- Configuration Management - TOML configuration reference
<- Back to Documentation Hub
Crate Architecture
Last Updated: 2026-01-15 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands | Quick Start
← Back to Documentation Hub
Overview
Tasker Core is organized as a Cargo workspace with 7 member crates, each with a specific responsibility in the workflow orchestration system. This document explains the role of each crate, their inter-dependencies, and how they work together to provide a complete orchestration solution.
Design Philosophy
The crate structure follows these principles:
- Separation of Concerns: Each crate has a well-defined responsibility
- Minimal Dependencies: Crates depend on the minimum necessary dependencies
- Shared Foundation: Common types and utilities in
tasker-shared - Language Flexibility: Support for multiple worker implementations (Rust, Ruby, Python planned)
- Production Ready: Workers and the orchestration system can be deployed and scaled independently
Workspace Structure
tasker-core/
├── tasker-pgmq/ # PGMQ wrapper with notification support
├── tasker-shared/ # Shared types, SQL functions, utilities
├── tasker-orchestration/ # Task coordination and lifecycle management
├── tasker-worker/ # Step execution and handler integration
├── tasker-client/ # API client library (REST + gRPC transport)
├── tasker-ctl/ # CLI binary (depends on tasker-client)
└── workers/
├── ruby/ext/tasker_core/ # Ruby FFI bindings
└── rust/ # Rust native worker
Crate Dependency Graph
┌─────────────────────────────────────────────────────────┐
│ External Dependencies │
│ (sqlx, tokio, serde, pgmq, magnus, axum, etc.) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ tasker-pgmq │
│ PGMQ wrapper with PostgreSQL LISTEN/NOTIFY │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ tasker-shared │
│ Core types, SQL functions, state machines │
└─────────────────────────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ tasker-orchestration │ │ tasker-worker │
│ Task coordination │ │ Step execution │
│ Lifecycle management │ │ Handler integration │
│ REST API │ │ FFI support │
└──────────────────────────┘ └──────────────────────────┘
│ │
▼ │
┌──────────────────────────┐ │
│ tasker-client │ │
│ API client library │ │
│ REST + gRPC transport │ │
└──────────────────────────┘ │
│ │
▼ │
┌──────────────────────────┐ │
│ tasker-ctl │ │
│ CLI binary │ │
└──────────────────────────┘ │
│
┌────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌────────────┐ ┌────────────┐
│ workers/ │ │ workers/ │
│ ruby/ │ │ rust/ │
│ ext/ │ │ │
└────────────┘ └────────────┘
Core Crates
tasker-pgmq
Purpose: Wrapper around PostgreSQL Message Queue (PGMQ) with native PostgreSQL LISTEN/NOTIFY support
Location: tasker-pgmq/
Key Responsibilities:
- Wrap
pgmqcrate with notification capabilities - Provide atomic
pgmq_send_with_notify()operations - Handle notification channel management
- Support namespace-aware queue naming
Public API:
#![allow(unused)]
fn main() {
pub struct PgmqClient {
// Send message with atomic notification
pub async fn send_with_notify<T>(&self, queue: &str, msg: T) -> Result<i64>;
// Read message with visibility timeout
pub async fn read<T>(&self, queue: &str, vt: i32) -> Result<Option<Message<T>>>;
// Delete processed message
pub async fn delete(&self, queue: &str, msg_id: i64) -> Result<bool>;
}
}
When to Use:
- When you need reliable message queuing with PostgreSQL
- When you need atomic send + notify operations
- When building event-driven systems on PostgreSQL
Dependencies:
pgmq- Core PostgreSQL message queue functionalitysqlx- Database connectivitytokio- Async runtime
tasker-shared
Purpose: Foundation crate containing all shared types, utilities, and SQL function interfaces
Location: tasker-shared/
Key Responsibilities:
- Core domain models (
Task,WorkflowStep,TaskTransition, etc.) - State machine implementations (Task + Step)
- SQL function executor and registry
- Database utilities and migrations
- Event system traits and types
- Messaging abstraction layer: Provider-agnostic messaging with PGMQ, RabbitMQ, and InMemory backends
- Factory system for testing
- Metrics and observability primitives
Public API:
#![allow(unused)]
fn main() {
// Core Models
pub mod models {
pub struct Task { /* ... */ }
pub struct WorkflowStep { /* ... */ }
pub struct TaskTransition { /* ... */ }
pub struct WorkflowStepTransition { /* ... */ }
}
// State Machines
pub mod state_machine {
pub struct TaskStateMachine { /* ... */ }
pub struct StepStateMachine { /* ... */ }
pub enum TaskState { /* 12 states */ }
pub enum WorkflowStepState { /* 9 states */ }
}
// SQL Functions
pub mod database {
pub struct SqlFunctionExecutor { /* ... */ }
pub async fn get_step_readiness_status(...) -> Result<Vec<StepReadinessStatus>>;
pub async fn get_next_ready_tasks(...) -> Result<Vec<ReadyTaskInfo>>;
}
// Event System
pub mod event_system {
pub trait EventDrivenSystem { /* ... */ }
pub enum DeploymentMode { Hybrid, EventDrivenOnly, PollingOnly }
}
// Messaging
pub mod messaging {
// Provider abstraction
pub enum MessagingProvider { Pgmq, RabbitMq, InMemory }
pub trait MessagingService { /* send_message, receive_messages, ack_message, ... */ }
pub trait SupportsPushNotifications { /* subscribe, subscribe_many, requires_fallback_polling */ }
pub enum MessageNotification { Available { ... }, Message(...) }
// Domain client
pub struct MessageClient { /* High-level queue operations */ }
// Message types
pub struct SimpleStepMessage { /* ... */ }
pub struct TaskRequestMessage { /* ... */ }
pub struct StepExecutionResult { /* ... */ }
}
}
When to Use:
- Always - This is the foundation for all other crates
- When you need core domain models
- When you need state machine logic
- When you need SQL function access
- When you need testing factories
Dependencies:
tasker-pgmq- Message queue operationssqlx- Database operationsserde- Serialization- Many workspace-shared dependencies
Why It’s Separate:
- Eliminates circular dependencies between orchestration and worker
- Provides single source of truth for domain models
- Enables independent testing of core logic
- Allows multiple implementations (orchestration vs worker) to share code
tasker-orchestration
Purpose: Task coordination, lifecycle management, and orchestration REST API
Location: tasker-orchestration/
Key Responsibilities:
- Actor-based lifecycle coordination
- Task initialization and finalization
- Step discovery and enqueueing
- Result processing from workers
- Dynamic executor pool management
- Event-driven coordination
- REST API endpoints
- Health monitoring
- Metrics collection
Public API:
#![allow(unused)]
fn main() {
// Core orchestration
pub struct OrchestrationCore {
pub async fn new() -> Result<Self>;
pub async fn from_config(config: ConfigManager) -> Result<Self>;
}
// Actor-based coordination
pub mod actors {
pub struct ActorRegistry { /* ... */ }
pub struct TaskRequestActor { /* ... */ }
pub struct ResultProcessorActor { /* ... */ }
pub struct StepEnqueuerActor { /* ... */ }
pub struct TaskFinalizerActor { /* ... */ }
pub trait OrchestrationActor { /* ... */ }
pub trait Handler<M: Message> { /* ... */ }
pub trait Message { /* ... */ }
}
// Lifecycle services (wrapped by actors)
pub mod lifecycle {
pub struct TaskInitializer { /* ... */ }
pub struct StepEnqueuerService { /* ... */ }
pub struct OrchestrationResultProcessor { /* ... */ }
pub struct TaskFinalizer { /* ... */ }
}
// Message hydration (Phase 4)
pub mod hydration {
pub struct StepResultHydrator { /* ... */ }
pub struct TaskRequestHydrator { /* ... */ }
pub struct FinalizationHydrator { /* ... */ }
}
// REST API (Axum)
pub mod web {
// POST /v1/tasks
pub async fn create_task(request: TaskRequest) -> Result<TaskResponse>;
// GET /v1/tasks/{uuid}
pub async fn get_task(uuid: Uuid) -> Result<TaskResponse>;
// GET /health
pub async fn health_check() -> Result<HealthResponse>;
}
// gRPC API (Tonic)
// Feature-gated behind `grpc-api`
pub mod grpc {
pub struct GrpcServer { /* ... */ }
pub struct GrpcState { /* wraps Arc<SharedApiServices> */ }
pub mod services {
pub struct TaskServiceImpl { /* 6 RPCs */ }
pub struct StepServiceImpl { /* 4 RPCs */ }
pub struct TemplateServiceImpl { /* 2 RPCs */ }
pub struct HealthServiceImpl { /* 4 RPCs */ }
pub struct AnalyticsServiceImpl { /* 2 RPCs */ }
pub struct DlqServiceImpl { /* 6 RPCs */ }
pub struct ConfigServiceImpl { /* 1 RPC */ }
}
pub mod interceptors {
pub struct AuthInterceptor { /* Bearer token, API key */ }
}
}
// Event systems
pub mod event_systems {
pub struct OrchestrationEventSystem { /* ... */ }
pub struct TaskReadinessEventSystem { /* ... */ }
}
}
Actor Architecture:
The orchestration crate implements a lightweight actor pattern for lifecycle component coordination:
- ActorRegistry: Manages all 4 orchestration actors with lifecycle hooks
- Message-Based Communication: Type-safe message handling via
Handler<M>trait - Service Decomposition: Large services decomposed into focused components (<300 lines per file)
- Direct Integration: Command processor calls actors directly without wrapper layers
See Actor-Based Architecture for comprehensive documentation.
When to Use:
- When you need to run the orchestration server
- When you need task coordination logic
- When building custom orchestration components
- When integrating with the REST API
Dependencies:
tasker-shared- Core types and SQL functionstasker-pgmq- Message queuingaxum- REST API frameworktower-http- HTTP middleware
Deployment: Typically deployed as a server process (tasker-server binary)
Dual-Server Architecture:
Orchestration supports both REST and gRPC APIs running simultaneously via SharedApiServices:
#![allow(unused)]
fn main() {
pub struct SharedApiServices {
pub security_service: Option<Arc<SecurityService>>,
pub task_service: TaskService,
pub step_service: StepService,
pub health_service: HealthService,
// ... other services
}
// Both APIs share the same service instances
AppState { services: Arc<SharedApiServices>, ... } // REST
GrpcState { services: Arc<SharedApiServices>, ... } // gRPC
}
Port Allocation:
- REST: 8080 (configurable)
- gRPC: 9190 (configurable)
tasker-worker
Purpose: Step execution, handler integration, and worker coordination
Location: tasker-worker/
Key Responsibilities:
- Claim steps from namespace queues
- Execute step handlers (Rust or FFI)
- Submit results to orchestration
- Template management and caching
- Event-driven step claiming
- Worker health monitoring
- FFI integration layer
Public API:
#![allow(unused)]
fn main() {
// Worker core
pub struct WorkerCore {
pub async fn new(config: WorkerConfig) -> Result<Self>;
pub async fn start(&mut self) -> Result<()>;
}
// Handler execution
pub mod handlers {
pub trait RustStepHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult>;
}
}
// Template management
pub mod task_template_manager {
pub struct TaskTemplateManager {
pub async fn load_templates(&mut self) -> Result<()>;
pub fn get_template(&self, name: &str) -> Option<&TaskTemplate>;
}
}
// Event systems
pub mod event_systems {
pub struct WorkerEventSystem { /* ... */ }
}
}
When to Use:
- When you need to run a worker process
- When implementing custom step handlers
- When integrating with Ruby/Python handlers via FFI
- When building worker-specific tools
Dependencies:
tasker-shared- Core types and messagingtasker-pgmq- Message queuingmagnus(optional) - Ruby FFI bindings
Deployment: Deployed as worker processes, typically one per namespace or scaled horizontally
tasker-client
Purpose: Transport-agnostic API client library for REST and gRPC
Location: tasker-client/
Key Responsibilities:
- HTTP client for orchestration REST API
- gRPC client for orchestration gRPC API (feature-gated)
- Transport abstraction via unified client traits
- Configuration management and auth resolution
- Client-side request building
Public API:
#![allow(unused)]
fn main() {
// REST client
pub struct RestOrchestrationClient {
pub async fn new(base_url: &str) -> Result<Self>;
// Task, step, template, health operations
}
// gRPC client (feature-gated)
#[cfg(feature = "grpc")]
pub struct GrpcOrchestrationClient {
pub async fn connect(endpoint: &str) -> Result<Self>;
pub async fn connect_with_auth(endpoint: &str, auth: GrpcAuthConfig) -> Result<Self>;
// Same operations as REST client
}
// Transport-agnostic client
pub enum UnifiedOrchestrationClient {
Rest(Box<RestOrchestrationClient>),
Grpc(Box<GrpcOrchestrationClient>),
}
// Client trait for transport abstraction
pub trait OrchestrationClient: Send + Sync {
async fn create_task(&self, request: TaskRequest) -> Result<TaskResponse>;
async fn get_task(&self, uuid: Uuid) -> Result<TaskResponse>;
async fn list_tasks(&self, filters: TaskFilters) -> Result<Vec<TaskResponse>>;
async fn health_check(&self) -> Result<HealthResponse>;
// ... more operations
}
}
When to Use:
- When you need to interact with orchestration API from Rust
- When building integration tests
- When implementing client applications or FFI bindings
- When building UI frontends (TUI, web) that need API access
tasker-ctl
Purpose: Command-line interface for Tasker (split from tasker-client)
Location: tasker-ctl/
Key Responsibilities:
- CLI argument parsing and command dispatch (via clap)
- Task, worker, system, config, auth, and DLQ commands
- Configuration documentation generation (via askama, feature-gated)
- API key generation and management
CLI Tools:
# Task management
tasker-ctl task create --template linear_workflow
tasker-ctl task get <uuid>
tasker-ctl task list --namespace payments
# Health checks
tasker-ctl health
# Configuration docs generation
tasker-ctl docs generate
When to Use:
- When managing tasks from the command line
- When generating configuration documentation
- When performing administrative operations (auth, DLQ management)
Dependencies:
reqwest- HTTP clientclap- CLI argument parsingserde_json- JSON serialization
Worker Implementations
workers/ruby/ext/tasker_core
Purpose: Ruby FFI bindings enabling Ruby workers to execute Rust-orchestrated workflows
Location: workers/ruby/ext/tasker_core/
Key Responsibilities:
- Expose Rust worker functionality to Ruby via Magnus (FFI)
- Handle Ruby handler execution
- Manage Ruby <-> Rust type conversions
- Provide Ruby API for template registration
- FFI performance optimization
Ruby API:
# Worker bootstrap
result = TaskerCore::Worker::Bootstrap.start!
# Template registration (automatic)
# Ruby templates in workers/ruby/app/tasker/tasks/templates/
# Handler execution (automatic via FFI)
class MyHandler < TaskerCore::StepHandler::Base
def call(context)
# Step implementation
success(result: { status: 'done' })
end
end
When to Use:
- When you have existing Ruby handlers
- When you need Ruby-specific libraries or gems
- When migrating from Ruby-based orchestration
- When team expertise is primarily Ruby
Dependencies:
magnus- Ruby FFI bindingstasker-worker- Core worker logic- Ruby runtime
Performance Considerations:
- FFI overhead: ~5-10ms per step (measured)
- Ruby GC can impact latency
- Thread-safe FFI calls via Ruby global lock
- Best for I/O-bound operations, not CPU-intensive
workers/rust
Purpose: Native Rust worker implementation for maximum performance
Location: workers/rust/
Key Responsibilities:
- Native Rust step handler execution
- Template definitions in Rust
- Direct integration with tasker-worker
- Maximum performance for CPU-intensive operations
Handler API:
#![allow(unused)]
fn main() {
// Define handler in Rust
pub struct MyHandler;
#[async_trait]
impl RustStepHandler for MyHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Step implementation
Ok(StepExecutionResult::success_from_json(json!({"result": "done"})))
}
}
// Register in template
pub fn register_template() -> TaskTemplate {
TaskTemplate {
name: "my_workflow",
steps: vec![
StepTemplate {
name: "my_step",
handler: Box::new(MyHandler),
// ...
}
]
}
}
}
When to Use:
- When you need maximum performance
- For CPU-intensive operations
- When building new workflows in Rust
- When minimizing latency is critical
Dependencies:
tasker-worker- Core worker logictokio- Async runtime
Performance: Native Rust handlers have zero FFI overhead
Crate Relationships
How Crates Work Together
Task Creation Flow
Client Application
↓ [HTTP POST]
tasker-client
↓ [REST API]
tasker-orchestration::web
↓ [Task lifecycle]
tasker-orchestration::lifecycle::TaskInitializer
↓ [Uses]
tasker-shared::models::Task
tasker-shared::database::sql_functions
↓ [PostgreSQL]
Database + PGMQ
Step Execution Flow
tasker-orchestration::lifecycle::StepEnqueuer
↓ [pgmq_send_with_notify]
PGMQ namespace queue
↓ [pg_notify event]
tasker-worker::event_systems::WorkerEventSystem
↓ [Claims step]
tasker-worker::handlers::execute_handler
↓ [FFI or native]
workers/ruby or workers/rust
↓ [Returns result]
tasker-worker::orchestration_result_sender
↓ [pgmq_send_with_notify]
PGMQ orchestration_step_results queue
↓ [pg_notify event]
tasker-orchestration::lifecycle::ResultProcessor
↓ [Updates state]
tasker-shared::models::WorkflowStepTransition
Dependency Rationale
Why tasker-shared exists:
- Prevents circular dependencies (orchestration ↔ worker)
- Single source of truth for domain models
- Enables independent testing
- Allows SQL function reuse
Why workers are separate from tasker-worker:
- Language-specific implementations
- Independent deployment
- FFI boundary separation
- Multiple worker types supported
Why tasker-pgmq is separate:
- Reusable in other projects
- Focused responsibility
- Easy to test independently
- Can be published as separate crate
Building and Testing
Build All Crates
# Build everything with all features
cargo build --all-features
# Build specific crate
cargo build --package tasker-orchestration --all-features
# Build workspace root (minimal, mostly for integration)
cargo build
Test All Crates
# Test everything
cargo test --all-features
# Test specific crate
cargo test --package tasker-shared --all-features
# Test with database
DATABASE_URL="postgresql://..." cargo test --all-features
Feature Flags
# Root workspace features
[features]
benchmarks = [
"tasker-shared/benchmarks",
# ...
]
test-utils = [
"tasker-orchestration/test-utils",
"tasker-shared/test-utils",
"tasker-worker/test-utils",
]
Migration Notes
Root Crate Being Phased Out
The root tasker-core crate (defined in the workspace root Cargo.toml) is being phased out:
- Current: Contains minimal code, mostly workspace configuration
- Future: Will be removed entirely, replaced by individual crates
- Impact: No functional impact, internal restructuring only
- Timeline: Complete when all functionality moved to member crates
Why: Cleaner workspace structure, better separation of concerns, easier to understand
Adding New Crates
When adding a new crate to the workspace:
- Add to
[workspace.members]in rootCargo.toml - Create crate:
cargo new --lib tasker-new-crate - Add workspace dependencies to crate’s
Cargo.toml - Update this documentation
- Add to dependency graph above
- Document public API
Best Practices
When to Create a New Crate
Create a new crate when:
- ✅ You have a distinct, reusable component
- ✅ You need independent versioning
- ✅ You want to reduce compile times
- ✅ You need isolation for testing
- ✅ You have language-specific implementations
Don’t create a new crate when:
- ❌ It’s tightly coupled to existing crates
- ❌ It’s only used in one place
- ❌ It would create circular dependencies
- ❌ It’s a small utility module
Dependency Management
- Use workspace dependencies: Define versions in root
Cargo.toml - Minimize dependencies: Only depend on what you need
- Version consistently: Use
workspace = truein member crates - Document dependencies: Explain why each dependency is needed
API Design
- Stable public API: Changes should be backward compatible
- Clear documentation: Every public item needs docs
- Examples in docs: Show how to use the API
- Error handling: Use
Resultwith meaningful error types
Related Documentation
- Actor-Based Architecture - Actor pattern implementation in tasker-orchestration
- Messaging Abstraction - Provider-agnostic messaging
- Quick Start - Get running with the crates
- Events and Commands - How crates coordinate
- States and Lifecycles - State machines in tasker-shared
- Task Readiness & Execution - SQL functions in tasker-shared
- Archive: Ruby Integration Lessons - FFI patterns
← Back to Documentation Hub
Deployment Patterns and Configuration
Last Updated: 2026-01-15 Audience: Architects, Operators Status: Active Related Docs: Documentation Hub | Quick Start | Observability | Messaging Abstraction
← Back to Documentation Hub
Overview
Tasker Core supports three deployment modes, each optimized for different operational requirements and infrastructure constraints. This guide covers deployment patterns, configuration management, and production considerations.
Key Deployment Modes:
- Hybrid Mode (Recommended) - Event-driven with polling fallback
- EventDrivenOnly Mode - Pure event-driven for lowest latency
- PollingOnly Mode - Traditional polling for restricted environments
Messaging Backend Options:
- PGMQ (Default) - PostgreSQL-based, single infrastructure dependency
- RabbitMQ - AMQP broker, higher throughput for high-volume scenarios
Messaging Backend Selection
Tasker Core supports multiple messaging backends through a provider-agnostic abstraction layer. The choice of backend affects deployment architecture and operational requirements.
Backend Comparison
| Feature | PGMQ | RabbitMQ |
|---|---|---|
| Infrastructure | PostgreSQL only | PostgreSQL + RabbitMQ |
| Delivery Model | Poll + pg_notify signals | Native push (basic_consume) |
| Fallback Polling | Required for reliability | Not needed |
| Throughput | Good | Higher |
| Latency | Low (~10-50ms) | Lowest (~5-20ms) |
| Operational Complexity | Lower | Higher |
| Message Persistence | PostgreSQL transactions | RabbitMQ durability |
PGMQ (Default)
PostgreSQL Message Queue is the default backend, ideal for:
- Simpler deployments: Single database dependency
- Transactional workflows: Messages participate in PostgreSQL transactions
- Smaller to medium scale: Excellent for most workloads
Configuration:
# Default - no additional configuration needed
TASKER_MESSAGING_BACKEND=pgmq
Deployment Mode Interaction:
- Uses
pg_notifyfor real-time notifications - Fallback polling recommended for reliability
- Hybrid mode provides best balance
RabbitMQ
AMQP-based messaging for high-throughput scenarios:
- High-volume workloads: Better throughput characteristics
- Existing RabbitMQ infrastructure: Leverage existing investments
- Pure push delivery: No fallback polling required
Configuration:
TASKER_MESSAGING_BACKEND=rabbitmq
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/%2F
Deployment Mode Interaction:
- EventDrivenOnly mode is natural fit (no fallback needed)
- Native push delivery via
basic_consume() - Protocol-guaranteed message delivery
Choosing a Backend
Decision Tree:
┌─────────────────┐
│ Do you need the │
│ highest possible │
│ throughput? │
└────────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Do you have │ │ Use PGMQ │
│ existing │ │ (simpler ops) │
│ RabbitMQ? │ └────────────────┘
└───────┬────────┘
│
┌──────────┴──────────┐
│ │
Yes No
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Use RabbitMQ │ │ Evaluate │
└────────────────┘ │ operational │
│ tradeoffs │
└────────────────┘
Recommendation: Start with PGMQ. Migrate to RabbitMQ only when throughput requirements demand it.
Production Deployment Strategy: Mixed Mode Architecture
Important: In production-grade Kubernetes environments, you typically run multiple orchestration containers simultaneously with different deployment modes. This is not just about horizontal scaling with identical configurations—it’s about deploying containers with different coordination strategies to optimize for both throughput and reliability.
Recommended Production Pattern
High-Throughput + Safety Net Architecture:
# Most orchestration containers in EventDrivenOnly mode for maximum throughput
- EventDrivenOnly containers: 8-12 replicas (handles 80-90% of workload)
- PollingOnly containers: 2-3 replicas (safety net for missed events)
Why this works:
- EventDrivenOnly containers handle the bulk of work with ~10ms latency
- PollingOnly containers catch any events that might be missed during network issues or LISTEN/NOTIFY failures
- Both sets of containers coordinate through atomic SQL operations (no conflicts)
- Scale each mode independently based on throughput needs
Alternative: All-Hybrid Deployment
You can also deploy all containers in Hybrid mode and scale horizontally:
# All containers use Hybrid mode
- Hybrid containers: 10-15 replicas
This is simpler but less flexible. The mixed-mode approach lets you:
- Tune for specific workload patterns (event-heavy vs. polling-heavy)
- Adapt to infrastructure constraints (some networks better for events, others for polling)
- Optimize resource usage (EventDrivenOnly uses less CPU than Hybrid)
- Scale dimensions independently (scale up event listeners without scaling pollers)
Key Insight
The different deployment modes exist not just for config tuning, but to enable sophisticated deployment strategies where you mix coordination approaches across containers to meet production throughput and reliability requirements.
Deployment Mode Comparison
| Feature | Hybrid | EventDrivenOnly | PollingOnly |
|---|---|---|---|
| Latency | Low (event-driven primary) | Lowest (~10ms) | Higher (~100-500ms) |
| Reliability | Highest (automatic fallback) | Good (requires stable connections) | Good (no dependencies) |
| Resource Usage | Medium (listeners + pollers) | Low (listeners only) | Medium (pollers only) |
| Network Requirements | Standard PostgreSQL | Persistent connections required | Standard PostgreSQL |
| Production Recommended | ✅ Yes | ⚠️ With stable network | ⚠️ For restricted environments |
| Complexity | Medium | Low | Low |
Hybrid Mode (Recommended)
Overview
Hybrid mode combines the best of both worlds: event-driven coordination for real-time performance with polling fallback for reliability.
How it works:
- PostgreSQL LISTEN/NOTIFY provides real-time event notifications
- If event listeners fail or lag, polling automatically takes over
- System continuously monitors and switches between modes
- No manual intervention required
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
# Event listener settings
enable_event_listeners = true
listener_reconnect_interval_ms = 5000
listener_health_check_interval_ms = 30000
# Polling fallback settings
enable_polling_fallback = true
polling_interval_ms = 1000
fallback_activation_threshold_ms = 5000
# Worker event settings
[orchestration.worker_events]
enable_worker_listeners = true
worker_listener_reconnect_ms = 5000
When to Use Hybrid Mode
Ideal for:
- Production deployments requiring high reliability
- Environments with occasional network instability
- Systems requiring both low latency and guaranteed delivery
- Multi-region deployments with variable network quality
Example: Production E-commerce Platform
# docker-compose.production.yml
version: '3.8'
services:
orchestration:
image: tasker-orchestration:latest
environment:
- TASKER_ENV=production
- TASKER_DEPLOYMENT_MODE=Hybrid
- DATABASE_URL=postgresql://tasker:${DB_PASSWORD}@postgres:5432/tasker_production
- RUST_LOG=info
deploy:
replicas: 3
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
postgres:
image: postgres:16
environment:
- POSTGRES_DB=tasker_production
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
resources:
limits:
cpus: '4'
memory: 8G
volumes:
postgres-data:
Monitoring Hybrid Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Hybrid mode health indicators
tasker_event_listener_active{mode="hybrid"} = 1 // Listener is active
tasker_event_listener_lag_ms{mode="hybrid"} < 100 // Event lag is acceptable
tasker_polling_fallback_active{mode="hybrid"} = 0 // Not in fallback mode
tasker_mode_switches_total{mode="hybrid"} < 10/hour // Infrequent mode switching
}
Alert conditions:
- Event listener down for > 60 seconds
- Polling fallback active for > 5 minutes
- Mode switches > 20 per hour (indicates instability)
EventDrivenOnly Mode
Overview
EventDrivenOnly mode provides the lowest possible latency by relying entirely on PostgreSQL LISTEN/NOTIFY for coordination.
How it works:
- Orchestration and workers establish persistent PostgreSQL connections
- LISTEN on specific channels for events
- Immediate notification on queue changes
- No polling overhead or delay
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "EventDrivenOnly"
[orchestration.event_driven]
# Listener configuration
listener_reconnect_interval_ms = 2000
listener_health_check_interval_ms = 15000
max_reconnect_attempts = 10
# Event channels
channels = [
"pgmq_message_ready.orchestration",
"pgmq_message_ready.*",
"pgmq_queue_created"
]
# Connection pool for listeners
listener_pool_size = 5
connection_timeout_ms = 5000
When to Use EventDrivenOnly Mode
Ideal for:
- High-throughput, low-latency requirements
- Stable network environments
- Development and testing environments
- Systems with reliable PostgreSQL infrastructure
Not recommended for:
- Unstable network connections
- Environments with frequent PostgreSQL failovers
- Systems requiring guaranteed operation during network issues
Example: High-Performance Payment Processing
#![allow(unused)]
fn main() {
// Worker configuration for event-driven mode
use tasker_worker::WorkerConfig;
let config = WorkerConfig {
deployment_mode: DeploymentMode::EventDrivenOnly,
namespaces: vec!["payments".to_string()],
event_driven_settings: EventDrivenSettings {
listener_reconnect_interval_ms: 2000,
health_check_interval_ms: 15000,
max_reconnect_attempts: 10,
},
..Default::default()
};
// Start worker with event-driven mode
let worker = WorkerCore::from_config(config).await?;
worker.start().await?;
}
Monitoring EventDrivenOnly Mode
Critical Metrics:
#![allow(unused)]
fn main() {
// Event-driven health indicators
tasker_event_listener_active{mode="event_driven"} = 1 // Must be 1
tasker_event_notifications_received_total // Should be > 0
tasker_event_processing_duration_seconds // Should be < 0.01
tasker_listener_reconnections_total // Should be low
}
Alert conditions:
- Event listener inactive
- No events received for > 60 seconds (when activity expected)
- Reconnections > 5 per hour
PollingOnly Mode
Overview
PollingOnly mode provides the most reliable operation in restricted or unstable network environments by using traditional polling.
How it works:
- Orchestration and workers poll message queues at regular intervals
- No dependency on persistent connections or LISTEN/NOTIFY
- Configurable polling intervals for performance/resource trade-offs
- Automatic retry and backoff on failures
Configuration
# config/tasker/base/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Polling intervals
task_request_poll_interval_ms = 1000
step_result_poll_interval_ms = 500
finalization_poll_interval_ms = 2000
# Batch processing
batch_size = 10
max_messages_per_poll = 100
# Backoff on errors
error_backoff_base_ms = 1000
error_backoff_max_ms = 30000
error_backoff_multiplier = 2.0
When to Use PollingOnly Mode
Ideal for:
- Restricted network environments (firewalls blocking persistent connections)
- Environments with frequent PostgreSQL connection issues
- Systems prioritizing reliability over latency
- Legacy infrastructure with limited LISTEN/NOTIFY support
Not recommended for:
- High-frequency, low-latency requirements
- Systems with strict resource constraints
- Environments where polling overhead is problematic
Example: Batch Processing System
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
# Longer intervals for batch processing
task_request_poll_interval_ms = 5000
step_result_poll_interval_ms = 2000
finalization_poll_interval_ms = 10000
# Large batches for efficiency
batch_size = 50
max_messages_per_poll = 500
Monitoring PollingOnly Mode
Key Metrics:
#![allow(unused)]
fn main() {
// Polling health indicators
tasker_polling_cycles_total // Should be increasing
tasker_polling_messages_processed_total // Should be > 0
tasker_polling_duration_seconds // Should be stable
tasker_polling_errors_total // Should be low
}
Alert conditions:
- Polling stopped (no cycles in last 60 seconds)
- Polling duration > 10x interval (indicates overload)
- Error rate > 5% of polling cycles
Configuration Management
Component-Based Configuration
Tasker Core uses a component-based TOML configuration system with environment-specific overrides.
Configuration Structure:
config/tasker/
├── base/ # Base configuration (all environments)
│ ├── database.toml # Database connection pool settings
│ ├── orchestration.toml # Orchestration and deployment mode
│ ├── circuit_breakers.toml # Circuit breaker thresholds
│ ├── executor_pools.toml # Executor pool sizing
│ ├── pgmq.toml # Message queue configuration
│ ├── query_cache.toml # Query caching settings
│ └── telemetry.toml # Metrics and logging
│
└── environments/ # Environment-specific overrides
├── development/
│ └── *.toml # Development overrides
├── test/
│ └── *.toml # Test overrides
└── production/
└── *.toml # Production overrides
Environment Detection
# Set environment via TASKER_ENV
export TASKER_ENV=production
# Validate configuration
cargo run --bin config-validator
# Expected output:
# ✓ Configuration loaded successfully
# ✓ Environment: production
# ✓ Deployment mode: Hybrid
# ✓ Database pool: 50 connections
# ✓ Circuit breakers: 10 configurations
Example: Production Configuration
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
max_concurrent_tasks = 1000
task_timeout_seconds = 3600
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true
polling_interval_ms = 2000
fallback_activation_threshold_ms = 10000
[orchestration.health]
health_check_interval_ms = 30000
unhealthy_threshold = 3
recovery_threshold = 2
# config/tasker/environments/production/database.toml
[database]
max_connections = 50
min_connections = 10
connection_timeout_ms = 5000
idle_timeout_seconds = 600
max_lifetime_seconds = 1800
[database.query_cache]
enabled = true
max_size = 1000
ttl_seconds = 300
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5
timeout_seconds = 60
half_open_timeout_seconds = 30
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Docker Compose Deployment
Development Setup
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD: tasker
POSTGRES_DB: tasker_rust_test
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U tasker"]
interval: 5s
timeout: 5s
retries: 5
orchestration:
build:
context: .
target: orchestration
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
worker:
build:
context: .
target: worker
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8081:8081"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
ruby-worker:
build:
context: ./workers/ruby
dockerfile: Dockerfile
environment:
- TASKER_ENV=development
- DATABASE_URL=postgresql://tasker:tasker@postgres:5432/tasker_rust_test
- RUST_LOG=debug
ports:
- "8082:8082"
depends_on:
postgres:
condition: service_healthy
profiles:
- server
volumes:
postgres-data:
Production Deployment
# docker-compose.production.yml
version: '3.8'
services:
postgres:
image: postgres:16
environment:
POSTGRES_USER: tasker
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
POSTGRES_DB: tasker_production
volumes:
- postgres-data:/var/lib/postgresql/data
deploy:
placement:
constraints:
- node.labels.database == true
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
secrets:
- db_password
orchestration:
image: tasker-orchestration:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
order: start-first
rollback_config:
parallelism: 0
order: stop-first
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '1'
memory: 1G
secrets:
- database_url
worker:
image: tasker-worker:${VERSION}
environment:
- TASKER_ENV=production
- DATABASE_URL_FILE=/run/secrets/database_url
deploy:
replicas: 5
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
secrets:
- database_url
secrets:
db_password:
external: true
database_url:
external: true
volumes:
postgres-data:
driver: local
Kubernetes Deployment
Mixed-Mode Production Deployment (Recommended)
This example demonstrates the recommended production pattern: multiple orchestration deployments with different modes.
# k8s/orchestration-event-driven.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-event-driven
namespace: tasker
labels:
app: tasker-orchestration
mode: event-driven
spec:
replicas: 10 # Majority of orchestration capacity
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: event-driven
template:
metadata:
labels:
app: tasker-orchestration
mode: event-driven
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "EventDrivenOnly" # High-throughput mode
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 500m # Lower CPU for event-driven
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-polling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration-polling
namespace: tasker
labels:
app: tasker-orchestration
mode: polling
spec:
replicas: 3 # Safety net for missed events
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
mode: polling
template:
metadata:
labels:
app: tasker-orchestration
mode: polling
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DEPLOYMENT_MODE
value: "PollingOnly" # Reliability safety net
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 750m # Higher CPU for polling
memory: 512Mi
limits:
cpu: 1500m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# k8s/orchestration-service.yaml
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration # Matches BOTH deployments
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Key points about this mixed-mode deployment:
- 10 EventDrivenOnly pods handle 80-90% of work with ~10ms latency
- 3 PollingOnly pods catch anything missed by event listeners
- Single service load balances across all 13 pods
- No conflicts - atomic SQL operations prevent duplicate processing
- Independent scaling - scale event-driven pods for throughput, polling pods for reliability
Single-Mode Orchestration Deployment
# k8s/orchestration-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration
namespace: tasker
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tasker-orchestration
template:
metadata:
labels:
app: tasker-orchestration
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: orchestration
image: tasker-orchestration:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
name: tasker-orchestration
namespace: tasker
spec:
selector:
app: tasker-orchestration
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
Worker Deployment
# k8s/worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
replicas: 5
selector:
matchLabels:
app: tasker-worker
namespace: payments
template:
metadata:
labels:
app: tasker-worker
namespace: payments
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8081"
spec:
containers:
- name: worker
image: tasker-worker:1.0.0
env:
- name: TASKER_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-secrets
key: database-url
- name: RUST_LOG
value: "info"
- name: WORKER_NAMESPACES
value: "payments"
ports:
- containerPort: 8081
name: http
protocol: TCP
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 20
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-worker-payments
namespace: tasker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-worker-payments
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Health Monitoring
Health Check Endpoints
Orchestration Health:
# Basic health check
curl http://localhost:8080/health
# Response:
{
"status": "healthy",
"database": "connected",
"message_queue": "operational"
}
# Detailed health check
curl http://localhost:8080/health/detailed
# Response:
{
"status": "healthy",
"deployment_mode": "Hybrid",
"event_listeners": {
"active": true,
"channels": 3,
"lag_ms": 12
},
"polling": {
"active": false,
"fallback_triggered": false
},
"database": {
"status": "connected",
"pool_size": 50,
"active_connections": 23
},
"circuit_breakers": {
"database": "closed",
"message_queue": "closed"
},
"executors": {
"task_initializer": {
"active": 3,
"max": 10,
"queue_depth": 5
},
"result_processor": {
"active": 5,
"max": 10,
"queue_depth": 12
}
}
}
Worker Health:
# Worker health check
curl http://localhost:8081/health
# Response:
{
"status": "healthy",
"namespaces": ["payments", "inventory"],
"active_executions": 8,
"claimed_steps": 3
}
Kubernetes Probes
# Liveness probe - restart if unhealthy
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe - remove from load balancer if not ready
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
gRPC Health Checks
Tasker Core exposes gRPC health endpoints alongside REST for Kubernetes gRPC health probes.
Port Allocation:
| Service | REST Port | gRPC Port |
|---|---|---|
| Orchestration | 8080 | 9190 |
| Rust Worker | 8081 | 9191 |
| Ruby Worker | 8082 | 9200 |
| Python Worker | 8083 | 9300 |
| TypeScript Worker | 8085 | 9400 |
gRPC Health Endpoints:
# Using grpcurl
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckLiveness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckReadiness
grpcurl -plaintext localhost:9190 tasker.v1.HealthService/CheckDetailedHealth
Kubernetes gRPC Probes (Kubernetes 1.24+):
# gRPC liveness probe
livenessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# gRPC readiness probe
readinessProbe:
grpc:
port: 9190
service: tasker.v1.HealthService
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Configuration (config/tasker/base/orchestration.toml):
[orchestration.grpc]
enabled = true
bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}"
enable_reflection = true # Service discovery via grpcurl
enable_health_service = true # gRPC health checks
Scaling Patterns
Horizontal Scaling
Mixed-Mode Orchestration Scaling (Recommended)
Scale different deployment modes independently to optimize for throughput and reliability:
# Scale event-driven pods for throughput
kubectl scale deployment tasker-orchestration-event-driven --replicas=15 -n tasker
# Scale polling pods for reliability
kubectl scale deployment tasker-orchestration-polling --replicas=5 -n tasker
Scaling strategy by workload:
| Scenario | Event-Driven Pods | Polling Pods | Rationale |
|---|---|---|---|
| High throughput | 15-20 | 3-5 | Maximize event-driven capacity |
| Network unstable | 5-8 | 5-8 | Balance between modes |
| Cost optimization | 10-12 | 2-3 | Minimize polling overhead |
| Maximum reliability | 8-10 | 8-10 | Ensure complete coverage |
Single-Mode Orchestration Scaling
If using single deployment mode (Hybrid or EventDrivenOnly):
# Scale orchestration to 10 replicas (all same mode)
kubectl scale deployment tasker-orchestration --replicas=10 -n tasker
Key principles:
- Multiple orchestration instances process tasks independently
- Atomic finalization claiming prevents duplicate processing
- Load balancer distributes API requests across instances
Worker Scaling
Workers scale independently per namespace:
# Scale payment workers to 10 replicas
kubectl scale deployment tasker-worker-payments --replicas=10 -n tasker
- Each worker claims steps from namespace-specific queues
- No coordination required between workers
- Scale per namespace based on queue depth
Vertical Scaling
Resource Allocation:
# High-throughput orchestration
resources:
requests:
cpu: 2000m
memory: 4Gi
limits:
cpu: 4000m
memory: 8Gi
# Standard worker
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Auto-Scaling
HPA Configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tasker-orchestration
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tasker-orchestration
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: tasker_tasks_active
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Production Considerations
Database Configuration
Connection Pooling:
# config/tasker/environments/production/database.toml
[database]
max_connections = 50 # Total pool size
min_connections = 10 # Minimum maintained connections
connection_timeout_ms = 5000 # Connection acquisition timeout
idle_timeout_seconds = 600 # Close idle connections after 10 minutes
max_lifetime_seconds = 1800 # Recycle connections after 30 minutes
Calculation:
Total DB Connections = (Orchestration Replicas × Pool Size) + (Worker Replicas × Pool Size)
Example: (3 × 50) + (10 × 20) = 350 connections
Ensure PostgreSQL max_connections > Total DB Connections + Buffer
Recommended: max_connections = 500 for above example
Circuit Breaker Tuning
# config/tasker/environments/production/circuit_breakers.toml
[circuit_breakers.database]
enabled = true
error_threshold = 5 # Open after 5 consecutive errors
timeout_seconds = 60 # Stay open for 60 seconds
half_open_timeout_seconds = 30 # Test recovery for 30 seconds
[circuit_breakers.message_queue]
enabled = true
error_threshold = 10
timeout_seconds = 120
half_open_timeout_seconds = 60
Executor Pool Sizing
# config/tasker/environments/production/executor_pools.toml
[executor_pools.task_initializer]
min_executors = 2
max_executors = 10
queue_high_watermark = 100
queue_low_watermark = 10
[executor_pools.result_processor]
min_executors = 5
max_executors = 20
queue_high_watermark = 200
queue_low_watermark = 20
[executor_pools.step_enqueuer]
min_executors = 3
max_executors = 15
queue_high_watermark = 150
queue_low_watermark = 15
Observability Integration
Prometheus Metrics:
# Prometheus scrape config
scrape_configs:
- job_name: 'tasker-orchestration'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- tasker
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Key Alerts:
# alerts.yaml
groups:
- name: tasker
interval: 30s
rules:
- alert: TaskerOrchestrationDown
expr: up{job="tasker-orchestration"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Tasker orchestration instance down"
- alert: TaskerHighErrorRate
expr: rate(tasker_step_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in step execution"
- alert: TaskerCircuitBreakerOpen
expr: tasker_circuit_breaker_state{state="open"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: TaskerDatabasePoolExhausted
expr: tasker_database_pool_active >= tasker_database_pool_max
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection pool exhausted"
Migration Strategies
Migrating to Hybrid Mode
Step 1: Enable event listeners
# config/tasker/environments/production/orchestration.toml
[orchestration]
deployment_mode = "Hybrid"
[orchestration.hybrid]
enable_event_listeners = true
enable_polling_fallback = true # Keep polling enabled during migration
Step 2: Monitor event listener health
# Check metrics for event listener stability
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Step 3: Gradually reduce polling frequency
# Once event listeners are stable
[orchestration.hybrid]
polling_interval_ms = 5000 # Increase from 1000ms to 5000ms
Step 4: Validate performance
- Monitor latency metrics:
tasker_step_discovery_duration_seconds - Verify no missed events:
tasker_polling_messages_found_totalshould be near zero
Rollback Plan
If event-driven mode fails:
# Immediate rollback to PollingOnly
[orchestration]
deployment_mode = "PollingOnly"
[orchestration.polling]
task_request_poll_interval_ms = 500 # Aggressive polling
Gradual rollback:
- Increase polling frequency in Hybrid mode
- Monitor for stability
- Disable event listeners once polling is stable
- Switch to PollingOnly mode
Troubleshooting
Event Listener Issues
Problem: Event listeners not receiving notifications
Diagnosis:
-- Check PostgreSQL LISTEN/NOTIFY is working
NOTIFY pgmq_message_ready, 'test';
# Check listener status
curl http://localhost:8080/health/detailed | jq '.event_listeners'
Solutions:
- Verify PostgreSQL version supports LISTEN/NOTIFY (9.0+)
- Check firewall rules allow persistent connections
- Increase
listener_reconnect_interval_msif connections drop frequently - Switch to Hybrid or PollingOnly mode if issues persist
Polling Performance Issues
Problem: High CPU usage from polling
Diagnosis:
# Check polling frequency and batch sizes
curl http://localhost:8080/health/detailed | jq '.polling'
Solutions:
- Increase polling intervals
- Increase batch sizes to process more messages per poll
- Switch to Hybrid or EventDrivenOnly mode for better performance
- Scale horizontally to distribute polling load
Database Connection Exhaustion
Problem: “connection pool exhausted” errors
Diagnosis:
-- Check active connections
SELECT count(*) FROM pg_stat_activity WHERE datname = 'tasker_production';
-- Check max connections
SHOW max_connections;
Solutions:
- Increase
max_connectionsin database.toml - Increase PostgreSQL
max_connectionssetting - Reduce number of replicas
- Implement connection pooling at infrastructure level (PgBouncer)
Best Practices
Configuration Management
- Use environment-specific overrides instead of modifying base configuration
- Validate configuration with
config-validatorbefore deployment - Version control all configuration including environment overrides
- Use secrets management for sensitive values (passwords, keys)
Deployment Strategy
- Use mixed-mode architecture in production (EventDrivenOnly + PollingOnly)
- Deploy 80-90% of orchestration pods in EventDrivenOnly mode for throughput
- Deploy 10-20% of orchestration pods in PollingOnly mode as safety net
- Single service load balances across all pods
- Alternative: Deploy all pods in Hybrid mode for simpler configuration
- Trade-off: Less tuning flexibility, slightly higher resource usage
- Scale each mode independently based on workload characteristics
- Monitor deployment mode metrics to adjust ratios over time
- Test mixed-mode deployments in staging before production
Deployment Operations
- Always test configuration changes in staging first
- Use rolling updates with health checks to prevent downtime
- Monitor deployment mode health during and after deployments
- Keep polling capacity available even when event-driven is primary
Scaling Guidelines
- Mixed-mode orchestration: Scale EventDrivenOnly and PollingOnly deployments independently
- Scale event-driven pods based on throughput requirements
- Scale polling pods based on reliability requirements
- Single-mode orchestration: Scale based on API request rate and task initialization throughput
- Workers: Scale based on namespace-specific queue depth
- Database connections: Monitor and adjust pool sizes as replicas scale
- Use HPA for automatic scaling based on CPU/memory and custom metrics
Observability
- Enable comprehensive metrics in production
- Set up alerts for circuit breaker states, connection pool exhaustion
- Monitor deployment mode distribution in mixed-mode deployments
- Track event listener lag in EventDrivenOnly and Hybrid modes
- Monitor polling overhead to optimize resource usage
- Track step execution latency per namespace and handler
Summary
Tasker Core’s flexible deployment modes enable sophisticated production architectures:
Deployment Modes
- Hybrid Mode: Event-driven with polling fallback in a single container
- EventDrivenOnly Mode: Maximum throughput with ~10ms latency
- PollingOnly Mode: Reliable safety net with traditional polling
Recommended Production Pattern
Mixed-Mode Architecture (recommended for production at scale):
- Deploy majority of orchestration pods in EventDrivenOnly mode for high throughput
- Deploy minority of orchestration pods in PollingOnly mode as reliability safety net
- Both deployments coordinate through atomic SQL operations with no conflicts
- Scale each mode independently based on workload characteristics
Alternative: Deploy all pods in Hybrid mode for simpler configuration with automatic fallback.
The key insight: deployment modes exist not just for configuration tuning, but to enable mixing coordination strategies across containers to meet production requirements for both throughput and reliability.
← Back to Documentation Hub
Next: Observability | Benchmarks | Quick Start
Domain Events Architecture
Last Updated: 2025-12-01 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Observability | States and Lifecycles
← Back to Documentation Hub
This document provides comprehensive documentation of the domain event system in tasker-core, covering event delivery modes, publisher patterns, subscriber implementations, and integration with the workflow orchestration system.
Overview
Domain Events vs System Events
The tasker-core system distinguishes between two types of events:
| Aspect | System Events | Domain Events |
|---|---|---|
| Purpose | Internal coordination | Business observability |
| Producers | Orchestration components | Step handlers during execution |
| Consumers | Event systems, command processors | External systems, analytics, audit |
| Delivery | PGMQ + LISTEN/NOTIFY | Configurable (Durable/Fast/Broadcast) |
| Semantics | At-least-once | Fire-and-forget (best effort) |
System events handle internal workflow coordination: task initialization, step enqueueing, result processing, and finalization. These are documented in Events and Commands.
Domain events enable business observability: payment processed, order fulfilled, inventory updated. Step handlers publish these events to enable external systems to react to business outcomes.
Key Design Principle: Non-Blocking Publication
Domain event publishing never fails the step. This is a fundamental design decision:
- Event publish errors are logged with
warn!level - Step execution continues regardless of publish outcome
- Business logic success is independent of event delivery
- A handler that successfully processes a payment should not fail if event publishing fails
#![allow(unused)]
fn main() {
// Event publishing is fire-and-forget
if let Err(e) = publisher.publish_event(event_name, payload, metadata).await {
warn!(
handler = self.handler_name(),
event_name = event_name,
error = %e,
"Failed to publish domain event - step will continue"
);
}
// Step continues regardless of publish result
}
Architecture
Data Flow
flowchart TB
subgraph handlers["Step Handlers"]
SH["Step Handler<br/>(Rust/Ruby)"]
end
SH -->|"publish_domain_event(name, payload)"| ER
subgraph routing["Event Routing"]
ER["EventRouter<br/>(Delivery Mode)"]
end
ER --> Durable
ER --> Fast
ER --> Broadcast
subgraph modes["Delivery Modes"]
Durable["Durable<br/>(PGMQ)"]
Fast["Fast<br/>(In-Process)"]
Broadcast["Broadcast<br/>(Both Paths)"]
end
Durable --> NEQ["Namespace<br/>Event Queue"]
Fast --> IPB["InProcessEventBus"]
Broadcast --> NEQ
Broadcast --> IPB
subgraph external["External Integration"]
NEQ --> EC["External Consumer<br/>(Polling)"]
end
subgraph internal["Internal Subscribers"]
IPB --> RS["Rust<br/>Subscribers"]
IPB --> RFF["Ruby FFI<br/>Channel"]
end
style handlers fill:#e1f5fe
style routing fill:#fff3e0
style modes fill:#f3e5f5
style external fill:#ffebee
style internal fill:#e8f5e9
Component Summary
| Component | Purpose | Location |
|---|---|---|
EventRouter | Routes events based on delivery mode | tasker-shared/src/events/domain_events/router.rs |
DomainEventPublisher | Durable PGMQ-based publishing | tasker-shared/src/events/domain_events/publisher.rs |
InProcessEventBus | Fast in-memory event dispatch | tasker-shared/src/events/domain_events/in_process_bus.rs |
EventRegistry | Pattern-based subscriber registration | tasker-shared/src/events/domain_events/registry.rs |
StepEventPublisher | Handler callback trait | tasker-shared/src/events/domain_events/step_event_publisher.rs |
GenericStepEventPublisher | Default publisher implementation | tasker-shared/src/events/domain_events/generic_publisher.rs |
Delivery Modes
Overview
The domain event system supports three delivery modes, configured per event in YAML templates:
| Mode | Durability | Latency | Use Case |
|---|---|---|---|
| Durable | High (PGMQ) | Higher (~5-10ms) | External system integration, audit trails |
| Fast | Low (memory) | Lowest (<1ms) | Internal subscribers, metrics, real-time processing |
| Broadcast | High + Low | Both paths | Events needing both internal and external delivery |
Durable Mode (PGMQ) - External Integration Boundary
Durable events define the integration boundary between Tasker and external systems. Events are published to namespace-specific PGMQ queues where external consumers can poll and process them.
Key Design Decision: Tasker does NOT consume durable events internally. PGMQ serves as a lightweight, PostgreSQL-native alternative to external message brokers (Kafka, AWS SNS/SQS, RabbitMQ). External systems or middleware proxies can:
- Poll PGMQ queues directly
- Forward events to Kafka, SNS/SQS, or other messaging systems
- Implement custom event processing pipelines
payment.processed → payments_domain_events (PGMQ queue) → External Systems
order.fulfilled → fulfillment_domain_events (PGMQ queue) → External Systems
Characteristics:
- Persisted in PostgreSQL (survives restarts)
- For external consumer integration only
- No internal Tasker polling or subscription
- Consumer acknowledgment and retry handled by external consumers
- Ordered within namespace
Implementation:
#![allow(unused)]
fn main() {
// DomainEventPublisher routes durable events to PGMQ
pub async fn publish_event(
&self,
event_name: &str,
payload: Value,
metadata: EventMetadata,
) -> TaskerResult<()> {
let queue_name = format!("{}_domain_events", metadata.namespace);
let message = DomainEventMessage {
event_name: event_name.to_string(),
payload,
metadata,
};
self.message_client
.send_message(&queue_name, &message)
.await
}
}
Fast Mode (In-Process) - Internal Subscriber Pattern
Fast events are the only delivery mode with internal subscriber support. Events are dispatched immediately to in-memory subscribers within the Tasker worker process.
#![allow(unused)]
fn main() {
// InProcessEventBus provides dual-path delivery
pub struct InProcessEventBus {
event_sender: tokio::sync::broadcast::Sender<DomainEvent>,
ffi_event_sender: Option<mpsc::Sender<DomainEvent>>,
}
}
Characteristics:
- Zero persistence overhead
- Sub-millisecond latency
- Lost on service restart
- Internal to Tasker process only
- Dual-path: Rust subscribers + Ruby FFI channel
- Non-blocking broadcast semantics
Dual-Path Architecture:
InProcessEventBus
│
├──► tokio::broadcast::Sender ──► Rust Subscribers (EventRegistry)
│
└──► mpsc::Sender ──► Ruby FFI Channel ──► Ruby Event Handlers
Use Cases:
- Real-time metrics collection
- Internal logging and telemetry
- Secondary actions that are not business-critical parts of the Task -> WorkflowStep DAG hierarchy
- Example: DataDog, Sentry, NewRelic, PagerDuty, Salesforce, Slack, Zapier
Broadcast Mode - Internal + External Delivery
Broadcast mode delivers events to both paths simultaneously: the fast in-process bus for internal subscribers AND the durable PGMQ queue for external systems. This ensures internal subscribers receive the same event shape as external consumers.
#![allow(unused)]
fn main() {
// EventRouter handles broadcast semantics
async fn route_event(&self, event: DomainEvent, mode: EventDeliveryMode) {
match mode {
EventDeliveryMode::Durable => {
self.durable_publisher.publish(event).await;
}
EventDeliveryMode::Fast => {
self.in_process_bus.publish(event).await;
}
EventDeliveryMode::Broadcast => {
// Send to both paths concurrently
let (durable, fast) = tokio::join!(
self.durable_publisher.publish(event.clone()),
self.in_process_bus.publish(event)
);
// Log errors but don't fail
}
}
}
}
When to Use Broadcast:
- Internal subscribers need the same event that external systems receive
- Real-time internal metrics tracking for events also exported externally
- Audit logging both internally and to external compliance systems
Important: Data published via broadcast goes to BOTH the internal process AND the public PGMQ boundary. Do not use broadcast for sensitive internal-only data (use fast for those).
Publisher Patterns
Default Publisher (GenericStepEventPublisher)
The default publisher automatically handles event publication for step handlers:
#![allow(unused)]
fn main() {
pub struct GenericStepEventPublisher {
router: Arc<EventRouter>,
default_delivery_mode: EventDeliveryMode,
}
impl GenericStepEventPublisher {
/// Publish event with metadata extracted from step context
pub async fn publish(
&self,
step_data: &TaskSequenceStep,
event_name: &str,
payload: Value,
) -> TaskerResult<()> {
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id,
fired_at: Utc::now(),
fired_by: "generic_publisher".to_string(),
};
self.router.route_event(event_name, payload, metadata).await
}
}
}
Custom Publishers
Custom publishers extend TaskerCore::DomainEvents::BasePublisher (Ruby) to provide specialized event handling with payload transformation, conditional publishing, and lifecycle hooks.
Real Example: PaymentEventPublisher (workers/ruby/spec/handlers/examples/domain_events/publishers/payment_event_publisher.rb):
# Custom publisher for payment-related domain events
# Demonstrates durable delivery mode with custom payload enrichment
module DomainEvents
module Publishers
class PaymentEventPublisher < TaskerCore::DomainEvents::BasePublisher
# Must match the `publisher:` field in YAML
def name
'DomainEvents::Publishers::PaymentEventPublisher'
end
# Transform step result into payment event payload
def transform_payload(step_result, event_declaration, step_context = nil)
result = step_result[:result] || {}
event_name = event_declaration[:name]
if step_result[:success] && event_name&.include?('processed')
build_success_payload(result, step_result, step_context)
elsif !step_result[:success] && event_name&.include?('failed')
build_failure_payload(result, step_result, step_context)
else
result
end
end
# Determine if this event should be published
def should_publish?(step_result, event_declaration, step_context = nil)
result = step_result[:result] || {}
event_name = event_declaration[:name]
# For success events, verify we have transaction data
if event_name&.include?('processed')
return step_result[:success] && result[:transaction_id].present?
end
# For failure events, verify we have error info
if event_name&.include?('failed')
metadata = step_result[:metadata] || {}
return !step_result[:success] && metadata[:error_code].present?
end
true # Default: always publish
end
# Add execution metrics to event metadata
def additional_metadata(step_result, event_declaration, step_context = nil)
metadata = step_result[:metadata] || {}
{
execution_time_ms: metadata[:execution_time_ms],
publisher_type: 'custom',
publisher_name: name,
payment_provider: metadata[:payment_provider]
}
end
private
def build_success_payload(result, step_result, step_context)
{
transaction_id: result[:transaction_id],
amount: result[:amount],
currency: result[:currency] || 'USD',
payment_method: result[:payment_method] || 'credit_card',
processed_at: result[:processed_at] || Time.now.iso8601,
delivery_mode: 'durable',
publisher: name
}
end
end
end
end
YAML Configuration for Custom Publisher:
steps:
- name: process_payment
publishes_events:
- name: payment.processed
condition: success
delivery_mode: durable
publisher: DomainEvents::Publishers::PaymentEventPublisher
- name: payment.failed
condition: failure
delivery_mode: durable
publisher: DomainEvents::Publishers::PaymentEventPublisher
YAML Event Declaration
Events are declared in task template YAML files using the publishes_events field at the step level:
# config/tasks/payments/credit_card_payment/1.0.0.yaml
name: credit_card_payment
namespace_name: payments
version: 1.0.0
description: Process credit card payments with validation and fraud detection
# Task-level domain events (optional)
domain_events: []
steps:
- name: process_payment
description: Process the payment transaction
handler:
callable: PaymentProcessing::StepHandler::ProcessPaymentHandler
initialization:
gateway_url: "${PAYMENT_GATEWAY_URL}"
dependencies:
- validate_payment
retry:
retryable: true
limit: 3
backoff: exponential
timeout_seconds: 120
# Step-level event declarations
publishes_events:
- name: payment.processed
description: "Payment successfully processed"
condition: success # success, failure, retryable_failure, permanent_failure, always
schema:
type: object
required: [transaction_id, amount]
properties:
transaction_id: { type: string }
amount: { type: number }
delivery_mode: broadcast # durable, fast, or broadcast
publisher: PaymentEventPublisher # optional custom publisher
- name: payment.failed
description: "Payment processing failed permanently"
condition: permanent_failure
schema:
type: object
required: [error_code, reason]
properties:
error_code: { type: string }
reason: { type: string }
delivery_mode: durable
Publication Conditions:
success: Publish only when step completes successfullyfailure: Publish on any step failure (backward compatible)retryable_failure: Publish only on retryable failures (step can be retried)permanent_failure: Publish only on permanent failures (exhausted retries or non-retryable)always: Publish regardless of step outcome
Event Declaration Fields:
name: Event name in dotted notation (e.g.,payment.processed)description: Human-readable description of when this event is publishedcondition: When to publish (defaults tosuccess)schema: JSON Schema for validating event payloadsdelivery_mode: Delivery mode (defaults todurable)publisher: Optional custom publisher class name
Subscriber Patterns
Subscriber patterns apply only to fast (in-process) events. Durable events are consumed by external systems, not by internal Tasker subscribers.
Rust Subscribers (InProcessEventBus)
Rust subscribers are registered with the InProcessEventBus using the EventHandler type. Subscribers are async closures that receive DomainEvent instances.
Real Example: Logging Subscriber (workers/rust/src/event_subscribers/logging_subscriber.rs):
#![allow(unused)]
fn main() {
use std::sync::Arc;
use tasker_shared::events::registry::EventHandler;
use tracing::info;
/// Create a logging subscriber that logs all events matching a pattern
pub fn create_logging_subscriber(prefix: &str) -> EventHandler {
let prefix = prefix.to_string();
Arc::new(move |event| {
let prefix = prefix.clone();
Box::pin(async move {
let step_name = event.metadata.step_name.as_deref().unwrap_or("unknown");
info!(
prefix = %prefix,
event_name = %event.event_name,
event_id = %event.event_id,
task_uuid = %event.metadata.task_uuid,
step_name = %step_name,
namespace = %event.metadata.namespace,
correlation_id = %event.metadata.correlation_id,
fired_at = %event.metadata.fired_at,
"Domain event received"
);
Ok(())
})
})
}
}
Real Example: Metrics Collector (workers/rust/src/event_subscribers/metrics_subscriber.rs):
#![allow(unused)]
fn main() {
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
/// Collects metrics from domain events (thread-safe)
pub struct EventMetricsCollector {
events_received: AtomicU64,
success_events: AtomicU64,
failure_events: AtomicU64,
// ... additional fields
}
impl EventMetricsCollector {
pub fn new() -> Arc<Self> {
Arc::new(Self {
events_received: AtomicU64::new(0),
success_events: AtomicU64::new(0),
failure_events: AtomicU64::new(0),
})
}
/// Create an event handler for this collector
pub fn create_handler(self: &Arc<Self>) -> EventHandler {
let metrics = Arc::clone(self);
Arc::new(move |event| {
let metrics = Arc::clone(&metrics);
Box::pin(async move {
metrics.events_received.fetch_add(1, Ordering::Relaxed);
if event.payload.execution_result.success {
metrics.success_events.fetch_add(1, Ordering::Relaxed);
} else {
metrics.failure_events.fetch_add(1, Ordering::Relaxed);
}
Ok(())
})
})
}
pub fn events_received(&self) -> u64 {
self.events_received.load(Ordering::Relaxed)
}
}
}
Registration with InProcessEventBus:
#![allow(unused)]
fn main() {
use tasker_worker::worker::in_process_event_bus::InProcessEventBus;
let mut bus = InProcessEventBus::new(config);
// Subscribe to all events
bus.subscribe("*", create_logging_subscriber("[ALL]")).unwrap();
// Subscribe to specific patterns
bus.subscribe("payment.*", create_logging_subscriber("[PAYMENT]")).unwrap();
// Use metrics collector
let metrics = EventMetricsCollector::new();
bus.subscribe("*", metrics.create_handler()).unwrap();
}
Ruby Subscribers (BaseSubscriber)
Ruby subscribers extend TaskerCore::DomainEvents::BaseSubscriber and use the class-level subscribes_to pattern declaration.
Real Example: LoggingSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/logging_subscriber.rb):
# Example logging subscriber for fast/in-process domain events
module DomainEvents
module Subscribers
class LoggingSubscriber < TaskerCore::DomainEvents::BaseSubscriber
# Subscribe to all events using pattern matching
subscribes_to '*'
# Handle any domain event by logging its details
def handle(event)
event_name = event[:event_name]
metadata = event[:metadata] || {}
logger.info "[LoggingSubscriber] Event: #{event_name}"
logger.info " Task: #{metadata[:task_uuid]}"
logger.info " Step: #{metadata[:step_name]}"
logger.info " Namespace: #{metadata[:namespace]}"
logger.info " Correlation: #{metadata[:correlation_id]}"
end
end
end
end
Real Example: MetricsSubscriber (workers/ruby/spec/handlers/examples/domain_events/subscribers/metrics_subscriber.rb):
# Example metrics subscriber for fast/in-process domain events
module DomainEvents
module Subscribers
class MetricsSubscriber < TaskerCore::DomainEvents::BaseSubscriber
subscribes_to '*'
class << self
attr_accessor :events_received, :success_events, :failure_events,
:events_by_namespace, :last_event_at
def reset_counters!
@mutex = Mutex.new
@events_received = 0
@success_events = 0
@failure_events = 0
@events_by_namespace = Hash.new(0)
@last_event_at = nil
end
end
reset_counters!
def handle(event)
event_name = event[:event_name]
metadata = event[:metadata] || {}
execution_result = event[:execution_result] || {}
self.class.increment(:events_received)
if execution_result[:success]
self.class.increment(:success_events)
else
self.class.increment(:failure_events)
end
namespace = metadata[:namespace] || 'unknown'
self.class.increment_hash(:events_by_namespace, namespace)
self.class.set(:last_event_at, Time.now)
end
end
end
end
Registration in Bootstrap:
# Register subscribers with the registry
registry = TaskerCore::DomainEvents::SubscriberRegistry.instance
registry.register(DomainEvents::Subscribers::LoggingSubscriber)
registry.register(DomainEvents::Subscribers::MetricsSubscriber)
registry.start_all!
# Later, query metrics
puts "Total events: #{DomainEvents::Subscribers::MetricsSubscriber.events_received}"
puts "By namespace: #{DomainEvents::Subscribers::MetricsSubscriber.events_by_namespace}"
External PGMQ Consumers (Durable Events)
Durable events are published to PGMQ queues for external consumption. Tasker does not provide internal consumers for these queues. External systems can consume events using:
- Direct PGMQ Polling: Query
pgmq.q_{namespace}_domain_eventstables directly - PGMQ Client Libraries: Use pgmq client libraries in Python, Node.js, Go, etc.
- Middleware Proxies: Build adapters that forward events to Kafka, SNS/SQS, etc.
Example: External Python Consumer:
import pgmq
# Connect to PGMQ
queue = pgmq.Queue("payments_domain_events", dsn="postgresql://...")
# Poll for events
while True:
messages = queue.read(batch_size=50, vt=30)
for msg in messages:
process_event(msg.message)
queue.delete(msg.msg_id)
Configuration
Domain event system configuration is part of the worker configuration in worker.toml files.
TOML Configuration
# config/tasker/base/worker.toml
# In-process event bus configuration for fast domain event delivery
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 2000 # Channel capacity for broadcast events
log_subscriber_errors = true # Log errors from event subscribers
dispatch_timeout_ms = 5000 # Timeout for event dispatch
# Domain Event System MPSC Configuration
[worker.mpsc_channels.domain_events]
command_buffer_size = 1000 # Channel capacity for domain event commands
shutdown_drain_timeout_ms = 5000 # Time to drain events on shutdown
log_dropped_events = true # Log when events are dropped due to backpressure
# In-process event settings (part of worker event systems)
[worker.event_systems.worker.metadata.in_process_events]
ffi_integration_enabled = true # Enable Ruby/Python FFI event channel
deduplication_cache_size = 10000 # Event deduplication cache size
Environment Overrides
Test Environment (config/tasker/environments/test/worker.toml):
# In-process event bus - smaller buffers for testing
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 1000
log_subscriber_errors = true
dispatch_timeout_ms = 2000
# Domain Event System - smaller buffers for testing
[worker.mpsc_channels.domain_events]
command_buffer_size = 100
shutdown_drain_timeout_ms = 1000
log_dropped_events = true
[worker.event_systems.worker.metadata.in_process_events]
deduplication_cache_size = 1000
Production Environment (config/tasker/environments/production/worker.toml):
# In-process event bus - large buffers for production throughput
[worker.mpsc_channels.in_process_events]
broadcast_buffer_size = 5000
log_subscriber_errors = false # Reduce log noise in production
dispatch_timeout_ms = 10000
# Domain Event System - large buffers for production throughput
[worker.mpsc_channels.domain_events]
command_buffer_size = 5000
shutdown_drain_timeout_ms = 10000
log_dropped_events = false # Reduce log noise in production
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
broadcast_buffer_size | Capacity of the broadcast channel for fast events | 2000 |
log_subscriber_errors | Whether to log subscriber errors | true |
dispatch_timeout_ms | Timeout for event dispatch to subscribers | 5000 |
command_buffer_size | Capacity of domain event command channel | 1000 |
shutdown_drain_timeout_ms | Time to drain pending events on shutdown | 5000 |
log_dropped_events | Whether to log events dropped due to backpressure | true |
ffi_integration_enabled | Enable FFI event channel for Ruby/Python | true |
deduplication_cache_size | Size of event deduplication cache | 10000 |
Integration with Step Execution
Event-Driven Domain Event Publishing
The worker uses an event-driven command pattern for step execution and domain event publishing. Nothing blocks - domain events are dispatched after successful orchestration notification using fire-and-forget semantics.
Flow (tasker-worker/src/worker/command_processor.rs):
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ FFI Handler │────►│ Completion │────►│ WorkerProcessor │
│ (Ruby/Rust) │ │ Channel │ │ Command Loop │
└─────────────────┘ └──────────────────┘ └──────────┬──────────┘
│
┌───────────────────────────────────┴───────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌────────────────────┐
│ 1. Send result to │ │ 2. Dispatch domain │
│ orchestration │──── on success ─────────►│ events │
│ (PGMQ) │ │ (fire-and-forget)│
└─────────────────────┘ └────────────────────┘
Implementation:
#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 512-525)
// Worker command processor receives step completions via FFI channel
match self.handle_send_step_result(step_result.clone()).await {
Ok(()) => {
// Dispatch domain events AFTER successful orchestration notification.
// Domain events are declarative of what HAS happened - the step is only
// truly complete once orchestration has been notified successfully.
// Fire-and-forget semantics (try_send) - never blocks the worker.
self.dispatch_domain_events(&step_result, None);
info!(
worker_id = %self.worker_id,
step_uuid = %step_result.step_uuid,
"Step completion forwarded to orchestration successfully"
);
}
Err(e) => {
// Don't dispatch domain events - orchestration wasn't notified,
// so the step isn't truly complete from the system's perspective
error!(
worker_id = %self.worker_id,
step_uuid = %step_result.step_uuid,
error = %e,
"Failed to forward step completion to orchestration"
);
}
}
}
Domain Event Dispatch (fire-and-forget):
#![allow(unused)]
fn main() {
// tasker-worker/src/worker/command_processor.rs (lines 362-432)
fn dispatch_domain_events(&mut self, step_result: &StepExecutionResult, correlation_id: Option<Uuid>) {
// Retrieve cached step context (stored when step was claimed)
let task_sequence_step = match self.step_execution_contexts.remove(&step_result.step_uuid) {
Some(ctx) => ctx,
None => return, // No context = can't build events
};
// Build events from step definition's publishes_events declarations
for event_def in &task_sequence_step.step_definition.publishes_events {
// Check publication condition before building event
if !event_def.should_publish(step_result.success) {
continue; // Skip events whose condition doesn't match
}
let event = DomainEventToPublish {
event_name: event_def.name.clone(),
delivery_mode: event_def.delivery_mode,
business_payload: step_result.result.clone(),
metadata: EventMetadata { /* ... */ },
task_sequence_step: task_sequence_step.clone(),
execution_result: step_result.clone(),
};
domain_events.push(event);
}
// Fire-and-forget dispatch - try_send never blocks
let dispatched = handle.dispatch_events(domain_events, publisher_name, correlation);
if !dispatched {
warn!(
step_uuid = %step_result.step_uuid,
"Domain event dispatch failed - channel full (events dropped)"
);
}
}
}
Key Design Decisions:
- Events only after orchestration success: Domain events are declarative of what HAS happened. If orchestration notification fails, the step isn’t truly complete from the system’s perspective.
- Fire-and-forget via
try_send: Never blocks the worker command loop. If the channel is full, events are dropped and logged. - Context caching: Step execution context is cached when the step is claimed, then retrieved for event building after completion.
Correlation ID Propagation
Domain events maintain correlation IDs for end-to-end distributed tracing. The correlation ID originates from the task and flows through all step executions and domain events.
EventMetadata Structure (tasker-shared/src/events/domain_events.rs):
#![allow(unused)]
fn main() {
pub struct EventMetadata {
pub task_uuid: Uuid,
pub step_uuid: Option<Uuid>,
pub step_name: Option<String>,
pub namespace: String,
pub correlation_id: Uuid, // From task for end-to-end tracing
pub fired_at: DateTime<Utc>,
pub fired_by: String, // Publisher identifier (worker_id)
}
}
Getting Correlation ID via API:
Use the orchestration API to get the correlation ID for a task:
# Get task details including correlation_id
curl http://localhost:8080/v1/tasks/{task_uuid} | jq '.correlation_id'
# Response includes correlation_id
{
"task_uuid": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
"correlation_id": "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5",
"status": "complete",
...
}
Tracing Events in PGMQ:
# Find all durable events for a correlation ID
psql $DATABASE_URL -c "
SELECT
message->>'event_name' as event,
message->'metadata'->>'step_name' as step,
message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';
"
Metrics and Observability
OpenTelemetry Metrics
Domain event publication emits OpenTelemetry counter metrics (tasker-shared/src/events/domain_events.rs:207-219):
#![allow(unused)]
fn main() {
// Metric emitted on every domain event publication
let counter = opentelemetry::global::meter("tasker")
.u64_counter("tasker.domain_events.published.total")
.with_description("Total number of domain events published")
.build();
counter.add(1, &[
opentelemetry::KeyValue::new("event_name", event_name.to_string()),
opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
]);
}
Prometheus Metrics Endpoint
The orchestration service exposes Prometheus-format metrics:
# Get Prometheus metrics from orchestration service
curl http://localhost:8080/metrics
# Get Prometheus metrics from worker service
curl http://localhost:8081/metrics
OpenTelemetry Tracing
Domain event publication is instrumented with tracing spans (tasker-shared/src/events/domain_events.rs:157-161):
#![allow(unused)]
fn main() {
#[instrument(skip(self, payload, metadata), fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
&self,
event_name: &str,
payload: DomainEventPayload,
metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
// ... implementation with debug! and info! logs including correlation_id
}
}
Grafana Query Examples
Loki Query - Domain Events by Correlation ID:
{service_name="tasker-worker"} |= "Domain event published" | json | correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"
Loki Query - All Domain Event Publications:
{service_name=~"tasker.*"} |= "Domain event" | json | line_format "{{.event_name}} - {{.namespace}} - {{.correlation_id}}"
Tempo Query - Trace by Correlation ID:
{resource.service.name="tasker-worker"} && {span.correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Prometheus Query - Event Publication Rate by Namespace:
sum by (namespace) (rate(tasker_domain_events_published_total[5m]))
Prometheus Query - Event Publication Rate by Event Name:
topk(10, sum by (event_name) (rate(tasker_domain_events_published_total[5m])))
Structured Log Fields
Domain event logs include structured fields for querying:
| Field | Description | Example |
|---|---|---|
event_id | Unique event UUID (v7, time-ordered) | 0199c3e0-d123-... |
event_name | Event name in dot notation | payment.processed |
queue_name | Target PGMQ queue | payments_domain_events |
task_uuid | Parent task UUID | 0199c3e0-ccdb-... |
correlation_id | End-to-end trace correlation | 0199c3e0-ccdb-... |
namespace | Event namespace | payments |
message_id | PGMQ message ID | 12345 |
Best Practices
1. Choose the Right Delivery Mode
| Scenario | Recommended Mode | Rationale |
|---|---|---|
| External system integration | Durable | Reliable delivery to external consumers |
| Internal metrics/telemetry | Fast | Internal subscribers only, low latency |
| Internal + external needs | Broadcast | Same event shape to both internal and external |
| Audit trails for compliance | Durable | Persisted for external audit systems |
| Real-time internal dashboards | Fast | In-process subscribers handle immediately |
Key Decision Criteria:
- Need internal Tasker subscribers? → Use
fastorbroadcast - Need external system integration? → Use
durableorbroadcast - Internal-only, sensitive data? → Use
fast(never reaches PGMQ boundary)
2. Design Event Payloads
Do:
#![allow(unused)]
fn main() {
json!({
"transaction_id": "TXN-123",
"amount": 99.99,
"currency": "USD",
"timestamp": "2025-12-01T10:00:00Z",
"idempotency_key": step_uuid
})
}
Don’t:
#![allow(unused)]
fn main() {
json!({
"data": "payment processed", // No structure
"info": full_database_record // Too much data
})
}
3. Handle Subscriber Failures Gracefully
#![allow(unused)]
fn main() {
#[async_trait]
impl EventSubscriber for MySubscriber {
async fn handle(&self, event: &DomainEvent) -> TaskerResult<()> {
// Wrap in timeout
match timeout(Duration::from_secs(5), self.process(event)).await {
Ok(result) => result,
Err(_) => {
warn!(event = %event.name, "Subscriber timeout");
Ok(()) // Don't fail the dispatch
}
}
}
}
}
4. Use Correlation IDs for Debugging
#![allow(unused)]
fn main() {
// Always include correlation ID in logs
info!(
correlation_id = %event.metadata.correlation_id,
event_name = %event.name,
namespace = %event.metadata.namespace,
"Processing domain event"
);
}
Related Documentation
- Events and Commands: events-and-commands.md - System event architecture
- Observability: observability/README.md - Metrics and monitoring
- States and Lifecycles: states-and-lifecycles.md - Task/step state machines
This domain event architecture provides a flexible, reliable foundation for business observability in the tasker-core workflow orchestration system.
Events and Commands Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Messaging Abstraction | States and Lifecycles | Deployment Patterns
← Back to Documentation Hub
This document provides comprehensive documentation of the event-driven and command pattern architecture in tasker-core, covering the unified event system foundation, orchestration and worker implementations, and the flow of tasks and steps through the system.
Overview
The tasker-core system implements a sophisticated hybrid architecture that combines:
- Event-Driven Systems: Real-time coordination using PostgreSQL LISTEN/NOTIFY and PGMQ notifications
- Command Pattern: Async command processors using tokio mpsc channels for orchestration and worker operations
- Hybrid Deployment Modes: PollingOnly, EventDrivenOnly, and Hybrid modes with fallback polling
- Queue-Based Communication: Provider-agnostic message queues (PGMQ or RabbitMQ) for reliable step execution and result processing
This architecture eliminates polling complexity while maintaining resilience through fallback mechanisms and provides horizontal scaling capabilities with atomic operation guarantees.
Event System Foundation
EventDrivenSystem Trait
The foundation of the event architecture is defined in tasker-shared/src/event_system/event_driven.rs with the EventDrivenSystem trait:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
type SystemId: Send + Sync + Clone + fmt::Display + fmt::Debug;
type Event: Send + Sync + Clone + fmt::Debug;
type Config: Send + Sync + Clone;
type Statistics: EventSystemStatistics + Send + Sync + Clone;
// Core lifecycle methods
async fn start(&mut self) -> Result<(), DeploymentModeError>;
async fn stop(&mut self) -> Result<(), DeploymentModeError>;
fn is_running(&self) -> bool;
// Event processing
async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
// Monitoring and health
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
fn statistics(&self) -> Self::Statistics;
// Configuration
fn deployment_mode(&self) -> DeploymentMode;
fn config(&self) -> &Self::Config;
}
}
Deployment Modes
The system supports three deployment modes for different operational requirements:
PollingOnly Mode
- Traditional polling-based coordination
- No event listeners or real-time notifications
- Reliable fallback for environments with networking restrictions
- Higher latency but guaranteed operation
EventDrivenOnly Mode
- Pure event-driven coordination using PostgreSQL LISTEN/NOTIFY
- Real-time response to database changes
- Lowest latency for step discovery and task coordination
- Requires reliable PostgreSQL connections
Hybrid Mode
- Primary event-driven coordination with polling fallback
- Best of both worlds: real-time when possible, reliable when needed
- Automatic fallback during connection issues
- Production-ready with resilience guarantees
Selecting a Deployment Mode
The Tasker system is built with the expectation of distributed deployment with multiple instances of both orchestration core servers and worker servers operating simultaneously. The goal of separating deployment mode is to enable different deployments to scale up event driven only processing nodes to meet demand, while having polling only nodes at a reasonable fallback polling interval and batch size. It is also to deploy in hybrid mode and control these on an instance over instance level.
Event Types and Sources
Queue-Level Events (Provider-Agnostic)
The system supports multiple messaging backends through MessageNotification:
#![allow(unused)]
fn main() {
pub enum MessageNotification {
/// Signal-only notification (PGMQ style)
/// Indicates a message is available but requires separate fetch
Available {
queue_name: String,
msg_id: Option<i64>,
},
/// Full message notification (RabbitMQ style)
/// Contains the complete message payload
Message(QueuedMessage<Vec<u8>>),
}
}
Event Sources by Provider:
| Provider | Notification Type | Fetch Required | Fallback Polling |
|---|---|---|---|
| PGMQ | Available | Yes (read by msg_id) | Required |
| RabbitMQ | Message | No (full payload) | Not needed |
| InMemory | Message | No | Not needed |
Common Event Types:
- Step Results: Worker completion notifications
- Task Requests: New task initialization requests
- Message Ready Events: Queue message availability notifications
- Transport: Provider-agnostic via
MessagingProvider.subscribe_many()
Command Pattern Architecture
Command Processor Pattern
Both orchestration and worker systems implement the command pattern to replace complex polling-based coordinators:
Benefits:
- No Polling Loops (Except where intended for fallback): Pure tokio mpsc command processing
- Simplified Architecture: ~100 lines vs 1000+ lines of complex systems
- Race Condition Prevention: Atomic operations through proper delegation
- Observability Preservation: Maintains metrics through delegated components
Command Flow Patterns
Both systems follow consistent command processing patterns:
sequenceDiagram
participant Client
participant CommandChannel
participant Processor
participant Delegate
participant Response
Client->>CommandChannel: Send Command + ResponseChannel
CommandChannel->>Processor: Receive Command
Processor->>Delegate: Delegate to Business Logic Component
Delegate-->>Processor: Return Result
Processor->>Response: Send Result via ResponseChannel
Response-->>Client: Receive Result
Orchestration Event Systems
OrchestrationEventSystem
Implemented in tasker-orchestration/src/orchestration/event_systems/orchestration_event_system.rs:
#![allow(unused)]
fn main() {
pub struct OrchestrationEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
queue_listener: Option<OrchestrationQueueListener>,
fallback_poller: Option<OrchestrationFallbackPoller>,
context: Arc<SystemContext>,
orchestration_core: Arc<OrchestrationCore>,
command_sender: mpsc::Sender<OrchestrationCommand>,
// ... statistics and state
}
}
Orchestration Command Types
The command processor handles both full-message and signal-only notification types:
#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
// Task lifecycle
InitializeTask { request: TaskRequestMessage, resp: CommandResponder<TaskInitializeResult> },
ProcessStepResult { result: StepExecutionResult, resp: CommandResponder<StepProcessResult> },
FinalizeTask { task_uuid: Uuid, resp: CommandResponder<TaskFinalizationResult> },
// Full message processing (RabbitMQ style - MessageNotification::Message)
// Used when provider delivers complete message payload
ProcessStepResultFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<StepProcessResult> },
InitializeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskInitializeResult> },
FinalizeTaskFromMessage { queue_name: String, message: QueuedMessage, resp: CommandResponder<TaskFinalizationResult> },
// Signal-only processing (PGMQ style - MessageNotification::Available)
// Used when provider sends notification that requires separate fetch
ProcessStepResultFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<StepProcessResult> },
InitializeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskInitializeResult> },
FinalizeTaskFromMessageEvent { message_event: MessageReadyEvent, resp: CommandResponder<TaskFinalizationResult> },
// Task readiness (database events)
ProcessTaskReadiness { task_uuid: Uuid, namespace: String, priority: i32, ready_steps: i32, triggered_by: String, resp: CommandResponder<TaskReadinessResult> },
// System operations
GetProcessingStats { resp: CommandResponder<OrchestrationProcessingStats> },
HealthCheck { resp: CommandResponder<SystemHealth> },
Shutdown { resp: CommandResponder<()> },
}
}
Command Routing by Notification Type:
MessageNotification::Message->*FromMessagecommands (immediate processing)MessageNotification::Available->*FromMessageEventcommands (requires fetch)
Orchestration Queue Architecture
The orchestration system coordinates multiple queue types:
- orchestration_step_results: Step completion results from workers
- orchestration_task_requests: New task initialization requests
- orchestration_task_finalization: Task finalization notifications
- Namespace Queues: Per-namespace step queues (e.g.,
fulfillment_queue,inventory_queue)
TaskReadinessEventSystem
Handles database-level events for task readiness using PostgreSQL LISTEN/NOTIFY:
#![allow(unused)]
fn main() {
pub struct TaskReadinessEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
listener: Option<TaskReadinessListener>,
fallback_poller: Option<TaskReadinessFallbackPoller>,
context: Arc<SystemContext>,
command_sender: mpsc::Sender<OrchestrationCommand>,
// ... configuration and statistics
}
}
PGMQ Notification Channels:
pgmq_message_ready.orchestration: Orchestration queue messages ready (task requests, step results, finalizations)pgmq_message_ready.{namespace}: Worker namespace queue messages ready (e.g.,payments,fulfillment,linear_workflow)pgmq_message_ready: Global channel for all queue messages (fallback)pgmq_queue_created: Queue creation notifications
Unified Event Coordination
The UnifiedEventCoordinator demonstrates coordinated management of multiple event systems:
#![allow(unused)]
fn main() {
pub struct UnifiedEventCoordinator {
orchestration_system: OrchestrationEventSystem,
task_readiness_fallback: FallbackPoller,
deployment_mode: DeploymentMode,
health_monitor: EventSystemHealthMonitor,
// ... coordination logic
}
}
Coordination Features:
- Shared Command Channel: Both systems send commands to same orchestration processor
- Health Monitoring: Unified health checking across all event systems
- Deployment Mode Management: Synchronized mode changes
- Statistics Aggregation: Combined metrics from all systems
Worker Event Systems
WorkerEventSystem
Implemented in tasker-worker/src/worker/event_systems/worker_event_system.rs:
#![allow(unused)]
fn main() {
pub struct WorkerEventSystem {
system_id: String,
deployment_mode: DeploymentMode,
queue_listeners: HashMap<String, WorkerQueueListener>,
fallback_pollers: HashMap<String, WorkerFallbackPoller>,
context: Arc<SystemContext>,
command_sender: mpsc::Sender<WorkerCommand>,
// ... statistics and configuration
}
}
Worker Command Types
#![allow(unused)]
fn main() {
pub enum WorkerCommand {
// Step execution
ExecuteStep { message: PgmqMessage<SimpleStepMessage>, queue_name: String, resp: CommandResponder<()> },
ExecuteStepWithCorrelation { message: PgmqMessage<SimpleStepMessage>, queue_name: String, correlation_id: Uuid, resp: CommandResponder<()> },
// Result processing
SendStepResult { result: StepExecutionResult, resp: CommandResponder<()> },
ProcessStepCompletion { step_result: StepExecutionResult, correlation_id: Option<Uuid>, resp: CommandResponder<()> },
// Event integration
ExecuteStepFromMessage { queue_name: String, message: PgmqMessage, resp: CommandResponder<()> },
ExecuteStepFromEvent { message_event: MessageReadyEvent, resp: CommandResponder<()> },
// System operations
GetWorkerStatus { resp: CommandResponder<WorkerStatus> },
SetEventIntegration { enabled: bool, resp: CommandResponder<()> },
GetEventStatus { resp: CommandResponder<EventIntegrationStatus> },
RefreshTemplateCache { namespace: Option<String>, resp: CommandResponder<()> },
HealthCheck { resp: CommandResponder<WorkerHealthStatus> },
Shutdown { resp: CommandResponder<()> },
}
}
Worker Queue Architecture
Workers monitor namespace-specific queues for step execution as Custom Namespace Queues that are dynamically configured per deployment
Example queues:
- fulfillment_queue: All fulfillment namespace steps
- inventory_queue: All inventory namespace steps
- notifications_queue: All notification namespace steps
- payment_queue: All payment processing steps
Event Flow and System Interactions
Complete Task Execution Flow
sequenceDiagram
participant Client
participant Orchestration
participant TaskDB
participant StepQueue
participant Worker
participant ResultQueue
%% Task Initialization
Client->>Orchestration: TaskRequestMessage (via pgmq_send_with_notify)
Orchestration->>TaskDB: Create Task + Steps
%% Step Discovery and Enqueueing (Event-Driven or Fallback Polling)
Orchestration->>StepQueue: pgmq_send_with_notify(ready steps)
StepQueue-->>Worker: pg_notify('pgmq_message_ready.{namespace}')
%% Step Execution
Worker->>StepQueue: pgmq.read() to claim step
Worker->>Worker: Execute Step Handler
Worker->>ResultQueue: pgmq_send_with_notify(StepExecutionResult)
ResultQueue-->>Orchestration: pg_notify('pgmq_message_ready.orchestration')
%% Result Processing
Orchestration->>Orchestration: ProcessStepResult Command
Orchestration->>TaskDB: Update Step State
Note over Orchestration: Fallback poller discovers ready steps if events missed
%% Task Completion
Note over Orchestration: All Steps Complete
Orchestration->>Orchestration: FinalizeTask Command
Orchestration->>TaskDB: Mark Task Complete
Orchestration-->>Client: Task Completed
Event-Driven Step Discovery
sequenceDiagram
participant Worker
participant PostgreSQL
participant PgmqNotify
participant OrchestrationListener
participant StepEnqueuer
Worker->>PostgreSQL: pgmq_send_with_notify('orchestration_step_results', result)
PostgreSQL->>PostgreSQL: Atomic: pgmq.send() + pg_notify()
PostgreSQL->>PgmqNotify: NOTIFY 'pgmq_message_ready.orchestration'
PgmqNotify->>OrchestrationListener: MessageReadyEvent
OrchestrationListener->>StepEnqueuer: ProcessStepResult Command
StepEnqueuer->>PostgreSQL: Query ready steps, enqueue via pgmq_send_with_notify()
Hybrid Mode Operation
stateDiagram-v2
[*] --> EventDriven
EventDriven --> Processing : Event Received
Processing --> EventDriven : Success
Processing --> PollingFallback : Event Failed
PollingFallback --> FallbackPolling : Start Polling
FallbackPolling --> EventDriven : Connection Restored
FallbackPolling --> Processing : Poll Found Work
EventDriven --> HealthCheck : Periodic Check
HealthCheck --> EventDriven : Healthy
HealthCheck --> PollingFallback : Event Issues Detected
Queue Architecture and Message Flow
PGMQ Integration
The system uses PostgreSQL Message Queue (PGMQ) for reliable message delivery:
Queue Types and Purposes
| Queue Name | Purpose | Message Type | Processing System |
|---|---|---|---|
orchestration_step_results | Step completion results | StepExecutionResult | Orchestration |
orchestration_task_requests | New task requests | TaskRequestMessage | Orchestration |
orchestration_task_finalization | Task finalization | TaskFinalizationMessage | Orchestration |
{namespace}_queue | Namespace-specific steps | SimpleStepMessage | Workers |
Message Processing Patterns
Event-Driven Processing:
- Message arrives in PGMQ queue
- PostgreSQL triggers pg_notify with
MessageReadyEvent - Event system receives notification
- System processes message via command pattern
- Message deleted after successful processing
Polling-Based Processing (Fallback):
- Periodic queue polling (configurable interval)
- Fetch available messages in batches
- Process messages via command pattern
- Delete processed messages
Circuit Breaker Integration
All PGMQ operations are protected by circuit breakers:
#![allow(unused)]
fn main() {
pub struct UnifiedPgmqClient {
standard_client: Box<dyn PgmqClientTrait + Send + Sync>,
protected_client: Option<ProtectedPgmqClient>,
circuit_breaker_enabled: bool,
}
}
Circuit Breaker Features:
- Automatic Protection: Failure detection and circuit opening
- Configurable Thresholds: Error rate and timeout configuration
- Seamless Fallback: Automatic switching between standard and protected clients
- Recovery Detection: Automatic circuit closing when service recovers
Statistics and Monitoring
Event System Statistics
Both orchestration and worker event systems implement comprehensive statistics:
#![allow(unused)]
fn main() {
pub trait EventSystemStatistics {
fn events_processed(&self) -> u64;
fn events_failed(&self) -> u64;
fn processing_rate(&self) -> f64; // events/second
fn average_latency_ms(&self) -> f64;
fn deployment_mode_score(&self) -> f64; // 0.0-1.0 effectiveness
fn success_rate(&self) -> f64; // derived: processed/(processed+failed)
}
}
Health Monitoring
Deployment Mode Health Status
#![allow(unused)]
fn main() {
pub enum DeploymentModeHealthStatus {
Healthy, // All systems operational
Degraded { reason: String },// Some issues but functional
Unhealthy { reason: String },// Significant issues
Critical { reason: String }, // System failure imminent
}
}
Health Check Integration
- Event System Health: Connection status, processing latency, error rates
- Command Processor Health: Queue backlog, processing timeout detection
- Database Health: Connection pool status, query performance
- Circuit Breaker Status: Circuit state, failure rates, recovery status
Metrics Collection
Key metrics collected across the system:
Orchestration Metrics
- Task Initialization Rate: Tasks/minute initialized
- Step Enqueueing Rate: Steps/minute enqueued to worker queues
- Result Processing Rate: Results/minute processed from workers
- Task Completion Rate: Tasks/minute completed successfully
- Error Rates: Failures by operation type and cause
Worker Metrics
- Step Execution Rate: Steps/minute executed
- Handler Performance: Execution time by handler type
- Queue Processing: Messages claimed/processed by queue
- Result Submission Rate: Results/minute sent to orchestration
- FFI Integration: Event correlation and handler communication stats
Error Handling and Resilience
Error Categories
The system handles multiple error categories with appropriate strategies:
Transient Errors
- Database Connection Issues: Circuit breaker protection + retry with exponential backoff
- Queue Processing Failures: Message retry with backoff, poison message detection
- Network Interruptions: Automatic fallback to polling mode
Permanent Errors
- Invalid Message Format: Dead letter queue for manual analysis
- Handler Execution Failures: Step failure state with retry limits
- Configuration Errors: System startup prevention with clear error messages
System Errors
- Resource Exhaustion: Graceful degradation and load shedding
- Component Crashes: Automatic restart with state recovery
- Data Corruption: Transaction rollback and consistency validation
Fallback Mechanisms
Event System Fallbacks
- Event-Driven -> Polling: Automatic fallback when event connection fails
- Real-time -> Batch: Switch to batch processing during high load
- Primary -> Secondary: Database failover support for high availability
Command Processing Fallbacks
- Async -> Sync: Degraded operation for critical operations
- Distributed -> Local: Local processing when coordination fails
- Optimistic -> Pessimistic: Conservative processing during uncertainty
Configuration Management
Event System Configuration
Event systems are configured via TOML with environment overrides:
# config/tasker/base/event_systems.toml
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
health_monitoring_enabled = true
health_check_interval = "30s"
max_concurrent_processors = 10
processing_timeout = "100ms"
[orchestration_event_system.queue_listener]
enabled = true
batch_size = 50
poll_interval = "1s"
connection_timeout = "5s"
[orchestration_event_system.fallback_poller]
enabled = true
poll_interval = "5s"
batch_size = 20
max_retry_attempts = 3
[task_readiness]
enabled = true
polling_interval_seconds = 30
[orchestration_event_system]
system_id = "orchestration-events"
deployment_mode = "Hybrid"
# PGMQ channels handled by listeners, not direct postgres channels
supported_namespaces = ["orchestration"]
Runtime Configuration Changes
Certain configuration changes can be applied at runtime:
- Deployment Mode Switching: EventDrivenOnly <-> Hybrid <-> PollingOnly
- Event Integration Toggle: Enable/disable event processing
- Health Check Intervals: Adjust monitoring frequency
- Circuit Breaker Thresholds: Modify failure detection sensitivity
Integration Points
State Machine Integration
Event systems integrate tightly with the state machines documented in states-and-lifecycles.md:
- Task State Changes: Event systems react to task transitions
- Step State Changes: Step completion triggers task readiness checks
- Event Generation: State transitions generate events for system coordination
- Atomic Operations: Event processing maintains state machine consistency
Database Integration
Event systems coordinate with PostgreSQL at multiple levels:
- LISTEN/NOTIFY: Real-time notifications for database changes
- PGMQ Integration: Reliable message queues built on PostgreSQL
- Transaction Coordination: Event processing within database transactions
- SQL Functions: Database functions generate events and notifications
External System Integration
The event architecture supports integration with external systems:
- Webhook Events: HTTP callbacks for external system notifications
- Message Bus Integration: Apache Kafka, RabbitMQ, etc. for enterprise messaging
- Monitoring Integration: Prometheus, DataDog, etc. for metrics export
- API Integration: REST and GraphQL APIs for external coordination
Actor Integration
Overview
The tasker-core system implements a lightweight Actor pattern that formalizes the relationship between Commands and Lifecycle Components. This architecture provides a consistent, type-safe foundation for orchestration component management with all lifecycle operations coordinated through actors.
Status: Complete (Phases 1-7) - Production ready
For comprehensive actor documentation, see Actor-Based Architecture.
Actor Pattern Basics
The actor pattern introduces three core traits:
- OrchestrationActor: Base trait for all actors with lifecycle hooks
Handler<M>: Message handling trait for type-safe command processing- Message: Marker trait for command messages
#![allow(unused)]
fn main() {
// Actor definition
pub struct TaskFinalizerActor {
context: Arc<SystemContext>,
service: TaskFinalizer,
}
// Message definition
pub struct FinalizeTaskMessage {
pub task_uuid: Uuid,
}
impl Message for FinalizeTaskMessage {
type Response = FinalizationResult;
}
// Message handler
#[async_trait]
impl Handler<FinalizeTaskMessage> for TaskFinalizerActor {
type Response = FinalizationResult;
async fn handle(&self, msg: FinalizeTaskMessage) -> TaskerResult<Self::Response> {
self.service.finalize_task(msg.task_uuid).await
.map_err(|e| e.into())
}
}
}
Integration with Command Processor
The actor pattern integrates seamlessly with the command processor through direct actor calls:
#![allow(unused)]
fn main() {
// From: tasker-orchestration/src/orchestration/command_processor.rs
async fn handle_finalize_task(&self, task_uuid: Uuid) -> TaskerResult<TaskFinalizationResult> {
// Direct actor-based task finalization
let msg = FinalizeTaskMessage { task_uuid };
let result = self.actors.task_finalizer_actor.handle(msg).await?;
Ok(TaskFinalizationResult::Success {
task_uuid: result.task_uuid,
final_status: format!("{:?}", result.action),
completion_time: Some(chrono::Utc::now()),
})
}
async fn handle_process_step_result(
&self,
step_result: StepExecutionResult,
) -> TaskerResult<StepProcessResult> {
// Direct actor-based step result processing
let msg = ProcessStepResultMessage {
result: step_result.clone(),
};
match self.actors.result_processor_actor.handle(msg).await {
Ok(()) => Ok(StepProcessResult::Success {
message: format!(
"Step {} result processed successfully",
step_result.step_uuid
),
}),
Err(e) => Ok(StepProcessResult::Error {
message: format!("Failed to process step result: {e}"),
}),
}
}
}
Event → Command → Actor Flow
The complete event-to-actor flow:
┌──────────────┐
│ PGMQ Message │ Message arrives in queue
└──────┬───────┘
│
▼
┌──────────────────┐
│ Event Listener │ EventDrivenSystem processes notification
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Command Channel │ Send command to processor via tokio::mpsc
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Command Processor│ Convert command to actor message
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Actor Registry │ Route message to appropriate actor
└──────┬───────────┘
│
▼
┌──────────────────┐
│ `Handler<M>`:: │ Actor processes message
│ handle() │ Delegates to underlying service
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Response │ Return result to command processor
└──────────────────┘
ActorRegistry and Lifecycle
The ActorRegistry manages all 4 orchestration actors and integrates with the system lifecycle:
#![allow(unused)]
fn main() {
// During system startup
let context = Arc::new(SystemContext::with_pool(pool).await?);
let actors = ActorRegistry::build(context).await?; // Calls started() on all actors
// During operation
let msg = FinalizeTaskMessage { task_uuid };
let result = actors.task_finalizer_actor.handle(msg).await?;
// During shutdown
actors.shutdown().await; // Calls stopped() on all actors in reverse order
}
Current Actors:
- TaskRequestActor: Handles task initialization requests
- ResultProcessorActor: Processes step execution results
- StepEnqueuerActor: Manages batch processing of ready tasks
- TaskFinalizerActor: Handles task finalization with atomic claiming
Benefits for Event-Driven Architecture
The actor pattern enhances the event-driven architecture by providing:
- Type Safety: Compile-time verification of message contracts
- Consistency: Uniform lifecycle management across all components
- Testability: Clear message boundaries for isolated testing
- Observability: Actor-level metrics and tracing
- Evolvability: Easy to add new message handlers and actors
Implementation Status
The actor integration is complete:
-
Phase 1 ✅: Actor infrastructure and test harness
- OrchestrationActor,
Handler<M>, Message traits - ActorRegistry structure
- OrchestrationActor,
-
Phase 2-3 ✅: All 4 primary actors implemented
- TaskRequestActor, ResultProcessorActor
- StepEnqueuerActor, TaskFinalizerActor
-
Phase 4-6 ✅: Message hydration and module reorganization
- Hydration layer for PGMQ messages
- Clean module organization
-
Phase 7 ✅: Service decomposition
- Large services decomposed into focused components
- All files <300 lines following single responsibility principle
-
Cleanup ✅: Direct actor integration
- Command processor calls actors directly
- Removed intermediate wrapper layers
- Production-ready implementation
Service Decomposition
Large services (800-900 lines) were decomposed into focused components:
TaskFinalizer (848 → 6 files):
service.rs: Main TaskFinalizer (~200 lines)completion_handler.rs: Task completion logicevent_publisher.rs: Lifecycle event publishingexecution_context_provider.rs: Context fetchingstate_handlers.rs: State-specific handling
StepEnqueuerService (781 → 3 files):
service.rs: Main service (~250 lines)batch_processor.rs: Batch processing logicstate_handlers.rs: State-specific processing
ResultProcessor (889 → 4 files):
service.rs: Main processormetadata_processor.rs: Metadata handlingerror_handler.rs: Error processingresult_validator.rs: Result validation
This comprehensive event and command architecture, now enhanced with the actor pattern, provides the foundation for scalable, reliable, and maintainable workflow orchestration in the tasker-core system while maintaining the flexibility to operate in diverse deployment environments.
Idempotency and Atomicity Guarantees
Last Updated: 2025-01-19 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands | Task Readiness & Execution
← Back to Documentation Hub
Overview
Tasker Core is designed for distributed orchestration with multiple orchestrator instances processing tasks concurrently. This document explains the defense-in-depth approach that ensures safe concurrent operation without race conditions, data corruption, or lost work.
The system provides idempotency and atomicity through four overlapping protection layers:
- Database Atomicity: PostgreSQL constraints, row locking, and compare-and-swap operations
- State Machine Guards: Current-state validation before all transitions
- Transaction Boundaries: All-or-nothing semantics for complex operations
- Application Logic: State-based filtering and idempotent patterns
These layers work together to ensure that operations can be safely retried, multiple orchestrators can process work concurrently, and crashes don’t leave the system in an inconsistent state.
Core Protection Mechanisms
Layer 1: Database Atomicity
PostgreSQL provides fundamental atomic guarantees through several mechanisms:
Unique Constraints
Purpose: Prevent duplicate creation of entities
Key Constraints:
tasker.tasks.identity_hash(UNIQUE) - Prevents duplicate task creation from identical requeststasker.task_namespaces.name(UNIQUE) - Namespace name uniquenesstasker.named_tasks (namespace_id, name, version)(UNIQUE) - Task template uniquenesstasker.named_steps.system_name(UNIQUE) - Step handler uniqueness
Example Protection:
#![allow(unused)]
fn main() {
// Two orchestrators receive identical TaskRequestMessage
// Orchestrator A creates task first -> commits successfully
// Orchestrator B attempts to create -> unique constraint violation
// Result: Exactly one task created, error cleanly handled
}
See Task Initialization for details on how this protects task creation.
Row-Level Locking
Purpose: Prevent concurrent modifications to the same database row
Locking Patterns:
-
FOR UPDATE- Exclusive lock, blocks concurrent transactions-- Used in: transition_task_state_atomic() SELECT * FROM tasker.tasks WHERE task_uuid = $1 FOR UPDATE; -- Blocks until transaction commits or rolls back -
FOR UPDATE SKIP LOCKED- Lock-free work distribution-- Used in: get_next_ready_tasks() SELECT * FROM tasker.tasks WHERE state = ANY($1) FOR UPDATE SKIP LOCKED LIMIT $2; -- Each orchestrator gets different tasks, no blocking
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Two orchestrators attempt state transition on same task
// Orchestrator A: BEGIN; SELECT FOR UPDATE; UPDATE state; COMMIT;
// Orchestrator B: BEGIN; SELECT FOR UPDATE (BLOCKS until A commits)
// UPDATE fails due to state validation
// Result: Only one transition succeeds, no race condition
}
Compare-and-Swap Semantics
Purpose: Validate expected state before making changes
Pattern: All state transitions validate current state in the same transaction as the update
-- From transition_task_state_atomic()
UPDATE tasker.tasks
SET state = $new_state, updated_at = NOW()
WHERE task_uuid = $uuid
AND state = $expected_current_state -- Critical: CAS validation
RETURNING *;
Example Protection:
#![allow(unused)]
fn main() {
// Orchestrator A and B both think task is in "Pending" state
// A transitions: WHERE state = 'Pending' -> succeeds, now "Initializing"
// B transitions: WHERE state = 'Pending' -> returns 0 rows (fails gracefully)
// Result: Atomic transition, no invalid state
}
See SQL Function Architecture for more on database-level guarantees.
Layer 2: State Machine Guards
Purpose: Enforce valid state transitions through application-level validation
Both task and step state machines validate current state before allowing transitions. This provides protection even when database constraints alone wouldn’t catch invalid operations.
Task State Machine
Defined in tasker-shared/src/state_machine/task_state_machine.rs, the TaskStateMachine validates:
- Current state retrieval: Always fetch latest state from database
- Event applicability: Check if event is valid for current state
- Terminal state protection: Cannot transition from Complete/Error/Cancelled
- Ownership tracking: Processor UUID tracked for audit (not enforced after ownership removal)
Example Protection:
#![allow(unused)]
fn main() {
// TaskStateMachine prevents invalid transitions
let mut state_machine = TaskStateMachine::new(task, context);
// Attempt to mark complete when still processing
let result = state_machine.transition(TaskEvent::MarkComplete).await;
// Result: Error - cannot mark complete while steps are in progress
// Current state validation prevents:
// - Completing tasks with pending steps
// - Re-initializing completed tasks
// - Transitioning from terminal states
}
See States and Lifecycles for complete state machine documentation.
Workflow Step State Machine
Defined in tasker-shared/src/state_machine/step_state_machine.rs, the StepStateMachine ensures:
- Execution claiming: Only Pending/Enqueued steps can transition to InProgress
- Completion validation: Only InProgress steps can be marked complete
- Retry eligibility: Validates max_attempts and backoff timing
Example Protection:
#![allow(unused)]
fn main() {
// Worker attempts to claim already-processing step
let mut step_machine = StepStateMachine::new(step.into(), context);
match step_machine.current_state().await {
WorkflowStepState::InProgress => {
// Already being processed by another worker
return Ok(false); // Cannot claim
}
WorkflowStepState::Pending | WorkflowStepState::Enqueued => {
// Attempt atomic transition
step_machine.transition(StepEvent::Start).await?;
}
}
}
This prevents:
- Multiple workers executing the same step concurrently
- Marking steps complete that weren’t started
- Retrying steps that exceeded max_attempts
Layer 3: Transaction Boundaries
Purpose: Ensure all-or-nothing semantics for multi-step operations
Critical operations wrap multiple database changes in a single transaction, ensuring atomic completion or full rollback on failure.
Task Initialization Transaction
Task creation involves multiple dependent entities that must all succeed or all fail:
#![allow(unused)]
fn main() {
// From TaskInitializer.initialize_task()
let mut tx = pool.begin().await?;
// 1. Create or find namespace (find-or-create is idempotent)
let namespace = NamespaceResolver::resolve_namespace(&mut tx, namespace_name).await?;
// 2. Create or find named task
let named_task = NamespaceResolver::resolve_named_task(&mut tx, namespace, task_name).await?;
// 3. Create task record
let task = create_task(&mut tx, named_task.uuid, context).await?;
// 4. Create all workflow steps and edges
let (step_count, step_mapping) = WorkflowStepBuilder::create_workflow_steps(
&mut tx, task.uuid, template
).await?;
// 5. Initialize state machine
StateInitializer::initialize_task_state(&mut tx, task.uuid).await?;
// ALL OR NOTHING: Commit entire transaction
tx.commit().await?;
}
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Task creation partially fails
// - Namespace created ✓
// - Named task created ✓
// - Task record created ✓
// - Workflow steps: Cycle detected ✗ (error thrown)
// Result: tx.rollback() -> ALL changes reverted, clean failure
}
Cycle Detection Enforcement
Workflow dependencies are validated during task initialization to prevent circular references:
#![allow(unused)]
fn main() {
// From WorkflowStepBuilder::create_step_dependencies()
for dependency in &step_definition.dependencies {
let from_uuid = step_mapping[dependency];
let to_uuid = step_mapping[&step_definition.name];
// Check for self-reference
if from_uuid == to_uuid {
return Err(CycleDetected { from, to });
}
// Check for path that would create cycle
if WorkflowStepEdge::would_create_cycle(pool, from_uuid, to_uuid).await? {
return Err(CycleDetected { from, to });
}
// Safe to create edge
WorkflowStepEdge::create_with_transaction(&mut tx, edge).await?;
}
}
This prevents invalid DAG structures from ever being persisted to the database.
Layer 4: Application Logic Patterns
Purpose: Implement idempotent patterns at the application level
Beyond database and state machine protections, application code uses several patterns to ensure safe retry and concurrent operation.
Find-or-Create Pattern
Used for entities that should be unique but may be created by multiple orchestrators:
#![allow(unused)]
fn main() {
// From NamespaceResolver
pub async fn resolve_namespace(
tx: &mut Transaction<'_, Postgres>,
name: &str,
) -> Result<TaskNamespace> {
// Try to find existing
if let Some(namespace) = TaskNamespace::find_by_name(pool, name).await? {
return Ok(namespace);
}
// Create if not found
match TaskNamespace::create_with_transaction(tx, NewTaskNamespace { name }).await {
Ok(namespace) => Ok(namespace),
Err(sqlx::Error::Database(e)) if is_unique_violation(&e) => {
// Another orchestrator created it between our find and create
// Re-query to get the one that won the race
TaskNamespace::find_by_name(pool, name).await?
.ok_or(Error::NotFound)
}
Err(e) => Err(e),
}
}
}
Why This Works:
- First attempt: Finds existing → idempotent
- Create attempt: Unique constraint prevents duplicates
- Retry after unique violation: Gets the winner → idempotent
- Result: Exactly one namespace, regardless of concurrent attempts
State-Based Filtering
Operations filter by state to naturally deduplicate work:
#![allow(unused)]
fn main() {
// From StepEnqueuerService
// Only enqueue steps in specific states
let ready_steps = steps.iter()
.filter(|step| matches!(
step.state,
WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry
))
.collect();
// Skip steps already:
// - Enqueued (another orchestrator handled it)
// - InProgress (worker is executing)
// - Complete (already done)
// - Error (terminal state)
}
Example Protection:
#![allow(unused)]
fn main() {
// Scenario: Orchestrator crash mid-batch
// Before crash: Enqueued steps 1-5 of 10
// After restart: Process task again
// State filtering:
// - Steps 1-5: state = Enqueued → skip
// - Steps 6-10: state = Pending → enqueue
// Result: Each step enqueued exactly once
}
State-Before-Queue Pattern
Ensures workers only see steps in correct state:
#![allow(unused)]
fn main() {
// 1. Commit state transition to database FIRST
step_state_machine.transition(StepEvent::Enqueue).await?;
// Step now in Enqueued state in database
// 2. THEN send PGMQ notification
pgmq_client.send_with_notify(queue_name, step_message).await?;
// Worker receives notification and:
// - Queries database for step
// - Sees state = Enqueued (committed)
// - Can safely claim and execute
}
Why Order Matters:
#![allow(unused)]
fn main() {
// Wrong order (queue-before-state):
// 1. Send PGMQ message
// 2. Worker receives immediately
// 3. Worker queries database → state still Pending
// 4. Worker might skip or fail to claim
// 5. State transition commits
// Correct order (state-before-queue):
// 1. State transition commits
// 2. Send PGMQ message
// 3. Worker receives
// 4. Worker queries → state correctly Enqueued
// 5. Worker can claim
}
See Events and Commands for event system details.
Component-by-Component Guarantees
Task Initialization Idempotency
Component: TaskRequestActor and TaskInitializer service
Operation: Creating a new task from a template
File: tasker-orchestration/src/orchestration/lifecycle/task_initialization/
Protection Mechanisms
-
Identity Hash Unique Constraint
#![allow(unused)] fn main() { // Tasks are identified by hash of (namespace, task_name, context) let identity_hash = calculate_identity_hash(namespace, name, context); NewTask { identity_hash, // Unique constraint prevents duplicates named_task_uuid, context, // ... } } -
Transaction Atomicity
- All entities created in single transaction
- Namespace, named task, task, workflow steps, edges
- Cycle detection validates DAG before committing
- Any failure rolls back everything
-
Find-or-Create for Shared Entities
- Namespaces can be created by any orchestrator
- Named tasks shared across workflow instances
- Named steps reused across tasks
Concurrent Scenario
Two orchestrators receive identical TaskRequestMessage:
T0: Orchestrator A begins transaction
T1: Orchestrator B begins transaction
T2: A creates namespace "payments"
T3: B attempts to create namespace "payments"
T4: A creates task with identity_hash "abc123"
T5: B attempts to create task with identity_hash "abc123"
T6: A commits successfully ✓
T7: B attempts commit → unique constraint violation on identity_hash
T8: B transaction rolled back
Result:
- Exactly one task created
- No partial state in database
- Orchestrator B receives clear error
- Retry-safe: B can check if task exists and return it
Cycle Detection
Prevents invalid workflow definitions:
#![allow(unused)]
fn main() {
// Template defines: A depends on B, B depends on C, C depends on A
// During initialization:
// - Create steps A, B, C
// - Create edge A -> B (valid)
// - Create edge B -> C (valid)
// - Attempt edge C -> A
// - would_create_cycle() returns true
// - Error: CycleDetected
// - Transaction rolled back
// Result: Invalid workflow rejected, no partial data
}
See tasker-shared/src/models/core/workflow_step_edge.rs:236-270 for cycle detection implementation.
Step Enqueueing Idempotency
Component: StepEnqueuerActor and StepEnqueuerService
Operation: Enqueueing ready workflow steps to worker queues
File: tasker-orchestration/src/orchestration/lifecycle/step_enqueuer_services/
Multi-Layer Protection
-
SQL-Level Row Locking
-- get_next_ready_tasks() uses SKIP LOCKED SELECT task_uuid FROM tasker.tasks WHERE state = ANY($states) FOR UPDATE SKIP LOCKED -- Prevents concurrent claiming LIMIT $batch_size;Each orchestrator gets different tasks, no overlap
-
State Machine Compare-and-Swap
#![allow(unused)] fn main() { // Only transition if task in expected state state_machine.transition(TaskEvent::EnqueueSteps(uuids)).await?; // Fails if another orchestrator already transitioned } -
Step State Filtering
#![allow(unused)] fn main() { // Only enqueue steps in specific states let enqueueable = steps.filter(|s| matches!( s.state, WorkflowStepState::Pending | WorkflowStepState::WaitingForRetry )); } -
State-Before-Queue Ordering
#![allow(unused)] fn main() { // 1. Commit step state to Enqueued step.transition(StepEvent::Enqueue).await?; // 2. Send PGMQ message pgmq.send_with_notify(queue, message).await?; }
Concurrent Scenario
Two orchestrators discover the same ready steps:
T0: Orchestrator A queries get_next_ready_tasks(batch=100)
T1: Orchestrator B queries get_next_ready_tasks(batch=100)
T2: A gets tasks [1,2,3] (locked by A's transaction)
T3: B gets tasks [4,5,6] (different rows, SKIP LOCKED)
T4: A enqueues steps for tasks 1,2,3
T5: B enqueues steps for tasks 4,5,6
T6: Both commit successfully
Result: No overlap, each task processed once
Orchestrator Crash Mid-Batch:
T0: Orchestrator A gets task 1 with steps [A, B, C, D]
T1: A enqueues steps A, B to "payments_queue"
T2: A crashes before processing steps C, D
T3: Task 1 state still EnqueuingSteps
T4: Orchestrator B picks up task 1 (A's transaction rolled back)
T5: B queries steps for task 1
T6: Steps A, B have state = Enqueued → skip
T7: Steps C, D have state = Pending → enqueue
Result: Steps A, B enqueued once, C, D recovered and enqueued
Result Processing Idempotency
Component: ResultProcessorActor and OrchestrationResultProcessor
Operation: Processing step execution results from workers
File: tasker-orchestration/src/orchestration/lifecycle/result_processing/
Protection Mechanisms
-
State Guard Validation
#![allow(unused)] fn main() { // TaskCoordinator validates step state before processing result let current_state = step_state_machine.current_state().await?; match current_state { WorkflowStepState::InProgress => { // Valid: step is being processed step_state_machine.transition(StepEvent::Complete).await?; } WorkflowStepState::Complete => { // Idempotent: already processed this result return Ok(AlreadyComplete); } _ => { // Invalid state for result processing return Err(InvalidState); } } } -
Atomic State Transitions
- Step result processing uses compare-and-swap
- Task state transitions validate current state
- All updates in same transaction as state check
-
Ownership Removed
- Processor UUID tracked for audit only
- Not enforced for transitions
- Any orchestrator can process results
- Enables recovery after crashes
Concurrent Scenario
Worker submits result, orchestrator crashes, retry arrives:
T0: Worker completes step A, submits result to orchestration_step_results queue
T1: Orchestrator A pulls message, begins processing
T2: A transitions step A to Complete
T3: A begins task state evaluation
T4: A crashes before deleting PGMQ message
T5: PGMQ visibility timeout expires → message reappears
T6: Orchestrator B pulls same message
T7: B queries step A state → Complete
T8: B returns early (idempotent, already processed)
T9: B deletes PGMQ message
Result: Step processed exactly once, retry is harmless
Before Ownership Removal (Ownership Enforced):
// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: task.processor_uuid != B.uuid
// Error: Ownership violation → TASK STUCK
After Ownership Removal (Ownership Audit-Only):
// Orchestrator A owned task in EvaluatingResults state
// A crashes
// B receives retry
// B checks: current task state (no ownership check)
// B processes successfully → TASK RECOVERS
See the Ownership Removal ADR for full analysis.
Task Finalization Idempotency
Component: TaskFinalizerActor and TaskFinalizer service
Operation: Finalizing task to terminal state
File: tasker-orchestration/src/orchestration/lifecycle/task_finalization/
Current Protection (Sufficient for Recovery)
-
State Guard Protection
#![allow(unused)] fn main() { // TaskFinalizer checks current task state let context = ExecutionContextProvider::fetch(task_uuid).await?; match context.should_finalize() { true => { // Transition to Complete task_state_machine.transition(TaskEvent::MarkComplete).await?; } false => { // Not ready to finalize (steps still pending) return Ok(NotReady); } } } -
Idempotent for Recovery
#![allow(unused)] fn main() { // Scenario: Orchestrator crashes during finalization // - Task state already Complete → state guard returns early // - Task state still StepsInProcess → retry succeeds // Result: Recovery works, final state reached }
Concurrent Scenario (Not Graceful)
Two orchestrators attempt finalization simultaneously:
T0: Orchestrators A and B both receive finalization trigger
T1: A checks: all steps complete → proceed
T2: B checks: all steps complete → proceed
T3: A transitions task to Complete (succeeds)
T4: B attempts transition to Complete
T5: State guard: task already Complete
T6: B receives StateMachineError (invalid transition)
Result:
- ✓ Task finalized exactly once (correct)
- ✓ No data corruption
- ⚠️ Orchestrator B gets error (not graceful)
Future Enhancement: Atomic Finalization Claiming
Atomic claiming would make concurrent finalization graceful:
-- Proposed claim_task_for_finalization() function
UPDATE tasker.tasks
SET finalization_claimed_at = NOW(),
finalization_claimed_by = $processor_uuid
WHERE task_uuid = $uuid
AND state = 'StepsInProcess'
AND finalization_claimed_at IS NULL
RETURNING *;
With atomic finalization claiming:
T0: Orchestrators A and B both receive finalization trigger
T1: A calls claim_task_for_finalization() → succeeds
T2: B calls claim_task_for_finalization() → returns 0 rows
T3: A proceeds with finalization
T4: B returns early (silent success, already claimed)
This enhancement is deferred (implementation not yet scheduled).
SQL Function Atomicity
File: tasker-shared/src/database/sql/
Documented: Task Readiness & Execution
Atomic State Transitions
Function: transition_task_state_atomic()
Protection: Compare-and-swap with row locking
-- Atomic state transition with validation
UPDATE tasker.tasks
SET state = $new_state,
updated_at = NOW()
WHERE task_uuid = $uuid
AND state = $expected_current_state -- CAS: only if state matches
FOR UPDATE; -- Lock prevents concurrent modifications
Key Guarantees:
- Returns 0 rows if state doesn’t match → safe retry
- Row lock prevents concurrent transitions
- Processor UUID tracked for audit, not enforced
Work Distribution Without Contention
Function: get_next_ready_tasks()
Protection: Lock-free claiming via SKIP LOCKED
SELECT task_uuid, correlation_id, state
FROM tasker.tasks
WHERE state = ANY($processable_states)
AND (
state NOT IN ('WaitingForRetry') OR
last_retry_at + retry_interval < NOW()
)
ORDER BY
CASE state
WHEN 'Pending' THEN 1
WHEN 'WaitingForRetry' THEN 2
ELSE 3
END,
created_at ASC
FOR UPDATE SKIP LOCKED -- Skip locked rows, no blocking
LIMIT $batch_size;
Key Guarantees:
- Each orchestrator gets different tasks
- No blocking or contention
- Dynamic priority (Pending before WaitingForRetry)
- Prevents task starvation
Step Readiness with Dependency Validation
Function: get_step_readiness_status()
Protection: Validates dependencies in single query
WITH step_dependencies AS (
SELECT COUNT(*) as total_deps,
SUM(CASE WHEN dep_step.state = 'Complete' THEN 1 ELSE 0 END) as completed_deps
FROM tasker.workflow_step_edges e
JOIN tasker.workflow_steps dep_step ON e.from_step_uuid = dep_step.uuid
WHERE e.to_step_uuid = $step_uuid
)
SELECT
CASE
WHEN total_deps = completed_deps THEN 'Ready'
WHEN step.state = 'Error' AND step.attempts < step.max_attempts THEN 'WaitingForRetry'
ELSE 'Blocked'
END as readiness
FROM step_dependencies, tasker.workflow_steps step
WHERE step.uuid = $step_uuid;
Key Guarantees:
- Atomic dependency check
- Handles retry logic with backoff
- Prevents premature execution
Cycle Detection
Function: WorkflowStepEdge::would_create_cycle() (Rust, uses SQL)
Protection: Recursive CTE path traversal
WITH RECURSIVE step_path AS (
-- Base: Start from proposed destination
SELECT from_step_uuid, to_step_uuid, 1 as depth
FROM tasker.workflow_step_edges
WHERE from_step_uuid = $proposed_to
UNION ALL
-- Recursive: Follow edges
SELECT sp.from_step_uuid, wse.to_step_uuid, sp.depth + 1
FROM step_path sp
JOIN tasker.workflow_step_edges wse ON sp.to_step_uuid = wse.from_step_uuid
WHERE sp.depth < 100 -- Prevent infinite recursion
)
SELECT COUNT(*) as has_path
FROM step_path
WHERE to_step_uuid = $proposed_from;
Returns: True if adding edge would create cycle
Enforcement: Called by WorkflowStepBuilder during task initialization
- Self-reference check:
from_uuid == to_uuid - Path check: Would adding edge create cycle?
- Error before commit: Transaction rolled back on cycle
See tasker-orchestration/src/orchestration/lifecycle/task_initialization/workflow_step_builder.rs for enforcement.
Cross-Cutting Scenarios
Multiple Orchestrators Processing Same Task
Scenario: Load balancer distributes work to multiple orchestrators
Protection:
-
Work Distribution:
-- Each orchestrator gets different tasks via SKIP LOCKED Orchestrator A: Tasks [1, 2, 3] Orchestrator B: Tasks [4, 5, 6] -
State Transitions:
#![allow(unused)] fn main() { // Both attempt to transition same task (shouldn't happen, but...) A: transition(Pending -> Initializing) → succeeds B: transition(Pending -> Initializing) → fails (state already changed) } -
Step Enqueueing:
#![allow(unused)] fn main() { // Task in EnqueuingSteps state A: Processes task, enqueues steps A, B B: Cannot claim task (state not in processable states) // OR if B claims during transition: B: Filters steps by state → A, B already Enqueued, skips them }
Result: No duplicate work, clean coordination
Orchestrator Crashes and Recovers
Scenario: Orchestrator crashes mid-operation, another takes over
During Task Initialization
Before ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A)
T2: A crashes
T3: Task stuck in Initializing forever (ownership blocks recovery)
After ownership removal:
T0: Orchestrator A initializes task 1
T1: Task transitions to Initializing (processor_uuid = A for audit)
T2: A crashes
T3: Orchestrator B picks up task 1
T4: B transitions Initializing -> EnqueuingSteps (succeeds, no ownership check)
T5: Task recovers automatically
During Step Enqueueing
T0: Orchestrator A enqueues steps [A, B] of task 1
T1: A crashes before committing
T2: Transaction rolls back
T3: Steps A, B remain in Pending state
T4: Orchestrator B picks up task 1
T5: B enqueues steps A, B (state still Pending)
T6: No duplicate work
During Result Processing
T0: Worker completes step A
T1: Orchestrator A receives result, transitions step to Complete
T2: A crashes before updating task state
T3: PGMQ message visibility timeout expires
T4: Orchestrator B receives same result message
T5: B queries step A → already Complete
T6: B skips processing (idempotent)
T7: B evaluates task state, continues workflow
Result: Complete recovery, no manual intervention
Retry After Transient Failure
Scenario: Database connection lost during operation
#![allow(unused)]
fn main() {
// Orchestrator attempts task initialization
let result = task_initializer.initialize(request).await;
match result {
Err(TaskInitializationError::Database(_)) => {
// Transient failure (connection lost)
// Retry same request
let retry_result = task_initializer.initialize(request).await;
// Possibilities:
// 1. Succeeds: Transaction completed before connection lost
// → identity_hash unique constraint prevents duplicate
// → Get existing task
// 2. Succeeds: Transaction rolled back
// → Create task successfully
// 3. Fails: Different error
// → Handle appropriately
}
Ok(task) => { /* Success */ }
}
}
Key Pattern: Operations are designed to be retry-safe
- Database constraints prevent duplicates
- State guards prevent invalid transitions
- Find-or-create handles concurrent creation
PGMQ Message Duplicate Delivery
Scenario: PGMQ message processed twice due to visibility timeout
#![allow(unused)]
fn main() {
// Worker completes step, sends result
pgmq.send("orchestration_step_results", result).await?;
// Orchestrator A receives message
let message = pgmq.read("orchestration_step_results").await?;
// A processes result
result_processor.process(message.payload).await?;
// A about to delete message, crashes
// Message visibility timeout expires → message reappears
// Orchestrator B receives same message
let duplicate = pgmq.read("orchestration_step_results").await?;
// B processes result
// State machine checks: step already Complete
// Returns early (idempotent)
result_processor.process(duplicate.payload).await?; // Harmless
// B deletes message
pgmq.delete(duplicate.msg_id).await?;
}
Protection:
- State guards: Check current state before processing
- Idempotent handlers: Safe to process same message multiple times
- Message deletion: Only after confirmed processing
See Events and Commands for PGMQ architecture.
Multi-Instance Validation
The defense-in-depth architecture was validated through comprehensive multi-instance cluster testing. This section documents the validation results and confirms the effectiveness of the protection mechanisms.
Test Configuration
- Orchestration Instances: 2 (ports 8080, 8081)
- Worker Instances: 2 per type (Rust: 8100-8101, Ruby: 8200-8201, Python: 8300-8301, TypeScript: 8400-8401)
- Total Services: 10 concurrent instances
- Database: Shared PostgreSQL with PGMQ messaging
Validation Results
| Metric | Result |
|---|---|
| Tests Passed | 1,645 |
| Intermittent Failures | 3 (resource contention, not race conditions) |
| Tests Skipped | 21 (domain event tests, require single-instance) |
| Race Conditions Detected | 0 |
| Data Corruption Detected | 0 |
What Was Validated
-
Concurrent Task Creation
- Tasks created through different orchestration instances
- No duplicate tasks or UUIDs
- All tasks complete successfully
- State consistent across all instances
-
Work Distribution
SKIP LOCKEDdistributes tasks without overlap- Multiple workers claim different steps
- No duplicate step processing
-
State Machine Guards
- Invalid transitions rejected at state machine layer
- Compare-and-swap prevents concurrent modifications
- Terminal states protected from re-entry
-
Transaction Boundaries
- All-or-nothing semantics maintained under load
- No partial task initialization observed
- Crash recovery works correctly
-
Cross-Instance Consistency
- Task state queries return same result from any instance
- Step state transitions visible immediately to all instances
- No stale reads observed
Protection Layer Effectiveness
| Layer | Validation Method | Result |
|---|---|---|
| Database Atomicity | Concurrent unique constraint tests | Duplicates correctly rejected |
| State Machine Guards | Parallel transition attempts | Invalid transitions blocked |
| Transaction Boundaries | Crash injection tests | Clean rollback, no corruption |
| Application Logic | State filtering under load | Idempotent processing confirmed |
Intermittent Failures Analysis
Three tests showed intermittent failures under heavy parallelization:
- Root Cause: Database connection pool exhaustion when running 1600+ tests in parallel
- Evidence: Failures occurred only at high parallelism (>4 threads), not with serialized execution
- Classification: Resource contention, NOT race conditions
- Mitigation: Nextest configured with
test-threads = 1for multi_instance tests
Key Finding: No race conditions were detected. All intermittent failures traced to resource limits.
Domain Event Tests
21 tests were excluded from cluster mode using #[cfg(not(feature = "test-cluster"))]:
- Reason: Domain event tests verify in-process event delivery (publish/subscribe within single process)
- Behavior in Cluster: Events published in one instance aren’t delivered to subscribers in another instance
- Status: Working as designed - these tests run correctly in single-instance CI
Stress Test Results
Rapid Task Burst Test:
- 25 tasks created in <1 second
- All tasks completed successfully
- No duplicate UUIDs
- Creation rate: ~50 tasks/second sustained
Round-Robin Distribution Test:
- Tasks distributed evenly across orchestration instances
- Load balancing working correctly
- No single-instance bottleneck
Recommendations Validated
The following architectural decisions were validated by cluster testing:
- Ownership Removal: Processor UUID as audit-only (not enforced) enables automatic recovery
- SKIP LOCKED Pattern: Effective for contention-free work distribution
- State-Before-Queue Pattern: Prevents workers from seeing uncommitted state
- Find-or-Create Pattern: Handles concurrent entity creation correctly
Future Enhancements Identified
Testing identified one P2 improvement opportunity:
Atomic Finalization Claiming
- Current: Second orchestrator gets
StateMachineErrorduring concurrent finalization - Proposed: Transaction-based locking for graceful handling
- Priority: P2 (operational improvement, correctness already ensured)
Running Cluster Validation
To reproduce the validation:
# Setup cluster environment
cargo make setup-env-cluster
# Start full cluster
cargo make cluster-start-all
# Run all tests including cluster tests
cargo make test-rust-all
# Stop cluster
cargo make cluster-stop
See Cluster Testing Guide for detailed instructions.
Design Principles
Defense in Depth
The system intentionally provides multiple overlapping protection layers rather than relying on a single mechanism. This ensures:
- Resilience: If one layer fails (e.g., application bug), others prevent corruption
- Clear Semantics: Each layer has a specific purpose and failure mode
- Ease of Reasoning: Developers can understand guarantees at each level
- Graceful Degradation: System remains safe even under partial failures
Fail-Safe Defaults
When in doubt, the system errs on the side of caution:
- State transitions fail if current state doesn’t match → prevents invalid states
- Unique constraints fail creation → prevents duplicates
- Row locks block concurrent access → prevents race conditions
- Cycle detection fails initialization → prevents invalid workflows
Better to fail cleanly than to corrupt data.
Retry Safety
All critical operations are designed to be safely retryable:
- Idempotent: Same operation, repeated → same outcome
- State-Based: Check current state before acting
- Atomic: All-or-nothing commits
- No Side Effects: Operations don’t accumulate partial state
This enables:
- Automatic retry after transient failures
- Duplicate message handling
- Recovery after crashes
- Horizontal scaling without coordination overhead
Audit Trail Without Enforcement
Ownership Decision: Track ownership for observability, don’t enforce for correctness
#![allow(unused)]
fn main() {
// Processor UUID recorded in all transitions
pub struct TaskTransition {
pub task_uuid: Uuid,
pub from_state: TaskState,
pub to_state: TaskState,
pub processor_uuid: Uuid, // For audit and debugging
pub event: String,
pub timestamp: DateTime<Utc>,
}
// But NOT enforced in transition logic
impl TaskStateMachine {
pub async fn transition(&mut self, event: TaskEvent) -> Result<TaskState> {
// ✅ Tracks processor UUID
// ❌ Does NOT require ownership match
// Reason: Enables recovery after crashes
}
}
}
Why This Works:
- State guards provide correctness (current state validation)
- Processor UUID provides observability (who did what when)
- No ownership blocking means automatic recovery
- Full audit trail for debugging and monitoring
Implementation Checklist
When implementing new orchestration operations, ensure:
Database Layer
- Unique constraints for entities that must be singular
-
FOR UPDATElocking for state transitions -
FOR UPDATE SKIP LOCKEDfor work distribution - Compare-and-swap (CAS) in UPDATE WHERE clauses
- Transaction wrapping for multi-step operations
State Machine Layer
- Current state retrieval before transitions
- Event applicability validation
- Terminal state protection
- Error handling for invalid transitions
Application Layer
- Find-or-create pattern for shared entities
- State-based filtering before processing
- State-before-queue ordering for events
- Idempotent message handlers
Testing
- Concurrent operation tests (multiple orchestrators)
- Crash recovery tests (mid-operation failures)
- Retry safety tests (duplicate message handling)
- Race condition tests (timing-dependent scenarios)
Related Documentation
Core Architecture
- States and Lifecycles - Dual state machine architecture
- Events and Commands - Event-driven coordination patterns
- Actor-Based Architecture - Orchestration actor pattern
- Task Readiness & Execution - SQL functions and execution logic
Implementation Details
- Ownership Removal ADR - Processor UUID ownership removal decision
Multi-Instance Validation
- Cluster Testing Guide - Running multi-instance cluster tests
Testing
- Comprehensive Lifecycle Testing - Testing patterns including concurrent scenarios
← Back to Documentation Hub
Messaging Abstraction Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Events and Commands | Deployment Patterns | Crate Architecture
<- Back to Documentation Hub
Overview
The provider-agnostic messaging abstraction enables Tasker Core to support multiple messaging backends through a unified interface. This architecture allows switching between PGMQ (PostgreSQL Message Queue) and RabbitMQ without changes to business logic.
Key Benefits:
- Zero handler changes required: Switching providers requires only configuration changes
- Provider-specific optimizations: Each backend can leverage its native strengths
- Testability: In-memory provider for fast unit testing
- Gradual migration: Systems can transition between providers incrementally
Core Concepts
Message Delivery Models
Different messaging providers have fundamentally different delivery models:
| Provider | Native Model | Push Support | Notification Type | Fallback Needed |
|---|---|---|---|---|
| PGMQ | Poll | Yes (pg_notify) | Signal only | Yes (catch-up) |
| RabbitMQ | Push | Yes (native) | Full message | No |
| InMemory | Push | Yes | Full message | No |
PGMQ (Signal-Only):
pg_notifysends a signal that a message exists- Worker must fetch the message after receiving the signal
- Fallback polling catches missed signals
RabbitMQ (Full Message Push):
basic_consume()delivers complete messages- No separate fetch required
- Protocol guarantees delivery
Architecture Layers
┌─────────────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ (Orchestration, Workers, Event Systems) │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Uses MessageClient
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MessageClient │
│ Domain-level facade with queue classification │
│ Location: tasker-shared/src/messaging/client.rs │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Delegates to
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MessagingProvider Enum │
│ Runtime dispatch without trait objects (zero-cost abstraction) │
│ Location: tasker-shared/src/messaging/service/provider.rs │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ PGMQ │ │ RabbitMQ │ │ InMemory │
│ Provider │ │ Provider │ │ Provider │
└───────────┘ └───────────┘ └───────────┘
Core Traits and Types
MessagingService Trait
Location: tasker-shared/src/messaging/service/traits.rs
The foundational trait defining queue operations:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait MessagingService: Send + Sync {
// Queue lifecycle
async fn create_queue(&self, name: &str) -> Result<(), MessagingError>;
async fn delete_queue(&self, name: &str) -> Result<(), MessagingError>;
async fn queue_exists(&self, name: &str) -> Result<bool, MessagingError>;
async fn list_queues(&self) -> Result<Vec<String>, MessagingError>;
// Message operations
async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError>;
async fn send_message_with_delay(&self, queue: &str, payload: &[u8], delay_seconds: i64) -> Result<i64, MessagingError>;
async fn receive_messages(&self, queue: &str, limit: i32, visibility_timeout: i32) -> Result<Vec<QueuedMessage<Vec<u8>>>, MessagingError>;
async fn ack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
async fn nack_message(&self, queue: &str, msg_id: i64) -> Result<(), MessagingError>;
// Provider information
fn provider_name(&self) -> &'static str;
}
}
SupportsPushNotifications Trait
Location: tasker-shared/src/messaging/service/traits.rs
Extends MessagingService with push notification capabilities:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait SupportsPushNotifications: MessagingService {
/// Subscribe to messages on a single queue
fn subscribe(&self, queue_name: &str)
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;
/// Subscribe to messages on multiple queues
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>;
/// Whether this provider requires fallback polling for reliability
fn requires_fallback_polling(&self) -> bool;
/// Suggested polling interval if fallback is needed
fn fallback_polling_interval(&self) -> Option<Duration>;
/// Whether this provider supports fetching by message ID
fn supports_fetch_by_message_id(&self) -> bool;
}
}
MessageNotification Enum
Location: tasker-shared/src/messaging/service/traits.rs
Abstracts the two notification models:
#![allow(unused)]
fn main() {
pub enum MessageNotification {
/// Signal-only notification (PGMQ style)
/// Indicates a message is available but requires separate fetch
Available {
queue_name: String,
msg_id: Option<i64>,
},
/// Full message notification (RabbitMQ style)
/// Contains the complete message payload
Message(QueuedMessage<Vec<u8>>),
}
}
Provider Implementations
PGMQ Provider
Location: tasker-shared/src/messaging/service/providers/pgmq.rs
PostgreSQL-based message queue with LISTEN/NOTIFY integration:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for PgmqMessagingService {
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
// Uses PgmqNotifyListener for pg_notify subscription
// Returns MessageNotification::Available (signal-only) for large messages
// Returns MessageNotification::Message for small messages (<7KB)
}
fn requires_fallback_polling(&self) -> bool {
true // pg_notify can miss signals during connection issues
}
fn supports_fetch_by_message_id(&self) -> bool {
true // PGMQ supports read_specific_message()
}
}
}
Characteristics:
- Uses PostgreSQL for storage and delivery
pg_notifyfor real-time notifications- Fallback polling required for reliability
- Supports visibility timeout for message claiming
RabbitMQ Provider
Location: tasker-shared/src/messaging/service/providers/rabbitmq.rs
AMQP-based message broker with native push delivery:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for RabbitMqMessagingService {
fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
// Uses lapin basic_consume() for native push delivery
// Always returns MessageNotification::Message (full payload)
}
fn requires_fallback_polling(&self) -> bool {
false // AMQP protocol guarantees delivery
}
fn supports_fetch_by_message_id(&self) -> bool {
false // RabbitMQ doesn't support fetch-by-ID
}
}
}
Characteristics:
- Native push delivery via AMQP protocol
- No fallback polling needed
- Higher throughput for high-volume scenarios
- Requires separate infrastructure (RabbitMQ server)
InMemory Provider
Location: tasker-shared/src/messaging/service/providers/in_memory.rs
In-process message queue for testing:
#![allow(unused)]
fn main() {
impl SupportsPushNotifications for InMemoryMessagingService {
fn requires_fallback_polling(&self) -> bool {
false // In-memory is reliable within process
}
}
}
Use Cases:
- Unit testing without external dependencies
- Integration testing with controlled timing
- Development environments
MessagingProvider Enum
Location: tasker-shared/src/messaging/service/provider.rs
Enum dispatch pattern for runtime provider selection without trait objects:
#![allow(unused)]
fn main() {
pub enum MessagingProvider {
Pgmq(PgmqMessagingService),
RabbitMq(RabbitMqMessagingService),
InMemory(InMemoryMessagingService),
}
impl MessagingProvider {
/// Delegate all MessagingService methods to the underlying provider
pub async fn send_message(&self, queue: &str, payload: &[u8]) -> Result<i64, MessagingError> {
match self {
Self::Pgmq(p) => p.send_message(queue, payload).await,
Self::RabbitMq(p) => p.send_message(queue, payload).await,
Self::InMemory(p) => p.send_message(queue, payload).await,
}
}
/// Subscribe to push notifications
pub fn subscribe_many(&self, queue_names: &[String])
-> Result<Pin<Box<dyn Stream<Item = MessageNotification> + Send>>, MessagingError>
{
match self {
Self::Pgmq(p) => p.subscribe_many(queue_names),
Self::RabbitMq(p) => p.subscribe_many(queue_names),
Self::InMemory(p) => p.subscribe_many(queue_names),
}
}
/// Check if fallback polling is required
pub fn requires_fallback_polling(&self) -> bool {
match self {
Self::Pgmq(p) => p.requires_fallback_polling(),
Self::RabbitMq(p) => p.requires_fallback_polling(),
Self::InMemory(p) => p.requires_fallback_polling(),
}
}
}
}
Benefits:
- Zero-cost abstraction (no vtable indirection)
- Exhaustive match ensures all providers handled
- Easy to add new providers
MessageClient Facade
Location: tasker-shared/src/messaging/client.rs
Domain-level facade providing high-level queue operations:
#![allow(unused)]
fn main() {
pub struct MessageClient {
provider: Arc<MessagingProvider>,
classifier: QueueClassifier,
}
impl MessageClient {
/// Send a step message to the appropriate namespace queue
pub async fn send_step_message(
&self,
namespace: &str,
step: &SimpleStepMessage,
) -> Result<i64, MessagingError> {
let queue_name = self.classifier.step_queue_for_namespace(namespace);
let payload = serde_json::to_vec(step)?;
self.provider.send_message(&queue_name, &payload).await
}
/// Send a step result to the orchestration queue
pub async fn send_step_result(
&self,
result: &StepExecutionResult,
) -> Result<i64, MessagingError> {
let queue_name = self.classifier.orchestration_results_queue();
let payload = serde_json::to_vec(result)?;
self.provider.send_message(&queue_name, &payload).await
}
/// Access the underlying provider for advanced operations
pub fn provider(&self) -> &MessagingProvider {
&self.provider
}
}
}
Event System Integration
Provider-Agnostic Queue Listeners
Both orchestration and worker queue listeners use provider.subscribe_many():
#![allow(unused)]
fn main() {
// tasker-orchestration/src/orchestration/orchestration_queues/listener.rs
impl OrchestrationQueueListener {
pub async fn start(&mut self) -> Result<(), MessagingError> {
let queues = vec![
self.classifier.orchestration_results_queue(),
self.classifier.orchestration_requests_queue(),
self.classifier.orchestration_finalization_queue(),
];
// Provider-agnostic subscription
let stream = self.provider.subscribe_many(&queues)?;
// Process notifications
while let Some(notification) = stream.next().await {
match notification {
MessageNotification::Available { queue_name, msg_id } => {
// PGMQ style: send event command to fetch message
self.send_event_command(queue_name, msg_id).await;
}
MessageNotification::Message(msg) => {
// RabbitMQ style: send message command with full payload
self.send_message_command(msg).await;
}
}
}
}
}
}
Deployment Mode Selection
Event systems select the appropriate mode based on provider capabilities:
#![allow(unused)]
fn main() {
// Determine effective deployment mode for this provider
let effective_mode = deployment_mode.effective_for_provider(provider.provider_name());
match effective_mode {
DeploymentMode::EventDrivenOnly => {
// Start queue listener only (no fallback poller)
// RabbitMQ typically uses this mode
}
DeploymentMode::Hybrid => {
// Start both listener and fallback poller
// PGMQ uses this mode for reliability
}
DeploymentMode::PollingOnly => {
// Start fallback poller only
// For restricted environments
}
}
}
Command Routing
Dual Command Variants
Command processors handle both notification types:
#![allow(unused)]
fn main() {
pub enum OrchestrationCommand {
// For full message notifications (RabbitMQ)
ProcessStepResultFromMessage {
queue_name: String,
message: QueuedMessage<Vec<u8>>,
resp: CommandResponder<StepProcessResult>,
},
// For signal-only notifications (PGMQ)
ProcessStepResultFromMessageEvent {
message_event: MessageReadyEvent,
resp: CommandResponder<StepProcessResult>,
},
}
}
Routing Logic:
MessageNotification::Message->ProcessStepResultFromMessageMessageNotification::Available->ProcessStepResultFromMessageEvent
Type-Safe Channel Wrappers
NewType wrappers for MPSC channels prevent accidental misuse:
Orchestration Channels
Location: tasker-orchestration/src/orchestration/channels.rs
#![allow(unused)]
fn main() {
/// Strongly-typed sender for orchestration commands
#[derive(Debug, Clone)]
pub struct OrchestrationCommandSender(pub(crate) mpsc::Sender<OrchestrationCommand>);
/// Strongly-typed receiver for orchestration commands
#[derive(Debug)]
pub struct OrchestrationCommandReceiver(pub(crate) mpsc::Receiver<OrchestrationCommand>);
/// Strongly-typed sender for orchestration notifications
#[derive(Debug, Clone)]
pub struct OrchestrationNotificationSender(pub(crate) mpsc::Sender<OrchestrationNotification>);
/// Strongly-typed receiver for orchestration notifications
#[derive(Debug)]
pub struct OrchestrationNotificationReceiver(pub(crate) mpsc::Receiver<OrchestrationNotification>);
}
Worker Channels
Location: tasker-worker/src/worker/channels.rs
#![allow(unused)]
fn main() {
/// Strongly-typed sender for worker commands
#[derive(Debug, Clone)]
pub struct WorkerCommandSender(pub(crate) mpsc::Sender<WorkerCommand>);
/// Strongly-typed receiver for worker commands
#[derive(Debug)]
pub struct WorkerCommandReceiver(pub(crate) mpsc::Receiver<WorkerCommand>);
}
Channel Factory
#![allow(unused)]
fn main() {
pub struct ChannelFactory;
impl ChannelFactory {
/// Create type-safe orchestration command channel pair
pub fn orchestration_command_channel(buffer_size: usize)
-> (OrchestrationCommandSender, OrchestrationCommandReceiver)
{
let (tx, rx) = mpsc::channel(buffer_size);
(OrchestrationCommandSender(tx), OrchestrationCommandReceiver(rx))
}
}
}
Benefits:
- Compile-time prevention of channel misuse
- Self-documenting function signatures
- Zero runtime overhead (NewTypes compile away)
Configuration
Provider Selection
# config/dotenv/test.env
# Valid values: pgmq (default), rabbitmq
TASKER_MESSAGING_BACKEND=pgmq
# RabbitMQ connection (only used when backend=rabbitmq)
RABBITMQ_URL=amqp://tasker:tasker@localhost:5672/%2F
Provider-Specific Settings
# config/tasker/base/common.toml
[pgmq]
visibility_timeout_seconds = 60
max_message_size_bytes = 1048576
batch_size = 100
[rabbitmq]
prefetch_count = 100
connection_timeout_seconds = 30
heartbeat_seconds = 60
Migration Guide
Switching from PGMQ to RabbitMQ
-
Deploy RabbitMQ infrastructure
-
Update configuration:
export TASKER_MESSAGING_BACKEND=rabbitmq export RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/%2F -
Restart services - No code changes required
Gradual Migration
For zero-downtime migration:
- Deploy new services with RabbitMQ configuration
- Gradually shift traffic to new services
- Monitor for any issues
- Decommission PGMQ-based services
Testing
Provider-Agnostic Tests
Most tests should use InMemoryMessagingService for speed:
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_step_execution() {
let provider = MessagingProvider::InMemory(InMemoryMessagingService::new());
let client = MessageClient::new(Arc::new(provider));
// Test with in-memory provider
client.send_step_message("payments", &step_msg).await.unwrap();
}
}
Provider-Specific Tests
For integration tests that need specific provider behavior:
#![allow(unused)]
fn main() {
#[tokio::test]
#[cfg(feature = "integration-tests")]
async fn test_pgmq_notifications() {
let provider = MessagingProvider::Pgmq(PgmqMessagingService::new(pool).await?);
// Test PGMQ-specific behavior
}
}
Best Practices
1. Use MessageClient for Application Code
#![allow(unused)]
fn main() {
// Good: Use domain-level facade
let client = context.message_client();
client.send_step_result(&result).await?;
// Avoid: Direct provider access unless necessary
let provider = context.messaging_provider();
provider.send_message("queue", &payload).await?;
}
2. Handle Both Notification Types
#![allow(unused)]
fn main() {
match notification {
MessageNotification::Available { queue_name, msg_id } => {
// Signal-only: need to fetch message
}
MessageNotification::Message(msg) => {
// Full message: can process immediately
}
}
}
3. Respect Provider Capabilities
#![allow(unused)]
fn main() {
if provider.supports_fetch_by_message_id() {
// Can use read_specific_message()
} else {
// Must use alternative approach
}
}
4. Configure Fallback Appropriately
#![allow(unused)]
fn main() {
if provider.requires_fallback_polling() {
// Start fallback poller for reliability
}
}
Related Documentation
- Events and Commands - Command pattern details
- Deployment Patterns - Deployment modes and configuration
- Worker Event Systems - Worker event architecture
- Crate Architecture - Workspace structure
<- Back to Documentation Hub
Next: Events and Commands | Deployment Patterns
States and Lifecycles
Last Updated: 2025-10-10 Audience: All Status: Active Related Docs: Documentation Hub | Events and Commands | Task Readiness & Execution
← Back to Documentation Hub
This document provides comprehensive documentation of the state machine architecture in tasker-core, covering both task and workflow step lifecycles, their state transitions, and the underlying persistence mechanisms.
Overview
The tasker-core system implements a sophisticated dual-state-machine architecture:
- Task State Machine: Manages overall workflow orchestration with 12 comprehensive states
- Workflow Step State Machine: Manages individual step execution with 8 states including orchestration queuing
Both state machines work in coordination to provide atomic, auditable, and resilient workflow execution with proper event-driven communication between orchestration and worker systems.
Task State Machine Architecture
Task State Definitions
The task state machine implements 12 comprehensive states as defined in tasker-shared/src/state_machine/states.rs:
Initial States
Pending: Created but not started (default initial state)Initializing: Discovering initial ready steps and setting up task context
Active Processing States
EnqueuingSteps: Actively enqueuing ready steps to worker queuesStepsInProcess: Steps are being processed by workers (orchestration monitoring)EvaluatingResults: Processing results from completed steps and determining next actions
Waiting States
WaitingForDependencies: No ready steps, waiting for dependencies to be satisfiedWaitingForRetry: Waiting for retry timeout before attempting failed steps againBlockedByFailures: Has failures that prevent progress (manual intervention may be needed)
Terminal States
Complete: All steps completed successfully (terminal)Error: Task failed permanently (terminal)Cancelled: Task was cancelled (terminal)ResolvedManually: Manually resolved by operator (terminal)
Task State Properties
Each state has key properties that drive system behavior:
#![allow(unused)]
fn main() {
impl TaskState {
pub fn is_terminal(&self) -> bool // Cannot transition further
pub fn requires_ownership(&self) -> bool // Processor ownership required
pub fn is_active(&self) -> bool // Currently being processed
pub fn is_waiting(&self) -> bool // Waiting for external conditions
pub fn can_be_processed(&self) -> bool // Available for orchestration pickup
}
}
Active States (processor UUID tracked for audit): Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults
Processable States: Pending, WaitingForDependencies, WaitingForRetry
Task Lifecycle Flow
stateDiagram-v2
[*] --> Pending
%% Initial Flow
Pending --> Initializing : Start
%% From Initializing
Initializing --> EnqueuingSteps : ReadyStepsFound(count)
Initializing --> Complete : NoStepsFound
Initializing --> WaitingForDependencies : NoDependenciesReady
%% Processing Flow
EnqueuingSteps --> StepsInProcess : StepsEnqueued(uuids)
EnqueuingSteps --> Error : EnqueueFailed(error)
StepsInProcess --> EvaluatingResults : AllStepsCompleted
StepsInProcess --> EvaluatingResults : StepCompleted(uuid)
StepsInProcess --> WaitingForRetry : StepFailed(uuid)
%% Result Evaluation
EvaluatingResults --> Complete : AllStepsSuccessful
EvaluatingResults --> EnqueuingSteps : ReadyStepsFound(count)
EvaluatingResults --> WaitingForDependencies : NoDependenciesReady
EvaluatingResults --> BlockedByFailures : PermanentFailure(error)
%% Waiting States
WaitingForDependencies --> EvaluatingResults : DependenciesReady
WaitingForRetry --> EnqueuingSteps : RetryReady
%% Problem Resolution
BlockedByFailures --> Error : GiveUp
BlockedByFailures --> ResolvedManually : ManualResolution
%% Cancellation (from any non-terminal state)
Pending --> Cancelled : Cancel
Initializing --> Cancelled : Cancel
EnqueuingSteps --> Cancelled : Cancel
StepsInProcess --> Cancelled : Cancel
EvaluatingResults --> Cancelled : Cancel
WaitingForDependencies --> Cancelled : Cancel
WaitingForRetry --> Cancelled : Cancel
BlockedByFailures --> Cancelled : Cancel
%% Legacy Support
Error --> Pending : Reset
%% Terminal States
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Task Event System
Task state transitions are driven by events defined in tasker-shared/src/state_machine/events.rs:
Lifecycle Events
Start: Begin task processingCancel: Cancel task executionGiveUp: Abandon task (BlockedByFailures -> Error)ManualResolution: Manually resolve task
Discovery Events
ReadyStepsFound(count): Ready steps discovered during initialization/evaluationNoStepsFound: No steps defined - task can complete immediatelyNoDependenciesReady: Dependencies not satisfied - wait requiredDependenciesReady: Dependencies now ready - can proceed
Processing Events
StepsEnqueued(vec<Uuid>): Steps successfully queued for workersEnqueueFailed(error): Failed to enqueue stepsStepCompleted(uuid): Individual step completedStepFailed(uuid): Individual step failedAllStepsCompleted: All current batch steps finishedAllStepsSuccessful: All steps completed successfully
System Events
PermanentFailure(error): Unrecoverable failureRetryReady: Retry timeout expiredTimeout: Operation timeout occurredProcessorCrashed: Processor became unavailable
Processor Ownership (Audit-Only Mode)
The task state machine tracks processor UUID for audit trail and debugging purposes, but does not enforce processor ownership. The requires_ownership() method returns false for all states.
History: The original TAS-41 design enforced processor ownership on active states (Initializing, EnqueuingSteps, StepsInProcess, EvaluatingResults), blocking different orchestrators from taking over tasks mid-processing. TAS-54 removed this enforcement because it prevented crash recovery – if an orchestrator crashed while processing a task, no other orchestrator could pick it up.
Current Behavior:
- Processor UUID is still recorded in the
tasker.task_transitions.processor_uuidcolumn on each transition for audit trail and debugging - Any orchestrator instance can process any task regardless of which instance previously owned it
- Idempotency is guaranteed through three mechanisms: state machine guards (which validate legal transitions), transaction atomicity (single database transaction per state change), and atomic claiming (database constraints prevent duplicate claims)
- Crash recovery is straightforward: if an orchestrator goes down, another instance discovers and resumes its in-progress tasks without ownership conflicts
Workflow Step State Machine Architecture
Step State Definitions
The workflow step state machine implements 9 states for individual step execution:
Processing Pipeline States
Pending: Initial state when step is createdEnqueued: Queued for processing but not yet claimed by workerInProgress: Currently being executed by a workerEnqueuedForOrchestration: Worker completed, queued for orchestration processingEnqueuedAsErrorForOrchestration: Worker failed, queued for orchestration error processing
Waiting States
WaitingForRetry: Step failed with retryable error, waiting for backoff period before retry
Terminal States
Complete: Step completed successfully (after orchestration processing)Error: Step failed permanently (non-retryable or max retries exceeded)Cancelled: Step was cancelledResolvedManually: Step was manually resolved by operator
State Machine Evolution
Previously, the Error state was used for both retryable and permanent failures. The introduction of WaitingForRetry created a semantic change:
- Before:
Error= any failure (retryable or permanent) - After:
Error= permanent failure only,WaitingForRetry= retryable failure awaiting backoff
This change required updates to:
get_step_readiness_status()to recognizeWaitingForRetryas a ready-eligible stateget_task_execution_context()to properly detect blocked vs recovering tasks- Error classification logic to distinguish permanent from retryable errors
Step State Properties
#![allow(unused)]
fn main() {
impl WorkflowStepState {
pub fn is_terminal(&self) -> bool // No further transitions
pub fn is_error(&self) -> bool // In error state (may allow retry)
pub fn is_active(&self) -> bool // Being processed by worker
pub fn is_in_processing_pipeline(&self) -> bool // In execution pipeline
pub fn is_ready_for_claiming(&self) -> bool // Available for worker claim
pub fn satisfies_dependencies(&self) -> bool // Can satisfy other step dependencies
}
}
Step Lifecycle Flow
stateDiagram-v2
[*] --> Pending
%% Main Execution Path
Pending --> Enqueued : Enqueue
Enqueued --> InProgress : Start (worker claims)
InProgress --> EnqueuedForOrchestration : EnqueueForOrchestration(success)
EnqueuedForOrchestration --> Complete : Complete(results) [orchestration]
%% Error Handling Path
InProgress --> EnqueuedAsErrorForOrchestration : EnqueueForOrchestration(error)
EnqueuedAsErrorForOrchestration --> WaitingForRetry : WaitForRetry(error) [retryable]
EnqueuedAsErrorForOrchestration --> Error : Fail(error) [permanent/max retries]
%% Retry Path
WaitingForRetry --> Pending : Retry (after backoff)
%% Legacy Direct Path (deprecated)
InProgress --> Complete : Complete(results) [direct - legacy]
InProgress --> Error : Fail(error) [direct - legacy]
%% Legacy Backward Compatibility
Pending --> InProgress : Start [legacy]
%% Direct Failure Paths (error before worker processing)
Pending --> Error : Fail(error)
Enqueued --> Error : Fail(error)
%% Cancellation Paths
Pending --> Cancelled : Cancel
Enqueued --> Cancelled : Cancel
InProgress --> Cancelled : Cancel
EnqueuedForOrchestration --> Cancelled : Cancel
EnqueuedAsErrorForOrchestration --> Cancelled : Cancel
WaitingForRetry --> Cancelled : Cancel
Error --> Cancelled : Cancel
%% Manual Resolution (from any state)
Pending --> ResolvedManually : ResolveManually
Enqueued --> ResolvedManually : ResolveManually
InProgress --> ResolvedManually : ResolveManually
EnqueuedForOrchestration --> ResolvedManually : ResolveManually
EnqueuedAsErrorForOrchestration --> ResolvedManually : ResolveManually
WaitingForRetry --> ResolvedManually : ResolveManually
Error --> ResolvedManually : ResolveManually
%% Terminal States
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Step Event System
Step transitions are driven by StepEvent types:
Processing Events
Enqueue: Queue step for worker processingStart: Begin step execution (worker claims step)EnqueueForOrchestration(results): Worker completes, queues for orchestrationComplete(results): Mark step complete (from orchestration or legacy direct)Fail(error): Mark step as permanently failedWaitForRetry(error): Mark step for retry after backoff
Control Events
Cancel: Cancel step executionResolveManually: Manual operator resolutionRetry: Retry step from WaitingForRetry or Error state
Step Execution Flow Integration
The step state machine integrates tightly with the task state machine:
- Task Discovers Ready Steps:
TaskEvent::ReadyStepsFound(count)-> Task moves toEnqueuingSteps - Steps Get Enqueued:
StepEvent::Enqueue-> Steps move toEnqueuedstate - Workers Claim Steps:
StepEvent::Start-> Steps move toInProgress - Workers Complete Steps:
StepEvent::EnqueueForOrchestration(results)-> Steps move toEnqueuedForOrchestration - Orchestration Processes Results:
StepEvent::Complete(results)-> Steps move toComplete - Task Evaluates Progress:
TaskEvent::StepCompleted(uuid)-> Task moves toEvaluatingResults - Task Completes or Continues: Based on remaining steps -> Task moves to
Completeor back toEnqueuingSteps
Guard Conditions and Validation
Both state machines implement comprehensive guard conditions in tasker-shared/src/state_machine/guards.rs:
Task Guards
TransitionGuard
- Validates all task state transitions
- Prevents invalid state combinations
- Enforces terminal state immutability
- Supports legacy transition compatibility
Processor Tracking
- Records processor UUID in transitions for audit trail (TAS-54: not enforced)
- Concurrent processing prevented by state machine guards and transaction atomicity
- Any orchestrator can resume stuck tasks after a crash
Step Guards
StepDependenciesMetGuard
- Validates all step dependencies are satisfied
- Delegates to
WorkflowStep::dependencies_met() - Prevents premature step execution
StepNotInProgressGuard
- Ensures step is not already being processed
- Prevents duplicate worker claims
- Validates step availability
Retry Guards
StepCanBeRetriedGuard: Validates step is in Error state- Checks retry limits and conditions
- Prevents infinite retry loops
Orchestration Guards
StepCanBeEnqueuedForOrchestrationGuard: Step must be InProgressStepCanBeCompletedFromOrchestrationGuard: Step must be EnqueuedForOrchestrationStepCanBeFailedFromOrchestrationGuard: Step must be EnqueuedForOrchestration
Persistence Layer Architecture
Delegation Pattern
The persistence layer in tasker-shared/src/state_machine/persistence.rs implements a delegation pattern to the model layer:
#![allow(unused)]
fn main() {
// TaskTransitionPersistence -> TaskTransition::create() & TaskTransition::get_current()
// StepTransitionPersistence -> WorkflowStepTransition::create() & WorkflowStepTransition::get_current()
}
Benefits:
- No SQL duplication between state machine and models
- Atomic transaction handling in models
- Single source of truth for database operations
- Independent testability of model methods
Transition Storage
Task Transitions (tasker.task_transitions)
CREATE TABLE tasker.task_transitions (
task_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
task_uuid UUID NOT NULL,
to_state VARCHAR NOT NULL,
from_state VARCHAR,
processor_uuid UUID, -- Ownership tracking
metadata JSONB,
sort_key INTEGER NOT NULL,
most_recent BOOLEAN DEFAULT false,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Step Transitions (tasker.workflow_step_transitions)
CREATE TABLE tasker.workflow_step_transitions (
workflow_step_transition_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
workflow_step_uuid UUID NOT NULL,
to_state VARCHAR NOT NULL,
from_state VARCHAR,
metadata JSONB,
sort_key INTEGER NOT NULL,
most_recent BOOLEAN DEFAULT false,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Current State Resolution
Both transition models implement efficient current state resolution:
#![allow(unused)]
fn main() {
// O(1) current state lookup using most_recent flag
TaskTransition::get_current(pool, task_uuid) -> Option<TaskTransition>
WorkflowStepTransition::get_current(pool, step_uuid) -> Option<WorkflowStepTransition>
}
Performance Optimization:
most_recent = trueflag on latest transition only- Indexed queries:
(task_uuid, most_recent) WHERE most_recent = true - Atomic flag updates during transition creation
Atomic Transitions with Ownership
Atomic transitions with processor ownership:
#![allow(unused)]
fn main() {
impl TaskTransitionPersistence {
pub async fn transition_with_ownership(
&self,
task_uuid: Uuid,
from_state: TaskState,
to_state: TaskState,
processor_uuid: Uuid,
metadata: Option<Value>,
pool: &PgPool,
) -> PersistenceResult<bool>
}
}
Atomicity Guarantees:
- Single database transaction for state change
- Processor UUID stored in dedicated column
most_recentflag updated atomically- Race condition prevention through database constraints
Action System
Both state machines execute actions after successful transitions:
Task Actions
- PublishTransitionEventAction: Publishes task state change events
- UpdateTaskCompletionAction: Updates task completion status
- ErrorStateCleanupAction: Performs error state cleanup
Step Actions
- PublishTransitionEventAction: Publishes step state change events
- UpdateStepResultsAction: Updates step results and execution data
- TriggerStepDiscoveryAction: Triggers task-level step discovery
- ErrorStateCleanupAction: Performs step error cleanup
Actions execute sequentially after transition persistence, ensuring consistency.
State Machine Integration Points
Task <-> Step Coordination
- Step Discovery: Task initialization discovers ready steps
- Step Enqueueing: Task enqueues discovered steps to worker queues
- Progress Monitoring: Task monitors step completion via events
- Result Processing: Task processes step results and discovers next steps
- Completion Detection: Task completes when all steps are complete
Event-Driven Communication
- pg_notify: PostgreSQL notifications for real-time coordination
- Event Publishers: Publish state transition events to event system
- Event Subscribers: React to state changes across system boundaries
- Queue Integration: Provider-agnostic message queues (PGMQ or RabbitMQ) for worker communication
Worker Integration
- Step Claiming: Workers claim
Enqueuedsteps from queues - Progress Updates: Workers transition steps to
InProgress - Result Submission: Workers submit results via
EnqueueForOrchestration - Orchestration Processing: Orchestration processes results and completes steps
This sophisticated state machine architecture provides the foundation for reliable, auditable, and scalable workflow orchestration in the tasker-core system.
Step Result Audit System
The step result audit system provides SOC2-compliant audit trails for workflow step execution results, enabling complete attribution tracking for compliance and debugging.
Audit Table Design
The tasker.workflow_step_result_audit table stores lightweight references with attribution data:
CREATE TABLE tasker.workflow_step_result_audit (
workflow_step_result_audit_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
workflow_step_uuid UUID NOT NULL REFERENCES tasker.workflow_steps,
workflow_step_transition_uuid UUID NOT NULL REFERENCES tasker.workflow_step_transitions,
task_uuid UUID NOT NULL REFERENCES tasker.tasks,
recorded_at TIMESTAMP NOT NULL DEFAULT NOW(),
-- Attribution (NEW data not in transitions)
worker_uuid UUID,
correlation_id UUID,
-- Extracted scalars for indexing/filtering
success BOOLEAN NOT NULL,
execution_time_ms BIGINT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE (workflow_step_uuid, workflow_step_transition_uuid)
);
Design Principles
-
No Data Duplication: Full execution results already exist in
tasker.workflow_step_transitions.metadata. The audit table stores references only. -
Attribution Capture: The audit system captures NEW attribution data:
worker_uuid: Which worker instance processed the stepcorrelation_id: Distributed tracing identifier for request correlation
-
Indexed Scalars: Success and execution time are extracted for efficient filtering without JSON parsing.
-
SQL Trigger: A database trigger (
trg_step_result_audit) guarantees audit record creation when workers persist results, ensuring SOC2 compliance.
Attribution Flow
Attribution data flows through the system via TransitionContext:
#![allow(unused)]
fn main() {
// Worker creates attribution context
let context = TransitionContext::with_worker(
worker_uuid,
Some(correlation_id),
);
// Context is merged into transition metadata
state_machine.transition_with_context(event, Some(context)).await?;
// SQL trigger extracts attribution from metadata
-- In trigger:
-- v_worker_uuid := (NEW.metadata->>'worker_uuid')::UUID;
-- v_correlation_id := (NEW.metadata->>'correlation_id')::UUID;
}
Trigger Behavior
The create_step_result_audit trigger fires on transitions to:
enqueued_for_orchestration: Successful step completionenqueued_as_error_for_orchestration: Failed step completion
These states represent when workers persist execution results, creating the audit trail.
Querying Audit History
Via API
GET /v1/tasks/{task_uuid}/workflow_steps/{step_uuid}/audit
Returns audit records with full transition details via JOIN, ordered by recorded_at DESC.
Via Client
#![allow(unused)]
fn main() {
let audit_history = client.get_step_audit_history(task_uuid, step_uuid).await?;
for record in audit_history {
println!("Worker: {:?}, Success: {}, Time: {:?}ms",
record.worker_uuid,
record.success,
record.execution_time_ms
);
}
}
Via Model
#![allow(unused)]
fn main() {
// Get audit history for a step with full transition details
let history = WorkflowStepResultAudit::get_audit_history(&pool, step_uuid).await?;
// Get all audit records for a task
let task_history = WorkflowStepResultAudit::get_task_audit_history(&pool, task_uuid).await?;
// Query by worker for attribution investigation
let worker_records = WorkflowStepResultAudit::get_by_worker(&pool, worker_uuid, Some(100)).await?;
// Query by correlation ID for distributed tracing
let correlated = WorkflowStepResultAudit::get_by_correlation_id(&pool, correlation_id).await?;
}
Indexes for Common Query Patterns
The audit table includes optimized indexes:
idx_audit_step_uuid: Primary query - get audit history for a stepidx_audit_task_uuid: Get all audit records for a taskidx_audit_recorded_at: Time-range queries for SOC2 audit reportsidx_audit_worker_uuid: Attribution investigation (partial index)idx_audit_correlation_id: Distributed tracing queries (partial index)idx_audit_success: Success/failure filtering
Historical Data
The migration includes a backfill for existing transitions. Historical records will have NULL attribution (worker_uuid, correlation_id) since that data wasn’t captured before the audit system was introduced.
Tasker CLI Architecture
Audience: Developers, Contributors Status: Active Related Docs: Crate Architecture | Configuration Management
Overview
tasker-ctl is the primary command-line interface for the Tasker orchestration system. It serves two roles:
- Operator tool — manage tasks, monitor workers, investigate DLQ entries, validate configuration, and generate documentation against running Tasker services.
- Developer tool — discover CLI plugins, inspect templates, and generate project scaffolding from community-contributed templates.
The CLI is built as a single Rust binary with no runtime dependencies beyond the Tasker services it connects to (for operator commands) or the filesystem (for plugin/template/config commands).
Module Structure
tasker-ctl/src/
├── main.rs # CLI definition (Clap derive), arg parsing, command dispatch
├── output/ # Styled terminal output (anstream/anstyle)
│ └── mod.rs
├── commands/ # Command handlers (one file per command group)
│ ├── mod.rs
│ ├── task.rs # Task CRUD, step operations, audit trail
│ ├── worker.rs # Worker listing, status, health checks
│ ├── system.rs # Cross-service health aggregation
│ ├── config.rs # Config generate, validate, explain, dump, analyze
│ ├── dlq.rs # Dead letter queue investigation
│ ├── auth.rs # JWT key generation, token creation/validation
│ ├── docs.rs # Configuration documentation generation
│ ├── plugin.rs # Plugin discovery and validation
│ ├── template.rs # Template listing, info, and code generation
│ ├── remote.rs # Remote repository management (add, remove, update, list)
│ └── init.rs # Bootstrap .tasker-ctl.toml with sensible defaults
├── cli_config/ # CLI-specific config (.tasker-ctl.toml)
│ ├── mod.rs
│ └── loader.rs
├── remotes/ # Remote git repository fetching and caching
│ ├── mod.rs
│ └── cache.rs # Git clone/fetch operations, cache directory management
├── plugins/ # Plugin discovery and registry
│ ├── mod.rs
│ ├── manifest.rs # Parse tasker-plugin.toml manifests
│ ├── discovery.rs # Filesystem scanning for plugins
│ └── registry.rs # Plugin registry with filtering
├── template_engine/ # Runtime template rendering (Tera)
│ ├── mod.rs
│ ├── metadata.rs # Parse template.toml definitions
│ ├── engine.rs # Tera wrapper with custom filters
│ ├── loader.rs # Load .tera template files
│ └── filters.rs # Case conversion filters (snake, pascal, camel, kebab)
├── docs/ # Askama compile-time templates for docs generation
│ ├── mod.rs
│ └── templates.rs
└── templates/ # Askama template files (.md.jinja, .toml.jinja, .txt.jinja)
Key Subsystems
Command Dispatch
The CLI uses Clap’s derive API to define a two-level command hierarchy (tasker-ctl <group> <action>). Each command group maps to a handler function in commands/:
Commands enum → match arm → handle_{group}_command() → API client calls or local operations
Commands that interact with Tasker services (task, worker, system, dlq) create API clients from ClientConfig. Commands that operate locally (config, docs, plugin, template) work directly with the filesystem and don’t require running services.
Client Configuration
Two separate configuration systems serve different purposes:
ClientConfig(fromtasker-client): Server URLs, transport (REST/gRPC), authentication. Loaded via profiles from.config/tasker-client.tomlwith CLI flag and environment variable overrides.CliConfig(fromcli_config/): Plugin search paths, default language, default output directory. Loaded from.tasker-ctl.tomlwith project-local and user-global discovery.
Output Styling
The output module provides structured terminal output using anstream and anstyle:
| Function | Purpose | Stream |
|---|---|---|
success() | Green check mark + message | stdout |
error() | Red X + message | stderr |
warning() | Yellow exclamation + message | stderr |
header() | Bold text | stdout |
label() | Bold name + value | stdout |
dim() | Dimmed informational text | stdout |
hint() | Dimmed hint with arrow prefix | stdout |
item() | Bullet point item | stdout |
status_icon() | Green check or red X based on boolean | stdout |
plain() | Unstyled text | stdout |
blank() | Empty line | stdout |
anstream auto-detects terminal capabilities and strips ANSI codes when output is piped. Commands designed for scripting (e.g., config dump, auth generate-token) write raw data to stdout so they remain safe for piping and redirection.
Clap’s built-in help rendering also uses custom styles via clap_styles() for a consistent visual appearance.
Plugin System
The plugin system enables external code to extend tasker-ctl with new templates without modifying the binary.
Discovery (plugins/discovery.rs): Scans configured paths with a three-level search strategy:
- Check if the path root contains
tasker-plugin.toml - Scan immediate subdirectories for
tasker-plugin.toml - Scan
*/tasker-cli-plugin/subdirectories (handlestasker-contriblayout)
Manifest (plugins/manifest.rs): Each plugin is defined by a tasker-plugin.toml:
[metadata]
name = "rails"
version = "0.1.0"
description = "Rails integration templates"
language = "ruby"
framework = "rails"
[[templates]]
name = "step_handler"
path = "templates/step_handler"
description = "Generate a Tasker step handler"
Registry (plugins/registry.rs): Aggregates discovered plugins and provides lookup by template name, language, or framework.
Template Engine
The template engine renders plugin-provided templates using Tera (runtime evaluation).
Metadata (template_engine/metadata.rs): Each template directory contains a template.toml defining parameters and output files:
[metadata]
name = "step_handler"
description = "Generate a step handler class"
language = "ruby"
[[parameters]]
name = "name"
description = "Handler class name"
required = true
[[output_files]]
path = "{{ name | snake_case }}_handler.rb"
template = "handler.rb.tera"
Engine (template_engine/engine.rs): Wraps Tera with custom case-conversion filters registered at initialization. Renders both output file paths and template content from the same context.
Filters (template_engine/filters.rs): Custom Tera filters via the heck crate:
snake_case—ProcessPayment→process_paymentpascal_case—process_payment→ProcessPaymentcamel_case—process_payment→processPaymentkebab_case—process_payment→process-payment
Remote System (TAS-270)
The remote system enables tasker-ctl to fetch plugins and configuration from git repositories, removing the need for local checkouts of community template repositories like tasker-contrib.
Cache (remotes/cache.rs): Manages local clones of remote git repos under ~/.cache/tasker-ctl/remotes/<name>/. Uses git2 for clone and fetch operations. A .tasker-last-fetch timestamp file tracks cache freshness against the configurable cache-max-age-hours threshold.
Configuration (cli_config/mod.rs): Remotes are defined in .tasker-ctl.toml:
[[remotes]]
name = "tasker-contrib"
url = "https://github.com/tasker-systems/tasker-contrib.git"
git-ref = "main"
config-path = "config/tasker/"
Integration: Remote cached paths are transparently injected into the existing plugin discovery and config generation pipelines. The --remote and --url flags on template and config commands select a specific remote or ad-hoc URL. The plugin list command auto-discovers plugins from all configured remotes.
Commands (commands/remote.rs): remote list, remote add, remote remove, remote update manage the configured remotes and their caches.
Init Command
The init command (commands/init.rs) bootstraps a new .tasker-ctl.toml in the current directory with sensible defaults:
tasker-ctl init # Creates config with tasker-contrib remote pre-configured
tasker-ctl init --no-contrib # Creates config without any remotes
The command refuses to overwrite an existing .tasker-ctl.toml to prevent accidental data loss. After creation, it prints next-step hints guiding the user toward fetching remotes and generating templates.
Documentation Generation
Documentation generation uses a separate template system from plugins. Askama provides compile-time template verification for the built-in documentation templates (configuration reference, annotated configs, parameter explanations). These templates live in templates/ and are bound to Rust structs at compile time via the docs-gen feature flag.
The data source for documentation is DocContextBuilder from tasker-shared, which extracts _docs metadata annotations from the TOML configuration files.
Design Decisions
Two Template Engines
Askama (compile-time) and Tera (runtime) serve different needs:
- Askama renders built-in documentation templates. Compile-time verification catches template errors early, and the templates ship with the binary.
- Tera renders plugin-provided templates. Runtime evaluation is necessary because templates are discovered from the filesystem, not known at compile time.
Both are lightweight and the overlap is intentional — they solve different problems at different phases of the tool’s lifecycle.
Plugin Discovery vs. Package Manager
Plugins are discovered by scanning filesystem paths rather than through a package manager. This keeps the system simple and predictable: drop a directory with a tasker-plugin.toml into a configured path and it’s immediately available. No installation step, no version resolution, no network requests.
Cache-as-Local-Path for Remotes
Remote repos are cloned to a local cache directory, then the existing filesystem-based plugin discovery and config generation pipelines operate on the cached path. This avoids adding “remote-aware” logic throughout the codebase — the remote system’s only job is to ensure a local directory exists and is reasonably fresh. Everything downstream sees a regular directory.
Piping-Safe Output
Commands that produce data for scripting (config dump, auth generate-token, docs rendering to stdout) write raw unformatted output via println!. Styled output is reserved for interactive feedback (status messages, errors, progress). This ensures tasker-ctl config dump | jq . and tasker-ctl auth generate-token | pbcopy work correctly.
Dependency Summary
| Dependency | Purpose |
|---|---|
tasker-client | REST/gRPC API client for service communication |
tasker-shared | Shared types, config models, doc context builder |
clap | CLI argument parsing with derive macros |
anstream + anstyle | TTY-aware styled output |
tera | Runtime template rendering for plugins |
heck | Case conversion for template filters |
askama | Compile-time templates for documentation (optional, docs-gen feature) |
git2 | Git clone/fetch for remote repositories |
toml_edit | Format-preserving TOML editing for remote add/remove/init |
rsa + rand | RSA key pair generation for JWT auth |
tokio | Async runtime |
Related Documentation
- Crate Architecture — workspace-level crate overview
- Configuration Management — TOML config structure and environments
- Auth Configuration — JWT and API key setup
Worker Actor-Based Architecture
Last Updated: 2025-12-04 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Actor-Based Architecture | Events and Commands
<- Back to Documentation Hub
This document provides comprehensive documentation of the worker actor-based architecture in tasker-worker, covering the lightweight Actor pattern that mirrors the orchestration architecture for step execution and worker coordination.
Overview
The tasker-worker system implements a lightweight Actor pattern that mirrors the orchestration architecture, providing:
- Actor Abstraction: Worker components encapsulated as actors with clear lifecycle hooks
- Message-Based Communication: Type-safe message handling via
Handler<M>trait - Central Registry:
WorkerActorRegistryfor managing all worker actors - Service Decomposition: Focused services following single responsibility principle
- Lock-Free Statistics: AtomicU64 counters for hot-path performance
- Direct Integration: Command processor routes to actors without wrapper layers
This architecture provides consistency between orchestration and worker systems, enabling clearer code organization and improved maintainability.
Implementation Status
Complete: All phases implemented and production-ready
- Phase 1: Core abstractions (traits, registry, lifecycle management)
- Phase 2: Service decomposition from 1575 LOC command_processor.rs
- Phase 3: All 5 primary actors implemented
- Phase 4: Command processor refactored to pure routing (~200 LOC)
- Phase 5: Stateless service design eliminating lock contention
- Cleanup: Lock-free AtomicU64 statistics, shared event system
Current State: Production-ready actor-based worker with 5 actors managing all step execution operations.
Core Concepts
What is a Worker Actor?
In the tasker-worker context, a Worker Actor is an encapsulated step execution component that:
- Manages its own state: Each actor owns its dependencies and configuration
- Processes messages: Responds to typed command messages via the
Handler<M>trait - Has lifecycle hooks: Initialization (
started) and cleanup (stopped) methods - Is isolated: Actors communicate through message passing
- Is thread-safe: All actors are
Send + Sync + 'static
Why Actors for Workers?
The previous architecture had a monolithic command processor:
#![allow(unused)]
fn main() {
// OLD: 1575 LOC monolithic command processor
pub struct WorkerProcessor {
// All logic mixed together
// RwLock contention on hot path
// Two-phase initialization complexity
}
}
The actor pattern provides:
#![allow(unused)]
fn main() {
// NEW: Pure routing command processor (~200 LOC)
impl ActorCommandProcessor {
async fn handle_command(&self, command: WorkerCommand) -> bool {
match command {
WorkerCommand::ExecuteStep { message, queue_name, resp } => {
let msg = ExecuteStepMessage { message, queue_name };
let result = self.actors.step_executor_actor.handle(msg).await;
let _ = resp.send(result);
true
}
// ... pure routing, no business logic
}
}
}
}
Actor vs Service
Services (underlying business logic):
- Encapsulate step execution logic
- Stateless operations on step data
- Direct method invocation
- Examples:
StepExecutorService,FFICompletionService,WorkerStatusService
Actors (message-based coordination):
- Wrap services with message-based interface
- Manage service lifecycle
- Asynchronous message handling
- Examples:
StepExecutorActor,FFICompletionActor,WorkerStatusActor
The relationship:
#![allow(unused)]
fn main() {
pub struct StepExecutorActor {
context: Arc<SystemContext>,
service: Arc<StepExecutorService>, // Wraps underlying service
}
#[async_trait]
impl Handler<ExecuteStepMessage> for StepExecutorActor {
async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
// Delegates to stateless service
self.service.execute_step(msg.message, &msg.queue_name).await
}
}
}
Worker Actor Traits
WorkerActor Trait
The base trait for all worker actors, defined in tasker-worker/src/worker/actors/traits.rs:
#![allow(unused)]
fn main() {
/// Base trait for all worker actors
///
/// Provides lifecycle management and context access for all actors in the
/// worker system. All actors must implement this trait to participate
/// in the actor registry and lifecycle management.
pub trait WorkerActor: Send + Sync + 'static {
/// Returns the unique name of this actor
fn name(&self) -> &'static str;
/// Returns a reference to the system context
fn context(&self) -> &Arc<SystemContext>;
/// Called when the actor is started
fn started(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor started");
Ok(())
}
/// Called when the actor is stopped
fn stopped(&mut self) -> TaskerResult<()> {
tracing::info!(actor = %self.name(), "Actor stopped");
Ok(())
}
}
}
Handler<M> Trait
The message handling trait, enabling type-safe message processing:
#![allow(unused)]
fn main() {
/// Message handler trait for specific message types
#[async_trait]
pub trait Handler<M: Message>: WorkerActor {
/// Handle a message asynchronously
async fn handle(&self, msg: M) -> TaskerResult<M::Response>;
}
}
Message Trait
The marker trait for command messages:
#![allow(unused)]
fn main() {
/// Marker trait for command messages
pub trait Message: Send + 'static {
/// The response type for this message
type Response: Send;
}
}
WorkerActorRegistry
The central registry managing all worker actors, defined in tasker-worker/src/worker/actors/registry.rs:
Structure
#![allow(unused)]
fn main() {
/// Registry managing all worker actors
#[derive(Clone)]
pub struct WorkerActorRegistry {
/// System context shared by all actors
context: Arc<SystemContext>,
/// Worker ID for this registry
worker_id: String,
/// Step executor actor for step execution pub step_executor_actor: Arc<StepExecutorActor>,
/// FFI completion actor for handling step completions pub ffi_completion_actor: Arc<FFICompletionActor>,
/// Template cache actor for template management pub template_cache_actor: Arc<TemplateCacheActor>,
/// Domain event actor for event dispatching pub domain_event_actor: Arc<DomainEventActor>,
/// Worker status actor for health and status pub worker_status_actor: Arc<WorkerStatusActor>,
}
}
Initialization
All dependencies required at construction time (no two-phase initialization):
#![allow(unused)]
fn main() {
impl WorkerActorRegistry {
pub async fn build(
context: Arc<SystemContext>,
worker_id: String,
task_template_manager: Arc<TaskTemplateManager>,
event_publisher: WorkerEventPublisher,
domain_event_handle: DomainEventSystemHandle,
) -> TaskerResult<Self> {
// Create actors with all dependencies upfront
let mut step_executor_actor = StepExecutorActor::new(
context.clone(),
worker_id.clone(),
task_template_manager.clone(),
event_publisher,
domain_event_handle,
);
// Call started() lifecycle hook
step_executor_actor.started()?;
// ... create other actors ...
Ok(Self {
context,
worker_id,
step_executor_actor: Arc::new(step_executor_actor),
// ...
})
}
}
}
Implemented Actors
StepExecutorActor
Handles step execution from PGMQ messages and events.
Location: tasker-worker/src/worker/actors/step_executor_actor.rs
Messages:
ExecuteStepMessage- Execute step from raw dataExecuteStepWithCorrelationMessage- Execute with FFI correlationExecuteStepFromPgmqMessage- Execute from PGMQ messageExecuteStepFromEventMessage- Execute from event notification
Delegation: Wraps StepExecutorService (stateless, no locks)
Purpose: Central coordinator for all step execution, handles claiming, handler invocation, and result construction.
FFICompletionActor
Handles step completion results from FFI handlers.
Location: tasker-worker/src/worker/actors/ffi_completion_actor.rs
Messages:
SendStepResultMessage- Send result to orchestrationProcessStepCompletionMessage- Process completion with correlation
Delegation: Wraps FFICompletionService
Purpose: Forwards step execution results to orchestration queue, manages correlation for async FFI handlers.
TemplateCacheActor
Manages task template caching and refresh.
Location: tasker-worker/src/worker/actors/template_cache_actor.rs
Messages:
RefreshTemplateCacheMessage- Refresh cache for namespace
Delegation: Wraps TaskTemplateManager
Purpose: Maintains handler template cache for efficient step execution.
DomainEventActor
Dispatches domain events after step completion.
Location: tasker-worker/src/worker/actors/domain_event_actor.rs
Messages:
DispatchDomainEventsMessage- Dispatch events for completed step
Delegation: Wraps DomainEventSystemHandle
Purpose: Fire-and-forget domain event dispatch (never blocks step completion).
WorkerStatusActor
Provides worker health and status reporting.
Location: tasker-worker/src/worker/actors/worker_status_actor.rs
Messages:
GetWorkerStatusMessage- Get current worker statusHealthCheckMessage- Perform health checkGetEventStatusMessage- Get event integration statusSetEventIntegrationMessage- Enable/disable event integration
Features:
- Lock-free statistics via
AtomicStepExecutionStats - AtomicU64 counters for
total_executed,total_succeeded,total_failed - Average execution time computed on read from
sum / count
Purpose: Real-time health monitoring and statistics without lock contention.
Lock-Free Statistics
The WorkerStatusActor uses atomic counters for lock-free statistics on the hot path:
#![allow(unused)]
fn main() {
/// Lock-free step execution statistics using atomic counters
#[derive(Debug)]
pub struct AtomicStepExecutionStats {
total_executed: AtomicU64,
total_succeeded: AtomicU64,
total_failed: AtomicU64,
total_execution_time_ms: AtomicU64,
}
impl AtomicStepExecutionStats {
/// Record a successful step execution (lock-free)
#[inline]
pub fn record_success(&self, execution_time_ms: u64) {
self.total_executed.fetch_add(1, Ordering::Relaxed);
self.total_succeeded.fetch_add(1, Ordering::Relaxed);
self.total_execution_time_ms.fetch_add(execution_time_ms, Ordering::Relaxed);
}
/// Record a failed step execution (lock-free)
#[inline]
pub fn record_failure(&self) {
self.total_executed.fetch_add(1, Ordering::Relaxed);
self.total_failed.fetch_add(1, Ordering::Relaxed);
}
/// Get a snapshot of current statistics
pub fn snapshot(&self) -> StepExecutionStats {
let total_executed = self.total_executed.load(Ordering::Relaxed);
let total_time = self.total_execution_time_ms.load(Ordering::Relaxed);
let average_execution_time_ms = if total_executed > 0 {
total_time as f64 / total_executed as f64
} else {
0.0
};
StepExecutionStats {
total_executed,
total_succeeded: self.total_succeeded.load(Ordering::Relaxed),
total_failed: self.total_failed.load(Ordering::Relaxed),
average_execution_time_ms,
}
}
}
}
Benefits:
- Zero lock contention on step completion (every step calls
record_successorrecord_failure) - Sub-microsecond overhead per operation
- Consistent averages computed from totals
Integration with Commands
ActorCommandProcessor
The command processor provides pure routing to actors:
#![allow(unused)]
fn main() {
impl ActorCommandProcessor {
async fn handle_command(&self, command: WorkerCommand) -> bool {
match command {
// Step Execution Commands -> StepExecutorActor
WorkerCommand::ExecuteStep { message, queue_name, resp } => {
let msg = ExecuteStepMessage { message, queue_name };
let result = self.actors.step_executor_actor.handle(msg).await;
let _ = resp.send(result);
true
}
// Completion Commands -> FFICompletionActor
WorkerCommand::SendStepResult { result, resp } => {
let msg = SendStepResultMessage { result };
let send_result = self.actors.ffi_completion_actor.handle(msg).await;
let _ = resp.send(send_result);
true
}
// Status Commands -> WorkerStatusActor
WorkerCommand::HealthCheck { resp } => {
let result = self.actors.worker_status_actor.handle(HealthCheckMessage).await;
let _ = resp.send(result);
true
}
WorkerCommand::Shutdown { resp } => {
let _ = resp.send(Ok(()));
false // Exit command loop
}
}
}
}
}
FFI Completion Flow
Domain events are dispatched after successful orchestration notification:
#![allow(unused)]
fn main() {
async fn handle_ffi_completion(&self, step_result: StepExecutionResult) {
// Record stats (lock-free)
if step_result.success {
self.actors.worker_status_actor
.record_success(step_result.metadata.execution_time_ms as f64).await;
} else {
self.actors.worker_status_actor.record_failure().await;
}
// Send to orchestration FIRST
let msg = SendStepResultMessage { result: step_result.clone() };
match self.actors.ffi_completion_actor.handle(msg).await {
Ok(()) => {
// Domain events dispatched AFTER successful orchestration notification
// Fire-and-forget - never blocks the worker
self.actors.step_executor_actor
.dispatch_domain_events(step_result.step_uuid, &step_result, None).await;
}
Err(e) => {
// Don't dispatch domain events - orchestration wasn't notified
tracing::error!("Failed to forward step completion to orchestration");
}
}
}
}
Service Decomposition
Large services were decomposed from the monolithic command processor:
StepExecutorService
services/step_execution/
├── mod.rs # Public API
├── service.rs # StepExecutorService (~250 lines)
├── step_claimer.rs # Step claiming logic
├── handler_invoker.rs # Handler invocation
└── result_builder.rs # Result construction
Key Design: Completely stateless service using &self methods. Wrapped in Arc<StepExecutorService> without any locks.
FFICompletionService
services/ffi_completion/
├── mod.rs # Public API
├── service.rs # FFICompletionService
└── result_sender.rs # Orchestration result sender
WorkerStatusService
services/worker_status/
├── mod.rs # Public API
└── service.rs # WorkerStatusService
Key Architectural Decisions
1. Stateless Services
Services use &self methods with no mutable state:
#![allow(unused)]
fn main() {
impl StepExecutorService {
pub async fn execute_step(
&self, // Immutable reference
message: PgmqMessage<SimpleStepMessage>,
queue_name: &str,
) -> TaskerResult<bool> {
// Stateless execution - no mutable state
}
}
}
Benefits:
- Zero lock contention
- Maximum concurrency per worker
- Simplified reasoning about state
2. Constructor-Based Dependency Injection
All dependencies required at construction time:
#![allow(unused)]
fn main() {
pub async fn new(
context: Arc<SystemContext>,
worker_id: String,
task_template_manager: Arc<TaskTemplateManager>,
event_publisher: WorkerEventPublisher, // Required
domain_event_handle: DomainEventSystemHandle, // Required
) -> TaskerResult<Self>
}
Benefits:
- Compiler enforces complete initialization
- No “partially initialized” states
- Clear dependency graph
3. Shared Event System
Event publisher and subscriber share the same WorkerEventSystem:
#![allow(unused)]
fn main() {
let shared_event_system = event_system
.unwrap_or_else(|| Arc::new(WorkerEventSystem::new()));
let event_publisher =
WorkerEventPublisher::with_event_system(worker_id.clone(), shared_event_system.clone());
// Enable subscriber with same shared system
processor.enable_event_subscriber(Some(shared_event_system)).await;
}
Benefits:
- FFI handlers reliably receive step execution events
- No isolated event systems causing silent failures
4. Graceful Degradation
Domain events never fail step completion:
#![allow(unused)]
fn main() {
// dispatch_domain_events returns () not TaskerResult<()>
// Errors logged but never propagated
pub async fn dispatch_domain_events(
&self,
step_uuid: Uuid,
result: &StepExecutionResult,
metadata: Option<HashMap<String, serde_json::Value>>,
) {
// Fire-and-forget with error logging
// Channel full? Log and continue
// Dispatch error? Log and continue
}
}
Comparison with Orchestration Actors
| Aspect | Orchestration | Worker |
|---|---|---|
| Actor Count | 4 actors | 5 actors |
| Registry | ActorRegistry | WorkerActorRegistry |
| Base Trait | OrchestrationActor | WorkerActor |
| Message Trait | Handler<M> | Handler<M> (same) |
| Service Design | Decomposed | Stateless |
| Statistics | N/A | Lock-free AtomicU64 |
| LOC Reduction | ~800 -> ~200 | 1575 -> ~200 |
Benefits
1. Consistency with Orchestration
Same patterns and traits as orchestration actors:
- Identical
Handler<M>trait interface - Similar registry lifecycle management
- Consistent message-based communication
2. Zero Lock Contention
- Stateless services eliminate RwLock on hot path
- AtomicU64 counters for statistics
- Maximum concurrent step execution
3. Type Safety
Messages and responses checked at compile time:
#![allow(unused)]
fn main() {
// Compile error if types don't match
impl Handler<ExecuteStepMessage> for StepExecutorActor {
async fn handle(&self, msg: ExecuteStepMessage) -> TaskerResult<bool> {
// Must return bool, not something else
}
}
}
4. Testability
- Clear message boundaries for mocking
- Isolated actor lifecycle for unit tests
- 119 unit tests, 73 E2E tests passing
5. Maintainability
- 1575 LOC -> ~200 LOC command processor
- Focused services (<300 lines per file)
- Clear separation of concerns
Detailed Analysis
For design rationale, see the Worker Decomposition ADR.
Summary
The worker actor-based architecture provides a consistent, type-safe foundation for step execution in tasker-worker. Key takeaways:
- Mirrors Orchestration: Same patterns as orchestration actors for consistency
- Lock-Free Performance: Stateless services and AtomicU64 counters
- Type Safety: Compile-time verification of message contracts
- Pure Routing: Command processor delegates without business logic
- Graceful Degradation: Domain events never fail step completion
- Production Ready: 119 unit tests, 73 E2E tests, full regression coverage
The architecture provides a solid foundation for high-throughput step execution while maintaining the proven reliability of the orchestration system.
<- Back to Documentation Hub
Worker Event Systems Architecture
Last Updated: 2026-01-15 Audience: Architects, Developers Status: Active Related Docs: Worker Actors | Events and Commands | Messaging Abstraction
<- Back to Documentation Hub
This document provides comprehensive documentation of the worker event system architecture in tasker-worker, covering the dual-channel event pattern, domain event publishing, and FFI integration.
Overview
The worker event system implements a dual-channel architecture for non-blocking step execution:
- WorkerEventSystem: Receives step execution events via provider-agnostic subscriptions
- HandlerDispatchService: Fire-and-forget handler invocation with bounded concurrency
- CompletionProcessorService: Routes results back to orchestration
- DomainEventSystem: Fire-and-forget domain event publishing
Messaging Backend Support: The worker event system supports multiple messaging backends (PGMQ, RabbitMQ) through a provider-agnostic abstraction. See Messaging Abstraction for details.
This architecture enables true parallel handler execution while maintaining strict ordering guarantees for domain events.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER EVENT FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
MessagingProvider (PGMQ or RabbitMQ)
│
│ provider.subscribe_many()
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ WorkerEventSystem │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ WorkerQueueListener │ │ WorkerFallbackPoller │ │
│ │ (provider-agnostic) │ │ (PGMQ only) │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ │ │
│ └───────────┬───────────────┘ │
│ │ │
│ ▼ │
│ MessageNotification::Message → ExecuteStepFromMessage (RabbitMQ) │
│ MessageNotification::Available → ExecuteStepFromEvent (PGMQ) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ActorCommandProcessor │
│ │ │
│ ▼ │
│ StepExecutorActor │
│ │ │
│ │ claim step, send to dispatch channel │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ │
Rust Workers FFI Workers (Ruby/Python)
│ │
▼ ▼
┌───────────────────────────────┐ ┌───────────────────────────────┐
│ HandlerDispatchService │ │ FfiDispatchChannel │
│ │ │ │
│ dispatch_receiver │ │ pending_events HashMap │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ [Semaphore] N permits │ │ poll_step_events() │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ handler.call() │ │ Ruby/Python handler │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ PostHandlerCallback │ │ complete_step_event() │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ completion_sender │ │ PostHandlerCallback │
│ │ │ │ │
└───────────────┬───────────────┘ │ ▼ │
│ │ completion_sender │
│ │ │
│ └───────────────┬───────────────┘
│ │
└───────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CompletionProcessorService │
│ │ │
│ ▼ │
│ FFICompletionService │
│ │ │
│ ▼ │
│ orchestration_step_results queue │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
Orchestration
Core Components
1. WorkerEventSystem
Location: tasker-worker/src/worker/event_systems/worker_event_system.rs
Implements the EventDrivenSystem trait for worker namespace queue processing. Supports three deployment modes with provider-agnostic message handling:
| Mode | Description | PGMQ Behavior | RabbitMQ Behavior |
|---|---|---|---|
PollingOnly | Traditional polling | Poll PGMQ tables | Poll via basic_get |
EventDrivenOnly | Pure push delivery | pg_notify signals | basic_consume push |
Hybrid | Event-driven + polling | pg_notify + fallback | Push only (no fallback) |
Provider-Specific Behavior:
- PGMQ: Uses
MessageNotification::Available(signal-only), requires fallback polling - RabbitMQ: Uses
MessageNotification::Message(full payload), no fallback needed
Key Features:
- Unified configuration via
WorkerEventSystemConfig - Atomic statistics with
AtomicU64counters - Converts
WorkerNotificationtoWorkerCommandfor processing
#![allow(unused)]
fn main() {
// Worker notification to command conversion (provider-agnostic)
match notification {
// RabbitMQ style - full message delivered
WorkerNotification::Message(msg) => {
command_sender.send(WorkerCommand::ExecuteStepFromMessage {
queue_name: msg.queue_name.clone(),
message: msg,
resp: resp_tx,
}).await;
}
// PGMQ style - signal-only, requires fetch
WorkerNotification::Event(WorkerQueueEvent::StepMessage(msg_event)) => {
command_sender.send(WorkerCommand::ExecuteStepFromEvent {
message_event: msg_event,
resp: resp_tx,
}).await;
}
// ...
}
}
2. HandlerDispatchService
Location: tasker-worker/src/worker/handlers/dispatch_service.rs
Non-blocking handler dispatch with bounded parallelism.
Architecture:
dispatch_receiver → [Semaphore] → handler.call() → [callback] → completion_sender
│ │
└─→ Bounded to N concurrent └─→ Domain events
tasks
Key Design Decisions:
- Semaphore-Bounded Concurrency: Limits concurrent handlers to prevent resource exhaustion
- Permit Release Before Send: Prevents backpressure cascade
- Post-Handler Callback: Domain events fire only after result is committed
#![allow(unused)]
fn main() {
tokio::spawn(async move {
let permit = semaphore.acquire().await?;
let result = execute_with_timeout(®istry, &msg, timeout).await;
// Release permit BEFORE sending to completion channel
drop(permit);
// Send result FIRST
sender.send(result.clone()).await?;
// Callback fires AFTER result is committed
if let Some(cb) = callback {
cb.on_handler_complete(&step, &result, &worker_id).await;
}
});
}
Error Handling:
| Scenario | Behavior |
|---|---|
| Handler timeout | StepExecutionResult::failure() with error_type=handler_timeout |
| Handler panic | Caught via catch_unwind(), failure result generated |
| Handler error | Failure result with error_type=handler_error |
| Semaphore closed | Failure result with error_type=semaphore_acquisition_failed |
Handler Resolution
Before handler execution, the dispatch service resolves the handler using a resolver chain pattern:
HandlerDefinition ResolverChain Handler
│ │ │
│ callable: "process_payment" │ │
│ method: "refund" │ │
│ resolver: null │ │
│ │ │
├───────────────────────────────────►│ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ ExplicitMappingResolver (10) │ │
│ │ can_resolve? ─► YES │ │
│ │ resolve() ─────────────────────────────────►│
│ └───────────────────────────────┘ │
│ │
│ ┌───────────────────────────────┐ │
│ │ MethodDispatchWrapper │ │
│ │ (if method != "call") │◄─────────────┤
│ └───────────────────────────────┘ │
Built-in Resolvers:
| Resolver | Priority | Function |
|---|---|---|
ExplicitMappingResolver | 10 | Hash lookup of registered handlers |
ClassConstantResolver | 100 | Runtime class lookup (Ruby only) |
ClassLookupResolver | 100 | Runtime class lookup (Python/TypeScript only) |
Method Dispatch: When handler.method is specified and not "call", a MethodDispatchWrapper is applied to invoke the specified method instead of the default call() method.
See Handler Resolution Guide for complete documentation.
3. FfiDispatchChannel
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs
Pull-based polling interface for FFI workers (Ruby, Python). Enables language-specific handlers without complex FFI memory management.
Flow:
Rust Ruby/Python
│ │
│ dispatch(step) │
│ ──────────────────────────────► │
│ │ pending_events.insert()
│ │
│ poll_step_events() │
│ ◄────────────────────────────── │
│ │
│ │ handler.call()
│ │
│ complete_step_event(result) │
│ ◄────────────────────────────── │
│ │
│ PostHandlerCallback │
│ completion_sender.send() │
│ │
Key Features:
- Thread-safe pending events map with lock poisoning recovery
- Configurable completion timeout (default 30s)
- Starvation detection and warnings
- Fire-and-forget callbacks via
runtime_handle.spawn()
4. CompletionProcessorService
Location: tasker-worker/src/worker/handlers/completion_processor.rs
Receives completed step results and routes to orchestration queue via FFICompletionService.
completion_receiver → CompletionProcessorService → FFICompletionService → orchestration_step_results
Note: Currently processes completions sequentially. Parallel processing is planned as a future enhancement.
5. DomainEventSystem
Location: tasker-worker/src/worker/event_systems/domain_event_system.rs
Async system for fire-and-forget domain event publishing.
Architecture:
command_processor.rs DomainEventSystem
│ │
│ try_send(command) │ spawn process_loop()
▼ ▼
mpsc::Sender<DomainEventCommand> → mpsc::Receiver
│
▼
EventRouter → PGMQ / InProcess
Key Design:
try_send()never blocks - if channel is full, events are dropped with metrics- Background task processes commands asynchronously
- Graceful shutdown drains fast events up to configurable timeout
- Three delivery modes: Durable (PGMQ), Fast (in-process), Broadcast
Shared Event Abstractions
EventDrivenSystem Trait
Location: tasker-shared/src/event_system/event_driven.rs
Unified trait for all event-driven systems:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait EventDrivenSystem: Send + Sync {
type SystemId: Send + Sync + Clone;
type Event: Send + Sync + Clone;
type Config: Send + Sync + Clone;
type Statistics: EventSystemStatistics;
fn system_id(&self) -> Self::SystemId;
fn deployment_mode(&self) -> DeploymentMode;
fn is_running(&self) -> bool;
async fn start(&mut self) -> Result<(), DeploymentModeError>;
async fn stop(&mut self) -> Result<(), DeploymentModeError>;
async fn process_event(&self, event: Self::Event) -> Result<(), DeploymentModeError>;
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError>;
fn statistics(&self) -> Self::Statistics;
fn config(&self) -> &Self::Config;
}
}
Deployment Modes
Location: tasker-shared/src/event_system/deployment.rs
#![allow(unused)]
fn main() {
pub enum DeploymentMode {
PollingOnly, // Traditional polling, no events
EventDrivenOnly, // Pure event-driven, no polling
Hybrid, // Event-driven with polling fallback
}
}
PostHandlerCallback Trait
Location: tasker-worker/src/worker/handlers/dispatch_service.rs
Extensibility point for post-handler actions:
#![allow(unused)]
fn main() {
#[async_trait]
pub trait PostHandlerCallback: Send + Sync + 'static {
/// Called after a handler completes
async fn on_handler_complete(
&self,
step: &TaskSequenceStep,
result: &StepExecutionResult,
worker_id: &str,
);
/// Name of this callback for logging purposes
fn name(&self) -> &str;
}
}
Implementations:
NoOpCallback: Default no-operation callbackDomainEventCallback: Publishes domain events toDomainEventSystem
Configuration
Worker Event System
# config/tasker/base/event_systems.toml
[event_systems.worker]
system_id = "worker-event-system"
deployment_mode = "Hybrid"
[event_systems.worker.metadata.listener]
retry_interval_seconds = 5
max_retry_attempts = 3
event_timeout_seconds = 60
batch_processing = true
connection_timeout_seconds = 30
[event_systems.worker.metadata.fallback_poller]
enabled = true
polling_interval_ms = 100
batch_size = 10
age_threshold_seconds = 30
max_age_hours = 24
visibility_timeout_seconds = 60
Handler Dispatch
# config/tasker/base/worker.toml
[worker.mpsc_channels.handler_dispatch]
dispatch_buffer_size = 1000
completion_buffer_size = 1000
max_concurrent_handlers = 10
handler_timeout_ms = 30000
[worker.mpsc_channels.ffi_dispatch]
dispatch_buffer_size = 1000
completion_timeout_ms = 30000
starvation_warning_threshold_ms = 10000
callback_timeout_ms = 5000
completion_send_timeout_ms = 10000
Integration with Worker Actors
The event systems integrate with the worker actor architecture:
WorkerEventSystem
│
▼
ActorCommandProcessor
│
├──► StepExecutorActor ──► dispatch_sender
│
├──► FFICompletionActor ◄── completion_receiver
│
└──► DomainEventActor ◄── PostHandlerCallback
See Worker Actors Documentation for actor details.
Event Flow Guarantees
Ordering Guarantee
Domain events fire AFTER result is committed to completion channel:
handler.call()
→ result committed to completion_sender
→ PostHandlerCallback.on_handler_complete()
→ domain events dispatched
This eliminates race conditions where downstream systems see events before orchestration processes results.
Idempotency Guarantee
State machine guards prevent duplicate execution:
- Step claimed atomically via
transition_step_state_atomic() - State guards reject duplicate claims
- Results are deduplicated by completion channel
Fire-and-Forget Guarantee
Domain event failures never fail step completion:
#![allow(unused)]
fn main() {
// DomainEventCallback
pub async fn on_handler_complete(&self, step, result, worker_id) {
// dispatch_events uses try_send() - never blocks
// If channel full, events dropped with metrics
// Step completion is NOT affected
self.handle.dispatch_events(events, publisher_name, correlation_id);
}
}
Monitoring
Key Metrics
| Metric | Description |
|---|---|
tasker.worker.events_processed | Total events processed |
tasker.worker.events_failed | Events that failed processing |
tasker.ffi.pending_events | Pending FFI events (starvation indicator) |
tasker.ffi.oldest_event_age_ms | Age of oldest pending event |
tasker.channel.completion.saturation | Completion channel utilization |
tasker.domain_events.dispatched | Domain events dispatched |
tasker.domain_events.dropped | Domain events dropped (backpressure) |
Health Checks
#![allow(unused)]
fn main() {
async fn health_check(&self) -> Result<DeploymentModeHealthStatus, DeploymentModeError> {
if self.is_running.load(Ordering::Acquire) {
Ok(DeploymentModeHealthStatus::Healthy)
} else {
Ok(DeploymentModeHealthStatus::Critical)
}
}
}
Backpressure Handling
The worker event system implements multiple backpressure mechanisms to ensure graceful degradation under load while preserving step idempotency.
Backpressure Points
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER BACKPRESSURE FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
[1] Step Claiming
│
├── Planned: Capacity check before claiming
│ └── If at capacity: Leave message in queue (visibility timeout)
│
▼
[2] Handler Dispatch Channel (Bounded)
│
├── dispatch_buffer_size = 1000
│ └── If full: Sender blocks until space available
│
▼
[3] Semaphore-Bounded Execution
│
├── max_concurrent_handlers = 10
│ └── If permits exhausted: Task waits for permit
│
├── CRITICAL: Permit released BEFORE sending to completion channel
│ └── Prevents backpressure cascade
│
▼
[4] Completion Channel (Bounded)
│
├── completion_buffer_size = 1000
│ └── If full: Handler task blocks until space available
│
▼
[5] Domain Events (Fire-and-Forget)
│
└── try_send() semantics
└── If channel full: Events DROPPED (step execution unaffected)
Handler Dispatch Backpressure
The HandlerDispatchService uses semaphore-bounded parallelism:
#![allow(unused)]
fn main() {
// Permit acquisition blocks if all permits in use
let permit = semaphore.acquire().await?;
let result = execute_with_timeout(®istry, &msg, timeout).await;
// CRITICAL: Release permit BEFORE sending to completion channel
// This prevents backpressure cascade where full completion channel
// holds permits, starving new handler execution
drop(permit);
// Now send to completion channel (may block if full)
sender.send(result).await?;
}
Why permit release before send matters:
- If completion channel is full, handler task blocks on send
- If permit is held during block, no new handlers can start
- By releasing permit first, new handlers can start even if completions are backing up
FFI Dispatch Backpressure
The FfiDispatchChannel handles backpressure for Ruby/Python workers:
| Scenario | Behavior |
|---|---|
| Dispatch channel full | Sender blocks |
| FFI polling too slow | Starvation warning logged |
| Completion timeout | Failure result generated |
| Callback timeout | Callback fire-and-forget, logged |
Starvation Detection:
[worker.mpsc_channels.ffi_dispatch]
starvation_warning_threshold_ms = 10000 # Warn if event waits > 10s
Domain Event Drop Semantics
Domain events use try_send() and are explicitly designed to be droppable:
#![allow(unused)]
fn main() {
// Domain events fire AFTER result is committed
// They are non-critical and use fire-and-forget semantics
match event_sender.try_send(event) {
Ok(()) => { /* Event dispatched */ }
Err(TrySendError::Full(_)) => {
// Event dropped - step execution NOT affected
warn!("Domain event dropped: channel full");
metrics.increment("domain_events_dropped");
}
}
}
Why this is safe: Domain events are informational. Dropping them does not affect step execution correctness. The step result is already committed to the completion channel before domain events fire.
Step Claiming Backpressure (Planned)
Future enhancement: Workers will check capacity before claiming steps:
#![allow(unused)]
fn main() {
// Planned implementation
fn should_claim_step(&self) -> bool {
let available = self.semaphore.available_permits();
let threshold = self.config.claim_capacity_threshold; // e.g., 0.8
let max = self.config.max_concurrent_handlers;
available as f64 / max as f64 > (1.0 - threshold)
}
}
If at capacity:
- Worker does NOT acknowledge the PGMQ message
- Message returns to queue after visibility timeout
- Another worker (or same worker later) claims it
Idempotency Under Backpressure
All backpressure mechanisms preserve step idempotency:
| Backpressure Point | Idempotency Guarantee |
|---|---|
| Claim refusal | Message stays in queue, visibility timeout protects |
| Dispatch channel full | Step claimed but queued for execution |
| Semaphore wait | Step claimed, waiting for permit |
| Completion channel full | Handler completed, result buffered |
| Domain event drop | Non-critical, step result already persisted |
Critical Rule: A claimed step MUST produce a result (success or failure). Backpressure may delay but never drop step execution.
For comprehensive backpressure strategy, see Backpressure Architecture.
Best Practices
1. Choose Deployment Mode
- Production: Use
Hybridfor reliability with event-driven performance - Development: Use
EventDrivenOnlyfor fastest iteration - Restricted environments: Use
PollingOnlywhen pg_notify unavailable
2. Tune Concurrency
[worker.mpsc_channels.handler_dispatch]
max_concurrent_handlers = 10 # Start here, increase based on monitoring
Monitor:
- Semaphore wait times
- Handler execution latency
- Completion channel saturation
3. Configure Timeouts
handler_timeout_ms = 30000 # Match your slowest handler
completion_timeout_ms = 30000 # FFI completion timeout
callback_timeout_ms = 5000 # Domain event callback timeout
4. Monitor Starvation
For FFI workers, monitor pending event age:
# Ruby
metrics = Tasker.ffi_dispatch_metrics
if metrics[:oldest_pending_age_ms] > 10000
warn "FFI polling falling behind"
end
Related Documentation
- Messaging Abstraction - Provider-agnostic messaging
- Backpressure Architecture - Unified backpressure strategy
- Worker Actor-Based Architecture - Actor pattern implementation
- Events and Commands - Command pattern details
- Dual-Channel Event System ADR - Dual-channel event system decision
- FFI Callback Safety - FFI guidelines
- RCA: Parallel Execution Timing Bugs - Lessons learned
- Backpressure Monitoring Runbook - Metrics and alerting
<- Back to Documentation Hub
Tasker Core Guides
This directory contains practical how-to guides for working with Tasker Core.
Documents
| Document | Description |
|---|---|
| Quick Start | Get running in 5 minutes |
| Use Cases and Patterns | Practical workflow examples |
| Conditional Workflows | Runtime decision-making and dynamic steps |
| Batch Processing | Parallel processing with cursor-based workers |
| DLQ System | Dead letter queue investigation and resolution |
| Retry Semantics | Understanding max_attempts and retryable flags |
| Identity Strategy | Task deduplication with STRICT, CALLER_PROVIDED, ALWAYS_UNIQUE |
| Configuration Management | TOML architecture, CLI tools, runtime observability |
When to Read These
- Getting started: Begin with Quick Start
- Implementing features: Check Use Cases and Patterns
- Handling errors: See Retry Semantics and DLQ System
- Processing data: Review Batch Processing
- Deploying: Consult Configuration Management
Related Documentation
- Architecture - The “what” - system structure
- Principles - The “why” - design philosophy
- Workers - Language-specific handler development
API Security Guide
API-level security for orchestration (8080) and worker (8081) endpoints using JWT bearer tokens and API key authentication with permission-based access control.
Security is disabled by default for backward compatibility. Enable it explicitly in configuration.
See also: Auth Documentation Hub for architecture overview, Permissions for route mapping, Configuration for full reference, Testing for E2E test patterns.
Quick Start
1. Generate Keys
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys
2. Generate a Token
cargo run --bin tasker-ctl -- auth generate-token \
--private-key ./keys/jwt-private-key.pem \
--permissions "tasks:create,tasks:read,tasks:list,steps:read" \
--subject my-service \
--expiry-hours 24
3. Enable Auth in Configuration
In config/tasker/base/orchestration.toml:
[auth]
enabled = true
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
4. Use the Token
export TASKER_AUTH_TOKEN=<generated-token>
cargo run --bin tasker-ctl -- task list
Or with curl:
curl -H "Authorization: Bearer $TASKER_AUTH_TOKEN" http://localhost:8080/v1/tasks
Permission Vocabulary
| Permission | Resource | Description |
|---|---|---|
tasks:create | tasks | Create new tasks |
tasks:read | tasks | Read task details |
tasks:list | tasks | List tasks |
tasks:cancel | tasks | Cancel running tasks |
tasks:context_read | tasks | Read task context data |
steps:read | steps | Read workflow step details |
steps:resolve | steps | Manually resolve steps |
dlq:read | dlq | Read DLQ entries |
dlq:update | dlq | Update DLQ investigations |
dlq:stats | dlq | View DLQ statistics |
templates:read | templates | Read task templates |
templates:validate | templates | Validate templates |
system:config_read | system | Read system configuration |
system:handlers_read | system | Read handler registry |
system:analytics_read | system | Read analytics data |
worker:config_read | worker | Read worker configuration |
worker:templates_read | worker | Read worker templates |
Wildcards
tasks:*- All task permissionssteps:*- All step permissionsdlq:*- All DLQ permissions*- All permissions (superuser)
Show All Permissions
cargo run --bin tasker-ctl -- auth show-permissions
Configuration Reference
Server-Side (orchestration.toml / worker.toml)
[auth]
enabled = true
# JWT Configuration
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
jwt_token_expiry_hours = 24
# Key Configuration (one of these):
jwt_public_key_path = "./keys/jwt-public-key.pem" # File path (preferred)
jwt_public_key = "-----BEGIN RSA PUBLIC KEY-----..." # Inline PEM
# Or set env: TASKER_JWT_PUBLIC_KEY_PATH
# JWKS (for dynamic key rotation)
jwt_verification_method = "jwks" # "public_key" (default) or "jwks"
jwks_url = "https://auth.example.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
# Permission validation
permissions_claim = "permissions" # JWT claim containing permissions
strict_validation = true # Reject tokens with unknown permissions
log_unknown_permissions = true
# API Key Authentication
api_key_header = "X-API-Key"
api_keys_enabled = true
[[auth.api_keys]]
key = "sk-prod-key-1"
permissions = ["tasks:read", "tasks:list", "steps:read"]
description = "Read-only monitoring service"
[[auth.api_keys]]
key = "sk-admin-key"
permissions = ["*"]
description = "Admin key"
Client-Side (Environment Variables)
| Variable | Description |
|---|---|
TASKER_AUTH_TOKEN | Bearer token for both APIs |
TASKER_ORCHESTRATION_AUTH_TOKEN | Override token for orchestration only |
TASKER_WORKER_AUTH_TOKEN | Override token for worker only |
TASKER_API_KEY | API key (fallback if no token) |
TASKER_API_KEY_HEADER | Custom header name (default: X-API-Key) |
Priority: endpoint-specific token > global token > API key > config file.
JWT Token Structure
{
"sub": "my-service",
"iss": "tasker-core",
"aud": "tasker-api",
"iat": 1706000000,
"exp": 1706086400,
"permissions": [
"tasks:create",
"tasks:read",
"tasks:list",
"steps:read"
],
"worker_namespaces": []
}
Common Role Patterns
Read-only operator:
permissions: ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
Task submitter:
permissions: ["tasks:create", "tasks:read", "tasks:list"]
Ops admin:
permissions: ["tasks:*", "steps:*", "dlq:*", "system:*"]
Worker service:
permissions: ["worker:config_read", "worker:templates_read"]
Superuser:
permissions: ["*"]
Public Endpoints
These endpoints never require authentication:
GET /health- Basic health checkGET /health/detailed- Detailed healthGET /metrics- Prometheus metrics
API Key Authentication
API keys are validated against a configured registry. Each key has its own set of permissions.
# Using API key
curl -H "X-API-Key: sk-prod-key-1" http://localhost:8080/v1/tasks
API keys are simpler than JWTs but have limitations:
- No expiration (rotate by removing from config)
- No claims beyond permissions
- Best for service-to-service communication with static permissions
Error Responses
401 Unauthorized (Missing/Invalid Credentials)
{
"error": "unauthorized",
"message": "Missing authentication credentials"
}
403 Forbidden (Insufficient Permissions)
{
"error": "forbidden",
"message": "Missing required permission: tasks:create"
}
Migration Guide: Disabled to Enabled
- Generate keys and distribute the public key to server config
- Generate tokens for each service/user with appropriate permissions
- Set
enabled = truein auth config - Deploy - services without valid tokens will get 401 responses
- Monitor the
tasker.auth.failures.totalmetric for issues
All endpoints remain accessible without auth when enabled = false.
Observability
Structured Logs
infoon successful authentication (subject, method)warnon authentication failure (error details)warnon permission denial (subject, required permission)
Prometheus Metrics
| Metric | Type | Labels |
|---|---|---|
tasker.auth.requests.total | Counter | method, result |
tasker.auth.failures.total | Counter | reason |
tasker.permission.denials.total | Counter | permission |
tasker.auth.jwt.verification.duration | Histogram | result |
CLI Auth Commands
# Generate RSA key pair
tasker-ctl auth generate-keys [--output-dir ./keys] [--key-size 2048]
# Generate JWT token
tasker-ctl auth generate-token \
--permissions tasks:create,tasks:read \
--subject my-service \
--private-key ./keys/jwt-private-key.pem \
--expiry-hours 24
# List all permissions
tasker-ctl auth show-permissions
# Validate a token
tasker-ctl auth validate-token \
--token <JWT> \
--public-key ./keys/jwt-public-key.pem
gRPC Authentication
gRPC endpoints support the same authentication methods as REST, using gRPC metadata instead of HTTP headers.
gRPC Ports
| Service | REST Port | gRPC Port |
|---|---|---|
| Orchestration | 8080 | 9190 |
| Rust Worker | 8081 | 9191 |
Bearer Token (gRPC)
# Using grpcurl with Bearer token
grpcurl -plaintext \
-H "Authorization: Bearer $TASKER_AUTH_TOKEN" \
localhost:9190 tasker.v1.TaskService/ListTasks
API Key (gRPC)
# Using grpcurl with API key
grpcurl -plaintext \
-H "X-API-Key: sk-prod-key-1" \
localhost:9190 tasker.v1.TaskService/ListTasks
gRPC Client Configuration
#![allow(unused)]
fn main() {
use tasker_client::grpc_clients::{OrchestrationGrpcClient, GrpcAuthConfig};
// With API key
let client = OrchestrationGrpcClient::connect_with_auth(
"http://localhost:9190",
GrpcAuthConfig::ApiKey("sk-prod-key-1".to_string()),
).await?;
// With Bearer token
let client = OrchestrationGrpcClient::connect_with_auth(
"http://localhost:9190",
GrpcAuthConfig::Bearer("eyJ...".to_string()),
).await?;
}
gRPC Error Codes
| gRPC Status | HTTP Equivalent | Meaning |
|---|---|---|
UNAUTHENTICATED | 401 | Missing or invalid credentials |
PERMISSION_DENIED | 403 | Valid credentials but insufficient permissions |
NOT_FOUND | 404 | Resource not found |
UNAVAILABLE | 503 | Service unavailable |
Public gRPC Endpoints
These endpoints never require authentication:
HealthService/CheckHealth- Basic health checkHealthService/CheckLiveness- Kubernetes liveness probeHealthService/CheckReadiness- Kubernetes readiness probeHealthService/CheckDetailedHealth- Detailed health metrics
Security Considerations
- Key storage: Private keys should never be committed to git. Use file paths or environment variables.
- Token expiry: Set appropriate expiry times. Short-lived tokens (1-24h) are preferred.
- Least privilege: Grant only the permissions each service needs.
- Key rotation: Use JWKS for automatic key rotation in production.
- API key rotation: Remove old keys from config and redeploy.
- Audit: Monitor
tasker.auth.failures.totalandtasker.permission.denials.totalfor anomalies.
External Auth Provider Integration
Integrating Tasker’s API security with external identity providers via JWKS endpoints.
See also: Auth Documentation Hub for architecture overview, Configuration for full TOML reference.
JWKS Integration
Tasker supports JWKS (JSON Web Key Set) for dynamic public key discovery. This enables key rotation without redeploying Tasker.
Configuration
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://your-provider.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://your-provider.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions" # or custom claim name
How It Works
- On first request, Tasker fetches the JWKS from the configured URL
- Keys are cached for the configured refresh interval
- When a token has an unknown
kid(Key ID), a refresh is triggered - RSA keys are parsed from the JWK
nandecomponents
Auth0
Auth0 Configuration
-
Create an API in Auth0 Dashboard:
- Name:
Tasker API - Identifier:
tasker-api(this becomes the audience) - Signing Algorithm: RS256
- Name:
-
Create permissions in the API settings matching Tasker’s vocabulary:
tasks:create,tasks:read,tasks:list, etc.
-
Assign permissions to users/applications via Auth0 roles
Tasker Configuration for Auth0
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.auth0.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.auth0.com/"
jwt_audience = "tasker-api"
permissions_claim = "permissions"
Token Request
curl --request POST \
--url https://YOUR_DOMAIN.auth0.com/oauth/token \
--header 'content-type: application/json' \
--data '{
"client_id": "YOUR_CLIENT_ID",
"client_secret": "YOUR_CLIENT_SECRET",
"audience": "tasker-api",
"grant_type": "client_credentials"
}'
Keycloak
Keycloak Configuration
- Create a realm and client for Tasker
- Define client roles matching Tasker permissions
- Configure the client to include roles in the
permissionstoken claim via a protocol mapper
Tasker Configuration for Keycloak
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://keycloak.example.com/realms/YOUR_REALM/protocol/openid-connect/certs"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://keycloak.example.com/realms/YOUR_REALM"
jwt_audience = "tasker-api"
permissions_claim = "permissions" # Configure via protocol mapper
Okta
Okta Configuration
- Create an API authorization server
- Add custom claims for permissions
- Define scopes matching Tasker permissions
Tasker Configuration for Okta
[auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID/v1/keys"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://YOUR_DOMAIN.okta.com/oauth2/YOUR_AUTH_SERVER_ID"
jwt_audience = "tasker-api"
permissions_claim = "scp" # Okta uses "scp" for scopes by default
Custom JWKS Endpoint
Any provider that serves a standard JWKS endpoint works. The endpoint must return:
{
"keys": [
{
"kty": "RSA",
"kid": "key-id-1",
"use": "sig",
"alg": "RS256",
"n": "<base64url-encoded modulus>",
"e": "<base64url-encoded exponent>"
}
]
}
Static Public Key (Development)
For development or simple deployments without a JWKS endpoint:
[auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
Generate keys with:
tasker-ctl auth generate-keys --output-dir /etc/tasker/keys
Permission Claim Mapping
If your identity provider uses a different claim name for permissions:
permissions_claim = "custom_permissions" # Default: "permissions"
The claim must be a JSON array of strings:
{
"custom_permissions": ["tasks:create", "tasks:read"]
}
Strict Validation
When strict_validation = true (default), tokens containing unknown permission strings are rejected. Set to false if your provider includes additional scopes/permissions not in Tasker’s vocabulary:
strict_validation = false
log_unknown_permissions = true # Still log unknown permissions for monitoring
Batch Processing in Tasker
Last Updated: 2026-01-06 Status: Production Ready Related: Conditional Workflows, DLQ System
Table of Contents
- Overview
- Architecture Foundations
- Core Concepts
- Checkpoint Yielding
- Workflow Pattern
- Data Structures
- Implementation Patterns
- Use Cases
- Operator Workflows
- Code Examples
- Best Practices
Overview
Batch processing in Tasker enables parallel processing of large datasets by dynamically creating worker steps at runtime. A single “batchable” step analyzes a workload and instructs orchestration to create N worker instances, each processing a subset of data using cursor-based boundaries.
Key Characteristics
Dynamic Worker Creation: Workers are created at runtime based on dataset analysis, using predefined in templates for structure, but scaled according to need.
Cursor-Based Resumability: Each worker processes a specific range (cursor) and can resume from checkpoints on failure.
Deferred Convergence: Aggregation steps use intersection semantics to wait for all created workers, regardless of count.
Standard Lifecycle: Workers use existing retry, timeout, and DLQ mechanics - no special recovery system needed.
Example Flow
Task: Process 1000-row CSV file
1. analyze_csv (batchable step)
→ Counts rows: 1000
→ Calculates workers: 5 (200 rows each)
→ Returns BatchProcessingOutcome::CreateBatches
2. Orchestration creates workers dynamically:
├─ process_csv_batch_001 (rows 1-200)
├─ process_csv_batch_002 (rows 201-400)
├─ process_csv_batch_003 (rows 401-600)
├─ process_csv_batch_004 (rows 601-800)
└─ process_csv_batch_005 (rows 801-1000)
3. Workers process in parallel
4. aggregate_csv_results (deferred convergence)
→ Waits for all 5 workers (intersection semantics)
→ Aggregates results from completed workers
→ Returns combined metrics
Architecture Foundations
Batch processing builds on and extends three foundational Tasker patterns:
1. DAG (Directed Acyclic Graph) Workflow Orchestration
What Batch Processing Inherits:
- Worker steps are full DAG nodes with standard state machines
- Parent-child dependencies enforced via
tasker_workflow_step_edges - Cycle detection prevents circular dependencies
- Topological ordering ensures correct execution sequence
What Batch Processing Extends:
- Dynamic node creation: Template steps instantiated N times at runtime
- Edge generation: Batchable step → worker instances → convergence step
- Transactional atomicity: All workers created in single database transaction
Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:357-400
#![allow(unused)]
fn main() {
// Transaction ensures all-or-nothing worker creation
let mut tx = pool.begin().await?;
for (i, cursor_config) in cursor_configs.iter().enumerate() {
// Create worker instance from template
let worker_step = WorkflowStepCreator::create_from_template(
&mut tx,
task_uuid,
&worker_template,
&format!("{}_{:03}", worker_template_name, i + 1),
Some(batch_worker_inputs.clone()),
).await?;
// Create edge: batchable → worker
WorkflowStepEdge::create_with_transaction(
&mut tx,
batchable_step.workflow_step_uuid,
worker_step.workflow_step_uuid,
"batch_dependency",
).await?;
}
tx.commit().await?; // Atomic - all workers or none
}
2. Retryability and Lifecycle Management
What Batch Processing Inherits:
- Standard
lifecycle.max_retriesconfiguration per template - Exponential backoff via
lifecycle.backoff_multiplier - Staleness detection using
lifecycle.max_steps_in_process_minutes - Standard state transitions (Pending → Enqueued → InProgress → Complete/Error)
What Batch Processing Extends:
- Checkpoint-based resumability: Workers checkpoint progress and resume from last cursor position
- Cursor preservation during retry:
workflow_steps.resultsfield preserved byResetForRetryaction - Additional staleness detection: Checkpoint timestamp tracking alongside duration-based detection
Key Simplification:
- ❌ No BatchRecoveryService - Uses standard retry + DLQ
- ❌ No duplicate timeout settings - Uses
lifecycleconfig only - ✅ Cursor data preserved during
ResetForRetry
Configuration Example: tests/fixtures/task_templates/ruby/batch_processing_products_csv.yaml:749-752
- name: process_csv_batch
type: batch_worker
lifecycle:
max_steps_in_process_minutes: 120 # DLQ timeout
max_retries: 3 # Standard retry limit
backoff_multiplier: 2.0 # Exponential backoff
3. Deferred Convergence
What Batch Processing Inherits:
- Intersection semantics: Wait for declared dependencies ∩ actually created steps
- Template-level dependencies: Convergence step depends on worker template, not instances
- Runtime resolution: System computes effective dependencies when workers are created
What Batch Processing Extends:
- Batch aggregation pattern: Convergence steps aggregate results from N workers
- NoBatches scenario handling: Placeholder worker created when dataset too small
- Scenario detection helpers:
BatchAggregationScenario::detect()for both cases
Flow Comparison:
Conditional Workflows (Decision Points):
decision_step → creates → option_a, option_b (conditional)
↓
convergence_step (depends on option_a AND option_b templates)
→ waits for whichever were created (intersection)
Batch Processing (Batchable Steps):
batchable_step → creates → worker_001, worker_002, ..., worker_N
↓
convergence_step (depends on worker template)
→ waits for ALL workers created (intersection)
Code Reference: tasker-orchestration/src/orchestration/lifecycle/batch_processing/service.rs:600-650
#![allow(unused)]
fn main() {
// Determine and create convergence step with intersection semantics
pub async fn determine_and_create_convergence_step(
&self,
tx: &mut PgTransaction,
task_uuid: Uuid,
convergence_template: &StepDefinition,
created_workers: &[WorkflowStep],
) -> Result<Option<WorkflowStep>> {
// Create convergence step as deferred type
let convergence_step = WorkflowStepCreator::create_from_template(
tx,
task_uuid,
convergence_template,
&convergence_template.name,
None,
).await?;
// Create edges from ALL worker instances to convergence step
for worker in created_workers {
WorkflowStepEdge::create_with_transaction(
tx,
worker.workflow_step_uuid,
convergence_step.workflow_step_uuid,
"batch_convergence_dependency",
).await?;
}
Ok(Some(convergence_step))
}
}
Core Concepts
Batchable Steps
Purpose: Analyze a workload and decide whether to create batch workers.
Responsibilities:
- Examine dataset (size, complexity, business logic)
- Calculate optimal worker count based on batch size
- Generate cursor configurations defining batch boundaries
- Return
BatchProcessingOutcomeinstructing orchestration
Returns: BatchProcessingOutcome enum with two variants:
NoBatches: Dataset too small or empty - create placeholder workerCreateBatches: Create N workers with cursor configurations
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-120
#![allow(unused)]
fn main() {
// Batchable handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let csv_file_path = step_data.task.context.get("csv_file_path").unwrap();
// Count rows in CSV (excluding header)
let total_rows = count_csv_rows(csv_file_path)?;
// Get batch configuration from handler initialization
let batch_size = step_data.handler_initialization
.get("batch_size").and_then(|v| v.as_u64()).unwrap_or(200);
if total_rows == 0 {
// No batches needed
let outcome = BatchProcessingOutcome::no_batches();
return Ok(success_result(
step_uuid,
json!({ "batch_processing_outcome": outcome.to_value() }),
elapsed_ms,
None,
));
}
// Calculate workers
let worker_count = (total_rows as f64 / batch_size as f64).ceil() as u32;
// Generate cursor configs
let cursor_configs = create_cursor_configs(total_rows, worker_count);
// Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"process_csv_batch".to_string(),
worker_count,
cursor_configs,
total_rows,
);
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_rows": total_rows
}),
elapsed_ms,
None,
))
}
}
Batch Workers
Purpose: Process a specific subset of data defined by cursor configuration.
Responsibilities:
- Extract cursor config from
workflow_step.inputs - Check for
is_no_opflag (NoBatches placeholder scenario) - Process items within cursor range (start_cursor to end_cursor)
- Checkpoint progress periodically for resumability
- Return processed results for aggregation
Cursor Configuration: Each worker receives BatchWorkerInputs in workflow_step.inputs:
{
"cursor": {
"batch_id": "001",
"start_cursor": 1,
"end_cursor": 200,
"batch_size": 200
},
"batch_metadata": {
"checkpoint_interval": 100,
"cursor_field": "row_number",
"failure_strategy": "fail_fast"
},
"is_no_op": false
}
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-280
#![allow(unused)]
fn main() {
// Batch worker handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Extract context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// Check for no-op placeholder worker
if context.is_no_op() {
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario - no items to process"
}),
elapsed_ms,
None,
));
}
// Get cursor range
let start_row = context.start_position();
let end_row = context.end_position();
// Get CSV file path from dependency results
let csv_file_path = step_data
.dependency_results
.get("analyze_csv")
.and_then(|r| r.result.get("csv_file_path"))
.unwrap();
// Process CSV rows in cursor range
let mut processed_count = 0;
let mut metrics = initialize_metrics();
let file = File::open(csv_file_path)?;
let mut csv_reader = csv::ReaderBuilder::new()
.has_headers(true)
.from_reader(BufReader::new(file));
for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
let data_row_num = row_idx + 1; // 1-indexed after header
if data_row_num < start_row {
continue; // Skip rows before our range
}
if data_row_num >= end_row {
break; // Processed all our rows
}
let product: Product = result?;
// Update metrics
metrics.total_inventory_value += product.price * (product.stock as f64);
metrics.category_counts.entry(product.category.clone())
.or_insert(0) += 1;
processed_count += 1;
// Checkpoint progress periodically
if processed_count % context.checkpoint_interval() == 0 {
checkpoint_progress(step_uuid, data_row_num).await?;
}
}
// Return results for aggregation
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"total_inventory_value": metrics.total_inventory_value,
"category_counts": metrics.category_counts,
"batch_id": context.batch_id(),
"start_row": start_row,
"end_row": end_row
}),
elapsed_ms,
None,
))
}
}
Convergence Steps
Purpose: Aggregate results from all batch workers using deferred intersection semantics.
Responsibilities:
- Detect scenario using
BatchAggregationScenario::detect() - Handle both NoBatches and WithBatches scenarios
- Aggregate metrics from all worker results
- Return combined results for task completion
Scenario Detection:
#![allow(unused)]
fn main() {
pub enum BatchAggregationScenario {
/// No batches created - placeholder worker used
NoBatches {
batchable_result: StepDependencyResult,
},
/// Batches created - multiple workers processed data
WithBatches {
batch_results: Vec<(String, StepDependencyResult)>,
worker_count: u32,
},
}
}
Code Reference: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-480
#![allow(unused)]
fn main() {
// Convergence handler example
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"analyze_csv", // batchable step name
"process_csv_batch_", // batch worker prefix
)?;
match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// No workers created - get dataset size from batchable step
let total_rows = batchable_result
.result.get("total_rows")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Return zero metrics
Ok(success_result(
step_uuid,
json!({
"total_processed": total_rows,
"total_inventory_value": 0.0,
"category_counts": {},
"worker_count": 0
}),
elapsed_ms,
None,
))
}
BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
// Aggregate results from all workers
let mut total_processed = 0u64;
let mut total_inventory_value = 0.0;
let mut global_category_counts = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
for (step_name, result) in batch_results {
// Sum processed counts
total_processed += result.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Sum inventory values
total_inventory_value += result.result
.get("total_inventory_value")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
// Merge category counts
if let Some(categories) = result.result
.get("category_counts")
.and_then(|v| v.as_object()) {
for (category, count) in categories {
*global_category_counts.entry(category.clone()).or_insert(0)
+= count.as_u64().unwrap_or(0);
}
}
// Find global max price
let batch_max_price = result.result
.get("max_price")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
if batch_max_price > max_price {
max_price = batch_max_price;
max_price_product = result.result
.get("max_price_product")
.and_then(|v| v.as_str())
.map(String::from);
}
}
// Return aggregated metrics
Ok(success_result(
step_uuid,
json!({
"total_processed": total_processed,
"total_inventory_value": total_inventory_value,
"category_counts": global_category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"worker_count": worker_count
}),
elapsed_ms,
None,
))
}
}
}
}
Checkpoint Yielding
Checkpoint yielding enables handler-driven progress persistence during long-running batch operations. Handlers explicitly checkpoint their progress, persist state to the database, and yield control back to the orchestrator for re-dispatch.
Key Characteristics
Handler-Driven: Handlers decide when to checkpoint based on business logic, not configuration timers. This gives handlers full control over checkpoint frequency and timing.
Checkpoint-Persist-Then-Redispatch: Progress is atomically saved to the database before the step is re-dispatched. This ensures no progress is ever lost, even during infrastructure failures.
Step Remains In-Progress: During checkpoint yield cycles, the step stays in InProgress state. It is not released or re-enqueued through normal channels—the re-dispatch happens internally.
State Machine Integrity: Only Success or Failure results trigger state transitions. Checkpoint yields are internal handler mechanics that don’t affect the step state machine.
When to Use Checkpoint Yielding
Use checkpoint yielding when:
- Processing takes longer than your visibility timeout (prevents DLQ escalation)
- You want resumable processing after transient failures
- You need to periodically release resources (memory, connections)
- Long-running operations need progress visibility
Don’t use checkpoint yielding when:
- Batch processing completes quickly (<30 seconds)
- The overhead of checkpointing exceeds the benefit
- Operations are inherently non-resumable
API Reference
All languages provide a checkpoint_yield() method (or checkpointYield() in TypeScript) on the Batchable mixin:
Ruby
class MyBatchWorkerHandler
include Tasker::StepHandler::Batchable
def call(step_data)
context = BatchWorkerContext.from_step_data(step_data)
# Resume from checkpoint if present
start_item = context.has_checkpoint? ? context.checkpoint_cursor : 0
accumulated = context.accumulated_results || []
items = fetch_items_to_process(start_item)
items.each_with_index do |item, idx|
result = process_item(item)
accumulated << result
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0
checkpoint_yield(
cursor: start_item + idx + 1,
items_processed: accumulated.size,
accumulated_results: { processed: accumulated }
)
# Handler execution stops here and resumes on re-dispatch
end
end
# Return final success result
success_result(results: { all_processed: accumulated })
end
end
BatchWorkerContext Accessors (Ruby):
checkpoint_cursor- Current cursor position (or nil if no checkpoint)accumulated_results- Previously accumulated results (or nil)has_checkpoint?- Returns true if checkpoint data existscheckpoint_items_processed- Number of items processed at checkpoint
Python
class MyBatchWorkerHandler(BatchableHandler):
def call(self, step_data: TaskSequenceStep) -> StepExecutionResult:
context = BatchWorkerContext.from_step_data(step_data)
# Resume from checkpoint if present
start_item = context.checkpoint_cursor if context.has_checkpoint() else 0
accumulated = context.accumulated_results or []
items = self.fetch_items_to_process(start_item)
for idx, item in enumerate(items):
result = self.process_item(item)
accumulated.append(result)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0:
self.checkpoint_yield(
cursor=start_item + idx + 1,
items_processed=len(accumulated),
accumulated_results={"processed": accumulated}
)
# Handler execution stops here and resumes on re-dispatch
# Return final success result
return self.success_result(results={"all_processed": accumulated})
BatchWorkerContext Accessors (Python):
checkpoint_cursor: int | str | dict | None- Current cursor positionaccumulated_results: dict | None- Previously accumulated resultshas_checkpoint() -> bool- Returns true if checkpoint data existscheckpoint_items_processed: int- Number of items processed at checkpoint
TypeScript
class MyBatchWorkerHandler extends BatchableHandler {
async call(stepData: TaskSequenceStep): Promise<StepExecutionResult> {
const context = BatchWorkerContext.fromStepData(stepData);
// Resume from checkpoint if present
const startItem = context.hasCheckpoint() ? context.checkpointCursor : 0;
const accumulated = context.accumulatedResults ?? [];
const items = await this.fetchItemsToProcess(startItem);
for (let idx = 0; idx < items.length; idx++) {
const result = await this.processItem(items[idx]);
accumulated.push(result);
// Checkpoint every 1000 items
if ((idx + 1) % 1000 === 0) {
await this.checkpointYield({
cursor: startItem + idx + 1,
itemsProcessed: accumulated.length,
accumulatedResults: { processed: accumulated }
});
// Handler execution stops here and resumes on re-dispatch
}
}
// Return final success result
return this.successResult({ results: { allProcessed: accumulated } });
}
}
BatchWorkerContext Properties (TypeScript):
checkpointCursor: number | string | Record<string, unknown> | undefinedaccumulatedResults: Record<string, unknown> | undefinedhasCheckpoint(): booleancheckpointItemsProcessed: number
Checkpoint Data Structure
Checkpoints are persisted in the checkpoint JSONB column on workflow_steps:
{
"cursor": 1000,
"items_processed": 1000,
"timestamp": "2026-01-06T12:00:00Z",
"accumulated_results": {
"processed": ["item1", "item2", "..."]
},
"history": [
{"cursor": 500, "timestamp": "2026-01-06T11:59:30Z"},
{"cursor": 1000, "timestamp": "2026-01-06T12:00:00Z"}
]
}
Fields:
cursor- Flexible JSON value representing position (integer, string, or object)items_processed- Total items processed at this checkpointtimestamp- ISO 8601 timestamp when checkpoint was createdaccumulated_results- Optional intermediate results to preservehistory- Array of previous checkpoint positions (appended automatically)
Checkpoint Flow
┌──────────────────────────────────────────────────────────────────┐
│ Handler calls checkpoint_yield(cursor, items_processed, ...) │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ FFI Bridge: checkpoint_yield_step_event() │
│ Converts language-specific types to CheckpointYieldData │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ CheckpointService::save_checkpoint() │
│ - Atomic SQL update with history append │
│ - Uses JSONB jsonb_set for history array │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Worker re-dispatches step via internal MPSC channel │
│ - Step stays InProgress (not released) │
│ - Re-queued for immediate processing │
└───────────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Handler resumes with checkpoint data in workflow_step │
│ - BatchWorkerContext provides checkpoint accessors │
│ - Handler continues from saved cursor position │
└──────────────────────────────────────────────────────────────────┘
Failure and Recovery
Transient Failure After Checkpoint:
- Handler checkpoints at position 500
- Handler fails at position 750 (transient error)
- Step is retried (standard retry semantics)
- Handler resumes from checkpoint (position 500)
- Items 500-750 are reprocessed (idempotency required)
- Processing continues to completion
Permanent Failure:
- Handler checkpoints at position 500
- Handler encounters non-retryable error
- Step transitions to Error state
- Checkpoint data preserved for operator inspection
- Manual intervention may use checkpoint to resume later
Best Practices
Checkpoint Frequency:
- Too frequent: Overhead dominates (database writes, re-dispatch latency)
- Too infrequent: Lost progress on failure, long recovery time
- Rule of thumb: Checkpoint every 1-5 minutes of work, or every 1000-10000 items
Accumulated Results:
- Keep accumulated results small (summaries, counts, IDs)
- For large result sets, write to external storage and store reference
- Unbounded accumulated results can cause performance degradation
Cursor Design:
- Use monotonic cursors (integers, timestamps) when possible
- Complex cursors (objects) are supported but harder to debug
- Cursor must uniquely identify resume position
Idempotency:
- Items between last checkpoint and failure will be reprocessed
- Ensure item processing is idempotent or use deduplication
- Consider storing processed item IDs in accumulated_results
Monitoring
Checkpoint Events (logged automatically):
INFO checkpoint_yield_step_event step_uuid=abc cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc history_length=2
Metrics to Monitor:
- Checkpoint frequency per step
- Average items processed between checkpoints
- Checkpoint history size (detect unbounded growth)
- Re-dispatch latency after checkpoint
Known Limitations
History Array Growth: The history array grows with each checkpoint. For very long-running processes with frequent checkpoints, this can lead to large JSONB values. Consider:
- Setting a maximum history length (future enhancement)
- Clearing history on step completion
- Using external storage for detailed history
Accumulated Results Size: No built-in size limit on accumulated_results. Handlers must self-regulate to prevent database bloat. Consider:
- Storing summaries instead of raw data
- Using external storage for large intermediate results
- Implementing size checks before checkpoint
Workflow Pattern
Template Definition
Batch processing workflows use three step types in YAML templates:
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
steps:
# BATCHABLE STEP: Analyzes dataset and decides batching strategy
- name: analyze_csv
type: batchable
dependencies: []
handler:
callable: BatchProcessing::CsvAnalyzerHandler
initialization:
batch_size: 200
max_workers: 5
# BATCH WORKER TEMPLATE: Single batch processing unit
# Orchestration creates N instances from this template at runtime
- name: process_csv_batch
type: batch_worker
dependencies:
- analyze_csv
lifecycle:
max_steps_in_process_minutes: 120
max_retries: 3
backoff_multiplier: 2.0
handler:
callable: BatchProcessing::CsvBatchProcessorHandler
initialization:
operation: "inventory_analysis"
# DEFERRED CONVERGENCE STEP: Aggregates results from all workers
- name: aggregate_csv_results
type: deferred_convergence
dependencies:
- process_csv_batch # Template dependency - resolves to all instances
handler:
callable: BatchProcessing::CsvResultsAggregatorHandler
initialization:
aggregation_type: "inventory_metrics"
Runtime Execution Flow
1. Task Initialization
User creates task with context: { "csv_file_path": "/path/to/data.csv" }
↓
Task enters Initializing state
↓
Orchestration discovers ready steps: [analyze_csv]
2. Batchable Step Execution
analyze_csv step enqueued to worker queue
↓
Worker claims step, executes CsvAnalyzerHandler
↓
Handler counts rows: 1000
Handler calculates workers: 5 (200 rows each)
Handler generates cursor configs
Handler returns BatchProcessingOutcome::CreateBatches
↓
Step completes with batch_processing_outcome in results
3. Batch Worker Creation (Orchestration)
ResultProcessorActor processes analyze_csv completion
↓
Detects batch_processing_outcome in step results
↓
Sends ProcessBatchableStepMessage to BatchProcessingActor
↓
BatchProcessingService.process_batchable_step():
- Begins database transaction
- Creates 5 worker instances from process_csv_batch template:
* process_csv_batch_001 (cursor: rows 1-200)
* process_csv_batch_002 (cursor: rows 201-400)
* process_csv_batch_003 (cursor: rows 401-600)
* process_csv_batch_004 (cursor: rows 601-800)
* process_csv_batch_005 (cursor: rows 801-1000)
- Creates edges: analyze_csv → each worker
- Creates convergence step: aggregate_csv_results
- Creates edges: each worker → aggregate_csv_results
- Commits transaction (all-or-nothing)
↓
Workers enqueued to worker queue with PGMQ notifications
4. Parallel Worker Execution
5 workers execute in parallel:
Worker 001:
- Extracts cursor: start=1, end=200
- Processes CSV rows 1-200
- Returns: processed_count=200, metrics={...}
Worker 002:
- Extracts cursor: start=201, end=400
- Processes CSV rows 201-400
- Returns: processed_count=200, metrics={...}
... (workers 003-005 similar)
All workers complete
5. Convergence Step Execution
Orchestration discovers aggregate_csv_results is ready
(all parent workers completed - intersection semantics)
↓
aggregate_csv_results enqueued to worker queue
↓
Worker claims step, executes CsvResultsAggregatorHandler
↓
Handler detects scenario: WithBatches (5 workers)
Handler aggregates results from all 5 workers:
- total_processed: 1000
- total_inventory_value: $XXX,XXX.XX
- category_counts: {electronics: 300, clothing: 250, ...}
Handler returns aggregated metrics
↓
Step completes
6. Task Completion
Orchestration detects all steps complete
↓
TaskFinalizerActor finalizes task
↓
Task state: Complete
NoBatches Scenario Flow
When dataset is too small or empty:
analyze_csv determines dataset too small (e.g., 50 rows < 200 batch_size)
↓
Returns BatchProcessingOutcome::NoBatches
↓
Orchestration creates single placeholder worker:
- process_csv_batch_001 (is_no_op: true)
- No cursor configuration needed
- Still maintains DAG structure
↓
Placeholder worker executes:
- Detects is_no_op flag
- Returns immediately with no_op: true
- No actual data processing
↓
Convergence step detects NoBatches scenario:
- Uses batchable step result directly
- Returns zero metrics or empty aggregation
Why placeholder workers?
- Maintains consistent DAG structure
- Convergence step logic handles both scenarios uniformly
- No special-case orchestration logic needed
- Standard retry/DLQ mechanics still apply
Data Structures
BatchProcessingOutcome
Location: tasker-shared/src/messaging/execution_types.rs
Purpose: Returned by batchable handlers to instruct orchestration.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
/// No batching needed - create placeholder worker
NoBatches,
/// Create N batch workers with cursor configurations
CreateBatches {
/// Template step name (e.g., "process_csv_batch")
worker_template_name: String,
/// Number of workers to create
worker_count: u32,
/// Cursor configurations for each worker
cursor_configs: Vec<CursorConfig>,
/// Total items across all batches
total_items: u64,
},
}
impl BatchProcessingOutcome {
pub fn no_batches() -> Self {
BatchProcessingOutcome::NoBatches
}
pub fn create_batches(
worker_template_name: String,
worker_count: u32,
cursor_configs: Vec<CursorConfig>,
total_items: u64,
) -> Self {
BatchProcessingOutcome::CreateBatches {
worker_template_name,
worker_count,
cursor_configs,
total_items,
}
}
pub fn to_value(&self) -> serde_json::Value {
serde_json::to_value(self).unwrap_or(json!({}))
}
}
}
Ruby Mirror: workers/ruby/lib/tasker_core/types/batch_processing_outcome.rb
module TaskerCore
module Types
module BatchProcessingOutcome
class NoBatches < Dry::Struct
attribute :type, Types::String.default('no_batches')
def to_h
{ 'type' => 'no_batches' }
end
def requires_batch_creation?
false
end
end
class CreateBatches < Dry::Struct
attribute :type, Types::String.default('create_batches')
attribute :worker_template_name, Types::Strict::String
attribute :worker_count, Types::Coercible::Integer.constrained(gteq: 1)
attribute :cursor_configs, Types::Array.of(Types::Hash).constrained(min_size: 1)
attribute :total_items, Types::Coercible::Integer.constrained(gteq: 0)
def to_h
{
'type' => 'create_batches',
'worker_template_name' => worker_template_name,
'worker_count' => worker_count,
'cursor_configs' => cursor_configs,
'total_items' => total_items
}
end
def requires_batch_creation?
true
end
end
class << self
def no_batches
NoBatches.new
end
def create_batches(worker_template_name:, worker_count:, cursor_configs:, total_items:)
CreateBatches.new(
worker_template_name: worker_template_name,
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_items
)
end
end
end
end
end
CursorConfig
Location: tasker-shared/src/messaging/execution_types.rs
Purpose: Defines batch boundaries for each worker.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
pub struct CursorConfig {
/// Batch identifier (e.g., "001", "002", "003")
pub batch_id: String,
/// Starting position (inclusive) - flexible JSON value
pub start_cursor: serde_json::Value,
/// Ending position (exclusive) - flexible JSON value
pub end_cursor: serde_json::Value,
/// Number of items in this batch
pub batch_size: u32,
}
}
Design Notes:
- Cursor values use
serde_json::Valuefor flexibility - Supports integers, strings, timestamps, UUIDs, etc.
- Batch IDs are zero-padded strings for consistent ordering
start_cursoris inclusive,end_cursoris exclusive
Example Cursor Configs:
// Numeric cursors (CSV row numbers)
{
"batch_id": "001",
"start_cursor": 1,
"end_cursor": 200,
"batch_size": 200
}
// Timestamp cursors (event processing)
{
"batch_id": "002",
"start_cursor": "2025-01-01T00:00:00Z",
"end_cursor": "2025-01-01T01:00:00Z",
"batch_size": 3600
}
// UUID cursors (database pagination)
{
"batch_id": "003",
"start_cursor": "00000000-0000-0000-0000-000000000000",
"end_cursor": "3fffffff-ffff-ffff-ffff-ffffffffffff",
"batch_size": 1000000
}
BatchWorkerInputs
Location: tasker-shared/src/models/core/batch_worker.rs
Purpose: Stored in workflow_steps.inputs for each worker instance.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchWorkerInputs {
/// Cursor configuration defining this worker's batch range
pub cursor: CursorConfig,
/// Batch processing metadata
pub batch_metadata: BatchMetadata,
/// Flag indicating if this is a placeholder worker (NoBatches scenario)
#[serde(default)]
pub is_no_op: bool,
}
impl BatchWorkerInputs {
pub fn new(
cursor_config: CursorConfig,
batch_config: &BatchConfiguration,
is_no_op: bool,
) -> Self {
Self {
cursor: cursor_config,
batch_metadata: BatchMetadata {
checkpoint_interval: batch_config.checkpoint_interval,
cursor_field: batch_config.cursor_field.clone(),
failure_strategy: batch_config.failure_strategy.clone(),
},
is_no_op,
}
}
}
}
Storage Location:
- ✅
workflow_steps.inputs(instance-specific runtime data) - ❌ NOT in
step_definition.handler.initialization(that’s the template)
BatchMetadata
Location: tasker-shared/src/models/core/batch_worker.rs
Purpose: Runtime configuration for batch processing behavior.
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct BatchMetadata {
/// Checkpoint frequency (every N items)
pub checkpoint_interval: u32,
/// Field name used for cursor tracking (e.g., "id", "row_number")
pub cursor_field: String,
/// How to handle failures during batch processing
pub failure_strategy: FailureStrategy,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum FailureStrategy {
/// Fail immediately if any item fails
FailFast,
/// Continue processing remaining items, report failures at end
ContinueOnFailure,
/// Isolate failed items to separate queue
IsolateFailed,
}
}
Implementation Patterns
Rust Implementation
1. Batchable Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_shared::messaging::execution_types::{BatchProcessingOutcome, CursorConfig};
use serde_json::json;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Analyze dataset
let dataset_size = analyze_dataset(step_data)?;
let batch_size = get_batch_size_from_config(step_data)?;
// 2. Check if batching needed
if dataset_size == 0 || dataset_size < batch_size {
let outcome = BatchProcessingOutcome::no_batches();
return Ok(success_result(
step_uuid,
json!({ "batch_processing_outcome": outcome.to_value() }),
elapsed_ms,
None,
));
}
// 3. Calculate worker count
let worker_count = (dataset_size as f64 / batch_size as f64).ceil() as u32;
// 4. Generate cursor configs
let cursor_configs = create_cursor_configs(dataset_size, worker_count, batch_size);
// 5. Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"worker_template_name".to_string(),
worker_count,
cursor_configs,
dataset_size,
);
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_items": dataset_size
}),
elapsed_ms,
None,
))
}
fn create_cursor_configs(
total_items: u64,
worker_count: u32,
batch_size: u64,
) -> Vec<CursorConfig> {
let mut cursor_configs = Vec::new();
let items_per_worker = (total_items as f64 / worker_count as f64).ceil() as u64;
for i in 0..worker_count {
let start_position = i as u64 * items_per_worker;
let end_position = ((i + 1) as u64 * items_per_worker).min(total_items);
cursor_configs.push(CursorConfig {
batch_id: format!("{:03}", i + 1),
start_cursor: json!(start_position),
end_cursor: json!(end_position),
batch_size: (end_position - start_position) as u32,
});
}
cursor_configs
}
}
2. Batch Worker Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchWorkerContext;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Extract batch worker context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// 2. Check for no-op placeholder worker
if context.is_no_op() {
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario",
"batch_id": context.batch_id()
}),
elapsed_ms,
None,
));
}
// 3. Extract cursor range
let start_idx = context.start_position();
let end_idx = context.end_position();
let checkpoint_interval = context.checkpoint_interval();
// 4. Process items in range
let mut processed_count = 0;
let mut results = initialize_results();
for idx in start_idx..end_idx {
// Process item
let item = get_item(idx)?;
update_results(&mut results, item);
processed_count += 1;
// 5. Checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
checkpoint_progress(step_uuid, idx).await?;
}
}
// 6. Return results for aggregation
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"results": results,
"batch_id": context.batch_id(),
"start_position": start_idx,
"end_position": end_idx
}),
elapsed_ms,
None,
))
}
}
3. Convergence Handler Pattern:
#![allow(unused)]
fn main() {
use tasker_worker::batch_processing::BatchAggregationScenario;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// 1. Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"batchable_step_name",
"batch_worker_prefix_",
)?;
// 2. Handle both scenarios
let aggregated_results = match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// Get dataset info from batchable step
let total_items = batchable_result
.result.get("total_items")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Return zero metrics
json!({
"total_processed": total_items,
"worker_count": 0
})
}
BatchAggregationScenario::WithBatches { batch_results, worker_count } => {
// Aggregate results from all workers
let mut total_processed = 0u64;
for (step_name, result) in batch_results {
total_processed += result.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
// Additional aggregation logic...
}
json!({
"total_processed": total_processed,
"worker_count": worker_count
})
}
};
// 3. Return aggregated results
Ok(success_result(
step_uuid,
aggregated_results,
elapsed_ms,
None,
))
}
}
Ruby Implementation
1. Batchable Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
def call(context)
csv_file_path = context.get_task_field('csv_file_path')
total_rows = count_csv_rows(csv_file_path)
# Get batch configuration
batch_size = context.step_config['batch_size'] || 200
max_workers = context.step_config['max_workers'] || 5
# Calculate worker count
worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min
if worker_count == 0 || total_rows == 0
# Use helper for NoBatches outcome
return no_batches_success(
reason: 'dataset_too_small',
total_rows: total_rows
)
end
# Generate cursor configs using helper
cursor_configs = generate_cursor_configs(
total_items: total_rows,
worker_count: worker_count
)
# Use helper for CreateBatches outcome
create_batches_success(
worker_template_name: 'process_csv_batch',
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_rows,
additional_data: {
'csv_file_path' => csv_file_path
}
)
end
private
def count_csv_rows(csv_file_path)
CSV.read(csv_file_path, headers: true).length
end
end
end
2. Batch Worker Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Extract batch context using helper
batch_ctx = get_batch_context(context)
# Use helper to check for no-op worker
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Get CSV file path from dependency results
csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
# Process CSV rows in cursor range
metrics = process_csv_batch(
csv_file_path,
batch_ctx.start_cursor,
batch_ctx.end_cursor
)
# Return results for aggregation
success(
result_data: {
'processed_count' => metrics[:processed_count],
'total_inventory_value' => metrics[:total_inventory_value],
'category_counts' => metrics[:category_counts],
'batch_id' => batch_ctx.batch_id
}
)
end
private
def process_csv_batch(csv_file_path, start_row, end_row)
metrics = {
processed_count: 0,
total_inventory_value: 0.0,
category_counts: Hash.new(0)
}
CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
next if data_row_num < start_row
break if data_row_num >= end_row
product = parse_product(row)
metrics[:total_inventory_value] += product.price * product.stock
metrics[:category_counts][product.category] += 1
metrics[:processed_count] += 1
end
metrics
end
end
end
3. Convergence Handler Pattern (using Batchable base class):
module BatchProcessing
class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Detect scenario using helper
scenario = detect_aggregation_scenario(
context,
batchable_step_name: 'analyze_csv',
batch_worker_prefix: 'process_csv_batch_'
)
# Use helper for aggregation with custom block
aggregate_batch_worker_results(scenario) do |batch_results|
# Custom aggregation logic
total_processed = 0
total_inventory_value = 0.0
global_category_counts = Hash.new(0)
batch_results.each do |step_name, result|
total_processed += result.dig('result', 'processed_count') || 0
total_inventory_value += result.dig('result', 'total_inventory_value') || 0.0
(result.dig('result', 'category_counts') || {}).each do |category, count|
global_category_counts[category] += count
end
end
{
'total_processed' => total_processed,
'total_inventory_value' => total_inventory_value,
'category_counts' => global_category_counts,
'worker_count' => batch_results.size
}
end
end
end
end
Use Cases
1. Large Dataset Processing
Scenario: Process millions of records from a database, file, or API.
Why Batch Processing?
- Single worker would timeout
- Memory constraints prevent loading entire dataset
- Want parallelism for speed
Example: Product catalog synchronization
Batchable: Analyze product table (5 million products)
Workers: 100 workers × 50,000 products each
Convergence: Aggregate sync statistics
Result: 5M products synced in 10 minutes vs 2 hours sequential
2. Time-Based Event Processing
Scenario: Process events from a time-series database or log aggregation system.
Why Batch Processing?
- Events span long time ranges
- Want to process hourly/daily chunks in parallel
- Need resumability for long-running processing
Example: Analytics event processing
Batchable: Analyze events (30 days × 24 hours)
Workers: 720 workers (1 per hour)
Cursors: Timestamp ranges (2025-01-01T00:00 to 2025-01-01T01:00)
Convergence: Aggregate daily/monthly metrics
3. Multi-Source Data Integration
Scenario: Fetch data from multiple external APIs or services.
Why Batch Processing?
- Each source is independent
- Want parallel fetching for speed
- Some sources may fail (need retry per source)
Example: Third-party data enrichment
Batchable: Analyze customer list (partition by data provider)
Workers: 5 workers (1 per provider: Stripe, Salesforce, HubSpot, etc.)
Cursors: Provider-specific identifiers
Convergence: Merge enriched customer profiles
4. Bulk File Processing
Scenario: Process multiple files (CSVs, images, documents).
Why Batch Processing?
- Each file is independent processing unit
- Want parallelism across files
- File sizes vary (dynamic batch sizing)
Example: Image transformation pipeline
Batchable: List S3 bucket objects (1000 images)
Workers: 20 workers × 50 images each
Cursors: S3 object key ranges
Convergence: Verify all images transformed
5. Geographical Data Partitioning
Scenario: Process data partitioned by geography (regions, countries, cities).
Why Batch Processing?
- Geographic boundaries provide natural partitions
- Want parallel processing per region
- Different regions may have different data volumes
Example: Regional sales report generation
Batchable: Analyze sales data (50 US states)
Workers: 50 workers (1 per state)
Cursors: State codes (AL, AK, AZ, ...)
Convergence: National sales dashboard
Operator Workflows
Batch processing integrates seamlessly with the DLQ (Dead Letter Queue) system for operator visibility and manual intervention. This section shows how operators manage failed batch workers.
DLQ Integration Principles
From DLQ System Documentation:
- Investigation Tracking Only: DLQ tracks “why task is stuck” and “who investigated” - it doesn’t manipulate tasks
- Step-Level Resolution: Operators fix problem steps using step APIs, not task-level operations
- Three Resolution Types:
- ResetForRetry: Reset attempts, return step to pending (cursor preserved)
- ResolveManually: Skip step, mark resolved without results
- CompleteManually: Provide manual results for dependent steps
Key for Batch Processing: Cursor data in workflow_steps.results is preserved during ResetForRetry, enabling resumability without data loss.
Staleness Detection for Batch Workers
Batch workers have two staleness detection mechanisms:
1. Duration-Based (Standard):
lifecycle:
max_steps_in_process_minutes: 120 # DLQ threshold
If worker stays in InProgress state for > 120 minutes, flagged as stale.
2. Checkpoint-Based (Batch-Specific):
#![allow(unused)]
fn main() {
// Workers checkpoint progress periodically
if processed_count % checkpoint_interval == 0 {
checkpoint_progress(step_uuid, current_cursor).await?;
}
}
If last checkpoint timestamp is too old, flagged as stale even if within duration threshold.
Common Operator Scenarios
Scenario 1: Transient Database Failure
Problem: 3 out of 5 batch workers failed due to database connection timeout.
Step 1: Find the stuck task in DLQ:
# Get investigation queue (prioritized by age and reason)
curl http://localhost:8080/v1/dlq/investigation-queue | jq
Step 2: Get task details and identify failed workers:
-- Get DLQ entry for the task
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.resolution_status,
dlq.task_snapshot->'workflow_steps' as steps
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = 'task-uuid-here'
AND dlq.resolution_status = 'pending';
-- Query task's workflow steps to find failed batch workers
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = 'task-uuid-here'
AND ws.name LIKE 'process_csv_batch_%'
AND ws.current_state = 'Error';
Result:
workflow_step_uuid | name | current_state | attempts | last_error
-------------------|------------------------|---------------|----------|------------------
uuid-worker-2 | process_csv_batch_002 | Error | 3 | DB timeout
uuid-worker-4 | process_csv_batch_004 | Error | 3 | DB timeout
uuid-worker-5 | process_csv_batch_005 | Error | 3 | DB timeout
Operator Action: Database is now healthy - reset workers for retry
# Get task UUID from DLQ entry
TASK_UUID="abc-123-task-uuid"
# Reset worker 2 (preserves cursor: rows 201-400)
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-2 \
-H "Content-Type: application/json" \
-d '{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection restored, resetting attempts"
}'
# Reset workers 4 and 5
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-4 \
-H "Content-Type: application/json" \
-d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/uuid-worker-5 \
-H "Content-Type: application/json" \
-d '{"action_type": "reset_for_retry", "reset_by": "operator@example.com", "reason": "Database connection restored"}'
# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"resolution_status": "manually_resolved",
"resolution_notes": "Reset 3 failed batch workers after database connection restored",
"resolved_by": "operator@example.com"
}'
Result:
- Workers 2, 4, 5 return to
Pendingstate - Cursor configs preserved in
workflow_steps.inputs - Retry attempt counter reset to 0
- Workers re-enqueued for execution
- DLQ entry updated with resolution metadata
Scenario 2: Bad Data in Specific Batch
Problem: Worker 3 repeatedly fails due to malformed CSV row in its range (rows 401-600).
Investigation:
-- Get worker details
SELECT
ws.name,
ws.current_state,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-3';
Result:
name: process_csv_batch_003
current_state: Error
attempts: 3
last_error: "CSV parsing failed at row 523: invalid price format"
Operator Decision: Row 523 has known data quality issue, already fixed in source system.
Option 1: Complete Manually (provide results for this batch):
TASK_UUID="abc-123-task-uuid"
STEP_UUID="uuid-worker-3"
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
-H "Content-Type: application/json" \
-d '{
"action_type": "complete_manually",
"completion_data": {
"result": {
"processed_count": 199,
"total_inventory_value": 45232.50,
"category_counts": {"electronics": 150, "clothing": 49},
"batch_id": "003",
"note": "Row 523 skipped due to data quality issue, manually verified totals"
},
"metadata": {
"manually_verified": true,
"verification_method": "manual_inspection",
"skipped_rows": [523]
}
},
"reason": "Manual completion after verifying corrected data in source system",
"completed_by": "operator@example.com"
}'
Option 2: Resolve Manually (skip this batch):
curl -X PATCH http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps/${STEP_UUID} \
-H "Content-Type: application/json" \
-d '{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical batch containing known bad data, skipping 200 rows out of 1000 total"
}'
Result (Option 1):
- Worker 3 marked
Completewith manual results - Convergence step receives manual results in aggregation
- Task completes successfully with note about manual intervention
Result (Option 2):
- Worker 3 marked
ResolvedManually(no results provided) - Convergence step detects missing results, adjusts aggregation
- Task completes with reduced total (800 rows instead of 1000)
Scenario 3: Long-Running Worker Needs Checkpoint
Problem: Worker 1 processing 10,000 rows, operator notices it’s been running 90 minutes (threshold: 120 minutes).
Investigation:
-- Check last checkpoint
SELECT
ws.name,
ws.current_state,
ws.results->>'last_checkpoint_cursor' as last_checkpoint,
ws.results->>'checkpoint_timestamp' as checkpoint_time,
NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint
FROM tasker.workflow_steps ws
WHERE ws.workflow_step_uuid = 'uuid-worker-1';
Result:
name: process_large_batch_001
current_state: InProgress
last_checkpoint: 7850
checkpoint_time: 2025-01-15 11:30:00
time_since_checkpoint: 00:05:00
Operator Action: Worker is healthy and making progress (checkpointed 5 minutes ago at row 7850).
No action needed - worker will complete normally. Operator adds investigation note to DLQ entry:
DLQ_ENTRY_UUID="dlq-entry-uuid-here"
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"metadata": {
"investigation_notes": "Worker healthy, last checkpoint at row 7850 (5 min ago), estimated 15 min remaining",
"investigator": "operator@example.com",
"timestamp": "2025-01-15T11:35:00Z",
"action_taken": "none - monitoring"
}
}'
Scenario 4: All Workers Failed - Batch Strategy Issue
Problem: All 10 workers fail with “memory exhausted” error - batch size too large.
Investigation via API:
TASK_UUID="task-uuid-here"
# Get task details including all workflow steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/workflow_steps | jq '.[] | select(.name | startswith("process_large_batch_")) | {name, current_state, attempts, last_error}'
Result: All workers show current_state: "Error" with same OOM error in last_error.
Operator Action: Cancel entire task, will re-run with smaller batch size.
DLQ_ENTRY_UUID="dlq-entry-uuid-here"
# Cancel task (cancels all workers)
curl -X DELETE http://localhost:8080/v1/tasks/${TASK_UUID}
# Update DLQ entry to track resolution
curl -X PATCH http://localhost:8080/v1/dlq/entry/${DLQ_ENTRY_UUID} \
-H "Content-Type: application/json" \
-d '{
"resolution_status": "permanently_failed",
"resolution_notes": "Batch size too large causing OOM. Cancelled task and created new task with batch_size: 5000 instead of 10000",
"resolved_by": "operator@example.com",
"metadata": {
"root_cause": "configuration_error",
"corrective_action": "reduced_batch_size",
"new_task_uuid": "new-task-uuid-here"
}
}'
# Create new task with corrected configuration
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"namespace": "data_processing",
"template_name": "large_dataset_processor",
"context": {
"dataset_id": "dataset-123",
"batch_size": 5000,
"max_workers": 20
}
}'
DLQ Query Patterns for Batch Processing
1. Find DLQ entry for a batch processing task:
-- Get DLQ entry with task snapshot
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.resolution_status,
dlq.dlq_timestamp,
dlq.resolution_notes,
dlq.resolved_by,
dlq.task_snapshot->'namespace_name' as namespace,
dlq.task_snapshot->'template_name' as template,
dlq.task_snapshot->'current_state' as task_state
FROM tasker.tasks_dlq dlq
WHERE dlq.task_uuid = :task_uuid
AND dlq.resolution_status = 'pending'
ORDER BY dlq.dlq_timestamp DESC
LIMIT 1;
2. Check batch completion progress:
SELECT
COUNT(*) FILTER (WHERE ws.current_state = 'Complete') as completed_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'InProgress') as in_progress_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'Error') as failed_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'Pending') as pending_workers,
COUNT(*) FILTER (WHERE ws.current_state = 'WaitingForRetry') as waiting_retry,
COUNT(*) as total_workers
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
AND ws.name LIKE 'process_%_batch_%';
3. Find workers with stale checkpoints:
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.results->>'last_checkpoint_cursor' as checkpoint_cursor,
ws.results->>'checkpoint_timestamp' as checkpoint_time,
NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz as time_since_checkpoint,
ws.attempts,
ws.last_error
FROM tasker.workflow_steps ws
WHERE ws.task_uuid = :task_uuid
AND ws.name LIKE 'process_%_batch_%'
AND ws.current_state = 'InProgress'
AND ws.results->>'checkpoint_timestamp' IS NOT NULL
AND NOW() - (ws.results->>'checkpoint_timestamp')::timestamptz > interval '15 minutes'
ORDER BY time_since_checkpoint DESC;
4. Get aggregated batch task health:
SELECT
t.task_uuid,
t.namespace_name,
t.template_name,
t.current_state as task_state,
t.execution_status,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_count,
jsonb_object_agg(
ws.current_state,
COUNT(*)
) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as worker_states,
dlq.dlq_reason,
dlq.resolution_status
FROM tasker.tasks t
JOIN tasker.workflow_steps ws ON ws.task_uuid = t.task_uuid
LEFT JOIN tasker.tasks_dlq dlq ON dlq.task_uuid = t.task_uuid
AND dlq.resolution_status = 'pending'
WHERE t.task_uuid = :task_uuid
GROUP BY t.task_uuid, t.namespace_name, t.template_name, t.current_state, t.execution_status,
dlq.dlq_reason, dlq.resolution_status;
5. Find all batch tasks in DLQ:
-- Find tasks with batch workers that are stuck
SELECT
dlq.dlq_entry_uuid,
dlq.task_uuid,
dlq.dlq_reason,
dlq.dlq_timestamp,
t.namespace_name,
t.template_name,
t.current_state,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') as batch_worker_count,
COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.current_state = 'Error' AND ws.name LIKE 'process_%_batch_%') as failed_workers
FROM tasker.tasks_dlq dlq
JOIN tasker.tasks t ON t.task_uuid = dlq.task_uuid
JOIN tasker.workflow_steps ws ON ws.task_uuid = dlq.task_uuid
WHERE dlq.resolution_status = 'pending'
GROUP BY dlq.dlq_entry_uuid, dlq.task_uuid, dlq.dlq_reason, dlq.dlq_timestamp,
t.namespace_name, t.template_name, t.current_state
HAVING COUNT(DISTINCT ws.workflow_step_uuid) FILTER (WHERE ws.name LIKE 'process_%_batch_%') > 0
ORDER BY dlq.dlq_timestamp DESC;
Operator Dashboard Recommendations
For monitoring batch processing tasks, operators should have dashboards showing:
-
Batch Progress:
- Total workers vs completed workers
- Estimated completion time based on worker velocity
- Current throughput (items/second across all workers)
-
Stale Worker Alerts:
- Workers exceeding duration threshold
- Workers with stale checkpoints
- Workers with repeated failures
-
Batch Health Metrics:
- Success rate per batch
- Average processing time per worker
- Resource utilization (CPU, memory)
-
Resolution Actions:
- Recent operator interventions
- Resolution action distribution (ResetForRetry vs ResolveManually)
- Time to resolution for stale workers
Code Examples
Complete Working Example: CSV Product Inventory
This section shows a complete end-to-end implementation processing a 1000-row CSV file in 5 parallel batches.
Rust Implementation
1. Batchable Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:60-150
#![allow(unused)]
fn main() {
pub struct CsvAnalyzerHandler;
#[async_trait]
impl StepHandler for CsvAnalyzerHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Get CSV file path from task context
let csv_file_path = step_data
.task
.context
.get("csv_file_path")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow!("Missing csv_file_path in task context"))?;
// Count total data rows (excluding header)
let file = File::open(csv_file_path)?;
let reader = BufReader::new(file);
let total_rows = reader.lines().count().saturating_sub(1) as u64;
info!("CSV Analysis: {} rows in {}", total_rows, csv_file_path);
// Get batch configuration
let handler_init = step_data.handler_initialization.as_object().unwrap();
let batch_size = handler_init
.get("batch_size")
.and_then(|v| v.as_u64())
.unwrap_or(200);
let max_workers = handler_init
.get("max_workers")
.and_then(|v| v.as_u64())
.unwrap_or(5);
// Determine if batching needed
if total_rows == 0 {
let outcome = BatchProcessingOutcome::no_batches();
let elapsed_ms = start_time.elapsed().as_millis() as u64;
return Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"reason": "empty_dataset",
"total_rows": 0
}),
elapsed_ms,
None,
));
}
// Calculate worker count
let worker_count = ((total_rows as f64 / batch_size as f64).ceil() as u64)
.min(max_workers);
// Generate cursor configurations
let actual_batch_size = (total_rows as f64 / worker_count as f64).ceil() as u64;
let mut cursor_configs = Vec::new();
for i in 0..worker_count {
let start_row = (i * actual_batch_size) + 1; // 1-indexed after header
let end_row = ((i + 1) * actual_batch_size).min(total_rows) + 1;
cursor_configs.push(CursorConfig {
batch_id: format!("{:03}", i + 1),
start_cursor: json!(start_row),
end_cursor: json!(end_row),
batch_size: (end_row - start_row) as u32,
});
}
info!(
"Creating {} batch workers for {} rows (batch_size: {})",
worker_count, total_rows, actual_batch_size
);
// Return CreateBatches outcome
let outcome = BatchProcessingOutcome::create_batches(
"process_csv_batch".to_string(),
worker_count as u32,
cursor_configs,
total_rows,
);
let elapsed_ms = start_time.elapsed().as_millis() as u64;
Ok(success_result(
step_uuid,
json!({
"batch_processing_outcome": outcome.to_value(),
"worker_count": worker_count,
"total_rows": total_rows,
"csv_file_path": csv_file_path
}),
elapsed_ms,
Some(json!({
"batch_size": actual_batch_size,
"file_path": csv_file_path
})),
))
}
}
}
2. Batch Worker Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:200-350
#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;
#[derive(Debug, Deserialize)]
struct Product {
id: u32,
title: String,
category: String,
price: f64,
stock: u32,
rating: f64,
}
#[async_trait]
impl StepHandler for CsvBatchProcessorHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract batch worker context using helper
let context = BatchWorkerContext::from_step_data(step_data)?;
// Check for no-op placeholder worker
if context.is_no_op() {
let elapsed_ms = start_time.elapsed().as_millis() as u64;
return Ok(success_result(
step_uuid,
json!({
"no_op": true,
"reason": "NoBatches scenario - no items to process",
"batch_id": context.batch_id()
}),
elapsed_ms,
None,
));
}
// Get CSV file path from dependency results
let csv_file_path = step_data
.dependency_results
.get("analyze_csv")
.and_then(|r| r.result.get("csv_file_path"))
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow!("Missing csv_file_path from analyze_csv"))?;
// Extract cursor range
let start_row = context.start_position();
let end_row = context.end_position();
info!(
"Processing batch {} (rows {}-{})",
context.batch_id(),
start_row,
end_row
);
// Initialize metrics
let mut processed_count = 0u64;
let mut total_inventory_value = 0.0;
let mut category_counts: HashMap<String, u32> = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
let mut total_rating = 0.0;
// Open CSV and process rows in cursor range
let file = File::open(Path::new(csv_file_path))?;
let mut csv_reader = csv::ReaderBuilder::new()
.has_headers(true)
.from_reader(BufReader::new(file));
for (row_idx, result) in csv_reader.deserialize::<Product>().enumerate() {
let data_row_num = row_idx + 1; // 1-indexed after header
if data_row_num < start_row {
continue; // Skip rows before our range
}
if data_row_num >= end_row {
break; // Processed all our rows
}
let product: Product = result?;
// Calculate inventory metrics
let inventory_value = product.price * (product.stock as f64);
total_inventory_value += inventory_value;
*category_counts.entry(product.category.clone()).or_insert(0) += 1;
if product.price > max_price {
max_price = product.price;
max_price_product = Some(product.title.clone());
}
total_rating += product.rating;
processed_count += 1;
// Checkpoint progress periodically
if processed_count % context.checkpoint_interval() == 0 {
debug!(
"Checkpoint: batch {} processed {} items",
context.batch_id(),
processed_count
);
}
}
let average_rating = if processed_count > 0 {
total_rating / (processed_count as f64)
} else {
0.0
};
let elapsed_ms = start_time.elapsed().as_millis() as u64;
info!(
"Batch {} complete: {} items processed",
context.batch_id(),
processed_count
);
Ok(success_result(
step_uuid,
json!({
"processed_count": processed_count,
"total_inventory_value": total_inventory_value,
"category_counts": category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"average_rating": average_rating,
"batch_id": context.batch_id(),
"start_row": start_row,
"end_row": end_row
}),
elapsed_ms,
None,
))
}
}
}
3. Convergence Handler: workers/rust/src/step_handlers/batch_processing_products_csv.rs:400-520
#![allow(unused)]
fn main() {
pub struct CsvResultsAggregatorHandler;
#[async_trait]
impl StepHandler for CsvResultsAggregatorHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Detect scenario using helper
let scenario = BatchAggregationScenario::detect(
&step_data.dependency_results,
"analyze_csv",
"process_csv_batch_",
)?;
let (total_processed, total_inventory_value, category_counts, max_price, max_price_product, overall_avg_rating, worker_count) = match scenario {
BatchAggregationScenario::NoBatches { batchable_result } => {
// No batch workers - get dataset size from batchable step
let total_rows = batchable_result
.result
.get("total_rows")
.and_then(|v| v.as_u64())
.unwrap_or(0);
info!("NoBatches scenario: {} rows (no processing needed)", total_rows);
(total_rows, 0.0, HashMap::new(), 0.0, None, 0.0, 0)
}
BatchAggregationScenario::WithBatches {
batch_results,
worker_count,
} => {
info!("Aggregating results from {} batch workers", worker_count);
let mut total_processed = 0u64;
let mut total_inventory_value = 0.0;
let mut global_category_counts: HashMap<String, u64> = HashMap::new();
let mut max_price = 0.0;
let mut max_price_product = None;
let mut weighted_ratings = Vec::new();
for (step_name, result) in batch_results {
// Sum processed counts
let count = result
.result
.get("processed_count")
.and_then(|v| v.as_u64())
.unwrap_or(0);
total_processed += count;
// Sum inventory values
let value = result
.result
.get("total_inventory_value")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
total_inventory_value += value;
// Merge category counts
if let Some(categories) = result
.result
.get("category_counts")
.and_then(|v| v.as_object())
{
for (category, cat_count) in categories {
*global_category_counts
.entry(category.clone())
.or_insert(0) += cat_count.as_u64().unwrap_or(0);
}
}
// Find global max price
let batch_max_price = result
.result
.get("max_price")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
if batch_max_price > max_price {
max_price = batch_max_price;
max_price_product = result
.result
.get("max_price_product")
.and_then(|v| v.as_str())
.map(String::from);
}
// Collect ratings for weighted average
let avg_rating = result
.result
.get("average_rating")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
weighted_ratings.push((count, avg_rating));
}
// Calculate overall weighted average rating
let total_items = weighted_ratings.iter().map(|(c, _)| c).sum::<u64>();
let overall_avg_rating = if total_items > 0 {
weighted_ratings
.iter()
.map(|(count, avg)| (*count as f64) * avg)
.sum::<f64>()
/ (total_items as f64)
} else {
0.0
};
(
total_processed,
total_inventory_value,
global_category_counts,
max_price,
max_price_product,
overall_avg_rating,
worker_count,
)
}
};
let elapsed_ms = start_time.elapsed().as_millis() as u64;
info!(
"Aggregation complete: {} total items processed by {} workers",
total_processed, worker_count
);
Ok(success_result(
step_uuid,
json!({
"total_processed": total_processed,
"total_inventory_value": total_inventory_value,
"category_counts": category_counts,
"max_price": max_price,
"max_price_product": max_price_product,
"overall_average_rating": overall_avg_rating,
"worker_count": worker_count
}),
elapsed_ms,
None,
))
}
}
}
Ruby Implementation
1. Batchable Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_analyzer_handler.rb
module BatchProcessing
module StepHandlers
# CSV Analyzer - Batchable Step
class CsvAnalyzerHandler < TaskerCore::StepHandler::Batchable
def call(context)
csv_file_path = context.get_task_field('csv_file_path')
raise ArgumentError, 'Missing csv_file_path in task context' unless csv_file_path
# Count CSV rows (excluding header)
total_rows = count_csv_rows(csv_file_path)
Rails.logger.info("CSV Analysis: #{total_rows} rows in #{csv_file_path}")
# Get batch configuration from handler initialization
batch_size = context.step_config['batch_size'] || 200
max_workers = context.step_config['max_workers'] || 5
# Calculate worker count
worker_count = [(total_rows.to_f / batch_size).ceil, max_workers].min
if worker_count.zero? || total_rows.zero?
# Use helper for NoBatches outcome
return no_batches_success(
reason: 'empty_dataset',
total_rows: total_rows
)
end
# Generate cursor configs using helper
cursor_configs = generate_cursor_configs(
total_items: total_rows,
worker_count: worker_count
) do |batch_idx, start_pos, end_pos, items_in_batch|
# Adjust to 1-indexed row numbers (after header)
{
'batch_id' => format('%03d', batch_idx + 1),
'start_cursor' => start_pos + 1,
'end_cursor' => end_pos + 1,
'batch_size' => items_in_batch
}
end
Rails.logger.info("Creating #{worker_count} batch workers for #{total_rows} rows")
# Use helper for CreateBatches outcome
create_batches_success(
worker_template_name: 'process_csv_batch',
worker_count: worker_count,
cursor_configs: cursor_configs,
total_items: total_rows,
additional_data: {
'csv_file_path' => csv_file_path
}
)
end
private
def count_csv_rows(csv_file_path)
CSV.read(csv_file_path, headers: true).length
end
def step_definition_initialization
@step_definition_initialization ||= {}
end
end
end
end
2. Batch Worker Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_batch_processor_handler.rb
module BatchProcessing
module StepHandlers
# CSV Batch Processor - Batch Worker
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
Product = Struct.new(
:id, :title, :description, :category, :price,
:discount_percentage, :rating, :stock, :brand, :sku, :weight,
keyword_init: true
)
def call(context)
# Extract batch context using helper
batch_ctx = get_batch_context(context)
# Use helper to check for no-op worker
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Get CSV file path from dependency results
csv_file_path = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
raise ArgumentError, 'Missing csv_file_path from analyze_csv' unless csv_file_path
Rails.logger.info("Processing batch #{batch_ctx.batch_id} (rows #{batch_ctx.start_cursor}-#{batch_ctx.end_cursor})")
# Process CSV rows in cursor range
metrics = process_csv_batch(
csv_file_path,
batch_ctx.start_cursor,
batch_ctx.end_cursor
)
Rails.logger.info("Batch #{batch_ctx.batch_id} complete: #{metrics[:processed_count]} items processed")
# Return results for aggregation
success(
result_data: {
'processed_count' => metrics[:processed_count],
'total_inventory_value' => metrics[:total_inventory_value],
'category_counts' => metrics[:category_counts],
'max_price' => metrics[:max_price],
'max_price_product' => metrics[:max_price_product],
'average_rating' => metrics[:average_rating],
'batch_id' => batch_ctx.batch_id,
'start_row' => batch_ctx.start_cursor,
'end_row' => batch_ctx.end_cursor
}
)
end
private
def process_csv_batch(csv_file_path, start_row, end_row)
metrics = {
processed_count: 0,
total_inventory_value: 0.0,
category_counts: Hash.new(0),
max_price: 0.0,
max_price_product: nil,
ratings: []
}
CSV.foreach(csv_file_path, headers: true).with_index(1) do |row, data_row_num|
# Skip rows before our range
next if data_row_num < start_row
# Break when we've processed all our rows
break if data_row_num >= end_row
product = parse_product(row)
# Calculate inventory metrics
inventory_value = product.price * product.stock
metrics[:total_inventory_value] += inventory_value
metrics[:category_counts][product.category] += 1
if product.price > metrics[:max_price]
metrics[:max_price] = product.price
metrics[:max_price_product] = product.title
end
metrics[:ratings] << product.rating
metrics[:processed_count] += 1
end
# Calculate average rating
metrics[:average_rating] = if metrics[:ratings].any?
metrics[:ratings].sum / metrics[:ratings].size.to_f
else
0.0
end
metrics.except(:ratings)
end
def parse_product(row)
Product.new(
id: row['id'].to_i,
title: row['title'],
description: row['description'],
category: row['category'],
price: row['price'].to_f,
discount_percentage: row['discountPercentage'].to_f,
rating: row['rating'].to_f,
stock: row['stock'].to_i,
brand: row['brand'],
sku: row['sku'],
weight: row['weight'].to_i
)
end
end
end
end
3. Convergence Handler: workers/ruby/spec/handlers/examples/batch_processing/step_handlers/csv_results_aggregator_handler.rb
module BatchProcessing
module StepHandlers
# CSV Results Aggregator - Deferred Convergence Step
class CsvResultsAggregatorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Detect scenario using helper
scenario = detect_aggregation_scenario(
context,
batchable_step_name: 'analyze_csv',
batch_worker_prefix: 'process_csv_batch_'
)
# Use helper for aggregation with custom block
aggregate_batch_worker_results(scenario) do |batch_results|
aggregate_csv_metrics(batch_results)
end
end
private
def aggregate_csv_metrics(batch_results)
total_processed = 0
total_inventory_value = 0.0
global_category_counts = Hash.new(0)
max_price = 0.0
max_price_product = nil
weighted_ratings = []
batch_results.each do |step_name, batch_result|
result = batch_result['result'] || {}
# Sum processed counts
count = result['processed_count'] || 0
total_processed += count
# Sum inventory values
total_inventory_value += result['total_inventory_value'] || 0.0
# Merge category counts
(result['category_counts'] || {}).each do |category, cat_count|
global_category_counts[category] += cat_count
end
# Find global max price
batch_max_price = result['max_price'] || 0.0
if batch_max_price > max_price
max_price = batch_max_price
max_price_product = result['max_price_product']
end
# Collect ratings for weighted average
avg_rating = result['average_rating'] || 0.0
weighted_ratings << { count: count, avg: avg_rating }
end
# Calculate overall weighted average rating
total_items = weighted_ratings.sum { |r| r[:count] }
overall_avg_rating = if total_items.positive?
weighted_ratings.sum { |r| r[:avg] * r[:count] } / total_items.to_f
else
0.0
end
Rails.logger.info("Aggregation complete: #{total_processed} total items processed by #{batch_results.size} workers")
{
'total_processed' => total_processed,
'total_inventory_value' => total_inventory_value,
'category_counts' => global_category_counts,
'max_price' => max_price,
'max_price_product' => max_price_product,
'overall_average_rating' => overall_avg_rating,
'worker_count' => batch_results.size
}
end
end
end
end
YAML Template
File: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml
---
name: csv_product_inventory_analyzer
namespace_name: csv_processing
version: "1.0.0"
description: "Process CSV product data in parallel batches"
steps:
# BATCHABLE STEP: CSV Analysis and Batch Planning
- name: analyze_csv
type: batchable
dependencies: []
handler:
callable: CsvAnalyzerHandler
initialization:
batch_size: 200
max_workers: 5
# BATCH WORKER TEMPLATE: Single CSV Batch Processing
# Orchestration creates N instances from this template
- name: process_csv_batch
type: batch_worker
dependencies:
- analyze_csv
lifecycle:
max_steps_in_process_minutes: 120
max_retries: 3
backoff_multiplier: 2.0
handler:
callable: CsvBatchProcessorHandler
initialization:
operation: "inventory_analysis"
# DEFERRED CONVERGENCE STEP: CSV Results Aggregation
- name: aggregate_csv_results
type: deferred_convergence
dependencies:
- process_csv_batch # Template dependency - resolves to all worker instances
handler:
callable: CsvResultsAggregatorHandler
initialization:
aggregation_type: "inventory_metrics"
Best Practices
1. Batch Size Calculation
Guideline: Balance parallelism with overhead.
Too Small:
- Excessive orchestration overhead
- Too many database transactions
- Diminishing returns on parallelism
Too Large:
- Workers timeout or OOM
- Long retry times on failure
- Reduced parallelism
Recommended Approach:
def calculate_optimal_batch_size(total_items, item_processing_time_ms)
# Target: Each batch takes 5-10 minutes
target_duration_ms = 7.5 * 60 * 1000
# Calculate items per batch
items_per_batch = (target_duration_ms / item_processing_time_ms).ceil
# Enforce min/max bounds
[[items_per_batch, 100].max, 10000].min
end
2. Worker Count Limits
Guideline: Limit concurrency based on system resources.
handler:
initialization:
batch_size: 200
max_workers: 10 # Prevents creating 100 workers for 20,000 items
Considerations:
- Database connection pool size
- Memory per worker
- External API rate limits
- CPU cores available
3. Cursor Design
Guideline: Use cursors that support resumability.
Good Cursor Types:
- ✅ Integer offsets:
start_cursor: 1000, end_cursor: 2000 - ✅ Timestamps:
start_cursor: "2025-01-01T00:00:00Z" - ✅ Database IDs:
start_cursor: uuid_a, end_cursor: uuid_b - ✅ Composite keys:
{ date: "2025-01-01", partition: "US-WEST" }
Bad Cursor Types:
- ❌ Page numbers (data can shift between pages)
- ❌ Non-deterministic ordering (random, relevance scores)
- ❌ Mutable values (last_modified_at can change)
4. Checkpoint Frequency
Guideline: Balance resumability with performance.
#![allow(unused)]
fn main() {
// Checkpoint every 100 items
if processed_count % 100 == 0 {
checkpoint_progress(step_uuid, current_cursor).await?;
}
}
Factors:
- Item processing time (faster = higher frequency)
- Worker failure rate (higher = more frequent checkpoints)
- Database write overhead (less frequent = better performance)
Recommended:
- Fast items (< 10ms each): Checkpoint every 1000 items
- Medium items (10-100ms each): Checkpoint every 100 items
- Slow items (> 100ms each): Checkpoint every 10 items
5. Error Handling Strategies
FailFast (default):
#![allow(unused)]
fn main() {
FailureStrategy::FailFast
}
- Worker fails immediately on first error
- Suitable for: Data validation, schema violations
- Retry preserves cursor for retry
ContinueOnFailure:
#![allow(unused)]
fn main() {
FailureStrategy::ContinueOnFailure
}
- Worker processes all items, collects errors
- Suitable for: Best-effort processing, partial results acceptable
- Returns both results and error list
IsolateFailed:
#![allow(unused)]
fn main() {
FailureStrategy::IsolateFailed
}
- Failed items moved to separate queue
- Suitable for: Large batches with few expected failures
- Allows manual review of failed items
6. Aggregation Patterns
Sum/Count:
#![allow(unused)]
fn main() {
let total = batch_results.iter()
.map(|(_, r)| r.result.get("count").unwrap().as_u64().unwrap())
.sum::<u64>();
}
Max/Min:
#![allow(unused)]
fn main() {
let max_value = batch_results.iter()
.filter_map(|(_, r)| r.result.get("max").and_then(|v| v.as_f64()))
.max_by(|a, b| a.partial_cmp(b).unwrap())
.unwrap_or(0.0);
}
Weighted Average:
#![allow(unused)]
fn main() {
let total_weight: u64 = weighted_values.iter().map(|(w, _)| w).sum();
let weighted_avg = weighted_values.iter()
.map(|(weight, value)| (*weight as f64) * value)
.sum::<f64>() / (total_weight as f64);
}
Merge HashMaps:
#![allow(unused)]
fn main() {
let mut merged = HashMap::new();
for (_, result) in batch_results {
if let Some(counts) = result.get("counts").and_then(|v| v.as_object()) {
for (key, count) in counts {
*merged.entry(key.clone()).or_insert(0) += count.as_u64().unwrap();
}
}
}
}
7. Testing Strategies
Unit Tests: Test handler logic independently
#![allow(unused)]
fn main() {
#[test]
fn test_cursor_generation() {
let configs = create_cursor_configs(1000, 5, 200);
assert_eq!(configs.len(), 5);
assert_eq!(configs[0].start_cursor, json!(0));
assert_eq!(configs[0].end_cursor, json!(200));
}
}
Integration Tests: Test with small datasets
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_batch_processing_integration() {
let task = create_task_with_csv("test_data_10_rows.csv").await;
assert_eq!(task.current_state, TaskState::Complete);
let steps = get_workflow_steps(task.task_uuid).await;
let workers = steps.iter().filter(|s| s.step_type == "batch_worker").count();
assert_eq!(workers, 1); // 10 rows = 1 worker with batch_size 200
}
}
E2E Tests: Test complete workflow with realistic data
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_csv_batch_processing_e2e() {
let task = create_task_with_csv("products_1000_rows.csv").await;
wait_for_completion(task.task_uuid, Duration::from_secs(60)).await;
let results = get_aggregation_results(task.task_uuid).await;
assert_eq!(results["total_processed"], 1000);
assert_eq!(results["worker_count"], 5);
}
}
8. Monitoring and Observability
Metrics to Track:
- Worker creation time
- Individual worker duration
- Batch size distribution
- Retry rate per batch
- Aggregation duration
Recommended Dashboards:
-- Batch processing health
SELECT
COUNT(*) FILTER (WHERE step_type = 'batch_worker') as total_workers,
AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_worker_duration_sec,
MAX(EXTRACT(EPOCH FROM (updated_at - created_at))) as max_worker_duration_sec,
COUNT(*) FILTER (WHERE current_state = 'Error') as failed_workers
FROM tasker.workflow_steps
WHERE task_uuid = :task_uuid
AND step_type = 'batch_worker';
Summary
Batch processing in Tasker provides a robust, production-ready pattern for parallel dataset processing:
Key Strengths:
- ✅ Builds on proven DAG, retry, and deferred convergence foundations
- ✅ No special recovery system needed (uses standard DLQ + retry)
- ✅ Transaction-based worker creation prevents corruption
- ✅ Cursor-based resumability enables long-running processing
- ✅ Language-agnostic design works across Rust and Ruby workers
Integration Points:
- DAG: Workers are full nodes with standard lifecycle
- Retryability: Uses
lifecycle.max_retriesand exponential backoff - Deferred Convergence: Intersection semantics aggregate dynamic worker counts
- DLQ: Standard operator workflows with cursor preservation
Production Readiness:
- 908 tests passing (Ruby workers)
- Real-world CSV processing (1000 rows)
- Docker integration working
- Code review complete with recommended fixes
For More Information:
- Conditional Workflows: See
docs/conditional-workflows.md - DLQ System: See
docs/dlq-system.md - Code Examples: See
workers/rust/src/step_handlers/batch_processing_*.rs
Caching Guide
This guide covers Tasker’s distributed caching system, including configuration, backend selection, circuit breaker protection, and operational considerations.
Overview
Tasker provides optional caching for:
- Task Templates: Reduces database queries when loading workflow definitions
- Analytics: Caches performance metrics and bottleneck analysis results
Caching is disabled by default and must be explicitly enabled in configuration.
Configuration
Basic Setup
[common.cache]
enabled = true
backend = "redis" # or "dragonfly" / "moka" / "memory" / "in-memory"
default_ttl_seconds = 3600 # 1 hour default
template_ttl_seconds = 3600 # 1 hour for templates
analytics_ttl_seconds = 60 # 1 minute for analytics
key_prefix = "tasker" # Namespace for cache keys
[common.cache.redis]
url = "${REDIS_URL:-redis://localhost:6379}"
max_connections = 10
connection_timeout_seconds = 5
database = 0
[common.cache.moka]
max_capacity = 10000 # Maximum entries in cache
Backend Selection
| Backend | Config Value | Use Case |
|---|---|---|
| Redis | "redis" | Multi-instance deployments (production) |
| Dragonfly | "dragonfly" | Redis-compatible with better multi-threaded performance |
| Memcached | "memcached" | Simple distributed cache (requires cache-memcached feature) |
| Moka | "moka", "memory", "in-memory" | Single-instance, development, DoS protection |
| NoOp | (enabled = false) | Disabled, always-miss |
Cache Backends
Redis (Distributed)
Redis is the recommended backend for production deployments:
- Shared state: All instances see the same cache entries
- Invalidation works: Worker bootstrap invalidations propagate to all instances
- Persistence: Survives process restarts (if Redis is configured for persistence)
[common.cache]
enabled = true
backend = "redis"
[common.cache.redis]
url = "redis://redis.internal:6379"
Dragonfly (Distributed)
Dragonfly is a Redis-compatible in-memory data store with better multi-threaded performance. It uses the same port (6379) and protocol as Redis, so no code changes are required.
- Redis compatible: Drop-in replacement for Redis
- Better performance: Multi-threaded architecture for higher throughput
- Shared state: Same distributed semantics as Redis
[common.cache]
enabled = true
backend = "dragonfly" # Uses Redis provider internally
[common.cache.redis]
url = "redis://dragonfly.internal:6379"
Note: Dragonfly is used in Tasker’s test and CI environments for improved performance. For production, either Redis or Dragonfly works.
Memcached (Distributed)
Memcached is a simple, high-performance distributed cache. It requires the
cache-memcached feature flag (not enabled by default).
- Simple protocol: Lightweight key-value store
- Distributed: State is shared across instances
- No pattern deletion: Relies on TTL expiry (like Moka)
[common.cache]
enabled = true
backend = "memcached"
[common.cache.memcached]
url = "tcp://memcached.internal:11211"
connection_timeout_seconds = 5
Note: Enable with cargo build --features cache-memcached. Not enabled
by default to reduce dependency footprint.
Moka (In-Memory)
Moka provides a high-performance in-memory cache:
- Zero network latency: All operations are in-process
- DoS protection: Rate-limits expensive operations without Redis dependency
- Single-instance only: Cache is not shared across processes
[common.cache]
enabled = true
backend = "moka"
[common.cache.moka]
max_capacity = 10000
Important: Moka is only suitable for:
- Single-instance deployments
- Development environments
- Analytics caching (where brief staleness is acceptable)
NoOp (Disabled)
When caching is disabled or a backend fails to initialize:
[common.cache]
enabled = false
The NoOp provider always returns cache misses and succeeds on writes (no-op). This is also used as a graceful fallback when Redis connection fails.
Circuit Breaker Protection
The cache circuit breaker prevents repeated timeout penalties when Redis/Dragonfly is unavailable. Instead of waiting for connection timeouts on every request, the circuit breaker fails fast after detecting failures.
Configuration
[common.circuit_breakers.component_configs.cache]
failure_threshold = 5 # Open after 5 consecutive failures
timeout_seconds = 15 # Test recovery after 15 seconds
success_threshold = 2 # Close after 2 successful calls
Behavior When Circuit is Open
When the circuit breaker is open (cache unavailable):
| Operation | Behavior |
|---|---|
get() | Returns None (cache miss) |
set() | Returns Ok(()) (no-op) |
delete() | Returns Ok(()) (no-op) |
health_check() | Returns false (unhealthy) |
This fail-fast behavior ensures:
- Requests don’t wait for connection timeouts
- Database queries still work (cache miss → DB fallback)
- Recovery is automatic when Redis/Dragonfly becomes available
Circuit States
| State | Description |
|---|---|
| Closed | Normal operation, all calls go through |
| Open | Failing fast, calls return fallback values |
| Half-Open | Testing recovery, limited calls allowed |
Monitoring
Circuit state is logged at state transitions:
INFO Circuit breaker half-open (testing recovery)
INFO Circuit breaker closed (recovered)
ERROR Circuit breaker opened (failing fast)
Usage Context Constraints
Different caching use cases have different consistency requirements. Tasker enforces these constraints at runtime:
Template Caching
Constraint: Requires distributed cache (Redis) or no cache (NoOp)
Templates are cached to avoid repeated database queries when loading workflow definitions. However, workers invalidate the template cache on bootstrap when they register new handler versions.
If an in-memory cache (Moka) is used:
- Orchestration server caches templates in its local memory
- Worker boots and invalidates templates in Redis (or nowhere, if Moka)
- Orchestration server never sees the invalidation
- Stale templates are served → operational errors
Behavior with Moka: Template caching is automatically disabled with a warning:
WARN Cache provider 'moka' is not safe for template caching (in-memory cache
would drift from worker invalidations). Template caching disabled.
Analytics Caching
Constraint: Any backend allowed
Analytics data is informational and TTL-bounded. Brief staleness is acceptable, and in-memory caching provides DoS protection for expensive aggregation queries.
Behavior with Moka: Analytics caching works normally.
Cache Keys
Cache keys are prefixed with the configured key_prefix to allow multiple
Tasker deployments to share a Redis instance:
| Resource | Key Pattern |
|---|---|
| Templates | {prefix}:template:{namespace}:{name}:{version} |
| Performance Metrics | {prefix}:analytics:performance:{hours} |
| Bottleneck Analysis | {prefix}:analytics:bottlenecks:{limit}:{min_executions} |
Operational Patterns
Multi-Instance Production
[common.cache]
enabled = true
backend = "redis"
template_ttl_seconds = 3600 # Long TTL, rely on invalidation
analytics_ttl_seconds = 60 # Short TTL for fresh data
- Templates cached for 1 hour but invalidated on worker registration
- Analytics cached briefly to reduce database load
Single-Instance / Development
[common.cache]
enabled = true
backend = "moka"
template_ttl_seconds = 300 # Shorter TTL since no invalidation
analytics_ttl_seconds = 30
- Template caching automatically disabled (Moka constraint)
- Analytics caching works, provides DoS protection
Caching Disabled
[common.cache]
enabled = false
- All cache operations are no-ops
- Every request hits the database
- Useful for debugging or when cache adds complexity without benefit
Graceful Degradation
Tasker never fails to start due to cache issues:
- Redis connection failure: Falls back to NoOp with warning
- Backend misconfiguration: Falls back to NoOp with warning
- Cache operation errors: Logged as warnings, never propagated
WARN Failed to connect to Redis, falling back to NoOp cache (graceful degradation)
The cache layer uses “best-effort” writes—failures are logged but never block request processing.
Monitoring
Cache Hit/Miss Rates
Cache operations are logged at DEBUG level:
DEBUG hours=24 "Performance metrics cache HIT"
DEBUG hours=24 "Performance metrics cache MISS, querying DB"
Provider Status
On startup, the active cache provider is logged:
INFO backend="redis" "Distributed cache provider initialized successfully"
INFO backend="moka" max_capacity=10000 "In-memory cache provider initialized"
INFO "Distributed cache disabled by configuration"
Troubleshooting
Templates Not Caching
- Check if backend is Moka—template caching is disabled with Moka
- Check for Redis connection warnings in logs
- Verify
enabled = truein configuration
Stale Templates Being Served
- Verify all instances point to the same Redis
- Check that workers are properly invalidating on bootstrap
- Consider reducing
template_ttl_seconds
High Cache Miss Rate
- Check Redis connectivity and latency
- Verify TTL settings aren’t too aggressive
- Check for cache key collisions (multiple deployments, same prefix)
Memory Growth with Moka
- Reduce
max_capacitysetting - Check TTL settings—items evict on TTL or capacity limit
- Monitor entry count if metrics are available
Conditional Workflows and Decision Points
Last Updated: 2025-10-27 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | Use Cases & Patterns | States and Lifecycles
← Back to Documentation Hub
Overview
Conditional workflows enable runtime decision-making that dynamically determines which workflow steps to execute based on business logic. Unlike static DAG workflows where all steps are predefined, conditional workflows use decision point steps to create steps on-demand based on runtime conditions.
Dynamic Workflow Decision Points provide this capability through:
- Decision Point Steps: Special step type that evaluates business logic and returns step names to create
- Deferred Steps: Step type with dynamic dependency resolution using intersection semantics
- Type-Safe Integration: Ruby and Rust helpers ensuring clean serialization between languages
Table of Contents
- When to Use Conditional Workflows
- Logical Pattern
- Architecture and Implementation
- YAML Configuration
- Simple Example: Approval Routing
- Complex Example: Multi-Tier Approval
- Ruby Implementation Guide
- Rust Implementation Guide
- Best Practices
- Limitations and Constraints
When to Use Conditional Workflows
✅ Use Conditional Workflows When
1. Business Logic Determines Execution Path
- Approval workflows with amount-based routing (small/medium/large)
- Risk-based processing (low/medium/high risk paths)
- Tiered customer service (bronze/silver/gold/platinum)
- Regulatory compliance with jurisdictional variations
2. Step Requirements Are Unknown Until Runtime
- Dynamic validation checks based on request type
- Multi-stage approvals where approval count depends on amount
- Conditional enrichment steps based on data completeness
- Parallel processing with variable worker count
3. Workflow Complexity Varies By Input
- Simple cases skip expensive steps
- Complex cases trigger additional validation
- Emergency processing bypasses normal checks
- VIP customers get expedited handling
❌ Don’t Use Conditional Workflows When
1. Static DAG is Sufficient
- All possible execution paths known at design time
- Complexity overhead not justified
- Simple if/else can be handled in handler code
2. Purely Sequential Logic
- No parallelism or branching needed
- Handler code can make decisions directly
3. Real-Time Sub-Second Decisions
- Decision overhead (~10-20ms) not acceptable
- In-memory processing required
Logical Pattern
Core Concepts
Task Initialization
↓
Regular Step(s)
↓
Decision Point Step ← Evaluates business logic
↓
[Decision Made]
↓
┌───┴───┐
↓ ↓
Path A Path B ← Steps created dynamically
↓ ↓
└───┬───┘
↓
Convergence Step ← Deferred dependencies resolve via intersection
↓
Task Complete
Decision Point Pattern
- Evaluation Phase: Decision point step executes handler
- Decision Output: Handler returns list of step names to create
- Dynamic Creation: Orchestration creates specified steps with proper dependencies
- Execution: Created steps execute like normal steps
- Convergence: Deferred steps wait for intersection of declared dependencies + created steps
Intersection Semantics for Deferred Steps
Declared Dependencies (in template):
- step_a
- step_b
- step_c
Actually Created Steps (by decision point):
Only step_a and step_c were created
Effective Dependencies (intersection):
step_a AND step_c (step_b ignored since not created)
This enables convergence steps that work regardless of which path was taken.
Architecture and Implementation
Step Type: Decision Point
Decision point steps are regular steps with a special handler that returns a DecisionPointOutcome:
#![allow(unused)]
fn main() {
pub enum DecisionPointOutcome {
NoBranches, // No additional steps needed
CreateSteps { // Dynamically create these steps
step_names: Vec<String>,
},
}
}
Key Characteristics:
- Executes like a normal step
- Result includes
decision_point_outcomefield - Orchestration detects outcome and creates steps
- Created steps depend on the decision point step
- Fully atomic - either all steps created or none
Step Type: Deferred
Deferred steps use intersection semantics for dependency resolution:
type: deferred # Special step type
dependencies:
- routing_decision # Must wait for decision point
- step_a # Might be created
- step_b # Might be created
- step_c # Might be created
Resolution Logic:
- Wait for decision point to complete
- Check which declared dependencies actually exist
- Wait only for intersection of declared + created
- Execute when all existing dependencies complete
Orchestration Flow
┌─────────────────────────────────────────┐
│ Step Result Processor │
│ │
│ 1. Check if result has │
│ decision_point_outcome field │
│ │
│ 2. If CreateSteps: │
│ - Validate step names exist │
│ - Create WorkflowStep records │
│ - Set dependencies │
│ - Enqueue for execution │
│ │
│ 3. If NoBranches: │
│ - Continue normally │
│ │
│ 4. Metrics and telemetry: │
│ - Track steps_created count │
│ - Log decision outcome │
│ - Warn if depth limit approached │
└─────────────────────────────────────────┘
Configuration
Decision point behavior is configured per environment:
# config/tasker/base/orchestration.toml
[orchestration.decision_points]
enabled = true
max_depth = 3 # Prevent infinite recursion
warn_threshold = 2 # Warn when nearing limit
YAML Configuration
Task Template Structure
Actual Implementation (from tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml):
---
name: approval_routing
namespace_name: conditional_approval
version: 1.0.0
description: >
Ruby implementation of conditional approval workflow demonstrating dynamic decision points.
Routes approval requests through different paths based on amount thresholds.
steps:
- name: validate_request
type: standard
dependencies: []
handler:
callable: ConditionalApproval::StepHandlers::ValidateRequestHandler
initialization: {}
- name: routing_decision
type: decision # DECISION POINT
dependencies:
- validate_request
handler:
callable: ConditionalApproval::StepHandlers::RoutingDecisionHandler
initialization: {}
- name: finalize_approval
type: deferred # DEFERRED - uses intersection semantics
dependencies:
- auto_approve # ALL possible dependencies listed
- manager_approval # System computes intersection at runtime
- finance_review
handler:
callable: ConditionalApproval::StepHandlers::FinalizeApprovalHandler
initialization: {}
# Possible dynamic branches (created by decision point)
- name: auto_approve
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::AutoApproveHandler
initialization: {}
- name: manager_approval
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::ManagerApprovalHandler
initialization: {}
- name: finance_review
type: standard
dependencies:
- routing_decision
handler:
callable: ConditionalApproval::StepHandlers::FinanceReviewHandler
initialization: {}
Key Points:
type: decisionmarks the decision point steptype: deferredenables intersection semantics for convergence- ALL possible dependencies listed in deferred step
- Orchestration computes: declared deps ∩ actually created steps
Simple Example: Approval Routing
Business Requirement
Route approval requests based on amount:
- < $1,000: Auto-approve (no human intervention)
- $1,000 - $4,999: Manager approval required
- ≥ $5,000: Manager + Finance approval required
Template Configuration
namespace: approval_workflows
name: simple_routing
version: "1.0"
steps:
- name: validate_request
handler: validate_request
- name: routing_decision
handler: routing_decision
type: decision_point
dependencies:
- validate_request
- name: auto_approve
handler: auto_approve
dependencies:
- routing_decision
- name: manager_approval
handler: manager_approval
dependencies:
- routing_decision
- name: finance_review
handler: finance_review
dependencies:
- routing_decision
- name: finalize_approval
handler: finalize_approval
type: deferred
dependencies:
- routing_decision
- auto_approve
- manager_approval
- finance_review
Ruby Handler Implementation
Actual Implementation (from workers/ruby/spec/handlers/examples/conditional_approval/step_handlers/routing_decision_handler.rb):
# frozen_string_literal: true
module ConditionalApproval
module StepHandlers
# Routing Decision: DECISION POINT that routes approval based on amount
#
# Uses TaskerCore::StepHandler::Decision base class for clean, type-safe
# decision outcome serialization consistent with Rust expectations.
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
SMALL_AMOUNT_THRESHOLD = 1_000
LARGE_AMOUNT_THRESHOLD = 5_000
def call(context)
# Get amount from validated request
amount = context.get_task_field('amount')
raise 'Amount is required for routing decision' unless amount
# Make routing decision based on amount
route = determine_route(amount)
# Use Decision base class helper for clean outcome serialization
decision_success(
steps: route[:steps],
result_data: {
route_type: route[:type],
reasoning: route[:reasoning],
amount: amount
},
metadata: {
operation: 'routing_decision',
route_thresholds: {
small: SMALL_AMOUNT_THRESHOLD,
large: LARGE_AMOUNT_THRESHOLD
}
}
)
end
private
def determine_route(amount)
if amount < SMALL_AMOUNT_THRESHOLD
{
type: 'auto_approval',
steps: ['auto_approve'],
reasoning: "Amount $#{amount} below threshold - auto-approval"
}
elsif amount < LARGE_AMOUNT_THRESHOLD
{
type: 'manager_only',
steps: ['manager_approval'],
reasoning: "Amount $#{amount} requires manager approval"
}
else
{
type: 'dual_approval',
steps: %w[manager_approval finance_review],
reasoning: "Amount $#{amount} >= $#{LARGE_AMOUNT_THRESHOLD} - dual approval required"
}
end
end
end
end
end
Key Ruby Patterns:
- Inherit from
TaskerCore::StepHandler::Decision- Specialized base class for decision points - Use helper method
decision_success(steps:, result_data:, metadata:)- Clean API for decision outcomes - Helper automatically creates
DecisionPointOutcomeand embeds it correctly - No manual serialization needed - base class handles Rust compatibility
- For no-branch scenarios, use
decision_no_branches(result_data:, metadata:)
Handler Syntax: Decision handler examples use the class-based pattern. For the DSL alternative, see the examples below each class-based handler or visit Class-Based Handlers.
DSL Alternative:
extend TaskerCore::StepHandler::Functional
RoutingDecisionHandler = decision_handler(
'RoutingDecisionHandler',
inputs: [:amount]
) do |amount:, context:|
steps = if amount.to_f < 1_000
['auto_approve']
elsif amount.to_f < 5_000
['manager_approval']
else
['manager_approval', 'finance_review']
end
Decision.route(steps, route_type: determine_route_type(amount), amount: amount)
end
Execution Flow Examples
Example 1: Small Amount ($500)
1. validate_request → Complete
2. routing_decision → Complete (creates: auto_approve)
3. auto_approve → Complete
4. finalize_approval → Complete
(waits for: routing_decision ∩ {auto_approve} = auto_approve)
Total Steps Created: 4
Execution Time: ~500ms
Example 2: Medium Amount ($2,500)
1. validate_request → Complete
2. routing_decision → Complete (creates: manager_approval)
3. manager_approval → Complete
4. finalize_approval → Complete
(waits for: routing_decision ∩ {manager_approval} = manager_approval)
Total Steps Created: 4
Execution Time: ~2s (human approval delay)
Example 3: Large Amount ($10,000)
1. validate_request → Complete
2. routing_decision → Complete (creates: manager_approval, finance_review)
3. manager_approval → Complete (parallel)
3. finance_review → Complete (parallel)
4. finalize_approval → Complete
(waits for: routing_decision ∩ {manager_approval, finance_review})
Total Steps Created: 5
Execution Time: ~3s (parallel approvals)
Complex Example: Multi-Tier Approval
Business Requirement
Implement sophisticated approval routing with:
- Risk assessment step
- Tiered approval requirements
- Emergency override path
- Compliance checks based on jurisdiction
Template Configuration
namespace: approval_workflows
name: multi_tier_approval
version: "1.0"
steps:
# Phase 1: Initial validation and risk assessment
- name: validate_request
handler: validate_request
- name: assess_risk
handler: assess_risk
dependencies:
- validate_request
# Phase 2: Primary routing decision
- name: primary_routing
handler: primary_routing
type: decision_point
dependencies:
- assess_risk
# Phase 3: Conditional approval paths
- name: emergency_approval
handler: emergency_approval
dependencies:
- primary_routing
- name: standard_manager_approval
handler: standard_manager_approval
dependencies:
- primary_routing
- name: senior_manager_approval
handler: senior_manager_approval
dependencies:
- primary_routing
# Phase 4: Secondary routing for high-risk cases
- name: compliance_routing
handler: compliance_routing
type: decision_point
dependencies:
- primary_routing
- senior_manager_approval # Only if created
# Phase 5: Compliance paths
- name: legal_review
handler: legal_review
dependencies:
- compliance_routing
- name: fraud_investigation
handler: fraud_investigation
dependencies:
- compliance_routing
- name: jurisdictional_check
handler: jurisdictional_check
dependencies:
- compliance_routing
# Phase 6: Convergence
- name: finalize_approval
handler: finalize_approval
type: deferred
dependencies:
- primary_routing
- emergency_approval
- standard_manager_approval
- senior_manager_approval
- compliance_routing
- legal_review
- fraud_investigation
- jurisdictional_check
Ruby Handler: Primary Routing
class PrimaryRoutingHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
risk_score = context.get_dependency_result('assess_risk')['risk_score']
is_emergency = context.get_task_field('emergency') == true
steps_to_create = if is_emergency && amount < 10_000
# Emergency override path
['emergency_approval']
elsif risk_score < 30 && amount < 5_000
# Low risk, standard approval
['standard_manager_approval']
else
# High risk or large amount - senior approval + compliance routing
['senior_manager_approval', 'compliance_routing']
end
decision_success(
steps: steps_to_create,
result_data: {
route_type: determine_route_type(is_emergency, risk_score, amount),
risk_score: risk_score,
amount: amount,
emergency: is_emergency
}
)
end
end
DSL Alternative:
extend TaskerCore::StepHandler::Functional
PrimaryRoutingHandler = decision_handler(
'PrimaryRoutingHandler',
inputs: [:amount, :emergency]
) do |amount:, emergency:, context:|
risk_score = context.get_dependency_result('assess_risk')['risk_score']
is_emergency = emergency == true
steps = if is_emergency && amount.to_f < 10_000
['emergency_approval']
elsif risk_score < 30 && amount.to_f < 5_000
['standard_manager_approval']
else
['senior_manager_approval', 'compliance_routing']
end
Decision.route(steps, route_type: determine_route_type(is_emergency, risk_score, amount))
end
Ruby Handler: Compliance Routing (Nested Decision)
class ComplianceRoutingHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
risk_score = context.get_dependency_result('assess_risk')['risk_score']
jurisdiction = context.get_task_field('jurisdiction')
steps_to_create = []
# Large amounts always need legal review
steps_to_create << 'legal_review' if amount >= 50_000
# High risk triggers fraud investigation
steps_to_create << 'fraud_investigation' if risk_score >= 70
# Certain jurisdictions need special checks
steps_to_create << 'jurisdictional_check' if high_regulation_jurisdiction?(jurisdiction)
if steps_to_create.empty?
# No additional compliance steps needed
decision_no_branches(
result_data: { reason: 'no_compliance_requirements' }
)
else
decision_success(
steps: steps_to_create,
result_data: {
compliance_level: 'enhanced',
checks_required: steps_to_create
}
)
end
end
private
def high_regulation_jurisdiction?(jurisdiction)
%w[EU UK APAC].include?(jurisdiction)
end
end
DSL Alternative:
extend TaskerCore::StepHandler::Functional
ComplianceRoutingHandler = decision_handler(
'ComplianceRoutingHandler',
inputs: [:amount, :jurisdiction]
) do |amount:, jurisdiction:, context:|
risk_score = context.get_dependency_result('assess_risk')['risk_score']
steps = []
steps << 'legal_review' if amount.to_f >= 50_000
steps << 'fraud_investigation' if risk_score >= 70
steps << 'jurisdictional_check' if %w[EU UK APAC].include?(jurisdiction)
if steps.empty?
Decision.no_branches(reason: 'no_compliance_requirements')
else
Decision.route(steps, compliance_level: 'enhanced', checks_required: steps)
end
end
Execution Scenarios
Scenario 1: Emergency Low-Risk Request ($5,000)
Path: validate → assess_risk → primary_routing → emergency_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates emergency_approval)
Complexity: Low
Scenario 2: Standard Medium-Risk Request ($3,000, Risk 25)
Path: validate → assess_risk → primary_routing → standard_manager_approval → finalize
Steps Created: 5
Decision Points: 1 (primary_routing creates standard_manager_approval)
Complexity: Low
Scenario 3: High-Risk Large Amount ($75,000, Risk 80, EU)
Path: validate → assess_risk → primary_routing → senior_manager_approval + compliance_routing
→ legal_review + fraud_investigation + jurisdictional_check → finalize
Steps Created: 9
Decision Points: 2 (primary_routing → compliance_routing)
Complexity: High (nested decisions)
Ruby Implementation Guide
Using the Decision Base Class
The TaskerCore::StepHandler::Decision base class provides type-safe helpers:
class MyDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
# Your business logic here
amount = context.get_task_field('amount')
if amount < 1000
# Create single step
decision_success(
steps: 'auto_approve', # Can pass string or array
result_data: { route: 'auto' }
)
elsif amount < 5000
# Create multiple steps
decision_success(
steps: ['manager_approval', 'risk_check'],
result_data: { route: 'standard' }
)
else
# No additional steps needed
decision_no_branches(
result_data: { route: 'none', reason: 'manual_review_required' }
)
end
end
end
Helper Methods
decision_success(steps:, result_data: {}, metadata: {})
- Creates steps dynamically
steps: String or Array of step namesresult_data: Additional data to store in step resultsmetadata: Observability metadata
decision_no_branches(result_data: {}, metadata: {})
- No additional steps created
- Workflow proceeds to next static step
decision_with_custom_outcome(outcome:, result_data: {}, metadata: {})
- Advanced: Full control over outcome structure
- Most handlers should use
decision_successordecision_no_branches
validate_decision_outcome!(outcome)
- Validates custom outcome structure
- Raises error if invalid
Type Definitions
# workers/ruby/lib/tasker_core/types/decision_point_outcome.rb
module TaskerCore
module Types
module DecisionPointOutcome
# Factory methods
def self.no_branches
NoBranches.new
end
def self.create_steps(step_names)
CreateSteps.new(step_names: step_names)
end
# Serialization format (matches Rust)
class NoBranches
def to_h
{ type: 'no_branches' }
end
end
class CreateSteps
def to_h
{ type: 'create_steps', step_names: step_names }
end
end
end
end
end
Rust Implementation Guide
Decision Handler Implementation
Actual Implementation (from workers/rust/src/step_handlers/conditional_approval_rust.rs):
#![allow(unused)]
fn main() {
use super::{error_result, success_result, RustStepHandler, StepHandlerConfig};
use anyhow::Result;
use async_trait::async_trait;
use chrono::Utc;
use serde_json::json;
use std::collections::HashMap;
use tasker_shared::messaging::{DecisionPointOutcome, StepExecutionResult};
use tasker_shared::types::TaskSequenceStep;
const SMALL_AMOUNT_THRESHOLD: i64 = 1000;
const LARGE_AMOUNT_THRESHOLD: i64 = 5000;
pub struct RoutingDecisionHandler {
#[allow(dead_code)]
config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for RoutingDecisionHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract amount from task context
let amount: i64 = step_data.get_context_field("amount")?;
// Business logic: determine routing
let (route_type, steps, reasoning) = if amount < SMALL_AMOUNT_THRESHOLD {
(
"auto_approval",
vec!["auto_approve"],
format!("Amount ${} under threshold", amount)
)
} else if amount < LARGE_AMOUNT_THRESHOLD {
(
"manager_only",
vec!["manager_approval"],
format!("Amount ${} requires manager approval", amount)
)
} else {
(
"dual_approval",
vec!["manager_approval", "finance_review"],
format!("Amount ${} requires dual approval", amount)
)
};
// Create decision point outcome
let outcome = DecisionPointOutcome::create_steps(
steps.iter().map(|s| s.to_string()).collect()
);
// Build result with embedded outcome
let result_data = json!({
"route_type": route_type,
"reasoning": reasoning,
"amount": amount,
"decision_point_outcome": outcome.to_value() // Embedded outcome
});
let metadata = HashMap::from([
("route_type".to_string(), json!(route_type)),
("steps_to_create".to_string(), json!(steps)),
]);
Ok(success_result(
step_uuid,
result_data,
start_time.elapsed().as_millis() as i64,
Some(metadata),
))
}
fn name(&self) -> &str {
"routing_decision"
}
fn new(config: StepHandlerConfig) -> Self {
Self { config }
}
}
}
DecisionPointOutcome Type
Type Definition (from tasker-shared/src/messaging/execution_types.rs):
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum DecisionPointOutcome {
NoBranches,
CreateSteps {
step_names: Vec<String>,
},
}
impl DecisionPointOutcome {
/// Create outcome that creates specific steps
pub fn create_steps(step_names: Vec<String>) -> Self {
Self::CreateSteps { step_names }
}
/// Create outcome with no additional steps
pub fn no_branches() -> Self {
Self::NoBranches
}
/// Convert to JSON value for embedding in StepExecutionResult
pub fn to_value(&self) -> serde_json::Value {
serde_json::to_value(self).expect("DecisionPointOutcome serialization should not fail")
}
/// Extract decision outcome from step execution result
pub fn from_step_result(result: &serde_json::Value) -> Option<Self> {
result
.as_object()?
.get("decision_point_outcome")
.and_then(|v| serde_json::from_value(v.clone()).ok())
}
}
}
Key Rust Patterns:
DecisionPointOutcome::create_steps(vec![...])- Type-safe factoryoutcome.to_value()- Serializes to JSON matching Ruby format- Embedded in result JSON as
decision_point_outcomefield - Serde handles serialization:
{ "type": "create_steps", "step_names": [...] }
Best Practices
1. Keep Decision Logic Deterministic
# ✅ Good: Deterministic decision based on input
def call(context)
amount = context.get_task_field('amount')
steps = if amount < 1000
['auto_approve']
else
['manager_approval']
end
decision_success(steps: steps)
end
# ❌ Bad: Non-deterministic (time-based, random)
def call(context)
# Decision changes based on when it runs
steps = if Time.now.hour < 9
['emergency_approval']
else
['standard_approval']
end
decision_success(steps: steps)
end
2. Validate Step Names
Ensure all step names in decision outcomes exist in template:
VALID_STEPS = %w[auto_approve manager_approval finance_review].freeze
def call(context)
steps_to_create = determine_steps(context)
# Validate step names
invalid = steps_to_create - VALID_STEPS
unless invalid.empty?
raise "Invalid step names: #{invalid.join(', ')}"
end
decision_success(steps: steps_to_create)
end
3. Use Deferred Type for Convergence
Any step that might depend on dynamically created steps should be type: deferred:
# ✅ Correct
- name: finalize
type: deferred # Uses intersection semantics
dependencies:
- routing_decision
- auto_approve
- manager_approval
# ❌ Wrong - will fail if dependencies don't all exist
- name: finalize
dependencies:
- routing_decision
- auto_approve
- manager_approval
4. Limit Decision Depth
Prevent infinite recursion:
[orchestration.decision_points]
max_depth = 3 # Maximum nesting level
warn_threshold = 2 # Warn when approaching limit
# ✅ Good: Linear decision chain (depth 1-2)
validate → routing_decision → compliance_check → finalize
# ⚠️ Be Careful: Deep nesting (depth 3)
validate → routing_1 → routing_2 → routing_3 → finalize
# ❌ Bad: Circular or unbounded nesting
routing_decision creates steps that create more routing decisions...
5. Handle No-Branch Cases
Explicitly return no_branches when no steps needed:
def call(context)
amount = context.get_task_field('amount')
if context.get_task_field('skip_approval')
# No additional steps needed
decision_no_branches(
result_data: { reason: 'approval_skipped' }
)
else
decision_success(steps: determine_steps(amount))
end
end
6. Meaningful Result Data
Include context for debugging and audit trails:
decision_success(
steps: ['manager_approval', 'finance_review'],
result_data: {
route_type: 'dual_approval',
reasoning: "Amount $#{amount} >= $5,000 threshold",
amount: amount,
thresholds_applied: {
small: 1_000,
large: 5_000
}
},
metadata: {
decision_time_ms: elapsed_ms,
steps_created_count: 2
}
)
Limitations and Constraints
Technical Limits
1. Maximum Decision Depth
- Default: 3 levels of nested decision points
- Configurable via
orchestration.decision_points.max_depth - Prevents infinite recursion
2. Step Names Must Exist in Template
- All step names in
CreateStepsmust be defined in template - Orchestration validates before creating steps
- Invalid names cause permanent failure
3. Decision Logic is Non-Retryable by Default
- Decision steps should be deterministic
- Retry disabled by default (
max_attempts: 1) - External API calls should be in separate steps
4. Created Steps Cannot Modify Template
- Decision points create instances of template steps
- Cannot dynamically define new step types
- All possible steps must be in template
Performance Considerations
1. Decision Overhead
- Each decision point adds ~10-20ms overhead
- Includes: handler execution + step creation + dependency resolution
- Factor into SLA planning
2. Database Impact
- Each created step = 1 WorkflowStep record + edges
- Large branch counts increase database operations
- Monitor
workflow_stepstable growth
3. Observability
- Decision outcomes logged with telemetry
- Metrics track:
decision_points.steps_created,decision_points.depth - Use structured logging for audit trails
Semantic Constraints
1. Deferred Dependencies Must Include Decision Point
# ✅ Correct
- name: finalize
type: deferred
dependencies:
- routing_decision # Must list the decision point
- auto_approve
- manager_approval
# ❌ Wrong - missing decision point
- name: finalize
type: deferred
dependencies:
- auto_approve
- manager_approval
2. Decision Points Cannot Be Circular
# ❌ Not allowed - circular dependency
routing_a creates routing_b
routing_b creates routing_a
3. No Dynamic Template Modification
- Cannot add new handler types at runtime
- Cannot modify step configurations
- All possibilities must be predefined
Testing Decision Point Workflows
E2E Test Structure
Both Ruby and Rust implementations include comprehensive E2E tests covering all routing scenarios:
Test Locations:
- Ruby:
tests/e2e/ruby/conditional_approval_test.rs - Rust:
tests/e2e/rust/conditional_approval_rust.rs
Test Scenarios:
-
Small Amount ($500) - Auto-approval only
validate_request → routing_decision → auto_approve → finalize_approval Expected: 4 steps created, only auto_approve path taken -
Medium Amount ($3,000) - Manager approval only
validate_request → routing_decision → manager_approval → finalize_approval Expected: 4 steps created, only manager path taken -
Large Amount ($10,000) - Dual approval
validate_request → routing_decision → manager_approval + finance_review → finalize_approval Expected: 5 steps created, both approval paths taken (parallel) -
API Validation - Initial step count verification
Expected: 2 steps at initialization (validate_request, routing_decision) Reason: finalize_approval is transitive descendant of decision point
Running Tests
# Run all E2E tests
cargo test --test e2e_tests
# Run Ruby conditional approval tests only
cargo test --test e2e_tests e2e::ruby::conditional_approval
# Run Rust conditional approval tests only
cargo test --test e2e_tests e2e::rust::conditional_approval_rust
# Run with output for debugging
cargo test --test e2e_tests -- --nocapture
Test Fixtures
Ruby Template: tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml
Rust Template: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml
Both templates demonstrate:
- Decision point step configuration (
type: decision) - Deferred convergence step (
type: deferred) - Dynamic step dependencies
- Namespace isolation between Ruby/Rust
Validation Checklist
When implementing decision point workflows, ensure:
- ✅ Decision point step has
type: decision - ✅ Deferred convergence step has
type: deferred - ✅ All possible dependencies listed in deferred step
- ✅ Handler embeds
decision_point_outcomein result - ✅ Step names in outcome match template definitions
- ✅ E2E tests cover all routing scenarios
- ✅ Tests validate step creation and completion
- ✅ Namespace isolated if multiple implementations exist
Related Documentation
- Use Cases & Patterns - More workflow examples
- States and Lifecycles - State machine details
- Task and Step Readiness - Dependency resolution logic
- Quick Start - Getting started guide
- Crate Architecture - System architecture overview
- Decision Point E2E Tests - Detailed test documentation
← Back to Documentation Hub
Configuration Management
Last Updated: 2025-10-17 Audience: Operators, Developers, Architects Status: Active Related Docs: Environment Configuration Comparison, Deployment Patterns
← Back to Documentation Hub
Overview
Tasker Core implements a sophisticated role-based configuration system with environment-specific overrides, runtime observability, and comprehensive validation. This document explains how to manage, validate, inspect, and deploy Tasker configurations.
Key Features
| Feature | Description | Benefit |
|---|---|---|
| Role-Based Architecture | 3 focused TOML files organized by common, orchestration, and worker | Easy to understand and maintain |
| Environment Overrides | Test, development, production-specific settings | Safe defaults with production scale-out |
| Single-File Runtime Loading | Load from pre-merged configuration files at runtime | Deployment certainty - exact config known at build time |
| Runtime Observability | /config API endpoints with secret redaction | Live inspection of deployed configurations |
| CLI Tools | Generate and validate single deployable configs | Build-time verification, deployment artifacts |
| Context-Specific Validation | Orchestration and worker-specific validation rules | Catch errors before deployment |
| Secret Redaction | 12+ sensitive key patterns automatically hidden | Safe configuration inspection |
Quick Start
Inspect Running System Configuration
# Check orchestration configuration (includes common + orchestration-specific)
curl http://localhost:8080/config | jq
# Check worker configuration (includes common + worker-specific)
curl http://localhost:8081/config | jq
# Secrets are automatically redacted for safety
Generate Deployable Configuration
# Generate production orchestration config for deployment
tasker-ctl config generate \
--context orchestration \
--environment production \
--output config/tasker/orchestration-production.toml
# This merged file is then loaded at runtime via TASKER_CONFIG_PATH
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
Validate Configuration
# Validate orchestration config for production
tasker-ctl config validate \
--context orchestration \
--environment production
# Validates: type safety, ranges, required fields, business rules
Part 1: Configuration Architecture
1.1 Role-Based Structure
Tasker uses a role-based TOML architecture where configuration is split into focused files organized by role (orchestration vs worker) with single responsibility:
config/tasker/
├── base/ # Base configuration (defaults)
│ ├── common.toml # Shared: database, circuit breakers, telemetry
│ ├── orchestration.toml # Orchestration-specific settings
│ └── worker.toml # Worker-specific settings
│
├── environments/ # Environment-specific overrides
│ ├── test/
│ │ ├── common.toml # Test overrides (small values, fast execution)
│ │ ├── orchestration.toml
│ │ └── worker.toml
│ │
│ ├── development/
│ │ ├── common.toml # Development overrides (medium values, local Docker)
│ │ ├── orchestration.toml
│ │ └── worker.toml
│ │
│ └── production/
│ ├── common.toml # Production overrides (large values, scale-out)
│ ├── orchestration.toml
│ └── worker.toml
│
├── orchestration-test.toml # Generated merged configs (used at runtime via TASKER_CONFIG_PATH)
├── orchestration-production.toml # Single-file deployment artifacts
├── worker-test.toml
└── worker-production.toml
1.2 Configuration Contexts
Tasker has three configuration contexts:
| Context | Purpose | Components |
|---|---|---|
| Common | Shared across orchestration and worker | Database, circuit breakers, telemetry, backoff, system |
| Orchestration | Orchestration-specific settings | Web API, MPSC channels, event systems, shutdown |
| Worker | Worker-specific settings | Handler discovery, resource limits, health monitoring |
1.3 Environment Detection
Configuration loading uses TASKER_ENV environment variable:
# Test environment (default) - small values for fast tests
export TASKER_ENV=test
# Development environment - medium values for local Docker
export TASKER_ENV=development
# Production environment - large values for scale-out
export TASKER_ENV=production
Detection Order:
TASKER_ENVenvironment variable- Default to “development” if not set
1.4 Runtime Configuration Loading
Production/Docker Deployment: Single-file loading via TASKER_CONFIG_PATH
Runtime systems (orchestration and worker) load configuration from pre-merged single files:
# Set path to merged configuration file
export TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml
# System loads this single file at startup
# No directory merging at runtime - configuration is fully determined at build time
Key Benefits:
- Deployment Certainty: Exact configuration known before deployment
- Simplified Debugging: Single file shows exactly what’s running
- Configuration Auditing: One file to version control and code review
- Fail Loudly: Missing or invalid config halts startup with explicit errors
Configuration Path Precedence:
The system uses a two-tier configuration loading strategy with clear precedence:
-
Primary: TASKER_CONFIG_PATH (Explicit single file - Docker/production)
- When set, system loads configuration from this exact file path
- Intended for production and Docker deployments
- Example:
TASKER_CONFIG_PATH=/app/config/tasker/orchestration-production.toml - Source logging:
"📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)"
-
Fallback: TASKER_CONFIG_ROOT (Convention-based - tests/development)
- When
TASKER_CONFIG_PATHis not set, system looks for config using convention - Convention:
{TASKER_CONFIG_ROOT}/tasker/{context}-{environment}.toml - Examples:
- Orchestration:
/config/tasker/generated/orchestration-test.toml - Worker:
/config/tasker/worker-production.toml
- Orchestration:
- Source logging:
"📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))"
- When
Logging and Transparency:
The system clearly logs which approach was taken at startup:
# Explicit path approach (TASKER_CONFIG_PATH set)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /app/config/tasker/orchestration-production.toml (source: TASKER_CONFIG_PATH)
# Convention-based approach (TASKER_CONFIG_ROOT set)
INFO tasker_shared::system_context: Using convention-based config path: /config/tasker/generated/orchestration-test.toml (environment=test)
INFO tasker_shared::system_context: 📋 Loading orchestration configuration from: /config/tasker/generated/orchestration-test.toml (source: TASKER_CONFIG_ROOT (convention))
When to Use Each:
| Environment | Recommended Approach | Reason |
|---|---|---|
| Production | TASKER_CONFIG_PATH | Explicit, auditable, matches what’s reviewed |
| Docker | TASKER_CONFIG_PATH | Single source of truth, no ambiguity |
| Kubernetes | TASKER_CONFIG_PATH | ConfigMap contains exact file |
| Tests (nextest) | TASKER_CONFIG_ROOT | Tests span multiple contexts, convention handles both |
| Local dev | Either | Personal preference |
Error Handling:
If neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set:
ConfigurationError("Neither TASKER_CONFIG_PATH nor TASKER_CONFIG_ROOT is set.
For Docker/production: set TASKER_CONFIG_PATH to the merged config file.
For tests/development: set TASKER_CONFIG_ROOT to the config directory.")
Local Development: Directory-based loading (legacy tests only)
For legacy test compatibility, you can still use directory-based loading via the load_context_direct() method, but this is not supported for production use.
1.5 Merging Strategy
Configuration merging follows environment overrides win pattern:
# base/common.toml
[database.pool]
max_connections = 30
min_connections = 8
# environments/production/common.toml
[database.pool]
max_connections = 50
# Result: max_connections = 50, min_connections = 8 (inherited from base)
Part 2: Runtime Observability
2.1 Configuration API Endpoints
Tasker provides unified configuration endpoints that return complete configuration (common + context-specific) in a single response.
Orchestration API
Endpoint: GET /config (system endpoint at root level)
Purpose: Inspect complete orchestration configuration including common settings
Example Request:
curl http://localhost:8080/config | jq
Response Structure:
{
"environment": "production",
"common": {
"database": {
"url": "***REDACTED***",
"pool": {
"max_connections": 50,
"min_connections": 15
}
},
"circuit_breakers": { "...": "..." },
"telemetry": { "...": "..." },
"system": { "...": "..." },
"backoff": { "...": "..." },
"task_templates": { "...": "..." }
},
"orchestration": {
"web": {
"bind_address": "0.0.0.0:8080",
"request_timeout_ms": 60000
},
"mpsc_channels": {
"command_buffer_size": 5000,
"pgmq_notification_buffer_size": 50000
},
"event_systems": { "...": "..." }
},
"metadata": {
"timestamp": "2025-10-17T15:30:45Z",
"source": "runtime",
"redacted_fields": [
"database.url",
"telemetry.api_key"
]
}
}
Worker API
Endpoint: GET /config (system endpoint at root level)
Purpose: Inspect complete worker configuration including common settings
Example Request:
curl http://localhost:8081/config | jq
Response Structure:
{
"environment": "production",
"common": {
"database": { "...": "..." },
"circuit_breakers": { "...": "..." },
"telemetry": { "...": "..." }
},
"worker": {
"template_path": "/app/templates",
"max_concurrent_steps": 500,
"resource_limits": {
"max_memory_mb": 4096,
"max_cpu_percent": 90
},
"web": {
"bind_address": "0.0.0.0:8081",
"request_timeout_ms": 60000
}
},
"metadata": {
"timestamp": "2025-10-17T15:30:45Z",
"source": "runtime",
"redacted_fields": [
"database.url",
"worker.auth_token"
]
}
}
2.2 Design Philosophy
Single Endpoint, Complete Configuration: Each system has one /config endpoint that returns both common and context-specific configuration in a single response.
Benefits:
- Single curl command: Get complete picture without correlation
- Easy comparison: Compare orchestration vs worker configs for compatibility
- Tooling-friendly: Automated tools can validate shared config matches
- Debugging-friendly: No mental correlation between multiple endpoints
- System endpoint: At root level like
/health,/metrics(not under/v1/)
2.3 Comprehensive Secret Redaction
All sensitive configuration values are automatically redacted before returning to clients.
Sensitive Key Patterns (12+ patterns, case-insensitive):
password,secret,token,key,api_keyprivate_key,jwt_private_key,jwt_public_keyauth_token,credentials,database_url,url
Key Features:
- Recursive Processing: Handles deeply nested objects and arrays
- Field Path Tracking: Reports which fields were redacted (e.g.,
database.url) - Smart Skipping: Empty strings and booleans not redacted
- Case-Insensitive: Catches
API_KEY,Secret_Token,database_PASSWORD - Structure Preservation: Non-sensitive data remains intact
Example:
{
"database": {
"url": "***REDACTED***",
"adapter": "postgresql",
"pool": {
"max_connections": 30
}
},
"metadata": {
"redacted_fields": ["database.url"]
}
}
2.4 OpenAPI/Swagger Integration
All configuration endpoints are documented with OpenAPI 3.0 and Swagger UI.
Access Swagger UI:
- Orchestration: http://localhost:8080/api-docs/ui
- Worker: http://localhost:8081/api-docs/ui
OpenAPI Specification:
- Orchestration: http://localhost:8080/api-docs/openapi.json
- Worker: http://localhost:8081/api-docs/openapi.json
Part 3: CLI Tools
3.1 Generate Command
Purpose: Generate a single merged configuration file from base + environment overrides for deployment.
Command Signature:
tasker-ctl config generate \
--context <common|orchestration|worker> \
--environment <test|development|production>
Examples:
# Generate orchestration config for production
tasker-ctl config generate --context orchestration --environment production
# Generate worker config for development
tasker-ctl config generate --context worker --environment development
# Generate common config for test
tasker-ctl config generate --context common --environment test
Output Location: Automatically generated at:
config/tasker/generated/{context}-{environment}.toml
Key Features:
-
Automatic Paths: No need for
--source-diror--outputflags -
Metadata Headers: Generated files include rich metadata:
# Generated by Tasker Configuration System # Context: orchestration # Environment: production # Generated At: 2025-10-17T15:30:45Z # Base Config: config/tasker/base/orchestration.toml # Environment Override: config/tasker/environments/production/orchestration.toml # # This is a merged configuration file combining base settings with # environment-specific overrides. Environment values take precedence. -
Automatic Validation: Validates during generation
-
Smart Merging: TOML-level merging preserves structure
3.2 Validate Command
Purpose: Validate configuration files with context-specific validation rules.
Command Signature:
tasker-ctl config validate \
--context <common|orchestration|worker> \
--environment <test|development|production>
Examples:
# Validate orchestration config for production
tasker-ctl config validate --context orchestration --environment production
# Validate worker config for test
tasker-ctl config validate --context worker --environment test
Validation Features:
- Environment variable substitution (
${VAR:-default}) - Type checking (numeric ranges, boolean values)
- Required field validation
- Context-specific business rules
- Clear error messages
Example Output:
🔍 Validating configuration...
Context: orchestration
Environment: production
✓ Configuration loaded
✓ Validation passed
✅ Configuration is valid!
📊 Configuration Summary:
Context: orchestration
Environment: production
Database: postgresql://tasker:***@localhost/tasker_production
Web API: 0.0.0.0:8080
MPSC Channels: 5 configured
3.3 Configuration Validator Binary
For quick validation without the full CLI:
# Validate all three environments
TASKER_ENV=test cargo run --bin config-validator
TASKER_ENV=development cargo run --bin config-validator
TASKER_ENV=production cargo run --bin config-validator
Part 4: Environment-Specific Configurations
See Environment Configuration Comparison for complete details on configuration values across environments.
4.1 Scaling Pattern
Tasker follows a 1:5:50 scaling pattern across environments:
| Component | Test | Development | Production | Pattern |
|---|---|---|---|---|
| Database Connections | 10 | 25 | 50 | 1x → 2.5x → 5x |
| Concurrent Steps | 10 | 50 | 500 | 1x → 5x → 50x |
| MPSC Channel Buffers | 100-500 | 500-1000 | 2000-50000 | 1x → 5-10x → 20-100x |
| Memory Limits | 512MB | 2GB | 4GB | 1x → 4x → 8x |
4.2 Environment Philosophy
Test Environment:
- Goal: Fast execution, test isolation
- Strategy: Minimal resources, small buffers
- Example: 10 database connections, 100-500 MPSC buffers
Development Environment:
- Goal: Comfortable local Docker development
- Strategy: Medium values, realistic workflows
- Example: 25 database connections, 2GB RAM, 500-1000 MPSC buffers
- Cluster Testing: 2 orchestrators to test multi-instance coordination
Production Environment:
- Goal: High throughput, scale-out capacity
- Strategy: Large values, production resilience
- Example: 50 database connections, 4GB RAM, 2000-50000 MPSC buffers
Part 5: Deployment Workflows
5.1 Docker Deployment
Build-Time Configuration Generation:
FROM rust:1.75 as builder
WORKDIR /app
COPY . .
# Build CLI tool
RUN cargo build --release --bin tasker-ctl
# Generate production config (single merged file)
RUN ./target/release/tasker-ctl config generate \
--context orchestration \
--environment production \
--output config/tasker/orchestration-production.toml
# Build orchestration binary
RUN cargo build --release --bin tasker-orchestration
FROM rust:1.75-slim
WORKDIR /app
# Copy orchestration binary
COPY --from=builder /app/target/release/tasker-orchestration /usr/local/bin/
# Copy generated config (single file with all merged settings)
COPY --from=builder /app/config/tasker/orchestration-production.toml /app/config/orchestration.toml
# Set environment - TASKER_CONFIG_PATH is REQUIRED
ENV TASKER_CONFIG_PATH=/app/config/orchestration.toml
ENV TASKER_ENV=production
CMD ["tasker-orchestration"]
Key Changes from Phase 2:
- ✅ Single merged file generated at build time
- ✅
TASKER_CONFIG_PATHenvironment variable (required) - ✅ No runtime merging - exact config known at build time
- ✅ Fail loudly if
TASKER_CONFIG_PATHnot set
5.2 Kubernetes Deployment
ConfigMap Strategy with Pre-Generated Config:
# Step 1: Generate merged configuration locally
tasker-ctl config generate \
--context orchestration \
--environment production \
--output orchestration-production.toml
# Step 2: Create ConfigMap from generated file
kubectl create configmap tasker-orchestration-config \
--from-file=orchestration.toml=orchestration-production.toml
Deployment Manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tasker-orchestration
spec:
replicas: 2
selector:
matchLabels:
app: tasker-orchestration
template:
metadata:
labels:
app: tasker-orchestration
spec:
containers:
- name: orchestration
image: tasker/orchestration:latest
env:
- name: TASKER_ENV
value: "production"
# REQUIRED: Path to single merged configuration file
- name: TASKER_CONFIG_PATH
value: "/config/orchestration.toml"
# DATABASE_URL should be in a separate secret
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: tasker-db-credentials
key: database-url
volumeMounts:
- name: config
mountPath: /config
readOnly: true
volumes:
- name: config
configMap:
name: tasker-orchestration-config
items:
- key: orchestration.toml
path: orchestration.toml
Key Benefits:
- ✅ Generated file reviewed before deployment
- ✅ Single source of truth for runtime configuration
- ✅ Easy to diff between environments
- ✅ ConfigMap contains exact runtime configuration
5.3 Local Development and Testing
For Tests (Legacy directory-based loading):
# Set test environment
export TASKER_ENV=test
# Tests use legacy load_context_direct() method
cargo test --all-features
For Docker Compose (Single-file loading):
# Generate test configs first
tasker-ctl config generate --context orchestration --environment test \
--output config/tasker/generated/orchestration-test.toml
tasker-ctl config generate --context worker --environment test \
--output config/tasker/generated/worker-test.toml
# Start services with generated configs
docker-compose -f docker/docker-compose.test.yml up
Docker Compose Configuration:
services:
orchestration:
environment:
# REQUIRED: Path to single merged file
TASKER_CONFIG_PATH: /app/config/tasker/generated/orchestration-test.toml
volumes:
# Mount config directory (contains generated files)
- ./config/tasker:/app/config/tasker:ro
Key Points:
- ✅ Tests use legacy directory-based loading for convenience
- ✅ Docker Compose uses single-file loading (matches production)
- ✅ Generated files should be committed to repo for reproducibility
- ✅ Both approaches work; choose based on use case
Part 6: Configuration Validation
6.1 Context-Specific Validation
Each configuration context has specific validation rules:
Common Configuration:
- Database URL format and connectivity
- Pool size ranges (1-1000 connections)
- Circuit breaker thresholds (1-100 failures)
- Timeout durations (1-3600 seconds)
Orchestration Configuration:
- Web API bind address format
- Request timeout ranges (1000-300000 ms)
- MPSC channel buffer sizes (100-100000)
- Event system configuration consistency
Worker Configuration:
- Template path existence
- Resource limit ranges (memory, CPU %)
- Handler discovery path validation
- Concurrent step limits (1-10000)
6.2 Validation Workflow
Pre-Deployment Validation:
# Validate before generating deployment artifact
tasker-ctl config validate --context orchestration --environment production
# Generate only if validation passes
tasker-ctl config generate --context orchestration --environment production
Runtime Validation:
- Configuration validated on application startup
- Invalid config prevents startup (fail-fast)
- Clear error messages for troubleshooting
6.3 Common Validation Errors
Example Error Messages:
❌ Validation Error: database.pool.max_connections
Value: 5000
Issue: Exceeds maximum allowed value (1000)
Fix: Reduce to 1000 or less
❌ Validation Error: web.bind_address
Value: "invalid:port"
Issue: Invalid IP:port format
Fix: Use format like "0.0.0.0:8080" or "127.0.0.1:3000"
Part 7: Operational Workflows
7.1 Compare Deployed Configurations
Cross-System Comparison:
# Get orchestration config
curl http://orchestration:8080/config > orch-config.json
# Get worker config
curl http://worker:8081/config > worker-config.json
# Compare common sections for compatibility
jq '.common' orch-config.json > orch-common.json
jq '.common' worker-config.json > worker-common.json
diff orch-common.json worker-common.json
Why This Matters:
- Ensures orchestration and worker share same database config
- Validates circuit breaker settings match
- Confirms telemetry endpoints aligned
7.2 Debug Configuration Issues
Step 1: Inspect Runtime Config
# Check what's actually deployed
curl http://localhost:8080/config | jq '.orchestration.web'
Step 2: Compare to Expected
# Check generated config file
cat config/tasker/generated/orchestration-production.toml
# Compare values
Step 3: Trace Configuration Source
# Check metadata for source files
curl http://localhost:8080/config | jq '.metadata'
# Metadata shows:
# - Environment (production)
# - Timestamp (when config was loaded)
# - Source (runtime)
# - Redacted fields (for transparency)
7.3 Configuration Drift Detection
Manual Comparison:
# Generate what should be deployed
tasker-ctl config generate --context orchestration --environment production
# Compare to runtime
diff config/tasker/generated/orchestration-production.toml \
<(curl -s http://localhost:8080/config | jq -r '.orchestration')
Automated Monitoring (future):
- Periodic config snapshots
- Alert on unexpected changes
- Configuration version tracking
Part 8: Best Practices
8.1 Configuration Management
DO:
✅ Use environment variables for secrets (${DATABASE_URL})
✅ Validate configs before deployment
✅ Generate single deployable artifacts for production
✅ Use /config endpoints for debugging
✅ Keep environment overrides minimal (only what changes)
✅ Document configuration changes in commit messages
DON’T: ❌ Commit production secrets to config files ❌ Mix test and production configurations ❌ Skip validation before deployment ❌ Use unbounded configuration values ❌ Override all settings in environment files
8.2 Security Best Practices
Secrets Management:
# ✅ GOOD: Use environment variable substitution
[database]
url = "${DATABASE_URL}"
# ❌ BAD: Hard-code credentials
[database]
url = "postgresql://user:password@localhost/db"
Production Deployment:
# ✅ GOOD: Use Kubernetes secrets
kubectl create secret generic tasker-db-url \
--from-literal=url='postgresql://...'
# ❌ BAD: Commit secrets to config files
Runtime Inspection:
/configendpoint automatically redacts secrets- Safe to use in logging and monitoring
- Field path tracking shows what was redacted
8.3 Testing Strategy
Test All Environments:
# Ensure all environments validate
for env in test development production; do
echo "Validating $env..."
tasker-ctl config validate --context orchestration --environment $env
done
Integration Testing:
# Test with generated configs
tasker-ctl config generate --context orchestration --environment test
export TASKER_CONFIG_PATH=config/tasker/generated/orchestration-test.toml
cargo test --all-features
Part 9: Troubleshooting
9.1 Common Issues
Issue: Configuration fails to load
# Check environment variable
echo $TASKER_ENV
# Check config files exist
ls -la config/tasker/base/
ls -la config/tasker/environments/$TASKER_ENV/
# Validate config
tasker-ctl config validate --context orchestration --environment $TASKER_ENV
Issue: Unexpected configuration values at runtime
# Check runtime config
curl http://localhost:8080/config | jq
# Compare to expected
cat config/tasker/generated/orchestration-$TASKER_ENV.toml
Issue: Validation errors
# Run validation with detailed output
RUST_LOG=debug tasker-ctl config validate \
--context orchestration \
--environment production
9.2 Debug Mode
Enable Configuration Debug Logging:
# Detailed config loading logs
RUST_LOG=tasker_shared::config=debug cargo run
# Shows:
# - Which files are loaded
# - Merge order
# - Environment variable substitution
# - Validation results
Part 10: Future Enhancements
10.1 Planned Features
Explain Command (Deferred):
# Get documentation for a parameter
tasker-ctl config explain --parameter database.pool.max_connections
# Shows:
# - Purpose and system impact
# - Valid range and type
# - Environment-specific recommendations
# - Related parameters
# - Example usage
Detect-Unused Command (Deferred):
# Find unused configuration parameters
tasker-ctl config detect-unused --context orchestration
# Auto-remove with backup
tasker-ctl config detect-unused --context orchestration --fix
10.2 Operational Enhancements
Configuration Versioning:
- Track configuration changes over time
- Compare configs across versions
- Rollback capability
Automated Drift Detection:
- Periodic config snapshots
- Alert on unexpected changes
- Configuration compliance checking
Configuration Templates:
- Pre-built configurations for common scenarios
- Quick-start templates for new deployments
- Best practice configurations
Related Documentation
- Environment Configuration Comparison - Detailed comparison of configuration values across environments
- Deployment Patterns - Deployment modes and strategies
- Quick Start Guide - Getting started with Tasker
Summary
Tasker’s configuration system provides:
- Role-Based Architecture: Focused TOML files with single responsibility
- Environment Scaling: 1:5:50 pattern from test → development → production
- Single-File Runtime Loading: Deploy exact configuration known at build time via
TASKER_CONFIG_PATH - Runtime Observability:
/configendpoints with comprehensive secret redaction - CLI Tools: Generate and validate single deployable configs
- Context-Specific Validation: Catch errors before deployment
- Security First: Automatic secret redaction, environment variable substitution
Key Workflows:
- Production/Docker: Generate single-file config at build time, set
TASKER_CONFIG_PATH, deploy - Testing: Use legacy directory-based loading for convenience
- Debugging: Use
/configendpoints to inspect runtime configuration - Validation: Validate before generating deployment artifacts
Phase 3 Changes (October 2025):
- ✅ Runtime systems now require
TASKER_CONFIG_PATHenvironment variable - ✅ Configuration loaded from single merged files (no runtime merging)
- ✅ Deployment certainty: exact config known at build time
- ✅ Fail loudly: missing/invalid config halts startup with explicit errors
- ✅ Generated configs committed to repo for reproducibility
← Back to Documentation Hub
Dead Letter Queue (DLQ) System Architecture
Purpose: Investigation tracking system for stuck, stale, or problematic tasks
Last Updated: 2025-11-01
Executive Summary
The DLQ (Dead Letter Queue) system is an investigation tracking system, NOT a task manipulation layer.
Key Principles:
- DLQ tracks “why task is stuck” and “who investigated”
- Resolution happens at step level via step APIs
- No task-level “requeue” - fix the problem steps instead
- Steps carry their own retry, attempt, and state lifecycles independent of DLQ
- DLQ is for audit, visibility, and investigation only
Architecture: PostgreSQL-based system with:
tasks_dlqtable for investigation tracking- 3 database views for monitoring and analysis
- 6 REST endpoints for operator interaction
- Background staleness detection service
DLQ vs Step Resolution
What DLQ Does
✅ Investigation Tracking:
- Record when and why task became stuck
- Capture complete task snapshot for debugging
- Track operator investigation workflow
- Provide visibility into systemic issues
✅ Visibility and Monitoring:
- Dashboard statistics by DLQ reason
- Prioritized investigation queue for triage
- Proactive staleness monitoring (before DLQ)
- Alerting integration for high-priority entries
What DLQ Does NOT Do
❌ Task Manipulation:
- Does NOT retry failed steps
- Does NOT requeue tasks
- Does NOT modify step state
- Does NOT execute business logic
Why This Separation Matters
Steps are mutable - Operators can:
- Manually resolve failed steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - View step readiness status:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Check retry eligibility and dependency satisfaction
- Trigger next steps by completing blocked steps
DLQ is immutable audit trail - Operators should:
- Review task snapshot to understand what went wrong
- Use step endpoints to fix the underlying problem
- Update DLQ investigation status to track resolution
- Analyze DLQ patterns to prevent future occurrences
DLQ Reasons
staleness_timeout
Definition: Task exceeded state-specific staleness threshold
States:
waiting_for_dependencies- Default 60 minuteswaiting_for_retry- Default 30 minutessteps_in_process- Default 30 minutes
Template Override: Configure per-template thresholds:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
max_duration_minutes: 1440 # 24 hours
Resolution Pattern:
- Operator:
GET /v1/dlq/task/{task_uuid}- Review task snapshot - Identify stuck steps: Check
current_statein snapshot - Fix steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Task state machine automatically progresses when steps fixed
- Operator:
PATCH /v1/dlq/entry/{dlq_entry_uuid}- Mark investigation resolved
Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring
max_retries_exceeded
Definition: Step exhausted all retry attempts and remains in Error state
Resolution Pattern:
- Review step results:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Analyze
last_failure_atand error details - Fix underlying issue (infrastructure, data, etc.)
- Manually resolve step:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Update DLQ investigation status
dependency_cycle_detected
Definition: Circular dependency detected in workflow step graph
Resolution Pattern:
- Review task template configuration
- Identify cycle in step dependencies
- Update template to break cycle
- Manually cancel affected tasks
- Re-submit tasks with corrected template
worker_unavailable
Definition: No worker available for task’s namespace
Resolution Pattern:
- Check worker service health
- Verify namespace configuration
- Scale worker capacity if needed
- Tasks automatically progress when worker available
manual_dlq
Definition: Operator manually sent task to DLQ for investigation
Resolution Pattern: Custom per-investigation
Database Schema
tasks_dlq Table
CREATE TABLE tasker.tasks_dlq (
dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
task_uuid UUID NOT NULL UNIQUE, -- One pending entry per task
original_state VARCHAR(50) NOT NULL,
dlq_reason dlq_reason NOT NULL,
dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
task_snapshot JSONB, -- Complete task state for debugging
resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
resolution_notes TEXT,
resolved_at TIMESTAMPTZ,
resolved_by VARCHAR(255),
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
ON tasker.tasks_dlq (task_uuid)
WHERE resolution_status = 'pending';
Key Fields:
dlq_entry_uuid- UUID v7 (time-ordered) for investigation trackingtask_uuid- Foreign key to task (unique for pending entries)original_state- Task state when sent to DLQtask_snapshot- JSONB snapshot with debugging contextresolution_status- Investigation workflow status
Database Views
v_dlq_dashboard
Purpose: Aggregated statistics for monitoring dashboard
Columns:
dlq_reason- Why tasks are in DLQtotal_entries- Count of entriespending,manually_resolved,permanent_failures,cancelled- Breakdown by statusoldest_entry,newest_entry- Time rangeavg_resolution_time_minutes- Average time to resolve
Use Case: High-level DLQ health monitoring
v_dlq_investigation_queue
Purpose: Prioritized queue for operator triage
Columns:
- Task and DLQ entry UUIDs
priority_score- Composite score (base reason priority + age factor)minutes_in_dlq- How long entry has been pending- Task metadata for context
Ordering: Priority score DESC (most urgent first)
Use Case: Operator dashboard showing “what to investigate next”
v_task_staleness_monitoring
Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ
Columns:
task_uuid,namespace_name,task_namecurrent_state,time_in_state_minutesstaleness_threshold_minutes- Threshold for this statehealth_status- healthy | warning | stalepriority- Task priority for ordering
Health Status Classification:
healthy- < 80% of thresholdwarning- 80-99% of thresholdstale- ≥ 100% of threshold
Use Case: Alerting at 80% threshold to prevent DLQ entries
REST API Endpoints
1. List DLQ Entries
GET /v1/dlq?resolution_status=pending&limit=50
Purpose: Browse DLQ entries with filtering
Query Parameters:
resolution_status- Filter by status (optional)limit- Max entries (default: 50)offset- Pagination offset (default: 0)
Response: Array of DlqEntry objects
Use Case: General DLQ browsing and pagination
2. Get DLQ Entry with Task Snapshot
GET /v1/dlq/task/{task_uuid}
Purpose: Retrieve most recent DLQ entry for a task with complete snapshot
Response: DlqEntry with full task_snapshot JSONB
Task Snapshot Contains:
- Task UUID, namespace, name
- Current state and time in state
- Staleness threshold
- Task age and priority
- Template configuration
- Detection time
Use Case: Investigation starting point - “why is this task stuck?”
3. Update DLQ Investigation Status
PATCH /v1/dlq/entry/{dlq_entry_uuid}
Purpose: Track investigation workflow
Request Body:
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed by manually completing stuck step using step API",
"resolved_by": "operator@example.com",
"metadata": {
"fixed_step_uuid": "...",
"root_cause": "database connection timeout"
}
}
Use Case: Document investigation findings and resolution
4. Get DLQ Statistics
GET /v1/dlq/stats
Purpose: Aggregated statistics for monitoring
Response: Statistics grouped by dlq_reason
Use Case: Dashboard metrics, identifying systemic issues
5. Get Investigation Queue
GET /v1/dlq/investigation-queue?limit=100
Purpose: Prioritized queue for operator triage
Response: Array of DlqInvestigationQueueEntry ordered by priority
Priority Factors:
- Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
- Age multiplier (older entries = higher priority)
Use Case: “What should I investigate next?”
6. Get Staleness Monitoring
GET /v1/dlq/staleness?limit=100
Purpose: Proactive monitoring BEFORE tasks hit DLQ
Response: Array of StalenessMonitoring with health status
Ordering: Stale first, then warning, then healthy
Use Case: Alerting and prevention
Alert Integration:
# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
Step Endpoints and Resolution Workflow
Step Endpoints
1. List Task Steps
GET /v1/tasks/{uuid}/workflow_steps
Returns: Array of steps with readiness status
Key Fields:
current_state- Step state (pending, enqueued, in_progress, complete, error)dependencies_satisfied- Can step execute?retry_eligible- Can step retry?ready_for_execution- Ready to enqueue?attempts/max_attempts- Retry trackinglast_failure_at- When step last failednext_retry_at- When step eligible for retry
Use Case: Understand task execution status
2. Get Step Details
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Returns: Single step with full readiness analysis
Use Case: Deep dive into specific step
3. Manually Resolve Step
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Purpose: Operator actions to handle stuck or failed steps
Action Types:
- ResetForRetry - Reset attempt counter and return to pending for automatic retry:
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection restored, resetting attempts"
}
- ResolveManually - Mark step as manually resolved without results:
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical step, bypassing for workflow continuation"
}
- CompleteManually - Complete step with execution results for dependent steps:
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validated": true,
"score": 95
},
"metadata": {
"manually_verified": true,
"verification_method": "manual_inspection"
}
},
"reason": "Manual verification completed after infrastructure fix",
"completed_by": "operator@example.com"
}
Behavior by Action Type:
reset_for_retry: Clears attempt counter, transitions topending, enables automatic retryresolve_manually: Transitions toresolved_manually(terminal state)complete_manually: Transitions tocompletewith results available for dependent steps
Common Effects:
- Triggers task state machine re-evaluation
- Task automatically discovers next ready steps
- Task progresses when all dependencies satisfied
Use Case: Unblock stuck workflow by fixing problem step
Complete Resolution Workflow
Scenario: Task Stuck in waiting_for_dependencies
1. Operator receives DLQ alert
GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority
2. Operator reviews task snapshot
GET /v1/dlq/task/abc-123
# Response:
{
"dlq_entry_uuid": "xyz-789",
"task_uuid": "abc-123",
"original_state": "waiting_for_dependencies",
"dlq_reason": "staleness_timeout",
"task_snapshot": {
"task_uuid": "abc-123",
"namespace": "order_processing",
"task_name": "fulfill_order",
"current_state": "error",
"time_in_state_minutes": 65,
"threshold_minutes": 60
}
}
3. Operator checks task steps
GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)
4. Operator investigates step_2 failure
GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout
5. Operator fixes infrastructure issue
# Fix database connection pool configuration
# Verify database connectivity
6. Operator chooses resolution strategy
Option A: Reset for retry (infrastructure fixed, retry should work):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection pool fixed, resetting attempts for automatic retry"
}
Option B: Resolve manually (bypass step entirely):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical validation step, bypassing"
}
Option C: Complete manually (provide results for dependent steps):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validation_status": "passed",
"score": 100
},
"metadata": {
"manually_verified": true
}
},
"reason": "Manual validation completed",
"completed_by": "operator@example.com"
}
7. Task state machine automatically progresses
Outcome depends on action type chosen:
If Option A (reset_for_retry):
- Step 2 →
pending(attempts reset to 0) - Automatic retry begins when dependencies satisfied
- Step 2 re-enqueued to worker
- If successful, workflow continues normally
If Option B (resolve_manually):
- Step 2 →
resolved_manually(terminal state) - Step 3 dependencies satisfied (manual resolution counts as success)
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker
- Task resumes normal execution
If Option C (complete_manually):
- Step 2 →
complete(with operator-provided results) - Step 3 can consume results from completion_data
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker with access to step 2 results
- Task resumes normal execution
8. Operator updates DLQ investigation
PATCH /v1/dlq/entry/xyz-789
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
"resolved_by": "operator@example.com",
"metadata": {
"root_cause": "database_connection_timeout",
"fixed_step_uuid": "{step_2_uuid}",
"infrastructure_fix": "increased_connection_pool_size"
}
}
Step Retry and Attempt Lifecycles
Step State Machine
States:
pending- Initial state, awaiting dependenciesenqueued- Sent to worker queuein_progress- Worker actively processingenqueued_for_orchestration- Result submitted, awaiting orchestrationcomplete- Successfully finishederror- Failed (may be retryable)cancelled- Manually cancelledresolved_manually- Operator intervention
Retry Logic
Configured per step in template:
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 30000
Retry Eligibility Criteria:
retryable: truein configurationattempts < max_attempts- Current state is
error next_retry_attimestamp has passed (backoff elapsed)
Backoff Calculation:
backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)
Example (base=1000ms, max=30000ms):
- Attempt 1 fails → wait 1s
- Attempt 2 fails → wait 2s
- Attempt 3 fails → wait 4s
SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at
Attempt Tracking
Fields (on workflow_steps table):
attempts- Current attempt countmax_attempts- Configuration limitlast_attempted_at- Timestamp of last executionlast_failure_at- Timestamp of last failure
Workflow:
- Step enqueued →
attempts++ - Step fails → Record
last_failure_at, calculatenext_retry_at - Backoff elapses → Step becomes
retry_eligible: true - Orchestration discovers ready steps → Step re-enqueued
- Repeat until success or
attempts >= max_attempts
Max Attempts Exceeded:
- Step remains in
errorstate retry_eligible: false- Task transitions to
errorstate - May trigger DLQ entry with reason
max_retries_exceeded
Independence from DLQ
Key Point: Step retry logic is INDEPENDENT of DLQ
- Steps retry automatically based on configuration
- DLQ does NOT trigger retries
- DLQ does NOT modify retry counters
- DLQ is pure observation and investigation
Why This Matters:
- Retry logic is predictable and configuration-driven
- DLQ doesn’t interfere with normal workflow execution
- Operators can manually resolve to bypass retry limits
- DLQ provides visibility into retry exhaustion patterns
Staleness Detection
Background Service
Component: tasker-orchestration/src/orchestration/staleness_detector.rs
Configuration:
[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300 # 5 minutes
Operation:
- Timer triggers every 5 minutes
- Calls
detect_and_transition_stale_tasks()SQL function - Function identifies tasks exceeding thresholds
- Creates DLQ entries for stale tasks
- Transitions tasks to
errorstate - Records OpenTelemetry metrics
Staleness Thresholds
Per-State Defaults (configurable):
waiting_for_dependencies: 60 minuteswaiting_for_retry: 30 minutessteps_in_process: 30 minutes
Per-Template Override:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
Precedence: Template config > Global defaults
Staleness SQL Function
Function: detect_and_transition_stale_tasks()
Architecture:
v_task_state_analysis (base view)
│
├── get_stale_tasks_for_dlq() (discovery function)
│ │
│ └── detect_and_transition_stale_tasks() (main orchestration)
│ ├── create_dlq_entry() (DLQ creation)
│ └── transition_stale_task_to_error() (state transition)
Performance Optimization:
- Expensive joins happen ONCE in base view
- Discovery function filters stale tasks
- Main function processes results in loop
- LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries
Output: Returns StalenessResult records with:
- Task identification (UUID, namespace, name)
- State and timing information
action_taken- What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)moved_to_dlq- Booleantransition_success- Boolean
OpenTelemetry Metrics
Metrics Exported
Counters:
tasker.dlq.entries_created.total- DLQ entries createdtasker.staleness.tasks_detected.total- Stale tasks detectedtasker.staleness.tasks_transitioned_to_error.total- Tasks moved to Errortasker.staleness.detection_runs.total- Detection cycles
Histograms:
tasker.staleness.detection.duration- Detection execution time (ms)tasker.dlq.time_in_queue- Time in DLQ before resolution (hours)
Gauges:
tasker.dlq.pending_investigations- Current pending DLQ count
Alert Examples
Prometheus Alerting Rules:
# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
expr: tasker_dlq_pending_investigations > 50
for: 15m
labels:
severity: warning
annotations:
summary: "High number of pending DLQ investigations ({{ $value }})"
# Alert on slow detection cycles
- alert: SlowStalenessDetection
expr: tasker_staleness_detection_duration > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Staleness detection taking >5s ({{ $value }}ms)"
# Alert on high stale task rate
- alert: HighStalenessRate
expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
for: 10m
labels:
severity: critical
annotations:
summary: "High rate of stale task detection ({{ $value }}/sec)"
CLI Usage Examples
The tasker-ctl tool provides commands for managing workflow steps directly from the command line.
List Workflow Steps
# List all steps for a task
tasker-ctl task steps <TASK_UUID>
# Example output:
# ✓ Found 3 workflow steps:
#
# Step: validate_input (01933d7c-...)
# State: complete
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 1/3
#
# Step: process_order (01933d7c-...)
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 3/3
# ⚠ Retry eligible
Get Step Details
# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>
# Example output:
# ✓ Step Details:
#
# UUID: 01933d7c-...
# Name: process_order
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Retry eligible: false
# Attempts: 3/3
# Last failure: 2025-11-02T14:23:45Z
Reset Step for Retry
When infrastructure is fixed and you want to reset attempt counter:
tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
--reason "Database connection pool increased" \
--reset-by "ops-team@example.com"
# Example output:
# ✓ Step reset successfully!
# New state: pending
# Reason: Database connection pool increased
# Reset by: ops-team@example.com
Resolve Step Manually
When you want to bypass a non-critical step:
tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
--reason "Non-critical validation, bypassing" \
--resolved-by "ops-team@example.com"
# Example output:
# ✓ Step resolved manually!
# New state: resolved_manually
# Reason: Non-critical validation, bypassing
# Resolved by: ops-team@example.com
Complete Step Manually with Results
When you’ve manually performed the step’s work and need to provide results:
tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
--result '{"validated": true, "score": 95}' \
--metadata '{"verification_method": "manual_review"}' \
--reason "Manual verification after infrastructure fix" \
--completed-by "ops-team@example.com"
# Example output:
# ✓ Step completed manually with results!
# New state: complete
# Reason: Manual verification after infrastructure fix
# Completed by: ops-team@example.com
JSON Formatting Tips:
# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'
# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
"validation_status": "passed",
"checks": ["auth", "permissions", "rate_limit"],
"score": 100
}
EOF
)"
# Or read from a file
--result "$(cat result.json)"
Operational Runbooks
Runbook 1: Investigating High DLQ Count
Trigger: tasker_dlq_pending_investigations > 50
Steps:
- Check DLQ dashboard:
curl /v1/dlq/stats | jq
- Identify dominant reason:
{
"dlq_reason": "staleness_timeout",
"total_entries": 45,
"pending": 45
}
- Get investigation queue:
curl /v1/dlq/investigation-queue?limit=10 | jq
- Check staleness monitoring:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'
- Identify patterns:
- Common namespace?
- Common task template?
- Common time period?
- Take action:
- Infrastructure issue? → Fix and manually resolve affected tasks
- Template misconfiguration? → Update template thresholds
- Worker unavailable? → Scale worker capacity
- Systemic dependency issue? → Investigate upstream systems
Runbook 2: Proactive Staleness Prevention
Trigger: Regular monitoring (not incident-driven)
Steps:
- Monitor warning threshold:
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
- Alert when warning count exceeds baseline:
if [ $warning_count -gt 10 ]; then
alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi
- Investigate early:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
task_uuid,
current_state,
time_in_state_minutes,
staleness_threshold_minutes,
threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'
- Intervene before DLQ:
- Check task steps for blockages
- Review dependencies
- Manually resolve if appropriate
Best Practices
For Operators
✅ DO:
- Use staleness monitoring for proactive prevention
- Document investigation findings in DLQ resolution notes
- Fix root causes, not just symptoms
- Update DLQ investigation status promptly
- Use step endpoints to resolve stuck workflows
- Monitor DLQ statistics for systemic patterns
❌ DON’T:
- Don’t try to “requeue” from DLQ - fix the steps instead
- Don’t ignore warning health status - investigate early
- Don’t manually resolve steps without fixing root cause
- Don’t leave DLQ investigations in pending status indefinitely
For Developers
✅ DO:
- Configure appropriate staleness thresholds per template
- Make steps retryable with sensible backoff
- Implement idempotent step handlers
- Add defensive timeouts to prevent hanging
- Test workflows under failure scenarios
❌ DON’T:
- Don’t set thresholds too low (causes false positives)
- Don’t set thresholds too high (delays detection)
- Don’t make all steps non-retryable
- Don’t ignore DLQ patterns - they indicate design issues
- Don’t rely on DLQ for normal workflow control flow
Testing
Test Coverage
Unit Tests: SQL function testing (17 tests)
- Staleness detection logic
- DLQ entry creation
- Threshold calculation with template overrides
- View query correctness
Integration Tests: Lifecycle testing (4 tests)
- Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
- Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
- Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
- Complete investigation workflow (test_dlq_investigation_workflow)
Metrics Tests: OpenTelemetry integration (1 test)
- Staleness detection metrics recording
- DLQ investigation metrics recording
- Pending investigations gauge query
Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml
- 2-step linear workflow
- 2-minute staleness thresholds for fast test execution
- Test-only template for lifecycle validation
Performance: All 22 tests complete in 0.95s (< 1s target)
Implementation Notes
File Locations:
- Staleness detector:
tasker-orchestration/src/orchestration/staleness_detector.rs - DLQ models:
tasker-shared/src/models/orchestration/dlq.rs - SQL functions:
migrations/20251122000004_add_dlq_discovery_function.sql - Database views:
migrations/20251122000003_add_dlq_views.sql
Key Design Decisions:
- Investigation tracking only - no task manipulation
- Step-level resolution via existing step endpoints
- Proactive monitoring at 80% threshold
- Template-specific threshold overrides
- Atomic DLQ entry creation with unique constraint
- Time-ordered UUID v7 for investigation tracking
Future Enhancements
Potential improvements (not currently planned):
-
DLQ Patterns Analysis
- Machine learning to identify systemic issues
- Automated root cause suggestions
- Pattern clustering by namespace/template
-
Advanced Alerting
- Anomaly detection on staleness rates
- Predictive DLQ entry forecasting
- Correlation with infrastructure metrics
-
Investigation Workflow
- Automated triage rules
- Escalation policies
- Integration with incident management systems
-
Performance Optimization
- Materialized views for dashboard
- Query result caching
- Incremental staleness detection
End of Documentation
Handler Resolution Guide
Last Updated: 2026-01-08 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | API Convergence Matrix
<- Back to Guides
Overview
Handler resolution is the process of converting a callable address (a string in your YAML template) into an executable handler instance that can process workflow steps. The resolver chain pattern provides a flexible, extensible approach that works consistently across all language workers.
This guide covers:
- The mental model for handler resolution
- The common path for task templates
- Built-in resolvers and how they work
- Method dispatch for multi-method handlers
- Writing custom resolvers
- Cross-language considerations
Mental Model
Handler resolution uses three key concepts:
handler:
callable: "PaymentProcessor" # 1. Address: WHERE to find the handler
method: "refund" # 2. Entry Point: WHICH method to invoke
resolver: "explicit_mapping" # 3. Resolution Hint: HOW to resolve
1. Address (callable)
The callable field is a logical address that identifies the handler. Think of it like a URL — it points to where the handler lives, but the format depends on your resolution strategy:
| Format | Example | Resolver |
|---|---|---|
| Short name | payment_processor | ExplicitMappingResolver |
| Class path (Ruby) | PaymentHandlers::ProcessPaymentHandler | ClassConstantResolver |
| Module path (Python) | payment_handlers.ProcessPaymentHandler | ClassLookupResolver |
| Namespace path (TS) | PaymentHandlers.ProcessPaymentHandler | ClassLookupResolver |
Convention vs. requirement: The ExplicitMappingResolver treats the callable as an opaque key — any string works as long as it exactly matches the name the handler was registered with. The format conventions above are recommendations, not constraints. Short names like validate_cart work perfectly well because DSL handlers register via explicit mapping, and exact-match always wins (priority 10 vs. 100). Qualified names like Ecommerce::StepHandlers::ValidateCartHandler are conventional in Ruby and TypeScript because they also enable the class lookup fallback resolver, but a short name registered via @step_handler("validate_cart") in Python resolves just as reliably.
2. Entry Point (method)
The method field specifies which method to invoke on the handler. This enables multi-method handlers - a single handler class that exposes multiple entry points:
# Default: calls the `call` method
handler:
callable: payment_processor
# Explicit method: calls the `refund` method instead
handler:
callable: payment_processor
method: refund
When to use method dispatch:
- Payment handlers with
charge,refund,voidmethods - Validation handlers with
validate_input,validate_outputmethods - CRUD handlers with
create,read,update,deletemethods
3. Resolution Hint (resolver)
The resolver field is an optional optimization that bypasses the resolver chain and goes directly to a specific resolver:
# Let the chain figure it out (default)
handler:
callable: payment_processor
# Skip directly to explicit mapping (faster, explicit)
handler:
callable: payment_processor
resolver: explicit_mapping
When to use resolver hints:
- Performance optimization for high-throughput steps
- Explicit documentation of resolution strategy
- Avoiding ambiguity when multiple resolvers could match
The Common Path
For most templates, you don’t need to think about resolution at all. The default resolution flow handles common cases automatically:
# Most common pattern - just specify the callable
steps:
- name: process_payment
handler:
callable: process_payment # Resolved by ExplicitMappingResolver
initialization:
timeout_ms: 5000
What happens under the hood:
- Worker receives step execution event
- HandlerDispatchService extracts the
HandlerDefinition - ResolverChain iterates through resolvers by priority
- ExplicitMappingResolver (priority 10) finds the registered handler
- Handler is invoked with
call()method (default)
Resolver Chain Architecture
The resolver chain is an ordered list of resolvers, each with a priority. Lower priority numbers are checked first:
┌─────────────────────────────────────────────────────────────────┐
│ ResolverChain │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ ExplicitMapping │ │ ClassConstant │ │
│ │ Priority: 10 │──│ Priority: 100 │──► ... │
│ │ │ │ │ │
│ │ "process_payment" ──►│ │ "Handlers::Payment"──► │
│ │ Handler instance │ │ constantize() │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Resolution Flow
HandlerDefinition
│
▼
┌──────────────────┐
│ Has resolver │──Yes──► Go directly to named resolver
│ hint? │
└────────┬─────────┘
│ No
▼
┌──────────────────┐
│ ExplicitMapping │──can_resolve?──Yes──► Return handler
│ (priority 10) │
└────────┬─────────┘
│ No
▼
┌──────────────────┐
│ ClassConstant │──can_resolve?──Yes──► Return handler
│ (priority 100) │
└────────┬─────────┘
│ No
▼
ResolutionError
Built-in Resolvers
ExplicitMappingResolver (Priority 10)
The primary resolver for all workers. Handlers are registered with string keys at startup.
DSL Auto-Registration
DSL handlers automatically register with the ExplicitMappingResolver when the handler module is loaded — no manual registration needed:
@step_handler("ecommerce_validate_cart") # Registers as "ecommerce_validate_cart"
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context):
return svc.validate_cart_items(...)
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler', # Registers with this name
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(...)
end
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler', // Registers with this name
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) => svc.validateCartItems(cartItems),
);
Manual Registration
For Rust (required — no DSL) or class-based handlers that need explicit registration:
#![allow(unused)]
fn main() {
// Rust registration (required — no runtime reflection)
registry.register("process_payment", Arc::new(ProcessPaymentHandler::new()));
}
# Ruby manual registration (optional — auto-resolver also finds derived classes)
registry.register("process_payment", ProcessPaymentHandler)
When it resolves: When the callable exactly matches a registered key.
Best for:
- Native Rust handlers (required - no runtime reflection)
- Performance-critical handlers
- Explicit, predictable resolution
Class Lookup Resolvers (Priority 100)
Dynamic language only (Ruby, Python, TypeScript). Interprets the callable as a class path and instantiates it at runtime.
Naming Note: Ruby uses
ClassConstantResolver(Ruby terminology for classes). Python and TypeScript useClassLookupResolver. The functionality is equivalent.
# Ruby: Uses Object.const_get (ClassConstantResolver)
handler:
callable: PaymentHandlers::ProcessPaymentHandler
# Python: Uses importlib (ClassLookupResolver)
handler:
callable: payment_handlers.ProcessPaymentHandler
# TypeScript: Uses dynamic import (ClassLookupResolver)
handler:
callable: PaymentHandlers.ProcessPaymentHandler
When it resolves: When the callable looks like a class/module path (contains ::, ., or starts with uppercase).
Best for:
- Convention-over-configuration setups
- Handlers that don’t need explicit registration
- Dynamic handler loading
Not available in Rust: Rust has no runtime reflection, so class lookup resolvers always return None. Use ExplicitMappingResolver instead.
Method Dispatch
Method dispatch allows a single handler to expose multiple entry points. This is useful for handlers that perform related operations:
Defining a Multi-Method Handler
# Ruby
class PaymentHandler < TaskerCore::StepHandler::Base
def call(context)
# Default method - standard payment processing
end
def refund(context)
# Refund-specific logic
end
def void(context)
# Void-specific logic
end
end
# Python
class PaymentHandler(StepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
# Default method
pass
def refund(self, context: StepContext) -> StepHandlerResult:
# Refund-specific logic
pass
// TypeScript
class PaymentHandler extends StepHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
// Default method
}
async refund(context: StepContext): Promise<StepHandlerResult> {
// Refund-specific logic
}
}
#![allow(unused)]
fn main() {
// Rust - requires explicit method routing
impl StepHandler for PaymentHandler {
async fn call(&self, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Default method
}
async fn invoke_method(&self, method: &str, step: &TaskSequenceStep) -> Result<StepExecutionResult> {
match method {
"refund" => self.refund(step).await,
"void" => self.void(step).await,
_ => self.call(step).await,
}
}
}
}
Using Method Dispatch in Templates
steps:
- name: process_refund
handler:
callable: payment_handler
method: refund # Invokes refund() instead of call()
initialization:
reason_required: true
How Method Dispatch Works
- Resolver chain resolves the handler from
callable - If
methodis specified and not “call”, aMethodDispatchWrapperis applied - When invoked, the wrapper calls the specified method instead of
call()
┌───────────────────┐
HandlerDefinition ──│ ResolverChain │── Handler
(method: "refund") │ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ MethodDispatch │
│ Wrapper │
│ │
│ inner.refund() │
└───────────────────┘
Writing Custom Resolvers
You can extend the resolver chain with custom resolution strategies for your domain.
Rust Custom Resolver
#![allow(unused)]
fn main() {
use tasker_shared::registry::{StepHandlerResolver, ResolutionContext, ResolvedHandler};
use async_trait::async_trait;
#[derive(Debug)]
pub struct ServiceDiscoveryResolver {
service_registry: Arc<ServiceRegistry>,
}
#[async_trait]
impl StepHandlerResolver for ServiceDiscoveryResolver {
fn resolver_name(&self) -> &str {
"service_discovery"
}
fn priority(&self) -> u32 {
50 // Between explicit (10) and class constant (100)
}
fn can_resolve(&self, definition: &HandlerDefinition) -> bool {
// Resolve callables that start with "service://"
definition.callable.starts_with("service://")
}
async fn resolve(
&self,
definition: &HandlerDefinition,
context: &ResolutionContext,
) -> Result<Arc<dyn ResolvedHandler>, ResolutionError> {
let service_name = definition.callable.strip_prefix("service://").unwrap();
let handler = self.service_registry.lookup(service_name).await?;
Ok(Arc::new(StepHandlerAsResolved::new(handler)))
}
}
}
Ruby Custom Resolver
module TaskerCore
module Registry
class ServiceDiscoveryResolver < BaseResolver
def resolver_name
"service_discovery"
end
def priority
50
end
def can_resolve?(definition)
definition.callable.start_with?("service://")
end
def resolve(definition, context)
service_name = definition.callable.delete_prefix("service://")
handler_class = ServiceRegistry.lookup(service_name)
handler_class.new
end
end
end
end
Python Custom Resolver
from tasker_core.registry import BaseResolver, ResolutionError
class ServiceDiscoveryResolver(BaseResolver):
def resolver_name(self) -> str:
return "service_discovery"
def priority(self) -> int:
return 50
def can_resolve(self, definition: HandlerDefinition) -> bool:
return definition.callable.startswith("service://")
async def resolve(
self, definition: HandlerDefinition, context: ResolutionContext
) -> ResolvedHandler:
service_name = definition.callable.removeprefix("service://")
handler_class = self.service_registry.lookup(service_name)
return handler_class()
TypeScript Custom Resolver
import { BaseResolver, HandlerDefinition, ResolutionContext } from './registry';
export class ServiceDiscoveryResolver extends BaseResolver {
resolverName(): string {
return 'service_discovery';
}
priority(): number {
return 50;
}
canResolve(definition: HandlerDefinition): boolean {
return definition.callable.startsWith('service://');
}
async resolve(
definition: HandlerDefinition,
context: ResolutionContext
): Promise<ResolvedHandler> {
const serviceName = definition.callable.replace('service://', '');
const HandlerClass = await this.serviceRegistry.lookup(serviceName);
return new HandlerClass();
}
}
Registering Custom Resolvers
#![allow(unused)]
fn main() {
// Rust
let mut chain = ResolverChain::new();
chain.register(Arc::new(ExplicitMappingResolver::new()));
chain.register(Arc::new(ServiceDiscoveryResolver::new(service_registry)));
chain.register(Arc::new(ClassConstantResolver::new()));
}
# Ruby
chain = TaskerCore::Registry::ResolverChain.new
chain.register(TaskerCore::Registry::ExplicitMappingResolver.new)
chain.register(ServiceDiscoveryResolver.new(service_registry))
chain.register(TaskerCore::Registry::ClassConstantResolver.new)
Cross-Language Considerations
Why Rust is Different
Rust has no runtime reflection, which affects handler resolution:
| Capability | Ruby/Python/TypeScript | Rust |
|---|---|---|
| Class Lookup Resolver | ✅ Works | ❌ Always returns None |
| Method dispatch | ✅ Native (send, getattr) | ⚠️ Requires invoke_method |
| Dynamic handler loading | ✅ const_get, importlib | ❌ Must pre-register |
Best Practice for Rust:
- Always use ExplicitMappingResolver with explicit registration
- Implement
invoke_method()for multi-method handlers - Use resolver hints (
resolver: explicit_mapping) for clarity
Method Dispatch by Language
| Language | Default Method | Dynamic Dispatch |
|---|---|---|
| Ruby | call | handler.public_send(method, context) |
| Python | call | getattr(handler, method)(context) |
| TypeScript | call | handler[method](context) |
| Rust | call | handler.invoke_method(method, step) |
Troubleshooting
“Handler not found” Error
Symptoms: ResolutionError: No resolver could resolve callable 'my_handler'
Causes:
- Handler not registered with ExplicitMappingResolver
- Class path typo (for ClassConstantResolver)
- Handler registered with different name than callable
Solutions:
#![allow(unused)]
fn main() {
// Verify registration
assert!(registry.is_registered("my_handler"));
// Check registered handlers
println!("{:?}", registry.list_handlers());
}
Method Not Found
Symptoms: MethodNotFound: Handler 'my_handler' does not respond to 'refund'
Causes:
- Method name typo in YAML template
- Method not defined on handler class
- Method is private (Ruby) or underscore-prefixed (Python)
Solutions:
# Verify method name matches exactly
handler:
callable: payment_handler
method: refund # Must match method name in handler
Resolver Hint Ignored
Symptoms: Resolution works but seems slow, or wrong resolver is used
Causes:
- Resolver hint name doesn’t match any registered resolver
- Resolver with that name returns
Nonefor this callable
Solutions:
# Use exact resolver name
handler:
callable: my_handler
resolver: explicit_mapping # Not "explicit" or "mapping"
Best Practices
1. Prefer Explicit Registration
# Good: Clear, predictable, works in all languages
handler:
callable: process_payment
# Avoid: Relies on runtime class lookup, not portable to Rust
handler:
callable: PaymentHandlers::ProcessPaymentHandler
2. Use Method Dispatch for Related Operations
# Good: Single handler, multiple entry points
steps:
- name: validate_input
handler:
callable: validator
method: validate_input
- name: validate_output
handler:
callable: validator
method: validate_output
# Avoid: Separate handlers for closely related operations
steps:
- name: validate_input
handler:
callable: input_validator
- name: validate_output
handler:
callable: output_validator
3. Document Resolution Strategy
# Good: Explicit about how resolution works
handler:
callable: payment_processor
resolver: explicit_mapping # Self-documenting
method: refund
initialization:
timeout_ms: 5000
4. Test Resolution in Isolation
#![allow(unused)]
fn main() {
#[test]
fn test_handler_resolution() {
let chain = create_resolver_chain();
let definition = HandlerDefinition::builder()
.callable("process_payment")
.build();
assert!(chain.can_resolve(&definition));
}
}
Summary
| Concept | Purpose | Default |
|---|---|---|
callable | Handler address | Required |
method | Entry point method | "call" |
resolver | Resolution strategy hint | Chain iteration |
| ExplicitMappingResolver | Registered handlers | Priority 10 |
| ClassConstantResolver / ClassLookupResolver | Dynamic class lookup | Priority 100 |
| MethodDispatchWrapper | Multi-method support | Applied when method != "call" |
The resolver chain provides a flexible, extensible system for handler resolution that works consistently across all language workers while respecting each language’s capabilities.
Task Identity Strategy Pattern
Last Updated: 2026-01-20 Audience: Developers, Operators Status: Active Related Docs: Documentation Hub | Idempotency and Atomicity
← Back to Documentation Hub
Overview
Task identity determines how Tasker deduplicates task creation requests. The identity strategy pattern allows named tasks to configure their deduplication behavior based on domain requirements.
When a task creation request arrives, Tasker computes an identity hash based on the configured strategy. If a task with that identity hash already exists, the request is rejected with a 409 Conflict response.
Why This Matters
Task identity is domain-specific:
| Use Case | Same Template + Same Context | Desired Behavior |
|---|---|---|
| Payment processing | Likely accidental duplicate | Deduplicate (safety) |
| Nightly batch job | Intentional repetition | Allow (operational) |
| Report generation | Could be either | Configurable |
| Event-driven triggers | Often intentional | Allow |
| Retry with same params | Intentional | Allow |
A TaskRequest with identical context might be:
- An accidental duplicate (network retry, user double-click) → should deduplicate
- An intentional repetition (scheduled job, legitimate re-run) → should allow
Identity Strategies
STRICT (Default)
identity_hash = hash(named_task_uuid, normalized_context)
Same named task + same context = same identity hash = deduplicated.
Use when:
- Accidental duplicates are a risk (payments, orders, notifications)
- Context fully describes the work to be done
- Network retries or user double-clicks should be safe
Example:
#![allow(unused)]
fn main() {
// Payment processing - same payment_id should never create duplicate tasks
TaskRequest {
namespace: "payments".to_string(),
name: "process_payment".to_string(),
context: json!({
"payment_id": "PAY-12345",
"amount": 100.00,
"currency": "USD"
}),
idempotency_key: None, // Uses STRICT strategy
..Default::default()
}
}
CALLER_PROVIDED
identity_hash = hash(named_task_uuid, idempotency_key)
Caller must provide idempotency_key. Request is rejected with 400 Bad Request if the key is missing.
Use when:
- Caller has a natural idempotency key (order_id, transaction_id, request_id)
- Caller needs control over deduplication scope
- Similar to Stripe’s Idempotency-Key pattern
Example:
#![allow(unused)]
fn main() {
// Order processing - caller controls idempotency with their order ID
TaskRequest {
namespace: "orders".to_string(),
name: "fulfill_order".to_string(),
context: json!({
"order_id": "ORD-98765",
"items": [...]
}),
idempotency_key: Some("ORD-98765".to_string()), // Required for CallerProvided
..Default::default()
}
}
ALWAYS_UNIQUE
identity_hash = uuidv7()
Every request creates a new task. No deduplication.
Use when:
- Every submission should create work (notifications, events)
- Repetition is intentional (scheduled jobs, cron-like triggers)
- Context doesn’t define uniqueness
Example:
#![allow(unused)]
fn main() {
// Notification sending - every call should send a notification
TaskRequest {
namespace: "notifications".to_string(),
name: "send_email".to_string(),
context: json!({
"user_id": 123,
"template": "welcome",
"data": {...}
}),
idempotency_key: None, // ALWAYS_UNIQUE ignores this
..Default::default()
}
}
Configuration
Named Task Configuration
Set the identity strategy in your task template:
# templates/payments/process_payment.yaml
namespace: payments
name: process_payment
version: "1.0.0"
identity_strategy: strict # strict | caller_provided | always_unique
steps:
- name: validate_payment
handler: payment_validator
# ...
Per-Request Override
The idempotency_key field overrides any strategy:
#![allow(unused)]
fn main() {
// Even if named task is ALWAYS_UNIQUE, this key makes it deduplicate
TaskRequest {
idempotency_key: Some("my-custom-key-12345".to_string()),
// ... other fields
}
}
Precedence:
idempotency_key(if provided) → always uses hash of key- Named task’s
identity_strategy→ applies if no key provided - Default → STRICT (if strategy not configured)
API Behavior
Successful Creation (201 Created)
{
"task_uuid": "019bddae-b818-7d82-b7c5-bd42e5db27fc",
"step_count": 4,
"message": "Task created successfully"
}
Duplicate Identity (409 Conflict)
When a task with the same identity hash exists:
{
"error": {
"code": "CONFLICT",
"message": "A task with this identity already exists. The task's identity strategy prevents duplicate creation."
}
}
Security Note: The API returns 409 Conflict rather than the existing task’s UUID. This prevents potential data leakage where attackers could probe for existing task UUIDs by submitting requests with guessed contexts.
Missing Idempotency Key (400 Bad Request)
When CallerProvided strategy requires a key:
{
"error": {
"code": "BAD_REQUEST",
"message": "idempotency_key is required when named task uses CallerProvided identity strategy"
}
}
JSON Normalization
For STRICT strategy, the context JSON is normalized before hashing:
- Key ordering: Keys are sorted alphabetically (recursively)
- Whitespace: Removed for consistency
- Semantic equivalence:
{"b":2,"a":1}and{"a":1,"b":2}produce the same hash
This means these two requests produce the same identity hash:
#![allow(unused)]
fn main() {
// Request 1
context: json!({"user_id": 123, "action": "create"})
// Request 2 - same content, different key order
context: json!({"action": "create", "user_id": 123})
}
Note: Array order is preserved (arrays are ordered by definition).
Recommended Patterns
Pattern 1: Time-Bucketed Keys
For deduplication within a time window but allowing repetition across windows:
#![allow(unused)]
fn main() {
// Dedupe within same hour, allow across hours
let hour_bucket = chrono::Utc::now().format("%Y-%m-%d-%H");
let idempotency_key = format!("{}-{}-{}", job_name, customer_id, hour_bucket);
TaskRequest {
namespace: "reports".to_string(),
name: "generate_report".to_string(),
context: json!({ "customer_id": 12345 }),
idempotency_key: Some(idempotency_key),
..Default::default()
}
}
Pattern 2: Time-Aware Context
Include scheduling context directly in the request:
#![allow(unused)]
fn main() {
TaskRequest {
namespace: "batch".to_string(),
name: "daily_reconciliation".to_string(),
context: json!({
"account_id": "ACC-001",
"run_date": "2026-01-20", // Changes daily
"run_window": "morning" // Optional: finer granularity
}),
..Default::default()
}
}
Granularity Guide
| Dedup Window | Key/Context Pattern | Use Case |
|---|---|---|
| Per-minute | {job}-{YYYY-MM-DD-HH-mm} | High-frequency event processing |
| Per-hour | {job}-{YYYY-MM-DD-HH} | Hourly reports, rate-limited APIs |
| Per-day | {job}-{YYYY-MM-DD} | Daily batch jobs, EOD processing |
| Per-week | {job}-{YYYY-Www} | Weekly aggregations |
| Per-month | {job}-{YYYY-MM} | Monthly billing cycles |
Anti-Patterns
Don’t Rely on Timing
#![allow(unused)]
fn main() {
// BAD: Hoping requests are "far enough apart"
TaskRequest { context: json!({ "customer_id": 123 }) }
}
Don’t Use ALWAYS_UNIQUE for Critical Operations
#![allow(unused)]
fn main() {
// BAD: Creates duplicate work on network retries
// Named task with AlwaysUnique for payment processing
}
Do Make Identity Explicit
#![allow(unused)]
fn main() {
// GOOD: Clear what makes this task unique
TaskRequest {
context: json!({
"payment_id": "PAY-123", // Natural idempotency key
"amount": 100
}),
..Default::default()
}
}
Database Implementation
The identity strategy is enforced at the database level:
- UNIQUE constraint on
identity_hashcolumn prevents duplicates - identity_strategy column on
named_tasksstores the configured strategy - Atomic insertion with constraint violation returns 409 Conflict
-- Identity hash has unique constraint
CREATE UNIQUE INDEX idx_tasks_identity_hash ON tasker.tasks(identity_hash);
-- Named tasks store their strategy
ALTER TABLE tasker.named_tasks
ADD COLUMN identity_strategy VARCHAR(20) DEFAULT 'strict';
Testing Considerations
When writing tests that create tasks, inject a unique identifier to avoid identity hash collisions:
#![allow(unused)]
fn main() {
// Test utility that ensures unique identity per test run
fn create_task_request(namespace: &str, name: &str, context: Value) -> TaskRequest {
let mut ctx = context.as_object().cloned().unwrap_or_default();
ctx.insert("_test_run_id".to_string(), json!(Uuid::now_v7().to_string()));
TaskRequest {
namespace: namespace.to_string(),
name: name.to_string(),
context: Value::Object(ctx),
..Default::default()
}
}
}
Summary
| Strategy | Identity Hash | Deduplicates? | Key Required? |
|---|---|---|---|
| STRICT | hash(uuid, context) | Yes | No |
| CALLER_PROVIDED | hash(uuid, key) | Yes | Yes |
| ALWAYS_UNIQUE | uuidv7() | No | No |
Choose STRICT (default) unless you have a specific reason not to. It’s the safest option for preventing accidental duplicate task creation.
Quick Start Guide
Last Updated: 2025-10-10 Audience: Developers Status: Active Time to Complete: 5 minutes Related Docs: Documentation Hub | Use Cases | Crate Architecture
← Back to Documentation Hub
Get Tasker Core Running in 5 Minutes
This guide will get you from zero to running your first workflow in under 5 minutes using Docker Compose.
Prerequisites
Before starting, ensure you have:
- Docker and Docker Compose installed
- Git to clone the repository
- curl for testing (or any HTTP client)
That’s it! Docker Compose handles all the complexity.
Step 1: Clone and Start Services (2 minutes)
# Clone the repository
git clone https://github.com/tasker-systems/tasker-core
cd tasker-core
# Start PostgreSQL (includes PGMQ extension for default messaging)
docker-compose up -d postgres
# Wait for PostgreSQL to be ready (about 10 seconds)
docker-compose logs -f postgres
# Press Ctrl+C when you see "database system is ready to accept connections"
# Run database migrations
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1" # Verify connection
# Start orchestration server and workers
docker-compose --profile server up -d
# Verify all services are healthy
docker-compose ps
You should see:
NAME STATUS PORTS
tasker-postgres Up (healthy) 5432
tasker-orchestration Up (healthy) 0.0.0.0:8080->8080/tcp
tasker-worker Up (healthy) 0.0.0.0:8081->8081/tcp
tasker-ruby-worker Up (healthy) 0.0.0.0:8082->8082/tcp
Step 2: Verify Services (30 seconds)
Check that all services are responding:
# Check orchestration health
curl http://localhost:8080/health
# Expected response:
# {
# "status": "healthy",
# "database": "connected",
# "message_queue": "operational"
# }
# Check Rust worker health
curl http://localhost:8081/health
# Check Ruby worker health (if started)
curl http://localhost:8082/health
Step 3: Create Your First Task (1 minute)
Now let’s create a simple linear workflow with 4 steps:
# Create a task using the linear_workflow template
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"template_name": "linear_workflow",
"namespace": "rust_e2e_linear",
"configuration": {
"test_value": "hello_world"
}
}'
Response:
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"status": "pending",
"namespace": "rust_e2e_linear",
"created_at": "2025-10-10T12:00:00Z"
}
Save the task_uuid from the response! You’ll need it to check the task status.
Step 4: Monitor Task Execution (1 minute)
Watch your workflow execute in real-time:
# Replace {task_uuid} with your actual task UUID
TASK_UUID="01234567-89ab-cdef-0123-456789abcdef"
# Check task status
curl http://localhost:8080/v1/tasks/${TASK_UUID}
Initial Response (task just created):
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"current_state": "initializing",
"total_steps": 4,
"completed_steps": 0,
"namespace": "rust_e2e_linear"
}
Wait a few seconds and check again:
# Check again after a few seconds
curl http://localhost:8080/v1/tasks/${TASK_UUID}
Final Response (task completed):
{
"task_uuid": "01234567-89ab-cdef-0123-456789abcdef",
"current_state": "complete",
"total_steps": 4,
"completed_steps": 4,
"namespace": "rust_e2e_linear",
"completed_at": "2025-10-10T12:00:05Z",
"duration_ms": 134
}
Congratulations! 🎉 You’ve just executed your first workflow with Tasker Core!
What Just Happened?
Let’s break down what happened in those ~100-150ms:
1. Orchestration received task creation request
↓
2. Task initialized with "linear_workflow" template
↓
3. 4 workflow steps created with dependencies:
- mathematical_add (no dependencies)
- mathematical_multiply (depends on add)
- mathematical_subtract (depends on multiply)
- mathematical_divide (depends on subtract)
↓
4. Orchestration discovered step 1 was ready
↓
5. Step 1 enqueued to "rust_e2e_linear" namespace queue
↓
6. Worker claimed and executed step 1
↓
7. Worker sent result back to orchestration
↓
8. Orchestration processed result, discovered step 2
↓
9. Steps 2, 3, 4 executed sequentially (due to dependencies)
↓
10. All steps complete → Task marked "complete"
Key Observations:
- Each step executed by autonomous workers
- Steps executed in dependency order automatically
- Complete workflow: ~130-150ms (including all coordination)
- All state changes recorded in audit trail
View Detailed Task Information
Get complete task execution details:
# Get full task details including steps
curl http://localhost:8080/v1/tasks/${TASK_UUID}/details
Response includes:
{
"task": {
"task_uuid": "...",
"current_state": "complete",
"namespace": "rust_e2e_linear"
},
"steps": [
{
"name": "mathematical_add",
"current_state": "complete",
"result": {"value": 15},
"duration_ms": 12
},
{
"name": "mathematical_multiply",
"current_state": "complete",
"result": {"value": 30},
"duration_ms": 8
},
// ... remaining steps
],
"state_transitions": [
{
"from_state": null,
"to_state": "pending",
"timestamp": "2025-10-10T12:00:00.000Z"
},
{
"from_state": "pending",
"to_state": "initializing",
"timestamp": "2025-10-10T12:00:00.050Z"
},
// ... complete transition history
]
}
Try a More Complex Workflow
Now try the diamond workflow pattern (parallel execution):
curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"template_name": "diamond_workflow",
"namespace": "rust_e2e_diamond",
"configuration": {
"test_value": "parallel_test"
}
}'
Diamond pattern:
step_1 (root)
/ \
step_2 step_3 ← Execute in PARALLEL
\ /
step_4 (join)
Steps 2 and 3 execute simultaneously because they both depend only on step 1!
View Logs
See what’s happening inside the services:
# Orchestration logs
docker-compose logs -f orchestration
# Worker logs
docker-compose logs -f worker
# All logs
docker-compose logs -f
Key log patterns to look for:
Task initialized: task_uuid=...- Task createdStep enqueued: step_uuid=...- Step sent to workerStep claimed: step_uuid=...- Worker picked up stepStep completed: step_uuid=...- Step finished successfullyTask finalized: task_uuid=...- Workflow complete
Explore the API
List All Tasks
curl http://localhost:8080/v1/tasks
Get Task Execution Context
curl http://localhost:8080/v1/tasks/${TASK_UUID}/context
View Available Templates
curl http://localhost:8080/v1/templates
Check System Health
curl http://localhost:8080/health/detailed
Next Steps
1. Understand What You Just Built
Read about the architecture:
- Crate Architecture - How the workspace is organized
- Events and Commands - How orchestration and workers coordinate
- States and Lifecycles - Task and step state machines
2. See Real-World Examples
Explore practical use cases:
- Use Cases and Patterns - E-commerce, payments, ETL, microservices
- See example templates in:
tests/fixtures/task_templates/
3. Create Your Own Workflow
Option A: Rust Handler (Native Performance)
#![allow(unused)]
fn main() {
// workers/rust/src/handlers/my_handler.rs
use async_trait::async_trait;
use anyhow::Result;
use serde_json::json;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
pub struct MyCustomHandler;
#[async_trait]
impl RustStepHandler for MyCustomHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let input: String = step_data.get_input_or("input", "default".to_string());
let result = process_data(&input).await?;
Ok(StepExecutionResult::success(
step_data.workflow_step.workflow_step_uuid,
json!({ "output": result }),
0,
None,
))
}
fn name(&self) -> &'static str {
"my_handler"
}
}
}
Option B: Ruby Handler (via FFI)
class MyHandler < TaskerCore::StepHandler::Base
def call(context)
input = context.get_input('input')
result = process_data(input)
success(result: { output: result })
end
end
Option C: Ruby Handler (DSL)
extend TaskerCore::StepHandler::Functional
MyHandler = step_handler(
'MyHandler',
inputs: [:input]
) do |input:, context:|
result = process_data(input)
{ output: result }
end
Define Your Workflow Template
# tests/fixtures/task_templates/rust/my_workflow.yaml
namespace_name: my_namespace
name: my_workflow
version: "1.0.0"
steps:
- name: my_step
handler:
callable: my_handler
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
4. Deploy to Production
Learn about deployment:
- Deployment Patterns - Hybrid, EventDriven, PollingOnly modes
- Observability - Metrics, logging, monitoring
- Benchmarks - Performance validation
5. Run Tests Locally
# Build the workspace
cargo build --all-features
# Run all tests
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo test --all-features
# Run benchmarks
cargo bench --all-features
Troubleshooting
Services Won’t Start
# Check Docker service status
docker-compose ps
# View service logs
docker-compose logs postgres
docker-compose logs orchestration
# Restart services
docker-compose restart
# Clean restart
docker-compose down
docker-compose up -d
Task Stays in “pending” or “initializing”
Possible causes:
- Template not found - Check available templates:
curl http://localhost:8080/v1/templates - Worker not running - Check worker status:
curl http://localhost:8081/health - Database connection issue - Check logs:
docker-compose logs postgres
Solution:
# Verify template exists
curl http://localhost:8080/v1/templates | jq '.[] | select(.name == "linear_workflow")'
# Restart workers
docker-compose restart worker
# Check orchestration logs for errors
docker-compose logs orchestration | grep ERROR
“Connection refused” Errors
Cause: Services not fully started yet
Solution: Wait 10-15 seconds after docker-compose up, then check health:
curl http://localhost:8080/health
PostgreSQL Connection Issues
# Verify PostgreSQL is running
docker-compose ps postgres
# Test connection
docker-compose exec postgres psql -U tasker -d tasker_rust_test -c "SELECT 1"
# View PostgreSQL logs
docker-compose logs postgres | tail -50
Cleanup
When you’re done exploring:
# Stop all services
docker-compose down
# Stop and remove volumes (cleans database)
docker-compose down -v
# Remove all Docker resources (complete cleanup)
docker-compose down -v
docker system prune -f
Summary
You’ve successfully:
- ✅ Started Tasker Core services with Docker Compose
- ✅ Created and executed a linear workflow
- ✅ Monitored task execution in real-time
- ✅ Viewed detailed task and step information
- ✅ Explored the REST API
Total time: ~5 minutes from zero to working workflow! 🚀
Getting Help
- Documentation Issues: Open an issue on GitHub
- Architecture Questions: See Crate Architecture
- Use Case Examples: See Use Cases and Patterns
- Deployment Help: See Deployment Patterns
← Back to Documentation Hub
Next: Use Cases and Patterns | Crate Architecture
Retry Semantics: max_attempts and retryable
Last Updated: 2025-10-10 Audience: Developers Status: Active Related Docs: Documentation Hub | Bug Report: Retry Eligibility Logic | States and Lifecycles
← Back to Documentation Hub
Overview
The Tasker orchestration system uses two configuration fields to control step execution and retry behavior:
max_attempts: Maximum number of total execution attempts (including first execution)retryable: Whether the step can be retried after failure
Semantic Definitions
max_attempts
Definition: The maximum number of times a step can be attempted, including the first execution.
This is NOT “number of retries” - it’s total attempts:
max_attempts=0: Step can never execute (likely a configuration error)max_attempts=1: Exactly one attempt (no retries after failure)max_attempts=3: First attempt + up to 2 retries = 3 total attempts
Implementation: SQL formula attempts < max_attempts where attempts starts at 0.
retryable
Definition: Whether a step can be retried after the first execution fails.
Important: The retryable flag does NOT affect the first execution attempt:
- First execution (attempts=0): Always eligible regardless of retryable setting
- Retry attempts (attempts>0): Require
retryable=true
Configuration Examples
Single Execution, No Retries
retry:
retryable: false
max_attempts: 1 # First attempt only
backoff: exponential
Behavior:
| attempts | retry_eligible | Outcome |
|---|---|---|
| 0 | ✅ true | First execution allowed |
| 1 | ❌ false | No retries (retryable=false) |
Use Case: Idempotent operations that should not retry (e.g., record creation with unique constraints)
Multiple Attempts with Retries
retry:
retryable: true
max_attempts: 3 # First attempt + 2 retries
backoff: exponential
backoff_base_ms: 1000
Behavior:
| attempts | retry_eligible | Outcome |
|---|---|---|
| 0 | ✅ true | First execution allowed |
| 1 | ✅ true | First retry allowed (1 < 3) |
| 2 | ✅ true | Second retry allowed (2 < 3) |
| 3 | ❌ false | Max attempts exhausted (3 >= 3) |
Use Case: External API calls that might have transient failures
Unlimited Retries (Not Recommended)
retry:
retryable: true
max_attempts: 999999
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 300000 # Cap at 5 minutes
Behavior: Will retry until external intervention (task cancellation, system restart)
Use Case: Critical operations that must eventually succeed (use with caution!)
Retry Eligibility Logic
SQL Implementation
From migrations/20251006000000_fix_retry_eligibility_logic.sql:
-- retry_eligible calculation
(
COALESCE(ws.attempts, 0) = 0 -- First attempt always eligible
OR (
COALESCE(ws.retryable, true) = true -- Must be retryable for retries
AND COALESCE(ws.attempts, 0) < COALESCE(ws.max_attempts, 3)
)
) as retry_eligible
Decision Tree
Is attempts = 0?
├─ YES → retry_eligible = TRUE (first execution)
└─ NO → Is retryable = true?
├─ YES → Is attempts < max_attempts?
│ ├─ YES → retry_eligible = TRUE (retry allowed)
│ └─ NO → retry_eligible = FALSE (max attempts exhausted)
└─ NO → retry_eligible = FALSE (retries disabled)
Edge Cases
max_attempts=0
retry:
max_attempts: 0
Behavior: Step can never execute (0 < 0 = false for all attempts)
Status: ⚠️ Configuration error - likely unintended
Recommendation: Use max_attempts: 1 for single execution
retryable=false with max_attempts > 1
retry:
retryable: false
max_attempts: 3 # Only first attempt will execute
Behavior: First execution allowed, but no retries regardless of max_attempts
Effective Result: Same as max_attempts: 1
Recommendation: Set max_attempts: 1 when retryable: false for clarity
Historical Context
Why “max_attempts” instead of “retry_limit”?
The original field name retry_limit was semantically confusing:
Old Interpretation (incorrect):
retry_limit=1→ “1 retry allowed” → 2 total attempts?retry_limit=0→ “0 retries” → 1 attempt or blocked?
New Interpretation (clear):
max_attempts=1→ “1 total attempt” → exactly 1 executionmax_attempts=0→ “0 attempts” → clearly invalid
Migration Timeline
- Original:
retry_limitfield with ambiguous semantics - 2025-10-05: Bug discovered -
retry_limit=0blocked all execution - 2025-10-06: Fixed SQL logic + renamed to
max_attempts - 2025-10-06: Added 6 SQL boundary tests for edge cases
Testing
Boundary Condition Tests
See tests/integration/sql_functions/retry_boundary_tests.rs for comprehensive coverage:
test_max_attempts_zero_allows_first_execution- Edge case handlingtest_max_attempts_zero_blocks_after_first- Exhaustion after firsttest_max_attempts_one_semantics- Single execution semanticstest_max_attempts_three_progression- Standard retry progressiontest_first_attempt_ignores_retryable_flag- First execution independencetest_retries_require_retryable_true- Retry flag enforcement
All tests passing as of 2025-10-06.
Best Practices
For Single-Execution Steps
retry:
retryable: false
max_attempts: 1
backoff: exponential # Ignored, but required for schema
Why: Makes intent crystal clear - execute once, never retry
For Transient Failure Tolerance
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 30000
Why: Reasonable retry count with exponential backoff prevents thundering herd
For Critical Operations
retry:
retryable: true
max_attempts: 10
backoff: exponential
backoff_base_ms: 5000
max_backoff_ms: 300000 # 5 minutes
Why: More attempts with longer backoff for operations that must succeed
Related Documentation
- Bug Report: Retry Eligibility Logic
- State Machine Documentation
- SQL Function:
get_step_readiness_status_batch - Migration:
20251006000000_fix_retry_eligibility_logic.sql
Questions or Issues? See test suite for comprehensive examples or consult bug report for historical context.
Use Cases and Patterns
Last Updated: 2025-10-10 Audience: Developers, Architects, Product Managers Status: Active Related Docs: Documentation Hub | Quick Start | Crate Architecture
← Back to Documentation Hub
Overview
This guide provides practical examples of when and how to use Tasker Core for workflow orchestration. Each use case includes architectural patterns, example workflows, and implementation guidance based on real-world scenarios.
Table of Contents
- E-Commerce Order Fulfillment
- Payment Processing Pipeline
- Data Transformation ETL
- Microservices Orchestration
- Scheduled Job Coordination
- Conditional Workflows and Decision Points
- Anti-Patterns
E-Commerce Order Fulfillment
Problem Statement
An e-commerce platform needs to coordinate multiple steps when processing orders:
- Validate order details and inventory
- Reserve inventory and process payment (parallel)
- Ship order after both payment and inventory confirmed
- Send confirmation emails
- Handle failures gracefully with retries
Why Tasker Core?
- Complex Dependencies: Steps have clear dependency relationships
- Parallel Execution: Payment and inventory can happen simultaneously
- Retry Logic: External API calls need retry with backoff
- Audit Trail: Complete history needed for compliance
- Idempotency: Steps must handle duplicate executions safely
Workflow Structure
Task: order_fulfillment_#{order_id}
Priority: Based on order value and customer tier
Namespace: fulfillment
Steps:
1. validate_order
- Handler: ValidateOrderHandler
- Dependencies: None (root step)
- Retry: retryable=true, max_attempts=3
- Validates order data, checks fraud
2. check_inventory
- Handler: InventoryCheckHandler
- Dependencies: validate_order (must complete)
- Retry: retryable=true, max_attempts=5
- Queries inventory system
3. reserve_inventory
- Handler: InventoryReservationHandler
- Dependencies: check_inventory
- Retry: retryable=true, max_attempts=3
- Reserves stock with timeout
4. process_payment
- Handler: PaymentProcessingHandler
- Dependencies: validate_order
- Retry: retryable=true, max_attempts=3
- Charges customer payment method
- **Runs in parallel with reserve_inventory**
5. ship_order
- Handler: ShippingHandler
- Dependencies: reserve_inventory AND process_payment
- Retry: retryable=false, max_attempts=1
- Creates shipping label, schedules pickup
6. send_confirmation
- Handler: EmailNotificationHandler
- Dependencies: ship_order
- Retry: retryable=true, max_attempts=10
- Sends confirmation email to customer
Implementation Pattern
Task Template (YAML configuration):
namespace: fulfillment
name: order_fulfillment
version: "1.0"
steps:
- name: validate_order
handler: validate_order
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
- name: check_inventory
handler: check_inventory
dependencies:
- validate_order
retry:
retryable: true
max_attempts: 5
backoff: exponential
backoff_base_ms: 2000
# ... remaining steps
Step Handler (Rust implementation):
#![allow(unused)]
fn main() {
pub struct ValidateOrderHandler;
#[async_trait]
impl RustStepHandler for ValidateOrderHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Extract order data from context
let order_id: String = step_data.get_input_or("order_id", String::new());
let customer_id: String = step_data.get_input_or("customer_id", String::new());
// Validate order
let order = validate_order_data(&order_id).await?;
// Check fraud detection
if check_fraud_risk(&customer_id, &order).await? {
return Ok(StepExecutionResult::failure(
step_data.workflow_step.workflow_step_uuid,
json!({"reason": "High fraud risk"}),
"fraud_detected".to_string(),
false, // not retryable
));
}
// Success - pass data to next steps
Ok(StepExecutionResult::success(
step_data.workflow_step.workflow_step_uuid,
json!({
"order_id": order_id,
"validated_at": Utc::now(),
"total_amount": order.total
}),
0,
None,
))
}
fn name(&self) -> &'static str {
"validate_order"
}
}
}
Ruby Handler Alternative:
class ProcessPaymentHandler < TaskerCore::StepHandler::Base
def call(context)
order_id = context.get_input('order_id')
amount = context.get_input('amount')
# Process payment via payment gateway
result = PaymentGateway.charge(
amount: amount,
idempotency_key: step_data.workflow_step.workflow_step_uuid
)
if result.success?
success(result: { transaction_id: result.transaction_id })
else
# Retryable failure with backoff
failure(message: result.error, error_type: 'PaymentError', retryable: true)
end
rescue PaymentGateway::NetworkError => e
# Transient error, retry
raise TaskerCore::Errors::RetryableError.new(e.message)
rescue PaymentGateway::CardDeclined => e
# Permanent failure, don't retry
raise TaskerCore::Errors::PermanentError.new(e.message, error_code: 'CARD_DECLINED')
end
end
Ruby Handler Alternative (DSL):
extend TaskerCore::StepHandler::Functional
ProcessPaymentHandler = step_handler(
'ProcessPaymentHandler',
inputs: [:order_id, :amount]
) do |order_id:, amount:, context:|
result = PaymentGateway.charge(amount: amount, idempotency_key: context.step_uuid)
if result.success?
{ transaction_id: result.transaction_id }
else
raise TaskerCore::Errors::RetryableError.new(result.error)
end
end
Key Patterns
1. Parallel Execution
reserve_inventoryandprocess_paymentboth depend only on earlier steps- Tasker automatically executes them in parallel
ship_orderwaits for both to complete
2. Idempotent Handlers
- Use
step_uuidas idempotency key for external APIs - Check if operation already completed before retrying
- Handle duplicate executions gracefully
3. Smart Retry Logic
- Network errors → retryable with exponential backoff
- Business logic failures → permanent, no retry
- Configure max_attempts based on criticality
4. Data Flow
- Early steps provide data to later steps via results
- Access parent results:
step_data.dependency_results["validate_order"] - Build context as workflow progresses
Observability
Monitor these metrics for order fulfillment:
#![allow(unused)]
fn main() {
// Track order processing stages
metrics::counter!("orders.validated").increment(1);
metrics::counter!("orders.payment_processed").increment(1);
metrics::counter!("orders.shipped").increment(1);
// Track failures by reason
metrics::counter!("orders.failed", "reason" => "fraud").increment(1);
metrics::counter!("orders.failed", "reason" => "inventory").increment(1);
// Track timing
metrics::histogram!("order.fulfillment_time_ms").record(elapsed_ms);
}
Payment Processing Pipeline
Problem Statement
A fintech platform needs to process payments with strict requirements:
- Multiple payment methods (card, bank transfer, wallet)
- Regulatory compliance and audit trails
- Automatic retry for transient failures
- Reconciliation with accounting system
- Webhook notifications to customers
Why Tasker Core?
- Compliance: Complete audit trail with state transitions
- Reliability: Automatic retry with configurable limits
- Observability: Detailed metrics for financial operations
- Idempotency: Prevent duplicate charges
- Flexibility: Support multiple payment flows
Workflow Structure
Task: payment_processing_#{payment_id}
Namespace: payments
Priority: High (financial operations)
Steps:
1. validate_payment_request
- Verify payment details
- Check account status
- Validate payment method
2. check_fraud
- Run fraud detection
- Verify transaction limits
- Check velocity rules
3. authorize_payment
- Contact payment gateway
- Reserve funds (authorization hold)
- Return authorization code
4. capture_payment (depends on authorize_payment)
- Capture authorized funds
- Handle settlement
- Generate receipt
5. record_transaction (depends on capture_payment)
- Write to accounting ledger
- Update customer balance
- Create audit records
6. send_notification (depends on record_transaction)
- Send webhook to merchant
- Send receipt to customer
- Update payment status
Implementation Highlights
Retry Strategy for Payment Gateway:
#![allow(unused)]
fn main() {
impl RustStepHandler for AuthorizePaymentHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let payment_id = step_data.get_input_or("payment_id")?;
match gateway.authorize(payment_id, &step_data.workflow_step.workflow_step_uuid).await {
Ok(auth) => {
Ok(StepExecutionResult::success_from_json(json!({
"authorization_code": auth.code,
"authorized_at": Utc::now(),
"gateway_transaction_id": auth.transaction_id
})))
}
Err(GatewayError::NetworkTimeout) => {
// Transient - retry with backoff
Ok(StepExecutionResult::retryable_failure(
"network_timeout",
json!({"retry_recommended": true})
))
}
Err(GatewayError::InsufficientFunds) => {
// Permanent - don't retry
Ok(StepExecutionResult::permanent_failure(
"insufficient_funds",
json!({"requires_manual_intervention": false})
))
}
Err(GatewayError::InvalidCard) => {
// Permanent - don't retry
Ok(StepExecutionResult::permanent_failure(
"invalid_card",
json!({"requires_manual_intervention": true})
))
}
}
}
}
}
Idempotency Pattern:
#![allow(unused)]
fn main() {
async fn capture_payment(step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let idempotency_key = step_data.workflow_step.workflow_step_uuid.to_string();
// Check if we already captured this payment
if let Some(existing) = check_existing_capture(&idempotency_key).await? {
return Ok(StepExecutionResult::success_from_json(json!({
"already_captured": true,
"transaction_id": existing.transaction_id,
"note": "Idempotent duplicate detected"
})));
}
// Proceed with capture
let result = gateway.capture(&idempotency_key).await?;
// Store idempotency record
store_capture_record(&idempotency_key, &result).await?;
Ok(StepExecutionResult::success_from_json(json!(result)))
}
}
Key Patterns
1. Two-Phase Commit
- Authorize (reserve) → Capture (settle)
- Allows cancellation between phases
- Common in payment processing
2. Audit Trail
- Every state transition recorded
- Regulatory compliance built-in
- Forensic investigation support
3. Circuit Breaking
- Protect against payment gateway failures
- Automatic backoff when gateway degraded
- Fallback to alternate gateways
Data Transformation ETL
Problem Statement
A data analytics platform needs to process data through multiple transformation stages:
- Extract data from multiple sources (APIs, databases, files)
- Transform data (clean, enrich, aggregate)
- Load to data warehouse
- Handle large datasets with partitioning
- Retry transient failures, skip corrupted data
Why Tasker Core?
- DAG Execution: Complex transformation pipelines
- Parallel Processing: Independent partitions processed concurrently
- Error Handling: Skip corrupted records, retry transient failures
- Observability: Track data quality and processing metrics
- Scheduling: Integrate with cron/scheduler for periodic runs
Workflow Structure
Task: etl_customer_data_#{date}
Namespace: data_pipeline
Steps:
1. extract_customer_profiles
- Fetch from customer database
- Partition by customer_id ranges
- Creates multiple output partitions
2. extract_transaction_history
- Fetch from transactions database
- Runs in parallel with extract_customer_profiles
- Time-based partitioning
3. enrich_customer_data (depends on extract_customer_profiles)
- Add demographic data from external API
- Process partitions in parallel
- Each partition is independent
4. join_transactions (depends on enrich_customer_data, extract_transaction_history)
- Join enriched profiles with transactions
- Aggregate metrics per customer
- Parallel processing per partition
5. load_to_warehouse (depends on join_transactions)
- Bulk load to data warehouse
- Verify data quality
- Update metadata tables
6. generate_summary_report (depends on load_to_warehouse)
- Generate processing statistics
- Send notification with summary
- Archive source files
Implementation Pattern
Partition-Based Processing:
#![allow(unused)]
fn main() {
pub struct ExtractCustomerProfilesHandler;
#[async_trait]
impl RustStepHandler for ExtractCustomerProfilesHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let date: String = step_data.get_input_or("processing_date")?;
// Determine partitions (e.g., by customer_id ranges)
let partitions = calculate_partitions(1000000, 100000)?; // 10 partitions
// Extract data for each partition
let mut partition_files = Vec::new();
for partition in partitions {
let filename = extract_partition(&date, partition).await?;
partition_files.push(filename);
}
// Return partition info for downstream steps
Ok(StepExecutionResult::success_from_json(json!({
"partitions": partition_files,
"total_records": 1000000,
"extracted_at": Utc::now()
})))
}
}
}
Error Handling for Data Quality:
#![allow(unused)]
fn main() {
async fn enrich_customer_data(step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let partition_file: String = step_data.get_input_or("partition_file")?;
let mut processed = 0;
let mut skipped = 0;
let mut errors = Vec::new();
for record in read_partition(&partition_file).await? {
match enrich_record(record).await {
Ok(enriched) => {
write_enriched(enriched).await?;
processed += 1;
}
Err(EnrichmentError::MalformedData(e)) => {
// Skip corrupted record, continue processing
skipped += 1;
errors.push(format!("Skipped record: {}", e));
}
Err(EnrichmentError::ApiTimeout(e)) => {
// Transient failure, retry entire step
return Ok(StepExecutionResult::retryable_failure(
"api_timeout",
json!({"error": e.to_string()})
));
}
}
}
if skipped as f64 / processed as f64 > 0.1 {
// Too many skipped records
return Ok(StepExecutionResult::permanent_failure(
"data_quality_issue",
json!({
"processed": processed,
"skipped": skipped,
"error_rate": skipped as f64 / processed as f64
})
));
}
Ok(StepExecutionResult::success_from_json(json!({
"processed": processed,
"skipped": skipped,
"errors": errors
})))
}
}
Key Patterns
1. Partition-Based Parallelism
- Split large datasets into partitions
- Process partitions independently
- Aggregate results in final step
2. Graceful Degradation
- Skip corrupted individual records
- Continue processing remaining data
- Report data quality issues
3. Monitoring Data Quality
- Track record counts through pipeline
- Alert on unexpected error rates
- Validate schema at boundaries
Microservices Orchestration
Problem Statement
Coordinate operations across multiple microservices:
- User registration flow (auth, profile, notifications, analytics)
- Distributed transactions with compensation
- Service dependency management
- Timeout and circuit breaking
Why Tasker Core?
- Service Coordination: Orchestrate distributed operations
- Saga Pattern: Implement compensation for failures
- Resilience: Circuit breakers and timeouts
- Observability: End-to-end tracing with correlation IDs
- Flexibility: Handle heterogeneous service protocols
Workflow Structure (User Registration Example)
Task: user_registration_#{user_id}
Namespace: user_onboarding
Steps:
1. create_auth_account
- Call auth service to create account
- Generate user credentials
- Store authentication tokens
2. create_user_profile (depends on create_auth_account)
- Call profile service
- Initialize user preferences
- Set default settings
3. setup_notification_preferences (depends on create_user_profile)
- Call notification service
- Configure email preferences
- Set up push notifications
4. track_user_signup (depends on create_user_profile)
- Call analytics service
- Record signup event
- Runs in parallel with setup_notification_preferences
5. send_welcome_email (depends on setup_notification_preferences)
- Send welcome email
- Provide onboarding links
- Track email delivery
Compensation Steps (on failure):
- If create_user_profile fails → delete_auth_account
- If any step fails after profile → deactivate_user
Implementation Pattern (Saga with Compensation)
#![allow(unused)]
fn main() {
pub struct CreateUserProfileHandler;
#[async_trait]
impl RustStepHandler for CreateUserProfileHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let user_id: String = step_data.get_input_or("user_id")?;
let email: String = step_data.get_input_or("email")?;
// Get auth details from previous step
let auth_result = step_data.dependency_results.get("create_auth_account")
.ok_or("Missing auth result")?;
let auth_token: String = auth_result.get("auth_token")?;
// Call profile service
match profile_service.create_profile(&user_id, &email, &auth_token).await {
Ok(profile) => {
Ok(StepExecutionResult::success_from_json(json!({
"profile_id": profile.id,
"created_at": profile.created_at,
"user_id": user_id
})))
}
Err(ProfileServiceError::DuplicateEmail) => {
// Permanent failure - email already exists
// Trigger compensation
Ok(StepExecutionResult::permanent_failure_with_compensation(
"duplicate_email",
json!({"email": email}),
vec!["delete_auth_account"] // Compensation steps
))
}
Err(ProfileServiceError::ServiceUnavailable) => {
// Transient - retry
Ok(StepExecutionResult::retryable_failure(
"service_unavailable",
json!({"retry_recommended": true})
))
}
}
}
}
}
Compensation Handler:
#![allow(unused)]
fn main() {
pub struct DeleteAuthAccountHandler;
#[async_trait]
impl RustStepHandler for DeleteAuthAccountHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let user_id: String = step_data.get_input_or("user_id")?;
// Best-effort deletion
match auth_service.delete_account(&user_id).await {
Ok(_) => {
Ok(StepExecutionResult::success_from_json(json!({
"compensated": true,
"user_id": user_id
})))
}
Err(e) => {
// Log error but don't fail - compensation is best-effort
warn!("Compensation failed for user {}: {}", user_id, e);
Ok(StepExecutionResult::success_from_json(json!({
"compensated": false,
"error": e.to_string(),
"requires_manual_cleanup": true
})))
}
}
}
}
}
Key Patterns
1. Correlation IDs
- Pass correlation_id through all services
- Enable end-to-end tracing
- Simplify debugging distributed issues
2. Compensation (Saga Pattern)
- Define compensation steps for cleanup
- Execute on permanent failures
- Best-effort execution, log failures
3. Service Circuit Breakers
- Wrap service calls in circuit breakers
- Fail fast when services degraded
- Automatic recovery detection
Scheduled Job Coordination
Problem Statement
Run periodic jobs with dependencies:
- Daily report generation (depends on data refresh)
- Scheduled data backups (depends on maintenance window)
- Cleanup jobs (depends on retention policies)
Why Tasker Core?
- Dependency Management: Jobs run in correct order
- Failure Handling: Automatic retry of failed jobs
- Observability: Track job execution history
- Flexibility: Dynamic scheduling based on results
Implementation Pattern
#![allow(unused)]
fn main() {
// External scheduler (cron, Kubernetes CronJob, etc.) creates tasks
pub async fn schedule_daily_reports() -> Result<Uuid> {
let client = OrchestrationClient::new("http://orchestration:8080").await?;
let task_request = TaskRequest {
template_name: "daily_reporting".to_string(),
namespace: "scheduled_jobs".to_string(),
configuration: json!({
"report_date": Utc::now().format("%Y-%m-%d").to_string(),
"report_types": ["sales", "inventory", "customer_activity"]
}),
priority: 5, // Normal priority
};
let response = client.create_task(task_request).await?;
Ok(response.task_uuid)
}
}
Conditional Workflows and Decision Points
Problem Statement
Many workflows require runtime decision-making where the execution path depends on business logic evaluated at runtime:
- Approval routing based on request amount or risk level
- Tiered processing based on customer status
- Compliance checks varying by jurisdiction
- Dynamic resource allocation based on workload
Why Use Decision Points?
Traditional Approach (Static DAG):
# Must define ALL possible paths upfront
Steps:
- validate
- route_A # Always created
- route_B # Always created
- route_C # Always created
- converge # Must handle all paths
Decision Point Approach (Dynamic DAG):
# Create ONLY the needed path at runtime
Steps:
- validate
- routing_decision # Decides which path
- route_A # Created dynamically if needed
- route_B # Created dynamically if needed
- route_C # Created dynamically if needed
- converge # Uses intersection semantics
Benefits
- Efficiency: Only execute steps actually needed
- Clarity: Workflow reflects actual business logic
- Cost Savings: Reduce API calls, processing time, and resource usage
- Flexibility: Add new paths without changing core logic
Core Pattern
Task: conditional_approval
Steps:
1. validate_request # Regular step
2. routing_decision # Decision point (type: decision_point)
→ Evaluates business logic
→ Returns: CreateSteps(['manager_approval']) or NoBranches
3. auto_approve # Might be created
4. manager_approval # Might be created
5. finance_review # Might be created
6. finalize_approval # Convergence (type: deferred)
→ Waits for intersection of dependencies
Example: Amount-Based Approval Routing
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
# Business logic determines which steps to create
steps = if amount < 1_000
['auto_approve']
elsif amount < 5_000
['manager_approval']
else
['manager_approval', 'finance_review']
end
# Return decision outcome
decision_success(
steps: steps,
result_data: {
route_type: determine_route_type(amount),
amount: amount
}
)
end
end
DSL Alternative:
extend TaskerCore::StepHandler::Functional
RoutingDecisionHandler = decision_handler(
'RoutingDecisionHandler',
inputs: [:amount]
) do |amount:, context:|
steps = if amount.to_f < 1_000
['auto_approve']
elsif amount.to_f < 5_000
['manager_approval']
else
['manager_approval', 'finance_review']
end
Decision.route(steps, route_type: determine_route_type(amount), amount: amount)
end
Real-World Scenarios
1. E-Commerce Returns Processing
- Low-value returns: Auto-approve
- Medium-value: Manager review
- High-value or suspicious: Fraud investigation + manager review
2. Financial Risk Assessment
- Low-risk transactions: Standard processing
- Medium-risk: Additional verification
- High-risk: Manual review + compliance checks + legal review
3. Healthcare Prior Authorization
- Standard procedures: Auto-approve
- Specialized care: Medical director review
- Experimental treatments: Medical director + insurance review + compliance
4. Customer Support Escalation
- Simple issues: Tier 1 resolution
- Complex issues: Tier 2 specialist
- VIP customers: Immediate senior support + account manager notification
Key Features
Decision Point Steps:
- Special step type that returns
DecisionPointOutcome - Can return
NoBranches(no additional steps) orCreateSteps(list of step names) - Fully atomic - either all steps created or none
- Supports nested decisions (configurable depth limit)
Deferred Steps:
- Use intersection semantics for dependencies
- Wait for: (declared dependencies) ∩ (actually created steps)
- Enable convergence regardless of path taken
Type-Safe Implementation:
- Ruby:
TaskerCore::StepHandler::Decisionbase class - Rust:
DecisionPointOutcomeenum with serde support - Automatic validation and serialization
Implementation
See the complete guide: Conditional Workflows and Decision Points
Covers:
- When to use conditional workflows
- YAML configuration
- Ruby and Rust implementation patterns
- Simple and complex examples
- Best practices and limitations
Anti-Patterns
❌ Don’t Use Tasker Core For
1. Simple Cron Jobs
# ❌ Anti-pattern: Single-step scheduled job
Task: send_daily_email
Steps:
- send_email # No dependencies, no retry needed
Why: Overhead not justified. Use native cron or systemd timers.
2. Real-Time Sub-Millisecond Operations
# ❌ Anti-pattern: High-frequency trading
Task: execute_trade_#{microseconds}
Steps:
- check_price # Needs <1ms latency
- execute_order
Why: Architectural overhead (~10-20ms) too high. Use in-memory queues or direct service calls.
3. Pure Fan-Out
# ❌ Anti-pattern: Simple message broadcasting
Task: broadcast_notification
Steps:
- send_to_user_1
- send_to_user_2
- send_to_user_3
# ... 1000s of independent steps
Why: Use message bus (Kafka, RabbitMQ) for pub/sub patterns. Tasker is for orchestration, not broadcasting.
4. Stateless Single Operations
# ❌ Anti-pattern: Single API call with no retry
Task: fetch_user_data
Steps:
- call_api # No dependencies, no state management needed
Why: Direct API call with client-side retry is simpler.
Pattern Selection Guide
| Characteristic | Use Tasker Core? | Alternative |
|---|---|---|
| Multiple dependent steps | ✅ Yes | N/A |
| Parallel execution needed | ✅ Yes | Thread pools for simple cases |
| Retry logic required | ✅ Yes | Client-side retry libraries |
| Audit trail needed | ✅ Yes | Append-only logs |
| Single step, no retry | ❌ No | Direct function call |
| Sub-second latency required | ❌ No | In-memory queues |
| Pure broadcast/fan-out | ❌ No | Message bus (Kafka, etc.) |
| Simple scheduled job | ❌ No | Cron, systemd timers |
Related Documentation
- Quick Start - Get your first workflow running
- Conditional Workflows - Runtime decision-making and dynamic step creation
- Crate Architecture - Understand the codebase
- Deployment Patterns - Deploy to production
- States and Lifecycles - State machine deep dive
- Events and Commands - Event-driven patterns
← Back to Documentation Hub
Worker Crates Overview
Last Updated: 2025-12-27 Audience: Developers, Architects, Operators Status: Active Related Docs: Worker Event Systems | Worker Actors
<- Back to Documentation Hub
The tasker-core workspace provides four worker implementations for executing workflow step handlers. Each implementation targets different deployment scenarios and developer ecosystems while sharing the same core Rust foundation.
Quick Navigation
| Document | Description |
|---|---|
| API Convergence Matrix | Quick reference for aligned APIs across languages |
| Client Wrapper API | High-level client for submitting tasks (Ruby, Python, TypeScript) |
| Example Handlers | Side-by-side handler examples |
| Patterns and Practices | Common patterns across all workers |
| Rust Worker | Native Rust implementation |
| Ruby Worker | Ruby gem for Rails integration |
| Python Worker | Python package for data pipelines |
| TypeScript Worker | TypeScript/JS for Bun/Node.js |
Overview
Four Workers, One Foundation
All workers share the same Rust core (tasker-worker crate) for orchestration, queueing, and state management. The language-specific workers add handler execution in their respective runtimes.
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
PostgreSQL + PGMQ
│
▼
┌─────────────────────────────┐
│ Rust Core (tasker-worker) │
│ ─────────────────────────│
│ • Queue Management │
│ • State Machines │
│ • Orchestration │
│ • Actor System │
└─────────────────────────────┘
│
┌───────────────┬───────────┼───────────┬───────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────────┐
│ Rust │ │ Ruby │ │ Python │ │ TypeScript │
│ Worker │ │ Worker │ │ Worker │ │ Worker │
│───────────│ │───────────│ │───────────│ │─────────────│
│ Native │ │ FFI Bridge│ │ FFI Bridge│ │ FFI Bridge │
│ Handlers │ │ + Gem │ │ + Package │ │ Bun/Node.js │
└───────────┘ └───────────┘ └───────────┘ └─────────────┘
Comparison Table
| Feature | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Performance | Native | GVL-limited | GIL-limited | V8/Bun native |
| Integration | Standalone | Rails/Rack apps | Data pipelines | Node/Bun apps |
| Handler Style | Async traits | Class-based | ABC-based | Class-based |
| Concurrency | Tokio async | Thread + FFI poll | Thread + FFI poll | Event loop + native addon |
| Deployment | Binary | Gem + Server | Package + Server | Package + Server |
| Headless Mode | N/A | Library embed | Library embed | Library embed |
| Runtimes | - | MRI | CPython | Bun (primary), Node.js |
When to Use Each
Rust Worker - Best for:
- Maximum throughput requirements
- Resource-constrained environments
- Standalone microservices
- Performance-critical handlers
Ruby Worker - Best for:
- Rails/Ruby applications
- ActiveRecord/ORM integration
- Existing Ruby codebases
- Quick prototyping with Ruby ecosystem
Python Worker - Best for:
- Data processing pipelines
- ML/AI integration
- Scientific computing workflows
- Python-native team preferences
TypeScript Worker - Best for:
- Modern JavaScript/TypeScript applications
- Full-stack Node.js teams
- High-performance Bun deployments
- React/Vue/Angular backend services
- Native addon integration via napi-rs
Deployment Modes
Server Mode
All workers can run as standalone servers:
Rust:
cargo run -p workers-rust
Ruby:
cd workers/ruby
./bin/server.rb
Python:
cd workers/python
python bin/server.py
TypeScript (Bun):
cd workers/typescript
bun run bin/server.ts
TypeScript (Node.js):
cd workers/typescript
npx tsx bin/server.ts
Headless/Embedded Mode (Ruby, Python & TypeScript)
Ruby, Python, and TypeScript workers can be embedded into existing applications without running the HTTP server. Headless mode is controlled via TOML configuration, not bootstrap parameters.
TOML Configuration (e.g., config/tasker/base/worker.toml):
[web]
enabled = false # Disables HTTP server for headless/embedded mode
Ruby (in Rails):
# config/initializers/tasker.rb
require 'tasker_core'
# Bootstrap worker (web server disabled via TOML config)
TaskerCore::Worker::Bootstrap.start!
# Register handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
'MyHandler',
MyHandler
)
Python (in application):
from tasker_core import bootstrap_worker, HandlerRegistry
from tasker_core.types import BootstrapConfig
# Bootstrap worker (web server disabled via TOML config)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)
# Register handlers
registry = HandlerRegistry.instance()
registry.register("my_handler", MyHandler)
TypeScript (in application):
import { WorkerServer } from '@tasker-systems/tasker';
// Bootstrap worker (web server disabled via TOML config)
const server = new WorkerServer();
await server.start({ namespace: 'my-app' });
// Register handlers
const handlerSystem = server.getHandlerSystem();
handlerSystem.register('my_handler', MyHandler);
Core Concepts
1. Handler Registration
All workers use a registry pattern for handler discovery:
┌─────────────────────┐
│ HandlerRegistry │
│ (Singleton) │
└─────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Handler A│ │Handler B│ │Handler C│
└─────────┘ └─────────┘ └─────────┘
2. Event Flow
Step events flow through a consistent pipeline:
1. PGMQ Queue → Event received
2. Worker claims step (atomic)
3. Handler resolved by name
4. Handler.call(context) executed
5. Result sent to completion channel
6. Orchestration receives result
3. Error Classification
All workers distinguish between:
- Retryable Errors: Transient failures → Re-enqueue step
- Permanent Errors: Unrecoverable → Mark step failed
4. Graceful Shutdown
All workers handle shutdown signals (SIGTERM, SIGINT):
1. Signal received
2. Stop accepting new work
3. Complete in-flight handlers
4. Flush completion channel
5. Shutdown Rust foundation
6. Exit cleanly
Configuration
Environment Variables
Common across all workers:
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
TASKER_NAMESPACE | Worker namespace for queue isolation | default |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
Language-Specific
Ruby:
| Variable | Description |
|---|---|
RUBY_GC_HEAP_GROWTH_FACTOR | GC tuning for production |
Python:
| Variable | Description |
|---|---|
PYTHON_HANDLER_PATH | Path for handler auto-discovery |
Handler Types
All workers support specialized handler types:
StepHandler (Base)
Basic step execution:
from tasker_core.step_handler.functional import step_handler, inputs
@step_handler("my_handler")
@inputs(MyInputModel)
def my_handler(inputs: MyInputModel, context):
return {"result": "done"}
See Class-Based Handlers for the inheritance-based alternative.
ApiHandler
HTTP/REST API integration with automatic error classification:
extend TaskerCore::StepHandler::Functional
FetchDataHandler = api_handler(
'FetchDataHandler',
base_url: 'https://api.example.com',
inputs: [:user_id]
) do |user_id:, api:, context:|
response = api.get("/users/#{user_id}")
api.api_success(result: { user: response.body })
end
See Class-Based Handlers for the inheritance-based alternative.
DecisionHandler
Dynamic workflow routing:
from tasker_core.step_handler.functional import decision_handler, Decision
@decision_handler("routing_decision")
@inputs('amount')
def routing_decision(amount, context):
if float(amount or 0) < 1000:
return Decision.route(['auto_approve'], route_type='automatic')
return Decision.route(['manager_approval'], route_type='manager')
See Class-Based Handlers for the inheritance-based alternative.
Batchable
Large dataset processing with separate analyzer and worker handlers:
import { defineBatchAnalyzer, defineBatchWorker, BatchConfig } from '@tasker-systems/tasker';
export const CsvAnalyzer = defineBatchAnalyzer(
'Csv.StepHandlers.CsvAnalyzerHandler',
{ inputs: { csvPath: 'csv_path' } },
async ({ csvPath }) => ({
totalItems: await countRows(csvPath as string),
batchSize: 100,
}),
);
export const CsvWorker = defineBatchWorker(
'Csv.StepHandlers.CsvWorkerHandler',
{ analyzerStep: 'analyze_csv' },
async ({ batchContext }) => ({ processed: batchContext.batchSize }),
);
See Class-Based Handlers for the inheritance-based alternative.
Quick Start
Rust
# Build and run
cd workers/rust
cargo run
# With custom configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run
Ruby
# Install dependencies
cd workers/ruby
bundle install
bundle exec rake compile
# Run server
./bin/server.rb
Python
# Install dependencies
cd workers/python
uv sync
uv run maturin develop
# Run server
python bin/server.py
TypeScript
# Install dependencies
cd workers/typescript
bun install
bun run build:napi # Build napi-rs native addon
bun run build # Build TypeScript
# Run server (Bun)
bun run bin/server.ts
# Run server (Node.js)
npx tsx bin/server.ts
Monitoring
Health Checks
All workers expose health status:
# Python
from tasker_core import get_health_check
health = get_health_check()
# Ruby
health = TaskerCore::FFI.health_check
Metrics
Common metrics available:
| Metric | Description |
|---|---|
pending_count | Events awaiting processing |
in_flight_count | Events being processed |
completed_count | Successfully completed |
failed_count | Failed events |
starvation_detected | Processing bottleneck |
Logging
All workers use structured logging:
2025-01-15T10:30:00Z [INFO] python-worker: Processing step step_uuid=abc-123 handler=process_order
2025-01-15T10:30:01Z [INFO] python-worker: Step completed step_uuid=abc-123 success=true duration_ms=150
Architecture Deep Dive
For detailed architectural documentation:
- Worker Event Systems - Dual-channel architecture, event-driven processing
- Worker Actors - Actor-based coordination, message handling
- Events and Commands - Event definitions, command routing
See Also
- API Convergence Matrix - Quick reference tables
- Example Handlers - Side-by-side code examples
- Patterns and Practices - Common patterns
- Rust Worker - Native implementation details
- Ruby Worker - Ruby gem documentation
- Python Worker - Python package documentation
- TypeScript Worker - TypeScript/JS package documentation
API Convergence Matrix
Last Updated: 2026-01-08 Status: Active <- Back to Worker Crates Overview
Overview
This document provides a quick reference for the aligned APIs across Ruby, Python, TypeScript, and Rust worker implementations. All four languages share consistent patterns for handler execution, result creation, registry operations, and composition via mixins/traits.
Handler Signatures
| Language | Base Class | Signature |
|---|---|---|
| Ruby | TaskerCore::StepHandler::Base | def call(context) |
| Python | BaseStepHandler | def call(self, context: StepContext) -> StepHandlerResult |
| TypeScript | StepHandler | async call(context: StepContext): Promise<StepHandlerResult> |
| Rust | StepHandler trait | async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult |
Composition Pattern
All languages use composition via mixins/traits rather than inheritance hierarchies.
Handler Composition
| Language | Base | Mixin Syntax | Example |
|---|---|---|---|
| Ruby | StepHandler::Base | include Mixins::API | class Handler < Base; include Mixins::API |
| Python | StepHandler | Multiple inheritance | class Handler(StepHandler, APIMixin) |
| TypeScript | StepHandler | applyAPI(this) | Mixin functions applied in constructor |
| Rust | impl StepHandler | impl APICapable | Multiple trait implementations |
Available Mixins/Traits
| Capability | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| API | Mixins::API | APIMixin | applyAPI() | APICapable |
| Decision | Mixins::Decision | DecisionMixin | applyDecision() | DecisionCapable |
| Batchable | Mixins::Batchable | BatchableMixin | BatchableHandler | BatchableCapable |
StepContext Fields
The StepContext provides unified access to step execution data across Ruby, Python, and TypeScript.
| Field | Type | Description |
|---|---|---|
task_uuid | String | Unique task identifier (UUID v7) |
step_uuid | String | Unique step identifier (UUID v7) |
input_data | Dict/Hash | Input data for the step from workflow_step.inputs |
step_inputs | Dict/Hash | Alias for input_data |
step_config | Dict/Hash | Handler configuration from step_definition.handler.initialization |
dependency_results | Wrapper | Results from parent steps (DependencyResultsWrapper) |
retry_count | Integer | Current retry attempt (from workflow_step.attempts) |
max_retries | Integer | Maximum retry attempts (from workflow_step.max_attempts) |
Convenience Methods
| Method | Description |
|---|---|
get_task_field(name) | Get field from task context |
get_dependency_result(step_name) | Get result from a parent step |
Ruby-Specific Accessors
| Property | Type | Description |
|---|---|---|
task | TaskWrapper | Full task wrapper with context and metadata |
workflow_step | WorkflowStepWrapper | Workflow step with execution state |
step_definition | StepDefinitionWrapper | Step definition from task template |
Result Factories
Success Results
| Language | Method | Example |
|---|---|---|
| Ruby | success(result:, metadata:) | success(result: { id: 123 }, metadata: { ms: 50 }) |
| Python | self.success(result, metadata) | self.success({"id": 123}, {"ms": 50}) |
| Rust | StepExecutionResult::success(...) | StepExecutionResult::success(result, metadata) |
Failure Results
| Language | Method | Key Parameters |
|---|---|---|
| Ruby | failure(message:, error_type:, error_code:, retryable:, metadata:) | keyword arguments |
| Python | self.failure(message, error_type, error_code, retryable, metadata) | positional/keyword |
| Rust | StepExecutionResult::failure(...) | structured fields |
Result Fields
| Field | Ruby | Python | Rust | Description |
|---|---|---|---|---|
| success | bool | bool | bool | Whether step succeeded |
| result | Hash | Dict | HashMap | Result data |
| metadata | Hash | Dict | HashMap | Additional context |
| error_message | String | str | String | Human-readable error |
| error_type | String | str | String | Error classification |
| error_code | String (optional) | str (optional) | String (optional) | Application error code |
| retryable | bool | bool | bool | Whether to retry |
Standard error_type Values
Use these standard values for consistent error classification:
| Value | Description | Retry Behavior |
|---|---|---|
PermanentError | Non-recoverable failure | Never retry |
RetryableError | Temporary failure | Will retry |
ValidationError | Input validation failed | No retry |
TimeoutError | Operation timed out | May retry |
UnexpectedError | Unexpected handler error | May retry |
Registry API
| Operation | Ruby | Python | Rust |
|---|---|---|---|
| Register | register(name, klass) | register(name, klass) | register_handler(name, handler) |
| Check | is_registered(name) | is_registered(name) | is_registered(name) |
| Resolve | resolve(name) | resolve(name) | get_handler(name) |
| List | list_handlers | list_handlers() | list_handlers() |
Note: Ruby also provides original method names (register_handler, handler_available?, resolve_handler, registered_handlers) as the primary API with the above as cross-language aliases.
Resolver Chain API
Handler resolution uses a chain-of-responsibility pattern to convert callable addresses into executable handlers.
StepHandlerResolver Interface
| Method | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| Get Name | name | resolver_name() | resolverName() | resolver_name(&self) |
| Get Priority | priority | priority() | priority() | priority(&self) |
| Can Resolve? | can_resolve?(definition, config) | can_resolve(definition) | canResolve(definition) | can_resolve(&self, definition) |
| Resolve | resolve(definition, config) | resolve(definition, context) | resolve(definition, context) | resolve(&self, definition, context) |
ResolverChain Operations
| Operation | Ruby | Python | TypeScript | Rust |
|---|---|---|---|---|
| Create | ResolverChain.new | ResolverChain() | new ResolverChain() | ResolverChain::new() |
| Register | register(resolver) | register(resolver) | register(resolver) | register(resolver) |
| Resolve | resolve(definition, context) | resolve(definition, context) | resolve(definition, context) | resolve(definition, context) |
| Can Resolve? | can_resolve?(definition) | can_resolve(definition) | canResolve(definition) | can_resolve(definition) |
| List | resolvers | resolvers | resolvers | resolvers() |
Built-in Resolvers
| Resolver | Priority | Function | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|---|---|
| ExplicitMappingResolver | 10 | Hash lookup of registered handlers | ✅ | ✅ | ✅ | ✅ |
| ClassConstantResolver | 100 | Runtime class lookup (Ruby) | ❌ | ✅ | - | - |
| ClassLookupResolver | 100 | Runtime class lookup (Python/TS) | ❌ | - | ✅ | ✅ |
Note: Class lookup resolvers are not available in Rust due to lack of runtime reflection. Rust handlers must use ExplicitMappingResolver. Ruby uses ClassConstantResolver (Ruby terminology); Python and TypeScript use ClassLookupResolver (same functionality, language-appropriate naming).
HandlerDefinition Fields
| Field | Type | Description | Required |
|---|---|---|---|
callable | String | Handler address (name or class path) | Yes |
method | String | Entry point method (default: "call") | No |
resolver | String | Resolution hint to bypass chain | No |
initialization | Dict/Hash | Handler configuration | No |
Method Dispatch
Multi-method handlers expose multiple entry points through the method field:
| Language | Default Method | Dynamic Dispatch |
|---|---|---|
| Ruby | call | handler.public_send(method, context) |
| Python | call | getattr(handler, method)(context) |
| TypeScript | call | handler[method](context) |
| Rust | call | handler.invoke_method(method, step) |
Creating Multi-Method Handlers:
| Language | Signature |
|---|---|
| Ruby | Define additional methods alongside call |
| Python | Define additional methods alongside call |
| TypeScript | Define additional async methods alongside call |
| Rust | Implement invoke_method to dispatch to internal methods |
See Handler Resolution Guide for complete documentation.
Specialized Handlers
API Handler
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| GET | get(path, params: {}, headers: {}) | self.get(path, params={}, headers={}) | this.get(path, params?, headers?) |
| POST | post(path, data: {}, headers: {}) | self.post(path, data={}, headers={}) | this.post(path, data?, headers?) |
| PUT | put(path, data: {}, headers: {}) | self.put(path, data={}, headers={}) | this.put(path, data?, headers?) |
| DELETE | delete(path, params: {}, headers: {}) | self.delete(path, params={}, headers={}) | this.delete(path, params?, headers?) |
Decision Handler
| Language | Simple API | Result Fields |
|---|---|---|
| Ruby | decision_success(steps:, routing_context:) | decision_point_outcome: { type, step_names } |
| Python | decision_success(steps, routing_context) | decision_point_outcome: { type, step_names } |
| TypeScript | decisionSuccess(steps, routingContext?) | decision_point_outcome: { type, step_names } |
| Rust | decision_success(step_uuid, step_names, ...) | Pattern-based |
Decision Helper Methods (Cross-Language):
decision_success(steps, routing_context)- Create dynamic stepsskip_branches(reason, routing_context)- Skip all conditional branchesdecision_failure(message, error_type)- Decision could not be made
Batchable Handler
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| Get Context | get_batch_context(context) | get_batch_context(context) | getBatchContext(context) |
| Complete Batch | batch_worker_success(items_processed:, items_succeeded:, ...) | batch_worker_success(items_processed=, items_succeeded=, ...) | batchWorkerSuccess(outcome, metadata?) |
| Handle No-Op | handle_no_op_worker(batch_ctx) | handle_no_op_worker(batch_ctx) | handleNoOpWorker(batchCtx) |
Standard Batch Result Fields:
processed_count/items_processeditems_succeeded/items_failedstart_cursor,end_cursor,batch_size,last_cursor
Cursor Indexing:
- All languages use 0-indexed cursors (start at 0, not 1)
- Ruby was updated from 1-indexed to 0-indexed for consistency
Checkpoint Yielding
Checkpoint yielding enables batch workers to persist progress and yield control for re-dispatch.
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| Checkpoint | checkpoint_yield(cursor:, items_processed:, accumulated_results:) | checkpoint_yield(cursor, items_processed, accumulated_results) | checkpointYield({ cursor, itemsProcessed, accumulatedResults }) |
BatchWorkerContext Checkpoint Accessors:
| Accessor | Ruby | Python | TypeScript |
|---|---|---|---|
| Cursor | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Accumulated Results | accumulated_results | accumulated_results | accumulatedResults |
| Has Checkpoint? | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
| Items Processed | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
FFI Contract:
| Function | Description |
|---|---|
checkpoint_yield_step_event(event_id, data) | Persist checkpoint and re-dispatch step |
Key Invariants:
- Progress is atomically saved before re-dispatch
- Step remains
InProgressduring checkpoint yield cycle - Only
Success/Failuretrigger state transitions
See Batch Processing Guide - Checkpoint Yielding for full documentation.
Domain Events
Publisher Contract
| Language | Base Class | Key Method |
|---|---|---|
| Ruby | TaskerCore::DomainEvents::BasePublisher | publish(ctx) |
| Python | BasePublisher | publish(ctx) |
| TypeScript | BasePublisher | publish(ctx) |
| Rust | StepEventPublisher trait | publish(ctx) |
Publisher Lifecycle Hooks
All languages support publisher lifecycle hooks for instrumentation:
| Hook | Ruby | Python | TypeScript | Description |
|---|---|---|---|---|
| Before Publish | before_publish(ctx) | before_publish(ctx) | beforePublish(ctx) | Called before publishing |
| After Publish | after_publish(ctx, result) | after_publish(ctx, result) | afterPublish(ctx, result) | Called after successful publish |
| On Error | on_publish_error(ctx, error) | on_publish_error(ctx, error) | onPublishError(ctx, error) | Called on publish failure |
| Metadata | additional_metadata(ctx) | additional_metadata(ctx) | additionalMetadata(ctx) | Inject custom metadata |
StepEventContext Fields
| Field | Description |
|---|---|
task_uuid | Task identifier |
step_uuid | Step identifier |
step_name | Handler/step name |
namespace | Task namespace |
correlation_id | Tracing correlation ID |
result | Step execution result |
metadata | Additional metadata |
Subscriber Contract
| Language | Base Class | Key Methods |
|---|---|---|
| Ruby | TaskerCore::DomainEvents::BaseSubscriber | subscribes_to, handle(event) |
| Python | BaseSubscriber | subscribes_to(), handle(event) |
| TypeScript | BaseSubscriber | subscribesTo(), handle(event) |
| Rust | EventHandler closures | N/A |
Subscriber Lifecycle Hooks
All languages support subscriber lifecycle hooks:
| Hook | Ruby | Python | TypeScript | Description |
|---|---|---|---|---|
| Before Handle | before_handle(event) | before_handle(event) | beforeHandle(event) | Called before handling |
| After Handle | after_handle(event, result) | after_handle(event, result) | afterHandle(event, result) | Called after handling |
| On Error | on_handle_error(event, error) | on_handle_error(event, error) | onHandleError(event, error) | Called on handler failure |
Registries
| Language | Publisher Registry | Subscriber Registry |
|---|---|---|
| Ruby | PublisherRegistry.instance | SubscriberRegistry.instance |
| Python | PublisherRegistry.instance() | SubscriberRegistry.instance() |
| TypeScript | PublisherRegistry.getInstance() | SubscriberRegistry.getInstance() |
Migration Summary
Ruby
| Before | After |
|---|---|
def call(task, sequence, step) | def call(context) |
class Handler < API | class Handler < Base; include Mixins::API |
task.context['field'] | context.get_task_field('field') |
sequence.get_results('step') | context.get_dependency_result('step') |
| 1-indexed cursors | 0-indexed cursors |
Python
| Before | After |
|---|---|
def handle(self, task, sequence, step) | def call(self, context) |
class Handler(APIHandler) | class Handler(StepHandler, APIMixin) |
| N/A | self.success(result, metadata) |
| N/A | Publisher/Subscriber lifecycle hooks |
TypeScript
| Before | After |
|---|---|
class Handler extends APIHandler | class Handler extends StepHandler implements APICapable |
| No domain events | Complete domain events module |
| N/A | Publisher/Subscriber lifecycle hooks |
| N/A | applyAPI(this), applyDecision(this) mixins |
Rust
| Before | After |
|---|---|
| (already aligned) | (already aligned) |
| N/A | APICapable, DecisionCapable, BatchableCapable traits |
See Also
- Example Handlers - Side-by-side code examples
- Patterns and Practices - Common patterns
- Ruby Worker - Ruby implementation details
- Python Worker - Python implementation details
- TypeScript Worker - TypeScript implementation details
- Rust Worker - Rust implementation details
- Composition Over Inheritance - Why mixins over inheritance
- FFI Boundary Types - Cross-language type alignment
- Handler Resolution Guide - Custom resolver strategies
Client Wrapper API
Last Updated: 2026-02-13 Audience: Developers Status: Active Related Docs: API Convergence Matrix | Workers Overview
<- Back to Workers Overview
Each FFI worker package (Ruby, Python, TypeScript) includes a high-level client wrapper for the orchestration API. The wrappers provide keyword-argument methods with sensible defaults and return typed response objects, removing the need to construct raw request hashes or parse untyped responses.
Overview
| Ruby | Python | TypeScript | |
|---|---|---|---|
| Module | TaskerCore::Client | TaskerClient class | TaskerClient class |
| Import | require 'tasker_core' | from tasker_core import TaskerClient | import { TaskerClient } from '@tasker-systems/tasker' |
| Response Types | Dry::Struct (e.g., ClientTypes::TaskResponse) | Frozen dataclasses (e.g., TaskResponse) | Generated DTO types (e.g., ClientTaskResponse) |
| Error Handling | Falls back to raw Hash on schema mismatch | Falls back to raw dict on missing fields | Throws TaskerClientError on failure |
Ruby
Usage
require 'tasker_core'
# All methods are module_function — call directly on TaskerCore::Client
response = TaskerCore::Client.create_task(
name: 'order_processing',
namespace: 'ecommerce',
context: { order_id: 123, items: [...] },
initiator: 'my-service', # default: 'tasker-core-ruby'
source_system: 'my-api', # default: 'tasker-core'
reason: 'New order received' # default: 'Task requested'
)
response.task_uuid # => "550e8400-..."
response.status # => "pending"
Methods
| Method | Signature | Returns |
|---|---|---|
create_task | (name:, namespace: 'default', context: {}, version: '1.0.0', initiator:, source_system:, reason:, **options) | ClientTypes::TaskResponse |
get_task | (task_uuid) | ClientTypes::TaskResponse |
list_tasks | (limit: 50, offset: 0, namespace: nil, status: nil) | ClientTypes::TaskListResponse |
cancel_task | (task_uuid) | Hash |
list_task_steps | (task_uuid) | Array<ClientTypes::StepResponse> |
get_step | (task_uuid, step_uuid) | ClientTypes::StepResponse |
get_step_audit_history | (task_uuid, step_uuid) | Array<ClientTypes::StepAuditResponse> |
health_check | () | ClientTypes::HealthResponse |
Response Types
All response types are Dry::Struct classes defined in TaskerCore::Types::ClientTypes. Access fields as method calls (e.g., response.task_uuid, list.pagination.total_count). If the API returns fields that don’t match the schema, the raw Hash is returned instead for forward-compatibility.
ActionController::Parameters
create_task automatically converts Rails ActionController::Parameters to plain hashes via deep_to_hash, so you can pass params[:context] directly from controllers.
Python
Usage
from tasker_core import TaskerClient
# Create a client with custom defaults
client = TaskerClient(
initiator="my-service", # default: "tasker-core-python"
source_system="my-api", # default: "tasker-core"
)
response = client.create_task(
"order_processing",
namespace="ecommerce",
context={"order_id": 123, "items": [...]},
reason="New order received",
)
response.task_uuid # => "550e8400-..."
response.status # => "pending"
Methods
| Method | Signature | Returns |
|---|---|---|
create_task | (name, *, namespace="default", context=None, version="1.0.0", reason="Task requested", **kwargs) | TaskResponse |
get_task | (task_uuid: str) | TaskResponse |
list_tasks | (*, limit=50, offset=0, namespace=None, status=None) | TaskListResponse |
cancel_task | (task_uuid: str) | dict[str, Any] |
list_task_steps | (task_uuid: str) | list[StepResponse] |
get_step | (task_uuid: str, step_uuid: str) | StepResponse |
get_step_audit_history | (task_uuid: str, step_uuid: str) | list[StepAuditResponse] |
health_check | () | HealthResponse |
Response Types
All response types are frozen dataclasses with from_dict(data) classmethods:
TaskResponse—task_uuid,name,namespace,status,context,steps, etc.TaskListResponse—tasks: list[TaskResponse],pagination: PaginationInfoStepResponse—step_uuid,task_uuid,name,current_state,attempts, etc.StepAuditResponse—audit_uuid,step_name,to_state,success, etc.HealthResponse—healthy,status,timestampPaginationInfo—page,per_page,total_count,total_pages,has_next,has_previous
Exports
All types are re-exported from tasker_core with Client prefix to avoid collisions:
from tasker_core import (
TaskerClient,
ClientTaskResponse,
ClientTaskListResponse,
ClientStepResponse,
ClientStepAuditResponse,
ClientHealthResponse,
ClientPaginationInfo,
)
TypeScript
Usage
import { FfiLayer, TaskerClient } from '@tasker-systems/tasker';
const ffiLayer = new FfiLayer();
await ffiLayer.load();
const client = new TaskerClient(ffiLayer);
const response = client.createTask({
name: 'order_processing',
namespace: 'ecommerce', // default: 'default'
context: { orderId: 123 }, // default: {}
initiator: 'my-service', // default: 'tasker-core-typescript'
sourceSystem: 'my-api', // default: 'tasker-core'
reason: 'New order received', // default: 'Task requested'
});
response.task_uuid; // "550e8400-..."
response.status; // "pending"
Methods
| Method | Signature | Returns |
|---|---|---|
createTask | (options: CreateTaskOptions) | ClientTaskResponse |
getTask | (taskUuid: string) | ClientTaskResponse |
listTasks | (options?: ListTasksOptions) | ClientTaskListResponse |
cancelTask | (taskUuid: string) | void |
listTaskSteps | (taskUuid: string) | ClientStepResponse[] |
getStep | (taskUuid: string, stepUuid: string) | ClientStepResponse |
getStepAuditHistory | (taskUuid: string, stepUuid: string) | ClientStepAuditResponse[] |
healthCheck | () | ClientHealthResponse |
Interfaces
interface CreateTaskOptions {
name: string;
namespace?: string; // default: 'default'
context?: Record<string, unknown>; // default: {}
version?: string; // default: '1.0.0'
initiator?: string; // default: 'tasker-core-typescript'
sourceSystem?: string; // default: 'tasker-core'
reason?: string; // default: 'Task requested'
tags?: string[];
correlationId?: string; // default: crypto.randomUUID()
parentCorrelationId?: string;
idempotencyKey?: string;
priority?: number;
}
interface ListTasksOptions {
limit?: number; // default: 50
offset?: number; // default: 0
namespace?: string;
status?: string;
}
Error Handling
All methods unwrap the raw FFI ClientResult envelope. On failure, a TaskerClientError is thrown:
import { TaskerClientError } from '@tasker-systems/tasker';
try {
const task = client.getTask('nonexistent-uuid');
} catch (error) {
if (error instanceof TaskerClientError) {
console.error(error.message);
console.error(error.recoverable); // boolean — whether retry is appropriate
}
}
Exports
export {
TaskerClient,
TaskerClientError,
type CreateTaskOptions,
type ListTasksOptions,
} from '@tasker-systems/tasker';
Raw FFI (Advanced)
The client wrappers call through to raw FFI functions. For advanced use cases or when the wrapper doesn’t expose a needed field, the raw FFI is still available:
| Operation | Ruby | Python | TypeScript |
|---|---|---|---|
| Create task | TaskerCore::FFI.client_create_task(hash) | from tasker_core._tasker_core import client_create_task | runtime.clientCreateTask(json) |
| Get task | TaskerCore::FFI.client_get_task(uuid) | client_get_task(uuid) | runtime.clientGetTask(uuid) |
| List tasks | TaskerCore::FFI.client_list_tasks(limit, offset, ns, status) | client_list_tasks(limit, offset, ns, status) | runtime.clientListTasks(json) |
| Cancel task | TaskerCore::FFI.client_cancel_task(uuid) | client_cancel_task(uuid) | runtime.clientCancelTask(uuid) |
| List steps | TaskerCore::FFI.client_list_task_steps(uuid) | client_list_task_steps(uuid) | runtime.clientListTaskSteps(uuid) |
| Get step | TaskerCore::FFI.client_get_step(task, step) | client_get_step(task, step) | runtime.clientGetStep(task, step) |
| Audit history | TaskerCore::FFI.client_get_step_audit_history(task, step) | client_get_step_audit_history(task, step) | runtime.clientGetStepAuditHistory(task, step) |
| Health check | TaskerCore::FFI.client_health_check | client_health_check() | runtime.clientHealthCheck() |
Raw FFI returns plain Hash/dict/ClientResult — no type wrapping. TypeScript raw FFI returns a ClientResult envelope ({ success, data, error, recoverable }).
Example Handlers - Cross-Language Reference
Last Updated: 2026-02-21 Status: Active <- Back to Worker Crates Overview
DSL syntax: All examples below use the functional DSL pattern (recommended for new projects). For the class-based alternative, see Class-Based Handlers.
Overview
This document provides side-by-side handler examples across Python, Ruby, TypeScript, and Rust. These examples demonstrate the aligned patterns that enable consistent handler authoring across all worker implementations.
Simple Step Handler
Python
from tasker_core.step_handler.functional import step_handler, inputs
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("ecommerce_validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context):
return svc.validate_cart_items(
cart_items=inputs.cart_items,
customer_email=inputs.customer_email,
)
Ruby
module Ecommerce
module StepHandlers
extend TaskerCore::StepHandler::Functional
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(
cart_items: inputs.cart_items,
customer_email: inputs.customer_email,
)
end
end
end
TypeScript
import { defineHandler } from '@tasker-systems/tasker';
import type { CartItem } from '../services/types';
import * as svc from '../services/ecommerce';
export const ValidateCartHandler = defineHandler(
'Ecommerce.StepHandlers.ValidateCartHandler',
{ inputs: { cartItems: 'cart_items' } },
async ({ cartItems }) =>
svc.validateCartItems(cartItems as CartItem[] | undefined),
);
Rust
#![allow(unused)]
fn main() {
use serde_json::{json, Value};
pub fn validate_cart(context: &Value) -> Result<Value, String> {
let cart_items = context.get("cart_items")
.and_then(|v| v.as_array())
.ok_or("Missing cart_items in context")?;
if cart_items.is_empty() {
return Err("Cart cannot be empty".to_string());
}
// Business logic: validate items, calculate pricing...
Ok(json!({
"validated_items": cart_items,
"subtotal": 59.97,
"tax": 4.80,
"total": 64.77
}))
}
}
Handler with Dependencies
Python
from tasker_core.step_handler.functional import step_handler, depends_on, inputs
from app.services.types import (
EcommerceOrderInput,
EcommerceValidateCartResult,
EcommerceProcessPaymentResult,
EcommerceUpdateInventoryResult,
)
from app.services import ecommerce as svc
@step_handler("ecommerce_create_order")
@depends_on(
cart_result=("validate_cart", EcommerceValidateCartResult),
payment_result=("process_payment", EcommerceProcessPaymentResult),
inventory_result=("update_inventory", EcommerceUpdateInventoryResult),
)
@inputs(EcommerceOrderInput)
def create_order(
cart_result: EcommerceValidateCartResult,
payment_result: EcommerceProcessPaymentResult,
inventory_result: EcommerceUpdateInventoryResult,
inputs: EcommerceOrderInput,
context,
):
return svc.create_order(
cart_result=cart_result,
payment_result=payment_result,
inventory_result=inventory_result,
customer_email=inputs.customer_email,
)
Ruby
module Microservices
module StepHandlers
extend TaskerCore::StepHandler::Functional
SendWelcomeSequenceHandler = step_handler(
'Microservices::StepHandlers::SendWelcomeSequenceHandler',
depends_on: {
account_data: ['create_user_account', Types::Microservices::CreateUserResult],
billing_data: ['setup_billing_profile', Types::Microservices::SetupBillingResult],
preferences_data: ['initialize_preferences', Types::Microservices::InitPreferencesResult]
}
) do |account_data:, billing_data:, preferences_data:, context:|
Microservices::Service.send_welcome_sequence(
account_data: account_data,
billing_data: billing_data,
preferences_data: preferences_data,
)
end
end
end
TypeScript
import { defineHandler } from '@tasker-systems/tasker';
import * as svc from '../services/ecommerce';
export const CreateOrderHandler = defineHandler(
'Ecommerce.StepHandlers.CreateOrderHandler',
{
depends: {
cartResult: 'validate_cart',
paymentResult: 'process_payment',
inventoryResult: 'update_inventory',
},
inputs: { customerEmail: 'customer_email' },
},
async ({ cartResult, paymentResult, inventoryResult, customerEmail }) =>
svc.createOrder(
cartResult as Record<string, unknown>,
paymentResult as Record<string, unknown>,
inventoryResult as Record<string, unknown>,
customerEmail as string | undefined,
),
);
Rust
#![allow(unused)]
fn main() {
use serde_json::{json, Value};
use std::collections::HashMap;
pub fn create_order(
context: &Value,
dependency_results: &HashMap<String, Value>,
) -> Result<Value, String> {
let cart_result = dependency_results
.get("validate_cart")
.ok_or("Missing validate_cart dependency")?;
let payment_result = dependency_results
.get("process_payment")
.ok_or("Missing process_payment dependency")?;
let inventory_result = dependency_results
.get("update_inventory")
.ok_or("Missing update_inventory dependency")?;
let customer_email = context
.get("customer_email")
.and_then(|v| v.as_str())
.unwrap_or("unknown@example.com");
Ok(json!({
"order_id": format!("ord_{}", uuid::Uuid::new_v4()),
"customer_email": customer_email,
"total": cart_result.get("total").and_then(|v| v.as_f64()).unwrap_or(0.0),
"payment_id": payment_result.get("payment_id"),
"inventory_log_id": inventory_result.get("inventory_log_id"),
"status": "confirmed"
}))
}
}
Decision Handler
Decision handlers route workflows dynamically by activating different step sets based on business logic. The DSL returns a Decision object — Decision.route(steps) to activate branches or Decision.skip(reason) to skip. For full details, see Conditional Workflows.
Python
from tasker_core.step_handler.functional import decision_handler, inputs, Decision
@decision_handler("routing_decision")
@inputs('amount')
def routing_decision(amount, context):
amount = float(amount or 0)
if amount < 1000:
return Decision.route(
['auto_approve'],
route_type='automatic', amount=amount,
)
elif amount < 5000:
return Decision.route(
['manager_approval'],
route_type='manager', amount=amount,
)
else:
return Decision.route(
['manager_approval', 'finance_review'],
route_type='dual_approval', amount=amount,
)
Ruby
module Orders
module StepHandlers
extend TaskerCore::StepHandler::Functional
RoutingDecisionHandler = decision_handler(
'Orders::StepHandlers::RoutingDecisionHandler',
inputs: [:amount]
) do |amount:, context:|
amount = amount.to_f
if amount < 1000
Decision.route(['auto_approve'], route_type: 'automatic', amount: amount)
elsif amount < 5000
Decision.route(['manager_approval'], route_type: 'manager', amount: amount)
else
Decision.route(
['manager_approval', 'finance_review'],
route_type: 'dual_approval', amount: amount
)
end
end
end
end
TypeScript
import { defineDecisionHandler, Decision } from '@tasker-systems/tasker';
export const RoutingDecisionHandler = defineDecisionHandler(
'Orders.StepHandlers.RoutingDecisionHandler',
{ inputs: { amount: 'amount' } },
async ({ amount }) => {
const amt = (amount as number) || 0;
if (amt < 1000) {
return Decision.route(['auto_approve'], { routeType: 'automatic', amount: amt });
} else if (amt < 5000) {
return Decision.route(['manager_approval'], { routeType: 'manager', amount: amt });
} else {
return Decision.route(
['manager_approval', 'finance_review'],
{ routeType: 'dual_approval', amount: amt },
);
}
},
);
Rust
#![allow(unused)]
fn main() {
use tasker_shared::messaging::DecisionPointOutcome;
use serde_json::{json, Value};
pub fn routing_decision(context: &Value) -> Result<Value, String> {
let amount = context.get("amount")
.and_then(|v| v.as_f64())
.ok_or("Amount is required for routing decision")?;
let (route_type, steps) = if amount < 1000.0 {
("automatic", vec!["auto_approve"])
} else if amount < 5000.0 {
("manager", vec!["manager_approval"])
} else {
("dual_approval", vec!["manager_approval", "finance_review"])
};
let outcome = DecisionPointOutcome::create_steps(
steps.iter().map(|s| s.to_string()).collect()
);
Ok(json!({
"route_type": route_type,
"amount": amount,
"decision_point_outcome": outcome.to_value()
}))
}
}
Batch Processing Handler
Batch handlers use the Analyzer/Worker pattern. The analyzer returns a BatchConfig specifying total items and batch size; the orchestrator automatically generates cursor ranges and spawns workers. For full details, see Batch Processing.
Python (Analyzer + Worker)
from tasker_core.step_handler.functional import (
batch_analyzer, batch_worker, inputs, depends_on, BatchConfig,
)
@batch_analyzer("analyze_csv", worker_template="process_csv_batch")
@inputs('csv_file_path')
def analyze_csv(csv_file_path, context):
total_rows = count_csv_rows(csv_file_path)
return BatchConfig(total_items=total_rows, batch_size=100)
@batch_worker("process_csv_batch")
@depends_on(analyzer_result="analyze_csv")
def process_csv_batch(analyzer_result, batch_context, context):
records = read_csv_range(
analyzer_result['csv_file_path'],
batch_context.start_cursor,
batch_context.batch_size,
)
processed = [transform_row(row) for row in records]
return {"items_processed": len(processed), "items_succeeded": len(processed)}
Ruby (Analyzer + Worker)
module Csv
module StepHandlers
extend TaskerCore::StepHandler::Functional
AnalyzeCsvHandler = batch_analyzer(
'Csv::StepHandlers::AnalyzeCsvHandler',
worker_template: 'process_csv_batch',
inputs: [:csv_file_path]
) do |csv_file_path:, context:|
total_rows = count_csv_rows(csv_file_path)
BatchConfig.new(total_items: total_rows, batch_size: 100)
end
ProcessCsvBatchHandler = batch_worker(
'Csv::StepHandlers::ProcessCsvBatchHandler',
depends_on: { analyzer_result: ['analyze_csv'] }
) do |analyzer_result:, batch_context:, context:|
records = read_csv_range(
analyzer_result['csv_file_path'],
batch_context.start_cursor,
batch_context.batch_size
)
processed = records.map { |row| transform_row(row) }
{ items_processed: processed.size, items_succeeded: processed.size }
end
end
end
TypeScript (Analyzer + Worker)
import { defineBatchAnalyzer, defineBatchWorker } from '@tasker-systems/tasker';
export const AnalyzeCsvHandler = defineBatchAnalyzer(
'Csv.StepHandlers.AnalyzeCsvHandler',
{ workerTemplate: 'process_csv_batch', inputs: { csvFilePath: 'csv_file_path' } },
async ({ csvFilePath }) => ({
totalItems: await countCsvRows(csvFilePath as string),
batchSize: 100,
}),
);
export const ProcessCsvBatchHandler = defineBatchWorker(
'Csv.StepHandlers.ProcessCsvBatchHandler',
{ depends: { analyzerResult: 'analyze_csv' } },
async ({ analyzerResult, batchContext }) => {
const records = await readCsvRange(
(analyzerResult as Record<string, unknown>).csvFilePath as string,
batchContext?.startCursor ?? 0,
batchContext?.batchSize ?? 100,
);
return { itemsProcessed: records.length, itemsSucceeded: records.length };
},
);
Rust
#![allow(unused)]
fn main() {
use serde_json::{json, Value};
// Batch analyzers in Rust return batch configuration via the result
pub fn analyze_csv(context: &Value) -> Result<Value, String> {
let file_path = context.get("csv_file_path")
.and_then(|v| v.as_str())
.ok_or("Missing csv_file_path")?;
let total_rows = count_csv_rows(file_path)?;
Ok(json!({
"total_items": total_rows,
"batch_size": 100,
"csv_file_path": file_path
}))
}
}
API Handler
API handlers add HTTP client methods with automatic error classification (429 -> retryable, 4xx -> permanent, 5xx -> retryable). The DSL provides an api object with get, post, put, delete methods and result helpers like api_success / api_failure.
Python
from tasker_core.step_handler.functional import api_handler, inputs
@api_handler("fetch_user", base_url="https://api.example.com")
@inputs('user_id')
def fetch_user(user_id, api, context):
response = api.get(f"/users/{user_id}")
return api.api_success(result={
"user_id": user_id,
"email": response["email"],
"name": response["name"],
})
Ruby
module Users
module StepHandlers
extend TaskerCore::StepHandler::Functional
FetchUserHandler = api_handler(
'Users::StepHandlers::FetchUserHandler',
base_url: 'https://api.example.com',
inputs: [:user_id]
) do |user_id:, api:, context:|
response = api.get("/users/#{user_id}")
api.api_success(result: {
user_id: user_id,
email: response.body['email'],
name: response.body['name']
})
end
end
end
TypeScript
import { defineApiHandler } from '@tasker-systems/tasker';
export const FetchUserHandler = defineApiHandler(
'Users.StepHandlers.FetchUserHandler',
{
baseUrl: 'https://api.example.com',
inputs: { userId: 'user_id' },
},
async ({ userId, api }) => {
const response = await api.get(`/users/${userId}`);
return api.apiSuccess({
userId,
email: response.email,
name: response.name,
});
},
);
Error Handling Patterns
Python (DSL — exceptions)
from tasker_core import PermanentError, RetryableError
@step_handler("validate_order")
@inputs(OrderInput)
def validate_order(inputs: OrderInput, context):
if not inputs.amount or inputs.amount <= 0:
raise PermanentError(
"Order amount must be positive",
error_code="INVALID_AMOUNT",
)
if external_service_unavailable():
raise RetryableError("External service temporarily unavailable")
return {"valid": True, "amount": inputs.amount}
Ruby (DSL — exceptions)
ValidateOrderHandler = step_handler(
'Orders::StepHandlers::ValidateOrderHandler',
inputs: Types::Orders::OrderInput
) do |inputs:, context:|
raise TaskerCore::Errors::PermanentError.new(
'Order amount must be positive',
error_code: 'INVALID_AMOUNT'
) if inputs.amount.to_f <= 0
raise TaskerCore::Errors::RetryableError.new(
'External service temporarily unavailable'
) if external_service_unavailable?
{ valid: true, amount: inputs.amount }
end
TypeScript (DSL — exceptions)
import { defineHandler, PermanentError, RetryableError } from '@tasker-systems/tasker';
export const ValidateOrderHandler = defineHandler(
'Orders.StepHandlers.ValidateOrderHandler',
{ inputs: { amount: 'amount' } },
async ({ amount }) => {
if (!amount || (amount as number) <= 0) {
throw new PermanentError('Order amount must be positive', 'INVALID_AMOUNT');
}
return { valid: true, amount };
},
);
Rust (Result type)
#![allow(unused)]
fn main() {
pub fn validate_order(context: &Value) -> Result<Value, String> {
let amount = context.get("amount")
.and_then(|v| v.as_f64())
.unwrap_or(0.0);
if amount <= 0.0 {
return Err("Order amount must be positive".to_string());
}
Ok(json!({ "valid": true, "amount": amount }))
}
}
For finer control over retryability in Rust, use StepExecutionResult directly in a StepHandler implementation. See Building with Rust.
See Also
- Class-Based Handlers - Class-based alternative for all languages
- API Convergence Matrix - Quick reference tables
- Patterns and Practices - Common patterns
- Building with Python - Python handler guide
- Building with Ruby - Ruby handler guide
- Building with TypeScript - TypeScript handler guide
- Building with Rust - Rust handler guide
FFI Safety Safeguards
Last Updated: 2026-02-02 Status: Production Implementation Applies To: Ruby (Magnus), Python (PyO3), TypeScript (napi-rs) workers
Overview
Tasker’s FFI workers embed the Rust tasker-worker runtime inside language-specific host processes (Ruby, Python, TypeScript/JavaScript). This document describes the safeguards that prevent Rust-side failures from crashing or corrupting the host process, ensuring that infrastructure unavailability, misconfiguration, and unexpected panics are surfaced as language-native errors rather than process faults.
FFI Architecture
Host Process (Ruby / Python / Node.js)
│
▼
FFI Boundary
┌─────────────────────────────────────┐
│ Language Binding Layer │
│ (Magnus / PyO3 / napi-rs) │
│ │
│ ┌─────────────────────────────┐ │
│ │ Bridge Module │ │
│ │ (bootstrap, poll, complete)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ FfiDispatchChannel │ │
│ │ (event dispatch, callbacks)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ WorkerBootstrap │ │
│ │ (runtime, DB, messaging) │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
Panic Safety by Framework
Each FFI framework provides different levels of automatic panic protection:
| Framework | Panic Handling | Mechanism |
|---|---|---|
| Magnus (Ruby) | Automatic | Catches panics at FFI boundary, converts to Ruby RuntimeError |
| PyO3 (Python) | Automatic | Catches panics at #[pyfunction] boundary, converts to PanicException |
| napi-rs (TypeScript) | Automatic | Catches panics at #[napi] boundary, converts to JavaScript Error |
All three FFI frameworks now provide automatic panic safety. napi-rs (used since TAS-290) catches Rust panics at the #[napi] function boundary and converts them to JavaScript Error exceptions, matching the behavior of Magnus and PyO3. No manual catch_unwind wrappers are needed.
Error Handling at FFI Boundaries
Bootstrap Failures
When infrastructure is unavailable during worker startup, errors flow through the normal Result path rather than panicking:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database unreachable | TaskerError::DatabaseError returned | Language exception, app can retry |
| Config TOML missing | TaskerError::ConfigurationError returned | Language exception with descriptive message |
| Worker config section absent | TaskerError::ConfigurationError returned | Language exception (was previously a panic) |
| Messaging backend unavailable | TaskerError::ConfigurationError returned | Language exception |
| Tokio runtime creation fails | Logged + language error returned | Language exception |
| Port already in use | TaskerError::WorkerError returned | Language exception |
| Redis/cache unavailable | Graceful degradation to noop cache | No error - worker starts without cache |
Steady-State Operation Failures
Once bootstrapped, the worker handles infrastructure failures gracefully:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database goes down during poll | Poll returns None (no events) | No impact - polling continues |
| Completion channel full | Retry loop with timeout, then logged | Step result may be lost after timeout |
| Completion channel closed | Returns false to caller | App code sees completion failure |
| Callback timeout (5s) | Logged, step completion unaffected | Domain events may be delayed |
| Messaging down during callback | Callback times out, logged | Domain events may not publish |
| Lock poisoned | Error returned to caller | Language exception |
| Worker not initialized | Error returned to caller | Language exception |
Lock Acquisition
All three workers validate lock acquisition before proceeding:
#![allow(unused)]
fn main() {
// Pattern used in all workers
let handle_guard = WORKER_SYSTEM.lock().map_err(|e| {
error!("Failed to acquire worker system lock: {}", e);
// Convert to language-appropriate error
})?;
}
A poisoned mutex (from a previous panic) produces a language exception rather than propagating the original panic.
EventRouter Availability
Post-bootstrap access to the EventRouter uses fallible error handling rather than .expect():
#![allow(unused)]
fn main() {
// Use ok_or_else instead of expect to prevent panic at FFI boundary
let event_router = worker_core.event_router().ok_or_else(|| {
error!("EventRouter not available from WorkerCore after bootstrap");
// Return language-appropriate error
})?;
}
Callback Safety
The FfiDispatchChannel uses a fire-and-forget pattern for post-completion callbacks, preventing the host process from being blocked or deadlocked by Rust-side async operations:
- Completion is sent first - the step result is delivered to the completion channel before any callback fires
- Callback is spawned separately - runs in the Tokio runtime, not the FFI caller’s thread
- Timeout protection - callbacks are bounded by a configurable timeout (default 5s)
- Callback failures are logged - they never affect step completion or the host process
FFI Thread (Ruby/Python/JS) Tokio Runtime
│ │
├──► complete(event_id, result) │
│ ├──► send result to channel │
│ └──► spawn callback ─────────┼──► callback.on_handler_complete()
│ │ (with 5s timeout)
◄──── return true ────────────────│
│ (immediate, non-blocking) │
See docs/development/ffi-callback-safety.md for detailed callback safety guidelines.
Backpressure Protection
Completion Channel
The completion channel uses a try-send retry loop with timeout to prevent indefinite blocking:
- Try-send avoids blocking the FFI thread
- Retry with sleep (10ms intervals) handles transient backpressure
- Timeout (configurable, default 30s) prevents permanent stalls
- Logged when backpressure delays exceed 100ms
Starvation Detection
The FfiDispatchChannel tracks event age and warns when polling falls behind:
- Events older than
starvation_warning_threshold_ms(default 10s) trigger warnings check_starvation_warnings()can be called periodically from the host processFfiDispatchMetricsexposes pending count, oldest event age, and starvation status
Infrastructure Dependency Matrix
| Component | Bootstrap | Poll | Complete | Callback |
|---|---|---|---|---|
| Database | Required (error on failure) | Not needed | Not needed | Errors logged |
| Message Bus | Required (error on failure) | Not needed | Not needed | Errors logged |
| Config System | Required (error on failure) | Not needed | Not needed | Not needed |
| Cache (Redis) | Optional (degrades to noop) | Not needed | Not needed | Not needed |
| Tokio Runtime | Required (error on failure) | Used | Used | Used |
Worker Lifecycle Safety
Start (bootstrap_worker)
- Validates configuration, creates runtime, initializes all subsystems
- All failures return language-appropriate errors
- Already-running detection prevents double initialization
Status (get_worker_status)
- Safe when worker is not initialized (returns
running: false) - Safe when worker is running (queries internal state)
- Lock acquisition failure returns error
Stop (stop_worker)
- Safe when worker is not running (returns success message)
- Sends shutdown signal and clears handle
- In-flight operations complete before shutdown
Graceful Shutdown (transition_to_graceful_shutdown)
- Initiates graceful shutdown allowing in-flight work to drain
- Errors during transition are logged and returned
- Requires worker to be running (error otherwise)
Adding a New FFI Worker
When implementing a new language worker:
-
Check framework panic safety - if the framework (like Magnus/PyO3) catches panics automatically, you get protection for free. If using raw C FFI, wrap all
extern "C"functions withcatch_unwind. -
Use the standard bridge pattern - global
WORKER_SYSTEMmutex,BridgeHandlestruct containingWorkerSystemHandle+FfiDispatchChannel+ runtime. -
Handle all lock acquisitions - always use
.map_err()on.lock()calls. -
Avoid
.expect()and.unwrap()in FFI code - useok_or_else()ormap_err()to convert to language-appropriate errors. -
Use fire-and-forget callbacks - never block the FFI thread on async operations.
-
Integrate starvation detection - call
check_starvation_warnings()periodically. -
Expose metrics - expose
FfiDispatchMetricsfor health monitoring.
Related Documentation
- FFI Callback Safety - Detailed callback patterns and deadlock prevention
- Worker Event Systems - Dispatch and completion channel architecture
- MPSC Channel Guidelines - Channel sizing and configuration
- Worker Patterns & Practices - General worker development patterns
Worker Crates: Common Patterns and Practices
Last Updated: 2026-01-06 Audience: Developers, Architects Status: Active Related Docs: Worker Event Systems | Worker Actors
<- Back to Worker Crates Overview
This document describes the common patterns and practices shared across all four worker implementations (Rust, Ruby, Python, TypeScript). Understanding these patterns helps developers write consistent handlers regardless of the language.
Table of Contents
- Architectural Patterns
- Handler Lifecycle
- Error Handling
- Polling Architecture
- Event Bridge Pattern
- Singleton Pattern
- Observability
- Checkpoint Yielding
Architectural Patterns
Dual-Channel Architecture
All workers implement a dual-channel architecture for non-blocking step execution:
┌─────────────────────────────────────────────────────────────────┐
│ DUAL-CHANNEL PATTERN │
└─────────────────────────────────────────────────────────────────┘
PostgreSQL PGMQ
│
▼
┌───────────────────┐
│ Dispatch Channel │ ──→ Step events flow TO handlers
└───────────────────┘
│
▼
┌───────────────────┐
│ Handler Execution │ ──→ Business logic runs here
└───────────────────┘
│
▼
┌───────────────────┐
│ Completion Channel │ ──→ Results flow BACK to orchestration
└───────────────────┘
│
▼
Orchestration
Benefits:
- Fire-and-forget dispatch (non-blocking)
- Bounded concurrency via semaphores
- Results processed independently from dispatch
- Consistent pattern across all languages
Language-Specific Implementations
| Component | Rust | Ruby | Python |
|---|---|---|---|
| Dispatch Channel | mpsc::channel | poll_step_events FFI | poll_step_events FFI |
| Completion Channel | mpsc::channel | complete_step_event FFI | complete_step_event FFI |
| Concurrency Model | Tokio async tasks | Ruby threads + FFI polling | Python threads + FFI polling |
| GIL Handling | N/A | Pull-based polling | Pull-based polling |
Handler Lifecycle
Handler Registration
All implementations follow the same registration pattern:
1. Define handler (DSL declaration or class/struct)
2. Set handler name identifier
3. Register with HandlerRegistry (automatic for DSL and class-based handlers)
4. Handler ready for resolution
Both DSL handlers and class-based handlers are discovered automatically at runtime. DSL handlers auto-register their callable name when the handler module is loaded. Class-based handlers are found by the auto-resolver, which scans for any class derived from the base step handler classes. Explicit registration is only needed for edge cases. See Handler Resolution for the full resolver chain and Class-Based Handlers for the class-based pattern.
Python (DSL):
from tasker_core.step_handler.functional import step_handler, inputs
from app.services.types import OrderInput
from app.services import orders as svc
@step_handler("process_order")
@inputs(OrderInput)
def process_order(inputs: OrderInput, context):
return svc.process_order(order_id=inputs.order_id, amount=inputs.amount)
Ruby (DSL):
module Orders
module StepHandlers
extend TaskerCore::StepHandler::Functional
ProcessOrderHandler = step_handler(
'Orders::StepHandlers::ProcessOrderHandler',
inputs: Types::Orders::OrderInput
) do |inputs:, context:|
Orders::Service.process_order(order_id: inputs.order_id, amount: inputs.amount)
end
end
end
TypeScript (DSL):
import { defineHandler } from '@tasker-systems/tasker';
import * as svc from '../services/orders';
export const ProcessOrderHandler = defineHandler(
'Orders.StepHandlers.ProcessOrderHandler',
{ inputs: { orderId: 'order_id', amount: 'amount' } },
async ({ orderId, amount }) =>
svc.processOrder(orderId as string, amount as number),
);
Rust (explicit registration required — no DSL):
#![allow(unused)]
fn main() {
use tasker_worker::worker::handlers::StepHandlerRegistry;
let registry = StepHandlerRegistry::new();
registry.register_fn("process_order",
Box::new(|ctx, _deps| handlers::orders::process_order(ctx)));
}
Handler Resolution Flow
1. Step event received with handler name
2. Registry.resolve(handler_name) called
3. Handler class instantiated
4. handler.call(context) invoked
5. Result returned to completion channel
Handler Context
DSL handlers receive their inputs and dependency results as typed function parameters — the DSL extracts and validates these from the raw context automatically. Handlers also receive a context parameter for accessing additional task metadata.
Class-based handlers receive a context object containing:
| Field | Description |
|---|---|
task_uuid | Unique identifier for the task |
step_uuid | Unique identifier for the step |
input_data | Task context data passed to the step |
dependency_results | Results from parent/dependency steps |
step_config | Configuration from step definition |
step_inputs | Runtime inputs from workflow_step.inputs |
retry_count | Current retry attempt number |
max_retries | Maximum retry attempts allowed |
Handler Results
All handlers return a structured result indicating success or failure. However, the APIs differ between Ruby and Python - this is a known design inconsistency that may be addressed in a future ticket.
Ruby - Uses keyword arguments and separate Success/Error types:
# Via base handler shortcuts
success(result: { key: "value" }, metadata: { duration_ms: 150 })
failure(
message: "Something went wrong",
error_type: "PermanentError",
error_code: "VALIDATION_ERROR", # Ruby has error_code field
retryable: false,
metadata: { field: "email" }
)
# Or via type factory methods
TaskerCore::Types::StepHandlerCallResult.success(result: { key: "value" })
TaskerCore::Types::StepHandlerCallResult.error(
error_type: "PermanentError",
message: "Error message",
error_code: "ERR_001"
)
Python - Uses keyword arguments and a single result type:
# Via base handler shortcuts
self.success(result={"key": "value"}, metadata={"duration_ms": 150})
self.failure(
message="Something went wrong",
error_type="ValidationError",
error_code="VALIDATION_ERROR",
retryable=False,
metadata={"field": "email"}
)
# Or via class factory methods
StepHandlerResult.success(
result={"key": "value"},
metadata={"duration_ms": 150}
)
StepHandlerResult.failure(
message="Something went wrong",
error_type="ValidationError",
error_code="VALIDATION_ERROR",
retryable=False,
metadata={"field": "email"}
)
Key Differences:
| Aspect | Ruby | Python |
|---|---|---|
| Factory method names | .success(), .error() | .success(), .failure() |
| Result type | Success / Error structs | Single StepHandlerResult class |
| Error code field | error_code (freeform) | error_code (optional) |
| Argument style | Keyword required (result:) | Keyword arguments |
Error Handling
Error Classification
All workers classify errors into two categories:
| Type | Description | Behavior |
|---|---|---|
| Retryable | Transient errors that may succeed on retry | Step re-enqueued up to max_retries |
| Permanent | Unrecoverable errors | Step marked as failed immediately |
HTTP Status Code Classification (ApiHandler)
400, 401, 403, 404, 422 → Permanent Error (client errors)
429 → Retryable Error (rate limiting)
500-599 → Retryable Error (server errors)
Exception Hierarchy
Ruby:
TaskerCore::Error # Base class
├── TaskerCore::RetryableError # Transient failures
├── TaskerCore::PermanentError # Unrecoverable failures
├── TaskerCore::FFIError # FFI bridge errors
└── TaskerCore::ConfigurationError # Configuration issues
Python (two modules — FFI/bootstrap errors and execution errors):
# tasker_core.exceptions (FFI / bootstrap)
TaskerError # Base class
├── WorkerNotInitializedError # Worker not bootstrapped
├── WorkerBootstrapError # Bootstrap failed
├── WorkerAlreadyRunningError # Double initialization
├── FFIError # FFI bridge errors
└── ConversionError # Type conversion errors
# tasker_core.errors (execution — used in handlers)
TaskerError # Base class
├── RetryableError # Transient failures (retry with backoff)
│ ├── TimeoutError # Request/connection timeouts
│ ├── NetworkError # Network connectivity issues
│ ├── RateLimitError # Rate limiting (429)
│ ├── ServiceUnavailableError # Service unavailable (503)
│ └── ResourceContentionError # Lock/resource conflicts
├── PermanentError # Unrecoverable failures
│ ├── ValidationError # Input validation failures
│ ├── NotFoundError # Resource not found
│ ├── AuthenticationError # Authentication failures
│ ├── AuthorizationError # Permission denied
│ └── BusinessLogicError # Business rule violations
└── ConfigurationError # Configuration issues
Error Context Propagation
All errors should include context for debugging:
StepHandlerResult.failure(
message="Payment gateway timeout",
error_type="gateway_timeout",
retryable=True,
metadata={
"gateway": "stripe",
"request_id": "req_xyz",
"response_time_ms": 30000
}
)
Polling Architecture
Why Polling?
Ruby and Python workers use a pull-based polling model due to language runtime constraints:
Ruby: The Global VM Lock (GVL) prevents Rust from safely calling Ruby methods from Rust threads. Polling allows Ruby to control thread context.
Python: The Global Interpreter Lock (GIL) has the same limitation. Python must initiate all cross-language calls.
Polling Characteristics
| Parameter | Default Value | Description |
|---|---|---|
| Poll Interval | 10ms | Time between polls when no events |
| Max Latency | ~10ms | Time from event generation to processing start |
| Starvation Check | Every 100 polls (1 second) | Detect processing bottlenecks |
| Cleanup Interval | Every 1000 polls (10 seconds) | Clean up timed-out events |
Poll Loop Structure
while running:
# 1. Poll for event
event = poll_step_events()
if event:
# 2. Process event through handler
process_event(event)
else:
# 3. Sleep when no events
time.sleep(0.01) # 10ms
# 4. Periodic maintenance
if poll_count % 100 == 0:
check_starvation_warnings()
if poll_count % 1000 == 0:
cleanup_timeouts()
FFI Contract
Ruby and Python share the same FFI contract:
| Function | Description |
|---|---|
poll_step_events() | Get next pending event (returns None if empty) |
complete_step_event(event_id, result) | Submit handler result |
get_ffi_dispatch_metrics() | Get dispatch channel metrics |
check_starvation_warnings() | Trigger starvation logging |
cleanup_timeouts() | Clean up timed-out events |
Event Bridge Pattern
Overview
All workers implement an EventBridge (pub/sub) pattern for internal coordination:
┌─────────────────────────────────────────────────────────────────┐
│ EVENT BRIDGE PATTERN │
└─────────────────────────────────────────────────────────────────┘
Publishers EventBridge Subscribers
───────── ─────────── ───────────
HandlerRegistry ──publish──→ ──notify──→ StepExecutionSubscriber
EventPoller ──publish──→ [Events] ──notify──→ MetricsCollector
Worker ──publish──→ ──notify──→ Custom Subscribers
Standard Event Names
| Event | Description | Payload |
|---|---|---|
handler_registered | Handler added to registry | (name, handler_class) |
step_execution_received | Step event received | FfiStepEvent |
step_execution_completed | Handler finished | StepHandlerResult |
worker_started | Worker bootstrap complete | worker_id |
worker_stopped | Worker shutdown | worker_id |
Implementation Libraries
| Language | Library | Pattern |
|---|---|---|
| Ruby | dry-events | Publisher/Subscriber |
| Python | pyee | EventEmitter |
| Rust | Native channels | mpsc |
Usage Example (Python)
from tasker_core import EventBridge, EventNames
bridge = EventBridge.instance()
# Subscribe to events
def on_step_received(event):
print(f"Processing step {event.step_uuid}")
bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)
# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)
Singleton Pattern
Worker State Management
All workers store global state in a thread-safe singleton:
┌─────────────────────────────────────────────────────────────────┐
│ SINGLETON WORKER STATE │
└─────────────────────────────────────────────────────────────────┘
Thread-Safe Global
│
▼
┌──────────────────┐
│ WorkerSystem │
│ ┌────────────┐ │
│ │ Mutex/Lock │ │
│ │ Inner │ │
│ │ State │ │
│ └────────────┘ │
└──────────────────┘
│
├──→ HandlerRegistry
├──→ EventBridge
├──→ EventPoller
└──→ Configuration
Singleton Classes
| Language | Singleton Implementation |
|---|---|
| Rust | OnceLock<Mutex<WorkerSystem>> |
| Ruby | Singleton module |
| Python | Class-level _instance with instance() classmethod |
Reset for Testing
All singletons provide reset methods for test isolation:
# Python
HandlerRegistry.reset_instance()
EventBridge.reset_instance()
# Ruby
TaskerCore::Registry::HandlerRegistry.reset_instance!
Observability
Health Checks
All workers expose health information via FFI:
from tasker_core import get_health_check
health = get_health_check()
# Returns: HealthCheck with component statuses
Metrics
Standard metrics available from all workers:
| Metric | Description |
|---|---|
pending_count | Events awaiting processing |
in_flight_count | Events currently being processed |
completed_count | Successfully completed events |
failed_count | Failed events |
starvation_detected | Whether events are timing out |
starving_event_count | Events exceeding timeout threshold |
Structured Logging
All workers use structured logging with consistent fields:
from tasker_core import log_info, LogContext
context = LogContext(
correlation_id="abc-123",
task_uuid="task-456",
operation="process_order"
)
log_info("Processing order", context)
Specialized Handlers
All handler types — including API, Decision, and Batchable — support both DSL and class-based patterns. The DSL approach is recommended for new projects. See Example Handlers for full cross-language examples and Class-Based Handlers for the class-based alternative.
Handler Type Hierarchy
Ruby (class hierarchy / DSL factories):
TaskerCore::StepHandler::Base
├── TaskerCore::StepHandler::Api # HTTP/REST API integration
├── TaskerCore::StepHandler::Decision # Dynamic workflow decisions
└── TaskerCore::StepHandler::Batchable # Batch processing support
TaskerCore::StepHandler::Functional # DSL module
├── step_handler() # Standard step
├── decision_handler() # Decision routing
├── api_handler() # HTTP API integration
├── batch_analyzer() # Batch analysis
└── batch_worker() # Batch processing
Python (class hierarchy / DSL decorators):
StepHandler (ABC)
├── ApiHandler # HTTP/REST API integration
├── DecisionHandler # Dynamic workflow decisions
└── + Batchable # Batch processing (mixin)
Decorators (tasker_core.step_handler.functional)
├── @step_handler # Standard step
├── @decision_handler # Decision routing
├── @api_handler # HTTP API integration
├── @batch_analyzer # Batch analysis
└── @batch_worker # Batch processing
TypeScript (factory functions):
defineHandler() # Standard step
defineDecisionHandler() # Decision routing
defineApiHandler() # HTTP API integration
defineBatchAnalyzer() # Batch analysis
defineBatchWorker() # Batch processing
Rust (trait composition — no DSL):
StepHandler (trait)
+ APICapable # HTTP client methods
+ DecisionCapable # Workflow routing
+ BatchableCapable # Cursor-based batch processing
Quick DSL Examples
Decision — returns Decision.route(steps) or Decision.skip(reason):
@decision_handler("routing_decision")
@inputs('amount')
def routing_decision(amount, context):
if float(amount or 0) < 1000:
return Decision.route(['auto_approve'], route_type='automatic')
return Decision.route(['manager_approval', 'finance_review'], route_type='dual')
API — receives api with HTTP methods and automatic error classification:
@api_handler("fetch_user", base_url="https://api.example.com")
@inputs('user_id')
def fetch_user(user_id, api, context):
response = api.get(f"/users/{user_id}")
return api.api_success(result={"user_id": user_id, "email": response["email"]})
Batch — analyzer returns BatchConfig, workers receive batch_context:
@batch_analyzer("analyze_csv", worker_template="process_csv_batch")
@inputs('csv_file_path')
def analyze_csv(csv_file_path, context):
return BatchConfig(total_items=count_csv_rows(csv_file_path), batch_size=100)
@batch_worker("process_csv_batch")
def process_csv_batch(batch_context, context):
records = read_csv_range(batch_context.start_cursor, batch_context.batch_size)
return {"items_processed": len(records), "items_succeeded": len(records)}
Checkpoint Yielding
Checkpoint yielding enables batch workers to persist progress and yield control back to the orchestrator for re-dispatch. This is essential for long-running batch operations.
When to Use
- Processing takes longer than visibility timeout
- You need resumable processing after failures
- Long-running operations need progress visibility
Cross-Language API
All Batchable handlers provide checkpoint_yield() (or checkpointYield() in TypeScript):
Ruby:
class MyBatchWorker < TaskerCore::StepHandler::Batchable
def call(context)
batch_ctx = get_batch_context(context)
# Resume from checkpoint if present
start = batch_ctx.has_checkpoint? ? batch_ctx.checkpoint_cursor : 0
items.each_with_index do |item, idx|
process_item(item)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0
checkpoint_yield(
cursor: start + idx + 1,
items_processed: idx + 1,
accumulated_results: { partial: "data" }
)
end
end
batch_worker_success(items_processed: items.size, items_succeeded: items.size)
end
end
Python:
class MyBatchWorker(StepHandler, Batchable):
def call(self, context):
batch_ctx = self.get_batch_context(context)
# Resume from checkpoint if present
start = batch_ctx.checkpoint_cursor if batch_ctx.has_checkpoint() else 0
for idx, item in enumerate(items):
self.process_item(item)
# Checkpoint every 1000 items
if (idx + 1) % 1000 == 0:
self.checkpoint_yield(
cursor=start + idx + 1,
items_processed=idx + 1,
accumulated_results={"partial": "data"}
)
return self.batch_worker_success(items_processed=len(items), items_succeeded=len(items))
TypeScript:
class MyBatchWorker extends BatchableHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
const batchCtx = this.getBatchContext(context);
// Resume from checkpoint if present
const start = batchCtx.hasCheckpoint() ? batchCtx.checkpointCursor : 0;
for (let idx = 0; idx < items.length; idx++) {
await this.processItem(items[idx]);
// Checkpoint every 1000 items
if ((idx + 1) % 1000 === 0) {
await this.checkpointYield({
cursor: start + idx + 1,
itemsProcessed: idx + 1,
accumulatedResults: { partial: "data" }
});
}
}
return this.batchWorkerSuccess({
itemsProcessed: items.length,
itemsSucceeded: items.length,
itemsFailed: 0,
itemsSkipped: 0,
results: [],
errors: [],
lastCursor: null,
});
}
}
BatchWorkerContext Checkpoint Accessors
All languages provide consistent accessors for checkpoint data:
| Accessor | Ruby | Python | TypeScript |
|---|---|---|---|
| Cursor position | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Accumulated data | accumulated_results | accumulated_results | accumulatedResults |
| Has checkpoint? | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
| Items processed | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
FFI Contract
| Function | Description |
|---|---|
checkpoint_yield_step_event(event_id, data) | Persist checkpoint and re-dispatch step |
Key Invariants
- Checkpoint-Persist-Then-Redispatch: Progress saved before re-dispatch
- Step Stays InProgress: No state machine transitions during yield
- Handler-Driven: Handlers decide when to checkpoint
See Batch Processing Guide - Checkpoint Yielding for comprehensive documentation.
Best Practices
1. Keep Handlers Focused
Each handler should do one thing well:
- Validate input
- Perform single operation
- Return clear result
2. Use Error Classification
Always specify whether errors are retryable:
# Good - clear error classification
return self.failure("API rate limit", retryable=True)
# Bad - ambiguous error handling
raise Exception("API error")
3. Include Context in Errors
return StepHandlerResult.failure(
message="Database connection failed",
error_type="database_error",
retryable=True,
metadata={
"host": "db.example.com",
"port": 5432,
"connection_timeout_ms": 5000
}
)
4. Use Structured Logging
log_info("Order processed", {
"order_id": order_id,
"total": total,
"items_count": len(items)
})
5. Test Handler Isolation
Reset singletons between tests:
def setup_method(self):
HandlerRegistry.reset_instance()
EventBridge.reset_instance()
See Also
- Worker Crates Overview - High-level introduction
- Rust Worker - Native Rust implementation
- Ruby Worker - Ruby gem documentation
- Python Worker - Python package documentation
- Worker Event Systems - Detailed architecture
- Worker Actors - Actor pattern documentation
Python Worker
Last Updated: 2026-01-01
Audience: Python Developers
Status: Active
Package: tasker_core
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Python worker provides a package-based interface for integrating tasker-core workflow execution into Python applications. It supports both standalone server deployment and headless embedding in existing codebases.
Quick Start
Installation
cd workers/python
uv sync # Install dependencies
uv run maturin develop # Build FFI extension
Running the Server
python bin/server.py
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
PYTHON_HANDLER_PATH | Path for handler auto-discovery | Not set |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
Architecture
Server Mode
Location: workers/python/bin/server.py
The server bootstraps the Rust foundation and manages Python handler execution:
from tasker_core import (
bootstrap_worker,
EventBridge,
EventPoller,
HandlerRegistry,
StepExecutionSubscriber,
)
# Bootstrap Rust worker foundation
result = bootstrap_worker(config)
# Start event dispatch system
event_bridge = EventBridge.instance()
event_bridge.start()
# Create step execution subscriber
handler_registry = HandlerRegistry.instance()
step_subscriber = StepExecutionSubscriber(
event_bridge=event_bridge,
handler_registry=handler_registry,
worker_id="python-worker-001"
)
step_subscriber.start()
# Start event poller (10ms polling)
event_poller = EventPoller(polling_interval_ms=10)
event_poller.on_step_event(lambda e: event_bridge.publish("step_execution_received", e))
event_poller.start()
# Wait for shutdown signal
shutdown_event.wait()
# Graceful shutdown
event_poller.stop()
step_subscriber.stop()
event_bridge.stop()
stop_worker()
Headless/Embedded Mode
For embedding in existing Python applications:
from tasker_core import (
bootstrap_worker,
HandlerRegistry,
EventBridge,
EventPoller,
StepExecutionSubscriber,
)
from tasker_core.types import BootstrapConfig
# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
config = BootstrapConfig(namespace="my-app")
bootstrap_worker(config)
# Register handlers
registry = HandlerRegistry.instance()
registry.register("process_data", ProcessDataHandler)
# Start event dispatch (required for embedded usage)
bridge = EventBridge.instance()
bridge.start()
subscriber = StepExecutionSubscriber(bridge, registry, "embedded-worker")
subscriber.start()
poller = EventPoller()
poller.on_step_event(lambda e: bridge.publish("step_execution_received", e))
poller.start()
FFI Bridge
Python communicates with the Rust foundation via FFI polling:
┌────────────────────────────────────────────────────────────────┐
│ PYTHON FFI BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ FFI (poll_step_events)
▼
┌─────────────────────┐
│ EventPoller │
│ (daemon thread) │──→ poll every 10ms
└─────────────────────┘
│
│ publish to EventBridge
▼
┌─────────────────────┐
│ StepExecution │
│ Subscriber │──→ route to handler
└─────────────────────┘
│
│ handler.call(context)
▼
┌─────────────────────┐
│ Handler Execution │
└─────────────────────┘
│
│ FFI (complete_step_event)
▼
Rust Completion Channel
Handler Development
DSL Handlers (Recommended)
Both DSL and class-based handlers are fully supported. DSL is recommended for new projects. See Class-Based Handlers for the inheritance-based patterns.
The functional DSL provides a decorator-based approach for defining handlers:
from tasker_core.step_handler.functional import step_handler, inputs, depends_on
@step_handler("process_order")
@inputs(OrderInput) # Pydantic BaseModel
@depends_on(validation=ValidationResult) # optional typed dependencies
def process_order(inputs: OrderInput, validation: ValidationResult, context):
result = process(inputs.order_id, inputs.amount)
return {"order_id": inputs.order_id, "status": "processed", "total": result["total"]}
Specialized DSL Handlers:
from tasker_core.step_handler.functional import (
decision_handler, api_handler, batch_analyzer, batch_worker,
Decision, BatchConfig
)
@decision_handler("routing_decision")
@inputs('amount')
def routing_decision(amount, context):
if float(amount or 0) < 1000:
return Decision.route(['auto_approve'], route_type='automatic')
return Decision.route(['manager_approval'], route_type='manager')
@api_handler("fetch_user", base_url="https://api.example.com")
@inputs('user_id')
def fetch_user(user_id, api, context):
response = api.get(f"/users/{user_id}")
return api.api_success(result=response.body)
@batch_analyzer("csv_analyzer")
@inputs('csv_path')
def csv_analyzer(csv_path, context):
return BatchConfig(total_items=count_rows(csv_path), batch_size=100)
@batch_worker("csv_processor", analyzer_step="analyze_csv")
def csv_processor(batch_context, context):
return {"processed": batch_context.batch_size}
Class-Based Handlers
The class-based patterns below remain fully supported.
Base Handler (ABC)
Location: python/tasker_core/step_handler/base.py
All handlers inherit from StepHandler:
from tasker_core import StepHandler, StepContext, StepHandlerResult
class ProcessOrderHandler(StepHandler):
handler_name = "process_order"
handler_version = "1.0.0"
def call(self, context: StepContext) -> StepHandlerResult:
# Access input data
order_id = context.input_data.get("order_id")
amount = context.input_data.get("amount")
# Business logic
result = self.process_order(order_id, amount)
# Return success
return self.success(result={
"order_id": order_id,
"status": "processed",
"total": result["total"]
})
Handler Signature
def call(self, context: StepContext) -> StepHandlerResult:
# context.task_uuid - Task identifier
# context.step_uuid - Step identifier
# context.input_data - Task context data
# context.dependency_results - Results from parent steps
# context.step_config - Handler configuration
# context.step_inputs - Runtime inputs
# context.retry_count - Current retry attempt
# context.max_retries - Maximum retry attempts
Result Methods
# Success result (from base class)
return self.success(
result={"key": "value"},
metadata={"duration_ms": 100}
)
# Failure result (from base class)
return self.failure(
message="Payment declined",
error_type="payment_error",
retryable=True,
metadata={"card_last_four": "1234"}
)
# Or using factory methods
from tasker_core import StepHandlerResult
return StepHandlerResult.success(
{"key": "value"},
{"duration_ms": 100}
)
return StepHandlerResult.failure(
message="Error",
error_type="validation_error",
retryable=False
)
Accessing Dependencies
def call(self, context: StepContext) -> StepHandlerResult:
# Get result from a dependency step
validation = context.dependency_results.get("validate_order", {})
if validation.get("valid"):
# Process with validated data
return self.success(result={"processed": True})
return self.failure("Validation failed", retryable=False)
Composition Pattern
Python handlers use composition via mixins (multiple inheritance) rather than single inheritance.
Using Mixins (Recommended for New Code)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin
class MyHandler(StepHandler, APIMixin, DecisionMixin):
handler_name = "my_handler"
def call(self, context: StepContext) -> StepHandlerResult:
# Has both API methods (get, post, put, delete)
# And Decision methods (decision_success, skip_branches)
response = self.get("/api/endpoint")
return self.decision_success(["next_step"], response)
Available Mixins
| Mixin | Location | Methods Provided |
|---|---|---|
APIMixin | mixins/api.py | get, post, put, delete, http_client |
DecisionMixin | mixins/decision.py | decision_success, skip_branches, decision_failure |
BatchableMixin | (base class) | get_batch_context, handle_no_op_worker, create_cursor_configs |
Using Wrapper Classes (Backward Compatible)
The wrapper classes delegate to mixins internally:
# These are equivalent:
class MyHandler(ApiHandler):
# Inherits API methods via APIMixin internally
pass
class MyHandler(StepHandler, APIMixin):
# Explicit mixin inclusion
pass
Specialized Handlers
API Handler
Location: python/tasker_core/step_handler/api.py
For HTTP API integration with automatic error classification:
from tasker_core.step_handler import ApiHandler
class FetchUserHandler(ApiHandler):
handler_name = "fetch_user"
base_url = "https://api.example.com"
def call(self, context: StepContext) -> StepHandlerResult:
user_id = context.input_data["user_id"]
# Automatic error classification
response = self.get(f"/users/{user_id}")
return self.api_success(response)
HTTP Methods:
# GET request
response = self.get("/path", params={"key": "value"}, headers={})
# POST request
response = self.post("/path", data={"key": "value"}, headers={})
# PUT request
response = self.put("/path", data={"key": "value"}, headers={})
# DELETE request
response = self.delete("/path", params={}, headers={})
ApiResponse Properties:
response.status_code # HTTP status code
response.headers # Response headers
response.body # Parsed body (dict or str)
response.ok # True if 2xx status
response.is_client_error # True if 4xx status
response.is_server_error # True if 5xx status
response.is_retryable # True if should retry (408, 429, 500-504)
response.retry_after # Retry-After header value in seconds
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Non-retryable | Permanent failure |
| 408, 429, 500-504 | Retryable | Standard retry |
Decision Handler
Location: python/tasker_core/step_handler/decision.py
For dynamic workflow routing:
from tasker_core.step_handler import DecisionHandler
from tasker_core import DecisionPointOutcome
class RoutingDecisionHandler(DecisionHandler):
handler_name = "routing_decision"
def call(self, context: StepContext) -> StepHandlerResult:
amount = context.input_data.get("amount", 0)
if amount < 1000:
# Auto-approve small amounts
outcome = DecisionPointOutcome.create_steps(
["auto_approve"],
routing_context={"route_type": "auto"}
)
return self.decision_success(outcome)
elif amount < 5000:
# Manager approval for medium amounts
outcome = DecisionPointOutcome.create_steps(
["manager_approval"],
routing_context={"route_type": "manager"}
)
return self.decision_success(outcome)
else:
# Dual approval for large amounts
outcome = DecisionPointOutcome.create_steps(
["manager_approval", "finance_review"],
routing_context={"route_type": "dual"}
)
return self.decision_success(outcome)
Decision Methods:
# Create steps
outcome = DecisionPointOutcome.create_steps(
step_names=["step1", "step2"],
routing_context={"key": "value"}
)
return self.decision_success(outcome)
# No branches needed
outcome = DecisionPointOutcome.no_branches(reason="condition not met")
return self.decision_no_branches(outcome)
Batchable Mixin
Location: python/tasker_core/batch_processing/
For processing large datasets in chunks. Both analyzer and worker handlers implement the standard call() method:
Analyzer Handler (creates batch configurations):
from tasker_core import StepHandler, StepHandlerResult
from tasker_core.batch_processing import Batchable
class CsvAnalyzerHandler(StepHandler, Batchable):
handler_name = "csv_analyzer"
def call(self, context: StepContext) -> StepHandlerResult:
"""Analyze CSV and create batch worker configurations."""
csv_path = context.input_data["csv_path"]
row_count = count_csv_rows(csv_path)
if row_count == 0:
# No data to process
return self.batch_analyzer_success(
cursor_configs=[],
total_items=0,
batch_metadata={"csv_path": csv_path}
)
# Create cursor ranges for batch workers
cursor_configs = self.create_cursor_ranges(
total_items=row_count,
batch_size=100,
max_batches=5
)
return self.batch_analyzer_success(
cursor_configs=cursor_configs,
total_items=row_count,
worker_template_name="process_csv_batch",
batch_metadata={"csv_path": csv_path}
)
Worker Handler (processes a batch):
class CsvBatchProcessorHandler(StepHandler, Batchable):
handler_name = "csv_batch_processor"
def call(self, context: StepContext) -> StepHandlerResult:
"""Process a batch of CSV rows."""
# Get cursor config from step_inputs
step_inputs = context.step_inputs or {}
# Check for no-op placeholder batch
if step_inputs.get("is_no_op"):
return self.batch_worker_success(
items_processed=0,
items_succeeded=0,
metadata={"no_op": True}
)
cursor = step_inputs.get("cursor", {})
start_cursor = cursor.get("start_cursor", 0)
end_cursor = cursor.get("end_cursor", 0)
# Get CSV path from analyzer result
analyzer_result = context.get_dependency_result("analyze_csv")
csv_path = analyzer_result["batch_metadata"]["csv_path"]
# Process the batch
results = process_csv_batch(csv_path, start_cursor, end_cursor)
return self.batch_worker_success(
items_processed=results["count"],
items_succeeded=results["success"],
items_failed=results["failed"],
results=results["data"],
last_cursor=end_cursor
)
Batchable Helper Methods:
# Analyzer helpers
self.create_cursor_ranges(total_items, batch_size, max_batches)
self.batch_analyzer_success(cursor_configs, total_items, worker_template_name, batch_metadata)
# Worker helpers
self.batch_worker_success(items_processed, items_succeeded, items_failed, results, last_cursor, metadata)
self.get_batch_context(context) # Returns BatchWorkerContext or None
# Aggregator helpers
self.aggregate_worker_results(worker_results) # Returns aggregated counts
Handler Registry
Registration
Location: python/tasker_core/handler.py
from tasker_core import HandlerRegistry
registry = HandlerRegistry.instance()
# Manual registration
registry.register("process_order", ProcessOrderHandler)
# Check if registered
registry.is_registered("process_order") # True
# Resolve and instantiate
handler = registry.resolve("process_order")
result = handler.call(context)
# List all handlers
registry.list_handlers() # ["process_order", ...]
# Handler count
registry.handler_count() # 1
Auto-Discovery
# Discover handlers from a package
count = registry.discover_handlers("myapp.handlers")
print(f"Discovered {count} handlers")
Handlers are discovered by:
- Scanning the package for classes inheriting from
StepHandler - Using the
handler_nameclass attribute for registration
Type System
Pydantic Models
Python types use Pydantic for validation:
from tasker_core import StepContext, StepHandlerResult, FfiStepEvent
# StepContext - validated from FFI event
context = StepContext.from_ffi_event(event, "handler_name")
context.task_uuid # UUID
context.step_uuid # UUID
context.input_data # dict
context.retry_count # int
# StepHandlerResult - structured result
result = StepHandlerResult.success({"key": "value"})
result.success # True
result.result # {"key": "value"}
result.error_message # None
Configuration Types
from tasker_core.types import BootstrapConfig, CursorConfig
# Bootstrap configuration
# Note: Headless mode is controlled via TOML config (web.enabled = false)
config = BootstrapConfig(
namespace="my-app",
log_level="info"
)
# Cursor configuration for batch processing
cursor = CursorConfig(
batch_size=100,
start_cursor=0,
end_cursor=1000
)
Event System
EventBridge
Location: python/tasker_core/event_bridge.py
from tasker_core import EventBridge, EventNames
bridge = EventBridge.instance()
# Start the event system
bridge.start()
# Subscribe to events
def on_step_received(event):
print(f"Processing step: {event.step_uuid}")
bridge.subscribe(EventNames.STEP_EXECUTION_RECEIVED, on_step_received)
# Publish events
bridge.publish(EventNames.HANDLER_REGISTERED, "my_handler", MyHandler)
# Stop when done
bridge.stop()
Event Names
from tasker_core import EventNames
EventNames.STEP_EXECUTION_RECEIVED # Step event received from FFI
EventNames.STEP_COMPLETION_SENT # Handler result sent to FFI
EventNames.HANDLER_REGISTERED # Handler registered
EventNames.HANDLER_ERROR # Handler execution error
EventNames.POLLER_METRICS # FFI dispatch metrics update
EventNames.POLLER_ERROR # Poller encountered an error
EventPoller
Location: python/tasker_core/event_poller.py
from tasker_core import EventPoller
poller = EventPoller(
polling_interval_ms=10, # Poll every 10ms
starvation_check_interval=100, # Check every 1 second
cleanup_interval=1000 # Cleanup every 10 seconds
)
# Register callbacks
poller.on_step_event(handle_step)
poller.on_metrics(handle_metrics)
poller.on_error(handle_error)
# Start polling (daemon thread)
poller.start()
# Get metrics
metrics = poller.get_metrics()
print(f"Pending: {metrics.pending_count}")
# Stop polling
poller.stop(timeout=5.0)
Domain Events
Python has full domain event support with lifecycle hooks matching Ruby and TypeScript capabilities.
Location: python/tasker_core/domain_events.py
BasePublisher
Publishers transform step execution context into domain-specific events:
from tasker_core.domain_events import BasePublisher, StepEventContext, DomainEvent
class PaymentEventPublisher(BasePublisher):
publisher_name = "payment_events"
def publishes_for(self) -> list[str]:
"""Which steps trigger this publisher."""
return ["process_payment", "refund_payment"]
async def transform_payload(self, ctx: StepEventContext) -> dict:
"""Transform step context into domain event payload."""
return {
"payment_id": ctx.result.get("payment_id"),
"amount": ctx.result.get("amount"),
"currency": ctx.result.get("currency"),
"status": ctx.result.get("status")
}
# Lifecycle hooks (optional)
async def before_publish(self, ctx: StepEventContext) -> None:
"""Called before publishing."""
print(f"Publishing payment event for step: {ctx.step_name}")
async def after_publish(self, ctx: StepEventContext, event: DomainEvent) -> None:
"""Called after successful publish."""
print(f"Published event: {event.event_name}")
async def on_publish_error(self, ctx: StepEventContext, error: Exception) -> None:
"""Called on publish failure."""
print(f"Failed to publish: {error}")
async def additional_metadata(self, ctx: StepEventContext) -> dict:
"""Inject custom metadata."""
return {"payment_processor": "stripe"}
BaseSubscriber
Subscribers react to domain events matching specific patterns:
from tasker_core.domain_events import BaseSubscriber, InProcessDomainEvent, SubscriberResult
class AuditLoggingSubscriber(BaseSubscriber):
subscriber_name = "audit_logger"
def subscribes_to(self) -> list[str]:
"""Which events to handle (glob patterns supported)."""
return ["payment.*", "order.completed"]
async def handle(self, event: InProcessDomainEvent) -> SubscriberResult:
"""Handle matching events."""
await self.log_to_audit_trail(event)
return SubscriberResult(success=True)
# Lifecycle hooks (optional)
async def before_handle(self, event: InProcessDomainEvent) -> None:
"""Called before handling."""
print(f"Handling: {event.event_name}")
async def after_handle(self, event: InProcessDomainEvent, result: SubscriberResult) -> None:
"""Called after handling."""
print(f"Handled successfully: {result.success}")
async def on_handle_error(self, event: InProcessDomainEvent, error: Exception) -> None:
"""Called on handler failure."""
print(f"Handler error: {error}")
Registries
Manage publishers and subscribers with singleton registries:
from tasker_core.domain_events import PublisherRegistry, SubscriberRegistry
# Publisher Registry
pub_registry = PublisherRegistry.instance()
pub_registry.register(PaymentEventPublisher)
pub_registry.register(OrderEventPublisher)
# Get publisher for a step
publisher = pub_registry.get_for_step("process_payment")
# Subscriber Registry
sub_registry = SubscriberRegistry.instance()
sub_registry.register(AuditLoggingSubscriber)
sub_registry.register(MetricsSubscriber)
# Start all subscribers
sub_registry.start_all()
# Stop all subscribers
sub_registry.stop_all()
Signal Handling
The Python worker handles signals for graceful shutdown:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
SIGUSR1 | Report worker status |
import signal
def handle_shutdown(signum, frame):
print("Shutting down...")
shutdown_event.set()
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)
Error Handling
Exception Classes
Python uses two error modules:
tasker_core.exceptions– FFI and bootstrap errorstasker_core.errors– Execution errors (handler-level)
# FFI/bootstrap errors (tasker_core.exceptions)
from tasker_core import (
TaskerError, # Base class
WorkerNotInitializedError,
WorkerBootstrapError,
WorkerAlreadyRunningError,
FFIError,
ConversionError,
)
# Execution errors (tasker_core.errors)
from tasker_core.errors import (
StepExecutionError, # Base execution error
RetryableError, # Transient failures (will be retried)
PermanentError, # Unrecoverable failures (no retry)
)
Using Execution Errors
Use RetryableError and PermanentError for handler-level error classification:
from tasker_core.errors import RetryableError, PermanentError
def call(self, context):
# Retryable error (transient failure)
raise RetryableError(
"Database connection timeout",
error_type="database_error"
)
# Permanent error (unrecoverable)
raise PermanentError(
"Invalid input format",
error_type="validation_error"
)
StepExecutionError remains available as the base class for custom execution errors:
Logging
Structured Logging
from tasker_core import log_info, log_error, log_warn, log_debug, LogContext
# Simple logging
log_info("Processing started")
log_error("Failed to connect")
# With context dict
log_info("Order processed", {
"order_id": "123",
"amount": "100.00"
})
# With LogContext model
context = LogContext(
correlation_id="abc-123",
task_uuid="task-456",
operation="process_order"
)
log_info("Processing", context)
File Structure
workers/python/
├── bin/
│ └── server.py # Production server
├── python/
│ └── tasker_core/
│ ├── __init__.py # Package exports
│ ├── handler.py # Handler registry
│ ├── event_bridge.py # Event pub/sub
│ ├── event_poller.py # FFI polling
│ ├── logging.py # Structured logging
│ ├── types.py # Pydantic models
│ ├── step_handler/
│ │ ├── __init__.py
│ │ ├── base.py # Base handler ABC
│ │ ├── api.py # API handler
│ │ └── decision.py # Decision handler
│ ├── batch_processing/
│ │ └── __init__.py # Batchable mixin
│ └── step_execution_subscriber.py
├── src/ # Rust/PyO3 extension
├── tests/
│ ├── test_step_handler.py
│ ├── test_module_exports.py
│ └── handlers/examples/
├── pyproject.toml
└── uv.lock
Testing
Unit Tests
cd workers/python
uv run pytest tests/
With Coverage
uv run pytest tests/ --cov=tasker_core
Type Checking
uv run mypy python/tasker_core/
Linting
uv run ruff check python/
Example Handlers
Linear Workflow
DSL (recommended):
from tasker_core.step_handler.functional import step_handler
@step_handler("linear_step_1")
def linear_step_1(context):
return {
"step1_processed": True,
"input_received": context.input_data,
"processed_at": datetime.now().isoformat()
}
Class-based:
class LinearStep1Handler(StepHandler):
handler_name = "linear_step_1"
def call(self, context: StepContext) -> StepHandlerResult:
return self.success(result={
"step1_processed": True,
"input_received": context.input_data,
"processed_at": datetime.now().isoformat()
})
Data Processing
DSL (recommended):
from tasker_core.step_handler.functional import step_handler, depends_on
@step_handler("transform_data")
@depends_on(fetch_data=dict)
def transform_data(fetch_data: dict, context):
transformed = [
{"id": item["id"], "value": item["raw_value"] * 2}
for item in fetch_data.get("items", [])
]
return {"items": transformed, "count": len(transformed)}
Class-based:
class TransformDataHandler(StepHandler):
handler_name = "transform_data"
def call(self, context: StepContext) -> StepHandlerResult:
# Get raw data from dependency
raw_data = context.dependency_results.get("fetch_data", {})
# Transform
transformed = [
{"id": item["id"], "value": item["raw_value"] * 2}
for item in raw_data.get("items", [])
]
return self.success(result={
"items": transformed,
"count": len(transformed)
})
Conditional Approval
DSL (recommended):
from tasker_core.step_handler.functional import decision_handler, inputs, Decision
@decision_handler("approval_router")
@inputs('amount')
def approval_router(amount, context):
amount = float(amount or 0)
if amount < 1000:
return Decision.route(['auto_approve'], route_type='automatic')
elif amount < 5000:
return Decision.route(['manager_approval'], route_type='manager')
return Decision.route(['manager_approval', 'finance_review'], route_type='dual')
Class-based:
class ApprovalRouterHandler(DecisionHandler):
handler_name = "approval_router"
THRESHOLDS = {
"auto": 1000,
"manager": 5000
}
def call(self, context: StepContext) -> StepHandlerResult:
amount = context.input_data.get("amount", 0)
if amount < self.THRESHOLDS["auto"]:
outcome = DecisionPointOutcome.create_steps(["auto_approve"])
elif amount < self.THRESHOLDS["manager"]:
outcome = DecisionPointOutcome.create_steps(["manager_approval"])
else:
outcome = DecisionPointOutcome.create_steps(
["manager_approval", "finance_review"]
)
return self.decision_success(outcome)
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Ruby Worker - Ruby implementation
- Worker Event Systems - Architecture details
Ruby Worker
Last Updated: 2026-01-01
Audience: Ruby Developers
Status: Active
Package: tasker_core (gem)
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Ruby worker provides a gem-based interface for integrating tasker-core workflow execution into Ruby applications. It supports both standalone server deployment and headless embedding in Rails applications.
Quick Start
Installation
cd workers/ruby
bundle install
bundle exec rake compile # Compile FFI extension
Running the Server
./bin/server.rb
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
RUBY_GC_HEAP_GROWTH_FACTOR | GC tuning for production | Ruby default |
Architecture
Server Mode
Location: workers/ruby/bin/server.rb
The server bootstraps the Rust foundation and manages Ruby handler execution:
# Bootstrap the worker system
bootstrap = TaskerCore::Worker::Bootstrap.start!
# Signal handlers for graceful shutdown
Signal.trap('TERM') { shutdown_event.set }
Signal.trap('INT') { shutdown_event.set }
# Main loop with health checks
loop do
break if shutdown_event.set?
sleep(1)
end
# Graceful shutdown
bootstrap.shutdown!
Headless/Embedded Mode
For embedding in Rails applications without an HTTP server:
# config/initializers/tasker.rb
require 'tasker_core'
# Bootstrap worker (headless mode controlled via TOML: web.enabled = false)
TaskerCore::Worker::Bootstrap.start!
# Register application handlers
TaskerCore::Registry::HandlerRegistry.instance.register_handler(
'ProcessOrderHandler',
ProcessOrderHandler
)
FFI Bridge
Ruby communicates with the Rust foundation via FFI polling:
┌────────────────────────────────────────────────────────────────┐
│ RUBY FFI BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ FFI (poll_step_events)
▼
┌─────────────┐
│ Ruby │
│ Thread │──→ poll every 10ms
└─────────────┘
│
▼
┌─────────────┐
│ Handler │
│ Execution │──→ handler.call(context)
└─────────────┘
│
│ FFI (complete_step_event)
▼
Rust Completion Channel
Handler Development
DSL Handlers (Recommended)
Both DSL and class-based handlers are fully supported. DSL is recommended for new projects. See Class-Based Handlers for the inheritance-based patterns.
The functional DSL provides a block-based approach for defining handlers:
extend TaskerCore::StepHandler::Functional
ValidateCartHandler = step_handler(
'Ecommerce::StepHandlers::ValidateCartHandler',
inputs: Types::Ecommerce::OrderInput
) do |inputs:, context:|
Ecommerce::Service.validate_cart_items(inputs.cart_items)
end
Specialized DSL Handlers:
RoutingHandler = decision_handler(
'RoutingDecisionHandler',
inputs: [:amount]
) do |amount:, context:|
if amount.to_f < 1000
Decision.route(['auto_approve'], route_type: 'automatic')
else
Decision.route(['manager_approval'], route_type: 'manager')
end
end
FetchUserHandler = api_handler(
'Users::StepHandlers::FetchUserHandler',
base_url: 'https://api.example.com',
inputs: [:user_id]
) do |user_id:, api:, context:|
response = api.get("/users/#{user_id}")
api.api_success(result: { user: response.body })
end
CsvAnalyzer = batch_analyzer(
'Csv::StepHandlers::CsvAnalyzerHandler',
inputs: [:csv_path]
) do |csv_path:, context:|
BatchConfig.new(total_items: count_rows(csv_path), batch_size: 100)
end
CsvWorker = batch_worker(
'Csv::StepHandlers::CsvWorkerHandler',
analyzer_step: 'analyze_csv'
) do |batch_context:, context:|
{ processed: batch_context.batch_size }
end
Class-Based Handlers
The class-based patterns below remain fully supported.
Base Handler
Location: lib/tasker_core/step_handler/base.rb
All handlers inherit from TaskerCore::StepHandler::Base:
class ProcessOrderHandler < TaskerCore::StepHandler::Base
def call(context)
# Access task context via cross-language standard methods
order_id = context.get_task_field('order_id')
amount = context.get_task_field('amount')
# Business logic
result = process_order(order_id, amount)
# Return success result
success(result: {
order_id: order_id,
status: 'processed',
total: result[:total]
})
end
end
Handler Signature
def call(context)
# context - StepContext with cross-language standard fields:
# context.task_uuid - Task UUID
# context.step_uuid - Step UUID
# context.input_data - Step inputs from workflow_step.inputs
# context.step_config - Handler config from step_definition
# context.retry_count - Current retry attempt
# context.max_retries - Maximum retry attempts
# context.get_task_field('field') - Get field from task context
# context.get_dependency_result('step') - Get result from parent step
end
Result Methods
# Success result (keyword arguments required)
success(
result: { key: 'value' },
metadata: { duration_ms: 100 }
)
# Failure result
# error_type must be one of: 'PermanentError', 'RetryableError',
# 'ValidationError', 'UnexpectedError', 'StepCompletionError'
failure(
message: 'Payment declined',
error_type: 'PermanentError', # Use enum value, not freeform string
error_code: 'PAYMENT_DECLINED', # Optional freeform error code
retryable: false,
metadata: { card_last_four: '1234' }
)
Accessing Dependencies
def call(context)
# Get result from a dependency step
validation_result = context.get_dependency_result('validate_order')
if validation_result && validation_result['valid']
# Process with validated data
end
end
Composition Pattern
Ruby handlers use composition via mixins rather than inheritance. You can use either:
- Wrapper classes (Api, Decision, Batchable) - simpler, backward compatible
- Mixin modules (Mixins::API, Mixins::Decision, Mixins::Batchable) - explicit composition
Using Mixins (Recommended for New Code)
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
include TaskerCore::StepHandler::Mixins::Decision
def call(context)
# Has both API methods (get, post, put, delete)
# And Decision methods (decision_success, decision_no_branches)
response = get('/api/endpoint')
decision_success(steps: ['next_step'], result_data: response)
end
end
Available Mixins
| Mixin | Location | Methods Provided |
|---|---|---|
Mixins::API | mixins/api.rb | get, post, put, delete, connection |
Mixins::Decision | mixins/decision.rb | decision_success, decision_no_branches, skip_branches |
Mixins::Batchable | mixins/batchable.rb | get_batch_context, handle_no_op_worker, create_cursor_configs |
Using Wrapper Classes (Backward Compatible)
The wrapper classes delegate to mixins internally:
# These are equivalent:
class MyHandler < TaskerCore::StepHandler::Api
# Inherits API methods via Mixins::API
end
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
# Explicit mixin inclusion
end
Specialized Handlers
API Handler
Location: lib/tasker_core/step_handler/api.rb
For HTTP API integration with automatic error classification:
class FetchUserHandler < TaskerCore::StepHandler::Api
def call(context)
user_id = context.get_task_field('user_id')
# Automatic error classification (429 → retryable, 404 → permanent)
response = connection.get("/users/#{user_id}")
process_response(response) # Raises on errors, returns response on success
# Return success result with response data
success(result: response.body)
end
# Optional: Custom connection configuration
def configure_connection
Faraday.new(base_url) do |conn|
conn.request :json
conn.response :json
conn.options.timeout = 30
end
end
end
HTTP Methods Available:
get(path, params: {}, headers: {})post(path, data: {}, headers: {})put(path, data: {}, headers: {})delete(path, params: {}, headers: {})
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Permanent | No retry |
| 429 | Retryable | Respect Retry-After |
| 500-599 | Retryable | Standard backoff |
Decision Handler
Location: lib/tasker_core/step_handler/decision.rb
For dynamic workflow routing:
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
def call(context)
amount = context.get_task_field('amount')
if amount < 1000
# Auto-approve small amounts
decision_success(
steps: ['auto_approve'],
result_data: { route_type: 'auto', amount: amount }
)
elsif amount < 5000
# Manager approval for medium amounts
decision_success(
steps: ['manager_approval'],
result_data: { route_type: 'manager', amount: amount }
)
else
# Dual approval for large amounts
decision_success(
steps: ['manager_approval', 'finance_review'],
result_data: { route_type: 'dual', amount: amount }
)
end
end
end
Decision Methods:
decision_success(steps:, result_data: {})- Create steps dynamicallydecision_no_branches(result_data: {})- Skip conditional steps
Batchable Handler
Location: lib/tasker_core/step_handler/batchable.rb
For processing large datasets in chunks:
Breaking Change: Cursors are now 0-indexed (previously 1-indexed) to match Python, TypeScript, and Rust.
class CsvBatchProcessorHandler < TaskerCore::StepHandler::Batchable
def call(context)
# Extract batch context from step inputs
batch_ctx = get_batch_context(context)
# Handle no-op placeholder batches
no_op_result = handle_no_op_worker(batch_ctx)
return no_op_result if no_op_result
# Process this batch
csv_file = context.get_dependency_result('analyze_csv')&.dig('csv_file_path')
records = read_csv_batch(csv_file, batch_ctx.start_cursor, batch_ctx.batch_size)
processed = records.map { |record| transform_record(record) }
# Return batch completion
batch_worker_success(
items_processed: processed.size,
items_succeeded: processed.size,
results: processed
)
end
end
Batch Helper Methods:
get_batch_context(context)- Get batch boundaries from StepContexthandle_no_op_worker(batch_ctx)- Handle placeholder batchesbatch_worker_success(items_processed:, items_succeeded:, ...)- Complete batchcreate_cursor_configs(total_items, worker_count)- Create 0-indexed cursor ranges
Cursor Indexing:
# Creates 0-indexed cursor ranges
configs = create_cursor_configs(1000, 5)
# => [
# { batch_id: '1', start_cursor: 0, end_cursor: 200 },
# { batch_id: '2', start_cursor: 200, end_cursor: 400 },
# { batch_id: '3', start_cursor: 400, end_cursor: 600 },
# { batch_id: '4', start_cursor: 600, end_cursor: 800 },
# { batch_id: '5', start_cursor: 800, end_cursor: 1000 }
# ]
Handler Registry
Registration
Location: lib/tasker_core/registry/handler_registry.rb
registry = TaskerCore::Registry::HandlerRegistry.instance
# Manual registration
registry.register_handler('ProcessOrderHandler', ProcessOrderHandler)
# Check availability
registry.handler_available?('ProcessOrderHandler') # => true
# List all handlers
registry.registered_handlers # => ["ProcessOrderHandler", ...]
Discovery Modes
-
Preloaded Handlers (Test environment)
- ObjectSpace scanning for loaded handler classes
-
Template-Driven Discovery
- YAML templates define handler references
- Handlers loaded from configured paths
Handler Search Paths
app/handlers/
lib/handlers/
handlers/
app/tasker/handlers/
lib/tasker/handlers/
spec/handlers/examples/ (test environment)
Configuration
Bootstrap Configuration
Bootstrap configuration is controlled via TOML files, not Ruby parameters:
# config/tasker/base/worker.toml
[web]
enabled = true # Set to false for headless/embedded mode
bind_address = "0.0.0.0"
port = 8080
# Ruby bootstrap is simple - config comes from TOML
TaskerCore::Worker::Bootstrap.start!
Handler Configuration
class MyHandler < TaskerCore::StepHandler::Base
def initialize(config: {})
super
@timeout = config[:timeout] || 30
@max_retries = config[:retries] || 3
end
def config_schema
{
type: 'object',
properties: {
timeout: { type: 'integer', minimum: 1, default: 30 },
retries: { type: 'integer', minimum: 0, default: 3 }
}
}
end
end
Signal Handling
The Ruby worker handles multiple signals:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
SIGUSR1 | Report worker status |
SIGUSR2 | Reload configuration |
# Status reporting
Signal.trap('USR1') do
logger.info "Worker Status: #{bootstrap.status.inspect}"
end
# Configuration reload
Signal.trap('USR2') do
bootstrap.reload_config
end
Error Handling
Exception Classes
TaskerCore::Errors::Error # Base class
├── TaskerCore::Errors::ConfigurationError # Configuration issues
├── TaskerCore::Errors::FFIError # FFI bridge errors
├── TaskerCore::Errors::ProceduralError # Base for workflow errors
│ ├── TaskerCore::Errors::RetryableError # Transient failures
│ ├── TaskerCore::Errors::PermanentError # Unrecoverable failures
│ │ ├── TaskerCore::Errors::ValidationError # Validation failures
│ │ └── TaskerCore::Errors::NotFoundError # Resource not found
│ ├── TaskerCore::Errors::TimeoutError # Timeout failures
│ └── TaskerCore::Errors::NetworkError # Network failures
└── TaskerCore::Errors::ServerError # Embedded server errors
Raising Errors
def call(context)
# Retryable error (will be retried)
raise TaskerCore::Errors::RetryableError.new(
'Database connection timeout',
retry_after: 5,
context: { service: 'database' }
)
# Permanent error (no retry)
raise TaskerCore::Errors::PermanentError.new(
'Invalid order format',
error_code: 'INVALID_ORDER',
context: { field: 'order_id' }
)
# Validation error (permanent, with field info)
raise TaskerCore::Errors::ValidationError.new(
'Email format is invalid',
field: 'email',
error_code: 'INVALID_EMAIL'
)
end
Logging
Structured Logging (Recommended)
New code should use TaskerCore::Tracing for unified structured logging via FFI:
# Recommended: Use Tracing directly
TaskerCore::Tracing.info('Processing order', {
order_id: order.id,
amount: order.total,
customer_id: order.customer_id
})
TaskerCore::Tracing.error('Payment failed', {
error_code: 'DECLINED',
card_last_four: '1234'
})
Legacy Logger (Deprecated)
Note: TaskerCore::Logger is maintained for backward compatibility but delegates to TaskerCore::Tracing. New code should use Tracing directly.
# Legacy (still works, but deprecated)
logger = TaskerCore::Logger.instance
logger.info('Processing order', {
order_id: order.id,
amount: order.total
})
Log Levels
Controlled via RUST_LOG environment variable:
trace- Very detailed debuggingdebug- Debugging informationinfo- Normal operationwarn- Warning conditionserror- Error conditions
File Structure
workers/ruby/
├── bin/
│ ├── server.rb # Production server
│ └── health_check.rb # Health check script
├── ext/
│ └── tasker_core/
│ └── extconf.rb # FFI extension config
├── lib/
│ └── tasker_core/
│ ├── errors.rb # Exception classes
│ ├── handlers.rb # Handler namespace
│ ├── internal.rb # Internal modules
│ ├── logger.rb # Logging
│ ├── models.rb # Type models
│ ├── registry/
│ │ ├── handler_registry.rb
│ │ └── step_handler_resolver.rb
│ ├── step_handler/
│ │ ├── base.rb # Base handler
│ │ ├── api.rb # API handler
│ │ ├── decision.rb # Decision handler
│ │ └── batchable.rb # Batch handler
│ ├── types/ # Type definitions
│ └── version.rb
├── spec/
│ ├── handlers/examples/ # Example handlers
│ └── integration/ # Integration tests
├── Gemfile
└── tasker_core.gemspec
Testing
Unit Tests
cd workers/ruby
bundle exec rspec spec/
Integration Tests
DATABASE_URL=postgresql://... bundle exec rspec spec/integration/
E2E Tests (from project root)
DATABASE_URL=postgresql://... \
TASKER_ENV=test \
bundle exec rspec spec/handlers/
Example Handlers
Linear Workflow
DSL (recommended):
extend TaskerCore::StepHandler::Functional
LinearStep1Handler = step_handler(
'LinearWorkflow::StepHandlers::LinearStep1Handler'
) do |context:|
{
step1_processed: true,
input_received: context.context,
processed_at: Time.now.iso8601
}
end
Class-based:
# spec/handlers/examples/linear_workflow/step_handlers/linear_step_1_handler.rb
module LinearWorkflow
module StepHandlers
class LinearStep1Handler < TaskerCore::StepHandler::Base
def call(context)
input = context.context # Full task context
success(result: {
step1_processed: true,
input_received: input,
processed_at: Time.now.iso8601
})
end
end
end
end
Order Fulfillment
DSL (recommended):
ValidateOrderHandler = step_handler(
'ValidateOrderHandler'
) do |context:|
order = context.context
unless order['items']&.any?
raise TaskerCore::Errors::PermanentError.new(
'Order must have at least one item',
error_code: 'EMPTY_ORDER'
)
end
{
valid: true,
item_count: order['items'].size,
total: calculate_total(order['items'])
}
end
Class-based:
class ValidateOrderHandler < TaskerCore::StepHandler::Base
def call(context)
order = context.context # Full task context
unless order['items']&.any?
return failure(
message: 'Order must have at least one item',
error_type: 'ValidationError',
error_code: 'EMPTY_ORDER',
retryable: false
)
end
success(result: {
valid: true,
item_count: order['items'].size,
total: calculate_total(order['items'])
})
end
end
Conditional Approval
DSL (recommended):
RoutingDecisionHandler = decision_handler(
'RoutingDecisionHandler',
inputs: [:amount]
) do |amount:, context:|
if amount.to_f < 1000
Decision.route(['auto_approve'], route_type: 'automatic')
elsif amount.to_f < 5000
Decision.route(['manager_approval'], route_type: 'manager')
else
Decision.route(['manager_approval', 'finance_review'], route_type: 'dual')
end
end
Class-based:
class RoutingDecisionHandler < TaskerCore::StepHandler::Decision
THRESHOLDS = {
auto_approve: 1000,
manager_only: 5000
}.freeze
def call(context)
amount = context.get_task_field('amount').to_f
if amount < THRESHOLDS[:auto_approve]
decision_success(steps: ['auto_approve'])
elsif amount < THRESHOLDS[:manager_only]
decision_success(steps: ['manager_approval'])
else
decision_success(steps: ['manager_approval', 'finance_review'])
end
end
end
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Python Worker - Python implementation
- Worker Event Systems - Architecture details
Rust Worker
Last Updated: 2026-01-01
Audience: Rust Developers
Status: Active
Package: workers-rust
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The Rust worker is the native, high-performance implementation for workflow step execution. It demonstrates the full capability of the tasker-worker foundation with zero FFI overhead.
Quick Start
Running the Server
cd workers/rust
cargo run
With Custom Configuration
TASKER_CONFIG_PATH=/path/to/config.toml cargo run
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
RUST_LOG | Log level | info |
Architecture
Entry Point
Location: workers/rust/src/main.rs
#[tokio::main]
async fn main() -> Result<()> {
// Initialize structured logging
tasker_shared::logging::init_tracing();
// Bootstrap worker system
let mut bootstrap_result = bootstrap().await?;
// Start event handler (legacy path)
tokio::spawn(async move {
bootstrap_result.event_handler.start().await
});
// Wait for shutdown signal
tokio::select! {
_ = tokio::signal::ctrl_c() => { /* shutdown */ }
_ = wait_for_sigterm() => { /* shutdown */ }
}
bootstrap_result.worker_handle.stop()?;
Ok(())
}
Bootstrap Process
Location: workers/rust/src/bootstrap.rs
The bootstrap process:
- Creates step handler registry with all handlers
- Sets up global event system
- Bootstraps
tasker-workerfoundation - Creates domain event publisher registry
- Spawns
HandlerDispatchServicefor non-blocking dispatch - Creates event handler for legacy path
#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<RustWorkerBootstrapResult> {
// Create registry with all handlers
let registry = Arc::new(RustStepHandlerRegistry::new());
// Bootstrap worker foundation
let worker_handle = WorkerBootstrap::bootstrap_with_event_system(...).await?;
// Set up dispatch service (non-blocking path)
let dispatch_service = HandlerDispatchService::with_callback(...);
Ok(RustWorkerBootstrapResult {
worker_handle,
event_handler,
dispatch_service_handle,
})
}
}
Handler Dispatch
The Rust worker uses the HandlerDispatchService for non-blocking handler execution:
┌────────────────────────────────────────────────────────────────┐
│ RUST HANDLER DISPATCH │
└────────────────────────────────────────────────────────────────┘
PGMQ Queue
│
▼
┌─────────────┐
│ Dispatch │
│ Channel │
└─────────────┘
│
▼
┌─────────────────────────────────────────┐
│ HandlerDispatchService │
│ ┌────────────────────────────────────┐ │
│ │ Semaphore (10 permits) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ handler.call(&step_data).await │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ DomainEventCallback │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
│
▼
┌─────────────┐
│ Completion │
│ Channel │
└─────────────┘
│
▼
Orchestration
Handler Development
Capability Traits
Rust uses traits for handler composition, matching the mixin pattern in Ruby/Python/TypeScript.
Location: tasker-worker/src/handler_capabilities.rs
APICapable Trait
For HTTP API integration:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::APICapable;
impl APICapable for MyHandler {
// Use the helper methods:
// - api_success(step_uuid, data, status, headers, execution_time_ms)
// - api_failure(step_uuid, message, status, error_type, execution_time_ms)
// - classify_status_code(status) -> ErrorClassification
}
}
DecisionCapable Trait
For dynamic workflow routing:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::DecisionCapable;
impl DecisionCapable for MyHandler {
// Use the helper methods:
// - decision_success(step_uuid, step_names, routing_context, execution_time_ms)
// - skip_branches(step_uuid, reason, routing_context, execution_time_ms)
// - decision_failure(step_uuid, message, error_type, execution_time_ms)
}
}
BatchableCapable Trait
For batch processing:
#![allow(unused)]
fn main() {
use tasker_worker::handler_capabilities::BatchableCapable;
impl BatchableCapable for MyHandler {
// Use the helper methods:
// - create_cursor_configs(total_items, worker_count) -> Vec<CursorConfig>
// - create_cursor_ranges(total_items, batch_size, max_batches) -> Vec<CursorConfig>
// - batch_analyzer_success(step_uuid, worker_template, configs, total_items, ...)
// - batch_worker_success(step_uuid, processed, succeeded, failed, skipped, ...)
// - no_batches_outcome(step_uuid, reason, execution_time_ms)
// - batch_failure(step_uuid, message, error_type, retryable, ...)
}
}
Composing Multiple Traits
#![allow(unused)]
fn main() {
// Implement multiple capability traits for a single handler
pub struct CompositeHandler {
config: StepHandlerConfig,
}
impl APICapable for CompositeHandler {}
impl DecisionCapable for CompositeHandler {}
#[async_trait]
impl RustStepHandler for CompositeHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Use API capability to fetch data
let response = self.call_api("/users/123").await?;
// Use Decision capability to route based on response
if response.status == 200 {
self.decision_success(step_uuid, vec!["process_user"], None, 50)
} else {
self.api_failure(step_uuid, "API failed", response.status, "api_error", 50)
}
}
}
}
Handler Trait
Location: workers/rust/src/step_handlers/mod.rs
All Rust handlers implement the RustStepHandler trait:
#![allow(unused)]
fn main() {
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
#[async_trait]
pub trait RustStepHandler: Send + Sync {
/// Handler name for registration
fn name(&self) -> &str;
/// Execute the handler
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult>;
/// Create a new instance with configuration from YAML
fn new(config: StepHandlerConfig) -> Self where Self: Sized;
}
}
Creating a Handler
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use anyhow::Result;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use crate::step_handlers::{RustStepHandler, StepHandlerConfig, success_result};
use serde_json::json;
pub struct ProcessOrderHandler {
_config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
fn name(&self) -> &str {
"process_order"
}
fn new(config: StepHandlerConfig) -> Self {
Self { _config: config }
}
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
let start_time = std::time::Instant::now();
let step_uuid = step_data.workflow_step.workflow_step_uuid;
// Extract input from task context
let order_id = step_data.task.context
.get("order_id")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing order_id"))?;
// Process the order
let result = process_order(order_id).await?;
// Return success using helper function
Ok(success_result(
step_uuid,
json!({
"order_id": order_id,
"status": "processed",
"total": result.total
}),
start_time.elapsed().as_millis() as i64,
None,
))
}
}
}
Handler Registration
Location: workers/rust/src/step_handlers/registry.rs
Handlers are registered in the RustStepHandlerRegistry:
#![allow(unused)]
fn main() {
pub struct RustStepHandlerRegistry {
handlers: HashMap<String, Arc<dyn RustStepHandler>>,
}
impl RustStepHandlerRegistry {
pub fn new() -> Self {
let mut registry = Self {
handlers: HashMap::new(),
};
registry.register_all_handlers();
registry
}
fn register_all_handlers(&mut self) {
let empty_config = StepHandlerConfig::empty();
// Linear workflow handlers
self.register_handler(Arc::new(LinearStep1Handler::new(empty_config.clone())));
self.register_handler(Arc::new(LinearStep2Handler::new(empty_config.clone())));
// Order fulfillment handlers
self.register_handler(Arc::new(ValidateOrderHandler::new(empty_config.clone())));
self.register_handler(Arc::new(ProcessPaymentHandler::new(empty_config.clone())));
// ... more handlers
}
fn register_handler(&mut self, handler: Arc<dyn RustStepHandler>) {
let name = handler.name().to_string();
self.handlers.insert(name, handler);
}
pub fn get_handler(&self, name: &str) -> Result<Arc<dyn RustStepHandler>, RustStepHandlerError> {
self.handlers
.get(name)
.cloned()
.ok_or_else(|| RustStepHandlerError::SystemError {
message: format!("Handler '{}' not found in registry", name),
})
}
}
}
Example Handlers
Linear Workflow
Location: workers/rust/src/step_handlers/linear_workflow.rs
Simple sequential workflow with 4 steps:
#![allow(unused)]
fn main() {
pub struct LinearStep1Handler;
#[async_trait]
impl RustStepHandler for LinearStep1Handler {
fn name(&self) -> &str {
"linear_step_1"
}
async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
info!("LinearStep1Handler: Processing step");
let input = step_data.input_data.clone();
let mut result = serde_json::Map::new();
result.insert("step1_processed".to_string(), json!(true));
result.insert("input_received".to_string(), input);
Ok(StepHandlerResult::success(json!(result)))
}
}
}
Diamond Workflow
Location: workers/rust/src/step_handlers/diamond_workflow.rs
Parallel branching with convergence:
┌─────┐
│Start│
└──┬──┘
│
┌────┴────┐
▼ ▼
┌───┐ ┌───┐
│ B │ │ C │
└─┬─┘ └─┬─┘
│ │
└────┬────┘
▼
┌─────┐
│ End │
└─────┘
Batch Processing
Location: workers/rust/src/step_handlers/batch_processing_products_csv.rs
Three-phase batch processing:
- Analyzer: Counts total records
- Batch Processor: Processes chunks
- Aggregator: Combines results
#![allow(unused)]
fn main() {
pub struct CsvBatchProcessorHandler;
#[async_trait]
impl RustStepHandler for CsvBatchProcessorHandler {
fn name(&self) -> &str {
"csv_batch_processor"
}
async fn call(&self, step_data: &StepExecutionData) -> Result<StepHandlerResult> {
let batch_size = step_data.step_inputs
.get("batch_size")
.and_then(|v| v.as_u64())
.unwrap_or(100) as usize;
let start_cursor = step_data.step_inputs
.get("start_cursor")
.and_then(|v| v.as_u64())
.unwrap_or(0) as usize;
// Process records in batch
let processed = process_batch(start_cursor, batch_size).await?;
Ok(StepHandlerResult::success(json!({
"processed_count": processed,
"batch_complete": true
})))
}
}
}
Error Injection (Testing)
Location: workers/rust/src/step_handlers/error_injection/
Handlers for testing retry behavior:
#![allow(unused)]
fn main() {
pub struct FailNTimesHandler;
impl FailNTimesHandler {
/// Create handler that fails N times before succeeding
pub fn new(fail_count: u32) -> Self {
Self { fail_count, attempts: AtomicU32::new(0) }
}
}
#[async_trait]
impl RustStepHandler for FailNTimesHandler {
async fn call(&self, _step_data: &StepExecutionData) -> Result<StepHandlerResult> {
let attempt = self.attempts.fetch_add(1, Ordering::SeqCst);
if attempt < self.fail_count {
Ok(StepHandlerResult::failure(
"Intentional failure for testing",
"test_error",
true, // retryable
))
} else {
Ok(StepHandlerResult::success(json!({"attempts": attempt + 1})))
}
}
}
}
Domain Events
Post-Execution Publishing
Handlers can publish domain events after step execution using the StepEventPublisher trait:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use std::sync::Arc;
use tasker_shared::events::domain_events::DomainEventPublisher;
use tasker_worker::worker::step_event_publisher::{
StepEventPublisher, StepEventContext, PublishResult
};
#[derive(Debug)]
pub struct PaymentEventPublisher {
domain_publisher: Arc<DomainEventPublisher>,
}
impl PaymentEventPublisher {
pub fn new(domain_publisher: Arc<DomainEventPublisher>) -> Self {
Self { domain_publisher }
}
}
#[async_trait]
impl StepEventPublisher for PaymentEventPublisher {
fn name(&self) -> &str {
"PaymentEventPublisher"
}
fn domain_publisher(&self) -> &Arc<DomainEventPublisher> {
&self.domain_publisher
}
async fn publish(&self, ctx: &StepEventContext) -> PublishResult {
let mut result = PublishResult::default();
if ctx.step_succeeded() {
let payload = json!({
"order_id": ctx.execution_result.result["order_id"],
"amount": ctx.execution_result.result["amount"],
});
// Uses default impl from trait
match self.publish_event(ctx, "payment.completed", payload).await {
Ok(event_id) => result.published.push(event_id),
Err(e) => result.errors.push(e.to_string()),
}
}
result
}
}
}
Dual-Path Delivery
Events can route to different delivery paths:
| Path | Description | Use Case |
|---|---|---|
durable | Published to PGMQ | External consumers, audit |
fast | In-process bus | Metrics, telemetry |
Configuration
Bootstrap Configuration
#![allow(unused)]
fn main() {
pub struct WorkerBootstrapConfig {
pub worker_id: String,
pub enable_web_api: bool,
pub event_driven_enabled: bool,
pub deployment_mode_hint: Option<String>,
}
// Default configuration
let config = WorkerBootstrapConfig {
worker_id: "rust-worker-001".to_string(),
enable_web_api: true,
event_driven_enabled: true,
deployment_mode_hint: Some("Hybrid".to_string()),
..Default::default()
};
}
Dispatch Configuration
#![allow(unused)]
fn main() {
let config = HandlerDispatchConfig {
max_concurrent_handlers: 10,
handler_timeout: Duration::from_secs(30),
service_id: "rust-handler-dispatch".to_string(),
load_shedding: LoadSheddingConfig::default(),
};
}
Signal Handling
The Rust worker handles graceful shutdown:
#![allow(unused)]
fn main() {
// Wait for shutdown signal
tokio::select! {
_ = tokio::signal::ctrl_c() => {
info!("Received Ctrl+C, initiating graceful shutdown...");
}
result = wait_for_sigterm() => {
info!("Received SIGTERM, initiating graceful shutdown...");
}
}
// Graceful shutdown sequence
bootstrap_result.worker_handle.stop()?;
}
Performance
Characteristics
- Zero FFI Overhead: Native Rust handlers
- Async/Await: Non-blocking I/O with Tokio
- Bounded Concurrency: Semaphore-limited parallelism
- Memory Safety: Rust’s ownership model
Benchmarking
# Run with release optimizations
cargo run --release
# With performance profiling
RUST_LOG=trace cargo run --release
File Structure
workers/rust/
├── src/
│ ├── main.rs # Entry point
│ ├── bootstrap.rs # Worker initialization
│ ├── lib.rs # Library exports
│ ├── event_handler.rs # Event bridging (legacy)
│ ├── global_event_system.rs # Global event coordination
│ ├── step_handlers/
│ │ ├── mod.rs # Handler traits and types
│ │ ├── registry.rs # Handler registry
│ │ ├── linear_workflow.rs # Linear workflow handlers
│ │ ├── diamond_workflow.rs # Diamond workflow handlers
│ │ ├── tree_workflow.rs # Tree workflow handlers
│ │ ├── mixed_dag_workflow.rs
│ │ ├── order_fulfillment.rs
│ │ ├── batch_processing_*.rs
│ │ ├── error_injection/ # Test handlers
│ │ └── domain_event_*.rs # Event publishing
│ └── event_subscribers/
│ ├── mod.rs
│ ├── logging_subscriber.rs
│ └── metrics_subscriber.rs
├── Cargo.toml
└── tests/
Testing
Unit Tests
cargo test -p workers-rust
Integration Tests
# With database
DATABASE_URL=postgresql://... cargo test -p workers-rust --test integration
E2E Tests
# From project root
DATABASE_URL=postgresql://... cargo nextest run --package workers-rust
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Worker Event Systems - Architecture details
- Worker Actors - Actor pattern documentation
TypeScript Worker
Last Updated: 2026-01-01
Audience: TypeScript/JavaScript Developers
Status: Active
Package: @tasker-systems/tasker
Related Docs: Patterns and Practices | Worker Event Systems | API Convergence Matrix
<- Back to Worker Crates Overview
The TypeScript worker provides a native Node-API (napi-rs) interface for integrating tasker-core workflow execution into TypeScript/JavaScript applications. It supports Bun (primary) and Node.js runtimes with native addon bindings to the Rust worker foundation.
Quick Start
Installation
cd workers/typescript
bun install # Install dependencies
bun run build:napi:release # Build napi-rs native addon
Running the Server
# With Bun (recommended for production)
bun run bin/server.ts
# With Node.js
npx tsx bin/server.ts
Environment Variables
| Variable | Description | Default |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | Required |
TASKER_ENV | Environment (test/development/production) | development |
TASKER_CONFIG_PATH | Path to TOML configuration | Auto-detected |
TASKER_TEMPLATE_PATH | Path to task templates | Auto-detected |
TASKER_FFI_MODULE_PATH | Override path to napi-rs .node file (optional) | Auto-detected |
RUST_LOG | Log level (trace/debug/info/warn/error) | info |
PORT | HTTP server port | 8081 |
Architecture
Server Mode
Location: workers/typescript/bin/server.ts
The server bootstraps the Rust foundation and manages TypeScript handler execution:
import { WorkerServer } from '@tasker-systems/tasker';
const server = new WorkerServer();
await server.start({ namespace: 'my-app' });
// Register handlers
const handlerSystem = server.getHandlerSystem();
handlerSystem.register('process_order', ProcessOrderHandler);
// Graceful shutdown on signal
process.on('SIGINT', () => server.shutdown());
WorkerServer internally orchestrates the FFI layer, event system (poller + subscriber), and handler registry. For details on the internal wiring, see the Worker Event Systems documentation.
Headless/Embedded Mode
For embedding in existing TypeScript applications (headless mode via TOML: web.enabled = false):
import { WorkerServer } from '@tasker-systems/tasker';
// Bootstrap worker — headless when web.enabled = false in TOML config
const server = new WorkerServer();
await server.start({ namespace: 'my-app' });
// Register handlers
const handlerSystem = server.getHandlerSystem();
handlerSystem.register('process_data', ProcessDataHandler);
// Server is now processing tasks without an HTTP server
Native Addon Bridge
TypeScript communicates with the Rust foundation via napi-rs native addons:
┌────────────────────────────────────────────────────────────────┐
│ TYPESCRIPT NAPI-RS BRIDGE │
└────────────────────────────────────────────────────────────────┘
Rust Worker System
│
│ Node-API (pollStepEvents)
▼
┌─────────────────────┐
│ EventPoller │
│ (setInterval) │──→ poll every 10ms
└─────────────────────┘
│
│ emit to EventEmitter
▼
┌─────────────────────┐
│ StepExecution │
│ Subscriber │──→ route to handler
└─────────────────────┘
│
│ handler.call(context)
▼
┌─────────────────────┐
│ Handler Execution │
└─────────────────────┘
│
│ Node-API (completeStepEvent)
▼
Rust Completion Channel
Runtime Support
| Runtime | Native Module | Status |
|---|---|---|
| Bun | napi-rs .node | Production (Primary) |
| Node.js | napi-rs .node | Production |
Handler Development
DSL Handlers (Recommended)
Both DSL and class-based handlers are fully supported. DSL is recommended for new projects. See Class-Based Handlers for the inheritance-based patterns.
The functional DSL provides a concise approach for defining handlers:
import { defineHandler } from '@tasker-systems/tasker';
export const ProcessOrderHandler = defineHandler(
'ProcessOrderHandler',
{ inputs: { orderId: 'order_id', amount: 'amount' } },
async ({ orderId, amount }) => ({
order_id: orderId,
status: 'processed',
total: Number(amount) * 1.1,
}),
);
Specialized DSL Handlers:
import {
defineDecisionHandler, defineApiHandler,
defineBatchAnalyzer, defineBatchWorker,
Decision, BatchConfig,
} from '@tasker-systems/tasker';
export const RoutingHandler = defineDecisionHandler(
'RoutingDecisionHandler',
{ inputs: { amount: 'amount' } },
async ({ amount }) => {
if (Number(amount) < 1000) {
return Decision.route(['auto_approve'], { routeType: 'automatic' });
}
return Decision.route(['manager_approval'], { routeType: 'manager' });
},
);
export const FetchUserHandler = defineApiHandler(
'Users.StepHandlers.FetchUserHandler',
{ baseUrl: 'https://api.example.com', inputs: { userId: 'user_id' } },
async ({ userId, api }) => {
const response = await api.get(`/users/${userId}`);
return api.apiSuccess({ user: response.body });
},
);
export const CsvAnalyzer = defineBatchAnalyzer(
'Csv.StepHandlers.CsvAnalyzerHandler',
{ inputs: { csvPath: 'csv_path' } },
async ({ csvPath }) => ({
totalItems: await countRows(csvPath as string),
batchSize: 100,
}),
);
export const CsvWorker = defineBatchWorker(
'Csv.StepHandlers.CsvWorkerHandler',
{ analyzerStep: 'analyze_csv' },
async ({ batchContext }) => ({ processed: batchContext.batchSize }),
);
Class-Based Handlers
The class-based patterns below remain fully supported.
Base Handler
Location: workers/typescript/src/handler/base.ts
All handlers extend StepHandler:
import { StepHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class ProcessOrderHandler extends StepHandler {
static handlerName = 'process_order';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Access input data
const orderId = context.getInput<string>('order_id');
const amount = context.getInput<number>('amount');
// Business logic
const result = await this.processOrder(orderId, amount);
// Return success
return this.success({
order_id: orderId,
status: 'processed',
total: result.total
});
}
private async processOrder(orderId: string, amount: number) {
// Implementation
return { total: amount * 1.1 };
}
}
Handler Signature
async call(context: StepContext): Promise<StepHandlerResult>
// StepContext provides:
context.taskUuid // Task identifier
context.stepUuid // Step identifier
context.stepInputs // Runtime inputs
context.stepConfig // Handler configuration
context.dependencyResults // Results from parent steps
context.taskContext // Full task context
context.retryCount // Current retry attempt
// Type-safe accessors:
context.getInput<T>(key) // Get single input
context.getDependencyResult(stepName) // Get dependency result
context.getAllDependencyResults(name) // Get all instances (batch workers)
Result Methods
// Success result (from base class)
return this.success(
{ key: 'value' }, // result
{ duration_ms: 100 } // metadata (optional)
);
// Failure result (from base class)
return this.failure(
'Payment declined', // message
'payment_error', // errorType
true, // retryable
{ card_last_four: '1234' } // metadata (optional)
);
Error Types
import { ErrorType } from '@tasker-systems/tasker';
ErrorType.PERMANENT_ERROR // Non-retryable failures
ErrorType.RETRYABLE_ERROR // Retryable failures
ErrorType.VALIDATION_ERROR // Input validation failures
ErrorType.HANDLER_ERROR // Handler execution failures
Accessing Dependencies
async call(context: StepContext): Promise<StepHandlerResult> {
// Get result from a dependency step
const validation = context.getDependencyResult('validate_order') as {
valid: boolean;
amount: number;
} | null;
if (!validation) {
return this.failure('Missing validation result', 'dependency_error', false);
}
if (validation.valid) {
return this.success({ processed: true, amount: validation.amount });
}
return this.failure('Validation failed', 'validation_error', false);
}
Specialized Handlers
Mixin Pattern
TypeScript uses composition via mixins rather than inheritance. You can use either:
- Wrapper classes (ApiHandler, DecisionHandler) - simpler, backward compatible
- Mixin functions (applyAPI, applyDecision) - explicit composition
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';
// Using mixin pattern (recommended for new code)
class MyHandler extends StepHandler implements APICapable {
constructor() {
super();
applyAPI(this); // Adds get/post/put/delete methods
}
async call(context: StepContext): Promise<StepHandlerResult> {
const response = await this.get('/api/data');
return this.apiSuccess(response);
}
}
// Or using wrapper class (simpler, backward compatible)
import { ApiHandler } from '@tasker-systems/tasker';
class MyHandler extends ApiHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
const response = await this.get('/api/data');
return this.apiSuccess(response);
}
}
API Handler
Location: workers/typescript/src/handler/api.ts
For HTTP API integration with automatic error classification:
import { ApiHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class FetchUserHandler extends ApiHandler {
static handlerName = 'fetch_user';
static handlerVersion = '1.0.0';
protected baseUrl = 'https://api.example.com';
async call(context: StepContext): Promise<StepHandlerResult> {
const userId = context.getInput<string>('user_id');
// Automatic error classification
const response = await this.get(`/users/${userId}`);
if (!response.ok) {
return this.apiFailure(response);
}
return this.apiSuccess(response);
}
}
HTTP Methods:
// GET request
const response = await this.get('/path', {
params: { key: 'value' },
headers: { 'Authorization': 'Bearer token' }
});
// POST request
const response = await this.post('/path', {
body: { key: 'value' },
headers: {}
});
// PUT request
const response = await this.put('/path', { body: { key: 'value' } });
// DELETE request
const response = await this.delete('/path', { params: {} });
ApiResponse Properties:
response.statusCode // HTTP status code
response.headers // Response headers
response.body // Parsed body (object or string)
response.ok // True if 2xx status
response.isClientError // True if 4xx status
response.isServerError // True if 5xx status
response.isRetryable // True if should retry (408, 429, 500-504)
response.retryAfter // Retry-After header value in seconds
Error Classification:
| Status | Classification | Behavior |
|---|---|---|
| 400, 401, 403, 404, 422 | Non-retryable | Permanent failure |
| 408, 429, 500-504 | Retryable | Standard retry |
Decision Handler
Location: workers/typescript/src/handler/decision.ts
For dynamic workflow routing:
import { DecisionHandler } from '@tasker-systems/tasker';
import type { StepContext, StepHandlerResult } from '@tasker-systems/tasker';
export class RoutingDecisionHandler extends DecisionHandler {
static handlerName = 'routing_decision';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const amount = context.getInput<number>('amount') ?? 0;
if (amount < 1000) {
// Auto-approve small amounts
return this.decisionSuccess(['auto_approve'], {
route_type: 'auto',
amount
});
} else if (amount < 5000) {
// Manager approval for medium amounts
return this.decisionSuccess(['manager_approval'], {
route_type: 'manager',
amount
});
} else {
// Dual approval for large amounts
return this.decisionSuccess(['manager_approval', 'finance_review'], {
route_type: 'dual',
amount
});
}
}
}
Decision Methods:
// Activate specific steps
return this.decisionSuccess(
['step1', 'step2'], // steps to activate
{ route_reason: 'threshold' } // routing context
);
// No branches needed
return this.decisionNoBranches('condition not met');
BatchableStepHandler
Location: workers/typescript/src/handler/batchable.ts
For processing large datasets in chunks. Cross-language aligned with Ruby and Python implementations.
Analyzer Handler (creates batch configurations):
import { BatchableStepHandler } from '@tasker-systems/tasker';
import type { StepContext, BatchableResult } from '@tasker-systems/tasker';
export class CsvAnalyzerHandler extends BatchableStepHandler {
static handlerName = 'csv_analyzer';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<BatchableResult> {
const csvPath = context.getInput<string>('csv_path');
const rowCount = await this.countCsvRows(csvPath);
if (rowCount === 0) {
// No data to process - use cross-language standard
return this.noBatchesResult('empty_dataset', {
csv_path: csvPath,
analyzed_at: new Date().toISOString()
});
}
// Create cursor configs using Ruby-style helper
// Divides rowCount into 5 roughly equal batches
const batchConfigs = this.createCursorConfigs(rowCount, 5);
return this.batchSuccess('process_csv_batch', batchConfigs, {
csv_path: csvPath,
total_rows: rowCount,
analyzed_at: new Date().toISOString()
});
}
}
Worker Handler (processes a batch):
export class CsvBatchProcessorHandler extends BatchableStepHandler {
static handlerName = 'csv_batch_processor';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Cross-language standard: check for no-op worker first
const noOpResult = this.handleNoOpWorker(context);
if (noOpResult) {
return noOpResult;
}
// Get batch worker inputs from Rust
const batchInputs = this.getBatchWorkerInputs(context);
const cursor = batchInputs?.cursor;
if (!cursor) {
return this.failure('Missing batch cursor', 'batch_error', false);
}
// Process the batch
const results = await this.processCsvBatch(
cursor.start_cursor,
cursor.end_cursor
);
return this.success({
batch_id: cursor.batch_id,
rows_processed: results.count,
items_succeeded: results.success,
items_failed: results.failed
});
}
}
Aggregator Handler (combines results):
export class CsvAggregatorHandler extends StepHandler {
static handlerName = 'csv_aggregator';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
// Get all batch worker results
const workerResults = context.getAllDependencyResults('process_csv_batch') as Array<{
rows_processed: number;
items_succeeded: number;
items_failed: number;
} | null>;
// Aggregate results
let totalProcessed = 0;
let totalSucceeded = 0;
let totalFailed = 0;
for (const result of workerResults) {
if (result) {
totalProcessed += result.rows_processed ?? 0;
totalSucceeded += result.items_succeeded ?? 0;
totalFailed += result.items_failed ?? 0;
}
}
return this.success({
total_processed: totalProcessed,
total_succeeded: totalSucceeded,
total_failed: totalFailed,
worker_count: workerResults.length
});
}
}
BatchableStepHandler Methods (Cross-Language Aligned):
| Method | Ruby Equivalent | Purpose |
|---|---|---|
batchSuccess(template, configs, metadata) | batch_success | Create batch workers |
noBatchesResult(reason, metadata) | no_batches_outcome | Empty dataset handling |
createCursorConfigs(total, workers) | create_cursor_configs | Divide work by worker count |
handleNoOpWorker(context) | handle_no_op_worker | Detect no-op placeholders |
getBatchWorkerInputs(context) | get_batch_context | Access Rust batch inputs |
aggregateWorkerResults(results) | aggregate_batch_worker_results | Static aggregation helper |
Handler Registry
Registration
Location: workers/typescript/src/handler/registry.ts
import { HandlerRegistry } from '@tasker-systems/tasker';
const registry = new HandlerRegistry();
// Manual registration
registry.register('process_order', ProcessOrderHandler);
// Check if registered
registry.isRegistered('process_order'); // true
// Resolve and instantiate
const handler = registry.resolve('process_order');
if (handler) {
const result = await handler.call(context);
}
// List all handlers
registry.listHandlers(); // ['process_order', ...]
// Handler count
registry.handlerCount(); // 1
Bulk Registration
import { registerExampleHandlers } from './handlers/examples/index.js';
// Register multiple handlers at once
registerExampleHandlers(registry);
Type System
Core Types
import type {
StepContext,
StepHandlerResult,
BatchableResult,
FfiStepEvent,
BootstrapConfig,
WorkerStatus,
} from '@tasker-systems/tasker';
// StepContext - created from FFI event
const context = StepContext.fromFfiEvent(event, 'handler_name');
context.taskUuid; // string
context.stepUuid; // string
context.stepInputs; // Record<string, unknown>
context.retryCount; // number
// StepHandlerResult - handler output
result.success; // boolean
result.result; // Record<string, unknown>
result.errorMessage; // string | undefined
result.retryable; // boolean
Configuration Types
import type { BootstrapConfig } from '@tasker-systems/tasker';
const config: BootstrapConfig = {
namespace: 'my-app',
environment: 'production',
configPath: '/path/to/config.toml'
};
Event System
EventEmitter
Location: workers/typescript/src/events/event-emitter.ts
import { EventEmitter } from '@tasker-systems/tasker';
import { StepEventNames } from '@tasker-systems/tasker';
const emitter = new EventEmitter();
// Subscribe to events
emitter.on(StepEventNames.STEP_EXECUTION_RECEIVED, (payload) => {
console.log(`Processing step: ${payload.event.step_uuid}`);
});
emitter.on(StepEventNames.STEP_EXECUTION_COMPLETED, (payload) => {
console.log(`Step completed: ${payload.stepUuid}`);
});
// Emit events
emitter.emit(StepEventNames.STEP_EXECUTION_RECEIVED, {
event: ffiStepEvent
});
Event Names
import { StepEventNames } from '@tasker-systems/tasker';
StepEventNames.STEP_EXECUTION_RECEIVED // Step event received from FFI
StepEventNames.STEP_EXECUTION_STARTED // Handler execution started
StepEventNames.STEP_EXECUTION_COMPLETED // Handler execution completed
StepEventNames.STEP_EXECUTION_FAILED // Handler execution failed
StepEventNames.STEP_COMPLETION_SENT // Result sent to FFI
EventPoller
Location: workers/typescript/src/events/event-poller.ts
import { EventPoller } from '@tasker-systems/tasker';
const poller = new EventPoller(runtime, emitter, {
pollingIntervalMs: 10, // Poll every 10ms
starvationCheckInterval: 100, // Check every 1 second
cleanupInterval: 1000 // Cleanup every 10 seconds
});
// Start polling
poller.start();
// Get metrics
const metrics = poller.getMetrics();
console.log(`Pending: ${metrics.pendingCount}`);
// Stop polling
poller.stop();
Domain Events
TypeScript has full domain event support, matching Ruby and Python capabilities. The domain events module provides BasePublisher, BaseSubscriber, and registries for custom event handling.
Location: workers/typescript/src/handler/domain-events.ts
BasePublisher
Publishers transform step execution context into domain-specific events:
import { BasePublisher, StepEventContext, DomainEvent } from '@tasker-systems/tasker';
export class PaymentEventPublisher extends BasePublisher {
static publisherName = 'payment_events';
// Required: which steps trigger this publisher
publishesFor(): string[] {
return ['process_payment', 'refund_payment'];
}
// Transform step context into domain event
async transformPayload(ctx: StepEventContext): Promise<Record<string, unknown>> {
return {
payment_id: ctx.result?.payment_id,
amount: ctx.result?.amount,
currency: ctx.result?.currency,
status: ctx.result?.status
};
}
// Lifecycle hooks (optional)
async beforePublish(ctx: StepEventContext): Promise<void> {
console.log(`Publishing payment event for step: ${ctx.stepName}`);
}
async afterPublish(ctx: StepEventContext, event: DomainEvent): Promise<void> {
console.log(`Published event: ${event.eventName}`);
}
async onPublishError(ctx: StepEventContext, error: Error): Promise<void> {
console.error(`Failed to publish: ${error.message}`);
}
// Inject custom metadata
async additionalMetadata(ctx: StepEventContext): Promise<Record<string, unknown>> {
return { payment_processor: 'stripe' };
}
}
BaseSubscriber
Subscribers react to domain events matching specific patterns:
import { BaseSubscriber, InProcessDomainEvent, SubscriberResult } from '@tasker-systems/tasker';
export class AuditLoggingSubscriber extends BaseSubscriber {
static subscriberName = 'audit_logger';
// Which events to handle (glob patterns supported)
subscribesTo(): string[] {
return ['payment.*', 'order.completed'];
}
// Handle matching events
async handle(event: InProcessDomainEvent): Promise<SubscriberResult> {
await this.logToAuditTrail(event);
return { success: true };
}
// Lifecycle hooks (optional)
async beforeHandle(event: InProcessDomainEvent): Promise<void> {
console.log(`Handling: ${event.eventName}`);
}
async afterHandle(event: InProcessDomainEvent, result: SubscriberResult): Promise<void> {
console.log(`Handled successfully: ${result.success}`);
}
async onHandleError(event: InProcessDomainEvent, error: Error): Promise<void> {
console.error(`Handler error: ${error.message}`);
}
}
Registries
Manage publishers and subscribers with singleton registries:
import { PublisherRegistry, SubscriberRegistry } from '@tasker-systems/tasker';
// Publisher Registry
const pubRegistry = PublisherRegistry.getInstance();
pubRegistry.register(PaymentEventPublisher);
pubRegistry.register(OrderEventPublisher);
pubRegistry.freeze(); // Prevent further registrations
// Get publisher for a step
const publisher = pubRegistry.getForStep('process_payment');
// Subscriber Registry
const subRegistry = SubscriberRegistry.getInstance();
subRegistry.register(AuditLoggingSubscriber);
subRegistry.register(MetricsSubscriber);
// Start all subscribers
subRegistry.startAll();
// Stop all subscribers
subRegistry.stopAll();
FFI Integration
Domain events integrate with the Rust FFI layer for cross-language event flow:
import { createFfiPollAdapter, InProcessDomainEventPoller } from '@tasker-systems/tasker';
// Create poller connected to Rust broadcast channel
const poller = new InProcessDomainEventPoller();
// Set the FFI poll function
poller.setPollFunction(createFfiPollAdapter(runtime));
// Start polling for events
poller.start((event) => {
// Route to appropriate subscriber
const subscribers = subRegistry.getMatchingSubscribers(event.eventName);
for (const sub of subscribers) {
sub.handle(event);
}
});
Signal Handling
The TypeScript worker handles signals for graceful shutdown:
| Signal | Behavior |
|---|---|
SIGTERM | Graceful shutdown |
SIGINT | Graceful shutdown (Ctrl+C) |
import { ShutdownController } from '@tasker-systems/tasker';
const shutdown = new ShutdownController();
// Register signal handlers
shutdown.registerSignalHandlers();
// Wait for shutdown signal
await shutdown.waitForShutdown();
// Or check if shutdown requested
if (shutdown.isShutdownRequested()) {
// Begin cleanup
}
Error Handling
Using Failure Results
async call(context: StepContext): Promise<StepHandlerResult> {
try {
const result = await this.processData(context);
return this.success(result);
} catch (error) {
if (error instanceof NetworkError) {
// Retryable error
return this.failure(
error.message,
ErrorType.RETRYABLE_ERROR,
true,
{ endpoint: error.endpoint }
);
}
// Non-retryable error
return this.failure(
error instanceof Error ? error.message : 'Unknown error',
ErrorType.HANDLER_ERROR,
false
);
}
}
Logging
Structured Logging
import { logInfo, logError, logWarn, logDebug } from '@tasker-systems/tasker';
// Simple logging
logInfo('Processing started', { component: 'handler' });
logError('Failed to connect', { component: 'database' });
// With additional context
logInfo('Order processed', {
component: 'handler',
order_id: '123',
amount: '100.00'
});
Pino Integration
The worker uses pino for structured logging:
import pino from 'pino';
const logger = pino({
name: 'my-handler',
level: process.env.RUST_LOG ?? 'info'
});
logger.info({ orderId: '123' }, 'Processing order');
File Structure
workers/typescript/
├── bin/
│ └── server.ts # Production server
├── src/
│ ├── index.ts # Package exports
│ ├── bootstrap/
│ │ └── bootstrap.ts # Worker initialization
│ ├── events/
│ │ ├── event-emitter.ts # Event pub/sub
│ │ ├── event-poller.ts # Native module polling
│ │ └── event-system.ts # Combined event system
│ ├── ffi/
│ │ ├── index.ts # Native module loader
│ │ ├── ffi-layer.ts # FFI abstraction layer
│ │ └── types.ts # FFI types
│ ├── handler/
│ │ ├── base.ts # Base handler class
│ │ ├── api.ts # API handler
│ │ ├── decision.ts # Decision handler
│ │ ├── batchable.ts # Batchable handler
│ │ ├── domain-events.ts # Domain events module
│ │ ├── registry.ts # Handler registry
│ │ └── mixins/ # Mixin modules
│ │ ├── index.ts # Mixin exports
│ │ ├── api.ts # APIMixin, applyAPI
│ │ └── decision.ts # DecisionMixin, applyDecision
│ ├── server/
│ │ ├── worker-server.ts # Server implementation
│ │ └── types.ts # Server types
│ ├── subscriber/
│ │ └── step-execution-subscriber.ts
│ └── types/
│ ├── step-context.ts # Step context
│ └── step-handler-result.ts
├── tests/
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── handlers/examples/ # Example handlers
├── src-rust/ # Rust FFI extension
├── package.json
├── tsconfig.json
└── biome.json # Linting config
Testing
Unit Tests
cd workers/typescript
bun test # Run all tests
bun test tests/unit/ # Run unit tests only
Integration Tests
bun test tests/integration/ # Run integration tests
With Coverage
bun test --coverage
Linting
bun run check # Biome lint + format check
bun run check:fix # Auto-fix issues
Type Checking
bunx tsc --noEmit # Type check without emit
Example Handlers
Linear Workflow
DSL (recommended):
import { defineHandler } from '@tasker-systems/tasker';
export const DoubleHandler = defineHandler(
'double_value',
{ inputs: { value: 'value' } },
async ({ value }) => ({ result: Number(value ?? 0) * 2, operation: 'double' }),
);
export const AddHandler = defineHandler(
'add_constant',
{ dependsOn: { doubleValue: 'double_value' } },
async ({ doubleValue }) => ({
result: (doubleValue as { result: number })?.result + 10 ?? 10,
operation: 'add',
}),
);
Class-based:
export class DoubleHandler extends StepHandler {
static handlerName = 'double_value';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const value = context.getInput<number>('value') ?? 0;
return this.success({
result: value * 2,
operation: 'double'
});
}
}
export class AddHandler extends StepHandler {
static handlerName = 'add_constant';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
const prev = context.getDependencyResult('double_value') as { result: number } | null;
const value = prev?.result ?? 0;
return this.success({
result: value + 10,
operation: 'add'
});
}
}
Diamond Workflow (Parallel Branches)
DSL (recommended):
export const DiamondStart = defineHandler(
'diamond_start',
{ inputs: { value: 'value' } },
async ({ value }) => ({ squared: Number(value ?? 0) ** 2 }),
);
export const BranchB = defineHandler(
'branch_b',
{ dependsOn: { start: 'diamond_start' } },
async ({ start }) => ({ result: (start as { squared: number }).squared + 25 }),
);
export const BranchC = defineHandler(
'branch_c',
{ dependsOn: { start: 'diamond_start' } },
async ({ start }) => ({ result: (start as { squared: number }).squared * 2 }),
);
export const DiamondEnd = defineHandler(
'diamond_end',
{ dependsOn: { branchB: 'branch_b', branchC: 'branch_c' } },
async ({ branchB, branchC }) => ({
final: ((branchB as { result: number }).result + (branchC as { result: number }).result) / 2,
}),
);
Class-based:
export class DiamondStartHandler extends StepHandler {
static handlerName = 'diamond_start';
async call(context: StepContext): Promise<StepHandlerResult> {
const input = context.getInput<number>('value') ?? 0;
return this.success({ squared: input * input });
}
}
export class BranchBHandler extends StepHandler {
static handlerName = 'branch_b';
async call(context: StepContext): Promise<StepHandlerResult> {
const start = context.getDependencyResult('diamond_start') as { squared: number };
return this.success({ result: start.squared + 25 });
}
}
export class BranchCHandler extends StepHandler {
static handlerName = 'branch_c';
async call(context: StepContext): Promise<StepHandlerResult> {
const start = context.getDependencyResult('diamond_start') as { squared: number };
return this.success({ result: start.squared * 2 });
}
}
export class DiamondEndHandler extends StepHandler {
static handlerName = 'diamond_end';
async call(context: StepContext): Promise<StepHandlerResult> {
const branchB = context.getDependencyResult('branch_b') as { result: number };
const branchC = context.getDependencyResult('branch_c') as { result: number };
return this.success({
final: (branchB.result + branchC.result) / 2
});
}
}
Error Handling
export class RetryableErrorHandler extends StepHandler {
static handlerName = 'retryable_error';
async call(context: StepContext): Promise<StepHandlerResult> {
// Simulate a retryable error (e.g., network timeout)
return this.failure(
'Connection timeout - will be retried',
ErrorType.RETRYABLE_ERROR,
true,
{ attempt: context.retryCount }
);
}
}
export class PermanentErrorHandler extends StepHandler {
static handlerName = 'permanent_error';
async call(context: StepContext): Promise<StepHandlerResult> {
// Simulate a permanent error (e.g., validation failure)
return this.failure(
'Invalid input - no retry allowed',
ErrorType.PERMANENT_ERROR,
false
);
}
}
Docker Deployment
Dockerfile
FROM oven/bun:1.1.38 AS runtime
WORKDIR /app
# Copy built artifacts
COPY workers/typescript/dist/ ./dist/
COPY workers/typescript/package.json ./
COPY workers/typescript/*.node ./
# Install production dependencies
RUN bun install --production
# Set environment
ENV PORT=8081
EXPOSE 8081
CMD ["bun", "run", "dist/bin/server.js"]
The napi-rs native addon (.node file) is auto-detected by the module loader — no TASKER_FFI_MODULE_PATH needed. The .node file is platform-specific (e.g., tasker_ts.darwin-arm64.node).
Docker Compose
typescript-worker:
build:
context: .
dockerfile: docker/build/typescript-worker.Dockerfile
environment:
DATABASE_URL: postgresql://tasker:tasker@postgres:5432/tasker
TASKER_ENV: production
TASKER_TEMPLATE_PATH: /app/templates
PORT: 8081
ports:
- "8084:8081"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
interval: 10s
timeout: 5s
retries: 3
See Also
- Worker Crates Overview - High-level introduction
- Patterns and Practices - Common patterns
- Python Worker - Python implementation
- Ruby Worker - Ruby implementation
- Worker Event Systems - Architecture details
Observability Documentation
Last Updated: 2025-12-01 Audience: Operators, Developers Status: Active Related Docs: Documentation Hub | Benchmarks | Deployment Patterns | Domain Events
← Back to Documentation Hub
This directory contains documentation for monitoring, metrics, logging, and performance measurement in tasker-core.
Quick Navigation
📊 Performance & Benchmarking → ../benchmarks/
All benchmark documentation has been consolidated in the docs/benchmarks/ directory.
See: Benchmark README for:
- API performance benchmarks
- SQL function benchmarks
- Event propagation benchmarks
- End-to-end latency benchmarks
- Benchmark quick reference
- Performance targets and CI integration
Migration Note: The following files remain in this directory for historical context but are superseded by the consolidated benchmarks documentation:
benchmark-implementation-decision.md- Decision rationale (archived)benchmark-quick-reference.md- Superseded by ../benchmarks/README.mdbenchmark-strategy-summary.md- Consolidated into benchmark-specific docsbenchmarking-guide.md- SQL benchmarks moved to ../benchmarks/sql-benchmarks.mdphase-5.4-distributed-benchmarks-plan.md- Implementation complete
Observability Categories
1. Metrics (metrics-*.md)
Purpose: System health, performance counters, and operational metrics
Documentation:
- metrics-reference.md - Complete metrics catalog
- metrics-verification.md - Verification procedures
- VERIFICATION_RESULTS.md - Test results and validation
Key Metrics Tracked:
- Task lifecycle events (created, started, completed, failed)
- Step execution metrics (claimed, executed, retried)
- Database operation performance (query times, cache hit rates)
- Worker health (active workers, queue depths, claim rates)
- System resource usage (memory, connections, threads)
Export Targets:
- OpenTelemetry (planned)
- Prometheus (supported)
- CloudWatch (planned)
- Datadog (planned)
Quick Reference:
#![allow(unused)]
fn main() {
// Example: Recording a metric using typed factory functions
use tasker_shared::metrics::orchestration::{task_requests_total, task_initialization_duration, active_tasks};
use opentelemetry::KeyValue;
task_requests_total().add(1, &[KeyValue::new("namespace", "payments")]);
task_initialization_duration().record(elapsed_ms, &[KeyValue::new("task_type", "order")]);
active_tasks().set(worker_count, &[KeyValue::new("state", "in_progress")]);
}
2. Logging (logging-standards.md)
Purpose: Structured logging for debugging, audit trails, and operational visibility
Documentation:
- logging-standards.md - Logging standards and best practices
Log Levels:
- ERROR: Critical failures requiring immediate attention
- WARN: Degraded operation or retry scenarios
- INFO: Significant lifecycle events and state transitions
- DEBUG: Detailed execution flow for troubleshooting
- TRACE: Exhaustive detail for deep debugging
Structured Fields:
#![allow(unused)]
fn main() {
info!(
task_uuid = %task_uuid,
correlation_id = %correlation_id,
step_name = %step_name,
elapsed_ms = elapsed.as_millis(),
"Step execution completed successfully"
);
}
Key Standards:
- Use structured logging (not string interpolation)
- Include correlation IDs for distributed tracing
- Log state transitions at INFO level
- Include timing information for performance analysis
- Sanitize sensitive data (credentials, PII)
3. Tracing and OpenTelemetry
Purpose: Distributed request tracing across services
Status: ✅ Active
Documentation:
- opentelemetry-improvements.md - Telemetry enhancements
Current Features:
- Distributed trace propagation via correlation IDs (UUIDv7)
- Span creation for major operations:
- API request handling
- Step execution (claim → execute → submit)
- Orchestration coordination
- Domain event publishing
- Message queue operations
- Two-phase FFI telemetry initialization (safe for Ruby/Python workers)
- Integration with Grafana LGTM stack (Prometheus, Tempo)
- Domain event metrics (
/metrics/eventsendpoint)
Two-Phase FFI Initialization:
- Phase 1: Console-only logging (safe during FFI bridge setup)
- Phase 2: Full OpenTelemetry (after FFI established)
Example:
#![allow(unused)]
fn main() {
#[tracing::instrument(
name = "publish_domain_event",
skip(self, payload),
fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id,
delivery_mode = ?delivery_mode
)
)]
async fn publish_event(&self, event_name: &str, ...) -> Result<()> {
// Implementation
}
}
4. Health Checks
Purpose: Service health monitoring for orchestration, availability, and alerting
Endpoints:
GET /health- Overall service healthGET /health/ready- Readiness for traffic (K8s readiness probe)GET /health/live- Liveness check (K8s liveness probe)
Health Indicators:
- Database connection pool status
- Message queue connectivity
- Worker availability
- Circuit breaker states
- Resource utilization (memory, connections)
Response Format:
{
"status": "healthy",
"checks": {
"database": {
"status": "healthy",
"connections_active": 5,
"connections_idle": 15,
"connections_max": 20
},
"message_queue": {
"status": "healthy",
"queues_monitored": 3
},
"circuit_breakers": {
"status": "healthy",
"open_breakers": 0
}
},
"uptime_seconds": 3600
}
Observability Architecture
Component-Level Instrumentation
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │ Health │ │
│ │ (Counters│ │(Structured)│ │(Planned)│ │ Checks │ │
│ │Histograms│ │ JSON │ │ Spans │ │ HTTP │ │
│ │ Gauges) │ │ Fields │ │ Tags │ │ Probes │ │
│ └─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘ │
│ │ │ │ │ │
└────────┼─────────────┼─────────────┼─────────────┼────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│Prometheus │ │ Loki / │ │ Jaeger / │ │ K8s │
│ OTLP │ │CloudWatch │ │ Tempo │ │ Probes │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
Instrumentation Points
Orchestration:
- Task lifecycle transitions
- Step discovery and enqueueing
- Result processing
- Finalization operations
- Database query performance
Worker:
- Step claiming
- Handler execution
- Result submission
- FFI call overhead (Ruby workers)
- Event propagation latency
Database:
- Query execution times
- Connection pool metrics
- Transaction commit latency
- Buffer cache hit ratio
Message Queue:
- Message send/receive latency
- Queue depth
- Notification propagation time
- Message processing errors
Performance Monitoring
Key Performance Indicators (KPIs)
| Metric | Target | Alert Threshold | Notes |
|---|---|---|---|
| API Response Time (p99) | < 100ms | > 200ms | User-facing latency |
| SQL Function Time (mean) | < 3ms | > 5ms | Orchestration efficiency |
| Event Propagation (p95) | < 10ms | > 20ms | Real-time coordination |
| E2E Task Completion (p99) | < 500ms | > 1000ms | End-user experience |
| Worker Claim Success Rate | > 95% | < 90% | Resource contention |
| Database Connection Pool | < 80% | > 90% | Resource exhaustion |
Monitoring Dashboards
Recommended Dashboard Panels:
-
Task Throughput
- Tasks created/min
- Tasks completed/min
- Tasks failed/min
- Active tasks count
-
Step Execution
- Steps enqueued/min
- Steps completed/min
- Average step execution time
- Step retry rate
-
System Health
- Worker health status
- Database connection pool utilization
- Circuit breaker status
- API response times (p50, p95, p99)
-
Error Rates
- Task failures by namespace
- Step failures by handler
- Database errors
- Message queue errors
Correlation and Debugging
Correlation ID Propagation
Every request generates a UUIDv7 correlation ID that flows through:
- API request → Task creation
- Task → Step enqueueing
- Step → Worker execution
- Worker → Result submission
- Result → Orchestration processing
Tracing a Request:
# Find correlation ID from task creation
curl http://localhost:8080/v1/tasks/{task_uuid} | jq .correlation_id
# Search logs across all services
docker logs orchestration 2>&1 | grep {correlation_id}
docker logs worker 2>&1 | grep {correlation_id}
# Query database for full timeline
psql $DATABASE_URL -c "
SELECT
created_at,
from_state,
to_state,
metadata->>'duration_ms' as duration
FROM tasker.task_transitions
WHERE metadata->>'correlation_id' = '{correlation_id}'
ORDER BY created_at;
"
Debug Logging
Enable debug logging for detailed execution flow:
# Docker Compose
RUST_LOG=debug docker-compose up
# Local development
RUST_LOG=tasker_worker=debug,tasker_orchestration=debug cargo run
# Specific modules
RUST_LOG=tasker_worker::worker::command_processor=trace cargo test
Best Practices
1. Structured Logging
✅ Do:
#![allow(unused)]
fn main() {
info!(
task_uuid = %task.uuid,
namespace = %task.namespace,
elapsed_ms = elapsed.as_millis(),
"Task completed successfully"
);
}
❌ Don’t:
#![allow(unused)]
fn main() {
info!("Task {} in namespace {} completed in {}ms",
task.uuid, task.namespace, elapsed.as_millis());
}
2. Metric Naming
Use the typed factory functions from tasker_shared::metrics:
#![allow(unused)]
fn main() {
use tasker_shared::metrics::orchestration::*;
use opentelemetry::KeyValue;
task_requests_total().add(1, &[KeyValue::new("namespace", "payments")]);
task_completions_total().add(1, &[KeyValue::new("namespace", "payments")]);
task_failures_total().add(1, &[KeyValue::new("error_type", "timeout")]);
task_initialization_duration().record(elapsed_ms, &[]);
}
3. Performance Measurement
Measure at operation boundaries:
#![allow(unused)]
fn main() {
use tasker_shared::metrics::orchestration::task_initialization_duration;
let start = Instant::now();
let result = operation().await?;
let elapsed = start.elapsed();
task_initialization_duration().record(elapsed.as_millis() as f64, &[]);
info!(
operation = "operation_name",
elapsed_ms = elapsed.as_millis(),
success = result.is_ok(),
"Operation completed"
);
}
4. Error Context
Include rich context in errors:
#![allow(unused)]
fn main() {
error!(
task_uuid = %task_uuid,
step_uuid = %step_uuid,
error = %err,
retry_count = retry_count,
"Step execution failed, will retry"
);
}
Tools and Integration
Development Tools
Metrics Visualization:
# Prometheus (if configured)
open http://localhost:9090
# Grafana (if configured)
open http://localhost:3000
Log Aggregation:
# Docker Compose logs
docker-compose -f docker/docker-compose.test.yml logs -f
# Specific service
docker-compose -f docker/docker-compose.test.yml logs -f orchestration
# JSON parsing
docker-compose logs orchestration | jq 'select(.level == "ERROR")'
Production Tools (Planned)
- Metrics: Prometheus + Grafana / DataDog / CloudWatch
- Logs: Loki / CloudWatch Logs / Splunk
- Traces: Jaeger / Tempo / Honeycomb
- Alerts: AlertManager / PagerDuty / Opsgenie
Related Documentation
- Benchmarks: ../benchmarks/README.md
- SQL Functions: ../task-and-step-readiness-and-execution.md
File Organization
Current Files
Active:
metrics-reference.md- Complete metrics catalogmetrics-verification.md- Verification procedureslogging-standards.md- Logging best practicesopentelemetry-improvements.md- Telemetry enhancementsVERIFICATION_RESULTS.md- Test results
Archived (superseded by docs/benchmarks/):
benchmark-implementation-decision.mdbenchmark-quick-reference.mdbenchmark-strategy-summary.mdbenchmarking-guide.mdphase-5.4-distributed-benchmarks-plan.md
Recommended Cleanup
Move benchmark files to docs/archive/ or delete:
# Option 1: Archive
mkdir -p docs/archive/benchmarks
mv docs/observability/benchmark-*.md docs/archive/benchmarks/
mv docs/observability/phase-5.4-*.md docs/archive/benchmarks/
# Option 2: Delete (information consolidated)
rm docs/observability/benchmark-*.md
rm docs/observability/phase-5.4-*.md
Contributing
When adding observability instrumentation:
- Follow standards: Use structured logging and consistent metric naming
- Include context: Add correlation IDs and relevant metadata
- Document metrics: Update metrics-reference.md with new metrics
- Test instrumentation: Verify metrics and logs in development
- Consider performance: Avoid expensive operations in hot paths
Benchmark Audit & Profiling Plan
Created: 2025-10-09 Status: 📋 Planning Purpose: Audit existing benchmarks, establish profiling tooling, baseline before Actor/Services refactor
Executive Summary
Before refactoring tasker-orchestration/src/orchestration/lifecycle/ to Actor/Services pattern, we need to:
- Audit Benchmarks: Review which benchmarks are implemented vs placeholders
- Clean Up: Remove or complete placeholder benchmarks
- Establish Profiling: Set up flamegraph/samply tooling
- Baseline Profiles: Capture performance profiles for comparison post-refactor
Current Status: We have working SQL and E2E benchmarks but several placeholder component benchmarks that need decisions.
Benchmark Inventory
✅ Working & Complete Benchmarks
1. SQL Function Benchmarks
-
Location:
tasker-shared/benches/sql_functions.rs -
Status: ✅ Complete, Compiles, Well-documented
-
Coverage:
get_next_ready_tasks()(4 batch sizes)get_step_readiness_status()(5 diverse samples)transition_task_state_atomic()(5 samples)get_task_execution_context()(5 samples)get_step_transitive_dependencies()(10 samples)
-
Documentation:
docs/observability/benchmarking-guide.md -
Run Command:
cargo bench --package tasker-shared --features benchmarks
2. Event Propagation Benchmarks
-
Location:
tasker-shared/benches/event_propagation.rs -
Status: ✅ Complete, Compiles
-
Coverage: PostgreSQL LISTEN/NOTIFY event propagation
-
Run Command:
cargo bench --package tasker-shared --features benchmarks event_propagation
3. Task Initialization Benchmarks
-
Location:
tasker-client/benches/task_initialization.rs -
Status: ✅ Complete, Compiles
-
Coverage: API task creation latency
-
Run Command:
export SQLX_OFFLINE=true cargo bench --package tasker-client --features benchmarks task_initialization
4. End-to-End Workflow Latency Benchmarks
-
Location:
tests/benches/e2e_latency.rs -
Status: ✅ Complete, Compiles
-
Coverage: Complete workflow execution (API → Result)
- Linear workflow (Ruby FFI)
- Diamond workflow (Ruby FFI)
- Linear workflow (Rust native)
- Diamond workflow (Rust native)
-
Prerequisites: Docker Compose services running
-
Run Command:
export SQLX_OFFLINE=true cargo bench --bench e2e_latency
⚠️ Placeholder Benchmarks (Need Decision)
5. Orchestration Benchmarks
- Location:
tasker-orchestration/benches/ - Files:
orchestration_benchmarks.rs- Empty placeholderstep_enqueueing.rs- Placeholder with documentation
- Status: Not implemented
- Documented Intent: Measure orchestration coordination latency
- Challenges:
- Requires triggering orchestration cycle without full execution
- Need step discovery measurement isolation
- Queue publishing and notification overhead breakdown
6. Worker Benchmarks
- Location:
tasker-worker/benches/ - Files:
worker_benchmarks.rs- Empty placeholderworker_execution.rs- Placeholder with documentationhandler_overhead.rs- Placeholder with documentation
- Status: Not implemented
- Documented Intent:
- Worker processing cycle (claim, execute, submit)
- Framework overhead vs pure handler execution
- Ruby FFI overhead measurement
- Challenges:
- Need pre-enqueued steps in test queues
- Noop handler implementations for baseline
- Breakdown metrics for each phase
Recommendations
Option 1: Keep Placeholders for Future Work ✅ RECOMMENDED
Rationale:
- Phase 5.4 distributed benchmarks are documented but complex to implement
- E2E benchmarks (
e2e_latency.rs) already provide full workflow metrics - SQL benchmarks provide component-level detail
- Actor/Services refactor is more urgent than distributed component benchmarks
Action:
- Keep placeholder files with clear “NOT IMPLEMENTED” status
- Update comments to reference this audit document
- Future ticket (post-refactor) can implement if needed
Option 2: Remove Placeholders
Rationale:
- Reduce confusion about benchmark status
- E2E benchmarks already cover end-to-end latency
- SQL benchmarks cover database hot paths
Action:
- Delete placeholder bench files
- Document decision in this file
- Can recreate later if specific component isolation needed
Option 3: Implement Placeholders Now
Rationale:
- Complete benchmark suite before refactor
- Better baseline data for Actor/Services comparison
Concerns:
- 2-3 days implementation effort
- Delays Actor/Services refactor
- May need re-implementation post-refactor anyway
Decision: Option 1 (Keep Placeholders, Document Status)
We have sufficient benchmarking coverage:
- ✅ SQL functions (hot path queries)
- ✅ E2E workflows (user-facing latency)
- ✅ Event propagation (LISTEN/NOTIFY)
- ✅ Task initialization (API latency)
What’s Missing:
- Component-level orchestration breakdown (not critical for refactor)
- Worker cycle breakdown (available via OpenTelemetry traces)
- Framework overhead measurement (nice-to-have, not blocking)
Action Items:
- Update placeholder comments with “Status: Planned for future implementation”
- Reference this document for implementation guidance
- Move forward with profiling and refactor
Profiling Tooling Setup
Goals
- Identify Inefficiencies: Find hot spots in lifecycle code
- Establish Baseline: Profile before Actor/Services refactor
- Compare Post-Refactor: Validate performance impact of refactor
- Continuous Profiling: Enable ongoing performance analysis
Tool Selection
Primary: samply (macOS-friendly)
- GitHub: https://github.com/mstange/samply
- Advantages:
- Native macOS support (uses Instruments)
- Interactive web UI for flamegraphs
- Low overhead
- Works with release builds
- Use Case: Development profiling on macOS
Secondary: flamegraph (CI/production)
- GitHub: https://github.com/flamegraph-rs/flamegraph
- Advantages:
- Linux support (perf-based)
- SVG output for CI artifacts
- Well-established in Rust ecosystem
- Use Case: CI profiling, Linux production analysis
Tertiary: cargo-flamegraph (Convenience)
- Cargo Plugin: Wraps flamegraph-rs
- Advantages:
- Single command profiling
- Automatic symbol resolution
- Use Case: Quick local profiling
Installation
macOS Setup (samply)
# Install samply
cargo install samply
# macOS requires SIP adjustment for sampling (one-time setup)
# https://github.com/mstange/samply#macos-permissions
# Verify installation
samply --version
Linux Setup (flamegraph)
# Install prerequisites (Ubuntu/Debian)
sudo apt-get install linux-tools-common linux-tools-generic
# Install flamegraph
cargo install flamegraph
# Allow perf without sudo (optional)
echo 'kernel.perf_event_paranoid=-1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Verify installation
flamegraph --version
Cross-Platform (cargo-flamegraph)
# Install cargo-flamegraph
cargo install cargo-flamegraph
# Verify installation
cargo flamegraph --version
Profiling Workflows
1. Profile E2E Benchmark (Recommended for Baseline)
Captures the entire workflow execution including orchestration lifecycle:
# macOS
samply record cargo bench --bench e2e_latency -- --profile-time=60
# Linux
cargo flamegraph --bench e2e_latency -- --profile-time=60
# Output: Interactive flamegraph showing hot paths
What to Look For:
- Time spent in
lifecycle/modules (task_initializer, step_enqueuer, result_processor, etc.) - Database query time vs business logic time
- Serialization/deserialization overhead
- Lock contention (should be minimal with our architecture)
2. Profile SQL Benchmarks
Isolates database performance:
# Profile just SQL function benchmarks
samply record cargo bench --package tasker-shared --features benchmarks sql_functions
# Output: Shows PostgreSQL function overhead
What to Look For:
- Time in
sqlxquery execution - Connection pool overhead
- Query planning time (shouldn’t be visible if using prepared statements)
3. Profile Integration Tests (Realistic Workload)
Profile actual test execution for realistic patterns:
# Profile a specific integration test
samply record cargo test --test e2e_tests e2e::rust::simple_integration_tests::test_linear_workflow
# Profile all integration tests (longer run)
samply record cargo test --test e2e_tests --all-features
What to Look For:
- Initialization overhead
- Test setup time vs actual execution time
- Repeated patterns across tests
4. Profile Specific Lifecycle Components
Isolate specific modules for deep analysis:
# Example: Profile only result processing
samply record cargo test --package tasker-orchestration --test lifecycle_integration_tests \
test_result_processing_updates_task_state --all-features -- --nocapture
# Or profile a unit test for a specific function
samply record cargo test --package tasker-orchestration \
result_processor::tests::test_process_step_result_success --all-features
Baseline Profiling Plan
Phase 1: Capture Pre-Refactor Baselines (Day 1)
Goal: Establish performance baseline of current lifecycle code before Actor/Services refactor
# 1. Clean build
cargo clean
cargo build --release --all-features
# 2. Profile E2E benchmarks (primary baseline)
samply record --output=baseline-e2e-pre-refactor.json \
cargo bench --bench e2e_latency
# 3. Profile SQL benchmarks
samply record --output=baseline-sql-pre-refactor.json \
cargo bench --package tasker-shared --features benchmarks
# 4. Profile specific lifecycle operations
samply record --output=baseline-task-init-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::task_initializer::tests --all-features
samply record --output=baseline-step-enqueue-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::step_enqueuer::tests --all-features
samply record --output=baseline-result-processor-pre-refactor.json \
cargo test --package tasker-orchestration \
lifecycle::result_processor::tests --all-features
Deliverables (completed, profiles removed — superseded by cluster benchmarks):
Baseline profile files in(removed)profiles/pre-refactor/- Performance baselines now in
docs/benchmarks/README.md
Phase 2: Identify Optimization Opportunities (Day 1)
Goal: Document current performance characteristics to preserve in refactor
Analysis Checklist:
- ✅ Time spent in each lifecycle module (task_initializer, step_enqueuer, etc.)
- ✅ Database query time breakdown
- ✅ Serialization overhead (JSON, MessagePack)
- ✅ Lock contention points (if any)
- ✅ Unnecessary allocations or clones
- ✅ Recursive call depth
Document Findings:
Performance baselines are now documented in docs/benchmarks/README.md.
The original lifecycle-performance-baseline.md was removed — its measurements had
data quality issues and the refactor it targeted is complete.
Phase 3: Post-Refactor Validation (After Refactor)
Goal: Validate Actor/Services refactor maintains or improves performance
# Re-run same profiling commands after refactor
samply record --output=baseline-e2e-post-refactor.json \
cargo bench --bench e2e_latency
# Compare baselines
# (samply doesn't have built-in diff, use manual comparison)
Success Criteria:
- E2E latency: Within 10% of baseline (preferably faster)
- SQL latency: Unchanged (no regression from refactor)
- Lifecycle operation time: Within 20% of baseline
- No new hot paths or contention points
Regression Signals:
- E2E latency >20% slower
- New allocations/clones in hot paths
- Increased lock contention
- Message passing overhead >5% of total time
Profiling Best Practices
1. Use Release Builds
# Always profile release builds (--release flag)
cargo build --release --all-features
samply record cargo bench --bench e2e_latency
Rationale: Debug builds have 10-100x overhead that masks real performance issues
2. Run Multiple Times
# Run 3 times, compare consistency
for i in {1..3}; do
samply record --output=profile-$i.json cargo bench --bench e2e_latency
done
Rationale: Catch warm-up effects, JIT compilation, cache behavior
3. Isolate Interference
# Close other applications
# Disable background processes (Spotlight, backups)
# Use consistent hardware (don't profile on battery power)
4. Focus on Hot Paths
80/20 Rule: 80% of time is spent in 20% of code
Priority Order:
- Top 5 functions by time (>5% each)
- Recursive calls (can amplify overhead)
- Locks and synchronization (contention multiplies)
- Allocations in loops (O(n) becomes visible)
5. Benchmark-Driven Profiling
Always profile realistic workloads:
- ✅ E2E benchmarks (represents user experience)
- ✅ Integration tests (real workflow patterns)
- ❌ Unit tests (too isolated, not representative)
Flamegraph Interpretation
Reading Flamegraphs
┌─────────────────────────────────────────────┐ ← Total Program Time (100%)
│ │
│ ┌────────────────┐ ┌─────────────────┐ │
│ │ Database Ops │ │ Serialization │ │ ← High-level Operations (60%)
│ │ (30%) │ │ (30%) │ │
│ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌───────────┐ │ │
│ │ │ SQL Exec │ │ │ │ JSON Ser │ │ │ ← Leaf Operations (25%)
│ │ │ (25%) │ │ │ │ (20%) │ │ │
│ └──┴──────────┴──┘ └──┴───────────┴─┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Business Logic (20%) │ │ ← Remaining Time
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
Width = Time spent in function (including children) Height = Call stack depth Color = Function group (can be customized)
Key Patterns
1. Wide Flat Bars = Hot Path
┌───────────────────────────────────────┐
│ step_enqueuer::enqueue_ready_steps() │ ← 40% of total time
└───────────────────────────────────────┘
Action: Optimize this function
2. Deep Call Stack = Recursion/Abstractions
┌─────────────────────────┐
│ process_dependencies() │
│ ┌─────────────────────┐│
│ │ resolve_deps() ││
│ │ ┌─────────────────┐││
│ │ │ check_ready() │││
│ │ └─────────────────┘││
│ └─────────────────────┘│
└─────────────────────────┘
Action: Consider flattening or caching
3. Many Narrow Bars = Fragmentation
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│A│B│C│D│E│F│G│H│I│J│K│L│M│ ← Many small functions
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
Action: Not necessarily bad (may be inlining), but check if overhead-heavy
Integration with CI
GitHub Actions Workflow (Future Enhancement)
# .github/workflows/profile-benchmarks.yml
name: Profile Benchmarks
on:
pull_request:
paths:
- 'tasker-orchestration/src/orchestration/lifecycle/**'
- 'tasker-shared/src/**'
jobs:
profile:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install flamegraph
run: cargo install flamegraph
- name: Profile benchmarks
run: |
cargo flamegraph --bench e2e_latency -- --profile-time=60 -o flamegraph.svg
- name: Upload flamegraph
uses: actions/upload-artifact@v3
with:
name: flamegraph
path: flamegraph.svg
- name: Compare with baseline
run: |
# TODO: Implement baseline comparison
# Download previous flamegraph, compare hot paths
Documentation Structure
Created Documents
-
This Document:
docs/observability/benchmark-audit-and-profiling-plan.md- Benchmark inventory
- Profiling tooling setup
- Baseline capture plan
-
Existing:
docs/observability/benchmarking-guide.md- SQL benchmark documentation
- Running instructions
- Performance expectations
-
(Removed — superseded bydocs/observability/lifecycle-performance-baseline.mddocs/benchmarks/README.md)
Next Steps
Before Actor/Services Refactor
-
✅ Audit Complete: Documented benchmark status
-
⏳ Install Profiling Tools:
cargo install samply # macOS cargo install flamegraph # Linux -
⏳ Capture Baselines (1 day):
- Run profiling plan Phase 1
- Generate flamegraphs
- Document hot paths
-
✅ Baseline Document: Superseded by
docs/benchmarks/README.md
During Actor/Services Refactor
- Incremental Profiling: Profile after each major component conversion
- Compare Baselines: Ensure no performance regressions
- Document Changes: Note architectural changes affecting performance
After Actor/Services Refactor
- Full Re-Profile: Run profiling plan Phase 3
- Comparison Analysis: Document performance changes
- Update Documentation: Reflect new architecture
- Benchmark Updates: Update benchmarks if Actor/Services changes measurement approach
Summary
Current State:
- ✅ SQL benchmarks working
- ✅ E2E benchmarks working
- ✅ Event propagation benchmarks working
- ✅ Task initialization benchmarks working
- ⚠️ Component benchmarks are placeholders (OK for now)
Decision:
- Keep placeholder benchmarks for future work
- Move forward with profiling and baseline capture
- Sufficient coverage to validate Actor/Services refactor
Action Plan:
- Install profiling tools (samply/flamegraph)
- Capture pre-refactor baselines (1 day)
- Document current hot paths
- Proceed with Actor/Services refactor
- Validate post-refactor performance
Success Criteria:
- Baseline profiles captured
- Hot paths documented
- Post-refactor validation plan established
- No performance regressions from refactor
Benchmark Implementation Decision: Event-Driven + E2E Focus
Date: 2025-10-08 Decision: Focus on event propagation and E2E benchmarks; infer worker metrics from traces
Context
Original Phase 5.4 plan included 7 benchmark categories:
- ✅ API Task Creation
- 🚧 Worker Processing Cycle
- ✅ Event Propagation
- 🚧 Step Enqueueing
- 🚧 Handler Overhead
- ✅ SQL Functions
- ✅ E2E Latency
Architectural Challenge: Worker Benchmarking
Problem: Direct worker benchmarking doesn’t match production reality
In a distributed system with multiple workers:
- ❌ Can’t predict which worker will claim which step
- ❌ Can’t control step distribution across workers
- ❌ Artificial scenarios required to direct specific steps to specific workers
- ❌ API queries would need to know which worker to query (unknowable in advance)
Example:
Task with 10 steps across 3 workers:
- Worker A might claim steps 1, 3, 7
- Worker B might claim steps 2, 5, 6, 9
- Worker C might claim steps 4, 8, 10
Which worker do you benchmark? How do you ensure consistent measurement?
Decision: Focus on Observable Metrics
✅ What We WILL Measure Directly
1. Event Propagation (tasker-shared/benches/event_propagation.rs)
Status: ✅ IMPLEMENTED
Measures: PostgreSQL LISTEN/NOTIFY round-trip latency
Approach:
#![allow(unused)]
fn main() {
// Setup listener on test channel
listener.listen("pgmq_message_ready.benchmark_event_test").await;
// Send message with notify
let send_time = Instant::now();
sqlx::query("SELECT pgmq_send_with_notify(...)").execute(&pool).await;
// Measure until listener receives
let received_at = listener.recv().await;
let latency = received_at.duration_since(send_time);
}
Why it works:
- Observable from outside the system
- Deterministic measurement (single listener, single sender)
- Matches production behavior (real LISTEN/NOTIFY path)
- Critical for worker responsiveness
Expected Performance: < 5-10ms p95
2. End-to-End Latency (tests/benches/e2e_latency.rs)
Status: ✅ IMPLEMENTED
Measures: Complete workflow execution (API → Task Complete)
Approach:
#![allow(unused)]
fn main() {
// Create task
let response = client.create_task(request).await;
let start = Instant::now();
// Poll for completion
loop {
let task = client.get_task(task_uuid).await;
if task.execution_status == "AllComplete" {
return start.elapsed();
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
}
Why it works:
- Measures user experience (submit → result)
- Naturally includes ALL system overhead:
- API processing
- Database writes
- Message queue latency
- Worker claim/execute/submit (embedded in total time)
- Event propagation
- Orchestration coordination
- No need to know which workers executed which steps
- Reflects real production behavior
Expected Performance:
- Linear (3 steps): < 500ms p99
- Diamond (4 steps): < 800ms p99
📊 What We WILL Infer from Traces
Worker-Level Breakdown via OpenTelemetry
Instead of direct benchmarking, use existing OpenTelemetry instrumentation:
# Query traces by correlation_id from E2E benchmark
curl "http://localhost:16686/api/traces?service=tasker-worker&tags=correlation_id:abc-123"
# Extract span timings:
{
"spans": [
{"operationName": "step_claim", "duration": 15ms},
{"operationName": "execute_handler", "duration": 42ms}, // Business logic
{"operationName": "submit_result", "duration": 23ms}
]
}
Advantages:
- ✅ Works across distributed workers (correlation ID links everything)
- ✅ Captures real production behavior (actual task execution)
- ✅ Breaks down by step type (different handlers have different timing)
- ✅ Shows which worker processed each step
- ✅ Already instrumented (Phase 3.3 work)
Metrics Available:
step_claim_duration- Time to claim step from queuehandler_execution_duration- Time to execute handler logicresult_submission_duration- Time to submit result backffi_overhead- Rust vs Ruby handler comparison
🚧 Benchmarks NOT Implemented (By Design)
Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- Requires artificial pre-arrangement of which worker claims which step
- Doesn’t match production (multiple workers competing for steps)
- Metrics available via OpenTelemetry traces instead
Alternative: Query traces for step_claim → execute_handler → submit_result span timing
Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- Difficult to trigger orchestration step discovery without full execution
- Result naturally embedded in E2E latency measurement
- Coordination overhead visible in E2E timing
Alternative: E2E benchmark includes step enqueueing naturally
Handler Overhead (tasker-worker/benches/handler_overhead.rs)
Status: 🚧 Skeleton only (placeholder)
Why not implemented:
- FFI overhead varies by handler type (can’t benchmark in isolation)
- Real overhead visible in E2E benchmark + traces
- Rust vs Ruby comparison available via trace analysis
Alternative: Compare handler_execution_duration spans for Rust vs Ruby handlers in traces
Implementation Summary
✅ Complete Benchmarks (3/7)
| Benchmark | Status | Measures | Run Command |
|---|---|---|---|
| SQL Functions | ✅ Complete | PostgreSQL function performance | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions |
| Task Initialization | ✅ Complete | API task creation latency | cargo bench -p tasker-client --features benchmarks |
| Event Propagation | ✅ Complete | LISTEN/NOTIFY round-trip | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks event_propagation |
| E2E Latency | ✅ Complete | Complete workflow execution | cargo bench --test e2e_latency |
🚧 Placeholder Benchmarks (3/7)
| Benchmark | Status | Alternative Measurement |
|---|---|---|
| Worker Execution | 🚧 Placeholder | OpenTelemetry traces (correlation ID) |
| Step Enqueueing | 🚧 Placeholder | Embedded in E2E latency |
| Handler Overhead | 🚧 Placeholder | OpenTelemetry span comparison (Rust vs Ruby) |
Advantages of This Approach
1. Matches Production Reality
- E2E benchmark reflects actual user experience
- No artificial worker pre-arrangement required
- Measures real distributed system behavior
2. Complete Coverage
- E2E latency includes ALL components naturally
- OpenTelemetry provides worker-level breakdown
- Event propagation measures critical notification path
3. Lower Maintenance
- Fewer benchmarks to maintain
- No complex setup for worker isolation
- Traces provide flexible analysis
4. Better Insights
- Correlation IDs link entire workflow across services
- Can analyze timing for ANY task in production
- Breakdown available on-demand via trace queries
How to Use This System
Running Performance Analysis
Step 1: Run E2E benchmark
cargo bench --test e2e_latency
Step 2: Extract correlation_id from benchmark output
Created task: abc-123-def-456 (correlation_id: xyz-789)
Step 3: Query traces for breakdown
# Jaeger UI or API
curl "http://localhost:16686/api/traces?tags=correlation_id:xyz-789"
Step 4: Analyze span timing
{
"spans": [
{"service": "orchestration", "operation": "create_task", "duration": 18ms},
{"service": "orchestration", "operation": "enqueue_steps", "duration": 12ms},
{"service": "worker", "operation": "step_claim", "duration": 15ms},
{"service": "worker", "operation": "execute_handler", "duration": 42ms},
{"service": "worker", "operation": "submit_result", "duration": 23ms},
{"service": "orchestration", "operation": "process_result", "duration": 8ms}
]
}
Total E2E: ~118ms (matches benchmark) Worker overhead: 15ms + 23ms = 38ms (claim + submit, excluding business logic)
Recommendations
Completion Criteria
✅ Complete with 4 working benchmarks:
- SQL Functions
- Task Initialization
- Event Propagation
- E2E Latency
📋 Document that worker-level metrics come from OpenTelemetry
For Future Enhancement
If direct worker benchmarking becomes necessary:
- Use single-worker mode Docker Compose configuration
- Pre-create tasks with known step assignments
- Query specific worker API for deterministic steps
- Document as synthetic benchmark (not matching production)
For Production Monitoring
Use OpenTelemetry for ongoing performance analysis:
- Set up trace retention (7-30 days)
- Create Grafana dashboards for span timing
- Alert on p95 latency increases
- Analyze slow workflows via correlation ID
Conclusion
Decision: Focus on event propagation and E2E latency benchmarks, use OpenTelemetry traces for worker-level breakdown.
Rationale: Matches production reality, provides complete coverage, lower maintenance, better insights.
Status: ✅ 4/4 practical benchmarks implemented and working
Benchmark Quick Reference Guide
Last Updated: 2025-10-08
Quick commands for running all benchmarks in the distributed benchmarking suite.
Prerequisites
# Start all Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# Verify services are healthy
curl http://localhost:8080/health # Orchestration
curl http://localhost:8081/health # Rust Worker
curl http://localhost:8082/health # Ruby Worker (optional)
# Set database URL (for SQL benchmarks)
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
Individual Benchmarks
✅ Implemented Benchmarks
# 1. API Task Creation (COMPLETE - 17.7-20.8ms)
cargo bench --package tasker-client --features benchmarks
# 2. SQL Function Performance (COMPLETE - 380µs-2.93ms)
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions
🚧 Placeholder Benchmarks
# 3. Event Propagation (placeholder)
cargo bench --package tasker-shared --features benchmarks event_propagation
# 4. Worker Execution (placeholder)
cargo bench --package tasker-worker --features benchmarks worker_execution
# 5. Handler Overhead (placeholder)
cargo bench --package tasker-worker --features benchmarks handler_overhead
# 6. Step Enqueueing (placeholder)
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing
# 7. End-to-End Latency (placeholder)
cargo bench --test e2e_latency
Run All Benchmarks
# Run ALL benchmarks (implemented + placeholders)
cargo bench --all-features
# Run only SQL benchmarks
cargo bench --package tasker-shared --features benchmarks
# Run only worker benchmarks
cargo bench --package tasker-worker --features benchmarks
Benchmark Categories
| Category | Package | Benchmark Name | Status | Run Command |
|---|---|---|---|---|
| API | tasker-client | task_initialization | ✅ Complete | cargo bench -p tasker-client --features benchmarks |
| SQL | tasker-shared | sql_functions | ✅ Complete | DATABASE_URL=... cargo bench -p tasker-shared --features benchmarks sql_functions |
| Events | tasker-shared | event_propagation | 🚧 Placeholder | cargo bench -p tasker-shared --features benchmarks event_propagation |
| Worker | tasker-worker | worker_execution | 🚧 Placeholder | cargo bench -p tasker-worker --features benchmarks worker_execution |
| Worker | tasker-worker | handler_overhead | 🚧 Placeholder | cargo bench -p tasker-worker --features benchmarks handler_overhead |
| Orchestration | tasker-orchestration | step_enqueueing | 🚧 Placeholder | cargo bench -p tasker-orchestration --features benchmarks |
| E2E | tests | e2e_latency | 🚧 Placeholder | cargo bench --test e2e_latency |
Benchmark Output Locations
# Criterion HTML reports
open target/criterion/report/index.html
# Individual benchmark data
ls target/criterion/
# Proposed: Structured logs (not yet implemented)
# tmp/benchmarks/YYYY-MM-DD-benchmark-name.log
Common Options
# Save baseline for comparison
cargo bench --features benchmarks -- --save-baseline main
# Compare to baseline
cargo bench --features benchmarks -- --baseline main
# Verbose output
cargo bench --features benchmarks -- --verbose
# Run specific benchmark
cargo bench --package tasker-client --features benchmarks task_creation_api
# Skip health checks (CI mode)
TASKER_TEST_SKIP_HEALTH_CHECK=true cargo bench --features benchmarks
Troubleshooting
“Services must be running”
# Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# Check service health
curl http://localhost:8080/health
“DATABASE_URL must be set”
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
“Task template not found”
# Ensure worker services are running (they register templates)
docker-compose -f docker/docker-compose.test.yml ps
# Check registered templates
curl -s http://localhost:8080/v1/handlers | jq
Compilation errors
# Clean and rebuild
cargo clean
cargo build --all-features
Performance Targets
| Benchmark | Metric | Target | Current | Status |
|---|---|---|---|---|
| Task Init (linear) | mean | < 50ms | 17.7ms | ✅ 3x better |
| Task Init (diamond) | mean | < 75ms | 20.8ms | ✅ 3.6x better |
| SQL Task Discovery | mean | < 3ms | 1.75-2.93ms | ✅ Pass |
| SQL Step Readiness | mean | < 1ms | 440-603µs | ✅ Pass |
| Worker Total Overhead | mean | < 60ms | TBD | 🚧 |
| Event Notify (p95) | p95 | < 10ms | TBD | 🚧 |
| Step Enqueue (3 steps) | mean | < 50ms | TBD | 🚧 |
| E2E Complete (3 steps) | p99 | < 500ms | TBD | 🚧 |
Documentation
- Full Strategy: benchmark-strategy-summary.md
- Implementation Plan: phase-5.4-distributed-benchmarks-plan.md
- SQL Benchmark Guide: benchmarking-guide.md
Distributed Benchmarking Strategy
Status: 🎯 Framework Complete | Implementation In Progress Last Updated: 2025-10-08
Overview
Complete benchmarking infrastructure for measuring distributed system performance across all components.
Benchmark Suite Structure
✅ Implemented
1. API Task Creation (tasker-client/benches/task_initialization.rs)
Status: ✅ COMPLETE - Fully implemented and tested
Measures:
- HTTP request → task initialized latency
- Task record creation in PostgreSQL
- Initial step discovery from template
- Response generation and serialization
Results (2025-10-08):
Linear (3 steps): 17.7ms (Target: < 50ms) ✅ 3x better than target
Diamond (4 steps): 20.8ms (Target: < 75ms) ✅ 3.6x better than target
Run Command:
cargo bench --package tasker-client --features benchmarks
2. SQL Function Performance (tasker-shared/benches/sql_functions.rs)
Status: ✅ COMPLETE - Fully implemented (Phase 5.2)
Measures:
- 6 critical PostgreSQL function benchmarks
- Intelligent stratified sampling (5-10 diverse samples per function)
- EXPLAIN ANALYZE query plan analysis (run once per function)
Results (from Phase 5.2):
Task discovery: 1.75-2.93ms (O(1) scaling!)
Step readiness: 440-603µs (37% variance captured)
State transitions: ~380µs (±5% variance)
Task execution context: 448-559µs
Step dependencies: 332-343µs
Query plan buffer hit: 100% (all functions)
Run Command:
DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test" \
cargo bench --package tasker-shared --features benchmarks sql_functions
🚧 Placeholders (Ready for Implementation)
3. Worker Processing Cycle (tasker-worker/benches/worker_execution.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Claim: PGMQ read + atomic claim
- Execute: Handler execution (framework overhead)
- Submit: Result serialization + HTTP submit
- Total: Complete worker cycle
Targets:
- Claim: < 20ms
- Execute (noop): < 10ms
- Submit: < 30ms
- Total overhead: < 60ms
Implementation Requirements:
- Pre-enqueued steps in namespace queues
- Worker client with breakdown metrics
- Multiple handler types (noop, calculation, database)
- Accurate timestamp collection for each phase
Run Command (when implemented):
cargo bench --package tasker-worker --features benchmarks worker_execution
4. Event Propagation (tasker-shared/benches/event_propagation.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- PostgreSQL LISTEN/NOTIFY latency
- PGMQ
pgmq_send_with_notifyoverhead - Event system framework overhead
Targets:
- p50: < 5ms
- p95: < 10ms
- p99: < 20ms
Implementation Requirements:
- PostgreSQL LISTEN connection setup
- PGMQ notification channel configuration
- Concurrent listener with timestamp correlation
- Accurate cross-thread time measurement
Run Command (when implemented):
cargo bench --package tasker-shared --features benchmarks event_propagation
5. Step Enqueueing (tasker-orchestration/benches/step_enqueueing.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Ready step discovery (SQL query time)
- Queue publishing (PGMQ write time)
- Notification overhead (LISTEN/NOTIFY)
- Total orchestration coordination
Targets:
- 3-step workflow: < 50ms
- 10-step workflow: < 100ms
- 50-step workflow: < 500ms
Implementation Requirements:
- Pre-created tasks with dependency chains
- Orchestration client with result processing trigger
- Queue polling to detect enqueued steps
- Breakdown metrics (discovery, publish, notify)
Challenge: Triggering step discovery without full workflow execution
Run Command (when implemented):
cargo bench --package tasker-orchestration --features benchmarks step_enqueueing
6. Handler Overhead (tasker-worker/benches/handler_overhead.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Pure Rust handler (baseline - direct call)
- Rust handler via framework (dispatch overhead)
- Ruby handler via FFI (FFI boundary cost)
Targets:
- Pure Rust: < 1µs (baseline)
- Via Framework: < 1ms
- Ruby FFI: < 5ms
Implementation Requirements:
- Noop handler implementations (Rust + Ruby)
- Direct function call benchmarks
- Framework dispatch overhead measurement
- FFI bridge overhead measurement
Run Command (when implemented):
cargo bench --package tasker-worker --features benchmarks handler_overhead
7. End-to-End Latency (tests/benches/e2e_latency.rs)
Status: 🚧 Skeleton created - needs implementation
Measures:
- Complete workflow execution (API → Task Complete)
- All system components (API, DB, Queue, Worker, Events)
- Real network overhead
- Different workflow patterns
Targets:
- Linear (3 steps): < 500ms p99
- Diamond (4 steps): < 800ms p99
- Tree (7 steps): < 1500ms p99
Implementation Requirements:
- All Docker Compose services running
- Orchestration client for task creation
- Polling mechanism for completion detection
- Multiple workflow templates
- Timeout handling for stuck workflows
Special Considerations:
- SLOW by design: Measures real workflow execution (seconds)
- Fewer samples (sample_size=10 vs 50 default)
- Higher variance expected (network + system state)
- Focus on regression detection, not absolute numbers
Run Command (when implemented):
# Requires all Docker services running
docker-compose -f docker/docker-compose.test.yml up -d
cargo bench --test e2e_latency
Benchmark Output Logging Strategy
Current State
Implemented:
- Criterion default output (terminal + HTML reports)
- Custom health check banners in benchmarks
- EXPLAIN ANALYZE output in SQL benchmarks
- Inline result commentary
Location: Results saved to target/criterion/
Proposed Consistent Structure
1. Standard Output Format
All benchmarks should follow this pattern:
═══════════════════════════════════════════════════════════════════════════════
🔍 VERIFYING PREREQUISITES
═══════════════════════════════════════════════════════════════════════════════
✅ All prerequisites met
═══════════════════════════════════════════════════════════════════════════════
Benchmarking <category>/<test_name>
...
<category>/<test_name> time: [X.XX ms Y.YY ms Z.ZZ ms]
═══════════════════════════════════════════════════════════════════════════════
📊 BENCHMARK RESULTS: <CATEGORY NAME>
═══════════════════════════════════════════════════════════════════════════════
Performance Summary:
• Test 1: X.XX ms (Target: < YY ms) ✅ Status
• Test 2: X.XX ms (Target: < YY ms) ⚠️ Status
Key Findings:
• Finding 1
• Finding 2
═══════════════════════════════════════════════════════════════════════════════
2. Structured Log Files
Proposal: Create tmp/benchmarks/ directory with dated output:
tmp/benchmarks/
├── 2025-10-08-task-initialization.log
├── 2025-10-08-sql-functions.log
├── 2025-10-08-worker-execution.log
├── ...
└── latest/
├── task-initialization.log -> ../2025-10-08-task-initialization.log
└── summary.md
Log Format (example):
# Benchmark Run: task_initialization
Date: 2025-10-08 14:23:45 UTC
Commit: abc123def456
Environment: Docker Compose Test
## Prerequisites
- [x] Orchestration service healthy (http://localhost:8080)
- [x] Worker service healthy (http://localhost:8081)
## Results
### Linear Workflow (3 steps)
- Mean: 17.748 ms
- Std Dev: 0.624 ms
- Min: 17.081 ms
- Max: 18.507 ms
- Target: < 50 ms
- Status: ✅ PASS (3.0x better than target)
- Outliers: 2/20 (10%)
### Diamond Workflow (4 steps)
- Mean: 20.805 ms
- Std Dev: 0.741 ms
- Min: 19.949 ms
- Max: 21.633 ms
- Target: < 75 ms
- Status: ✅ PASS (3.6x better than target)
- Outliers: 0/20 (0%)
## Summary
✅ All tests passed
🎯 Average performance: 3.3x better than targets
3. Baseline Comparison Format
For tracking performance over time:
# Performance Baseline Comparison
Baseline: main branch (2025-10-01)
Current: feature/benchmarks (2025-10-08)
| Benchmark | Baseline | Current | Change | Status |
|-----------|----------|---------|--------|--------|
| task_init/linear | 18.2ms | 17.7ms | -2.7% | ✅ Improved |
| task_init/diamond | 21.1ms | 20.8ms | -1.4% | ✅ Improved |
| sql/task_discovery | 2.91ms | 2.93ms | +0.7% | ✅ Stable |
4. CI Integration Format
For GitHub Actions / CI output:
{
"benchmark_suite": "task_initialization",
"timestamp": "2025-10-08T14:23:45Z",
"commit": "abc123def456",
"results": [
{
"name": "linear_3_steps",
"mean_ms": 17.748,
"std_dev_ms": 0.624,
"target_ms": 50,
"status": "pass",
"performance_ratio": 3.0
}
],
"summary": {
"total_tests": 2,
"passed": 2,
"failed": 0,
"warnings": 0
}
}
Running All Benchmarks
Quick Reference
# 1. Start Docker services
docker-compose -f docker/docker-compose.test.yml up -d
# 2. Run individual benchmarks
cargo bench --package tasker-client --features benchmarks # Task initialization
cargo bench --package tasker-shared --features benchmarks # SQL + Events
cargo bench --package tasker-worker --features benchmarks # Worker + Handlers
cargo bench --package tasker-orchestration --features benchmarks # Step enqueueing
cargo bench --test e2e_latency # End-to-end
# 3. Run ALL benchmarks (when all implemented)
cargo bench --all-features
Environment Variables
# Required for SQL benchmarks
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
# Optional: Skip health checks (CI)
export TASKER_TEST_SKIP_HEALTH_CHECK="true"
# Optional: Custom service URLs
export TASKER_TEST_ORCHESTRATION_URL="http://localhost:9080"
export TASKER_TEST_WORKER_URL="http://localhost:9081"
Performance Targets Summary
| Category | Component | Metric | Target | Status |
|---|---|---|---|---|
| API | Task Creation (3 steps) | p99 | < 50ms | ✅ 17.7ms |
| API | Task Creation (4 steps) | p99 | < 75ms | ✅ 20.8ms |
| SQL | Task Discovery | mean | < 3ms | ✅ 1.75-2.93ms |
| SQL | Step Readiness | mean | < 1ms | ✅ 440-603µs |
| Worker | Total Overhead | mean | < 60ms | 🚧 TBD |
| Worker | FFI Overhead | mean | < 5ms | 🚧 TBD |
| Events | Notify Latency | p95 | < 10ms | 🚧 TBD |
| Orchestration | Step Enqueueing (3 steps) | mean | < 50ms | 🚧 TBD |
| E2E | Complete Workflow (3 steps) | p99 | < 500ms | 🚧 TBD |
Next Steps
Immediate (Current Session)
- ✅ Create all benchmark skeletons
- 🎯 Design consistent logging structure
- Decide on implementation priorities
Short Term
- Implement worker execution benchmark
- Implement event propagation benchmark
- Create benchmark output logging utilities
Medium Term
- Implement step enqueueing benchmark
- Implement handler overhead benchmark
- Implement E2E latency benchmark
Long Term
- CI integration with baseline tracking
- Performance regression detection
- Automated benchmark reports
- Historical performance trending
Documentation
- Full Plan: phase-5.4-distributed-benchmarks-plan.md
- SQL Benchmarks: benchmarking-guide.md
SQL Function Benchmarking Guide
Created: 2025-10-08
Status: ✅ Complete
Location: tasker-shared/benches/sql_functions.rs
Overview
The SQL function benchmark suite measures performance of critical database operations that form the hot paths in the Tasker orchestration system. These benchmarks provide:
- Baseline Performance Metrics: Establish expected performance ranges
- Regression Detection: Identify performance degradations in code changes
- Optimization Guidance: Use EXPLAIN ANALYZE output to guide index/query improvements
- Capacity Planning: Understand scaling characteristics with data volume
Quick Start
Prerequisites
# 1. Ensure PostgreSQL is running
pg_isready
# 2. Set up test database
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo sqlx migrate run
# 3. Populate with test data - REQUIRED for representative benchmarks
cargo test --all-features
Important: The benchmarks use intelligent sampling to test diverse task/step complexities. Running integration tests first ensures the database contains various workflow patterns (linear, diamond, parallel) for representative benchmarking.
Running Benchmarks
# Run all SQL benchmarks
cargo bench --package tasker-shared --features benchmarks
# Run specific benchmark group
cargo bench --package tasker-shared --features benchmarks get_next_ready_tasks
# Run with baseline comparison
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main
# ... make changes ...
cargo bench --package tasker-shared --features benchmarks -- --baseline main
Sampling Strategy
The benchmarks use intelligent sampling to ensure representative results:
Task Sampling
- Samples 5 diverse tasks from different
named_task_uuidtypes - Distributes samples across different workflow patterns
- Maintains deterministic ordering (same UUIDs in same order each run)
- Provides consistent results while capturing complexity variance
Step Sampling
- Samples 10 diverse steps from different tasks
- Selects up to 2 steps per task for variety
- Captures different DAG depths and dependency patterns
- Helps identify performance variance in recursive queries
Benefits
- Representativeness: No bias from single task/step selection
- Consistency: Same samples = comparable baseline comparisons
- Variance Detection: Criterion can measure performance across complexities
- Real-world Accuracy: Reflects actual production workload diversity
Example Output:
step_readiness_status/calculate_readiness/0 2.345 ms
step_readiness_status/calculate_readiness/1 1.234 ms (simple linear task)
step_readiness_status/calculate_readiness/2 5.678 ms (complex diamond DAG)
step_readiness_status/calculate_readiness/3 3.456 ms
step_readiness_status/calculate_readiness/4 2.789 ms
Benchmark Categories
1. Task Discovery (get_next_ready_tasks)
What it measures: Time to discover ready tasks for orchestration
Hot path: Orchestration coordinator → Task discovery
Test parameters:
- Batch size: 1, 10, 50, 100 tasks
- Measures function overhead even with empty database
Expected performance:
- Empty DB: < 5ms for any batch size (function overhead)
- With data: Should scale linearly, not exponentially
Optimization targets:
- Index on task state
- Index on namespace for filtering
- Efficient processor ownership checks
Example output:
get_next_ready_tasks/batch_size/1
time: [2.1234 ms 2.1567 ms 2.1845 ms]
get_next_ready_tasks/batch_size/10
time: [2.2156 ms 2.2489 ms 2.2756 ms]
get_next_ready_tasks/batch_size/50
time: [2.5123 ms 2.5456 ms 2.5789 ms]
get_next_ready_tasks/batch_size/100
time: [3.0234 ms 3.0567 ms 3.0890 ms]
Analysis: Near-constant time across batch sizes indicates efficient query planning.
2. Step Readiness (get_step_readiness_status)
What it measures: Time to calculate if a step is ready to execute
Hot path: Step enqueuer → Dependency resolution
Dependencies: Requires test data (tasks with steps)
Expected performance:
- Simple linear: < 10ms
- Diamond pattern: < 20ms
- Complex DAG: < 50ms
Optimization targets:
- Parent step completion checks
- Dependency graph traversal
- Retry backoff calculations
Graceful degradation:
⚠️ Skipping step_readiness_status benchmark - no test data found
Run integration tests first to populate test data
3. State Transitions (transition_task_state_atomic)
What it measures: Time for atomic state transitions with processor ownership
Hot path: All orchestration operations (initialization, enqueuing, finalization)
Expected performance:
- Successful transition: < 15ms
- Failed transition (wrong state): < 10ms (faster path)
- Contention scenario: < 50ms with backoff
Optimization targets:
- Atomic compare-and-swap efficiency
- Index on task_uuid + processor_uuid
- Transition history table size
4. Task Execution Context (get_task_execution_context)
What it measures: Time to retrieve comprehensive task orchestration status
Hot path: Orchestration coordinator → Status checking
Dependencies: Requires test data (tasks in database)
Expected performance:
- Simple tasks: < 10ms
- Complex tasks: < 25ms
- With many steps: < 50ms
Optimization targets:
- Step aggregation queries
- State calculation efficiency
- Join optimization for step counts
5. Transitive Dependencies (get_step_transitive_dependencies)
What it measures: Time to resolve complete dependency tree for a step
Hot path: Worker → Step execution preparation (once per step lifecycle)
Dependencies: Requires test data (steps with dependencies)
Expected performance:
- Linear dependencies: < 5ms
- Diamond pattern: < 10ms
- Complex DAG (10+ levels): < 25ms
Optimization targets:
- Recursive CTE performance
- Index on step dependencies
- Materialized dependency graphs (future)
Why it matters: Called once per step on worker side when populating step data. While not in orchestration hot path, it affects worker step initialization latency. Recursive CTEs can be expensive with deep dependency trees.
6. EXPLAIN ANALYZE (explain_analyze)
What it measures: Query execution plans, not just timing
How it works: Runs EXPLAIN ANALYZE once per function (no repeated iterations since query plans don’t change between executions)
Functions analyzed:
get_next_ready_tasks()- Task discovery query plansget_task_execution_context()- Task status aggregation plansget_step_transitive_dependencies()- Recursive CTE dependency traversal plans
Purpose: Identify optimization opportunities:
- Sequential scans (need indexes)
- Nested loop performance
- Buffer hit ratios
- Index usage patterns
- Recursive CTE efficiency
Automatic Query Plan Logging: Captures each query plan once and analyzes, printing:
- ⏱️ Execution Time: Actual query execution duration
- 📋 Planning Time: Time spent planning the query
- 📦 Node Type: Primary operation type (Aggregate, Index Scan, etc.)
- 💰 Total Cost: PostgreSQL’s cost estimate
- ⚠️ Sequential Scan Warning: Alerts for potential missing indexes
- 📊 Buffer Hit Ratio: Cache efficiency (higher is better)
Example output:
════════════════════════════════════════════════════════════════════════════════
📊 QUERY PLAN ANALYSIS
════════════════════════════════════════════════════════════════════════════════
🔍 Function: get_next_ready_tasks
────────────────────────────────────────────────────────────────────────────────
⏱️ Execution Time: 2.345 ms
📋 Planning Time: 0.123 ms
📦 Node Type: Aggregate
💰 Total Cost: 45.67
📊 Buffer Hit Ratio: 98.5% (197/200 blocks)
────────────────────────────────────────────────────────────────────────────────
Saving Full Plans:
# Save complete JSON plans to target/query_plan_*.json
SAVE_QUERY_PLANS=1 cargo bench --package tasker-shared --features benchmarks
Red flags to investigate:
- “Seq Scan” on large tables → Add index
- “Nested Loop” with high iteration count → Optimize join strategy
- “Sort” operations on large datasets → Add index for ORDER BY
- Low buffer hit ratio (< 90%) → Increase shared_buffers or investigate I/O
Interpreting Results
Criterion Statistics
Criterion provides comprehensive statistics for each benchmark:
get_next_ready_tasks/batch_size/10
time: [2.2156 ms 2.2489 ms 2.2756 ms]
change: [-1.5% +0.2% +1.9%] (p = 0.31 > 0.05)
No change in performance detected.
Found 3 outliers among 50 measurements (6.00%)
2 (4.00%) high mild
1 (2.00%) high severe
Key metrics:
- [2.2156 ms 2.2489 ms 2.2756 ms]: Lower bound, mean, upper bound (95% confidence)
- change: Comparison to baseline (if available)
- p-value: Statistical significance (p < 0.05 = significant)
- Outliers: Measurements far from median (cache effects, GC, etc.)
Performance Expectations
Based on Phase 3 metrics verification (26 tasks executed):
| Metric | Expected | Warning | Critical |
|---|---|---|---|
| Task initialization | < 50ms | 50-100ms | > 100ms |
| Step readiness | < 20ms | 20-50ms | > 50ms |
| State transition | < 15ms | 15-30ms | > 30ms |
| Finalization claim | < 10ms | 10-25ms | > 25ms |
Note: These are function-level times, not end-to-end latencies.
Using Benchmarks for Optimization
Workflow
-
Establish Baseline
cargo bench --package tasker-shared --features benchmarks -- --save-baseline main -
Make Changes (e.g., add index, optimize query)
-
Compare
cargo bench --package tasker-shared --features benchmarks -- --baseline main -
Review Output
get_next_ready_tasks/batch_size/100 time: [2.0123 ms 2.0456 ms 2.0789 ms] change: [-34.5% -32.1% -29.7%] (p = 0.00 < 0.05) Performance has improved. -
Analyze EXPLAIN Plans (if improvement isn’t clear)
Common Optimization Patterns
Pattern 1: Missing Index
Symptom: Exponential scaling with data volume
EXPLAIN shows: Seq Scan on tasks
Solution:
CREATE INDEX idx_tasks_state ON tasker.tasks(current_state)
WHERE complete = false;
Pattern 2: Inefficient Join
Symptom: High latency with complex DAGs
EXPLAIN shows: Nested Loop with high row counts
Solution: Use CTE or adjust join strategy
WITH parent_status AS (
SELECT ... -- Pre-compute parent completions
)
SELECT ... FROM tasker.workflow_steps s
JOIN parent_status ps ON ...
Pattern 3: Large Transaction History
Symptom: State transition slowing over time
EXPLAIN shows: Large scan of task_transitions
Solution: Partition by date or archive old transitions
CREATE TABLE tasker.task_transitions_archive (LIKE tasker.task_transitions);
-- Move old data periodically
Integration with Metrics
The benchmark results should correlate with production metrics:
From metrics-reference.md:
tasker_task_initialization_duration_milliseconds→ Benchmark: task discovery + initializationtasker_step_result_processing_duration_milliseconds→ Benchmark: step readiness + state transitionstasker_task_finalization_duration_milliseconds→ Benchmark: finalization claiming
Validation approach:
- Run benchmarks: Get ~2ms for task discovery
- Check metrics:
tasker_task_initialization_durationP95 = ~45ms - Calculate overhead: 45ms - 2ms = 43ms (business logic + framework)
This helps identify where optimization efforts should focus:
- If benchmark is slow → Optimize SQL/indexes
- If benchmark is fast but metrics slow → Optimize Rust code
Continuous Integration
Recommended CI Workflow
# .github/workflows/benchmarks.yml
name: Performance Benchmarks
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: tasker
options: >-
--health-cmd pg_isready
--health-interval 10s
steps:
- uses: actions/checkout@v3
- uses: dtolnay/rust-toolchain@stable
- name: Run migrations
run: cargo sqlx migrate run
env:
DATABASE_URL: postgresql://postgres:tasker@localhost/test
- name: Run benchmarks
run: cargo bench --package tasker-shared --features benchmarks
- name: Check for regressions
run: |
# Parse Criterion output and fail if P95 > threshold
# This is left as an exercise for CI implementation
Future Enhancements
Phase 5.3: Data Generation (Deferred)
The current benchmarks work with existing test data. Future work could add:
-
Realistic Data Generation
- Create 100/1,000/10,000 task datasets
- Various DAG complexities (linear, diamond, tree)
- State distribution (60% complete, 20% in-progress, etc.)
-
Contention Testing
- Multiple processors competing for same tasks
- Race condition scenarios
- Deadlock detection
-
Long-Running Benchmarks
- Memory leak detection
- Connection pool exhaustion
- Query plan cache effects
Troubleshooting
Benchmark fails with “DATABASE_URL must be set”
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
All benchmarks show “no test data found”
# Run integration tests to populate database
cargo test --all-features
# Or run specific test suite
cargo test --package tasker-shared --all-features
Benchmarks are inconsistent/noisy
- Close other applications
- Ensure PostgreSQL isn’t under load
- Run benchmarks multiple times
- Increase
sample_sizein benchmark code
Results don’t match production metrics
- Production has different data volumes
- Network latency in production
- Different PostgreSQL version/configuration
- Connection pool overhead in production
References
- Criterion Documentation: https://bheisler.github.io/criterion.rs/book/
- PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/sql-explain.html
- Phase 3 Metrics:
docs/observability/metrics-reference.md - Verification Results:
docs/observability/VERIFICATION_RESULTS.md
Sign-Off
Phase 5.2 Status: ✅ COMPLETE
Benchmarks Implemented:
- ✅
get_next_ready_tasks()- 4 batch sizes - ✅
get_step_readiness_status()- with graceful skip - ✅
transition_task_state_atomic()- atomic operations - ✅
get_task_execution_context()- orchestration status retrieval - ✅
get_step_transitive_dependencies()- recursive dependency traversal - ✅
EXPLAIN ANALYZE- query plan capture with automatic analysis
Documentation Complete:
- ✅ Quick start guide
- ✅ Interpretation guidance
- ✅ Optimization patterns
- ✅ Integration with metrics
- ✅ CI recommendations
Next Steps: Run benchmarks with real data and establish baseline performance targets.
Tasker-Core Logging Standards
Version: 1.0 Last Updated: 2025-10-07 Status: Active Related: Observability Standardization
Table of Contents
- Philosophy
- Log Levels
- Structured Fields
- Message Style
- Instrument Macro
- Error Handling
- Examples
- Enforcement
Philosophy
Principles:
- Production-First: Logs must be parseable, searchable, and professional
- Correlation-Driven: All operations include correlation_id for distributed tracing
- Structured: Fields over string interpolation for aggregation and querying
- Concise: Clear, actionable messages without noise
- Consistent: Predictable patterns across all code
Anti-Patterns to Avoid:
- ❌ Emojis (🚀✅❌) - Breaks log parsers, unprofessional
- ❌ All-caps prefixes (“BOOTSTRAP:”, “CORE:”) - Redundant with module paths
- ❌ Ticket references (“JIRA-123”, “PROJ-40”) - Internal, meaningless externally
- ❌ String interpolation - Use structured fields instead
- ❌ Verbose messages - Be concise, let fields provide detail
Log Levels
ERROR - Unrecoverable Failures
When to Use:
- Database connection permanently lost
- Critical system component failure
- Unrecoverable state machine violation
- Data corruption detected
- Message queue unavailable
Characteristics:
- Requires immediate human intervention
- Service degradation or outage
- Cannot automatically recover
- Should trigger alerts/pages
Example:
#![allow(unused)]
fn main() {
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Failed to claim task for finalization: database unavailable"
);
}
WARN - Degraded Operation
When to Use:
- Retryable failures after exhausting retries
- Circuit breaker opened (degraded mode)
- Fallback behavior activated
- Rate limiting engaged
- Configuration issues (non-fatal)
- Unexpected but handled conditions
Characteristics:
- Service continues but degraded
- Automatic recovery possible
- Should be monitored for patterns
- May indicate upstream problems
Example:
#![allow(unused)]
fn main() {
warn!(
correlation_id = %correlation_id,
step_uuid = %step_uuid,
retry_count = attempts,
max_retries = max_attempts,
next_retry_at = ?next_retry,
"Step execution failed after max retries, will not retry further"
);
}
INFO - Lifecycle Events
When to Use:
- System startup/shutdown
- Task created/completed/failed
- Step enqueued/completed
- State transitions (task/step)
- Configuration loaded
- Significant business events
Characteristics:
- Normal operation milestones
- Useful for understanding flow
- Production-ready verbosity
- Default log level in production
Example:
#![allow(unused)]
fn main() {
info!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
steps_enqueued = count,
duration_ms = elapsed.as_millis(),
"Task initialization complete"
);
}
DEBUG - Detailed Diagnostics
When to Use:
- Discovery query results
- Queue depth checks
- Dependency analysis details
- Configuration value dumps
- State machine transition details
- Detailed operation flow
Characteristics:
- Troubleshooting information
- Not shown in production (usually)
- Safe to be verbose
- Helps understand “why”
Example:
#![allow(unused)]
fn main() {
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
viable_steps = steps.len(),
pending_steps = pending.len(),
blocked_steps = blocked.len(),
"Step readiness analysis complete"
);
}
TRACE - Very Verbose
When to Use:
- Function entry/exit in hot paths
- Loop iteration details
- Deep parameter inspection
- Performance profiling hooks
Characteristics:
- Extremely verbose
- Usually disabled even in dev
- Performance impact acceptable
- Use sparingly
Example:
#![allow(unused)]
fn main() {
trace!(
correlation_id = %correlation_id,
iteration = i,
"Polling loop iteration"
);
}
Structured Fields
Required Fields (Context-Dependent)
Always Include:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id, // ALWAYS when available
}
When Task Context Available:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace,
}
When Step Context Available:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
}
For Operations:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
operation = "step_enqueue", // Operation identifier
duration_ms = elapsed.as_millis(), // Timing for operations
}
For Errors:
#![allow(unused)]
fn main() {
correlation_id = %correlation_id,
// ... entity IDs ...
error = %e, // Error Display
error_type = %type_name::<E>(), // Optional: Error type
}
Field Ordering (MANDATORY)
Standard Order:
- correlation_id (always first)
- Entity IDs (task_uuid, step_uuid, namespace)
- Operation/Action (operation, state, status)
- Measurements (duration_ms, count, size)
- Error Info (error, error_type, context)
- Other Context (additional fields)
Example:
#![allow(unused)]
fn main() {
info!(
// 1. Correlation ID (ALWAYS FIRST)
correlation_id = %correlation_id,
// 2. Entity IDs
task_uuid = %task_uuid,
step_uuid = %step_uuid,
namespace = %namespace,
// 3. Operation
operation = "step_transition",
from_state = %old_state,
to_state = %new_state,
// 4. Measurements
duration_ms = elapsed.as_millis(),
// 5. No errors (success case)
// 6. Other context
processor_id = %processor_uuid,
"Step state transition complete"
);
}
Field Formatting
Use Display Formatting (%):
#![allow(unused)]
fn main() {
// ✅ CORRECT: Let tracing handle formatting
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
}
Avoid Manual Conversion:
#![allow(unused)]
fn main() {
// ❌ WRONG: Manual to_string()
task_uuid = task_uuid.to_string(),
// ❌ WRONG: Debug formatting for production types
task_uuid = ?task_uuid, // Use ? only for Debug types
}
Field Naming:
#![allow(unused)]
fn main() {
// ✅ Standard names
duration_ms // Not elapsed_ms, time_ms
error // Not err, error_message
step_uuid // Not workflow_step_uuid (be consistent)
retry_count // Not attempts, retries
max_retries // Not max_attempts
}
Message Style
Guidelines
DO:
- ✅ Be concise and actionable
- ✅ Use present tense for states: “Step enqueued”
- ✅ Use past tense for events: “Task completed”
- ✅ Start with the subject: “Task completed” not “Successfully completed task”
- ✅ Focus on WHAT happened (fields show HOW)
DON’T:
- ❌ Use emojis: “🚀 Starting…” → “Starting orchestration system”
- ❌ Use all-caps prefixes: “BOOTSTRAP: Starting…” → “Starting orchestration bootstrap”
- ❌ Include ticket numbers: “PROJ-40: Processing…” → “Processing command”
- ❌ Be redundant: “Successfully enqueued step successfully” → “Step enqueued”
- ❌ Include technical jargon: “Atomic CAS transition succeeded” → “State transition complete”
- ❌ Be verbose: Keep messages under 10 words ideally
Before/After Examples
Lifecycle Events:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🚀 BOOTSTRAP: Starting unified orchestration system bootstrap");
// ✅ AFTER
info!("Starting orchestration system bootstrap");
}
Operation Completion:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("✅ STEP_ENQUEUER: Successfully marked step {} as enqueued", step_uuid);
// ✅ AFTER
info!(
correlation_id = %correlation_id,
step_uuid = %step_uuid,
"Step marked as enqueued"
);
}
Error Handling:
#![allow(unused)]
fn main() {
// ❌ BEFORE
error!("❌ ORCHESTRATION_LOOP: Failed to process task {}: {}", task_uuid, e);
// ✅ AFTER
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Task processing failed"
);
}
Shutdown:
#![allow(unused)]
fn main() {
// ❌ BEFORE
info!("🛑 Shutdown signal received, initiating graceful shutdown...");
// ✅ AFTER
info!("Shutdown signal received, initiating graceful shutdown");
}
Instrument Macro
When to Use
Use #[instrument] for:
- Function-level spans in hot paths
- Automatic correlation ID tracking
- Operations that should appear in traces
- Functions with significant duration
Benefits:
- Automatic span creation
- Automatic timing
- Better OpenTelemetry integration (Phase 2)
- Cleaner code
Example
#![allow(unused)]
fn main() {
use tracing::instrument;
#[instrument(skip(self), fields(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
namespace = %namespace
))]
pub async fn process_task(
&self,
correlation_id: Uuid,
task_uuid: Uuid,
namespace: String,
) -> Result<TaskResult> {
// Span automatically created with fields above
info!("Starting task processing");
// ... implementation ...
info!(
duration_ms = start.elapsed().as_millis(),
"Task processing complete"
);
Ok(result)
}
}
Skip Parameters
Always skip:
self(redundant)- Large structures (use specific fields instead)
- Sensitive data (passwords, tokens, PII)
#![allow(unused)]
fn main() {
#[instrument(
skip(self, context), // Skip large context
fields(
correlation_id = %correlation_id,
task_uuid = %context.task_uuid, // Extract specific fields
)
)]
}
Error Handling
Error Context
Always include:
#![allow(unused)]
fn main() {
error!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e, // Error Display (user-friendly)
error_type = %type_name::<E>(), // Optional: For classification
"Operation failed"
);
}
Error Propagation
#![allow(unused)]
fn main() {
// ✅ Log and return for caller to handle
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
error = %e,
"Step discovery query failed, will retry"
);
return Err(e);
// ❌ Don't log at every level (causes noise)
// Instead: Log once at appropriate level where handled
}
Error Classification
#![allow(unused)]
fn main() {
match result {
Err(e) if e.is_retryable() => {
warn!(
correlation_id = %correlation_id,
error = %e,
retry_count = attempts,
"Operation failed, will retry"
);
}
Err(e) => {
error!(
correlation_id = %correlation_id,
error = %e,
"Operation failed permanently"
);
}
Ok(result) => {
info!(
correlation_id = %correlation_id,
duration_ms = elapsed.as_millis(),
"Operation completed successfully"
);
}
}
}
Examples
Complete Examples by Scenario
Task Initialization
#![allow(unused)]
fn main() {
#[instrument(skip(self), fields(
correlation_id = %task_request.correlation_id,
task_name = %task_request.name,
namespace = %task_request.namespace
))]
pub async fn create_task_from_request(
&self,
task_request: TaskRequest,
) -> Result<TaskInitializationResult> {
let correlation_id = task_request.correlation_id;
let start = Instant::now();
info!("Starting task initialization");
// Create task
let task = self.create_task(&task_request).await?;
debug!(
task_uuid = %task.task_uuid,
template_uuid = %task.named_task_uuid,
"Task created in database"
);
// Discover steps
let steps = self.discover_initial_steps(task.task_uuid).await?;
info!(
correlation_id = %correlation_id,
task_uuid = %task.task_uuid,
step_count = steps.len(),
duration_ms = start.elapsed().as_millis(),
"Task initialization complete"
);
Ok(TaskInitializationResult {
task_uuid: task.task_uuid,
step_count: steps.len(),
})
}
}
Step Enqueueing
#![allow(unused)]
fn main() {
pub async fn enqueue_step(
&self,
correlation_id: Uuid,
task_uuid: Uuid,
step: &ViableStep,
) -> Result<()> {
let start = Instant::now();
debug!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step.step_uuid,
step_name = %step.name,
queue = %step.queue_name,
"Enqueueing step"
);
let message = self.create_message(correlation_id, task_uuid, step)?;
self.pgmq_client
.send(&step.queue_name, &message)
.await?;
info!(
correlation_id = %correlation_id,
task_uuid = %task_uuid,
step_uuid = %step.step_uuid,
queue = %step.queue_name,
duration_ms = start.elapsed().as_millis(),
"Step enqueued"
);
Ok(())
}
}
Error Handling
#![allow(unused)]
fn main() {
match self.process_step_result(result).await {
Ok(()) => {
info!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
duration_ms = elapsed.as_millis(),
"Step result processed"
);
}
Err(e) if e.is_retryable() => {
warn!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
error = %e,
retry_count = result.attempts,
"Step result processing failed, will retry"
);
return Err(e);
}
Err(e) => {
error!(
correlation_id = %result.correlation_id,
task_uuid = %result.task_uuid,
step_uuid = %result.step_uuid,
error = %e,
"Step result processing failed permanently"
);
return Err(e);
}
}
}
Bootstrap/Shutdown
#![allow(unused)]
fn main() {
pub async fn bootstrap() -> Result<OrchestrationSystemHandle> {
info!("Starting orchestration system bootstrap");
let config = ConfigManager::load()?;
debug!(environment = %config.environment, "Configuration loaded");
let context = SystemContext::from_config(config).await?;
info!(processor_uuid = %context.processor_uuid, "System context initialized");
let core = OrchestrationCore::new(context).await?;
info!("Orchestration core initialized");
// ... more initialization ...
info!(
processor_uuid = %core.processor_uuid,
namespaces = ?core.supported_namespaces,
"Orchestration system bootstrap complete"
);
Ok(handle)
}
pub async fn shutdown(&mut self) -> Result<()> {
info!("Initiating graceful shutdown");
if let Some(coordinator) = &self.event_coordinator {
coordinator.stop().await?;
debug!("Event coordinator stopped");
}
info!("Orchestration system shutdown complete");
Ok(())
}
}
Enforcement
Code Review Checklist
Before merging, verify:
- No emojis in log messages
- No all-caps component prefixes
- No ticket references in runtime logs
- correlation_id present in all task/step operations
- Structured fields follow standard ordering
- Messages are concise and actionable
- Appropriate log levels used
- Error context is complete
CI Checks
Recommended lints (future):
# Check for emojis
! grep -r '[🔧✅🚀❌⚠️📊🔍🎉🛡️⏱️📝🏗️🎯🔄💡📦🧪🌉🔌⏳🛑]' src/
# Check for all-caps prefixes
! grep -rE '(info|debug|warn|error)!\(".*[A-Z_]{3,}:' src/
# Check for ticket references in logs (allow in comments)
! grep -rE '(info|debug|warn|error)!.*[A-Z]+-[0-9]+' src/
Pre-commit Hook
Add to .git/hooks/pre-commit:
#!/bin/bash
./scripts/audit-logging.sh --check || {
echo "❌ Logging standards violation detected"
echo "Run: ./scripts/audit-logging.sh for details"
exit 1
}
Migration Guide
For Existing Code
- Remove emojis: Use find/replace
- Remove all-caps prefixes: Simple cleanup
- Add correlation_id: Extract from context
- Reorder fields: correlation_id first
- Shorten messages: Remove redundancy
- Verify log levels: Lifecycle = INFO, diagnostics = DEBUG
For New Code
- Always include correlation_id when context available
- Use
#[instrument]for significant functions - Follow field ordering: correlation_id, IDs, operation, measurements, errors
- Keep messages concise: Under 10 words
- Choose appropriate level: ERROR (fatal), WARN (degraded), INFO (lifecycle), DEBUG (diagnostic)
FAQ
Q: Should I use info! or debug! for step enqueueing?
A: info! - It’s a significant lifecycle event even if frequent.
Q: When should I add duration_ms?
A: For any operation that:
- Calls external systems (DB, queue)
- Is in the hot path
- Takes >10ms typically
- Needs performance monitoring
Q: Can I use emojis in error messages? A: No. Never use emojis in any log message. They break parsers and are unprofessional.
Q: Should correlation_id really always be first? A: Yes. This enables easy correlation across all logs. It’s the #1 most important field for distributed tracing.
Q: What about ticket references in module docs? A: Acceptable in module-level documentation for architectural context. Remove from runtime logs and inline comments.
Q: Can I include stack traces in logs?
A: Use error = %e which includes the error chain. Only add explicit backtrace for truly exceptional cases.
References
Document End
This is a living document. Propose changes via PR with rationale.
OpenTelemetry Metrics Reference
Status: ✅ Complete Export Interval: 60 seconds OTLP Endpoint: http://localhost:4317 Grafana UI: http://localhost:3000
This document provides a complete reference for all OpenTelemetry metrics instrumented in the Tasker orchestration system.
Table of Contents
- Overview
- Configuration
- Orchestration Metrics
- Worker Metrics
- Resilience Metrics
- Database Metrics
- Messaging Metrics
- Example Queries
- Dashboard Recommendations
Overview
The Tasker system exports 47+ OpenTelemetry metrics across 5 domains:
| Domain | Metrics | Description |
|---|---|---|
| Orchestration | 11 | Task lifecycle, step coordination, finalization |
| Worker | 10 | Step execution, claiming, result submission |
| Resilience | 8+ | Circuit breakers, MPSC channels |
| Database | 7 | SQL query performance, connection pools |
| Messaging | 11 | PGMQ queue operations, message processing |
All metrics include correlation_id labels for distributed tracing correlation with Tempo traces.
Histogram Metric Naming
OpenTelemetry automatically appends _milliseconds to histogram metric names when the unit is specified as ms. This provides clarity in Prometheus queries.
Pattern: metric_name → metric_name_milliseconds_{bucket,sum,count}
Example:
- Code:
tasker.step.execution.durationwith unit “ms” - Prometheus:
tasker_step_execution_duration_milliseconds_*
Query Patterns: Instant vs Rate-Based
Instant/Recent Data Queries - Use these when:
- Testing with burst/batch task execution
- Viewing data from recent runs (last few minutes)
- Data points are sparse or clustered together
- You want simple averages without time windows
Rate-Based Queries - Use these when:
- Continuous production monitoring
- Data flows steadily over time
- Calculating per-second rates
- Building alerting rules
Why the difference matters: The rate() function calculates per-second change rates over a time window. It requires data points spread across that window. If you run 26 tasks in quick succession, all data points cluster at one timestamp, and rate() returns no data because there’s no rate change to calculate.
Configuration
Enable OpenTelemetry
File: config/tasker/environments/development/telemetry.toml
[telemetry]
enabled = true
service_name = "tasker-core-dev"
sample_rate = 1.0
[telemetry.opentelemetry]
enabled = true # Must be true to export metrics
Verify in Logs
# Should see:
# opentelemetry_enabled=true
# NOT: Metrics collection disabled (TELEMETRY_ENABLED=false)
Orchestration Metrics
Module: tasker-shared/src/metrics/orchestration.rs
Instrumentation: tasker-orchestration/src/orchestration/lifecycle/*.rs
Counters
tasker.tasks.requests.total
Description: Total number of task creation requests received Type: Counter (u64) Labels:
correlation_id: Request correlation IDtask_type: Task name (e.g., “mathematical_sequence”)namespace: Task namespace (e.g., “rust_e2e_linear”)
Instrumented In: task_initializer.rs:start_task_initialization()
Example Query:
# Total task requests
tasker_tasks_requests_total
# By namespace
sum by (namespace) (tasker_tasks_requests_total)
# Specific correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.tasks.completions.total
Description: Total number of tasks that completed successfully Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Completed)
Example Query:
# Total completions
tasker_tasks_completions_total
# Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])
Expected Output: (To be verified)
tasker.tasks.failures.total
Description: Total number of tasks that failed Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: task_finalizer.rs:finalize_task() (FinalizationAction::Failed)
Example Query:
# Total failures
tasker_tasks_failures_total
# Error rate over 5 minutes
rate(tasker_tasks_failures_total[5m])
Expected Output: (To be verified)
tasker.steps.enqueued.total
Description: Total number of steps enqueued to worker queues Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Task namespacestep_name: Name of the enqueued step
Instrumented In: step_enqueuer.rs:enqueue_steps()
Example Query:
# Total steps enqueued
tasker_steps_enqueued_total
# By step name
sum by (step_name) (tasker_steps_enqueued_total)
# For specific task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.step_results.processed.total
Description: Total number of step results processed from workers Type: Counter (u64) Labels:
correlation_id: Request correlation IDresult_type: “success”, “error”, “timeout”, “cancelled”, “skipped”
Instrumented In: result_processor.rs:process_step_result()
Example Query:
# Total results processed
tasker_step_results_processed_total
# By result type
sum by (result_type) (tasker_step_results_processed_total)
# Success rate
rate(tasker_step_results_processed_total{result_type="success"}[5m])
Expected Output: (To be verified)
Histograms
tasker.task.initialization.duration
Description: Task initialization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_task_initialization_duration_milliseconds_buckettasker_task_initialization_duration_milliseconds_sumtasker_task_initialization_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDtask_type: Task name
Instrumented In: task_initializer.rs:start_task_initialization()
Example Queries:
Instant/Recent Data (works immediately after task execution):
# Simple average initialization time
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count
# P95 latency
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
Rate-Based (for continuous monitoring, requires data spread over time):
# Average initialization time over 5 minutes
rate(tasker_task_initialization_duration_milliseconds_sum[5m]) /
rate(tasker_task_initialization_duration_milliseconds_count[5m])
# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
tasker.task.finalization.duration
Description: Task finalization duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_task_finalization_duration_milliseconds_buckettasker_task_finalization_duration_milliseconds_sumtasker_task_finalization_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDfinal_state: “complete”, “error”, “cancelled”
Instrumented In: task_finalizer.rs:finalize_task()
Example Queries:
Instant/Recent Data:
# Simple average finalization time
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count
# P95 by final state
histogram_quantile(0.95,
sum by (final_state, le) (
tasker_task_finalization_duration_milliseconds_bucket
)
)
Rate-Based:
# Average finalization time over 5 minutes
rate(tasker_task_finalization_duration_milliseconds_sum[5m]) /
rate(tasker_task_finalization_duration_milliseconds_count[5m])
# P95 by final state over 5 minutes
histogram_quantile(0.95,
sum by (final_state, le) (
rate(tasker_task_finalization_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step_result.processing.duration
Description: Step result processing duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_result_processing_duration_milliseconds_buckettasker_step_result_processing_duration_milliseconds_sumtasker_step_result_processing_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDresult_type: “success”, “error”, “timeout”, “cancelled”, “skipped”
Instrumented In: result_processor.rs:process_step_result()
Example Queries:
Instant/Recent Data:
# Simple average result processing time
tasker_step_result_processing_duration_milliseconds_sum /
tasker_step_result_processing_duration_milliseconds_count
# P50, P95, P99 latencies
histogram_quantile(0.50, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.95, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
histogram_quantile(0.99, sum by (le) (tasker_step_result_processing_duration_milliseconds_bucket))
Rate-Based:
# Average result processing time over 5 minutes
rate(tasker_step_result_processing_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_processing_duration_milliseconds_count[5m])
# P95 latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_processing_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
Gauges
tasker.tasks.active
Description: Number of tasks currently being processed Type: Gauge (u64) Labels:
state: Current task state
Status: Planned (not yet instrumented)
tasker.steps.ready
Description: Number of steps ready for execution Type: Gauge (u64) Labels:
namespace: Worker namespace
Status: Planned (not yet instrumented)
Worker Metrics
Module: tasker-shared/src/metrics/worker.rs
Instrumentation: tasker-worker/src/worker/*.rs
Counters
tasker.steps.executions.total
Description: Total number of step executions attempted Type: Counter (u64) Labels:
correlation_id: Request correlation ID
Instrumented In: command_processor.rs:handle_execute_step()
Example Query:
# Total step executions
tasker_steps_executions_total
# Execution rate
rate(tasker_steps_executions_total[5m])
# For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
tasker.steps.successes.total
Description: Total number of step executions that completed successfully Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Worker namespace
Instrumented In: command_processor.rs:handle_execute_step() (success path)
Example Query:
# Total successes
tasker_steps_successes_total
# Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
# By namespace
sum by (namespace) (tasker_steps_successes_total)
Expected Output: (To be verified)
tasker.steps.failures.total
Description: Total number of step executions that failed Type: Counter (u64) Labels:
correlation_id: Request correlation IDnamespace: Worker namespace (or “unknown” for early failures)error_type: “claim_failed”, “database_error”, “step_not_found”, “message_deletion_failed”
Instrumented In: command_processor.rs:handle_execute_step() (error paths)
Example Query:
# Total failures
tasker_steps_failures_total
# Failure rate
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m])
# By error type
sum by (error_type) (tasker_steps_failures_total)
# Error distribution
topk(5, sum by (error_type) (tasker_steps_failures_total))
Expected Output: (To be verified)
tasker.steps.claimed.total
Description: Total number of steps claimed from queues Type: Counter (u64) Labels:
namespace: Worker namespaceclaim_method: “event”, “poll”
Instrumented In: step_claim.rs:try_claim_step()
Example Query:
# Total claims
tasker_steps_claimed_total
# By claim method
sum by (claim_method) (tasker_steps_claimed_total)
# Claim rate
rate(tasker_steps_claimed_total[5m])
Expected Output: (To be verified)
tasker.steps.results_submitted.total
Description: Total number of step results submitted to orchestration Type: Counter (u64) Labels:
correlation_id: Request correlation IDresult_type: “completion”
Instrumented In: orchestration_result_sender.rs:send_completion()
Example Query:
# Total submissions
tasker_steps_results_submitted_total
# Submission rate
rate(tasker_steps_results_submitted_total[5m])
# For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Output: (To be verified)
Histograms
tasker.step.execution.duration
Description: Step execution duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_execution_duration_milliseconds_buckettasker_step_execution_duration_milliseconds_sumtasker_step_execution_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDnamespace: Worker namespaceresult: “success”, “error”
Instrumented In: command_processor.rs:handle_execute_step()
Example Queries:
Instant/Recent Data:
# Simple average execution time
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count
# P95 latency by namespace
histogram_quantile(0.95,
sum by (namespace, le) (
tasker_step_execution_duration_milliseconds_bucket
)
)
# P99 latency
histogram_quantile(0.99, sum by (le) (tasker_step_execution_duration_milliseconds_bucket))
Rate-Based:
# Average execution time over 5 minutes
rate(tasker_step_execution_duration_milliseconds_sum[5m]) /
rate(tasker_step_execution_duration_milliseconds_count[5m])
# P95 latency by namespace over 5 minutes
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step.claim.duration
Description: Step claiming duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_claim_duration_milliseconds_buckettasker_step_claim_duration_milliseconds_sumtasker_step_claim_duration_milliseconds_count
Labels:
namespace: Worker namespaceclaim_method: “event”, “poll”
Instrumented In: step_claim.rs:try_claim_step()
Example Queries:
Instant/Recent Data:
# Simple average claim time
tasker_step_claim_duration_milliseconds_sum /
tasker_step_claim_duration_milliseconds_count
# Compare event vs poll claiming (P95)
histogram_quantile(0.95,
sum by (claim_method, le) (
tasker_step_claim_duration_milliseconds_bucket
)
)
Rate-Based:
# Average claim time over 5 minutes
rate(tasker_step_claim_duration_milliseconds_sum[5m]) /
rate(tasker_step_claim_duration_milliseconds_count[5m])
# P95 by claim method over 5 minutes
histogram_quantile(0.95,
sum by (claim_method, le) (
rate(tasker_step_claim_duration_milliseconds_bucket[5m])
)
)
Expected Output: ✅ Verified - Returns millisecond values
tasker.step_result.submission.duration
Description: Step result submission duration in milliseconds Type: Histogram (f64) Unit: ms Prometheus Metric Names:
tasker_step_result_submission_duration_milliseconds_buckettasker_step_result_submission_duration_milliseconds_sumtasker_step_result_submission_duration_milliseconds_count
Labels:
correlation_id: Request correlation IDresult_type: “completion”
Instrumented In: orchestration_result_sender.rs:send_completion()
Example Queries:
Instant/Recent Data:
# Simple average submission time
tasker_step_result_submission_duration_milliseconds_sum /
tasker_step_result_submission_duration_milliseconds_count
# P95 submission latency
histogram_quantile(0.95, sum by (le) (tasker_step_result_submission_duration_milliseconds_bucket))
Rate-Based:
# Average submission time over 5 minutes
rate(tasker_step_result_submission_duration_milliseconds_sum[5m]) /
rate(tasker_step_result_submission_duration_milliseconds_count[5m])
# P95 submission latency over 5 minutes
histogram_quantile(0.95, sum by (le) (rate(tasker_step_result_submission_duration_milliseconds_bucket[5m])))
Expected Output: ✅ Verified - Returns millisecond values
Gauges
tasker.steps.active_executions
Description: Number of steps currently being executed Type: Gauge (u64) Labels:
namespace: Worker namespacehandler_type: “rust”, “ruby”
Status: Defined but not actively instrumented (gauge tracking removed during implementation)
tasker.queue.depth
Description: Current queue depth per namespace Type: Gauge (u64) Labels:
namespace: Worker namespace
Status: Planned (not yet instrumented)
Resilience Metrics
Module: tasker-shared/src/metrics/worker.rs, tasker-orchestration/src/web/circuit_breaker.rs
Instrumentation: Circuit breakers, MPSC channels
Related Docs: Circuit Breakers | Backpressure Architecture
Circuit Breaker Metrics
Circuit breakers provide fault isolation and cascade prevention. These metrics track breaker state transitions and related operations.
api_circuit_breaker_state
Description: Current state of the web API database circuit breaker Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels: None
Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs
Example Queries:
# Current state
api_circuit_breaker_state
# Alert when open
api_circuit_breaker_state == 2
tasker_circuit_breaker_state
Description: Per-component circuit breaker state Type: Gauge (i64) Values: 0=Closed, 1=Half-Open, 2=Open Labels:
component: Circuit breaker name (e.g., “ffi_completion”, “task_readiness”, “pgmq”)
Instrumented In: Various circuit breaker implementations
Example Queries:
# All circuit breaker states
tasker_circuit_breaker_state
# Check specific component
tasker_circuit_breaker_state{component="ffi_completion"}
# Count open breakers
count(tasker_circuit_breaker_state == 2)
api_requests_rejected_total
Description: Total API requests rejected due to open circuit breaker Type: Counter (u64) Labels:
endpoint: The rejected endpoint path
Instrumented In: tasker-orchestration/src/web/circuit_breaker.rs
Example Queries:
# Total rejections
api_requests_rejected_total
# Rejection rate
rate(api_requests_rejected_total[5m])
# By endpoint
sum by (endpoint) (api_requests_rejected_total)
ffi_completion_slow_sends_total
Description: FFI completion channel sends exceeding latency threshold (100ms default) Type: Counter (u64) Labels: None
Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Example Queries:
# Total slow sends
ffi_completion_slow_sends_total
# Slow send rate (alerts at >10/sec)
rate(ffi_completion_slow_sends_total[5m]) > 10
Alert Threshold: Warning when rate exceeds 10/second for 2 minutes
ffi_completion_circuit_rejections_total
Description: FFI completion operations rejected due to open circuit breaker Type: Counter (u64) Labels: None
Instrumented In: tasker-worker/src/worker/handlers/ffi_completion_circuit_breaker.rs
Example Queries:
# Total rejections
ffi_completion_circuit_rejections_total
# Rejection rate
rate(ffi_completion_circuit_rejections_total[5m])
MPSC Channel Metrics
Bounded MPSC channels provide backpressure control. These metrics track channel utilization and overflow events.
mpsc_channel_usage_percent
Description: Current fill percentage of a bounded MPSC channel Type: Gauge (f64) Labels:
channel: Channel name (e.g., “orchestration_command”, “pgmq_notifications”)component: Owning component
Instrumented In: Channel monitor integration points
Example Queries:
# All channel usage
mpsc_channel_usage_percent
# High usage channels
mpsc_channel_usage_percent > 80
# By component
max by (component) (mpsc_channel_usage_percent)
Alert Thresholds:
- Warning: > 80% for 15 minutes
- Critical: > 90% for 5 minutes
mpsc_channel_capacity
Description: Configured buffer capacity for an MPSC channel Type: Gauge (u64) Labels:
channel: Channel namecomponent: Owning component
Instrumented In: Channel monitor initialization
Example Queries:
# All channel capacities
mpsc_channel_capacity
# Compare usage to capacity
mpsc_channel_usage_percent / 100 * mpsc_channel_capacity
mpsc_channel_full_events_total
Description: Count of channel overflow events (backpressure applied) Type: Counter (u64) Labels:
channel: Channel namecomponent: Owning component
Instrumented In: Channel send operations with backpressure handling
Example Queries:
# Total overflow events
mpsc_channel_full_events_total
# Overflow rate
rate(mpsc_channel_full_events_total[5m])
# By channel
sum by (channel) (mpsc_channel_full_events_total)
Alert Threshold: Any overflow events indicate backpressure is occurring
Resilience Dashboard Panels
Circuit Breaker State Timeline:
# Panel: Time series with state mapping
api_circuit_breaker_state
# Value mappings: 0=Closed (green), 1=Half-Open (yellow), 2=Open (red)
FFI Completion Health:
# Panel: Multi-stat showing slow sends and rejections
rate(ffi_completion_slow_sends_total[5m])
rate(ffi_completion_circuit_rejections_total[5m])
Channel Saturation Overview:
# Panel: Gauge showing max channel usage
max(mpsc_channel_usage_percent)
# Thresholds: Green < 70%, Yellow < 90%, Red >= 90%
Backpressure Events:
# Panel: Time series of overflow rate
rate(mpsc_channel_full_events_total[5m])
Database Metrics
Module: tasker-shared/src/metrics/database.rs
Status: ⚠️ Defined but not yet instrumented
Planned Metrics
tasker.sql.queries.total- Countertasker.sql.query.duration- Histogramtasker.db.pool.connections_active- Gaugetasker.db.pool.connections_idle- Gaugetasker.db.pool.wait_duration- Histogramtasker.db.transactions.total- Countertasker.db.transaction.duration- Histogram
Messaging Metrics
Module: tasker-shared/src/metrics/messaging.rs
Status: ⚠️ Defined but not yet instrumented
Planned Metrics
tasker.queue.messages_sent.total- Countertasker.queue.messages_received.total- Countertasker.queue.messages_deleted.total- Countertasker.queue.message_send.duration- Histogramtasker.queue.message_receive.duration- Histogramtasker.queue.depth- Gaugetasker.queue.age_seconds- Gaugetasker.queue.visibility_timeouts.total- Countertasker.queue.errors.total- Countertasker.queue.retry_attempts.total- Counter
Note: Circuit breaker metrics (including queue-related circuit breakers) are documented in the Resilience Metrics section.
Example Queries
Task Execution Flow
Complete task execution for a specific correlation_id:
# 1. Task creation
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 3. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 4. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 5. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 6. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
# 7. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected Flow: 1 → N → N → N → N → N → 1 (where N = number of steps)
Performance Analysis
Task initialization latency percentiles:
Instant/Recent Data:
# P50 (median)
histogram_quantile(0.50, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P95
histogram_quantile(0.95, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
# P99
histogram_quantile(0.99, sum by (le) (tasker_task_initialization_duration_milliseconds_bucket))
Rate-Based (continuous monitoring):
# P50 (median)
histogram_quantile(0.50, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
# P95
histogram_quantile(0.95, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
# P99
histogram_quantile(0.99, sum by (le) (rate(tasker_task_initialization_duration_milliseconds_bucket[5m])))
Step execution latency by namespace:
Instant/Recent Data:
histogram_quantile(0.95,
sum by (namespace, le) (
tasker_step_execution_duration_milliseconds_bucket
)
)
Rate-Based:
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_milliseconds_bucket[5m])
)
)
End-to-end task duration (from request to completion):
This requires combining initialization + step execution + finalization durations. Use the simple average approach for instant data:
# Average task initialization
tasker_task_initialization_duration_milliseconds_sum /
tasker_task_initialization_duration_milliseconds_count
# Average step execution
tasker_step_execution_duration_milliseconds_sum /
tasker_step_execution_duration_milliseconds_count
# Average finalization
tasker_task_finalization_duration_milliseconds_sum /
tasker_task_finalization_duration_milliseconds_count
Error Rate Monitoring
Overall step failure rate:
rate(tasker_steps_failures_total[5m]) /
rate(tasker_steps_executions_total[5m])
Error distribution by type:
topk(5, sum by (error_type) (tasker_steps_failures_total))
Task failure rate:
rate(tasker_tasks_failures_total[5m]) /
(rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m]))
Throughput Monitoring
Task request rate:
rate(tasker_tasks_requests_total[1m])
rate(tasker_tasks_requests_total[5m])
rate(tasker_tasks_requests_total[15m])
Step execution throughput:
sum(rate(tasker_steps_executions_total[5m]))
Step completion rate (successes + failures):
sum(rate(tasker_steps_successes_total[5m])) +
sum(rate(tasker_steps_failures_total[5m]))
Dashboard Recommendations
Task Execution Overview Dashboard
Panels:
-
Task Request Rate
- Query:
rate(tasker_tasks_requests_total[5m]) - Visualization: Time series graph
- Query:
-
Task Completion Rate
- Query:
rate(tasker_tasks_completions_total[5m]) - Visualization: Time series graph
- Query:
-
Task Success/Failure Ratio
- Query: Two series
- Completions:
rate(tasker_tasks_completions_total[5m]) - Failures:
rate(tasker_tasks_failures_total[5m])
- Completions:
- Visualization: Stacked area chart
- Query: Two series
-
Task Initialization Latency (P95)
- Query:
histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m])) - Visualization: Time series graph
- Query:
-
Steps Enqueued vs Executed
- Query: Two series
- Enqueued:
rate(tasker_steps_enqueued_total[5m]) - Executed:
rate(tasker_steps_executions_total[5m])
- Enqueued:
- Visualization: Time series graph
- Query: Two series
Worker Performance Dashboard
Panels:
-
Step Execution Throughput by Namespace
- Query:
sum by (namespace) (rate(tasker_steps_executions_total[5m])) - Visualization: Time series graph (multi-series)
- Query:
-
Step Success Rate
- Query:
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m]) - Visualization: Gauge (0-1 scale)
- Query:
-
Step Execution Latency Percentiles
- Query: Three series
- P50:
histogram_quantile(0.50, rate(tasker_step_execution_duration_bucket[5m])) - P95:
histogram_quantile(0.95, rate(tasker_step_execution_duration_bucket[5m])) - P99:
histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
- P50:
- Visualization: Time series graph
- Query: Three series
-
Step Claiming Performance (Event vs Poll)
- Query:
histogram_quantile(0.95, sum by (claim_method, le) (rate(tasker_step_claim_duration_bucket[5m]))) - Visualization: Time series graph
- Query:
-
Error Distribution by Type
- Query:
sum by (error_type) (rate(tasker_steps_failures_total[5m])) - Visualization: Pie chart or bar chart
- Query:
System Health Dashboard
Panels:
-
Overall Task Success Rate
- Query:
rate(tasker_tasks_completions_total[5m]) / (rate(tasker_tasks_completions_total[5m]) + rate(tasker_tasks_failures_total[5m])) - Visualization: Stat panel with thresholds (green > 0.95, yellow > 0.90, red < 0.90)
- Query:
-
Step Failure Rate
- Query:
rate(tasker_steps_failures_total[5m]) / rate(tasker_steps_executions_total[5m]) - Visualization: Stat panel with thresholds
- Query:
-
Average Task End-to-End Duration
- Query: Combination of initialization, execution, and finalization durations
- Visualization: Time series graph
-
Result Processing Latency
- Query:
rate(tasker_step_result_processing_duration_sum[5m]) / rate(tasker_step_result_processing_duration_count[5m]) - Visualization: Time series graph
- Query:
-
Active Operations
- Query: Currently not instrumented (gauges removed)
- Status: Planned future enhancement
Verification Checklist
Use this checklist to verify metrics are working correctly:
Prerequisites
-
telemetry.opentelemetry.enabled = truein development config - Services restarted after config change
- Logs show
opentelemetry_enabled=true - Grafana LGTM container running on ports 3000, 4317
Basic Verification
- At least one task created via CLI
- Correlation ID captured from task creation
- Trace visible in Grafana Tempo for correlation ID
Orchestration Metrics
-
tasker_tasks_requests_totalreturns non-zero -
tasker_steps_enqueued_totalreturns expected step count -
tasker_step_results_processed_totalreturns expected result count -
tasker_tasks_completions_totalincrements on success -
tasker_task_initialization_duration_buckethas histogram data
Worker Metrics
-
tasker_steps_executions_totalreturns non-zero -
tasker_steps_successes_totalmatches successful steps -
tasker_steps_claimed_totalreturns expected claims -
tasker_steps_results_submitted_totalmatches result submissions -
tasker_step_execution_duration_buckethas histogram data
Resilience Metrics
-
api_circuit_breaker_statereturns 0 (Closed) during normal operation -
/health/detailedendpoint shows circuit breaker states -
mpsc_channel_usage_percentreturns values < 80% (no saturation) -
mpsc_channel_full_events_totalis 0 or very low (no backpressure) - FFI workers:
ffi_completion_slow_sends_totalis near zero
Correlation
- All metrics filterable by
correlation_id - Correlation ID in metrics matches trace ID in Tempo
- Complete execution flow visible from request to completion
Troubleshooting
No Metrics Appearing
Check 1: OpenTelemetry enabled
grep "opentelemetry_enabled" tmp/*.log
# Should show: opentelemetry_enabled=true
Check 2: OTLP endpoint accessible
curl -v http://localhost:4317 2>&1 | grep Connected
# Should show: Connected to localhost (127.0.0.1) port 4317
Check 3: Grafana LGTM running
curl -s http://localhost:3000/api/health | jq
# Should return healthy status
Check 4: Wait for export interval (60 seconds) Metrics are batched and exported every 60 seconds. Wait at least 1 minute after task execution.
Metrics Missing Labels
If correlation_id or other labels are missing, check:
- Logs for
correlation_idfield presence - Metric instrumentation includes KeyValue::new() calls
- Labels match between metric definition and usage
Histogram Buckets Empty
If histogram queries return no data:
- Verify histogram is initialized: check logs for metric initialization
- Ensure duration values are non-zero and reasonable
- Check that
record()is called, notadd()for histograms
Next Steps
Phase 3.4 (Future)
- Instrument database metrics (7 metrics)
- Instrument messaging metrics (11 metrics)
- Add gauge tracking for active operations
- Implement queue depth monitoring
Production Readiness
- Create alert rules for error rates
- Set up automated dashboards
- Configure metric retention policies
- Add metric aggregation for long-term storage
Last Updated: 2025-12-10
Test Task: mathematical_sequence (correlation_id: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5)
Status: All orchestration and worker metrics verified and producing data ✅
Recent Updates:
- 2025-12-10: Added Resilience Metrics section (circuit breakers, MPSC channels)
- 2025-10-08: Initial metrics verification completed
Metrics Verification Guide
Purpose: Verify that documented metrics queries work with actual system data
Test Task: mathematical_sequence
Correlation ID: 0199c3e0-ccdb-7581-87ab-3f67daeaa4a5
Task ID: 0199c3e0-ccea-70f0-b6ae-3086b2f68280
Trace ID: d640f82572e231322edba0a5ef6e1405
How to Use This Guide
- Open Grafana at http://localhost:3000
- Navigate to Explore (compass icon in sidebar)
- Select Prometheus as the data source
- Copy each query below into the query editor
- Record the actual output
- Mark ✅ if query works, ❌ if it fails, or ⚠️ if partial data
Orchestration Metrics Verification
1. Task Requests Counter
Metric: tasker.tasks.requests.total
Query 1: Basic counter
tasker_tasks_requests_total
Expected: At least 1 (for our test task) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Filtered by correlation_id
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Exactly 1 Actual Result: _____________ Labels Present: [ ] correlation_id [ ] task_type [ ] namespace Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Sum by namespace
sum by (namespace) (tasker_tasks_requests_total)
Expected: 1 for namespace “rust_e2e_linear” Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
2. Task Completions Counter
Metric: tasker.tasks.completions.total
Query 1: Basic counter
tasker_tasks_completions_total
Expected: At least 1 (if task completed successfully) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Filtered by correlation_id
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Completion rate over 5 minutes
rate(tasker_tasks_completions_total[5m])
Expected: Some positive rate value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
3. Steps Enqueued Counter
Metric: tasker.steps.enqueued.total
Query 1: Total steps enqueued for our task
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of steps in mathematical_sequence workflow (likely 3-4 steps) Actual Result: _____________ Step Names Visible: [ ] Yes [ ] No Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Sum by step name
sum by (step_name) (tasker_steps_enqueued_total)
Expected: Breakdown by step name Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
4. Step Results Processed Counter
Metric: tasker.step_results.processed.total
Query 1: Results processed for our task
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Same as steps enqueued Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Sum by result type
sum by (result_type) (tasker_step_results_processed_total)
Expected: Breakdown showing “success” results Actual Result: _____________ Result Types Visible: [ ] success [ ] error [ ] timeout [ ] cancelled [ ] skipped Status: [ ] ✅ [ ] ❌ [ ] ⚠️
5. Task Initialization Duration Histogram
Metric: tasker.task.initialization.duration
Query 1: Check if histogram has data
tasker_task_initialization_duration_count
Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average initialization time
rate(tasker_task_initialization_duration_sum[5m]) /
rate(tasker_task_initialization_duration_count[5m])
Expected: Some millisecond value (probably < 100ms) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 latency
histogram_quantile(0.95, rate(tasker_task_initialization_duration_bucket[5m]))
Expected: P95 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 4: P99 latency
histogram_quantile(0.99, rate(tasker_task_initialization_duration_bucket[5m]))
Expected: P99 millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
6. Task Finalization Duration Histogram
Metric: tasker.task.finalization.duration
Query 1: Check count
tasker_task_finalization_duration_count
Expected: At least 1 Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average finalization time
rate(tasker_task_finalization_duration_sum[5m]) /
rate(tasker_task_finalization_duration_count[5m])
Expected: Some millisecond value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 by final_state
histogram_quantile(0.95,
sum by (final_state, le) (
rate(tasker_task_finalization_duration_bucket[5m])
)
)
Expected: P95 value grouped by final_state (likely “complete”) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
7. Step Result Processing Duration Histogram
Metric: tasker.step_result.processing.duration
Query 1: Check count
tasker_step_result_processing_duration_count
Expected: Number of steps processed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average processing time
rate(tasker_step_result_processing_duration_sum[5m]) /
rate(tasker_step_result_processing_duration_count[5m])
Expected: Millisecond value for result processing Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Worker Metrics Verification
8. Step Executions Counter
Metric: tasker.steps.executions.total
Query 1: Total executions
tasker_steps_executions_total
Expected: Number of steps in workflow Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: For specific task
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of steps executed Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Execution rate
rate(tasker_steps_executions_total[5m])
Expected: Positive rate Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
9. Step Successes Counter
Metric: tasker.steps.successes.total
Query 1: Total successes
tasker_steps_successes_total
Expected: Should equal executions if all succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By namespace
sum by (namespace) (tasker_steps_successes_total)
Expected: Successes grouped by namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: Success rate
rate(tasker_steps_successes_total[5m]) / rate(tasker_steps_executions_total[5m])
Expected: ~1.0 (100%) if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
10. Step Failures Counter
Metric: tasker.steps.failures.total
Query 1: Total failures
tasker_steps_failures_total
Expected: 0 if all steps succeeded Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By error type
sum by (error_type) (tasker_steps_failures_total)
Expected: No results if no failures, or breakdown by error type Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
11. Steps Claimed Counter
Metric: tasker.steps.claimed.total
Query 1: Total claims
tasker_steps_claimed_total
Expected: Number of steps claimed (should match executions) Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: By claim method
sum by (claim_method) (tasker_steps_claimed_total)
Expected: Breakdown by “event” or “poll” Actual Result: _____________ Claim Methods Visible: [ ] event [ ] poll Status: [ ] ✅ [ ] ❌ [ ] ⚠️
12. Step Results Submitted Counter
Metric: tasker.steps.results_submitted.total
Query 1: Total submissions
tasker_steps_results_submitted_total
Expected: Number of steps that submitted results Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: For specific task
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Expected: Number of step results submitted Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
13. Step Execution Duration Histogram
Metric: tasker.step.execution.duration
Query 1: Check count
tasker_step_execution_duration_count
Expected: Number of step executions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average execution time
rate(tasker_step_execution_duration_sum[5m]) /
rate(tasker_step_execution_duration_count[5m])
Expected: Average milliseconds per step Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 latency by namespace
histogram_quantile(0.95,
sum by (namespace, le) (
rate(tasker_step_execution_duration_bucket[5m])
)
)
Expected: P95 latency for rust_e2e_linear namespace Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 4: P99 latency
histogram_quantile(0.99, rate(tasker_step_execution_duration_bucket[5m]))
Expected: P99 latency value Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
14. Step Claim Duration Histogram
Metric: tasker.step.claim.duration
Query 1: Check count
tasker_step_claim_duration_count
Expected: Number of claims Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average claim time
rate(tasker_step_claim_duration_sum[5m]) /
rate(tasker_step_claim_duration_count[5m])
Expected: Average milliseconds to claim Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 by claim method
histogram_quantile(0.95,
sum by (claim_method, le) (
rate(tasker_step_claim_duration_bucket[5m])
)
)
Expected: P95 claim latency by method Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
15. Step Result Submission Duration Histogram
Metric: tasker.step_result.submission.duration
Query 1: Check count
tasker_step_result_submission_duration_count
Expected: Number of result submissions Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 2: Average submission time
rate(tasker_step_result_submission_duration_sum[5m]) /
rate(tasker_step_result_submission_duration_count[5m])
Expected: Average milliseconds to submit Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Query 3: P95 submission latency
histogram_quantile(0.95, rate(tasker_step_result_submission_duration_bucket[5m]))
Expected: P95 submission latency Actual Result: _____________ Status: [ ] ✅ [ ] ❌ [ ] ⚠️
Complete Execution Flow Verification
Purpose: Verify the full task lifecycle is visible in metrics
Query: Complete flow for correlation_id
# Run each query and record the value
# 1. Task created
tasker_tasks_requests_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 2. Steps enqueued
tasker_steps_enqueued_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 3. Steps claimed
tasker_steps_claimed_total
Result: _____________
# 4. Steps executed
tasker_steps_executions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 5. Steps succeeded
tasker_steps_successes_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 6. Results submitted
tasker_steps_results_submitted_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 7. Results processed
tasker_step_results_processed_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
# 8. Task completed
tasker_tasks_completions_total{correlation_id="0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"}
Result: _____________
Expected Pattern: 1 → N → N → N → N → N → N → 1 Actual Pattern: _____________ → _____ → _____ → _____ → _____ → _____ → _____ → _____
Analysis:
- Do the numbers make sense for your workflow? [ ] Yes [ ] No
- Are any steps missing? _____________
- Do counts match expectations? [ ] Yes [ ] No
Issues Found
Document any issues discovered during verification:
Issue 1
Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No
Issue 2
Query: _____________ Expected: _____________ Actual: _____________ Problem: _____________ Fix Required: [ ] Yes [ ] No
Summary
Total Queries Tested: _____________ Successful: _____________ ✅ Failed: _____________ ❌ Partial: _____________ ⚠️
Overall Status: [ ] All Working [ ] Some Issues [ ] Major Problems
Ready for Production: [ ] Yes [ ] No [ ] Needs Work
Verification Date: _____________ Verified By: _____________ Grafana Version: _____________ OpenTelemetry Version: 0.26
OpenTelemetry Improvements
Last Updated: 2025-12-01 Audience: Developers, Operators Status: Active Related Docs: Observability Hub | Metrics Reference | Domain Events
← Back to Observability Hub
This document describes the OpenTelemetry improvements for the domain event system, including two-phase FFI telemetry initialization, domain event metrics, and enhanced observability for the distributed event system.
Overview
These telemetry improvements support the domain event system while addressing FFI-specific challenges:
| Improvement | Purpose | Impact |
|---|---|---|
| Two-Phase FFI Telemetry | Safe telemetry in FFI workers | No segfaults during Ruby/Python bridging |
| Domain Event Metrics | Event system observability | Real-time monitoring of event publication |
| Correlation ID Propagation | End-to-end tracing | Events traceable across distributed system |
| Worker Metrics Endpoint | Domain event statistics | /metrics/events for monitoring dashboards |
Two-Phase FFI Telemetry Initialization
The Problem
When Rust workers operate with Ruby FFI bindings, OpenTelemetry’s global tracer/meter providers can cause issues:
- Thread Safety: Ruby’s GVL (Global VM Lock) conflicts with OpenTelemetry’s internal threading
- Signal Handling: OpenTelemetry’s OTLP exporter may interfere with Ruby signal handling
- Segfaults: Premature initialization can cause crashes during FFI boundary crossings
The Solution: Two-Phase Initialization
flowchart LR
subgraph Phase1["Phase 1 (FFI-Safe)"]
A[Console logger]
B[Tracing init]
C[No OTLP export]
D[No global state]
end
subgraph Phase2["Phase 2 (Full OTel)"]
E[OTLP exporter]
F[Metrics export]
G[Full tracing]
H[Global tracer]
end
Phase1 -->|"After FFI bridge<br/>established"| Phase2
Worker Bootstrap Sequence:
- Load Rust worker library
- Initialize Phase 1 (console-only logging)
- Execute FFI bridge setup (Ruby/Python)
- Initialize Phase 2 (full OpenTelemetry)
Implementation
Phase 1: Console-Only Initialization (FFI-Safe):
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs — `init_console_only()`
/// Initialize console-only logging (FFI-safe, no Tokio runtime required)
///
/// This function sets up structured console logging without OpenTelemetry,
/// making it safe to call from FFI initialization contexts where no Tokio
/// runtime exists yet.
pub fn init_console_only() {
TRACING_INITIALIZED.get_or_init(|| {
let environment = get_environment();
let log_level = get_log_level(&environment);
// Determine if we're in a TTY for ANSI color support
let use_ansi = IsTerminal::is_terminal(&std::io::stdout());
// Create base console layer
let console_layer = fmt::layer()
.with_target(true)
.with_thread_ids(true)
.with_level(true)
.with_ansi(use_ansi)
.with_filter(EnvFilter::new(&log_level));
// Build subscriber with console layer only (no telemetry)
let subscriber = tracing_subscriber::registry().with(console_layer);
if subscriber.try_init().is_err() {
tracing::debug!(
"Global tracing subscriber already initialized"
);
} else {
tracing::info!(
environment = %environment,
opentelemetry_enabled = false,
context = "ffi_initialization",
"Console-only logging initialized (FFI-safe mode)"
);
}
// Initialize basic metrics (no OpenTelemetry exporters)
metrics::init_metrics();
metrics::orchestration::init();
metrics::worker::init();
metrics::database::init();
metrics::messaging::init();
});
}
}
Phase 2: Full OpenTelemetry Initialization:
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs — `init_tracing()`
/// Initialize tracing with console output and optional OpenTelemetry
///
/// When OpenTelemetry is enabled (via TELEMETRY_ENABLED=true), it also
/// configures distributed tracing with OTLP exporter.
///
/// **IMPORTANT**: When telemetry is enabled, this function MUST be called from
/// a Tokio runtime context because the batch exporter requires async I/O.
pub fn init_tracing() {
TRACING_INITIALIZED.get_or_init(|| {
let environment = get_environment();
let log_level = get_log_level(&environment);
let telemetry_config = TelemetryConfig::default();
// Determine if we're in a TTY for ANSI color support
let use_ansi = IsTerminal::is_terminal(&std::io::stdout());
// Create base console layer
let console_layer = fmt::layer()
.with_target(true)
.with_thread_ids(true)
.with_level(true)
.with_ansi(use_ansi)
.with_filter(EnvFilter::new(&log_level));
// Build subscriber with optional OpenTelemetry layer
let subscriber = tracing_subscriber::registry().with(console_layer);
if telemetry_config.enabled {
// Initialize OpenTelemetry tracer and logger
match (init_opentelemetry_tracer(&telemetry_config),
init_opentelemetry_logger(&telemetry_config)) {
(Ok(tracer_provider), Ok(logger_provider)) => {
// Add trace layer
let tracer = tracer_provider.tracer("tasker-core");
let telemetry_layer = OpenTelemetryLayer::new(tracer);
// Add log layer (bridge tracing logs -> OTEL logs)
let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);
let subscriber = subscriber.with(telemetry_layer).with(log_layer);
if subscriber.try_init().is_ok() {
tracing::info!(
environment = %environment,
opentelemetry_enabled = true,
logs_enabled = true,
otlp_endpoint = %telemetry_config.otlp_endpoint,
service_name = %telemetry_config.service_name,
"Console logging with OpenTelemetry initialized"
);
}
}
// ... error handling with fallback to console-only
}
}
});
}
}
Worker Bootstrap Integration:
#![allow(unused)]
fn main() {
// workers/rust/src/bootstrap.rs — `bootstrap()`
pub async fn bootstrap() -> Result<(WorkerSystemHandle, RustEventHandler)> {
info!("📋 Creating native Rust step handler registry...");
let registry = Arc::new(RustStepHandlerRegistry::new());
// Get global event system for connecting to worker events
info!("🔗 Setting up event system connection...");
let event_system = get_global_event_system();
// Bootstrap the worker using tasker-worker foundation
info!("🏗️ Bootstrapping worker with tasker-worker foundation...");
let worker_handle =
WorkerBootstrap::bootstrap_with_event_system(Some(event_system.clone())).await?;
// Create step event publisher registry with domain event publisher
info!("🔔 Setting up step event publisher registry...");
let domain_event_publisher = {
let worker_core = worker_handle.worker_core.lock().await;
worker_core.domain_event_publisher()
};
// Dual-Path: Create in-process event bus for fast event delivery
info!("⚡ Creating in-process event bus for fast domain events...");
let in_process_bus = Arc::new(RwLock::new(InProcessEventBus::new(
InProcessEventBusConfig::default(),
)));
// Dual-Path: Create event router for dual-path delivery
info!("🔀 Creating event router for dual-path delivery...");
let event_router = Arc::new(RwLock::new(EventRouter::new(
domain_event_publisher.clone(),
in_process_bus.clone(),
)));
// Create registry with EventRouter for dual-path delivery
let mut step_event_registry =
StepEventPublisherRegistry::with_event_router(
domain_event_publisher.clone(),
event_router
);
Ok((worker_handle, event_handler))
}
}
Configuration
Telemetry is configured exclusively via environment variables. This is intentional because logging must be initialized before the TOML config loader runs (to log any config loading errors).
# Enable OpenTelemetry
export TELEMETRY_ENABLED=true
# OTLP endpoint (default: http://localhost:4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
# Service identification
export OTEL_SERVICE_NAME=tasker-orchestration
export OTEL_SERVICE_VERSION=0.1.0
# Deployment environment (falls back to TASKER_ENV, then "development")
export DEPLOYMENT_ENVIRONMENT=production
# Sampling rate (0.0 to 1.0, default: 1.0 = 100%)
export OTEL_TRACES_SAMPLER_ARG=1.0
The TelemetryConfig::default() implementation in tasker-shared/src/logging.rs
reads all values from environment variables at initialization time.
Domain Event Metrics
New Metrics
Domain event observability metrics:
| Metric | Type | Description |
|---|---|---|
tasker.domain_events.published.total | Counter | Total events published |
router.durable_routed | Counter | Events sent via durable path (PGMQ) |
router.fast_routed | Counter | Events sent via fast path (in-process) |
router.broadcast_routed | Counter | Events broadcast to both paths |
Implementation
Domain event metrics are emitted inline during publication:
#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs — metric emission in `publish_event()`
// Emit OpenTelemetry metric
let counter = opentelemetry::global::meter("tasker")
.u64_counter("tasker.domain_events.published.total")
.with_description("Total number of domain events published")
.build();
counter.add(
1,
&[
opentelemetry::KeyValue::new("event_name", event_name.to_string()),
opentelemetry::KeyValue::new("namespace", metadata.namespace.clone()),
],
);
}
Event routing statistics are tracked in the EventRouterStats and InProcessEventBusStats structures:
#![allow(unused)]
fn main() {
// tasker-shared/src/metrics/worker.rs — `EventRouterStats`
/// Statistics for the event router
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct EventRouterStats {
/// Total events routed through the router
pub total_routed: u64,
/// Events sent via durable path (PGMQ)
pub durable_routed: u64,
/// Events sent via fast path (in-process)
pub fast_routed: u64,
/// Events broadcast to both paths
pub broadcast_routed: u64,
/// Fast delivery errors in broadcast mode (non-fatal, logged for monitoring)
pub fast_delivery_errors: u64,
/// Failed routing attempts (durable failures only)
pub routing_errors: u64,
}
// tasker-shared/src/metrics/worker.rs — `InProcessEventBusStats`
/// Statistics for the in-process event bus
#[derive(Debug, Clone, Default, serde::Serialize, serde::Deserialize)]
pub struct InProcessEventBusStats {
/// Total events dispatched through the bus
pub total_events_dispatched: u64,
/// Total events dispatched to Rust handlers
pub rust_handler_dispatches: u64,
/// Total events dispatched to FFI channel
pub ffi_channel_dispatches: u64,
}
}
Prometheus Queries
Event publication rate by namespace:
sum by (namespace) (rate(tasker_domain_events_published_total[5m]))
Event failure rate:
rate(tasker_domain_events_failed_total[5m]) /
rate(tasker_domain_events_published_total[5m])
Publication latency (P95):
histogram_quantile(0.95,
sum by (le) (rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m]))
)
Latency by delivery mode:
histogram_quantile(0.95,
sum by (delivery_mode, le) (
rate(tasker_domain_events_publish_duration_milliseconds_bucket[5m])
)
)
Worker Metrics Endpoint
/metrics/events Endpoint
The worker exposes domain event statistics through a dedicated metrics endpoint:
Request:
curl http://localhost:8081/metrics/events
Response:
{
"router": {
"total_routed": 42,
"durable_routed": 10,
"fast_routed": 30,
"broadcast_routed": 2,
"fast_delivery_errors": 0,
"routing_errors": 0
},
"in_process_bus": {
"total_events_dispatched": 32,
"rust_handler_dispatches": 20,
"ffi_channel_dispatches": 12
},
"captured_at": "2025-12-01T10:30:00Z",
"worker_id": "worker-01234567"
}
Implementation
#![allow(unused)]
fn main() {
// tasker-worker/src/web/handlers/metrics.rs — `domain_event_stats()`
/// Domain event statistics endpoint: GET /metrics/events
///
/// Returns statistics about domain event routing and delivery paths.
/// Used for monitoring event publishing and by E2E tests to verify
/// events were published through the expected delivery paths.
///
/// # Response
///
/// Returns statistics for:
/// - **Router stats**: durable_routed, fast_routed, broadcast_routed counts
/// - **In-process bus stats**: handler dispatches, FFI channel dispatches
pub async fn domain_event_stats(
State(state): State<Arc<WorkerWebState>>,
) -> Json<DomainEventStats> {
debug!("Serving domain event statistics");
// Use cached event components - does not lock worker core
let stats = state.domain_event_stats().await;
Json(stats)
}
}
The DomainEventStats structure is defined in tasker-shared/src/types/web.rs:
#![allow(unused)]
fn main() {
// tasker-shared/src/types/web.rs — `DomainEventStats`
#[derive(Debug, Clone, Serialize, Deserialize, Default)]
pub struct DomainEventStats {
/// Event router statistics
pub router: EventRouterStats,
/// In-process event bus statistics
pub in_process_bus: InProcessEventBusStats,
/// Timestamp when stats were captured
pub captured_at: DateTime<Utc>,
/// Worker ID for correlation
pub worker_id: String,
}
}
Correlation ID Propagation
End-to-End Tracing
Domain events maintain correlation IDs for distributed tracing:
flowchart LR
subgraph TaskCreation["Task Creation"]
A[correlation_id<br/>UUIDv7]
end
subgraph StepExecution["Step Execution"]
B[correlation_id<br/>propagated]
end
subgraph DomainEvent["Domain Event"]
C[correlation_id<br/>in metadata]
end
TaskCreation --> StepExecution --> DomainEvent
subgraph TraceContext["Trace Context"]
D[task_uuid]
E[step_uuid]
F[step_name]
G[namespace]
H[correlation_id]
end
Tracing Integration
The DomainEventPublisher::publish_event method uses #[instrument] for automatic span creation:
#![allow(unused)]
fn main() {
// tasker-shared/src/events/domain_events.rs — `publish_event()`
#[instrument(skip(self, payload, metadata), fields(
event_name = %event_name,
namespace = %metadata.namespace,
correlation_id = %metadata.correlation_id
))]
pub async fn publish_event(
&self,
event_name: &str,
payload: DomainEventPayload,
metadata: EventMetadata,
) -> Result<Uuid, DomainEventError> {
let event_id = Uuid::now_v7();
let queue_name = format!("{}_domain_events", metadata.namespace);
debug!(
event_id = %event_id,
event_name = %event_name,
queue_name = %queue_name,
task_uuid = %metadata.task_uuid,
correlation_id = %metadata.correlation_id,
"Publishing domain event"
);
// Create and serialize domain event
let event = DomainEvent {
event_id,
event_name: event_name.to_string(),
event_version: "1.0".to_string(),
payload,
metadata: metadata.clone(),
};
// Publish to PGMQ
let message_id = self.message_client
.send_json_message(&queue_name, &event_json)
.await?;
info!(
event_id = %event_id,
message_id = message_id,
correlation_id = %metadata.correlation_id,
"Domain event published successfully"
);
Ok(event_id)
}
}
Querying by Correlation ID
Find all events for a task:
# In Grafana/Tempo
correlation_id = "0199c3e0-ccdb-7581-87ab-3f67daeaa4a5"
In PostgreSQL (PGMQ queues):
SELECT
message->>'event_name' as event,
message->'metadata'->>'step_name' as step,
message->'metadata'->>'fired_at' as fired_at
FROM pgmq.q_payments_domain_events
WHERE message->'metadata'->>'correlation_id' = '0199c3e0-ccdb-7581-87ab-3f67daeaa4a5'
ORDER BY message->'metadata'->>'fired_at';
Span Hierarchy
Domain Event Spans
Domain event spans:
Task Execution (root span)
├── Step Execution
│ ├── Handler Call
│ │ └── Business Logic
│ └── publish_domain_event ◄── NEW
│ ├── route_event
│ │ ├── publish_durable (if durable/broadcast)
│ │ └── publish_fast (if fast/broadcast)
│ └── record_metrics
└── Result Submission
Span Attributes
| Span | Attributes |
|---|---|
publish_domain_event | event_name, namespace, correlation_id, delivery_mode |
route_event | delivery_mode, target_queue (if durable) |
publish_durable | queue_name, message_size |
publish_fast | subscriber_count |
Troubleshooting
Console-Only Mode (No OTLP Export)
Symptom: Logs show “Console-only logging initialized (FFI-safe mode)” but no OpenTelemetry traces
Cause: init_console_only() was called but init_tracing() was never called, or TELEMETRY_ENABLED=false
Fix:
-
Check initialization logs:
grep -E "(Console-only|OpenTelemetry)" logs/worker.log -
Verify
TELEMETRY_ENABLED=trueis set:grep "opentelemetry_enabled" logs/worker.log
Domain Event Metrics Missing
Symptom: /metrics/events returns zeros for all stats
Cause: Events not being published or the event router/bus not tracking statistics
Fix:
-
Verify events are being published:
grep "Domain event published successfully" logs/worker.log -
Check event router initialization:
grep "event router" logs/worker.log -
Verify in-process event bus is configured:
grep "in-process event bus" logs/worker.log
Correlation ID Not Propagating
Symptom: Events have different correlation IDs than parent task
Cause: EventMetadata not constructed with task’s correlation_id
Fix: Verify EventMetadata is constructed with the correct correlation_id from the task:
#![allow(unused)]
fn main() {
// When constructing EventMetadata, always use the task's correlation_id
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id, // Must use task's ID
fired_at: chrono::Utc::now(),
fired_by: handler_name.to_string(),
};
}
Best Practices
1. Always Use Two-Phase Init for FFI Workers
#![allow(unused)]
fn main() {
// Correct: Two-phase initialization pattern
// Phase 1: During FFI initialization (Magnus, PyO3, WASM)
tasker_shared::logging::init_console_only();
// Phase 2: After runtime creation
let runtime = tokio::runtime::Runtime::new()?;
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Incorrect: Calling init_tracing() during FFI initialization
// before Tokio runtime exists (may cause issues with OTLP exporter)
}
2. Include Correlation ID in All Events
#![allow(unused)]
fn main() {
// Always propagate correlation_id from the task
let metadata = EventMetadata {
task_uuid: step_data.task.task.task_uuid,
step_uuid: Some(step_data.workflow_step.workflow_step_uuid),
step_name: Some(step_data.workflow_step.name.clone()),
namespace: step_data.task.namespace_name.clone(),
correlation_id: step_data.task.task.correlation_id, // Critical!
fired_at: chrono::Utc::now(),
fired_by: handler_name.to_string(),
};
}
3. Use Structured Logging with Correlation Context
#![allow(unused)]
fn main() {
// All logs should include correlation_id for trace correlation
info!(
event_id = %event_id,
event_name = %event_name,
correlation_id = %metadata.correlation_id,
namespace = %metadata.namespace,
"Domain event published successfully"
);
}
Related Documentation
- Metrics Reference: metrics-reference.md - Complete metrics catalog
- Domain Events: ../domain-events.md - Event system architecture
- Logging Standards: logging-standards.md - Structured logging best practices
This telemetry architecture provides robust observability for domain events while ensuring safe operation with FFI-based language bindings.
Tasker Core Principles
This directory contains the core principles and design philosophy that guide Tasker Core development. These principles are not arbitrary rules but hard-won lessons extracted from implementation experience, root cause analyses, and architectural decisions.
Core Documents
| Document | Description |
|---|---|
| Tasker Core Tenets | The 11 foundational principles that drive all architecture and design decisions |
| Defense in Depth | Multi-layered protection model for idempotency and data integrity |
| Fail Loudly | Why errors beat silent defaults, and phantom data breaks trust |
| Cross-Language Consistency | The “one API” philosophy for Rust, Ruby, Python, and TypeScript workers |
| Composition Over Inheritance | Mixin-based handler composition pattern |
| Intentional AI Partnership | Collaborative approach to AI integration |
Influences
| Document | Description |
|---|---|
| Twelve-Factor App Alignment | How the 12-factor methodology shapes our architecture, with codebase examples and honest gap assessment |
| Zen of Python (PEP-20) | Tim Peters’ guiding principles — referenced as inspiration |
How These Principles Were Derived
These principles emerged from:
- Root Cause Analyses: Ownership removal revealed that “redundant protection with harmful side effects” is worse than minimal, well-understood protection
- Cross-Language Development: Handler harmonization established patterns for consistent APIs across four languages
- Architectural Migrations: Actor pattern refactoring proved the pattern’s effectiveness
- Production Incidents: Real bugs in parallel execution (Heisenbugs becoming Bohrbugs) shaped defensive design
- Protocol Trust Analysis: gRPC client refactoring exposed how silent defaults create phantom data that breaks consumer trust
When to Consult These Documents
- Design decisions: Read Tasker Core Tenets before proposing architecture changes
- Adding protections: Consult Defense in Depth to understand existing layers
- Error handling: Review Fail Loudly before adding defaults or fallbacks
- Worker development: Review Cross-Language Consistency for API alignment
- Handler patterns: Study Composition Over Inheritance for proper structure
Related Documentation
- Architecture Decisions:
docs/decisions/for specific ADRs - Historical Context:
docs/CHRONOLOGY.mdfor development timeline - Implementation Details:
docs/ticket-specs/for original specifications
Composition Over Inheritance
Last Updated: 2026-01-01 This document describes Tasker Core’s approach to handler composition using mixins and traits rather than class hierarchies.
The Core Principle
Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable
Handlers gain capabilities by mixing in modules, not by inheriting from specialized base classes.
Why Composition?
The Problem with Inheritance
Deep inheritance hierarchies create problems:
# BAD: Inheritance-based capabilities
class APIDecisionBatchableHandler < APIDecisionHandler < APIHandler < BaseHandler
# Which methods came from where?
# How do I override just one behavior?
# What if I need Batchable but not API?
end
| Problem | Description |
|---|---|
| Diamond problem | Multiple paths to same ancestor |
| Tight coupling | Can’t change base without affecting all children |
| Inflexible | Can’t pick-and-choose capabilities |
| Hard to test | Must test entire hierarchy |
| Opaque behavior | Method origin unclear |
The Composition Solution
Mixins provide selective capabilities:
# GOOD: Composition-based capabilities
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::APICapable
include TaskerCore::StepHandler::DecisionCapable
def call(context)
# Has API methods (get, post, put, delete)
# Has Decision methods (decision_success, decision_no_branches)
# Does NOT have Batchable methods (didn't include it)
end
end
| Benefit | Description |
|---|---|
| Selective inclusion | Only the capabilities you need |
| Clear origin | Module name indicates where methods come from |
| Independent testing | Test each mixin in isolation |
| Flexible combination | Any combination of capabilities |
| Flat structure | No deep hierarchies to navigate |
The Discovery
Analysis of Batchable handlers revealed they already used the composition pattern:
# Batchable was the TARGET architecture all along
class BatchHandler < Base
include BatchableCapable # Already doing it right!
def call(context)
batch_ctx = get_batch_context(context)
# ...process batch...
batch_worker_complete(processed_count: count, result_data: data)
end
end
The cross-language handler harmonization recommended migrating API and Decision handlers to match this pattern.
Capability Modules
Available Capabilities
| Capability | Module (Ruby) | Methods Provided |
|---|---|---|
| API | APICapable | get, post, put, delete |
| Decision | DecisionCapable | decision_success, decision_no_branches |
| Batchable | BatchableCapable | get_batch_context, batch_worker_complete, handle_no_op_worker |
Rust Traits
#![allow(unused)]
fn main() {
// Rust uses traits for the same pattern
pub trait APICapable {
async fn get(&self, path: &str, params: Option<Params>) -> Response;
async fn post(&self, path: &str, data: Option<Value>) -> Response;
async fn put(&self, path: &str, data: Option<Value>) -> Response;
async fn delete(&self, path: &str, params: Option<Params>) -> Response;
}
pub trait DecisionCapable {
fn decision_success(&self, steps: Vec<String>, result: Value) -> StepExecutionResult;
fn decision_no_branches(&self, result: Value) -> StepExecutionResult;
}
pub trait BatchableCapable {
fn get_batch_context(&self, context: &StepContext) -> BatchContext;
fn batch_worker_complete(&self, count: usize, data: Value) -> StepExecutionResult;
}
}
Python Mixins
# Python uses multiple inheritance (mixins)
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin, DecisionMixin
class MyHandler(StepHandler, APIMixin, DecisionMixin):
def call(self, context: StepContext) -> StepHandlerResult:
# Has both API and Decision methods
response = self.get("/api/endpoint")
return self.decision_success(["next_step"], response)
TypeScript Mixins
// TypeScript uses mixin functions applied in constructor
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, applyDecision, APICapable, DecisionCapable } from '@tasker-systems/tasker';
class MyHandler extends StepHandler implements APICapable, DecisionCapable {
constructor() {
super();
applyAPI(this); // Adds get/post/put/delete methods
applyDecision(this); // Adds decisionSuccess/skipBranches methods
}
async call(context: StepContext): Promise<StepHandlerResult> {
// Has both API and Decision methods
const response = await this.get('/api/endpoint');
return this.decisionSuccess(['next_step'], response.body);
}
}
Separation of Concerns
What Orchestration Owns
The orchestration layer handles:
- Domain event publishing (after results committed)
- Decision point step creation (from DecisionPointOutcome)
- Batch worker creation (from BatchProcessingOutcome)
- State machine transitions
What Workers Own
Workers handle:
- Decision logic (returns
DecisionPointOutcome) - Batch analysis (returns
BatchProcessingOutcome) - Handler execution (returns
StepHandlerResult) - Custom publishers/subscribers (fast path events)
The Boundary
┌─────────────────────────────────────────────────────────────────┐
│ Worker Space │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Handler.call(context) ││
│ │ - Executes business logic ││
│ │ - Uses API/Decision/Batchable capabilities ││
│ │ - Returns StepHandlerResult with outcome ││
│ └─────────────────────────────────────────────────────────────┘│
│ ↓ Result (with outcome) │
├─────────────────────────────────────────────────────────────────┤
│ Orchestration Space │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Process result ││
│ │ - Commit state transition ││
│ │ - If DecisionPointOutcome: create decision steps ││
│ │ - If BatchProcessingOutcome: create batch workers ││
│ │ - Publish domain events ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
FFI Boundary Types
Outcomes crossing the FFI boundary need explicit types:
DecisionPointOutcome
#![allow(unused)]
fn main() {
// Rust definition
pub enum DecisionPointOutcome {
ActivateSteps { step_names: Vec<String> },
NoBranches,
}
// Serialized (all languages)
{
"type": "ActivateSteps",
"step_names": ["branch_a", "branch_b"]
}
}
BatchProcessingOutcome
#![allow(unused)]
fn main() {
// Rust definition
pub enum BatchProcessingOutcome {
Continue { cursor: CursorConfig },
Complete,
NoOp,
}
// Serialized (all languages)
{
"type": "Continue",
"cursor": {
"position": "offset_123",
"batch_size": 100
}
}
}
Migration Path
Cross-Language Migration Examples
Ruby
Before (inheritance):
class MyAPIHandler < TaskerCore::APIHandler
def call(context)
# ...
end
end
After (composition):
class MyAPIHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
def call(context)
# Same implementation, different structure
end
end
Python
Before (inheritance):
class MyAPIHandler(APIHandler):
def call(self, context):
# ...
After (composition):
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin
class MyAPIHandler(StepHandler, APIMixin):
def call(self, context):
# Same implementation, different structure
TypeScript
Before (inheritance):
class MyAPIHandler extends APIHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
// ...
}
}
After (composition):
import { StepHandler } from '@tasker-systems/tasker';
import { applyAPI, APICapable } from '@tasker-systems/tasker';
class MyAPIHandler extends StepHandler implements APICapable {
constructor() {
super();
applyAPI(this);
}
async call(context: StepContext): Promise<StepHandlerResult> {
// Same implementation, different structure
}
}
Rust
Rust already used the composition pattern via traits:
#![allow(unused)]
fn main() {
// Rust has always used traits (composition)
impl StepHandler for MyHandler { ... }
impl APICapable for MyHandler { ... }
impl DecisionCapable for MyHandler { ... }
}
Breaking Changes Implemented
The migration to composition involved breaking changes:
- Base class changes across all languages
- Module/mixin includes required
- Ruby cursor indexing changed from 1-indexed to 0-indexed
All breaking changes were accumulated and released together.
Anti-Patterns
Don’t: Inherit from Multiple Specialized Classes
# BAD: Ruby doesn't support multiple inheritance like this
class MyHandler < APIHandler, DecisionHandler # Syntax error!
Don’t: Reimplement Mixin Methods
# BAD: Overriding mixin methods defeats the purpose
class MyHandler < Base
include APICapable
def get(path, params: {})
# Custom implementation - now you own this forever
end
end
Don’t: Mix Concerns
# BAD: Handler doing orchestration's job
class MyHandler < Base
include DecisionCapable
def call(context)
# Don't create steps directly!
create_workflow_step("next_step") # Orchestration does this!
# Do return the outcome
decision_success(steps: ["next_step"], result_data: {})
end
end
Testing Composition
Test Mixins in Isolation
# Test the mixin itself
RSpec.describe TaskerCore::StepHandler::APICapable do
let(:handler) { Class.new { include TaskerCore::StepHandler::APICapable }.new }
it "provides get method" do
expect(handler).to respond_to(:get)
end
end
Test Handler with Stubs
# Test handler behavior, stub mixin methods
RSpec.describe MyHandler do
let(:handler) { MyHandler.new }
it "calls API and makes decision" do
allow(handler).to receive(:get).and_return({ status: 200 })
result = handler.call(context)
expect(result.decision_point_outcome.type).to eq("ActivateSteps")
end
end
Related Documentation
- Tasker Core Tenets - Tenet #3: Composition Over Inheritance
- Cross-Language Consistency - How composition works across languages
- Patterns and Practices - Handler patterns
Cross-Language Consistency
This document describes Tasker Core’s philosophy for maintaining consistent APIs across Rust, Ruby, Python, and TypeScript workers while respecting each language’s idioms.
The Core Philosophy
“There should be one–and preferably only one–obvious way to do it.” – The Zen of Python
When a developer learns one Tasker worker language, they should understand all of them at the conceptual level. The specific syntax changes; the patterns remain constant.
Consistency Without Uniformity
What We Align
Developer-facing touchpoints that affect daily work:
| Touchpoint | Why Align |
|---|---|
| Handler signatures | Developers switch languages within projects |
| Result factories | Error handling should feel familiar |
| Registry APIs | Service configuration is cross-cutting |
| Context access patterns | Data access is the core operation |
| Specialized handlers | API, Decision, Batchable are reusable patterns |
What We Don’t Force
Language idioms that feel natural in their ecosystem:
| Ruby | Python | TypeScript | Rust |
|---|---|---|---|
Blocks, yield | Decorators, context managers | Generics, interfaces | Traits, associated types |
Symbols (:name) | Type hints | async/await | Pattern matching |
| Duck typing | ABC, Protocol | Union types | Enums, Result<T,E> |
The Aligned APIs
Handler Signatures
All handlers receive context, return results:
# Ruby
class MyHandler < TaskerCore::StepHandler::Base
def call(context)
success(result: { data: "value" })
end
end
# Python
class MyHandler(BaseStepHandler):
def call(self, context: StepContext) -> StepHandlerResult:
return self.success({"data": "value"})
// TypeScript
class MyHandler extends BaseStepHandler {
async call(context: StepContext): Promise<StepHandlerResult> {
return this.success({ data: "value" });
}
}
#![allow(unused)]
fn main() {
// Rust
impl StepHandler for MyHandler {
async fn call(&self, step_data: &TaskSequenceStep) -> StepExecutionResult {
StepExecutionResult::success(json!({"data": "value"}), None)
}
}
}
Result Factories
Success and failure patterns are identical:
| Operation | Pattern |
|---|---|
| Success | success(result_data, metadata?) |
| Failure | failure(message, error_type, error_code?, retryable?, metadata?) |
The factory methods hide implementation details (wrapper classes, enum variants) behind a consistent interface.
Registry Operations
All registries support the same core operations:
| Operation | Description |
|---|---|
register(name, handler) | Register a handler by name |
is_registered(name) | Check if handler exists |
resolve(name) | Get handler instance |
list_handlers() | List all registered handlers |
Context Access Patterns
The StepContext provides unified access to execution data:
Core Fields (All Languages)
| Field | Type | Description |
|---|---|---|
task_uuid | String | Unique task identifier |
step_uuid | String | Unique step identifier |
input_data | Dict/Hash | Input data for the step |
step_config | Dict/Hash | Handler configuration |
dependency_results | Wrapper | Results from parent steps |
retry_count | Integer | Current retry attempt |
max_retries | Integer | Maximum retry attempts |
Convenience Methods
| Method | Description |
|---|---|
get_task_field(name) | Get field from task context |
get_dependency_result(step_name) | Get result from a parent step |
Specialized Handler Patterns
API Handler
HTTP operations available in all languages:
| Method | Pattern |
|---|---|
| GET | get(path, params?, headers?) |
| POST | post(path, data?, headers?) |
| PUT | put(path, data?, headers?) |
| DELETE | delete(path, params?, headers?) |
Decision Handler
Conditional workflow branching:
# Ruby
decision_success(steps: ["branch_a", "branch_b"], result_data: { routing: "criteria" })
decision_no_branches(result_data: { reason: "no action needed" })
# Python
self.decision_success(["branch_a", "branch_b"], {"routing": "criteria"})
self.decision_no_branches({"reason": "no action needed"})
Batchable Handler
Cursor-based batch processing:
| Operation | Description |
|---|---|
get_batch_context(context) | Extract batch metadata |
batch_worker_complete(count, data) | Signal batch completion |
handle_no_op_worker(batch_ctx) | Handle empty batch |
FFI Boundary Types
When data crosses the FFI boundary (Rust <-> Ruby/Python/TypeScript), types must serialize identically:
Required Explicit Types
| Type | Purpose |
|---|---|
DecisionPointOutcome | Decision handler results |
BatchProcessingOutcome | Batch handler results |
CursorConfig | Batch cursor configuration |
StepHandlerResult | All handler results |
Serialization Guarantee
The same JSON representation must work across all languages:
{
"success": true,
"result": { "data": "value" },
"metadata": { "timing_ms": 50 }
}
Why This Matters
Developer Productivity
When switching from a Ruby handler to a Python handler:
- No relearning core concepts
- Same mental model applies
- Documentation transfers
Code Review Consistency
Reviewers can evaluate handlers in any language:
- Pattern violations are obvious
- Best practices are universal
- Anti-patterns are recognizable
Documentation Efficiency
One set of conceptual docs serves all languages:
- Language-specific pages show syntax only
- Core patterns documented once
- Examples parallel across languages
The Pre-Alpha Advantage
In pre-alpha, we can make breaking changes to achieve consistency:
| Change Type | Example |
|---|---|
| Method renames | handle() → call() |
| Signature changes | (task, step) → (context) |
| Return type unification | Separate Success/Error → unified result |
These changes would be costly post-release but are cheap now.
Migration Path
When APIs diverge, we follow this pattern:
- Non-Breaking First: Add aliases, helpers, new modules
- Deprecation Period: Mark old APIs deprecated (warnings in logs)
- Breaking Release: Remove old APIs, document migration
Example timeline:
Phase 1: Python migration (non-breaking + breaking)
Phase 2: Ruby migration (non-breaking + breaking)
Phase 3: Rust alignment (already aligned)
Phase 4: TypeScript alignment (new implementation)
Phase 5: Breaking changes release (all languages together)
Anti-Patterns
Don’t: Force Identical Syntax
# BAD: Ruby-style symbols in Python
def call(context) -> Hash[:success => true] # Not Python!
Don’t: Ignore Language Idioms
# BAD: Python-style type hints in Ruby
def call(context: StepContext) -> StepHandlerResult # Not Ruby!
Don’t: Duplicate Orchestration Logic
# BAD: Worker creating decision steps
def call(context)
# Don't do orchestration's job!
create_decision_steps(...) # Orchestration handles this
end
Related Documentation
- Tasker Core Tenets - Tenet #4: Cross-Language Consistency
- API Convergence Matrix - Detailed API reference
- Patterns and Practices - Common patterns
- Example Handlers - Side-by-side code examples
Defense in Depth
This document describes Tasker Core’s multi-layered protection model for idempotency and data integrity.
The Four Protection Layers
Tasker Core uses four independent protection layers. Each layer catches what others might miss, and no single layer bears full responsibility for data integrity.
┌─────────────────────────────────────────────────────────────────┐
│ Layer 4: Application Logic │
│ (State-based deduplication) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3: Transaction Boundaries │
│ (All-or-nothing semantics) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: State Machine Guards │
│ (Current state validation) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1: Database Atomicity │
│ (Unique constraints, row locks, CAS) │
└─────────────────────────────────────────────────────────────────┘
Layer 1: Database Atomicity
The foundation layer using PostgreSQL’s transactional guarantees.
Mechanisms
| Mechanism | Purpose | Example |
|---|---|---|
| Unique constraints | Prevent duplicate records | One active task per (namespace, external_id) |
| Row-level locking | Prevent concurrent modification | SELECT ... FOR UPDATE on task claim |
| Compare-and-swap | Atomic state transitions | UPDATE ... WHERE state = $expected |
| Advisory locks | Distributed coordination | Template cache invalidation |
Atomic Claiming Pattern
-- Only one processor can claim a task
UPDATE tasks
SET state = 'in_progress',
processor_uuid = $1,
claimed_at = NOW()
WHERE id = $2
AND state = 'pending' -- CAS: only if still pending
RETURNING *;
If two processors try to claim the same task:
- First: Succeeds, task transitions to
in_progress - Second: Fails (0 rows affected), no state change
Why This Works
PostgreSQL’s MVCC ensures the WHERE state = 'pending' check and the SET state = 'in_progress' happen atomically. There’s no window where both processors see state = 'pending'.
Layer 2: State Machine Guards
State machine validation before any transition is attempted.
Implementation
#![allow(unused)]
fn main() {
impl TaskStateMachine {
pub fn can_transition(&self, from: TaskState, to: TaskState) -> bool {
VALID_TRANSITIONS.contains(&(from, to))
}
pub fn transition(&mut self, to: TaskState) -> Result<(), StateError> {
if !self.can_transition(self.current, to) {
return Err(StateError::InvalidTransition { from: self.current, to });
}
// Proceed with transition
}
}
}
Valid Transitions Matrix
The state machine explicitly defines which transitions are valid:
Pending → Initializing → EnqueuingSteps → StepsInProcess
↓
Complete ← EvaluatingResults ← (step completions)
↓
Error (from any state)
Invalid transitions are rejected before reaching the database.
Why This Works
Application-level guards prevent obviously invalid operations from even attempting database changes. This reduces database load and provides better error messages.
Layer 3: Transaction Boundaries
All-or-nothing semantics for multi-step operations.
Example: Step Enqueueing
#![allow(unused)]
fn main() {
async fn enqueue_steps(task_id: TaskId, steps: Vec<Step>) -> Result<()> {
let mut tx = pool.begin().await?;
// Insert all steps
for step in steps {
insert_step(&mut tx, task_id, &step).await?;
}
// Update task state
update_task_state(&mut tx, task_id, TaskState::StepsInProcess).await?;
// Atomic commit - all or nothing
tx.commit().await?;
Ok(())
}
}
If step insertion fails:
- Transaction rolls back
- Task state unchanged
- No partial steps created
Why This Works
PostgreSQL transactions ensure that either all changes commit or none do. There’s no intermediate state where some steps exist but task state is wrong.
Layer 4: Application-Level Filtering
State-based deduplication in application logic.
Example: Result Processing
#![allow(unused)]
fn main() {
async fn process_result(step_id: StepId, result: HandlerResult) -> Result<()> {
let step = get_step(step_id).await?;
// Filter: Only process if step is in_progress
if step.state != StepState::InProgress {
log::info!("Ignoring result for step {} in state {:?}", step_id, step.state);
return Ok(()); // Idempotent: already processed
}
// Proceed with result processing
apply_result(step, result).await
}
}
Why This Works
Even if the same result arrives multiple times (network retries, duplicate messages), only the first processing has effect. Subsequent attempts see the step already transitioned and exit cleanly.
The Discovery: Ownership Was Harmful
What We Learned
Analysis of processor UUID “ownership” enforcement revealed:
#![allow(unused)]
fn main() {
// OLD: Ownership enforcement (REMOVED)
fn can_process(&self, processor_uuid: Uuid) -> bool {
self.owner_uuid == processor_uuid // BLOCKED recovery!
}
// NEW: Ownership tracking only (for audit)
fn process(&self, processor_uuid: Uuid) -> Result<()> {
self.record_processor(processor_uuid); // Track, don't enforce
// ... proceed with processing
}
}
Why Ownership Enforcement Was Removed
| Scenario | With Enforcement | Without Enforcement |
|---|---|---|
| Normal operation | Works | Works |
| Orchestrator crash & restart | BLOCKED - new UUID | Automatic recovery |
| Duplicate message | Rejected | Layer 1 (CAS) rejects |
| Race condition | Rejected | Layer 1 (CAS) rejects |
The four protection layers already prevent corruption. Ownership added:
- Zero additional safety (layers 1-4 sufficient)
- Recovery blocking (crashed tasks stuck forever)
- Operational complexity (manual intervention needed)
The Verdict
“Processor UUID ownership was redundant protection with harmful side effects.”
When two actors receive identical messages:
- First: Succeeds atomically (Layer 1 CAS)
- Second: Fails cleanly (Layer 1 CAS)
- No partial state, no corruption
- No ownership check needed
Designing New Protections
When adding protection mechanisms, evaluate against this checklist:
Before Adding Protection
- Which layer does this belong to? (Database, state machine, transaction, application)
- What does it protect against? (Be specific: race condition, duplicate, corruption)
- Do existing layers already cover this? (Usually yes)
- What failure modes does it introduce? (Blocked recovery, increased latency)
- Can the system recover if this protection itself fails?
The Minimal Set Principle
Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.
A system that:
- Has fewer protections
- Recovers automatically from crashes
- Handles duplicates idempotently
Is better than a system that:
- Has more protections
- Requires manual intervention after crashes
- Is “theoretically more secure”
Relationship to Fail Loudly
Defense in Depth and Fail Loudly are complementary principles:
| Defense in Depth | Fail Loudly |
|---|---|
| Multiple layers prevent corruption | Errors surface problems immediately |
| Redundancy catches edge cases | Transparency enables diagnosis |
| Protection happens before damage | Visibility happens at detection |
Both reject the same anti-pattern: silent failures.
- Defense in Depth rejects: silent corruption (data changed without protection)
- Fail Loudly rejects: silent defaults (missing data hidden with fabricated values)
Together they ensure: if something goes wrong, we know about it—either protection prevents it, or an error surfaces it.
Related Documentation
- Tasker Core Tenets - Tenet #1: Defense in Depth, Tenet #11: Fail Loudly
- Fail Loudly - Errors as first-class citizens
- Idempotency and Atomicity - Implementation details
- States and Lifecycles - State machine specifications
- Ownership Removal ADR - Processor UUID ownership removal decision
Fail Loudly
This document describes Tasker Core’s philosophy on error handling: errors are first-class citizens, not inconveniences to hide.
The Core Principle
A system that lies is worse than one that fails.
When data is missing, malformed, or unexpected, the correct response is an explicit error—not a fabricated default that makes the problem invisible.
The Problem: Phantom Data
“Phantom data” is data that:
- Looks valid to consumers
- Passes type checks and validation
- Contains no actual information from the source
- Was fabricated by defensive code trying to be “helpful”
Example: The Silent Default
#![allow(unused)]
fn main() {
// WRONG: Silent default hides protocol violation
fn get_pool_utilization(response: Response) -> PoolUtilization {
response.pool_utilization.unwrap_or_else(|| PoolUtilization {
active_connections: 0,
idle_connections: 0,
max_connections: 0,
utilization_percent: 0.0, // Looks like "no load"
})
}
}
A monitoring system receiving this response sees:
utilization_percent: 0.0— “Great, the system is idle!”- Reality: The server never sent pool data. The system might be at 100% load.
The consumer cannot distinguish “server reported 0%” from “server sent nothing.”
The Trust Equation
Silent default
→ Consumer receives valid-looking data
→ Consumer makes decisions based on phantom values
→ Phantom bugs manifest in production
→ Debugging nightmare: "But the data looked correct!"
vs.
Explicit error
→ Consumer receives clear failure
→ Consumer handles error appropriately
→ Problem visible immediately
→ Fix applied at source
The Solution: Explicit Errors
Pattern: Required Fields Return Errors
#![allow(unused)]
fn main() {
// RIGHT: Explicit error on missing required data
fn get_pool_utilization(response: Response) -> Result<PoolUtilization, ClientError> {
response.pool_utilization.ok_or_else(|| {
ClientError::invalid_response(
"Response.pool_utilization",
"Server omitted required pool utilization data",
)
})
}
}
Now the consumer:
- Knows data is missing
- Can retry, alert, or degrade gracefully
- Never operates on phantom values
Pattern: Distinguish Required vs Optional
Not all fields should fail on absence. The distinction matters:
| Field Type | Missing Means | Response |
|---|---|---|
| Required | Protocol violation, server bug | Return error |
| Optional | Legitimately absent, feature not configured | Return None |
#![allow(unused)]
fn main() {
// Required: Server MUST send health checks
let checks = response.checks.ok_or_else(||
ClientError::invalid_response("checks", "missing")
)?;
// Optional: Distributed cache may not be configured
let cache = response.distributed_cache; // Option<T> preserved
}
Pattern: Propagate, Don’t Swallow
Errors should flow up, not disappear:
#![allow(unused)]
fn main() {
// WRONG: Error swallowed, default returned
fn convert_response(r: Response) -> DomainType {
let info = r.info.unwrap_or_default(); // Error hidden
// ...
}
// RIGHT: Error propagated to caller
fn convert_response(r: Response) -> Result<DomainType, ClientError> {
let info = r.info.ok_or_else(||
ClientError::invalid_response("info", "missing")
)?; // Error visible
// ...
}
}
When Defaults Are Acceptable
Not every unwrap_or_default() is wrong. Defaults are acceptable when:
-
The field is explicitly optional in the domain model
#![allow(unused)] fn main() { // Optional metadata that may legitimately be absent let metadata: Option<Value> = response.metadata; } -
The default is semantically meaningful
#![allow(unused)] fn main() { // Empty tags list is valid—means "no tags" let tags = response.tags.unwrap_or_default(); // Vec<String> } -
Absence cannot be confused with a valid value
#![allow(unused)] fn main() { // description being None vs "" are distinguishable let description: Option<String> = response.description; }
Red Flags to Watch For
When reviewing code, these patterns indicate potential phantom data:
1. unwrap_or_default() on Numeric Types
#![allow(unused)]
fn main() {
// RED FLAG: 0 looks like a valid measurement
let active_connections = pool.active.unwrap_or_default();
}
2. unwrap_or_else(|| ...) with Fabricated Values
#![allow(unused)]
fn main() {
// RED FLAG: "unknown" looks like real status
let status = check.status.unwrap_or_else(|| "unknown".to_string());
}
3. Default Structs for Missing Nested Data
#![allow(unused)]
fn main() {
// RED FLAG: Entire section fabricated
let config = response.config.unwrap_or_else(default_config);
}
4. Silent Fallbacks in Health Checks
#![allow(unused)]
fn main() {
// RED FLAG: Health check that never fails is useless
let health = check_health().unwrap_or(HealthStatus::Ok);
}
Implementation Checklist
When implementing new conversions or response handling:
- Is this field required by the protocol/API contract?
- If missing, would a default be indistinguishable from a valid value?
- Could a consumer make incorrect decisions based on a default?
- Is the error message actionable? (includes field name, explains what’s wrong)
- Is the error type appropriate? (
InvalidResponsefor protocol violations)
The Discovery
What We Found
During gRPC client implementation, analysis revealed pervasive patterns like:
#![allow(unused)]
fn main() {
// Found throughout conversions.rs
let checks = response.checks.unwrap_or_else(|| ReadinessChecks {
web_database: HealthCheck { status: "unknown".into(), ... },
orchestration_database: HealthCheck { status: "unknown".into(), ... },
// ... more fabricated checks
});
}
A client calling get_readiness() would receive what looked like a valid response with “unknown” status for all checks—when in reality, the server sent nothing.
The Refactoring
All required-field patterns were changed to explicit errors:
#![allow(unused)]
fn main() {
// After refactoring
let checks = response.checks.ok_or_else(|| {
ClientError::invalid_response(
"ReadinessResponse.checks",
"Readiness response missing required health checks",
)
})?;
}
Now a malformed server response immediately fails with:
Error: Invalid response: ReadinessResponse.checks - Readiness response missing required health checks
The problem is visible. The fix can be applied. Trust is preserved.
Related Principles
- Tenet #11: Fail Loudly in Tasker Core Tenets
- Meta-Principle #6: Errors Over Defaults
- Defense in Depth — fail loudly is a form of protection; silent defaults are a form of hiding
Summary
| Don’t | Do |
|---|---|
| Hide missing data with defaults | Return explicit errors |
| Make consumers guess if data is real | Distinguish required vs optional |
| Fabricate “unknown” status values | Error: “status unavailable” |
| Swallow errors in conversions | Propagate with ? operator |
| Treat all fields as optional | Model optionality in types |
The golden rule: If you can’t tell the difference between “server sent 0” and “server sent nothing,” you have a phantom data problem.
Intentional AI Partnership
A philosophy of rigorous collaboration for the age of AI-assisted engineering
The Growing Divide
There is a phrase gaining traction in software engineering circles: “Nice AI slop.”
It’s dismissive. It’s reductive. And it’s not entirely wrong.
The critique is valid: AI tools have made it possible to generate enormous volumes of code without understanding what that code does, why it’s structured the way it is, or how to maintain it when something breaks at 2 AM. Engineers who would never have shipped code they couldn’t explain are now approving pull requests they couldn’t debug. Project leads are drowning in contributions from well-meaning developers who “vibe-coded” their way into maintenance nightmares.
For those of us who have spent years—decades—in the craft of software engineering, who have sat with codebases through their full lifecycles, who have felt the weight of technical decisions made five years ago landing on our shoulders today, this is frustrating. The hard-won discipline of our profession seems to be eroding in favor of velocity.
And yet.
The response to “AI slop” cannot be rejection of AI as a partner in engineering work. That path leads to irrelevance. The question is not whether to work with AI, but how—with what principles, what practices, what commitments to quality and accountability.
This document is an attempt to articulate those principles. Not as abstract ideals, but as a working philosophy grounded in practice: building real systems, shipping real code, maintaining real accountability.
The Core Insight: Amplification, Not Replacement
AI does not create the problems we’re seeing. It amplifies them.
Teams that already had weak ownership practices now produce more poorly-understood code, faster. Organizations where “move fast and break things” meant “ship it and let someone else figure it out” now ship more of it. Engineers who never quite understood the systems they worked on can now generate more code they don’t understand.
But the inverse is also true.
Teams with strong engineering discipline—clear specifications, rigorous review, genuine ownership—can leverage AI to operate at a higher level of abstraction while maintaining (or even improving) quality. The acceleration becomes an advantage, not a liability.
This is the same dynamic that exists in any collaboration. A junior engineer paired with a senior engineer who doesn’t mentor becomes a junior engineer who writes more code without learning. A junior engineer paired with a senior engineer who invests in their growth becomes a stronger engineer, faster.
AI partnership follows the same pattern. The quality of the outcome depends on the quality of the collaboration practices surrounding it.
The discipline required for effective AI partnership is not new. It is the discipline that should characterize all engineering collaboration. AI simply makes the presence or absence of that discipline more visible, more consequential, and more urgent.
Principles of Intentional Partnership
1. Specification Before Implementation
The most effective AI collaboration begins long before code is written.
When you ask an AI to “build a feature,” you get code. When you work with an AI to understand the problem, research the landscape, evaluate approaches, and specify the solution—then implement—you get software.
This is not an AI-specific insight. It’s foundational engineering practice. But AI makes the cost of skipping specification deceptively low: you can generate code instantly, so why spend time on design? The answer is the same as it’s always been: because code without design is not software, it’s typing.
In practice:
- Begin with exploration: What problem are we solving? What does the current system look like? What will be different when this work is complete?
- Research with tools: Use AI capabilities to understand the codebase, explore patterns in the ecosystem, review prior art. Ground the work in reality, not assumptions.
- Develop evaluation criteria before evaluating solutions. Know what “good” looks like before you start judging options.
- Document the approach, not just the code. Specifications are artifacts of understanding.
2. Phased Delivery with Validation Gates
Large work should be decomposed into phases, and each phase should have clear acceptance criteria.
This principle exists because humans have limited working memory. It’s true for individual engineers, it’s true for teams, and it’s true for AI systems. Complex work exceeds the capacity of any single context—human or machine—to hold it all at once.
Phased delivery is how we manage this limitation. Each phase is small enough to understand completely, validate thoroughly, and commit to confidently. The boundaries between phases are synchronization points where understanding is verified.
In practice:
- Identify what can be parallelized versus what must be sequential. Not all work is equally dependent.
- Determine which aspects require careful attention versus which can be resolved at implementation time. Not all decisions are equally consequential.
- Each phase should be independently validatable: tests pass, acceptance criteria met, code reviewed.
- Phase documentation should include code samples for critical paths. Show, don’t just tell.
3. Validation as a First-Class Concern
Testing is not a phase that happens after implementation. It is a design constraint that shapes implementation.
AI can generate tests as easily as it generates code. This makes it tempting to treat testing as an afterthought: write the code, then generate tests to cover it. This inverts the value proposition of testing entirely.
Tests are specifications. They encode expectations about behavior. When tests are written first—or at least designed first—they constrain the implementation toward correctness. When tests are generated after the fact, they merely document whatever the implementation happens to do, bugs included.
In practice:
- Define acceptance criteria before implementation begins.
- Include edge cases, boundary conditions, and non-happy-path scenarios in specifications.
- End-to-end testing validates that the system works, not just that individual units work.
- Review tests with the same rigor as implementation code. Tests can have bugs too.
4. Human Accountability as the Final Gate
This is the principle that separates intentional partnership from “AI slop.”
The human engineer is ultimately responsible for code that ships. Not symbolically responsible—actually responsible. Responsible for understanding what the code does, why it’s structured the way it is, what trade-offs were made, and how to maintain it.
This is not about low trust in AI. It’s about the nature of accountability.
If you cannot explain why a particular approach was chosen, you should not approve it. If you cannot articulate the trade-offs embedded in a design decision, you should not sign off on it. If you cannot defend a choice—or at least explain why the choice wasn’t worth extensive deliberation—then you are not in a position to take responsibility for it.
This standard applies to all code, regardless of its origin. Human-written code that the approving engineer doesn’t understand is no better than AI-written code they don’t understand. The source is irrelevant; the accountability is what matters.
In practice:
- Review is not approval. Approval requires understanding.
- The bikeshedding threshold is a valid concept: knowing why something isn’t worth debating is also knowledge. But you must actually know this, not assume it.
- Code review agents and architectural validators are useful, but they augment human judgment rather than replacing it.
- If you wouldn’t ship code you wrote yourself without understanding it, don’t ship AI-written code without understanding it either.
5. Documentation as Extended Cognition
Documentation is not an artifact of completed work. It is a tool that enables work to continue.
Every engineer who joins a project faces the same challenge: building sufficient context to contribute effectively. Every AI session faces the same challenge: starting fresh without memory of prior work. Good documentation serves both.
This is the insight that makes documentation investment worthwhile: it extends cognition across time and across minds. The context you build today, documented well, becomes instantly available to future collaborators—human or AI.
In practice:
- Structure documentation for efficient context loading. Navigation guides, trigger patterns, clear hierarchies.
- Capture the “why” alongside the “what.” Decisions without rationale are trivia.
- Principles, architecture, guides, reference—different documents serve different needs at different times.
- Documentation that serves future AI sessions also serves future human engineers. The requirements are the same: limited working memory, need for efficient orientation.
6. Toolchain Alignment
Some development environments are better suited to intentional partnership than others.
The ideal toolchain provides fast feedback loops, enforces correctness constraints, and makes architectural decisions explicit. The compiler, the type system, the test framework—these become additional collaborators in the process, catching errors early and forcing clarity about intent.
Languages and tools that defer decisions to runtime, that allow implicit behavior, that prioritize flexibility over explicitness, make intentional partnership harder. Not impossible—but harder. The burden of verification shifts more heavily to the human.
In practice:
- Strong type systems document intent in ways that survive across sessions and collaborators.
- Compilers that enforce correctness (memory safety, exhaustive matching) catch the classes of errors most likely to slip through in high-velocity development.
- Explicit architectural patterns—actor models, channel semantics, clear ownership boundaries—force intentional design rather than emergent mess.
- The goal is not language advocacy but recognition: your toolchain affects your collaboration quality.
A Concrete Example: Building Tasker
These principles are not theoretical. They emerged from—and continue to guide—the development of Tasker, a workflow orchestration system built in Rust.
Why Rust?
Rust is not chosen as a recommendation but as an illustration of what makes a toolchain powerful for intentional partnership.
The Rust compiler forces agreement on memory ownership. You cannot be vague about who owns data and when it’s released; the borrow checker requires explicitness. This means architectural decisions about data flow must be made consciously rather than accidentally.
Exhaustive pattern matching means you cannot forget to handle a case. Every enum variant must be addressed. This is particularly valuable when working with AI: generated code that handles only the happy path fails to compile rather than failing silently in production.
The type system documents intent in ways that persist across context windows. When an AI session resumes work on a Rust codebase, the types communicate constraints that would otherwise need to be re-established through conversation.
Tokio channels, MPSC patterns, actor boundaries—these require intentional design. You cannot stumble into an actor architecture; you must choose it and implement it explicitly. This aligns well with specification-driven development.
None of this makes Rust uniquely suitable or necessary. It makes Rust an example of the properties that matter: explicitness, enforcement, feedback loops that catch errors early.
The Spec-Driven Workflow
Every significant piece of Tasker work follows a pattern:
-
Problem exploration: What are we trying to accomplish? What’s the current state? What will success look like?
-
Grounded research: Use AI capabilities to understand the codebase, explore ecosystem patterns, review tooling options. Generate a situated view of how the problem exists within the actual system.
-
Approach analysis: Develop criteria for evaluating solutions. Generate multiple approaches. Evaluate against criteria. Select and refine.
-
Phased planning: Break work into milestones with validation gates. Identify dependencies, parallelization opportunities, risk areas. Determine what needs careful specification versus what can be resolved during implementation.
-
Phase documentation: Each phase gets its own specification in a dedicated directory. Includes acceptance criteria, code samples for critical paths, and explicit validation requirements.
-
Implementation with validation: Work proceeds phase by phase. Tests are written. Code is reviewed. Each phase is complete before the next begins.
-
Human accountability gate: The human partner reviews not just for correctness but for understanding. Can they defend the choices? Do they know why alternatives were rejected? Are they prepared to maintain this code?
This workflow produces comprehensive documentation as a side effect of doing the work. The docs/ticket-specs/ directories in Tasker contain detailed specifications that serve both as implementation guides and as institutional memory. Future engineers—and future AI sessions—can understand not just what was built but why.
The Tenets as Guardrails
Tasker’s development is guided by ten core tenets, derived from experience. Several are directly relevant to intentional partnership:
State Machine Rigor: All state transitions are atomic, audited, and validated. This principle emerged from debugging distributed systems failures; it also provides clear contracts for AI-generated code to satisfy.
Defense in Depth: Multiple overlapping protection layers rather than single points of failure. In collaboration terms: review, testing, type checking, and runtime validation each catch what others might miss.
Composition Over Inheritance: Capabilities are composed via mixins, not class hierarchies. This produces code that’s easier to understand in isolation—crucial when any given context (human or AI) can only hold part of the system at once.
These tenets emerged from building software over many years. They apply to AI partnership because they apply to engineering generally. AI is a collaborator; good engineering principles govern collaboration.
The Organizational Dimension
Intentional AI partnership is not just an individual practice. It’s an organizational capability.
What Changes
When AI acceleration is available to everyone, the differentiator becomes the quality of surrounding practices:
- Specification quality determines whether AI generates useful code or plausible-looking nonsense.
- Review rigor determines whether errors are caught before or after deployment.
- Testing discipline determines whether systems are verifiably correct or coincidentally working.
- Documentation investment determines whether institutional knowledge accumulates or evaporates.
Organizations that were already strong in these areas will find AI amplifies their strength. Organizations that were weak will find AI amplifies their weakness—faster.
The Accountability Question
The hardest organizational challenge is accountability.
When an engineer can generate a month’s worth of code in a day, traditional review processes break down. You cannot carefully review a thousand lines of code per hour. Something has to give.
The answer is not “skip review” or “automate review entirely.” The answer is to change what gets reviewed.
In intentional partnership, the specification is the primary artifact. The specification is reviewed carefully: Does this approach make sense? Does it align with architectural principles? Does it handle edge cases? Does it integrate with existing systems?
The implementation—whether AI-generated or human-written—is validated against the specification. Tests verify behavior. Type systems verify contracts. Review confirms that the implementation matches the spec.
This shifts review from “read every line of code” to “verify that implementation matches intent.” It’s a different skill, but it’s learnable. And it scales in ways that line-by-line review does not.
Building the Capability
Organizations building intentional AI partnership should focus on:
-
Specification practices: Invest in training engineers to write clear, complete specifications. This skill was always valuable; it’s now critical.
-
Review culture: Shift review culture from gatekeeping to verification. The question is not “would I have written it this way?” but “does this correctly implement the specification?”
-
Testing infrastructure: Fast, comprehensive test suites become even more valuable when implementation velocity increases. Invest accordingly.
-
Documentation standards: Establish expectations for documentation quality. Make documentation a first-class deliverable, not an afterthought.
-
Toolchain alignment: Choose languages, frameworks, and tools that provide fast feedback and enforce correctness. The compiler is a collaborator.
The Call to Action: What Becomes Possible
There is another dimension to this conversation that deserves attention.
We have focused on rigor, accountability, and the discipline required to avoid producing “slop.” This framing is necessary but insufficient. It treats AI partnership primarily as a risk to be managed rather than an opportunity to be seized.
Consider what has changed.
For decades, software engineers have carried mental backlogs of things we would build if we had the time. Ideas that were sound, architecturally feasible, genuinely useful—but the time-to-execute made them impractical. Side projects abandoned. Features deprioritized. Entire systems that existed only as sketches in notebooks because the implementation cost was prohibitive.
That calculus has shifted.
AI partnership, applied rigorously, compresses implementation timelines in ways that make previously infeasible work feasible. The system you would have built “someday” can be prototyped in a weekend. The refactoring you’ve been putting off for years can be specified, planned, and executed in weeks. The tooling you wished existed can be created rather than merely wished for.
This is not about moving faster for its own sake. It’s about what becomes creatively possible when the friction of implementation is reduced.
Tasker exists because of this shift. A workflow orchestration system supporting four languages, with comprehensive documentation, rigorous testing, and production-grade architecture—built as a labor of love alongside a demanding day job. Ten years ago, this project would have remained an idea. Five years ago, perhaps a half-finished prototype. Today, it’s real software approaching production readiness.
And Tasker is not unique. Across the industry, engineers are building things that would not have existed otherwise. Not “AI-generated slop,” but genuine contributions to the craft—systems built with care, designed with intention, maintained with accountability.
This is what’s at stake when we talk about intentional partnership.
When we approach AI collaboration carelessly, we produce code we don’t understand and can’t maintain. We waste the capability on work that creates more problems than it solves. We give ammunition to critics who argue that AI makes engineering worse.
When we approach AI collaboration with rigor, clarity, and commitment to excellence, we unlock creative possibilities that were previously out of reach. We build things that matter. We expand what a single engineer, or a small team, can accomplish.
It is not treating ourselves with respect—our time, our creativity, our professional aspirations—to squander this capability on careless work. It is not treating the partnership with respect to use it without intention.
The opportunity before us is unprecedented. The discipline required to seize it is not new—it’s the discipline of good engineering, applied to a new context.
Let’s not waste it.
Conclusion: Craft Persists
The critique of “AI slop” is fundamentally a critique of craft—or its absence.
Craft is the accumulated wisdom of how to do something well. In software engineering, craft includes knowing when to abstract and when to be concrete, when to optimize and when to leave well enough alone, when to document and when the code is the documentation. Craft is what separates software that works from software that lasts.
AI does not possess craft. AI possesses capability—vast capability—but capability without wisdom is dangerous. This is true of humans as well; we just notice it less because human capability is more limited.
Intentional AI partnership is the practice of combining AI capability with human craft. The AI brings speed, breadth of knowledge, tireless pattern matching. The human brings judgment, accountability, and the accumulated wisdom of the profession.
Neither is sufficient alone. Together, working with discipline and intention, they can build software that is not just functional but maintainable, not just shipped but understood, not just code but craft.
The divide between “AI slop” and intentional partnership is not about the tools. It’s about us—whether we bring the same standards to AI collaboration that we would (or should) bring to any engineering work.
The tools are new. The standards are not. Let’s hold ourselves to them.
This document is part of the Tasker Core project principles. It reflects one approach to AI-assisted engineering; your mileage may vary. The principles here emerged from practice and continue to evolve with it.
Tasker Core Tenets
These 11 tenets guide all architectural and design decisions in Tasker Core. Each emerged from real implementation experience, root cause analyses, or architectural migrations.
The 11 Tenets
1. Defense in Depth
Multiple overlapping protection layers provide robust idempotency without single-point dependency.
Protection comes from four independent layers:
- Database-level atomicity: Unique constraints, row locking, compare-and-swap
- State machine guards: Current state validation before transitions
- Transaction boundaries: All-or-nothing semantics
- Application-level filtering: State-based deduplication
Each layer catches what others might miss. No single layer is responsible for all protection.
Origin: Processor UUID ownership was removed when analysis proved it provided redundant protection with harmful side effects (blocking recovery after crashes).
Lesson: Find the minimal set of protections that prevents corruption. Additional layers that prevent recovery are worse than none.
2. Event-Driven with Polling Fallback
Real-time responsiveness via PostgreSQL LISTEN/NOTIFY, with polling as a reliability backstop.
The system supports three deployment modes:
- EventDrivenOnly: Lowest latency, relies on pg_notify
- PollingOnly: Traditional polling, higher latency but simple
- Hybrid (recommended): Event-driven primary, polling fallback
Events can be missed (network issues, connection drops). Polling ensures eventual consistency.
Origin: Event-driven task claiming was added for low-latency response while preserving reliability guarantees.
3. Composition Over Inheritance
Mixins and traits for handler capabilities, not class hierarchies.
Handler capabilities are composed via mixins:
Not: class Handler < API
But: class Handler < Base; include API, include Decision, include Batchable
This pattern enables:
- Selective capability inclusion
- Clear separation of concerns
- Easier testing of individual capabilities
- No diamond inheritance problems
Origin: Analysis of cross-language handler harmonization revealed Batchable handlers already used composition. This was identified as the target architecture for all handlers.
See also: Composition Over Inheritance
4. Cross-Language Consistency
Unified developer-facing APIs across Rust, Ruby, Python, and TypeScript.
Consistent touchpoints include:
- Handler signatures:
call(context)pattern - Result factories:
success(data)/failure(error, retry_on) - Registry APIs:
register_handler(name, handler) - Specialized patterns: API, Decision, Batchable
Each language expresses these idiomatically while maintaining conceptual consistency.
Origin: Cross-language API alignment established the “one obvious way” philosophy.
See also: Cross-Language Consistency
5. Actor-Based Decomposition
Lightweight actors for lifecycle management and clear boundaries.
Orchestration uses four core actors:
- TaskRequestActor: Task initialization
- ResultProcessorActor: Step result processing
- StepEnqueuerActor: Batch step enqueueing
- TaskFinalizerActor: Task completion
Worker uses five specialized actors:
- StepExecutorActor: Step execution coordination
- FFICompletionActor: FFI completion handling
- TemplateCacheActor: Template cache management
- DomainEventActor: Event dispatching
- WorkerStatusActor: Status and health
Each actor handles specific message types, enabling testability and clear ownership.
Origin: Actor pattern refactoring reduced monolithic processors from 1,575 LOC to ~150 LOC focused files.
6. State Machine Rigor
Dual state machines (Task + Step) for atomic transitions with full audit trails.
Task states (12): Pending → Initializing → EnqueuingSteps → StepsInProcess → EvaluatingResults → Complete/Error
Step states (8): Pending → Enqueued → InProgress → Complete/Error
All transitions are:
- Atomic (compare-and-swap at database level)
- Audited (full history in transitions table)
- Validated (state guards prevent invalid transitions)
Origin: Enhanced state machines with richer task states were introduced for better workflow visibility.
7. Audit Before Enforce
Track for observability, don’t block for “ownership.”
Processor UUID is tracked in every transition for:
- Debugging (which instance processed which step)
- Audit trails (full history of processing)
- Metrics (load distribution analysis)
But not enforced for:
- Ownership claims (blocks recovery)
- Permission checks (redundant with state guards)
Origin: Ownership enforcement removal proved that audit trails provide value without enforcement costs.
Key insight: When two actors receive identical messages, first succeeds atomically, second fails cleanly - no partial state, no corruption.
8. Early Release Iteration
Refine architecture intentionally while adoption is limited.
In the 0.1.x series:
- Breaking changes may still occur when architecture is fundamentally unsound
- We follow semantic versioning and provide migration guidance where practical
- Architectural correctness remains a priority, balanced with user impact
- We communicate breaking changes clearly in release notes
This approach enables:
- Continued refinement of core patterns
- Learning from real-world feedback
- Building trust through predictable releases
Origin: All major refactoring efforts made breaking changes that improved architecture fundamentally. As the project matures, we balance this with user expectations.
9. PostgreSQL as Foundation
Database-level guarantees with flexible messaging (PGMQ default, RabbitMQ optional).
PostgreSQL provides:
- State storage: Task and step state with transactional guarantees
- Advisory locks: Distributed coordination primitives
- Atomic functions: State transitions in single round-trip
- Row-level locking: Prevents concurrent modification
Messaging is provider-agnostic:
- PGMQ (default): Message queue built on PostgreSQL—single-dependency deployment
- RabbitMQ (optional): For high-throughput or existing broker infrastructure
The database is not just storage—it’s the coordination layer. Message delivery is pluggable.
Origin: Core architecture decision - PostgreSQL’s transactional guarantees eliminate entire classes of distributed systems problems. The messaging abstraction was added for deployment flexibility.
10. Bounded Resources
All channels bounded, backpressure everywhere.
Every MPSC channel is:
- Bounded: Fixed capacity, no unbounded memory growth
- Configurable: Sizes set via TOML configuration
- Monitored: Backpressure metrics exposed
Semaphores limit concurrent handler execution. Circuit breakers protect downstream services.
Origin: Bounded MPSC channels were mandated after analysis of unbounded channel risks.
Rule: Never use unbounded_channel(). Always configure bounds via TOML.
11. Fail Loudly
A system that lies is worse than one that fails. Errors are first-class citizens, not inconveniences to hide.
When data is missing, malformed, or unexpected:
- Return errors, not fabricated defaults
- Propagate failures up the call stack
- Make problems visible immediately, not downstream
- Trust nothing that hasn’t been validated
Silent defaults create phantom data—values that look valid but represent nothing real. A monitoring system that receives 0% utilization cannot distinguish “system is idle” from “data was missing.”
What this means in practice:
| Scenario | Wrong Approach | Right Approach |
|---|---|---|
| gRPC response missing field | Return default value | Return InvalidResponse error |
| Config section absent | Use empty/zero defaults | Fail with clear message |
| Health check data missing | Fabricate “unknown” status | Error: “health data unavailable” |
| Optional vs Required | Treat all as optional | Distinguish explicitly in types |
The trust equation:
A client that returns fabricated data
= A client that lies to you
= Worse than a client that fails loudly
= Debugging phantom bugs in production
Origin: gRPC client refactoring revealed pervasive unwrap_or_default() patterns that silently fabricated response data. Analysis showed consumers could receive “valid-looking” responses containing entirely phantom data, breaking the trust contract between client and caller.
Key insight: When a gRPC server omits required fields, that’s a protocol violation—not an opportunity to be “helpful” with defaults. The server is broken; pretending otherwise delays the fix and misleads operators.
Rule: Never use unwrap_or_default() or unwrap_or_else(|| fabricated_value) for required fields. Use ok_or_else(|| ClientError::invalid_response(...)) instead.
Meta-Principles
These overarching themes emerge from the tenets:
-
Simplicity Over Elegance: The minimal protection set that prevents corruption beats layered defense that prevents recovery
-
Observation-Driven Design: Let real behavior (parallel execution, edge cases) guide architecture
-
Explicit Over Implicit: Make boundaries, layers, and decisions visible in documentation and code
-
Consistency Without Uniformity: Align APIs while preserving language idioms
-
Separation of Concerns: Orchestration handles state and coordination; workers handle execution and domain events
-
Errors Over Defaults: When in doubt, fail with a clear error rather than proceeding with fabricated data
Applying These Tenets
When making design decisions:
- Check against tenets: Does this violate any of the 10 tenets?
- Find the precedent: Has a similar decision been made before? (See ticket-specs)
- Document the trade-off: What are you gaining and giving up?
- Consider recovery: If this fails, how does the system recover?
When reviewing code:
- Bounded resources: Are all channels bounded? All concurrency limited?
- State machine compliance: Do transitions use atomic database operations?
- Language consistency: Does the API align with other language workers?
- Composition pattern: Are capabilities mixed in rather than inherited?
- Fail loudly: Are missing/invalid data handled with errors, not silent defaults?
Twelve-Factor App Alignment
The Twelve-Factor App methodology, authored by Adam Wiggins and contributors at Heroku, has been a foundational influence on Tasker Core’s systems design. These principles were not adopted as a checklist but absorbed over years of building production systems. Some factors are deeply embedded in the architecture; others remain aspirational or partially realized.
This document maps each factor to where it shows up in the codebase, where we fall short, and what contributors should keep in mind. It is meant as practical guidance, not a compliance scorecard.
I. Codebase
One codebase tracked in revision control, many deploys.
Tasker Core is a single Git monorepo containing all deployable services: orchestration server, workers (Rust, Ruby, Python, TypeScript), CLI, and shared libraries.
Where this lives:
- Root
Cargo.tomldefines the workspace with all crate members - Environment-specific Docker Compose files produce different deploys from the same source:
docker/docker-compose.prod.yml,docker/docker-compose.dev.yml,docker/docker-compose.test.yml,docker/docker-compose.ci.yml - Feature flags (
web-api,grpc-api,test-services,test-cluster) control build variations without code branches
Gaps: The monorepo means all crates share a single version today (v0.1.0). As the project matures toward independent crate publishing, version coordination will need more tooling. Independent crate versioning and release management tooling will need to evolve as the project matures.
II. Dependencies
Explicitly declare and isolate dependencies.
Rust’s Cargo ecosystem makes this natural. All dependencies are declared in Cargo.toml with workspace-level management and pinned in Cargo.lock.
Where this lives:
- Root
Cargo.toml[workspace.dependencies]section — single source of truth for shared dependency versions Cargo.lockcommitted to the repository for reproducible builds- Multi-stage Docker builds (
docker/build/orchestration.prod.Dockerfile) usecargo-cheffor cached, reproducible dependency resolution - No runtime dependency fetching — everything resolved at build time
Gaps: FFI workers each bring their own dependency ecosystem (Python’s uv/pyproject.toml, Ruby’s Bundler/Gemfile, TypeScript’s bun/package.json). These are well-declared but not unified — contributors working across languages need to manage multiple lock files.
III. Config
Store config in the environment.
This is one of the strongest alignments. All runtime configuration flows through environment variables, with TOML files providing structured defaults that reference those variables.
Where this lives:
config/dotenv/— environment-specific.envfiles (base.env,test.env,orchestration.env)config/tasker/base/*.toml— role-based defaults with${ENV_VAR:-default}interpolationconfig/tasker/environments/{test,development,production}/— environment overridesdocker/.env.prod.template— production variable templatetasker-shared/src/config/— config loading with environment variable resolution- No secrets in source:
DATABASE_URL,POSTGRES_PASSWORD, JWT keys all via environment
For contributors: Never hard-code connection strings, credentials, or deployment-specific values. Use environment variables with sensible defaults in the TOML layer. The configuration structure is role-based (orchestration/worker/common), not component-based — see CLAUDE.md for details.
IV. Backing Services
Treat backing services as attached resources.
Backing services are abstracted behind trait interfaces and swappable via configuration alone.
Where this lives:
- Database: PostgreSQL connection via
DATABASE_URL, pool settings inconfig/tasker/base/common.tomlunder[common.database.pool] - Messaging: PGMQ or RabbitMQ selected via
TASKER_MESSAGING_BACKENDenvironment variable — same code paths, different drivers - Cache: Redis, Moka (in-process), or disabled entirely via
[common.cache]configuration - Observability: OpenTelemetry with pluggable backends (Honeycomb, Jaeger, Grafana Tempo) via
OTEL_EXPORTER_OTLP_ENDPOINT - Circuit breakers protect against backing service failures:
[common.circuit_breakers.component_configs]
For contributors: When adding a new backing service dependency, ensure it can be configured via environment variables and that the system degrades gracefully when it’s unavailable. Follow the messaging abstraction pattern — trait-based interfaces, not concrete types.
V. Build, Release, Run
Strictly separate build and run stages.
The Docker build pipeline enforces this cleanly with multi-stage builds.
Where this lives:
- Build:
docker/build/orchestration.prod.Dockerfile—cargo-chefdependency caching,cargo build --release --all-features --locked, binary stripping - Release: Tagged Docker images with only runtime dependencies (no build tools), non-root user (
tasker:999), read-only config mounts - Run:
docker/scripts/orchestration-entrypoint.sh— environment validation, database availability check, migrations, thenexecinto the application binary - Deployment modes control startup behavior:
standard,migrate-only,no-migrate,safe,emergency
Gaps: Local development doesn’t enforce the same separation — developers run cargo run directly, which conflates build and run. This is fine for development ergonomics but worth noting as a difference from the production path.
VI. Processes
Execute the app as one or more stateless processes.
All persistent state lives in PostgreSQL. Processes can be killed and restarted at any time without data loss.
Where this lives:
- Orchestration server: stateless HTTP/gRPC service backed by
tasker.tasksandtasker.stepstables - Workers: claim steps from message queues, execute handlers, write results back — no in-memory state across requests
- Message queue visibility timeouts (
visibility_timeout_secondsin worker config) ensure unacknowledged messages are reclaimed by other workers - Docker Compose
replicassetting scales workers horizontally
For contributors: Never store workflow state in memory across requests. If you need coordination state, it belongs in PostgreSQL. In-memory caches (Moka) are optimization layers, not sources of truth — the system must function correctly without them.
VII. Port Binding
Export services via port binding.
Each service is self-contained and binds its own ports.
Where this lives:
- REST:
config/tasker/base/orchestration.toml—[orchestration.web] bind_address = "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - gRPC:
[orchestration.grpc] bind_address = "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Worker REST/gRPC on separate ports (8081/9191)
- Health endpoints on both transports for load balancer integration
- Docker exposes ports via environment-configurable mappings
VIII. Concurrency
Scale out via the process model.
The system scales horizontally by adding worker processes and vertically by tuning concurrency settings.
Where this lives:
- Horizontal:
docker/docker-compose.prod.yml—replicas: ${WORKER_REPLICAS:-2}, each worker is independent - Vertical:
config/tasker/base/orchestration.toml—max_concurrent_operations,batch_sizeper event system - Worker handler parallelism:
[worker.mpsc_channels.handler_dispatch] max_concurrent_handlers = 10 - Load shedding:
[worker.mpsc_channels.handler_dispatch.load_shedding] capacity_threshold_percent = 80.0
Gaps: The actor pattern within a single process is more vertical than horizontal — actors share a Tokio runtime and scale via async concurrency, not OS processes. This is a pragmatic choice for Rust’s async model but means single-process scaling has limits that multiple processes solve.
IX. Disposability
Maximize robustness with fast startup and graceful shutdown.
This factor gets significant attention due to the distributed nature of task orchestration.
Where this lives:
- Graceful shutdown: Signal handlers (SIGTERM, SIGINT) in
tasker-orchestration/src/bin/server.rsandtasker-worker/src/bin/— actors drain in-flight work, OpenTelemetry flushes spans, connections close cleanly - Fast startup: Compiled binary, pooled database connections, environment-driven config (no service discovery delays)
- Crash recovery: PGMQ visibility timeouts requeue unacknowledged messages; steps claimed by a crashed worker reappear for others after
visibility_timeout_seconds - Entrypoint:
docker/scripts/orchestration-entrypoint.shusesexecto replace shell with app process (proper PID 1 signal handling) - Health checks: Docker
start_periodallows grace time before liveness probes begin
For contributors: When adding new async subsystems, ensure they participate in the shutdown sequence. Bounded channels and drain timeouts (shutdown_drain_timeout_ms) prevent shutdown from hanging indefinitely.
X. Dev/Prod Parity
Keep development, staging, and production as similar as possible.
The same code, same migrations, and same config structure run everywhere — only values change.
Where this lives:
config/tasker/base/provides defaults;config/tasker/environments/overrides per-environment — structure is identicalmigrations/directory contains SQL migrations shared across all environments- Docker images use the same base (
debian:bullseye-slim) and runtime user (tasker:999) - Structured logging format (tracing crate) is consistent; only verbosity changes (
RUST_LOG) - E2E tests (
--features test-services) exercise the same code paths as production
Gaps: Development uses cargo run with debug builds while production uses release-optimized Docker images. The observability stack (Grafana LGTM) is available in docker-compose.dev.yml but most local development happens without it. These are standard trade-offs, but contributors should periodically test against the full Docker stack to catch environment-specific issues.
XI. Logs
Treat logs as event streams.
All logging goes to stdout/stderr. No file-based logging is built into the application.
Where this lives:
tasker-shared/src/logging.rs— tracing subscriber writes to stdout, JSON format in production, ANSI colors in development (TTY-detected)- OpenTelemetry integration exports structured traces via
OTEL_EXPORTER_OTLP_ENDPOINT - Correlation IDs (
correlation_id) propagate through tasks, steps, actors, and message queues for distributed tracing docker-compose.dev.ymlincludes Loki for log aggregation and Grafana for visualization- Entrypoint scripts log to stdout/stderr with role-prefixed format
For contributors: Use the tracing crate’s #[instrument] macro and structured fields (tracing::info!(task_id = %id, "processing")) rather than string interpolation. Never write to log files directly.
XII. Admin Processes
Run admin/management tasks as one-off processes.
The CLI and deployment scripts serve this role.
Where this lives:
tasker-ctl/— task management (create,list,cancel), DLQ investigation (dlq list,dlq recover), system health, auth token managementdocker/scripts/orchestration-entrypoint.sh—DEPLOYMENT_MODE=migrate-onlyruns migrations and exits without starting the serverconfig-validatorbinary validates TOML configuration as a one-off check- Database migrations run as a distinct phase before application startup, with retry logic and timeout protection
Gaps: Some administrative operations (cache invalidation, circuit breaker reset) are only available through the REST/gRPC API, not the CLI. As the CLI matures, these should become first-class admin commands.
Using This as a Contributor
These factors are not rules to enforce mechanically. They’re a lens for evaluating design decisions:
- Adding a new service dependency? Factor IV says treat it as an attached resource — configure via environment, degrade gracefully without it.
- Storing state? Factor VI says processes are stateless — put it in PostgreSQL, not in memory.
- Adding configuration? Factor III says environment variables — use the existing TOML-with-env-var-interpolation pattern.
- Writing logs? Factor XI says event streams — stdout, structured fields, correlation IDs.
- Building deployment artifacts? Factor V says separate build/release/run — don’t bake configuration into images.
When a factor conflicts with practical needs, document the trade-off. The goal is not purity but awareness.
Attribution
The Twelve-Factor App methodology was created by Adam Wiggins with contributions from many others, originally published at 12factor.net. It is made available under the MIT License and has influenced how a generation of developers think about building software-as-a-service applications. Its influence on this project is gratefully acknowledged.
PEP: 20 Title: The Zen of Python Author: Tim Peters tim.peters@gmail.com Status: Active Type: Informational Created: 19-Aug-2004 Post-History: 22-Aug-2004
Abstract
Long time Pythoneer Tim Peters succinctly channels the BDFL’s guiding principles for Python’s design into 20 aphorisms, only 19 of which have been written down.
The Zen of Python
.. code-block:: text
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Easter Egg
.. code-block:: pycon
import this
References
Originally posted to comp.lang.python/python-list@python.org under a
thread called "The Way of Python" <https://groups.google.com/d/msg/comp.lang.python/B_VxeTBClM0/L8W9KlsiriUJ>__
Copyright
This document has been placed in the public domain.
Generative Workflows, Deterministic Execution
Status: Vision & Planning Created: 2026-02-14 Revised: 2026-02-23 Author: Pete (with Claude collaboration)
Overview
This document set describes a vision for extending Tasker Core’s workflow orchestration capabilities toward generative workflows — workflows whose components can be progressively generated from increasingly abstract descriptions, while preserving the deterministic execution guarantees that make Tasker trustworthy.
The core insight: Tasker’s execution properties (idempotency, transactional step creation, state machine consistency, DAG-ordered dependencies) create a foundation strong enough to support non-deterministic planning. An LLM or developer tool reasons about what should happen; the system guarantees how it executes. These same properties make Tasker valuable infrastructure for autonomous agents — not as an agent framework, but as deterministic execution infrastructure that agents use as clients.
The path from here to there is incremental. Each phase delivers independent value — better developer tooling, composable handler primitives, LLM-assisted authoring, agent-accessible infrastructure — while building toward a system where any client (human, LLM, or agent) can compose validated workflow segments from Rust-enforced action grammar primitives.
Documents
| Document | Purpose |
|---|---|
| 01 - Vision | The philosophical and architectural “why” — how deterministic execution enables generative workflows, the action grammar concept, agent integration, and the two-tier trust model |
| 02 - Technical Approach | Problem statement, solution space analysis (including agent-accessible infrastructure), recommendation, and phase overview |
| 03 - Phase 0: Foundation | Templates as generative contracts — TAS-280, MCP server, shared tooling crate, and the groundwork for everything that follows |
| 04 - Phase 1: Action Grammars | Rust-native composable action primitives, common patterns, and the unified composition model |
| 05 - Agent Orchestration | How agents use Tasker as deterministic infrastructure — context decomposition, research workflows, and the shared tooling foundation |
| 06 - Phase 2: Planning Interface | LLM-backed planning steps and workflow fragment generation from grammar compositions |
| 07 - Phase 3: Recursive Planning | Multi-phase adaptive workflows — step-level recursion and task-level delegation |
| 08 - Complexity Management | Observability, operator experience, LLM context management, agent lineage, and avoiding complication |
Reading Order
For understanding the vision: start with 01, then 08 for the complexity framing.
For the agent story: 01 introduces the concept, 05 gives the full treatment.
For the pragmatic path: 03 (Phase 0) describes the immediately actionable work that establishes the foundation.
For technical depth: 02 gives the full analysis, then dive into whichever phase is most relevant.
For implementation: 03 (Phase 0) is the starting point — it delivers independent value and informs every subsequent phase.
Key Principles
- The step remains the atom. Dynamic planning introduces new topology, not new mechanics.
- Action grammars are the vocabulary. Composable, Rust-enforced primitives that any client can compose without breaking.
- One composition model. All handler composition uses the same
action(resource)grammar. Common patterns are named, well-tested composition specifications. Dynamic compositions use the same grammar and validation pipeline. - The single-mutation boundary is the safety invariant. A valid composition has at most one external mutation (Persist, Emit), and it appears after all fallible preparatory work. Primitives are compile-time verified Rust; compositions are validated at assembly time against structural invariants.
- Two trust tiers. Developer-authored handlers are the developer’s responsibility. System-invoked action grammars are the system’s responsibility, with the strongest guarantees the language can provide.
- Agents are clients, not components. Tasker provides deterministic infrastructure; agents provide reasoning. The API is the same regardless of who submits the task.
- Validation is the trust boundary. Every workflow fragment is validated before execution.
- The template is the safety contract. Planning fills in the middle; the frame is fixed.
- Each phase is independently valuable. TAS-280 improves developer experience. Grammar-composed handlers improve workflow authoring. Agent integration is a cross-cutting capability that enriches each phase.
Relationship to Existing Work
This initiative builds directly on:
- TAS-280: Typed handler generation from task templates — the first step toward templates as generative contracts
- TAS-294: Functional DSL for handler registration — the developer-facing composition pattern
- TAS-53: Dynamic workflow decision points (conditional workflows) — the execution foundation
- TAS-112: Cross-language ergonomics analysis (handler patterns)
- Intentional AI Partnership: The philosophical foundation for human-AI collaboration in the system
These documents describe a vision. Implementation will be tracked through Linear tickets as each phase progresses from research to prototyping to delivery.
Generative Workflows, Deterministic Execution
A Vision for Composable, LLM-Integrated Workflow Orchestration in Tasker
The Generative Foundation
Tasker Core is a workflow orchestration system built on a set of properties that, individually, are well-understood in distributed systems engineering: statelessness, horizontal scalability, event-driven processing, deterministic execution, idempotency. What makes Tasker interesting is not any single property but the composition of all of them — and what that composition makes possible.
The workflow step is the atomic unit of the system. Each step has a lifecycle modeled as a state machine, with execution paths, retry and backoff semantics, and transactional state transitions. Because the step is idempotent, it can be retried safely. Because its state machine is consistent, its behavior is predictable. Because it executes within a DAG of declared dependencies, the broader workflow has known ordering guarantees. Because all of this is backed by PostgreSQL transactions, the system never enters a half-built state.
These properties were designed to make Tasker reliable. But reliability, it turns out, is also what makes a system composable in ways that weren’t necessarily planned for. When you can trust that a step will do exactly what it says — succeed, fail, or retry — you can assemble steps into novel configurations without losing the safety properties of the whole. Reliable parts enable unreliable planners.
This is the generative insight: the determinism of the execution layer creates space for non-deterministic planning. And the stronger the execution guarantees, the more freedom the planning layer has.
What Already Exists
Tasker’s conditional workflow architecture already proves the core mechanism. Three capabilities, working together, establish the precedent for everything this vision describes:
Decision Point Steps evaluate business logic at runtime and return the names of downstream steps that should be created. The full DAG does not need to exist at task initialization — steps downstream of a decision point are held unrealized until the decision is made.
Dynamic Step Creation materializes new steps transactionally. When a decision point fires, the orchestration layer creates the specified workflow steps, their edges, and their queue entries in a single database transaction. The graph grows atomically — either the entire downstream segment is created, or none of it is.
Deferred Convergence with intersection semantics allows the system to reconverge after dynamic branching without knowing in advance which branches were taken. A convergence step declares all possible upstream dependencies; at runtime, it waits only for the intersection of declared dependencies and actually-created steps. This means convergence works correctly regardless of which path the decision point chose.
Batch processing extends this further — a single batchable step can spawn N worker instances at runtime, all created transactionally, all converging through the same intersection semantics. The graph is not just branching dynamically; it is scaling dynamically.
These are not theoretical capabilities. They are implemented, tested, and approaching production readiness.
Action Grammars: A Type System for Workflow Actions
The vision for dynamic workflows begins with a vocabulary of composable, type-safe action primitives — not a static catalog of pre-built handlers, but a grammar from which handlers can be composed.
Consider how handlers actually work. An HTTP-calling handler does several things: it constructs a request, manages authentication, makes the call, handles errors, extracts relevant data from the response, and shapes its output for downstream consumption. A validation handler reads input, applies rules, partitions results into valid and invalid sets, and shapes its output. These handlers share lower-level actions — acquiring data, transforming shapes, asserting conditions, emitting results — even though their higher-level purposes differ.
An action grammar is a formalization of these lower-level actions as composable primitives with declared input and output contracts. The term “grammar” is borrowed from compositional pattern work in the Storyteller project, but the application here is grounded in distributed systems concerns rather than narrative ones.
The fundamental unit is action(resource) — a verb applied to a noun. Acquire(HttpEndpoint). Transform(JsonPayload). Validate(Schema). Persist(DatabaseRow). Every action implies directionality — a “from” and a “to” — and every action operates on a typed resource. When actions compose, the chain is always action(resource) → action(resource), where the output resource of one action becomes the input resource of the next.
An action grammar primitive:
- Has a single concern. Acquire external data. Transform a data shape. Assert an invariant. Gate on a condition. Emit a notification. Persist a mutation.
- Declares its input and output contracts. As Rust structs and traits enforced at compile time. Primitive A’s output type must be compatible with primitive B’s input type for the composition to compile.
- Preserves execution properties. Each primitive is idempotent, retryable, and side-effect bounded. A composition of primitives inherits these properties because the composition rules enforce them.
The single-mutation boundary. The grammar distinguishes between non-mutating actions (reads, transforms, validations, calculations) and mutating actions (creates, updates, deletes against external systems — a database, an API, a message queue). A valid composition may contain an arbitrary number of non-mutating actions but has at most one external mutation, and that mutation is the composition’s commitment point. Everything before the mutation is preparatory, idempotent, and safely retryable. The mutation itself is the boundary where the composition has its external effect. Nothing after the mutation should be fallible in a way that would trigger re-execution of the mutation.
This structural rule is what makes compositions safe — not whether they were assembled in advance or at runtime. If the grammar enforces the single-mutation boundary and the contracts chain correctly, the composition is safe. The primitives are Rust-native and compile-time verified. The composition rules are checkable at assembly time. The safety comes from the shape of valid compositions, not from when they were assembled.
The “grammar” metaphor is deliberate and important. A grammar defines what sentences are expressible in a language. A dictionary is a useful collection of known sentences, but it is not the language — the grammar is. In the same way, a library of common handler patterns is a useful collection of known compositions, but it is not the full vocabulary of what can be expressed. The grammar is.
A handler, then, is a composition of action(resource) primitives — assembled to fit a specific problem’s requirements, validated against the grammar’s composition rules (contracts chain, single-mutation boundary, configuration well-formed), and executed with full lifecycle guarantees. A composition like Acquire(HttpSource) → Transform(SchemaExtract) → Validate(ContractCheck) → Persist(DatabaseRow) is validated once at assembly time and then executes as a single step with a single lifecycle.
Common patterns — http_request, transform_and_validate, fan_out_aggregate — are named, documented, well-tested composition specifications that any client can reference by name. They are a convenience library: recipes that make common operations easy to use. But at runtime, they resolve to the same grammar composition as any other handler. There is one composition model, one validation path, one execution path. The grammar’s vocabulary is open, not enumerated.
Two Trust Tiers
This architecture creates a natural separation between two levels of trust:
Developer-authored handlers are written in whatever language the developer’s application uses — Python, Ruby, TypeScript, Rust — registered through the DSL or class-based patterns, and executed through the polyglot FFI worker infrastructure. These handlers contain business logic that the developer owns. The system routes to them, retries them, manages their lifecycle, but the correctness of what’s inside is the developer’s responsibility. This is the model Tasker has today, and it remains the primary developer experience.
System-invoked action grammars are the primitives and compositions that the system executes on behalf of planners and agents. These are implemented in Rust because they are the system’s responsibility. When a planner says “acquire data from this endpoint, validate against this schema, transform into this shape, persist the result,” the code that executes needs to be as close to provably correct as possible. Rust provides this: each primitive’s implementation is compiled with full type safety. Composition rules — contract compatibility, single-mutation boundary, configuration validity — are enforced at assembly time before any execution occurs. The safety comes from the grammar’s structural invariants: the primitives are correct because they’re compiled Rust, and the compositions are correct because the grammar rules prevent invalid assembly.
The action grammar tier is where certainty lives precisely because the planner is probabilistic. The stronger the floor beneath the planner, the more freedom the planner has to compose novel workflows. A planner that can only compose from a vocabulary it cannot break is fundamentally different from a planner that can compose from a vocabulary that might fail at runtime.
Developer-authored handlers and system-invoked action grammars coexist in the same workflow. A task template can reference both — static steps backed by application-specific handlers and dynamically planned steps backed by grammar compositions. The orchestration layer treats both as steps; the difference is in the provenance and the trust model.
Agent Integration: Deterministic Infrastructure for Autonomous Clients
The properties that make Tasker valuable for human-designed workflows — parallel execution with convergence, conditional branching, batch fan-out/fan-in, result aggregation, bounded resource consumption — are the same properties that autonomous agents need when they cannot hold everything in context at once.
The most powerful capability of agent systems is not that they reason — it’s that they delegate. An effective agent decomposes problems: spinning up sub-agents for research, parallelizing analysis across multiple sources, aggregating findings, and making decisions with the benefit of structured, converged information. This decomposition needs infrastructure — and workflow orchestration is exactly that infrastructure.
Tasker is not an agent framework. It does not manage agent state, control agent behavior, or coordinate agent-to-agent communication. Tasker is deterministic infrastructure that agents use as clients. An agent submitting a task to Tasker is indistinguishable from a human developer or an application submitting a task. The same API, the same guarantees, the same resource controls, the same observability.
The critical insight is about context decomposition. When an agent is asked to design a workflow or make a complex decision, it doesn’t have to “get it right” in a single reasoning pass. It can create a Tasker task — a research workflow — that fans out investigation across multiple dimensions, converges the findings, and delivers structured results that the agent uses as input to its actual decision. The parent_correlation_id field traces the lineage from research task to design decision to workflow execution, providing full observability of the agent’s reasoning chain through Tasker’s standard telemetry.
This pattern requires no new orchestration machinery. Task creation is an existing API. parent_correlation_id is an existing field. Fan-out/fan-in is existing behavior. Convergence semantics are existing behavior. What needs to be built are the patterns and templates that make this easy — research task templates, analysis step handlers, convergence-to-decision patterns — not new orchestration features.
The MCP server (Phase 0) becomes the natural integration point for agents interacting with Tasker at design time: inspecting templates, querying action grammar vocabulary, validating compositions, and understanding schema contracts. Task submission — the runtime interaction — goes through the standard Tasker API.
See Agent Orchestration for the full treatment of this capability.
The From-Here-to-There Path
This vision does not require building everything before any of it is useful. There is a natural sequence of work where each step delivers real value to the Tasker ecosystem while building toward the full generative capability:
Step 1: Templates as Generative Contracts (TAS-280). Extend the TaskTemplate with result_schema on step definitions. Build tasker-ctl generate to produce typed handler code, result models, and test scaffolds from templates. This has immediate utility for developer quality of life — typed dependency injection, IDE autocomplete, compile-time or lint-time shape checking. But it also establishes the first data contracts in the system and makes the template a source of code generation rather than just structural description.
Step 2: MCP Server and Shared Tooling. Build an MCP server that works with an LLM to help developers (and agents) create well-structured templates, handler code, and test code from natural language descriptions. Extract the shared logic between tasker-ctl and tasker-mcp into a tasker-tooling crate — template validation, schema inspection, codegen, handler resolution checking — so both interfaces consume the same capabilities. This is the first point where an LLM touches the Tasker workflow lifecycle, and the first integration point for agent clients.
Step 3: Action Grammars. With data contracts established and MCP server experience revealing which workflow patterns recur, build the action(resource) grammar primitives in Rust. Establish the composition framework with contract validation and single-mutation-boundary enforcement. Build a library of common patterns as named, documented composition specifications. Any client — human, LLM, or agent — can compose handlers from the grammar vocabulary, validated against the same structural rules.
Step 4: LLM Planning and Recursive Workflows. With the vocabulary established and the MCP server experience informing prompt and context design, introduce planning steps that generate validated workflow fragments from runtime context, composed from action grammar primitives. Extend to recursive planning where each phase’s plan is informed by accumulated results from prior phases. Agent-initiated planning — where the agent creates tasks containing planning steps — composes naturally with this infrastructure.
Each step depends on insights from the previous one. TAS-280 reveals what data contracts look like in practice. The MCP server reveals what the LLM needs to generate correct workflow components. Both inform the action grammar design. The action grammars provide the vocabulary the planning interface requires. Agent integration is a cross-cutting capability that becomes richer at each step.
What This Is Not
Precision matters when describing what AI integration means within an engineering system. This vision is specifically not the following:
It is not an agent orchestration framework. Tasker does not manage agent state, does not decide when agents should act, and does not coordinate agent-to-agent communication. Agents are external clients that use Tasker’s execution guarantees for their own purposes. The system provides infrastructure, not agency. The distinction is fundamental: an agent framework would require abandoning the determinism that makes Tasker trustworthy; agent-accessible infrastructure preserves it.
It is not code generation. The LLM does not write handlers. In the developer-assistance context (MCP server), it generates handler scaffolds that developers review and extend. In the runtime planning context, it composes and parameterizes action(resource) grammar primitives. There is no hot-loading of generated code, no runtime compilation, no eval. The security boundary is composition of verified primitives, not execution of generated code.
It is not unconstrained. Planning steps operate within explicit bounds: maximum graph depth, maximum step count, maximum cost budget, required convergence points. Agent-created tasks are subject to the same resource controls as any task. The orchestration layer validates every workflow fragment before materializing it. Invalid plans are rejected, not executed. Agents that exceed delegation budgets fail cleanly.
It is not opaque. Every planning decision is captured — the LLM’s reasoning, the generated workflow fragment, the validation result, the materialized steps. Agent-created task chains are traceable through parent_correlation_id. The observability guarantees that apply to every Tasker step apply equally to dynamically planned steps and agent-created steps. The provenance of every step — including which grammar composition was used — is traceable to the decision that created it.
The Intentional Partnership
This vision is a natural expression of the Intentional AI Partnership philosophy that guides Tasker’s development.
The core insight of that philosophy — that AI amplifies existing engineering practices rather than replacing them — applies directly here. Tasker’s execution guarantees (determinism, idempotency, transactional consistency, observability) are the engineering practices being amplified. The LLM adds flexibility and reasoning capability to workflow planning without undermining the properties that make the system trustworthy.
The principle of specification before implementation maps to the planning step’s contract: the LLM produces a specification (the workflow fragment) that the system validates before implementing (materializing and executing). The principle of human accountability as the final gate maps to the gate primitive — an action(resource) that allows human approval at any decision point. The principle of validation as a first-class concern maps to the fragment validation pipeline that sits between planning and execution.
The two-tier trust model embodies the partnership concretely. Human developers write business logic in their language of choice with their domain expertise. The system provides a grammar of reliable, composable, type-safe primitives that LLMs and agents can assemble. Neither is asked to do what the other does better. The boundaries are clean, the contracts are explicit, and the guarantees are preserved.
Agent integration extends this partnership model. The agent brings reasoning and context decomposition — the ability to recognize that a problem needs investigation before design, and to structure that investigation effectively. Tasker brings deterministic execution — the guarantee that every research step completes reliably, every convergence is transactionally sound, and every result is observable. The agent delegates what it cannot do well (reliable parallel execution with convergence). Tasker delegates what it cannot do at all (reasoning about what to investigate and how to interpret findings).
Where This Leads
The immediate practical implication is a system where problems can be described at the level of what needs to happen rather than exactly how it should be orchestrated. Today this means a developer describes a workflow and an MCP server generates the template and handler code. Tomorrow this means a planning step receives runtime context and generates a workflow fragment from composable grammar primitives. Eventually this means agents that can investigate, plan, and execute complex workflows — using Tasker’s deterministic infrastructure for every phase of that process.
The longer-term implication is a type system for workflow actions — an extensible grammar of action(resource) primitives where new capabilities are compositions of existing primitives, where correctness is enforced by the grammar’s structural invariants (contract compatibility, single-mutation boundary), and where any client — human, LLM, or agent — has a vocabulary rich enough to express complex workflows yet constrained enough that every composition is guaranteed to execute correctly.
The foundation is built. The machinery works. The extension is natural — and the path there delivers value at every step.
This document is part of the Tasker Core generative workflow initiative. It describes a vision that will be realized through phased implementation, beginning with templates as generative contracts and progressing through action grammars, agent integration, planning interfaces, and recursive planning capabilities.
Technical Approach: Action Grammars, Handler Composition, and Agent Integration
Problem Statement, Solution Consideration, Recommendation, and Phases of Implementation
Problem Statement
Tasker Core provides deterministic workflow orchestration with conditional branching, batch processing, and convergence semantics. These capabilities are powerful but require workflows to be fully designed at development time: every step must have a registered handler, every decision path must be anticipated in a template, every handler must be implemented in application code.
This creates three constraints that limit the system’s expressiveness:
Constraint 1: Decision Logic is Static
Decision point handlers evaluate business rules and return step names from a pre-declared template. The decision space is bounded by what was anticipated when the template was authored. Unanticipated scenarios — novel data shapes, unforeseen combinations of conditions, problems that require multi-step reasoning to decompose — cannot be handled without authoring new templates and deploying new code.
Constraint 2: Handlers are Application-Specific
Every step in a workflow requires a handler implemented in application code. There is no shared vocabulary of common operations (HTTP requests, data transformations, schema validations, fan-out patterns) that can be composed without writing new code. This means that even workflows consisting entirely of common operations — fetch data, validate, transform, store — require custom handler implementations.
Constraint 3: Workflow Topology is Fixed at Template Time
Task templates define the complete set of possible steps. While conditional workflows defer step creation to runtime, the universe of creatable steps is fixed. There is no mechanism for generating workflow topology that wasn’t anticipated in the template — no way to say “use whatever steps are needed to solve this problem.”
Constraint 4: Complex Decisions Require Complete Context
When a planning entity — whether an LLM planning step or a human architect — needs to design a workflow for a complex problem, it must reason about the entire problem in a single pass. There is no mechanism for a planner to investigate before deciding — to fan out research, gather information from multiple sources, and converge findings before committing to a workflow design. The planner must hold everything in context at once, which limits the complexity of problems that can be addressed dynamically.
The Opportunity
These constraints are not bugs. They are consequences of a system designed for reliability and predictability. But they also represent a ceiling on what the system can express.
If we could introduce a vocabulary of composable, type-safe action primitives — and a planner capable of composing them into workflow topologies at runtime — while preserving all of Tasker’s execution guarantees, we would unlock a new class of workflows: ones where the goal is specified and the path is determined at runtime, composed from primitives the system guarantees are correct.
If we could additionally provide agents with the ability to use Tasker’s own infrastructure for investigation and context decomposition — creating research workflows whose results inform design decisions — we would address the fundamental limitation of single-pass planning: the requirement that the planner already knows enough to plan well.
Solution Consideration
Approach A: LLM as External Advisor
Description: An LLM operates outside Tasker, generating complete task templates (YAML) that are then submitted through normal task creation flows.
Strengths:
- No changes to Tasker’s internals
- Clean separation between planning and execution
- Template validation catches errors before execution
Weaknesses:
- No iterative planning — the entire workflow must be known upfront
- No access to intermediate results for adaptive planning
- Requires full template + handler deployment before execution
- Loses the dynamic branching capabilities Tasker already has
Assessment: This is already possible today and is valuable as a developer productivity tool. It does not address the core opportunity of runtime adaptive planning. Phase 0’s MCP server pursues this path for its immediate developer experience value.
Approach B: LLM as Decision Handler (Current Architecture)
Description: An LLM serves as the business logic behind a decision point handler, choosing from pre-declared step names in an existing template.
Strengths:
- Works within current architecture with no modifications
- LLM adds reasoning to existing decision points
- All execution guarantees preserved
Weaknesses:
- Decision space still bounded by template-declared steps
- LLM cannot generate novel workflow topology
- Handlers must still be application-specific code
- No composition of generic capabilities
Assessment: Valuable and achievable immediately. Should be demonstrated in tasker-contrib as a pattern. But it is an incremental improvement, not a paradigm extension.
Approach C: LLM as Workflow Planner with Action Grammar (Recommended)
Description: Introduce a layer of composable, Rust-native action grammar primitives with compile-time enforced data contracts. All handler composition — whether referencing a common pattern or assembled dynamically — uses the same action(resource) grammar model, validated through the same pipeline at assembly time. Extend the decision point mechanism to support workflow fragment generation — a planning step backed by an LLM that composes workflow fragments from the grammar’s vocabulary. The orchestration layer validates and materializes these fragments using existing transactional infrastructure. The single-mutation boundary (at most one external mutation after all fallible preparatory work) is the central safety invariant for all compositions.
Strengths:
- Builds on proven conditional workflow machinery
- Preserves all execution guarantees (transactionality, idempotency, observability)
- No code generation or hot-loading — only composition of verified primitives
- Primitives are compile-time verified Rust; compositions are validated at assembly time against structural invariants (contract compatibility, single-mutation boundary)
- LLM composes from a vocabulary it cannot break — the type system prevents invalid compositions
- One composition model — common patterns and dynamic compositions use the same grammar and validation pipeline
- Enables recursive planning through nested planning steps
- Action grammars and common patterns are independently valuable even without LLM integration
- Polyglot developers consume action grammars through FFI without needing to understand Rust
Weaknesses:
- Requires new infrastructure: action grammar primitives, data contracts, composition framework, fragment validation, capability schemas
- Action grammar design requires careful research into which primitives are fundamental
- LLM planning quality depends on capability schema design
- Assembly-time validation is strong but not compile-time — requires careful design of the validation pipeline
- New observability requirements for dynamically generated workflows
- FFI boundary between Rust grammar layer and polyglot developer layer needs careful design
Assessment: This is the approach that fully realizes the opportunity while building on Tasker’s existing architecture. The action grammar layer provides strong guarantees through compile-time verified primitives and assembly-time validated compositions. Common patterns provide named, well-tested composition specifications for frequently used combinations, while dynamic composition enables open-ended use of the same grammar. The phased path (TAS-280 → MCP server → action grammars → planning) allows each component to inform the next.
Approach D: Full Agent Framework
Description: Build a general-purpose AI agent framework within Tasker, with persistent memory, autonomous decision-making, and dynamic tool selection.
Strengths:
- Maximum flexibility
- Aligns with industry “agent” narrative
Weaknesses:
- Undermines Tasker’s core value proposition (determinism, predictability)
- Enormous complexity with unclear boundaries
- Agent failure modes are poorly understood
- Observability becomes intractable
- Fundamentally different system with different guarantees
Assessment: This is a different product. Tasker’s strength is deterministic orchestration; an agent framework would require abandoning the properties that make the system trustworthy. Explicitly rejected. However, this rejection is specific: what’s rejected is Tasker becoming an agent. What’s embraced is Approach E.
Approach E: Agent-Accessible Deterministic Infrastructure (Recommended alongside C)
Description: Position Tasker as deterministic infrastructure that agents use as external clients. Agents submit tasks through the standard API, create research workflows to decompose complex decisions, and use Tasker’s convergence semantics to aggregate findings before committing to workflow designs. The MCP server provides agents with design-time access to template inspection, grammar vocabulary, and composition validation. Agent-created task hierarchies are traced through parent_correlation_id.
Strengths:
- Requires no new orchestration machinery — agents use existing APIs and task lifecycle
- Preserves all deterministic execution guarantees (agents are clients, not components)
- Addresses the “single-pass planning” limitation: agents can investigate before deciding
parent_correlation_idprovides full observability of agent reasoning chains- Same resource controls (budgets, timeouts, max steps) apply to agent-created tasks
- Composes naturally with Approach C: agents can create tasks that contain planning steps, compose handlers from grammar primitives, and reference common patterns
- The MCP server (Phase 0) is the natural agent integration point — no additional infrastructure needed
- Bounded delegation: agents can create related tasks with explicit resource limits
Weaknesses:
- Agents must understand Tasker’s task/template model to create effective research workflows
- Research workflow patterns and templates need to be designed and provided
- Agent coordination logic lives outside Tasker, which means agent-side failures are not Tasker-observable
- Patterns for “agent creates a task, waits for results, makes a decision” need standardization
Assessment: This approach complements Approach C by addressing the context decomposition problem that Approach C’s planning interface cannot solve alone. A planning step within a workflow can compose from action grammar primitives, but it cannot investigate unknowns — it can only reason about the context it receives. An agent operating as an external client can investigate, using Tasker’s own infrastructure to structure that investigation, and then provide richer context to planning steps or design complete workflows from its findings.
Recommendation
Proceed with Approach C (Action Grammar with Handler Composition) and Approach E (Agent-Accessible Infrastructure), implemented through four phases with a pragmatic “from here to there” path where each phase delivers independent value. Approach E is not a separate phase — it is a cross-cutting capability that becomes richer at each phase.
The key architectural decisions within this approach:
-
Action grammars as the compositional foundation. Rust-native primitives (acquire, transform, validate, gate, emit, decide, fan-out, aggregate) with compile-time enforced input/output contracts. The type system guarantees that primitives are correct — if it compiles, the data flow is sound.
-
One composition model. All handler composition uses the same
action(resource)grammar. Common patterns are named, documented, well-tested composition specifications that resolve to standard grammar compositions at execution time — they are a documentation and convenience layer, not a separate runtime concept. Dynamic compositions use the same grammar and validation pipeline. Both execute with identical lifecycle guarantees. -
Workflow fragments as the planning output. The LLM planner returns a structured workflow fragment (steps, dependencies, grammar compositions, input mappings) that the orchestration layer validates against the grammar’s structural invariants before materialization. Fragments can reference common patterns, dynamic compositions, or both.
-
Validation as the trust boundary. The orchestration layer validates every fragment before materializing it: primitives are compile-time verified Rust; compositions are validated at assembly time against structural invariants (contract compatibility, single-mutation boundary); the DAG is acyclic; input schemas match handler contracts; resource bounds are respected. Invalid fragments are rejected with diagnostic information. The single-mutation boundary — at most one external mutation (Persist, Emit) appearing after all fallible preparatory work — is the central safety invariant.
-
Two-tier trust model. Developer-authored handlers (polyglot, FFI, developer-owned) coexist with system-invoked action grammars (Rust-native, compile-time enforced, LLM-composable) in the same workflow. The orchestration layer treats both as steps.
-
Recursive planning through nested planning steps. Planning steps can appear at any point in a workflow, including downstream of other planned segments. Each planning step receives accumulated context from prior steps, enabling multi-phase workflows where each phase’s plan is informed by previous results.
-
Agents as external clients. Agents interact with Tasker through the standard API (task submission) and the MCP server (design-time inspection and validation). Task-level delegation through
parent_correlation_idprovides agent reasoning chain traceability. Research workflow patterns provide reusable templates for agent-driven investigation. -
Shared tooling foundation. The
tasker-toolingcrate provides the shared logic consumed by bothtasker-ctl(CLI interface) andtasker-mcp(MCP server / agent interface), ensuring consistent behavior across human and machine interaction surfaces.
Phases of Implementation
Phase 0: Foundation — Templates as Generative Contracts
Goal: Establish the tooling foundation through typed code generation (TAS-280), an MCP server for LLM-assisted workflow authoring, and the shared tooling crate.
Deliverables:
result_schemaon TaskTemplate step definitions with typed code generation in all four languages- MCP server for template validation, handler resolution checking, and template/handler generation
tasker-toolingcrate extracting shared logic betweentasker-ctlandtasker-mcp- Patterns and insights that inform action grammar design
Validation Gate: tasker-ctl generate produces typed handler scaffolds. MCP server generates valid templates from natural language descriptions. Schema compatibility between connected steps is validated. Agent clients can use the MCP server for template inspection and validation.
Dependencies: None — builds on existing TaskTemplate infrastructure and TAS-280.
Phase 1: Action Grammars and Handler Composition
Goal: Establish the vocabulary of composable, type-safe action primitives and the unified composition model for building handlers from them.
Deliverables:
- Action grammar primitive framework in Rust with compile-time data contracts
- Core primitives: acquire, transform, validate, gate, emit, decide, fan-out, aggregate
- Common patterns: named, documented, well-tested composition specifications for frequently used combinations
- Composition validation framework enforcing structural invariants (contract compatibility, single-mutation boundary)
- Capability schema format derived from grammar composition
- FFI surface for polyglot developer consumption
- Grammar worker deployment configuration
Validation Gate: Handlers composed from grammar primitives execute through the grammar worker. Common patterns and dynamic compositions pass the same assembly-time validation and execute correctly. Primitives are compile-time verified Rust; compositions are validated at assembly time against structural invariants. Polyglot developers can use grammar-composed handlers through FFI.
Dependencies: Phase 0 informs grammar design through observed patterns.
Phase 2: Planning Interface
Goal: Enable LLM-backed planning steps that generate workflow fragments composed from action grammar primitives.
Deliverables:
- Workflow fragment schema (steps, dependencies, grammar compositions, input mappings)
- Fragment validation pipeline (structural invariant checking, DAG validation, resource bound checking)
- Planning step handler type
- LLM integration adapter (configurable model selection, prompt construction from capability schema)
- Fragment materialization service
Validation Gate: An LLM-backed planning step generates a valid workflow fragment from a problem description, that fragment is validated (including grammar composition structural invariant checking) and executed through standard Tasker infrastructure. Agent-created tasks containing planning steps execute correctly with full traceability through parent_correlation_id.
Dependencies: Phase 1 (action grammar provides the vocabulary that fragments reference).
Phase 3: Recursive Planning and Adaptive Workflows
Goal: Enable multi-phase workflows where each phase’s plan is informed by previous results, and agent-driven task-level delegation for investigation workflows.
Deliverables:
- Nested planning step support
- Context accumulation patterns (with typed data contracts aiding summarization)
- Planning depth and breadth controls
- Cost tracking and budgeting
- Adaptive convergence patterns
- Research workflow templates and patterns for agent-driven investigation
- Agent delegation patterns with bounded resource allocation
Validation Gate: A multi-phase workflow can plan, execute, observe results, and re-plan for subsequent phases. Agent-created research tasks converge and deliver structured results for downstream planning. Resource bounds are enforced and the system terminates gracefully when bounds are exceeded.
Dependencies: Phase 1 + Phase 2. Phase 3 should be informed by operational experience with Phase 2.
Risk Assessment
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| Action grammar primitives too coarse or too fine | High | Medium | Phase 0’s MCP server and TAS-280 experience reveals actual patterns. Start with coarse primitives; refine based on observed composition needs. |
| LLM generates invalid workflow fragments | Medium | High | Fragment validation pipeline rejects invalid plans before execution. Grammar type system prevents invalid compositions. This is expected behavior, not a failure mode. |
| Composition validation insufficient for edge cases | Medium | Medium | Assembly-time validation catches structural mismatches. The single-mutation boundary provides a strong structural invariant. Operational experience identifies additional validation rules. Frequently used compositions become common patterns with additional testing. |
| Common patterns insufficient for real workflows | High | Medium | Common patterns are extensible. Dynamic composition from grammar primitives fills gaps without code changes. Organization-specific compositions supplement the standard set. Developer-authored handlers coexist with grammar-composed handlers in the same workflow. |
| Rust-to-polyglot FFI boundary too complex | Medium | Medium | Phase 0’s MCP server and code generation establish the pattern. Polyglot developers consume through configuration and their language’s DSL, not raw FFI. |
| LLM planning quality too low for useful workflows | High | Low-Medium | Capability schema design (derived from grammar composition) is the primary lever. MCP server experience from Phase 0 informs prompt engineering. |
| Recursive planning creates runaway graphs | High | Medium | Resource bounds enforced at the orchestration level: max depth, max total steps, cost budgets. Planning steps that exceed bounds fail cleanly with diagnostic information. |
| Agent-created task chains become unmanageable | Medium | Medium | Resource bounds on agent-created tasks. parent_correlation_id provides chain traceability. Research workflow templates provide bounded, well-designed patterns. |
| Data contract evolution breaks existing compositions | Medium | Medium | Schema versioning strategy. Grammar primitives are additive; existing compositions remain valid as new primitives are added. |
| Shared tooling crate creates coupling | Low | Medium | tasker-tooling exposes stable interfaces. CLI and MCP server consume through the interface, not implementation details. Premature extraction is the real risk — extract when the API surface stabilizes. |
Timeline Considerations
Phase 0 (Foundation) is immediately actionable. TAS-280 is already specified. The MCP server can begin prototyping alongside it. The tasker-tooling extraction should follow once TAS-280 stabilizes the codegen and validation APIs. All deliver independent value and require no orchestration runtime changes.
Phase 1 (Action Grammars) is the core architectural investment. It should begin as Phase 0’s patterns inform the grammar design. This is where the most design research is needed — establishing the primitives, the composition model, and the structural invariants (especially the single-mutation boundary) that all compositions must satisfy.
Phase 2 (Planning Interface) requires Phase 1 and is the heart of the generative vision. MCP server experience from Phase 0 significantly de-risks the LLM integration design. Agent integration through the MCP server and standard API is available from Phase 0 onward and becomes more capable as each phase adds vocabulary.
Phase 3 (Recursive Planning) requires Phase 2 and represents the full realization of the vision. It is the most speculative phase and should be informed by operational experience with Phase 2. Agent-driven research workflow patterns can be developed in parallel with Phase 2, informed by Phase 0 and Phase 1 experience.
WASM sandboxing is a valuable capability that can be pursued as a parallel effort at any point. It complements grammar-composed handlers by providing execution isolation for compositions, but is not a prerequisite for any phase of this initiative.
Each phase is elaborated in its own document with detailed research areas, design questions, prototyping goals, and validation criteria. The agent integration capability is described in a dedicated cross-cutting document rather than a phase-specific document, as it composes with all phases.
Phase 0: Foundation — Templates as Generative Contracts
Developer tooling that delivers immediate value while building toward generative workflow capabilities
Phase Summary
Phase 0 establishes the foundation for generative workflows by extending Tasker’s existing tooling in two directions: typed code generation from task templates (TAS-280) and an MCP server for LLM-assisted workflow authoring. Neither capability requires changes to Tasker’s orchestration runtime. Both deliver immediate developer experience improvements. Together, they transform the TaskTemplate from a structural description into a generative contract — a machine-readable specification from which code, tests, validation, and eventually workflow fragments can be produced.
This phase is where we learn what patterns developers actually need, what the LLM gets right and wrong when generating workflow components, and what data contracts look like in practice. These lessons directly inform the design of action grammars in Phase 1.
Research Areas
1. Result Schemas and Typed Code Generation (TAS-280)
Question: How do we make the TaskTemplate a source of typed, validated code generation?
Context: TAS-280 introduces an optional result_schema on step definitions — a JSON Schema describing what each step produces. The orchestrator stores whatever JSON the handler returns; the schema is metadata for tooling, not runtime enforcement. From this schema, tasker-ctl generate produces typed handler scaffolds, result models, and test scaffolds in all four supported languages.
What this establishes for the vision:
The result_schema is the first instance of a data contract in Tasker. It declares the shape of what flows between steps. Today, this contract is advisory — it drives code generation and developer experience. In Phase 1, these same data contracts become the compile-time enforced input/output specifications for action grammar primitives.
steps:
- name: validate_order
handler:
callable: handlers.validate_order
result_schema:
type: object
required: [validated, order_total, item_count]
properties:
validated: { type: boolean }
order_total: { type: number }
item_count: { type: integer }
- name: charge_payment
dependencies: [validate_order]
handler:
callable: handlers.charge_payment
result_schema:
type: object
required: [charge_id, amount_charged]
properties:
charge_id: { type: string }
amount_charged: { type: number }
From this, tasker-ctl generates typed handler code where dependency results are deserialized into language-specific models — Pydantic BaseModel in Python, Dry::Struct in Ruby, TypeScript interfaces, Rust #[derive(Deserialize)] structs. The handler author gets IDE autocomplete, type checking, and compile-time or lint-time guarantees that their code matches the workflow’s data flow.
Research questions:
- What is the minimal
result_schemathat produces useful typed code? (Full JSON Schema is expressive but verbose; a constrained subset may be more practical.) - How should schema evolution work? When a step’s output shape changes, what tooling helps propagate that change to downstream handlers?
- Should
tasker-ctlvalidate schema compatibility between connected steps? (Step A produces shape X; step B declares it depends on A — does B’s expected input match A’s declared output?) - What patterns emerge from real-world schema usage that inform action grammar data contract design?
2. MCP Server for Template and Handler Authoring
Question: How can an LLM assist developers in creating correct, well-structured workflows within Tasker’s existing framework?
Research approach:
An MCP server exposes Tasker’s template structure, handler patterns, and validation rules as tools that an LLM can use during a developer’s authoring session. The developer describes what they’re building; the LLM generates templates, handler code, and tests using Tasker’s actual patterns and conventions.
Proposed MCP server capabilities:
| Capability | Description |
|---|---|
| Template generation | Given a natural language description of a workflow, generate a well-structured TaskTemplate YAML with step definitions, dependencies, and handler references |
| Handler scaffolding | Given a template and step name, generate handler code in the developer’s chosen language using the DSL patterns from the appropriate framework integration |
| Test scaffolding | Generate test code that exercises the handler with realistic inputs and asserts expected output shapes |
| Template validation | Check a developer-authored template for structural correctness — valid step references, acyclic dependencies, handler callable format matching language conventions |
| Handler resolution check | Verify that handler callables in a template match registered handler names in the codebase, catching registration mismatches before runtime |
| Request generation | From a template’s input_schema, generate example curl commands, tasker-ctl invocations, or language-specific TaskerClient calls for submitting task requests |
What this establishes for the vision:
The MCP server is a prototype of the LLM integration pattern that Phase 2 formalizes. The MCP server helps an LLM generate developer-space workflow components (templates, handlers, tests) from descriptions. Phase 2’s planning interface helps an LLM generate system-space workflow fragments (steps, configurations, dependencies) from runtime context. The lessons from MCP server development — prompt engineering, structured output quality, validation feedback loops — transfer directly to planning step design.
The MCP server also creates a natural feedback loop: as developers use LLM-assisted authoring, we observe which patterns the LLM recommends, which mistakes it makes, and which workflow shapes recur across use cases. These observations inform the action grammar design — the recurring patterns become candidates for grammar primitives.
Research questions:
- What prompt patterns produce the best template quality? (Few-shot examples from
tasker-contrib? Schema-constrained output? Iterative refinement with validation feedback?) - How much of the TaskTemplate JSON Schema does the LLM need in its context to generate valid templates?
- What is the right boundary between MCP server validation and
tasker-ctlvalidation? (The MCP server should catch obvious errors during authoring;tasker-ctlprovides definitive validation.) - Should the MCP server be aware of the codebase’s existing handlers, or generate templates against the catalog of possible handlers?
3. TaskTemplate as Generative Input
Question: What makes the TaskTemplate a suitable foundation for progressive levels of code generation?
Context: The TaskTemplate already serves multiple roles:
- Structural description — defines steps, dependencies, and handler references
- Input validation —
input_schema(JSON Schema) validates task request payloads - Configuration carrier —
step_inputsparameterize handler behavior - Queue routing —
namespace_namedetermines which workers process the task’s steps
With result_schema (TAS-280), the template gains a fourth dimension: output contracts that describe what flows between steps. This makes the template a complete description of a workflow’s data flow — what goes in, what each step produces, how data moves between steps, and what comes out.
The generative progression:
| Level | Input | Output | Phase |
|---|---|---|---|
| Type generation | Template + result_schema | Typed models, handler scaffolds, tests | Phase 0 (TAS-280) |
| Authoring assistance | Natural language description | Complete template + handler code + tests | Phase 0 (MCP server) |
| Action grammar composition | Template + catalog schema | Composed handlers from grammar primitives | Phase 1 |
| Workflow fragment generation | Runtime context + capability schema | Validated workflow fragments | Phase 2 |
| Adaptive planning | Accumulated results + capability schema | Multi-phase workflow plans | Phase 3 |
Each level builds on the previous. The template’s structure, schemas, and data contracts are the common thread — the machine-readable specification that each level of generation reads from and produces into.
Research questions:
- Does the current TaskTemplate schema need extension to support generative use cases, or is
result_schemasufficient for Phase 0? - Should template metadata include hints for generation (e.g., “this step typically handles authentication,” “this step is a good candidate for batching”)?
- How should the relationship between
input_schema,result_schema, andstep_inputsbe documented for LLM consumption?
Prototyping Goals
Prototype 1: Typed Code Generation (TAS-280)
Objective: Deliver tasker-ctl generate with typed handler scaffolding from result_schema.
Success criteria:
result_schemaparsed in TaskTemplate step definitionstasker-ctl generate typesproduces language-specific models (Python Pydantic, Ruby Dry::Struct, TypeScript interface, Rust struct) from schemastasker-ctl generate handlerproduces DSL handler scaffolds with typed dependency injection- Generated code compiles/type-checks in all four languages
- Test scaffolds reference expected output shapes
Prototype 2: MCP Server — Template Validation
Objective: An MCP server that validates existing templates and checks handler resolution.
Success criteria:
- MCP server exposes template validation as a tool
- Structural validation catches dependency cycles, missing handler references, invalid step configurations
- Handler resolution checking validates callable strings against codebase patterns
- Validation feedback is actionable by both developers and LLMs
Prototype 3: MCP Server — Template Generation
Objective: An MCP server that generates templates and handler code from natural language descriptions.
Success criteria:
- Given a workflow description, the MCP server generates a structurally valid TaskTemplate
- Generated templates pass the validation from Prototype 2
- Generated handler code uses the correct DSL patterns for the target language
- The LLM produces valid templates > 80% of the time without human correction
Validation Criteria for Phase Completion
result_schemasupported in TaskTemplate step definitions with typed code generation in all four languages- MCP server operational with template validation and handler resolution checking
- MCP server generates structurally valid templates from natural language descriptions
- At least 3 end-to-end examples: description → template → generated handlers → passing tests
- Documented patterns from MCP server usage that inform action grammar design (recurring workflow shapes, common handler compositions, typical data flow patterns)
- Schema compatibility checking between connected steps (step A’s output matches step B’s expected input)
Relationship to Other Phases
- Phase 1 is informed by this phase: patterns observed through MCP server usage and code generation reveal which action grammar primitives are needed and what data contracts look like in practice.
- Phase 2 builds on this phase: the MCP server’s LLM integration patterns (prompt engineering, validation feedback, structured output) transfer directly to planning step design.
- Phase 3 is independent of this phase but benefits from the data contract foundation established here.
- This phase is independently valuable regardless of whether subsequent phases are implemented. TAS-280 and the MCP server improve developer experience for all Tasker users.
This phase is the most immediately actionable. TAS-280 is already specified and ready for implementation. The MCP server can begin prototyping as soon as TAS-280 establishes the template extension patterns.
Phase 1: Action Grammars
Rust-native composable action(resource) primitives as the vocabulary of generative workflow planning
Phase Summary
The action grammar is a framework of composable, Rust-native action(resource) primitives with compile-time enforced data contracts. Each primitive is a verb applied to a typed resource — Acquire(HttpEndpoint), Transform(JsonPayload), Validate(Schema), Persist(DatabaseRow) — with declared input and output types that the Rust compiler verifies.
The fundamental unit of composition is the handler: a chain of action(resource) primitives assembled to perform a business-meaningful operation. All handlers — whether referenced by name from a common patterns library or composed on the fly by an LLM planner — are validated and executed through the same pipeline. There is one composition model, one validation path, one execution path.
The grammar’s central safety invariant is the single-mutation boundary. A valid composition may contain an arbitrary number of non-mutating actions (reads, transforms, validations, calculations) but has at most one external mutation (a create, update, or delete against a database, API, or other external system). Everything before the mutation is preparatory, idempotent, and safely retryable. The mutation is the commitment point. Nothing after it should be fallible in a way that triggers re-execution of the mutation. This structural rule — enforced at assembly time — is what makes compositions safe regardless of how they were assembled.
This phase delivers independent value. Even without LLM integration, composable handlers reduce the boilerplate of workflow authoring, provide stronger correctness guarantees than hand-written handlers, and enable a richer template ecosystem in tasker-contrib. The action grammar layer also provides the vocabulary that Phase 2’s LLM planner and agent clients compose from — a vocabulary that is open, safe, and never artificially limited.
Research Areas
1. Action Grammar Primitives
Question: What are the fundamental verbs of workflow action, and how do they differ from handlers?
The distinction: A primitive is a single action(resource) with compile-time enforced input/output contracts. A handler composes primitives into a business-meaningful operation. The primitive is the atom; the handler is the molecule.
Research approach:
- Audit existing Tasker example workflows and extract recurring low-level actions
- Analyze Phase 0 MCP server usage patterns to identify what compositions developers actually request
- Survey workflow primitives in competing systems (Airflow operators, Temporal activities, Prefect tasks, Step Functions integrations) at a lower granularity than their handler-level abstractions
- Categorize by action type: non-mutating (reads, transforms, validations, control flow) vs. mutating (external state changes)
Proposed primitive taxonomy:
Non-mutating primitives (idempotent, retryable, no external side effects):
| Primitive | Concern | Input Contract | Output Contract |
|---|---|---|---|
Acquire | Fetch data from an external source | Source descriptor (URL, query, config) | Raw acquired data + metadata (status, timing) |
Transform | Reshape data from one form to another | Input data + transformation spec | Transformed data conforming to target shape |
Validate | Assert invariants on data | Data + validation rules | Partitioned results (valid set, invalid set, diagnostics) |
Gate | Block execution pending a condition | Gate condition + notification config | Approval/rejection + gating metadata |
Decide | Evaluate conditions and select a path | Decision context + routing rules | Selected path identifier + reasoning |
FanOut | Decompose work into parallel units | Data source + partitioning strategy | Partition descriptors for parallel execution |
Aggregate | Converge and reduce parallel results | Collection of results + reduction strategy | Reduced result conforming to output shape |
Mutating primitives (external side effects — at most one per composition):
| Primitive | Concern | Input Contract | Output Contract |
|---|---|---|---|
Persist | Write state to an external system (database, API, message queue) | Data + destination descriptor + operation (create/update/delete) | Confirmation + metadata (id, version, timestamp) |
Emit | Send a notification or event to an external channel | Data + channel descriptor (webhook, email, queue) | Delivery confirmation + metadata |
The separation between non-mutating and mutating primitives is the grammar’s most important structural distinction. Non-mutating primitives can appear in any quantity and any order. Mutating primitives are the composition’s commitment point — the single-mutation boundary.
How primitives differ from existing handlers:
Today, an http_request handler is monolithic — it handles URL construction, authentication, request execution, error handling, response extraction, and output shaping in a single implementation. As a grammar composition, the same operation would be:
Acquire(HttpSource { url, method, auth, headers })
→ Transform(ExtractFields { path: "$.data.records" })
→ Validate(JsonSchema { schema: record_schema_v2 })
Each step in the composition has typed inputs and outputs. The Acquire primitive’s output type matches the Transform primitive’s input type. The compiler verifies this. If someone changes the Acquire output shape, compositions that depend on the old shape fail to compile — not at runtime, but at build time.
Open questions:
- Is
Transformone primitive or a family? (Field extraction vs. full reshaping vs. type coercion may warrant separate primitives with different type constraints.) - Should
AcquireandEmitbe symmetric primitives, or does the I/O direction warrant different type signatures? - How granular should error handling be? Per-primitive error types vs. a unified error model that compositions inherit?
- Should there be a
Cacheprimitive for memoizing expensive acquisitions?
2. Data Contracts as Compositional Glue
Question: How do Rust’s type system features enforce correctness of grammar compositions?
Research approach:
- Design trait bounds that express “primitive A’s output is compatible with primitive B’s input”
- Evaluate generic associated types, trait objects, and enum dispatch for composition flexibility
- Prototype compositions and validate that the compiler catches invalid ones
- Design the
dyn Pluggableboundary for plugin extensibility and runtime composition
The type-level composition model:
#![allow(unused)]
fn main() {
/// Every grammar primitive declares its input and output types
trait ActionPrimitive {
type Input: ActionData;
type Output: ActionData;
type Error: ActionError;
fn execute(&self, input: Self::Input) -> Result<Self::Output, Self::Error>;
}
/// Data contracts: marker trait for types that can flow between primitives
trait ActionData: Serialize + DeserializeOwned + Send + Sync + Debug {}
/// A composition is valid when Output of A matches Input of B
struct Compose<A, B>
where
A: ActionPrimitive,
B: ActionPrimitive<Input = A::Output>,
{
first: A,
second: B,
}
}
The key insight: B: ActionPrimitive<Input = A::Output> is a compile-time constraint. If A produces HttpResponse and B expects ValidatedRecords, the composition fails to compile. No runtime surprises.
The runtime composition boundary:
All handler compositions — whether referencing common patterns by name or assembled dynamically — pass through a runtime composition boundary where contracts are validated through JSON Schema matching. This is the dyn Pluggable boundary: a trait object interface that allows runtime dispatch while still requiring data contracts to be declared:
#![allow(unused)]
fn main() {
/// All composed handlers declare their contracts and dispatch dynamically
trait PluggablePrimitive: Send + Sync {
fn input_schema(&self) -> &JsonSchema;
fn output_schema(&self) -> &JsonSchema;
fn is_mutating(&self) -> bool; // Single-mutation boundary enforcement
fn execute(&self, input: serde_json::Value) -> Result<serde_json::Value, ActionError>;
}
}
At the composition boundary, contracts are validated at assembly time through JSON Schema matching. The grammar primitives themselves are compiled Rust with full type safety. The composition layer validates that contracts chain correctly, the single-mutation boundary is respected, and configurations are well-formed — all before any execution occurs. Organization-specific primitives, user-defined transformations, and integration-specific adapters all go through this same boundary.
Open questions:
- How do we handle optional fields in data contracts? (A transform that adds fields to its input — the output is a superset of the input type.)
- Should compositions be linear (A → B → C) or support branching (A → B + C → D)?
- How do we express that a primitive preserves certain fields while transforming others? (Partial type transformations are hard to express statically.)
- What is the right balance between static composition (maximum safety, less flexibility) and dynamic dispatch (more flexibility, weaker guarantees)?
3. Composition Rules
Question: How do primitives combine into handlers while preserving idempotency, single responsibility, and retryability?
Research approach:
- Define which compositions are valid from an execution-guarantee perspective (not just type compatibility)
- Formalize the single-mutation boundary as a structural invariant
- Design the mixin/layering approach for building handlers from primitives
- Validate that composed handlers maintain the step contract (idempotent, retryable, side-effect bounded)
The single-mutation boundary — the central safety invariant:
A valid composition follows this structural pattern:
[non-mutating actions]* → [mutation]? → [non-failing actions]*
- Before the mutation: An arbitrary chain of
Acquire,Transform,Validate,Decide,Aggregateprimitives. All non-mutating, all idempotent, all safely retryable. If the step fails anywhere in this phase, the entire composition can be retried from the beginning with no side effects. - The mutation (at most one): A single
PersistorEmitthat commits the composition’s external effect. This is the commitment point. Tasker’s existing step state machine tracks whether the mutation has occurred. - After the mutation: Only actions that cannot fail in a way that would trigger re-execution of the mutation. Typically: metadata recording, confirmation formatting, non-critical logging.
This rule is what makes compositions safe. It is checkable at assembly time — the grammar knows which primitives are mutating (is_mutating() == true) and can enforce that at most one appears, and that it appears in the correct position. A composition that violates the single-mutation boundary is rejected before execution.
Composition properties preserved by this structure:
| Property | How the Grammar Enforces It |
|---|---|
| Idempotency | Non-mutating phase is inherently idempotent. The mutation primitive declares its own idempotency strategy (idempotency keys, conditional execution). Nothing after the mutation can trigger retry. |
| Retryability | Re-execution from the beginning replays only idempotent actions until the mutation boundary. The step state machine prevents re-execution of the mutation itself. |
| Single responsibility | A composition’s responsibility is the business action it represents. One mutation = one externally visible effect. |
| Side-effect boundary | The mutation is the composition’s only external side effect. It is explicit, bounded, and singular. |
The mixin approach:
Handlers are not just linear chains of primitives. They are layered compositions where cross-cutting concerns (error handling, retry logic, observability, caching) are applied as mixins that wrap the core primitive chain:
Handler = WithRetry(
WithObservability(
WithErrorMapping(
Acquire → Transform → Validate
)
)
)
Each mixin layer is itself a primitive (or primitive wrapper) with typed input/output contracts. The mixin transforms the error type, adds metadata to the output, or wraps the execution with retry logic — all type-checked at compile time.
Open questions:
- Should the composition framework enforce a maximum depth? (Deep compositions may have unclear failure modes.)
- How should partial failure work? (If Transform succeeds but Validate fails, what is the composition’s state?)
- Should compositions support checkpointing? (Resume from the last successful primitive rather than restarting the entire composition.)
- How do we test compositions? (Unit test each primitive independently; integration test the composition. But what about the mixin layers?)
4. Handler Composition and Validation
Question: How are handlers assembled from grammar primitives, and what validation ensures they are safe to execute?
All handler composition — whether referencing a common pattern by name or assembled dynamically by an LLM or agent — goes through the same pipeline: specification, validation, execution. There is one composition model.
Research approach:
- Design a specification format for handler compositions (primitives, configuration, data mappings)
- Build a validation pipeline that checks composition correctness including the single-mutation boundary
- Prototype handler execution through the grammar worker infrastructure
- Evaluate performance characteristics of the validation pipeline
The handler composition specification:
A handler is described as a composition of action(resource) grammar primitives with configuration and data mappings:
{
"primitives": [
{
"type": "Acquire",
"variant": "HttpSource",
"config": {
"url": "https://api.example.com/v2/search",
"method": "POST",
"auth": { "type": "bearer", "token_source": "env:API_KEY" }
}
},
{
"type": "Transform",
"variant": "FieldExtract",
"config": {
"source_path": "$.response.results",
"target_shape": "array<SearchResult>"
},
"input_mapping": {
"data": "$.previous.acquired_data"
}
},
{
"type": "Validate",
"variant": "SchemaCheck",
"config": {
"schema_ref": "search_result_v1",
"on_invalid": "partition"
},
"input_mapping": {
"data": "$.previous.transformed_data"
}
}
],
"mixins": ["WithRetry", "WithObservability"]
}
Common patterns — like http_request — are named composition specifications that resolve to this same format. Referencing "pattern": "http_request" with parameters is syntactic sugar for the fully-specified composition. At execution time, everything is a validated composition of grammar primitives.
N-intersecting logical actions:
Compositions are not limited to linear chains. A composition can express intersecting actions — primitives that share data flows and coordination points without being strictly sequential:
- Acquire from multiple sources, then Transform the merged result
- Validate before and after a Transform
- Transform into multiple shapes for different downstream consumers
- Acquire and Validate in parallel, then Gate on the combined result
The composition specification supports these patterns through explicit input mappings — each primitive declares where its input comes from, which may be the output of any prior primitive in the composition (not just the immediately preceding one). The validation pipeline verifies that all input mappings resolve to available data with compatible shapes.
The validation pipeline:
Every composition passes through the same validation before execution:
- Primitive existence check: Every referenced primitive type and variant exists in the grammar
- Configuration validation: Each primitive’s config matches its declared configuration schema
- Input mapping resolution: Every
input_mappingpath resolves to a primitive output in the composition - Contract compatibility: Output schema of the source primitive is compatible with input schema of the consuming primitive
- Single-mutation boundary: At most one mutating primitive (
Persist,Emit) appears in the composition, and it appears after all fallible preparatory work - Mixin applicability: Declared mixins are compatible with the composition’s primitive chain
This pipeline runs at assembly time — before any execution occurs. An invalid composition is rejected with diagnostic information. The grammar primitives themselves are compiled Rust with full type safety; the validation pipeline ensures that compositions of those primitives respect the structural invariants that make them safe.
Capability schema derivation:
Because handlers are compositions of typed primitives, their capability schemas can be derived from the composition rather than hand-authored. The input contract is the first primitive’s input type, parameterized by the handler’s configuration. The output contract is the last primitive’s output type (or the mutation primitive’s confirmation type). The error modes are the union of each primitive’s error types. This derivation is mechanistic and always accurate — the capability schema cannot drift from the implementation because it is generated from the same type definitions.
Any client (LLM planner, agent, MCP tool) can inspect what a handler composition will accept and produce before it executes.
Open questions:
- Should composition specifications support conditional primitives? (If condition X, include Validate; otherwise skip.)
- How should specifications express branching compositions? (Multiple output paths from a single primitive.)
- Should the MCP server /
tasker-ctlprovide acomposecommand that helps build handler specifications interactively? - How should error modes be derived for compositions with non-linear data flows?
- How should compositions with zero mutations (pure-read handlers) be distinguished from compositions where the mutation was accidentally omitted?
5. Common Patterns Library
Question: What does the library of named, documented composition patterns look like, and how does it serve developers and planners?
The common patterns library is a collection of named, well-tested, documented handler composition specifications. It is a documentation and convenience layer, not a distinct runtime concept. Every pattern resolves to a standard grammar composition at execution time.
Research approach:
- Identify recurring composition patterns from Phase 0 MCP server usage and existing Tasker workflows
- Design a pattern specification format that is both human-readable and machine-consumable
- Validate that patterns are discoverable by LLM planners through capability schemas
Pattern specification:
Each pattern is a named, parameterized composition specification:
name: http_request
description: >
Makes an HTTP request to an external service with configurable authentication,
error handling, and response extraction. Composed from Acquire + Transform + Validate
primitives with retry and observability mixins.
composition:
- Acquire(HttpSource)
- Transform(ResponseExtract)
- Validate(StatusCodeCheck)
mixins: [WithRetry, WithObservability, WithTimeout]
parameters:
url: { type: string, required: true }
method: { type: enum, values: [GET, POST, PUT, PATCH, DELETE], default: GET }
headers: { type: map, default: {} }
body: { type: object, required_when: "method in [POST, PUT, PATCH]" }
auth: { type: AuthConfig, default: none }
response_extract: { type: string, description: "JSONPath for response extraction" }
expected_status: { type: array, default: [200] }
timeout_ms: { type: integer, default: 5000 }
retry: { type: RetryConfig, default: { max_attempts: 3, backoff: exponential } }
# Derived from composition types — not hand-authored
input_contract: HttpRequestInput
output_contract: HttpRequestOutput
error_modes:
- { type: timeout, retryable: true, source: Acquire }
- { type: unexpected_status, retryable: false, source: Validate }
- { type: extraction_failed, retryable: false, source: Transform }
When a template references "pattern": "http_request" with parameters, the system resolves the pattern to a composition specification, applies the parameters, and validates through the standard pipeline. The pattern is a shorthand, not a different execution path.
Proposed initial patterns:
| Pattern | Composition | Key Parameters |
|---|---|---|
http_request | Acquire(Http) → Transform(Extract) → Validate(Status) | URL, method, headers, body, auth, extraction path |
transform | Transform(Reshape) | Input mapping, output schema, transformation rules |
validate | Validate(Schema) | JSON Schema, error strategy (fail/flag/filter) |
fan_out | FanOut(Partition) | Data source, partition strategy, max concurrency |
aggregate | Aggregate(Reduce) | Reduction strategy, failure threshold, output schema |
gate | Gate(Approval) → Emit(Notification) | Notification config, approval criteria, timeout |
notify | Emit(Channel) | Channel type (webhook/email/slack), template, recipients |
decide | Decide(Rules) | Decision logic config, possible outcomes, routing rules |
persist | Validate(PreCheck) → Persist(Target) | Target system, operation, idempotency key |
Open questions:
- Should patterns support versioning? (Upgrading a composition without breaking existing templates.)
- How should organization-specific patterns be registered alongside standard ones?
- Should a pattern’s composition be directly inspectable by clients? (Useful for LLMs that want to understand what a pattern does before using it, and for agents building modified compositions from a pattern as a starting point.)
6. FFI Surface for Polyglot Consumption
Question: How do Python, Ruby, and TypeScript developers use Rust-implemented action grammars?
Research approach:
- Design the boundary between Rust grammar layer and polyglot developer layer
- Evaluate whether polyglot developers interact with grammar primitives directly or only through composed handlers
- Prototype FFI bindings for grammar handler invocation
Design principle: Polyglot developers should not need to understand Rust. They interact with action grammars through three paths:
Path 1: Configuration-driven pattern usage. A developer references a common pattern in their task template YAML. The handler executes in the Rust grammar worker. No language-specific code needed:
steps:
- name: fetch_records
handler:
pattern: http_request
config:
url: "https://api.example.com/records"
method: GET
response_extract: "$.data"
Path 2: Language-specific DSL wrappers. For developers who want to use grammar primitives alongside their own business logic in the same step, thin FFI wrappers expose primitives through each language’s DSL:
@step_handler("enrich_and_validate")
@depends_on(records="fetch_records")
def enrich_and_validate(records, context):
# Use grammar primitives through FFI wrapper
validated = context.grammar.validate(records, schema="record_v2")
enriched = context.grammar.http_request(
url="https://enrichment.example.com/v2/enrich",
method="POST",
body={"records": validated.valid_records}
)
return {"enriched": enriched, "invalid": validated.invalid_records}
Path 3: Composition specification. For agents, LLM planners, and any client that prefers structured data over code, handler compositions are specified as JSON and executed through the grammar worker infrastructure without any language-specific code. This is the primary path for dynamic composition.
The FFI wrapper handles serialization/deserialization across the language boundary. The Rust grammar layer executes the primitive with full type checking. The developer gets the safety of the grammar without leaving their language.
Open questions:
- Should FFI wrappers expose individual primitives or only composed handlers?
- What is the serialization overhead of crossing the FFI boundary for each primitive call? (May need batching or composition-level FFI rather than primitive-level.)
- How should errors from Rust grammar execution be translated to language-idiomatic exceptions?
- Should there be a “local grammar worker” mode where handlers execute in-process rather than through queue dispatch?
Prototyping Goals
Prototype 1: Primitive Framework and Basic Compositions
Objective: Implement the ActionPrimitive trait system, Acquire, Transform, and Validate primitives, and demonstrate composition with compile-time type checking.
Success criteria:
- Primitive trait with associated Input/Output types compiles and works
- Valid compositions compile; invalid compositions (type mismatch) fail to compile with clear error messages
- A simple composition (Acquire → Transform → Validate) executes correctly with test data
- Mixin wrappers (WithRetry, WithObservability) compose with core primitives
Prototype 2: Composition Validation and the Single-Mutation Boundary
Objective: Demonstrate the validation pipeline, including single-mutation boundary enforcement, contract checking, and diagnostic output for invalid compositions.
Success criteria:
- Validation catches incompatible primitive chains (output/input mismatch)
- Validation catches invalid configurations (missing required fields, wrong types)
- Validation rejects compositions with multiple mutating primitives
- Validation rejects compositions where fallible actions follow the mutation
- Invalid compositions produce actionable diagnostic messages
- Valid compositions execute through the grammar worker with correct lifecycle
Prototype 3: Common Pattern Resolution and Execution
Objective: Build the http_request common pattern and demonstrate that named pattern references resolve to standard compositions.
Success criteria:
http_requestpattern assembled from Acquire(Http) → Transform(Extract) → Validate(Status)- Pattern reference with parameters resolves to a composition specification
- Resolved composition passes the standard validation pipeline
- Handler executes correctly through the grammar worker
- Capability schema derived from composition matches expected output
Prototype 4: Capability Schema Generation
Objective: Generate capability schemas from handler compositions and validate they enable LLM planning.
Success criteria:
- Capability schemas derived automatically from composition specifications
- Claude can generate valid handler compositions when provided with capability schemas and grammar rules
- Generated compositions pass the validation pipeline including single-mutation boundary check
- Schemas include composition information (what primitives are involved, what error modes are possible)
Prototype 5: FFI Surface
Objective: Validate that polyglot developers can use grammar handlers through FFI wrappers.
Success criteria:
- Python, Ruby, and TypeScript can invoke grammar primitives through FFI wrappers
- Serialization/deserialization across language boundary is correct
- Error propagation works (Rust errors become language-idiomatic exceptions)
- Performance overhead of FFI boundary is acceptable
Validation Criteria for Phase Completion
- Action grammar primitive framework implemented in Rust with compile-time data contracts
- At least 7 primitives implemented (recommend: Acquire, Transform, Validate, Gate, Decide, FanOut, Aggregate) plus at least 2 mutating primitives (recommend: Persist, Emit)
- Composition validation pipeline operational, enforcing contract compatibility, single-mutation boundary, and configuration validity
- At least 5 common patterns documented and resolvable (recommend: http_request, transform, validate, gate, notify)
- At least 3 dynamically-composed handlers demonstrated — including at least one non-linear (branching/multi-source) composition
- Capability schemas derived automatically from composition specifications
- A grammar worker deployment exists that validates and executes handler compositions
- At least 3 example workflows authored using only grammar-composed handlers (no custom code)
- FFI wrappers available for at least Python and TypeScript
- Inter-step data flow works correctly with grammar-composed handlers in both static and conditional workflows
- Documentation in
tasker-contribcovering grammar primitives, handler composition, common patterns, and extension patterns
Relationship to Other Phases
- Phase 0 informs this phase: patterns from MCP server usage and TAS-280 code generation reveal which primitives and compositions are needed.
- Phase 2 depends on this phase: the planning interface generates workflow fragments that reference grammar compositions.
- Phase 3 uses this phase: recursive planning composes grammar primitives across multiple planning phases.
- Agent integration uses this phase: agents composing research workflows can assemble handler compositions for investigation steps directly from the grammar vocabulary.
- This phase is independently valuable regardless of whether subsequent phases are implemented.
This document will be updated as Phase 0 progresses and reveals design insights, and as prototyping reveals which composition patterns work well in practice.
Agent Orchestration: Deterministic Infrastructure for Autonomous Clients
How agents use Tasker’s execution guarantees for investigation, planning, and coordinated action
Overview
Tasker is not an agent orchestration framework. Tasker is deterministic workflow infrastructure that agents use as clients.
This distinction is not semantic. An agent orchestration framework manages agent state, controls agent behavior, coordinates agent-to-agent communication, and is responsible for the correctness of agent decisions. Such a framework would require abandoning the properties that make Tasker trustworthy: determinism, predictability, transactional guarantees, and clean observability. That path is explicitly rejected.
What this document describes is the inverse: agents as external clients that leverage Tasker’s existing capabilities for their own purposes. An agent submitting a task to Tasker is architecturally indistinguishable from a human developer or an application submitting a task. The same API. The same guarantees. The same resource controls. The same observability. Tasker does not know or care that the client is an agent — it provides infrastructure and the agent provides intent.
This capability is not a separate phase of the generative workflow initiative. It is a cross-cutting concern that composes with every phase. Phase 0’s MCP server provides the agent’s design-time interface. Phase 1’s action grammars provide the compositional vocabulary. Phase 2’s planning interface provides the runtime planning mechanism. Phase 3’s recursive planning provides the multi-phase coordination pattern. Agent integration enriches and is enriched by each of these capabilities.
The Context Decomposition Problem
The most powerful capability of agent systems is not reasoning — it’s delegation. When a human expert approaches a complex problem, they don’t try to hold everything in their head at once. They decompose: research this aspect, analyze that data source, compare these options, synthesize the findings, then decide. Each activity produces intermediate results that inform the next. The expert’s working memory is bounded, but their problem-solving is not, because they structure their investigation to manage that bound.
LLMs face the same constraint, amplified. A context window is finite. Reasoning quality degrades as context length grows. Holding every fact needed for a complex decision simultaneously is often impossible and always suboptimal. The most effective agent architectures recognize this and delegate: spinning up sub-agents for research, parallelizing analysis across multiple sources, aggregating findings, and making decisions with the benefit of structured, converged information.
This delegation needs infrastructure. Specifically, it needs:
- Parallel execution with convergence: investigate multiple dimensions simultaneously, then bring the findings together
- Transactional guarantees: each investigation step either completes fully or fails cleanly — no half-finished research polluting the decision context
- Bounded resource consumption: investigation cannot run forever or consume unlimited resources
- Observability: every step of the investigation is traceable, timing is recorded, results are inspectable
- Retry semantics: transient failures in investigation (API timeouts, rate limits) are handled automatically
These are Tasker’s core capabilities. They exist today, are tested, and are approaching production readiness. The agent doesn’t need a new system for investigation — it needs access to the system that already provides these guarantees.
The Agent-Client Pattern
How Agents Interact with Tasker
An agent interacts with Tasker through two surfaces:
Design time: MCP server. The agent uses MCP tools to inspect available templates, query the action grammar vocabulary, validate proposed compositions, understand schema contracts, and generate template structures. This is the same MCP server that human developers use through their IDE — the agent simply uses it programmatically. The tasker-tooling crate (see Phase 0) provides the shared logic that powers both the CLI and MCP interfaces, ensuring consistent behavior regardless of who the client is.
Runtime: Tasker API. The agent creates tasks through the standard Tasker API — the same endpoint that applications use. A task creation request includes the template reference, input data, and optionally a parent_correlation_id linking it to a parent task or decision context. The agent receives the task UUID and can poll for completion or subscribe to completion events.
The parent_correlation_id Chain
The parent_correlation_id is an existing Tasker field designed for tracing second-order task dependencies. In the agent context, it becomes the thread that connects an agent’s entire reasoning chain:
Agent receives problem description
│
├── Creates research task (parent_correlation_id: agent_session_123)
│ ├── Step: query_api_source_a (fan-out)
│ ├── Step: query_api_source_b (fan-out)
│ ├── Step: analyze_documentation (fan-out)
│ └── Step: converge_findings (convergence)
│
├── Receives converged research results
│
├── Creates design task (parent_correlation_id: agent_session_123)
│ ├── Step: plan_workflow (planning step, Phase 2)
│ ├── Step: validate_design (grammar handler)
│ └── Step: converge_design (convergence)
│
├── Receives validated workflow design
│
└── Creates execution task (parent_correlation_id: agent_session_123)
├── Step: ... (the actual workflow)
└── Step: ...
Every task in this chain shares the same parent_correlation_id. Standard Tasker telemetry — step timing, results, failures, retries — is emitted for every step. An operator can trace the entire agent reasoning chain through a single correlation ID query, seeing what the agent investigated, what it designed, and how the design executed.
Bounded Delegation
An agent creating tasks is subject to the same resource controls as any Tasker client:
- Task-level resource bounds: max steps, max depth, timeout, cost budget — set per template or per task creation request
- Rate limiting: standard API rate limits prevent an agent from overwhelming the system
- Template constraints: the templates available to the agent define the shape of what it can create — an agent cannot create arbitrary topologies unless using templates designed for dynamic composition
- Per-task isolation: each task is independent. A failure in one agent-created task does not affect others.
The agent cannot create unbounded task chains. Each task has its own budget. The total resource consumption of an agent’s reasoning chain is the sum of its individual task budgets, each of which is independently bounded.
Research Workflow Patterns
The primary use case for agent-as-client is the research workflow: a task whose purpose is to gather, analyze, and synthesize information that the agent needs to make a decision. Research workflows are not the agent’s final product — they are a preliminary phase that produces the context the agent uses for its actual work.
Pattern 1: Parallel Investigation
The simplest research pattern fans out investigation across multiple sources and converges the findings:
name: agent_parallel_investigation
namespace_name: agent_research
version: 1.0.0
description: "Fan-out investigation across multiple sources with convergence"
input_schema:
type: object
required: [investigation_queries, convergence_strategy]
properties:
investigation_queries:
type: array
items:
type: object
properties:
source: { type: string }
query: { type: string }
expected_shape: { type: object }
convergence_strategy:
type: string
enum: [merge, summarize, compare]
steps:
- name: investigate
type: batchable
handler:
grammar: http_request # Common pattern, or dynamic composition per source
dependencies: []
batch:
strategy: per_item
source_path: "$.investigation_queries"
max_concurrency: 10
- name: converge_findings
type: deferred
handler:
grammar: aggregate
config:
strategy: "${task.context.convergence_strategy}"
dependencies:
- investigate
The agent provides the investigation queries and convergence strategy. Tasker handles the parallel execution, retry semantics, and convergence. The agent receives a single converged result — a structured synthesis of all investigation threads.
Pattern 2: Staged Investigation
When the second phase of research depends on the first phase’s findings:
name: agent_staged_investigation
namespace_name: agent_research
version: 1.0.0
description: "Multi-stage investigation where each stage informs the next"
steps:
- name: initial_reconnaissance
type: batchable
handler:
grammar: http_request
dependencies: []
batch:
strategy: per_item
source_path: "$.recon_queries"
- name: analyze_recon
type: standard
handler:
callable: agent_research.analyze_findings
dependencies:
- initial_reconnaissance
- name: deep_investigation
type: decision_point
handler:
callable: agent_research.plan_deep_dive
dependencies:
- analyze_recon
# Decision handler examines recon results and creates
# targeted follow-up investigation steps
- name: synthesize
type: deferred
handler:
grammar: aggregate
config:
strategy: summarize
dependencies:
- deep_investigation
This pattern uses Tasker’s existing decision point mechanism — the plan_deep_dive handler examines reconnaissance results and creates targeted follow-up steps. The convergence step waits for whatever investigation steps were created.
Pattern 3: Investigation with Dynamic Compositions
When the investigation requires operations that don’t map to common patterns, the agent can compose handlers dynamically from grammar primitives:
steps:
- name: custom_analysis
type: standard
handler:
composition:
primitives:
- type: Acquire
variant: HttpSource
config:
url: "${step_inputs.analysis_endpoint}"
method: POST
body: "${step_inputs.analysis_payload}"
- type: Transform
variant: FieldExtract
config:
source_path: "$.analysis.findings"
- type: Validate
variant: SchemaCheck
config:
schema_ref: "analysis_result_v1"
- type: Transform
variant: Reshape
config:
target_shape: "investigation_summary"
mixins: [WithRetry, WithObservability]
dependencies:
- initial_data_gathering
The composition is assembled from grammar primitives at task creation time, validated against the grammar’s structural invariants (including the single-mutation boundary), and executed with the same guarantees as any common pattern. The agent gets exactly the analysis pipeline it needs without registering a new pattern.
Pattern 4: Agent-Created Planning Tasks
When the agent wants to use Tasker’s LLM planning capabilities (Phase 2) for part of its investigation:
name: agent_planned_investigation
namespace_name: agent_research
version: 1.0.0
description: "Agent-initiated task with LLM planning for investigation strategy"
steps:
- name: gather_context
type: standard
handler:
callable: agent_research.prepare_context
dependencies: []
- name: plan_investigation
type: planning
handler:
grammar: planning_step
config:
model: claude-sonnet-4-5-20250929
capability_schema: standard_v1
max_fragment_steps: 15
max_fragment_depth: 2
planning_prompt: |
Given the gathered context, plan an investigation workflow
that identifies the information needed to make a decision about
the original problem. Use http_request for data gathering,
validate for quality checking, and aggregate for synthesis.
context_from:
- gather_context
dependencies:
- gather_context
- name: synthesize_for_agent
type: deferred
handler:
callable: agent_research.package_results
dependencies:
- plan_investigation
This composes the agent-client pattern with Phase 2’s planning interface: the agent creates a task that contains a planning step, and the planning step generates the investigation workflow dynamically. The agent gets the benefit of both its own high-level reasoning (what to investigate) and the LLM planner’s tactical reasoning (how to structure the investigation).
The Shared Tooling Foundation
The tasker-tooling Crate
The capabilities that an agent needs at design time — template inspection, schema validation, grammar vocabulary queries, composition validation, code generation — are the same capabilities that tasker-ctl provides to human developers through the CLI. Building these capabilities twice (once for CLI, once for MCP) creates maintenance burden and risks behavioral divergence.
The tasker-tooling crate extracts the shared logic into a library consumed by both interfaces:
tasker-tooling (library crate)
├── template_parser — Parse and validate TaskTemplate YAML
├── schema_inspector — Inspect and compare result_schema contracts
├── codegen_engine — Generate typed handler scaffolds across languages
├── handler_resolver — Validate handler callable → registration mapping
├── grammar_vocabulary — Query available primitives, common patterns, composition rules
├── composition_validator — Validate grammar compositions against structural invariants
└── capability_schema — Derive and query capability schemas from compositions
tasker-ctl (binary crate)
├── CLI argument parsing
├── Terminal output formatting
└── Calls tasker-tooling functions
tasker-mcp (binary crate)
├── MCP protocol handling
├── Tool registration and dispatch
└── Calls tasker-tooling functions
Extraction timing: The tasker-tooling extraction should follow once TAS-280 stabilizes the codegen and validation APIs. Extracting too early risks premature abstraction if the API surface is still shifting. However, the mental model of “these are the same capabilities with different front-ends” should inform the tasker-ctl design now so the extraction is straightforward later.
MCP Server as Agent Interface
The MCP server exposes tasker-tooling capabilities as MCP tools. For agent integration, the key tools are:
| Tool | Purpose | Agent Use Case |
|---|---|---|
template_inspect | Return template structure, schemas, dependencies | Agent understanding available workflows |
template_validate | Validate a template for structural correctness | Agent verifying its generated templates |
schema_compare | Check compatibility between step output and input schemas | Agent ensuring data flow correctness |
grammar_vocabulary | List available primitives, their contracts, composition rules | Agent composing handlers from grammar primitives |
composition_validate | Validate a grammar composition against structural invariants | Agent checking compositions before task submission |
capability_query | Query common patterns and primitives by capability (what they do) rather than name | Agent discovering relevant patterns for a problem |
template_generate | Generate a template from structured description | Agent creating investigation templates |
These tools are available to any MCP client — IDE extensions for human developers, agent frameworks, or standalone LLM sessions. The tooling layer doesn’t distinguish between clients; it provides capabilities and lets the client use them as appropriate.
Observability for Agent Workflows
Tracing Agent Reasoning Chains
The parent_correlation_id field connects all tasks in an agent’s reasoning chain. Standard Tasker telemetry — step timing, results, failures, retries — is emitted for every step in every task. No special agent-aware telemetry is needed.
What is needed is the ability to query and visualize across the chain:
- “Show me all tasks with
parent_correlation_id = X” — the complete reasoning chain - “What was the total resource consumption across this chain?” — budget accounting
- “Which investigation step took the longest?” — performance analysis
- “What did the convergence step produce?” — investigation results
These are standard queries against existing telemetry, filtered by parent_correlation_id. The correlation field is the only mechanism needed; the rest is query design.
What Tasker Observes vs. What the Agent Observes
There is a clear boundary between what Tasker can observe and what it cannot:
Tasker observes:
- Every step’s execution: timing, inputs, outputs, retries, failures
- Task lifecycle: creation, progress, completion
- Resource consumption: steps created, time elapsed, cost incurred
- Correlation:
parent_correlation_idlinking related tasks
Tasker does not observe:
- The agent’s internal reasoning between tasks
- Why the agent chose to create a particular investigation task
- The agent’s interpretation of results before creating the next task
- Agent-side failures (network errors calling the API, agent crashes)
This boundary is clean and intentional. Tasker provides infrastructure telemetry. The agent’s own reasoning and decision-making is the agent framework’s responsibility to observe and debug. The handoff point — task creation and result retrieval — is fully observable on both sides.
Recommended Agent-Side Practices
While Tasker cannot enforce these, agents interacting with Tasker should:
- Include descriptive metadata in task context explaining why this investigation is being conducted
- Use consistent
parent_correlation_idvalues across a reasoning chain - Set appropriate resource bounds on investigation tasks (don’t use defaults for research workflows)
- Log their interpretation of investigation results before creating follow-up tasks
- Handle task failures gracefully — a failed investigation task is information, not necessarily a fatal error
What This Is Not
It is not agent-to-agent communication. Tasker does not route messages between agents, maintain agent registries, or coordinate agent handoffs. If agents need to communicate, they do so through their own infrastructure. Tasker provides task results, not messaging.
It is not persistent agent memory. Tasker does not maintain state across agent sessions. Each task is independent. If an agent needs to remember what it learned from a previous investigation, it maintains that state externally and provides it as task context in subsequent requests.
It is not autonomous scheduling. Tasker does not decide when an agent should investigate or what it should investigate. The agent makes those decisions. Tasker executes what the agent requests, with the guarantees the agent needs.
It is not a “swarm.” The vision is not a swarm of agents coordinating through Tasker. The vision is a thoughtful agent — or a small number of purpose-specific agents — using Tasker’s deterministic infrastructure to structure their investigation and execution. The value is not in scale of agents but in the quality of the infrastructure supporting each agent’s reasoning.
Relationship to Existing Phases
| Phase | Agent Integration Point |
|---|---|
| Phase 0: Foundation | MCP server provides design-time interface. tasker-tooling crate powers both human and agent interactions. Agent can generate templates and validate schemas. |
| Phase 1: Action Grammars | Agent can compose handlers from grammar primitives for investigation steps. Grammar vocabulary and common patterns are queryable through MCP tools. |
| Phase 2: Planning Interface | Agent can create tasks containing planning steps, combining its high-level reasoning with LLM planning’s tactical composition. |
| Phase 3: Recursive Planning | Agent-created investigation tasks complement recursive planning — the agent handles task-level decomposition while planning steps handle step-level composition. |
| Complexity Management | Agent task chains use existing observability through parent_correlation_id. No new telemetry infrastructure needed. |
Agent integration is not “Phase N.” It is a capability that emerges from the properties Tasker already has and becomes richer as each phase adds vocabulary and tooling.
Future Considerations
Research Workflow Template Library
As agents use Tasker for investigation, common patterns will emerge. A library of research workflow templates in tasker-contrib — parallel investigation, staged analysis, decision-point-driven deep dives — would reduce the overhead for agents building investigation workflows. These templates would use common grammar patterns and support dynamic composition for custom analysis steps.
Agent SDK / Client Library
While agents can interact with Tasker through the standard API, a thin client library that encapsulates common patterns (create investigation task, wait for results, parse converged findings) would reduce integration friction. This is a convenience layer, not a new abstraction — it wraps existing API calls in agent-friendly patterns.
Feedback Loops
Over time, patterns in how agents use investigation results to inform workflow design could feed back into the grammar’s common patterns. If agents consistently compose similar handlers for research, those compositions become candidates for named common patterns with additional testing and documentation. If investigation workflows consistently follow certain patterns, those patterns become template library entries. The system learns from its agent clients’ behavior, not through agent-internal mechanisms, but through observable patterns in task creation and execution.
This document describes a cross-cutting capability, not a phase. Agent integration composes with all phases of the generative workflow initiative and requires no changes to Tasker’s orchestration runtime. It is enabled by existing infrastructure and enriched by each subsequent phase.
Phase 2: Planning Interface
LLM-backed planning steps and workflow fragment generation from action grammar primitives
Phase Summary
The planning interface introduces a new step handler type — the planning step — that uses an LLM to generate workflow fragments composed from action grammar primitives. The orchestration layer validates these fragments against the grammar’s structural invariants (contract compatibility, single-mutation boundary) and materializes them through the existing transactional step creation infrastructure.
This is the phase where generative planning meets deterministic execution. The contract is: given context and a capability vocabulary, produce a valid plan. The system validates the plan’s grammar compositions against structural invariants, checks the DAG structure, and executes with full transactional guarantees. The LLM reasons; the system guarantees.
Planning steps can appear in any task — including tasks created by agent clients (see Agent Orchestration). An agent that creates a research task containing a planning step gets the benefit of both its own high-level reasoning (what to investigate) and the LLM planner’s tactical composition (how to structure the investigation). The planning step doesn’t know or care that its parent task was agent-created; it generates and validates fragments the same way in all contexts.
Phase 0’s MCP server experience directly informs this phase — the same prompt engineering patterns, validation feedback loops, and structured output strategies that work for developer-time authoring apply to runtime planning.
Research Areas
1. Workflow Fragment Schema
Question: What is the structural representation of a workflow fragment that a planning step produces?
Research approach:
- Start from the existing
DecisionPointOutcome::CreateSteps { step_names }and extend it to carry full step specifications including grammar compositions - Evaluate what minimum information is needed to materialize a step (common pattern reference, dynamic grammar composition, or application callable, plus inputs, dependencies, configuration)
- Design for validation: the schema should make it structurally impossible to express certain invalid states
Proposed fragment structure:
{
"fragment_version": "1.0",
"planning_context": {
"goal": "Process and enrich customer records from CSV upload",
"reasoning": "The dataset contains 5000 records requiring validation, API enrichment, and categorization...",
"estimated_steps": 7,
"estimated_depth": 3
},
"steps": [
{
"name": "validate_records",
"handler": {
"grammar": "validate",
"config": {
"schema": { "$ref": "customer_record_v2" },
"on_invalid": "flag"
}
},
"dependencies": [],
"inputs": {
"data": "${task.context.uploaded_records}"
}
},
{
"name": "enrich_valid",
"handler": {
"grammar": "http_request",
"config": {
"url": "https://enrichment-api.example.com/v2/enrich",
"method": "POST",
"body": { "records": "${steps.validate_records.result.valid_records}" }
}
},
"dependencies": ["validate_records"]
},
{
"name": "categorize",
"handler": {
"composition": {
"primitives": [
{
"type": "Transform",
"variant": "Categorize",
"config": {
"categories": ["premium", "standard", "review"],
"rules": { "$ref": "categorization_rules_v1" }
}
},
{
"type": "Validate",
"variant": "CategoryRules",
"config": {
"schema_ref": "categorized_record_v1"
},
"input_mapping": {
"data": "$.previous.categorized_data"
}
}
],
"mixins": ["WithObservability"]
}
},
"dependencies": ["enrich_valid"]
},
{
"name": "converge_results",
"handler": {
"grammar": "aggregate",
"config": {
"strategy": "merge",
"output_key": "final_results"
}
},
"dependencies": ["categorize"],
"step_type": "deferred"
}
],
"convergence": "converge_results",
"resource_bounds": {
"max_downstream_steps": 15,
"max_downstream_depth": 2
}
}
Note the three ways a fragment can reference handlers: by common pattern name ("grammar": "validate") for named grammar compositions, by direct composition ("composition": {...}) for dynamically assembled grammar compositions, or by application-specific callable ("callable": "...") for developer-authored handlers. The first two are the same composition model — a common pattern is simply a named, well-tested composition specification. Both are validated at assembly time against structural invariants (contract compatibility, single-mutation boundary). Callable references are validated against the handler resolver.
Open questions:
- How should fragments reference the planning step’s own results? The planning step creates downstream steps, but those steps need to reference data that existed before planning.
- Should fragments support conditional sub-paths, or should every conditional branch require a nested planning step?
- How do we handle fragments that reference both grammar compositions and application-specific handlers in the same fragment?
- Should the LLM planner prefer common patterns when available and compose dynamically only for novel operations?
2. Fragment Validation Pipeline
Question: What validations must pass before a fragment is materialized?
Research approach:
- Enumerate all failure modes of an invalid fragment
- Design a validation pipeline that catches errors early with actionable diagnostics
- Leverage grammar structural invariants for composition validation
Proposed validation stages:
| Stage | Validates | Failure Mode |
|---|---|---|
| Schema validation | Fragment structure conforms to fragment schema | Malformed fragment — parsing error |
| Pattern reference check | Every common pattern reference exists in the registered patterns | Unknown pattern — planner hallucinated a capability |
| Composition structural validation | All grammar compositions have compatible input/output schemas across primitives and respect the single-mutation boundary | Contract mismatch or safety violation — planner composed incompatible or unsafe primitives |
| Configuration validation | Handler config matches the handler’s configuration JSON Schema | Invalid config — wrong types, missing required fields |
| DAG validation | Dependencies form a valid acyclic graph | Cycle detected, orphan steps, unreachable convergence |
| Input reference resolution | All ${step_reference} paths resolve to steps in the fragment or existing task context | Dangling reference — step references nonexistent upstream |
| Data contract compatibility | Output contracts of upstream steps match input contracts of downstream steps | Shape mismatch — data flow is inconsistent |
| Resource bound check | Total steps, depth, fan-out factor within configured limits | Plan exceeds bounds — too large, too deep, too expensive |
| Convergence validation | Deferred steps have valid intersection semantics with fragment steps | Convergence cannot resolve — no path to terminal state |
The composition structural validation stage uses the same grammar contract metadata that the Rust compiler uses for primitive verification, applied at assembly time to LLM-generated composition specifications. Primitives are compile-time verified Rust; compositions are validated at assembly time against structural invariants (contract compatibility, single-mutation boundary). An invalid composition is rejected at planning validation, not at step execution.
Open questions:
- Should validation be a single pass or multi-pass? (Early termination vs. collecting all errors)
- Should the planner receive validation feedback and be able to revise? (Planning → validate → revise loop)
- What diagnostic information should be stored when a fragment fails validation? (For observability and planner improvement)
- Should there be a “simulation” mode that validates and reports without materializing?
- How many planning attempts should be allowed before failing? (One shot? Up to 3 with validation feedback?)
3. LLM Integration Adapter
Question: How should the planning step interface with LLM APIs?
Research approach:
- Build on Phase 0 MCP server experience for prompt engineering and structured output patterns
- Design an adapter pattern that supports multiple LLM providers
- Evaluate context window management strategies for grammar capability schemas
Provider abstraction: The planning step should not be coupled to a specific LLM API. An adapter interface should support at minimum Claude (Anthropic API) and OpenAI-compatible endpoints, with the ability to add providers.
Prompt construction: The planning prompt has several components, informed by MCP server experience:
- System context: “You are a workflow planner. Generate a workflow fragment using the available action grammar primitives and common patterns.”
- Capability schema: Machine-readable descriptions of available primitives, common patterns, and composition rules (derived from grammar types in Phase 1)
- Task context: The problem description, input data schema, and any accumulated results
- Planning constraints: Resource bounds, required convergence points, any domain-specific rules
- Output format: The fragment schema with examples showing common pattern references, dynamic compositions, and mixed fragments
- Validation feedback: If retrying, the validation errors from the previous attempt
Context window management: Capability schemas derived from grammar compositions can be large. Strategies (validated through Phase 0 MCP server experience):
- Tiered descriptions: primitive names + one-line descriptions always included; full type signatures included for primitives the LLM selects
- Category-based inclusion: data processing problems get full Transform/Validate schemas; API integration problems get full Acquire/Emit schemas
- Composition rules included as a concise reference — the planner needs to know the structural invariants (especially the single-mutation boundary) without seeing every possible combination
- Few-shot examples demonstrating common patterns and dynamic compositions
Structured output: Use function calling / tool use APIs where available to constrain the LLM’s output to valid fragment structures. Fall back to JSON mode with post-hoc validation where function calling isn’t supported.
Open questions:
- Should the planning step cache successful plans for similar problem descriptions?
- How should model selection work? (Configurable per planning step? Global default with overrides?)
- What telemetry should be emitted from the LLM call? (Token counts, latency, planning reasoning)
- Should the planner have access to the grammar’s type signatures, or only the derived capability schemas?
4. Fragment Materialization
Question: How does a validated fragment become real workflow steps?
Research approach:
- Study the existing
ResultProcessingServicepath for decision point outcomes - Determine what modifications are needed to support full step specifications with grammar compositions
- Validate transactional guarantees are preserved with richer creation payloads
The current flow for decision points is:
- Decision handler returns
DecisionPointOutcome::CreateSteps { step_names } ResultProcessingServicevalidates step names exist in the template- Steps are created from template definitions in a single transaction
- Edges are created connecting the decision step to new steps
- New steps are enqueued
For planning steps, the flow extends to:
- Planning handler returns a validated workflow fragment
- Fragment materialization service creates steps from fragment specifications (not template)
- Steps are created with grammar compositions (common pattern references or dynamic compositions), configurations, and inputs in a single transaction
- Edges are created from the fragment’s dependency declarations
- New steps are enqueued for the appropriate namespace (grammar workers for grammar-composed steps and application workers for app-specific handlers)
The key difference: instead of looking up step definitions in a template, the materialization service uses the fragment’s step specifications directly. Grammar-composed steps route to grammar workers; application-specific steps route to their registered namespace.
Open questions:
- Should fragment materialization be a separate service or an extension of
ResultProcessingService? - How should the planning step’s own task template relate to the materialized fragment? Is it a “meta-template” that declares the planning step and convergence, with the middle filled in dynamically?
- What happens if materialization fails after partial creation? (Transaction should handle this, but worth explicit validation)
5. The Planning Step Template Pattern
Question: What does a task template look like when it includes planning steps?
Proposed template pattern:
name: adaptive_data_processing
namespace_name: dynamic_planning
version: 1.0.0
description: Process data with LLM-planned workflow
grammar_patterns: standard_v1
steps:
- name: ingest_data
type: standard
handler:
callable: DataIngestionHandler # Application-specific
dependencies: []
- name: plan_processing
type: planning # New step type
handler:
grammar: planning_step
config:
model: claude-sonnet-4-5-20250929
capability_schema: standard_v1
max_fragment_steps: 20
max_fragment_depth: 3
allow_dynamic_composition: true # Permit dynamic grammar composition beyond common patterns
planning_prompt: |
Given the ingested data characteristics, plan a processing
workflow that validates, enriches, and categorizes the records.
Use common patterns where available. Compose dynamically from
grammar primitives for operations that don't map to existing patterns.
context_from:
- ingest_data
dependencies:
- ingest_data
- name: finalize
type: deferred
handler:
callable: FinalizationHandler # Application-specific
dependencies:
- plan_processing # Intersection semantics with planned steps
Key insight: The template defines the frame — what happens before planning and what happens after convergence. The middle is filled in by the planner using grammar compositions — common patterns for well-tested operations, dynamic compositions for novel operations. This preserves the template’s role as a structural contract while enabling dynamic topology. Developer-authored handlers (DataIngestionHandler, FinalizationHandler) coexist with grammar-composed planned steps in the same workflow.
The allow_dynamic_composition flag gives template authors control over whether the planner can compose novel handlers from primitives or must limit itself to common patterns. This is a safety lever: templates used in high-trust environments can enable dynamic composition for maximum flexibility, while templates used in more constrained environments can restrict to well-tested common patterns. In both cases, compositions are validated against the same structural invariants (contract compatibility, single-mutation boundary).
Prototyping Goals
Prototype 1: Fragment Schema and Validation
Objective: Define the fragment schema and implement the validation pipeline, including grammar composition structural invariant checking, independent of LLM integration.
Success criteria:
- Fragment schema defined with JSON Schema, supporting common pattern references, dynamic compositions, and callable handler references
- Validation pipeline rejects all identified invalid fragment patterns
- Composition validation catches incompatible primitive chains and single-mutation boundary violations
- Validation produces actionable diagnostic messages
- Valid fragments can be materialized into workflow steps in a test environment
Prototype 2: LLM-Generated Fragments
Objective: Validate that an LLM can generate valid workflow fragments from grammar capability schemas, including dynamic compositions.
Success criteria:
- Claude generates valid fragments for at least 3 distinct problem types
- Generated fragments pass the validation pipeline including structural invariant checking
- Fragments use common patterns, dynamic compositions, or both appropriately
- Planning prompt engineering (informed by Phase 0 MCP server experience) produces consistent results
Prototype 3: End-to-End Planning and Execution
Objective: Execute a complete workflow with an LLM planning step.
Success criteria:
- Task is created with a planning step in its template
- Planning step calls LLM, receives fragment, passes validation
- Fragment is materialized as workflow steps (grammar-composed and/or application-specific)
- Planned steps execute through appropriate workers
- Convergence step receives results from planned steps
- Full workflow observable through standard Tasker telemetry
- Agent-created tasks containing planning steps execute correctly with
parent_correlation_idtraceability
Validation Criteria for Phase Completion
- Workflow fragment schema defined and documented, supporting common pattern references, dynamic compositions, and application callables
- Fragment validation pipeline implemented with all stages including composition structural invariant checking
- Planning step handler type implemented in at least Rust
- LLM integration adapter supports at least one provider (recommend: Anthropic API)
- Fragment materialization extends existing step creation with full transactional guarantees
- At least 3 end-to-end workflows demonstrated with LLM planning, including at least one using dynamic composition
- Validation failure modes tested and documented with diagnostic output
- Planning step telemetry includes LLM call metrics, fragment structure, and validation results
Relationship to Other Phases
- Phase 0 informs this phase: MCP server experience with LLM integration, prompt engineering, and validation feedback transfers directly.
- Phase 1 is a prerequisite: planning steps generate fragments that reference grammar compositions — both common patterns and dynamic compositions.
- Phase 3 extends this phase: recursive planning is nested planning steps within planned fragments.
- Agent orchestration composes with this phase: agents can create tasks containing planning steps, combining agent-level reasoning with LLM planning-level composition.
This document will be updated as Phase 1 progresses and reveals design insights that inform planning interface design.
Phase 3: Recursive Planning and Adaptive Workflows
Multi-phase workflows where each phase’s plan is informed by previous results — at both step level and task level
Phase Summary
Recursive planning enables workflows where the path forward is not just unknown at task creation time — it is unknowable until intermediate results are observed. A planning step generates a workflow fragment, that fragment executes, and a subsequent planning step uses the accumulated results to plan the next phase. This is not iteration in a loop; it is phased problem-solving where each phase operates with strictly more information than the one before.
This capability operates at two distinct levels:
Step-level recursion occurs within a single task. Planning steps generate workflow fragments, those fragments execute, and subsequent planning steps within the same task plan the next phase based on accumulated results. The task template defines the frame (what happens before and after planning); the planners fill in the middle. All steps share a single task lifecycle, budget, and convergence structure.
Task-level delegation occurs across tasks. An external client — typically an agent (see Agent Orchestration) — creates a task, receives its results, reasons about them externally, and creates follow-up tasks based on that reasoning. The parent_correlation_id field traces the chain across tasks. Each task has its own lifecycle and budget, but the reasoning chain is observable as a unit.
These two levels are complementary, not competing. Step-level recursion handles multi-phase execution within a bounded problem. Task-level delegation handles multi-phase investigation and decision-making where the reasoning between phases is too complex or context-dependent for an in-workflow planning step. An agent might create a research task (task-level delegation), and that research task might contain planning steps that adapt to intermediate findings (step-level recursion). The levels compose naturally.
This is the full realization of the vision: a system that can approach a problem the way a thoughtful engineer would — reconnaissance first, then analysis, then execution, then validation — with each phase adapting to what was learned. Whether the adaptation happens within a task (step-level) or across tasks (task-level), the execution guarantees are identical.
Research Areas
1. Context Accumulation and Propagation
Question: How do results from earlier phases flow into subsequent planning steps?
Research approach:
- Study current
dependency_resultspropagation patterns in conditional and batch workflows - Design a context accumulation strategy that provides sufficient information for planning without overwhelming the LLM’s context window
- Leverage typed data contracts from action grammars to improve summarization accuracy
- Distinguish context flow patterns for step-level (within-task) and task-level (cross-task) recursion
The context problem:
In a two-phase workflow — plan → execute → plan → execute — the second planning step needs to know:
- What the original problem was (task context)
- What the first planning step decided and why (planning metadata)
- What the first phase’s execution produced (step results)
- What went wrong, if anything (failure information from retried or failed steps)
For a three-phase workflow, the third planner needs all of the above for both prior phases. Context accumulates linearly with phases. With large step results (API responses, processed datasets), the accumulated context can exceed LLM context windows.
Step-level vs. task-level context flow:
For step-level recursion, context flows through Tasker’s standard dependency_results mechanism — each planning step has access to its declared upstream steps’ results. The context is internal to the task and fully managed by the orchestration layer.
For task-level delegation, context flows through the agent. The agent receives task results through the API, processes or summarizes them in its own reasoning, and provides relevant context as input when creating the next task. Tasker does not manage cross-task context propagation — the agent is responsible for deciding what context the next task needs.
This separation is intentional. Step-level context is bounded by the task’s scope and managed by the system. Task-level context is unbounded in principle but managed by the agent, which can apply its own judgment about what’s relevant. The agent’s ability to filter, prioritize, and restructure context between tasks is a feature, not a limitation.
The typed data contract advantage:
Because grammar-backed steps have declared output contracts (from Phase 1), the context accumulation layer knows the shape of each step’s result without inspecting the data. This enables more intelligent summarization — the system can summarize a step’s results according to its output contract’s type structure, preserving key fields and eliding bulk data, rather than attempting generic JSON summarization. This applies to both step-level accumulation (system-managed) and task-level accumulation (agent-managed, but aided by schema metadata).
Proposed context accumulation patterns:
| Pattern | Level | Description | When to Use |
|---|---|---|---|
| Full propagation | Step | All prior results passed to planner | Small results, shallow recursion (2-3 phases) |
| Contract-guided summarization | Step | Results summarized according to their output contract types, preserving structure | Medium results, typed step outputs |
| LLM-generated summary | Step | Dedicated summarization step before next planning step | Large results, deep recursion |
| Selective propagation | Step | Planner declares which upstream results it needs; only those are passed | When the planner can predict its own information needs |
| Hierarchical propagation | Step | Each planning step receives only its immediate predecessor’s summary | Deep recursion with clear phase boundaries |
| Agent-mediated propagation | Task | Agent receives full results, applies its own reasoning and filtering, provides curated context to next task | Cross-task reasoning where judgment about relevance is needed |
Open questions:
- Should context accumulation be explicit (planner declares what it needs) or implicit (system provides everything available)?
- How should we handle the case where a planner needs raw data from two phases ago, not just the summary?
- Should accumulated context be stored as a separate artifact from step results? (e.g., a
planning_contextfield on the task that grows across phases) - What is the practical depth limit before context quality degrades? (Likely 3-5 phases based on LLM context window constraints)
- For task-level delegation, should Tasker provide schema metadata in task results to help agents understand result structure?
2. Planning Depth and Breadth Controls
Question: How do we prevent recursive planning from generating arbitrarily large or deep workflows?
Research approach:
- Define resource consumption model for recursive planning (LLM calls, steps created, wall-clock time, external API calls)
- Design hierarchical budgets that constrain planning at each level
- Study termination guarantees in recursive systems
- Define resource controls for task-level delegation (per-task budgets, not cross-task budgets)
Budget hierarchy (step-level recursion):
Task-level budget (set by template author or operator):
├── max_total_planning_phases: 5
├── max_total_steps: 100
├── max_total_llm_calls: 10
├── max_wall_clock_time: 30m
└── max_cost_budget: $5.00
│
├── Phase 1 planning step budget (subset of task budget):
│ ├── max_fragment_steps: 20
│ ├── max_fragment_depth: 3
│ └── remaining_budget: inherited from task level, minus consumed
│
└── Phase 2 planning step budget (further subset):
├── max_fragment_steps: min(20, remaining)
├── max_fragment_depth: 3
└── remaining_budget: further reduced
Resource controls for task-level delegation:
For task-level delegation, each task has its own independent budget. Tasker does not enforce cross-task budget hierarchies — an agent creating three investigation tasks gets three independent task budgets. The agent is responsible for its own delegation budgets:
- Deciding how many investigation tasks to create
- Setting appropriate resource bounds on each task
- Managing total resource consumption across its reasoning chain
Tasker provides the per-task controls; the agent provides the chain-level controls. This is consistent with the agent-as-client model: Tasker provides infrastructure, not agency.
That said, the parent_correlation_id enables observability across the chain. An operator can query total resource consumption for all tasks sharing a correlation ID, even though Tasker doesn’t enforce it. This allows after-the-fact analysis of agent resource usage without requiring real-time cross-task budget management.
Termination guarantees (step-level):
- Each planning step’s fragment is bounded (
max_fragment_steps,max_fragment_depth) - Task-level bounds cap total growth across all phases
- Budget decrements are tracked in the task’s metadata (JSONB)
- A planning step that exceeds its budget fails with a diagnostic error
- A planning step that requests resources exceeding the remaining budget is rejected at validation
The “infinite planner” problem: A planning step could, in theory, always create another planning step as part of its fragment — infinite recursion. Prevention:
max_total_planning_phasescaps the number of planning steps in a task’s lifetime- Each planning step’s
max_fragment_depthlimits nesting - Budget exhaustion provides a natural termination condition
Open questions:
- Should budgets be expressed in abstract units or concrete metrics? (Abstract: “complexity points”; Concrete: “LLM tokens + step count + wall time”)
- How should the system behave when a budget is nearly exhausted? (Inform the planner of remaining budget so it can plan conservatively?)
- Should there be an “emergency convergence” mechanism that forces a workflow to converge when budgets are low?
- How do we account for retry costs? (A step that fails and retries 3 times consumes budget for each attempt)
- Should Tasker provide an optional cross-task budget mechanism for agent chains, or is per-task sufficient?
3. Adaptive Convergence
Question: How should convergence work when the topology is determined across multiple planning phases?
Research approach:
- Extend current deferred/intersection convergence semantics to multi-phase contexts
- Design patterns for convergence steps that may not know all upstream steps at creation time
- Evaluate whether new convergence semantics are needed or existing mechanisms suffice
Challenge: In a single-phase conditional workflow, the convergence step declares all possible dependencies and uses intersection semantics. In a multi-phase workflow, the convergence step at the task template level doesn’t know what steps will exist in the middle — those are planned dynamically.
Proposed approach: Convergence declarations on planning steps.
The task template declares that a planning step’s output should converge to a specific convergence step. The planning step includes convergence information in its fragment. The orchestration layer ensures that all terminal steps in the fragment connect to the declared convergence point.
# In the task template:
steps:
- name: plan_phase_1
type: planning
dependencies: [ingest]
convergence_target: finalize # Fragment must converge here
- name: finalize
type: deferred
dependencies: [plan_phase_1]
handler:
callable: FinalizationHandler
The planning step’s fragment must declare a convergence point. The orchestration layer validates that the fragment’s terminal steps connect to the declared convergence_target. If the fragment includes a nested planning step, that step’s convergence target is the outer convergence — recursive convergence resolution.
Open questions:
- Can a planning step declare its own convergence step, or must convergence always be declared in the task template?
- How should convergence interact with partial failure? (If 3 of 5 planned steps complete but 2 fail permanently, does convergence fire with partial results?)
- Should the convergence step know how many upstream steps were planned? (Useful for aggregation, but creates coupling between planner and convergence handler)
4. Multi-Phase Workflow Patterns
Question: What are the common patterns for multi-phase adaptive workflows?
Research approach:
- Identify canonical problem types that benefit from multi-phase planning
- Design reference architectures for each pattern
- Validate with concrete use cases
- Distinguish patterns that work best as step-level recursion from those that benefit from task-level delegation
Proposed canonical patterns:
Pattern 1: Reconnaissance → Execution (Step-Level)
ingest → plan_recon → [recon steps] → plan_execution → [execution steps] → finalize
The reconnaissance phase gathers information (API calls, data profiling, schema inspection). The execution phase acts on what was learned. Two planning steps within a single task, each informed by the prior phase’s results. Typed data contracts from the recon phase’s grammar-composed steps ensure the execution planner receives well-structured context.
Best for: Problems where the reconnaissance is bounded and the context can flow through the task’s step-level accumulation.
Pattern 2: Iterative Refinement (Step-Level)
ingest → plan_v1 → [execute] → evaluate → plan_v2 → [execute] → evaluate → converge
Each phase attempts a solution. An evaluation step (which could be LLM-backed) assesses the results. If the results are insufficient, another planning phase refines the approach. Budget controls limit iterations.
Best for: Optimization problems where each iteration is structurally similar and the evaluation criteria are known upfront.
Pattern 3: Map → Analyze → Reduce (Step-Level)
ingest → plan_map → [fan_out batch processing] → plan_analyze → [analysis steps] → reduce
The map phase parallelizes work across a dataset using the FanOut grammar primitive. The analysis phase examines aggregated results. The reduce phase produces final output. Planning enables each phase to adapt to the data’s characteristics.
Best for: Data processing pipelines where the shape of analysis depends on the data.
Pattern 4: Progressive Disclosure (Step-Level)
ingest → plan_triage → [triage steps] → route_by_complexity →
simple: plan_simple → [quick steps] → converge
complex: plan_deep → [thorough steps] → converge
Initial triage determines problem complexity. Simple problems get lightweight plans. Complex problems get thorough plans. The planning steps at each level are scoped to the problem’s complexity.
Best for: Heterogeneous problem sets where different inputs require different treatment.
Pattern 5: Investigation → Design → Execute (Task-Level)
Agent creates research_task → [investigation steps, possibly with step-level planning]
Agent receives research results, reasons about them
Agent creates design_task → [workflow design, possibly with planning steps]
Agent receives design, reviews or adjusts
Agent creates execution_task → [the actual workflow]
The agent uses task-level delegation for the high-level phases (investigate, design, execute) because its reasoning between phases requires context that cannot be captured in a planning step’s prompt. Within each task, step-level planning may be used for tactical adaptation.
Best for: Problems where the reasoning between phases is complex, context-dependent, or requires external judgment.
Pattern 6: Parallel Investigation with Synthesis (Task-Level)
Agent creates investigation_task_A (source 1)
Agent creates investigation_task_B (source 2)
Agent creates investigation_task_C (source 3)
Agent waits for all three, synthesizes findings
Agent creates design_or_action_task based on synthesis
The agent fans out investigation across multiple independent research tasks, synthesizes the results externally, and acts on the synthesis. This is similar to Pattern 1 but with the fan-out happening at the task level rather than the step level — useful when the investigation dimensions are truly independent and the synthesis requires agent-level reasoning.
Best for: Complex decisions requiring information from multiple independent domains.
Open questions:
- Are there patterns that require capabilities beyond what Phases 0-2 provide?
- Should pattern selection itself be LLM-assisted? (Meta-planning: “given this problem, which multi-phase pattern is most appropriate?”)
- How do we document and catalog these patterns for users?
- Should there be a library of research workflow templates in
tasker-contribfor the task-level patterns?
5. Failure Recovery in Multi-Phase Workflows
Question: How does the system recover when a phase fails in the middle of a multi-phase workflow?
Research approach:
- Map existing retry/backoff semantics to multi-phase contexts
- Design recovery strategies for planning step failures vs. execution step failures
- Evaluate checkpoint and resume semantics across planning phases
- Distinguish recovery for step-level failures (system-managed) from task-level failures (agent-managed)
Failure categories (step-level):
| Failure | Scope | Recovery Strategy |
|---|---|---|
| Planning step LLM call fails | Single step | Standard retry with backoff. LLM calls are idempotent (stateless). |
| Planning step generates invalid fragment | Single step | Retry with validation feedback to LLM. Limited attempts before permanent failure. |
| Execution step in planned fragment fails | Single step | Standard step retry semantics. Same as any Tasker step. |
| Entire planned phase fails (all steps) | Phase | Phase-level retry: re-plan from the last successful phase boundary. |
| Budget exhausted mid-phase | Phase | Graceful degradation: force convergence with partial results + diagnostic. |
| Recursive planning exceeds depth | Task | Hard stop. Planning step fails with “depth exceeded” error. Task may converge with partial results or fail entirely depending on template design. |
Failure categories (task-level):
| Failure | Scope | Recovery Strategy |
|---|---|---|
| Investigation task fails | Single task | Agent receives failure notification. Agent decides whether to retry (create new task) or proceed without that investigation. |
| Agent-side failure between tasks | Agent | Not Tasker-observable. The task chain simply stops. parent_correlation_id shows the last completed task. |
| Multiple investigation tasks fail | Chain | Agent manages multi-failure scenarios. May create a fallback investigation task, proceed with partial information, or escalate to human review. |
The key distinction: step-level failure recovery is system-managed — Tasker’s orchestration layer handles retries, budget checks, and convergence. Task-level failure recovery is agent-managed — the agent decides how to respond to failed tasks. This matches the agent-as-client model: Tasker guarantees individual task execution; the agent guarantees reasoning chain coherence.
Phase-level retry is the novel step-level pattern. If an entire planned phase fails, the system could re-invoke the planning step with the accumulated context plus failure information. The planner can then generate a revised fragment that accounts for what failed. This requires:
- Clear phase boundaries (which the planning step + convergence pattern provides)
- Failure context propagation to the planner
- Budget accounting that doesn’t penalize retried phases excessively
Open questions:
- Should phase-level retry be automatic or require human approval (gate step)?
- How many phase-level retries are reasonable? (Likely 1-2 before requiring human intervention)
- Should the revised plan have access to the failed plan for comparison? (Useful for “don’t try the same thing again”)
- For task-level delegation, should Tasker provide a callback/webhook mechanism so agents learn about task completion without polling?
Prototyping Goals
Prototype 1: Two-Phase Adaptive Workflow (Step-Level)
Objective: Implement the Reconnaissance → Execution pattern with two planning steps within a single task.
Success criteria:
- First planning step gathers information and produces results
- Second planning step receives first phase’s results (with typed context from grammar-backed steps) and plans execution
- Execution phase completes using grammar-composed handlers
- Context accumulation works correctly across phases
- Budget tracking decrements across phases
Prototype 2: Planning Depth Controls
Objective: Validate that recursive planning terminates correctly under all conditions.
Success criteria:
- Planning that exceeds
max_total_planning_phasesfails gracefully - Budget exhaustion produces clean termination with diagnostic
- Partially completed phases converge with available results
- No infinite planning loops possible
Prototype 3: Phase-Level Failure Recovery
Objective: Demonstrate recovery from phase failure through re-planning.
Success criteria:
- A planned phase with a failing step triggers re-planning
- The re-planner receives failure context and generates a revised fragment
- The revised fragment avoids the failure mode of the original
- Budget accounting is correct across retries
Prototype 4: Agent-Driven Investigation Chain (Task-Level)
Objective: Demonstrate an agent creating a research task, receiving results, and creating a follow-up task based on findings.
Success criteria:
- Agent creates an investigation task with appropriate resource bounds
- Investigation task executes with fan-out, convergence, and structured results
- Agent receives results through the API
- Agent creates a follow-up task with curated context from investigation results
parent_correlation_idtraces the entire chain- Total resource consumption across the chain is observable
Validation Criteria for Phase Completion
- Multi-phase workflow with at least 2 planning steps executes end-to-end (step-level)
- Context accumulation provides sufficient information for subsequent planning steps, with typed data contracts improving summarization accuracy
- Budget hierarchy enforced: task-level, phase-level, and per-fragment limits
- Recursive planning terminates under all conditions (depth limits, budget exhaustion)
- At least 2 step-level canonical patterns (of the 4 proposed) demonstrated with real use cases
- At least 1 task-level delegation pattern demonstrated end-to-end
- Phase-level failure recovery demonstrated (re-planning after phase failure)
- Convergence works correctly with dynamically determined upstream steps
- Full observability across all planning phases and across agent task chains (see Complexity Management document)
Relationship to Other Phases
- Phase 0 provides data contract patterns that inform context accumulation design.
- Phase 1 is foundational: grammar-composed handlers execute at every phase, and typed output contracts improve cross-phase context quality.
- Phase 2 is a direct prerequisite: recursive planning is nested planning steps.
- Agent orchestration composes with this phase: task-level delegation provides the cross-task investigation pattern that complements within-task step-level recursion.
This is the most speculative phase of the initiative. Many design questions will be informed by operational experience with Phase 2. This document should be substantially revised after Phase 2 validation.
Managing Complexity in Dynamic Workflow Planning
Complexity grounded in simplicity, for humans, LLMs, agents, and observability systems — and how structural invariants reduce runtime surprises
The Distinction: Complexity vs. Complication
Dynamic workflow planning is inherently complex. A system where an LLM generates workflow topology at runtime, where handlers can be composed dynamically from grammar primitives, where agents create investigation task chains, where planning can recur across multiple phases — this introduces combinatorial possibilities that static workflows do not have.
Complexity is not the concern. Complication is.
Complexity arises from the genuine richness of a problem space. A workflow that adapts to its inputs, that routes through different processing paths based on data characteristics, that plans in phases informed by intermediate results — this is complex because the problem is complex. The system’s behavior reflects the problem’s structure.
Complication arises from incidental difficulty — difficulty that exists because of how the system was built rather than because of what it does. Configuration that requires understanding internal implementation details. Failure modes that are opaque. Observability that buries the signal in noise. Abstractions that leak. Mental models that don’t match system behavior.
The goal of this document is to ensure that dynamic workflow planning introduces complexity proportional to the problems it solves while ruthlessly eliminating complication. The same step — with its state machine, its idempotency, its lifecycle — remains the atomic unit. The same execution guarantees hold. The same observability contracts apply. What’s new is the topology, not the mechanics.
Four Audiences, Four Models
Dynamic workflow planning has four distinct audiences, each with different mental models and different needs. Complexity management must serve all of them.
1. Human Operators
Operators need to understand what the system is doing, why it’s doing it, and what to do when something goes wrong. For dynamic workflows, this means:
- What was planned and why. Every planning step’s reasoning, the fragment it generated, and the validation result must be inspectable.
- What is executing now. The current state of all active steps, including dynamically created ones (whether referencing common patterns or dynamically composed), must be visible through the same interfaces used for static workflows.
- What went wrong. Failures in dynamic workflows must be diagnosable with the same tools used for static workflows, with additional context about the planning decisions that led to the failed step.
- What is the agent doing. For agent-created task chains, the
parent_correlation_idlineage must be traceable, showing the progression from investigation through design to execution.
The complication trap for operators: If dynamically planned workflows appear fundamentally different from static workflows in the observability UI — if they require a different mental model, different tooling, different diagnostic procedures — then we have introduced complication. The operator should see steps, dependencies, states, and results. The fact that some steps were created by a planner rather than a template, or composed dynamically rather than referencing a common pattern, should be visible but not disruptive.
2. LLMs as Planners
The LLM planner needs to understand what capabilities are available, what the problem context is, and what constraints apply. For effective planning, this means:
- Clear capability descriptions. The action grammar primitives, common patterns, and composition rules must be described in terms the LLM can reason about — not implementation details, but semantic capabilities, input/output contracts, and composition patterns. Because capability schemas are derived from grammar composition types (not hand-authored), they are always accurate.
- Bounded context. The information provided to the planner must be sufficient for good decisions but not so voluminous that it degrades reasoning quality. This is a context window management problem with direct impact on planning quality.
- Structured feedback. When a plan is invalid, the validation diagnostics must be actionable by the LLM in a retry attempt. “Handler ‘foo’ not found” is useful. “Validation failed” is not.
The complication trap for LLMs: If the capability schema is too granular (every parameter of every handler), the LLM drowns in detail. If it’s too abstract (“handlers exist”), the LLM can’t plan concretely. If the planning prompt requires understanding of Tasker internals (queue namespaces, transaction boundaries, PGMQ message formats), we’ve leaked implementation into the planning layer. The LLM should plan in terms of what needs to happen, not how Tasker works.
3. Agents as Clients
Agents need to understand what Tasker can do for them, how to structure their investigation and workflow creation, and how to monitor progress and handle failures. For effective agent integration, this means:
- Discoverable capabilities. The MCP server exposes what templates exist, what action grammar primitives are available, what schemas look like, and what resource limits apply. The agent should be able to explore the system’s capabilities without prior knowledge.
- Structured results. Task completion delivers results in a form the agent can reason about — typed data contracts mean the agent knows the shape of what it’s getting back.
- Clear failure modes. When a task or step fails, the agent receives structured diagnostic information sufficient to decide on recovery (retry, revise, escalate).
- Lineage tracking. The agent’s chain of tasks (linked by
parent_correlation_id) is traceable, enabling the agent — and human operators overseeing the agent — to understand the full investigation arc.
The complication trap for agents: If the agent needs to understand Tasker’s internal architecture to use it effectively, the abstraction has leaked. The agent should think in terms of “I need to investigate these three things in parallel and then combine the results” — not in terms of PGMQ queues, step state machines, or worker namespaces. The MCP server and task API should present the right level of abstraction.
4. Observability Systems
Observability infrastructure needs to ingest, correlate, and present the telemetry from dynamic workflows without special-casing. For coherent observability, this means:
- Consistent telemetry shape. Dynamically created steps emit the same metrics, logs, and traces as statically defined steps. Virtual handler steps emit the same telemetry as catalog handler steps. No parallel telemetry pipeline for “dynamic” or “virtual” steps.
- Planning provenance. Additional metadata connects each step to the planning decision that created it, enabling drill-down from “what happened” to “why this was planned.”
- Virtual handler provenance. Steps executed by virtual handlers include the composition specification in their metadata, enabling drill-down from “what failed” to “what composition was attempted.”
- Task lineage. For agent-created task chains,
parent_correlation_idenables correlation across tasks, showing the full arc from investigation to execution. - Aggregation across dynamic topologies. When a task template produces workflows with different shapes (because the planner chose different paths), observability must support comparison and aggregation across these variations.
The complication trap for observability: If dynamic workflows generate telemetry that existing dashboards and alerts can’t consume, operators fall back to log grep. If planning provenance is stored in a different system than step telemetry, correlation requires manual effort. If each unique workflow topology gets its own metric namespace, aggregation becomes impossible.
Design Principles for Complexity Management
Principle 1: The Step Remains the Atom
Every capability in dynamic workflow planning is expressed through steps. A planning step is a step. A grammar-composed handler step is a step. A convergence step is a step. Each has the same lifecycle, the same state machine, the same observability contract.
The action grammar layer adds compositional depth within a step — a handler may be composed from Acquire → Transform → Validate primitives — but from the orchestration layer’s perspective, it is still a single step with a single lifecycle. Whether the handler referenced a common pattern or was composed dynamically is an implementation detail of the handler, not a new structural concept in the workflow.
Implication: No new top-level concepts. No “planning phase” object separate from steps. No “fragment execution” lifecycle separate from step execution. No “grammar composition” lifecycle visible to the orchestrator. No “agent task” type separate from regular tasks. The DAG is the DAG, whether its topology was determined by a template, a planner, or an agent.
What this means practically:
- The workflow visualization shows steps and edges, regardless of how they were created
- Step-level alerts (timeout, failure, retry) work identically for all grammar-composed steps
- The DLQ system processes planned steps the same way as static steps
- Performance metrics (step latency, throughput) aggregate across planned and static steps regardless of composition origin
Principle 2: Provenance is Metadata, Not Structure
The fact that a step was created by a planning step rather than a template, or composed dynamically rather than referencing a common pattern, is important context. But it should be captured as metadata on the step, not as a structural difference in how the step exists in the system.
Implication: Add provenance fields to the step record, not a parallel provenance system.
Proposed provenance metadata (stored in workflow_steps or related JSONB):
| Field | Type | Description |
|---|---|---|
created_by | enum | template / decision_point / planning_step / batch_spawn |
planning_step_uuid | uuid? | The planning step that created this step (if applicable) |
fragment_id | string? | Identifier of the workflow fragment this step belongs to |
planning_phase | integer? | Which planning phase (1, 2, 3…) this step was created in |
planning_reasoning | text? | The planner’s reasoning for including this step |
handler_type | enum | application / grammar |
composition_source | string? | Whether a grammar handler referenced a common pattern name or was composed dynamically (metadata, not a type distinction) |
composition_spec | jsonb? | The grammar composition specification (if applicable) |
This metadata enriches observability without changing the step’s identity. Dashboards can filter by created_by to show only planned steps, or by handler_type to distinguish developer-authored from grammar-composed steps. Traces can include planning_step_uuid for drill-down. But the default view — the one operators see every day — shows steps as steps.
Principle 3: Progressive Disclosure
Not everyone needs to see everything. The default view should show what’s essential: steps, states, results, timing. Planning details, composition specifications, and agent lineage should be available on drill-down, not in the primary display.
Layer 1: Workflow Overview (same as static workflows)
- Task status, step states, dependency graph, timing
- Steps created by planners are visually annotated but not structurally different
- Dynamically composed steps are visually annotated but appear as normal steps
- Summary: “This task had 2 planning phases, created 14 steps total (12 grammar, 2 application)”
Layer 2: Planning Details (drill into a planning step)
- The planning prompt (what the LLM was asked)
- The capability schema provided (what the LLM could use)
- The generated fragment (what the LLM planned)
- The validation result (accepted/rejected, with diagnostics)
- Token usage, latency, model identifier
Layer 3: Fragment Analysis (drill into the fragment)
- The fragment’s DAG structure
- Each step’s handler configuration (common pattern reference, dynamic composition, or application callable)
- Input mappings and data flow
- Comparison with alternative fragments (if retry occurred)
Layer 4: Composition Details (drill into a grammar-composed step)
- The composition specification (which primitives, in what order, with what configuration)
- Whether the composition referenced a common pattern or was dynamically assembled
- Structural invariant validation results (contract compatibility, single-mutation boundary)
- Per-primitive execution details (timing, intermediate results if captured)
Layer 5: Execution Details (same as static workflows)
- Individual step execution: inputs, outputs, timing, retries
- Standard Tasker observability for each step
Layer 6: Agent Lineage (for agent-created task chains)
parent_correlation_idchain visualization- Task-level progression: research → analysis → workflow → execution
- Aggregate resource consumption across the delegation chain
- Agent decision points (implicit, inferred from task creation patterns)
Principle 4: Bounded Blast Radius
Dynamic planning introduces new classes of failure: a planning step generates a fragment that, while valid, produces poor results. A grammar composition passes structural validation but behaves unexpectedly at runtime. An agent creates a long chain of research tasks without converging on a design.
These failures are bounded by design:
- Each planning step’s fragment has resource limits (max steps, max depth)
- Task-level budgets cap total resource consumption across all phases
- Grammar compositions are validated against structural invariants before execution
- Convergence points are declared in the template (the frame), not by the planner
- Agent delegation chains have depth and budget limits
- The worst case is a task (or task chain) that consumes its budget without producing useful results — disappointing but not dangerous
Implication: Budget consumption should be a first-class metric. Operators should see: “This task has used 47 of 100 step budget, 3 of 5 planning phases, $2.30 of $5.00 cost budget.” For agent delegation chains: “This chain has 3 tasks across 2 delegation levels, consuming $12.40 of $50.00 aggregate budget.”
Principle 5: The Template is the Safety Contract
Even with dynamic planning and agent integration, the task template is the safety contract between the workflow author and the system. The template declares:
- What happens before planning (static steps)
- Where planning occurs (planning steps with constraints)
- What happens after planning (convergence and finalization)
- What the resource bounds are
- Whether dynamic composition beyond common patterns is allowed
The planner fills in the middle. It cannot modify the frame. It cannot bypass convergence. It cannot exceed its bounds. The template author retains control of the workflow’s structure; they delegate the topology of specific segments.
For agent-created tasks, the template still provides the safety contract. The agent selects which template to use (or constructs a task with planning steps), but the template’s constraints apply regardless of who submitted the task.
Implication: Template review is the primary code review artifact for dynamic workflows. If the template’s constraints are correct, the system’s behavior is bounded regardless of what the planner or agent does.
Observability Architecture for Dynamic Workflows
Telemetry Extensions
Standard telemetry (emitted by all steps, unchanged):
- Step lifecycle events (created, enqueued, claimed, executing, completed/failed)
- Step execution metrics (latency, retry count, handler name)
- Task lifecycle events (created, in_progress, completed/failed)
- Queue metrics (depth, claim rate, processing time)
Planning telemetry (emitted by planning steps, in addition to standard):
- LLM call metrics (model, token count input/output, latency, cost)
- Fragment generation metrics (steps planned, depth, handler distribution by type)
- Validation metrics (pass/fail, rejection reasons, retry count)
- Budget consumption (steps used / remaining, phases used / remaining, cost used / remaining)
Composition telemetry (emitted by grammar-composed steps, in addition to standard):
- Composition specification (primitives, configuration)
- Whether the composition referenced a common pattern or was dynamically assembled
- Structural invariant validation result (pass, with any warnings)
- Per-primitive timing (if captured — useful for identifying bottleneck primitives)
Provenance telemetry (emitted when dynamic steps are created):
- Step creation source (planning_step_uuid, fragment_id)
- Step’s position in fragment DAG (depth, breadth)
- Planning phase identifier
- Handler type (grammar, application)
Agent lineage telemetry (emitted for tasks with parent_correlation_id):
- Delegation depth (how many levels deep in the chain)
- Aggregate step count and cost across the chain
- Task creation timing (latency between parent task completion and child task creation)
Correlation Strategy
All telemetry for a dynamic workflow is correlated through existing mechanisms:
- Task UUID: Groups all steps (planned and static) in a single workflow
- Trace ID: Spans the entire task lifecycle, including planning
- Planning Step UUID: Links planned steps to their planner (via provenance metadata)
- Fragment ID: Groups steps from a single planning decision
- parent_correlation_id: Links tasks in an agent delegation chain
No new correlation mechanism is needed. The existing task → step hierarchy, extended with provenance metadata and the existing parent_correlation_id, supports all drill-down patterns.
Dashboarding Patterns
Task-Level Dashboard (extends existing):
- Task completion rate, segmented by planning phase count
- Average steps per task (static vs. dynamic, grammar vs. application)
- Budget utilization distribution (histogram)
- Planning success rate (fragments generated vs. fragments validated)
Planning-Specific Dashboard (new):
- LLM call volume and latency
- Most frequently used common patterns in planned fragments
- Most frequently dynamically composed handler patterns
- Fragment validation failure rate by rejection reason
- Cost per planning phase, averaged across tasks
- Planning depth distribution (how many phases do tasks actually use?)
Grammar Composition Dashboard (new):
- Handler usage distribution (common patterns vs. dynamic compositions)
- Handler execution success rate by handler type (grammar vs. application)
- Performance by handler (latency distribution per common pattern)
- Dynamic composition pattern frequency (candidates for named common patterns)
- Configuration pattern analysis (what configurations are most common?)
Agent Activity Dashboard (new):
- Tasks created by agents (volume, success rate)
- Delegation chain depth distribution
- Agent-to-decision latency (how long from first research task to final workflow)
- Aggregate resource consumption by agent delegation chain
- Research task convergence quality (how useful are research results to subsequent decisions?)
Alerting Patterns
| Alert | Trigger | Action |
|---|---|---|
| Planning step timeout | LLM call exceeds configured timeout | Retry or fail step based on retry policy |
| Fragment validation failure rate | > 30% of planning steps produce invalid fragments | Review capability schema, planning prompts |
| Dynamic composition runtime failure spike | Dynamically composed handlers failing at higher rate than common patterns | Review structural validation coverage, common failure compositions |
| Budget consumption anomaly | Task consuming budget > 2σ from mean | Investigate planning decisions, consider tighter bounds |
| Common pattern error spike | Handler failure rate > threshold | Investigate handler configuration patterns |
| Planning depth anomaly | Tasks consistently reaching max phases without converging | Review problem descriptions, planning prompts, or increase bounds |
| Agent delegation depth anomaly | Agent chains consistently reaching max depth | Review agent decomposition patterns, consider wider investigation templates |
LLM Context Management
The Context Window as a Design Constraint
The LLM planner’s effectiveness is directly proportional to the quality of information in its context window. Too little information and the planner makes poor decisions. Too much and the planner loses focus or hits token limits. This is a design constraint, not a runtime problem — the system must be designed to provide the right information in the right format.
Context Composition
The planning prompt is composed from these sources, in priority order:
- Problem description (from task context): What needs to be accomplished. Always included in full.
- Accumulated results (from prior phases): What has been learned. Summarized if large.
- Capability schema (from grammar primitives and common patterns): What the planner can use, including common patterns and composition rules. Potentially large; strategy required.
- Planning constraints (from template): Resource bounds, required convergence. Always included.
- Failure context (if retrying): What went wrong. Included on retry.
- Examples (from prompt engineering): Few-shot demonstrations. Carefully curated.
Capability Schema Compression
The full capability schema for all common patterns plus the grammar’s composition rules may exceed practical context window budgets. Strategies:
Tiered description:
- Tier 1: Handler name + one-line description, primitive name + one-line description (always included)
- Tier 2: Input/output schemas (included for handlers the planner selects)
- Tier 3: Full configuration reference (included on request or for complex handlers)
- Composition rules (including single-mutation boundary): always included at Tier 1 (the rules are compact); specific primitive schemas at Tier 2
Category-based inclusion:
- Include full schemas only for primitive categories relevant to the problem type
- Data processing problems get full
transform,validate,fan_outschemas - API integration problems get full
http_request,authschemas - Control flow gets
decide,gateschemas
Empirical calibration:
- Measure planning quality as a function of schema detail
- Find the minimum schema detail that produces valid fragments > 90% of the time
- This will vary by LLM model; calibrate per supported model
Result Summarization
Between planning phases, accumulated results must be compressed. The summarization strategy depends on result size:
| Result Size | Strategy | Example |
|---|---|---|
| < 1KB | Include verbatim | Status codes, counts, small payloads |
| 1KB - 10KB | Structured summary | Key fields extracted, schema preserved |
| 10KB - 100KB | LLM-generated summary | Dedicated summarization step before next planning step |
| > 100KB | Reference with metadata | Object store reference + schema + size + sample |
The summarization strategy should be configurable per planning step, with sensible defaults.
Operator Experience Design
Mental Model
The operator’s mental model for dynamic workflows should be:
“This workflow has a frame (the template) and fill (the planned steps). The frame is static and reviewed like any template. The fill is dynamic and generated by a planner. Fill steps use grammar compositions — some reference common patterns, others are composed dynamically from primitives. I monitor everything through the same tools, with drill-down into planning decisions and composition details when I need it.”
For agent-created workflows:
“An agent created a chain of tasks — first research, then execution. Each task is a normal Tasker task. I can see the chain through the parent_correlation_id lineage. Each task has its own frame and fill.”
These are the only new concepts operators need to learn. Everything else — step states, retries, convergence, DLQ — works the same way.
Investigation Workflow
When a dynamic workflow fails, the operator’s investigation follows this path:
- What failed? → Standard step failure view. Same as static workflows.
- Was this step planned or static? → Provenance metadata. One field check.
- If planned: what kind of handler? →
handler_typemetadata. Grammar or application. - If grammar: what was the composition? →
composition_specmetadata. See which primitives were used and whether it referenced a common pattern or was dynamically composed. - What was the planning decision? → Drill into planning step. See fragment, reasoning, validation.
- Was the plan reasonable? → Evaluate fragment structure, handler selection, configuration.
- If plan was bad: why? → Examine planning prompt, context, LLM response. Identify whether the issue is schema quality, context quality, or model quality.
- If plan was good but execution failed: why? → Standard step debugging. Inputs, outputs, error messages, retry history.
- Is this part of an agent chain? → Check
parent_correlation_id. Trace lineage to understand the broader investigation arc.
Steps 1-4 add approximately 15 seconds to investigation time. Steps 5-7 are new but only needed when the failure is planning-related. Steps 8-9 are unchanged or trivial lookups.
Runbooks
Dynamic workflows should ship with runbook extensions that cover:
- “A planning step is in DLQ” — how to investigate and resolve
- “A grammar-composed step failed” — how to examine the composition and identify the failing primitive
- “A task is consuming its budget without completing” — how to diagnose and intervene
- “Fragment validation failures are spiking” — how to diagnose capability schema issues
- “An LLM provider is returning errors” — how to fail over or degrade gracefully
- “An agent delegation chain is growing without converging” — how to investigate and intervene
Avoiding Complication: Anti-Patterns
These are specific patterns that introduce complication without corresponding complexity. The system design should prevent them.
| Anti-Pattern | Why It’s Complication | Prevention |
|---|---|---|
| Different observability for dynamic vs. static steps | Operators must maintain two mental models | All steps emit identical telemetry; provenance is metadata |
| Different observability for common patterns vs. dynamic compositions | Operators must learn new debugging tools | All grammar compositions emit identical step telemetry; composition source is drill-down metadata |
| Planning logic embedded in handler configuration | Handler behavior becomes unpredictable | Handlers are deterministic; planning is a separate step type |
| Fragment schema coupled to Tasker internals | LLM must understand orchestration mechanics | Fragment schema expresses intent; materialization is the system’s job |
| Budget controls scattered across configuration | No single place to understand resource limits | Budget hierarchy in task template, visible and auditable |
| Context accumulation that silently drops information | Planning quality degrades mysteriously | Explicit summarization steps with configurable strategies |
| Capability schema that describes implementation | LLM reasons about wrong abstractions | Capability schema describes what, never how. Derived from grammar types, not hand-authored |
| Action grammar internals exposed to operators | Operators must understand Rust trait composition | Grammar compositions are opaque to the operator; they see “http_request handler” not “Acquire → Transform → Validate” |
| Runtime validation duplicating compile-time checks | Wasted cycles and confusing error messages | Primitive correctness is verified at compile time; composition correctness at assembly time; runtime validates only fragment references |
| Agent-specific task types or APIs | Agents appear as a special class of client | All tasks are identical; agents use the same API as any client |
| Agent delegation tracking in a separate system | Lineage requires cross-system correlation | parent_correlation_id is a standard task field; lineage queries use standard task queries |
| Grammar composition details exposed in workflow visualization by default | Operators see implementation details they don’t need | Composition details are available on drill-down, not in the default step view |
Summary: The Complexity Budget
Every system has a complexity budget — the amount of complexity humans can manage before the system becomes opaque. Dynamic workflow planning spends from this budget. The question is whether we get proportional value.
What we spend:
- One new step type (planning step)
- One new concept (workflow fragments)
- One new compositional layer (action grammar primitives — but invisible to operators)
- One new metadata layer (planning and composition provenance)
- One new resource dimension (planning budgets)
- One new trust distinction (developer-authored handlers vs. system-invoked grammar compositions)
- One new client pattern (agents as task-creating clients — but using existing APIs)
What we get:
- Workflows that adapt to their inputs
- Multi-phase problem solving with accumulated context
- Composition of generic capabilities without custom code, with compile-time verified primitives and assembly-time validated compositions
- Dynamic compositions that can be constructed for any problem without registering new patterns
- Agents that can structure their own investigation using Tasker’s execution guarantees
- Gradual automation of workflow design — from developer tooling (Phase 0) through agent-driven workflows
- A type system for workflow actions that makes the vocabulary extensible without sacrificing safety
What we protect:
- The step as the atomic unit (unchanged)
- Execution guarantees (unchanged)
- Observability patterns (extended, not replaced)
- Operator investigation workflows (extended, not replaced)
- Template as safety contract (strengthened, not weakened)
- API uniformity (agents use the same APIs as any client)
What we actively reduce:
- Runtime type errors in handler compositions (primitives verified at compile time, compositions validated at assembly time)
- Capability schema drift from implementation (schemas derived from types, not hand-maintained)
- Configuration-driven failure modes (grammar compositions are verified before they can be referenced)
- Agent investigation overhead (structured research workflows replace ad hoc manual investigation)
The complexity budget is balanced when the new capabilities justify the new concepts, and the existing foundations are preserved. This document’s purpose is to ensure we stay within budget.
This document applies to all phases of the generative workflow initiative and the agent integration patterns. It should be reviewed and updated as each phase is implemented and operational experience reveals new complexity management needs.
Tasker Core Reference
This directory contains technical reference documentation with precise specifications and implementation details.
Documents
| Document | Description |
|---|---|
| StepContext API | Cross-language API reference for step handlers |
| Table Management | Database table structure and management |
| Task and Step Readiness | SQL functions and execution logic |
| sccache Configuration | Build caching setup |
| Library Deployment Patterns | Library distribution strategies |
| FFI Telemetry Pattern | Cross-language telemetry integration |
When to Read These
- Need exact behavior: Consult these for precise specifications
- Debugging edge cases: Check implementation details
- Database operations: See Table Management and SQL functions
- Build optimization: Review sccache Configuration
Related Documentation
- Architecture - System structure and patterns
- Guides - Practical how-to documentation
- Development - Developer tooling and patterns
Class-Based Handlers
The class-based pattern is fully supported and will continue to work in all future versions. For new projects, we recommend the DSL approach — it produces shorter handlers with typed signatures that make the data flow explicit. This page documents the class-based alternative.
When to Use Class-Based Handlers
- Existing codebases with class hierarchies that benefit from inheritance
- Complex handler lifecycle requirements (custom initialization, shared state across calls)
- API handlers that need the
APIMixinHTTP client methods - Batchable handlers with complex aggregation logic
Step Handler
The base handler type. All other types extend it.
Python
from tasker_core import StepContext, StepHandler, StepHandlerResult
class ProcessOrderHandler(StepHandler):
handler_name = "process_order"
handler_version = "1.0.0"
def call(self, context: StepContext) -> StepHandlerResult:
# Access input data from the task context
input_data = context.input_data
# Access results from upstream dependency steps
prev_result = context.get_dependency_result("previous_step_name")
result = {
"processed": True,
"handler": "process_order",
}
return StepHandlerResult.success(result=result)
Ruby
require 'tasker_core'
module Handlers
class ProcessOrderHandler < TaskerCore::StepHandler::Base
def call(context)
# Access input data from the task context
input = context.input_data
# Access results from upstream dependency steps
# prev_result = context.get_dependency_result('previous_step_name')
result = {
processed: true,
handler: 'process_order'
}
success(result: result)
rescue StandardError => e
failure(
message: e.message,
error_type: 'RetryableError',
retryable: true,
metadata: { handler: 'process_order' }
)
end
end
end
TypeScript
import {
StepHandler,
type StepContext,
type StepHandlerResult,
ErrorType,
} from '@tasker-systems/tasker';
export class ProcessOrderHandler extends StepHandler {
static handlerName = 'process_order';
static handlerVersion = '1.0.0';
async call(context: StepContext): Promise<StepHandlerResult> {
try {
// Access input data from the task context
const inputData = context.inputData;
// Access results from upstream dependency steps
// const prevResult = context.getDependencyResult('previous_step_name');
const result = {
processed: true,
handler: 'process_order',
};
return this.success(result);
} catch (error) {
return this.failure(
error instanceof Error ? error.message : String(error),
ErrorType.RETRYABLE_ERROR,
true,
);
}
}
}
Rust
Rust uses the RustStepHandler trait directly — this is Rust’s only handler pattern (no DSL equivalent, by design).
#![allow(unused)]
fn main() {
use anyhow::Result;
use async_trait::async_trait;
use serde_json::json;
use std::time::Instant;
use tasker_shared::messaging::StepExecutionResult;
use tasker_shared::types::TaskSequenceStep;
use tasker_worker_rust::{success_result, RustStepHandler};
use tasker_worker_rust::step_handlers::StepHandlerConfig;
pub struct ProcessOrderHandler {
config: StepHandlerConfig,
}
#[async_trait]
impl RustStepHandler for ProcessOrderHandler {
fn new(config: StepHandlerConfig) -> Self {
Self { config }
}
fn name(&self) -> &str {
"process_order"
}
async fn call(
&self,
step_data: &TaskSequenceStep,
) -> Result<StepExecutionResult> {
let start = Instant::now();
// Access input data from the task context
let _input_data = &step_data.task.context;
// Access dependency results from upstream steps
// let _dep_results = &step_data.dependency_results;
let result_data = json!({
"processed": true,
"handler": "process_order"
});
let duration_ms = start.elapsed().as_millis() as i64;
Ok(success_result(
step_data.workflow_step.workflow_step_uuid,
result_data,
duration_ms,
None,
))
}
}
}
Context Access Patterns
| Concept | Python | Ruby | TypeScript | Rust |
|---|---|---|---|---|
| Input data | context.input_data | context.input_data | context.inputData | step_data.task.context |
| Dependency result | context.get_dependency_result("step") | context.get_dependency_result('step') | context.getDependencyResult('step') | step_data.dependency_results |
| Success | StepHandlerResult.success(result=data) | success(result: data) | this.success(data) | Ok(success_result(...)) |
| Failure | StepHandlerResult.failure(...) | failure(message:, ...) | this.failure(msg, type, retryable) | Ok(error_result(...)) |
API Handler
The APIMixin adds HTTP client methods with built-in error classification. It provides self.get(), self.post(), self.put(), self.patch(), self.delete() methods that return an ApiResponse wrapper, plus self.api_success() and self.api_failure() helpers that automatically classify HTTP errors as retryable or permanent.
When to use: Calling external APIs where you need to distinguish retryable errors (5xx, timeouts) from permanent errors (4xx).
import httpx
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import APIMixin
class FetchOrderHandler(APIMixin, StepHandler):
handler_name = "fetch_order"
base_url = "https://api.example.com"
default_timeout = 30.0
def call(self, context):
order_id = context.input_data["order_id"]
try:
response = self.get(f"/orders/{order_id}")
except httpx.ConnectError as e:
return self.connection_error(e, "fetching order")
except httpx.TimeoutException as e:
return self.timeout_error(e, "fetching order")
if response.ok:
return self.api_success(response)
return self.api_failure(response)
The APIMixin provides:
| Method | Purpose |
|---|---|
self.get(), self.post(), etc. | HTTP methods returning ApiResponse |
self.api_success(response) | Success result with response metadata |
self.api_failure(response) | Failure with automatic error classification (4xx = permanent, 5xx/429 = retryable) |
self.connection_error(exc) | Retryable failure for connection errors |
self.timeout_error(exc) | Retryable failure for timeouts |
Ruby and TypeScript provide equivalent API handler mixins with the same error classification pattern.
Decision Handler
The DecisionMixin adds workflow routing methods. self.decision_success() activates downstream steps; self.skip_branches() when no steps should execute.
When to use: Conditional branching — when the next steps depend on runtime data.
from tasker_core.step_handler import StepHandler
from tasker_core.step_handler.mixins import DecisionMixin
class OrderRoutingHandler(DecisionMixin, StepHandler):
handler_name = "order_routing"
def call(self, context):
order_type = context.input_data.get("order_type")
if order_type == "premium":
return self.decision_success(
["validate_premium", "process_premium"],
routing_context={"order_type": order_type},
)
elif order_type == "review_required":
return self.decision_success(
["manual_review", "approval_gate"],
routing_context={"order_type": order_type},
)
else:
return self.decision_success(["standard_processing"])
The DecisionMixin provides:
| Method | Purpose |
|---|---|
self.decision_success(steps, routing_context) | Activate downstream steps by name |
self.skip_branches(reason) | Successful outcome with no follow-up steps |
self.decision_failure(message) | Decision could not be made (usually not retryable) |
See Conditional Workflows for decision handler patterns in depth.
Batchable Handler
The Batchable mixin adds batch processing methods for splitting large workloads into parallel cursor-based batches.
When to use: Processing large datasets where you want to divide work across multiple parallel workers.
Workflow pattern:
- Analyzer step — determines total work and creates cursor configs that divide it into batches
- Worker steps — Tasker spawns parallel workers, each processing one batch
- Aggregator step — (optional) combines results from all workers
Analyzer
from tasker_core.step_handler import StepHandler
from tasker_core.batch_processing import Batchable
class CsvAnalyzerHandler(StepHandler, Batchable):
handler_name = "analyze_csv"
def call(self, context):
total_rows = int(context.input_data.get("total_rows", 10000))
outcome = self.create_batch_outcome(
total_items=total_rows,
batch_size=100,
)
return self.batch_analyzer_success(outcome)
Worker
class CsvBatchProcessorHandler(StepHandler, Batchable):
handler_name = "process_csv_batch"
def call(self, context):
batch_context = self.get_batch_worker_context(context)
cursor = batch_context.cursor
# Process rows in the assigned range
rows_processed = cursor.end_cursor - cursor.start_cursor
return self.batch_worker_success(
batch_context,
result={"rows_processed": rows_processed},
)
Aggregator
from tasker_core.batch_processing import Batchable, BatchAggregationScenario
class CsvResultsAggregatorHandler(StepHandler, Batchable):
handler_name = "aggregate_csv_results"
def call(self, context):
scenario = BatchAggregationScenario.detect(
context.dependency_results,
"analyze_csv",
"process_csv_batch_",
)
if scenario.is_no_batches:
return self.success({"total_rows": 0, "skipped": True})
total = sum(
r.get("rows_processed", 0)
for r in scenario.batch_results.values()
)
return self.success({
"total_rows": total,
"worker_count": scenario.worker_count,
})
The Batchable mixin provides:
| Method | Role | Purpose |
|---|---|---|
self.create_batch_outcome(total_items, batch_size) | Analyzer | Create cursor ranges dividing work into batches |
self.batch_analyzer_success(outcome) | Analyzer | Return batch config for worker spawning |
self.get_batch_worker_context(context) | Worker | Extract cursor and batch metadata |
self.batch_worker_success(batch_context, result) | Worker | Return per-batch results |
BatchAggregationScenario.detect(...) | Aggregator | Detect whether batches ran and collect results |
See Batch Processing for the full analyzer/worker/aggregator pattern with production examples.
Registering Class-Based Handlers
Handlers are resolved by matching the handler.callable field in task template YAML. The callable format varies by language:
| Language | Format | Example |
|---|---|---|
| Ruby | Module::ClassName | Handlers::ProcessOrderHandler |
| Python | module.file.ClassName | handlers.process_order_handler.ProcessOrderHandler |
| TypeScript | ClassName | ProcessOrderHandler |
| Rust | function_name | process_order |
See Handler Resolution for the full resolver chain and how callables are matched to handler implementations.
FFI Boundary Types Reference
Cross-language type harmonization for Rust, Ruby, Python, and TypeScript boundaries.
This document defines the canonical FFI boundary types that cross the Rust orchestration layer and the Ruby/Python/TypeScript worker implementations. These types are critical for correct serialization/deserialization between languages.
Overview
The tasker-core system uses FFI (Foreign Function Interface) to integrate Rust orchestration
with Ruby, Python, and TypeScript step handlers. Data crosses this boundary via framework-native
mechanisms: Magnus (Ruby), PyO3 (Python), and napi-rs (TypeScript). For complex types like
StepExecutionResult, all three workers use serde-based deserialization — the language side
builds a dict/hash/object with snake_case keys matching Rust serde field names, then Rust
deserializes via serde_magnus::deserialize(), depythonize(), or serde_json::from_value()
respectively. These types must remain consistent across all four languages.
Source of Truth: Rust types in tasker-shared/src/messaging/execution_types.rs and
tasker-shared/src/models/core/batch_worker.rs.
Type Mapping
| Rust Type | Python Type | TypeScript Type |
|---|---|---|
CursorConfig | RustCursorConfig | RustCursorConfig |
BatchProcessingOutcome | BatchProcessingOutcome | BatchProcessingOutcome |
BatchWorkerInputs | RustBatchWorkerInputs | RustBatchWorkerInputs |
BatchMetadata | BatchMetadata | BatchMetadata |
FailureStrategy | FailureStrategy | FailureStrategy |
CursorConfig
Cursor configuration for a single batch’s position and range.
Flexible Cursor Types
Unlike simple integer cursors, RustCursorConfig supports flexible cursor values:
- Integer for record IDs:
123 - String for timestamps:
"2025-11-01T00:00:00Z" - Object for composite keys:
{"page": 1, "offset": 0}
This enables cursor-based pagination across diverse data sources.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
pub struct CursorConfig {
pub batch_id: String,
pub start_cursor: serde_json::Value, // Flexible type
pub end_cursor: serde_json::Value, // Flexible type
pub batch_size: u32,
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export interface RustCursorConfig {
batch_id: string;
start_cursor: unknown; // Flexible: number | string | object
end_cursor: unknown;
batch_size: number;
}
Python Definition
# workers/python/python/tasker_core/types.py
class RustCursorConfig(BaseModel):
batch_id: str
start_cursor: Any # Flexible: int | str | dict
end_cursor: Any
batch_size: int
JSON Wire Format
{
"batch_id": "batch_001",
"start_cursor": 0,
"end_cursor": 1000,
"batch_size": 1000
}
BatchProcessingOutcome
Discriminated union representing the outcome of a batchable step.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/messaging/execution_types.rs
#[serde(tag = "type", rename_all = "snake_case")]
pub enum BatchProcessingOutcome {
NoBatches,
CreateBatches {
worker_template_name: String,
worker_count: u32,
cursor_configs: Vec<CursorConfig>,
total_items: u64,
},
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export interface NoBatchesOutcome {
type: 'no_batches';
}
export interface CreateBatchesOutcome {
type: 'create_batches';
worker_template_name: string;
worker_count: number;
cursor_configs: RustCursorConfig[];
total_items: number;
}
export type BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome;
Python Definition
# workers/python/python/tasker_core/types.py
class NoBatchesOutcome(BaseModel):
type: str = "no_batches"
class CreateBatchesOutcome(BaseModel):
type: str = "create_batches"
worker_template_name: str
worker_count: int
cursor_configs: list[RustCursorConfig]
total_items: int
BatchProcessingOutcome = NoBatchesOutcome | CreateBatchesOutcome
JSON Wire Formats
NoBatches:
{
"type": "no_batches"
}
CreateBatches:
{
"type": "create_batches",
"worker_template_name": "batch_worker_template",
"worker_count": 5,
"cursor_configs": [
{ "batch_id": "001", "start_cursor": 0, "end_cursor": 1000, "batch_size": 1000 },
{ "batch_id": "002", "start_cursor": 1000, "end_cursor": 2000, "batch_size": 1000 }
],
"total_items": 5000
}
BatchWorkerInputs
Initialization inputs for batch worker instances, stored in workflow_steps.inputs.
Rust Definition
#![allow(unused)]
fn main() {
// tasker-shared/src/models/core/batch_worker.rs
pub struct BatchWorkerInputs {
pub cursor: CursorConfig,
pub batch_metadata: BatchMetadata,
pub is_no_op: bool,
}
pub struct BatchMetadata {
// checkpoint_interval removed - handlers decide when to checkpoint
pub cursor_field: String,
pub failure_strategy: FailureStrategy,
}
pub enum FailureStrategy {
ContinueOnFailure,
FailFast,
Isolate,
}
}
TypeScript Definition
// workers/typescript/src/types/batch.ts
export type FailureStrategy = 'continue_on_failure' | 'fail_fast' | 'isolate';
export interface BatchMetadata {
// checkpoint_interval removed - handlers decide when to checkpoint
cursor_field: string;
failure_strategy: FailureStrategy;
}
export interface RustBatchWorkerInputs {
cursor: RustCursorConfig;
batch_metadata: BatchMetadata;
is_no_op: boolean;
}
Python Definition
# workers/python/python/tasker_core/types.py
class FailureStrategy(str, Enum):
CONTINUE_ON_FAILURE = "continue_on_failure"
FAIL_FAST = "fail_fast"
ISOLATE = "isolate"
class BatchMetadata(BaseModel):
# checkpoint_interval removed - handlers decide when to checkpoint
cursor_field: str
failure_strategy: FailureStrategy
class RustBatchWorkerInputs(BaseModel):
cursor: RustCursorConfig
batch_metadata: BatchMetadata
is_no_op: bool
JSON Wire Format
{
"cursor": {
"batch_id": "batch_001",
"start_cursor": 0,
"end_cursor": 1000,
"batch_size": 1000
},
"batch_metadata": {
"cursor_field": "id",
"failure_strategy": "continue_on_failure"
},
"is_no_op": false
}
BatchAggregationResult
Standardized result from aggregating multiple batch worker results.
Cross-Language Standard
All three languages produce identical aggregation results:
| Field | Type | Description |
|---|---|---|
total_processed | int | Items processed across all batches |
total_succeeded | int | Items that succeeded |
total_failed | int | Items that failed |
total_skipped | int | Items that were skipped |
batch_count | int | Number of batch workers that ran |
success_rate | float | Success rate (0.0 to 1.0) |
errors | array | Collected errors (limited to 100) |
error_count | int | Total error count |
Usage Examples
TypeScript:
import { aggregateBatchResults } from 'tasker-core';
const workerResults = Object.values(context.previousResults)
.filter(r => r?.batch_worker);
const summary = aggregateBatchResults(workerResults);
return this.success(summary);
Python:
from tasker_core.types import aggregate_batch_results
worker_results = [
context.get_dependency_result(f"worker_{i}")
for i in range(batch_count)
]
summary = aggregate_batch_results(worker_results)
return self.success(summary.model_dump())
Factory Functions
Creating BatchProcessingOutcome
TypeScript:
import { noBatches, createBatches, RustCursorConfig } from 'tasker-core';
// No batches needed
const outcome1 = noBatches();
// Create batch workers
const configs: RustCursorConfig[] = [
{ batch_id: '001', start_cursor: 0, end_cursor: 1000, batch_size: 1000 },
{ batch_id: '002', start_cursor: 1000, end_cursor: 2000, batch_size: 1000 },
];
const outcome2 = createBatches('process_batch', 2, configs, 2000);
Python:
from tasker_core.types import no_batches, create_batches, RustCursorConfig
# No batches needed
outcome1 = no_batches()
# Create batch workers
configs = [
RustCursorConfig(batch_id="001", start_cursor=0, end_cursor=1000, batch_size=1000),
RustCursorConfig(batch_id="002", start_cursor=1000, end_cursor=2000, batch_size=1000),
]
outcome2 = create_batches("process_batch", 2, configs, 2000)
Type Guards (TypeScript)
import {
BatchProcessingOutcome,
isNoBatches,
isCreateBatches
} from 'tasker-core';
function handleOutcome(outcome: BatchProcessingOutcome): void {
if (isNoBatches(outcome)) {
console.log('No batches needed');
return;
}
if (isCreateBatches(outcome)) {
console.log(`Creating ${outcome.worker_count} workers`);
console.log(`Total items: ${outcome.total_items}`);
}
}
Migration Notes
From Legacy Types
If migrating from older batch processing types:
-
CursorConfig → RustCursorConfig: The new type adds
batch_idfield and uses flexible cursor types (unknown/Any) instead of fixednumber/int. -
Inline batch_processing_outcome → BatchProcessingOutcome: Use the discriminated union type with factory functions instead of building JSON manually.
-
Manual aggregation → aggregateBatchResults: Use the standardized aggregation function for consistent cross-language behavior.
Backwards Compatibility
The legacy CursorConfig type (with number/int cursors) is preserved for simple
use cases. Use RustCursorConfig when:
- Working with Rust orchestration inputs
- Needing flexible cursor types (timestamps, UUIDs, composites)
- Building
BatchProcessingOutcomestructures
Related Documentation
FFI Telemetry Initialization Pattern
Overview
This document describes the two-phase telemetry initialization pattern for Foreign Function Interface (FFI) integrations where Rust code is called from languages that don’t have a Tokio runtime during initialization (Ruby, Python, WASM).
The Problem
OpenTelemetry batch exporter requires a Tokio runtime context for async I/O operations:
#![allow(unused)]
fn main() {
// This PANICS if called outside a Tokio runtime
let tracer_provider = SdkTracerProvider::builder()
.with_batch_exporter(exporter) // ❌ Requires Tokio runtime
.with_resource(resource)
.with_sampler(sampler)
.build();
}
FFI Initialization Timeline:
1. Language Runtime Loads Extension (Ruby, Python, WASM)
↓ No Tokio runtime exists yet
2. Extension Init Function Called (Magnus init, PyO3 init, etc.)
↓ Logging needed for debugging, but no async runtime
3. Later: Create Tokio Runtime
↓ Now safe to initialize telemetry
4. Bootstrap Worker System
The Solution: Two-Phase Initialization
Phase 1: Console-Only Logging (FFI-Safe)
During language extension initialization, use console-only logging that requires no Tokio runtime:
#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs
pub fn init_console_only() {
// Initialize console logging without OpenTelemetry
// Safe to call from any thread, no async runtime required
}
}
When to use:
- During Magnus initialization (Ruby)
- During PyO3 initialization (Python)
- During WASM module initialization
- Any context where no Tokio runtime exists
Phase 2: Full Telemetry (Tokio Context)
After creating the Tokio runtime, initialize full telemetry including OpenTelemetry:
#![allow(unused)]
fn main() {
// Create Tokio runtime
let runtime = tokio::runtime::Runtime::new()?;
// Initialize telemetry in runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
}
When to use:
- After creating Tokio runtime in bootstrap
- Inside
runtime.block_on()context - When async I/O is available
Implementation Guide
Ruby FFI (Magnus)
File Structure:
workers/ruby/ext/tasker_core/src/ffi_logging.rs- Phase 1workers/ruby/ext/tasker_core/src/bootstrap.rs- Phase 2
Phase 1: Magnus Initialization
#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/ffi_logging.rs
pub fn init_ffi_logger() -> Result<(), Box<dyn std::error::Error>> {
// Check if telemetry is enabled
let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
.map(|v| v.to_lowercase() == "true")
.unwrap_or(false);
if telemetry_enabled {
// Phase 1: Defer telemetry init to runtime context
println!("Telemetry enabled - deferring logging init to runtime context");
} else {
// Phase 1: Safe to initialize console-only logging
tasker_shared::logging::init_console_only();
tasker_shared::log_ffi!(
info,
"FFI console logging initialized (no telemetry)",
component: "ffi_boundary"
);
}
Ok(())
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/bootstrap.rs
pub fn bootstrap_worker() -> Result<Value, Error> {
// Create tokio runtime
let runtime = tokio::runtime::Runtime::new()?;
// Phase 2: Initialize telemetry in Tokio runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Continue with bootstrap...
let system_context = runtime.block_on(async {
SystemContext::new_for_worker().await
})?;
// ... rest of bootstrap
}
}
Python FFI (PyO3)
Phase 1: PyO3 Module Initialization
#![allow(unused)]
fn main() {
// workers/python/src/lib.rs
#[pymodule]
fn tasker_core(py: Python, m: &PyModule) -> PyResult<()> {
// Check if telemetry is enabled
let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
.map(|v| v.to_lowercase() == "true")
.unwrap_or(false);
if telemetry_enabled {
println!("Telemetry enabled - deferring logging init to runtime context");
} else {
tasker_shared::logging::init_console_only();
}
// Register Python functions...
m.add_function(wrap_pyfunction!(bootstrap_worker, m)?)?;
Ok(())
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/python/src/bootstrap.rs
#[pyfunction]
pub fn bootstrap_worker() -> PyResult<String> {
// Create tokio runtime
let runtime = tokio::runtime::Runtime::new()
.map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
format!("Failed to create runtime: {}", e)
))?;
// Phase 2: Initialize telemetry in Tokio runtime context
runtime.block_on(async {
tasker_shared::logging::init_tracing();
});
// Continue with bootstrap...
let system_context = runtime.block_on(async {
SystemContext::new_for_worker().await
})?;
// ... rest of bootstrap
}
}
WASM FFI
Note: WASM worker support is planned but not yet implemented. The pattern below shows the intended initialization approach.
Phase 1: WASM Module Initialization
#![allow(unused)]
fn main() {
// workers/wasm/src/lib.rs
#[wasm_bindgen(start)]
pub fn init_wasm() {
// Check if telemetry is enabled (from JS environment)
let telemetry_enabled = js_sys::Reflect::get(
&js_sys::global(),
&"TELEMETRY_ENABLED".into()
).ok()
.and_then(|v| v.as_bool())
.unwrap_or(false);
if telemetry_enabled {
web_sys::console::log_1(&"Telemetry enabled - deferring logging init to runtime context".into());
} else {
tasker_shared::logging::init_console_only();
}
}
}
Phase 2: After Runtime Creation
#![allow(unused)]
fn main() {
// workers/wasm/src/bootstrap.rs
#[wasm_bindgen]
pub async fn bootstrap_worker() -> Result<JsValue, JsValue> {
// In WASM, we're already in an async context
// Initialize telemetry directly
tasker_shared::logging::init_tracing();
// Continue with bootstrap...
let system_context = SystemContext::new_for_worker().await
.map_err(|e| JsValue::from_str(&format!("Bootstrap failed: {}", e)))?;
// ... rest of bootstrap
}
}
Docker Configuration
Enable telemetry in docker-compose with appropriate comments:
# docker/docker-compose.test.yml
ruby-worker:
environment:
# Two-phase FFI telemetry initialization pattern
# Phase 1: Magnus init skips telemetry (no runtime)
# Phase 2: bootstrap_worker() initializes telemetry in Tokio context
TELEMETRY_ENABLED: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: http://observability:4317
OTEL_SERVICE_NAME: tasker-ruby-worker
OTEL_SERVICE_VERSION: "0.1.0"
Verification
Expected Log Sequence
Ruby Worker with Telemetry Enabled:
1. Magnus init:
Telemetry enabled - deferring logging init to runtime context
2. After runtime creation:
Console logging with OpenTelemetry initialized
environment=test
opentelemetry_enabled=true
otlp_endpoint=http://observability:4317
service_name=tasker-ruby-worker
3. OpenTelemetry components:
Global meter provider is set
OpenTelemetry Prometheus text exporter initialized
Ruby Worker with Telemetry Disabled:
1. Magnus init:
Console-only logging initialized (FFI-safe mode)
environment=test
opentelemetry_enabled=false
context=ffi_initialization
2. After runtime creation:
(No additional initialization - already complete)
Health Check
All workers should be healthy with telemetry enabled:
$ curl http://localhost:8082/health
{"status":"healthy","timestamp":"...","worker_id":"worker-..."}
Grafana Verification
With all services running with telemetry:
- Access Grafana: http://localhost:3000 (admin/admin)
- Navigate to Explore → Tempo
- Query by service:
tasker-ruby-worker - Verify traces appear with correlation IDs
Key Principles
1. Separation of Concerns
- Infrastructure Decision (Tokio runtime availability): Handled by init functions
- Business Logic (when to log): Handled by application code
- Clean separation prevents runtime panics
2. Fail-Safe Defaults
- Always provide console logging at minimum
- Telemetry is enhancement, not requirement
- Graceful degradation if telemetry unavailable
3. Explicit Over Implicit
- Clear phase separation in code
- Documented at each call site
- Easy to understand initialization flow
4. Language-Agnostic Pattern
- Same pattern works for Ruby, Python, WASM
- Consistent across all FFI bindings
- Single source of truth in tasker-shared
Troubleshooting
“no reactor running” Panic
Symptom:
thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'
Cause:
Calling init_tracing() when TELEMETRY_ENABLED=true outside a Tokio runtime context.
Solution: Use two-phase pattern:
#![allow(unused)]
fn main() {
// Phase 1: Skip telemetry init
if telemetry_enabled {
println!("Deferring telemetry init...");
} else {
init_console_only();
}
// Phase 2: Initialize in runtime
runtime.block_on(async {
init_tracing();
});
}
Telemetry Not Appearing
Symptom:
No traces in Grafana/Tempo despite TELEMETRY_ENABLED=true.
Check:
- Verify environment variable is set:
TELEMETRY_ENABLED=true - Check logs for initialization message
- Verify OTLP endpoint is reachable
- Check observability stack is healthy
Debug:
# Check worker logs
docker logs docker-ruby-worker-1 | grep -E "telemetry|OpenTelemetry"
# Check observability stack
curl http://localhost:4317 # Should connect to OTLP gRPC
# Check Grafana Tempo
curl http://localhost:3200/api/status/buildinfo
Performance Considerations
Minimal Overhead
- Phase 1: Simple console initialization, <1ms
- Phase 2: Batch exporter initialization, <10ms
- Total overhead: <15ms during startup
- Zero runtime overhead after initialization
Memory Usage
- Console-only: ~100KB (tracing subscriber)
- With telemetry: ~500KB (includes OTLP client buffers)
- Acceptable for all deployment scenarios
Future Enhancements
Lazy Telemetry Upgrade
Future optimization could upgrade console-only subscriber to include telemetry without restart:
#![allow(unused)]
fn main() {
// Not yet implemented - requires tracing layer hot-swapping
pub fn upgrade_to_telemetry() -> TaskerResult<()> {
// Would require custom subscriber implementation
// to support layer addition after initialization
}
}
Per-Worker Telemetry Control
Could extend pattern to support per-worker telemetry configuration:
#![allow(unused)]
fn main() {
// Not yet implemented
pub fn init_with_config(config: TelemetryConfig) -> TaskerResult<()> {
// Would allow fine-grained control per worker
}
}
Phase 1.5: Worker Span Instrumentation with Trace Context Propagation
Implemented: 2025-11-24 Status: ✅ Production Ready - Validated end-to-end with Ruby workers
The Challenge
After implementing two-phase telemetry initialization (Phase 1), we discovered a gap: while OpenTelemetry infrastructure was working, worker step execution spans lacked correlation attributes needed for distributed tracing.
The Problem:
- ✅ Orchestration spans had correlation_id, task_uuid, step_uuid
- ✅ Worker infrastructure spans existed (read_messages, reserve_capacity)
- ❌ Worker step execution spans were missing these attributes
Root Cause: Ruby workers use an async dual-event-system architecture where:
- Rust worker fires FFI event to Ruby (via EventPoller polling every 10ms)
- Ruby processes event asynchronously
- Ruby returns completion via FFI
The async boundary made traditional span scope maintenance impossible.
The Solution: Trace ID Propagation Pattern
Instead of trying to maintain span scope across the async FFI boundary, we propagate trace context as opaque strings:
Rust: Extract trace_id/span_id → Add to FFI event payload →
Ruby: Treat as opaque strings → Propagate through processing → Include in completion →
Rust: Create linked span using returned trace_id/span_id
Key Insight: Ruby doesn’t need to understand OpenTelemetry - it just passes through trace IDs like it already does with correlation_id.
Implementation: Rust Side (Phase 1.5a)
File: tasker-worker/src/worker/command_processor.rs
Step 1: Create instrumented span with all required attributes
#![allow(unused)]
fn main() {
use tracing::{span, event, Level, Instrument};
pub async fn handle_execute_step(&self, step_message: SimpleStepMessage) -> TaskerResult<()> {
// Fetch step details to get step_name and namespace
let task_sequence_step = self.fetch_task_sequence_step(&step_message).await?;
// Create span with all 5 required attributes
let step_span = span!(
Level::INFO,
"worker.step_execution",
correlation_id = %step_message.correlation_id,
task_uuid = %step_message.task_uuid,
step_uuid = %step_message.step_uuid,
step_name = %task_sequence_step.workflow_step.name,
namespace = %task_sequence_step.task.namespace_name
);
let execution_result = async {
event!(Level::INFO, "step.execution_started");
// Extract trace context for FFI propagation
let trace_id = Some(step_message.correlation_id.to_string());
let span_id = Some(format!("span-{}", step_message.step_uuid));
// Fire FFI event with trace context
let result = self.event_publisher
.fire_step_execution_event_with_trace(
&task_sequence_step,
trace_id,
span_id,
)
.await?;
event!(Level::INFO, "step.execution_completed");
Ok(result)
}
.instrument(step_span) // Wrap async block with span
.await;
execution_result
}
}
Key Points:
- All 5 attributes present:
correlation_id,task_uuid,step_uuid,step_name,namespace - Event markers:
step.execution_started,step.execution_completed .instrument(span)pattern for async code- Trace context extracted and passed to FFI
Implementation: Data Structures
File: tasker-shared/src/types/base.rs
Add trace context fields to FFI event structures:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionEvent {
pub event_id: Uuid,
pub task_uuid: Uuid,
pub step_uuid: Uuid,
pub task_sequence_step: TaskSequenceStep,
pub correlation_id: Uuid,
// Trace context propagation
#[serde(skip_serializing_if = "Option::is_none")]
pub trace_id: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub span_id: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionCompletionEvent {
pub event_id: Uuid,
pub task_uuid: Uuid,
pub step_uuid: Uuid,
pub success: bool,
pub result: Option<serde_json::Value>,
// Trace context from Ruby
#[serde(skip_serializing_if = "Option::is_none")]
pub trace_id: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub span_id: Option<String>,
}
}
Design Notes:
- Fields are optional for backward compatibility
skip_serializing_ifprevents empty fields in JSON- Treated as opaque strings (no OpenTelemetry types)
Implementation: Ruby Side Propagation
File: workers/ruby/lib/tasker_core/event_bridge.rb
Propagate trace context like correlation_id:
def wrap_step_execution_event(event_data)
wrapped = {
event_id: event_data[:event_id],
task_uuid: event_data[:task_uuid],
step_uuid: event_data[:step_uuid],
task_sequence_step: TaskerCore::Models::TaskSequenceStepWrapper.new(event_data[:task_sequence_step])
}
# Expose correlation_id at top level for easy access
wrapped[:correlation_id] = event_data[:correlation_id] if event_data[:correlation_id]
wrapped[:parent_correlation_id] = event_data[:parent_correlation_id] if event_data[:parent_correlation_id]
# Expose trace_id and span_id for distributed tracing
wrapped[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
wrapped[:span_id] = event_data[:span_id] if event_data[:span_id]
wrapped
end
File: workers/ruby/lib/tasker_core/subscriber.rb
Include trace context in completion:
def publish_step_completion(event_data:, success:, result: nil, error_message: nil, metadata: nil)
completion_payload = {
event_id: event_data[:event_id],
task_uuid: event_data[:task_uuid],
step_uuid: event_data[:step_uuid],
success: success,
result: result,
metadata: metadata,
error_message: error_message
}
# Propagate trace context back to Rust
completion_payload[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
completion_payload[:span_id] = event_data[:span_id] if event_data[:span_id]
TaskerCore::Worker::EventBridge.instance.publish_step_completion(completion_payload)
end
Key Points:
- Ruby treats trace_id and span_id as opaque strings
- No OpenTelemetry dependency in Ruby
- Simple pass-through pattern like correlation_id
- Works with existing dual-event-system architecture
Implementation: Completion Span (Rust)
File: tasker-worker/src/worker/event_subscriber.rs
Create linked span when receiving Ruby completion:
#![allow(unused)]
fn main() {
pub fn handle_completion(&self, completion: StepExecutionCompletionEvent) -> TaskerResult<()> {
// Create linked span using trace context from Ruby
let completion_span = if let (Some(trace_id), Some(span_id)) =
(&completion.trace_id, &completion.span_id) {
span!(
Level::INFO,
"worker.step_completion_received",
trace_id = %trace_id,
span_id = %span_id,
event_id = %completion.event_id,
task_uuid = %completion.task_uuid,
step_uuid = %completion.step_uuid,
success = completion.success
)
} else {
// Fallback span without trace context
span!(
Level::INFO,
"worker.step_completion_received",
event_id = %completion.event_id,
task_uuid = %completion.task_uuid,
step_uuid = %completion.step_uuid,
success = completion.success
)
};
let _guard = completion_span.enter();
event!(Level::INFO, "step.ruby_execution_completed",
success = completion.success,
duration_ms = completion.metadata.execution_time_ms
);
// Continue with normal completion processing...
Ok(())
}
}
Key Points:
- Uses returned trace_id/span_id to create linked span
- Graceful fallback if trace context not available
- Event:
step.ruby_execution_completed
Validation Results (2025-11-24)
Test Task:
- Correlation ID:
88f21229-4085-4d53-8f52-2fde0b7228e2 - Task UUID:
019ab6f9-7a27-7d16-b298-1ea41b327373 - 4 steps executed successfully
Log Evidence:
worker.step_execution{
correlation_id=88f21229-4085-4d53-8f52-2fde0b7228e2
task_uuid=019ab6f9-7a27-7d16-b298-1ea41b327373
step_uuid=019ab6f9-7a2a-7873-a5d1-93234ae46003
step_name=linear_step_1
namespace=linear_workflow
}: step.execution_started
Step execution event with trace context fired successfully to FFI handlers
trace_id=Some("88f21229-4085-4d53-8f52-2fde0b7228e2")
span_id=Some("span-019ab6f9-7a2a-7873-a5d1-93234ae46003")
worker.step_completion_received{...}: step.ruby_execution_completed
Tempo Query Results:
- By
correlation_id: 9 traces (5 orchestration + 4 worker) - By
task_uuid: 13 traces (complete task lifecycle) - ✅ All attributes indexed and queryable
- ✅ Spans exported to Tempo successfully
Complete Trace Flow
For each step execution:
┌─────────────────────────────────────────────────────┐
│ Rust Worker (command_processor.rs) │
│ 1. Create worker.step_execution span │
│ - correlation_id, task_uuid, step_uuid │
│ - step_name, namespace │
│ 2. Emit step.execution_started event │
│ 3. Extract trace_id and span_id from span │
│ 4. Add to StepExecutionEvent │
│ 5. Fire FFI event with trace context │
│ 6. Emit step.execution_completed event │
└─────────────────┬───────────────────────────────────┘
│
│ Async FFI boundary (EventPoller polling)
▼
┌─────────────────────────────────────────────────────┐
│ Ruby EventBridge & Subscriber │
│ 1. Receive event with trace_id/span_id │
│ 2. Propagate as opaque strings │
│ 3. Execute Ruby handler (business logic) │
│ 4. Include trace_id/span_id in completion │
└─────────────────┬───────────────────────────────────┘
│
│ Completion via FFI
▼
┌─────────────────────────────────────────────────────┐
│ Rust Worker (event_subscriber.rs) │
│ 1. Receive StepExecutionCompletionEvent │
│ 2. Extract trace_id and span_id │
│ 3. Create worker.step_completion_received span │
│ 4. Emit step.ruby_execution_completed event │
└─────────────────────────────────────────────────────┘
Benefits of This Pattern
- No Breaking Changes: Optional fields, backward compatible
- Ruby Simplicity: No OpenTelemetry dependency, opaque string propagation
- Trace Continuity: Same trace_id flows Rust → Ruby → Rust
- Query-Friendly: Tempo queries show complete execution flow
- Extensible: Pattern works for Python, WASM, any FFI language
- Performance: Zero overhead in Ruby (just string passing)
Pattern for Python Workers
The exact same pattern applies to Python workers:
Python Side (PyO3):
# workers/python/tasker_core/event_bridge.py
def wrap_step_execution_event(event_data):
wrapped = {
'event_id': event_data['event_id'],
'task_uuid': event_data['task_uuid'],
'step_uuid': event_data['step_uuid'],
# ... other fields
}
# Propagate trace context as opaque strings
if 'trace_id' in event_data:
wrapped['trace_id'] = event_data['trace_id']
if 'span_id' in event_data:
wrapped['span_id'] = event_data['span_id']
return wrapped
Key Insight: Any FFI language can use this pattern - they just need to pass through trace_id and span_id as strings.
Performance Characteristics
- Rust overhead: ~50-100 microseconds per span creation
- FFI overhead: ~10-50 microseconds for extra string fields
- Ruby overhead: Zero (just string passing, no OpenTelemetry)
- Total overhead: <200 microseconds per step execution
- Network: Spans batched and exported asynchronously
Troubleshooting
Symptom: Spans missing trace_id/span_id in Tempo
Check:
- Verify Rust logs show “Step execution event with trace context fired successfully”
- Check Ruby logs don’t have errors in EventBridge
- Verify completion events include trace_id/span_id
- Query Tempo by task_uuid to see if spans exist
Debug:
# Check Rust worker logs for trace context
docker logs docker-ruby-worker-1 | grep -E "(trace_id|span_id)"
# Query Tempo by task_uuid
curl "http://localhost:3200/api/search?tags=task_uuid=<UUID>"
# Check span export metrics
curl "http://localhost:9090/metrics" | grep otel
Future Enhancements
OpenTelemetry W3C Trace Context: Currently using correlation_id as trace_id placeholder. Future enhancement:
#![allow(unused)]
fn main() {
use opentelemetry::trace::TraceContextExt;
// Extract real OpenTelemetry trace context
let cx = tracing::Span::current().context();
let span_context = cx.span().span_context();
let trace_id = span_context.trace_id().to_string();
let span_id = span_context.span_id().to_string();
}
Span Linking:
Use OpenTelemetry’s Link API for explicit parent-child relationships:
#![allow(unused)]
fn main() {
use opentelemetry::trace::{Link, SpanContext, TraceId, SpanId};
// Create linked span
let parent_context = SpanContext::new(
TraceId::from_hex(&trace_id)?,
SpanId::from_hex(&span_id)?,
TraceFlags::default(),
false,
TraceState::default(),
);
let span = span!(
Level::INFO,
"worker.step_completion_received",
links = vec![Link::new(parent_context, Vec::new())]
);
}
References
- OpenTelemetry Rust: https://github.com/open-telemetry/opentelemetry-rust
- Grafana LGTM Stack: https://grafana.com/oss/lgtm-stack/
- W3C Trace Context: https://www.w3.org/TR/trace-context/
Related Documentation
tasker-shared/src/logging.rs- Core logging implementationworkers/rust/README.md- Event-driven FFI architecturedocs/batch-processing.md- Distributed tracing integrationdocker/docker-compose.test.yml- Observability stack configuration
Status: ✅ Production Ready - Two-phase initialization and Phase 1.5 worker span instrumentation patterns implemented and validated with Ruby FFI. Ready for Python and WASM implementations.
Library Deployment Patterns
This document describes the library deployment patterns feature that enables applications to consume worker observability data (health, metrics, templates, configuration) either via the HTTP API or directly through FFI, without running a web server.
Overview
Previously, applications needed to run the worker’s HTTP server to access observability data. This created deployment overhead for applications that only needed programmatic access to health checks, metrics, or template information.
The library deployment patterns feature:
- Extracts observability logic into reusable services - Business logic moved from HTTP handlers to service classes
- Exposes services via FFI - Same functionality available without HTTP overhead
- Provides Ruby wrapper layer - Type-safe Ruby interface with dry-struct types
- Makes HTTP server optional - Services always available, web server is opt-in
Architecture
Service Layer
Four services encapsulate observability logic:
tasker-worker/src/worker/services/
├── health/ # HealthService - health checks
├── metrics/ # MetricsService - metrics collection
├── template_query/ # TemplateQueryService - template operations
└── config_query/ # ConfigQueryService - configuration queries
Each service:
- Contains all business logic previously in HTTP handlers
- Is independent of HTTP transport
- Can be accessed via web handlers OR FFI
- Returns typed response structures
Service Access Patterns
┌─────────────────────────────────────────┐
│ WorkerWebState │
│ ┌────────────────────────────────────┐ │
│ │ Service Instances │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │HealthServ.│ │MetricsService │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │TemplQuery │ │ConfigQuery │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └────────────────────────────────────┘ │
└──────────────┬───────────────┬──────────┘
│ │
┌───────────────────────┴───┐ ┌─────┴──────────────────────┐
│ HTTP Handlers │ │ FFI Layer │
│ (web/handlers/*.rs) │ │ (observability_ffi.rs) │
└───────────────────────────┘ └────────────────────────────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ HTTP Clients │ │ Ruby/Python │
│ curl, etc. │ │ Applications │
└───────────────┘ └───────────────┘
Usage
Ruby FFI Access
The TaskerCore::Observability module provides type-safe access to all services:
# Health checks
health = TaskerCore::Observability.health_basic
puts health.status # => "healthy"
puts health.worker_id # => "worker-abc123"
# Kubernetes-style probes
if TaskerCore::Observability.ready?
puts "Worker ready to receive requests"
end
if TaskerCore::Observability.alive?
puts "Worker is alive"
end
# Detailed health information
detailed = TaskerCore::Observability.health_detailed
detailed.checks.each do |name, check|
puts "#{name}: #{check.status} (#{check.duration_ms}ms)"
end
Metrics Access
# Domain event statistics
events = TaskerCore::Observability.event_stats
puts "Events routed: #{events.router.total_routed}"
puts "FFI dispatches: #{events.in_process_bus.ffi_channel_dispatches}"
# Prometheus format (for custom scrapers)
prometheus_text = TaskerCore::Observability.prometheus_metrics
Template Operations
# List templates (JSON string)
templates_json = TaskerCore::Observability.templates_list
# Validate a template
validation = TaskerCore::Observability.template_validate(
namespace: "payments",
name: "process_payment",
version: "v1"
)
if validation.valid
puts "Template valid with #{validation.handler_count} handlers"
else
validation.issues.each { |issue| puts "Issue: #{issue}" }
end
# Cache management
stats = TaskerCore::Observability.cache_stats
puts "Cache hits: #{stats.hits}, misses: #{stats.misses}"
TaskerCore::Observability.cache_clear # Clear all cached templates
Configuration Access
# Get runtime configuration (secrets redacted)
config = TaskerCore::Observability.config
puts "Environment: #{config.environment}"
puts "Redacted fields: #{config.metadata.redacted_fields.join(', ')}"
# Quick environment check
env = TaskerCore::Observability.environment
puts "Running in: #{env}" # => "production"
Configuration
HTTP Server Toggle
The HTTP server is now optional. Services are always created, but the HTTP server only starts if enabled:
# config/tasker/base/worker.toml
[worker.web]
enabled = true # Set to false to disable HTTP server
bind_address = "0.0.0.0:8081"
request_timeout_ms = 30000
When enabled = false:
- WorkerWebState is still created (services available)
- HTTP server does NOT start
- All services accessible via FFI only
- Reduces resource usage (no HTTP listener, no connections)
Deployment Modes
| Mode | HTTP Server | FFI Services | Use Case |
|---|---|---|---|
| Full | ✅ | ✅ | Standard deployment with monitoring |
| Library | ❌ | ✅ | Embedded in application, no external access |
| Headless | ❌ | ✅ | Container with external health checks disabled |
Type Definitions
The Ruby wrapper uses dry-struct types for structured access:
Health Types
TaskerCore::Observability::Types::BasicHealth
- status: String
- worker_id: String
- timestamp: String
TaskerCore::Observability::Types::DetailedHealth
- status: String
- timestamp: String
- worker_id: String
- checks: Hash[String, HealthCheck]
- system_info: WorkerSystemInfo
TaskerCore::Observability::Types::HealthCheck
- status: String
- message: String?
- duration_ms: Integer
- last_checked: String
Metrics Types
TaskerCore::Observability::Types::DomainEventStats
- router: EventRouterStats
- in_process_bus: InProcessEventBusStats
- captured_at: String
- worker_id: String
TaskerCore::Observability::Types::EventRouterStats
- total_routed: Integer
- durable_routed: Integer
- fast_routed: Integer
- broadcast_routed: Integer
- fast_delivery_errors: Integer
- routing_errors: Integer
Template Types
TaskerCore::Observability::Types::CacheStats
- total_entries: Integer
- hits: Integer
- misses: Integer
- evictions: Integer
- last_maintenance: String?
TaskerCore::Observability::Types::TemplateValidation
- valid: Boolean
- namespace: String
- name: String
- version: String
- handler_count: Integer
- issues: Array[String]
- handler_metadata: Hash?
Config Types
TaskerCore::Observability::Types::RuntimeConfig
- environment: String
- common: Hash
- worker: Hash
- metadata: ConfigMetadata
TaskerCore::Observability::Types::ConfigMetadata
- timestamp: String
- source: String
- redacted_fields: Array[String]
Error Handling
FFI methods raise RuntimeError on failures:
begin
health = TaskerCore::Observability.health_basic
rescue RuntimeError => e
if e.message.include?("Worker system not running")
# Worker not bootstrapped yet
elsif e.message.include?("Web state not available")
# Services not initialized
end
end
Template Operation Errors
Template operations raise RuntimeError for missing templates or namespaces:
begin
result = TaskerCore::Observability.template_get(
namespace: "unknown",
name: "missing",
version: "1.0.0"
)
rescue RuntimeError => e
puts "Template not found: #{e.message}"
end
# template_refresh handles errors gracefully, returning a result struct
result = TaskerCore::Observability.template_refresh(
namespace: "unknown",
name: "missing",
version: "1.0.0"
)
puts result.success # => false
puts result.message # => error description
Convenience Methods
The ready? and alive? methods handle errors gracefully:
# These never raise - they return false on any error
TaskerCore::Observability.ready? # => true/false
TaskerCore::Observability.alive? # => true/false
Note: alive? checks for status == "alive" (from liveness probe), while ready? checks for status == "healthy" (from readiness probe).
Best Practices
- Use type-safe methods when possible - Methods returning dry-struct types provide better validation
- Handle errors gracefully - FFI can fail if worker not bootstrapped
- Consider caching - For high-frequency health checks, cache results briefly
- Use ready?/alive? helpers - They handle exceptions and return boolean
- Prefer FFI for internal use - Less overhead than HTTP for same-process access
Migration Guide
From HTTP to FFI
Before (HTTP):
response = Faraday.get("http://localhost:8081/health")
health = JSON.parse(response.body)
After (FFI):
health = TaskerCore::Observability.health_basic
Disabling HTTP Server
-
Update configuration:
[worker.web] enabled = false -
Update health check scripts to use FFI:
# health_check.rb require 'tasker_core' exit(TaskerCore::Observability.ready? ? 0 : 1) -
Update monitoring to scrape via FFI:
metrics = TaskerCore::Observability.prometheus_metrics # Send to Prometheus pushgateway or custom aggregator
API Reference
Health Methods
| Method | Returns | Description |
|---|---|---|
health_basic | Types::BasicHealth | Basic health status |
health_live | Types::BasicHealth | Liveness probe (status: “alive”) |
health_ready | Types::DetailedHealth | Readiness probe with all checks |
health_detailed | Types::DetailedHealth | Full health information |
ready? | Boolean | True if status == “healthy” |
alive? | Boolean | True if status == “alive” |
Metrics Methods
| Method | Returns | Description |
|---|---|---|
metrics_worker | String (JSON) | Worker metrics as JSON |
event_stats | Types::DomainEventStats | Domain event statistics |
prometheus_metrics | String | Prometheus text format |
Template Methods
| Method | Returns | Description |
|---|---|---|
templates_list(include_cache_stats: false) | String (JSON) | List all templates |
template_get(namespace:, name:, version:) | String (JSON) | Get specific template (raises on error) |
template_validate(namespace:, name:, version:) | Types::TemplateValidation | Validate template (raises on error) |
cache_stats | Types::CacheStats | Cache statistics |
cache_clear | Types::CacheOperationResult | Clear template cache |
template_refresh(namespace:, name:, version:) | Types::CacheOperationResult | Refresh specific template |
Config Methods
| Method | Returns | Description |
|---|---|---|
config | Types::RuntimeConfig | Full config (secrets redacted) |
environment | String | Current environment name |
Related Documentation
- Configuration Management - Full configuration reference
- Deployment Patterns - General deployment options
- Observability - Metrics and monitoring
- FFI Telemetry Pattern - FFI logging integration
SCache Configuration Documentation
Overview
This document records our sccache configuration for future reference. Sccache is currently disabled due to GitHub Actions cache service issues, but we plan to re-enable it once the service is stable.
Current Status
🚫 DISABLED - Temporarily disabled due to GitHub Actions cache service issues:
sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => <h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p>
Planned Configuration
Environment Variables (setup-env action)
RUSTC_WRAPPER=sccache
SCCACHE_GHA_ENABLED=true
SCCACHE_CACHE_SIZE=2G # For Docker builds
GitHub Actions Integration
Workflows Using sccache
- code-quality.yml - Build caching for clippy and rustfmt
- test-unit.yml - Build caching for unit tests
- test-integration.yml - Build caching for integration tests
Action Configuration
- uses: mozilla-actions/sccache-action@v0.0.4
Expected Benefits
- 50%+ faster builds through compilation caching
- Reduced CI costs by avoiding redundant compilation
- Better developer experience with faster feedback loops
Performance Targets
- Build cache hit rate: Target > 80%
- Compilation time reduction: 50%+ on cache hits
- Total CI time: Reduce by 10-20 minutes per run
Local Development Setup
For local development when sccache is working:
# Install sccache
cargo binstall sccache -y
# Set environment variables
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED=true
# Check stats
sccache --show-stats
# Clear cache if needed
sccache --zero-stats
Re-enabling Steps
When GitHub Actions cache service is stable:
-
Re-enable in workflows:
- Uncomment
mozilla-actions/sccache-action@v0.0.4in workflows - Restore sccache environment variables in setup-env action
- Uncomment
-
Test with minimal workflow first:
- Start with code-quality.yml
- Monitor for cache service issues
- Gradually enable in other workflows
-
Monitor performance:
- Track build times before/after
- Monitor cache hit rates
- Watch for any new cache service errors
Configuration Locations
Files containing sccache configuration
.github/actions/setup-env/action.yml- Environment variables.github/workflows/code-quality.yml- Action usage.github/workflows/test-unit.yml- Action usage.github/workflows/test-integration.yml- Action usagedocs/sccache-configuration.md- This documentation
Docker Integration
For Docker builds, pass sccache variables as build args:
build-args: |
SCCACHE_GHA_ENABLED=true
RUSTC_WRAPPER=sccache
SCCACHE_CACHE_SIZE=2G
Troubleshooting
Common Issues
- Cache service unavailable: Wait for GitHub to restore service
- Cache misses: Check RUSTC_WRAPPER is set correctly
- Permission errors: Ensure sccache action has proper permissions
Monitoring
- Check
sccache --show-statsfor cache effectiveness - Monitor CI run times for performance improvements
- Watch GitHub status page for cache service updates
References
StepContext API Reference
StepContext is the primary data access object for step handlers across all languages in the Tasker worker ecosystem. It provides a consistent interface for accessing task inputs, dependency results, configuration, and checkpoint data.
Overview
Every step handler receives a StepContext (or TaskSequenceStep in Rust) that contains:
- Task context - Input data for the workflow (JSONB from task.context)
- Dependency results - Results from upstream DAG steps
- Step configuration - Handler-specific settings from the template
- Checkpoint data - Batch processing state for resumability
- Retry information - Current attempt count and max retries
Cross-Language API Reference
Core Data Access
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Get task input | get_input::<T>("key")? | get_input("key") | get_input("key") | getInput("key") |
| Get input with default | get_input_or("key", default) | get_input_or("key", default) | get_input_or("key", default) | getInputOr("key", default) |
| Get config value | get_config::<T>("key")? | get_config("key") | get_config("key") | getConfig("key") |
| Get dependency result | get_dependency_result_column_value::<T>("step")? | get_dependency_result("step") | get_dependency_result("step") | getDependencyResult("step") |
| Get nested dependency field | get_dependency_field::<T>("step", &["path"])? | get_dependency_field("step", *path) | get_dependency_field("step", *path) | getDependencyField("step", ...path) |
Retry Helpers
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Check if retry | is_retry() | is_retry? | is_retry() | isRetry() |
| Check if last retry | is_last_retry() | is_last_retry? | is_last_retry() | isLastRetry() |
| Get retry count | retry_count() | retry_count | retry_count | retryCount |
| Get max retries | max_retries() | max_retries | max_retries | maxRetries |
Checkpoint Access
| Operation | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Get raw checkpoint | checkpoint() | checkpoint | checkpoint | checkpoint |
| Get cursor | checkpoint_cursor::<T>() | checkpoint_cursor | checkpoint_cursor | checkpointCursor |
| Get items processed | checkpoint_items_processed() | checkpoint_items_processed | checkpoint_items_processed | checkpointItemsProcessed |
| Get accumulated results | accumulated_results::<T>() | accumulated_results | accumulated_results | accumulatedResults |
| Check has checkpoint | has_checkpoint() | has_checkpoint? | has_checkpoint() | hasCheckpoint() |
Standard Fields
| Field | Rust | Ruby | Python | TypeScript |
|---|---|---|---|---|
| Task UUID | task.task.task_uuid | task_uuid | task_uuid | taskUuid |
| Step UUID | workflow_step.workflow_step_uuid | step_uuid | step_uuid | stepUuid |
| Correlation ID | task.task.correlation_id | task.correlation_id | correlation_id | correlationId |
| Input data (raw) | task.task.context | input_data / context | input_data | inputData |
| Step config (raw) | step_definition.handler.initialization | step_config | step_config | stepConfig |
Usage Examples
Rust
#![allow(unused)]
fn main() {
use tasker_shared::types::base::TaskSequenceStep;
async fn call(&self, step_data: &TaskSequenceStep) -> Result<StepExecutionResult> {
// Get task input
let order_id: String = step_data.get_input("order_id")?;
let batch_size: i32 = step_data.get_input_or("batch_size", 100);
// Get config
let api_url: String = step_data.get_config("api_url")?;
// Get dependency result
let validation_result: ValidationResult = step_data.get_dependency_result_column_value("validate")?;
// Extract nested field from dependency
let item_count: i32 = step_data.get_dependency_field("process", &["stats", "count"])?;
// Check retry status
if step_data.is_retry() {
println!("Retry attempt {}", step_data.retry_count());
}
// Resume from checkpoint
let cursor: Option<i64> = step_data.checkpoint_cursor();
let start_from = cursor.unwrap_or(0);
// ... handler logic ...
}
}
Ruby
def call(context)
# Get task input
order_id = context.get_input('order_id')
batch_size = context.get_input_or('batch_size', 100)
# Get config
api_url = context.get_config('api_url')
# Get dependency result
validation_result = context.get_dependency_result('validate')
# Extract nested field from dependency
item_count = context.get_dependency_field('process', 'stats', 'count')
# Check retry status
if context.is_retry?
logger.info("Retry attempt #{context.retry_count}")
end
# Resume from checkpoint
start_from = context.checkpoint_cursor || 0
# ... handler logic ...
end
Python
def call(self, context: StepContext) -> StepHandlerResult:
# Get task input
order_id = context.get_input("order_id")
batch_size = context.get_input_or("batch_size", 100)
# Get config
api_url = context.get_config("api_url")
# Get dependency result
validation_result = context.get_dependency_result("validate")
# Extract nested field from dependency
item_count = context.get_dependency_field("process", "stats", "count")
# Check retry status
if context.is_retry():
print(f"Retry attempt {context.retry_count}")
# Resume from checkpoint
start_from = context.checkpoint_cursor or 0
# ... handler logic ...
TypeScript
async call(context: StepContext): Promise<StepHandlerResult> {
// Get task input
const orderId = context.getInput<string>('order_id');
const batchSize = context.getInputOr('batch_size', 100);
// Get config
const apiUrl = context.getConfig<string>('api_url');
// Get dependency result
const validationResult = context.getDependencyResult('validate');
// Extract nested field from dependency
const itemCount = context.getDependencyField('process', 'stats', 'count');
// Check retry status
if (context.isRetry()) {
console.log(`Retry attempt ${context.retryCount}`);
}
// Resume from checkpoint
const startFrom = context.checkpointCursor ?? 0;
// ... handler logic ...
}
Checkpoint Usage Guide
Checkpoints enable resumable batch processing. When a handler processes large datasets, it can save progress via checkpoints and resume from where it left off on retry.
Checkpoint Fields
- cursor - Position marker (can be int, string, or object)
- items_processed - Count of items completed
- accumulated_results - Running totals or aggregated state
Reading Checkpoints
# Python example
def call(self, context: StepContext) -> StepHandlerResult:
# Check if resuming from checkpoint
if context.has_checkpoint():
cursor = context.checkpoint_cursor
items_done = context.checkpoint_items_processed
totals = context.accumulated_results or {}
print(f"Resuming from cursor {cursor}, {items_done} items done")
else:
cursor = 0
items_done = 0
totals = {}
# Process from cursor position...
Writing Checkpoints
Checkpoints are written by including checkpoint data in the handler result metadata. See the batch processing documentation for details on the checkpoint yield pattern.
Notes
- All accessor methods handle missing data gracefully (return None/null or use defaults)
- Dependency results are automatically unwrapped from the
{"result": value}envelope - Type conversion is handled automatically where supported (Rust, TypeScript generics)
- Checkpoint data is persisted atomically by the CheckpointService
Table Management and Growth Strategies
Last Updated: 2026-01-10 Status: Active Recommendation
Problem Statement
In high-throughput workflow orchestration systems, the core task tables (tasks, workflow_steps, task_transitions, workflow_step_transitions) can grow to millions of rows over time. Without proper management, this growth can lead to:
Note: All tables reside in the
taskerschema with simplified names (e.g.,tasksinstead oftasker_tasks). Withsearch_path = tasker, public, queries use unqualified table names.
- Query Performance Degradation: Even with proper indexes, very large tables require more I/O operations
- Maintenance Overhead: VACUUM, ANALYZE, and index maintenance become increasingly expensive
- Backup/Recovery Challenges: Larger tables increase backup windows and recovery times
- Storage Costs: Historical data that’s rarely accessed still consumes storage resources
Existing Performance Mitigations
The tasker-core system employs several strategies to maintain query performance even with large tables:
1. Strategic Indexing
Covering Indexes for Hot Paths
The most critical indexes use PostgreSQL’s INCLUDE clause to create covering indexes that satisfy queries without table lookups:
Active Task Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for active task queries with priority sorting
CREATE INDEX IF NOT EXISTS idx_tasks_active_with_priority_covering
ON tasks (complete, priority, task_uuid)
INCLUDE (named_task_uuid, requested_at)
WHERE complete = false;
Impact: Task discovery queries can be satisfied entirely from the index without accessing the main table.
Step Readiness Processing (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for step readiness queries
CREATE INDEX IF NOT EXISTS idx_workflow_steps_ready_covering
ON workflow_steps (task_uuid, processed, in_process)
INCLUDE (workflow_step_uuid, attempts, max_attempts, retryable)
WHERE processed = false;
-- Covering index for task-based step grouping
CREATE INDEX IF NOT EXISTS idx_workflow_steps_task_covering
ON workflow_steps (task_uuid)
INCLUDE (workflow_step_uuid, processed, in_process, attempts, max_attempts);
Impact: Step dependency resolution and retry logic queries avoid heap lookups.
Transitive Dependency Optimization (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Covering index for transitive dependency traversal
CREATE INDEX IF NOT EXISTS idx_workflow_steps_transitive_deps
ON workflow_steps (workflow_step_uuid, named_step_uuid)
INCLUDE (task_uuid, results, processed);
Impact: DAG traversal operations can read all needed columns from the index.
State Transition Lookups (Partial Indexes)
Current State Resolution (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Fast current state resolution (only indexes most_recent = true)
CREATE INDEX IF NOT EXISTS idx_task_transitions_state_lookup
ON task_transitions (task_uuid, to_state, most_recent)
WHERE most_recent = true;
CREATE INDEX IF NOT EXISTS idx_workflow_step_transitions_state_lookup
ON workflow_step_transitions (workflow_step_uuid, to_state, most_recent)
WHERE most_recent = true;
Impact: State lookups index only current state, not full audit history. Reduces index size by >90%.
Correlation and Tracing Indexes
Distributed Tracing Support (migrations/tasker/20251007000000_add_correlation_ids.sql):
-- Primary correlation ID lookups
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_id
ON tasks(correlation_id);
-- Hierarchical workflow traversal (parent-child relationships)
CREATE INDEX IF NOT EXISTS idx_tasks_correlation_hierarchy
ON tasks(parent_correlation_id, correlation_id)
WHERE parent_correlation_id IS NOT NULL;
Impact: Enables efficient distributed tracing and workflow hierarchy queries.
Processor Ownership and Monitoring
Processor Tracking (migrations/tasker/20250912000000_tas41_richer_task_states.sql):
-- Index for processor ownership queries (audit trail only, enforcement removed)
CREATE INDEX IF NOT EXISTS idx_task_transitions_processor
ON task_transitions(processor_uuid)
WHERE processor_uuid IS NOT NULL;
-- Index for timeout monitoring using JSONB metadata
CREATE INDEX IF NOT EXISTS idx_task_transitions_timeout
ON task_transitions((transition_metadata->>'timeout_at'))
WHERE most_recent = true;
Impact: Enables processor-level debugging and timeout monitoring. Processor ownership enforcement was removed but the audit trail is preserved.
Dependency Graph Navigation
Step Edges for DAG Operations (migrations/tasker/20250810140000_uuid_v7_initial_schema.sql):
-- Parent-to-child navigation for dependency resolution
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_from_step
ON workflow_step_edges (from_step_uuid);
-- Child-to-parent navigation for completion propagation
CREATE INDEX IF NOT EXISTS idx_workflow_step_edges_to_step
ON workflow_step_edges (to_step_uuid);
Impact: Bidirectional DAG traversal for readiness checks and completion propagation.
2. Partial Indexes
Many indexes use WHERE clauses to index only active/relevant rows:
-- Only index tasks that are actively being processed
WHERE current_state IN ('pending', 'initializing', 'steps_in_process')
-- Only index the current state transition
WHERE most_recent = true
This significantly reduces index size and maintenance overhead while keeping lookups fast.
3. SQL Function Optimizations
Complex orchestration queries are implemented as PostgreSQL functions that leverage:
- Lateral Joins: For efficient correlated subqueries
- CTEs with Materialization: For complex dependency analysis
- Targeted Filtering: Early elimination of irrelevant rows using index scans
Example from get_next_ready_tasks():
-- First filter to active tasks with priority sorting (uses index)
WITH prioritized_tasks AS (
SELECT task_uuid, priority
FROM tasks
WHERE current_state IN ('pending', 'steps_in_process')
ORDER BY priority DESC, created_at ASC
LIMIT $1 * 2 -- Get more candidates than needed for filtering
)
-- Then apply complex staleness/readiness checks only on candidates
...
4. Staleness Exclusion
The system automatically excludes stale tasks from active processing queues:
- Tasks stuck in
waiting_for_dependencies> 60 minutes - Tasks stuck in
waiting_for_retry> 30 minutes - Tasks with lifecycle timeouts exceeded
This prevents the active query set from growing indefinitely, even if old tasks aren’t archived.
Archive-and-Delete Strategy (Considered, Not Implemented)
What We Considered
We initially designed an archive-and-delete strategy:
Architecture:
- Mirror tables:
tasker.archived_tasks,tasker.archived_workflow_steps,tasker.archived_task_transitions,tasker.archived_workflow_step_transitions - Background service running every 24 hours
- Batch processing: 1000 tasks per run
- Transactional archival: INSERT into archive tables → DELETE from main tables
- Retention policies: Configurable per task state (completed, error, cancelled)
Implementation Details:
#![allow(unused)]
fn main() {
// Archive tasks in terminal states older than retention period
pub async fn archive_completed_tasks(
pool: &PgPool,
retention_days: i32,
batch_size: i32,
) -> Result<ArchiveStats> {
// 1. INSERT INTO archived_tasks SELECT * FROM tasks WHERE ...
// 2. INSERT INTO archived_workflow_steps SELECT * WHERE task_uuid IN (...)
// 3. INSERT INTO archived_task_transitions SELECT * WHERE task_uuid IN (...)
// 4. DELETE FROM workflow_step_transitions WHERE ...
// 5. DELETE FROM task_transitions WHERE ...
// 6. DELETE FROM workflow_steps WHERE ...
// 7. DELETE FROM tasks WHERE ...
}
}
Why We Decided Against It
After implementation and analysis, we identified critical performance issues:
1. Write Amplification
Every archived task results in:
- 2× writes per row: INSERT into archive table + original row still exists until DELETE
- 1× delete per row: DELETE from main table triggers index updates
- Cascade costs: Foreign key relationships require multiple DELETE operations in sequence
For a system processing 100,000 tasks/day with 30-day retention:
- Daily archival: ~100,000 tasks × 2 write operations = 200,000 write I/Os
- Plus associated workflow_steps (typically 5-10 per task): 500,000-1,000,000 additional writes
2. Index Maintenance Overhead
PostgreSQL must maintain indexes during both INSERT and DELETE operations:
During INSERT to archive tables:
- Build index entries for all archive table indexes
- Update statistics for query planner
During DELETE from main tables:
- Mark deleted tuples in main table indexes
- Update free space maps
- Trigger VACUUM requirements
Result: Periodic severe degradation (2-5 seconds) during archival runs, even with batch processing.
3. Lock Contention
Large DELETE operations require:
- Row-level locks on deleted rows
- Table-level locks during index updates
- Lock escalation risk with large batch sizes
This creates a “stop-the-world” effect where active task processing is blocked during archival.
4. VACUUM Pressure
Frequent large DELETEs create dead tuples that require aggressive VACUUMing:
- Increases I/O load during off-hours
- Can’t be fully eliminated even with proper tuning
- Competes with active workload for resources
5. The “Garbage Collector” Anti-Pattern
The archive-and-delete strategy essentially implements a manual garbage collector:
- Periodic runs with performance impact
- Tuning trade-offs (frequency vs. batch size vs. impact)
- Operational complexity (monitoring, alerting, recovery)
Recommended Solution: PostgreSQL Native Partitioning
Overview
PostgreSQL’s native table partitioning with pg_partman provides zero-runtime-cost table management:
Key Advantages:
- No write amplification: Data stays in place, partitions are logical divisions
- No DELETE operations: Old partitions are DETACHed and dropped as units
- Instant partition drops: Dropping a partition is O(1), not O(rows)
- Transparent to application: Queries work identically on partitioned tables
- Battle-tested: Used by pgmq (our queue infrastructure) and thousands of production systems
How It Works
-- 1. Create partitioned parent table (in tasker schema)
CREATE TABLE tasker.tasks (
task_uuid UUID NOT NULL,
created_at TIMESTAMP NOT NULL,
-- ... other columns
) PARTITION BY RANGE (created_at);
-- 2. pg_partman automatically creates child partitions
-- tasker.tasks_p2025_01 (Jan 2025)
-- tasker.tasks_p2025_02 (Feb 2025)
-- tasker.tasks_p2025_03 (Mar 2025)
-- ... etc
-- 3. Queries transparently use appropriate partitions
SELECT * FROM tasks WHERE task_uuid = $1;
-- → PostgreSQL automatically queries correct partition
-- 4. Dropping old partitions is instant
ALTER TABLE tasker.tasks DETACH PARTITION tasker.tasks_p2024_12;
DROP TABLE tasker.tasks_p2024_12; -- Instant, no row-by-row deletion
Performance Characteristics
| Operation | Archive-and-Delete | Native Partitioning |
|---|---|---|
| Write path | INSERT + DELETE (2× I/O) | INSERT only (1× I/O) |
| Index maintenance | On INSERT + DELETE | On INSERT only |
| Lock contention | Row locks during DELETE | No locks for drops |
| VACUUM pressure | High (dead tuples) | None (partition drops) |
| Old data removal | O(rows) per deletion | O(1) partition detach |
| Query performance | Scans entire table | Partition pruning |
| Runtime impact | Periodic degradation | Zero |
Implementation with pg_partman
Installation
CREATE EXTENSION pg_partman;
Setup for tasks
-- 1. Create partitioned table structure
-- (Include all existing columns and indexes)
-- 2. Initialize pg_partman for monthly partitions
SELECT partman.create_parent(
p_parent_table := 'tasker.tasks',
p_control := 'created_at',
p_type := 'native',
p_interval := 'monthly',
p_premake := 3 -- Pre-create 3 future months
);
-- 3. Configure retention (keep 90 days)
UPDATE partman.part_config
SET retention = '90 days',
retention_keep_table = false -- Drop old partitions entirely
WHERE parent_table = 'tasker.tasks';
-- 4. Enable automatic maintenance
SELECT partman.run_maintenance(p_parent_table := 'tasker.tasks');
Automation
Add to cron or pg_cron:
-- Run maintenance every hour
SELECT cron.schedule('partman-maintenance', '0 * * * *',
$$SELECT partman.run_maintenance()$$
);
This automatically:
- Creates new partitions before they’re needed
- Detaches and drops partitions older than retention period
- Updates partition constraints for query optimization
Real-World Example: pgmq
The pgmq message queue system (which tasker-core uses for orchestration) implements partitioned queues for high-throughput scenarios:
Reference: pgmq Partitioned Queues
pgmq’s Rationale (from their docs):
“For very high-throughput queues, you may want to partition the queue table by time. This allows you to drop old partitions instead of deleting rows, which is much faster and doesn’t cause table bloat.”
pgmq’s Approach:
-- pgmq uses pg_partman for message queues
SELECT pgmq.create_partitioned(
queue_name := 'high_throughput_queue',
partition_interval := '1 day',
retention_interval := '7 days'
);
Benefits They Report:
- 10× faster old message cleanup vs. DELETE
- Zero bloat from message deletion
- Consistent performance even at millions of messages per day
Applying to Tasker: Our use case is nearly identical to pgmq:
- High-throughput append-heavy workload
- Time-series data (created_at is natural partition key)
- Need to retain recent data, drop old data
- Performance-critical read path
If pgmq chose partitioning over archive-and-delete for these reasons, we should too.
Migration Path
Phase 1: Analysis (Current State)
Before implementing partitioning:
- Analyze Current Growth Rate:
SELECT
pg_size_pretty(pg_total_relation_size('tasker.tasks')) as total_size,
count(*) as row_count,
min(created_at) as oldest_task,
max(created_at) as newest_task,
count(*) / EXTRACT(day FROM (max(created_at) - min(created_at))) as avg_tasks_per_day
FROM tasks;
-
Determine Partition Strategy:
- Daily partitions: For > 1M tasks/day
- Weekly partitions: For 100K-1M tasks/day
- Monthly partitions: For < 100K tasks/day
-
Plan Retention Period:
- Legal/compliance requirements
- Analytics/reporting needs
- Typical task investigation window
Phase 2: Implementation
- Create Partitioned Tables (requires downtime or blue-green deployment)
- Migrate Existing Data using
pg_partman.partition_data_proc() - Update Application (no code changes needed if using same table names)
- Configure Automation (pg_cron for maintenance)
Phase 3: Monitoring
Track partition management effectiveness:
-- Check partition sizes
SELECT
schemaname || '.' || tablename as partition_name,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'tasker' AND tablename LIKE 'tasks_p%'
ORDER BY tablename;
-- Verify partition pruning is working
EXPLAIN SELECT * FROM tasks
WHERE created_at > NOW() - INTERVAL '7 days';
-- Should show: "Seq Scan on tasker.tasks_p2025_11" (only current partition)
Decision Summary
Decision: Use PostgreSQL native partitioning with pg_partman for table growth management.
Rationale:
- Zero runtime performance impact vs. periodic degradation with archive-and-delete
- Operationally simpler (set-and-forget vs. monitoring archive jobs)
- Battle-tested solution used by pgmq and thousands of production systems
- Aligns with PostgreSQL best practices and community recommendations
Not Recommended: Archive-and-delete strategy due to write amplification, lock contention, and periodic performance degradation.
References
See Also
- States and Lifecycles - Task and step state management
- Task and Step Readiness - SQL function optimizations
- Observability README - Monitoring table growth and query performance
Task and Step Readiness and Execution
Last Updated: 2026-01-10 Audience: Developers, Architects Status: Active Related Docs: Documentation Hub | States and Lifecycles | Events and Commands
← Back to Documentation Hub
This document provides comprehensive documentation of the SQL functions and database logic that drives task and step readiness analysis, dependency resolution, and execution coordination in the tasker-core system.
Overview
The tasker-core system relies heavily on sophisticated PostgreSQL functions to perform complex workflow orchestration operations at the database level. This approach provides significant performance benefits through set-based operations, atomic transactions, and reduced network round trips while maintaining data consistency.
The SQL function system supports several critical categories of operations:
- Step Readiness Analysis: Complex dependency resolution and backoff calculations
- DAG Operations: Cycle detection, depth calculation, and parallel execution discovery
- State Management: Atomic state transitions with processor ownership tracking
- Analytics and Monitoring: Performance metrics and system health analysis
- Task Execution Context: Comprehensive execution metadata and results management
SQL Function Architecture
Function Categories
The SQL functions are organized into logical categories as defined in
tasker-shared/src/database/sql_functions.rs:
1. Step Readiness Analysis
get_step_readiness_status(task_uuid, step_uuids[]): Comprehensive dependency analysiscalculate_backoff_delay(attempts, base_delay): Exponential backoff calculationcheck_step_dependencies(step_uuid): Parent completion validationget_ready_steps(task_uuid): Parallel execution candidate discovery
2. DAG Operations
detect_cycle(from_step_uuid, to_step_uuid): Cycle detection using recursive CTEscalculate_dependency_levels(task_uuid): Topological depth calculationcalculate_step_depth(step_uuid): Individual step depth analysisget_step_transitive_dependencies(step_uuid): Full dependency tree traversal
3. State Management
transition_task_state_atomic(task_uuid, from_state, to_state, processor_uuid): Atomic state transitions with ownershipget_current_task_state(task_uuid): Current task state resolutionfinalize_task_completion(task_uuid): Task completion orchestration
4. Analytics and Monitoring
get_analytics_metrics(since_timestamp): Comprehensive system analyticsget_system_health_counts(): System-wide health and performance metricsget_slowest_steps(limit): Performance optimization analysisget_slowest_tasks(limit): Task performance analysis
5. Task Discovery and Execution
get_next_ready_task(): Single task discovery for orchestrationget_next_ready_tasks(limit): Batch task discovery for scalingget_task_ready_info(task_uuid): Detailed task readiness informationget_task_execution_context(task_uuid): Complete execution metadata
Database Schema Foundation
Core Tables
The SQL functions operate on a comprehensive schema designed for UUID v7
performance and scalability. All tables reside in the tasker schema
with simplified names. With search_path = tasker, public, queries use unqualified
table names.
Primary Tables
tasks: Main workflow instances with UUID v7 primary keysworkflow_steps: Individual workflow steps with dependency relationshipstask_transitions: Task state change audit trail with processor trackingworkflow_step_transitions: Step state change audit trail
Registry Tables
task_namespaces: Workflow namespace definitionsnamed_tasks: Task type templates and metadatanamed_steps: Step type definitions and handlersworkflow_step_edges: Step dependency relationships (DAG structure)
Richer Task State Enhancements
The richer task states migration (migrations/tasker/20251209000000_tas41_richer_task_states.sql) enhanced the
schema with:
Task State Management:
-- 12 comprehensive task states
ALTER TABLE task_transitions
ADD CONSTRAINT chk_task_transitions_to_state
CHECK (to_state IN (
'pending', 'initializing', 'enqueuing_steps', 'steps_in_process',
'evaluating_results', 'waiting_for_dependencies', 'waiting_for_retry',
'blocked_by_failures', 'complete', 'error', 'cancelled', 'resolved_manually'
));
Processor Ownership Tracking:
ALTER TABLE task_transitions
ADD COLUMN processor_uuid UUID,
ADD COLUMN transition_metadata JSONB DEFAULT '{}';
Atomic State Transitions:
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN
Step Readiness Analysis
Recent Enhancements
WaitingForRetry State Support (Migration 20250927000000)
The step readiness system was enhanced to support the new WaitingForRetry state, which distinguishes retryable failures from permanent errors:
Key Changes:
- Helper Functions: Added
calculate_step_next_retry_time()andevaluate_step_state_readiness()for consistent backoff logic - State Recognition: Updated readiness evaluation to treat
waiting_for_retryas a ready-eligible state alongsidepending - Backoff Calculation: Centralized exponential backoff logic with configurable backoff periods
- Performance Optimization: Introduced task-scoped CTEs to eliminate table scans for batch operations
Semantic Impact:
- Before:
errorstate included both retryable and permanent failures - After:
error= permanent only,waiting_for_retry= awaiting backoff for retry
Backoff Logic Consolidation (October 2025)
The backoff calculation system was consolidated to eliminate configuration conflicts and race conditions:
Key Changes:
- Configuration Alignment: Single source of truth (TOML config) with max_backoff_seconds = 60
- Parameterized SQL Functions:
calculate_step_next_retry_time()accepts configurable max delay and multiplier - Atomic Updates: Row-level locking prevents concurrent backoff update conflicts
- Timing Consistency:
last_attempted_atupdated atomically withbackoff_request_seconds
Issues Resolved:
- Configuration Conflicts: Eliminated three conflicting max values (30s SQL, 60s code, 300s TOML)
- Race Conditions: Added SELECT FOR UPDATE locking in BackoffCalculator
- Hardcoded Values: Removed hardcoded 30-second cap and power(2, attempts) in SQL
Helper Functions Enhanced:
-
calculate_step_next_retry_time(): Now parameterized with configuration valuesCREATE OR REPLACE FUNCTION calculate_step_next_retry_time( backoff_request_seconds INTEGER, last_attempted_at TIMESTAMP, failure_time TIMESTAMP, attempts INTEGER, p_max_backoff_seconds INTEGER DEFAULT 60, p_backoff_multiplier NUMERIC DEFAULT 2.0 ) RETURNS TIMESTAMP- Respects custom backoff periods from step configuration (primary path)
- Falls back to exponential backoff with configurable parameters
- Defaults aligned with TOML config (60s max, 2.0 multiplier)
- Used consistently across all readiness evaluation
-
set_step_backoff_atomic(): New atomic update functionCREATE OR REPLACE FUNCTION set_step_backoff_atomic( p_step_uuid UUID, p_backoff_seconds INTEGER ) RETURNS BOOLEAN- Provides transactional guarantee for concurrent updates
- Updates both
backoff_request_secondsandlast_attempted_at - Ensures timing consistency with SQL calculations
-
evaluate_step_state_readiness(): Determines if a step is ready for executionCREATE OR REPLACE FUNCTION evaluate_step_state_readiness( current_state TEXT, processed BOOLEAN, in_process BOOLEAN, dependencies_satisfied BOOLEAN, retry_eligible BOOLEAN, retryable BOOLEAN, next_retry_time TIMESTAMP ) RETURNS BOOLEAN- Recognizes both
pendingandwaiting_for_retryas ready-eligible states - Validates backoff period has expired before allowing retry
- Ensures dependencies are satisfied and retry limits not exceeded
- Recognizes both
Step Readiness Status
The get_step_readiness_status function provides comprehensive analysis of step
execution eligibility:
CREATE OR REPLACE FUNCTION get_step_readiness_status(
task_uuid UUID,
step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
workflow_step_uuid UUID,
task_uuid UUID,
named_step_uuid UUID,
name VARCHAR,
current_state VARCHAR,
dependencies_satisfied BOOLEAN,
retry_eligible BOOLEAN,
ready_for_execution BOOLEAN,
last_failure_at TIMESTAMP,
next_retry_at TIMESTAMP,
total_parents INTEGER,
completed_parents INTEGER,
attempts INTEGER,
retry_limit INTEGER,
backoff_request_seconds INTEGER,
last_attempted_at TIMESTAMP
)
Key Analysis Features
Dependency Satisfaction:
- Validates all parent steps are in
completeorresolved_manuallystates - Handles complex DAG structures with multiple dependency paths
- Supports conditional dependencies based on parent results
Retry Logic:
- Exponential backoff calculation:
2^attemptsseconds (max 60, configurable) - Custom backoff periods from step configuration
- Retry limit enforcement to prevent infinite loops
- Failure tracking with temporal analysis
Execution Readiness:
- State validation (must be
pendingorerror) - Dependency satisfaction confirmation
- Retry eligibility assessment
- Backoff period expiration checking
Step Readiness Implementation
The Rust integration provides type-safe access to step readiness analysis:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct StepReadinessStatus {
pub workflow_step_uuid: Uuid,
pub task_uuid: Uuid,
pub named_step_uuid: Uuid,
pub name: String,
pub current_state: String,
pub dependencies_satisfied: bool,
pub retry_eligible: bool,
pub ready_for_execution: bool,
pub last_failure_at: Option<NaiveDateTime>,
pub next_retry_at: Option<NaiveDateTime>,
pub total_parents: i32,
pub completed_parents: i32,
pub attempts: i32,
pub retry_limit: i32,
pub backoff_request_seconds: Option<i32>,
pub last_attempted_at: Option<NaiveDateTime>,
}
impl StepReadinessStatus {
pub fn can_execute_now(&self) -> bool {
self.ready_for_execution
}
pub fn blocking_reason(&self) -> Option<&'static str> {
if !self.dependencies_satisfied {
return Some("dependencies_not_satisfied");
}
if !self.retry_eligible {
return Some("retry_not_eligible");
}
Some("invalid_state")
}
/// Check if this step is ready for execution.
pub fn is_ready(&self) -> bool {
self.ready_for_execution
}
/// Check if this step is blocked by dependencies.
pub fn is_blocked(&self) -> bool {
!self.dependencies_satisfied
}
/// Check if this step can be retried.
pub fn can_retry(&self) -> bool {
self.retry_eligible
}
}
// Note: Backoff computation is handled server-side by the SQL function
// `calculate_step_next_retry_time()`, not as a Rust method on this struct.
// The `backoff_request_seconds` field contains the raw value from the SQL function,
// and `next_retry_at` contains the pre-computed next retry timestamp.
}
DAG Operations and Dependency Resolution
Dependency Level Calculation
The calculate_dependency_levels function uses recursive CTEs to perform
topological analysis of the workflow DAG:
CREATE OR REPLACE FUNCTION calculate_dependency_levels(input_task_uuid UUID)
RETURNS TABLE(workflow_step_uuid UUID, dependency_level INTEGER)
LANGUAGE plpgsql STABLE AS $$
BEGIN
RETURN QUERY
WITH RECURSIVE dependency_levels AS (
-- Base case: Find root nodes (steps with no dependencies)
SELECT
ws.workflow_step_uuid,
0 as level
FROM workflow_steps ws
WHERE ws.task_uuid = input_task_uuid
AND NOT EXISTS (
SELECT 1
FROM workflow_step_edges wse
WHERE wse.to_step_uuid = ws.workflow_step_uuid
)
UNION ALL
-- Recursive case: Find children of current level nodes
SELECT
wse.to_step_uuid as workflow_step_uuid,
dl.level + 1 as level
FROM dependency_levels dl
JOIN workflow_step_edges wse ON wse.from_step_uuid = dl.workflow_step_uuid
JOIN workflow_steps ws ON ws.workflow_step_uuid = wse.to_step_uuid
WHERE ws.task_uuid = input_task_uuid
)
SELECT
dl.workflow_step_uuid,
MAX(dl.level) as dependency_level -- Use MAX to handle multiple paths
FROM dependency_levels dl
GROUP BY dl.workflow_step_uuid
ORDER BY dependency_level, workflow_step_uuid;
END;
Dependency Level Benefits
Parallel Execution Planning:
- Steps at the same dependency level can execute in parallel
- Enables optimal resource utilization across workers
- Supports batch enqueueing for scalability
Execution Ordering:
- Level 0: Root steps (no dependencies) - can start immediately
- Level N: Steps requiring completion of level N-1 steps
- Topological ordering ensures dependency satisfaction
Performance Optimization:
- Single query provides complete dependency analysis
- Avoids N+1 query problems in dependency resolution
- Enables batch processing optimizations
Transitive Dependencies
The get_step_transitive_dependencies function provides complete ancestor analysis:
CREATE OR REPLACE FUNCTION get_step_transitive_dependencies(step_uuid UUID)
RETURNS TABLE(
step_name VARCHAR,
step_uuid UUID,
task_uuid UUID,
distance INTEGER,
processed BOOLEAN,
results JSONB
)
This enables step handlers to access results from any ancestor step:
#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
pub async fn get_step_dependency_results_map(
&self,
step_uuid: Uuid,
) -> Result<HashMap<String, StepExecutionResult>, sqlx::Error> {
let dependencies = self.get_step_transitive_dependencies(step_uuid).await?;
Ok(dependencies
.into_iter()
.filter_map(|dep| {
if dep.processed && dep.results.is_some() {
let results: StepExecutionResult = dep.results.unwrap().into();
Some((dep.step_name, results))
} else {
None
}
})
.collect())
}
}
}
Task Execution Context
Recent Enhancements
Permanently Blocked Detection Fix (Migration 20251001000000)
The get_task_execution_context function was enhanced to correctly identify tasks blocked by permanent errors:
Problem: The function only checked attempts >= retry_limit to detect permanently blocked steps, missing cases where workers marked errors as non-retryable (e.g., missing handlers, configuration errors).
Solution: Updated permanently_blocked_steps calculation to check both conditions:
COUNT(CASE WHEN sd.current_state = 'error'
AND (sd.attempts >= retry_limit OR sd.retry_eligible = false) THEN 1 END)
Impact:
- execution_status: Now correctly returns
blocked_by_failuresinstead ofwaiting_for_dependenciesfor tasks with non-retryable errors - recommended_action: Returns
handle_failuresinstead ofwait_for_dependencies - health_status: Returns
blockedinstead ofrecoveringwhen appropriate
This fix ensures the orchestration system properly identifies when manual intervention is needed versus when a task is simply waiting for retry backoff.
Task Discovery and Orchestration
Task Readiness Discovery
The system provides multiple functions for task discovery based on orchestration needs:
Single Task Discovery
CREATE OR REPLACE FUNCTION get_next_ready_task()
RETURNS TABLE(
task_uuid UUID,
task_name VARCHAR,
priority INTEGER,
namespace_name VARCHAR,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Batch Task Discovery
CREATE OR REPLACE FUNCTION get_next_ready_tasks(limit_count INTEGER)
RETURNS TABLE(
task_uuid UUID,
task_name VARCHAR,
priority INTEGER,
namespace_name VARCHAR,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Task Ready Information
The ReadyTaskInfo structure provides comprehensive task metadata for
orchestration decisions:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct ReadyTaskInfo {
pub task_uuid: Uuid,
pub task_name: String,
pub priority: i32,
pub namespace_name: String,
pub ready_steps_count: i64,
pub computed_priority: Option<BigDecimal>,
pub current_state: String,
}
}
Priority Calculation:
- Base priority from task configuration
- Dynamic priority adjustment based on age, retry attempts
- Namespace-based priority modifiers
- SLA-based priority escalation
Ready Steps Count:
- Real-time count of steps eligible for execution
- Used for batch size optimization
- Influences orchestration scheduling decisions
State Management and Atomic Transitions
Atomic State Transitions
The enhanced state machine provides atomic transitions with processor ownership:
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
DECLARE
v_sort_key INTEGER;
v_transitioned BOOLEAN := FALSE;
BEGIN
-- Get next sort key
SELECT COALESCE(MAX(sort_key), 0) + 1 INTO v_sort_key
FROM task_transitions
WHERE task_uuid = p_task_uuid;
-- Atomically transition only if in expected state
WITH current_state AS (
SELECT to_state, processor_uuid
FROM task_transitions
WHERE task_uuid = p_task_uuid
AND most_recent = true
FOR UPDATE
),
ownership_check AS (
SELECT
CASE
-- States requiring ownership
WHEN cs.to_state IN ('initializing', 'enqueuing_steps',
'steps_in_process', 'evaluating_results')
THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL
-- Other states don't require ownership
ELSE true
END as can_transition
FROM current_state cs
WHERE cs.to_state = p_from_state
),
do_update AS (
UPDATE task_transitions
SET most_recent = false
WHERE task_uuid = p_task_uuid
AND most_recent = true
AND EXISTS (SELECT 1 FROM ownership_check WHERE can_transition)
RETURNING task_uuid
)
INSERT INTO task_transitions (
task_uuid, from_state, to_state,
processor_uuid, transition_metadata,
sort_key, most_recent, created_at, updated_at
)
SELECT
p_task_uuid, p_from_state, p_to_state,
p_processor_uuid, p_metadata,
v_sort_key, true, NOW(), NOW()
WHERE EXISTS (SELECT 1 FROM do_update);
GET DIAGNOSTICS v_transitioned = ROW_COUNT;
RETURN v_transitioned > 0;
END;
$$ LANGUAGE plpgsql;
Key Features
Atomic Operation:
- Single transaction with row-level locking
- Compare-and-swap semantics prevent race conditions
- Returns boolean indicating success/failure
Ownership Validation:
- Processor ownership required for active states
- Prevents concurrent processing by multiple orchestrators
- Supports ownership claiming for unowned tasks
State Consistency:
- Validates current state matches expected
from_state - Maintains audit trail with complete transition history
- Updates
most_recentflags atomically
Current State Resolution
Fast current state lookups are provided through optimized queries:
#![allow(unused)]
fn main() {
impl SqlFunctionExecutor {
pub async fn get_current_task_state(&self, task_uuid: Uuid)
-> Result<TaskState, sqlx::Error> {
let state_str = sqlx::query_scalar!(
r#"SELECT get_current_task_state($1) as "state""#,
task_uuid
)
.fetch_optional(&self.pool)
.await?
.ok_or_else(|| sqlx::Error::RowNotFound)?;
match state_str {
Some(state) => TaskState::try_from(state.as_str())
.map_err(|_| sqlx::Error::Decode("Invalid task state".into())),
None => Err(sqlx::Error::RowNotFound),
}
}
}
}
Analytics and System Health
System Health Monitoring
The get_system_health_counts function provides comprehensive system visibility:
CREATE OR REPLACE FUNCTION get_system_health_counts()
RETURNS TABLE(
pending_tasks BIGINT,
initializing_tasks BIGINT,
enqueuing_steps_tasks BIGINT,
steps_in_process_tasks BIGINT,
evaluating_results_tasks BIGINT,
waiting_for_dependencies_tasks BIGINT,
waiting_for_retry_tasks BIGINT,
blocked_by_failures_tasks BIGINT,
complete_tasks BIGINT,
error_tasks BIGINT,
cancelled_tasks BIGINT,
resolved_manually_tasks BIGINT,
total_tasks BIGINT,
-- step counts...
) AS $$
Health Score Calculation
The Rust implementation provides derived health metrics:
#![allow(unused)]
fn main() {
impl SystemHealthCounts {
pub fn health_score(&self) -> f64 {
if self.total_tasks == 0 {
return 1.0;
}
let success_rate = self.complete_tasks as f64 / self.total_tasks as f64;
let error_rate = self.error_tasks as f64 / self.total_tasks as f64;
let connection_health = 1.0 -
(self.active_connections as f64 / self.max_connections as f64).min(1.0);
// Weighted combination: 50% success rate, 30% error rate, 20% connection health
(success_rate * 0.5) + ((1.0 - error_rate) * 0.3) + (connection_health * 0.2)
}
pub fn is_under_heavy_load(&self) -> bool {
let connection_pressure =
self.active_connections as f64 / self.max_connections as f64;
let error_rate = if self.total_tasks > 0 {
self.error_tasks as f64 / self.total_tasks as f64
} else {
0.0
};
connection_pressure > 0.8 || error_rate > 0.2
}
}
}
Analytics Metrics
The get_analytics_metrics function provides comprehensive performance analysis:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct AnalyticsMetrics {
pub active_tasks_count: i64,
pub total_namespaces_count: i64,
pub unique_task_types_count: i64,
pub system_health_score: BigDecimal,
pub task_throughput: i64,
pub completion_count: i64,
pub error_count: i64,
pub completion_rate: BigDecimal,
pub error_rate: BigDecimal,
pub avg_task_duration: BigDecimal,
pub avg_step_duration: BigDecimal,
pub step_throughput: i64,
pub analysis_period_start: DateTime<Utc>,
pub calculated_at: DateTime<Utc>,
}
}
Performance Optimization Analysis
Slowest Steps Analysis
The system provides performance optimization guidance through detailed analysis:
CREATE OR REPLACE FUNCTION get_slowest_steps(
since_timestamp TIMESTAMP WITH TIME ZONE,
limit_count INTEGER,
namespace_filter VARCHAR,
task_name_filter VARCHAR,
version_filter VARCHAR
) RETURNS TABLE(
named_step_uuid INTEGER,
step_name VARCHAR,
avg_duration_seconds NUMERIC,
max_duration_seconds NUMERIC,
min_duration_seconds NUMERIC,
execution_count INTEGER,
error_count INTEGER,
error_rate NUMERIC,
last_executed_at TIMESTAMP WITH TIME ZONE
)
Slowest Tasks Analysis
Similar analysis is available at the task level:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct SlowestTaskAnalysis {
pub named_task_uuid: Uuid,
pub task_name: String,
pub avg_duration_seconds: f64,
pub max_duration_seconds: f64,
pub min_duration_seconds: f64,
pub execution_count: i32,
pub avg_step_count: f64,
pub error_count: i32,
pub error_rate: f64,
pub last_executed_at: Option<DateTime<Utc>>,
}
}
Critical Problem-Solving SQL Functions
PGMQ Message Race Condition Prevention
Problem: Multiple Workers Claiming Same Message
When multiple workers simultaneously try to process steps from the same queue, PGMQ’s standard
pgmq.read() function randomly selects messages, potentially causing workers to miss messages
they were specifically notified about. This creates inefficiency and potential race conditions.
Solution: pgmq_read_specific_message()
CREATE OR REPLACE FUNCTION pgmq_read_specific_message(
queue_name text,
target_msg_id bigint,
vt_seconds integer DEFAULT 30
) RETURNS TABLE (
msg_id bigint,
read_ct integer,
enqueued_at timestamp with time zone,
vt timestamp with time zone,
message jsonb
) AS $$
Key Problem-Solving Logic:
-
Atomic Claim with Visibility Timeout: Uses UPDATE…RETURNING pattern to atomically:
- Check if message is available (
vt <= now()) - Set new visibility timeout preventing other workers from claiming
- Increment read count for monitoring retry attempts
- Return message data only if successfully claimed
- Check if message is available (
-
Race Condition Prevention: The
WHERE vt <= now()clause ensures only one worker can claim a message. If two workers try simultaneously, only one UPDATE succeeds. -
Graceful Failure Handling: Returns empty result set if message is:
- Already claimed by another worker (vt > now())
- Non-existent (deleted or never existed)
- Archived (moved to archive table)
-
Security: Validates queue name to prevent SQL injection in dynamic query construction.
Real-World Impact: Eliminates “message not found” errors when workers are notified about specific messages but can’t retrieve them due to random selection in standard read.
Task State Ownership and Atomic Transitions
Problem: Concurrent Orchestrators Processing Same Task
In distributed deployments, multiple orchestrator instances might try to process the same task simultaneously, leading to duplicate work, inconsistent state, and race conditions.
Solution: transition_task_state_atomic()
CREATE OR REPLACE FUNCTION transition_task_state_atomic(
p_task_uuid UUID,
p_from_state VARCHAR,
p_to_state VARCHAR,
p_processor_uuid UUID,
p_metadata JSONB DEFAULT '{}'
) RETURNS BOOLEAN AS $$
Key Problem-Solving Logic:
-
Compare-and-Swap Pattern:
- Reads current state with
FOR UPDATElock - Only transitions if current state matches expected
from_state - Returns false if state has changed, allowing caller to retry with fresh state
- Reads current state with
-
Processor Ownership Enforcement:
CASE WHEN cs.to_state IN ('initializing', 'enqueuing_steps', 'steps_in_process', 'evaluating_results') THEN cs.processor_uuid = p_processor_uuid OR cs.processor_uuid IS NULL ELSE true END- Active processing states require ownership match
- Allows claiming unowned tasks (NULL processor_uuid)
- Terminal states (complete, error) don’t require ownership
-
Audit Trail Preservation:
- Updates previous transition’s
most_recent = false - Inserts new transition with
most_recent = true - Maintains complete history with sort_key ordering
- Updates previous transition’s
-
Atomic Success/Failure: Returns boolean indicating whether transition succeeded, enabling callers to handle contention gracefully.
Real-World Impact: Enables safe distributed orchestration where multiple instances can operate without conflicts, automatically distributing work through ownership claiming.
Batch Task Discovery with Priority
Problem: Efficient Work Distribution Across Orchestrators
Orchestrators need to discover ready tasks efficiently without creating hotspots or missing tasks, while respecting priority and avoiding claimed tasks.
Solution: get_next_ready_tasks()
CREATE OR REPLACE FUNCTION get_next_ready_tasks(p_limit INTEGER DEFAULT 5)
RETURNS TABLE(
task_uuid UUID,
task_name TEXT,
priority INTEGER,
namespace_name TEXT,
ready_steps_count BIGINT,
computed_priority NUMERIC,
current_state VARCHAR
)
Key Problem-Solving Logic:
-
Ready Step Discovery:
WITH ready_steps AS ( SELECT task_uuid, COUNT(*) as ready_count FROM workflow_steps WHERE current_state IN ('pending', 'error') AND [dependency checks] GROUP BY task_uuid )- Pre-aggregates ready steps per task for efficiency
- Considers both new steps and retryable errors
-
State-Based Filtering:
- Only returns tasks in states that need processing
- Excludes terminal states (complete, cancelled)
- Includes waiting states that might have become ready
-
Priority Computation:
computed_priority = base_priority + (age_factor * hours_waiting) + (retry_factor * retry_count)- Dynamic priority based on age and retry attempts
- Prevents task starvation through age escalation
-
Batch Efficiency:
- Returns multiple tasks in single query
- Reduces database round trips
- Enables parallel processing across orchestrators
Real-World Impact: Enables efficient work distribution where each orchestrator can claim a batch of tasks, reducing contention and improving throughput.
Complex Dependency Resolution
Problem: Determining Step Execution Readiness
Workflow steps have complex dependencies involving parent completion, retry logic, backoff timing, and state validation. Determining which steps are ready for execution requires sophisticated dependency analysis that must handle:
- Multiple parent dependencies with conditional logic
- Exponential backoff after failures
- Retry limits and attempt tracking
- State consistency across distributed workers
Solution: get_step_readiness_status()
CREATE OR REPLACE FUNCTION get_step_readiness_status(
input_task_uuid UUID,
step_uuids UUID[] DEFAULT NULL
) RETURNS TABLE(
workflow_step_uuid UUID,
task_uuid UUID,
named_step_uuid UUID,
name VARCHAR,
current_state VARCHAR,
dependencies_satisfied BOOLEAN,
retry_eligible BOOLEAN,
ready_for_execution BOOLEAN,
-- ... additional metadata
)
Key Problem-Solving Logic:
-
Dependency Satisfaction Analysis:
WITH parent_completion AS ( SELECT edge.to_step_uuid, COUNT(*) as total_parents, COUNT(CASE WHEN parent.current_state = 'complete' THEN 1 END) as completed_parents FROM workflow_step_edges edge JOIN workflow_steps parent ON parent.workflow_step_uuid = edge.from_step_uuid WHERE parent.task_uuid = input_task_uuid GROUP BY edge.to_step_uuid )- Counts total vs. completed parent dependencies
- Handles conditional dependencies based on parent results
- Supports complex DAG structures with multiple paths
-
Retry Eligibility Assessment:
retry_eligible = ( current_state = 'error' AND attempts < retry_limit AND (last_attempted_at IS NULL OR last_attempted_at + backoff_interval <= NOW()) )- Enforces retry limits to prevent infinite loops
- Calculates exponential backoff:
2^attemptsseconds (max 60, configurable) - Respects custom backoff periods from step configuration
- Considers temporal constraints for retry timing
-
State Validation:
ready_for_execution = ( current_state IN ('pending', 'error') AND dependencies_satisfied AND retry_eligible )- Only pending or retryable error steps can execute
- Requires all dependencies satisfied
- Must pass retry eligibility checks
- Prevents execution of steps in terminal states
-
Backoff Calculation:
next_retry_at = CASE WHEN current_state = 'error' AND attempts > 0 THEN last_attempted_at + INTERVAL '1 second' * COALESCE(backoff_request_seconds, LEAST(POW(2, attempts), 60)) ELSE NULL END- Custom backoff from step configuration takes precedence
- Default exponential backoff with maximum cap
- Temporal precision for scheduling retry attempts
Real-World Impact: Enables complex workflow orchestration with sophisticated dependency management, retry logic, and backoff handling, supporting enterprise-grade reliability patterns while maintaining high performance through set-based operations.
Integration with Event and State Systems
PostgreSQL LISTEN/NOTIFY Integration
The SQL functions integrate with the event-driven architecture through PostgreSQL notifications:
PGMQ Wrapper Functions for Atomic Operations
The system uses wrapper functions that combine PGMQ message sending with PostgreSQL notifications atomically:
-- Atomic wrapper that sends message AND notification
CREATE OR REPLACE FUNCTION pgmq_send_with_notify(
queue_name TEXT,
message JSONB,
delay_seconds INTEGER DEFAULT 0
) RETURNS BIGINT AS $$
DECLARE
msg_id BIGINT;
namespace_name TEXT;
event_payload TEXT;
namespace_channel TEXT;
global_channel TEXT := 'pgmq_message_ready';
BEGIN
-- Send message using PGMQ's native function
SELECT pgmq.send(queue_name, message, delay_seconds) INTO msg_id;
-- Extract namespace from queue name using robust helper
namespace_name := extract_queue_namespace(queue_name);
-- Build namespace-specific channel name
namespace_channel := 'pgmq_message_ready.' || namespace_name;
-- Build event payload
event_payload := json_build_object(
'event_type', 'message_ready',
'msg_id', msg_id,
'queue_name', queue_name,
'namespace', namespace_name,
'ready_at', NOW()::timestamptz,
'delay_seconds', delay_seconds
)::text;
-- Send notifications in same transaction
PERFORM pg_notify(namespace_channel, event_payload);
-- Also send to global channel if different
IF namespace_channel != global_channel THEN
PERFORM pg_notify(global_channel, event_payload);
END IF;
RETURN msg_id;
END;
$$ LANGUAGE plpgsql;
Namespace Extraction Helper
-- Robust namespace extraction helper function
CREATE OR REPLACE FUNCTION extract_queue_namespace(queue_name TEXT)
RETURNS TEXT AS $$
BEGIN
-- Handle orchestration queues
IF queue_name ~ '^orchestration' THEN
RETURN 'orchestration';
END IF;
-- Handle worker queues: worker_namespace_queue -> namespace
IF queue_name ~ '^worker_.*_queue$' THEN
RETURN COALESCE(
(regexp_match(queue_name, '^worker_(.+?)_queue$'))[1],
'worker'
);
END IF;
-- Handle standard namespace_queue pattern
IF queue_name ~ '^[a-zA-Z][a-zA-Z0-9_]*_queue$' THEN
RETURN COALESCE(
(regexp_match(queue_name, '^([a-zA-Z][a-zA-Z0-9_]*)_queue$'))[1],
'default'
);
END IF;
-- Fallback for any other pattern
RETURN 'default';
END;
$$ LANGUAGE plpgsql;
Fallback Polling for Task Readiness
Instead of database triggers for task readiness notifications, the system uses a fallback polling mechanism to ensure no ready tasks are missed:
FallbackPoller Configuration:
- Default polling interval: 30 seconds
- Runs
StepEnqueuerService::process_batch()periodically - Catches tasks that may have been missed by primary PGMQ notification system
- Configurable enable/disable via TOML configuration
Key Benefits:
- Resilience: Ensures no tasks are permanently stuck if notifications fail
- Simplicity: No complex database triggers or state tracking required
- Observability: Clear metrics on fallback discovery vs. event-driven discovery
- Safety Net: Primary event-driven system + fallback polling provides redundancy
PGMQ Message Queue Integration
SQL functions coordinate with PGMQ for reliable message processing:
Queue Management Functions
-- Ensure queue exists with proper configuration
CREATE OR REPLACE FUNCTION ensure_task_queue(queue_name VARCHAR)
RETURNS BOOLEAN AS $$
BEGIN
-- Create queue if it doesn't exist
PERFORM pgmq.create_queue(queue_name);
-- Ensure headers column exists (pgmq-rs compatibility)
PERFORM pgmq_ensure_headers_column(queue_name);
RETURN TRUE;
END;
$$ LANGUAGE plpgsql;
Message Processing Support
-- Get queue statistics for monitoring
CREATE OR REPLACE FUNCTION get_queue_statistics(queue_name VARCHAR)
RETURNS TABLE(
queue_name VARCHAR,
queue_length BIGINT,
oldest_msg_age_seconds INTEGER,
newest_msg_age_seconds INTEGER
) AS $$
BEGIN
RETURN QUERY
SELECT
queue_name,
pgmq.queue_length(queue_name),
EXTRACT(EPOCH FROM (NOW() - MIN(enqueued_at)))::INTEGER,
EXTRACT(EPOCH FROM (NOW() - MAX(enqueued_at)))::INTEGER
FROM pgmq.messages(queue_name);
END;
$$ LANGUAGE plpgsql;
Transaction Safety
All SQL functions are designed with transaction safety in mind:
Atomic Operations:
- State transitions use row-level locking (
FOR UPDATE) - Compare-and-swap patterns prevent race conditions
- Rollback safety for partial failures
Consistency Guarantees:
- Foreign key constraints maintained across all operations
- Check constraints validate state transitions
- Audit trails preserved for debugging and compliance
Performance Optimization:
- Efficient indexes for common query patterns
- Materialized views for expensive analytics queries
- Connection pooling for high concurrency
Usage Patterns and Best Practices
Rust Integration Patterns
The SqlFunctionExecutor provides type-safe access to all SQL functions:
#![allow(unused)]
fn main() {
use tasker_shared::database::sql_functions::{SqlFunctionExecutor, FunctionRegistry};
// Direct executor usage
let executor = SqlFunctionExecutor::new(pool);
let ready_steps = executor.get_ready_steps(task_uuid).await?;
// Registry pattern for organized access
let registry = FunctionRegistry::new(pool);
let analytics = registry.analytics().get_analytics_metrics(None).await?;
let health = registry.system_health().get_system_health_counts().await?;
}
Batch Processing Optimization
For high-throughput scenarios, the system supports efficient batch operations:
#![allow(unused)]
fn main() {
// Batch step readiness analysis
let task_uuids = vec![task1_uuid, task2_uuid, task3_uuid];
let batch_readiness = executor.get_step_readiness_status_batch(task_uuids).await?;
// Batch task discovery
let ready_tasks = executor.get_next_ready_tasks(50).await?;
}
Error Handling Best Practices
SQL function errors are properly propagated through the type system:
#![allow(unused)]
fn main() {
match executor.get_current_task_state(task_uuid).await {
Ok(state) => {
// Process state
}
Err(sqlx::Error::RowNotFound) => {
// Handle missing task
}
Err(e) => {
// Handle other database errors
}
}
}
Tasker Configuration Documentation Index
Coverage: 246/246 parameters documented (100%)
Common Configuration
- backoff (
common.backoff) — 5 params (5 documented) - cache (
common.cache) — 10 params (10 documented) - moka (
common.cache.moka) — 1 params - redis (
common.cache.redis) — 4 params - circuit_breakers (
common.circuit_breakers) — 13 params (13 documented) - component_configs (
common.circuit_breakers.component_configs) — 8 params - default_config (
common.circuit_breakers.default_config) — 3 params - global_settings (
common.circuit_breakers.global_settings) — 2 params - database (
common.database) — 7 params (7 documented) - pool (
common.database.pool) — 6 params - execution (
common.execution) — 2 params (2 documented) - mpsc_channels (
common.mpsc_channels) — 4 params (4 documented) - event_publisher (
common.mpsc_channels.event_publisher) — 1 params - ffi (
common.mpsc_channels.ffi) — 1 params - overflow_policy (
common.mpsc_channels.overflow_policy) — 2 params - pgmq_database (
common.pgmq_database) — 8 params (8 documented) - pool (
common.pgmq_database.pool) — 6 params - queues (
common.queues) — 14 params (14 documented) - orchestration_queues (
common.queues.orchestration_queues) — 3 params - pgmq (
common.queues.pgmq) — 3 params - rabbitmq (
common.queues.rabbitmq) — 3 params - system (
common.system) — 1 params (1 documented) - task_templates (
common.task_templates) — 1 params (1 documented)
Orchestration Configuration
- orchestration (
orchestration) — 2 params (2 documented) - batch_processing (
orchestration.batch_processing) — 4 params (4 documented) - decision_points (
orchestration.decision_points) — 7 params (7 documented) - dlq (
orchestration.dlq) — 13 params (13 documented) - staleness_detection (
orchestration.dlq.staleness_detection) — 12 params - event_systems (
orchestration.event_systems) — 36 params (36 documented) - orchestration (
orchestration.event_systems.orchestration) — 18 params - task_readiness (
orchestration.event_systems.task_readiness) — 18 params - grpc (
orchestration.grpc) — 9 params (9 documented) - mpsc_channels (
orchestration.mpsc_channels) — 3 params (3 documented) - command_processor (
orchestration.mpsc_channels.command_processor) — 1 params - event_listeners (
orchestration.mpsc_channels.event_listeners) — 1 params - event_systems (
orchestration.mpsc_channels.event_systems) — 1 params - web (
orchestration.web) — 17 params (17 documented) - auth (
orchestration.web.auth) — 9 params - database_pools (
orchestration.web.database_pools) — 5 params
Worker Configuration
- worker (
worker) — 2 params (2 documented) - circuit_breakers (
worker.circuit_breakers) — 4 params (4 documented) - ffi_completion_send (
worker.circuit_breakers.ffi_completion_send) — 4 params - event_systems (
worker.event_systems) — 32 params (32 documented) - worker (
worker.event_systems.worker) — 32 params - grpc (
worker.grpc) — 9 params (9 documented) - mpsc_channels (
worker.mpsc_channels) — 23 params (23 documented) - command_processor (
worker.mpsc_channels.command_processor) — 1 params - domain_events (
worker.mpsc_channels.domain_events) — 3 params - event_listeners (
worker.mpsc_channels.event_listeners) — 1 params - event_subscribers (
worker.mpsc_channels.event_subscribers) — 2 params - event_systems (
worker.mpsc_channels.event_systems) — 1 params - ffi_dispatch (
worker.mpsc_channels.ffi_dispatch) — 5 params - handler_dispatch (
worker.mpsc_channels.handler_dispatch) — 7 params - in_process_events (
worker.mpsc_channels.in_process_events) — 3 params - orchestration_client (
worker.orchestration_client) — 3 params (3 documented) - web (
worker.web) — 17 params (17 documented) - auth (
worker.web.auth) — 9 params - database_pools (
worker.web.database_pools) — 5 params
Generated by tasker-ctl docs — Tasker Configuration System
Architectural Decision Records
Auto-generated ADR summary. Do not edit manually.
Regenerate with:
cargo make generate-adr-summary
This page summarizes the Architectural Decision Records (ADRs) from the Tasker project. Each ADR documents a significant design decision, its context, and consequences.
| # | Title | Status | Summary |
|---|---|---|---|
| 1 | Actor-Based Orchestration Architecture | Accepted | The decision was made to switch from a command pattern with direct service delegation to a lightweight actor-based architecture due to testing complexity and unclear boundaries between components. … |
| 2 | Bounded MPSC Channel Migration | Accepted | The decision was made to migrate the tasker-core system from its current inconsistent and risky usage of Multi-Producer Single-Consumer (MPSC) channels to a uniform implementation that uses only bo… |
| 3 | Processor UUID Ownership Removal | Accepted | The architectural decision was made to remove processor UUID ownership enforcement in the state transitions of orchestrators, as it prevented new orchestrators from taking over tasks when an old on… |
| 4 | Backoff Logic Consolidation | Accepted | To address multiple conflicting implementations of exponential backoff logic in the tasker-core system, which caused significant issues, we decided to consolidate the backoff logic into a single, u… |
| 5 | Worker Dual-Channel Event System | Accepted | The architectural decision was made to replace the original blocking .call() pattern in Rust workers with a dual-channel command pattern to enable true concurrency and prevent race conditions, en… |
| 6 | Worker Actor-Service Decomposition | Accepted | The decision was made to decompose the monolithic WorkerProcessor in the tasker-worker crate into an actor-based design due to its difficulty in testing and inconsistency with existing architectu… |
| 7 | FFI Over WASM for Language Workers | Accepted | The team decided to proceed with FFI for the TypeScript worker implementation, matching the successful Ruby and Python worker approaches, due to its pattern consistency and production readiness. WA… |
| 8 | Handler Composition Pattern | u | The architectural decision was made to migrate all handler patterns in the system to use composition via mixins, aligning with the Batchable handlers’ implementation, to adhere to the principle of … |
Summaries generated with Ollama (
qwen2.5:14b). SetSKIP_LLM=truefor deterministic summaries.
Generated by generate-adr-summary.sh from tasker-core ADR files
Configuration Operational Guide
Auto-generated operational tuning guide. Do not edit manually.
Regenerate with:
cargo make generate-config-guide
This guide provides operational tuning advice for the most important Tasker configuration parameters. For the complete parameter reference, see the Configuration Reference.
Tasker uses a context-based configuration architecture:
- Common — shared across all contexts (database, queues, resilience, caching)
- Orchestration — orchestration-specific (gRPC, web, event systems, DLQ, batch processing)
- Worker — worker-specific (event systems, FFI dispatch, circuit breakers)
Common Configuration
Operational Tuning Guide for Tasker Common Configuration
This guide provides tuning recommendations for the most critical parameters within the CommonConfig section of Tasker, a Rust-based workflow orchestration platform. Adjustments to these settings are crucial for optimizing performance and resilience in development, staging, and production environments.
Key Parameters Overview
- Database Connection Pooling: Controls connection pool size.
- Message Queue (PGMQ) Configuration: Sets buffer sizes and concurrency limits.
- Circuit Breakers: Manages error tolerance and recovery mechanisms.
- MPSC Channels Buffer Size: Determines the buffer size for communication channels.
Database Connection Pool Configuration
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
database.pool_size | Maximum number of database connections in pool. | Increase if high contention or slow queries are observed; decrease to conserve resources. | Dev/Test: 5-10, Staging: 20-30, Production: 50+ |
database.max_idle_connections | Number of idle connections kept open. | Adjust based on workload stability and connection overhead. | Dev/Test: 2, Staging: 5, Production: 10 |
Message Queue (PGMQ) Configuration
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
queues.buffer_size | Maximum queue buffer size in messages. | Increase to handle bursts; decrease if latency is critical. | Dev/Test: 50, Staging: 100, Production: 200-300 |
queues.max_concurrency | Concurrent message processing limit. | Adjust based on available system resources and workload spikes. | Dev/Test: 2, Staging: 5, Production: 10 |
Circuit Breakers Configuration
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
circuit_breakers.max_failures | Number of failures before circuit trips. | Increase for more resilient systems; decrease to quickly fail over. | Dev/Test: 2, Staging: 5, Production: 10-15 |
circuit_breakers.reset_timeout | Time (ms) before circuit breaker resets. | Longer timeouts are more conservative and reduce false positives. | Dev/Test: 30s, Staging: 60s, Production: 120s |
MPSC Channels Buffer Size Configuration
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
mpsc_channels.buffer_size | Number of messages buffer can hold. | Increase to handle high message throughput; decrease for faster message delivery. | Dev/Test: 10, Staging: 25, Production: 50-75 |
These settings should be adjusted based on observed system performance metrics and operational needs across different deployment environments. Proper monitoring of these parameters is essential for maintaining optimal task execution efficiency and reliability in Tasker.
Orchestration Configuration
Operational Tuning Guide for Tasker’s Orchestration Configuration
This guide provides insights into key parameters of the Orchestration configuration in Tasker to optimize performance and resource utilization. Adjust these settings based on your deployment environment: development/test (small), staging (medium), or production (large).
| Parameter | Description | Adjustment Criteria | Small (dev/test) | Medium (staging) | Large (production) |
|---|---|---|---|---|---|
shutdown_timeout_ms | Maximum time to wait for orchestration subsystems to stop during graceful shutdown. | Increase if subsystems take longer to shut down | 30,000 ms (30 sec) | 60,000 ms (1 min) | 90,000 ms (1.5 min) |
grpc.bind_address | IP address and port for gRPC server to listen on. | Not typically changed | 0.0.0.0:9090 | 0.0.0.0:9090 | 0.0.0.0:9090 |
grpc.tls_enabled | Enables TLS for gRPC connections (use if security is required). | Enable in production | false | true | true |
grpc.keepalive_interval_seconds | Interval between keep-alive pings to maintain HTTP/2 connection. | Increase for more stable, less chatty connections | 30 seconds | 60 seconds | 180 seconds (3 min) |
grpc.max_concurrent_streams | Maximum number of concurrent gRPC streams per connection. | Increase with higher traffic | 200 | 500 | 1,000 |
grpc.enable_reflection | Enables the gRPC reflection service for easier discovery and debugging (recommended for development). | Disable in production | true | false | false |
Detailed Parameter Descriptions
shutdown_timeout_ms
- What it controls: Specifies the maximum time to wait during a graceful shutdown of orchestration subsystems before forceful termination.
- Adjustment Criteria:
- Increase if subsystems tend to take longer to shut down, particularly in large deployments with many active tasks or connections.
- Decrease if quick startup/shutdown is prioritized over thorough cleanup.
- Recommended Values:
- Small (dev/test): 30 seconds
- Medium (staging): 1 minute
- Large (production): 1.5 minutes
grpc.bind_address
- What it controls: IP address and port for the gRPC server to listen on.
- Adjustment Criteria: Typically set to
0.0.0.0to bind to all available network interfaces, useful in cloud environments where external IPs are dynamic.
grpc.tls_enabled
- What it controls: Enables Transport Layer Security (TLS) for secure gRPC connections.
- Adjustment Criteria:
- Enable if the application is handling sensitive data or exposed publicly over untrusted networks.
- Disable only in internal testing environments where security risks are minimal.
grpc.keepalive_interval_seconds
- What it controls: Interval between keep-alive pings to maintain HTTP/2 connections, ensuring no idle timeouts occur.
- Adjustment Criteria:
- Increase for more stable, less chatty connections (useful across unreliable networks).
- Decrease in high-frequency environments where quick response times are critical.
grpc.max_concurrent_streams
- What it controls: The maximum number of concurrent gRPC streams allowed per connection.
- Adjustment Criteria:
- Increase if your application expects a high volume of concurrent operations or connections, but be mindful of resource constraints.
grpc.enable_reflection
- What it controls: Enables the gRPC reflection service for easier discovery and debugging.
- Adjustment Criteria:
- Enable during development to facilitate introspection and tool integration.
- Disable in production to minimize potential security risks.
By carefully tuning these parameters, you can optimize Tasker’s performance and reliability across various deployment scenarios.
Worker Configuration
Operational Tuning Guide for Tasker Worker Configuration
The following guide provides instructions on tuning the key parameters of the WorkerConfig structure in the Tasker workflow orchestration platform. This section focuses on optimizing the performance and reliability of workers by adjusting critical settings.
Key Parameters Overview
- Circuit Breakers (
circuit_breakers) - Event Systems (
event_systems) - MPSC Channels (
mpsc_channels) - Orchestration Client (
orchestration_client) (optional) - Web API Configuration (
web) (optional) - gRPC API Configuration (
grpc) (optional)
Circuit Breakers
The circuit_breakers configuration is crucial for managing worker stability by preventing overload and ensuring quick recovery from issues.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| failure_threshold | Number of slow/failed sends before the circuit breaks. | Increase if frequent false positives; decrease to be more sensitive to failures. | Small: 5, Medium: 10, Large: 20 |
| recovery_timeout_seconds | Time (in seconds) for which a broken circuit stays open before attempting recovery. | Decrease to accelerate recovery time; increase to avoid premature reopening. | Small: 5, Medium: 10, Large: 15 |
| success_threshold | Number of successful fast sends required in the half-open state to close the circuit again. | Increase if false negatives are common; decrease for faster recovery attempts. | Small: 2, Medium: 3, Large: 4 |
| slow_send_threshold_ms | Latency above which a send is considered slow and contributes to breaking the circuit. | Decrease to be more sensitive to latency issues; increase to avoid unnecessary circuit breaks. | Small: 100, Medium: 200, Large: 300 |
Event Systems
The event_systems configuration determines how workers handle event-driven operations.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| worker | Configuration for the worker-specific event system. | Adjust based on specific needs of each deployment environment to ensure optimal event handling. | Small: Default, Medium: Customized, Large: Highly optimized |
MPSC Channels
The mpsc_channels configuration is essential for managing message passing between different components.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| batch_size | Number of messages to process in a single batch. | Increase or decrease based on performance tuning and resource availability. | Small: 10, Medium: 50, Large: 100 |
| max_buffer_length | Maximum length of the buffer queue before backpressure starts. | Adjust according to expected message volume and system capacity. | Small: 20, Medium: 100, Large: 300 |
Orchestration Client
The orchestration_client configures how workers connect to the orchestration API.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| connection_timeout | Maximum time to wait for a connection attempt before timing out. | Increase in environments with high network latency; decrease in highly responsive setups. | Small: 10s, Medium: 20s, Large: 30s |
| retry_attempts | Number of retry attempts upon initial failure to connect. | Adjust based on the reliability and availability of the orchestration service. | Small: 3, Medium: 5, Large: 7 |
Web API Configuration
The web configuration sets parameters for the worker’s web-based interface.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| listen_address | IP address and port on which to listen. | Adjust based on deployment specifics such as network topology and security constraints. | Small: Localhost, Medium: Internal Network, Large: Public Internet |
gRPC API Configuration
The grpc configuration is used for setting up the worker’s gRPC-based interface.
| Parameter | Description | Adjustment Guidance | Recommended Values |
|---|---|---|---|
| max_concurrent_streams | Maximum number of concurrent streams allowed. | Adjust based on expected load and available system resources. | Small: 10, Medium: 50, Large: 200 |
Conclusion
By fine-tuning these parameters, operators can significantly enhance the performance, reliability, and responsiveness of Tasker workers across different deployment environments. Careful monitoring and iterative adjustments are key to achieving optimal results.
Operational guidance generated with Ollama (
qwen2.5:14b). SetSKIP_LLM=truefor deterministic output.
Generated by generate-config-guide.sh from tasker-core configuration source
Configuration Reference: common
65/65 parameters documented
backoff
Path: common.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
backoff_multiplier | f64 | 2.0 | Multiplier applied to the previous delay for exponential backoff calculations |
default_backoff_seconds | Vec<u32> | [1, 5, 15, 30, 60] | Sequence of backoff delays in seconds for successive retry attempts |
jitter_enabled | bool | true | Add random jitter to backoff delays to prevent thundering herd on retry |
jitter_max_percentage | f64 | 0.15 | Maximum jitter as a fraction of the computed backoff delay |
max_backoff_seconds | u32 | 3600 | Hard upper limit on any single backoff delay |
common.backoff.backoff_multiplier
Multiplier applied to the previous delay for exponential backoff calculations
- Type:
f64 - Default:
2.0 - Valid Range: 1.0-10.0
- System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous
common.backoff.default_backoff_seconds
Sequence of backoff delays in seconds for successive retry attempts
- Type:
Vec<u32> - Default:
[1, 5, 15, 30, 60] - Valid Range: non-empty array of positive integers
- System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds
common.backoff.jitter_enabled
Add random jitter to backoff delays to prevent thundering herd on retry
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time
common.backoff.jitter_max_percentage
Maximum jitter as a fraction of the computed backoff delay
- Type:
f64 - Default:
0.15 - Valid Range: 0.0-1.0
- System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay
common.backoff.max_backoff_seconds
Hard upper limit on any single backoff delay
- Type:
u32 - Default:
3600 - Valid Range: 1-3600
- System Impact: Caps exponential backoff growth to prevent excessively long delays between retries
cache
Path: common.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
analytics_ttl_seconds | u32 | 60 | Time-to-live in seconds for cached analytics and metrics data |
backend | String | "redis" | Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process) |
default_ttl_seconds | u32 | 3600 | Default time-to-live in seconds for cached entries |
enabled | bool | false | Enable the distributed cache layer for template and analytics data |
template_ttl_seconds | u32 | 3600 | Time-to-live in seconds for cached task template definitions |
common.cache.analytics_ttl_seconds
Time-to-live in seconds for cached analytics and metrics data
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current
common.cache.backend
Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
- Type:
String - Default:
"redis" - Valid Range: redis | moka
- System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection
common.cache.default_ttl_seconds
Default time-to-live in seconds for cached entries
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Controls how long cached data remains valid before being re-fetched from the database
common.cache.enabled
Enable the distributed cache layer for template and analytics data
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required
common.cache.template_ttl_seconds
Time-to-live in seconds for cached task template definitions
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance
moka
Path: common.cache.moka
| Parameter | Type | Default | Description |
|---|---|---|---|
max_capacity | u64 | 10000 | Maximum number of entries the in-process Moka cache can hold |
common.cache.moka.max_capacity
Maximum number of entries the in-process Moka cache can hold
- Type:
u64 - Default:
10000 - Valid Range: 1-1000000
- System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached
redis
Path: common.cache.redis
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_timeout_seconds | u32 | 5 | Maximum time to wait when establishing a new Redis connection |
database | u32 | 0 | Redis database number (0-15) |
max_connections | u32 | 10 | Maximum number of connections in the Redis connection pool |
url | String | "${REDIS_URL:-redis://localhost:6379}" | Redis connection URL |
common.cache.redis.connection_timeout_seconds
Maximum time to wait when establishing a new Redis connection
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Connections that cannot be established within this timeout fail; cache falls back to database
common.cache.redis.database
Redis database number (0-15)
- Type:
u32 - Default:
0 - Valid Range: 0-15
- System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance
common.cache.redis.max_connections
Maximum number of connections in the Redis connection pool
- Type:
u32 - Default:
10 - Valid Range: 1-500
- System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads
common.cache.redis.url
Redis connection URL
- Type:
String - Default:
"${REDIS_URL:-redis://localhost:6379}" - Valid Range: valid Redis URI
- System Impact: Must be reachable when cache is enabled with redis backend
circuit_breakers
Path: common.circuit_breakers
component_configs
Path: common.circuit_breakers.component_configs
cache
Path: common.circuit_breakers.component_configs.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the cache circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the cache breaker |
common.circuit_breakers.component_configs.cache.failure_threshold
Failures before the cache circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database
common.circuit_breakers.component_configs.cache.success_threshold
Successes in Half-Open required to close the cache breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database
messaging
Path: common.circuit_breakers.component_configs.messaging
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the messaging circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the messaging breaker |
common.circuit_breakers.component_configs.messaging.failure_threshold
Failures before the messaging circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited
common.circuit_breakers.component_configs.messaging.success_threshold
Successes in Half-Open required to close the messaging breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient
task_readiness
Path: common.circuit_breakers.component_configs.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 10 | Failures before the task readiness circuit breaker trips to Open |
success_threshold | u32 | 3 | Successes in Half-Open required to close the task readiness breaker |
common.circuit_breakers.component_configs.task_readiness.failure_threshold
Failures before the task readiness circuit breaker trips to Open
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected
common.circuit_breakers.component_configs.task_readiness.success_threshold
Successes in Half-Open required to close the task readiness breaker
- Type:
u32 - Default:
3 - Valid Range: 1-100
- System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries
web
Path: common.circuit_breakers.component_configs.web
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the web/API database circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the web database breaker |
common.circuit_breakers.component_configs.web.failure_threshold
Failures before the web/API database circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts
common.circuit_breakers.component_configs.web.success_threshold
Successes in Half-Open required to close the web database breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic
default_config
Path: common.circuit_breakers.default_config
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive failures before a circuit breaker trips to the Open state |
success_threshold | u32 | 2 | Number of consecutive successes in Half-Open state required to close the circuit breaker |
timeout_seconds | u32 | 30 | Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests |
common.circuit_breakers.default_config.failure_threshold
Number of consecutive failures before a circuit breaker trips to the Open state
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping
common.circuit_breakers.default_config.success_threshold
Number of consecutive successes in Half-Open state required to close the circuit breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Higher values require more proof of recovery before restoring full traffic
common.circuit_breakers.default_config.timeout_seconds
Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures
global_settings
Path: common.circuit_breakers.global_settings
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics_collection_interval_seconds | u32 | 30 | Interval in seconds between circuit breaker metrics collection sweeps |
min_state_transition_interval_seconds | f64 | 5.0 | Minimum time in seconds between circuit breaker state transitions |
common.circuit_breakers.global_settings.metrics_collection_interval_seconds
Interval in seconds between circuit breaker metrics collection sweeps
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability
common.circuit_breakers.global_settings.min_state_transition_interval_seconds
Minimum time in seconds between circuit breaker state transitions
- Type:
f64 - Default:
5.0 - Valid Range: 0.0-60.0
- System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures
database
Path: common.database
| Parameter | Type | Default | Description |
|---|---|---|---|
url | String | "${DATABASE_URL:-postgresql://localhost/tasker}" | PostgreSQL connection URL for the primary database |
common.database.url
PostgreSQL connection URL for the primary database
- Type:
String - Default:
"${DATABASE_URL:-postgresql://localhost/tasker}" - Valid Range: valid PostgreSQL connection URI
- System Impact: All task, step, and workflow state is stored here; must be reachable at startup
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | postgresql://localhost/tasker | Local default, no auth |
| production | ${DATABASE_URL} | Always use env var injection for secrets rotation |
| test | postgresql://tasker:tasker@localhost:5432/tasker_rust_test | Isolated test database with known credentials |
Related: common.database.pool.max_connections, common.pgmq_database.url
pool
Path: common.database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 10 | Maximum time to wait when acquiring a connection from the pool |
idle_timeout_seconds | u32 | 300 | Time before an idle connection is closed and removed from the pool |
max_connections | u32 | 25 | Maximum number of concurrent database connections in the pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a connection before it is closed and replaced |
min_connections | u32 | 5 | Minimum number of idle connections maintained in the pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which connection acquisition is logged as slow |
common.database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the pool
- Type:
u32 - Default:
10 - Valid Range: 1-300
- System Impact: Queries fail with a timeout error if no connection is available within this window
common.database.pool.idle_timeout_seconds
Time before an idle connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the pool shrinks back to min_connections after load drops
common.database.pool.max_connections
Maximum number of concurrent database connections in the pool
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | 10-25 | Small pool for local development |
| production | 30-50 | Scale based on worker count and concurrent task volume |
| test | 10-30 | Moderate pool; cluster tests may run 10 services sharing the same DB |
Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds
common.database.pool.max_lifetime_seconds
Maximum total lifetime of a connection before it is closed and replaced
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections
common.database.pool.min_connections
Minimum number of idle connections maintained in the pool
- Type:
u32 - Default:
5 - Valid Range: 0-100
- System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods
common.database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which connection acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow acquire warnings indicate pool pressure or network issues
execution
Path: common.execution
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | String | "development" | Runtime environment identifier used for configuration context selection and logging |
step_enqueue_batch_size | u32 | 50 | Number of steps to enqueue in a single batch during task initialization |
common.execution.environment
Runtime environment identifier used for configuration context selection and logging
- Type:
String - Default:
"development" - Valid Range: test | development | production
- System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system
common.execution.step_enqueue_batch_size
Number of steps to enqueue in a single batch during task initialization
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency
mpsc_channels
Path: common.mpsc_channels
event_publisher
Path: common.mpsc_channels.event_publisher
| Parameter | Type | Default | Description |
|---|---|---|---|
event_queue_buffer_size | usize | 5000 | Bounded channel capacity for the event publisher MPSC channel |
common.mpsc_channels.event_publisher.event_queue_buffer_size
Bounded channel capacity for the event publisher MPSC channel
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner
ffi
Path: common.mpsc_channels.ffi
| Parameter | Type | Default | Description |
|---|---|---|---|
ruby_event_buffer_size | usize | 1000 | Bounded channel capacity for Ruby FFI event delivery |
common.mpsc_channels.ffi.ruby_event_buffer_size
Bounded channel capacity for Ruby FFI event delivery
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side
overflow_policy
Path: common.mpsc_channels.overflow_policy
| Parameter | Type | Default | Description |
|---|---|---|---|
log_warning_threshold | f64 | 0.8 | Channel saturation fraction at which warning logs are emitted |
common.mpsc_channels.overflow_policy.log_warning_threshold
Channel saturation fraction at which warning logs are emitted
- Type:
f64 - Default:
0.8 - Valid Range: 0.0-1.0
- System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity
metrics
Path: common.mpsc_channels.overflow_policy.metrics
| Parameter | Type | Default | Description |
|---|---|---|---|
saturation_check_interval_seconds | u32 | 30 | Interval in seconds between channel saturation metric samples |
common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds
Interval in seconds between channel saturation metric samples
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead
pgmq_database
Path: common.pgmq_database
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable PGMQ messaging subsystem |
url | String | "${PGMQ_DATABASE_URL:-}" | PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database |
common.pgmq_database.enabled
Enable PGMQ messaging subsystem
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend
common.pgmq_database.url
PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database
- Type:
String - Default:
"${PGMQ_DATABASE_URL:-}" - Valid Range: valid PostgreSQL connection URI or empty string
- System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load
Related: common.database.url, common.pgmq_database.enabled
pool
Path: common.pgmq_database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 5 | Maximum time to wait when acquiring a connection from the PGMQ pool |
idle_timeout_seconds | u32 | 300 | Time before an idle PGMQ connection is closed and removed from the pool |
max_connections | u32 | 15 | Maximum number of concurrent connections in the PGMQ database pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a PGMQ database connection before replacement |
min_connections | u32 | 3 | Minimum idle connections maintained in the PGMQ database pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which PGMQ pool acquisition is logged as slow |
common.pgmq_database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the PGMQ pool
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window
common.pgmq_database.pool.idle_timeout_seconds
Time before an idle PGMQ connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops
common.pgmq_database.pool.max_connections
Maximum number of concurrent connections in the PGMQ database pool
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Separate from the main database pool; size according to messaging throughput requirements
common.pgmq_database.pool.max_lifetime_seconds
Maximum total lifetime of a PGMQ database connection before replacement
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift in long-running PGMQ connections
common.pgmq_database.pool.min_connections
Minimum idle connections maintained in the PGMQ database pool
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations
common.pgmq_database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which PGMQ pool acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure
queues
Path: common.queues
| Parameter | Type | Default | Description |
|---|---|---|---|
backend | String | "${TASKER_MESSAGING_BACKEND:-pgmq}" | Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker) |
default_visibility_timeout_seconds | u32 | 30 | Default time a dequeued message remains invisible to other consumers |
naming_pattern | String | "{namespace}_{name}_queue" | Template pattern for constructing queue names from namespace and name |
orchestration_namespace | String | "orchestration" | Namespace prefix for orchestration queue names |
worker_namespace | String | "worker" | Namespace prefix for worker queue names |
common.queues.backend
Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
- Type:
String - Default:
"${TASKER_MESSAGING_BACKEND:-pgmq}" - Valid Range: pgmq | rabbitmq
- System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | pgmq or rabbitmq | pgmq for simplicity, rabbitmq for high-throughput push semantics |
| test | pgmq | Single-dependency setup, simpler CI |
Related: common.queues.pgmq, common.queues.rabbitmq
common.queues.default_visibility_timeout_seconds
Default time a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry
common.queues.naming_pattern
Template pattern for constructing queue names from namespace and name
- Type:
String - Default:
"{namespace}_{name}_queue" - Valid Range: string containing {namespace} and {name} placeholders
- System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration
common.queues.orchestration_namespace
Namespace prefix for orchestration queue names
- Type:
String - Default:
"orchestration" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues
common.queues.worker_namespace
Namespace prefix for worker queue names
- Type:
String - Default:
"worker" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues
orchestration_queues
Path: common.queues.orchestration_queues
| Parameter | Type | Default | Description |
|---|---|---|---|
step_results | String | "orchestration_step_results" | Queue name for step execution results returned by workers |
task_finalizations | String | "orchestration_task_finalizations" | Queue name for task finalization messages |
task_requests | String | "orchestration_task_requests" | Queue name for incoming task execution requests |
common.queues.orchestration_queues.step_results
Queue name for step execution results returned by workers
- Type:
String - Default:
"orchestration_step_results" - Valid Range: valid queue name
- System Impact: Workers publish step completion results here for the orchestration result processor
common.queues.orchestration_queues.task_finalizations
Queue name for task finalization messages
- Type:
String - Default:
"orchestration_task_finalizations" - Valid Range: valid queue name
- System Impact: Tasks ready for completion evaluation are enqueued here
common.queues.orchestration_queues.task_requests
Queue name for incoming task execution requests
- Type:
String - Default:
"orchestration_task_requests" - Valid Range: valid queue name
- System Impact: The orchestration system reads new task requests from this queue
pgmq
Path: common.queues.pgmq
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval_ms | u32 | 500 | Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive |
common.queues.pgmq.poll_interval_ms
Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive
- Type:
u32 - Default:
500 - Valid Range: 10-10000
- System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval
queue_depth_thresholds
Path: common.queues.pgmq.queue_depth_thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
critical_threshold | i64 | 5000 | Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions |
overflow_threshold | i64 | 10000 | Queue depth indicating an emergency condition requiring manual intervention |
common.queues.pgmq.queue_depth_thresholds.critical_threshold
Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
- Type:
i64 - Default:
5000 - Valid Range: 1+
- System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages
common.queues.pgmq.queue_depth_thresholds.overflow_threshold
Queue depth indicating an emergency condition requiring manual intervention
- Type:
i64 - Default:
10000 - Valid Range: 1+
- System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting
rabbitmq
Path: common.queues.rabbitmq
| Parameter | Type | Default | Description |
|---|---|---|---|
heartbeat_seconds | u16 | 30 | AMQP heartbeat interval for connection liveness detection |
prefetch_count | u16 | 100 | Number of unacknowledged messages RabbitMQ will deliver before waiting for acks |
url | String | "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" | AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’ |
common.queues.rabbitmq.heartbeat_seconds
AMQP heartbeat interval for connection liveness detection
- Type:
u16 - Default:
30 - Valid Range: 0-3600
- System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)
common.queues.rabbitmq.prefetch_count
Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
- Type:
u16 - Default:
100 - Valid Range: 1-65535
- System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process
common.queues.rabbitmq.url
AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’
- Type:
String - Default:
"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" - Valid Range: valid AMQP URI
- System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup
system
Path: common.system
| Parameter | Type | Default | Description |
|---|---|---|---|
default_dependent_system | String | "default" | Default system name assigned to tasks that do not specify a dependent system |
common.system.default_dependent_system
Default system name assigned to tasks that do not specify a dependent system
- Type:
String - Default:
"default" - Valid Range: non-empty string
- System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default
task_templates
Path: common.task_templates
| Parameter | Type | Default | Description |
|---|---|---|---|
search_paths | Vec<String> | ["config/tasks/**/*.{yml,yaml}"] | Glob patterns for discovering task template YAML files |
common.task_templates.search_paths
Glob patterns for discovering task template YAML files
- Type:
Vec<String> - Default:
["config/tasks/**/*.{yml,yaml}"] - Valid Range: valid glob patterns
- System Impact: Templates matching these patterns are loaded at startup for task definition discovery
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: common
65/65 parameters documented
backoff
Path: common.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
backoff_multiplier | f64 | 2.0 | Multiplier applied to the previous delay for exponential backoff calculations |
default_backoff_seconds | Vec<u32> | [1, 5, 15, 30, 60] | Sequence of backoff delays in seconds for successive retry attempts |
jitter_enabled | bool | true | Add random jitter to backoff delays to prevent thundering herd on retry |
jitter_max_percentage | f64 | 0.15 | Maximum jitter as a fraction of the computed backoff delay |
max_backoff_seconds | u32 | 3600 | Hard upper limit on any single backoff delay |
common.backoff.backoff_multiplier
Multiplier applied to the previous delay for exponential backoff calculations
- Type:
f64 - Default:
2.0 - Valid Range: 1.0-10.0
- System Impact: Controls how aggressively delays grow; 2.0 means each delay is double the previous
common.backoff.default_backoff_seconds
Sequence of backoff delays in seconds for successive retry attempts
- Type:
Vec<u32> - Default:
[1, 5, 15, 30, 60] - Valid Range: non-empty array of positive integers
- System Impact: Defines the retry cadence; after exhausting the array, the last value is reused up to max_backoff_seconds
common.backoff.jitter_enabled
Add random jitter to backoff delays to prevent thundering herd on retry
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, backoff delays are randomized within jitter_max_percentage to spread retries across time
common.backoff.jitter_max_percentage
Maximum jitter as a fraction of the computed backoff delay
- Type:
f64 - Default:
0.15 - Valid Range: 0.0-1.0
- System Impact: A value of 0.15 means delays vary by up to +/-15% of the base delay
common.backoff.max_backoff_seconds
Hard upper limit on any single backoff delay
- Type:
u32 - Default:
3600 - Valid Range: 1-3600
- System Impact: Caps exponential backoff growth to prevent excessively long delays between retries
cache
Path: common.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
analytics_ttl_seconds | u32 | 60 | Time-to-live in seconds for cached analytics and metrics data |
backend | String | "redis" | Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process) |
default_ttl_seconds | u32 | 3600 | Default time-to-live in seconds for cached entries |
enabled | bool | false | Enable the distributed cache layer for template and analytics data |
template_ttl_seconds | u32 | 3600 | Time-to-live in seconds for cached task template definitions |
common.cache.analytics_ttl_seconds
Time-to-live in seconds for cached analytics and metrics data
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Analytics data is write-heavy and changes frequently; short TTL (60s) keeps metrics current
common.cache.backend
Cache backend implementation: ‘redis’ (distributed) or ‘moka’ (in-process)
- Type:
String - Default:
"redis" - Valid Range: redis | moka
- System Impact: Redis is required for multi-instance deployments to avoid stale data; moka is suitable for single-instance or DoS protection
common.cache.default_ttl_seconds
Default time-to-live in seconds for cached entries
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Controls how long cached data remains valid before being re-fetched from the database
common.cache.enabled
Enable the distributed cache layer for template and analytics data
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all cache reads fall through to direct database queries; no cache dependency required
common.cache.template_ttl_seconds
Time-to-live in seconds for cached task template definitions
- Type:
u32 - Default:
3600 - Valid Range: 1-86400
- System Impact: Template changes take up to this long to propagate; shorter values increase DB load, longer values improve performance
moka
Path: common.cache.moka
| Parameter | Type | Default | Description |
|---|---|---|---|
max_capacity | u64 | 10000 | Maximum number of entries the in-process Moka cache can hold |
common.cache.moka.max_capacity
Maximum number of entries the in-process Moka cache can hold
- Type:
u64 - Default:
10000 - Valid Range: 1-1000000
- System Impact: Bounds memory usage; least-recently-used entries are evicted when capacity is reached
redis
Path: common.cache.redis
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_timeout_seconds | u32 | 5 | Maximum time to wait when establishing a new Redis connection |
database | u32 | 0 | Redis database number (0-15) |
max_connections | u32 | 10 | Maximum number of connections in the Redis connection pool |
url | String | "${REDIS_URL:-redis://localhost:6379}" | Redis connection URL |
common.cache.redis.connection_timeout_seconds
Maximum time to wait when establishing a new Redis connection
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Connections that cannot be established within this timeout fail; cache falls back to database
common.cache.redis.database
Redis database number (0-15)
- Type:
u32 - Default:
0 - Valid Range: 0-15
- System Impact: Isolates Tasker cache keys from other applications sharing the same Redis instance
common.cache.redis.max_connections
Maximum number of connections in the Redis connection pool
- Type:
u32 - Default:
10 - Valid Range: 1-500
- System Impact: Bounds concurrent Redis operations; increase for high cache throughput workloads
common.cache.redis.url
Redis connection URL
- Type:
String - Default:
"${REDIS_URL:-redis://localhost:6379}" - Valid Range: valid Redis URI
- System Impact: Must be reachable when cache is enabled with redis backend
circuit_breakers
Path: common.circuit_breakers
component_configs
Path: common.circuit_breakers.component_configs
cache
Path: common.circuit_breakers.component_configs.cache
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the cache circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the cache breaker |
common.circuit_breakers.component_configs.cache.failure_threshold
Failures before the cache circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects Redis/Dragonfly operations; when tripped, cache reads fall through to database
common.circuit_breakers.component_configs.cache.success_threshold
Successes in Half-Open required to close the cache breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) for fast recovery since cache failures gracefully degrade to database
messaging
Path: common.circuit_breakers.component_configs.messaging
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the messaging circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the messaging breaker |
common.circuit_breakers.component_configs.messaging.failure_threshold
Failures before the messaging circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the messaging layer (PGMQ or RabbitMQ); when tripped, queue send/receive operations are short-circuited
common.circuit_breakers.component_configs.messaging.success_threshold
Successes in Half-Open required to close the messaging breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Lower threshold (2) allows faster recovery since messaging failures are typically transient
task_readiness
Path: common.circuit_breakers.component_configs.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 10 | Failures before the task readiness circuit breaker trips to Open |
success_threshold | u32 | 3 | Successes in Half-Open required to close the task readiness breaker |
common.circuit_breakers.component_configs.task_readiness.failure_threshold
Failures before the task readiness circuit breaker trips to Open
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Higher than default (10 vs 5) because task readiness queries are frequent and transient failures are expected
common.circuit_breakers.component_configs.task_readiness.success_threshold
Successes in Half-Open required to close the task readiness breaker
- Type:
u32 - Default:
3 - Valid Range: 1-100
- System Impact: Slightly higher than default (3) for extra confidence before resuming readiness queries
web
Path: common.circuit_breakers.component_configs.web
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures before the web/API database circuit breaker trips to Open |
success_threshold | u32 | 2 | Successes in Half-Open required to close the web database breaker |
common.circuit_breakers.component_configs.web.failure_threshold
Failures before the web/API database circuit breaker trips to Open
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects API database operations; when tripped, API requests receive fast 503 errors instead of waiting for timeouts
common.circuit_breakers.component_configs.web.success_threshold
Successes in Half-Open required to close the web database breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Standard threshold (2) provides confidence in recovery before restoring full API traffic
default_config
Path: common.circuit_breakers.default_config
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive failures before a circuit breaker trips to the Open state |
success_threshold | u32 | 2 | Number of consecutive successes in Half-Open state required to close the circuit breaker |
timeout_seconds | u32 | 30 | Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests |
common.circuit_breakers.default_config.failure_threshold
Number of consecutive failures before a circuit breaker trips to the Open state
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Lower values make the breaker more sensitive; higher values tolerate more transient failures before tripping
common.circuit_breakers.default_config.success_threshold
Number of consecutive successes in Half-Open state required to close the circuit breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Higher values require more proof of recovery before restoring full traffic
common.circuit_breakers.default_config.timeout_seconds
Duration in seconds a circuit breaker stays Open before transitioning to Half-Open for probe requests
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Controls recovery speed; shorter timeouts attempt recovery sooner but risk repeated failures
global_settings
Path: common.circuit_breakers.global_settings
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics_collection_interval_seconds | u32 | 30 | Interval in seconds between circuit breaker metrics collection sweeps |
min_state_transition_interval_seconds | f64 | 5.0 | Minimum time in seconds between circuit breaker state transitions |
common.circuit_breakers.global_settings.metrics_collection_interval_seconds
Interval in seconds between circuit breaker metrics collection sweeps
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently circuit breaker state, failure counts, and transition counts are collected for observability
common.circuit_breakers.global_settings.min_state_transition_interval_seconds
Minimum time in seconds between circuit breaker state transitions
- Type:
f64 - Default:
5.0 - Valid Range: 0.0-60.0
- System Impact: Prevents rapid oscillation between Open and Closed states during intermittent failures
database
Path: common.database
| Parameter | Type | Default | Description |
|---|---|---|---|
url | String | "${DATABASE_URL:-postgresql://localhost/tasker}" | PostgreSQL connection URL for the primary database |
common.database.url
PostgreSQL connection URL for the primary database
- Type:
String - Default:
"${DATABASE_URL:-postgresql://localhost/tasker}" - Valid Range: valid PostgreSQL connection URI
- System Impact: All task, step, and workflow state is stored here; must be reachable at startup
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | postgresql://localhost/tasker | Local default, no auth |
| production | ${DATABASE_URL} | Always use env var injection for secrets rotation |
| test | postgresql://tasker:tasker@localhost:5432/tasker_rust_test | Isolated test database with known credentials |
Related: common.database.pool.max_connections, common.pgmq_database.url
pool
Path: common.database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 10 | Maximum time to wait when acquiring a connection from the pool |
idle_timeout_seconds | u32 | 300 | Time before an idle connection is closed and removed from the pool |
max_connections | u32 | 25 | Maximum number of concurrent database connections in the pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a connection before it is closed and replaced |
min_connections | u32 | 5 | Minimum number of idle connections maintained in the pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which connection acquisition is logged as slow |
common.database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the pool
- Type:
u32 - Default:
10 - Valid Range: 1-300
- System Impact: Queries fail with a timeout error if no connection is available within this window
common.database.pool.idle_timeout_seconds
Time before an idle connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the pool shrinks back to min_connections after load drops
common.database.pool.max_connections
Maximum number of concurrent database connections in the pool
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Controls database connection concurrency; too few causes query queuing under load, too many risks DB resource exhaustion
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| development | 10-25 | Small pool for local development |
| production | 30-50 | Scale based on worker count and concurrent task volume |
| test | 10-30 | Moderate pool; cluster tests may run 10 services sharing the same DB |
Related: common.database.pool.min_connections, common.database.pool.acquire_timeout_seconds
common.database.pool.max_lifetime_seconds
Maximum total lifetime of a connection before it is closed and replaced
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift from server-side config changes or memory leaks in long-lived connections
common.database.pool.min_connections
Minimum number of idle connections maintained in the pool
- Type:
u32 - Default:
5 - Valid Range: 0-100
- System Impact: Keeps connections warm to avoid cold-start latency on first queries after idle periods
common.database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which connection acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow acquire warnings indicate pool pressure or network issues
execution
Path: common.execution
| Parameter | Type | Default | Description |
|---|---|---|---|
environment | String | "development" | Runtime environment identifier used for configuration context selection and logging |
step_enqueue_batch_size | u32 | 50 | Number of steps to enqueue in a single batch during task initialization |
common.execution.environment
Runtime environment identifier used for configuration context selection and logging
- Type:
String - Default:
"development" - Valid Range: test | development | production
- System Impact: Affects log levels, default tuning, and environment-specific behavior throughout the system
common.execution.step_enqueue_batch_size
Number of steps to enqueue in a single batch during task initialization
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Controls step enqueueing throughput; larger batches reduce round trips but increase per-batch latency
mpsc_channels
Path: common.mpsc_channels
event_publisher
Path: common.mpsc_channels.event_publisher
| Parameter | Type | Default | Description |
|---|---|---|---|
event_queue_buffer_size | usize | 5000 | Bounded channel capacity for the event publisher MPSC channel |
common.mpsc_channels.event_publisher.event_queue_buffer_size
Bounded channel capacity for the event publisher MPSC channel
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Controls backpressure for domain event publishing; smaller buffers apply backpressure sooner
ffi
Path: common.mpsc_channels.ffi
| Parameter | Type | Default | Description |
|---|---|---|---|
ruby_event_buffer_size | usize | 1000 | Bounded channel capacity for Ruby FFI event delivery |
common.mpsc_channels.ffi.ruby_event_buffer_size
Bounded channel capacity for Ruby FFI event delivery
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers events between the Rust runtime and Ruby FFI layer; overflow triggers backpressure on the dispatch side
overflow_policy
Path: common.mpsc_channels.overflow_policy
| Parameter | Type | Default | Description |
|---|---|---|---|
log_warning_threshold | f64 | 0.8 | Channel saturation fraction at which warning logs are emitted |
common.mpsc_channels.overflow_policy.log_warning_threshold
Channel saturation fraction at which warning logs are emitted
- Type:
f64 - Default:
0.8 - Valid Range: 0.0-1.0
- System Impact: A value of 0.8 means warnings fire when any channel reaches 80% capacity
metrics
Path: common.mpsc_channels.overflow_policy.metrics
| Parameter | Type | Default | Description |
|---|---|---|---|
saturation_check_interval_seconds | u32 | 30 | Interval in seconds between channel saturation metric samples |
common.mpsc_channels.overflow_policy.metrics.saturation_check_interval_seconds
Interval in seconds between channel saturation metric samples
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Lower intervals give finer-grained capacity visibility but add sampling overhead
pgmq_database
Path: common.pgmq_database
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable PGMQ messaging subsystem |
url | String | "${PGMQ_DATABASE_URL:-}" | PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database |
common.pgmq_database.enabled
Enable PGMQ messaging subsystem
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, PGMQ queue operations are disabled; only useful if using RabbitMQ as the sole messaging backend
common.pgmq_database.url
PostgreSQL connection URL for a dedicated PGMQ database; when empty, PGMQ shares the primary database
- Type:
String - Default:
"${PGMQ_DATABASE_URL:-}" - Valid Range: valid PostgreSQL connection URI or empty string
- System Impact: Separating PGMQ to its own database isolates messaging I/O from task state queries, reducing contention under heavy load
Related: common.database.url, common.pgmq_database.enabled
pool
Path: common.pgmq_database.pool
| Parameter | Type | Default | Description |
|---|---|---|---|
acquire_timeout_seconds | u32 | 5 | Maximum time to wait when acquiring a connection from the PGMQ pool |
idle_timeout_seconds | u32 | 300 | Time before an idle PGMQ connection is closed and removed from the pool |
max_connections | u32 | 15 | Maximum number of concurrent connections in the PGMQ database pool |
max_lifetime_seconds | u32 | 1800 | Maximum total lifetime of a PGMQ database connection before replacement |
min_connections | u32 | 3 | Minimum idle connections maintained in the PGMQ database pool |
slow_acquire_threshold_ms | u32 | 100 | Threshold in milliseconds above which PGMQ pool acquisition is logged as slow |
common.pgmq_database.pool.acquire_timeout_seconds
Maximum time to wait when acquiring a connection from the PGMQ pool
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Queue operations fail with timeout if no PGMQ connection is available within this window
common.pgmq_database.pool.idle_timeout_seconds
Time before an idle PGMQ connection is closed and removed from the pool
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Controls how quickly the PGMQ pool shrinks after messaging load drops
common.pgmq_database.pool.max_connections
Maximum number of concurrent connections in the PGMQ database pool
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Separate from the main database pool; size according to messaging throughput requirements
common.pgmq_database.pool.max_lifetime_seconds
Maximum total lifetime of a PGMQ database connection before replacement
- Type:
u32 - Default:
1800 - Valid Range: 60-86400
- System Impact: Prevents connection drift in long-running PGMQ connections
common.pgmq_database.pool.min_connections
Minimum idle connections maintained in the PGMQ database pool
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Keeps PGMQ connections warm to avoid cold-start latency on queue operations
common.pgmq_database.pool.slow_acquire_threshold_ms
Threshold in milliseconds above which PGMQ pool acquisition is logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-60000
- System Impact: Observability: slow PGMQ acquire warnings indicate messaging pool pressure
queues
Path: common.queues
| Parameter | Type | Default | Description |
|---|---|---|---|
backend | String | "${TASKER_MESSAGING_BACKEND:-pgmq}" | Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker) |
default_visibility_timeout_seconds | u32 | 30 | Default time a dequeued message remains invisible to other consumers |
naming_pattern | String | "{namespace}_{name}_queue" | Template pattern for constructing queue names from namespace and name |
orchestration_namespace | String | "orchestration" | Namespace prefix for orchestration queue names |
worker_namespace | String | "worker" | Namespace prefix for worker queue names |
common.queues.backend
Messaging backend: ‘pgmq’ (PostgreSQL-based, LISTEN/NOTIFY) or ‘rabbitmq’ (AMQP broker)
- Type:
String - Default:
"${TASKER_MESSAGING_BACKEND:-pgmq}" - Valid Range: pgmq | rabbitmq
- System Impact: Determines the entire message transport layer; pgmq requires only PostgreSQL, rabbitmq requires a separate AMQP broker
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | pgmq or rabbitmq | pgmq for simplicity, rabbitmq for high-throughput push semantics |
| test | pgmq | Single-dependency setup, simpler CI |
Related: common.queues.pgmq, common.queues.rabbitmq
common.queues.default_visibility_timeout_seconds
Default time a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If a consumer fails to process a message within this window, the message becomes visible again for retry
common.queues.naming_pattern
Template pattern for constructing queue names from namespace and name
- Type:
String - Default:
"{namespace}_{name}_queue" - Valid Range: string containing {namespace} and {name} placeholders
- System Impact: Determines the actual PGMQ/RabbitMQ queue names; changing this after deployment requires manual queue migration
common.queues.orchestration_namespace
Namespace prefix for orchestration queue names
- Type:
String - Default:
"orchestration" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate orchestration queues from worker queues
common.queues.worker_namespace
Namespace prefix for worker queue names
- Type:
String - Default:
"worker" - Valid Range: non-empty string
- System Impact: Used in queue naming pattern to isolate worker queues from orchestration queues
orchestration_queues
Path: common.queues.orchestration_queues
| Parameter | Type | Default | Description |
|---|---|---|---|
step_results | String | "orchestration_step_results" | Queue name for step execution results returned by workers |
task_finalizations | String | "orchestration_task_finalizations" | Queue name for task finalization messages |
task_requests | String | "orchestration_task_requests" | Queue name for incoming task execution requests |
common.queues.orchestration_queues.step_results
Queue name for step execution results returned by workers
- Type:
String - Default:
"orchestration_step_results" - Valid Range: valid queue name
- System Impact: Workers publish step completion results here for the orchestration result processor
common.queues.orchestration_queues.task_finalizations
Queue name for task finalization messages
- Type:
String - Default:
"orchestration_task_finalizations" - Valid Range: valid queue name
- System Impact: Tasks ready for completion evaluation are enqueued here
common.queues.orchestration_queues.task_requests
Queue name for incoming task execution requests
- Type:
String - Default:
"orchestration_task_requests" - Valid Range: valid queue name
- System Impact: The orchestration system reads new task requests from this queue
pgmq
Path: common.queues.pgmq
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval_ms | u32 | 500 | Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive |
common.queues.pgmq.poll_interval_ms
Interval in milliseconds between PGMQ polling cycles when no LISTEN/NOTIFY events arrive
- Type:
u32 - Default:
500 - Valid Range: 10-10000
- System Impact: Lower values reduce message latency in polling mode but increase database load; in Hybrid mode this is the fallback interval
queue_depth_thresholds
Path: common.queues.pgmq.queue_depth_thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
critical_threshold | i64 | 5000 | Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions |
overflow_threshold | i64 | 10000 | Queue depth indicating an emergency condition requiring manual intervention |
common.queues.pgmq.queue_depth_thresholds.critical_threshold
Queue depth at which the API returns HTTP 503 Service Unavailable for new task submissions
- Type:
i64 - Default:
5000 - Valid Range: 1+
- System Impact: Backpressure mechanism: rejects new work to allow the system to drain existing messages
common.queues.pgmq.queue_depth_thresholds.overflow_threshold
Queue depth indicating an emergency condition requiring manual intervention
- Type:
i64 - Default:
10000 - Valid Range: 1+
- System Impact: Highest severity threshold; triggers error-level logging and metrics for operational alerting
rabbitmq
Path: common.queues.rabbitmq
| Parameter | Type | Default | Description |
|---|---|---|---|
heartbeat_seconds | u16 | 30 | AMQP heartbeat interval for connection liveness detection |
prefetch_count | u16 | 100 | Number of unacknowledged messages RabbitMQ will deliver before waiting for acks |
url | String | "${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" | AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’ |
common.queues.rabbitmq.heartbeat_seconds
AMQP heartbeat interval for connection liveness detection
- Type:
u16 - Default:
30 - Valid Range: 0-3600
- System Impact: Detects dead connections; 0 disables heartbeats (not recommended in production)
common.queues.rabbitmq.prefetch_count
Number of unacknowledged messages RabbitMQ will deliver before waiting for acks
- Type:
u16 - Default:
100 - Valid Range: 1-65535
- System Impact: Controls consumer throughput vs. memory usage; higher values increase throughput but buffer more messages in-process
common.queues.rabbitmq.url
AMQP connection URL for RabbitMQ; %2F is the URL-encoded default vhost ‘/’
- Type:
String - Default:
"${RABBITMQ_URL:-amqp://guest:guest@localhost:5672/%2F}" - Valid Range: valid AMQP URI
- System Impact: Only used when queues.backend = ‘rabbitmq’; must be reachable at startup
system
Path: common.system
| Parameter | Type | Default | Description |
|---|---|---|---|
default_dependent_system | String | "default" | Default system name assigned to tasks that do not specify a dependent system |
common.system.default_dependent_system
Default system name assigned to tasks that do not specify a dependent system
- Type:
String - Default:
"default" - Valid Range: non-empty string
- System Impact: Groups tasks for routing and reporting; most single-system deployments can leave this as default
task_templates
Path: common.task_templates
| Parameter | Type | Default | Description |
|---|---|---|---|
search_paths | Vec<String> | ["config/tasks/**/*.{yml,yaml}"] | Glob patterns for discovering task template YAML files |
common.task_templates.search_paths
Glob patterns for discovering task template YAML files
- Type:
Vec<String> - Default:
["config/tasks/**/*.{yml,yaml}"] - Valid Range: valid glob patterns
- System Impact: Templates matching these patterns are loaded at startup for task definition discovery
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: orchestration
91/91 parameters documented
orchestration
Root-level orchestration parameters
Path: orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_performance_logging | bool | true | Enable detailed performance logging for orchestration actors |
shutdown_timeout_ms | u64 | 30000 | Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown |
orchestration.enable_performance_logging
Enable detailed performance logging for orchestration actors
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern
orchestration.shutdown_timeout_ms
Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown
- Type:
u64 - Default:
30000 - Valid Range: 1000-300000
- System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments
batch_processing
Path: orchestration.batch_processing
| Parameter | Type | Default | Description |
|---|---|---|---|
checkpoint_stall_minutes | u32 | 15 | Minutes without a checkpoint update before a batch is considered stalled |
default_batch_size | u32 | 1000 | Default number of items in a single batch when not specified by the handler |
enabled | bool | true | Enable the batch processing subsystem for large-scale step execution |
max_parallel_batches | u32 | 50 | Maximum number of batch operations that can execute concurrently |
orchestration.batch_processing.checkpoint_stall_minutes
Minutes without a checkpoint update before a batch is considered stalled
- Type:
u32 - Default:
15 - Valid Range: 1-1440
- System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster
orchestration.batch_processing.default_batch_size
Default number of items in a single batch when not specified by the handler
- Type:
u32 - Default:
1000 - Valid Range: 1-100000
- System Impact: Larger batches improve throughput but increase memory usage and per-batch latency
orchestration.batch_processing.enabled
Enable the batch processing subsystem for large-scale step execution
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, batch step handlers cannot be used; all steps must be processed individually
orchestration.batch_processing.max_parallel_batches
Maximum number of batch operations that can execute concurrently
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads
decision_points
Path: orchestration.decision_points
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_detailed_logging | bool | false | Enable verbose logging of decision point evaluation including expression results |
enable_metrics | bool | true | Enable metrics collection for decision point evaluations |
enabled | bool | true | Enable the decision point evaluation subsystem for conditional workflow branching |
max_decision_depth | u32 | 20 | Maximum depth of nested decision point chains |
max_steps_per_decision | u32 | 100 | Maximum number of steps that can be generated by a single decision point evaluation |
warn_threshold_depth | u32 | 10 | Decision depth above which a warning is logged |
warn_threshold_steps | u32 | 50 | Number of steps per decision above which a warning is logged |
orchestration.decision_points.enable_detailed_logging
Enable verbose logging of decision point evaluation including expression results
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior
orchestration.decision_points.enable_metrics
Enable metrics collection for decision point evaluations
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks evaluation counts, timings, and branch selection distribution
orchestration.decision_points.enabled
Enable the decision point evaluation subsystem for conditional workflow branching
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, all decision points are skipped and conditional steps are not evaluated
orchestration.decision_points.max_decision_depth
Maximum depth of nested decision point chains
- Type:
u32 - Default:
20 - Valid Range: 1-100
- System Impact: Prevents infinite recursion from circular decision point references
orchestration.decision_points.max_steps_per_decision
Maximum number of steps that can be generated by a single decision point evaluation
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Safety limit to prevent decision points from creating unbounded step graphs
orchestration.decision_points.warn_threshold_depth
Decision depth above which a warning is logged
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Observability: identifies deeply nested decision chains that may indicate design issues
orchestration.decision_points.warn_threshold_steps
Number of steps per decision above which a warning is logged
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Observability: identifies decision points that generate unusually large step sets
dlq
Path: orchestration.dlq
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable the Dead Letter Queue subsystem for handling unrecoverable tasks |
orchestration.dlq.enabled
Enable the Dead Letter Queue subsystem for handling unrecoverable tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, stale or failed tasks remain in their error state without DLQ routing
staleness_detection
Path: orchestration.dlq.staleness_detection
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 100 | Number of potentially stale tasks to evaluate in a single detection sweep |
detection_interval_seconds | u32 | 300 | Interval in seconds between staleness detection sweeps |
dry_run | bool | false | Run staleness detection in observation-only mode without taking action |
enabled | bool | true | Enable periodic scanning for stale tasks |
orchestration.dlq.staleness_detection.batch_size
Number of potentially stale tasks to evaluate in a single detection sweep
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost
orchestration.dlq.staleness_detection.detection_interval_seconds
Interval in seconds between staleness detection sweeps
- Type:
u32 - Default:
300 - Valid Range: 30-3600
- System Impact: Lower values detect stale tasks faster but increase database query frequency
orchestration.dlq.staleness_detection.dry_run
Run staleness detection in observation-only mode without taking action
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds
orchestration.dlq.staleness_detection.enabled
Enable periodic scanning for stale tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d
actions
Path: orchestration.dlq.staleness_detection.actions
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_move_to_dlq | bool | true | Automatically move stale tasks to the DLQ after transitioning to error |
auto_transition_to_error | bool | true | Automatically transition stale tasks to the Error state |
emit_events | bool | true | Emit domain events when staleness is detected |
event_channel | String | "task_staleness_detected" | PGMQ channel name for staleness detection events |
orchestration.dlq.staleness_detection.actions.auto_move_to_dlq
Automatically move stale tasks to the DLQ after transitioning to error
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review
orchestration.dlq.staleness_detection.actions.auto_transition_to_error
Automatically transition stale tasks to the Error state
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state
orchestration.dlq.staleness_detection.actions.emit_events
Emit domain events when staleness is detected
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling
orchestration.dlq.staleness_detection.actions.event_channel
PGMQ channel name for staleness detection events
- Type:
String - Default:
"task_staleness_detected" - Valid Range: 1-255 characters
- System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling
thresholds
Path: orchestration.dlq.staleness_detection.thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
steps_in_process_minutes | u32 | 30 | Minutes a task can have steps in process before being considered stale |
task_max_lifetime_hours | u32 | 24 | Absolute maximum lifetime for any task regardless of state |
waiting_for_dependencies_minutes | u32 | 60 | Minutes a task can wait for step dependencies before being considered stale |
waiting_for_retry_minutes | u32 | 30 | Minutes a task can wait for step retries before being considered stale |
orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes
Minutes a task can have steps in process before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation
orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours
Absolute maximum lifetime for any task regardless of state
- Type:
u32 - Default:
24 - Valid Range: 1-168
- System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing
orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes
Minutes a task can wait for step dependencies before being considered stale
- Type:
u32 - Default:
60 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration
orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes
Minutes a task can wait for step retries before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration
event_systems
Path: orchestration.event_systems
orchestration
Path: orchestration.event_systems.orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "orchestration-event-system" | Unique identifier for the orchestration event system instance |
orchestration.event_systems.orchestration.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
orchestration.event_systems.orchestration.system_id
Unique identifier for the orchestration event system instance
- Type:
String - Default:
"orchestration-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: orchestration.event_systems.orchestration.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the orchestration event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics collection for event processing |
orchestration.event_systems.orchestration.health.enabled
Enable health monitoring for the orchestration event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for this event system
orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute
Error rate per minute above which the event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection
orchestration.event_systems.orchestration.health.max_consecutive_errors
Number of consecutive errors before the event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation after sustained failures; resets on any success
orchestration.event_systems.orchestration.health.performance_monitoring_enabled
Enable detailed performance metrics collection for event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks processing latency percentiles and throughput; adds minor overhead
processing
Path: orchestration.event_systems.orchestration.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read |
max_concurrent_operations | u32 | 50 | Maximum number of events processed concurrently by the orchestration event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed event processing operation |
orchestration.event_systems.orchestration.processing.batch_size
Number of events dequeued in a single batch read
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
orchestration.event_systems.orchestration.processing.max_concurrent_operations
Maximum number of events processed concurrently by the orchestration event system
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Controls parallelism for task request, result, and finalization processing
orchestration.event_systems.orchestration.processing.max_retries
Maximum retry attempts for a failed event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: orchestration.event_systems.orchestration.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between event processing retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: orchestration.event_systems.orchestration.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds an event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the orchestration event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued message remains invisible to other consumers |
orchestration.event_systems.orchestration.timing.claim_timeout_seconds
Maximum time in seconds an event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking event processing indefinitely
orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load
orchestration.event_systems.orchestration.timing.health_check_interval_seconds
Interval in seconds between health check probes for the orchestration event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness
orchestration.event_systems.orchestration.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
orchestration.event_systems.orchestration.timing.visibility_timeout_seconds
Time in seconds a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If processing is not completed within this window, the message becomes visible again for redelivery
task_readiness
Path: orchestration.event_systems.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "task-readiness-event-system" | Unique identifier for the task readiness event system instance |
orchestration.event_systems.task_readiness.deployment_mode
Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery
orchestration.event_systems.task_readiness.system_id
Unique identifier for the task readiness event system instance
- Type:
String - Default:
"task-readiness-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish task readiness events from other event systems
health
Path: orchestration.event_systems.task_readiness.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the task readiness event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the task readiness system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the task readiness system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for task readiness event processing |
orchestration.event_systems.task_readiness.health.enabled
Enable health monitoring for the task readiness event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks run for task readiness processing
orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute
Error rate per minute above which the task readiness system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
orchestration.event_systems.task_readiness.health.max_consecutive_errors
Number of consecutive errors before the task readiness system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful readiness check
orchestration.event_systems.task_readiness.health.performance_monitoring_enabled
Enable detailed performance metrics for task readiness event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency
processing
Path: orchestration.event_systems.task_readiness.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 50 | Number of task readiness events dequeued in a single batch |
max_concurrent_operations | u32 | 100 | Maximum number of task readiness events processed concurrently |
max_retries | u32 | 3 | Maximum retry attempts for a failed task readiness event |
orchestration.event_systems.task_readiness.processing.batch_size
Number of task readiness events dequeued in a single batch
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput
orchestration.event_systems.task_readiness.processing.max_concurrent_operations
Maximum number of task readiness events processed concurrently
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries
orchestration.event_systems.task_readiness.processing.max_retries
Maximum retry attempts for a failed task readiness event
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff
Path: orchestration.event_systems.task_readiness.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first task readiness processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay for readiness retries |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds for task readiness retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive readiness failure |
timing
Path: orchestration.event_systems.task_readiness.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a task readiness event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles for task readiness |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the task readiness event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single task readiness event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued task readiness message remains invisible to other consumers |
orchestration.event_systems.task_readiness.timing.claim_timeout_seconds
Maximum time in seconds a task readiness event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned readiness claims from blocking task evaluation
orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for task readiness
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness
orchestration.event_systems.task_readiness.timing.health_check_interval_seconds
Interval in seconds between health check probes for the task readiness event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the task readiness system verifies its own connectivity
orchestration.event_systems.task_readiness.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single task readiness event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Readiness events exceeding this timeout are considered failed
orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds
Time in seconds a dequeued task readiness message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate processing of readiness events during normal operation
grpc
Path: orchestration.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" | Socket address for the gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service (grpc.health.v1) |
enable_reflection | bool | true | Enable gRPC server reflection for service discovery |
enabled | bool | true | Enable the gRPC API server alongside the REST API |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 200 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame |
tls_enabled | bool | false | Enable TLS encryption for gRPC connections |
orchestration.grpc.bind_address
Socket address for the gRPC server
- Type:
String - Default:
"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict
orchestration.grpc.enable_health_service
Enable the gRPC health checking service (grpc.health.v1)
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
orchestration.grpc.enable_reflection
Enable gRPC server reflection for service discovery
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production
orchestration.grpc.enabled
Enable the gRPC API server alongside the REST API
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
orchestration.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
orchestration.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
orchestration.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
200 - Valid Range: 1-10000
- System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads
orchestration.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
orchestration.grpc.tls_enabled
Enable TLS encryption for gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments
mpsc_channels
Path: orchestration.mpsc_channels
command_processor
Path: orchestration.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 5000 | Bounded channel capacity for the orchestration command processor |
orchestration.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the orchestration command processor
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory
event_listeners
Path: orchestration.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 50000 | Bounded channel capacity for PGMQ event listener notifications |
orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications
- Type:
usize - Default:
50000 - Valid Range: 1000-1000000
- System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL
event_systems
Path: orchestration.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 10000 | Bounded channel capacity for the orchestration event system internal channel |
orchestration.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the orchestration event system internal channel
- Type:
usize - Default:
10000 - Valid Range: 100-100000
- System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts
web
Path: orchestration.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" | Socket address for the REST API server |
enabled | bool | true | Enable the REST API server for the orchestration service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for an HTTP request to complete before timeout |
orchestration.web.bind_address
Socket address for the REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - Valid Range: host:port
- System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8080 | Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI |
| test | 0.0.0.0:8080 | Default port for test fixtures |
orchestration.web.enabled
Enable the REST API server for the orchestration service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the service operates via messaging only
orchestration.web.request_timeout_ms
Maximum time in milliseconds for an HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: orchestration.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication |
enabled | bool | false | Enable authentication for the REST API |
jwt_audience | String | "tasker-api" | Expected ‘aud’ claim in JWT tokens |
jwt_issuer | String | "tasker-core" | Expected ‘iss’ claim in JWT tokens |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if this service issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours |
orchestration.web.auth.api_key
Static API key for simple key-based authentication
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
orchestration.web.auth.api_key_header
HTTP header name for API key authentication
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
orchestration.web.auth.enabled
Enable authentication for the REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth
orchestration.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens
- Type:
String - Default:
"tasker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
orchestration.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens
- Type:
String - Default:
"tasker-core" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected during validation
orchestration.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if this service issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers
orchestration.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production
orchestration.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
orchestration.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication
database_pools
Path: orchestration.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 50 | Advisory hint for the total number of database connections across all orchestration pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle web API connection is closed |
web_api_max_connections | u32 | 30 | Maximum number of connections the web API pool can grow to under load |
web_api_pool_size | u32 | 20 | Target number of connections in the web API database pool |
orchestration.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all orchestration pools
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
orchestration.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: API requests that cannot acquire a connection within this window return an error
orchestration.web.database_pools.web_api_idle_timeout_seconds
Time before an idle web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the web API pool shrinks after traffic subsides
orchestration.web.database_pools.web_api_max_connections
Maximum number of connections the web API pool can grow to under load
- Type:
u32 - Default:
30 - Valid Range: 1-500
- System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes
orchestration.web.database_pools.web_api_pool_size
Target number of connections in the web API database pool
- Type:
u32 - Default:
20 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the REST API can execute
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: worker
90/90 parameters documented
worker
Root-level worker parameters
Path: worker
| Parameter | Type | Default | Description |
|---|---|---|---|
worker_id | String | "worker-default-001" | Unique identifier for this worker instance |
worker_type | String | "general" | Worker type classification for routing and reporting |
worker.worker_id
Unique identifier for this worker instance
- Type:
String - Default:
"worker-default-001" - Valid Range: non-empty string
- System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster
worker.worker_type
Worker type classification for routing and reporting
- Type:
String - Default:
"general" - Valid Range: non-empty string
- System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types
circuit_breakers
Path: worker.circuit_breakers
ffi_completion_send
Path: worker.circuit_breakers.ffi_completion_send
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive FFI completion send failures before the circuit breaker trips |
recovery_timeout_seconds | u32 | 5 | Time the FFI completion breaker stays Open before probing with a test send |
slow_send_threshold_ms | u32 | 100 | Threshold in milliseconds above which FFI completion channel sends are logged as slow |
success_threshold | u32 | 2 | Consecutive successful sends in Half-Open required to close the breaker |
worker.circuit_breakers.ffi_completion_send.failure_threshold
Number of consecutive FFI completion send failures before the circuit breaker trips
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited
worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds
Time the FFI completion breaker stays Open before probing with a test send
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Short timeout (5s) because FFI channel issues are typically transient
worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms
Threshold in milliseconds above which FFI completion channel sends are logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-10000
- System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers
worker.circuit_breakers.ffi_completion_send.success_threshold
Consecutive successful sends in Half-Open required to close the breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient
event_systems
Path: worker.event_systems
worker
Path: worker.event_systems.worker
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "worker-event-system" | Unique identifier for the worker event system instance |
worker.event_systems.worker.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
worker.event_systems.worker.system_id
Unique identifier for the worker event system instance
- Type:
String - Default:
"worker-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: worker.event_systems.worker.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the worker event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the worker event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the worker event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for worker event processing |
worker.event_systems.worker.health.enabled
Enable health monitoring for the worker event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for the worker event system
worker.event_systems.worker.health.error_rate_threshold_per_minute
Error rate per minute above which the worker event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
worker.event_systems.worker.health.max_consecutive_errors
Number of consecutive errors before the worker event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful event processing
worker.event_systems.worker.health.performance_monitoring_enabled
Enable detailed performance metrics for worker event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings
metadata
Path: worker.event_systems.worker.metadata
fallback_poller
Path: worker.event_systems.worker.metadata.fallback_poller
| Parameter | Type | Default | Description |
|---|---|---|---|
age_threshold_seconds | u32 | 5 | Minimum age in seconds of a message before the fallback poller will pick it up |
batch_size | u32 | 20 | Number of messages to dequeue in a single fallback poll |
enabled | bool | true | Enable the fallback polling mechanism for step dispatch |
max_age_hours | u32 | 24 | Maximum age in hours of messages the fallback poller will process |
polling_interval_ms | u32 | 1000 | Interval in milliseconds between fallback polling cycles |
supported_namespaces | Vec<String> | [] | List of queue namespaces the fallback poller monitors; empty means all namespaces |
visibility_timeout_seconds | u32 | 30 | Time in seconds a message polled by the fallback mechanism remains invisible |
in_process_events
Path: worker.event_systems.worker.metadata.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
deduplication_cache_size | usize | 10000 | Number of event IDs to cache for deduplication of in-process events |
ffi_integration_enabled | bool | true | Enable FFI integration for in-process event delivery to Ruby/Python workers |
listener
Path: worker.event_systems.worker.metadata.listener
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_processing | bool | true | Enable batch processing of accumulated LISTEN/NOTIFY events |
connection_timeout_seconds | u32 | 30 | Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection |
event_timeout_seconds | u32 | 60 | Maximum time to wait for a LISTEN/NOTIFY event before yielding |
max_retry_attempts | u32 | 5 | Maximum number of listener reconnection attempts before falling back to polling |
retry_interval_seconds | u32 | 5 | Interval in seconds between LISTEN/NOTIFY listener reconnection attempts |
processing
Path: worker.event_systems.worker.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read by the worker |
max_concurrent_operations | u32 | 100 | Maximum number of events processed concurrently by the worker event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed worker event processing operation |
worker.event_systems.worker.processing.batch_size
Number of events dequeued in a single batch read by the worker
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
worker.event_systems.worker.processing.max_concurrent_operations
Maximum number of events processed concurrently by the worker event system
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Controls parallelism for step dispatch and completion processing
worker.event_systems.worker.processing.max_retries
Maximum retry attempts for a failed worker event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: worker.event_systems.worker.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first worker event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between worker event retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: worker.event_systems.worker.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a worker event claim remains valid |
fallback_polling_interval_seconds | u32 | 2 | Interval in seconds between fallback polling cycles for step dispatch |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the worker event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single worker event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued step dispatch message remains invisible to other workers |
worker.event_systems.worker.timing.claim_timeout_seconds
Maximum time in seconds a worker event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking step processing indefinitely
worker.event_systems.worker.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for step dispatch
- Type:
u32 - Default:
2 - Valid Range: 1-60
- System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency
worker.event_systems.worker.timing.health_check_interval_seconds
Interval in seconds between health check probes for the worker event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the worker event system verifies its own connectivity
worker.event_systems.worker.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single worker event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
worker.event_systems.worker.timing.visibility_timeout_seconds
Time in seconds a dequeued step dispatch message remains invisible to other workers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate step execution; must be longer than typical step processing time
grpc
Path: worker.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" | Socket address for the worker gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service on the worker |
enable_reflection | bool | true | Enable gRPC server reflection for the worker service |
enabled | bool | true | Enable the gRPC API server for the worker service |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames on the worker |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 1000 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server |
tls_enabled | bool | false | Enable TLS encryption for worker gRPC connections |
worker.grpc.bind_address
Socket address for the worker gRPC server
- Type:
String - Default:
"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191
worker.grpc.enable_health_service
Enable the gRPC health checking service on the worker
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
worker.grpc.enable_reflection
Enable gRPC server reflection for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production
worker.grpc.enabled
Enable the gRPC API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
worker.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames on the worker
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
worker.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
worker.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
1000 - Valid Range: 1-10000
- System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this
worker.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
worker.grpc.tls_enabled
Enable TLS encryption for worker gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments
mpsc_channels
Path: worker.mpsc_channels
command_processor
Path: worker.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 2000 | Bounded channel capacity for the worker command processor |
worker.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the worker command processor
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types
domain_events
Path: worker.mpsc_channels.domain_events
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 1000 | Bounded channel capacity for domain event system commands |
log_dropped_events | bool | true | Log a warning when domain events are dropped due to channel saturation |
shutdown_drain_timeout_ms | u32 | 5000 | Maximum time in milliseconds to drain pending domain events during shutdown |
worker.mpsc_channels.domain_events.command_buffer_size
Bounded channel capacity for domain event system commands
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown
worker.mpsc_channels.domain_events.log_dropped_events
Log a warning when domain events are dropped due to channel saturation
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps detect when event volume exceeds channel capacity
worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms
Maximum time in milliseconds to drain pending domain events during shutdown
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss
event_listeners
Path: worker.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 10000 | Bounded channel capacity for PGMQ event listener notifications on the worker |
worker.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications on the worker
- Type:
usize - Default:
10000 - Valid Range: 1000-1000000
- System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types
event_subscribers
Path: worker.mpsc_channels.event_subscribers
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step completion event subscribers |
result_buffer_size | usize | 1000 | Bounded channel capacity for step result event subscribers |
worker.mpsc_channels.event_subscribers.completion_buffer_size
Bounded channel capacity for step completion event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step completion notifications before they are forwarded to the orchestration service
worker.mpsc_channels.event_subscribers.result_buffer_size
Bounded channel capacity for step result event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution results before they are published to the result queue
event_systems
Path: worker.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 2000 | Bounded channel capacity for the worker event system internal channel |
worker.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the worker event system internal channel
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers events between the listener and processor; sized for worker-level throughput
ffi_dispatch
Path: worker.mpsc_channels.ffi_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
callback_timeout_ms | u32 | 5000 | Maximum time in milliseconds for FFI fire-and-forget domain event callbacks |
completion_send_timeout_ms | u32 | 10000 | Maximum time in milliseconds to retry sending FFI completion results when the channel is full |
completion_timeout_ms | u32 | 30000 | Maximum time in milliseconds to wait for an FFI step handler to complete |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for FFI step dispatch requests |
starvation_warning_threshold_ms | u32 | 10000 | Age in milliseconds of pending FFI events that triggers a starvation warning |
worker.mpsc_channels.ffi_dispatch.callback_timeout_ms
Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents indefinite blocking of FFI threads during domain event publishing
worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms
Maximum time in milliseconds to retry sending FFI completion results when the channel is full
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks
worker.mpsc_channels.ffi_dispatch.completion_timeout_ms
Maximum time in milliseconds to wait for an FFI step handler to complete
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads
worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size
Bounded channel capacity for FFI step dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers
worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms
Age in milliseconds of pending FFI events that triggers a starvation warning
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached
handler_dispatch
Path: worker.mpsc_channels.handler_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step handler completion notifications |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for step handler dispatch requests |
handler_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a step handler to complete execution |
max_concurrent_handlers | u32 | 10 | Maximum number of step handlers executing simultaneously |
worker.mpsc_channels.handler_dispatch.completion_buffer_size
Bounded channel capacity for step handler completion notifications
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers handler completion results before they are forwarded to the result processor
worker.mpsc_channels.handler_dispatch.dispatch_buffer_size
Bounded channel capacity for step handler dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers incoming step execution requests before handler assignment
worker.mpsc_channels.handler_dispatch.handler_timeout_ms
Maximum time in milliseconds for a step handler to complete execution
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity
worker.mpsc_channels.handler_dispatch.max_concurrent_handlers
Maximum number of step handlers executing simultaneously
- Type:
u32 - Default:
10 - Valid Range: 1-10000
- System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore
load_shedding
Path: worker.mpsc_channels.handler_dispatch.load_shedding
| Parameter | Type | Default | Description |
|---|---|---|---|
capacity_threshold_percent | f64 | 80.0 | Handler capacity percentage above which new step claims are refused |
enabled | bool | true | Enable load shedding to refuse step claims when handler capacity is nearly exhausted |
warning_threshold_percent | f64 | 70.0 | Handler capacity percentage at which warning logs are emitted |
worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent
Handler capacity percentage above which new step claims are refused
- Type:
f64 - Default:
80.0 - Valid Range: 0.0-100.0
- System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy
worker.mpsc_channels.handler_dispatch.load_shedding.enabled
Enable load shedding to refuse step claims when handler capacity is nearly exhausted
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload
worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent
Handler capacity percentage at which warning logs are emitted
- Type:
f64 - Default:
70.0 - Valid Range: 0.0-100.0
- System Impact: Observability: alerts operators that the worker is approaching its capacity limit
in_process_events
Path: worker.mpsc_channels.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
broadcast_buffer_size | usize | 2000 | Bounded broadcast channel capacity for in-process domain event delivery |
dispatch_timeout_ms | u32 | 5000 | Maximum time in milliseconds to wait when dispatching an in-process event |
log_subscriber_errors | bool | true | Log errors when in-process event subscribers fail to receive events |
worker.mpsc_channels.in_process_events.broadcast_buffer_size
Bounded broadcast channel capacity for in-process domain event delivery
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure
worker.mpsc_channels.in_process_events.dispatch_timeout_ms
Maximum time in milliseconds to wait when dispatching an in-process event
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow
worker.mpsc_channels.in_process_events.log_subscriber_errors
Log errors when in-process event subscribers fail to receive events
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps identify slow or failing event subscribers
orchestration_client
Path: worker.orchestration_client
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | String | "http://localhost:8080" | Base URL of the orchestration REST API that this worker reports to |
max_retries | u32 | 3 | Maximum retry attempts for failed orchestration API calls |
timeout_ms | u32 | 30000 | HTTP request timeout in milliseconds for orchestration API calls |
worker.orchestration_client.base_url
Base URL of the orchestration REST API that this worker reports to
- Type:
String - Default:
"http://localhost:8080" - Valid Range: valid HTTP(S) URL
- System Impact: Workers send step completion results and health reports to this endpoint
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | http://orchestration:8080 | Container-internal DNS in Kubernetes/Docker |
| test | http://localhost:8080 | Local orchestration for testing |
Related: orchestration.web.bind_address
worker.orchestration_client.max_retries
Maximum retry attempts for failed orchestration API calls
- Type:
u32 - Default:
3 - Valid Range: 0-10
- System Impact: Retries use backoff; higher values improve resilience to transient network issues
worker.orchestration_client.timeout_ms
HTTP request timeout in milliseconds for orchestration API calls
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried
web
Path: worker.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" | Socket address for the worker REST API server |
enabled | bool | true | Enable the REST API server for the worker service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a worker HTTP request to complete before timeout |
worker.web.bind_address
Socket address for the worker REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" - Valid Range: host:port
- System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8081 | Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override |
| test | 0.0.0.0:8081 | Default port offset from orchestration (8080) |
worker.web.enabled
Enable the REST API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only
worker.web.request_timeout_ms
Maximum time in milliseconds for a worker HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: worker.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication on the worker API |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication on the worker API |
enabled | bool | false | Enable authentication for the worker REST API |
jwt_audience | String | "worker-api" | Expected ‘aud’ claim in JWT tokens for the worker API |
jwt_issuer | String | "tasker-worker" | Expected ‘iss’ claim in JWT tokens for the worker API |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if the worker issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures on the worker API |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for worker JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours for worker API tokens |
worker.web.auth.api_key
Static API key for simple key-based authentication on the worker API
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
worker.web.auth.api_key_header
HTTP header name for API key authentication on the worker API
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
worker.web.auth.enabled
Enable authentication for the worker REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all worker API endpoints are unauthenticated
worker.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens for the worker API
- Type:
String - Default:
"worker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
worker.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens for the worker API
- Type:
String - Default:
"tasker-worker" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens
worker.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the worker service issues its own JWT tokens; typically empty
worker.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures on the worker API
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management
worker.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for worker JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
worker.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours for worker API tokens
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security
database_pools
Path: worker.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 25 | Advisory hint for the total number of database connections across all worker pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the worker web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle worker web API connection is closed |
web_api_max_connections | u32 | 15 | Maximum number of connections the worker web API pool can grow to under load |
web_api_pool_size | u32 | 10 | Target number of connections in the worker web API database pool |
worker.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all worker pools
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
worker.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the worker web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Worker API requests that cannot acquire a connection within this window return an error
worker.web.database_pools.web_api_idle_timeout_seconds
Time before an idle worker web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides
worker.web.database_pools.web_api_max_connections
Maximum number of connections the worker web API pool can grow to under load
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Hard ceiling for worker web API database connections
worker.web.database_pools.web_api_pool_size
Target number of connections in the worker web API database pool
- Type:
u32 - Default:
10 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: orchestration
91/91 parameters documented
orchestration
Root-level orchestration parameters
Path: orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_performance_logging | bool | true | Enable detailed performance logging for orchestration actors |
shutdown_timeout_ms | u64 | 30000 | Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown |
orchestration.enable_performance_logging
Enable detailed performance logging for orchestration actors
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Emits timing metrics for task processing, step enqueueing, and result evaluation; disable in production if log volume is a concern
orchestration.shutdown_timeout_ms
Maximum time in milliseconds to wait for orchestration subsystems to stop during graceful shutdown
- Type:
u64 - Default:
30000 - Valid Range: 1000-300000
- System Impact: If shutdown exceeds this timeout, the process exits forcefully to avoid hanging indefinitely; 30s is conservative for most deployments
batch_processing
Path: orchestration.batch_processing
| Parameter | Type | Default | Description |
|---|---|---|---|
checkpoint_stall_minutes | u32 | 15 | Minutes without a checkpoint update before a batch is considered stalled |
default_batch_size | u32 | 1000 | Default number of items in a single batch when not specified by the handler |
enabled | bool | true | Enable the batch processing subsystem for large-scale step execution |
max_parallel_batches | u32 | 50 | Maximum number of batch operations that can execute concurrently |
orchestration.batch_processing.checkpoint_stall_minutes
Minutes without a checkpoint update before a batch is considered stalled
- Type:
u32 - Default:
15 - Valid Range: 1-1440
- System Impact: Stalled batches are flagged for investigation or automatic recovery; lower values detect issues faster
orchestration.batch_processing.default_batch_size
Default number of items in a single batch when not specified by the handler
- Type:
u32 - Default:
1000 - Valid Range: 1-100000
- System Impact: Larger batches improve throughput but increase memory usage and per-batch latency
orchestration.batch_processing.enabled
Enable the batch processing subsystem for large-scale step execution
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, batch step handlers cannot be used; all steps must be processed individually
orchestration.batch_processing.max_parallel_batches
Maximum number of batch operations that can execute concurrently
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Bounds resource usage from concurrent batch processing; increase for high-throughput batch workloads
decision_points
Path: orchestration.decision_points
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_detailed_logging | bool | false | Enable verbose logging of decision point evaluation including expression results |
enable_metrics | bool | true | Enable metrics collection for decision point evaluations |
enabled | bool | true | Enable the decision point evaluation subsystem for conditional workflow branching |
max_decision_depth | u32 | 20 | Maximum depth of nested decision point chains |
max_steps_per_decision | u32 | 100 | Maximum number of steps that can be generated by a single decision point evaluation |
warn_threshold_depth | u32 | 10 | Decision depth above which a warning is logged |
warn_threshold_steps | u32 | 50 | Number of steps per decision above which a warning is logged |
orchestration.decision_points.enable_detailed_logging
Enable verbose logging of decision point evaluation including expression results
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Produces high-volume logs; enable only for debugging specific decision point behavior
orchestration.decision_points.enable_metrics
Enable metrics collection for decision point evaluations
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks evaluation counts, timings, and branch selection distribution
orchestration.decision_points.enabled
Enable the decision point evaluation subsystem for conditional workflow branching
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, all decision points are skipped and conditional steps are not evaluated
orchestration.decision_points.max_decision_depth
Maximum depth of nested decision point chains
- Type:
u32 - Default:
20 - Valid Range: 1-100
- System Impact: Prevents infinite recursion from circular decision point references
orchestration.decision_points.max_steps_per_decision
Maximum number of steps that can be generated by a single decision point evaluation
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Safety limit to prevent decision points from creating unbounded step graphs
orchestration.decision_points.warn_threshold_depth
Decision depth above which a warning is logged
- Type:
u32 - Default:
10 - Valid Range: 1-100
- System Impact: Observability: identifies deeply nested decision chains that may indicate design issues
orchestration.decision_points.warn_threshold_steps
Number of steps per decision above which a warning is logged
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Observability: identifies decision points that generate unusually large step sets
dlq
Path: orchestration.dlq
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable the Dead Letter Queue subsystem for handling unrecoverable tasks |
orchestration.dlq.enabled
Enable the Dead Letter Queue subsystem for handling unrecoverable tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, stale or failed tasks remain in their error state without DLQ routing
staleness_detection
Path: orchestration.dlq.staleness_detection
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 100 | Number of potentially stale tasks to evaluate in a single detection sweep |
detection_interval_seconds | u32 | 300 | Interval in seconds between staleness detection sweeps |
dry_run | bool | false | Run staleness detection in observation-only mode without taking action |
enabled | bool | true | Enable periodic scanning for stale tasks |
orchestration.dlq.staleness_detection.batch_size
Number of potentially stale tasks to evaluate in a single detection sweep
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Larger batches process more stale tasks per sweep but increase per-sweep query cost
orchestration.dlq.staleness_detection.detection_interval_seconds
Interval in seconds between staleness detection sweeps
- Type:
u32 - Default:
300 - Valid Range: 30-3600
- System Impact: Lower values detect stale tasks faster but increase database query frequency
orchestration.dlq.staleness_detection.dry_run
Run staleness detection in observation-only mode without taking action
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: Logs what would be DLQ’d without actually transitioning tasks; useful for tuning thresholds
orchestration.dlq.staleness_detection.enabled
Enable periodic scanning for stale tasks
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no automatic staleness detection runs; tasks must be manually DLQ’d
actions
Path: orchestration.dlq.staleness_detection.actions
| Parameter | Type | Default | Description |
|---|---|---|---|
auto_move_to_dlq | bool | true | Automatically move stale tasks to the DLQ after transitioning to error |
auto_transition_to_error | bool | true | Automatically transition stale tasks to the Error state |
emit_events | bool | true | Emit domain events when staleness is detected |
event_channel | String | "task_staleness_detected" | PGMQ channel name for staleness detection events |
orchestration.dlq.staleness_detection.actions.auto_move_to_dlq
Automatically move stale tasks to the DLQ after transitioning to error
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are routed to the DLQ; when false, they remain in Error state for manual review
orchestration.dlq.staleness_detection.actions.auto_transition_to_error
Automatically transition stale tasks to the Error state
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, stale tasks are moved to Error before DLQ routing; when false, tasks stay in their current state
orchestration.dlq.staleness_detection.actions.emit_events
Emit domain events when staleness is detected
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, staleness events are published to the event_channel for external alerting or custom handling
orchestration.dlq.staleness_detection.actions.event_channel
PGMQ channel name for staleness detection events
- Type:
String - Default:
"task_staleness_detected" - Valid Range: 1-255 characters
- System Impact: Consumers can subscribe to this channel for alerting or custom staleness handling
thresholds
Path: orchestration.dlq.staleness_detection.thresholds
| Parameter | Type | Default | Description |
|---|---|---|---|
steps_in_process_minutes | u32 | 30 | Minutes a task can have steps in process before being considered stale |
task_max_lifetime_hours | u32 | 24 | Absolute maximum lifetime for any task regardless of state |
waiting_for_dependencies_minutes | u32 | 60 | Minutes a task can wait for step dependencies before being considered stale |
waiting_for_retry_minutes | u32 | 30 | Minutes a task can wait for step retries before being considered stale |
orchestration.dlq.staleness_detection.thresholds.steps_in_process_minutes
Minutes a task can have steps in process before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in StepsInProcess state exceeding this age may have hung workers; flags for investigation
orchestration.dlq.staleness_detection.thresholds.task_max_lifetime_hours
Absolute maximum lifetime for any task regardless of state
- Type:
u32 - Default:
24 - Valid Range: 1-168
- System Impact: Hard cap; tasks exceeding this age are considered stale even if actively processing
orchestration.dlq.staleness_detection.thresholds.waiting_for_dependencies_minutes
Minutes a task can wait for step dependencies before being considered stale
- Type:
u32 - Default:
60 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForDependencies state exceeding this age are flagged for DLQ consideration
orchestration.dlq.staleness_detection.thresholds.waiting_for_retry_minutes
Minutes a task can wait for step retries before being considered stale
- Type:
u32 - Default:
30 - Valid Range: 1-1440
- System Impact: Tasks in WaitingForRetry state exceeding this age are flagged for DLQ consideration
event_systems
Path: orchestration.event_systems
orchestration
Path: orchestration.event_systems.orchestration
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "orchestration-event-system" | Unique identifier for the orchestration event system instance |
orchestration.event_systems.orchestration.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
orchestration.event_systems.orchestration.system_id
Unique identifier for the orchestration event system instance
- Type:
String - Default:
"orchestration-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: orchestration.event_systems.orchestration.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the orchestration event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics collection for event processing |
orchestration.event_systems.orchestration.health.enabled
Enable health monitoring for the orchestration event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for this event system
orchestration.event_systems.orchestration.health.error_rate_threshold_per_minute
Error rate per minute above which the event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal; complements max_consecutive_errors for burst error detection
orchestration.event_systems.orchestration.health.max_consecutive_errors
Number of consecutive errors before the event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation after sustained failures; resets on any success
orchestration.event_systems.orchestration.health.performance_monitoring_enabled
Enable detailed performance metrics collection for event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks processing latency percentiles and throughput; adds minor overhead
processing
Path: orchestration.event_systems.orchestration.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read |
max_concurrent_operations | u32 | 50 | Maximum number of events processed concurrently by the orchestration event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed event processing operation |
orchestration.event_systems.orchestration.processing.batch_size
Number of events dequeued in a single batch read
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
orchestration.event_systems.orchestration.processing.max_concurrent_operations
Maximum number of events processed concurrently by the orchestration event system
- Type:
u32 - Default:
50 - Valid Range: 1-10000
- System Impact: Controls parallelism for task request, result, and finalization processing
orchestration.event_systems.orchestration.processing.max_retries
Maximum retry attempts for a failed event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: orchestration.event_systems.orchestration.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between event processing retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: orchestration.event_systems.orchestration.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds an event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the orchestration event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued message remains invisible to other consumers |
orchestration.event_systems.orchestration.timing.claim_timeout_seconds
Maximum time in seconds an event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking event processing indefinitely
orchestration.event_systems.orchestration.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles when LISTEN/NOTIFY is unavailable
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Only active in Hybrid mode when event-driven delivery fails; lower values reduce latency but increase DB load
orchestration.event_systems.orchestration.timing.health_check_interval_seconds
Interval in seconds between health check probes for the orchestration event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the event system verifies its own connectivity and responsiveness
orchestration.event_systems.orchestration.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
orchestration.event_systems.orchestration.timing.visibility_timeout_seconds
Time in seconds a dequeued message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: If processing is not completed within this window, the message becomes visible again for redelivery
task_readiness
Path: orchestration.event_systems.task_readiness
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "task-readiness-event-system" | Unique identifier for the task readiness event system instance |
orchestration.event_systems.task_readiness.deployment_mode
Event delivery mode for task readiness: ‘Hybrid’, ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; task readiness events trigger step processing and benefit from low-latency delivery
orchestration.event_systems.task_readiness.system_id
Unique identifier for the task readiness event system instance
- Type:
String - Default:
"task-readiness-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish task readiness events from other event systems
health
Path: orchestration.event_systems.task_readiness.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the task readiness event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the task readiness system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the task readiness system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for task readiness event processing |
orchestration.event_systems.task_readiness.health.enabled
Enable health monitoring for the task readiness event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks run for task readiness processing
orchestration.event_systems.task_readiness.health.error_rate_threshold_per_minute
Error rate per minute above which the task readiness system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
orchestration.event_systems.task_readiness.health.max_consecutive_errors
Number of consecutive errors before the task readiness system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful readiness check
orchestration.event_systems.task_readiness.health.performance_monitoring_enabled
Enable detailed performance metrics for task readiness event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks readiness check latency and throughput; useful for tuning batch_size and concurrency
processing
Path: orchestration.event_systems.task_readiness.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 50 | Number of task readiness events dequeued in a single batch |
max_concurrent_operations | u32 | 100 | Maximum number of task readiness events processed concurrently |
max_retries | u32 | 3 | Maximum retry attempts for a failed task readiness event |
orchestration.event_systems.task_readiness.processing.batch_size
Number of task readiness events dequeued in a single batch
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput for readiness evaluation; 50 balances latency and throughput
orchestration.event_systems.task_readiness.processing.max_concurrent_operations
Maximum number of task readiness events processed concurrently
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Higher than orchestration (100 vs 50) because readiness checks are lightweight SQL queries
orchestration.event_systems.task_readiness.processing.max_retries
Maximum retry attempts for a failed task readiness event
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Readiness events are idempotent so retries are safe; limits retry storms
backoff
Path: orchestration.event_systems.task_readiness.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first task readiness processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay for readiness retries |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds for task readiness retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive readiness failure |
timing
Path: orchestration.event_systems.task_readiness.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a task readiness event claim remains valid |
fallback_polling_interval_seconds | u32 | 5 | Interval in seconds between fallback polling cycles for task readiness |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the task readiness event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single task readiness event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued task readiness message remains invisible to other consumers |
orchestration.event_systems.task_readiness.timing.claim_timeout_seconds
Maximum time in seconds a task readiness event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned readiness claims from blocking task evaluation
orchestration.event_systems.task_readiness.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for task readiness
- Type:
u32 - Default:
5 - Valid Range: 1-60
- System Impact: Fallback interval when LISTEN/NOTIFY is unavailable; lower values improve responsiveness
orchestration.event_systems.task_readiness.timing.health_check_interval_seconds
Interval in seconds between health check probes for the task readiness event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the task readiness system verifies its own connectivity
orchestration.event_systems.task_readiness.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single task readiness event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Readiness events exceeding this timeout are considered failed
orchestration.event_systems.task_readiness.timing.visibility_timeout_seconds
Time in seconds a dequeued task readiness message remains invisible to other consumers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate processing of readiness events during normal operation
grpc
Path: orchestration.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" | Socket address for the gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service (grpc.health.v1) |
enable_reflection | bool | true | Enable gRPC server reflection for service discovery |
enabled | bool | true | Enable the gRPC API server alongside the REST API |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 200 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame |
tls_enabled | bool | false | Enable TLS encryption for gRPC connections |
orchestration.grpc.bind_address
Socket address for the gRPC server
- Type:
String - Default:
"${TASKER_ORCHESTRATION_GRPC_BIND_ADDRESS:-0.0.0.0:9190}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API bind_address; default 9190 avoids Prometheus port conflict
orchestration.grpc.enable_health_service
Enable the gRPC health checking service (grpc.health.v1)
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
orchestration.grpc.enable_reflection
Enable gRPC server reflection for service discovery
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect services; safe to enable in development, consider disabling in production
orchestration.grpc.enabled
Enable the gRPC API server alongside the REST API
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
orchestration.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
orchestration.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
orchestration.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
200 - Valid Range: 1-10000
- System Impact: Limits multiplexed request parallelism per connection; 200 is conservative for orchestration workloads
orchestration.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
orchestration.grpc.tls_enabled
Enable TLS encryption for gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, tls_cert_path and tls_key_path must be provided; required for production gRPC deployments
mpsc_channels
Path: orchestration.mpsc_channels
command_processor
Path: orchestration.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 5000 | Bounded channel capacity for the orchestration command processor |
orchestration.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the orchestration command processor
- Type:
usize - Default:
5000 - Valid Range: 100-100000
- System Impact: Buffers incoming orchestration commands; larger values absorb traffic spikes but use more memory
event_listeners
Path: orchestration.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 50000 | Bounded channel capacity for PGMQ event listener notifications |
orchestration.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications
- Type:
usize - Default:
50000 - Valid Range: 1000-1000000
- System Impact: Large buffer (50000) absorbs high-volume PGMQ LISTEN/NOTIFY events without backpressure on PostgreSQL
event_systems
Path: orchestration.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 10000 | Bounded channel capacity for the orchestration event system internal channel |
orchestration.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the orchestration event system internal channel
- Type:
usize - Default:
10000 - Valid Range: 100-100000
- System Impact: Buffers events between the event listener and event processor; larger values absorb notification bursts
web
Path: orchestration.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" | Socket address for the REST API server |
enabled | bool | true | Enable the REST API server for the orchestration service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for an HTTP request to complete before timeout |
orchestration.web.bind_address
Socket address for the REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8080}" - Valid Range: host:port
- System Impact: Determines where the orchestration REST API listens; use 0.0.0.0 for container deployments
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8080 | Standard port; use TASKER_WEB_BIND_ADDRESS env var to override in CI |
| test | 0.0.0.0:8080 | Default port for test fixtures |
orchestration.web.enabled
Enable the REST API server for the orchestration service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the service operates via messaging only
orchestration.web.request_timeout_ms
Maximum time in milliseconds for an HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: orchestration.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication |
enabled | bool | false | Enable authentication for the REST API |
jwt_audience | String | "tasker-api" | Expected ‘aud’ claim in JWT tokens |
jwt_issuer | String | "tasker-core" | Expected ‘iss’ claim in JWT tokens |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if this service issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours |
orchestration.web.auth.api_key
Static API key for simple key-based authentication
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
orchestration.web.auth.api_key_header
HTTP header name for API key authentication
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
orchestration.web.auth.enabled
Enable authentication for the REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all API endpoints are unauthenticated; enable in production with JWT or API key auth
orchestration.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens
- Type:
String - Default:
"tasker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
orchestration.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens
- Type:
String - Default:
"tasker-core" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected during validation
orchestration.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if this service issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the orchestration service issues its own JWT tokens; leave empty when using external identity providers
orchestration.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management in production
orchestration.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
orchestration.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security but require more frequent re-authentication
database_pools
Path: orchestration.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 50 | Advisory hint for the total number of database connections across all orchestration pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle web API connection is closed |
web_api_max_connections | u32 | 30 | Maximum number of connections the web API pool can grow to under load |
web_api_pool_size | u32 | 20 | Target number of connections in the web API database pool |
orchestration.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all orchestration pools
- Type:
u32 - Default:
50 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
orchestration.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: API requests that cannot acquire a connection within this window return an error
orchestration.web.database_pools.web_api_idle_timeout_seconds
Time before an idle web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the web API pool shrinks after traffic subsides
orchestration.web.database_pools.web_api_max_connections
Maximum number of connections the web API pool can grow to under load
- Type:
u32 - Default:
30 - Valid Range: 1-500
- System Impact: Hard ceiling for web API database connections; prevents connection exhaustion from traffic spikes
orchestration.web.database_pools.web_api_pool_size
Target number of connections in the web API database pool
- Type:
u32 - Default:
20 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the REST API can execute
Generated by tasker-ctl docs — Tasker Configuration System
Configuration Reference: worker
90/90 parameters documented
worker
Root-level worker parameters
Path: worker
| Parameter | Type | Default | Description |
|---|---|---|---|
worker_id | String | "worker-default-001" | Unique identifier for this worker instance |
worker_type | String | "general" | Worker type classification for routing and reporting |
worker.worker_id
Unique identifier for this worker instance
- Type:
String - Default:
"worker-default-001" - Valid Range: non-empty string
- System Impact: Used in logging, metrics, and step claim attribution; must be unique across all worker instances in a cluster
worker.worker_type
Worker type classification for routing and reporting
- Type:
String - Default:
"general" - Valid Range: non-empty string
- System Impact: Used to match worker capabilities with step handler requirements; ‘general’ handles all step types
circuit_breakers
Path: worker.circuit_breakers
ffi_completion_send
Path: worker.circuit_breakers.ffi_completion_send
| Parameter | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Number of consecutive FFI completion send failures before the circuit breaker trips |
recovery_timeout_seconds | u32 | 5 | Time the FFI completion breaker stays Open before probing with a test send |
slow_send_threshold_ms | u32 | 100 | Threshold in milliseconds above which FFI completion channel sends are logged as slow |
success_threshold | u32 | 2 | Consecutive successful sends in Half-Open required to close the breaker |
worker.circuit_breakers.ffi_completion_send.failure_threshold
Number of consecutive FFI completion send failures before the circuit breaker trips
- Type:
u32 - Default:
5 - Valid Range: 1-100
- System Impact: Protects the FFI completion channel from cascading failures; when tripped, sends are short-circuited
worker.circuit_breakers.ffi_completion_send.recovery_timeout_seconds
Time the FFI completion breaker stays Open before probing with a test send
- Type:
u32 - Default:
5 - Valid Range: 1-300
- System Impact: Short timeout (5s) because FFI channel issues are typically transient
worker.circuit_breakers.ffi_completion_send.slow_send_threshold_ms
Threshold in milliseconds above which FFI completion channel sends are logged as slow
- Type:
u32 - Default:
100 - Valid Range: 10-10000
- System Impact: Observability: identifies when the FFI completion channel is under pressure from slow consumers
worker.circuit_breakers.ffi_completion_send.success_threshold
Consecutive successful sends in Half-Open required to close the breaker
- Type:
u32 - Default:
2 - Valid Range: 1-100
- System Impact: Low threshold (2) allows fast recovery since FFI send failures are usually transient
event_systems
Path: worker.event_systems
worker
Path: worker.event_systems.worker
| Parameter | Type | Default | Description |
|---|---|---|---|
deployment_mode | DeploymentMode | "Hybrid" | Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’ |
system_id | String | "worker-event-system" | Unique identifier for the worker event system instance |
worker.event_systems.worker.deployment_mode
Event delivery mode: ‘Hybrid’ (LISTEN/NOTIFY + polling fallback), ‘EventDrivenOnly’, or ‘PollingOnly’
- Type:
DeploymentMode - Default:
"Hybrid" - Valid Range: Hybrid | EventDrivenOnly | PollingOnly
- System Impact: Hybrid is recommended; EventDrivenOnly has lowest latency but no fallback; PollingOnly has highest latency but no LISTEN/NOTIFY dependency
worker.event_systems.worker.system_id
Unique identifier for the worker event system instance
- Type:
String - Default:
"worker-event-system" - Valid Range: non-empty string
- System Impact: Used in logging and metrics to distinguish this event system from others
health
Path: worker.event_systems.worker.health
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable health monitoring for the worker event system |
error_rate_threshold_per_minute | u32 | 20 | Error rate per minute above which the worker event system reports as unhealthy |
max_consecutive_errors | u32 | 10 | Number of consecutive errors before the worker event system reports as unhealthy |
performance_monitoring_enabled | bool | true | Enable detailed performance metrics for worker event processing |
worker.event_systems.worker.health.enabled
Enable health monitoring for the worker event system
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no health checks or error tracking run for the worker event system
worker.event_systems.worker.health.error_rate_threshold_per_minute
Error rate per minute above which the worker event system reports as unhealthy
- Type:
u32 - Default:
20 - Valid Range: 1-10000
- System Impact: Rate-based health signal complementing max_consecutive_errors
worker.event_systems.worker.health.max_consecutive_errors
Number of consecutive errors before the worker event system reports as unhealthy
- Type:
u32 - Default:
10 - Valid Range: 1-1000
- System Impact: Triggers health status degradation; resets on any successful event processing
worker.event_systems.worker.health.performance_monitoring_enabled
Enable detailed performance metrics for worker event processing
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Tracks step dispatch latency and throughput; useful for tuning concurrency settings
metadata
Path: worker.event_systems.worker.metadata
fallback_poller
Path: worker.event_systems.worker.metadata.fallback_poller
| Parameter | Type | Default | Description |
|---|---|---|---|
age_threshold_seconds | u32 | 5 | Minimum age in seconds of a message before the fallback poller will pick it up |
batch_size | u32 | 20 | Number of messages to dequeue in a single fallback poll |
enabled | bool | true | Enable the fallback polling mechanism for step dispatch |
max_age_hours | u32 | 24 | Maximum age in hours of messages the fallback poller will process |
polling_interval_ms | u32 | 1000 | Interval in milliseconds between fallback polling cycles |
supported_namespaces | Vec<String> | [] | List of queue namespaces the fallback poller monitors; empty means all namespaces |
visibility_timeout_seconds | u32 | 30 | Time in seconds a message polled by the fallback mechanism remains invisible |
in_process_events
Path: worker.event_systems.worker.metadata.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
deduplication_cache_size | usize | 10000 | Number of event IDs to cache for deduplication of in-process events |
ffi_integration_enabled | bool | true | Enable FFI integration for in-process event delivery to Ruby/Python workers |
listener
Path: worker.event_systems.worker.metadata.listener
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_processing | bool | true | Enable batch processing of accumulated LISTEN/NOTIFY events |
connection_timeout_seconds | u32 | 30 | Maximum time to wait when establishing the LISTEN/NOTIFY PostgreSQL connection |
event_timeout_seconds | u32 | 60 | Maximum time to wait for a LISTEN/NOTIFY event before yielding |
max_retry_attempts | u32 | 5 | Maximum number of listener reconnection attempts before falling back to polling |
retry_interval_seconds | u32 | 5 | Interval in seconds between LISTEN/NOTIFY listener reconnection attempts |
processing
Path: worker.event_systems.worker.processing
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size | u32 | 20 | Number of events dequeued in a single batch read by the worker |
max_concurrent_operations | u32 | 100 | Maximum number of events processed concurrently by the worker event system |
max_retries | u32 | 3 | Maximum retry attempts for a failed worker event processing operation |
worker.event_systems.worker.processing.batch_size
Number of events dequeued in a single batch read by the worker
- Type:
u32 - Default:
20 - Valid Range: 1-1000
- System Impact: Larger batches improve throughput but increase per-batch processing time
worker.event_systems.worker.processing.max_concurrent_operations
Maximum number of events processed concurrently by the worker event system
- Type:
u32 - Default:
100 - Valid Range: 1-10000
- System Impact: Controls parallelism for step dispatch and completion processing
worker.event_systems.worker.processing.max_retries
Maximum retry attempts for a failed worker event processing operation
- Type:
u32 - Default:
3 - Valid Range: 0-100
- System Impact: Events exceeding this retry count are dropped or sent to the DLQ
backoff
Path: worker.event_systems.worker.processing.backoff
| Parameter | Type | Default | Description |
|---|---|---|---|
initial_delay_ms | u64 | 100 | Initial backoff delay in milliseconds after first worker event processing failure |
jitter_percent | f64 | 0.1 | Maximum jitter as a fraction of the computed backoff delay |
max_delay_ms | u64 | 10000 | Maximum backoff delay in milliseconds between worker event retries |
multiplier | f64 | 2.0 | Multiplier applied to the backoff delay after each consecutive failure |
timing
Path: worker.event_systems.worker.timing
| Parameter | Type | Default | Description |
|---|---|---|---|
claim_timeout_seconds | u32 | 300 | Maximum time in seconds a worker event claim remains valid |
fallback_polling_interval_seconds | u32 | 2 | Interval in seconds between fallback polling cycles for step dispatch |
health_check_interval_seconds | u32 | 30 | Interval in seconds between health check probes for the worker event system |
processing_timeout_seconds | u32 | 60 | Maximum time in seconds allowed for processing a single worker event |
visibility_timeout_seconds | u32 | 30 | Time in seconds a dequeued step dispatch message remains invisible to other workers |
worker.event_systems.worker.timing.claim_timeout_seconds
Maximum time in seconds a worker event claim remains valid
- Type:
u32 - Default:
300 - Valid Range: 1-3600
- System Impact: Prevents abandoned claims from blocking step processing indefinitely
worker.event_systems.worker.timing.fallback_polling_interval_seconds
Interval in seconds between fallback polling cycles for step dispatch
- Type:
u32 - Default:
2 - Valid Range: 1-60
- System Impact: Shorter than orchestration (2s vs 5s) because workers need fast step pickup for low latency
worker.event_systems.worker.timing.health_check_interval_seconds
Interval in seconds between health check probes for the worker event system
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Controls how frequently the worker event system verifies its own connectivity
worker.event_systems.worker.timing.processing_timeout_seconds
Maximum time in seconds allowed for processing a single worker event
- Type:
u32 - Default:
60 - Valid Range: 1-3600
- System Impact: Events exceeding this timeout are considered failed and may be retried
worker.event_systems.worker.timing.visibility_timeout_seconds
Time in seconds a dequeued step dispatch message remains invisible to other workers
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Prevents duplicate step execution; must be longer than typical step processing time
grpc
Path: worker.grpc
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" | Socket address for the worker gRPC server |
enable_health_service | bool | true | Enable the gRPC health checking service on the worker |
enable_reflection | bool | true | Enable gRPC server reflection for the worker service |
enabled | bool | true | Enable the gRPC API server for the worker service |
keepalive_interval_seconds | u32 | 30 | Interval in seconds between gRPC keepalive ping frames on the worker |
keepalive_timeout_seconds | u32 | 20 | Time in seconds to wait for a keepalive ping acknowledgment before closing the connection |
max_concurrent_streams | u32 | 1000 | Maximum number of concurrent gRPC streams per connection |
max_frame_size | u32 | 16384 | Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server |
tls_enabled | bool | false | Enable TLS encryption for worker gRPC connections |
worker.grpc.bind_address
Socket address for the worker gRPC server
- Type:
String - Default:
"${TASKER_WORKER_GRPC_BIND_ADDRESS:-0.0.0.0:9191}" - Valid Range: host:port
- System Impact: Must not conflict with the REST API or orchestration gRPC ports; default 9191
worker.grpc.enable_health_service
Enable the gRPC health checking service on the worker
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Required for gRPC-native health checks used by load balancers and container orchestrators
worker.grpc.enable_reflection
Enable gRPC server reflection for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Allows tools like grpcurl to list and inspect worker services; consider disabling in production
worker.grpc.enabled
Enable the gRPC API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no gRPC endpoints are available; clients must use REST
worker.grpc.keepalive_interval_seconds
Interval in seconds between gRPC keepalive ping frames on the worker
- Type:
u32 - Default:
30 - Valid Range: 1-3600
- System Impact: Detects dead connections; lower values detect failures faster but increase network overhead
worker.grpc.keepalive_timeout_seconds
Time in seconds to wait for a keepalive ping acknowledgment before closing the connection
- Type:
u32 - Default:
20 - Valid Range: 1-300
- System Impact: Connections that fail to acknowledge within this window are considered dead and closed
worker.grpc.max_concurrent_streams
Maximum number of concurrent gRPC streams per connection
- Type:
u32 - Default:
1000 - Valid Range: 1-10000
- System Impact: Workers typically handle more concurrent streams than orchestration; default 1000 reflects this
worker.grpc.max_frame_size
Maximum size in bytes of a single HTTP/2 frame for the worker gRPC server
- Type:
u32 - Default:
16384 - Valid Range: 16384-16777215
- System Impact: Larger frames reduce framing overhead for large messages but increase memory per-stream
worker.grpc.tls_enabled
Enable TLS encryption for worker gRPC connections
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When true, TLS cert and key paths must be provided; required for production gRPC deployments
mpsc_channels
Path: worker.mpsc_channels
command_processor
Path: worker.mpsc_channels.command_processor
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 2000 | Bounded channel capacity for the worker command processor |
worker.mpsc_channels.command_processor.command_buffer_size
Bounded channel capacity for the worker command processor
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers incoming worker commands; smaller than orchestration (2000 vs 5000) since workers process fewer command types
domain_events
Path: worker.mpsc_channels.domain_events
| Parameter | Type | Default | Description |
|---|---|---|---|
command_buffer_size | usize | 1000 | Bounded channel capacity for domain event system commands |
log_dropped_events | bool | true | Log a warning when domain events are dropped due to channel saturation |
shutdown_drain_timeout_ms | u32 | 5000 | Maximum time in milliseconds to drain pending domain events during shutdown |
worker.mpsc_channels.domain_events.command_buffer_size
Bounded channel capacity for domain event system commands
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers domain event system control commands such as publish, subscribe, and shutdown
worker.mpsc_channels.domain_events.log_dropped_events
Log a warning when domain events are dropped due to channel saturation
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps detect when event volume exceeds channel capacity
worker.mpsc_channels.domain_events.shutdown_drain_timeout_ms
Maximum time in milliseconds to drain pending domain events during shutdown
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Ensures in-flight domain events are delivered before the worker exits; prevents event loss
event_listeners
Path: worker.mpsc_channels.event_listeners
| Parameter | Type | Default | Description |
|---|---|---|---|
pgmq_event_buffer_size | usize | 10000 | Bounded channel capacity for PGMQ event listener notifications on the worker |
worker.mpsc_channels.event_listeners.pgmq_event_buffer_size
Bounded channel capacity for PGMQ event listener notifications on the worker
- Type:
usize - Default:
10000 - Valid Range: 1000-1000000
- System Impact: Buffers PGMQ LISTEN/NOTIFY events; smaller than orchestration (10000 vs 50000) since workers handle fewer event types
event_subscribers
Path: worker.mpsc_channels.event_subscribers
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step completion event subscribers |
result_buffer_size | usize | 1000 | Bounded channel capacity for step result event subscribers |
worker.mpsc_channels.event_subscribers.completion_buffer_size
Bounded channel capacity for step completion event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step completion notifications before they are forwarded to the orchestration service
worker.mpsc_channels.event_subscribers.result_buffer_size
Bounded channel capacity for step result event subscribers
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution results before they are published to the result queue
event_systems
Path: worker.mpsc_channels.event_systems
| Parameter | Type | Default | Description |
|---|---|---|---|
event_channel_buffer_size | usize | 2000 | Bounded channel capacity for the worker event system internal channel |
worker.mpsc_channels.event_systems.event_channel_buffer_size
Bounded channel capacity for the worker event system internal channel
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Buffers events between the listener and processor; sized for worker-level throughput
ffi_dispatch
Path: worker.mpsc_channels.ffi_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
callback_timeout_ms | u32 | 5000 | Maximum time in milliseconds for FFI fire-and-forget domain event callbacks |
completion_send_timeout_ms | u32 | 10000 | Maximum time in milliseconds to retry sending FFI completion results when the channel is full |
completion_timeout_ms | u32 | 30000 | Maximum time in milliseconds to wait for an FFI step handler to complete |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for FFI step dispatch requests |
starvation_warning_threshold_ms | u32 | 10000 | Age in milliseconds of pending FFI events that triggers a starvation warning |
worker.mpsc_channels.ffi_dispatch.callback_timeout_ms
Maximum time in milliseconds for FFI fire-and-forget domain event callbacks
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents indefinite blocking of FFI threads during domain event publishing
worker.mpsc_channels.ffi_dispatch.completion_send_timeout_ms
Maximum time in milliseconds to retry sending FFI completion results when the channel is full
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Uses try_send with retry loop instead of blocking send to prevent deadlocks
worker.mpsc_channels.ffi_dispatch.completion_timeout_ms
Maximum time in milliseconds to wait for an FFI step handler to complete
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: FFI handlers exceeding this timeout are considered failed; guards against hung FFI threads
worker.mpsc_channels.ffi_dispatch.dispatch_buffer_size
Bounded channel capacity for FFI step dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers step execution requests destined for Ruby/Python FFI handlers
worker.mpsc_channels.ffi_dispatch.starvation_warning_threshold_ms
Age in milliseconds of pending FFI events that triggers a starvation warning
- Type:
u32 - Default:
10000 - Valid Range: 1000-300000
- System Impact: Proactive detection of FFI channel starvation before completion_timeout_ms is reached
handler_dispatch
Path: worker.mpsc_channels.handler_dispatch
| Parameter | Type | Default | Description |
|---|---|---|---|
completion_buffer_size | usize | 1000 | Bounded channel capacity for step handler completion notifications |
dispatch_buffer_size | usize | 1000 | Bounded channel capacity for step handler dispatch requests |
handler_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a step handler to complete execution |
max_concurrent_handlers | u32 | 10 | Maximum number of step handlers executing simultaneously |
worker.mpsc_channels.handler_dispatch.completion_buffer_size
Bounded channel capacity for step handler completion notifications
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers handler completion results before they are forwarded to the result processor
worker.mpsc_channels.handler_dispatch.dispatch_buffer_size
Bounded channel capacity for step handler dispatch requests
- Type:
usize - Default:
1000 - Valid Range: 100-50000
- System Impact: Buffers incoming step execution requests before handler assignment
worker.mpsc_channels.handler_dispatch.handler_timeout_ms
Maximum time in milliseconds for a step handler to complete execution
- Type:
u32 - Default:
30000 - Valid Range: 1000-600000
- System Impact: Handlers exceeding this timeout are cancelled; prevents hung handlers from consuming capacity
worker.mpsc_channels.handler_dispatch.max_concurrent_handlers
Maximum number of step handlers executing simultaneously
- Type:
u32 - Default:
10 - Valid Range: 1-10000
- System Impact: Controls per-worker parallelism; bounded by the handler dispatch semaphore
load_shedding
Path: worker.mpsc_channels.handler_dispatch.load_shedding
| Parameter | Type | Default | Description |
|---|---|---|---|
capacity_threshold_percent | f64 | 80.0 | Handler capacity percentage above which new step claims are refused |
enabled | bool | true | Enable load shedding to refuse step claims when handler capacity is nearly exhausted |
warning_threshold_percent | f64 | 70.0 | Handler capacity percentage at which warning logs are emitted |
worker.mpsc_channels.handler_dispatch.load_shedding.capacity_threshold_percent
Handler capacity percentage above which new step claims are refused
- Type:
f64 - Default:
80.0 - Valid Range: 0.0-100.0
- System Impact: At 80%, the worker stops accepting new steps when 80% of max_concurrent_handlers are busy
worker.mpsc_channels.handler_dispatch.load_shedding.enabled
Enable load shedding to refuse step claims when handler capacity is nearly exhausted
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When true, the worker refuses new step claims above the capacity threshold to prevent overload
worker.mpsc_channels.handler_dispatch.load_shedding.warning_threshold_percent
Handler capacity percentage at which warning logs are emitted
- Type:
f64 - Default:
70.0 - Valid Range: 0.0-100.0
- System Impact: Observability: alerts operators that the worker is approaching its capacity limit
in_process_events
Path: worker.mpsc_channels.in_process_events
| Parameter | Type | Default | Description |
|---|---|---|---|
broadcast_buffer_size | usize | 2000 | Bounded broadcast channel capacity for in-process domain event delivery |
dispatch_timeout_ms | u32 | 5000 | Maximum time in milliseconds to wait when dispatching an in-process event |
log_subscriber_errors | bool | true | Log errors when in-process event subscribers fail to receive events |
worker.mpsc_channels.in_process_events.broadcast_buffer_size
Bounded broadcast channel capacity for in-process domain event delivery
- Type:
usize - Default:
2000 - Valid Range: 100-100000
- System Impact: Controls how many domain events can be buffered before slow subscribers cause backpressure
worker.mpsc_channels.in_process_events.dispatch_timeout_ms
Maximum time in milliseconds to wait when dispatching an in-process event
- Type:
u32 - Default:
5000 - Valid Range: 100-60000
- System Impact: Prevents event dispatch from blocking indefinitely if all subscribers are slow
worker.mpsc_channels.in_process_events.log_subscriber_errors
Log errors when in-process event subscribers fail to receive events
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: Observability: helps identify slow or failing event subscribers
orchestration_client
Path: worker.orchestration_client
| Parameter | Type | Default | Description |
|---|---|---|---|
base_url | String | "http://localhost:8080" | Base URL of the orchestration REST API that this worker reports to |
max_retries | u32 | 3 | Maximum retry attempts for failed orchestration API calls |
timeout_ms | u32 | 30000 | HTTP request timeout in milliseconds for orchestration API calls |
worker.orchestration_client.base_url
Base URL of the orchestration REST API that this worker reports to
- Type:
String - Default:
"http://localhost:8080" - Valid Range: valid HTTP(S) URL
- System Impact: Workers send step completion results and health reports to this endpoint
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | http://orchestration:8080 | Container-internal DNS in Kubernetes/Docker |
| test | http://localhost:8080 | Local orchestration for testing |
Related: orchestration.web.bind_address
worker.orchestration_client.max_retries
Maximum retry attempts for failed orchestration API calls
- Type:
u32 - Default:
3 - Valid Range: 0-10
- System Impact: Retries use backoff; higher values improve resilience to transient network issues
worker.orchestration_client.timeout_ms
HTTP request timeout in milliseconds for orchestration API calls
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Worker-to-orchestration calls exceeding this timeout fail and may be retried
web
Path: worker.web
| Parameter | Type | Default | Description |
|---|---|---|---|
bind_address | String | "${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" | Socket address for the worker REST API server |
enabled | bool | true | Enable the REST API server for the worker service |
request_timeout_ms | u32 | 30000 | Maximum time in milliseconds for a worker HTTP request to complete before timeout |
worker.web.bind_address
Socket address for the worker REST API server
- Type:
String - Default:
"${TASKER_WEB_BIND_ADDRESS:-0.0.0.0:8081}" - Valid Range: host:port
- System Impact: Must not conflict with orchestration.web.bind_address when co-located; default 8081
Environment Recommendations:
| Environment | Value | Rationale |
|---|---|---|
| production | 0.0.0.0:8081 | Standard worker port; use TASKER_WEB_BIND_ADDRESS env var to override |
| test | 0.0.0.0:8081 | Default port offset from orchestration (8080) |
worker.web.enabled
Enable the REST API server for the worker service
- Type:
bool - Default:
true - Valid Range: true/false
- System Impact: When false, no HTTP endpoints are available; the worker operates via messaging only
worker.web.request_timeout_ms
Maximum time in milliseconds for a worker HTTP request to complete before timeout
- Type:
u32 - Default:
30000 - Valid Range: 100-300000
- System Impact: Requests exceeding this timeout return HTTP 408; protects against slow client connections
auth
Path: worker.web.auth
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | String | "" | Static API key for simple key-based authentication on the worker API |
api_key_header | String | "X-API-Key" | HTTP header name for API key authentication on the worker API |
enabled | bool | false | Enable authentication for the worker REST API |
jwt_audience | String | "worker-api" | Expected ‘aud’ claim in JWT tokens for the worker API |
jwt_issuer | String | "tasker-worker" | Expected ‘iss’ claim in JWT tokens for the worker API |
jwt_private_key | String | "" | PEM-encoded private key for signing JWT tokens (if the worker issues tokens) |
jwt_public_key | String | "${TASKER_JWT_PUBLIC_KEY:-}" | PEM-encoded public key for verifying JWT token signatures on the worker API |
jwt_public_key_path | String | "${TASKER_JWT_PUBLIC_KEY_PATH:-}" | File path to a PEM-encoded public key for worker JWT verification |
jwt_token_expiry_hours | u32 | 24 | Default JWT token validity period in hours for worker API tokens |
worker.web.auth.api_key
Static API key for simple key-based authentication on the worker API
- Type:
String - Default:
"" - Valid Range: non-empty string or empty to disable
- System Impact: When non-empty and auth is enabled, clients can authenticate by sending this key in the api_key_header
worker.web.auth.api_key_header
HTTP header name for API key authentication on the worker API
- Type:
String - Default:
"X-API-Key" - Valid Range: valid HTTP header name
- System Impact: Clients send their API key in this header; default is X-API-Key
worker.web.auth.enabled
Enable authentication for the worker REST API
- Type:
bool - Default:
false - Valid Range: true/false
- System Impact: When false, all worker API endpoints are unauthenticated
worker.web.auth.jwt_audience
Expected ‘aud’ claim in JWT tokens for the worker API
- Type:
String - Default:
"worker-api" - Valid Range: non-empty string
- System Impact: Tokens with a different audience are rejected during validation
worker.web.auth.jwt_issuer
Expected ‘iss’ claim in JWT tokens for the worker API
- Type:
String - Default:
"tasker-worker" - Valid Range: non-empty string
- System Impact: Tokens with a different issuer are rejected; default ‘tasker-worker’ distinguishes worker tokens from orchestration tokens
worker.web.auth.jwt_private_key
PEM-encoded private key for signing JWT tokens (if the worker issues tokens)
- Type:
String - Default:
"" - Valid Range: valid PEM private key or empty
- System Impact: Required only if the worker service issues its own JWT tokens; typically empty
worker.web.auth.jwt_public_key
PEM-encoded public key for verifying JWT token signatures on the worker API
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY:-}" - Valid Range: valid PEM public key or empty
- System Impact: Required for JWT validation; prefer jwt_public_key_path for file-based key management
worker.web.auth.jwt_public_key_path
File path to a PEM-encoded public key for worker JWT verification
- Type:
String - Default:
"${TASKER_JWT_PUBLIC_KEY_PATH:-}" - Valid Range: valid file path or empty
- System Impact: Alternative to inline jwt_public_key; supports key rotation by replacing the file
worker.web.auth.jwt_token_expiry_hours
Default JWT token validity period in hours for worker API tokens
- Type:
u32 - Default:
24 - Valid Range: 1-720
- System Impact: Tokens older than this are rejected; shorter values improve security
database_pools
Path: worker.web.database_pools
| Parameter | Type | Default | Description |
|---|---|---|---|
max_total_connections_hint | u32 | 25 | Advisory hint for the total number of database connections across all worker pools |
web_api_connection_timeout_seconds | u32 | 30 | Maximum time to wait when acquiring a connection from the worker web API pool |
web_api_idle_timeout_seconds | u32 | 600 | Time before an idle worker web API connection is closed |
web_api_max_connections | u32 | 15 | Maximum number of connections the worker web API pool can grow to under load |
web_api_pool_size | u32 | 10 | Target number of connections in the worker web API database pool |
worker.web.database_pools.max_total_connections_hint
Advisory hint for the total number of database connections across all worker pools
- Type:
u32 - Default:
25 - Valid Range: 1-1000
- System Impact: Used for capacity planning; not enforced but logged if actual connections exceed this hint
worker.web.database_pools.web_api_connection_timeout_seconds
Maximum time to wait when acquiring a connection from the worker web API pool
- Type:
u32 - Default:
30 - Valid Range: 1-300
- System Impact: Worker API requests that cannot acquire a connection within this window return an error
worker.web.database_pools.web_api_idle_timeout_seconds
Time before an idle worker web API connection is closed
- Type:
u32 - Default:
600 - Valid Range: 1-3600
- System Impact: Controls how quickly the worker web API pool shrinks after traffic subsides
worker.web.database_pools.web_api_max_connections
Maximum number of connections the worker web API pool can grow to under load
- Type:
u32 - Default:
15 - Valid Range: 1-500
- System Impact: Hard ceiling for worker web API database connections
worker.web.database_pools.web_api_pool_size
Target number of connections in the worker web API database pool
- Type:
u32 - Default:
10 - Valid Range: 1-200
- System Impact: Determines how many concurrent database queries the worker REST API can execute; smaller than orchestration
Generated by tasker-ctl docs — Tasker Configuration System
Crate Dependency Graph
Auto-generated from
Cargo.tomlworkspace analysis. Do not edit manually.Regenerate with:
cargo make generate-crate-deps
This diagram shows the inter-crate dependency structure of the tasker-core workspace. Arrows point from dependent to dependency (A → B means “A depends on B”).
graph TD
subgraph core["Core Libraries"]
tasker_pgmq["tasker-pgmq"]
tasker_shared["tasker-shared"]
end
subgraph services["Services"]
tasker_client["tasker-client"]
tasker_ctl["tasker-ctl"]
tasker_orchestration["tasker-orchestration"]
tasker_worker["tasker-worker"]
end
subgraph workers["FFI Workers"]
tasker_py["tasker-py"]
tasker_rb["tasker-rb"]
tasker_worker_rust["tasker-worker-rust"]
tasker_ts["tasker-ts"]
end
tasker_client --> tasker_shared
tasker_ctl --> tasker_client
tasker_ctl --> tasker_shared
tasker_orchestration --> tasker_pgmq
tasker_orchestration --> tasker_shared
tasker_worker --> tasker_pgmq
tasker_worker --> tasker_client
tasker_worker --> tasker_shared
tasker_py --> tasker_shared
tasker_py --> tasker_worker
tasker_rb --> tasker_shared
tasker_rb --> tasker_worker
tasker_worker_rust --> tasker_shared
tasker_worker_rust --> tasker_worker
tasker_ts --> tasker_shared
tasker_ts --> tasker_worker
classDef coreLib fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
classDef service fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef worker fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
class tasker_pgmq,tasker_shared coreLib
class tasker_client,tasker_ctl,tasker_orchestration,tasker_worker service
class tasker_py,tasker_rb,tasker_worker_rust,tasker_ts worker
Workspace Crates
| Crate | Category | Dependencies |
|---|---|---|
tasker-pgmq | Core Library | (none) |
tasker-client | Service | tasker-shared |
tasker-ctl | Service | tasker-client, tasker-shared |
tasker-orchestration | Service | tasker-pgmq, tasker-shared |
tasker-shared | Core Library | (none) |
tasker-worker | Service | tasker-pgmq, tasker-client, tasker-shared |
tasker-py | FFI Worker | tasker-shared, tasker-worker |
tasker-rb | FFI Worker | tasker-shared, tasker-worker |
tasker-worker-rust | FFI Worker | tasker-core, tasker-shared, tasker-worker |
tasker-ts | FFI Worker | tasker-shared, tasker-worker |
Generated by generate-crate-deps.sh from tasker-core workspace analysis
Database Schema
Auto-generated from SQL migration analysis. Do not edit manually.
Regenerate with:
cargo make generate-db-schema
The Tasker database uses PostgreSQL with the tasker schema. All tables use UUID v7
primary keys for time-ordered identifiers. The schema supports PostgreSQL 17 (via
pg_uuidv7 extension) and PostgreSQL 18+ (native uuidv7() function).
Entity Relationship Diagram
erDiagram
named_steps {
uuid named_step_uuid PK
varchar name
varchar description
timestamp created_at
timestamp updated_at
}
named_tasks {
uuid named_task_uuid PK
uuid task_namespace_uuid FK
varchar name
varchar description
varchar version
jsonb configuration
timestamp created_at
timestamp updated_at
}
named_tasks_named_steps {
uuid ntns_uuid PK
uuid named_task_uuid FK
uuid named_step_uuid FK
boolean default_retryable
integer default_max_attempts
timestamp created_at
timestamp updated_at
}
workflow_step_edges {
uuid workflow_step_edge_uuid PK
uuid from_step_uuid FK
uuid to_step_uuid FK
varchar name
timestamp created_at
timestamp updated_at
}
workflow_steps {
uuid workflow_step_uuid PK
uuid task_uuid FK
uuid named_step_uuid FK
boolean retryable
integer max_attempts
boolean in_process
boolean processed
timestamp processed_at
integer attempts
timestamp last_attempted_at
integer backoff_request_seconds
jsonb inputs
jsonb results
timestamp created_at
timestamp updated_at
integer priority
jsonb checkpoint
}
task_namespaces {
uuid task_namespace_uuid PK
varchar name
varchar description
timestamp created_at
timestamp updated_at
}
task_transitions {
uuid task_transition_uuid PK
uuid task_uuid FK
varchar to_state
varchar from_state
jsonb metadata
integer sort_key
boolean most_recent
timestamp created_at
timestamp updated_at
uuid processor_uuid FK
jsonb transition_metadata
}
tasks {
uuid task_uuid PK
uuid named_task_uuid FK
boolean complete
timestamp requested_at
timestamp completed_at
varchar initiator
varchar source_system
varchar reason
jsonb tags
jsonb context
varchar identity_hash
timestamp created_at
timestamp updated_at
integer priority
uuid correlation_id
uuid parent_correlation_id
}
tasks_dlq {
uuid dlq_entry_uuid PK
uuid task_uuid FK
varchar original_state
enum dlq_reason
timestamp dlq_timestamp
enum resolution_status
timestamp resolution_timestamp
text resolution_notes
varchar resolved_by
jsonb task_snapshot
jsonb metadata
timestamp created_at
timestamp updated_at
}
workflow_step_result_audit {
uuid workflow_step_result_audit_uuid PK
uuid workflow_step_uuid FK
uuid workflow_step_transition_uuid FK
uuid task_uuid FK
timestamp recorded_at
uuid worker_uuid FK
uuid correlation_id
boolean success
bigint execution_time_ms
timestamp created_at
timestamp updated_at
}
workflow_step_transitions {
uuid workflow_step_transition_uuid PK
uuid workflow_step_uuid FK
varchar to_state
varchar from_state
jsonb metadata
integer sort_key
boolean most_recent
timestamp created_at
timestamp updated_at
}
named_steps ||--o{ named_tasks_named_steps : "named_step_uuid"
named_tasks ||--o{ named_tasks_named_steps : "named_task_uuid"
task_namespaces ||--o{ named_tasks : "task_namespace_uuid"
tasks ||--o{ task_transitions : "task_uuid"
tasks ||--o{ tasks_dlq : "task_uuid"
named_tasks ||--o{ tasks : "named_task_uuid"
workflow_steps ||--o{ workflow_step_edges : "from_step_uuid"
workflow_steps ||--o{ workflow_step_edges : "to_step_uuid"
workflow_step_transitions ||--o{ workflow_step_result_audit : "workflow_step_transition_uuid"
tasks ||--o{ workflow_step_result_audit : "task_uuid"
workflow_steps ||--o{ workflow_step_result_audit : "workflow_step_uuid"
workflow_steps ||--o{ workflow_step_transitions : "workflow_step_uuid"
named_steps ||--o{ workflow_steps : "named_step_uuid"
tasks ||--o{ workflow_steps : "task_uuid"
Tables
| Table | Description |
|---|---|
task_namespaces | Multi-tenant namespace isolation for task definitions |
named_tasks | Reusable task templates with versioned configuration |
named_steps | Reusable step definitions referenced by task templates |
named_tasks_named_steps | Join table linking task templates to their step definitions |
tasks | Task instances created from templates with execution context |
workflow_steps | Individual step instances within a task execution |
workflow_step_edges | Directed graph of step dependencies (DAG edges) |
task_transitions | Event-sourced state change history for tasks (12-state machine) |
workflow_step_transitions | Event-sourced state change history for steps (10-state machine) |
workflow_step_result_audit | Lightweight audit trail for SOC2 compliance |
tasks_dlq | Dead Letter Queue for stuck task investigation and resolution |
Foreign Key Relationships
| Source Table | Column | Target Table | Target Column |
|---|---|---|---|
named_tasks_named_steps | named_step_uuid | named_steps | named_step_uuid |
named_tasks_named_steps | named_task_uuid | named_tasks | named_task_uuid |
named_tasks | task_namespace_uuid | task_namespaces | task_namespace_uuid |
task_transitions | task_uuid | tasks | task_uuid |
tasks_dlq | task_uuid | tasks | task_uuid |
tasks | named_task_uuid | named_tasks | named_task_uuid |
workflow_step_edges | from_step_uuid | workflow_steps | workflow_step_uuid |
workflow_step_edges | to_step_uuid | workflow_steps | workflow_step_uuid |
workflow_step_result_audit | workflow_step_transition_uuid | workflow_step_transitions | workflow_step_transition_uuid |
workflow_step_result_audit | task_uuid | tasks | task_uuid |
workflow_step_result_audit | workflow_step_uuid | workflow_steps | workflow_step_uuid |
workflow_step_transitions | workflow_step_uuid | workflow_steps | workflow_step_uuid |
workflow_steps | named_step_uuid | named_steps | named_step_uuid |
workflow_steps | task_uuid | tasks | task_uuid |
Generated by generate-db-schema.sh from tasker-core SQL migration analysis
Error Troubleshooting Guide
Auto-generated troubleshooting guide. Do not edit manually.
Regenerate with:
cargo make generate-error-guide
This guide provides diagnosis and resolution steps for errors in the Tasker workflow orchestration system. Errors are organized by subsystem, from high-level system errors to specific execution and infrastructure errors.
Error hierarchy: Most specialized errors convert upward —
ExecutionError → OrchestrationError → TaskerError. When troubleshooting,
start with the most specific error type and work outward.
TaskerError
Troubleshooting Guide for Tasker Errors
When encountering errors in Tasker, it’s crucial to identify the root cause and take appropriate action to resolve them. Below is a guide that covers some of the most significant TaskerError variants along with their causes, diagnosis methods, and resolutions.
| Variant | Cause | Resolution |
|---|---|---|
| DatabaseError | Issues connecting to or querying the database. This could be due to network issues, incorrect credentials, or database server unavailability. | Check the database logs for connection errors or timeouts; verify that the Tasker configuration has correct database credentials and URL; ensure the database server is up and running. |
| StateTransitionError | Problems transitioning states in a workflow state machine due to invalid transitions or missing data. | Review the state machine definition for proper transition rules and check if all required input parameters are present before attempting state changes. |
| OrchestrationError | Issues related to orchestration logic such as incorrect workflow definitions, task failures, or unexpected behavior during execution. | Examine the workflow definitions for errors; review logs of individual tasks for detailed error messages; ensure that all necessary steps and dependencies are correctly configured in workflows. |
| EventError | Problems with event handling, including missing events, duplicate events, or timing issues leading to lost events. | Verify the event logging mechanisms to check if events are being captured as expected; validate the workflow’s state-machine configuration for proper event trigger definitions. |
| ValidationError | Occurs when input data fails validation checks against predefined schemas or rules. This is typically due to incorrect data formats or missing required fields. | Review the validation schema and ensure that all provided inputs conform to these requirements; correct any discrepancies in the input data before retrying execution. |
| ConfigurationError | Issues arise from misconfigured settings within Tasker, such as incorrect worker configurations, event subscriptions, etc. | Check Tasker’s configuration files for accuracy and completeness; refer to documentation for recommended best practices on configuring different components of Tasker. |
| InvalidConfiguration | Occurs when the system encounters a configuration file or environment variable that is malformed or contains unsupported values. | Validate all configuration settings according to documented guidelines; correct any errors in the configurations before attempting re-execution. |
| FFIError | Errors related to foreign function interface (FFI) calls, often due to incompatibilities between Tasker and external systems it interacts with. | Review the details of the error message for clues on incompatible system versions or required setup changes; ensure all necessary dependencies are correctly installed and configured. |
| MessagingError | Problems communicating through messaging services such as queues or brokers used within workflows. This could be due to network issues, broker unavailability, etc. | Verify that the messaging service is up and accessible from Tasker’s environment; check configuration details for proper endpoints and authentication credentials. |
| CacheError | Errors related to caching mechanisms within Tasker where data retrieval or storage fails due to connectivity issues with cache servers or corrupted cache entries. | Investigate logs of cache management components to find any errors in communication or failures; flush and reconfigure the cache system if necessary, ensuring all configurations are up-to-date. |
| WorkerError | Errors originating from worker processes that may include code runtime issues like panics, crashes, or incorrect task execution. | Review the stack traces and logs for the specific worker to identify causes of failure; ensure correct implementation of tasks according to Tasker’s guidelines; restart affected workers if necessary. |
This guide serves as a starting point for troubleshooting common errors in Tasker. For more detailed information on each error variant, refer to the full documentation or consult support channels provided by Alibaba Cloud.
OrchestrationError
Troubleshooting Guide for Tasker Orchestration Errors
This guide provides steps to diagnose and resolve common errors encountered in the Tasker workflow orchestration platform.
| Variant | Cause | Resolution |
|---|---|---|
| DatabaseError | Issues with database operations such as connection failures or query errors. | Check database logs for any signs of operational issues, ensure that all necessary connections are established, and verify that queries match expected schema changes. |
| InvalidTaskState | Attempting to perform an operation on a task when it is in a state not compatible with the action being taken. | Review the current state of the affected task via the Tasker API or database, then transition the task to one of the valid states before retrying. Ensure that all workflow transitions are correctly defined and followed. |
| WorkflowStepNotFound | Reference to a step within a workflow that does not exist in the database. | Verify that the step UUID provided is correct and exists in the current workflow definition. Correct any errors in the workflow configuration or recreate the missing step entry. |
| StepStateMachineNotFound | The state machine responsible for managing transitions of a particular workflow step cannot be found. | Confirm the existence and correctness of the state machine configuration associated with the specific step. If it is supposed to exist, ensure that the registry or database has been properly updated. |
| StateVerificationFailed | A verification process on the state of a workflow step failed due to an invalid condition or unexpected outcome. | Inspect the reason provided for failure and cross-reference it against expected states transitions in your workflow logic. Correct any discrepancies between expected and actual states, then attempt re-verification. |
| DelegationFailed | The task execution cannot be delegated properly due to issues with worker frameworks such as Rust, Python, or TypeScript. | Ensure that all necessary workers are running and accessible over the network. Check framework-specific configurations for any required environment variables or dependencies. If using a specific worker framework (e.g., Rust), confirm its proper installation and configuration in Tasker settings. |
| TaskExecutionFailed | A task execution encountered an error preventing it from completing successfully. | Examine logs from both Tasker and the relevant worker to identify the root cause of failure. Address any issues noted, such as missing dependencies or incorrect configurations. Retry the failed task once the problem is resolved. |
Understanding these critical error conditions can help you efficiently debug and maintain robust workflows in your Tasker environment.
StepExecutionError
Troubleshooting Guide for StepExecutionErrors in Tasker
StepExecutionError is a critical component of Tasker’s workflow orchestration, encapsulating various types of errors that can occur during the execution of tasks within workflows. Understanding and addressing these errors efficiently ensures smooth operation of complex state-machine driven processes.
| Variant | Likely Cause | Resolution |
|---|---|---|
| Permanent | A critical issue preventing task completion, often due to configuration problems or external service failures. | Review the message and error_code in logs for detailed information. Adjust configurations or fix dependencies based on error specifics. |
| Retryable | Temporal issues such as transient network errors or resource contention that might resolve with time. | Check the provided context, if available, for additional insights into the nature of the problem. Consider adjusting retry policies and backoff strategies. |
| Timeout | Task execution exceeds specified duration due to delays in processing or external service response times. | Examine task logs and adjust timeout durations based on observed performance metrics. Investigate slow-running tasks or services causing delays. |
| NetworkError | Issues with connectivity or service availability affecting communication between Tasker and its dependencies. | Validate network configurations, inspect status_code for HTTP errors, and ensure that dependent services are up and responsive. Reconfigure if necessary. |
This guide provides a structured approach to diagnosing and resolving issues categorized under StepExecutionError in the Tasker workflow orchestration system.
RegistryError
Troubleshooting Guide for RegistryError in Tasker
When working with the Rust-based Tasker workflow orchestration platform, encountering a RegistryError indicates issues with registering or managing handlers within your workflows. This guide provides troubleshooting steps for resolving the most common types of these errors.
| Variant | Likely Cause | Resolution |
|---|---|---|
| NotFound | Handler is referenced but not registered in the system. Common when trying to invoke a handler that hasn’t been added to the registry or has been deleted. | Ensure all handlers are correctly registered and available at runtime. Double-check your workflow configuration for any references to non-existent handlers. If using dynamic registrations, confirm that registration requests are being processed successfully. |
| Conflict | Occurs when attempting to register a handler with an existing key, violating the unique constraint or due to incompatible updates. Can happen during parallel registration attempts or conflicting updates. | Review the logs and workflow configurations for any duplicate registration attempts or concurrent modifications of the same handler. Ensure that registration requests are serialized where necessary to prevent conflicts. Adjust your workflow logic if needed to handle concurrency correctly. |
| ValidationError | Handler fails validation checks, such as missing required fields, incorrect data types, or failing custom validations defined in Tasker’s configuration. | Examine the error details for specific reasons why the handler failed validation. Correct any issues in the handler class or configuration files that are causing the failure. If custom validation rules are involved, ensure they match expected criteria and do not impose unnecessary restrictions. |
| ThreadSafetyError | Handler operation violates thread safety principles of Tasker, such as attempting to modify state from an invalid context. Can happen during concurrent access issues or misuse of async operations. | Review the logs and operational details around the time of failure for indications of concurrency issues. Ensure all modifications are made in threads or contexts that comply with Tasker’s threading model guidelines. Modify offending code sections to ensure thread-safe practices, using proper synchronization mechanisms where necessary. |
By addressing these common causes and following the outlined resolutions, you should be able to mitigate RegistryError instances effectively within your workflows on Tasker.
EventError
Troubleshooting Guide for Tasker Event Errors
When working with Tasker workflows and encountering issues related to event handling, the following guide can help diagnose and resolve common errors quickly. This guide focuses on the most critical error variants in the EventError enum.
| Variant | Cause | Resolution |
|---|---|---|
| PublishingFailed | Occurs when a task fails to publish an event, typically due to network issues or incorrect configuration parameters. | - Check Tasker logs for specific errors related to the failed event type. - Ensure all required fields and configurations are correctly set in the workflow definition. - Verify that the destination service is reachable and running. |
| SerializationError | Happens when there’s a problem serializing data into an expected format, often due to mismatched types or missing required fields during serialization attempts. | - Review the application logs for detailed error messages regarding the failed event type. - Validate input data against expected schemas or formats before attempting to serialize it. - Adjust the workflow configuration if necessary to match the serialization requirements. |
| FfiBridgeError | This error occurs when there is an issue with the Foreign Function Interface (FFI) bridge, such as incorrect bindings, version mismatches, or library incompatibilities. | - Inspect Tasker logs for detailed information about the FFI error. - Ensure that all dependencies and libraries used by the workflow are correctly installed and compatible with each other. - Review and update FFI bindings if necessary to match changes in external systems or libraries. |
By addressing these common issues, you can enhance the reliability of your Tasker workflows and ensure smoother operation of event-driven processes.
StateError
Troubleshooting Guide for State Management Errors in Tasker
When dealing with workflow orchestration using Tasker, encountering errors related to state management is common. This guide focuses on providing quick fixes and diagnostics for critical issues derived from the StateError enum.
| Variant | Cause | Resolution |
|---|---|---|
| InvalidTransition | Attempting an unsupported transition between two states for a specific entity type and UUID. | Verify the workflow definitions or business logic to ensure that transitions align with valid state changes. Update the workflow accordingly if necessary. Check logs for entity_type, entity_uuid, from_state, and to_state to validate the sequence of events leading up to this error. |
| StateNotFound | Requested operation on a non-existent entity type or UUID. | Confirm that the entity exists in the system prior to performing operations on it. Ensure all required entities are correctly created and initialized before proceeding with state changes. Use logs to trace whether an attempt was made to transition states for a non-existing entity. |
| DatabaseError | An unexpected issue occurred within the database layer, such as connection issues or queries failing. | Examine database logs and connection details to identify any issues like timeouts, disconnections, or query errors. Implement retries or enhanced error handling in Tasker’s codebase to mitigate transient database errors. Consider scaling resources if persistent performance problems are identified. |
| ConcurrentModification | Another process modified the entity while an operation was being executed, causing a conflict. | Use optimistic concurrency control mechanisms such as versioning to prevent concurrent modifications. Ensure transactions encapsulate all operations on an entity during updates and employ locking strategies in high-concurrency environments. Monitor system metrics for spikes indicating excessive contention or transaction failures. |
By addressing these specific error variants effectively, you can maintain robust state management within Tasker workflows and ensure smooth orchestration of tasks across polyglot workers.
DiscoveryError
Troubleshooting Guide for Discovery Errors in Tasker
This guide provides steps to diagnose and resolve the most critical errors encountered during task discovery processes within Tasker.
| Variant | Cause | Resolution |
|---|---|---|
| DatabaseError | Issues with database connectivity or operations, such as connection timeouts, query failures. | Check the database logs for any related exceptions or warnings. Ensure that the database is running and accessible from the server handling Tasker tasks. Verify if there are network issues preventing communication between services. |
| SqlFunctionError | SQL functions used in queries fail due to invalid usage or missing dependencies. | Review the function definitions and ensure they are correctly implemented and referenced in the query. Check for any syntax errors or logical mistakes that could be causing the issue. Validate that all required database objects (tables, views) exist and have the correct schema. |
| DependencyCycle | A circular reference is detected among tasks which prevents proper task sequencing. | Examine the workflow configuration to identify the cycle. Modify task dependencies to eliminate the circular relationship by ensuring each task has a clear starting point and no loops. Review the cycle_steps for specific task IDs involved in the loop and update their relationships accordingly. |
| ConfigurationError | Incorrect or missing configurations such as task templates or step definitions which are required for task execution. | Verify the existence of referenced entities like tasks, steps, etc., within the configuration files. Ensure that all necessary configurations are correctly defined without typos or inconsistencies. If the issue is due to a template not being found, recreate it based on existing examples or documentation provided by Tasker. |
These resolutions should help address common issues encountered with DiscoveryErrors in Tasker, ensuring smooth operation and efficient task processing within the platform.
ExecutionError
When troubleshooting ExecutionError in Tasker, focus on the specific variant to diagnose and resolve issues. Each error provides insight into why a step execution failed, which can range from invalid states and concurrency control issues to timeouts and retries failing.
| Variant | Cause | Resolution |
|---|---|---|
| StepExecutionFailed | The execution of a workflow step encountered an unexpected issue or failure that couldn’t be automatically resolved. | Check the logs for detailed error messages related to the step UUID, including any system or application-specific errors indicated by reason and error_code. Adjust the workflow logic or configuration as needed based on these insights. |
| InvalidStepState | The state of a step was not in the expected condition when an action was attempted (e.g., trying to transition from “running” to “failed” without completing). | Review the sequence of events leading up to this error by examining logs and state transitions for the given step UUID. Ensure that all prerequisites for transitioning states are met correctly according to the workflow design before attempting another execution. |
| StateTransitionError | An unexpected issue occurred during a state transition within the execution process (e.g., database constraint violations, missing dependencies). | Examine logs surrounding the affected state transition and look for any reported reasons or constraints causing issues. Ensure that all necessary transitions are correctly specified in the workflow definitions and that there are no external factors preventing successful state changes. |
| ConcurrencyError | A concurrency control mechanism failed to manage parallel executions properly (e.g., locks, semaphores). | Investigate how concurrent tasks interact based on logs and metrics for step UUIDs involved. Ensure proper synchronization and backoff strategies are in place to handle high load or conflicting operations efficiently. Adjust configurations as necessary to prevent contention issues. |
| NoResultReturned | A workflow execution did not yield any output result despite completing all steps, indicating a failure to properly conclude an operation. | Confirm that each step is configured correctly to produce expected results and that there are no missing return values causing the error. Review and validate the logic of your workflows to ensure proper handling across all branches and edge cases. |
| ExecutionTimeout | A workflow or individual step did not complete within a set time limit, leading to automatic termination. | Increase timeout durations if tasks consistently exceed their allotted time due to high processing demands. Analyze steps for potential inefficiencies causing delays and optimize code accordingly. Ensure that timeouts are appropriately defined based on realistic execution scenarios. |
| RetryLimitExceeded | The number of allowed retries for a step or workflow has been reached without successful completion, indicating persistent issues. | Review the root causes of failures leading to retry attempts by analyzing logs and state histories. Adjust retry policies or error handling strategies as needed to address recurring issues more effectively, including increasing max attempt limits if appropriate. |
| BatchCreationFailed | An issue occurred during batch creation for ZeroMQ publishing, often related to network communication problems. | Check the status of your messaging system and ensure that all components are running smoothly without disruptions. Verify configuration settings such as message queue size and error handling mechanisms, and correct any misconfigurations or network issues impeding batch creation processes. |
These steps should help pinpoint and address common issues encountered in Tasker’s workflow execution environment.
Troubleshooting guidance generated with Ollama (
qwen2.5:14b). SetSKIP_LLM=truefor deterministic output.
Generated by generate-error-guide.sh from tasker-core error definitions
State Machine Diagrams
Auto-generated from Rust source analysis. Do not edit manually.
Regenerate with:
cargo make generate-state-machines
Tasker uses two state machines to manage the lifecycle of tasks and workflow steps.
Both are implemented in tasker-shared/src/state_machine/.
Task State Machine
The task state machine manages the overall lifecycle of a task through 12 states.
Tasks progress from Pending through initialization, step enqueuing, processing,
and evaluation phases, with support for dependency waiting, retry, and manual resolution.
stateDiagram-v2
[*] --> Pending
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Pending --> Initializing : Start
Initializing --> EnqueuingSteps : ReadyStepsFound
Initializing --> Complete : NoStepsFound
Initializing --> WaitingForDependencies : NoDependenciesReady
EnqueuingSteps --> StepsInProcess : StepsEnqueued
EnqueuingSteps --> Error : EnqueueFailed
StepsInProcess --> EvaluatingResults : AllStepsCompleted
StepsInProcess --> EvaluatingResults : StepCompleted
StepsInProcess --> WaitingForRetry : StepFailed
EvaluatingResults --> Complete : AllStepsSuccessful
EvaluatingResults --> EnqueuingSteps : ReadyStepsFound
EvaluatingResults --> WaitingForDependencies : NoDependenciesReady
EvaluatingResults --> BlockedByFailures : PermanentFailure
WaitingForDependencies --> EvaluatingResults : DependenciesReady
WaitingForRetry --> EnqueuingSteps : RetryReady
BlockedByFailures --> Error : GiveUp
BlockedByFailures --> ResolvedManually : ManualResolution
Pending --> Cancelled : Cancel [guard]
Initializing --> Cancelled : Cancel [guard]
EnqueuingSteps --> Cancelled : Cancel [guard]
StepsInProcess --> Cancelled : Cancel [guard]
EvaluatingResults --> Cancelled : Cancel [guard]
WaitingForDependencies --> Cancelled : Cancel [guard]
WaitingForRetry --> Cancelled : Cancel [guard]
BlockedByFailures --> Cancelled : Cancel [guard]
StepsInProcess --> Complete : Complete
StepsInProcess --> Error : Fail
Error --> Pending : Reset
note right of ResolvedManually : From any state via ResolveManually
Task State Transitions
| From State | Event | To State | Notes |
|---|---|---|---|
| Pending | Start | Initializing | |
| Initializing | ReadyStepsFound | EnqueuingSteps | |
| Initializing | NoStepsFound | Complete | |
| Initializing | NoDependenciesReady | WaitingForDependencies | |
| EnqueuingSteps | StepsEnqueued | StepsInProcess | |
| EnqueuingSteps | EnqueueFailed | Error | |
| StepsInProcess | AllStepsCompleted | EvaluatingResults | |
| StepsInProcess | StepCompleted | EvaluatingResults | |
| StepsInProcess | StepFailed | WaitingForRetry | |
| EvaluatingResults | AllStepsSuccessful | Complete | |
| EvaluatingResults | ReadyStepsFound | EnqueuingSteps | |
| EvaluatingResults | NoDependenciesReady | WaitingForDependencies | |
| EvaluatingResults | PermanentFailure | BlockedByFailures | |
| WaitingForDependencies | DependenciesReady | EvaluatingResults | |
| WaitingForRetry | RetryReady | EnqueuingSteps | |
| BlockedByFailures | GiveUp | Error | |
| BlockedByFailures | ManualResolution | ResolvedManually | |
| (any non-terminal) | Cancel | Cancelled | Guard: !state.is_terminal() |
| StepsInProcess | Complete | Complete | |
| StepsInProcess | Fail | Error | |
| Error | Reset | Pending | |
| (any non-terminal) | ResolveManually | ResolvedManually |
Workflow Step State Machine
The workflow step state machine manages individual step execution through 10 states. Steps follow a worker-to-orchestration handoff pattern: workers execute steps and enqueue results for orchestration processing.
stateDiagram-v2
[*] --> Pending
Complete --> [*]
Error --> [*]
Cancelled --> [*]
ResolvedManually --> [*]
Pending --> Enqueued : Enqueue
Enqueued --> InProgress : Start
Pending --> InProgress : Start
InProgress --> EnqueuedForOrchestration : EnqueueForOrchestration
InProgress --> EnqueuedAsErrorForOrchestration : EnqueueAsErrorForOrchestration
EnqueuedForOrchestration --> Complete : Complete
EnqueuedForOrchestration --> Error : Fail
EnqueuedAsErrorForOrchestration --> Error : Fail
EnqueuedAsErrorForOrchestration --> Complete : Complete
InProgress --> Complete : Complete
InProgress --> Error : Fail
Pending --> Error : Fail
Enqueued --> Error : Fail
Pending --> Cancelled : Cancel
Enqueued --> Cancelled : Cancel
InProgress --> Cancelled : Cancel
EnqueuedForOrchestration --> Cancelled : Cancel
EnqueuedAsErrorForOrchestration --> Cancelled : Cancel
Error --> Cancelled : Cancel
Error --> Pending : Retry
InProgress --> WaitingForRetry : WaitForRetry
Enqueued --> WaitingForRetry : WaitForRetry
Pending --> WaitingForRetry : WaitForRetry
EnqueuedAsErrorForOrchestration --> WaitingForRetry : WaitForRetry
WaitingForRetry --> Pending : Retry
WaitingForRetry --> Cancelled : Cancel
note right of Complete : From any state via CompleteManually
note right of ResolvedManually : From any state via ResolveManually
Error --> Pending : ResetForRetry
Workflow Step State Transitions
| From State | Event | To State |
|---|---|---|
| Pending | Enqueue | Enqueued |
| Enqueued | Start | InProgress |
| Pending | Start | InProgress |
| InProgress | EnqueueForOrchestration | EnqueuedForOrchestration |
| InProgress | EnqueueAsErrorForOrchestration | EnqueuedAsErrorForOrchestration |
| EnqueuedForOrchestration | Complete | Complete |
| EnqueuedForOrchestration | Fail | Error |
| EnqueuedAsErrorForOrchestration | Fail | Error |
| EnqueuedAsErrorForOrchestration | Complete | Complete |
| InProgress | Complete | Complete |
| InProgress | Fail | Error |
| Pending | Fail | Error |
| Enqueued | Fail | Error |
| Pending | Cancel | Cancelled |
| Enqueued | Cancel | Cancelled |
| InProgress | Cancel | Cancelled |
| EnqueuedForOrchestration | Cancel | Cancelled |
| EnqueuedAsErrorForOrchestration | Cancel | Cancelled |
| Error | Cancel | Cancelled |
| Error | Retry | Pending |
| InProgress | WaitForRetry | WaitingForRetry |
| Enqueued | WaitForRetry | WaitingForRetry |
| Pending | WaitForRetry | WaitingForRetry |
| EnqueuedAsErrorForOrchestration | WaitForRetry | WaitingForRetry |
| WaitingForRetry | Retry | Pending |
| WaitingForRetry | Cancel | Cancelled |
| (any state) | CompleteManually | Complete |
| (any state) | ResolveManually | ResolvedManually |
| Error | ResetForRetry | Pending |
Generated by generate-state-machines.sh from tasker-core Rust source analysis
Authentication & Authorization
API-level security for Tasker’s orchestration and worker HTTP endpoints, providing JWT bearer token and API key authentication with permission-based access control.
Architecture
┌──────────────────────────────┐
Request ──► Middleware │ SecurityService │
(per-route) │ ├─ JwtAuthenticator │
│ ├─ JwksKeyStore (optional) │
│ └─ ApiKeyRegistry (optional) │
└───────────┬──────────────────┘
│
▼
SecurityContext
(injected into request extensions)
│
▼
┌───────────────────────┐
│ authorize() wrapper │
│ Resource + Action │
└───────────┬───────────┘
│
┌─────────┴─────────┐
▼ ▼
Body parsing 403 (denied)
│
▼
Handler body
│
▼
200 (success)
Key Components
| Component | Location | Role |
|---|---|---|
SecurityService | tasker-shared/src/services/security_service.rs | Unified auth backend: validates JWTs (static key or JWKS) and API keys |
SecurityContext | tasker-shared/src/types/security.rs | Per-request identity + permissions, extracted by handlers |
Permission enum | tasker-shared/src/types/permissions.rs | Compile-time permission vocabulary (resource:action) |
Resource, Action | tasker-shared/src/types/resources.rs | Resource-based authorization types |
authorize() wrapper | tasker-shared/src/web/authorize.rs | Handler wrapper for declarative permission checks |
| Auth middleware | */src/web/middleware/auth.rs | Axum middleware injecting SecurityContext |
require_permission() | */src/web/middleware/permission.rs | Legacy per-handler permission gate (still available) |
Request Flow
- Middleware (
conditional_auth) runs on protected routes - If auth disabled → injects
SecurityContext::disabled_context()(bypasses permission checks) - If auth enabled → extracts Bearer token or API key from headers
SecurityServicevalidates credentials, returnsSecurityContextauthorize()wrapper checks permission BEFORE body deserialization → 403 if denied- Body deserialization and handler execution proceed if authorized
Route Layers
Routes are split into public (never require auth) and protected (auth middleware applied):
Orchestration (port 8080):
- Public:
/health/*,/metrics,/api-docs/* - Protected:
/v1/*,/config(opt-in)
Worker (port 8081):
- Public:
/health/*,/metrics,/api-docs/* - Protected:
/v1/templates/*,/config(opt-in)
Quick Start
# 1. Generate RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys
# 2. Generate a token
cargo run --bin tasker-ctl -- auth generate-token \
--private-key ./keys/jwt-private-key.pem \
--permissions "tasks:create,tasks:read,tasks:list" \
--subject my-service \
--expiry-hours 24
# 3. Enable auth in config (orchestration.toml)
# [web.auth]
# enabled = true
# jwt_public_key_path = "./keys/jwt-public-key.pem"
# 4. Use the token
curl -H "Authorization: Bearer <token>" http://localhost:8080/v1/tasks
Documentation Index
| Document | Contents |
|---|---|
| Permissions | Permission vocabulary, route mapping, wildcards, role patterns |
| Configuration | TOML config, environment variables, deployment patterns |
| Testing | E2E test infrastructure, cargo-make tasks, writing auth tests |
Cross-References
| Document | Contents |
|---|---|
| API Security Guide | Quick start, CLI commands, error responses, observability |
| Auth Integration Guide | JWKS, Auth0, Keycloak, Okta configuration |
Design Decisions
Auth Disabled by Default
Security is opt-in (enabled = false default). Existing deployments are unaffected. When disabled, all handlers receive a SecurityContext with AuthMethod::Disabled and an empty permissions list (permissions: []). All permission checks still pass because has_permission() short-circuits and returns true for AuthMethod::Disabled without inspecting the permissions list.
Config Endpoint Opt-In
The /config endpoint exposes runtime configuration (secrets redacted). It is controlled by a separate toggle (config_endpoint_enabled, default false). When disabled, the route is not registered (404, not 401).
Resource-Based Authorization
Permission checks happen at the route level via authorize() wrappers BEFORE body deserialization:
#![allow(unused)]
fn main() {
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
}
This approach:
- Rejects unauthorized requests before parsing request bodies
- Provides a declarative, visible permission model at the route level
- Is protocol-agnostic (same
Resource/Actiontypes work for REST and gRPC) - Documents permissions in OpenAPI via
x-required-permissionextensions
The legacy require_permission() function is still available for cases where permission checks need to happen inside handler logic.
Credential Priority (Client)
The tasker-client library resolves credentials in this order:
- Endpoint-specific token (
TASKER_ORCHESTRATION_AUTH_TOKEN/TASKER_WORKER_AUTH_TOKEN) - Global token (
TASKER_AUTH_TOKEN) - API key (
TASKER_API_KEY) - JWT generation from private key (if configured)
Known Limitations
Body-before-permission ordering for POST/PATCH endpoints— Resolved by resource-based authorization- No token refresh — tokens are stateless; clients must generate new tokens before expiry
- API keys have no expiration — rotate by removing from config and redeploying
Configuration Reference
Complete configuration for Tasker authentication: server-side TOML, environment variables, and client settings.
Server-Side Configuration
Auth config lives under a role-prefixed [web.auth] section in both orchestration and worker TOML files.
Location
config/tasker/base/orchestration.toml → [orchestration.web.auth]
config/tasker/base/worker.toml → [worker.web.auth]
config/tasker/environments/{env}/... → environment overrides
Configuration follows the role-based structure (see Configuration Management).
Full Reference
The example below uses orchestration as the role prefix. For worker configuration, replace orchestration with worker throughout.
[orchestration.web]
# Whether the /config endpoint is registered (default: false).
# When false, GET /config returns 404. When true, requires system:config_read permission.
config_endpoint_enabled = false
[orchestration.web.auth]
# Master switch (default: false). When disabled, all routes are accessible without credentials.
enabled = false
# --- JWT Configuration ---
# Token issuer claim (validated against incoming tokens)
jwt_issuer = "tasker-core"
# Token audience claim (validated against incoming tokens)
jwt_audience = "tasker-api"
# Token expiry for generated tokens (via CLI)
jwt_token_expiry_hours = 24
# Verification method: "public_key" (static RSA key) or "jwks" (dynamic key rotation)
jwt_verification_method = "public_key"
# Static public key (one of these, path takes precedence):
jwt_public_key_path = "/etc/tasker/keys/jwt-public-key.pem"
jwt_public_key = "" # Inline PEM string (use path instead for production)
# Private key (for token generation only, not needed for verification):
jwt_private_key = ""
# --- JWKS Configuration (when jwt_verification_method = "jwks") ---
# JWKS endpoint URL
jwks_url = "https://auth.example.com/.well-known/jwks.json"
# How often to refresh the key set (seconds)
jwks_refresh_interval_seconds = 3600
# Maximum staleness (seconds) for JWKS cache when a refresh fails.
# If the cache is within this window past its refresh interval, the stale
# cache is used with a warning. 0 = no stale cache fallback.
jwks_max_stale_seconds = 300
# Allow HTTP (non-TLS) JWKS URLs. Only enable for local testing.
jwks_url_allow_http = false
# Allowed JWT signing algorithms. Tokens signed with other algorithms
# are rejected. Default: ["RS256"]
jwt_allowed_algorithms = ["RS256"]
# --- Permission Validation ---
# JWT claim name containing the permissions array
permissions_claim = "permissions"
# Reject tokens with unrecognized permission strings
strict_validation = true
# Log unrecognized permissions even when strict_validation = false
log_unknown_permissions = true
# --- API Key Authentication ---
# Header name for API key authentication
api_key_header = "X-API-Key"
# Enable multi-key registry (default: false)
api_keys_enabled = false
# API key registry (multiple keys with individual permissions)
[[orchestration.web.auth.api_keys]]
key = "sk-prod-monitoring-key"
permissions = ["tasks:read", "tasks:list", "dlq:read", "dlq:stats"]
description = "Production monitoring service"
[[orchestration.web.auth.api_keys]]
key = "sk-prod-admin-key"
permissions = ["*"]
description = "Production admin"
Environment Variables
Server-Side
| Variable | Description | Overrides |
|---|---|---|
TASKER_JWT_PUBLIC_KEY_PATH | Path to RSA public key PEM file | web.auth.jwt_public_key_path |
TASKER_JWT_PUBLIC_KEY | Inline PEM public key | web.auth.jwt_public_key |
These override TOML values via the config loader’s environment interpolation.
Client-Side
| Variable | Priority | Description |
|---|---|---|
TASKER_ORCHESTRATION_AUTH_TOKEN | 1 (highest) | Bearer token for orchestration API only |
TASKER_WORKER_AUTH_TOKEN | 1 (highest) | Bearer token for worker API only |
TASKER_AUTH_TOKEN | 2 | Bearer token for both APIs |
TASKER_API_KEY | 3 | API key (sent via configured header) |
TASKER_API_KEY_HEADER | — | Custom header name (default: X-API-Key) |
TASKER_JWT_PRIVATE_KEY_PATH | 4 (lowest) | Private key for on-demand token generation |
The tasker-client library checks these in priority order and uses the first available credential.
Deployment Patterns
Development (Auth Disabled)
# In orchestration.toml:
[orchestration.web.auth]
enabled = false
# In worker.toml:
[worker.web.auth]
enabled = false
All endpoints accessible without credentials. Default behavior.
Development (Auth Enabled, Static Key)
# In orchestration.toml:
[orchestration.web.auth]
enabled = true
jwt_verification_method = "public_key"
jwt_public_key_path = "./keys/jwt-public-key.pem"
jwt_issuer = "tasker-core"
jwt_audience = "tasker-api"
strict_validation = false
[[orchestration.web.auth.api_keys]]
key = "dev-key"
permissions = ["*"]
description = "Dev superuser key"
Production (JWKS + API Keys)
# In orchestration.toml:
[orchestration.web.auth]
enabled = true
jwt_verification_method = "jwks"
jwks_url = "https://auth.company.com/.well-known/jwks.json"
jwks_refresh_interval_seconds = 3600
jwt_issuer = "https://auth.company.com/"
jwt_audience = "tasker-api"
strict_validation = true
log_unknown_permissions = true
api_keys_enabled = true
api_key_header = "X-API-Key"
[[orchestration.web.auth.api_keys]]
key = "sk-monitoring-prod"
permissions = ["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
description = "Monitoring service"
[[orchestration.web.auth.api_keys]]
key = "sk-submitter-prod"
permissions = ["tasks:create", "tasks:read", "tasks:list"]
description = "Task submission service"
Production (Config Endpoint Enabled)
# In orchestration.toml:
[orchestration.web]
config_endpoint_enabled = true
[orchestration.web.auth]
enabled = true
# ... auth config ...
Exposes GET /config (requires system:config_read permission). Secrets are redacted in the response.
Key Management
Generating Keys
# Generate 2048-bit RSA key pair
cargo run --bin tasker-ctl -- auth generate-keys --output-dir ./keys --key-size 2048
# Output:
# keys/jwt-private-key.pem (keep secret, used for token generation)
# keys/jwt-public-key.pem (distribute to servers for verification)
Key Rotation (Static Key)
- Generate a new key pair
- Update
jwt_public_key_pathin server config - Restart servers
- Re-generate tokens with the new private key
- Old tokens become invalid immediately
Key Rotation (JWKS)
Handled automatically by the identity provider. Tasker refreshes keys on:
- Timer interval (
jwks_refresh_interval_seconds) - Unknown
kidin incoming token (triggers immediate refresh)
Security Hardening Checklist
- Private keys never committed to version control
-
enabled = truein production configs -
strict_validation = trueto reject unknown permissions - Token expiry set appropriately (1-24h recommended)
- API keys use descriptive names for audit trails
-
config_endpoint_enabled = falseunless needed (default) - Monitor
tasker.auth.failures.totalmetric for anomalies - Use JWKS in production for automatic key rotation
- Least-privilege: each service gets only the permissions it needs
Related
- API Security Guide — Quick start, CLI commands, error responses
- Auth Integration Guide — Auth0, Keycloak, Okta, JWKS setup
- Permissions — Full permission vocabulary and route mapping
Permissions
Permission-based access control using a resource:action vocabulary with wildcard support.
Permission Vocabulary
17 permissions organized by resource:
Tasks
| Permission | Description | Endpoints |
|---|---|---|
tasks:create | Create new tasks | POST /v1/tasks |
tasks:read | Read task details | GET /v1/tasks/{uuid} |
tasks:list | List tasks | GET /v1/tasks |
tasks:cancel | Cancel running tasks | DELETE /v1/tasks/{uuid} |
tasks:context_read | Read task context data | GET /v1/tasks/{uuid}/context |
Steps
| Permission | Description | Endpoints |
|---|---|---|
steps:read | Read workflow step details | GET /v1/tasks/{uuid}/workflow_steps, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}, GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}/audit |
steps:resolve | Manually resolve steps | PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} |
Dead Letter Queue
| Permission | Description | Endpoints |
|---|---|---|
dlq:read | Read DLQ entries | GET /v1/dlq, GET /v1/dlq/task/{task_uuid}, GET /v1/dlq/investigation-queue, GET /v1/dlq/staleness |
dlq:update | Update DLQ investigations | PATCH /v1/dlq/entry/{dlq_entry_uuid} |
dlq:stats | View DLQ statistics | GET /v1/dlq/stats |
Templates
| Permission | Description | Endpoints |
|---|---|---|
templates:read | Read task templates | Orchestration: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version} |
templates:validate | Validate templates | Worker: POST /v1/templates/{namespace}/{name}/{version}/validate |
System (Orchestration)
| Permission | Description | Endpoints |
|---|---|---|
system:config_read | Read system configuration | GET /config |
system:handlers_read | Read handler registry | GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version} |
system:analytics_read | Read analytics data | GET /v1/analytics/performance, GET /v1/analytics/bottlenecks |
Worker
| Permission | Description | Endpoints |
|---|---|---|
worker:config_read | Read worker configuration | Worker: GET /config |
worker:templates_read | Read worker templates | Worker: GET /v1/templates, GET /v1/templates/{namespace}/{name}/{version} |
Wildcards
Resource-level wildcards allow broad access within a resource domain:
| Pattern | Matches |
|---|---|
tasks:* | All task permissions |
steps:* | All step permissions |
dlq:* | All DLQ permissions |
templates:* | All template permissions |
system:* | All system permissions |
worker:* | All worker permissions |
Note: Global wildcards (*) are NOT supported. Use explicit resource wildcards for broad access (e.g., tasks:*, system:*). This follows AWS IAM-style resource-level granularity.
Wildcard matching is implemented in permission_matches():
resource:*→ matches if required permission’s resource component equals the prefix- Exact string → matches if strings are identical
Role Patterns
Common permission sets for different service roles:
Read-Only Operator
["tasks:read", "tasks:list", "steps:read", "dlq:read", "dlq:stats"]
Suitable for dashboards, monitoring services, and read-only admin UIs.
Task Submitter
["tasks:create", "tasks:read", "tasks:list"]
Services that submit work to Tasker and track their submissions.
Ops Admin
["tasks:*", "steps:*", "dlq:*", "system:*"]
Full operational access including step resolution, DLQ investigation, and system observability.
Worker Service
["worker:config_read", "worker:templates_read"]
Worker processes that need to read their configuration and available templates.
Full Access (Admin)
["tasks:*", "steps:*", "dlq:*", "templates:*", "system:*", "worker:*"]
Full access to all resources via resource wildcards. Use sparingly.
Strict Validation
When strict_validation = true (default), tokens containing permission strings not in the vocabulary are rejected with 401:
Unknown permissions: custom:action, tasks:delete
Set strict_validation = false if your identity provider includes additional scopes that are not part of Tasker’s vocabulary. Use log_unknown_permissions = true to still log unrecognized permissions for monitoring.
Permission Check Implementation
Resource-Based Authorization
Permissions are enforced declaratively at the route level using authorize() wrappers. This ensures authorization happens before body deserialization:
#![allow(unused)]
fn main() {
// In routes.rs
use tasker_shared::web::authorize;
use tasker_shared::types::resources::{Resource, Action};
Router::new()
.route("/tasks", post(authorize(Resource::Tasks, Action::Create, create_task)))
.route("/tasks", get(authorize(Resource::Tasks, Action::List, list_tasks)))
.route("/tasks/{uuid}", get(authorize(Resource::Tasks, Action::Read, get_task)))
}
The authorize() wrapper:
- Extracts
SecurityContextfrom request extensions (set by auth middleware) - If resource is public (Health/Metrics/Docs) → proceeds to handler
- If auth disabled (
AuthMethod::Disabled) → proceeds to handler - Checks
has_permission(required)→ if yes, proceeds; if no, returns 403
Resource → Permission Mapping
The ResourceAction type maps resource+action combinations to permissions:
| Resource | Action | Permission |
|---|---|---|
| Tasks | Create | tasks:create |
| Tasks | Read | tasks:read |
| Tasks | List | tasks:list |
| Tasks | Cancel | tasks:cancel |
| Tasks | ContextRead | tasks:context_read |
| Steps | Read/List | steps:read |
| Steps | Resolve | steps:resolve |
| Dlq | Read/List | dlq:read |
| Dlq | Update | dlq:update |
| Dlq | Stats | dlq:stats |
| Templates | Read/List | templates:read |
| Templates | Validate | templates:validate |
| System | ConfigRead | system:config_read |
| System | HandlersRead | system:handlers_read |
| System | AnalyticsRead | system:analytics_read |
| Worker | ConfigRead | worker:config_read |
| Worker | Read/List | worker:templates_read |
Public Resources
These resources don’t require authentication:
Resource::Health- Health check endpointsResource::Metrics- Prometheus metricsResource::Docs- OpenAPI/Swagger documentation
Legacy Handler-Level Check (Still Available)
For cases where you need permission checks inside handler logic:
#![allow(unused)]
fn main() {
use tasker_shared::services::require_permission;
use tasker_shared::types::Permission;
fn my_handler(ctx: SecurityContext) -> Result<(), ApiError> {
require_permission(&ctx, Permission::TasksCreate)?;
// ... handler logic
}
}
Source: tasker-shared/src/web/authorize.rs, tasker-shared/src/types/resources.rs
OpenAPI Documentation
Permission Extensions
Each protected endpoint in the OpenAPI spec includes an x-required-permission extension that documents the exact permission required:
{
"paths": {
"/v1/tasks": {
"post": {
"security": [
{ "bearer_auth": [] },
{ "api_key_auth": [] }
],
"x-required-permission": "tasks:create",
...
}
}
}
}
Why Extensions Instead of OAuth2 Scopes?
OpenAPI 3.x only formally supports scopes for OAuth2 and OpenID Connect security schemes—not for HTTP Bearer or API Key authentication. Since Tasker uses JWT Bearer tokens with JWKS validation (not OAuth2 flows), we use vendor extensions (x-required-permission) to document permissions in a standards-compliant way.
This approach:
- Is OpenAPI compliant (tools ignore unknown
x-fields gracefully) - Doesn’t misrepresent our authentication mechanism
- Is machine-readable for SDK generators and tooling
- Is visible in generated documentation
Viewing Permissions in Swagger UI
Each operation’s description includes a Required Permission line:
**Required Permission:** `tasks:create`
This provides human-readable permission information directly in the Swagger UI.
Programmatic Access
To extract permission requirements from the OpenAPI spec:
import json
spec = json.load(open("orchestration-openapi.json"))
for path, methods in spec["paths"].items():
for method, operation in methods.items():
if "x-required-permission" in operation:
print(f"{method.upper()} {path}: {operation['x-required-permission']}")
CLI: List Permissions
cargo run --bin tasker-ctl -- auth show-permissions
Outputs all 17 permissions with their resource grouping.
Auth Testing
E2E test infrastructure for validating authentication and permission enforcement.
Test Organization
tasker-orchestration/tests/web/auth/
├── mod.rs # Module declarations
├── common.rs # AuthWebTestClient, token generators, constants
├── tasks.rs # Task endpoint auth tests
├── workflow_steps.rs # Step resolution auth tests
├── dlq.rs # DLQ endpoint auth tests
├── handlers.rs # Handler registry auth tests
├── analytics.rs # Analytics endpoint auth tests
├── config.rs # Config endpoint auth tests
├── health.rs # Health endpoint public access tests
└── api_keys.rs # API key auth tests (full/read/tasks/none)
All tests are feature-gated: #[cfg(feature = "test-services")]
Running Auth Tests
# Run all auth E2E tests (requires database running)
cargo make test-auth-e2e # or: cargo make tae
# Run a specific test file
cargo nextest run --features test-services \
-E 'test(auth::tasks)' \
--package tasker-orchestration
# Run with output
cargo nextest run --features test-services \
-E 'test(auth::)' \
--package tasker-orchestration \
--nocapture
Test Infrastructure
AuthTestServer and AuthWebTestClient
Tests use a two-part setup: AuthTestServer starts an auth-enabled Axum server, and AuthWebTestClient provides HTTP methods to interact with it:
#![allow(unused)]
fn main() {
use crate::web::auth_test_helpers::*;
#[tokio::test]
async fn test_example() {
let server = AuthTestServer::start()
.await
.expect("Failed to start auth test server");
let mut client = AuthWebTestClient::for_server(&server);
// Use client for requests...
let response = client.get("/v1/tasks").await.expect("request failed");
server.shutdown().await.expect("shutdown failed");
}
}
AuthTestServer::start() does:
- Allocates a dynamic port (
127.0.0.1:0) - Sets
TASKER_CONFIG_PATHandTASKER_JWT_PUBLIC_KEY_PATH - Creates
SystemContext+OrchestrationCore+AppState - Starts Axum with auth middleware active
AuthWebTestClient::for_server(&server) creates an HTTP client configured with the server’s base URL. It supports auth modes via builder methods: with_jwt(), with_api_key(), and without_auth().
Token Generators
#![allow(unused)]
fn main() {
use crate::web::auth::common::{generate_jwt, generate_expired_jwt, generate_jwt_wrong_issuer};
// Valid token with specific permissions
let token = generate_jwt(&["tasks:create", "tasks:read"]);
// Expired token (1 hour ago)
let token = generate_expired_jwt(&["tasks:create"]);
// Wrong issuer (won't validate)
let token = generate_jwt_wrong_issuer(&["tasks:create"]);
}
Token generation uses the test RSA private key (tests/fixtures/auth/jwt-private-key-test.pem) embedded as a constant.
API Key Constants
#![allow(unused)]
fn main() {
use crate::web::auth::common::{
TEST_API_KEY_FULL, // permissions: ["*"]
TEST_API_KEY_READ_ONLY, // permissions: tasks/steps/dlq read + system read
TEST_API_KEY_TASKS_ONLY, // permissions: ["tasks:*"]
TEST_API_KEY_NO_PERMS, // permissions: []
INVALID_API_KEY, // not registered
};
}
These match the keys configured in config/tasker/generated/auth-test.toml.
Test Configuration
config/tasker/generated/auth-test.toml
A copy of complete-test.toml with auth overrides:
[orchestration.web.auth]
enabled = true
jwt_issuer = "tasker-core-test"
jwt_audience = "tasker-api-test"
jwt_verification_method = "public_key"
jwt_public_key_path = "" # Set via TASKER_JWT_PUBLIC_KEY_PATH at runtime
api_keys_enabled = true
strict_validation = false
[[orchestration.web.auth.api_keys]]
key = "test-api-key-full-access"
permissions = ["*"]
[[orchestration.web.auth.api_keys]]
key = "test-api-key-read-only"
permissions = ["tasks:read", "tasks:list", "steps:read", ...]
# ... more keys ...
Test Fixture Keys
tests/fixtures/auth/
├── jwt-private-key-test.pem # RSA private key (for token generation in tests)
└── jwt-public-key-test.pem # RSA public key (loaded by SecurityService)
These are deterministic test keys committed to the repository. They are only used in tests and have no security value.
Test Patterns
Pattern: No Credentials → 401
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_no_credentials_returns_401() {
let server = AuthTestServer::start().await.expect("Failed to start");
let mut client = AuthWebTestClient::for_server(&server);
client.without_auth();
let response = client.get("/v1/tasks").await.expect("request failed");
assert_eq!(response.status(), StatusCode::UNAUTHORIZED);
server.shutdown().await.expect("shutdown failed");
}
}
Pattern: Valid JWT with Required Permission → 200
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_with_permission_succeeds() {
let server = AuthTestServer::start().await.expect("Failed to start");
let mut client = AuthWebTestClient::for_server(&server);
let token = generate_jwt(&["tasks:list"]);
client.with_jwt(&token);
let response = client.get("/v1/tasks").await.expect("request failed");
assert_eq!(response.status(), StatusCode::OK);
server.shutdown().await.expect("shutdown failed");
}
}
Pattern: Valid JWT Missing Permission → 403
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_jwt_without_permission_returns_403() {
let server = AuthTestServer::start().await.expect("Failed to start");
let mut client = AuthWebTestClient::for_server(&server);
let token = generate_jwt(&["tasks:read"]); // missing tasks:create
client.with_jwt(&token);
let body = serde_json::json!({ /* ... */ });
let response = client.post_json("/v1/tasks", &body).await.expect("request failed");
assert_eq!(response.status(), StatusCode::FORBIDDEN);
server.shutdown().await.expect("shutdown failed");
}
}
Pattern: API Key with Permissions → 200
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_api_key_full_access() {
let server = AuthTestServer::start().await.expect("Failed to start");
let mut client = AuthWebTestClient::for_server(&server);
client.with_api_key(TEST_API_KEY_FULL);
let response = client.get("/v1/tasks").await.expect("request failed");
assert_eq!(response.status(), StatusCode::OK);
server.shutdown().await.expect("shutdown failed");
}
}
Pattern: Health Always Public
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_health_no_auth_required() {
let server = AuthTestServer::start().await.expect("Failed to start");
let mut client = AuthWebTestClient::for_server(&server);
client.without_auth();
let response = client.get("/health").await.expect("request failed");
assert_eq!(response.status(), StatusCode::OK);
server.shutdown().await.expect("shutdown failed");
}
}
Test Coverage Matrix
| Scenario | Expected | Test File |
|---|---|---|
| No credentials on protected routes | 401 | All files |
| JWT with exact permission | 200 | tasks, dlq, handlers, analytics, config |
JWT with resource wildcard (tasks:*) | 200 | tasks |
JWT with global wildcard (*) | 200 | All files |
| JWT missing required permission | 403 | tasks, dlq, handlers, analytics |
| JWT wrong issuer | 401 | tasks |
| JWT wrong audience | 401 | tasks |
| Expired JWT | 401 | tasks |
| Malformed JWT | 401 | tasks |
| API key full access | 200 | api_keys |
| API key read-only | 200/403 | api_keys |
| API key tasks-only | 200/403 | api_keys |
| API key no permissions | 403 | api_keys |
| Invalid API key | 401 | api_keys |
| Health endpoints without auth | 200 | health |
CI Compatibility
Auth tests are compatible with CI without special environment setup:
- Dynamic port allocation:
TcpListener::bind("127.0.0.1:0")avoids port conflicts - Self-configuring paths: Uses
CARGO_MANIFEST_DIRto resolve fixture paths at compile time - No external services: Auth validation is in-process (no external JWKS/IdP needed)
- Nextest isolation: Each test runs in its own process, preventing env var conflicts
Adding New Auth Tests
- Identify the endpoint and required permission (see Permissions)
- Add tests to the appropriate file (by resource) or create a new one
- Test at minimum: no credentials (401), correct permission (200), wrong permission (403)
- For POST/PATCH endpoints, use a valid request body (deserialization runs before permission check)
- Run
cargo make test-auth-e2eto verify
Related
- Permissions — Full permission vocabulary and endpoint mapping
- Configuration — Auth config reference
config/tasker/generated/auth-test.toml— Test auth configuration
Cluster Testing Guide
Last Updated: 2026-01-19 Audience: Developers, QA Status: Active Related: Tooling | Idempotency and Atomicity
Overview
This guide covers multi-instance cluster testing for validating horizontal scaling, race condition detection, and concurrent processing scenarios.
Key Capabilities:
- Run N orchestration instances with M worker instances
- Test concurrent task creation across instances
- Validate state consistency across cluster
- Detect race conditions and data corruption
- Measure performance under concurrent load
Test Infrastructure
Feature Flags
Tests are organized by infrastructure requirements using Cargo feature flags:
| Feature Flag | Infrastructure Required | In CI? |
|---|---|---|
test-db | PostgreSQL database | Yes |
test-messaging | DB + messaging (PGMQ/RabbitMQ) | Yes |
test-services | DB + messaging + services running | Yes |
test-cluster | Multi-instance cluster running | No |
Hierarchy: Each flag implies the previous (test-cluster includes test-services includes test-messaging includes test-db).
Test Commands
# Unit tests (DB + messaging only)
cargo make test-rust-unit
# E2E tests (services running)
cargo make test-rust-e2e
# Cluster tests (cluster running - LOCAL ONLY)
cargo make test-rust-cluster
# All tests including cluster
cargo make test-rust-all
Test Entry Points
tests/
├── basic_tests.rs # Always compiles
├── integration_tests.rs # #[cfg(feature = "test-messaging")]
├── e2e_tests.rs # #[cfg(feature = "test-services")]
└── e2e/
└── multi_instance/ # #[cfg(feature = "test-cluster")]
├── mod.rs
├── concurrent_task_creation_test.rs
└── consistency_test.rs
Multi-Instance Test Manager
The MultiInstanceTestManager provides high-level APIs for cluster testing.
Location
tests/common/multi_instance_test_manager.rs
tests/common/orchestration_cluster.rs
Basic Usage
#![allow(unused)]
fn main() {
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_concurrent_operations() -> Result<()> {
// Setup from environment (reads TASKER_TEST_ORCHESTRATION_URLS, etc.)
let manager = MultiInstanceTestManager::setup_from_env().await?;
// Wait for all instances to become healthy
manager.wait_for_healthy(Duration::from_secs(30)).await?;
// Create tasks concurrently across the cluster
let requests = vec![create_task_request("namespace", "task", json!({})); 10];
let responses = manager.create_tasks_concurrent(requests).await?;
// Wait for completion
let task_uuids: Vec<_> = responses.iter()
.map(|r| uuid::Uuid::parse_str(&r.task_uuid).unwrap())
.collect();
let completed = manager.wait_for_tasks_completion(task_uuids.clone(), timeout).await?;
// Verify consistency across all instances
for uuid in &task_uuids {
manager.verify_task_consistency(*uuid).await?;
}
Ok(())
}
}
Key Methods
| Method | Description |
|---|---|
setup_from_env() | Create manager from environment variables |
setup(orch_count, worker_count) | Create manager with explicit counts |
wait_for_healthy(timeout) | Wait for all instances to be healthy |
create_tasks_concurrent(requests) | Create tasks using round-robin distribution |
wait_for_task_completion(uuid, timeout) | Wait for single task completion |
wait_for_tasks_completion(uuids, timeout) | Wait for multiple tasks |
verify_task_consistency(uuid) | Verify task state across all instances |
orchestration_count() | Number of orchestration instances |
worker_count() | Number of worker instances |
OrchestrationCluster
Lower-level cluster abstraction with load balancing:
#![allow(unused)]
fn main() {
use crate::common::orchestration_cluster::{OrchestrationCluster, ClusterConfig, LoadBalancingStrategy};
// Create cluster with round-robin load balancing
let config = ClusterConfig {
orchestration_urls: vec!["http://localhost:8080", "http://localhost:8081"],
worker_urls: vec!["http://localhost:8100", "http://localhost:8101"],
load_balancing: LoadBalancingStrategy::RoundRobin,
health_timeout: Duration::from_secs(5),
};
let cluster = OrchestrationCluster::new(config).await?;
// Get client using load balancing strategy
let client = cluster.get_client();
// Get all clients for parallel operations
for client in cluster.all_clients() {
let task = client.get_task(task_uuid).await?;
}
}
Running Cluster Tests
Prerequisites
- PostgreSQL running with PGMQ extension
- Environment configured for cluster mode
Step-by-Step
# 1. Start PostgreSQL (if not already running)
cargo make docker-up
# 2. Setup cluster environment
cargo make setup-env-cluster
# 3. Start the full cluster
cargo make cluster-start-all
# 4. Verify cluster health
cargo make cluster-status
# Expected output:
# Instance Status:
# ─────────────────────────────────────────────────────────────
# INSTANCE STATUS PID PORT
# ─────────────────────────────────────────────────────────────
# orchestration-1 healthy 12345 8080
# orchestration-2 healthy 12346 8081
# worker-rust-1 healthy 12347 8100
# worker-rust-2 healthy 12348 8101
# ... (more workers)
# 5. Run cluster tests
cargo make test-rust-cluster
# 6. Stop cluster when done
cargo make cluster-stop
Monitoring During Tests
# In separate terminal: Watch cluster logs
cargo make cluster-logs
# Or orchestration logs only
cargo make cluster-logs-orchestration
# Quick status check (no health probes)
cargo make cluster-status-quick
Test Scenarios
Concurrent Task Creation
Validates that tasks can be created concurrently across orchestration instances without conflicts.
File: tests/e2e/multi_instance/concurrent_task_creation_test.rs
Test: test_concurrent_task_creation_across_instances
Validates:
- Tasks created through different orchestration instances
- All tasks complete successfully
- State is consistent across all instances
- No duplicate UUIDs generated
Rapid Task Burst
Stress tests the system by creating many tasks in quick succession.
Test: test_rapid_task_creation_burst
Validates:
- System handles high task creation rate
- No duplicate task UUIDs
- All tasks created successfully
Round-Robin Distribution
Verifies tasks are distributed across instances using round-robin.
Test: test_task_creation_round_robin_distribution
Validates:
- Tasks distributed across instances
- Distribution is approximately even
- No single-instance bottleneck
Validation Results
The cluster testing infrastructure was validated with the following results:
Test Summary
| Metric | Result |
|---|---|
| Tests Passed | 1645 |
| Intermittent Failures | 3 (resource contention, not race conditions) |
| Tests Skipped | 21 (domain event tests, require single-instance) |
| Cluster Configuration | 2x orchestration + 2x each worker type (10 total) |
Key Findings
-
No Race Conditions Detected: All concurrent operations completed without data corruption or invalid states
-
Defense-in-Depth Validated: Four protection layers (database atomicity, state machine guards, transaction boundaries, application logic) work correctly together
-
Recovery Mechanism Works: Tasks and steps recover correctly after simulated failures
-
Consistent State: Task state is consistent when queried from any orchestration instance
Connection Pool Deadlock (Fixed)
Initial testing revealed intermittent failures under high parallelization:
- Cause: Connection pool deadlock in task initialization - transactions held connections while template loading needed additional connections
- Root Cause Fix: Moved template loading BEFORE transaction begins in
task_initialization/service.rs - Additional Tuning: Increased pool sizes (20→30 max, 1→2 min connections)
- Status: ✅ Fixed - all 9 cluster tests now pass in parallel
See the connection pool deadlock pattern documentation in docs/ticket-specs/ for details.
Domain Event Tests
21 tests were skipped in cluster mode (marked with #[cfg(not(feature = "test-cluster"))]):
- Reason: Domain event tests verify in-process event delivery, incompatible with multi-process cluster
- Status: Working as designed - these tests run in single-instance CI
Test Feature Flag Implementation
Adding the Feature Gate
Tests requiring cluster infrastructure should use the feature gate:
#![allow(unused)]
fn main() {
// At module level
#![cfg(feature = "test-cluster")]
// Or at test level
#[tokio::test]
#[cfg(feature = "test-cluster")]
async fn test_cluster_specific_behavior() -> Result<()> {
// ...
}
}
Skipping Tests in Cluster Mode
Some tests (like domain events) don’t work in cluster mode:
#![allow(unused)]
fn main() {
// Only run when NOT in cluster mode
#[tokio::test]
#[cfg(not(feature = "test-cluster"))]
async fn test_domain_event_delivery() -> Result<()> {
// In-process event testing
}
}
Conditional Imports
#![allow(unused)]
fn main() {
// Only import cluster test utilities when needed
#[cfg(feature = "test-cluster")]
use crate::common::multi_instance_test_manager::MultiInstanceTestManager;
}
Nextest Configuration
The .config/nextest.toml configures test execution for cluster scenarios:
[profile.default]
retries = 0
leak-timeout = { period = "500ms", result = "fail" }
fail-fast = false
# Multi-instance tests can run in parallel once cluster is warmed up
[[profile.default.overrides]]
filter = 'test(multi_instance)'
[profile.ci]
# Limit parallelism to avoid database connection pool exhaustion
test-threads = 4
Cluster Warmup: Multi-instance tests can run in parallel. Connection pools now start with min_connections=2 for faster warmup. The 5-second delay built into cluster-start-all usually suffices. If you see “Failed to create task after all retries” errors immediately after startup, wait a few more seconds for pools to fully initialize.
Troubleshooting
Cluster Won’t Start
# Check for port conflicts
lsof -i :8080-8089
lsof -i :8100-8109
# Check for stale PID files
ls -la .pids/
rm -rf .pids/*.pid # Clean up stale PIDs
# Retry start
cargo make cluster-start-all
Tests Timeout / “Failed to create task after all retries”
This typically indicates the cluster wasn’t fully warmed up:
# Check cluster health
cargo make cluster-status
# If health is green but tests fail, wait for warmup
sleep 10 && cargo make test-rust-cluster
# Check logs for errors
cargo make cluster-logs | head -100
# Restart cluster with extra warmup
cargo make cluster-stop
cargo make cluster-start-all
sleep 10
cargo make test-rust-cluster
Root cause: Connection pools start at min_connections=2 and grow on demand. The first requests after startup may timeout while pools are establishing connections.
Connection Pool Exhaustion
If tests fail with “pool timed out” errors, ensure you have the latest code with:
- Template loading before transaction in
task_initialization/service.rs - Pool sizes:
max_connections=30,min_connections=2in test config
If issues persist, verify pool configuration:
# Check test config
cat config/tasker/generated/orchestration-test.toml | grep -A5 "pool"
Environment Variables Not Set
# Verify environment
env | grep TASKER_TEST
# Re-source environment
source .env
# Or regenerate
cargo make setup-env-cluster
CI Considerations
Cluster tests are NOT run in CI due to GitHub Actions resource constraints:
- Running multiple orchestration + worker instances requires more memory than free GHA runners provide
- This is a conscious tradeoff for an open-source, pre-alpha project
Future Options (when project matures):
- Self-hosted runners with more resources
- Paid GHA larger runners
- Separate manual workflow trigger for cluster tests
Workaround: Run cluster tests locally before PRs that touch concurrent processing code.
Related Documentation
- Tooling - Cluster deployment tasks
- Idempotency and Atomicity - Protection mechanisms
Comprehensive Lifecycle Testing Framework Guide
This guide demonstrates the complete lifecycle testing framework, showing patterns, examples, and best practices for validating task and workflow step lifecycles with integrated SQL function validation.
Table of Contents
- Framework Overview
- Core Testing Patterns
- Advanced Assertion Traits
- Template-Based Testing
- SQL Function Integration
- Example Test Executions
- Tracing Output Examples
- Best Practices
- Troubleshooting
Framework Overview
Architecture
The comprehensive lifecycle testing framework consists of several key components:
#![allow(unused)]
fn main() {
// Core Infrastructure
TestOrchestrator // Wrapper around orchestration components
StepErrorSimulator // Realistic error scenario simulation
SqlLifecycleAssertion // SQL function validation
TestScenarioBuilder // YAML template loading
// Advanced Patterns
TemplateTestRunner // Parameterized error pattern testing
ErrorPattern // Comprehensive error configuration
TaskAssertions // Task-level validation trait
StepAssertions // Step-level validation trait
}
Integration Strategy
Each test follows the integrated validation pattern:
- Exercise Lifecycle: Use orchestration framework to create scenario
- Capture SQL State: Call SQL functions to get current state
- Assert Integration: Validate SQL functions return expected values
- Document Relationship: Structured tracing showing cause → effect
Core Testing Patterns
Pattern 1: Basic Lifecycle Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_basic_lifecycle_validation(pool: PgPool) -> Result<()> {
tracing::info!("🧪 Testing basic lifecycle progression");
// STEP 1: Exercise lifecycle using framework
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "basic_validation").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// STEP 2: Validate initial state
pool.assert_step_ready(step.workflow_step_uuid).await?;
// STEP 3: Execute step
let result = orchestrator.execute_step(&step, true, 1000).await?;
assert!(result.success);
// STEP 4: Validate final state
pool.assert_step_complete(step.workflow_step_uuid).await?;
tracing::info!("✅ Basic lifecycle validation complete");
Ok(())
}
}
Pattern 2: Error and Retry Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_error_retry_validation(pool: PgPool) -> Result<()> {
tracing::info!("🔄 Testing error and retry behavior");
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "retry_validation").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// STEP 1: Simulate retryable error
StepErrorSimulator::simulate_execution_error(
&pool,
&step,
1 // attempt number
).await?;
// STEP 2: Validate retry behavior
pool.assert_step_retry_behavior(
step.workflow_step_uuid,
1, // expected attempts
None, // no custom backoff
true // still retry eligible
).await?;
tracing::info!("✅ Error retry validation complete");
Ok(())
}
}
Pattern 3: Complex Dependency Validation
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_dependency_validation(pool: PgPool) -> Result<()> {
tracing::info!("🔗 Testing dependency relationships");
let orchestrator = TestOrchestrator::new(pool.clone());
// Create diamond pattern workflow
let task = create_diamond_workflow_task(&orchestrator).await?;
let steps = get_task_steps(&pool, task.task_uuid).await?;
// Execute start step
let result = orchestrator.execute_step(&steps[0], true, 1000).await?;
assert!(result.success);
// Fail one branch
StepErrorSimulator::simulate_validation_error(
&pool,
&steps[1],
"dependency_test_error"
).await?;
// Complete other branch
let result = orchestrator.execute_step(&steps[2], true, 1000).await?;
assert!(result.success);
// Validate convergence step is blocked
pool.assert_step_blocked(steps[3].workflow_step_uuid).await?;
tracing::info!("✅ Dependency validation complete");
Ok(())
}
}
Advanced Assertion Traits
TaskAssertions Trait Usage
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{TaskAssertions, TaskStepDistribution};
// Task completion validation
pool.assert_task_complete(task_uuid).await?;
// Task error state validation
pool.assert_task_error(task_uuid, 2).await?; // 2 error steps
// Complex step distribution validation
pool.assert_task_step_distribution(
task_uuid,
TaskStepDistribution {
total_steps: 4,
completed_steps: 2,
failed_steps: 1,
ready_steps: 0,
pending_steps: 1,
in_progress_steps: 0,
error_steps: 1,
}
).await?;
// Execution status validation
pool.assert_task_execution_status(
task_uuid,
ExecutionStatus::BlockedByFailures,
Some(RecommendedAction::HandleFailures)
).await?;
// Completion percentage validation
pool.assert_task_completion_percentage(task_uuid, 75.0, 5.0).await?;
}
StepAssertions Trait Usage
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::StepAssertions;
// Basic step state validations
pool.assert_step_ready(step_uuid).await?;
pool.assert_step_complete(step_uuid).await?;
pool.assert_step_blocked(step_uuid).await?;
// Retry behavior validation
pool.assert_step_retry_behavior(
step_uuid,
3, // expected attempts
Some(30), // custom backoff seconds
false // not retry eligible (exhausted)
).await?;
// Dependency validation
pool.assert_step_dependencies_satisfied(step_uuid, true).await?;
// State transition sequence validation
pool.assert_step_state_sequence(
step_uuid,
vec!["Pending".to_string(), "InProgress".to_string(), "Complete".to_string()]
).await?;
// Permanent failure validation
pool.assert_step_failed_permanently(step_uuid).await?;
// Waiting for retry with specific time
let retry_time = chrono::Utc::now() + chrono::Duration::seconds(60);
pool.assert_step_waiting(step_uuid, retry_time).await?;
}
Template-Based Testing
ErrorPattern Configuration
#![allow(unused)]
fn main() {
use common::lifecycle_test_helpers::{ErrorPattern, TemplateTestRunner};
// Simple patterns
let success_pattern = ErrorPattern::AllSuccess;
let first_fail_pattern = ErrorPattern::FirstStepFails { retryable: true };
let last_fail_pattern = ErrorPattern::LastStepFails { permanently: false };
// Advanced patterns
let targeted_pattern = ErrorPattern::MiddleStepFails {
step_name: "process_payment".to_string(),
attempts_before_success: 3
};
let dependency_pattern = ErrorPattern::DependencyBlockage {
blocked_step: "finalize_order".to_string(),
blocking_step: "validate_payment".to_string()
};
// Custom pattern with full control
let custom_pattern = ErrorPattern::Custom {
step_configs: {
let mut configs = HashMap::new();
configs.insert("critical_step".to_string(), StepErrorConfig {
error_type: StepErrorType::ExternalServiceError,
attempts_before_success: Some(5),
custom_backoff_seconds: Some(120),
permanently_fails: false,
});
configs
}
};
}
Template Runner Usage
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_template_patterns(pool: PgPool) -> Result<()> {
let template_runner = TemplateTestRunner::new(pool.clone()).await?;
// Test single pattern
let summary = template_runner.run_template_with_errors(
"order_fulfillment.yaml",
ErrorPattern::FirstStepFails { retryable: true }
).await?;
assert!(summary.sql_validations_passed > 0);
assert_eq!(summary.sql_validations_failed, 0);
// Test all patterns automatically
let summaries = template_runner
.run_template_with_all_patterns("linear_workflow.yaml")
.await?;
for summary in summaries {
tracing::info!(
pattern = summary.error_pattern,
execution_time = summary.execution_time_ms,
validations = summary.sql_validations_passed,
"Pattern execution complete"
);
}
Ok(())
}
}
SQL Function Integration
Direct SQL Function Testing
#![allow(unused)]
fn main() {
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_direct_sql_functions(pool: PgPool) -> Result<()> {
// Test get_step_readiness_status
let step_status = sqlx::query!(
"SELECT ready_for_execution, dependencies_satisfied, retry_eligible, attempts,
backoff_request_seconds, next_retry_at
FROM get_step_readiness_status($1)",
step_uuid
)
.fetch_one(&pool)
.await?;
// Validate individual fields
assert_eq!(step_status.ready_for_execution, Some(true));
assert_eq!(step_status.dependencies_satisfied, Some(true));
assert_eq!(step_status.retry_eligible, Some(false));
assert_eq!(step_status.attempts, Some(0));
// Test get_task_execution_context
let task_context = sqlx::query!(
"SELECT total_steps, completed_steps, failed_steps, ready_steps,
pending_steps, in_progress_steps, error_steps,
completion_percentage, execution_status, recommended_action,
blocked_by_errors
FROM get_task_execution_context($1)",
task_uuid
)
.fetch_one(&pool)
.await?;
// Validate task aggregations
assert!(task_context.total_steps.unwrap_or(0) > 0);
assert_eq!(task_context.completed_steps, Some(0));
assert_eq!(task_context.failed_steps, Some(0));
Ok(())
}
}
Integrated SQL Validation Pattern
#![allow(unused)]
fn main() {
// The standard pattern used throughout the framework
async fn validate_integrated_sql_behavior(
pool: &PgPool,
task_uuid: Uuid,
step_uuid: Uuid
) -> Result<()> {
// STEP 1: Execute lifecycle action
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?;
// STEP 2: Immediately validate SQL functions
SqlLifecycleAssertion::assert_step_scenario(
pool,
task_uuid,
step_uuid,
ExpectedStepState {
state: "Error".to_string(),
ready_for_execution: false,
dependencies_satisfied: true,
retry_eligible: true,
attempts: 2,
next_retry_at: Some(calculate_expected_retry_time()),
backoff_request_seconds: None,
retry_limit: 3,
}
).await?;
// STEP 3: Document the relationship
tracing::info!(
lifecycle_action = "simulate_execution_error",
sql_result = "retry_eligible=true, attempts=2",
"✅ INTEGRATION: Lifecycle → SQL alignment verified"
);
Ok(())
}
}
Example Test Executions
Running Individual Tests
# Run specific test with detailed output
RUST_LOG=info cargo test --test complex_retry_scenarios \
test_cascading_retries_with_dependencies -- --nocapture
# Run all lifecycle tests
cargo test --all-features --test '*lifecycle*' -- --nocapture
# Run with specific environment
TASKER_ENV=test cargo test --test step_retry_lifecycle_tests -- --nocapture
Running Test Suites
# Run comprehensive validation
cargo test --test sql_function_integration_validation -- --nocapture
# Run complex scenarios
cargo test --test complex_retry_scenarios -- --nocapture
# Run task finalization tests
cargo test --test task_finalization_error_scenarios -- --nocapture
Tracing Output Examples
Successful Test Execution
2025-01-15T10:30:45.123Z INFO test_cascading_retries_with_dependencies:
🧪 Testing cascading retries with diamond dependency pattern
2025-01-15T10:30:45.125Z INFO test_cascading_retries_with_dependencies:
🏗️ Creating diamond workflow: Start → BranchA/BranchB → Convergence
2025-01-15T10:30:45.145Z INFO test_cascading_retries_with_dependencies:
📋 STEP 1: Executing start step successfully
step_uuid=01JGJX7K8QMRNP4W2X3Y5Z6ABC
2025-01-15T10:30:45.167Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 2: Simulating BranchA failure (attempt 1)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
error_type="ExecutionError" retryable=true
2025-01-15T10:30:45.189Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Retry behavior matches expectations
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
attempts=1 backoff=null retry_eligible=true
2025-01-15T10:30:45.201Z INFO test_cascading_retries_with_dependencies:
🔄 STEP 3: BranchA retry attempt (attempt 2)
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
2025-01-15T10:30:45.223Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step completed successfully
step_uuid=01JGJX7K8RMRNP4W2X3Y5Z6DEF
2025-01-15T10:30:45.245Z INFO test_cascading_retries_with_dependencies:
❌ STEP 4: Simulating BranchB permanent failure
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
error_type="ValidationError" retryable=false
2025-01-15T10:30:45.267Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step failed permanently (not retryable)
step_uuid=01JGJX7K8SMRNP4W2X3Y5Z6GHI
2025-01-15T10:30:45.289Z INFO test_cascading_retries_with_dependencies:
🚫 STEP 5: Validating Convergence step is blocked
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL
2025-01-15T10:30:45.301Z INFO test_cascading_retries_with_dependencies:
✅ STEP ASSERTION: Step blocked by dependencies
step_uuid=01JGJX7K8TMRNP4W2X3Y5Z6JKL
2025-01-15T10:30:45.323Z INFO test_cascading_retries_with_dependencies:
📊 TASK ASSERTION: Step distribution matches expectations
task_uuid=01JGJX7K8PMRNP4W2X3Y5Z6MNO
total=4 completed=2 failed=0 ready=0 pending=0 in_progress=0 error=2
2025-01-15T10:30:45.345Z INFO test_cascading_retries_with_dependencies:
✅ INTEGRATION: Lifecycle → SQL alignment verified
lifecycle_action="cascading_retry_with_dependency_blocking"
sql_result="blocked_by_errors=true, error_steps=2"
2025-01-15T10:30:45.356Z INFO test_cascading_retries_with_dependencies:
🧪 CASCADING RETRY TEST COMPLETE: Diamond pattern with mixed outcomes validated
Error Pattern Testing Output
2025-01-15T10:35:12.123Z INFO test_template_runner_all_patterns:
🎭 TEMPLATE DEMO: All error patterns with multiple templates
2025-01-15T10:35:12.145Z INFO test_template_runner_all_patterns:
📋 Testing template with all error patterns
template="linear_workflow.yaml"
2025-01-15T10:35:12.167Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"AllSuccess"#
2025-01-15T10:35:12.234Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=67
successful_steps=4 failed_steps=0 retried_steps=0
final_state="Complete"
validations_passed=12 validations_failed=0
2025-01-15T10:35:12.256Z INFO template_runner:
🎭 TEMPLATE TEST: Starting parameterized test execution
template_path="linear_workflow.yaml"
error_pattern=r#"FirstStepFails { retryable: true }"#
2025-01-15T10:35:12.334Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=1
2025-01-15T10:35:12.356Z INFO template_runner:
📋 TEMPLATE: Simulated retryable error
step_name="initialize" attempt=2
2025-01-15T10:35:12.423Z INFO template_runner:
🎭 TEMPLATE TEST: Execution complete
template_path="linear_workflow.yaml"
execution_time_ms=167
successful_steps=4 failed_steps=0 retried_steps=1
final_state="Complete"
validations_passed=15 validations_failed=0
2025-01-15T10:35:12.445Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=0
pattern="AllSuccess" execution_time_ms=67
final_state="Complete" total_validations=12 success_rate="100.0%"
2025-01-15T10:35:12.467Z INFO test_template_runner_all_patterns:
📊 Template pattern result
template="linear_workflow.yaml" pattern_index=1
pattern=r#"FirstStepFails { retryable: true }"# execution_time_ms=167
final_state="Complete" total_validations=15 success_rate="100.0%"
SQL Function Validation Output
2025-01-15T10:40:30.123Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION: Starting comprehensive validation across all scenarios
2025-01-15T10:40:30.145Z INFO test_comprehensive_sql_function_integration:
📋 SCENARIO 1: Basic lifecycle progression validation
2025-01-15T10:40:30.167Z INFO validate_initial_state:
✅ Initial state validation passed
2025-01-15T10:40:30.189Z INFO validate_step_completion:
✅ Step completion validation passed
step_uuid=01JGJX7M8QMRNP4W2X3Y5Z6PQR
2025-01-15T10:40:30.201Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 1: Basic lifecycle validation complete
scenario="basic_lifecycle" validations=2
2025-01-15T10:40:30.223Z INFO test_comprehensive_sql_function_integration:
🔄 SCENARIO 2: Error handling and retry behavior validation
2025-01-15T10:40:30.245Z INFO validate_retry_behavior:
✅ Retry behavior validation passed
step_uuid=01JGJX7M8RMRNP4W2X3Y5Z6STU
attempts=1 backoff=Some(5) retry_eligible=true
2025-01-15T10:40:30.267Z INFO test_comprehensive_sql_function_integration:
✅ SCENARIO 2: Error and retry validation complete
scenario="error_retry" validations=1
2025-01-15T10:40:30.289Z INFO test_comprehensive_sql_function_integration:
🎯 FINAL VALIDATION: Comprehensive results summary
total_validations=25 successful_validations=25
success_rate="100.00%" scenarios_tested=6
2025-01-15T10:40:30.301Z INFO test_comprehensive_sql_function_integration:
🔍 SQL INTEGRATION VALIDATION COMPLETE: All scenarios validated successfully
Best Practices
1. Always Use Integrated Validation Pattern
#![allow(unused)]
fn main() {
// ✅ GOOD: Integrated lifecycle + SQL validation
async fn test_step_retry_behavior(pool: PgPool) -> Result<()> {
// Exercise lifecycle
StepErrorSimulator::simulate_execution_error(pool, step, 1).await?;
// Immediately validate SQL functions
pool.assert_step_retry_behavior(step_uuid, 1, None, true).await?;
// Document relationship
tracing::info!("✅ INTEGRATION: Retry behavior alignment verified");
Ok(())
}
// ❌ BAD: Testing SQL functions in isolation
async fn test_sql_only(pool: PgPool) -> Result<()> {
// Directly manipulating database state
sqlx::query!("UPDATE steps SET attempts = 3").execute(pool).await?;
// This doesn't prove the integration works
let status = sqlx::query!("SELECT * FROM get_step_readiness_status($1)", uuid)
.fetch_one(pool).await?;
Ok(())
}
}
2. Use Structured Tracing
#![allow(unused)]
fn main() {
// ✅ GOOD: Structured tracing with context
tracing::info!(
step_uuid = %step.workflow_step_uuid,
attempts = expected_attempts,
backoff = ?expected_backoff,
retry_eligible = expected_retry_eligible,
"✅ STEP ASSERTION: Retry behavior matches expectations"
);
// ❌ BAD: Unstructured logging
println!("Step retry test passed");
}
3. Test Multiple Scenarios
#![allow(unused)]
fn main() {
// ✅ GOOD: Comprehensive scenario coverage
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_complete_retry_scenarios(pool: PgPool) -> Result<()> {
// Test retryable error
test_retryable_error_scenario(&pool).await?;
// Test non-retryable error
test_non_retryable_error_scenario(&pool).await?;
// Test retry exhaustion
test_retry_exhaustion_scenario(&pool).await?;
// Test custom backoff
test_custom_backoff_scenario(&pool).await?;
Ok(())
}
}
4. Validate State Transitions
#![allow(unused)]
fn main() {
// ✅ GOOD: Validate complete state transition sequence
pool.assert_step_state_sequence(
step_uuid,
vec![
"Pending".to_string(),
"InProgress".to_string(),
"Error".to_string(),
"WaitingForRetry".to_string(),
"Ready".to_string(),
"Complete".to_string()
]
).await?;
}
5. Use Assertion Traits for Readability
#![allow(unused)]
fn main() {
// ✅ GOOD: Clear, readable assertions
pool.assert_task_complete(task_uuid).await?;
pool.assert_step_failed_permanently(step_uuid).await?;
// ❌ BAD: Manual SQL queries everywhere
let task_status = sqlx::query!("SELECT ...").fetch_one(pool).await?;
assert_eq!(task_status.some_field, Some("Complete"));
}
Troubleshooting
Common Issues
1. Assertion Failures
Error: Task 01JGJX... assertion failed: expected Complete, found Processing
// Solution: Ensure lifecycle actions complete before asserting
tokio::time::sleep(Duration::from_millis(100)).await;
pool.assert_task_complete(task_uuid).await?;
2. SQL Function Mismatches
Error: Step 01JGJX... retry assertion failed: attempts expected 2, got Some(1)
// Solution: Verify error simulator is configured correctly
StepErrorSimulator::simulate_execution_error(pool, step, 2).await?; // 2 attempts
3. State Machine Violations
Error: Cannot transition from Complete to InProgress
// Solution: Use proper orchestration framework, not direct DB manipulation
let result = orchestrator.execute_step(step, true, 1000).await?;
// Don't: sqlx::query!("UPDATE steps SET state = 'InProgress'").execute(pool).await?;
4. Template Loading Issues
Error: Template 'nonexistent.yaml' not found
// Solution: Ensure template exists in correct directory
// templates should be in tests/fixtures/task_templates/rust/
Debugging Techniques
1. Enable Detailed Tracing
RUST_LOG=debug cargo test test_name -- --nocapture
2. Inspect SQL Function Results Directly
#![allow(unused)]
fn main() {
let step_status = sqlx::query!(
"SELECT * FROM get_step_readiness_status($1)",
step_uuid
)
.fetch_one(&pool)
.await?;
tracing::debug!("Step status: {:?}", step_status);
}
3. Validate Test Prerequisites
#![allow(unused)]
fn main() {
// Ensure test setup is correct
assert_eq!(steps.len(), 4, "Test requires 4 steps");
assert_eq!(task.namespace, "expected_namespace");
}
4. Use Incremental Validation
#![allow(unused)]
fn main() {
// Validate after each step
orchestrator.execute_step(&step1, true, 1000).await?;
pool.assert_step_complete(step1.workflow_step_uuid).await?;
orchestrator.execute_step(&step2, false, 1000).await?;
pool.assert_step_retry_behavior(step2.workflow_step_uuid, 1, None, true).await?;
}
Migration from Old Tests
Before (Direct Database Manipulation)
#![allow(unused)]
fn main() {
// ❌ OLD: Bypassing orchestration framework
async fn test_task_finalization_old(pool: PgPool) -> Result<()> {
// Direct database manipulation
sqlx::query!("UPDATE tasks SET state = 'Error'").execute(&pool).await?;
sqlx::query!("UPDATE steps SET state = 'Error'").execute(&pool).await?;
// Test SQL functions in isolation
let context = get_task_execution_context(&pool, task_uuid).await?;
assert_eq!(context.execution_status, ExecutionStatus::Error);
Ok(())
}
}
After (Integrated Framework)
#![allow(unused)]
fn main() {
// ✅ NEW: Using integrated framework
#[sqlx::test(migrator = "tasker_core::test_helpers::MIGRATOR")]
async fn test_task_finalization_new(pool: PgPool) -> Result<()> {
tracing::info!("🧪 Testing task finalization with integrated approach");
// Use orchestration framework
let orchestrator = TestOrchestrator::new(pool.clone());
let task = orchestrator.create_simple_task("test", "finalization").await?;
let step = get_first_step(&pool, task.task_uuid).await?;
// Create error state through framework
StepErrorSimulator::simulate_validation_error(
&pool,
&step,
"finalization_test_error"
).await?;
// Immediately validate SQL functions
pool.assert_step_failed_permanently(step.workflow_step_uuid).await?;
pool.assert_task_error(task.task_uuid, 1).await?;
tracing::info!("✅ INTEGRATION: Finalization behavior verified");
Ok(())
}
}
This comprehensive guide demonstrates the power and flexibility of the lifecycle testing framework, providing developers with the tools needed to validate complex workflow behavior while maintaining confidence in the system’s correctness.
Decision Point E2E Tests
This document describes the E2E tests for decision point functionality and how to run them.
Test Location
tests/e2e/ruby/conditional_approval_test.rs
Design Note: Deferred Step Type (Added 2025-10-27)
A critical design refinement was introduced to handle convergence patterns in decision point workflows:
The Convergence Problem
In conditional_approval, all three possible outcomes (auto_approve, manager_approval, finance_review) converge to the same finalize_approval step. However, we cannot create finalize_approval at task initialization because:
- We don’t know which approval steps will be created
- finalize_approval needs different dependencies depending on the decision point’s choice
Solution: type: deferred
A new step type was added to handle this pattern:
- name: finalize_approval
type: deferred # NEW STEP TYPE!
dependencies: [auto_approve, manager_approval, finance_review] # All possible deps
How it works:
- Deferred steps list ALL possible dependencies in the template
- At initialization, deferred steps are excluded (they’re descendants of decision points)
- When a decision point creates outcome steps, the system:
- Detects downstream deferred steps
- Computes:
declared_deps ∩ actually_created_steps= actual DAG - Creates deferred steps with resolved dependencies
Example:
- When routing_decision chooses
auto_approve:- Creates:
auto_approve - Detects:
finalize_approvalis deferred with deps[auto_approve, manager_approval, finance_review] - Intersection:
[auto_approve]∩[auto_approve]=[auto_approve] - Creates:
finalize_approvaldepending onauto_approveonly
- Creates:
This elegantly solves convergence without requiring handlers to explicitly list convergence steps or special orchestration logic.
Test Coverage
The test suite validates the conditional approval workflow, which demonstrates decision point functionality with dynamic step creation based on runtime conditions (approval amount thresholds).
Test Cases
-
test_small_amount_auto_approval() - Tests amounts < $1,000
- Expected path: validate_request → routing_decision → auto_approve → finalize_approval
- Verifies only 4 steps created
- Confirms manager_approval and finance_review are NOT created
-
test_medium_amount_manager_approval() - Tests amounts $1,000-$4,999
- Expected path: validate_request → routing_decision → manager_approval → finalize_approval
- Verifies only 4 steps created
- Confirms auto_approve and finance_review are NOT created
-
test_large_amount_dual_approval() - Tests amounts >= $5,000
- Expected path: validate_request → routing_decision → manager_approval + finance_review → finalize_approval
- Verifies 5 steps created
- Confirms both parallel approval steps complete
- Verifies auto_approve is NOT created
-
test_decision_point_step_dependency_structure() - Validates dependency resolution
- Verifies dynamically created steps depend on routing_decision
- Confirms finalize_approval waits for all approval steps
- Tests proper execution order
-
test_boundary_conditions() - Tests exactly at $1,000 threshold
- Verifies manager approval is used (not auto)
-
test_boundary_large_threshold() - Tests exactly at $5,000 threshold
- Verifies dual approval path is triggered
-
test_very_small_amount() - Tests $0.01 amount
- Verifies auto-approval for very small amounts
Running the Tests
Prerequisites
The tests require the full integration environment to be running. Use the Docker Compose test strategy:
# From the tasker-core directory
# 1. Stop any existing containers and clean up
docker-compose -f docker/docker-compose.test.yml down -v
# 2. Rebuild containers with latest changes
docker-compose -f docker/docker-compose.test.yml up --build -d
# 3. Wait for services to be healthy (about 10-15 seconds)
sleep 15
# 4. Run the conditional approval E2E tests
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo test --test e2e_tests e2e::ruby::conditional_approval_test -- --nocapture
# 5. Clean up after testing (optional)
docker-compose -f docker/docker-compose.test.yml down
Running Specific Tests
# Run just the small amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_small_amount_auto_approval -- --nocapture
# Run just the large amount test
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_large_amount_dual_approval -- --nocapture
# Run all boundary tests
cargo test --test e2e_tests e2e::ruby::conditional_approval_test::test_boundary -- --nocapture
Environment Variables
The tests use the following environment variables (set automatically in docker-compose.test.yml):
DATABASE_URL: PostgreSQL connection stringTASKER_ENV: Set to “test” for test configurationTASK_TEMPLATE_PATH: Points to test fixtures directoryRUST_LOG: Set to “info” or “debug” for detailed logging
Test Workflow Details
Conditional Approval Workflow
The workflow implements amount-based routing:
┌─────────────────┐
│ validate_request│
│ (initial) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ routing_decision│ ◄─── DECISION POINT (type: decision)
│ (decision) │
└────────┬────────┘
│
├─────────── < $1,000 ─────────┐
│ │
│ ▼
│ ┌────────────────┐
│ │ auto_approve │
│ └────────┬───────┘
│ │
├─────── $1,000-$4,999 ────────┼────┐
│ │ │
│ │ ▼
│ │ ┌──────────────────┐
│ │ │ manager_approval │
│ │ └────────┬─────────┘
│ │ │
└──────── >= $5,000 ───────────┼───────────┼────┐
│ │ │
│ │ ▼
│ │ ┌───────────────┐
│ │ │ finance_review│
│ │ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────┐
│ finalize_approval │
│ (convergence) │
└─────────────────────────┘
Decision Point Mechanism
- routing_decision step executes with
type: decisionmarker - Handler returns
DecisionPointOutcome::CreateStepswith step names - Orchestration creates those steps dynamically and adds dependencies
- Dynamically created steps execute like normal steps
- Convergence step (finalize_approval) waits for all paths
Task Template Location
The test uses the task template at:
tests/fixtures/task_templates/ruby/conditional_approval_handler.yaml
Ruby Handler Implementation
The Ruby handlers are located at:
workers/ruby/spec/handlers/examples/conditional_approval/
├── handlers/
│ └── conditional_approval_handler.rb
└── step_handlers/
├── validate_request_handler.rb
├── routing_decision_handler.rb ◄─── DECISION POINT HANDLER
├── auto_approve_handler.rb
├── manager_approval_handler.rb
├── finance_review_handler.rb
└── finalize_approval_handler.rb
Key Implementation Detail
The routing_decision_handler.rb returns a decision point outcome:
outcome = if steps_to_create.empty?
TaskerCore::Types::DecisionPointOutcome.no_branches
else
TaskerCore::Types::DecisionPointOutcome.create_steps(steps_to_create)
end
TaskerCore::Types::StepHandlerCallResult.success(
result: {
# IMPORTANT: The decision point outcome MUST be in this key
decision_point_outcome: outcome.to_h,
route_type: route[:type],
# ... other result fields
}
)
Troubleshooting
Tests Fail with “Template Not Found”
Ensure the Ruby worker container has the correct template path:
docker-compose -f docker/docker-compose.test.yml logs ruby-worker
# Should show: TASK_TEMPLATE_PATH=/app/tests/fixtures/task_templates/ruby
Tests Timeout
Increase wait time in docker-compose startup:
sleep 30 # Instead of sleep 15
Database Connection Errors
Verify PostgreSQL is running and healthy:
docker-compose -f docker/docker-compose.test.yml ps
docker-compose -f docker/docker-compose.test.yml logs postgres
Step Creation Doesn’t Happen
Check orchestration logs for decision point processing:
docker-compose -f docker/docker-compose.test.yml logs orchestration | grep -i decision
Success Criteria
All tests should pass with output similar to:
test e2e::ruby::conditional_approval_test::test_small_amount_auto_approval ... ok
test e2e::ruby::conditional_approval_test::test_medium_amount_manager_approval ... ok
test e2e::ruby::conditional_approval_test::test_large_amount_dual_approval ... ok
test e2e::ruby::conditional_approval_test::test_decision_point_step_dependency_structure ... ok
test e2e::ruby::conditional_approval_test::test_boundary_conditions ... ok
test e2e::ruby::conditional_approval_test::test_boundary_large_threshold ... ok
test e2e::ruby::conditional_approval_test::test_very_small_amount ... ok
test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Next Steps
After validating Ruby workers:
- Phase 8a: Implement Rust worker support for decision points
- Phase 9a: Create E2E tests for Rust worker decision points
Focused Architectural and Security Audit Report
Audit Date: 2026-02-05 Auditor: Claude (Opus 4.6 / Sonnet 4.5 sub-agents) Status: Complete
Executive Summary
This audit evaluates all Tasker Core crates for alpha readiness across security, error handling, resilience, and architecture dimensions. Findings are categorized by severity (Critical/High/Medium/Low/Info) per the methodology defined in the audit specification.
Alpha Readiness Verdict
ALPHA READY with targeted fixes. No Critical vulnerabilities found. The High-severity items (dependency CVE, input validation gaps, shutdown timeouts) are straightforward fixes that can be completed in a single sprint.
Consolidated Finding Counts (All Crates)
| Severity | Count | Status |
|---|---|---|
| Critical | 0 | None found |
| High | 9 | Must fix before alpha |
| Medium | 22 | Document as known limitations |
| Low | 13 | Track for post-alpha |
High-Severity Findings (Must Fix Before Alpha)
| ID | Finding | Crate | Fix Effort | Remediation |
|---|---|---|---|---|
| S-1 | Queue name validation missing | tasker-shared | Small | Queue name validation |
| S-2 | SQL error details exposed to clients | tasker-shared | Medium | Error message sanitization |
| S-3 | #[allow] → #[expect] (systemic) | All | Small (batch) | Lint compliance cleanup |
| P-1 | NOTIFY channel name unvalidated | tasker-pgmq | Small | Queue name validation |
| O-1 | No actor panic recovery | tasker-orchestration | Medium | Shutdown and recovery hardening |
| O-2 | Graceful shutdown lacks timeout | tasker-orchestration | Small | Shutdown and recovery hardening |
| W-1 | checkpoint_yield blocks FFI without timeout | tasker-worker | Small | FFI checkpoint timeout |
| X-1 | bytes v1.11.0 CVE (RUSTSEC-2026-0007) | Workspace | Trivial | Dependency upgrade |
| P-2 | CLI migration SQL generation unescaped | tasker-pgmq | Small | Queue name validation |
Crate 1: tasker-shared
Overall Rating: A- (Strong foundation with targeted improvements needed)
The tasker-shared crate is the largest and most foundational crate in the workspace. It provides core types, error handling, messaging abstraction, security services, circuit breakers, configuration management, database utilities, and shared models. The crate demonstrates strong security practices overall.
Strengths
- Zero unsafe code across the entire crate
- Excellent cryptographic hygiene: Constant-time API key comparison via
subtle::ConstantTimeEq(src/types/api_key_auth.rs:53-62), JWKS hardening with SSRF prevention (blocks private IPs, cloud metadata endpoints, requires HTTPS), algorithm allowlist enforcement (noalg: none) - Comprehensive input validation: JSONB validation with size/depth/key count limits (
src/validation.rs), namespace validation with PostgreSQL identifier rules, XSS sanitization - 100% SQLx macro usage: All database queries use compile-time verified
sqlx::query!macros, zero string interpolation in SQL - Lock-free circuit breakers: Atomic state management (
AtomicU8for state,AtomicU64for metrics), proper memory ordering, correct state machine transitions - All MPSC channels bounded and config-driven: Full bounded-channel compliance
- Exemplary config security: Environment variable allowlist with regex validation, TOML injection prevention via
escape_toml_string(), fail-fast on validation errors - No hardcoded secrets: All sensitive values come from env vars or file paths
- Well-organized API surface: Feature-gated modules (web-api, grpc-api), selective re-exports
Finding S-1 (HIGH): Queue Name Validation Missing
Location: tasker-shared/src/messaging/service/router.rs:96-97
Queue names are constructed via format! with unvalidated namespace input:
#![allow(unused)]
fn main() {
fn step_queue(&self, namespace: &str) -> String {
format!("{}_{}_queue", self.worker_queue_prefix, namespace)
}
}
The MessagingError::InvalidQueueName variant exists (src/messaging/errors.rs:56) but is never raised. Neither the router nor the provider implementations (pgmq.rs:134-139, rabbitmq.rs:276-375) validate queue names before passing them to native queue APIs.
Risk: PGMQ creates PostgreSQL tables named after queues — special characters in namespace could cause SQL issues at the DDL level. RabbitMQ queue creation could fail with unexpected characters.
Recommendation: Add validate_queue_name() that enforces alphanumeric + underscore/hyphen, 1-255 chars. Call it in DefaultMessageRouter methods and/or ensure_queue().
Finding S-2 (HIGH): SQL Error Details Exposed to Clients
Location: tasker-shared/src/errors.rs:71-74, 431-437
#![allow(unused)]
fn main() {
impl From<sqlx::Error> for TaskerError {
fn from(err: sqlx::Error) -> Self {
TaskerError::DatabaseError(err.to_string())
}
}
}
sqlx::Error::to_string() can expose SQL query details, table/column names, constraint names, and potentially connection string information. These error messages may propagate to API responses.
Recommendation: Create a sanitized error mapper that logs full details internally but returns generic messages to API clients (e.g., “Database operation failed” with an internal error ID for correlation).
Finding S-3 (HIGH): #[allow] Used Instead of #[expect] (Lint Policy Violation)
Locations:
src/messaging/execution_types.rs:383—#[allow(clippy::too_many_arguments)]src/web/authorize.rs:194—#[allow(dead_code)]src/utils/serde.rs:46-47—#[allow(dead_code)]
Project lint policy mandates #[expect(lint_name, reason = "...")] instead of #[allow]. This is a policy compliance issue.
Recommendation: Convert all #[allow] to #[expect] with documented reasons.
Finding S-4 (MEDIUM): unwrap_or_default() Violations of Tenet #11 (Fail Loudly)
Locations (20+ instances across crate):
src/messaging/execution_types.rs:120,186,213— Step execution status defaults to empty stringsrc/database/sql_functions.rs:377,558— Query results default to empty vectorssrc/registry/task_handler_registry.rs:214,268,656,700,942— Config schema fields default silentlysrc/proto/conversions.rs:32— Invalid timestamps silently default to UNIX epoch
Risk: Required fields silently defaulting to empty values can mask real errors and produce incorrect behavior that’s hard to debug.
Recommendation: Audit all unwrap_or_default() usages. Replace with explicit error returns for required fields. Keep unwrap_or_default() only for truly optional fields with documented rationale.
Finding S-5 (MEDIUM): Error Context Loss in .map_err(|_| ...)
14 instances where original error context is discarded:
src/messaging/service/providers/rabbitmq.rs:544— Discards parse errorsrc/messaging/service/providers/in_memory.rs:305,331,368— 3 instancessrc/state_machine/task_state_machine.rs:114— Discards parse errorsrc/state_machine/actions.rs:256,372,434,842— 4 instances discarding publisher errorssrc/config/config_loader.rs:220,417— 2 instances discarding env var errorssrc/database/sql_functions.rs:1032— Discards decode errorsrc/types/auth.rs:283— Discards parse error
Recommendation: Include original error via .map_err(|e| SomeError::new(context, e.to_string())).
Finding S-6 (MEDIUM): Production expect() Calls
src/macros.rs:65— Panics if Tokio task spawning failssrc/cache/provider.rs:399,429,459,489,522— Multipleexpect("checked in should_use")calls
Risk: Panics in production code. While guarded by preconditions, they bypass error propagation.
Recommendation: Replace with Result propagation or add detailed safety comments explaining invariant guarantees.
Finding S-7 (MEDIUM): Database Pool Config Lacks Validation
Database pool configuration (PoolConfig) does not have a validate() method. Unlike circuit breaker config which validates ranges (failure_threshold > 0, timeout <= 300s), pool config relies on sqlx to reject invalid values at runtime.
Recommendation: Add validation: max_connections > 0, min_connections <= max_connections, acquire_timeout_seconds > 0.
Finding S-8 (MEDIUM): Individual Query Timeouts Missing
While database pools have acquire_timeout configured (src/database/pools.rs:169-170), individual sqlx::query! calls lack explicit timeout wrappers. Long-running queries rely solely on pool-level timeouts.
Recommendation: Consider PostgreSQL statement_timeout at the connection level, or add tokio::time::timeout() wrappers around critical query paths.
Finding S-9 (LOW): Message Size Limits Not Enforced
Messaging deserialization uses serde_json::from_slice() without explicit size limits. While PGMQ has implicit limits from PostgreSQL column sizes, a very large message could cause memory issues during deserialization.
Recommendation: Add configurable message size limits at the provider level.
Finding S-10 (LOW): File Path Exposure in Config Errors
src/services/security_service.rs:184-187 — Configuration errors include filesystem paths. Only occurs during startup (not exposed to API clients in normal operation).
Finding S-11 (LOW): Timestamp Conversion Silently Defaults to Epoch
src/proto/conversions.rs:32 — DateTime::from_timestamp().unwrap_or_default() silently converts invalid timestamps to 1970-01-01 instead of returning an error.
Finding S-12 (LOW): cargo-machete Ignore List Has 19 Entries
Cargo.toml:12-39 — Most are legitimately feature-gated or used via macros, but the list should be periodically audited to prevent dependency bloat.
Finding S-13 (LOW): Global Wildcard Permission Rejection Undocumented
src/types/permissions.rs — The permission_matches() function correctly rejects global wildcard (*) permissions but this behavior isn’t documented in user-facing comments.
Crate 2: tasker-pgmq
Overall Rating: B+ (Good with one high-priority fix needed)
The tasker-pgmq crate is a PGMQ wrapper providing PostgreSQL LISTEN/NOTIFY support for event-driven message processing. ~3,345 source lines across 9 files. No dependencies on tasker-shared (clean separation).
Strengths
- No unsafe code across the entire crate
- Payload uses parameterized queries: Message payloads bound via
$1parameter in NOTIFY - Payload size validation: Enforces pg_notify 8KB limit
- Comprehensive thiserror error types with context preservation
- Bounded channels: All MPSC channels bounded
- Good test coverage: 6 integration test files covering major flows
- Clean separation from tasker-shared: No duplication, standalone library
Finding P-1 (HIGH): SQL Injection via NOTIFY Channel Name
Location: tasker-pgmq/src/emitter.rs:122
#![allow(unused)]
fn main() {
let sql = format!("NOTIFY {}, $1", channel);
sqlx::query(&sql).bind(payload).execute(&self.pool)
}
PostgreSQL’s NOTIFY does not support parameterized channel identifiers. The channel name is interpolated directly via format!. Channel names flow from config.build_channel_name() which concatenates channels_prefix (from TOML config) with base channel names and namespace strings.
Risk: While the NOTIFY command has limited injection surface (it’s not a general SQL execution vector), malformed channel names could cause PostgreSQL errors, unexpected channel routing, or denial of service. The channels_prefix comes from config (lower risk), but namespace strings flow from queue operations.
Recommendation: Add channel name validation — allow only [a-zA-Z0-9_.]+, max 63 chars (PostgreSQL identifier limit). Apply in build_channel_name() and/or notify_channel().
Finding P-2 (HIGH): CLI Migration SQL Generation with Unescaped Input
Location: tasker-pgmq/src/bin/cli.rs:179-353
User-provided regex patterns and channel prefixes are directly interpolated into SQL migration strings when generating migration files. While these are generated files that should be reviewed before application, the lack of escaping creates a risk if the generation process is automated.
Recommendation: Validate inputs against strict patterns before interpolation. Add a warning comment in generated files that they should be reviewed.
Finding P-3 (MEDIUM): unwrap_or_default() on Database Results (Tenet #11)
Location: tasker-pgmq/src/client.rs:164
#![allow(unused)]
fn main() {
.read_batch(queue_name, visibility_timeout, l).await?.unwrap_or_default()
}
When read_batch returns None, this silently produces an empty vector instead of failing loudly. Could mask permission errors, connection failures, or other serious issues.
Recommendation: Return explicit error on unexpected None.
Finding P-4 (MEDIUM): RwLock Poison Handling Masks Panics
Location: tasker-pgmq/src/listener.rs (22 instances)
#![allow(unused)]
fn main() {
self.stats.write().unwrap_or_else(|p| p.into_inner())
}
Silently recovers from poisoned RwLock without logging. Could propagate corrupted state from a panicked thread.
Recommendation: Log warning on poison recovery, or switch to parking_lot::RwLock (doesn’t poison).
Finding P-5 (MEDIUM): Hardcoded Pool Size
Location: tasker-pgmq/src/client.rs:41-44
#![allow(unused)]
fn main() {
let pool = sqlx::postgres::PgPoolOptions::new()
.max_connections(20) // Hard-coded
.connect(database_url).await?;
}
Pool size should be configurable for different deployment scenarios.
Finding P-6 (MEDIUM): Missing Async Operation Timeouts
Database operations in client.rs, emitter.rs, and listener.rs lack explicit tokio::time::timeout() wrappers. Relies solely on pool-level acquire timeouts.
Finding P-7 (LOW): Error Context Loss in Regex Compilation
Location: tasker-pgmq/src/config.rs:169
#![allow(unused)]
fn main() {
Regex::new(&self.queue_naming_pattern)
.map_err(|_| PgmqNotifyError::invalid_pattern(&self.queue_naming_pattern))
}
Original regex error details discarded.
Finding P-8 (LOW): #[allow] Instead of #[expect] (Lint Policy)
Location: tasker-pgmq/src/emitter.rs:299-320 — 3 instances of #[allow(dead_code)] on EmitterFactory.
Crate 3: tasker-orchestration
Overall Rating: A- (Strong security with targeted resilience improvements needed)
The tasker-orchestration crate handles core orchestration logic: actors, state machines, REST + gRPC APIs, and auth middleware. This is the largest service crate and the primary attack surface.
Strengths
- Zero unsafe code across the entire crate
- Excellent auth architecture: Constant-time API key comparison, JWT algorithm allowlist, JWKS SSRF prevention, auth before body parsing
- gRPC/REST auth parity verified: All 6 gRPC task methods enforce identical permissions to REST counterparts
- No auth bypass found: All API v1 routes wrapped in
authorize(), health/metrics public by design - Database-level atomic claiming:
FOR UPDATE SKIP LOCKEDprevents concurrent state corruption - State transitions enforce ownership: No API endpoint allows direct state manipulation
- Sanitized error responses: No stack traces, database errors genericized, consistent JSON format
- Backpressure checked before resource operations: 503 with Retry-After header
- Full bounded-channel compliance: All MPSC channels bounded and config-driven (0 unbounded channels)
- HTTP request timeout:
TimeoutLayerwith configurable 30s default
Finding O-1 (HIGH): No Actor Panic Recovery
Location: tasker-orchestration/src/actors/command_processor_actor.rs:139
Actors spawn via spawn_named! but have no supervisor/restart logic. If OrchestrationCommandProcessorActor panics, the entire orchestration processing stops. Recovery requires full process restart.
Recommendation: Implement panic-catching wrapper with logged restart, or document that process-level supervision (systemd, k8s) handles this.
Finding O-2 (HIGH): Graceful Shutdown Lacks Timeout
Locations:
tasker-orchestration/src/orchestration/bootstrap.rs:177-213tasker-orchestration/src/bin/server.rs:68-82
Shutdown calls coordinator.lock().await.stop().await and orchestration_handle.stop().await with no timeout. If the event coordinator or actors hang during shutdown, the server never completes graceful shutdown.
Recommendation: Add 30-second timeout with force-kill fallback.
Finding O-3 (HIGH): #[allow] Instead of #[expect] (Lint Policy)
21 instances of #[allow] found across the crate (most without reason = clause):
src/actors/traits.rs:67,81src/web/extractors.rs:6src/health/channel_status.rs:87src/grpc/conversions.rs:42- And 16 more locations
Finding O-4 (MEDIUM): Request Validation Not Enforced at Handler Layer
Location: src/web/handlers/tasks.rs:47
TaskRequest has #[derive(Validate)] with constraints (name length 1-255, namespace length 1-255, priority range -100 to 100) but handlers accept Json<TaskRequest> without calling .validate(). Validation happens later at the service layer.
Impact: Oversized payloads are deserialized before rejection. Not a security vulnerability per se, but the defense-in-depth pattern would catch malformed input earlier.
Recommendation: Add .validate() at handler entry or use Valid<Json<TaskRequest>> extractor.
Finding O-5 (MEDIUM): Actor Shutdown May Lose In-Flight Work
Location: tasker-orchestration/src/actors/registry.rs:216-259
Shutdown uses Arc::get_mut() which only works if no other references exist. If get_mut fails, stopped() is silently skipped. In-flight work may be lost.
Finding O-6 (MEDIUM): Database Query Timeouts Missing
Same pattern as tasker-shared (Finding S-8). Individual sqlx::query! calls lack explicit timeout wrappers:
src/services/health/service.rs:284— health check querysrc/orchestration/backoff_calculator.rs:232,245,290,345,368— multiple queries
Pool-level acquire timeout (30s) provides partial mitigation.
Finding O-7 (MEDIUM): unwrap_or_default() on Config Fields
src/orchestration/event_systems/unified_event_coordinator.rs:89— event system configsrc/orchestration/bootstrap.rs:581— namespace configsrc/grpc/services/config.rs:96-97—jwt_issuerandjwt_audiencedefault to empty strings
Finding O-8 (MEDIUM): Error Context Loss
~12 instances of .map_err(|_| ...) discarding error context:
src/orchestration/bootstrap.rs:203— oneshot send errorsrc/web/handlers/health.rs:53— timeout errorsrc/web/handlers/tasks.rs:113— UUID parse error
Finding O-9 (MEDIUM): Hardcoded Magic Numbers
src/services/task_service.rs:257-259—per_page > 100validationsrc/orchestration/event_systems/orchestration_event_system.rs:142— 24h max message agesrc/services/analytics_query_service.rs:229— 30.0s slow step threshold
Finding O-10 (LOW): gRPC Internal Error May Leak Details
Location: src/grpc/conversions.rs:152-153
tonic::Status::internal(error.to_string()) — depending on error Display implementations, could expose implementation details in gRPC error messages.
Finding O-11 (LOW): CORS Allows Any Origin
Location: src/web/mod.rs
#![allow(unused)]
fn main() {
CorsLayer::new()
.allow_origin(tower_http::cors::Any)
.allow_methods(tower_http::cors::Any)
.allow_headers(tower_http::cors::Any)
}
Acceptable for alpha/API service, but should be configurable for production deployments.
Crate 4: tasker-worker
Overall Rating: A- (Strong FFI safety with one notable gap)
The tasker-worker crate handles handler dispatch, FFI integration, and completion processing. Despite complex FFI requirements, it achieves this with zero unsafe blocks in the crate itself.
Strengths
- Zero unsafe code despite handling Ruby/Python FFI integration
- All SQL queries via sqlx macros — no string interpolation
- Handler panic containment:
catch_unwind()+AssertUnwindSafewraps all handler calls - Error classification preserved: Permanent/Retryable distinction maintained across FFI boundary
- Fire-and-forget callbacks: Spawned into runtime, 5s timeout, no deadlock risk
- FFI completion circuit breaker: Latency-based, 100ms threshold, lock-free metrics
- All MPSC channels bounded — full bounded-channel compliance
- No production unwrap()/expect() in core paths
Finding W-1 (HIGH): checkpoint_yield Blocks FFI Thread Without Timeout
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:904
#![allow(unused)]
fn main() {
let result = self.config.runtime_handle.block_on(async {
self.handle_checkpoint_yield_async(/* ... */).await
});
}
Uses block_on which blocks the Ruby/Python thread while persisting checkpoint data to the database. No timeout wrapper. If the database is slow, this blocks the FFI thread indefinitely, potentially exhausting the thread pool.
Recommendation: Add tokio::time::timeout() around the block_on body (configurable, suggest 10s default).
Finding W-2 (MEDIUM): Starvation Detection is Warning-Only
Location: tasker-worker/src/worker/handlers/ffi_dispatch_channel.rs:772-793
check_starvation_warnings() logs warnings but doesn’t enforce any action. Also requires manual invocation by the caller — no automatic monitoring loop.
Finding W-3 (MEDIUM): FFI Thread Safety Documentation Gap
The FfiDispatchChannel uses Arc<Mutex<mpsc::Receiver>> (thread-safe) but lacks documentation about thread-safety guarantees, poll() contention behavior, and block_on safety in FFI context.
Finding W-4 (MEDIUM): #[allow] vs #[expect] (Lint Policy)
5 instances in web/middleware/mod.rs and web/middleware/request_id.rs.
Finding W-5 (MEDIUM): Missing Database Query Timeouts
Same systemic pattern as other crates. Checkpoint service and step claim queries lack explicit timeout wrappers.
Finding W-6 (LOW): unwrap_or_default() in worker/core.rs
Several instances, appear to be for optional fields (likely legitimate), but warrants review.
Crates 5-6: tasker-client & tasker-cli
Overall Rating: A (Excellent — cleanest crates in the workspace)
These client crates demonstrate the strongest compliance across all audit dimensions. Notably, lint policy compliant (using #[expect] already). No Critical or High findings.
Strengths
- No unsafe code in either crate
- No hardcoded credentials — all auth from env vars or config files
- RSA key generation validates minimum 2048-bit keys
- Proper error context preservation in all
Fromconversions - Complete transport abstraction: REST and gRPC both implement 11/11 methods
- HTTP/gRPC timeouts configured: 30s request, 10s connect
- Exponential backoff retry for
create_taskwith configurable max retries - Lint policy compliant — uses
#[expect]with reasons - User-facing CLI errors informative without leaking internals
Finding C-1 (MEDIUM): TLS Certificate Validation Not Explicitly Enforced
Location: tasker-client/src/api_clients/orchestration_client.rs:220
HTTP client uses reqwest::Client::builder() without explicitly setting .danger_accept_invalid_certs(false). Default is secure, but explicit enforcement prevents accidental changes.
Finding C-2 (MEDIUM): Default URLs Use HTTP
Location: tasker-client/src/config.rs:276
Default base_url is http://localhost:8080. Credentials transmitted over HTTP are vulnerable to interception. Appropriate for local dev, but should warn when HTTP is used with authentication enabled.
Finding C-3 (MEDIUM): Retry Logic Only on create_task
Other operations (get_task, list_tasks, etc.) do not retry on transient failures. Should either extend retry logic or document the limitation.
Finding C-4 (LOW): Production expect() in Config Initialization
tasker-client/src/api_clients/orchestration_client.rs:123 — panics if config is malformed. Acceptable during startup but could return Result instead.
Crates 7-10: Language Workers (Rust, Ruby, Python, TypeScript)
Overall Rating: A- (Strong FFI engineering, no critical gaps)
All 4 language workers share common architecture via FfiDispatchChannel for poll-based event dispatch. Audited ~22,000 lines of Rust FFI code plus language wrappers.
Strengths
- TypeScript: Comprehensive panic handling —
catch_unwindon all critical FFI functions, errors converted to JSON error responses - Ruby/Python: Managed FFI via Magnus and PyO3 — these frameworks handle panic unwinding automatically via their exception systems
- Error classification preserved across all FFI boundaries: Permanent/Retryable distinction maintained
- Fire-and-forget callbacks: No deadlock risk identified
- Starvation detection functional in all workers
- Proper Arc usage for thread-safe shared ownership across FFI
- TypeScript C FFI: Correct string memory management with
into_raw()/from_raw()pattern andfree_rust_string()for caller cleanup - Checkpoint support uniformly implemented across all 4 workers
- Consistent error hierarchy across all languages
Finding LW-1 (MEDIUM): TypeScript FFI Missing Safety Documentation
Location: workers/typescript/src-rust/lib.rs:38
#![allow(clippy::missing_safety_doc)] — suppresses docs for 9 unsafe extern "C" functions. Should use #[expect] per lint policy and add # Safety sections.
Finding LW-2 (MEDIUM): Rust Worker #[allow(dead_code)] (Lint Policy)
Location: workers/rust/src/event_subscribers/logging_subscriber.rs:60,98,132
3 instances of #[allow(dead_code)] instead of #[expect].
Finding LW-3 (LOW): Ruby Bootstrap Uses expect() on Ruby Runtime
Location: workers/ruby/ext/tasker_core/src/bridge.rs:19-20, bootstrap.rs:29-30
Ruby::get().expect("Ruby runtime should be available") — safe in practice (guaranteed by Magnus FFI contract) but could use ? for defensive programming.
Finding LW-4 (LOW): Timeout Cleanup Requires Manual Polling
cleanup_timeouts() exists in all FFI workers but documentation doesn’t specify recommended polling frequency. Workers must call this periodically.
Finding LW-5 (LOW): Ruby Tokio Thread Pool Hardcoded to 8
Location: workers/ruby/ext/tasker_core/src/bootstrap.rs:74-79
Hardcoded .worker_threads(8) for M2/M4 Pro compatibility. Python/TypeScript use defaults. Consider making configurable.
Cross-Cutting Concerns
Dependency Audit (cargo audit)
Finding X-1 (HIGH): bytes v1.11.0 Integer Overflow (RUSTSEC-2026-0007)
Published 2026-02-03. Integer overflow in BytesMut::reserve. Fix: upgrade to bytes >= 1.11.1. This is a transitive dependency used by tokio, hyper, axum, tonic, reqwest, sqlx — deeply embedded.
Recommendation: Add to workspace Cargo.toml: bytes = "1.11.1"
Finding X-2 (LOW): rustls-pemfile Unmaintained (RUSTSEC-2025-0134)
Transitive from lapin (RabbitMQ) → amq-protocol → tcp-stream → rustls-pemfile. No action available from this project; depends on upstream lapin update.
Clippy Compliance
Zero warnings across entire workspace with --all-targets --all-features. Excellent.
Systemic: #[allow] vs #[expect] (Lint Policy)
27 instances of #[allow] found across all crates. Distribution:
- tasker-shared: ~5 instances
- tasker-pgmq: 3 instances
- tasker-orchestration: 21 instances (highest)
- tasker-worker: 5 instances
- tasker-client/cli: 0 (compliant)
- Language workers: ~3 instances
Recommendation: Batch fix in a single PR — mechanical replacement of #[allow] → #[expect] with reason strings.
Systemic: Database Query Timeouts
Found across tasker-shared, tasker-orchestration, tasker-worker, and tasker-pgmq. Individual sqlx::query! calls lack explicit tokio::time::timeout() wrappers. Pool-level acquire timeouts (30s) provide partial mitigation.
Recommendation: Consider PostgreSQL statement_timeout at the connection level as a blanket fix, or add tokio::time::timeout() around critical query paths.
Systemic: unwrap_or_default() on Required Fields (Tenet #11)
Found across tasker-shared (20+ instances), tasker-orchestration (3 instances), tasker-pgmq (1 instance). Silent failures on required fields violate the Fail Loudly principle.
Recommendation: Audit all instances and replace with explicit error handling for required fields.
Appendix: Methodology
Each crate was evaluated across these dimensions:
- Security — Input validation, SQL safety, auth checks, unsafe blocks, crypto, secrets
- Error Handling — Fail Loudly (Tenet #11), context preservation, structured errors
- Resilience — Bounded channels, timeouts, circuit breakers, backpressure
- Architecture — API surface, documentation consistency, test coverage, dead code
- FFI-Specific (language workers) — Error classification, deadlock risk, starvation detection, memory safety
Severity definitions follow the audit specification.
Appendix: Remediation Tracking
Remediation work items for all High-severity findings:
| Work Item | Findings | Priority | Summary |
|---|---|---|---|
| Dependency upgrade | X-1 | Urgent | Upgrade bytes to fix RUSTSEC-2026-0007 CVE |
| Queue name validation | S-1, P-1, P-2 | High | Add queue name and NOTIFY channel validation |
| Lint compliance cleanup | S-3, O-3, W-4, LW-1, LW-2, P-8 | Medium | Replace #[allow] with #[expect] workspace-wide |
| Shutdown and recovery hardening | O-1, O-2 | High | Add shutdown timeout and actor panic recovery |
| FFI checkpoint timeout | W-1 | High | Add timeout to checkpoint_yield block_on |
| Error message sanitization | S-2 | High | Sanitize database error messages in API responses |
Architecture Decision Records (ADRs)
This directory contains Architecture Decision Records that document significant design decisions in Tasker Core. Each ADR captures the context, decision, and consequences of a specific architectural choice.
ADR Index
Active Decisions
| ADR | Title | Date | Status |
|---|---|---|---|
| ADR-001 | Actor-Based Orchestration Architecture | 2025-10 | Accepted |
| ADR-002 | Bounded MPSC Channels | 2025-10 | Accepted |
| ADR-003 | Processor UUID Ownership Removal | 2025-10 | Accepted |
| ADR-004 | Backoff Strategy Consolidation | 2025-10 | Accepted |
| ADR-005 | Worker Dual-Channel Event System | 2025-12 | Accepted |
| ADR-006 | Worker Actor-Service Decomposition | 2025-12 | Accepted |
| ADR-007 | FFI Over WASM for Language Workers | 2025-12 | Accepted |
| ADR-008 | Handler Composition Pattern | 2025-12 | Accepted |
Root Cause Analyses
| Document | Title | Date |
|---|---|---|
| RCA | Parallel Execution Timing Bugs | 2025-12 |
ADR Template
When creating a new ADR, use this template:
# ADR: [Title]
**Status**: [Proposed | Accepted | Deprecated | Superseded]
**Date**: YYYY-MM-DD
**Ticket**: TAS-XXX
## Context
What is the issue that we're seeing that is motivating this decision or change?
## Decision
What is the change that we're proposing and/or doing?
## Consequences
What becomes easier or more difficult to do because of this change?
### Positive
- Benefit 1
- Benefit 2
### Negative
- Trade-off 1
- Trade-off 2
### Neutral
- Side effect that is neither positive nor negative
## Alternatives Considered
What other options were considered and why were they rejected?
### Alternative 1: [Name]
Description and why it was rejected.
### Alternative 2: [Name]
Description and why it was rejected.
## References
- Related documents
- External references
When to Create an ADR
Create an ADR when:
- Making a significant architectural change that affects multiple components
- Choosing between alternatives with meaningful trade-offs
- Establishing a pattern that should be followed consistently
- Removing or deprecating an existing pattern or approach
- Learning from an incident (RCA format)
Don’t create an ADR for:
- Minor implementation details
- Bug fixes without architectural impact
- Documentation updates
- Routine refactoring
Related Documentation
- Tasker Core Tenets - Core design principles
ADR: Actor-Based Orchestration Architecture
Status: Accepted Date: 2025-10 Ticket: TAS-46
Context
The orchestration system used a command pattern with direct service delegation, but lacked formal boundaries between commands and lifecycle components. This created:
- Testing Complexity: Lifecycle components tightly coupled to command processor
- Unclear Boundaries: No formal interface between commands and lifecycle operations
- Limited Supervision: No standardized lifecycle hooks for resource management
- Inconsistent Patterns: Each component had different initialization patterns
- Coupling: Command processor had direct dependencies on multiple service instances
The command processor was 1,164 lines, mixing routing, hydration, validation, and delegation.
Decision
Adopt a lightweight actor pattern with message-based interfaces:
Core Abstractions:
OrchestrationActortrait with lifecycle hooks (started(),stopped())Messagetrait for type-safe messages with associatedResponsetypeHandler<M>trait for async message processingActorRegistryfor centralized actor management
Four Orchestration Actors:
- TaskRequestActor: Task initialization and request processing
- ResultProcessorActor: Step result processing
- StepEnqueuerActor: Step enqueueing coordination
- TaskFinalizerActor: Task finalization with atomic claiming
Implementation Approach:
- Greenfield migration (no dual support)
- Actors wrap existing services, not replace them
- Arc-wrapped actors for efficient cloning across threads
- No full actor framework (keeping it lightweight)
Consequences
Positive
- 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
- Clear boundaries: Each actor handles specific message types
- Testability: Message-based interfaces enable isolated testing
- Consistent patterns: Established migration pattern for all actors
- Lifecycle management: Standardized
started()/stopped()hooks - Thread safety: Arc-wrapped actors with Send+Sync guarantees
Negative
- Additional abstraction: One more layer between commands and services
- Learning curve: New pattern to understand
- Message overhead: ~100-500ns per actor call (acceptable for our use case)
- Not a full framework: Lacks supervision trees, mailboxes, etc.
Neutral
- Services remain unchanged; actors are thin wrappers
- Performance impact minimal (<1μs per operation)
Alternatives Considered
Alternative 1: Full Actor Framework (Actix)
Would provide supervision, mailboxes, and advanced patterns.
Rejected: Too heavyweight for our needs. We need lifecycle hooks and message-based testing, not a full distributed actor system.
Alternative 2: Keep Direct Service Delegation
Continue with command processor calling services directly.
Rejected: Doesn’t address testing complexity, unclear boundaries, or lifecycle management needs.
Alternative 3: Trait-Based Service Abstraction
Define Service trait and implement on each lifecycle component.
Partially adopted: Combined with actor pattern. Services implement business logic; actors provide message-based coordination.
References
- See the actor pattern implementation in
tasker-orchestration/ - Actors Architecture - Actor pattern documentation
- Events and Commands - Integration context
ADR: Bounded MPSC Channel Migration
Status: Implemented Date: 2025-10-14 Decision Makers: Engineering Team Ticket: TAS-51
Context and Problem Statement
Prior to this change, the tasker-core system had inconsistent and risky MPSC channel usage:
-
Unbounded Channels (3 critical sites): Risk of unbounded memory growth under load
- PGMQ notification listener: Could exhaust memory during notification bursts
- Event publisher: Vulnerable to event storms
- Ruby FFI handler: No backpressure across FFI boundary
-
Configuration Disconnect (6 sites): TOML configuration existed but wasn’t used
- Hard-coded values (100, 1000) with no rationale
- Test/dev/prod environments used identical capacities
- No ability to tune without code changes
-
No Backpressure Strategy: Missing overflow handling policies
- No monitoring of channel saturation
- No documented behavior when channels fill
- No metrics for operational visibility
Production Impact
- Memory Risk: OOM possible under high database notification load (10k+ tasks enqueued)
- Operational Pain: Cannot tune channel sizes without code deployment
- Environment Mismatch: Test environment uses production-scale buffers, masking issues
- Technical Debt: Wasted configuration infrastructure
Decision
Migrate to 100% bounded, configuration-driven MPSC channels with explicit backpressure handling.
Key Principles
- All Channels Bounded: Zero
unbounded_channel()calls in production code - Configuration-Driven: All capacities from TOML with environment overrides
- Separation of Concerns: Infrastructure (sizing) separate from business logic (retry behavior)
- Explicit Backpressure: Document and implement overflow policies
- Full Observability: Metrics for channel saturation and overflows
Solution Architecture
Configuration Structure
Created unified MPSC channel configuration in config/tasker/base/mpsc_channels.toml:
[mpsc_channels]
# Orchestration subsystem
[mpsc_channels.orchestration.command_processor]
command_buffer_size = 1000
[mpsc_channels.orchestration.event_listeners]
pgmq_event_buffer_size = 10000 # Large for notification bursts
# Task readiness subsystem
[mpsc_channels.task_readiness.event_channel]
buffer_size = 1000
send_timeout_ms = 1000
# Worker subsystem
[mpsc_channels.worker.command_processor]
command_buffer_size = 1000
[mpsc_channels.worker.in_process_events]
broadcast_buffer_size = 1000 # Rust → Ruby FFI
# Shared/cross-cutting
[mpsc_channels.shared.event_publisher]
event_queue_buffer_size = 5000
[mpsc_channels.shared.ffi]
ruby_event_buffer_size = 1000
# Overflow policy
[mpsc_channels.overflow_policy]
log_warning_threshold = 0.8 # Warn at 80% full
drop_policy = "block"
Environment-Specific Overrides
Production (config/tasker/environments/production/mpsc_channels.toml):
- Orchestration command: 5000 (5x base)
- PGMQ listeners: 50000 (5x base) - handles bulk task creation bursts
- Event publisher: 10000 (2x base)
Development (config/tasker/environments/development/mpsc_channels.toml):
- Task readiness: 500 (0.5x base)
- Worker FFI: 500 (0.5x base)
Test (config/tasker/environments/test/mpsc_channels.toml):
- Orchestration command: 100 (0.1x base) - exposes backpressure issues
- Task readiness: 100 (0.1x base)
Critical Implementation Detail
Environment override files MUST use full [mpsc_channels.*] prefix:
# ✅ CORRECT
[mpsc_channels.task_readiness.event_channel]
buffer_size = 100
# ❌ WRONG - creates top-level key that overrides correct config
[task_readiness.event_channel]
buffer_size = 100
This was discovered during implementation when environment files created conflicting top-level configuration keys.
Configuration Migration
Migrated MPSC sizing fields from event_systems.toml to mpsc_channels.toml:
Moved to mpsc_channels.toml:
event_systems.task_readiness.metadata.event_channel.buffer_sizeevent_systems.task_readiness.metadata.event_channel.send_timeout_msevent_systems.worker.metadata.in_process_events.broadcast_buffer_size
Kept in event_systems.toml (event processing logic):
event_systems.task_readiness.metadata.event_channel.max_retriesevent_systems.task_readiness.metadata.event_channel.backoff
Rationale: Separation of concerns - infrastructure sizing vs business logic behavior.
Rust Type System
Created comprehensive type system in tasker-shared/src/config/mpsc_channels.rs:
#![allow(unused)]
fn main() {
pub struct MpscChannelsConfig {
pub orchestration: OrchestrationChannelsConfig,
pub task_readiness: TaskReadinessChannelsConfig,
pub worker: WorkerChannelsConfig,
pub shared: SharedChannelsConfig,
pub overflow_policy: OverflowPolicyConfig,
}
}
All channel creation sites updated to use configuration:
#![allow(unused)]
fn main() {
// Before
let (tx, rx) = mpsc::unbounded_channel();
// After
let buffer_size = config.mpsc_channels
.orchestration.event_listeners.pgmq_event_buffer_size;
let (tx, rx) = mpsc::channel(buffer_size);
}
Observability
ChannelMonitor Integration:
- Tracks channel usage in real-time
- Logs warnings at 80% saturation
- Exposes metrics via OpenTelemetry
Metrics Available:
mpsc_channel_usage_percent- Current channel fill percentagempsc_channel_capacity- Configured capacity- Component and channel name labels for filtering
Consequences
Positive
- Memory Safety: Bounded channels prevent OOM from unbounded growth
- Operational Flexibility: Tune channel sizes via configuration without code changes
- Environment Appropriateness: Test uses small buffers (exposes issues), production uses large buffers (handles load)
- Observability: Channel saturation visible in metrics and logs
- Documentation: Clear guidelines for future channel additions
Negative
- Backpressure Complexity: Must handle full channel conditions
- Configuration Overhead: More configuration files to maintain
- Tuning Required: May need adjustment based on production load patterns
Neutral
- No Performance Impact: Bounded channels with appropriate sizes perform identically to unbounded
- Backward Compatible: Existing deployments automatically use new defaults
Implementation Notes
Backpressure Strategies by Component
PGMQ Notification Listener:
- Strategy: Block sender (apply backpressure)
- Rationale: Cannot drop database notifications
- Buffer: Large (10000 base, 50000 production) to handle bursts
Event Publisher:
- Strategy: Drop events with metrics when full
- Rationale: Internal events are non-critical
- Buffer: Medium (5000 base, 10000 production)
Ruby FFI Handler:
- Strategy: Return error to Rust (signal backpressure)
- Rationale: Ruby must handle gracefully
- Buffer: Standard (1000) with monitoring
Sizing Guidelines
Command Channels (orchestration, worker):
- Base: 1000
- Test: 100 (expose issues)
- Production: 2000-5000 (concurrent load)
Event Channels:
- Base: 1000
- Production: Higher if event-driven architecture
Notification Channels:
- Base: 10000 (burst handling)
- Production: 50000 (bulk operations)
Validation
Testing Performed
- Unit Tests: Configuration loading and validation ✅
- Integration Tests: All tests pass with bounded channels ✅
- Local Verification: Service starts successfully in test environment ✅
- Configuration Verification: All environments load correctly ✅
Success Criteria Met
- ✅ Zero unbounded channels in production code
- ✅ 100% configurable channel capacities
- ✅ Environment-specific overrides working
- ✅ Backpressure handling implemented
- ✅ Observability through ChannelMonitor
- ✅ All tests passing
Future Considerations
- Dynamic Sizing: Consider runtime buffer adjustment based on load (not current scope)
- Priority Queues: Allow critical events to bypass overflow drops (evaluate based on metrics)
- Notification Coalescing: Reduce PGMQ notification volume during bursts (future optimization)
- Advanced Metrics: Percentile latencies for channel send operations
References
- Configuration Files:
config/tasker/base/mpsc_channels.toml - Rust Module:
tasker-shared/src/config/mpsc_channels.rs - Related ADRs: Command Pattern, Actor Pattern
Lessons Learned
- Configuration Structure Matters: Environment override files must use proper prefixes
- Separation of Concerns: Keep infrastructure config (sizing) separate from business logic (retry behavior)
- Test Appropriately: Small buffers in test environment expose backpressure issues early
- Migration Strategy: Moving config fields requires coordinated struct updates across all files
- Type Safety: Rust’s type system caught many configuration mismatches during development
Decision: Approved and Implemented Review Date: 2025-10-14 Next Review: 2026-Q1 (evaluate sizing based on production metrics)
ADR: Processor UUID Ownership Removal
Status: Accepted Date: 2025-10 Ticket: TAS-54
Context
When orchestrators crash with tasks in active processing states (Initializing, EnqueuingSteps, EvaluatingResults), the processor UUID ownership enforcement prevented new orchestrators from taking over. Tasks became permanently stuck until manual intervention.
Root Cause: Three states required ownership enforcement (the original state machine pattern), but when orchestrator A crashed and orchestrator B tried to recover, the ownership check failed: B != A.
Production Impact:
- Stuck tasks requiring manual intervention
- Orchestrator restarts caused task processing to halt
- 15-second gap between crash and retry, but tasks permanently blocked
Decision
Move to audit-only processor UUID tracking:
- Keep processor UUID in all transitions (audit trail for debugging)
- Remove ownership enforcement from state transitions
- Rely on existing state machine guards for idempotency
- Add configuration flag for gradual rollout
Key Insight: The original problem (race conditions) had been solved by multiple other mechanisms:
- Atomic finalization claiming via SQL functions
- Command pattern with stateless async processors
- Actor pattern with 4 production-ready actors
Idempotency Without Ownership
| Actor | Idempotency Mechanism | Race Condition Protection |
|---|---|---|
| TaskRequestActor | identity_hash unique constraint | Transaction atomicity |
| ResultProcessorActor | Current state guards | State machine atomicity |
| StepEnqueuerActor | SQL function atomicity | PGMQ transactional operations |
| TaskFinalizerActor | Atomic claiming | SQL compare-and-swap |
Consequences
Positive
- Task recovery: Tasks automatically recover after orchestrator crashes
- Zero manual interventions: Stuck task count approaches zero
- Audit trail preserved: Full debugging capability retained
- Instant rollback: Configuration flag allows quick revert
Negative
- New debugging patterns: Processor ownership changes visible in audit trail
- Team training: Operators need to understand audit-only interpretation
Neutral
- No database schema changes required
- No performance impact (one fewer query per transition)
Alternatives Considered
Alternative 1: Timeout-Based Ownership Transfer
Add timeout after which ownership can be claimed by another processor.
Rejected: Adds complexity; existing idempotency guards make ownership redundant entirely.
Alternative 2: Keep Ownership Enforcement
Continue with existing ownership enforcement behavior, add manual recovery tools.
Rejected: Doesn’t address root cause; manual intervention doesn’t scale.
References
- Defense in Depth - Multi-layer protection philosophy
- Idempotency and Atomicity - Defense layer documentation
ADR: Backoff Logic Consolidation
Status: Implemented Date: 2025-10-29 Deciders: Engineering Team Ticket: TAS-57
Context
The tasker-core distributed workflow orchestration system had multiple, potentially conflicting implementations of exponential backoff logic for step retry coordination. This created several critical issues:
Problems Identified
-
Configuration Conflicts: Three different maximum backoff values existed across the system:
- SQL Migration (hardcoded): 30 seconds
- Rust Code Default: 60 seconds
- TOML Configuration: 300 seconds
-
Race Conditions: No atomic guarantees on backoff updates when multiple orchestrators processed the same step failure simultaneously, leading to potential lost updates and inconsistent state.
-
Implementation Divergence: Dual calculation paths (Rust BackoffCalculator vs SQL fallback) could produce different results due to:
- Different time sources (
last_attempted_atvsfailure_time) - Hardcoded vs configurable parameters
- Lack of timestamp synchronization
- Different time sources (
-
Hardcoded SQL Values: The SQL migration contained non-configurable exponential backoff logic:
-- Old hardcoded implementation power(2, COALESCE(attempts, 1)) * interval '1 second', interval '30 seconds'
Decision
We consolidated the backoff logic with the following architectural decisions:
1. Single Source of Truth: TOML Configuration
Decision: All backoff parameters originate from TOML configuration files.
Rationale:
- Centralized configuration management
- Environment-specific overrides (test/development/production)
- Runtime validation and type safety
- Clear documentation of system behavior
Implementation:
# config/tasker/base/orchestration.toml
[backoff]
default_backoff_seconds = [1, 2, 4, 8, 16, 32]
max_backoff_seconds = 60 # Standard across all environments
backoff_multiplier = 2.0
jitter_enabled = true
jitter_max_percentage = 0.1
2. Standard Maximum Backoff: 60 Seconds
Decision: Standardize on 60 seconds as the maximum backoff delay.
Rationale:
- Balance: 60 seconds balances retry speed with system load reduction
- Not Too Short: 30 seconds (old SQL max) insufficient for rate limiting scenarios
- Not Too Long: 300 seconds (old TOML config) creates excessive delays in failure scenarios
- Alignment: Matches Rust code defaults and production requirements
Impact:
- Tasks recover faster from transient failures
- Rate-limited APIs get adequate cooldown
- User experience improved with reasonable retry times
3. Parameterized SQL Functions
Decision: SQL functions accept configuration parameters with sensible defaults.
Implementation:
CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
backoff_request_seconds INTEGER,
last_attempted_at TIMESTAMP,
failure_time TIMESTAMP,
attempts INTEGER,
p_max_backoff_seconds INTEGER DEFAULT 60,
p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP
Rationale:
- Eliminates hardcoded values in SQL
- Allows runtime configuration without schema changes
- Maintains SQL fallback safety net
- Defaults prevent breaking existing code
4. Atomic Backoff Updates with Row-Level Locking
Decision: Use PostgreSQL SELECT FOR UPDATE for atomic backoff updates.
Implementation:
#![allow(unused)]
fn main() {
// Rust BackoffCalculator
async fn update_backoff_atomic(&self, step_uuid: &Uuid, delay_seconds: u32) {
let mut tx = self.pool.begin().await?;
// Acquire row-level lock
sqlx::query!("SELECT ... FROM tasker_workflow_steps WHERE ... FOR UPDATE")
.fetch_one(&mut *tx).await?;
// Update with lock held
sqlx::query!("UPDATE tasker_workflow_steps SET ...")
.execute(&mut *tx).await?;
tx.commit().await?;
}
}
Rationale:
- Correctness: Prevents lost updates from concurrent orchestrators
- Simplicity: PostgreSQL’s row-level locking is well-understood and reliable
- Performance: Minimal overhead - locks only held during UPDATE operation
- Idempotency: Multiple retries produce consistent results
Alternative Considered: Optimistic concurrency with version field
- Rejected: More complex implementation, retry logic in application layer
- Benefit of Chosen Approach: Database guarantees atomicity
5. Timing Consistency: Update last_attempted_at with backoff_request_seconds
Decision: Always update both backoff_request_seconds and last_attempted_at atomically.
Rationale:
- SQL fallback calculation:
last_attempted_at + backoff_request_seconds - Prevents timing window where calculation uses stale timestamp
- Single transaction ensures consistency
Before:
#![allow(unused)]
fn main() {
// Old: Only updated backoff_request_seconds
sqlx::query!("UPDATE tasker_workflow_steps SET backoff_request_seconds = $1 ...")
}
After:
#![allow(unused)]
fn main() {
// New: Updates both atomically
sqlx::query!(
"UPDATE tasker_workflow_steps
SET backoff_request_seconds = $1,
last_attempted_at = NOW()
WHERE ..."
)
}
6. Dual-Path Strategy: Rust Primary, SQL Fallback
Decision: Maintain both Rust calculation and SQL fallback, but ensure they use same configuration.
Rationale:
- Rust Primary: Fast, configurable, with jitter support
- SQL Fallback: Safety net if
backoff_request_secondsis NULL - Consistency: Both paths now use same max delay and multiplier
Path Selection Logic:
CASE
-- Primary: Rust-calculated backoff
WHEN backoff_request_seconds IS NOT NULL AND last_attempted_at IS NOT NULL THEN
last_attempted_at + (backoff_request_seconds * interval '1 second')
-- Fallback: SQL exponential with configurable params
WHEN failure_time IS NOT NULL THEN
failure_time + LEAST(
power(p_backoff_multiplier, attempts) * interval '1 second',
p_max_backoff_seconds * interval '1 second'
)
ELSE NULL
END
Consequences
Positive
- Configuration Clarity: Single max_backoff_seconds value (60s) across entire system
- Race Condition Prevention: Atomic updates guarantee correctness in distributed deployments
- Flexibility: Parameterized SQL allows future config changes without migrations
- Timing Consistency: Synchronized timestamp updates eliminate calculation errors
- Maintainability: Clear separation of concerns - Rust for calculation, SQL for fallback
- Test Coverage: All 518 unit tests pass, validating correctness
Negative
-
Performance Overhead: Row-level locking adds ~1-2ms per backoff update
- Mitigation: Negligible compared to step execution time (typically seconds)
- Acceptable Trade-off: Correctness more important than microseconds
-
Lock Contention Risk: High-frequency failures on same step could cause lock queuing
- Mitigation: Exponential backoff naturally spreads out retries
- Monitoring: Added metrics for lock contention detection
- Real-World Impact: Minimal - failures are infrequent by design
-
Complexity: Transaction management adds code complexity
- Mitigation: Encapsulated in
update_backoff_atomic()method - Benefit: Hidden behind clean interface, testable in isolation
- Mitigation: Encapsulated in
Neutral
-
Breaking Change: SQL function signature changed (added parameters)
- Not an Issue: Greenfield alpha project, no production dependencies
- Future-Proof: Default parameters maintain backward compatibility
-
Configuration Migration: Changed max from 300s → 60s
- Impact: Tasks retry faster, reducing user-perceived latency
- Validation: All tests pass with new values
Validation
Testing
-
Unit Tests: All 518 unit tests pass
- BackoffCalculator calculation correctness
- Jitter bounds validation
- Max cap enforcement
-
Database Tests: SQL function behavior validated
- Parameterization with various max values
- Exponential calculation matches Rust
- Boundary conditions (attempts 0, 10, 20)
-
Integration Tests: End-to-end flow verified
- Worker failure → Backoff applied → Readiness respects delay
- SQL fallback when backoff_request_seconds NULL
- Rust and SQL calculations produce consistent results
Verification Steps Completed
✅ Configuration alignment (TOML, Rust defaults) ✅ SQL function rewrite with parameters ✅ BackoffCalculator atomic updates implemented ✅ Database reset successful with new migration ✅ All unit tests passing ✅ Architecture documentation updated
Implementation Notes
Files Modified
-
Configuration:
config/tasker/base/orchestration.toml: max_backoff_seconds = 60tasker-shared/src/config/tasker.rs: jitter_max_percentage = 0.1
-
Database Migration:
migrations/20250927000000_add_waiting_for_retry_state.sql: Parameterized functions
-
Rust Implementation:
tasker-orchestration/src/orchestration/backoff_calculator.rs: Atomic updates
-
Documentation:
docs/task-and-step-readiness-and-execution.md: Backoff section added- This ADR
Migration Path
Since this is greenfield alpha:
- Drop and recreate test database
- Run migrations with updated SQL functions
- Rebuild sqlx cache
- Run full test suite
Future Production Path (when needed):
- Deploy parameterized SQL functions alongside old functions
- Update Rust code to use new atomic methods
- Enable in staging, monitor metrics
- Gradual production rollout with feature flag
- Remove old functions after validation period
Future Enhancements
Potential Improvements (Post-Alpha)
- Configuration Table: Store backoff config in database for runtime updates
- Metrics: OpenTelemetry metrics for backoff application and lock contention
- Adaptive Backoff: Adjust multiplier based on system load or error patterns
- Per-Namespace Policies: Different backoff configs per workflow namespace
- Backoff Profiles: Named profiles (aggressive, moderate, conservative)
Monitoring Recommendations
Key Metrics to Track:
backoff_calculation_duration_seconds: Time to calculate and apply backoffbackoff_lock_contention_total: Lock acquisition failuresbackoff_sql_fallback_total: Frequency of SQL fallback usagebackoff_delay_applied_seconds: Histogram of actual delays
Alert Conditions:
- SQL fallback usage > 5% (indicates Rust path failing)
- Lock contention > threshold (indicates hot spots)
- Backoff delays > max_backoff_seconds (configuration issue)
References
- Task and Step Readiness Documentation
- States and Lifecycles Documentation
- BackoffCalculator Implementation
- SQL Migration 20250927000000
Related ADRs
- Ownership Removal - Concurrent access patterns
Decision Status: ✅ Implemented and Validated (2025-10-29)
ADR: Worker Dual-Channel Event System
Status: Accepted Date: 2025-12 Ticket: TAS-67
Context
The original Rust worker used a blocking .call() pattern in the event handler:
#![allow(unused)]
fn main() {
let result = handler.call(&event.payload.task_sequence_step).await; // BLOCKS
}
This created effectively sequential execution even for independent steps, preventing true concurrency and causing domain event race conditions where downstream systems saw events before orchestration processed results.
Decision
Adopt a dual-channel command pattern where handler invocation is fire-and-forget, and completions flow back through a separate channel.
Architecture:
[1] WorkerEventSystem receives StepExecutionEvent
↓
[2] ActorCommandProcessor routes to StepExecutorActor
↓
[3] StepExecutorActor claims step, publishes to HANDLER DISPATCH CHANNEL
↓ (fire-and-forget, non-blocking)
[4] HandlerDispatchService receives from channel
↓
[5] Resolves handler from registry, invokes handler.call()
↓
[6] Handler completes, publishes to COMPLETION CHANNEL
↓
[7] CompletionProcessorService receives from channel
↓
[8] Routes to FFICompletionService → Orchestration queue
Key Design Decisions:
- Bounded Parallel Execution: Semaphore-bounded concurrency (configurable via TOML)
- Ordered Domain Events: Events fire AFTER result is committed to completion channel
- Comprehensive Error Handling: Panics, timeouts, handler errors all generate proper failure results
- Fire-and-Forget FFI Callbacks:
runtime_handle.spawn()instead ofblock_on()prevents deadlocks
Consequences
Positive
- True parallelism: Parallel handler execution with bounded concurrency
- Eliminated race conditions: Domain events only fire after results committed
- Comprehensive error handling: All failure modes produce proper step failures
- Foundation for FFI: Reusable abstractions for Ruby/Python/TypeScript workers
- Bug discovery: Parallel execution surfaced latent SQL precedence bug
Negative
- Increased complexity: Two channels to manage instead of one
- Debugging complexity: Tracing flow across multiple channels requires structured logging
Neutral
- Channel saturation monitoring available via metrics
- Configurable buffer sizes per environment
Risk Mitigations Implemented
| Risk | Mitigation |
|---|---|
| Semaphore acquisition failure | Generate failure result instead of silent exit |
| FFI polling starvation | Metrics + starvation warnings + timeout |
| Completion channel backpressure | Release permit before send |
| FFI thread runtime context | Fire-and-forget callbacks |
Alternatives Considered
Alternative 1: Thread Pool Pattern
Use dedicated thread pool for handler execution.
Rejected: Tokio already provides excellent async runtime; adding threads increases complexity without benefit.
Alternative 2: Single Channel with Priority Queue
Priority queue for completions within single channel.
Rejected: Doesn’t address the fundamental blocking issue; still couples dispatch and completion.
Alternative 3: Keep Blocking Pattern with Larger Buffer
Increase buffer size to mask sequential execution.
Rejected: Doesn’t solve concurrency; just delays the problem.
References
- Worker Event Systems - Architecture documentation
- RCA: Parallel Execution Timing Bugs - Bug discovered during implementation
- FFI Callback Safety - FFI patterns established
ADR: Worker Actor-Service Decomposition
Status: Accepted Date: 2025-12 Ticket: TAS-69
Context
The tasker-worker crate had a monolithic command processor architecture:
WorkerProcessor: 1,575 lines of code- All command handling inline
- Difficult to test individual behaviors
- Inconsistent with orchestration actor architecture
Decision
Transform the worker from monolithic command processor to actor-based design, mirroring the orchestration actor pattern.
Before: Monolithic Design
WorkerCore
└── WorkerProcessor (1,575 LOC)
└── All command handling inline
After: Actor-Based Design
WorkerCore
└── ActorCommandProcessor (~350 LOC)
└── WorkerActorRegistry
├── StepExecutorActor → StepExecutorService
├── FFICompletionActor → FFICompletionService
├── TemplateCacheActor → TaskTemplateManager
├── DomainEventActor → DomainEventSystem
└── WorkerStatusActor → WorkerStatusService
Five Actors:
| Actor | Responsibility | Messages |
|---|---|---|
| StepExecutorActor | Step execution coordination | 4 |
| FFICompletionActor | FFI completion handling | 2 |
| TemplateCacheActor | Template cache management | 2 |
| DomainEventActor | Event dispatching | 1 |
| WorkerStatusActor | Status and health | 4 |
Three Services:
| Service | Lines | Purpose |
|---|---|---|
| StepExecutorService | ~400 | Step claiming, verification, FFI invocation |
| FFICompletionService | ~200 | Result delivery to orchestration |
| WorkerStatusService | ~200 | Stats tracking, health reporting |
Consequences
Positive
- 92% reduction in command processor complexity (1,575 LOC → 123 LOC main file)
- Single responsibility: Each file handles one concern
- Testability: Services testable in isolation, actors via message handlers
- Consistency: Mirrors orchestration architecture
- Extensibility: New actors/services follow established pattern
Negative
- Two-phase initialization: Registry requires careful startup ordering
- Actor shutdown ordering: Must coordinate graceful shutdown
- Learning curve: New pattern to understand for contributors
Neutral
- Public API unchanged (
WorkerCore::new(),send_command(),stop()) - Internal restructuring transparent to users
Gaps Identified and Fixed
| Gap | Issue | Fix |
|---|---|---|
| Domain Event Dispatch | Events not dispatched after step completion | Explicit dispatch call in actor |
| Silent Error Handling | Orchestration send errors swallowed | Explicit error propagation |
| Namespace Sharing | Registry created new manager, losing namespaces | Shared pre-initialized manager |
Alternatives Considered
Alternative 1: Service-Only Pattern
Extract services without actor layer.
Rejected: Loses message-based interfaces that enable testing and future distributed execution.
Alternative 2: Keep Monolithic with Better Organization
Refactor WorkerProcessor into methods without extraction.
Rejected: Doesn’t address testability or architectural consistency goals.
Alternative 3: Full Actor Framework (Actix)
Use production actor framework.
Rejected: Too heavyweight; we need lifecycle hooks and message-based testing, not distributed supervision.
References
- Worker Actors - Architecture documentation
- Actor Pattern - Orchestration actor precedent
ADR: FFI Over WASM for Language Workers
Status: Accepted Date: 2025-12 Ticket: TAS-100
Context
For the TypeScript worker implementation, we needed to decide between two integration approaches:
- FFI (Foreign Function Interface): Direct C ABI calls to compiled Rust library
- WASM (WebAssembly): Compile Rust to wasm32-wasi target
Ruby (Magnus) and Python (PyO3) workers already used FFI successfully.
Decision
Proceed with FFI for all language workers. Reserve WASM for future serverless handler execution.
Decision Matrix:
| Criteria | FFI | WASM |
|---|---|---|
| Pattern Consistency | Matches Ruby/Python | Requires new architecture |
| Production Readiness | Node FFI mature, Bun stabilizing | WASI networking immature |
| Implementation Speed | 2-3 weeks | 2-3 months + unknowns |
| PostgreSQL Access | Native via Rust | Needs host functions |
| Multi-threading | Full Tokio support | Single-threaded WASM |
| Async Runtime | Tokio works | Incompatible |
| Debugging | Standard tools | Limited tooling |
Score: FFI 8/10, WASM 3/10 for current requirements.
WASM Deal-Breakers:
- No mature PostgreSQL client for
wasm32-wasi - Single-threaded execution (our
HandlerDispatchServicerelies on Tokio multi-threading) - Tokio doesn’t compile to
wasm32-wasitarget - WASI networking still experimental (Preview 2 adoption low)
Consequences
Positive
- Pattern consistency: Single Rust codebase serves all four workers
- Proven approach: Ruby/Python FFI already validated
- Full feature access: PostgreSQL, PGMQ, Tokio, domain events all work
- Standard debugging: lldb, gdb, structured logging across boundary
- Fast implementation: Estimated 2-3 weeks for TypeScript worker
Negative
- FFI safety concerns: Incorrect types can cause segfaults
- Platform builds: Must distribute
.dylib/.so/.dllper platform - Runtime compatibility: Different FFI semantics between Bun and Node
Neutral
- Bun FFI experimental but fast-stabilizing
- Pre-built binaries via GitHub releases address distribution
Future Vision
WASM Research: Revisit when WASI 0.3+ stabilizes with networking.
Serverless WASM Handlers:
- Compile individual handlers to WASM (not orchestration)
- Deploy to serverless platforms (AWS Lambda, Cloudflare Workers)
- Cold start optimization (1ms vs 100ms)
- Extreme scalability for compute-heavy workflows
Separation of Concerns:
- Orchestration: Stays Rust (PostgreSQL, PGMQ, state machines)
- Handlers: Optionally WASM (stateless compute units)
Alternatives Considered
Alternative 1: WASM with Host Functions
Implement database operations as host functions.
Rejected: Defeats the purpose - logic split between WASM and host, loses Rust benefits.
Alternative 2: Wait for WASI 0.3
Delay TypeScript worker until WASI matures.
Rejected: Timeline uncertain (6+ months); FFI works today.
Alternative 3: Spin Framework
Use Spin’s WASM abstraction layer.
Rejected: Framework lock-in; requires Spin APIs, can’t reuse Axum/Tower patterns.
References
- Cross-Language Consistency - API philosophy
- Workers Documentation - Language-specific implementation guides
ADR: Handler Composition Pattern
Status: Accepted Date: 2025-12 Ticket: TAS-112
Context
Cross-language step handler ergonomics research revealed an architectural inconsistency:
- Batchable handlers: Already use composition via mixins (target pattern)
- API handlers: Use inheritance (subclass pattern)
- Decision handlers: Use inheritance (subclass pattern)
Current State:
✅ Batchable: class Handler(StepHandler, Batchable) # Composition
❌ API: class Handler < APIHandler # Inheritance
❌ Decision: class Handler extends DecisionHandler # Inheritance
Guiding Principle (Zen of Python): “There should be one– and preferably only one –obvious way to do it.”
Decision
Migrate all handler patterns to composition (mixins/traits), using batchable as the reference implementation.
Target Architecture:
All patterns use composition:
Ruby: include Base, include API, include Decision, include Batchable
Python: class Handler(StepHandler, API, Decision, Batchable)
TypeScript: interface composition + mixins
Rust: trait composition (impl Base + API + Decision + Batchable)
Benefits:
- Single responsibility - each mixin handles one concern
- Flexible composition - handlers can mix capabilities as needed
- Easier testing - can test each capability independently
- Matches batchable pattern (already proven successful)
Example Migration:
# Old pattern (deprecated)
class MyHandler < TaskerCore::StepHandler::API
def call(context)
api_success(data)
end
end
# New pattern
class MyHandler < TaskerCore::StepHandler::Base
include TaskerCore::StepHandler::Mixins::API
def call(context)
api_success(data)
end
end
Consequences
Positive
- Consistent architecture: One pattern for all handler capabilities
- Composable capabilities: Mix API + Decision + Batchable as needed
- Testable in isolation: Each mixin can be tested independently
- Matches proven pattern: Batchable already validates approach
- Cross-language alignment: Same mental model in all languages
Negative
- Breaking change: All existing handlers need migration
- Learning curve: Contributors must understand mixin pattern
- Migration effort: All examples and documentation need updates
Neutral
- Pre-alpha status means breaking changes are acceptable
- Migration can be phased with deprecation warnings
Related Decisions
Ruby Result Unification
Ruby uses separate Success/Error classes while Python/TypeScript use unified result with success flag. Recommend unifying Ruby to match.
Rust Handler Traits
Rust needs ergonomic traits for API, Decision, and Batchable capabilities to match other languages:
#![allow(unused)]
fn main() {
pub trait APICapable {
fn api_success(&self, data: Value, status: u16) -> StepExecutionResult;
fn api_failure(&self, message: &str, status: u16) -> StepExecutionResult;
}
pub trait DecisionCapable {
fn decision_success(&self, step_names: Vec<String>) -> StepExecutionResult;
fn skip_branches(&self, reason: &str) -> StepExecutionResult;
}
}
FFI Boundary Types
Data structures crossing FFI boundaries must have identical serialization. Create explicit type mirrors in all languages:
DecisionPointOutcomeBatchProcessingOutcomeCursorConfig
Alternatives Considered
Alternative 1: Keep Inheritance Pattern
Continue with subclass pattern for API and Decision.
Rejected: Inconsistent with batchable; makes multi-capability handlers awkward.
Alternative 2: Migrate Batchable to Inheritance
Make batchable use inheritance to match others.
Rejected: Batchable composition is the better pattern; others should follow it.
Alternative 3: Language-Specific Patterns
Let each language use its idiomatic pattern.
Rejected: Violates cross-language consistency principle; increases cognitive load.
References
- Composition Over Inheritance - Principle documentation
- Cross-Language Consistency - API philosophy
- API Convergence Matrix - Cross-language API reference
napi-rs Research Spike: Findings
Branch: research/napi-rs-ffi-spike
Date: 2026-02-16
Status: Complete — GO recommendation
Executive Summary
napi-rs is a viable replacement for the koffi + C FFI approach in the TypeScript worker. It eliminates the entire class of TAS-283 “trailing input” bugs by removing JSON serialization and C string marshalling from the FFI boundary. The spike successfully:
- Built a
.nodemodule with 14 exported functions - Loaded and ran in Bun without issues
- Passed
clientCreateTaskwith a native JS object (no trailing input) - Auto-generated correct TypeScript definitions with proper camelCase conversion
- Introduced zero workspace dependency conflicts
Recommendation: Create a formal ticket to replace koffi with napi-rs.
Detailed Findings
1. Bun Compatibility: CONFIRMED
The napi-rs .node module loads directly in Bun via require():
const lib = require("./tasker-ts-napi.darwin-arm64.node");
lib.getVersion(); // "0.1.3"
lib.healthCheck(); // true
Bun has native support for N-API modules. No polyfills or compatibility layers needed.
2. TAS-283 Bug Elimination: CONFIRMED
The critical test — clientCreateTask() with a complex nested object — works without any serialization:
lib.clientCreateTask({
name: "ecommerce_order_processing",
namespace: "ecommerce_ts",
version: "0.1.0",
context: {
order_id: "test-napi-123",
customer_email: "test@napi-spike.com",
items: [{ sku: "WIDGET-1", qty: 2, price: 29.99 }],
// ... complex nested object
},
initiator: "napi-rs-spike-test",
sourceSystem: "test-spike",
reason: "Validating napi-rs eliminates trailing input bug",
});
With orchestration running, the request completed the full round-trip:
clientCreateTask({...})→ 404 “Task template not found” (expected — template doesn’t exist)clientListTasks({ limit: 5 })→ Returns 489 tasks with full pagination and typed objectsclientHealthCheck()→{ success: true, data: { healthy: true } }
No “trailing input” error anywhere. The JS object crosses into Rust as a native #[napi(object)] struct — no JSON, no C strings, no trailing bytes.
3. Type Generation: EXCELLENT
napi-rs auto-generates index.d.ts with:
- snake_case → camelCase: Automatic field name conversion (
worker_id→workerId) - Option<T> → T | undefined: Proper nullable types
- HashMap<String, T> → Record<string, T>: Correct map types
- Vec<T> → Array<T>: Correct array types
- serde_json::Value → any: JS-native any type
- Rust doc comments → JSDoc comments: Documentation preserved
Sample generated types:
export interface NapiStepEvent {
eventId: string
taskUuid: string
stepUuid: string
task: NapiTaskInfo
workflowStep: NapiWorkflowStep
stepDefinition: NapiStepDefinition
dependencyResults: Record<string, NapiDependencyResult>
}
export declare function pollStepEvents(): NapiStepEvent | null
export declare function completeStepEvent(eventId: string, result: NapiStepResult): boolean
4. Dependency Analysis
Added Dependencies
| Crate | Version | Purpose | Conflicts |
|---|---|---|---|
napi | 2.16.17 | Core N-API bindings | None |
napi-derive | 2.16.13 | Proc macros for #[napi] | None |
napi-build | 2.3.1 | Build script helper | None |
napi-sys | 2.4.0 | Raw N-API FFI bindings | None |
convert_case | 0.6.0 | snake→camelCase conversion | None |
ctor | 0.2.9 | Module init registration | None |
Total new transitive dependencies: ~6 crates. No conflicts with existing workspace dependencies.
Removed Dependencies (vs koffi approach)
The napi-rs approach eliminates the need for:
koffinpm package (JavaScript side)- Manual
free_rust_string()calls - JSON
{success, error}envelope pattern serde_json::Deserializer::from_strworkaround for trailing bytes
5. Build Complexity
| Aspect | koffi (current) | napi-rs (spike) |
|---|---|---|
| Crate type | cdylib | cdylib |
| Build command | cargo build --release | npx napi build --release --platform |
| Output | .dylib/.so | .node (per-platform) |
| Platform naming | Manual | Automatic (darwin-arm64, linux-x64, etc.) |
| TypeScript types | Manual ts-rs + export_bindings test | Auto-generated index.d.ts |
| npm packaging | Manual binary distribution | napi-rs handles platform packages |
napi-rs’s platform-aware build system is actually simpler for npm distribution.
6. Performance Characteristics
| Aspect | koffi (current) | napi-rs (spike) |
|---|---|---|
| Call overhead | C FFI + JSON ser/de | N-API native object conversion |
| Memory | Manual free_rust_string() | Automatic (V8/Bun GC) |
| String handling | C strings (null-terminated) | N-API strings (length-prefixed) |
| Object passing | JSON serialize → C string → JSON parse | Direct field-by-field conversion |
N-API object conversion is faster than JSON serialization for structured data, though both are fast enough that the difference is unlikely to be measurable in practice. The real win is correctness, not performance.
7. Code Comparison
Before (koffi): Creating a task
// TypeScript side
const requestJson = JSON.stringify(taskRequest);
const resultPtr = lib.client_create_task(requestJson);
const resultStr = resultPtr.readString();
lib.free_rust_string(resultPtr);
const result = JSON.parse(resultStr);
if (!result.success) throw new Error(result.error);
return result.data;
#![allow(unused)]
fn main() {
// Rust side
pub extern "C" fn client_create_task(request_json: *const c_char) -> *mut c_char {
let c_str = unsafe { CStr::from_ptr(request_json) };
let json_str = c_str.to_str().unwrap();
// ↑ BUG: koffi may include trailing bytes in the C string buffer
let mut deserializer = serde_json::Deserializer::from_str(json_str);
// ↑ WORKAROUND: still fails (TAS-283)
// ... serialize result to JSON, convert to C string, return pointer
}
}
After (napi-rs): Creating a task
// TypeScript side
const result = lib.clientCreateTask({
name: "order_processing",
namespace: "ecommerce",
version: "0.1.0",
context: { order_id: "123" },
initiator: "user",
sourceSystem: "web",
reason: "New order",
});
// result is a typed NapiClientResult — no JSON.parse, no free_rust_string
#![allow(unused)]
fn main() {
// Rust side
#[napi]
pub fn client_create_task(request: NapiTaskRequest) -> Result<NapiClientResult> {
// request fields are already native Rust types — no JSON parsing
let task_request = TaskRequest {
name: request.name,
context: request.context, // serde_json::Value from JS object directly
// ...
};
// Return typed object — no JSON serialization, no C string allocation
}
}
8. napi-rs as Single FFI Foundation
The Current Multi-Runtime Architecture
The existing TypeScript worker (workers/typescript/) has a multi-layer runtime abstraction:
TypeScript Public API (WorkerServer, StepHandler, TaskerClient)
└── FfiLayer (src/ffi/ffi-layer.ts) — runtime detection + dispatch
├── NodeRuntime (src/ffi/node-runtime.ts) — koffi, used by Bun AND Node.js
└── DenoRuntime (src/ffi/deno-runtime.ts) — Deno.dlopen
Runtime detection (src/ffi/runtime.ts) inspects globals at startup:
'Bun' in globalThis→ Bun'Deno' in globalThis→ Denoprocess.versions.node→ Node.js
Both Bun and Node.js use NodeRuntime (koffi via Node-API). Deno uses its own DenoRuntime with Deno.dlopen. The FfiLayer class abstracts this, discovering the correct runtime and loading the appropriate adapter.
Why napi-rs Should Replace the Entire Layer
napi-rs targets N-API — the stable, ABI-compatible native addon interface. N-API is supported by:
| Runtime | N-API Support | Status |
|---|---|---|
| Bun | Native (bun:ffi + N-API) | Primary runtime, tested in spike |
| Node.js | Native (since v8.0) | N-API was designed for Node.js |
| Deno | Via --unstable-node-api flag, or via npm: specifiers | Deno 2.x has improved N-API compat |
This means a single .node binary serves all three runtimes. The current architecture’s runtime introspection, NodeRuntime/DenoRuntime split, and koffi dependency all become unnecessary.
Proposed Simplified Architecture
TypeScript Public API (WorkerServer, StepHandler, TaskerClient)
└── Direct require() of .node module — no abstraction layer needed
What gets deleted:
src/ffi/runtime.ts— Runtime detection (no longer needed)src/ffi/ffi-layer.ts— Runtime dispatch abstraction (no longer needed)src/ffi/node-runtime.ts— koffi wrapper (~250 lines of manual FFI function definitions)src/ffi/deno-runtime.ts— Deno.dlopen wrapper (~250 lines)src/ffi/shims.d.ts— Deno type shimsdeno.json— Deno-specific configurationkoffifromoptionalDependencies- All
free_rust_string()calls in the TypeScript codebase - All JSON envelope parsing (
{success, error}unwrapping) - The
ts-rsdev-dependency andexport_bindingstest (napi-rs generates types automatically)
What stays unchanged:
src/index.ts— Public API exportssrc/worker-server.ts— WorkerServer classsrc/handlers/— StepHandler base class, handler registrysrc/client/— TaskerClient (rewired to call napi-rs directly)src/events/— Event system
What gets simplified:
src/ffi/index.ts— Thin re-export of the.nodemodule’s auto-generated types- Loading:
const native = require('./tasker-ts-napi.<platform>.node')or napi-rs’s built-in loader
Deno Compatibility Assessment
Current state: Deno support via DenoRuntime uses Deno.dlopen with the same .dylib/.so as koffi. This is a completely separate code path from Bun/Node.
With napi-rs: Deno’s N-API support has matured significantly:
- Deno 2.x supports N-API natively via
--unstable-node-apior when importingnpm:packages - The
@napi-rs/clitoolchain generates.nodefiles that Deno can load - However, Deno’s N-API is still marked unstable for direct
.nodeloading
Recommendation: Drop the dedicated DenoRuntime adapter. Deno users can:
- Use Deno’s
npm:specifier to import@tasker-systems/tasker(N-API works transparently) - Use
--unstable-node-apiflag for direct.nodeloading - The current
DenoRuntimewithDeno.dlopenhas the same C FFI problems as koffi anyway
This is a net simplification — one code path instead of two, no runtime introspection, no conditional imports.
Type Generation Consolidation
Currently, TypeScript types are generated via a two-step process:
- Rust DTOs in
src-rust/dto.rswith#[cfg_attr(test, derive(TS))] cargo test export_bindings --package tasker-tsgenerates.tsfiles tosrc/ffi/generated/src/ffi/types.tsmanually re-exports with API-friendly names
With napi-rs, this entire pipeline is replaced:
#[napi(object)]structs in Rust are the single source of truthnpx napi buildauto-generatesindex.d.tswith all types- No manual re-export step, no separate
generated/directory
The auto-generated types also get proper camelCase conversion for free, matching JavaScript conventions without any manual #[serde(rename)] annotations.
9. CI and Release Pipeline Impact
Current Artifact Flow
build-ffi-libraries.yml (matrix: linux-x64, darwin-arm64)
├── Docker build → libtasker_ts-x86_64-unknown-linux-gnu.so
└── Native build → libtasker_ts-aarch64-apple-darwin.dylib
↓
release.yml: publish-typescript job
├── Download artifacts → workers/typescript/native/
├── bun install && bun run build
└── npm publish @tasker-systems/tasker
Key files:
.github/workflows/build-ffi-libraries.yml— Cross-platform matrix builds.github/workflows/release.yml(lines 419-497) — npm publish jobscripts/ffi-build/build-typescript.sh—cargo build -p tasker-ts --releasescripts/release/publish-typescript.sh— Version check +npm publishdocker/build/ffi-builder.Dockerfile— Linux build container
What Changes with napi-rs
| Component | Current (koffi) | After (napi-rs) | Notes |
|---|---|---|---|
| Rust crate | tasker-ts (cdylib → .so/.dylib) | tasker-ts-napi (cdylib → .node) | Same crate type, different output |
| Build command | cargo build -p tasker-ts --release | npx napi build --release --platform | napi CLI handles platform naming |
| Output naming | Manual: libtasker_ts-linux-x64.so | Automatic: tasker-ts-napi.linux-x64-gnu.node | napi-rs convention |
| Bundle location | native/libtasker_ts-*.{so,dylib} | tasker-ts-napi.*.node at package root | napi-rs standard layout |
| Platform detection | src/ffi/ffi-layer.ts + BUNDLED_LIBRARIES map | napi-rs built-in loadBinding() | Eliminates manual path resolution |
| npm dependency | koffi (optionalDependency) | @napi-rs/cli (devDependency only) | koffi removed from production |
| Type generation | ts-rs + cargo test export_bindings | Automatic during npx napi build | One fewer build step |
Workflow Changes Required
build-ffi-libraries.yml:
# Build script change
- cargo build -p tasker-ts --release --locked
+ cd workers/typescript-napi && npx napi build --release --platform --target $TARGET
The matrix (linux-x64, darwin-arm64) stays the same. Output artifacts change from .so/.dylib to .node.
release.yml publish-typescript job:
# Bundle step — same pattern, different file names
- mkdir -p workers/typescript/native
- cp ffi-artifacts/typescript/libtasker_ts-x86_64-unknown-linux-gnu.so \
- workers/typescript/native/libtasker_ts-linux-x64.so
- cp ffi-artifacts/typescript/libtasker_ts-aarch64-apple-darwin.dylib \
- workers/typescript/native/libtasker_ts-darwin-arm64.dylib
+ # napi-rs .node files go at package root (loader expects them there)
+ cp ffi-artifacts/typescript/*.node workers/typescript/
No changes to OIDC, npm environment, or publish command — still npm publish of a single @tasker-systems/tasker package.
test-typescript-framework.yml:
- Remove Node.js and Deno FFI test steps (single runtime path)
- Simplify to:
bun test(one command, one runtime) - Client API tests unchanged
build-workers.yml TypeScript job:
- cargo make build-ffi # cargo build -p tasker-ts
+ cd workers/typescript-napi && npx napi build --platform # debug build for tests
Docker production build (typescript-worker.prod.Dockerfile):
- cargo build -p tasker-ts --release --locked
- ENV TASKER_FFI_LIBRARY_PATH=/app/lib/libtasker_ts.so
+ cd workers/typescript-napi && npx napi build --release --platform
+ # .node file discovered automatically by napi-rs loader, no env var needed
npm Distribution: Single Package with Bundled Binaries
napi-rs supports two distribution models:
- Platform packages (separate
@org/pkg-linux-x64-gnu, etc. asoptionalDependencies) - Single package with
.nodefiles bundled alongsideindex.js
We use approach 2 — the same strategy as our current native/ directory approach, keeping everything in @tasker-systems/tasker. This avoids the significant overhead of platform packages, each of which would require its own unique OIDC trusted publishing setup ((org, repo, workflow, environment) tuple) in GitHub Actions and npm.
The napi-rs auto-generated index.js loader already supports this natively via a dual resolution strategy:
// Generated by napi-rs — checks local file FIRST, falls back to platform package
case 'darwin':
switch (arch) {
case 'arm64':
localFileExisted = existsSync(join(__dirname, 'tasker-ts-napi.darwin-arm64.node'))
if (localFileExisted) {
nativeBinding = require('./tasker-ts-napi.darwin-arm64.node') // ← bundled
} else {
nativeBinding = require('@tasker-systems/tasker-darwin-arm64') // ← never used
}
Since the .node files are co-located in the package directory, the loader finds them locally and never attempts the optional dependency fallback. This is functionally identical to our current native/ directory strategy:
# Current (koffi) # After (napi-rs)
@tasker-systems/tasker @tasker-systems/tasker
├── dist/ ├── dist/
├── native/ ├── tasker-ts-napi.linux-x64-gnu.node
│ ├── libtasker_ts-linux-x64.so ├── tasker-ts-napi.darwin-arm64.node
│ └── libtasker_ts-darwin-arm64.dylib ├── index.js (auto-generated loader)
└── package.json ├── index.d.ts (auto-generated types)
└── package.json
What changes vs current approach:
- The
native/directory goes away —.nodefiles live at package root (napi-rs convention) - Platform resolution moves from our hand-written
FfiLayer.discoverLibraryPath()to napi-rs’s generatedindex.js - The
TASKER_FFI_LIBRARY_PATHenvironment variable override is no longer needed (napi-rs loader handles it) - Same OIDC setup, same single
npm publish, samerelease.yml— no new packages to configure
Release artifact flow stays parallel to what we have:
build-ffi-libraries.yml (matrix: linux-x64, darwin-arm64)
├── Docker: npx napi build --release --platform --target x86_64-unknown-linux-gnu
│ → tasker-ts-napi.linux-x64-gnu.node
└── Native: npx napi build --release --platform --target aarch64-apple-darwin
→ tasker-ts-napi.darwin-arm64.node
↓
release.yml: publish-typescript job
├── Download artifacts → cp *.node workers/typescript/
├── bun install && bun run build
└── npm publish @tasker-systems/tasker (single package, same OIDC)
Files Changed or Removed
| File | Change | Reason |
|---|---|---|
scripts/ffi-build/build-typescript.sh | Update | cargo build → npx napi build |
cargo-make/scripts/ci-restore-typescript-artifacts.sh | Simplify | .node files are self-contained (no lib prefix, no extension mapping) |
workers/typescript/deno.json | Delete | No dedicated Deno adapter |
test-typescript-framework.yml | Simplify | Remove multi-runtime test matrix (Node/Deno steps), keep Bun |
docker/build/typescript-worker.prod.Dockerfile | Simplify | Remove TASKER_FFI_LIBRARY_PATH env var, napi-rs loader handles resolution |
10. Migration Path
The migration is a direct replacement, not incremental. The koffi FFI layer is broken (TAS-283) and the public TypeScript API (WorkerServer, StepHandler, TaskerClient) doesn’t change — only the internal FFI plumbing.
Phase 1: Replace FFI crate (Rust side)
- Rename/replace
workers/typescript/src-rust/with napi-rs implementation - Update
Cargo.toml: removecdylibC FFI, add napi dependencies - Port all functions from C FFI signatures to
#[napi]functions - Delete
conversions.rs(JSON conversion helpers) — no longer needed - Delete
dto.rs— replaced by#[napi(object)]structs that auto-generate TypeScript types
Phase 2: Simplify TypeScript layer
- Delete
src/ffi/runtime.ts,ffi-layer.ts,node-runtime.ts,deno-runtime.ts - Delete
src/ffi/generated/directory andts-rsbinding generation - Add napi-rs module loader (one line:
const native = require('./index.node')or use@napi-rs/cligenerated loader) - Rewire
WorkerServer,TaskerClient, event system to call napi-rs functions directly - Remove all
JSON.parse/JSON.stringifyat the FFI boundary - Remove all
free_rust_string()calls - Remove
koffifromoptionalDependencies
Phase 3: Update CI and release
- Update
build-ffi-libraries.ymlto usenpx napi build - Update
release.ymlto use napi-rs platform package publishing - Simplify
test-typescript-framework.ymlto single-runtime tests - Update Docker builds
Phase 4: Cleanup
- Remove
workers/typescript-napi/spike directory - Update documentation
11. Risks & Mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| napi-rs version churn | Low | Pin napi v2, mature ecosystem (SWC, Rollup, Parcel use it) |
| Bun N-API compatibility gaps | Low | Tested in spike, Bun team actively maintains N-API |
| Build complexity for CI | Low | npx napi build handles platform detection automatically |
| Deno N-API gaps | Low | Deno 2.x N-API is stable for npm packages; dedicated adapter was more fragile |
| Platform package publishing | Low | Well-documented napi-rs workflow; used by major projects |
12. What This Spike Did NOT Test
- Multi-platform builds (only tested darwin-arm64)
- napi-rs platform package publishing (
npx napi prepublish) - Long-running event loop (poll/complete cycle under load)
- Concurrent access patterns (multiple JS threads)
- Memory leak detection under sustained use
- Deno loading the
.nodemodule vianpm:specifier
These should be tested during formal implementation.
Files Created
workers/typescript-napi/
├── Cargo.toml # napi + workspace deps
├── build.rs # napi-build setup
├── package.json # @napi-rs/cli tooling
├── src/
│ ├── lib.rs # Module entry (get_version, health_check)
│ ├── bridge.rs # Worker lifecycle + poll/complete (14 napi object types)
│ ├── client_ffi.rs # Client API (clientCreateTask — THE bug test)
│ └── error.rs # Error types → JS exceptions
├── test-spike.ts # Bun test script
├── index.d.ts # Auto-generated TypeScript definitions
├── tasker-ts-napi.darwin-arm64.node # Built binary
└── RESEARCH.md # This document
Conclusion
GO: napi-rs should replace koffi for the TypeScript FFI layer. It brings TypeScript to parity with Ruby (magnus) and Python (pyo3):
| Aspect | Before (koffi) | After (napi-rs) | Ruby (magnus) | Python (pyo3) |
|---|---|---|---|---|
| Type conversion | Manual JSON | Native objects | serde_magnus | pythonize |
| Memory mgmt | Manual free | Automatic (GC) | Automatic (GC) | Automatic (GC) |
| Error handling | JSON envelope | JS exceptions | Ruby exceptions | Python exceptions |
| String bugs | TAS-283 | Eliminated | None | None |
| Type generation | Manual ts-rs | Auto index.d.ts | N/A | N/A |
| Runtime adapters | 2 (koffi + Deno.dlopen) | 1 (N-API) | 1 (magnus) | 1 (pyo3) |
| Runtime detection | Required (3-way branch) | Not needed | Not needed | Not needed |
| Code to maintain | ~500 lines FFI wrappers | ~0 lines (auto-generated) | ~0 lines | ~0 lines |
The migration is a direct replacement — no dual-support phase needed. The public TypeScript API (WorkerServer, StepHandler, TaskerClient) is unchanged; only the FFI plumbing underneath is swapped. The koffi layer is broken (TAS-283), so there’s no value in keeping it around.
RCA: Parallel Execution Exposing Latent Timing Bugs
Date: 2025-12-07
Related: Worker Dual-Channel Event System
Status: Resolved
Impact: Flaky E2E test test_mixed_workflow_scenario
Executive Summary
During the dual-channel event system implementation (fire-and-forget handler dispatch), a previously hidden bug in the SQL function get_task_execution_context() became consistently reproducible. The bug was a logical precedence error that had always existed but was masked by sequential execution timing. Introducing true parallelism changed the probability distribution of state combinations, transforming a Heisenbug into a Bohrbug.
This document captures the root cause analysis as a reference for understanding how architectural changes to concurrency can surface latent bugs in distributed systems.
The Bug
Symptom
Test test_mixed_workflow_scenario intermittently failed with timeout waiting for BlockedByFailures status, while the API returned HasReadySteps.
⏳ Waiting for task to fail (max 10s)...
Task execution status: processing (processing)
Task execution status: has_ready_steps (has_ready_steps) ← Wrong!
Task execution status: has_ready_steps (has_ready_steps)
... timeout ...
Root Cause
The SQL function get_task_execution_context() checked ready_steps > 0 BEFORE permanently_blocked_steps > 0:
-- BUGGY: Wrong precedence order
CASE
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps' -- ← Checked FIRST
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'
...
END as execution_status
When a task had BOTH permanently blocked steps AND ready steps, the function returned has_ready_steps instead of blocked_by_failures.
The Fix
Migration 20251207000000_fix_execution_status_priority.sql corrects the precedence:
-- FIXED: blocked_by_failures takes semantic priority
CASE
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures' -- ← Now FIRST
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'
...
END as execution_status
Why Did This Surface Now?
The Test Scenario
# 3 parallel steps with NO dependencies (can all run concurrently)
steps:
- name: success_step
retryable: false
- name: permanent_error_step
retryable: false # Fails permanently
- name: retryable_error_step
retryable: true
max_attempts: 2 # Fails, but becomes "ready" after backoff
Before: Blocking Handler Dispatch
The original architecture used blocking .call() in the event handler:
#![allow(unused)]
fn main() {
// workers/rust/src/event_handler.rs (before)
let result = handler.call(&step).await; // BLOCKS until handler completes
}
This created effectively sequential execution even for independent steps:
Timeline (Sequential):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]
t=50ms [success_step completes]
t=51ms [permanent_error_step starts]
t=100ms [permanent_error_step fails → PERMANENTLY BLOCKED]
t=101ms [retryable_error_step starts]
t=150ms [retryable_error_step fails → enters 100ms backoff]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 0 (still in backoff!)
──► Returns: blocked_by_failures ✓
The backoff hadn't elapsed yet because steps were processed one at a time.
After: Fire-and-Forget Handler Dispatch
The dual-channel event system introduced non-blocking dispatch via channels:
#![allow(unused)]
fn main() {
// Fire-and-forget pattern
dispatch_sender.send(DispatchHandlerMessage { step, ... }).await;
// Returns immediately - handler executes in separate task
}
This enables true parallel execution:
Timeline (Parallel):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]──────────────────►[completes t=50ms]
t=0ms [permanent_error_step starts]──────────►[fails t=50ms → BLOCKED]
t=0ms [retryable_error_step starts]──────────►[fails t=50ms → backoff]
t=150ms [retryable_error_step backoff expires → becomes READY]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 1 (backoff elapsed!)
──► Returns: has_ready_steps ✗ (BUG!)
Probability Analysis
The “Both States” Window
The bug manifests when checking status while the task has BOTH:
- At least one permanently blocked step
- At least one ready step (e.g., retryable step after backoff)
Sequential Processing:
├────────────────────────────────────────────────────────────────────┤
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│ Very LOW probability of "both states" window │
│ Steps complete serially; backoff rarely overlaps with status check │
└────────────────────────────────────────────────────────────────────┘
Parallel Processing:
├────────────────────────────────────────────────────────────────────┤
│░░░░░░░░░░░░████████████████████████████████████████████░░░░░░░░░░░│
│ ↑ ↑ │
│ │ HIGH probability "both states" window │ │
│ │ All steps complete ~simultaneously │ │
│ │ Backoff expires while status is polled │ │
└────────────────────────────────────────────────────────────────────┘
Quantifying the Change
| Metric | Sequential | Parallel |
|---|---|---|
| Step completion spread | ~150ms | ~50ms |
| “Both states” window duration | ~0ms (transient) | ~100ms+ (stable) |
| Probability of hitting bug | <1% | >50% |
| Bug classification | Heisenbug | Bohrbug |
Bug Classification
Heisenbug → Bohrbug Transformation
| Property | Before (Heisenbug) | After (Bohrbug) |
|---|---|---|
| Reproducibility | Intermittent, timing-dependent | Consistent, deterministic |
| Root cause | Logical precedence error | Same |
| Visibility | Hidden by sequential timing | Exposed by parallel timing |
| Debug difficulty | Extremely hard (may never reproduce) | Straightforward |
| Detection in CI | Might pass for months | Fails consistently under load |
Why This Matters
- The bug was always present - It existed in the SQL function since it was written
- Sequential execution hid it - Incidental timing prevented the problematic state
- Parallelization surfaced it - Not by introducing a bug, but by applying concurrency pressure
- This is good - Better to find in tests than production
Semantic Correctness
The Correct Mental Model
“If ANY step is permanently blocked, the task cannot make further progress toward completion, even if other steps are ready to execute.”
A task with permanent failures is blocked by failures regardless of what else might be runnable. The old code implicitly assumed:
“If work is available, we’re making progress”
This is incorrect for workflows where:
- Convergence points require ALL branches to complete
- Final task status depends on ALL steps succeeding
- Partial progress doesn’t constitute overall success
State Precedence (Correct Order)
-- 1. Permanent failures block overall progress
WHEN permanently_blocked_steps > 0 THEN 'blocked_by_failures'
-- 2. Ready work can continue (but may not lead to completion)
WHEN ready_steps > 0 THEN 'has_ready_steps'
-- 3. Work in flight
WHEN in_progress_steps > 0 THEN 'processing'
-- 4. All done
WHEN completed_steps = total_steps THEN 'all_complete'
-- 5. Waiting for dependencies
ELSE 'waiting_for_dependencies'
Patterns to Watch For
1. State Combination Explosions
Sequential processing often means only one state at a time. Parallelism creates state combinations that were previously impossible:
Sequential: A → B → C (states are mutually exclusive in time)
Parallel: A + B + C (states can coexist)
Watch for: CASE statements, if/else chains, and state machines that assume mutual exclusivity.
2. Timing-Dependent Invariants
Code may accidentally depend on timing:
#![allow(unused)]
fn main() {
// Assumes step_a completes before step_b starts
if step_a.is_complete() {
// Safe to check step_b
}
}
Watch for: Implicit ordering assumptions in status calculations, rollups, and aggregations.
3. Transient vs Stable States
Some states were transient under sequential processing but become stable under parallel:
| State | Sequential | Parallel |
|---|---|---|
| “1 complete, 1 in-progress” | Transient (~ms) | Stable (seconds) |
| “blocked + ready” | Nearly impossible | Common |
| “multiple errors” | Rare | Frequent |
Watch for: Error handling, status rollups, and progress calculations that assumed single-state scenarios.
4. Test Timing Sensitivity
Tests written for sequential execution may have implicit timing dependencies:
#![allow(unused)]
fn main() {
// This worked when steps were sequential
wait_for_status(BlockedByFailures, timeout: 10s);
// But fails when parallel execution creates a different status first
}
Watch for: Tests that pass in isolation but fail under concurrent load.
Verification Strategy
After Parallelization Changes
- Run tests multiple times - Timing bugs may not manifest on first run
- Run tests under load - Concurrent test execution increases probability
- Add explicit state combination tests - Test scenarios that were previously impossible
- Review CASE/if-else precedence - Check all status calculations for correct ordering
Example: Testing State Combinations
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_blocked_with_ready_steps() {
// Explicitly create the state combination
let task = create_task_with_parallel_steps();
// Force one step to permanent failure
force_step_to_permanent_failure(&task, "step_a").await;
// Force another step to ready (after backoff)
force_step_to_ready_after_backoff(&task, "step_b").await;
// Verify correct precedence
let status = get_task_execution_status(&task).await;
assert_eq!(status, ExecutionStatus::BlockedByFailures);
}
}
Conclusion
This bug exemplifies how architectural improvements to concurrency can surface latent correctness issues. The parallelization didn’t introduce a bug—it revealed one that had been hidden by incidental sequential timing.
This is a positive outcome: the bug was found in testing rather than production. The fix ensures correct semantic precedence regardless of execution timing, making the system more robust under parallel load.
Key Takeaways
- Parallelization is a stress test - It exposes timing-dependent bugs
- Sequential execution hides bugs - Incidental ordering masks logical errors
- State precedence matters - Review all status calculations when adding concurrency
- Heisenbugs become Bohrbugs - Parallel execution makes rare bugs reproducible
- This is good engineering - Finding bugs through architectural improvements validates the testing strategy
References
- Migration:
migrations/20251207000000_fix_execution_status_priority.sql - Test:
tests/e2e/ruby/error_scenarios_test.rs::test_mixed_workflow_scenario - SQL Function:
get_task_execution_context()inmigrations/20251001000000_fix_permanently_blocked_detection.sql - Dual-Channel Event System ADR
Tasker Core Benchmarks
Last Updated: 2026-01-23 Audience: Architects, Developers Status: Active Related Docs: Documentation Hub | Observability | Deployment Patterns
<- Back to Documentation Hub
This directory contains documentation for all performance benchmarks in the tasker-core workspace.
Quick Reference
# E2E benchmarks (cluster mode, all tiers)
cargo make setup-env-all-cluster
cargo make cluster-start-all
set -a && source .env && set +a && cargo bench --bench e2e_latency
cargo make bench-report # Percentile JSON
cargo make bench-analysis # Markdown analysis
cargo make cluster-stop
# Component benchmarks (requires Docker services)
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
cargo bench --package tasker-client --features benchmarks # API benchmarks
cargo bench --package tasker-shared --features benchmarks # SQL + Event benchmarks
Benchmark Categories
1. End-to-End Latency (tests/benches)
Location: tests/benches/e2e_latency.rs
Documentation: e2e-benchmarks.md
Measures complete workflow execution from API call through orchestration, message queue, worker execution, result processing, and dependency resolution — across all distributed components in a 10-instance cluster.
| Tier | Benchmark | Steps | Parallelism | P50 | Target (p99) |
|---|---|---|---|---|---|
| 1 | Linear Rust | 4 sequential | none | 255-258ms | < 500ms |
| 1 | Diamond Rust | 4 (2 parallel) | 2-way | 200-259ms | < 500ms |
| 2 | Complex DAG | 7 (mixed) | 2+3-way | 382ms | < 800ms |
| 2 | Hierarchical Tree | 8 (4 parallel) | 4-way | 389-426ms | < 800ms |
| 2 | Conditional | 5 (3 executed) | dynamic | 251-262ms | < 500ms |
| 3 | Cluster single task | 4 sequential | none | 261ms | < 500ms |
| 3 | Cluster concurrent 2x | 4+4 | distributed | 332-384ms | < 800ms |
| 4 | FFI linear (Ruby/Python/TS) | 4 sequential | none | 312-316ms | < 800ms |
| 4 | FFI diamond (Ruby/Python/TS) | 4 (2 parallel) | 2-way | 260-275ms | < 800ms |
| 5 | Batch 1000 rows | 7 (5 parallel) | 5-way | 358-368ms | < 1000ms |
Each step involves ~19 database operations, 2 message queue round-trips, 4+ state transitions, and dependency graph evaluation. See e2e-benchmarks.md for the detailed per-step lifecycle.
Key Characteristics:
- FFI overhead: ~23% vs native Rust (all languages within 3ms of each other)
- Linear patterns: highly reproducible (<2% variance between runs)
- Parallel patterns: environment-sensitive (I/O contention affects parallelism)
- Batch processing: 2,700-2,800 rows/second with tight P95/P50 ratios
Run Commands:
cargo make bench-e2e # Tier 1: Rust core
cargo make bench-e2e-full # Tier 1+2: + complexity
cargo make bench-e2e-cluster # Tier 3: Multi-instance
cargo make bench-e2e-languages # Tier 4: FFI comparison
cargo make bench-e2e-batch # Tier 5: Batch processing
cargo make bench-e2e-all # All tiers
2. API Performance (tasker-client)
Location: tasker-client/benches/task_initialization.rs
Measures orchestration API response times for task creation (HTTP round-trip + DB insert + step initialization).
| Benchmark | Target | Current | Status |
|---|---|---|---|
| Linear task init | < 50ms | 17.7ms | 2.8x better |
| Diamond task init | < 75ms | 20.8ms | 3.6x better |
cargo bench --package tasker-client --features benchmarks
3. SQL Function Performance (tasker-shared)
Location: tasker-shared/benches/sql_functions.rs
Measures critical PostgreSQL function performance for orchestration polling.
| Function | Target | Current (5K tasks) | Status |
|---|---|---|---|
| get_next_ready_tasks | < 3ms | 1.75-2.93ms | Pass |
| get_step_readiness_status | < 1ms | 440-603us | Pass |
| get_task_execution_context | < 1ms | 380-460us | Pass |
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks sql_functions
4. Event Propagation (tasker-shared)
Location: tasker-shared/benches/event_propagation.rs
Measures PostgreSQL LISTEN/NOTIFY round-trip latency for real-time coordination.
| Metric | Target (p95) | Current | Status |
|---|---|---|---|
| Notify round-trip | < 10ms | 14.1ms | Slightly above, p99 < 20ms |
DATABASE_URL="..." cargo bench --package tasker-shared --features benchmarks event_propagation
Performance Targets
System-Wide Goals
| Category | Metric | Target | Rationale |
|---|---|---|---|
| API Latency | p99 | < 100ms | User-facing responsiveness |
| SQL Functions | mean | < 3ms | Orchestration polling efficiency |
| Event Propagation | p95 | < 10ms | Real-time coordination overhead |
| E2E Linear (4 steps) | p99 | < 500ms | End-user task completion |
| E2E Complex (7-8 steps) | p99 | < 800ms | Complex workflow completion |
| E2E Batch (1000 rows) | p99 | < 1000ms | Bulk operation completion |
Scaling Targets
| Dataset Size | get_next_ready_tasks | Notes |
|---|---|---|
| 1K tasks | < 2ms | Initial implementation |
| 5K tasks | < 3ms | Current verified |
| 10K tasks | < 5ms | Target |
| 100K tasks | < 10ms | Production scale |
Cluster Topology (E2E Benchmarks)
| Service | Instances | Ports | Build |
|---|---|---|---|
| Orchestration | 2 | 8080, 8081 | Release |
| Rust Worker | 2 | 8100, 8101 | Release |
| Ruby Worker | 2 | 8200, 8201 | Release extension |
| Python Worker | 2 | 8300, 8301 | Maturin develop |
| TypeScript Worker | 2 | 8400, 8401 | Bun FFI |
Deployment Mode: Hybrid (event-driven with polling fallback) Database: PostgreSQL (with PGMQ extension available) Messaging: RabbitMQ (via MessagingService provider abstraction; PGMQ also supported) Sample Size: 50 per benchmark
Running Benchmarks
E2E Benchmarks (Full Suite)
# 1. Setup cluster environment
cargo make setup-env-all-cluster
# 2. Start 10-instance cluster
cargo make cluster-start-all
# 3. Verify cluster health
cargo make cluster-status
# 4. Run benchmarks
set -a && source .env && set +a && cargo bench --bench e2e_latency
# 5. Generate reports
cargo make bench-report # → target/criterion/percentile_report.json
cargo make bench-analysis # → tmp/benchmark-results/benchmark-results.md
# 6. Stop cluster
cargo make cluster-stop
Component Benchmarks
# Start database
docker-compose -f docker/docker-compose.test.yml up -d
export DATABASE_URL="postgresql://tasker:tasker@localhost:5432/tasker_rust_test"
# Run individual suites
cargo bench --package tasker-client --features benchmarks # API
cargo bench --package tasker-shared --features benchmarks # SQL + Events
# Run all at once
cargo bench --all-features
Baseline Comparison
# Save current performance as baseline
cargo bench --all-features -- --save-baseline main
# After changes, compare
cargo bench --all-features -- --baseline main
# View report
open target/criterion/report/index.html
Interpreting Results
Stable Metrics (Reliable for Regression Detection)
These metrics show <2% variance between runs:
- Linear pattern P50 (sequential execution baseline)
- FFI linear P50 (framework overhead measurement)
- Single task in cluster (cluster overhead measurement)
- Batch P50 (parallel I/O throughput)
Environment-Sensitive Metrics
These metrics vary 10-30% depending on system load:
- Diamond pattern P50 (parallelism benefit depends on I/O capacity)
- Concurrent 2x (scheduling contention varies)
- Hierarchical tree (deep dependency chains amplify I/O latency)
Key Ratios (Always Valid)
- FFI overhead %: ~23% for all languages (framework-dominated)
- P95/P50 ratio: 1.01-1.12 (execution stability indicator)
- Cluster vs single overhead: <3ms (negligible cluster tax)
- FFI language spread: <3ms (language runtime is not the bottleneck)
Design Principles
Natural Measurement
Benchmarks measure real system behavior without artificial test harnesses:
- API benchmarks hit actual HTTP endpoints
- SQL benchmarks use real database with realistic data volumes
- E2E benchmarks execute complete workflows through all distributed components
Distributed System Focus
All benchmarks account for distributed system characteristics:
- Network latency included (HTTP, PostgreSQL, message queues)
- Database transaction timing considered
- Message queue delivery overhead measured
- Worker coordination and scheduling included
Load-Based Validation
Benchmarks serve dual purpose:
- Performance measurement: Track regressions and improvements
- Load testing: Expose race conditions and timing bugs
E2E benchmark warmup has historically discovered critical race conditions that manual testing never revealed.
Statistical Rigor
- 50 samples per benchmark for P50/P95 validity
- Criterion framework with statistical regression detection
- Multiple independent runs recommended for absolute comparisons
- Relative metrics (ratios, overhead %) preferred over absolute milliseconds
Troubleshooting
“Services must be running”
cargo make cluster-status # Check cluster health
cargo make cluster-start-all # Restart cluster
Tier 3/4 benchmarks skipped
# Ensure cluster env is configured (not single-service)
cargo make setup-env-all-cluster # Generates .env with cluster URLs
High variance between runs
- Close resource-intensive applications (browsers, IDEs)
- Ensure machine is plugged in (not throttling)
- Focus on stable metrics (linear P50, FFI overhead %) for comparisons
- Run benchmarks twice and compare for reproducibility
Benchmark takes too long
# Reduce sample size (default: 50)
cargo bench -- --sample-size 10
# Run single tier
cargo make bench-e2e # Only Tier 1
CI Integration
# Example: .github/workflows/benchmarks.yml
name: Performance Benchmarks
on:
pull_request:
paths:
- 'tasker-*/src/**'
- 'migrations/**'
jobs:
benchmark:
runs-on: ubuntu-latest
services:
postgres:
image: ghcr.io/pgmq/pg18-pgmq:v1.8.1
env:
POSTGRES_DB: tasker_rust_test
POSTGRES_USER: tasker
POSTGRES_PASSWORD: tasker
steps:
- uses: actions/checkout@v3
- run: cargo bench --all-features -- --save-baseline pr
- uses: benchmark-action/github-action-benchmark@v1
with:
tool: 'criterion'
output-file-path: target/criterion/report/index.html
Criterion automatically detects performance regressions with statistical comparison to baselines and alerts on >5% slowdowns.
Contributing
When adding new benchmarks:
- Follow naming convention:
<tier>_<category>/<group>/<scenario> - Include targets: Document expected performance in this README
- Add fixture: Create workflow template YAML in
tests/fixtures/task_templates/ - Document shape: Update e2e-benchmarks.md with topology
- Consider variance: Account for distributed system characteristics
- Use 50 samples: Minimum for P50/P95 statistical validity
Benchmark Template
#![allow(unused)]
fn main() {
use criterion::{criterion_group, criterion_main, Criterion, BenchmarkId};
use std::time::Duration;
fn bench_my_scenario(c: &mut Criterion) {
let mut group = c.benchmark_group("e2e_my_tier");
group.sample_size(50);
group.measurement_time(Duration::from_secs(30));
group.bench_function(BenchmarkId::new("workflow", "my_scenario"), |b| {
b.iter(|| {
runtime.block_on(async {
execute_benchmark_scenario(&client, namespace, handler, context, timeout).await
})
});
});
group.finish();
}
}
E2E Benchmark Scenarios: Workflow Shapes and Per-Step Lifecycle
Last Updated: 2026-01-23 Audience: Architects, Developers, Performance Engineers Related Docs: Benchmarks README | States & Lifecycles | Actor Pattern
<- Back to Benchmarks
What Each Benchmark Measures
Each E2E benchmark executes a complete workflow through the full distributed system: HTTP API call, task initialization, step discovery, message queue dispatch, worker execution, result submission, dependency graph re-evaluation, and task finalization.
A 4-step linear workflow at P50=257ms means the system completes 76+ database operations, 8 message queue round-trips, 16+ state machine transitions, and 4 dependency graph evaluations — all across a 10-instance distributed cluster — in approximately one quarter of a second.
Per-Step Lifecycle: What Happens for Every Step
Before examining the benchmark scenarios, it’s important to understand the work the system performs for each individual step. Every step in every benchmark goes through this complete lifecycle.
Messaging Backend: Tasker uses a MessagingService trait with provider variants for
PGMQ (PostgreSQL-native, single-dependency) and RabbitMQ (high-throughput). The benchmark
results documented here were captured using the RabbitMQ backend. The per-step lifecycle
is identical regardless of backend — only the transport layer differs.
State Machine Transitions (per step)
Step: Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
Task: StepsInProcess → EvaluatingResults → (EnqueuingSteps if more ready) → Complete
Database Operations (per step): ~19 operations
| Phase | Operations | Description |
|---|---|---|
| Discovery | 2 queries | get_next_ready_tasks + get_step_readiness_status_batch (8-CTE query) |
| Enqueueing | 4 writes | Fetch correlation_id, transition Pending→Enqueued (SELECT sort_key + UPDATE most_recent + INSERT transition) |
| Message send | 1 op | Send step dispatch to worker queue (via MessagingService) |
| Worker claim | 1 op | Claim message with visibility timeout (via MessagingService) |
| Worker transition | 3 writes | Transition Enqueued→InProgress |
| Result submission | 4 writes | Transition InProgress→EnqueuedForOrchestration + audit trigger INSERT + send completion to orchestration queue |
| Result processing | 4 writes | Fetch step state, transition →Complete, delete consumed message |
| Task coordination | 1+ queries | Re-evaluate get_step_readiness_status_batch for remaining steps |
| Total | ~19 ops |
Message Queue Round-Trips (per step): 2
- Orchestration → Worker: Step dispatch message (task_uuid, step_uuid, handler, context)
- Worker → Orchestration: Completion notification (task_uuid, step_uuid, results)
Dependency Graph Evaluation (per step completion)
After each step completes, the orchestration:
- Queries all steps in the task for current state
- Evaluates dependency edges (parent steps must be Complete)
- Calculates retry eligibility (attempts < max_attempts, backoff expired)
- Identifies newly-ready steps for enqueueing
- Updates task state (more steps ready → EnqueuingSteps, all complete → Complete)
Idempotency Guarantees
- Message visibility timeout: MessagingService prevents duplicate processing (30s window)
- State machine guards: Transitions validate from-state before applying
- Atomic claiming: Workers claim via the messaging backend’s atomic read operation
- Audit trail: Every transition creates an immutable
workflow_step_transitionsrecord
Tier 1: Core Performance (Rust Native)
Linear Rust (4 steps, sequential)
Fixture: tests/fixtures/task_templates/rust/mathematical_sequence.yaml
Namespace: rust_e2e_linear | Handler: mathematical_sequence
linear_step_1 → linear_step_2 → linear_step_3 → linear_step_4
| Step | Handler | Operation | Depends On | Math |
|---|---|---|---|---|
| linear_step_1 | LinearStep1 | square | none | 6^2 = 36 |
| linear_step_2 | LinearStep2 | square | step_1 | 36^2 = 1,296 |
| linear_step_3 | LinearStep3 | square | step_2 | 1,296^2 = 1,679,616 |
| linear_step_4 | LinearStep4 | square | step_3 | 1,679,616^2 |
Distributed system work for this workflow:
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 (4 per step) |
| State machine transitions (task) | 6 (Pending→Init→Enqueue→InProcess→Eval→Complete) |
| Database operations | 76 (19 per step) |
| MQ messages | 8 (2 per step) |
| Dependency evaluations | 4 (after each step completes) |
| HTTP calls (benchmark→API) | 1 create + ~5 polls |
| Sequential stages | 4 |
Why this matters: This is the purest sequential latency test. Each step must fully complete (all 19 DB operations + 2 message round-trips) before the next step can begin. The P50 of ~257ms means each step’s complete lifecycle averages ~64ms including all distributed coordination.
Diamond Rust (4 steps, 2-way parallel)
Fixture: tests/fixtures/task_templates/rust/diamond_pattern.yaml
Namespace: rust_e2e_diamond | Handler: diamond_pattern
diamond_start
/ \
/ \
diamond_branch_b diamond_branch_c ← parallel execution
\ /
\ /
diamond_end ← 2-way convergence
| Step | Handler | Operation | Depends On | Parallelism |
|---|---|---|---|---|
| diamond_start | Start | square | none | - |
| diamond_branch_b | BranchB | square | start | parallel with C |
| diamond_branch_c | BranchC | square | start | parallel with B |
| diamond_end | End | multiply_and_square | branch_b AND branch_c | convergence |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 |
| Database operations | 76 |
| MQ messages | 8 |
| Dependency evaluations | 4 |
| Sequential stages | 3 (start → parallel → end) |
| Convergence points | 1 (diamond_end waits for both branches) |
| Dependency edge checks | 4 (start→B, start→C, B→end, C→end) |
Why this matters: Tests the system’s ability to dispatch and execute steps concurrently. The convergence point (diamond_end) requires the orchestration to correctly evaluate that BOTH branch_b AND branch_c are Complete before enqueueing diamond_end. Under light load, this completes in 3 sequential stages vs 4 for linear (~30% faster).
Tier 2: Complexity Scaling
Complex DAG (7 steps, mixed parallelism)
Fixture: tests/fixtures/task_templates/rust/complex_dag.yaml
Namespace: rust_e2e_mixed_dag | Handler: complex_dag
dag_init
/ \
dag_process_left dag_process_right ← 2-way parallel
/ | | \
/ | | \
dag_validate dag_transform dag_analyze ← mixed dependencies
\ | /
\ | /
dag_finalize ← 3-way convergence
| Step | Depends On | Type |
|---|---|---|
| dag_init | none | init |
| dag_process_left | init | parallel branch |
| dag_process_right | init | parallel branch |
| dag_validate | left AND right | 2-way convergence |
| dag_transform | left only | linear continuation |
| dag_analyze | right only | linear continuation |
| dag_finalize | validate AND transform AND analyze | 3-way convergence |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 28 (7 steps x 4) |
| Database operations | 133 (7 x 19) |
| MQ messages | 14 (7 x 2) |
| Dependency evaluations | 7 |
| Sequential stages | 4 (init → left/right → validate/transform/analyze → finalize) |
| Convergence points | 2 (dag_validate: 2-way, dag_finalize: 3-way) |
| Dependency edge checks | 8 |
Why this matters: Tests multiple convergence points with different fan-in widths. The orchestration must correctly handle that dag_validate needs 2 parents while dag_finalize needs 3. Also tests mixed patterns: some steps continue from a single parent (transform from left only) while others require multiple.
Hierarchical Tree (8 steps, 4-way convergence)
Fixture: tests/fixtures/task_templates/rust/hierarchical_tree.yaml
Namespace: rust_e2e_tree | Handler: hierarchical_tree
tree_root
/ \
tree_branch_left tree_branch_right ← 2-way parallel
/ \ / \
tree_leaf_d tree_leaf_e tree_leaf_f tree_leaf_g ← 4-way parallel
\ | | /
\ | | /
tree_final_convergence ← 4-way convergence
| Level | Steps | Parallelism | Operation |
|---|---|---|---|
| 0 | root | sequential | square |
| 1 | branch_left, branch_right | 2-way parallel | square |
| 2 | leaf_d, leaf_e, leaf_f, leaf_g | 4-way parallel | square |
| 3 | final_convergence | 4-way convergence | multiply_all_and_square |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 32 (8 x 4) |
| Database operations | 152 (8 x 19) |
| MQ messages | 16 (8 x 2) |
| Dependency evaluations | 8 |
| Sequential stages | 4 (root → branches → leaves → convergence) |
| Maximum fan-out | 2-way (each branch → 2 leaves) |
| Maximum fan-in | 4-way (convergence waits for all 4 leaves) |
| Dependency edge checks | 9 |
Why this matters: Tests the widest convergence pattern — 4 parallel leaves must all complete before the final step can execute. This exercises the dependency evaluation with a large number of parent checks per step. Also tests hierarchical fan-out (root→2 branches→4 leaves).
Conditional Routing (5 steps, 3 executed)
Fixture: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml
Namespace: conditional_approval_rust | Handler: approval_routing
Context: {"amount": 500, "requester": "benchmark"}
validate_request
↓
routing_decision ← DECISION POINT (routes based on amount)
/ | \
/ | \
auto_approve manager_approval finance_review
(< $1000) ($1000-$5000) (> $5000)
\ | /
\ | /
finalize_approval ← deferred convergence
With benchmark context amount=500, only the auto_approve path executes:
validate_request → routing_decision → auto_approve → finalize_approval
| Step | Executed | Condition |
|---|---|---|
| validate_request | Yes | always |
| routing_decision | Yes | always (decision point) |
| auto_approve | Yes | amount < 1000 |
| manager_approval | Skipped | amount 1000-5000 |
| finance_review | Skipped | amount > 5000 |
| finalize_approval | Yes | deferred convergence (waits for executed paths only) |
Distributed system work (executed steps only):
| Metric | Count |
|---|---|
| State machine transitions (step) | 16 (4 executed x 4) |
| Database operations | 76 (4 executed x 19) |
| MQ messages | 8 (4 executed x 2) |
| Dependency evaluations | 4 |
| Sequential stages | 4 (validate → decision → approve → finalize) |
| Skipped steps | 2 (manager_approval, finance_review) |
Why this matters: Tests deferred convergence — the finalize_approval step depends on ALL conditional branches, but only blocks on branches that actually executed. The orchestration must correctly determine that manager_approval and finance_review were skipped (not just incomplete) and allow finalize_approval to proceed. Also tests the decision point routing pattern.
Tier 3: Cluster Performance
Single Task Linear (4 steps, round-robin across 2 orchestrators)
Same workflow as Tier 1 linear_rust, but benchmarked with round-robin across 2 orchestration instances to measure cluster coordination overhead.
Distributed system work: Same as linear_rust (76 DB ops, 8 MQ messages) plus cluster coordination overhead (shared database, message queue visibility).
Why this matters: Validates that running in cluster mode adds negligible overhead compared to single-instance. The P50 difference (261ms vs 257ms = ~4ms) represents the entire cluster coordination tax.
Concurrent Tasks 2x (2 tasks simultaneously across 2 orchestrators)
Two linear workflows submitted simultaneously, one to each orchestration instance.
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions | 44 (22 per task) |
| Database operations | 152 (76 per task) |
| MQ messages | 16 (8 per task) |
| Concurrent step executions | up to 2 |
| Database connection contention | 2 orchestrators + 2 workers competing |
Why this matters: Tests work distribution across cluster instances under concurrent load. The P50 of ~332-384ms for TWO tasks (vs ~261ms for one) shows that the second task adds only 30-50% latency, not 100% — demonstrating effective parallelism in the cluster.
Tier 4: FFI Language Comparison
Same linear and diamond patterns as Tier 1, but using FFI workers (Ruby via Magnus, Python via PyO3, TypeScript via Bun FFI) instead of native Rust handlers.
Additional per-step work for FFI:
| Phase | Additional Operations |
|---|---|
| Handler dispatch | FFI bridge call (Rust → language runtime) |
| Context serialization | JSON serialize context for foreign runtime |
| Result deserialization | JSON deserialize results back to Rust |
| Circuit breaker check | should_allow() (sync, atomic check) |
| Completion callback | FFI completion channel (bounded MPSC) |
FFI overhead: ~23% (~60ms for 4 steps)
The overhead is framework-dominated (Rust dispatch + serialization + completion channel), not language-dominated — all three languages perform within 3ms of each other.
Tier 5: Batch Processing
CSV Products 1000 Rows (7 steps, 5-way parallel)
Fixture: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml
Namespace: csv_processing_rust | Handler: csv_product_inventory_analyzer
analyze_csv ← reads CSV, returns BatchProcessingOutcome
↓
[orchestration creates 5 dynamic workers from batch template]
↓
process_csv_batch_001 ──┐
process_csv_batch_002 ──┤
process_csv_batch_003 ──├──→ aggregate_csv_results ← deferred convergence
process_csv_batch_004 ──┤
process_csv_batch_005 ──┘
| Step | Type | Rows | Operation |
|---|---|---|---|
| analyze_csv | batchable | all 1000 | Count rows, compute batch ranges |
| process_csv_batch_001 | batch_worker | 1-200 | Compute inventory metrics |
| process_csv_batch_002 | batch_worker | 201-400 | Compute inventory metrics |
| process_csv_batch_003 | batch_worker | 401-600 | Compute inventory metrics |
| process_csv_batch_004 | batch_worker | 601-800 | Compute inventory metrics |
| process_csv_batch_005 | batch_worker | 801-1000 | Compute inventory metrics |
| aggregate_csv_results | deferred_convergence | all | Merge batch results |
Distributed system work:
| Metric | Count |
|---|---|
| State machine transitions (step) | 28 (7 x 4) |
| Database operations | 133 (7 x 19) |
| MQ messages | 14 (7 x 2) |
| Dynamic step creation | 5 (batch workers created at runtime) |
| Dependency edges (dynamic) | 6 (batch workers → analyze, aggregate → batch_template) |
| File I/O operations | 6 (1 analysis read + 5 batch reads of CSV) |
| CSV rows processed | 1000 |
| Sequential stages | 3 (analyze → 5 parallel workers → aggregate) |
Why this matters: Tests the most complex orchestration pattern — dynamic step
generation. The analyze_csv step returns a BatchProcessingOutcome that tells the
orchestration to create N worker steps at runtime. The orchestration must:
- Create new step records in the database
- Create dependency edges dynamically
- Enqueue all batch workers for parallel execution
- Use deferred convergence for the aggregate step (waits for batch template, not specific steps)
At P50=358-368ms for 1000 rows, throughput is ~2,700 rows/second with all the distributed system overhead included.
Summary: Operations Per Benchmark
| Benchmark | Steps | DB Ops | MQ Msgs | Transitions | Convergence | P50 |
|---|---|---|---|---|---|---|
| Linear Rust | 4 | 76 | 8 | 22 | none | 257ms |
| Diamond Rust | 4 | 76 | 8 | 22 | 2-way | 200-259ms |
| Complex DAG | 7 | 133 | 14 | 34 | 2+3-way | 382ms |
| Hierarchical Tree | 8 | 152 | 16 | 38 | 4-way | 389-426ms |
| Conditional | 4* | 76 | 8 | 22 | deferred | 251-262ms |
| Cluster single | 4 | 76 | 8 | 22 | none | 261ms |
| Cluster 2x | 8 | 152 | 16 | 44 | none | 332-384ms |
| FFI linear | 4 | 76 | 8 | 22 | none | 312-316ms |
| FFI diamond | 4 | 76 | 8 | 22 | 2-way | 260-275ms |
| Batch 1000 rows | 7 | 133 | 14 | 34 | deferred | 358-368ms |
*Conditional executes 4 of 5 defined steps (2 skipped by routing decision)
Performance per Sequential Stage
For workflows with known sequential depth, we can calculate per-stage overhead:
| Benchmark | Sequential Stages | P50 | Per-Stage Avg |
|---|---|---|---|
| Linear (4 seq) | 4 | 257ms | 64ms |
| Diamond (3 seq) | 3 | 200ms* | 67ms |
| Complex DAG (4 seq) | 4 | 382ms | 96ms** |
| Tree (4 seq) | 4 | 389ms | 97ms** |
| Conditional (4 seq) | 4 | 257ms | 64ms |
| Batch (3 seq) | 3 | 363ms | 121ms*** |
*Diamond under light load (parallelism helping) **Higher per-stage due to multiple steps per stage (more DB ops per evaluation cycle) ***Higher per-stage due to batch worker creation overhead + file I/O
The ~64ms per sequential stage for simple patterns represents the total distributed round-trip: orchestration discovery → MQ dispatch → worker claim → handler execute (~1ms for math operations) → MQ completion → orchestration result processing → dependency re-evaluation. The handler execution itself is negligible; the 64ms is almost entirely orchestration infrastructure.
Engineering Stories
A progressive-disclosure blog series that teaches Tasker concepts through real-world scenarios. Each story builds on the previous, showing how a growing engineering team adopts workflow orchestration across all four supported languages.
These stories are being rewritten for the current Tasker architecture. See the archive for the original GitBook-era versions.
| Story | Theme | Status |
|---|---|---|
| 01: E-commerce Checkout | Linear pipelines, error handling, retry | Published |
| 02: Data Pipeline | DAG workflows, parallel execution, ETL | Published |
| 03: Microservices Coordination | Diamond pattern, service coordination | Published |
| 04: Team Scaling | Namespace isolation, cross-team workflows | Published |
| 05: Observability | OpenTelemetry + domain events | Planned |
| 06: Batch Processing | Batch step patterns | Planned |
| 07: Conditional Workflows | Decision handlers | Planned |
| 08: Production Debugging | DLQ investigation | Planned |
Every published story (01-04) has working implementations in all four frameworks. See the Example Apps page for how to run them.
Reliable E-commerce Checkout with Tasker
How workflow orchestration turns a fragile checkout pipeline into a resilient, observable process.
Handler examples use Python DSL syntax. See Class-Based Handlers for the class-based alternative. Full implementations in all four languages are linked at the bottom.
The Problem
Your checkout flow works — most of the time. A customer adds items to their cart, enters payment details, and clicks “Place Order.” Behind the scenes, your application validates the cart, charges the payment gateway, reserves inventory, creates the order record, and fires off a confirmation email. Five steps, all wired together in a single controller action.
Then a payment gateway times out mid-checkout. Your code has already validated the cart but hasn’t reserved inventory yet. The customer sees an error, retries, and now you have a double charge to sort out. Your on-call engineer spends the evening tracing logs across services trying to figure out which step failed and whether the customer was actually charged.
This is the reliability problem that workflow orchestration solves. Instead of wiring steps together in application code, you declare them as a workflow template and let the orchestrator handle sequencing, retries, and error classification.
The Fragile Approach
Most checkout implementations start as a procedural chain in a controller:
def process_order(cart, payment, customer):
validated = validate_cart(cart)
charge = process_payment(payment, validated.total)
inventory = reserve_inventory(validated.items)
order = create_order(customer, validated, charge, inventory)
send_confirmation(customer.email, order)
return order
Every step assumes the previous one succeeded. There’s no retry logic, no distinction between “the payment gateway is temporarily down” (retry) and “the card was declined” (don’t retry), and no way to resume from the middle if something fails partway through.
The Tasker Approach
With Tasker, you break the checkout into a task template that defines steps and their dependencies, and step handlers that implement the business logic. The orchestrator takes care of sequencing, retry with backoff, and error classification.
Task Template (YAML)
The workflow definition lives in a YAML file. Each step declares which handler runs it, what it depends on, and how retries should work:
name: ecommerce_order_processing
namespace_name: ecommerce
version: 1.0.0
description: "Complete e-commerce order processing: validate -> payment -> inventory -> order -> confirmation"
steps:
- name: validate_cart
description: "Validate cart items, check availability, calculate totals"
handler:
callable: Ecommerce::StepHandlers::ValidateCartHandler
dependencies: []
retry:
retryable: true
max_attempts: 2
backoff: exponential
backoff_base_ms: 100
- name: process_payment
description: "Process customer payment using payment service"
handler:
callable: Ecommerce::StepHandlers::ProcessPaymentHandler
dependencies:
- validate_cart
retry:
retryable: true
max_attempts: 2
backoff: exponential
backoff_base_ms: 100
- name: update_inventory
description: "Reserve inventory for order items"
handler:
callable: Ecommerce::StepHandlers::UpdateInventoryHandler
dependencies:
- process_payment
retry:
retryable: true
max_attempts: 2
backoff: exponential
- name: create_order
description: "Create order record with customer, payment, and inventory details"
handler:
callable: Ecommerce::StepHandlers::CreateOrderHandler
dependencies:
- update_inventory
retry:
retryable: true
max_attempts: 2
backoff: exponential
- name: send_confirmation
description: "Send order confirmation email to customer"
handler:
callable: Ecommerce::StepHandlers::SendConfirmationHandler
dependencies:
- create_order
retry:
retryable: true
max_attempts: 2
backoff: exponential
The dependencies field creates a linear pipeline: validate_cart -> process_payment -> update_inventory -> create_order -> send_confirmation. Tasker executes them in order, passing each step’s results to its dependents.
Full template: ecommerce_order_processing.yaml
Step Handlers
Each handler is a thin DSL wrapper: it declares a typed input model, then delegates to a service function. The orchestrator handles sequencing, retries, and error classification.
ValidateCartHandler — Input Validation and Pricing
Type definition (the contract):
# app/services/types.py
class EcommerceOrderInput(BaseModel):
cart_items: list[dict[str, Any]] | None = None
customer_email: str | None = None
payment_token: str | None = None
Handler (DSL declaration + service delegation):
# app/handlers/ecommerce.py
from tasker_core.step_handler.functional import step_handler, inputs
from app.services.types import EcommerceOrderInput
from app.services import ecommerce as svc
@step_handler("ecommerce_validate_cart")
@inputs(EcommerceOrderInput)
def validate_cart(inputs: EcommerceOrderInput, context):
return svc.validate_cart_items(
cart_items=inputs.cart_items,
customer_email=inputs.customer_email,
)
The @inputs decorator extracts fields from the task’s submitted context and validates them against the Pydantic model. Invalid data raises a permanent error — there’s no point retrying a request with an empty cart. The service function (svc.validate_cart_items) contains the business logic: price calculations, tax, shipping thresholds.
ProcessPaymentHandler — Dependency Access and Error Classification
The payment handler demonstrates two critical patterns: reading results from an upstream step via @depends_on, and classifying errors so the orchestrator knows whether to retry.
from tasker_core.step_handler.functional import step_handler, depends_on, inputs
from tasker_core import PermanentError, RetryableError
from app.services.types import EcommerceOrderInput, EcommerceValidateCartResult
from app.services import ecommerce as svc
@step_handler("ecommerce_process_payment")
@depends_on(cart_result=("validate_cart", EcommerceValidateCartResult))
@inputs(EcommerceOrderInput)
def process_payment(
cart_result: EcommerceValidateCartResult,
inputs: EcommerceOrderInput,
context,
):
return svc.process_payment(
cart_result=cart_result,
payment_token=inputs.payment_token,
)
The @depends_on decorator declares that this handler needs the result from validate_cart, typed as EcommerceValidateCartResult. The orchestrator injects the validated, typed result directly into the function signature — no manual parsing or get_dependency_result() calls.
The service function classifies errors:
- PermanentError (declined card, invalid data): The orchestrator marks the step as failed and stops. No retry will fix a declined card.
- RetryableError (gateway timeout, network blip): The orchestrator retries with exponential backoff up to
max_attempts.
Creating a Task
Your application code submits work to Tasker by creating a task. The orchestrator picks it up and runs the step handlers in dependency order.
Ruby (Rails Controller)
class OrdersController < ApplicationController
def create
order = Order.create!(
customer_email: order_params[:customer_email],
items: order_params[:cart_items],
status: 'pending'
)
task = TaskerCore::Client.create_task(
name: 'ecommerce_order_processing',
namespace: 'ecommerce',
context: {
customer_email: order_params[:customer_email],
cart_items: order_params[:cart_items],
payment_token: order_params[:payment_token],
shipping_address: order_params[:shipping_address],
domain_record_id: order.id
}
)
order.update!(task_uuid: task['id'], status: 'processing')
render json: { id: order.id, status: 'processing', task_uuid: order.task_uuid }, status: :created
end
end
TypeScript (Bun/Hono Route)
ordersRoute.post('/', async (c) => {
const { customer_email, items, payment_info } = await c.req.json();
const [order] = await db.insert(orders).values({
customerEmail: customer_email, items, status: 'pending',
}).returning();
const ffiLayer = new FfiLayer();
await ffiLayer.load();
const client = new TaskerClient(ffiLayer);
const task = client.createTask({
name: 'ecommerce_order_processing',
context: { order_id: order.id, customer_email, cart_items: items, payment_info },
initiator: 'bun-app',
reason: `Process order #${order.id}`,
idempotencyKey: `order-${order.id}`,
});
await db.update(orders).set({ taskUuid: task.task_uuid, status: 'processing' })
.where(eq(orders.id, order.id));
return c.json({ id: order.id, status: 'processing', task_uuid: task.task_uuid }, 201);
});
Both implementations follow the same pattern: create a domain record, submit the workflow to Tasker with the relevant context, and store the task UUID for status tracking.
Full implementations: Rails controller | Bun/Hono route
Key Concepts
- Linear dependencies: Each step declares what it depends on. The orchestrator guarantees execution order without you writing sequencing logic.
- Typed inputs via DSL:
@inputsextracts fields from the task context into a validated Pydantic model.@depends_oninjects upstream step results as typed parameters. No manual parsing needed. - Permanent vs. retryable errors: Handlers classify failures so the orchestrator can retry transient issues (gateway timeouts) while immediately failing on business errors (declined cards).
- Task creation via FFI client: Your application submits work through a client that communicates with the Rust orchestration core. The same workflow template runs regardless of which language your handlers are written in.
Full Implementations
The complete e-commerce checkout workflow is implemented in all four supported languages:
| Language | Handlers | Template | Route/Controller |
|---|---|---|---|
| Ruby (Rails) | handlers/ecommerce/ | ecommerce_order_processing.yaml | orders_controller.rb |
| TypeScript (Bun/Hono) | handlers/ecommerce.ts | ecommerce_order_processing.yaml | routes/orders.ts |
| Python (FastAPI) | handlers/ecommerce.py | ecommerce_order_processing.yaml | routers/orders.py |
| Rust (Axum) | handlers/ecommerce.rs | ecommerce_order_processing.yaml | routes/orders.rs |
What’s Next
A linear pipeline works well for checkout, but real systems have steps that can run in parallel. In Post 02: Data Pipeline Resilience, we’ll build an analytics ETL workflow where three data sources are extracted concurrently, transformed independently, and then aggregated — demonstrating Tasker’s DAG execution engine and how parallel steps dramatically reduce pipeline runtime.
See this pattern implemented in all four frameworks on the Example Apps page.
Resilient Data Pipelines with Tasker
How DAG workflows and parallel execution turn brittle ETL scripts into observable, self-healing pipelines.
Handler examples use Python DSL syntax. See Class-Based Handlers for the class-based alternative. Full implementations in all four languages are linked at the bottom.
The Problem
Your analytics pipeline runs nightly. It pulls sales data from your database, inventory snapshots from the warehouse system, and customer records from the CRM. Then it transforms each dataset, aggregates everything into a unified view, and generates business insights. Eight steps, chained together in a cron job.
When the warehouse API returns a 503 at 2 AM, the entire pipeline fails. Your data team discovers the gap the next morning when dashboards show stale numbers. They re-run the whole pipeline manually, even though the sales and customer extracts completed successfully the first time. The warehouse API is back up now, but you’ve lost hours of freshness and burned compute re-extracting data you already had.
The root issue isn’t the API failure — transient errors happen. The issue is that your pipeline treats independent data sources as a sequential chain, so one failure poisons everything downstream.
The Fragile Approach
A typical ETL pipeline chains everything sequentially:
def run_pipeline(config):
sales = extract_sales(config.source) # 1. blocks on completion
inventory = extract_inventory(config.warehouse) # 2. waits for sales (why?)
customers = extract_customers(config.crm) # 3. waits for inventory (why?)
sales_t = transform_sales(sales)
inventory_t = transform_inventory(inventory)
customers_t = transform_customers(customers)
metrics = aggregate(sales_t, inventory_t, customers_t)
return generate_insights(metrics)
The three extract steps have no data dependency on each other, yet they run sequentially because the code is sequential. If extract #2 fails, extract #3 never starts. And there’s no retry — a single transient failure aborts the whole run.
The Tasker Approach
Tasker models this pipeline as a DAG (directed acyclic graph). Steps that don’t depend on each other run in parallel automatically. Steps that need upstream results wait only for their specific dependencies.
Task Template (YAML)
name: analytics_pipeline
namespace_name: data_pipeline
version: 1.0.0
description: "Analytics ETL pipeline with parallel extraction and aggregation"
steps:
# EXTRACT PHASE — 3 parallel steps (no dependencies)
- name: extract_sales_data
description: "Extract sales records from database"
handler:
callable: extract_sales_data
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
- name: extract_inventory_data
description: "Extract inventory records from warehouse system"
handler:
callable: extract_inventory_data
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
- name: extract_customer_data
description: "Extract customer records from CRM"
handler:
callable: extract_customer_data
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
# TRANSFORM PHASE — each depends only on its own extract
- name: transform_sales
handler:
callable: transform_sales
dependencies:
- extract_sales_data
retry:
retryable: true
max_attempts: 2
- name: transform_inventory
handler:
callable: transform_inventory
dependencies:
- extract_inventory_data
retry:
retryable: true
max_attempts: 2
- name: transform_customers
handler:
callable: transform_customers
dependencies:
- extract_customer_data
retry:
retryable: true
max_attempts: 2
# AGGREGATE PHASE — waits for ALL 3 transforms (DAG convergence)
- name: aggregate_metrics
handler:
callable: aggregate_metrics
dependencies:
- transform_sales
- transform_inventory
- transform_customers
retry:
retryable: true
max_attempts: 2
# INSIGHTS PHASE — depends on aggregation
- name: generate_insights
handler:
callable: generate_insights
dependencies:
- aggregate_metrics
retry:
retryable: true
max_attempts: 2
The DAG structure is visible in the dependencies field:
extract_sales ──→ transform_sales ──────┐
extract_inventory → transform_inventory ─┼─→ aggregate_metrics → generate_insights
extract_customer ─→ transform_customers ─┘
All three extract steps have dependencies: [], so Tasker runs them concurrently. Each transform depends only on its own extract, so transforms also run in parallel (once their extract completes). The aggregate step waits for all three transforms — this is the convergence point where parallel branches rejoin.
Full template: data_pipeline_analytics_pipeline.yaml
Step Handlers
ExtractSalesDataHandler — Parallel Root Step
Extract steps are “root” steps with no dependencies. They run immediately when the task starts, concurrently with other root steps.
from tasker_core.step_handler.functional import step_handler, inputs
from app.services.types import DataPipelineInput
from app.services import data_pipeline as svc
@step_handler("extract_sales_data")
@inputs(DataPipelineInput)
def extract_sales_data(inputs: DataPipelineInput, context):
return svc.extract_sales_data(
source=inputs.source,
date_range_start=inputs.date_range_start,
date_range_end=inputs.date_range_end,
granularity=inputs.granularity,
)
The important detail: this handler uses only @inputs (no @depends_on). It reads from the task’s initial input — no upstream step results needed. The orchestrator knows it can run this step immediately, in parallel with the other two extract steps. The service function contains the actual data extraction logic.
AggregateMetricsHandler — Multi-Dependency Convergence
The aggregate step is the convergence point. It depends on all three transform steps and pulls typed results from each one via @depends_on.
from tasker_core.step_handler.functional import step_handler, depends_on
from app.services.types import (
PipelineTransformSalesResult,
PipelineTransformInventoryResult,
PipelineTransformCustomersResult,
)
from app.services import data_pipeline as svc
@step_handler("aggregate_metrics")
@depends_on(
sales_transform=("transform_sales", PipelineTransformSalesResult),
inventory_transform=("transform_inventory", PipelineTransformInventoryResult),
customers_transform=("transform_customers", PipelineTransformCustomersResult),
)
def aggregate_metrics(
sales_transform: PipelineTransformSalesResult,
inventory_transform: PipelineTransformInventoryResult,
customers_transform: PipelineTransformCustomersResult,
context,
):
return svc.aggregate_metrics(
sales_transform=sales_transform,
inventory_transform=inventory_transform,
customers_transform=customers_transform,
)
Three @depends_on entries compose the function signature — each injects a typed result from an upstream transform step. The orchestrator guarantees all three have completed successfully before this step runs. If any transform failed (after exhausting its retries), this step never executes. The service function contains the cross-source aggregation logic.
Creating a Task
Submitting the pipeline follows the same pattern as any Tasker workflow:
from tasker_core import TaskerClient
client = TaskerClient()
task = client.create_task(
name="analytics_pipeline",
namespace="data_pipeline",
context={
"source": "production",
"date_range_start": "2026-01-01",
"date_range_end": "2026-01-31",
"granularity": "daily",
},
)
Key Concepts
- Parallel steps via empty dependencies: Steps with
dependencies: []are root steps that run concurrently. No threading code, no async coordination — the orchestrator handles it. - DAG convergence: A step that depends on multiple upstream steps waits for all of them. The
aggregate_metricsstep converges three parallel branches into one. - Typed dependency injection:
@depends_oninjects typed upstream results directly into the function signature. The handler receives validated data — no manualget_dependency_result()calls or key lookups. - Retry with backoff: Each step configures its own retry policy. The extract steps use 3 attempts with exponential backoff because external systems have transient failures. Transform steps use 2 attempts because they’re CPU-bound and unlikely to benefit from retrying.
Full Implementations
The complete analytics pipeline is implemented in all four supported languages:
| Language | Handlers | Template |
|---|---|---|
| Ruby (Rails) | handlers/data_pipeline/ | data_pipeline_analytics_pipeline.yaml |
| TypeScript (Bun/Hono) | handlers/data-pipeline.ts | data_pipeline_analytics_pipeline.yaml |
| Python (FastAPI) | handlers/data_pipeline.py | data_pipeline_analytics_pipeline.yaml |
| Rust (Axum) | handlers/data_pipeline.rs | data_pipeline_analytics_pipeline.yaml |
What’s Next
Parallel extraction is powerful, but real-world workflows often have a diamond pattern — a step that fans out to parallel branches that must converge before continuing. In Post 03: Microservices Coordination, we’ll build a user registration workflow where account creation fans out to billing and preferences setup in parallel, then converges for the welcome sequence — demonstrating how Tasker replaces custom circuit breakers with declarative dependency management.
See this pattern implemented in all four frameworks on the Example Apps page.
Microservices Coordination with Tasker
How the diamond dependency pattern replaces custom circuit breakers and service coordination glue.
Handler examples use Ruby DSL syntax. See Class-Based Handlers for the class-based alternative. Full implementations in all four languages are linked at the bottom.
The Problem
Your user registration flow touches four services: the user service (account creation), the billing service (payment profile), the preferences service (notification settings), and the notification service (welcome emails). Each service has its own API, its own failure modes, and its own retry characteristics.
You started with a sequential chain: create user, then billing, then preferences, then welcome email. It works, but it’s slow — billing and preferences don’t depend on each other, yet one waits for the other. When the billing service has a bad deploy and starts returning 500s, your entire registration pipeline backs up. You add a circuit breaker for billing, then another for preferences, then retry logic, then timeout handling. Now your “simple” registration flow is 400 lines of coordination code that’s harder to reason about than the business logic it orchestrates.
The coordination logic isn’t the value your team delivers. The value is in the business rules — how you create accounts, what billing tiers you support, which notification channels you enable. The wiring between services should be declarative.
The Fragile Approach
A typical multi-service registration handler accumulates coordination concerns:
def register_user(user_info):
account = user_service.create(user_info) # must complete first
billing = retry(3, backoff=exp): # custom retry
billing_service.setup(account.id, user_info.plan)
preferences = retry(3, backoff=exp): # custom retry
preferences_service.init(account.id)
wait_all(billing, preferences) # custom fan-out/fan-in
notifications.send_welcome(account, billing, preferences) # depends on both
user_service.activate(account.id) # final step
Each retry() call is hand-rolled. The wait_all() is custom concurrency code. Error handling for partial failures (billing succeeded but preferences didn’t) requires manual cleanup logic. And this pattern gets duplicated across every multi-service workflow in your codebase.
The Tasker Approach
Tasker models this as a diamond dependency pattern: one step fans out to parallel branches that converge before the next step runs. The template declares the shape; the orchestrator handles concurrency, retries, and convergence.
Task Template (YAML)
name: user_registration
namespace_name: microservices
version: 1.0.0
description: "User registration workflow with microservices coordination"
steps:
# Step 1: Create user account (must complete before anything else)
- name: create_user_account
description: "Create user account in user service with idempotency"
handler:
callable: Microservices::StepHandlers::CreateUserAccountHandler
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
# Steps 2-3: Run in PARALLEL (both depend only on create_user_account)
- name: setup_billing_profile
description: "Setup billing profile in billing service"
handler:
callable: Microservices::StepHandlers::SetupBillingProfileHandler
dependencies:
- create_user_account
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
- name: initialize_preferences
description: "Initialize user preferences in preferences service"
handler:
callable: Microservices::StepHandlers::InitializePreferencesHandler
dependencies:
- create_user_account
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 2
max_delay: 30
# Step 4: CONVERGENCE — waits for both billing AND preferences
- name: send_welcome_sequence
description: "Send welcome emails via notification service"
handler:
callable: Microservices::StepHandlers::SendWelcomeSequenceHandler
dependencies:
- setup_billing_profile
- initialize_preferences
retry:
retryable: true
max_attempts: 2
backoff: exponential
initial_delay: 2
max_delay: 20
# Step 5: Final status update
- name: update_user_status
description: "Update user status to active in user service"
handler:
callable: Microservices::StepHandlers::UpdateUserStatusHandler
dependencies:
- send_welcome_sequence
retry:
retryable: true
max_attempts: 2
backoff: exponential
The diamond pattern emerges from the dependency declarations:
create_user_account
├──→ setup_billing_profile ──────┐
└──→ initialize_preferences ─────┼──→ send_welcome_sequence → update_user_status
┘
Steps 2 and 3 both depend on step 1, so they run in parallel once account creation completes. Step 4 depends on both steps 2 and 3, so it waits for the slower of the two — this is the convergence point. No custom concurrency code needed.
Full template: microservices_user_registration.yaml
Step Handlers
CreateUserAccountHandler — Idempotent Account Creation
The first step creates the user account. Since it’s the entry point, the DSL declares a typed input model that handles validation.
Type definition (the contract):
# app/services/types.rb
module Types
module Microservices
class CreateUserAccountInput < Types::InputStruct
attribute :email, Types::String
attribute :name, Types::String.optional
attribute :plan, Types::String.optional
attribute :marketing_consent, Types::Bool.optional
end
end
end
Handler (DSL declaration + service delegation):
# app/handlers/microservices/step_handlers/create_user_account_handler.rb
module Microservices
module StepHandlers
extend TaskerCore::StepHandler::Functional
CreateUserAccountHandler = step_handler(
'Microservices::StepHandlers::CreateUserAccountHandler',
inputs: Types::Microservices::CreateUserAccountInput
) do |inputs:, context:|
Microservices::Service.create_user_account(input: inputs)
end
end
end
The inputs: config extracts fields from the task context and validates them against the Dry::Struct type. Input validation (required email, format checks, blocked domains) lives in the service function — the handler stays thin. Validation failures raise PermanentError since bad input can’t be fixed by retrying.
SendWelcomeSequenceHandler — Multi-Dependency Convergence
The welcome sequence handler demonstrates the convergence pattern — the diamond’s bottom vertex. Three depends_on entries compose the function signature with typed results from all upstream branches.
module Microservices
module StepHandlers
extend TaskerCore::StepHandler::Functional
SendWelcomeSequenceHandler = step_handler(
'Microservices::StepHandlers::SendWelcomeSequenceHandler',
depends_on: {
account_data: ['create_user_account', Types::Microservices::CreateUserResult],
billing_data: ['setup_billing_profile', Types::Microservices::SetupBillingResult],
preferences_data: ['initialize_preferences', Types::Microservices::InitPreferencesResult]
}
) do |account_data:, billing_data:, preferences_data:, context:|
Microservices::Service.send_welcome_sequence(
account_data: account_data,
billing_data: billing_data,
preferences_data: preferences_data,
)
end
end
end
The depends_on: hash declares three upstream step results, each typed with a Dry::Struct result model. The orchestrator guarantees that all three have completed successfully before this handler runs. The service function composes the welcome content — adapting based on the billing profile (trial status) and preferences (notification channels), data that was gathered in parallel.
The parallel steps that feed into this convergence point use the same pattern:
# Runs in parallel with InitializePreferencesHandler (both depend on create_user_account)
SetupBillingProfileHandler = step_handler(
'Microservices::StepHandlers::SetupBillingProfileHandler',
depends_on: { account_data: ['create_user_account', Types::Microservices::CreateUserResult] }
) do |account_data:, context:|
Microservices::Service.setup_billing_profile(account_data: account_data)
end
Creating a Task
task = TaskerCore::Client.create_task(
name: 'user_registration',
namespace: 'microservices',
context: {
user_info: {
email: 'new.user@example.com',
name: 'Jane Developer',
plan: 'pro',
source: 'signup_form'
}
}
)
Key Concepts
- Diamond dependency pattern: One step fans out to parallel branches that converge before the workflow continues. Declare it with
dependencies— no concurrency primitives needed. - Typed convergence: The
depends_on:hash in the DSL composes the convergence handler’s signature from three typed upstream results. No manualget_dependency_result()calls or nil checks. - Service coordination without custom circuit breakers: Each step’s retry policy acts as a per-service circuit breaker. If billing fails 3 times, that step fails permanently — but preferences continues independently. No shared circuit breaker state to manage.
- Dependency-driven personalization: The convergence handler’s service function composes data from all upstream branches to create personalized outputs (welcome messages tailored to plan, trial status, and notification preferences).
Full Implementations
The complete user registration workflow is implemented in all four supported languages:
| Language | Handlers | Template |
|---|---|---|
| Ruby (Rails) | handlers/microservices/ | microservices_user_registration.yaml |
| TypeScript (Bun/Hono) | handlers/microservices.ts | microservices_user_registration.yaml |
| Python (FastAPI) | handlers/microservices.py | microservices_user_registration.yaml |
| Rust (Axum) | handlers/microservices.rs | microservices_user_registration.yaml |
What’s Next
Workflows within a single team are manageable, but what happens when multiple teams define workflows with overlapping names? In Post 04: Team Scaling with Namespaces, we’ll see how Tasker’s namespace system lets teams like Customer Success and Payments each own a process_refund workflow without naming conflicts — and how cross-namespace coordination enables clean team boundaries.
See this pattern implemented in all four frameworks on the Example Apps page.
Team Scaling with Namespaces
How namespace isolation lets multiple teams own workflows with the same name without stepping on each other.
Handler examples use TypeScript DSL syntax (Customer Success) and Rust (Payments). See Class-Based Handlers for the class-based alternative. Full implementations in all four languages are linked at the bottom.
The Problem
Your company has grown. The Customer Success team handles refund requests through a multi-step approval workflow. The Payments team processes refunds directly through the payment gateway. Both teams call their workflow process_refund — because that’s what it does.
Without namespace isolation, you have a naming collision. One team renames their workflow to cs_process_refund or payments_process_refund, which leads to inconsistent naming conventions, confusion about ownership, and a growing pile of team-prefixed workflow names that nobody wants to maintain. Worse, when the Customer Success team’s approval workflow needs to trigger the Payments team’s gateway refund, the coupling between teams becomes explicit and brittle.
This is the team scaling problem. As your organization grows from one team with a few workflows to multiple teams with overlapping domain concepts, you need a way to isolate ownership while still enabling coordination.
The Fragile Approach
Without namespaces, teams resort to naming conventions:
# Customer Success team
workflow: cs_process_refund_v2_with_approval
# Payments team
workflow: payments_direct_refund_v3
# Which one does "process a refund" mean?
# Depends on who you ask.
Cross-team coordination requires hard-coded references to the other team’s workflow name. When the Payments team renames their workflow, the Customer Success team’s code breaks. There’s no formal boundary between teams — just convention and hope.
The Tasker Approach
Tasker solves this with namespaces. Each team owns a namespace, and workflow names are scoped to that namespace. Both teams can have a workflow called process_refund — the fully qualified names are customer_success.process_refund and payments.process_refund.
Two Templates, Same Name, Different Namespaces
Customer Success: process_refund
The Customer Success team’s refund workflow includes approval steps and ticket management:
name: process_refund
namespace_name: customer_success
version: 1.0.0
description: "Process customer service refunds with approval workflow"
steps:
- name: validate_refund_request
description: "Validate customer refund request details"
handler:
callable: CustomerSuccess::StepHandlers::ValidateRefundRequestHandler
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
- name: check_refund_policy
description: "Verify request complies with refund policies"
handler:
callable: CustomerSuccess::StepHandlers::CheckRefundPolicyHandler
dependencies:
- validate_refund_request
- name: get_manager_approval
description: "Route to manager for approval if needed"
handler:
callable: CustomerSuccess::StepHandlers::GetManagerApprovalHandler
dependencies:
- check_refund_policy
- name: execute_refund_workflow
description: "Call payments team refund workflow (cross-namespace)"
handler:
callable: CustomerSuccess::StepHandlers::ExecuteRefundWorkflowHandler
dependencies:
- get_manager_approval
retry:
retryable: true
max_attempts: 3
backoff: exponential
initial_delay: 5
max_delay: 60
- name: update_ticket_status
description: "Update customer support ticket"
handler:
callable: CustomerSuccess::StepHandlers::UpdateTicketStatusHandler
dependencies:
- execute_refund_workflow
Payments: process_refund
The Payments team’s refund workflow is direct gateway integration — no approval needed:
name: process_refund
namespace_name: payments
version: 1.0.0
description: "Process payment gateway refunds with direct API integration"
steps:
- name: validate_payment_eligibility
description: "Check if payment can be refunded via gateway"
handler:
callable: team_scaling_payments_validate_eligibility
dependencies: []
retry:
retryable: true
max_attempts: 3
backoff: exponential
- name: process_gateway_refund
description: "Execute refund through payment processor"
handler:
callable: team_scaling_payments_process_gateway_refund
dependencies:
- validate_payment_eligibility
retry:
retryable: true
max_attempts: 2
backoff: exponential
initial_delay: 5
max_delay: 60
- name: update_payment_records
description: "Update internal payment status and history"
handler:
callable: team_scaling_payments_update_records
dependencies:
- process_gateway_refund
- name: notify_customer
description: "Send refund confirmation to customer"
handler:
callable: team_scaling_payments_notify_customer
dependencies:
- update_payment_records
retry:
retryable: true
max_attempts: 5
backoff: exponential
Both templates use name: process_refund. The namespace_name field is what makes them distinct. When a task is created, the fully qualified identifier is namespace.name — so customer_success.process_refund and payments.process_refund coexist without conflict.
Full templates: customer_success_process_refund.yaml | payments_process_refund.yaml
Step Handlers
The namespace boundary extends to handler implementations — and even to language choice. The Customer Success team uses TypeScript (Bun/Hono), their existing stack. The Payments team chose Rust (Axum) for their handlers because gateway latency is critical and they need predictable sub-millisecond overhead on every refund validation. Both languages connect to the same Tasker orchestration core via FFI.
Customer Success: ValidateRefundRequestHandler (TypeScript)
The Customer Success handler validates from the customer’s perspective — ticket IDs, refund reasons, order history. The inputs config maps camelCase parameter names to the snake_case fields in the task context:
import { defineHandler, PermanentError } from '@tasker-systems/tasker';
import * as svc from '../services/customer-success';
export const ValidateRefundRequestHandler = defineHandler(
'CustomerSuccess.StepHandlers.ValidateRefundRequestHandler',
{
inputs: {
ticketId: 'ticket_id',
customerId: 'customer_id',
refundAmount: 'refund_amount',
refundReason: 'refund_reason',
},
},
async ({ ticketId, customerId, refundAmount, refundReason }) => {
const missingFields: string[] = [];
if (!ticketId) missingFields.push('ticket_id');
if (!customerId) missingFields.push('customer_id');
if (!refundAmount) missingFields.push('refund_amount');
if (missingFields.length > 0) {
throw new PermanentError(
`Missing required fields: ${missingFields.join(', ')}`,
);
}
return svc.validateRefundRequest({
ticketId: ticketId as string,
customerId: customerId as string,
refundAmount: refundAmount as number,
refundReason: (refundReason as string) || 'customer_request',
});
},
);
The defineHandler factory registers the handler by name. The inputs config extracts fields from the task context — ticketId reads from ticket_id in the submitted JSON. Validation failures throw PermanentError since bad input can’t be fixed by retrying.
Payments: ValidatePaymentEligibilityHandler (Rust)
The Payments team needs low-latency gateway validation — checking payment method support, refund windows, and remaining balance with zero GC overhead. Their handlers are plain Rust functions that receive the task context as serde_json::Value:
#![allow(unused)]
fn main() {
pub fn validate_payment_eligibility(context: &Value) -> Result<Value, String> {
let payment_id = context.get("payment_id")
.and_then(|v| v.as_str())
.ok_or("Missing payment_id in context")?;
let refund_amount = context.get("refund_amount")
.and_then(|v| v.as_f64())
.ok_or("Missing or invalid refund_amount")?;
let payment_method = context.get("payment_method")
.and_then(|v| v.as_str())
.unwrap_or("credit_card");
if refund_amount <= 0.0 {
return Err("Refund amount must be positive".to_string());
}
// Check if payment method supports refunds
let refund_supported = match payment_method {
"credit_card" | "debit_card" | "bank_transfer" => true,
"gift_card" => refund_amount <= 500.0,
"crypto" => false,
_ => true,
};
if !refund_supported {
return Err(format!(
"Payment method '{}' does not support automated refunds",
payment_method
));
}
let now = chrono::Utc::now().to_rfc3339();
Ok(json!({
"payment_validated": true,
"payment_id": payment_id,
"refund_amount": refund_amount,
"payment_method": payment_method,
"eligibility_status": "eligible",
"validation_timestamp": now,
"namespace": "payments"
}))
}
}
The Rust handler uses pattern matching for payment method validation — a natural fit for the exhaustive checking that payment processing requires. The Result<Value, String> return type maps directly to Tasker’s success/error model. Errors returned as Err(...) become permanent failures; the retry policy in the YAML template controls whether the orchestrator retries.
Notice how both handlers include namespace in their result. This makes it unambiguous which team’s workflow produced a given result, even when viewing task data across the system. And critically, the Customer Success team’s Ruby handlers and the Payments team’s Rust handlers participate in the same workflow ecosystem — the orchestrator doesn’t care what language a handler is written in.
Full implementations: Rails customer_success handlers | Axum payments handlers
Cross-Namespace Coordination
The Customer Success workflow’s execute_refund_workflow step demonstrates cross-namespace coordination. After getting manager approval, it creates a task in the payments namespace:
# Inside CustomerSuccess::StepHandlers::ExecuteRefundWorkflowHandler
def call(context)
approval = context.get_dependency_result('get_manager_approval')
validation = context.get_dependency_result('validate_refund_request')
# Create a task in the payments namespace
payment_task = TaskerCore::Client.create_task(
name: 'process_refund',
namespace: 'payments', # <-- cross-namespace call
context: {
payment_id: validation['payment_id'],
refund_amount: validation['refund_amount'],
refund_reason: 'customer_request',
correlation_id: context.task_id # link back to CS workflow
}
)
TaskerCore::Types::StepHandlerCallResult.success(
result: {
payment_task_id: payment_task['id'],
status: 'delegated_to_payments',
correlation_id: context.task_id
}
)
end
The Customer Success team doesn’t need to know how the Payments team processes refunds internally. They just create a task in the payments namespace with the required inputs. If the Payments team refactors their workflow (adds steps, changes retry policies), the Customer Success workflow is unaffected.
Creating Tasks in Each Namespace
Customer Success refund (triggered by a support agent):
task = TaskerCore::Client.create_task(
name: 'process_refund',
namespace: 'customer_success',
context: {
ticket_id: 'TICKET-1234',
customer_id: 'cust_abc123',
refund_amount: 49.99,
refund_reason: 'defective',
customer_email: 'customer@example.com'
}
)
Payments refund (triggered by fraud detection or internal tooling):
task = TaskerCore::Client.create_task(
name: 'process_refund',
namespace: 'payments',
context: {
payment_id: 'pay_xyz789',
refund_amount: 49.99,
refund_reason: 'fraud',
customer_email: 'customer@example.com'
}
)
Same workflow name, different namespaces, different input schemas, different step sequences. Each team owns their namespace independently.
Key Concepts
- Namespace isolation: Workflow names are scoped to namespaces.
customer_success.process_refundandpayments.process_refundare distinct workflows that coexist without conflict. - Same name, different implementations: Both teams use the natural name
process_refundfor their workflow. No team-prefix naming conventions needed. - Cross-namespace coordination: One team’s step handler can create tasks in another team’s namespace. The boundary is clean — just a namespace and name, plus the required inputs.
- Team ownership: Each namespace has clear ownership. The Payments team can refactor their
process_refundworkflow without breaking the Customer Success team, as long as the input schema remains compatible. - Polyglot handlers: Namespace isolation extends to language choice. The Customer Success team writes handlers in Ruby; the Payments team chose Rust for low-latency gateway operations. The orchestration core doesn’t care — both connect via the same FFI interface.
Full Implementations
The namespace isolation pattern is demonstrated across all four supported languages:
| Language | Customer Success Handlers | Payments Handlers | Templates |
|---|---|---|---|
| Ruby (Rails) | handlers/customer_success/ | handlers/payments/ | customer_success_process_refund.yaml, payments_process_refund.yaml |
| TypeScript (Bun/Hono) | handlers/customer-success.ts | handlers/payments.ts | Same YAML structure |
| Python (FastAPI) | handlers/customer_success.py | handlers/payments.py | Same YAML structure |
| Rust (Axum) | handlers/customer_success.rs | handlers/payments.rs | Same YAML structure |
What’s Next
With namespaces, your teams can scale independently. But as your workflow count grows, you need visibility into what’s happening across all those namespaces. The next posts in this series will explore observability (OpenTelemetry integration and domain events), batch processing patterns, and production debugging workflows.
See this pattern implemented in all four frameworks on the Example Apps page.