Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

E2E Benchmark Scenarios: Workflow Shapes and Per-Step Lifecycle

Last Updated: 2026-01-23 Audience: Architects, Developers, Performance Engineers Related Docs: Benchmarks README | States & Lifecycles | Actor Pattern

<- Back to Benchmarks


What Each Benchmark Measures

Each E2E benchmark executes a complete workflow through the full distributed system: HTTP API call, task initialization, step discovery, message queue dispatch, worker execution, result submission, dependency graph re-evaluation, and task finalization.

A 4-step linear workflow at P50=257ms means the system completes 76+ database operations, 8 message queue round-trips, 16+ state machine transitions, and 4 dependency graph evaluations — all across a 10-instance distributed cluster — in approximately one quarter of a second.


Per-Step Lifecycle: What Happens for Every Step

Before examining the benchmark scenarios, it’s important to understand the work the system performs for each individual step. Every step in every benchmark goes through this complete lifecycle.

Messaging Backend: Tasker uses a MessagingService trait with provider variants for PGMQ (PostgreSQL-native, single-dependency) and RabbitMQ (high-throughput). The benchmark results documented here were captured using the RabbitMQ backend. The per-step lifecycle is identical regardless of backend — only the transport layer differs.

State Machine Transitions (per step)

Step:  Pending → Enqueued → InProgress → EnqueuedForOrchestration → Complete
Task:  StepsInProcess → EvaluatingResults → (EnqueuingSteps if more ready) → Complete

Database Operations (per step): ~19 operations

PhaseOperationsDescription
Discovery2 queriesget_next_ready_tasks + get_step_readiness_status_batch (8-CTE query)
Enqueueing4 writesFetch correlation_id, transition Pending→Enqueued (SELECT sort_key + UPDATE most_recent + INSERT transition)
Message send1 opSend step dispatch to worker queue (via MessagingService)
Worker claim1 opClaim message with visibility timeout (via MessagingService)
Worker transition3 writesTransition Enqueued→InProgress
Result submission4 writesTransition InProgress→EnqueuedForOrchestration + audit trigger INSERT + send completion to orchestration queue
Result processing4 writesFetch step state, transition →Complete, delete consumed message
Task coordination1+ queriesRe-evaluate get_step_readiness_status_batch for remaining steps
Total~19 ops

Message Queue Round-Trips (per step): 2

  1. Orchestration → Worker: Step dispatch message (task_uuid, step_uuid, handler, context)
  2. Worker → Orchestration: Completion notification (task_uuid, step_uuid, results)

Dependency Graph Evaluation (per step completion)

After each step completes, the orchestration:

  1. Queries all steps in the task for current state
  2. Evaluates dependency edges (parent steps must be Complete)
  3. Calculates retry eligibility (attempts < max_attempts, backoff expired)
  4. Identifies newly-ready steps for enqueueing
  5. Updates task state (more steps ready → EnqueuingSteps, all complete → Complete)

Idempotency Guarantees

  • Message visibility timeout: MessagingService prevents duplicate processing (30s window)
  • State machine guards: Transitions validate from-state before applying
  • Atomic claiming: Workers claim via the messaging backend’s atomic read operation
  • Audit trail: Every transition creates an immutable workflow_step_transitions record

Tier 1: Core Performance (Rust Native)

Linear Rust (4 steps, sequential)

Fixture: tests/fixtures/task_templates/rust/mathematical_sequence.yaml Namespace: rust_e2e_linear | Handler: mathematical_sequence

linear_step_1 → linear_step_2 → linear_step_3 → linear_step_4
StepHandlerOperationDepends OnMath
linear_step_1LinearStep1squarenone6^2 = 36
linear_step_2LinearStep2squarestep_136^2 = 1,296
linear_step_3LinearStep3squarestep_21,296^2 = 1,679,616
linear_step_4LinearStep4squarestep_31,679,616^2

Distributed system work for this workflow:

MetricCount
State machine transitions (step)16 (4 per step)
State machine transitions (task)6 (Pending→Init→Enqueue→InProcess→Eval→Complete)
Database operations76 (19 per step)
MQ messages8 (2 per step)
Dependency evaluations4 (after each step completes)
HTTP calls (benchmark→API)1 create + ~5 polls
Sequential stages4

Why this matters: This is the purest sequential latency test. Each step must fully complete (all 19 DB operations + 2 message round-trips) before the next step can begin. The P50 of ~257ms means each step’s complete lifecycle averages ~64ms including all distributed coordination.


Diamond Rust (4 steps, 2-way parallel)

Fixture: tests/fixtures/task_templates/rust/diamond_pattern.yaml Namespace: rust_e2e_diamond | Handler: diamond_pattern

         diamond_start
           /       \
          /         \
  diamond_branch_b  diamond_branch_c    ← parallel execution
          \         /
           \       /
         diamond_end                    ← 2-way convergence
StepHandlerOperationDepends OnParallelism
diamond_startStartsquarenone-
diamond_branch_bBranchBsquarestartparallel with C
diamond_branch_cBranchCsquarestartparallel with B
diamond_endEndmultiply_and_squarebranch_b AND branch_cconvergence

Distributed system work:

MetricCount
State machine transitions (step)16
Database operations76
MQ messages8
Dependency evaluations4
Sequential stages3 (start → parallel → end)
Convergence points1 (diamond_end waits for both branches)
Dependency edge checks4 (start→B, start→C, B→end, C→end)

Why this matters: Tests the system’s ability to dispatch and execute steps concurrently. The convergence point (diamond_end) requires the orchestration to correctly evaluate that BOTH branch_b AND branch_c are Complete before enqueueing diamond_end. Under light load, this completes in 3 sequential stages vs 4 for linear (~30% faster).


Tier 2: Complexity Scaling

Complex DAG (7 steps, mixed parallelism)

Fixture: tests/fixtures/task_templates/rust/complex_dag.yaml Namespace: rust_e2e_mixed_dag | Handler: complex_dag

              dag_init
             /        \
   dag_process_left   dag_process_right     ← 2-way parallel
        /    |              |    \
       /     |              |     \
dag_validate dag_transform dag_analyze      ← mixed dependencies
       \          |          /
        \         |         /
         dag_finalize                       ← 3-way convergence
StepDepends OnType
dag_initnoneinit
dag_process_leftinitparallel branch
dag_process_rightinitparallel branch
dag_validateleft AND right2-way convergence
dag_transformleft onlylinear continuation
dag_analyzeright onlylinear continuation
dag_finalizevalidate AND transform AND analyze3-way convergence

Distributed system work:

MetricCount
State machine transitions (step)28 (7 steps x 4)
Database operations133 (7 x 19)
MQ messages14 (7 x 2)
Dependency evaluations7
Sequential stages4 (init → left/right → validate/transform/analyze → finalize)
Convergence points2 (dag_validate: 2-way, dag_finalize: 3-way)
Dependency edge checks8

Why this matters: Tests multiple convergence points with different fan-in widths. The orchestration must correctly handle that dag_validate needs 2 parents while dag_finalize needs 3. Also tests mixed patterns: some steps continue from a single parent (transform from left only) while others require multiple.


Hierarchical Tree (8 steps, 4-way convergence)

Fixture: tests/fixtures/task_templates/rust/hierarchical_tree.yaml Namespace: rust_e2e_tree | Handler: hierarchical_tree

                    tree_root
                   /         \
        tree_branch_left    tree_branch_right    ← 2-way parallel
          /       \           /        \
  tree_leaf_d  tree_leaf_e  tree_leaf_f  tree_leaf_g  ← 4-way parallel
         \          |            |          /
          \         |            |         /
           tree_final_convergence               ← 4-way convergence
LevelStepsParallelismOperation
0rootsequentialsquare
1branch_left, branch_right2-way parallelsquare
2leaf_d, leaf_e, leaf_f, leaf_g4-way parallelsquare
3final_convergence4-way convergencemultiply_all_and_square

Distributed system work:

MetricCount
State machine transitions (step)32 (8 x 4)
Database operations152 (8 x 19)
MQ messages16 (8 x 2)
Dependency evaluations8
Sequential stages4 (root → branches → leaves → convergence)
Maximum fan-out2-way (each branch → 2 leaves)
Maximum fan-in4-way (convergence waits for all 4 leaves)
Dependency edge checks9

Why this matters: Tests the widest convergence pattern — 4 parallel leaves must all complete before the final step can execute. This exercises the dependency evaluation with a large number of parent checks per step. Also tests hierarchical fan-out (root→2 branches→4 leaves).


Conditional Routing (5 steps, 3 executed)

Fixture: tests/fixtures/task_templates/rust/conditional_approval_rust.yaml Namespace: conditional_approval_rust | Handler: approval_routing Context: {"amount": 500, "requester": "benchmark"}

validate_request
       ↓
routing_decision          ← DECISION POINT (routes based on amount)
   /      |      \
  /       |       \
auto_approve  manager_approval  finance_review
(< $1000)     ($1000-$5000)     (> $5000)
  \       |       /
   \      |      /
  finalize_approval               ← deferred convergence

With benchmark context amount=500, only the auto_approve path executes:

validate_request → routing_decision → auto_approve → finalize_approval
StepExecutedCondition
validate_requestYesalways
routing_decisionYesalways (decision point)
auto_approveYesamount < 1000
manager_approvalSkippedamount 1000-5000
finance_reviewSkippedamount > 5000
finalize_approvalYesdeferred convergence (waits for executed paths only)

Distributed system work (executed steps only):

MetricCount
State machine transitions (step)16 (4 executed x 4)
Database operations76 (4 executed x 19)
MQ messages8 (4 executed x 2)
Dependency evaluations4
Sequential stages4 (validate → decision → approve → finalize)
Skipped steps2 (manager_approval, finance_review)

Why this matters: Tests deferred convergence — the finalize_approval step depends on ALL conditional branches, but only blocks on branches that actually executed. The orchestration must correctly determine that manager_approval and finance_review were skipped (not just incomplete) and allow finalize_approval to proceed. Also tests the decision point routing pattern.


Tier 3: Cluster Performance

Single Task Linear (4 steps, round-robin across 2 orchestrators)

Same workflow as Tier 1 linear_rust, but benchmarked with round-robin across 2 orchestration instances to measure cluster coordination overhead.

Distributed system work: Same as linear_rust (76 DB ops, 8 MQ messages) plus cluster coordination overhead (shared database, message queue visibility).

Why this matters: Validates that running in cluster mode adds negligible overhead compared to single-instance. The P50 difference (261ms vs 257ms = ~4ms) represents the entire cluster coordination tax.

Concurrent Tasks 2x (2 tasks simultaneously across 2 orchestrators)

Two linear workflows submitted simultaneously, one to each orchestration instance.

Distributed system work:

MetricCount
State machine transitions44 (22 per task)
Database operations152 (76 per task)
MQ messages16 (8 per task)
Concurrent step executionsup to 2
Database connection contention2 orchestrators + 2 workers competing

Why this matters: Tests work distribution across cluster instances under concurrent load. The P50 of ~332-384ms for TWO tasks (vs ~261ms for one) shows that the second task adds only 30-50% latency, not 100% — demonstrating effective parallelism in the cluster.


Tier 4: FFI Language Comparison

Same linear and diamond patterns as Tier 1, but using FFI workers (Ruby via Magnus, Python via PyO3, TypeScript via Bun FFI) instead of native Rust handlers.

Additional per-step work for FFI:

PhaseAdditional Operations
Handler dispatchFFI bridge call (Rust → language runtime)
Context serializationJSON serialize context for foreign runtime
Result deserializationJSON deserialize results back to Rust
Circuit breaker checkshould_allow() (sync, atomic check)
Completion callbackFFI completion channel (bounded MPSC)

FFI overhead: ~23% (~60ms for 4 steps)

The overhead is framework-dominated (Rust dispatch + serialization + completion channel), not language-dominated — all three languages perform within 3ms of each other.


Tier 5: Batch Processing

CSV Products 1000 Rows (7 steps, 5-way parallel)

Fixture: tests/fixtures/task_templates/rust/batch_processing_products_csv.yaml Namespace: csv_processing_rust | Handler: csv_product_inventory_analyzer

analyze_csv                    ← reads CSV, returns BatchProcessingOutcome
    ↓
[orchestration creates 5 dynamic workers from batch template]
    ↓
process_csv_batch_001 ──┐
process_csv_batch_002 ──┤
process_csv_batch_003 ──├──→ aggregate_csv_results    ← deferred convergence
process_csv_batch_004 ──┤
process_csv_batch_005 ──┘
StepTypeRowsOperation
analyze_csvbatchableall 1000Count rows, compute batch ranges
process_csv_batch_001batch_worker1-200Compute inventory metrics
process_csv_batch_002batch_worker201-400Compute inventory metrics
process_csv_batch_003batch_worker401-600Compute inventory metrics
process_csv_batch_004batch_worker601-800Compute inventory metrics
process_csv_batch_005batch_worker801-1000Compute inventory metrics
aggregate_csv_resultsdeferred_convergenceallMerge batch results

Distributed system work:

MetricCount
State machine transitions (step)28 (7 x 4)
Database operations133 (7 x 19)
MQ messages14 (7 x 2)
Dynamic step creation5 (batch workers created at runtime)
Dependency edges (dynamic)6 (batch workers → analyze, aggregate → batch_template)
File I/O operations6 (1 analysis read + 5 batch reads of CSV)
CSV rows processed1000
Sequential stages3 (analyze → 5 parallel workers → aggregate)

Why this matters: Tests the most complex orchestration pattern — dynamic step generation. The analyze_csv step returns a BatchProcessingOutcome that tells the orchestration to create N worker steps at runtime. The orchestration must:

  1. Create new step records in the database
  2. Create dependency edges dynamically
  3. Enqueue all batch workers for parallel execution
  4. Use deferred convergence for the aggregate step (waits for batch template, not specific steps)

At P50=358-368ms for 1000 rows, throughput is ~2,700 rows/second with all the distributed system overhead included.


Summary: Operations Per Benchmark

BenchmarkStepsDB OpsMQ MsgsTransitionsConvergenceP50
Linear Rust476822none257ms
Diamond Rust4768222-way200-259ms
Complex DAG713314342+3-way382ms
Hierarchical Tree815216384-way389-426ms
Conditional4*76822deferred251-262ms
Cluster single476822none261ms
Cluster 2x81521644none332-384ms
FFI linear476822none312-316ms
FFI diamond4768222-way260-275ms
Batch 1000 rows71331434deferred358-368ms

*Conditional executes 4 of 5 defined steps (2 skipped by routing decision)


Performance per Sequential Stage

For workflows with known sequential depth, we can calculate per-stage overhead:

BenchmarkSequential StagesP50Per-Stage Avg
Linear (4 seq)4257ms64ms
Diamond (3 seq)3200ms*67ms
Complex DAG (4 seq)4382ms96ms**
Tree (4 seq)4389ms97ms**
Conditional (4 seq)4257ms64ms
Batch (3 seq)3363ms121ms***

*Diamond under light load (parallelism helping) **Higher per-stage due to multiple steps per stage (more DB ops per evaluation cycle) ***Higher per-stage due to batch worker creation overhead + file I/O

The ~64ms per sequential stage for simple patterns represents the total distributed round-trip: orchestration discovery → MQ dispatch → worker claim → handler execute (~1ms for math operations) → MQ completion → orchestration result processing → dependency re-evaluation. The handler execution itself is negligible; the 64ms is almost entirely orchestration infrastructure.