RCA: Parallel Execution Exposing Latent Timing Bugs
Date: 2025-12-07
Related: Worker Dual-Channel Event System
Status: Resolved
Impact: Flaky E2E test test_mixed_workflow_scenario
Executive Summary
During the dual-channel event system implementation (fire-and-forget handler dispatch), a previously hidden bug in the SQL function get_task_execution_context() became consistently reproducible. The bug was a logical precedence error that had always existed but was masked by sequential execution timing. Introducing true parallelism changed the probability distribution of state combinations, transforming a Heisenbug into a Bohrbug.
This document captures the root cause analysis as a reference for understanding how architectural changes to concurrency can surface latent bugs in distributed systems.
The Bug
Symptom
Test test_mixed_workflow_scenario intermittently failed with timeout waiting for BlockedByFailures status, while the API returned HasReadySteps.
⏳ Waiting for task to fail (max 10s)...
Task execution status: processing (processing)
Task execution status: has_ready_steps (has_ready_steps) ← Wrong!
Task execution status: has_ready_steps (has_ready_steps)
... timeout ...
Root Cause
The SQL function get_task_execution_context() checked ready_steps > 0 BEFORE permanently_blocked_steps > 0:
-- BUGGY: Wrong precedence order
CASE
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps' -- ← Checked FIRST
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures'
...
END as execution_status
When a task had BOTH permanently blocked steps AND ready steps, the function returned has_ready_steps instead of blocked_by_failures.
The Fix
Migration 20251207000000_fix_execution_status_priority.sql corrects the precedence:
-- FIXED: blocked_by_failures takes semantic priority
CASE
WHEN COALESCE(ast.permanently_blocked_steps, 0) > 0 THEN 'blocked_by_failures' -- ← Now FIRST
WHEN COALESCE(ast.ready_steps, 0) > 0 THEN 'has_ready_steps'
...
END as execution_status
Why Did This Surface Now?
The Test Scenario
# 3 parallel steps with NO dependencies (can all run concurrently)
steps:
- name: success_step
retryable: false
- name: permanent_error_step
retryable: false # Fails permanently
- name: retryable_error_step
retryable: true
max_attempts: 2 # Fails, but becomes "ready" after backoff
Before: Blocking Handler Dispatch
The original architecture used blocking .call() in the event handler:
#![allow(unused)]
fn main() {
// workers/rust/src/event_handler.rs (before)
let result = handler.call(&step).await; // BLOCKS until handler completes
}
This created effectively sequential execution even for independent steps:
Timeline (Sequential):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]
t=50ms [success_step completes]
t=51ms [permanent_error_step starts]
t=100ms [permanent_error_step fails → PERMANENTLY BLOCKED]
t=101ms [retryable_error_step starts]
t=150ms [retryable_error_step fails → enters 100ms backoff]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 0 (still in backoff!)
──► Returns: blocked_by_failures ✓
The backoff hadn't elapsed yet because steps were processed one at a time.
After: Fire-and-Forget Handler Dispatch
The dual-channel event system introduced non-blocking dispatch via channels:
#![allow(unused)]
fn main() {
// Fire-and-forget pattern
dispatch_sender.send(DispatchHandlerMessage { step, ... }).await;
// Returns immediately - handler executes in separate task
}
This enables true parallel execution:
Timeline (Parallel):
────────────────────────────────────────────────────────────────────
t=0ms [success_step starts]──────────────────►[completes t=50ms]
t=0ms [permanent_error_step starts]──────────►[fails t=50ms → BLOCKED]
t=0ms [retryable_error_step starts]──────────►[fails t=50ms → backoff]
t=150ms [retryable_error_step backoff expires → becomes READY]
t=151ms ──► STATUS CHECK
permanently_blocked_steps = 1
ready_steps = 1 (backoff elapsed!)
──► Returns: has_ready_steps ✗ (BUG!)
Probability Analysis
The “Both States” Window
The bug manifests when checking status while the task has BOTH:
- At least one permanently blocked step
- At least one ready step (e.g., retryable step after backoff)
Sequential Processing:
├────────────────────────────────────────────────────────────────────┤
│▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓│
│ Very LOW probability of "both states" window │
│ Steps complete serially; backoff rarely overlaps with status check │
└────────────────────────────────────────────────────────────────────┘
Parallel Processing:
├────────────────────────────────────────────────────────────────────┤
│░░░░░░░░░░░░████████████████████████████████████████████░░░░░░░░░░░│
│ ↑ ↑ │
│ │ HIGH probability "both states" window │ │
│ │ All steps complete ~simultaneously │ │
│ │ Backoff expires while status is polled │ │
└────────────────────────────────────────────────────────────────────┘
Quantifying the Change
| Metric | Sequential | Parallel |
|---|---|---|
| Step completion spread | ~150ms | ~50ms |
| “Both states” window duration | ~0ms (transient) | ~100ms+ (stable) |
| Probability of hitting bug | <1% | >50% |
| Bug classification | Heisenbug | Bohrbug |
Bug Classification
Heisenbug → Bohrbug Transformation
| Property | Before (Heisenbug) | After (Bohrbug) |
|---|---|---|
| Reproducibility | Intermittent, timing-dependent | Consistent, deterministic |
| Root cause | Logical precedence error | Same |
| Visibility | Hidden by sequential timing | Exposed by parallel timing |
| Debug difficulty | Extremely hard (may never reproduce) | Straightforward |
| Detection in CI | Might pass for months | Fails consistently under load |
Why This Matters
- The bug was always present - It existed in the SQL function since it was written
- Sequential execution hid it - Incidental timing prevented the problematic state
- Parallelization surfaced it - Not by introducing a bug, but by applying concurrency pressure
- This is good - Better to find in tests than production
Semantic Correctness
The Correct Mental Model
“If ANY step is permanently blocked, the task cannot make further progress toward completion, even if other steps are ready to execute.”
A task with permanent failures is blocked by failures regardless of what else might be runnable. The old code implicitly assumed:
“If work is available, we’re making progress”
This is incorrect for workflows where:
- Convergence points require ALL branches to complete
- Final task status depends on ALL steps succeeding
- Partial progress doesn’t constitute overall success
State Precedence (Correct Order)
-- 1. Permanent failures block overall progress
WHEN permanently_blocked_steps > 0 THEN 'blocked_by_failures'
-- 2. Ready work can continue (but may not lead to completion)
WHEN ready_steps > 0 THEN 'has_ready_steps'
-- 3. Work in flight
WHEN in_progress_steps > 0 THEN 'processing'
-- 4. All done
WHEN completed_steps = total_steps THEN 'all_complete'
-- 5. Waiting for dependencies
ELSE 'waiting_for_dependencies'
Patterns to Watch For
1. State Combination Explosions
Sequential processing often means only one state at a time. Parallelism creates state combinations that were previously impossible:
Sequential: A → B → C (states are mutually exclusive in time)
Parallel: A + B + C (states can coexist)
Watch for: CASE statements, if/else chains, and state machines that assume mutual exclusivity.
2. Timing-Dependent Invariants
Code may accidentally depend on timing:
#![allow(unused)]
fn main() {
// Assumes step_a completes before step_b starts
if step_a.is_complete() {
// Safe to check step_b
}
}
Watch for: Implicit ordering assumptions in status calculations, rollups, and aggregations.
3. Transient vs Stable States
Some states were transient under sequential processing but become stable under parallel:
| State | Sequential | Parallel |
|---|---|---|
| “1 complete, 1 in-progress” | Transient (~ms) | Stable (seconds) |
| “blocked + ready” | Nearly impossible | Common |
| “multiple errors” | Rare | Frequent |
Watch for: Error handling, status rollups, and progress calculations that assumed single-state scenarios.
4. Test Timing Sensitivity
Tests written for sequential execution may have implicit timing dependencies:
#![allow(unused)]
fn main() {
// This worked when steps were sequential
wait_for_status(BlockedByFailures, timeout: 10s);
// But fails when parallel execution creates a different status first
}
Watch for: Tests that pass in isolation but fail under concurrent load.
Verification Strategy
After Parallelization Changes
- Run tests multiple times - Timing bugs may not manifest on first run
- Run tests under load - Concurrent test execution increases probability
- Add explicit state combination tests - Test scenarios that were previously impossible
- Review CASE/if-else precedence - Check all status calculations for correct ordering
Example: Testing State Combinations
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_blocked_with_ready_steps() {
// Explicitly create the state combination
let task = create_task_with_parallel_steps();
// Force one step to permanent failure
force_step_to_permanent_failure(&task, "step_a").await;
// Force another step to ready (after backoff)
force_step_to_ready_after_backoff(&task, "step_b").await;
// Verify correct precedence
let status = get_task_execution_status(&task).await;
assert_eq!(status, ExecutionStatus::BlockedByFailures);
}
}
Conclusion
This bug exemplifies how architectural improvements to concurrency can surface latent correctness issues. The parallelization didn’t introduce a bug—it revealed one that had been hidden by incidental sequential timing.
This is a positive outcome: the bug was found in testing rather than production. The fix ensures correct semantic precedence regardless of execution timing, making the system more robust under parallel load.
Key Takeaways
- Parallelization is a stress test - It exposes timing-dependent bugs
- Sequential execution hides bugs - Incidental ordering masks logical errors
- State precedence matters - Review all status calculations when adding concurrency
- Heisenbugs become Bohrbugs - Parallel execution makes rare bugs reproducible
- This is good engineering - Finding bugs through architectural improvements validates the testing strategy
References
- Migration:
migrations/20251207000000_fix_execution_status_priority.sql - Test:
tests/e2e/ruby/error_scenarios_test.rs::test_mixed_workflow_scenario - SQL Function:
get_task_execution_context()inmigrations/20251001000000_fix_permanently_blocked_detection.sql - Dual-Channel Event System ADR