ADR: Backoff Logic Consolidation

Status: Implemented Date: 2025-10-29 Deciders: Engineering Team Ticket: TAS-57

Context

The tasker-core distributed workflow orchestration system had multiple, potentially conflicting implementations of exponential backoff logic for step retry coordination. This created several critical issues:

Problems Identified

Configuration Conflicts: Three different maximum backoff values existed across the system:
- SQL Migration (hardcoded): 30 seconds
- Rust Code Default: 60 seconds
- TOML Configuration: 300 seconds
Race Conditions: No atomic guarantees on backoff updates when multiple orchestrators processed the same step failure simultaneously, leading to potential lost updates and inconsistent state.
Implementation Divergence: Dual calculation paths (Rust BackoffCalculator vs SQL fallback) could produce different results due to:
- Different time sources (last_attempted_at vs failure_time)
- Hardcoded vs configurable parameters
- Lack of timestamp synchronization

Hardcoded SQL Values: The SQL migration contained non-configurable exponential backoff logic:

-- Old hardcoded implementation
power(2, COALESCE(attempts, 1)) * interval '1 second', interval '30 seconds'

Decision

We consolidated the backoff logic with the following architectural decisions:

1. Single Source of Truth: TOML Configuration

Decision: All backoff parameters originate from TOML configuration files.

Rationale:

Centralized configuration management
Environment-specific overrides (test/development/production)
Runtime validation and type safety
Clear documentation of system behavior

Implementation:

# config/tasker/base/orchestration.toml
[backoff]
default_backoff_seconds = [1, 2, 4, 8, 16, 32]
max_backoff_seconds = 60  # Standard across all environments
backoff_multiplier = 2.0
jitter_enabled = true
jitter_max_percentage = 0.1

2. Standard Maximum Backoff: 60 Seconds

Decision: Standardize on 60 seconds as the maximum backoff delay.

Rationale:

Balance: 60 seconds balances retry speed with system load reduction
Not Too Short: 30 seconds (old SQL max) insufficient for rate limiting scenarios
Not Too Long: 300 seconds (old TOML config) creates excessive delays in failure scenarios
Alignment: Matches Rust code defaults and production requirements

Impact:

Tasks recover faster from transient failures
Rate-limited APIs get adequate cooldown
User experience improved with reasonable retry times

3. Parameterized SQL Functions

Decision: SQL functions accept configuration parameters with sensible defaults.

Implementation:

CREATE OR REPLACE FUNCTION calculate_step_next_retry_time(
    backoff_request_seconds INTEGER,
    last_attempted_at TIMESTAMP,
    failure_time TIMESTAMP,
    attempts INTEGER,
    p_max_backoff_seconds INTEGER DEFAULT 60,
    p_backoff_multiplier NUMERIC DEFAULT 2.0
) RETURNS TIMESTAMP

Rationale:

Eliminates hardcoded values in SQL
Allows runtime configuration without schema changes
Maintains SQL fallback safety net
Defaults prevent breaking existing code

4. Atomic Backoff Updates with Row-Level Locking

Decision: Use PostgreSQL SELECT FOR UPDATE for atomic backoff updates.

Implementation:

#![allow(unused)]
fn main() {
// Rust BackoffCalculator
async fn update_backoff_atomic(&self, step_uuid: &Uuid, delay_seconds: u32) {
    let mut tx = self.pool.begin().await?;

    // Acquire row-level lock
    sqlx::query!("SELECT ... FROM tasker_workflow_steps WHERE ... FOR UPDATE")
        .fetch_one(&mut *tx).await?;

    // Update with lock held
    sqlx::query!("UPDATE tasker_workflow_steps SET ...")
        .execute(&mut *tx).await?;

    tx.commit().await?;
}
}

Rationale:

Correctness: Prevents lost updates from concurrent orchestrators
Simplicity: PostgreSQL’s row-level locking is well-understood and reliable
Performance: Minimal overhead - locks only held during UPDATE operation
Idempotency: Multiple retries produce consistent results

Alternative Considered: Optimistic concurrency with version field

Rejected: More complex implementation, retry logic in application layer
Benefit of Chosen Approach: Database guarantees atomicity

5. Timing Consistency: Update last_attempted_at with backoff_request_seconds

Decision: Always update both backoff_request_seconds and last_attempted_at atomically.

Rationale:

SQL fallback calculation: last_attempted_at + backoff_request_seconds
Prevents timing window where calculation uses stale timestamp
Single transaction ensures consistency

Before:

#![allow(unused)]
fn main() {
// Old: Only updated backoff_request_seconds
sqlx::query!("UPDATE tasker_workflow_steps SET backoff_request_seconds = $1 ...")
}

After:

#![allow(unused)]
fn main() {
// New: Updates both atomically
sqlx::query!(
    "UPDATE tasker_workflow_steps
     SET backoff_request_seconds = $1,
         last_attempted_at = NOW()
     WHERE ..."
)
}

6. Dual-Path Strategy: Rust Primary, SQL Fallback

Decision: Maintain both Rust calculation and SQL fallback, but ensure they use same configuration.

Rationale:

Rust Primary: Fast, configurable, with jitter support
SQL Fallback: Safety net if backoff_request_seconds is NULL
Consistency: Both paths now use same max delay and multiplier

Path Selection Logic:

CASE
    -- Primary: Rust-calculated backoff
    WHEN backoff_request_seconds IS NOT NULL AND last_attempted_at IS NOT NULL THEN
        last_attempted_at + (backoff_request_seconds * interval '1 second')

    -- Fallback: SQL exponential with configurable params
    WHEN failure_time IS NOT NULL THEN
        failure_time + LEAST(
            power(p_backoff_multiplier, attempts) * interval '1 second',
            p_max_backoff_seconds * interval '1 second'
        )

    ELSE NULL
END

Consequences

Positive

Configuration Clarity: Single max_backoff_seconds value (60s) across entire system
Race Condition Prevention: Atomic updates guarantee correctness in distributed deployments
Flexibility: Parameterized SQL allows future config changes without migrations
Timing Consistency: Synchronized timestamp updates eliminate calculation errors
Maintainability: Clear separation of concerns - Rust for calculation, SQL for fallback
Test Coverage: All 518 unit tests pass, validating correctness

Negative

Performance Overhead: Row-level locking adds ~1-2ms per backoff update
- Mitigation: Negligible compared to step execution time (typically seconds)
- Acceptable Trade-off: Correctness more important than microseconds
Lock Contention Risk: High-frequency failures on same step could cause lock queuing
- Mitigation: Exponential backoff naturally spreads out retries
- Monitoring: Added metrics for lock contention detection
- Real-World Impact: Minimal - failures are infrequent by design
Complexity: Transaction management adds code complexity
- Mitigation: Encapsulated in update_backoff_atomic() method
- Benefit: Hidden behind clean interface, testable in isolation

Neutral

Breaking Change: SQL function signature changed (added parameters)
- Not an Issue: Greenfield alpha project, no production dependencies
- Future-Proof: Default parameters maintain backward compatibility
Configuration Migration: Changed max from 300s → 60s
- Impact: Tasks retry faster, reducing user-perceived latency
- Validation: All tests pass with new values

Validation

Testing

Unit Tests: All 518 unit tests pass
- BackoffCalculator calculation correctness
- Jitter bounds validation
- Max cap enforcement
Database Tests: SQL function behavior validated
- Parameterization with various max values
- Exponential calculation matches Rust
- Boundary conditions (attempts 0, 10, 20)
Integration Tests: End-to-end flow verified
- Worker failure → Backoff applied → Readiness respects delay
- SQL fallback when backoff_request_seconds NULL
- Rust and SQL calculations produce consistent results

Verification Steps Completed

✅ Configuration alignment (TOML, Rust defaults) ✅ SQL function rewrite with parameters ✅ BackoffCalculator atomic updates implemented ✅ Database reset successful with new migration ✅ All unit tests passing ✅ Architecture documentation updated

Implementation Notes

Files Modified

Configuration:
- config/tasker/base/orchestration.toml: max_backoff_seconds = 60
- tasker-shared/src/config/tasker.rs: jitter_max_percentage = 0.1
Database Migration:
- migrations/20250927000000_add_waiting_for_retry_state.sql: Parameterized functions
Rust Implementation:
- tasker-orchestration/src/orchestration/backoff_calculator.rs: Atomic updates
Documentation:
- docs/task-and-step-readiness-and-execution.md: Backoff section added
- This ADR

Migration Path

Since this is greenfield alpha:

Drop and recreate test database
Run migrations with updated SQL functions
Rebuild sqlx cache
Run full test suite

Future Production Path (when needed):

Deploy parameterized SQL functions alongside old functions
Update Rust code to use new atomic methods
Enable in staging, monitor metrics
Gradual production rollout with feature flag
Remove old functions after validation period

Future Enhancements

Potential Improvements (Post-Alpha)

Configuration Table: Store backoff config in database for runtime updates
Metrics: OpenTelemetry metrics for backoff application and lock contention
Adaptive Backoff: Adjust multiplier based on system load or error patterns
Per-Namespace Policies: Different backoff configs per workflow namespace
Backoff Profiles: Named profiles (aggressive, moderate, conservative)

Monitoring Recommendations

Key Metrics to Track:

backoff_calculation_duration_seconds: Time to calculate and apply backoff
backoff_lock_contention_total: Lock acquisition failures
backoff_sql_fallback_total: Frequency of SQL fallback usage
backoff_delay_applied_seconds: Histogram of actual delays

Alert Conditions:

SQL fallback usage > 5% (indicates Rust path failing)
Lock contention > threshold (indicates hot spots)
Backoff delays > max_backoff_seconds (configuration issue)

References

Ownership Removal - Concurrent access patterns

Decision Status: ✅ Implemented and Validated (2025-10-29)

Keyboard shortcuts

Tasker Documentation