Dead Letter Queue (DLQ) System Architecture
Purpose: Investigation tracking system for stuck, stale, or problematic tasks
Last Updated: 2025-11-01
Executive Summary
The DLQ (Dead Letter Queue) system is an investigation tracking system, NOT a task manipulation layer.
Key Principles:
- DLQ tracks “why task is stuck” and “who investigated”
- Resolution happens at step level via step APIs
- No task-level “requeue” - fix the problem steps instead
- Steps carry their own retry, attempt, and state lifecycles independent of DLQ
- DLQ is for audit, visibility, and investigation only
Architecture: PostgreSQL-based system with:
tasks_dlqtable for investigation tracking- 3 database views for monitoring and analysis
- 6 REST endpoints for operator interaction
- Background staleness detection service
DLQ vs Step Resolution
What DLQ Does
✅ Investigation Tracking:
- Record when and why task became stuck
- Capture complete task snapshot for debugging
- Track operator investigation workflow
- Provide visibility into systemic issues
✅ Visibility and Monitoring:
- Dashboard statistics by DLQ reason
- Prioritized investigation queue for triage
- Proactive staleness monitoring (before DLQ)
- Alerting integration for high-priority entries
What DLQ Does NOT Do
❌ Task Manipulation:
- Does NOT retry failed steps
- Does NOT requeue tasks
- Does NOT modify step state
- Does NOT execute business logic
Why This Separation Matters
Steps are mutable - Operators can:
- Manually resolve failed steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - View step readiness status:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Check retry eligibility and dependency satisfaction
- Trigger next steps by completing blocked steps
DLQ is immutable audit trail - Operators should:
- Review task snapshot to understand what went wrong
- Use step endpoints to fix the underlying problem
- Update DLQ investigation status to track resolution
- Analyze DLQ patterns to prevent future occurrences
DLQ Reasons
staleness_timeout
Definition: Task exceeded state-specific staleness threshold
States:
waiting_for_dependencies- Default 60 minuteswaiting_for_retry- Default 30 minutessteps_in_process- Default 30 minutes
Template Override: Configure per-template thresholds:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
max_duration_minutes: 1440 # 24 hours
Resolution Pattern:
- Operator:
GET /v1/dlq/task/{task_uuid}- Review task snapshot - Identify stuck steps: Check
current_statein snapshot - Fix steps:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Task state machine automatically progresses when steps fixed
- Operator:
PATCH /v1/dlq/entry/{dlq_entry_uuid}- Mark investigation resolved
Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring
max_retries_exceeded
Definition: Step exhausted all retry attempts and remains in Error state
Resolution Pattern:
- Review step results:
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Analyze
last_failure_atand error details - Fix underlying issue (infrastructure, data, etc.)
- Manually resolve step:
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid} - Update DLQ investigation status
dependency_cycle_detected
Definition: Circular dependency detected in workflow step graph
Resolution Pattern:
- Review task template configuration
- Identify cycle in step dependencies
- Update template to break cycle
- Manually cancel affected tasks
- Re-submit tasks with corrected template
worker_unavailable
Definition: No worker available for task’s namespace
Resolution Pattern:
- Check worker service health
- Verify namespace configuration
- Scale worker capacity if needed
- Tasks automatically progress when worker available
manual_dlq
Definition: Operator manually sent task to DLQ for investigation
Resolution Pattern: Custom per-investigation
Database Schema
tasks_dlq Table
CREATE TABLE tasker.tasks_dlq (
dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
task_uuid UUID NOT NULL UNIQUE, -- One pending entry per task
original_state VARCHAR(50) NOT NULL,
dlq_reason dlq_reason NOT NULL,
dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
task_snapshot JSONB, -- Complete task state for debugging
resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
resolution_notes TEXT,
resolved_at TIMESTAMPTZ,
resolved_by VARCHAR(255),
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
ON tasker.tasks_dlq (task_uuid)
WHERE resolution_status = 'pending';
Key Fields:
dlq_entry_uuid- UUID v7 (time-ordered) for investigation trackingtask_uuid- Foreign key to task (unique for pending entries)original_state- Task state when sent to DLQtask_snapshot- JSONB snapshot with debugging contextresolution_status- Investigation workflow status
Database Views
v_dlq_dashboard
Purpose: Aggregated statistics for monitoring dashboard
Columns:
dlq_reason- Why tasks are in DLQtotal_entries- Count of entriespending,manually_resolved,permanent_failures,cancelled- Breakdown by statusoldest_entry,newest_entry- Time rangeavg_resolution_time_minutes- Average time to resolve
Use Case: High-level DLQ health monitoring
v_dlq_investigation_queue
Purpose: Prioritized queue for operator triage
Columns:
- Task and DLQ entry UUIDs
priority_score- Composite score (base reason priority + age factor)minutes_in_dlq- How long entry has been pending- Task metadata for context
Ordering: Priority score DESC (most urgent first)
Use Case: Operator dashboard showing “what to investigate next”
v_task_staleness_monitoring
Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ
Columns:
task_uuid,namespace_name,task_namecurrent_state,time_in_state_minutesstaleness_threshold_minutes- Threshold for this statehealth_status- healthy | warning | stalepriority- Task priority for ordering
Health Status Classification:
healthy- < 80% of thresholdwarning- 80-99% of thresholdstale- ≥ 100% of threshold
Use Case: Alerting at 80% threshold to prevent DLQ entries
REST API Endpoints
1. List DLQ Entries
GET /v1/dlq?resolution_status=pending&limit=50
Purpose: Browse DLQ entries with filtering
Query Parameters:
resolution_status- Filter by status (optional)limit- Max entries (default: 50)offset- Pagination offset (default: 0)
Response: Array of DlqEntry objects
Use Case: General DLQ browsing and pagination
2. Get DLQ Entry with Task Snapshot
GET /v1/dlq/task/{task_uuid}
Purpose: Retrieve most recent DLQ entry for a task with complete snapshot
Response: DlqEntry with full task_snapshot JSONB
Task Snapshot Contains:
- Task UUID, namespace, name
- Current state and time in state
- Staleness threshold
- Task age and priority
- Template configuration
- Detection time
Use Case: Investigation starting point - “why is this task stuck?”
3. Update DLQ Investigation Status
PATCH /v1/dlq/entry/{dlq_entry_uuid}
Purpose: Track investigation workflow
Request Body:
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed by manually completing stuck step using step API",
"resolved_by": "operator@example.com",
"metadata": {
"fixed_step_uuid": "...",
"root_cause": "database connection timeout"
}
}
Use Case: Document investigation findings and resolution
4. Get DLQ Statistics
GET /v1/dlq/stats
Purpose: Aggregated statistics for monitoring
Response: Statistics grouped by dlq_reason
Use Case: Dashboard metrics, identifying systemic issues
5. Get Investigation Queue
GET /v1/dlq/investigation-queue?limit=100
Purpose: Prioritized queue for operator triage
Response: Array of DlqInvestigationQueueEntry ordered by priority
Priority Factors:
- Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
- Age multiplier (older entries = higher priority)
Use Case: “What should I investigate next?”
6. Get Staleness Monitoring
GET /v1/dlq/staleness?limit=100
Purpose: Proactive monitoring BEFORE tasks hit DLQ
Response: Array of StalenessMonitoring with health status
Ordering: Stale first, then warning, then healthy
Use Case: Alerting and prevention
Alert Integration:
# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
Step Endpoints and Resolution Workflow
Step Endpoints
1. List Task Steps
GET /v1/tasks/{uuid}/workflow_steps
Returns: Array of steps with readiness status
Key Fields:
current_state- Step state (pending, enqueued, in_progress, complete, error)dependencies_satisfied- Can step execute?retry_eligible- Can step retry?ready_for_execution- Ready to enqueue?attempts/max_attempts- Retry trackinglast_failure_at- When step last failednext_retry_at- When step eligible for retry
Use Case: Understand task execution status
2. Get Step Details
GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Returns: Single step with full readiness analysis
Use Case: Deep dive into specific step
3. Manually Resolve Step
PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Purpose: Operator actions to handle stuck or failed steps
Action Types:
- ResetForRetry - Reset attempt counter and return to pending for automatic retry:
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection restored, resetting attempts"
}
- ResolveManually - Mark step as manually resolved without results:
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical step, bypassing for workflow continuation"
}
- CompleteManually - Complete step with execution results for dependent steps:
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validated": true,
"score": 95
},
"metadata": {
"manually_verified": true,
"verification_method": "manual_inspection"
}
},
"reason": "Manual verification completed after infrastructure fix",
"completed_by": "operator@example.com"
}
Behavior by Action Type:
reset_for_retry: Clears attempt counter, transitions topending, enables automatic retryresolve_manually: Transitions toresolved_manually(terminal state)complete_manually: Transitions tocompletewith results available for dependent steps
Common Effects:
- Triggers task state machine re-evaluation
- Task automatically discovers next ready steps
- Task progresses when all dependencies satisfied
Use Case: Unblock stuck workflow by fixing problem step
Complete Resolution Workflow
Scenario: Task Stuck in waiting_for_dependencies
1. Operator receives DLQ alert
GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority
2. Operator reviews task snapshot
GET /v1/dlq/task/abc-123
# Response:
{
"dlq_entry_uuid": "xyz-789",
"task_uuid": "abc-123",
"original_state": "waiting_for_dependencies",
"dlq_reason": "staleness_timeout",
"task_snapshot": {
"task_uuid": "abc-123",
"namespace": "order_processing",
"task_name": "fulfill_order",
"current_state": "error",
"time_in_state_minutes": 65,
"threshold_minutes": 60
}
}
3. Operator checks task steps
GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)
4. Operator investigates step_2 failure
GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout
5. Operator fixes infrastructure issue
# Fix database connection pool configuration
# Verify database connectivity
6. Operator chooses resolution strategy
Option A: Reset for retry (infrastructure fixed, retry should work):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "reset_for_retry",
"reset_by": "operator@example.com",
"reason": "Database connection pool fixed, resetting attempts for automatic retry"
}
Option B: Resolve manually (bypass step entirely):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "resolve_manually",
"resolved_by": "operator@example.com",
"reason": "Non-critical validation step, bypassing"
}
Option C: Complete manually (provide results for dependent steps):
PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
"action_type": "complete_manually",
"completion_data": {
"result": {
"validation_status": "passed",
"score": 100
},
"metadata": {
"manually_verified": true
}
},
"reason": "Manual validation completed",
"completed_by": "operator@example.com"
}
7. Task state machine automatically progresses
Outcome depends on action type chosen:
If Option A (reset_for_retry):
- Step 2 →
pending(attempts reset to 0) - Automatic retry begins when dependencies satisfied
- Step 2 re-enqueued to worker
- If successful, workflow continues normally
If Option B (resolve_manually):
- Step 2 →
resolved_manually(terminal state) - Step 3 dependencies satisfied (manual resolution counts as success)
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker
- Task resumes normal execution
If Option C (complete_manually):
- Step 2 →
complete(with operator-provided results) - Step 3 can consume results from completion_data
- Task transitions:
error→enqueuing_steps - Step 3 enqueued to worker with access to step 2 results
- Task resumes normal execution
8. Operator updates DLQ investigation
PATCH /v1/dlq/entry/xyz-789
{
"resolution_status": "manually_resolved",
"resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
"resolved_by": "operator@example.com",
"metadata": {
"root_cause": "database_connection_timeout",
"fixed_step_uuid": "{step_2_uuid}",
"infrastructure_fix": "increased_connection_pool_size"
}
}
Step Retry and Attempt Lifecycles
Step State Machine
States:
pending- Initial state, awaiting dependenciesenqueued- Sent to worker queuein_progress- Worker actively processingenqueued_for_orchestration- Result submitted, awaiting orchestrationcomplete- Successfully finishederror- Failed (may be retryable)cancelled- Manually cancelledresolved_manually- Operator intervention
Retry Logic
Configured per step in template:
retry:
retryable: true
max_attempts: 3
backoff: exponential
backoff_base_ms: 1000
max_backoff_ms: 30000
Retry Eligibility Criteria:
retryable: truein configurationattempts < max_attempts- Current state is
error next_retry_attimestamp has passed (backoff elapsed)
Backoff Calculation:
backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)
Example (base=1000ms, max=30000ms):
- Attempt 1 fails → wait 1s
- Attempt 2 fails → wait 2s
- Attempt 3 fails → wait 4s
SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at
Attempt Tracking
Fields (on workflow_steps table):
attempts- Current attempt countmax_attempts- Configuration limitlast_attempted_at- Timestamp of last executionlast_failure_at- Timestamp of last failure
Workflow:
- Step enqueued →
attempts++ - Step fails → Record
last_failure_at, calculatenext_retry_at - Backoff elapses → Step becomes
retry_eligible: true - Orchestration discovers ready steps → Step re-enqueued
- Repeat until success or
attempts >= max_attempts
Max Attempts Exceeded:
- Step remains in
errorstate retry_eligible: false- Task transitions to
errorstate - May trigger DLQ entry with reason
max_retries_exceeded
Independence from DLQ
Key Point: Step retry logic is INDEPENDENT of DLQ
- Steps retry automatically based on configuration
- DLQ does NOT trigger retries
- DLQ does NOT modify retry counters
- DLQ is pure observation and investigation
Why This Matters:
- Retry logic is predictable and configuration-driven
- DLQ doesn’t interfere with normal workflow execution
- Operators can manually resolve to bypass retry limits
- DLQ provides visibility into retry exhaustion patterns
Staleness Detection
Background Service
Component: tasker-orchestration/src/orchestration/staleness_detector.rs
Configuration:
[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300 # 5 minutes
Operation:
- Timer triggers every 5 minutes
- Calls
detect_and_transition_stale_tasks()SQL function - Function identifies tasks exceeding thresholds
- Creates DLQ entries for stale tasks
- Transitions tasks to
errorstate - Records OpenTelemetry metrics
Staleness Thresholds
Per-State Defaults (configurable):
waiting_for_dependencies: 60 minuteswaiting_for_retry: 30 minutessteps_in_process: 30 minutes
Per-Template Override:
lifecycle:
max_waiting_for_dependencies_minutes: 120
max_waiting_for_retry_minutes: 45
max_steps_in_process_minutes: 60
Precedence: Template config > Global defaults
Staleness SQL Function
Function: detect_and_transition_stale_tasks()
Architecture:
v_task_state_analysis (base view)
│
├── get_stale_tasks_for_dlq() (discovery function)
│ │
│ └── detect_and_transition_stale_tasks() (main orchestration)
│ ├── create_dlq_entry() (DLQ creation)
│ └── transition_stale_task_to_error() (state transition)
Performance Optimization:
- Expensive joins happen ONCE in base view
- Discovery function filters stale tasks
- Main function processes results in loop
- LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries
Output: Returns StalenessResult records with:
- Task identification (UUID, namespace, name)
- State and timing information
action_taken- What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)moved_to_dlq- Booleantransition_success- Boolean
OpenTelemetry Metrics
Metrics Exported
Counters:
tasker.dlq.entries_created.total- DLQ entries createdtasker.staleness.tasks_detected.total- Stale tasks detectedtasker.staleness.tasks_transitioned_to_error.total- Tasks moved to Errortasker.staleness.detection_runs.total- Detection cycles
Histograms:
tasker.staleness.detection.duration- Detection execution time (ms)tasker.dlq.time_in_queue- Time in DLQ before resolution (hours)
Gauges:
tasker.dlq.pending_investigations- Current pending DLQ count
Alert Examples
Prometheus Alerting Rules:
# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
expr: tasker_dlq_pending_investigations > 50
for: 15m
labels:
severity: warning
annotations:
summary: "High number of pending DLQ investigations ({{ $value }})"
# Alert on slow detection cycles
- alert: SlowStalenessDetection
expr: tasker_staleness_detection_duration > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Staleness detection taking >5s ({{ $value }}ms)"
# Alert on high stale task rate
- alert: HighStalenessRate
expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
for: 10m
labels:
severity: critical
annotations:
summary: "High rate of stale task detection ({{ $value }}/sec)"
CLI Usage Examples
The tasker-ctl tool provides commands for managing workflow steps directly from the command line.
List Workflow Steps
# List all steps for a task
tasker-ctl task steps <TASK_UUID>
# Example output:
# ✓ Found 3 workflow steps:
#
# Step: validate_input (01933d7c-...)
# State: complete
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 1/3
#
# Step: process_order (01933d7c-...)
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Attempts: 3/3
# ⚠ Retry eligible
Get Step Details
# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>
# Example output:
# ✓ Step Details:
#
# UUID: 01933d7c-...
# Name: process_order
# State: error
# Dependencies satisfied: true
# Ready for execution: false
# Retry eligible: false
# Attempts: 3/3
# Last failure: 2025-11-02T14:23:45Z
Reset Step for Retry
When infrastructure is fixed and you want to reset attempt counter:
tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
--reason "Database connection pool increased" \
--reset-by "ops-team@example.com"
# Example output:
# ✓ Step reset successfully!
# New state: pending
# Reason: Database connection pool increased
# Reset by: ops-team@example.com
Resolve Step Manually
When you want to bypass a non-critical step:
tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
--reason "Non-critical validation, bypassing" \
--resolved-by "ops-team@example.com"
# Example output:
# ✓ Step resolved manually!
# New state: resolved_manually
# Reason: Non-critical validation, bypassing
# Resolved by: ops-team@example.com
Complete Step Manually with Results
When you’ve manually performed the step’s work and need to provide results:
tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
--result '{"validated": true, "score": 95}' \
--metadata '{"verification_method": "manual_review"}' \
--reason "Manual verification after infrastructure fix" \
--completed-by "ops-team@example.com"
# Example output:
# ✓ Step completed manually with results!
# New state: complete
# Reason: Manual verification after infrastructure fix
# Completed by: ops-team@example.com
JSON Formatting Tips:
# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'
# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
"validation_status": "passed",
"checks": ["auth", "permissions", "rate_limit"],
"score": 100
}
EOF
)"
# Or read from a file
--result "$(cat result.json)"
Operational Runbooks
Runbook 1: Investigating High DLQ Count
Trigger: tasker_dlq_pending_investigations > 50
Steps:
- Check DLQ dashboard:
curl /v1/dlq/stats | jq
- Identify dominant reason:
{
"dlq_reason": "staleness_timeout",
"total_entries": 45,
"pending": 45
}
- Get investigation queue:
curl /v1/dlq/investigation-queue?limit=10 | jq
- Check staleness monitoring:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'
- Identify patterns:
- Common namespace?
- Common task template?
- Common time period?
- Take action:
- Infrastructure issue? → Fix and manually resolve affected tasks
- Template misconfiguration? → Update template thresholds
- Worker unavailable? → Scale worker capacity
- Systemic dependency issue? → Investigate upstream systems
Runbook 2: Proactive Staleness Prevention
Trigger: Regular monitoring (not incident-driven)
Steps:
- Monitor warning threshold:
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'
- Alert when warning count exceeds baseline:
if [ $warning_count -gt 10 ]; then
alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi
- Investigate early:
curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
task_uuid,
current_state,
time_in_state_minutes,
staleness_threshold_minutes,
threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'
- Intervene before DLQ:
- Check task steps for blockages
- Review dependencies
- Manually resolve if appropriate
Best Practices
For Operators
✅ DO:
- Use staleness monitoring for proactive prevention
- Document investigation findings in DLQ resolution notes
- Fix root causes, not just symptoms
- Update DLQ investigation status promptly
- Use step endpoints to resolve stuck workflows
- Monitor DLQ statistics for systemic patterns
❌ DON’T:
- Don’t try to “requeue” from DLQ - fix the steps instead
- Don’t ignore warning health status - investigate early
- Don’t manually resolve steps without fixing root cause
- Don’t leave DLQ investigations in pending status indefinitely
For Developers
✅ DO:
- Configure appropriate staleness thresholds per template
- Make steps retryable with sensible backoff
- Implement idempotent step handlers
- Add defensive timeouts to prevent hanging
- Test workflows under failure scenarios
❌ DON’T:
- Don’t set thresholds too low (causes false positives)
- Don’t set thresholds too high (delays detection)
- Don’t make all steps non-retryable
- Don’t ignore DLQ patterns - they indicate design issues
- Don’t rely on DLQ for normal workflow control flow
Testing
Test Coverage
Unit Tests: SQL function testing (17 tests)
- Staleness detection logic
- DLQ entry creation
- Threshold calculation with template overrides
- View query correctness
Integration Tests: Lifecycle testing (4 tests)
- Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
- Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
- Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
- Complete investigation workflow (test_dlq_investigation_workflow)
Metrics Tests: OpenTelemetry integration (1 test)
- Staleness detection metrics recording
- DLQ investigation metrics recording
- Pending investigations gauge query
Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml
- 2-step linear workflow
- 2-minute staleness thresholds for fast test execution
- Test-only template for lifecycle validation
Performance: All 22 tests complete in 0.95s (< 1s target)
Implementation Notes
File Locations:
- Staleness detector:
tasker-orchestration/src/orchestration/staleness_detector.rs - DLQ models:
tasker-shared/src/models/orchestration/dlq.rs - SQL functions:
migrations/20251122000004_add_dlq_discovery_function.sql - Database views:
migrations/20251122000003_add_dlq_views.sql
Key Design Decisions:
- Investigation tracking only - no task manipulation
- Step-level resolution via existing step endpoints
- Proactive monitoring at 80% threshold
- Template-specific threshold overrides
- Atomic DLQ entry creation with unique constraint
- Time-ordered UUID v7 for investigation tracking
Future Enhancements
Potential improvements (not currently planned):
-
DLQ Patterns Analysis
- Machine learning to identify systemic issues
- Automated root cause suggestions
- Pattern clustering by namespace/template
-
Advanced Alerting
- Anomaly detection on staleness rates
- Predictive DLQ entry forecasting
- Correlation with infrastructure metrics
-
Investigation Workflow
- Automated triage rules
- Escalation policies
- Integration with incident management systems
-
Performance Optimization
- Materialized views for dashboard
- Query result caching
- Incremental staleness detection
End of Documentation