Dead Letter Queue (DLQ) System Architecture

Manually resolve failed steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
View step readiness status: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Check retry eligibility and dependency satisfaction
Trigger next steps by completing blocked steps

DLQ is immutable audit trail - Operators should:

Review task snapshot to understand what went wrong
Use step endpoints to fix the underlying problem
Update DLQ investigation status to track resolution
Analyze DLQ patterns to prevent future occurrences

DLQ Reasons

staleness_timeout

Definition: Task exceeded state-specific staleness threshold

States:

waiting_for_dependencies - Default 60 minutes
waiting_for_retry - Default 30 minutes
steps_in_process - Default 30 minutes

Template Override: Configure per-template thresholds:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60
  max_duration_minutes: 1440  # 24 hours

Resolution Pattern:

Operator: GET /v1/dlq/task/{task_uuid} - Review task snapshot
Identify stuck steps: Check current_state in snapshot
Fix steps: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Task state machine automatically progresses when steps fixed
Operator: PATCH /v1/dlq/entry/{dlq_entry_uuid} - Mark investigation resolved

Prevention: Use /v1/dlq/staleness endpoint for proactive monitoring

max_retries_exceeded

Definition: Step exhausted all retry attempts and remains in Error state

Resolution Pattern:

Review step results: GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Analyze last_failure_at and error details
Fix underlying issue (infrastructure, data, etc.)
Manually resolve step: PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}
Update DLQ investigation status

dependency_cycle_detected

Definition: Circular dependency detected in workflow step graph

Resolution Pattern:

Review task template configuration
Identify cycle in step dependencies
Update template to break cycle
Manually cancel affected tasks
Re-submit tasks with corrected template

worker_unavailable

Definition: No worker available for task’s namespace

Resolution Pattern:

Check worker service health
Verify namespace configuration
Scale worker capacity if needed
Tasks automatically progress when worker available

manual_dlq

Definition: Operator manually sent task to DLQ for investigation

Resolution Pattern: Custom per-investigation

Database Schema

tasks_dlq Table

CREATE TABLE tasker.tasks_dlq (
    dlq_entry_uuid UUID PRIMARY KEY DEFAULT uuid_generate_v7(),
    task_uuid UUID NOT NULL UNIQUE,  -- One pending entry per task
    original_state VARCHAR(50) NOT NULL,
    dlq_reason dlq_reason NOT NULL,
    dlq_timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    task_snapshot JSONB,  -- Complete task state for debugging
    resolution_status dlq_resolution_status NOT NULL DEFAULT 'pending',
    resolution_notes TEXT,
    resolved_at TIMESTAMPTZ,
    resolved_by VARCHAR(255),
    metadata JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Unique constraint: Only one pending DLQ entry per task
CREATE UNIQUE INDEX idx_dlq_unique_pending_task
    ON tasker.tasks_dlq (task_uuid)
    WHERE resolution_status = 'pending';

Key Fields:

dlq_entry_uuid - UUID v7 (time-ordered) for investigation tracking
task_uuid - Foreign key to task (unique for pending entries)
original_state - Task state when sent to DLQ
task_snapshot - JSONB snapshot with debugging context
resolution_status - Investigation workflow status

Database Views

v_dlq_dashboard

Purpose: Aggregated statistics for monitoring dashboard

Columns:

dlq_reason - Why tasks are in DLQ
total_entries - Count of entries
pending, manually_resolved, permanent_failures, cancelled - Breakdown by status
oldest_entry, newest_entry - Time range
avg_resolution_time_minutes - Average time to resolve

Use Case: High-level DLQ health monitoring

v_dlq_investigation_queue

Purpose: Prioritized queue for operator triage

Columns:

Task and DLQ entry UUIDs
priority_score - Composite score (base reason priority + age factor)
minutes_in_dlq - How long entry has been pending
Task metadata for context

Ordering: Priority score DESC (most urgent first)

Use Case: Operator dashboard showing “what to investigate next”

v_task_staleness_monitoring

Purpose: Proactive staleness monitoring BEFORE tasks hit DLQ

Columns:

task_uuid, namespace_name, task_name
current_state, time_in_state_minutes
staleness_threshold_minutes - Threshold for this state
health_status - healthy | warning | stale
priority - Task priority for ordering

Health Status Classification:

healthy - < 80% of threshold
warning - 80-99% of threshold
stale - ≥ 100% of threshold

Use Case: Alerting at 80% threshold to prevent DLQ entries

REST API Endpoints

1. List DLQ Entries

GET /v1/dlq?resolution_status=pending&limit=50

Purpose: Browse DLQ entries with filtering

Query Parameters:

resolution_status - Filter by status (optional)
limit - Max entries (default: 50)
offset - Pagination offset (default: 0)

Response: Array of DlqEntry objects

Use Case: General DLQ browsing and pagination

2. Get DLQ Entry with Task Snapshot

GET /v1/dlq/task/{task_uuid}

Purpose: Retrieve most recent DLQ entry for a task with complete snapshot

Response: DlqEntry with full task_snapshot JSONB

Task Snapshot Contains:

Task UUID, namespace, name
Current state and time in state
Staleness threshold
Task age and priority
Template configuration
Detection time

Use Case: Investigation starting point - “why is this task stuck?”

3. Update DLQ Investigation Status

PATCH /v1/dlq/entry/{dlq_entry_uuid}

Purpose: Track investigation workflow

Request Body:

{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed by manually completing stuck step using step API",
  "resolved_by": "operator@example.com",
  "metadata": {
    "fixed_step_uuid": "...",
    "root_cause": "database connection timeout"
  }
}

Use Case: Document investigation findings and resolution

4. Get DLQ Statistics

GET /v1/dlq/stats

Purpose: Aggregated statistics for monitoring

Response: Statistics grouped by dlq_reason

Use Case: Dashboard metrics, identifying systemic issues

5. Get Investigation Queue

GET /v1/dlq/investigation-queue?limit=100

Purpose: Prioritized queue for operator triage

Response: Array of DlqInvestigationQueueEntry ordered by priority

Priority Factors:

Base reason priority (staleness_timeout: 10, max_retries: 20, etc.)
Age multiplier (older entries = higher priority)

Use Case: “What should I investigate next?”

6. Get Staleness Monitoring

GET /v1/dlq/staleness?limit=100

Purpose: Proactive monitoring BEFORE tasks hit DLQ

Response: Array of StalenessMonitoring with health status

Ordering: Stale first, then warning, then healthy

Use Case: Alerting and prevention

Alert Integration:

# Alert when warning count exceeds threshold
curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'

Step Endpoints and Resolution Workflow

Step Endpoints

1. List Task Steps

GET /v1/tasks/{uuid}/workflow_steps

Returns: Array of steps with readiness status

Key Fields:

current_state - Step state (pending, enqueued, in_progress, complete, error)
dependencies_satisfied - Can step execute?
retry_eligible - Can step retry?
ready_for_execution - Ready to enqueue?
attempts / max_attempts - Retry tracking
last_failure_at - When step last failed
next_retry_at - When step eligible for retry

Use Case: Understand task execution status

2. Get Step Details

GET /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Returns: Single step with full readiness analysis

Use Case: Deep dive into specific step

3. Manually Resolve Step

PATCH /v1/tasks/{uuid}/workflow_steps/{step_uuid}

Purpose: Operator actions to handle stuck or failed steps

Action Types:

ResetForRetry - Reset attempt counter and return to pending for automatic retry:

{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection restored, resetting attempts"
}

ResolveManually - Mark step as manually resolved without results:

{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical step, bypassing for workflow continuation"
}

CompleteManually - Complete step with execution results for dependent steps:

{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validated": true,
      "score": 95
    },
    "metadata": {
      "manually_verified": true,
      "verification_method": "manual_inspection"
    }
  },
  "reason": "Manual verification completed after infrastructure fix",
  "completed_by": "operator@example.com"
}

Behavior by Action Type:

reset_for_retry: Clears attempt counter, transitions to pending, enables automatic retry
resolve_manually: Transitions to resolved_manually (terminal state)
complete_manually: Transitions to complete with results available for dependent steps

Common Effects:

Triggers task state machine re-evaluation
Task automatically discovers next ready steps
Task progresses when all dependencies satisfied

Use Case: Unblock stuck workflow by fixing problem step

Complete Resolution Workflow

Scenario: Task Stuck in waiting_for_dependencies

1. Operator receives DLQ alert

GET /v1/dlq/investigation-queue
# Response shows task_uuid: abc-123 with high priority

2. Operator reviews task snapshot

GET /v1/dlq/task/abc-123
# Response:
{
  "dlq_entry_uuid": "xyz-789",
  "task_uuid": "abc-123",
  "original_state": "waiting_for_dependencies",
  "dlq_reason": "staleness_timeout",
  "task_snapshot": {
    "task_uuid": "abc-123",
    "namespace": "order_processing",
    "task_name": "fulfill_order",
    "current_state": "error",
    "time_in_state_minutes": 65,
    "threshold_minutes": 60
  }
}

3. Operator checks task steps

GET /v1/tasks/abc-123/workflow_steps
# Response shows:
# step_1: complete
# step_2: error (blocked, max_attempts exceeded)
# step_3: waiting_for_dependencies (blocked by step_2)

4. Operator investigates step_2 failure

GET /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
# Response shows last_failure_at and error details
# Root cause: database connection timeout

5. Operator fixes infrastructure issue

# Fix database connection pool configuration
# Verify database connectivity

6. Operator chooses resolution strategy

Option A: Reset for retry (infrastructure fixed, retry should work):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "reset_for_retry",
  "reset_by": "operator@example.com",
  "reason": "Database connection pool fixed, resetting attempts for automatic retry"
}

Option B: Resolve manually (bypass step entirely):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "resolve_manually",
  "resolved_by": "operator@example.com",
  "reason": "Non-critical validation step, bypassing"
}

Option C: Complete manually (provide results for dependent steps):

PATCH /v1/tasks/abc-123/workflow_steps/{step_2_uuid}
{
  "action_type": "complete_manually",
  "completion_data": {
    "result": {
      "validation_status": "passed",
      "score": 100
    },
    "metadata": {
      "manually_verified": true
    }
  },
  "reason": "Manual validation completed",
  "completed_by": "operator@example.com"
}

7. Task state machine automatically progresses

Outcome depends on action type chosen:

If Option A (reset_for_retry):

Step 2 → pending (attempts reset to 0)
Automatic retry begins when dependencies satisfied
Step 2 re-enqueued to worker
If successful, workflow continues normally

If Option B (resolve_manually):

Step 2 → resolved_manually (terminal state)
Step 3 dependencies satisfied (manual resolution counts as success)
Task transitions: error → enqueuing_steps
Step 3 enqueued to worker
Task resumes normal execution

If Option C (complete_manually):

Step 2 → complete (with operator-provided results)
Step 3 can consume results from completion_data
Task transitions: error → enqueuing_steps
Step 3 enqueued to worker with access to step 2 results
Task resumes normal execution

8. Operator updates DLQ investigation

PATCH /v1/dlq/entry/xyz-789
{
  "resolution_status": "manually_resolved",
  "resolution_notes": "Fixed database connection pool configuration. Manually resolved step_2 to unblock workflow. Task resumed execution.",
  "resolved_by": "operator@example.com",
  "metadata": {
    "root_cause": "database_connection_timeout",
    "fixed_step_uuid": "{step_2_uuid}",
    "infrastructure_fix": "increased_connection_pool_size"
  }
}

Step Retry and Attempt Lifecycles

Step State Machine

States:

pending - Initial state, awaiting dependencies
enqueued - Sent to worker queue
in_progress - Worker actively processing
enqueued_for_orchestration - Result submitted, awaiting orchestration
complete - Successfully finished
error - Failed (may be retryable)
cancelled - Manually cancelled
resolved_manually - Operator intervention

Retry Logic

Configured per step in template:

retry:
  retryable: true
  max_attempts: 3
  backoff: exponential
  backoff_base_ms: 1000
  max_backoff_ms: 30000

Retry Eligibility Criteria:

retryable: true in configuration
attempts < max_attempts
Current state is error
next_retry_at timestamp has passed (backoff elapsed)

Backoff Calculation:

backoff_ms = min(backoff_base_ms * (2 ^ (attempts - 1)), max_backoff_ms)

Example (base=1000ms, max=30000ms):

Attempt 1 fails → wait 1s
Attempt 2 fails → wait 2s
Attempt 3 fails → wait 4s

SQL Function: get_step_readiness_status() calculates retry_eligible and next_retry_at

Attempt Tracking

Fields (on workflow_steps table):

attempts - Current attempt count
max_attempts - Configuration limit
last_attempted_at - Timestamp of last execution
last_failure_at - Timestamp of last failure

Workflow:

Step enqueued → attempts++
Step fails → Record last_failure_at, calculate next_retry_at
Backoff elapses → Step becomes retry_eligible: true
Orchestration discovers ready steps → Step re-enqueued
Repeat until success or attempts >= max_attempts

Max Attempts Exceeded:

Step remains in error state
retry_eligible: false
Task transitions to error state
May trigger DLQ entry with reason max_retries_exceeded

Independence from DLQ

Key Point: Step retry logic is INDEPENDENT of DLQ

Steps retry automatically based on configuration
DLQ does NOT trigger retries
DLQ does NOT modify retry counters
DLQ is pure observation and investigation

Why This Matters:

Retry logic is predictable and configuration-driven
DLQ doesn’t interfere with normal workflow execution
Operators can manually resolve to bypass retry limits
DLQ provides visibility into retry exhaustion patterns

Staleness Detection

Background Service

Component: tasker-orchestration/src/orchestration/staleness_detector.rs

Configuration:

[staleness_detection]
enabled = true
batch_size = 100
detection_interval_seconds = 300  # 5 minutes

Operation:

Timer triggers every 5 minutes
Calls detect_and_transition_stale_tasks() SQL function
Function identifies tasks exceeding thresholds
Creates DLQ entries for stale tasks
Transitions tasks to error state
Records OpenTelemetry metrics

Staleness Thresholds

Per-State Defaults (configurable):

waiting_for_dependencies: 60 minutes
waiting_for_retry: 30 minutes
steps_in_process: 30 minutes

Per-Template Override:

lifecycle:
  max_waiting_for_dependencies_minutes: 120
  max_waiting_for_retry_minutes: 45
  max_steps_in_process_minutes: 60

Precedence: Template config > Global defaults

Staleness SQL Function

Function: detect_and_transition_stale_tasks()

Architecture:

v_task_state_analysis (base view)
    │
    ├── get_stale_tasks_for_dlq() (discovery function)
    │       │
    │       └── detect_and_transition_stale_tasks() (main orchestration)
    │               ├── create_dlq_entry() (DLQ creation)
    │               └── transition_stale_task_to_error() (state transition)

Performance Optimization:

Expensive joins happen ONCE in base view
Discovery function filters stale tasks
Main function processes results in loop
LEFT JOIN anti-join pattern for excluding tasks with pending DLQ entries

Output: Returns StalenessResult records with:

Task identification (UUID, namespace, name)
State and timing information
action_taken - What happened (enum: TransitionedToDlqAndError, MovedToDlqOnly, etc.)
moved_to_dlq - Boolean
transition_success - Boolean

OpenTelemetry Metrics

Metrics Exported

Counters:

tasker.dlq.entries_created.total - DLQ entries created
tasker.staleness.tasks_detected.total - Stale tasks detected
tasker.staleness.tasks_transitioned_to_error.total - Tasks moved to Error
tasker.staleness.detection_runs.total - Detection cycles

Histograms:

tasker.staleness.detection.duration - Detection execution time (ms)
tasker.dlq.time_in_queue - Time in DLQ before resolution (hours)

Gauges:

tasker.dlq.pending_investigations - Current pending DLQ count

Alert Examples

Prometheus Alerting Rules:

# Alert on high pending investigations
- alert: HighPendingDLQInvestigations
  expr: tasker_dlq_pending_investigations > 50
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "High number of pending DLQ investigations ({{ $value }})"

# Alert on slow detection cycles
- alert: SlowStalenessDetection
  expr: tasker_staleness_detection_duration > 5000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Staleness detection taking >5s ({{ $value }}ms)"

# Alert on high stale task rate
- alert: HighStalenessRate
  expr: rate(tasker_staleness_tasks_detected_total[5m]) > 10
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High rate of stale task detection ({{ $value }}/sec)"

CLI Usage Examples

The tasker-ctl tool provides commands for managing workflow steps directly from the command line.

List Workflow Steps

# List all steps for a task
tasker-ctl task steps <TASK_UUID>

# Example output:
# ✓ Found 3 workflow steps:
#
#   Step: validate_input (01933d7c-...)
#     State: complete
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 1/3
#
#   Step: process_order (01933d7c-...)
#     State: error
#     Dependencies satisfied: true
#     Ready for execution: false
#     Attempts: 3/3
#     ⚠ Retry eligible

Get Step Details

# Get detailed information about a specific step
tasker-ctl task step <TASK_UUID> <STEP_UUID>

# Example output:
# ✓ Step Details:
#
#   UUID: 01933d7c-...
#   Name: process_order
#   State: error
#   Dependencies satisfied: true
#   Ready for execution: false
#   Retry eligible: false
#   Attempts: 3/3
#   Last failure: 2025-11-02T14:23:45Z

Reset Step for Retry

When infrastructure is fixed and you want to reset attempt counter:

tasker-ctl task reset-step <TASK_UUID> <STEP_UUID> \
  --reason "Database connection pool increased" \
  --reset-by "ops-team@example.com"

# Example output:
# ✓ Step reset successfully!
#   New state: pending
#   Reason: Database connection pool increased
#   Reset by: ops-team@example.com

Resolve Step Manually

When you want to bypass a non-critical step:

tasker-ctl task resolve-step <TASK_UUID> <STEP_UUID> \
  --reason "Non-critical validation, bypassing" \
  --resolved-by "ops-team@example.com"

# Example output:
# ✓ Step resolved manually!
#   New state: resolved_manually
#   Reason: Non-critical validation, bypassing
#   Resolved by: ops-team@example.com

Complete Step Manually with Results

When you’ve manually performed the step’s work and need to provide results:

tasker-ctl task complete-step <TASK_UUID> <STEP_UUID> \
  --result '{"validated": true, "score": 95}' \
  --metadata '{"verification_method": "manual_review"}' \
  --reason "Manual verification after infrastructure fix" \
  --completed-by "ops-team@example.com"

# Example output:
# ✓ Step completed manually with results!
#   New state: complete
#   Reason: Manual verification after infrastructure fix
#   Completed by: ops-team@example.com

JSON Formatting Tips:

# Use single quotes around JSON to avoid shell escaping issues
--result '{"key": "value"}'

# For complex JSON, use a heredoc or file
--result "$(cat <<'EOF'
{
  "validation_status": "passed",
  "checks": ["auth", "permissions", "rate_limit"],
  "score": 100
}
EOF
)"

# Or read from a file
--result "$(cat result.json)"

Operational Runbooks

Runbook 1: Investigating High DLQ Count

Trigger: tasker_dlq_pending_investigations > 50

Steps:

Check DLQ dashboard:

curl /v1/dlq/stats | jq

Identify dominant reason:

{
  "dlq_reason": "staleness_timeout",
  "total_entries": 45,
  "pending": 45
}

Get investigation queue:

curl /v1/dlq/investigation-queue?limit=10 | jq

Check staleness monitoring:

curl /v1/dlq/staleness | jq '.[] | select(.health_status == "stale")'

Identify patterns:

Common namespace?
Common task template?
Common time period?

Take action:

Infrastructure issue? → Fix and manually resolve affected tasks
Template misconfiguration? → Update template thresholds
Worker unavailable? → Scale worker capacity
Systemic dependency issue? → Investigate upstream systems

Runbook 2: Proactive Staleness Prevention

Trigger: Regular monitoring (not incident-driven)

Steps:

Monitor warning threshold:

curl /v1/dlq/staleness | jq '[.[] | select(.health_status == "warning")] | length'

Alert when warning count exceeds baseline:

if [ $warning_count -gt 10 ]; then
  alert "High staleness warning count: $warning_count tasks at 80%+ threshold"
fi

Investigate early:

curl /v1/dlq/staleness | jq '.[] | select(.health_status == "warning") | {
  task_uuid,
  current_state,
  time_in_state_minutes,
  staleness_threshold_minutes,
  threshold_percentage: ((.time_in_state_minutes / .staleness_threshold_minutes) * 100)
}'

Intervene before DLQ:

Check task steps for blockages
Review dependencies
Manually resolve if appropriate

Best Practices

For Operators

✅ DO:

Use staleness monitoring for proactive prevention
Document investigation findings in DLQ resolution notes
Fix root causes, not just symptoms
Update DLQ investigation status promptly
Use step endpoints to resolve stuck workflows
Monitor DLQ statistics for systemic patterns

❌ DON’T:

Don’t try to “requeue” from DLQ - fix the steps instead
Don’t ignore warning health status - investigate early
Don’t manually resolve steps without fixing root cause
Don’t leave DLQ investigations in pending status indefinitely

For Developers

✅ DO:

Configure appropriate staleness thresholds per template
Make steps retryable with sensible backoff
Implement idempotent step handlers
Add defensive timeouts to prevent hanging
Test workflows under failure scenarios

❌ DON’T:

Don’t set thresholds too low (causes false positives)
Don’t set thresholds too high (delays detection)
Don’t make all steps non-retryable
Don’t ignore DLQ patterns - they indicate design issues
Don’t rely on DLQ for normal workflow control flow

Testing

Test Coverage

Unit Tests: SQL function testing (17 tests)

Staleness detection logic
DLQ entry creation
Threshold calculation with template overrides
View query correctness

Integration Tests: Lifecycle testing (4 tests)

Waiting for dependencies staleness (test_dlq_lifecycle_waiting_for_dependencies_staleness)
Steps in process staleness (test_dlq_lifecycle_steps_in_process_staleness)
Proactive monitoring with health status progression (test_dlq_lifecycle_proactive_monitoring)
Complete investigation workflow (test_dlq_investigation_workflow)

Metrics Tests: OpenTelemetry integration (1 test)

Staleness detection metrics recording
DLQ investigation metrics recording
Pending investigations gauge query

Test Template: tests/fixtures/task_templates/rust/dlq_staleness_test.yaml

2-step linear workflow
2-minute staleness thresholds for fast test execution
Test-only template for lifecycle validation

Performance: All 22 tests complete in 0.95s (< 1s target)

Implementation Notes

File Locations:

Staleness detector: tasker-orchestration/src/orchestration/staleness_detector.rs
DLQ models: tasker-shared/src/models/orchestration/dlq.rs
SQL functions: migrations/20251122000004_add_dlq_discovery_function.sql
Database views: migrations/20251122000003_add_dlq_views.sql

Key Design Decisions:

Investigation tracking only - no task manipulation
Step-level resolution via existing step endpoints
Proactive monitoring at 80% threshold
Template-specific threshold overrides
Atomic DLQ entry creation with unique constraint
Time-ordered UUID v7 for investigation tracking

Future Enhancements

Potential improvements (not currently planned):

DLQ Patterns Analysis
- Machine learning to identify systemic issues
- Automated root cause suggestions
- Pattern clustering by namespace/template
Advanced Alerting
- Anomaly detection on staleness rates
- Predictive DLQ entry forecasting
- Correlation with infrastructure metrics
Investigation Workflow
- Automated triage rules
- Escalation policies
- Integration with incident management systems
Performance Optimization
- Materialized views for dashboard
- Query result caching
- Incremental staleness detection

End of Documentation

Keyboard shortcuts

Tasker Documentation