Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Observability Documentation

Last Updated: 2025-12-01 Audience: Operators, Developers Status: Active Related Docs: Documentation Hub | Benchmarks | Deployment Patterns | Domain Events

← Back to Documentation Hub


This directory contains documentation for monitoring, metrics, logging, and performance measurement in tasker-core.


Quick Navigation

πŸ“Š Performance & Benchmarking β†’ ../benchmarks/

All benchmark documentation has been consolidated in the docs/benchmarks/ directory.

See: Benchmark README for:

  • API performance benchmarks
  • SQL function benchmarks
  • Event propagation benchmarks
  • End-to-end latency benchmarks
  • Benchmark quick reference
  • Performance targets and CI integration

Migration Note: The following files remain in this directory for historical context but are superseded by the consolidated benchmarks documentation:

  • benchmark-implementation-decision.md - Decision rationale (archived)
  • benchmark-quick-reference.md - Superseded by ../benchmarks/README.md
  • benchmark-strategy-summary.md - Consolidated into benchmark-specific docs
  • benchmarking-guide.md - SQL benchmarks moved to ../benchmarks/sql-benchmarks.md
  • phase-5.4-distributed-benchmarks-plan.md - Implementation complete

Observability Categories

1. Metrics (metrics-*.md)

Purpose: System health, performance counters, and operational metrics

Documentation:

Key Metrics Tracked:

  • Task lifecycle events (created, started, completed, failed)
  • Step execution metrics (claimed, executed, retried)
  • Database operation performance (query times, cache hit rates)
  • Worker health (active workers, queue depths, claim rates)
  • System resource usage (memory, connections, threads)

Export Targets:

  • OpenTelemetry (planned)
  • Prometheus (supported)
  • CloudWatch (planned)
  • Datadog (planned)

Quick Reference:

#![allow(unused)]
fn main() {
// Example: Recording a metric
metrics::counter!("tasker.tasks.created").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed_ms);
metrics::gauge!("tasker.workers.active").set(worker_count as f64);
}

2. Logging (logging-standards.md)

Purpose: Structured logging for debugging, audit trails, and operational visibility

Documentation:

Log Levels:

  • ERROR: Critical failures requiring immediate attention
  • WARN: Degraded operation or retry scenarios
  • INFO: Significant lifecycle events and state transitions
  • DEBUG: Detailed execution flow for troubleshooting
  • TRACE: Exhaustive detail for deep debugging

Structured Fields:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task_uuid,
    correlation_id = %correlation_id,
    step_name = %step_name,
    elapsed_ms = elapsed.as_millis(),
    "Step execution completed successfully"
);
}

Key Standards:

  • Use structured logging (not string interpolation)
  • Include correlation IDs for distributed tracing
  • Log state transitions at INFO level
  • Include timing information for performance analysis
  • Sanitize sensitive data (credentials, PII)

3. Tracing and OpenTelemetry

Purpose: Distributed request tracing across services

Status: βœ… Active

Documentation:

Current Features:

  • Distributed trace propagation via correlation IDs (UUIDv7)
  • Span creation for major operations:
    • API request handling
    • Step execution (claim β†’ execute β†’ submit)
    • Orchestration coordination
    • Domain event publishing
    • Message queue operations
  • Two-phase FFI telemetry initialization (safe for Ruby/Python workers)
  • Integration with Grafana LGTM stack (Prometheus, Tempo)
  • Domain event metrics (/metrics/events endpoint)

Two-Phase FFI Initialization:

  • Phase 1: Console-only logging (safe during FFI bridge setup)
  • Phase 2: Full OpenTelemetry (after FFI established)

Example:

#![allow(unused)]
fn main() {
#[tracing::instrument(
    name = "publish_domain_event",
    skip(self, payload),
    fields(
        event_name = %event_name,
        namespace = %metadata.namespace,
        correlation_id = %metadata.correlation_id,
        delivery_mode = ?delivery_mode
    )
)]
async fn publish_event(&self, event_name: &str, ...) -> Result<()> {
    // Implementation
}
}

4. Health Checks

Purpose: Service health monitoring for orchestration, availability, and alerting

Endpoints:

  • GET /health - Overall service health
  • GET /health/ready - Readiness for traffic (K8s readiness probe)
  • GET /health/live - Liveness check (K8s liveness probe)

Health Indicators:

  • Database connection pool status
  • Message queue connectivity
  • Worker availability
  • Circuit breaker states
  • Resource utilization (memory, connections)

Response Format:

{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "healthy",
      "connections_active": 5,
      "connections_idle": 15,
      "connections_max": 20
    },
    "message_queue": {
      "status": "healthy",
      "queues_monitored": 3
    },
    "circuit_breakers": {
      "status": "healthy",
      "open_breakers": 0
    }
  },
  "uptime_seconds": 3600
}

Observability Architecture

Component-Level Instrumentation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Observability Stack                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Metrics β”‚  β”‚   Logs   β”‚  β”‚  Traces  β”‚  β”‚  Health  β”‚  β”‚
β”‚  β”‚ (Countersβ”‚  β”‚(Structured)β”‚  β”‚(Planned)β”‚  β”‚  Checks  β”‚  β”‚
β”‚  β”‚Histogramsβ”‚  β”‚   JSON   β”‚  β”‚  Spans   β”‚  β”‚   HTTP   β”‚  β”‚
β”‚  β”‚  Gauges) β”‚  β”‚   Fields β”‚  β”‚   Tags   β”‚  β”‚  Probes  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β”‚
β”‚        β”‚             β”‚             β”‚             β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚             β”‚             β”‚             β”‚
         β–Ό             β–Ό             β–Ό             β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚Prometheus β”‚ β”‚  Loki /   β”‚ β”‚  Jaeger / β”‚ β”‚    K8s    β”‚
  β”‚   OTLP    β”‚ β”‚CloudWatch β”‚ β”‚   Tempo   β”‚ β”‚  Probes   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Instrumentation Points

Orchestration:

  • Task lifecycle transitions
  • Step discovery and enqueueing
  • Result processing
  • Finalization operations
  • Database query performance

Worker:

  • Step claiming
  • Handler execution
  • Result submission
  • FFI call overhead (Ruby workers)
  • Event propagation latency

Database:

  • Query execution times
  • Connection pool metrics
  • Transaction commit latency
  • Buffer cache hit ratio

Message Queue:

  • Message send/receive latency
  • Queue depth
  • Notification propagation time
  • Message processing errors

Performance Monitoring

Key Performance Indicators (KPIs)

MetricTargetAlert ThresholdNotes
API Response Time (p99)< 100ms> 200msUser-facing latency
SQL Function Time (mean)< 3ms> 5msOrchestration efficiency
Event Propagation (p95)< 10ms> 20msReal-time coordination
E2E Task Completion (p99)< 500ms> 1000msEnd-user experience
Worker Claim Success Rate> 95%< 90%Resource contention
Database Connection Pool< 80%> 90%Resource exhaustion

Monitoring Dashboards

Recommended Dashboard Panels:

  1. Task Throughput

    • Tasks created/min
    • Tasks completed/min
    • Tasks failed/min
    • Active tasks count
  2. Step Execution

    • Steps enqueued/min
    • Steps completed/min
    • Average step execution time
    • Step retry rate
  3. System Health

    • Worker health status
    • Database connection pool utilization
    • Circuit breaker status
    • API response times (p50, p95, p99)
  4. Error Rates

    • Task failures by namespace
    • Step failures by handler
    • Database errors
    • Message queue errors

Correlation and Debugging

Correlation ID Propagation

Every request generates a UUIDv7 correlation ID that flows through:

  1. API request β†’ Task creation
  2. Task β†’ Step enqueueing
  3. Step β†’ Worker execution
  4. Worker β†’ Result submission
  5. Result β†’ Orchestration processing

Tracing a Request:

# Find correlation ID from task creation
curl http://localhost:8080/v1/tasks/{task_uuid} | jq .correlation_id

# Search logs across all services
docker logs orchestration 2>&1 | grep {correlation_id}
docker logs worker 2>&1 | grep {correlation_id}

# Query database for full timeline
psql $DATABASE_URL -c "
  SELECT
    created_at,
    from_state,
    to_state,
    metadata->>'duration_ms' as duration
  FROM tasker.task_transitions
  WHERE metadata->>'correlation_id' = '{correlation_id}'
  ORDER BY created_at;
"

Debug Logging

Enable debug logging for detailed execution flow:

# Docker Compose
RUST_LOG=debug docker-compose up

# Local development
RUST_LOG=tasker_worker=debug,tasker_orchestration=debug cargo run

# Specific modules
RUST_LOG=tasker_worker::worker::command_processor=trace cargo test

Best Practices

1. Structured Logging

βœ… Do:

#![allow(unused)]
fn main() {
info!(
    task_uuid = %task.uuid,
    namespace = %task.namespace,
    elapsed_ms = elapsed.as_millis(),
    "Task completed successfully"
);
}

❌ Don’t:

#![allow(unused)]
fn main() {
info!("Task {} in namespace {} completed in {}ms",
    task.uuid, task.namespace, elapsed.as_millis());
}

2. Metric Naming

Use consistent, hierarchical naming:

#![allow(unused)]
fn main() {
metrics::counter!("tasker.tasks.created").increment(1);
metrics::counter!("tasker.tasks.completed").increment(1);
metrics::counter!("tasker.tasks.failed").increment(1);
metrics::histogram!("tasker.step.execution_time_ms").record(elapsed);
}

3. Performance Measurement

Measure at operation boundaries:

#![allow(unused)]
fn main() {
let start = Instant::now();
let result = operation().await?;
let elapsed = start.elapsed();

metrics::histogram!("tasker.operation.duration_ms")
    .record(elapsed.as_millis() as f64);

info!(
    operation = "operation_name",
    elapsed_ms = elapsed.as_millis(),
    success = result.is_ok(),
    "Operation completed"
);
}

4. Error Context

Include rich context in errors:

#![allow(unused)]
fn main() {
error!(
    task_uuid = %task_uuid,
    step_uuid = %step_uuid,
    error = %err,
    retry_count = retry_count,
    "Step execution failed, will retry"
);
}

Tools and Integration

Development Tools

Metrics Visualization:

# Prometheus (if configured)
open http://localhost:9090

# Grafana (if configured)
open http://localhost:3000

Log Aggregation:

# Docker Compose logs
docker-compose -f docker/docker-compose.test.yml logs -f

# Specific service
docker-compose -f docker/docker-compose.test.yml logs -f orchestration

# JSON parsing
docker-compose logs orchestration | jq 'select(.level == "ERROR")'

Production Tools (Planned)

  • Metrics: Prometheus + Grafana / DataDog / CloudWatch
  • Logs: Loki / CloudWatch Logs / Splunk
  • Traces: Jaeger / Tempo / Honeycomb
  • Alerts: AlertManager / PagerDuty / Opsgenie


File Organization

Current Files

Active:

  • metrics-reference.md - Complete metrics catalog
  • metrics-verification.md - Verification procedures
  • logging-standards.md - Logging best practices
  • opentelemetry-improvements.md - Telemetry enhancements
  • VERIFICATION_RESULTS.md - Test results

Archived (superseded by docs/benchmarks/):

  • benchmark-implementation-decision.md
  • benchmark-quick-reference.md
  • benchmark-strategy-summary.md
  • benchmarking-guide.md
  • phase-5.4-distributed-benchmarks-plan.md

Move benchmark files to docs/archive/ or delete:

# Option 1: Archive
mkdir -p docs/archive/benchmarks
mv docs/observability/benchmark-*.md docs/archive/benchmarks/
mv docs/observability/phase-5.4-*.md docs/archive/benchmarks/

# Option 2: Delete (information consolidated)
rm docs/observability/benchmark-*.md
rm docs/observability/phase-5.4-*.md

Contributing

When adding observability instrumentation:

  1. Follow standards: Use structured logging and consistent metric naming
  2. Include context: Add correlation IDs and relevant metadata
  3. Document metrics: Update metrics-reference.md with new metrics
  4. Test instrumentation: Verify metrics and logs in development
  5. Consider performance: Avoid expensive operations in hot paths