Telemetry & Observability

Overview

Tasker includes comprehensive telemetry capabilities to provide insights into task execution, workflow steps, and overall system performance. The telemetry system leverages OpenTelemetry standards and a unified event architecture to ensure compatibility with a wide range of observability tools and platforms.

OpenTelemetry Architecture: Metrics vs Spans

Tasker follows OpenTelemetry best practices by using both metrics and spans for different purposes:

Spans πŸ” (Primary Focus)

  • Purpose: Individual trace records with detailed context for debugging and analysis

  • Use Cases:

    • "Why did task #12345 take 30 seconds?"

    • "What was the exact execution path for this failed workflow?"

    • "Which step in the order process is the bottleneck?"

  • Benefits:

    • Complete request context and timing

    • Parent-child relationships show workflow hierarchy

    • Error propagation with full stack traces

    • Rich attributes for detailed analysis

  • Implementation: TelemetrySubscriber creates hierarchical spans for tasks and steps

Metrics πŸ“Š (Derived or Separate)

  • Purpose: Aggregated numerical data for dashboards, alerts, and SLIs/SLOs

  • Use Cases:

    • "How many tasks completed in the last hour?"

    • "What's the 95th percentile task duration?"

    • "Alert if error rate exceeds 5%"

  • Benefits:

    • Very efficient storage and querying

    • Perfect for real-time dashboards

    • Lightweight for high-volume scenarios

  • Implementation: Can be derived from span data or collected separately

Key Features

  • Unified Event System - Single Events::Publisher with consistent event publishing patterns

  • Standardized Event Payloads - EventPayloadBuilder ensures consistent telemetry data structure

  • Production-Ready OpenTelemetry Integration - Full instrumentation stack with safety mechanisms

  • Hierarchical Span Creation - Proper parent-child relationships for complex workflows

  • Automatic Step Error Persistence - Complete error data capture with atomic transactions

  • Memory-Safe Operation - Database connection pooling and leak prevention

  • Comprehensive Event Lifecycle Tracking - Task, step, workflow, and orchestration events

  • Sensitive Data Filtering - Automatic security and privacy protection

  • Developer-Friendly API - Clean EventPublisher concern for easy event publishing

  • Custom Event Subscribers - Generator and BaseSubscriber for creating integrations

  • Event Discovery System - Complete event catalog with documentation and examples

Architecture

Tasker's telemetry is built on a unified event system with these main components:

  1. Events::Publisher - Centralized event publishing using dry-events with OpenTelemetry integration

  2. EventPublisher Concern - Clean interface providing publish_event(), publish_step_event(), etc.

  3. EventPayloadBuilder - Standardized payload creation for consistent telemetry data

  4. TelemetrySubscriber - Converts events to OpenTelemetry spans (spans only, no metrics)

  5. Event Catalog - Complete event discovery and documentation system

  6. BaseSubscriber - Foundation for creating custom event subscribers

  7. Subscriber Generator - Tool for creating custom integrations with external services

  8. Configuration - OpenTelemetry setup with production-ready safety mechanisms

Event Flow

Two Complementary Observability Systems

Tasker provides two distinct but complementary observability systems designed for different use cases:

πŸ” TelemetrySubscriber (Event-Driven Spans)

  • Purpose: Detailed tracing and debugging with OpenTelemetry spans

  • Trigger: Automatic via event subscription (no manual instrumentation needed)

  • Use Cases:

    • "Why did task #12345 fail?"

    • "What's the execution path through this workflow?"

    • "Which step is causing the bottleneck?"

  • Data: Rich contextual information, hierarchical relationships, error details

  • Storage: OpenTelemetry backends (Jaeger, Zipkin, Honeycomb)

  • Performance: Optimized for detailed context, not high-volume aggregation

πŸ“Š MetricsBackend (Native Metrics Collection)

  • Purpose: High-performance aggregated metrics for dashboards and alerting

  • Trigger: Direct collection during workflow execution (no events)

  • Use Cases:

    • "How many tasks completed in the last hour?"

    • "What's the 95th percentile execution time?"

    • "Alert if error rate exceeds 5%"

  • Data: Numerical counters, gauges, histograms with labels

  • Storage: Prometheus, JSON, CSV exports

  • Performance: Optimized for high-volume, low-latency operations

Why Two Systems?

Performance: Native metrics avoid event publishing overhead for high-frequency operations Reliability: Metrics collection continues even if event system has issues Flexibility: Choose appropriate storage backend for each use case Scalability: Each system optimized for its specific workload

βœ… TelemetrySubscriber - Spans Only (Event-Driven)

βœ… MetricsBackend - Native Collection (Direct)

βœ… Clean Architecture - Complementary Systems

See METRICS.md for comprehensive details on the native metrics system including cache strategies, Kubernetes integration, and production deployment patterns.

Configuration

Tasker Configuration

Configure Tasker's telemetry in config/initializers/tasker.rb:

OpenTelemetry Configuration

Configure OpenTelemetry with production-ready settings in config/initializers/opentelemetry.rb:

Custom Telemetry Integrations

Beyond OpenTelemetry, Tasker's event system enables easy integration with any observability or monitoring service:

Creating Custom Metrics Subscribers

Tasker now provides specialized tooling for metrics collection:

The --metrics flag creates a specialized subscriber with:

  • Built-in helper methods for extracting timing, error, and performance metrics

  • Automatic tag generation for categorization

  • Examples for StatsD, DataDog, Prometheus, and other metrics systems

  • Safe numeric value extraction with defaults

  • Production-ready patterns for operational monitoring

Creating Custom Subscribers

For non-metrics integrations, use the regular subscriber generator:

Example Custom Integrations

Metrics Collection (Using Helper Methods):

Error Tracking (Sentry):

Metrics Helper Methods

BaseSubscriber now includes specialized helper methods for extracting common metrics data:

These helpers standardize metrics extraction and ensure consistency across different subscriber implementations.

For complete documentation on creating custom subscribers and integration examples, see EVENT_SYSTEM.md.

Integration with OpenTelemetry

Tasker's unified event system automatically integrates with OpenTelemetry through the enhanced TelemetrySubscriber. For each task:

  1. Root Task Span: A root span (tasker.task.execution) is created when the task starts and stored for the entire task lifecycle

  2. Child Step Spans: Child spans (tasker.step.execution) are created for each step with proper parent-child relationships to the root task span

  3. Hierarchical Context: All spans maintain proper parent-child relationships, ensuring full traceability in Jaeger/Zipkin

  4. Event Annotations: Each span includes relevant events (task.started, step.completed, etc.) with comprehensive attributes

  5. Error Propagation: Error status and messages are properly propagated through the span hierarchy

  6. Performance Metrics: Execution duration and attempt tracking are captured at both task and step levels

Span Hierarchy Example

Key Improvements

  • Proper Hierarchical Context: All step spans are now properly parented to their task span

  • Consistent Span Names: Standardized span names (tasker.task.execution, tasker.step.execution) make filtering and querying easier

  • Rich Event Annotations: Spans include relevant lifecycle events as annotations for detailed timeline visibility

  • Error Context Preservation: Failed steps maintain full error context while still being linked to their parent task

  • Task ID Propagation: All spans include the task_id for easy correlation across the entire workflow

Best Practices

1. Single Responsibility for Telemetry Components

2. Avoid Duplication Between Systems

3. Use Spans for Debugging, Metrics for Operations

4. Production Sampling Strategy

Event Payload Standardization

The EventPayloadBuilder ensures all events have consistent, comprehensive payloads:

Step Event Payloads

Task Event Payloads

Developing with Telemetry

Using EventPublisher Concern

When implementing custom task handlers, events are automatically published around your business logic:

For API Step Handlers

API step handlers follow the same automatic event publishing pattern:

Key Architecture Points:

  • βœ… Implement process() for all step handlers, base or API

  • βœ… Implement process() for API step handlers and use the connection which is a client configured by your step template settings

  • βœ… Optionally override process_results() to customize how return values are stored in step.results

  • ⚠️ Never override handle() - it's framework-only code that publishes events and coordinates execution

Alternative: Manual Event Publishing (Advanced Use Cases)

Manual Event Publishing (Advanced Use Cases)

For special cases where you need additional custom events, you can still manually publish them:

Error Handling and Observability

Tasker automatically captures comprehensive error information:

Log Monitoring

Look for these log patterns:

  • Instrumentation: OpenTelemetry::Instrumentation::* was successfully installed

  • Instrumentation: OpenTelemetry::Instrumentation::Faraday failed to install (expected)

  • Event publishing errors: Error publishing event * :

Summary

The key insight you had is correct - spans can provide much of the same information as metrics, but they serve different purposes:

  • Spans: Rich individual trace records for debugging specific issues

  • Metrics: Aggregated operational data for dashboards and alerting

Recommended approach:

  1. Use TelemetrySubscriber for comprehensive spans (captures everything for debugging)

  2. Create separate MetricsSubscriber for operational metrics (lightweight data for dashboards)

  3. Use single-responsibility subscribers for clean separation of concerns

  4. Consider deriving metrics from spans in high-maturity setups instead of separate collection

This gives you the best of both worlds: detailed debugging capability through spans and efficient operational monitoring through metrics.

Last updated