Telemetry & Observability
Overview
Tasker includes comprehensive telemetry capabilities to provide insights into task execution, workflow steps, and overall system performance. The telemetry system leverages OpenTelemetry standards and a unified event architecture to ensure compatibility with a wide range of observability tools and platforms.
OpenTelemetry Architecture: Metrics vs Spans
Tasker follows OpenTelemetry best practices by using both metrics and spans for different purposes:
Spans π (Primary Focus)
Purpose: Individual trace records with detailed context for debugging and analysis
Use Cases:
"Why did task #12345 take 30 seconds?"
"What was the exact execution path for this failed workflow?"
"Which step in the order process is the bottleneck?"
Benefits:
Complete request context and timing
Parent-child relationships show workflow hierarchy
Error propagation with full stack traces
Rich attributes for detailed analysis
Implementation:
TelemetrySubscribercreates hierarchical spans for tasks and steps
Metrics π (Derived or Separate)
Purpose: Aggregated numerical data for dashboards, alerts, and SLIs/SLOs
Use Cases:
"How many tasks completed in the last hour?"
"What's the 95th percentile task duration?"
"Alert if error rate exceeds 5%"
Benefits:
Very efficient storage and querying
Perfect for real-time dashboards
Lightweight for high-volume scenarios
Implementation: Can be derived from span data or collected separately
Recommended Strategy
Key Features
Unified Event System - Single
Events::Publisherwith consistent event publishing patternsStandardized Event Payloads -
EventPayloadBuilderensures consistent telemetry data structureProduction-Ready OpenTelemetry Integration - Full instrumentation stack with safety mechanisms
Hierarchical Span Creation - Proper parent-child relationships for complex workflows
Automatic Step Error Persistence - Complete error data capture with atomic transactions
Memory-Safe Operation - Database connection pooling and leak prevention
Comprehensive Event Lifecycle Tracking - Task, step, workflow, and orchestration events
Sensitive Data Filtering - Automatic security and privacy protection
Developer-Friendly API - Clean
EventPublisherconcern for easy event publishingCustom Event Subscribers - Generator and BaseSubscriber for creating integrations
Event Discovery System - Complete event catalog with documentation and examples
Architecture
Tasker's telemetry is built on a unified event system with these main components:
Events::Publisher - Centralized event publishing using dry-events with OpenTelemetry integration
EventPublisher Concern - Clean interface providing
publish_event(),publish_step_event(), etc.EventPayloadBuilder - Standardized payload creation for consistent telemetry data
TelemetrySubscriber - Converts events to OpenTelemetry spans (spans only, no metrics)
Event Catalog - Complete event discovery and documentation system
BaseSubscriber - Foundation for creating custom event subscribers
Subscriber Generator - Tool for creating custom integrations with external services
Configuration - OpenTelemetry setup with production-ready safety mechanisms
Event Flow
Two Complementary Observability Systems
Tasker provides two distinct but complementary observability systems designed for different use cases:
π TelemetrySubscriber (Event-Driven Spans)
Purpose: Detailed tracing and debugging with OpenTelemetry spans
Trigger: Automatic via event subscription (no manual instrumentation needed)
Use Cases:
"Why did task #12345 fail?"
"What's the execution path through this workflow?"
"Which step is causing the bottleneck?"
Data: Rich contextual information, hierarchical relationships, error details
Storage: OpenTelemetry backends (Jaeger, Zipkin, Honeycomb)
Performance: Optimized for detailed context, not high-volume aggregation
π MetricsBackend (Native Metrics Collection)
Purpose: High-performance aggregated metrics for dashboards and alerting
Trigger: Direct collection during workflow execution (no events)
Use Cases:
"How many tasks completed in the last hour?"
"What's the 95th percentile execution time?"
"Alert if error rate exceeds 5%"
Data: Numerical counters, gauges, histograms with labels
Storage: Prometheus, JSON, CSV exports
Performance: Optimized for high-volume, low-latency operations
Why Two Systems?
Performance: Native metrics avoid event publishing overhead for high-frequency operations Reliability: Metrics collection continues even if event system has issues Flexibility: Choose appropriate storage backend for each use case Scalability: Each system optimized for its specific workload
β
TelemetrySubscriber - Spans Only (Event-Driven)
β
MetricsBackend - Native Collection (Direct)
β
Clean Architecture - Complementary Systems
See METRICS.md for comprehensive details on the native metrics system including cache strategies, Kubernetes integration, and production deployment patterns.
Configuration
Tasker Configuration
Configure Tasker's telemetry in config/initializers/tasker.rb:
OpenTelemetry Configuration
Configure OpenTelemetry with production-ready settings in config/initializers/opentelemetry.rb:
Custom Telemetry Integrations
Beyond OpenTelemetry, Tasker's event system enables easy integration with any observability or monitoring service:
Creating Custom Metrics Subscribers
Tasker now provides specialized tooling for metrics collection:
The --metrics flag creates a specialized subscriber with:
Built-in helper methods for extracting timing, error, and performance metrics
Automatic tag generation for categorization
Examples for StatsD, DataDog, Prometheus, and other metrics systems
Safe numeric value extraction with defaults
Production-ready patterns for operational monitoring
Creating Custom Subscribers
For non-metrics integrations, use the regular subscriber generator:
Example Custom Integrations
Metrics Collection (Using Helper Methods):
Error Tracking (Sentry):
Metrics Helper Methods
BaseSubscriber now includes specialized helper methods for extracting common metrics data:
These helpers standardize metrics extraction and ensure consistency across different subscriber implementations.
For complete documentation on creating custom subscribers and integration examples, see EVENT_SYSTEM.md.
Integration with OpenTelemetry
Tasker's unified event system automatically integrates with OpenTelemetry through the enhanced TelemetrySubscriber. For each task:
Root Task Span: A root span (
tasker.task.execution) is created when the task starts and stored for the entire task lifecycleChild Step Spans: Child spans (
tasker.step.execution) are created for each step with proper parent-child relationships to the root task spanHierarchical Context: All spans maintain proper parent-child relationships, ensuring full traceability in Jaeger/Zipkin
Event Annotations: Each span includes relevant events (task.started, step.completed, etc.) with comprehensive attributes
Error Propagation: Error status and messages are properly propagated through the span hierarchy
Performance Metrics: Execution duration and attempt tracking are captured at both task and step levels
Span Hierarchy Example
Key Improvements
Proper Hierarchical Context: All step spans are now properly parented to their task span
Consistent Span Names: Standardized span names (
tasker.task.execution,tasker.step.execution) make filtering and querying easierRich Event Annotations: Spans include relevant lifecycle events as annotations for detailed timeline visibility
Error Context Preservation: Failed steps maintain full error context while still being linked to their parent task
Task ID Propagation: All spans include the task_id for easy correlation across the entire workflow
Best Practices
1. Single Responsibility for Telemetry Components
2. Avoid Duplication Between Systems
3. Use Spans for Debugging, Metrics for Operations
4. Production Sampling Strategy
Event Payload Standardization
The EventPayloadBuilder ensures all events have consistent, comprehensive payloads:
Step Event Payloads
Task Event Payloads
Developing with Telemetry
Using EventPublisher Concern
When implementing custom task handlers, events are automatically published around your business logic:
For API Step Handlers
API step handlers follow the same automatic event publishing pattern:
Key Architecture Points:
β Implement
process()for all step handlers, base or APIβ Implement
process()for API step handlers and use theconnectionwhich is a client configured by your step template settingsβ Optionally override
process_results()to customize how return values are stored instep.resultsβ οΈ Never override
handle()- it's framework-only code that publishes events and coordinates execution
Alternative: Manual Event Publishing (Advanced Use Cases)
Manual Event Publishing (Advanced Use Cases)
For special cases where you need additional custom events, you can still manually publish them:
Error Handling and Observability
Tasker automatically captures comprehensive error information:
Log Monitoring
Look for these log patterns:
Instrumentation: OpenTelemetry::Instrumentation::* was successfully installedInstrumentation: OpenTelemetry::Instrumentation::Faraday failed to install(expected)Event publishing errors:
Error publishing event * :
Summary
The key insight you had is correct - spans can provide much of the same information as metrics, but they serve different purposes:
Spans: Rich individual trace records for debugging specific issues
Metrics: Aggregated operational data for dashboards and alerting
Recommended approach:
Use
TelemetrySubscriberfor comprehensive spans (captures everything for debugging)Create separate
MetricsSubscriberfor operational metrics (lightweight data for dashboards)Use single-responsibility subscribers for clean separation of concerns
Consider deriving metrics from spans in high-maturity setups instead of separate collection
This gives you the best of both worlds: detailed debugging capability through spans and efficient operational monitoring through metrics.
Last updated