Metrics & Performance
Overview
Tasker's metrics system provides production-ready observability for workflow orchestration through a hybrid architecture that combines high-performance in-memory collection with persistent, distributed storage. This system is designed to be cache-store agnostic, infrastructure-independent, and production-scalable.
Architecture Philosophy
Distinct Systems: Event Subscribers vs Metrics Collector
Tasker employs two complementary but distinct observability systems with clear separation of concerns:
🔍 TelemetrySubscriber (Event-Driven Spans)
Purpose: Detailed tracing and debugging with OpenTelemetry spans
Use Cases:
"Why did task #12345 fail?"
"What's the execution path through this workflow?"
"Which step is causing the bottleneck?"
Data: Rich contextual information, hierarchical relationships, error details
Storage: OpenTelemetry backends (Jaeger, Zipkin, Honeycomb)
Performance: Optimized for detailed context, not high-volume aggregation
📊 MetricsBackend (Native Metrics Collection)
Purpose: High-performance aggregated metrics for dashboards and alerting
Use Cases:
"How many tasks completed in the last hour?"
"What's the 95th percentile execution time?"
"Alert if error rate exceeds 5%"
Data: Numerical counters, gauges, histograms with labels
Storage: Prometheus, JSON, CSV exports
Performance: Optimized for high-volume, low-latency operations
Why Two Systems?
Benefits of Separation:
Performance: Native metrics avoid event publishing overhead for high-frequency operations
Reliability: Metrics collection continues even if event system has issues
Flexibility: Choose appropriate storage backend for each use case
Scalability: Each system optimized for its specific workload
Core Components
1. MetricsBackend (Thread-Safe Collection)
The core metrics collection engine using Concurrent::Hash for thread-safe operations:
2. Cache-Agnostic Sync Strategies
Adaptive synchronization based on Rails.cache capabilities:
3. Export Coordination System
TTL-aware export scheduling with distributed coordination:
Cache Strategy Benefits and Drawbacks
Redis Cache Store (:redis_cache_store)
:redis_cache_store)✅ Benefits
Full Coordination: Atomic operations enable perfect cross-container synchronization
Distributed Locking: Prevents concurrent export conflicts
TTL Inspection: Can detect and extend cache expiration times
High Performance: Optimized for concurrent access patterns
Persistence: Survives container restarts and deployments
⚠️ Considerations
Infrastructure Dependency: Requires Redis cluster for production
Network Latency: Sync operations involve network calls
Redis Memory: Additional memory usage for metrics storage
Complexity: More moving parts in infrastructure stack
Best For: Production environments with existing Redis infrastructure, multi-container deployments requiring coordination.
Memcached (:mem_cache_store)
:mem_cache_store)✅ Benefits
Distributed Storage: Shared cache across multiple containers
Atomic Operations: Basic increment/decrement support
Mature Technology: Well-understood operational characteristics
Memory Efficiency: Optimized for memory usage patterns
⚠️ Considerations
Limited Locking: No distributed locking support
No TTL Inspection: Cannot detect or extend expiration times
Basic Coordination: Read-modify-write patterns only
Data Loss Risk: No persistence across restarts
Best For: Environments with existing Memcached infrastructure, scenarios where basic distributed storage is sufficient.
File Store (:file_store)
:file_store)✅ Benefits
No Infrastructure: Works without external dependencies
Persistence: Survives container restarts
Simple Deployment: Easy to set up and maintain
Local Performance: No network latency for operations
⚠️ Considerations
No Distribution: Cannot coordinate across containers
File I/O Overhead: Disk operations slower than memory
Concurrency Limits: File locking limitations
Storage Management: Requires disk space management
Best For: Single-container deployments, development environments, scenarios where external dependencies are not allowed.
Memory Store (:memory_store)
:memory_store)✅ Benefits
Highest Performance: Pure in-memory operations
No Dependencies: No external infrastructure required
Simple Configuration: Works out of the box
Low Latency: Fastest possible cache operations
⚠️ Considerations
Single Process Only: No cross-container coordination
Data Loss: Lost on container restart
Memory Limits: Bounded by container memory
No Persistence: Cannot survive deployments
Best For: Development environments, single-container applications, testing scenarios.
Kubernetes Integration
CronJob-Based Export Architecture
Tasker follows cloud-native patterns by providing export capabilities while delegating scheduling to Kubernetes:
Cache Synchronization
Monitoring and Alerting
Configuration
Basic Configuration
Advanced Configuration
Environment-Specific Configuration
Dependencies and Assumptions
Required Dependencies
Optional Dependencies
Infrastructure Assumptions
Cache Store
Rails.cache properly configured
Cache store supports basic read/write operations
TTL (time-to-live) support for automatic cleanup
Background Jobs
ActiveJob backend configured (Sidekiq, Resque, DelayedJob, etc.)
Job queues operational and monitored
Retry mechanisms available for failed jobs
Container Environment
Container orchestration platform (Kubernetes, Docker Swarm, etc.)
CronJob scheduling capabilities
Resource limits and monitoring
Monitoring Infrastructure
Prometheus server (for Prometheus export)
Grafana or similar for visualization
Alerting system for operational issues
Performance Expectations
In-Memory Operations
Latency: < 1ms for counter/gauge operations
Throughput: > 10,000 operations/second per container
Memory: ~1MB per 10,000 unique metric combinations
Concurrency: Scales linearly with CPU cores
Cache Synchronization
Frequency: Every 30 seconds (configurable)
Latency: < 100ms for Redis, < 500ms for Memcached
Batch Size: 1,000 metrics per sync operation
Network: Minimal bandwidth usage with compression
Export Operations
Frequency: Every 5 minutes (configurable via CronJob)
Duration: < 30 seconds for 10,000 metrics
Resource Usage: < 256MB memory, < 200m CPU
Network: Compressed payload reduces bandwidth by 70%
Interoperability
Prometheus Integration
Native Prometheus Format
Grafana Dashboards
OpenTelemetry Compatibility
The native metrics system is designed to be complementary to OpenTelemetry spans from TelemetrySubscriber. For OpenTelemetry metrics integration, use the existing TelemetrySubscriber which automatically publishes to OpenTelemetry backends.
Custom Export Formats
DataDog Integration
Production Deployment Patterns
High-Availability Setup
Resource Management
Security Considerations
Troubleshooting
Common Issues
Cache Sync Failures
Export Job Failures
Missing Metrics
Debugging Commands
Performance Monitoring
This comprehensive metrics system provides production-ready observability while maintaining the flexibility to adapt to different infrastructure environments and scaling requirements.
Last updated