Metrics & Performance

Overview

Tasker's metrics system provides production-ready observability for workflow orchestration through a hybrid architecture that combines high-performance in-memory collection with persistent, distributed storage. This system is designed to be cache-store agnostic, infrastructure-independent, and production-scalable.

Architecture Philosophy

Distinct Systems: Event Subscribers vs Metrics Collector

Tasker employs two complementary but distinct observability systems with clear separation of concerns:

🔍 TelemetrySubscriber (Event-Driven Spans)

  • Purpose: Detailed tracing and debugging with OpenTelemetry spans

  • Use Cases:

    • "Why did task #12345 fail?"

    • "What's the execution path through this workflow?"

    • "Which step is causing the bottleneck?"

  • Data: Rich contextual information, hierarchical relationships, error details

  • Storage: OpenTelemetry backends (Jaeger, Zipkin, Honeycomb)

  • Performance: Optimized for detailed context, not high-volume aggregation

📊 MetricsBackend (Native Metrics Collection)

  • Purpose: High-performance aggregated metrics for dashboards and alerting

  • Use Cases:

    • "How many tasks completed in the last hour?"

    • "What's the 95th percentile execution time?"

    • "Alert if error rate exceeds 5%"

  • Data: Numerical counters, gauges, histograms with labels

  • Storage: Prometheus, JSON, CSV exports

  • Performance: Optimized for high-volume, low-latency operations

Why Two Systems?

Benefits of Separation:

  • Performance: Native metrics avoid event publishing overhead for high-frequency operations

  • Reliability: Metrics collection continues even if event system has issues

  • Flexibility: Choose appropriate storage backend for each use case

  • Scalability: Each system optimized for its specific workload

Core Components

1. MetricsBackend (Thread-Safe Collection)

The core metrics collection engine using Concurrent::Hash for thread-safe operations:

2. Cache-Agnostic Sync Strategies

Adaptive synchronization based on Rails.cache capabilities:

3. Export Coordination System

TTL-aware export scheduling with distributed coordination:

Cache Strategy Benefits and Drawbacks

Redis Cache Store (:redis_cache_store)

✅ Benefits

  • Full Coordination: Atomic operations enable perfect cross-container synchronization

  • Distributed Locking: Prevents concurrent export conflicts

  • TTL Inspection: Can detect and extend cache expiration times

  • High Performance: Optimized for concurrent access patterns

  • Persistence: Survives container restarts and deployments

⚠️ Considerations

  • Infrastructure Dependency: Requires Redis cluster for production

  • Network Latency: Sync operations involve network calls

  • Redis Memory: Additional memory usage for metrics storage

  • Complexity: More moving parts in infrastructure stack

Best For: Production environments with existing Redis infrastructure, multi-container deployments requiring coordination.

Memcached (:mem_cache_store)

✅ Benefits

  • Distributed Storage: Shared cache across multiple containers

  • Atomic Operations: Basic increment/decrement support

  • Mature Technology: Well-understood operational characteristics

  • Memory Efficiency: Optimized for memory usage patterns

⚠️ Considerations

  • Limited Locking: No distributed locking support

  • No TTL Inspection: Cannot detect or extend expiration times

  • Basic Coordination: Read-modify-write patterns only

  • Data Loss Risk: No persistence across restarts

Best For: Environments with existing Memcached infrastructure, scenarios where basic distributed storage is sufficient.

File Store (:file_store)

✅ Benefits

  • No Infrastructure: Works without external dependencies

  • Persistence: Survives container restarts

  • Simple Deployment: Easy to set up and maintain

  • Local Performance: No network latency for operations

⚠️ Considerations

  • No Distribution: Cannot coordinate across containers

  • File I/O Overhead: Disk operations slower than memory

  • Concurrency Limits: File locking limitations

  • Storage Management: Requires disk space management

Best For: Single-container deployments, development environments, scenarios where external dependencies are not allowed.

Memory Store (:memory_store)

✅ Benefits

  • Highest Performance: Pure in-memory operations

  • No Dependencies: No external infrastructure required

  • Simple Configuration: Works out of the box

  • Low Latency: Fastest possible cache operations

⚠️ Considerations

  • Single Process Only: No cross-container coordination

  • Data Loss: Lost on container restart

  • Memory Limits: Bounded by container memory

  • No Persistence: Cannot survive deployments

Best For: Development environments, single-container applications, testing scenarios.

Kubernetes Integration

CronJob-Based Export Architecture

Tasker follows cloud-native patterns by providing export capabilities while delegating scheduling to Kubernetes:

Cache Synchronization

Monitoring and Alerting

Configuration

Basic Configuration

Advanced Configuration

Environment-Specific Configuration

Dependencies and Assumptions

Required Dependencies

Optional Dependencies

Infrastructure Assumptions

Cache Store

  • Rails.cache properly configured

  • Cache store supports basic read/write operations

  • TTL (time-to-live) support for automatic cleanup

Background Jobs

  • ActiveJob backend configured (Sidekiq, Resque, DelayedJob, etc.)

  • Job queues operational and monitored

  • Retry mechanisms available for failed jobs

Container Environment

  • Container orchestration platform (Kubernetes, Docker Swarm, etc.)

  • CronJob scheduling capabilities

  • Resource limits and monitoring

Monitoring Infrastructure

  • Prometheus server (for Prometheus export)

  • Grafana or similar for visualization

  • Alerting system for operational issues

Performance Expectations

In-Memory Operations

  • Latency: < 1ms for counter/gauge operations

  • Throughput: > 10,000 operations/second per container

  • Memory: ~1MB per 10,000 unique metric combinations

  • Concurrency: Scales linearly with CPU cores

Cache Synchronization

  • Frequency: Every 30 seconds (configurable)

  • Latency: < 100ms for Redis, < 500ms for Memcached

  • Batch Size: 1,000 metrics per sync operation

  • Network: Minimal bandwidth usage with compression

Export Operations

  • Frequency: Every 5 minutes (configurable via CronJob)

  • Duration: < 30 seconds for 10,000 metrics

  • Resource Usage: < 256MB memory, < 200m CPU

  • Network: Compressed payload reduces bandwidth by 70%

Interoperability

Prometheus Integration

Native Prometheus Format

Grafana Dashboards

OpenTelemetry Compatibility

The native metrics system is designed to be complementary to OpenTelemetry spans from TelemetrySubscriber. For OpenTelemetry metrics integration, use the existing TelemetrySubscriber which automatically publishes to OpenTelemetry backends.

Custom Export Formats

DataDog Integration

Production Deployment Patterns

High-Availability Setup

Resource Management

Security Considerations

Troubleshooting

Common Issues

Cache Sync Failures

Export Job Failures

Missing Metrics

Debugging Commands

Performance Monitoring

This comprehensive metrics system provides production-ready observability while maintaining the flexibility to adapt to different infrastructure environments and scaling requirements.

Last updated