FFI Telemetry Initialization Pattern

Overview

This document describes the two-phase telemetry initialization pattern for Foreign Function Interface (FFI) integrations where Rust code is called from languages that don’t have a Tokio runtime during initialization (Ruby, Python, WASM).

The Problem

OpenTelemetry batch exporter requires a Tokio runtime context for async I/O operations:

#![allow(unused)]
fn main() {
// This PANICS if called outside a Tokio runtime
let tracer_provider = SdkTracerProvider::builder()
    .with_batch_exporter(exporter)  // ❌ Requires Tokio runtime
    .with_resource(resource)
    .with_sampler(sampler)
    .build();
}

FFI Initialization Timeline:

1. Language Runtime Loads Extension (Ruby, Python, WASM)
   ↓ No Tokio runtime exists yet
2. Extension Init Function Called (Magnus init, PyO3 init, etc.)
   ↓ Logging needed for debugging, but no async runtime
3. Later: Create Tokio Runtime
   ↓ Now safe to initialize telemetry
4. Bootstrap Worker System

The Solution: Two-Phase Initialization

Phase 1: Console-Only Logging (FFI-Safe)

During language extension initialization, use console-only logging that requires no Tokio runtime:

#![allow(unused)]
fn main() {
// tasker-shared/src/logging.rs
pub fn init_console_only() {
    // Initialize console logging without OpenTelemetry
    // Safe to call from any thread, no async runtime required
}
}

When to use:

During Magnus initialization (Ruby)
During PyO3 initialization (Python)
During WASM module initialization
Any context where no Tokio runtime exists

Phase 2: Full Telemetry (Tokio Context)

After creating the Tokio runtime, initialize full telemetry including OpenTelemetry:

#![allow(unused)]
fn main() {
// Create Tokio runtime
let runtime = tokio::runtime::Runtime::new()?;

// Initialize telemetry in runtime context
runtime.block_on(async {
    tasker_shared::logging::init_tracing();
});
}

When to use:

After creating Tokio runtime in bootstrap
Inside runtime.block_on() context
When async I/O is available

Implementation Guide

Ruby FFI (Magnus)

File Structure:

workers/ruby/ext/tasker_core/src/ffi_logging.rs - Phase 1
workers/ruby/ext/tasker_core/src/bootstrap.rs - Phase 2

Phase 1: Magnus Initialization

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/ffi_logging.rs

pub fn init_ffi_logger() -> Result<(), Box<dyn std::error::Error>> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        // Phase 1: Defer telemetry init to runtime context
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        // Phase 1: Safe to initialize console-only logging
        tasker_shared::logging::init_console_only();
        tasker_shared::log_ffi!(
            info,
            "FFI console logging initialized (no telemetry)",
            component: "ffi_boundary"
        );
    }

    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/ruby/ext/tasker_core/src/bootstrap.rs

pub fn bootstrap_worker() -> Result<Value, Error> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

Python FFI (PyO3)

Phase 1: PyO3 Module Initialization

#![allow(unused)]
fn main() {
// workers/python/src/lib.rs

#[pymodule]
fn tasker_core(py: Python, m: &PyModule) -> PyResult<()> {
    // Check if telemetry is enabled
    let telemetry_enabled = std::env::var("TELEMETRY_ENABLED")
        .map(|v| v.to_lowercase() == "true")
        .unwrap_or(false);

    if telemetry_enabled {
        println!("Telemetry enabled - deferring logging init to runtime context");
    } else {
        tasker_shared::logging::init_console_only();
    }

    // Register Python functions...
    m.add_function(wrap_pyfunction!(bootstrap_worker, m)?)?;
    Ok(())
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/python/src/bootstrap.rs

#[pyfunction]
pub fn bootstrap_worker() -> PyResult<String> {
    // Create tokio runtime
    let runtime = tokio::runtime::Runtime::new()
        .map_err(|e| PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(
            format!("Failed to create runtime: {}", e)
        ))?;

    // Phase 2: Initialize telemetry in Tokio runtime context
    runtime.block_on(async {
        tasker_shared::logging::init_tracing();
    });

    // Continue with bootstrap...
    let system_context = runtime.block_on(async {
        SystemContext::new_for_worker().await
    })?;

    // ... rest of bootstrap
}
}

WASM FFI

Phase 1: WASM Module Initialization

#![allow(unused)]
fn main() {
// workers/wasm/src/lib.rs

#[wasm_bindgen(start)]
pub fn init_wasm() {
    // Check if telemetry is enabled (from JS environment)
    let telemetry_enabled = js_sys::Reflect::get(
        &js_sys::global(),
        &"TELEMETRY_ENABLED".into()
    ).ok()
    .and_then(|v| v.as_bool())
    .unwrap_or(false);

    if telemetry_enabled {
        web_sys::console::log_1(&"Telemetry enabled - deferring logging init to runtime context".into());
    } else {
        tasker_shared::logging::init_console_only();
    }
}
}

Phase 2: After Runtime Creation

#![allow(unused)]
fn main() {
// workers/wasm/src/bootstrap.rs

#[wasm_bindgen]
pub async fn bootstrap_worker() -> Result<JsValue, JsValue> {
    // In WASM, we're already in an async context
    // Initialize telemetry directly
    tasker_shared::logging::init_tracing();

    // Continue with bootstrap...
    let system_context = SystemContext::new_for_worker().await
        .map_err(|e| JsValue::from_str(&format!("Bootstrap failed: {}", e)))?;

    // ... rest of bootstrap
}
}

Docker Configuration

Enable telemetry in docker-compose with appropriate comments:

# docker/docker-compose.test.yml

ruby-worker:
  environment:
    # Two-phase FFI telemetry initialization pattern
    # Phase 1: Magnus init skips telemetry (no runtime)
    # Phase 2: bootstrap_worker() initializes telemetry in Tokio context
    TELEMETRY_ENABLED: "true"
    OTEL_EXPORTER_OTLP_ENDPOINT: http://observability:4317
    OTEL_SERVICE_NAME: tasker-ruby-worker
    OTEL_SERVICE_VERSION: "0.1.0"

Verification

Expected Log Sequence

Ruby Worker with Telemetry Enabled:

1. Magnus init:
Telemetry enabled - deferring logging init to runtime context

2. After runtime creation:
Console logging with OpenTelemetry initialized
  environment=test
  opentelemetry_enabled=true
  otlp_endpoint=http://observability:4317
  service_name=tasker-ruby-worker

3. OpenTelemetry components:
Global meter provider is set
OpenTelemetry Prometheus text exporter initialized

Ruby Worker with Telemetry Disabled:

1. Magnus init:
Console-only logging initialized (FFI-safe mode)
  environment=test
  opentelemetry_enabled=false
  context=ffi_initialization

2. After runtime creation:
(No additional initialization - already complete)

Health Check

All workers should be healthy with telemetry enabled:

$ curl http://localhost:8082/health
{"status":"healthy","timestamp":"...","worker_id":"worker-..."}

Grafana Verification

With all services running with telemetry:

Access Grafana: http://localhost:3000 (admin/admin)
Navigate to Explore → Tempo
Query by service: tasker-ruby-worker
Verify traces appear with correlation IDs

Key Principles

1. Separation of Concerns

Infrastructure Decision (Tokio runtime availability): Handled by init functions
Business Logic (when to log): Handled by application code
Clean separation prevents runtime panics

2. Fail-Safe Defaults

Always provide console logging at minimum
Telemetry is enhancement, not requirement
Graceful degradation if telemetry unavailable

3. Explicit Over Implicit

Clear phase separation in code
Documented at each call site
Easy to understand initialization flow

4. Language-Agnostic Pattern

Same pattern works for Ruby, Python, WASM
Consistent across all FFI bindings
Single source of truth in tasker-shared

Troubleshooting

“no reactor running” Panic

Symptom:

thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'

Cause: Calling init_tracing() when TELEMETRY_ENABLED=true outside a Tokio runtime context.

Solution: Use two-phase pattern:

#![allow(unused)]
fn main() {
// Phase 1: Skip telemetry init
if telemetry_enabled {
    println!("Deferring telemetry init...");
} else {
    init_console_only();
}

// Phase 2: Initialize in runtime
runtime.block_on(async {
    init_tracing();
});
}

Telemetry Not Appearing

Symptom: No traces in Grafana/Tempo despite TELEMETRY_ENABLED=true.

Check:

Verify environment variable is set: TELEMETRY_ENABLED=true
Check logs for initialization message
Verify OTLP endpoint is reachable
Check observability stack is healthy

Debug:

# Check worker logs
docker logs docker-ruby-worker-1 | grep -E "telemetry|OpenTelemetry"

# Check observability stack
curl http://localhost:4317  # Should connect to OTLP gRPC

# Check Grafana Tempo
curl http://localhost:3200/api/status/buildinfo

Performance Considerations

Minimal Overhead

Phase 1: Simple console initialization, <1ms
Phase 2: Batch exporter initialization, <10ms
Total overhead: <15ms during startup
Zero runtime overhead after initialization

Memory Usage

Console-only: ~100KB (tracing subscriber)
With telemetry: ~500KB (includes OTLP client buffers)
Acceptable for all deployment scenarios

Future Enhancements

Lazy Telemetry Upgrade

Future optimization could upgrade console-only subscriber to include telemetry without restart:

#![allow(unused)]
fn main() {
// Not yet implemented - requires tracing layer hot-swapping
pub fn upgrade_to_telemetry() -> TaskerResult<()> {
    // Would require custom subscriber implementation
    // to support layer addition after initialization
}
}

Per-Worker Telemetry Control

Could extend pattern to support per-worker telemetry configuration:

#![allow(unused)]
fn main() {
// Not yet implemented
pub fn init_with_config(config: TelemetryConfig) -> TaskerResult<()> {
    // Would allow fine-grained control per worker
}
}

Phase 1.5: Worker Span Instrumentation with Trace Context Propagation

Implemented: 2025-11-24 Status: ✅ Production Ready - Validated end-to-end with Ruby workers

The Challenge

After implementing two-phase telemetry initialization (Phase 1), we discovered a gap: while OpenTelemetry infrastructure was working, worker step execution spans lacked correlation attributes needed for distributed tracing.

The Problem:

✅ Orchestration spans had correlation_id, task_uuid, step_uuid
✅ Worker infrastructure spans existed (read_messages, reserve_capacity)
❌ Worker step execution spans were missing these attributes

Root Cause: Ruby workers use an async dual-event-system architecture where:

Rust worker fires FFI event to Ruby (via EventPoller polling every 10ms)
Ruby processes event asynchronously
Ruby returns completion via FFI

The async boundary made traditional span scope maintenance impossible.

The Solution: Trace ID Propagation Pattern

Instead of trying to maintain span scope across the async FFI boundary, we propagate trace context as opaque strings:

Rust: Extract trace_id/span_id → Add to FFI event payload →
Ruby: Treat as opaque strings → Propagate through processing → Include in completion →
Rust: Create linked span using returned trace_id/span_id

Key Insight: Ruby doesn’t need to understand OpenTelemetry - it just passes through trace IDs like it already does with correlation_id.

Implementation: Rust Side (Phase 1.5a)

File: tasker-worker/src/worker/command_processor.rs

Step 1: Create instrumented span with all required attributes

#![allow(unused)]
fn main() {
use tracing::{span, event, Level, Instrument};

pub async fn handle_execute_step(&self, step_message: SimpleStepMessage) -> TaskerResult<()> {
    // Fetch step details to get step_name and namespace
    let task_sequence_step = self.fetch_task_sequence_step(&step_message).await?;

    // Create span with all 5 required attributes
    let step_span = span!(
        Level::INFO,
        "worker.step_execution",
        correlation_id = %step_message.correlation_id,
        task_uuid = %step_message.task_uuid,
        step_uuid = %step_message.step_uuid,
        step_name = %task_sequence_step.workflow_step.name,
        namespace = %task_sequence_step.task.namespace_name
    );

    let execution_result = async {
        event!(Level::INFO, "step.execution_started");

        // Extract trace context for FFI propagation
        let trace_id = Some(step_message.correlation_id.to_string());
        let span_id = Some(format!("span-{}", step_message.step_uuid));

        // Fire FFI event with trace context
        let result = self.event_publisher
            .fire_step_execution_event_with_trace(
                &task_sequence_step,
                trace_id,
                span_id,
            )
            .await?;

        event!(Level::INFO, "step.execution_completed");
        Ok(result)
    }
    .instrument(step_span)  // Wrap async block with span
    .await;

    execution_result
}
}

Key Points:

All 5 attributes present: correlation_id, task_uuid, step_uuid, step_name, namespace
Event markers: step.execution_started, step.execution_completed
.instrument(span) pattern for async code
Trace context extracted and passed to FFI

Implementation: Data Structures

File: tasker-shared/src/types/base.rs

Add trace context fields to FFI event structures:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub task_sequence_step: TaskSequenceStep,
    pub correlation_id: Uuid,

    // Trace context propagation
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepExecutionCompletionEvent {
    pub event_id: Uuid,
    pub task_uuid: Uuid,
    pub step_uuid: Uuid,
    pub success: bool,
    pub result: Option<serde_json::Value>,

    // Trace context from Ruby
    #[serde(skip_serializing_if = "Option::is_none")]
    pub trace_id: Option<String>,

    #[serde(skip_serializing_if = "Option::is_none")]
    pub span_id: Option<String>,
}
}

Design Notes:

Fields are optional for backward compatibility
skip_serializing_if prevents empty fields in JSON
Treated as opaque strings (no OpenTelemetry types)

Implementation: Ruby Side Propagation

File: workers/ruby/lib/tasker_core/event_bridge.rb

Propagate trace context like correlation_id:

def wrap_step_execution_event(event_data)
  wrapped = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    task_sequence_step: TaskerCore::Models::TaskSequenceStepWrapper.new(event_data[:task_sequence_step])
  }

  # Expose correlation_id at top level for easy access
  wrapped[:correlation_id] = event_data[:correlation_id] if event_data[:correlation_id]
  wrapped[:parent_correlation_id] = event_data[:parent_correlation_id] if event_data[:parent_correlation_id]

  # Expose trace_id and span_id for distributed tracing
  wrapped[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  wrapped[:span_id] = event_data[:span_id] if event_data[:span_id]

  wrapped
end

File: workers/ruby/lib/tasker_core/subscriber.rb

Include trace context in completion:

def publish_step_completion(event_data:, success:, result: nil, error_message: nil, metadata: nil)
  completion_payload = {
    event_id: event_data[:event_id],
    task_uuid: event_data[:task_uuid],
    step_uuid: event_data[:step_uuid],
    success: success,
    result: result,
    metadata: metadata,
    error_message: error_message
  }

  # Propagate trace context back to Rust
  completion_payload[:trace_id] = event_data[:trace_id] if event_data[:trace_id]
  completion_payload[:span_id] = event_data[:span_id] if event_data[:span_id]

  TaskerCore::Worker::EventBridge.instance.publish_step_completion(completion_payload)
end

Key Points:

Ruby treats trace_id and span_id as opaque strings
No OpenTelemetry dependency in Ruby
Simple pass-through pattern like correlation_id
Works with existing dual-event-system architecture

Implementation: Completion Span (Rust)

File: tasker-worker/src/worker/event_subscriber.rs

Create linked span when receiving Ruby completion:

#![allow(unused)]
fn main() {
pub fn handle_completion(&self, completion: StepExecutionCompletionEvent) -> TaskerResult<()> {
    // Create linked span using trace context from Ruby
    let completion_span = if let (Some(trace_id), Some(span_id)) =
        (&completion.trace_id, &completion.span_id) {
        span!(
            Level::INFO,
            "worker.step_completion_received",
            trace_id = %trace_id,
            span_id = %span_id,
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    } else {
        // Fallback span without trace context
        span!(
            Level::INFO,
            "worker.step_completion_received",
            event_id = %completion.event_id,
            task_uuid = %completion.task_uuid,
            step_uuid = %completion.step_uuid,
            success = completion.success
        )
    };

    let _guard = completion_span.enter();

    event!(Level::INFO, "step.ruby_execution_completed",
        success = completion.success,
        duration_ms = completion.metadata.execution_time_ms
    );

    // Continue with normal completion processing...
    Ok(())
}
}

Key Points:

Uses returned trace_id/span_id to create linked span
Graceful fallback if trace context not available
Event: step.ruby_execution_completed

Validation Results (2025-11-24)

Test Task:

Correlation ID: 88f21229-4085-4d53-8f52-2fde0b7228e2
Task UUID: 019ab6f9-7a27-7d16-b298-1ea41b327373
4 steps executed successfully

Log Evidence:

worker.step_execution{
  correlation_id=88f21229-4085-4d53-8f52-2fde0b7228e2
  task_uuid=019ab6f9-7a27-7d16-b298-1ea41b327373
  step_uuid=019ab6f9-7a2a-7873-a5d1-93234ae46003
  step_name=linear_step_1
  namespace=linear_workflow
}: step.execution_started

Step execution event with trace context fired successfully to FFI handlers
  trace_id=Some("88f21229-4085-4d53-8f52-2fde0b7228e2")
  span_id=Some("span-019ab6f9-7a2a-7873-a5d1-93234ae46003")

worker.step_completion_received{...}: step.ruby_execution_completed

Tempo Query Results:

By correlation_id: 9 traces (5 orchestration + 4 worker)
By task_uuid: 13 traces (complete task lifecycle)
✅ All attributes indexed and queryable
✅ Spans exported to Tempo successfully

Complete Trace Flow

For each step execution:

┌─────────────────────────────────────────────────────┐
│ Rust Worker (command_processor.rs)                 │
│ 1. Create worker.step_execution span               │
│    - correlation_id, task_uuid, step_uuid          │
│    - step_name, namespace                          │
│ 2. Emit step.execution_started event               │
│ 3. Extract trace_id and span_id from span          │
│ 4. Add to StepExecutionEvent                       │
│ 5. Fire FFI event with trace context               │
│ 6. Emit step.execution_completed event             │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Async FFI boundary (EventPoller polling)
                  ▼
┌─────────────────────────────────────────────────────┐
│ Ruby EventBridge & Subscriber                       │
│ 1. Receive event with trace_id/span_id            │
│ 2. Propagate as opaque strings                     │
│ 3. Execute Ruby handler (business logic)           │
│ 4. Include trace_id/span_id in completion          │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Completion via FFI
                  ▼
┌─────────────────────────────────────────────────────┐
│ Rust Worker (event_subscriber.rs)                  │
│ 1. Receive StepExecutionCompletionEvent            │
│ 2. Extract trace_id and span_id                    │
│ 3. Create worker.step_completion_received span     │
│ 4. Emit step.ruby_execution_completed event        │
└─────────────────────────────────────────────────────┘

Benefits of This Pattern

No Breaking Changes: Optional fields, backward compatible
Ruby Simplicity: No OpenTelemetry dependency, opaque string propagation
Trace Continuity: Same trace_id flows Rust → Ruby → Rust
Query-Friendly: Tempo queries show complete execution flow
Extensible: Pattern works for Python, WASM, any FFI language
Performance: Zero overhead in Ruby (just string passing)

Pattern for Python Workers

The exact same pattern applies to Python workers:

Python Side (PyO3):

# workers/python/tasker_core/event_bridge.py

def wrap_step_execution_event(event_data):
    wrapped = {
        'event_id': event_data['event_id'],
        'task_uuid': event_data['task_uuid'],
        'step_uuid': event_data['step_uuid'],
        # ... other fields
    }

    # Propagate trace context as opaque strings
    if 'trace_id' in event_data:
        wrapped['trace_id'] = event_data['trace_id']
    if 'span_id' in event_data:
        wrapped['span_id'] = event_data['span_id']

    return wrapped

Key Insight: Any FFI language can use this pattern - they just need to pass through trace_id and span_id as strings.

Performance Characteristics

Rust overhead: ~50-100 microseconds per span creation
FFI overhead: ~10-50 microseconds for extra string fields
Ruby overhead: Zero (just string passing, no OpenTelemetry)
Total overhead: <200 microseconds per step execution
Network: Spans batched and exported asynchronously

Troubleshooting

Symptom: Spans missing trace_id/span_id in Tempo

Check:

Verify Rust logs show “Step execution event with trace context fired successfully”
Check Ruby logs don’t have errors in EventBridge
Verify completion events include trace_id/span_id
Query Tempo by task_uuid to see if spans exist

Debug:

# Check Rust worker logs for trace context
docker logs docker-ruby-worker-1 | grep -E "(trace_id|span_id)"

# Query Tempo by task_uuid
curl "http://localhost:3200/api/search?tags=task_uuid=<UUID>"

# Check span export metrics
curl "http://localhost:9090/metrics" | grep otel

Future Enhancements

OpenTelemetry W3C Trace Context: Currently using correlation_id as trace_id placeholder. Future enhancement:

#![allow(unused)]
fn main() {
use opentelemetry::trace::TraceContextExt;

// Extract real OpenTelemetry trace context
let cx = tracing::Span::current().context();
let span_context = cx.span().span_context();
let trace_id = span_context.trace_id().to_string();
let span_id = span_context.span_id().to_string();
}

Span Linking: Use OpenTelemetry’s Link API for explicit parent-child relationships:

#![allow(unused)]
fn main() {
use opentelemetry::trace::{Link, SpanContext, TraceId, SpanId};

// Create linked span
let parent_context = SpanContext::new(
    TraceId::from_hex(&trace_id)?,
    SpanId::from_hex(&span_id)?,
    TraceFlags::default(),
    false,
    TraceState::default(),
);

let span = span!(
    Level::INFO,
    "worker.step_completion_received",
    links = vec![Link::new(parent_context, Vec::new())]
);
}

References

OpenTelemetry Rust: https://github.com/open-telemetry/opentelemetry-rust
Grafana LGTM Stack: https://grafana.com/oss/lgtm-stack/
W3C Trace Context: https://www.w3.org/TR/trace-context/

tasker-shared/src/logging.rs - Core logging implementation
workers/rust/README.md - Event-driven FFI architecture
docs/batch-processing.md - Distributed tracing integration
docker/docker-compose.test.yml - Observability stack configuration

Status: ✅ Production Ready - Two-phase initialization and Phase 1.5 worker span instrumentation patterns implemented and validated with Ruby FFI. Ready for Python and WASM implementations.

Keyboard shortcuts

Tasker Documentation