FFI Safety Safeguards
Last Updated: 2026-02-02 Status: Production Implementation Applies To: Ruby (Magnus), Python (PyO3), TypeScript (C FFI) workers
Overview
Tasker’s FFI workers embed the Rust tasker-worker runtime inside language-specific host processes (Ruby, Python, TypeScript/JavaScript). This document describes the safeguards that prevent Rust-side failures from crashing or corrupting the host process, ensuring that infrastructure unavailability, misconfiguration, and unexpected panics are surfaced as language-native errors rather than process faults.
FFI Architecture
Host Process (Ruby / Python / Node.js)
│
▼
FFI Boundary
┌─────────────────────────────────────┐
│ Language Binding Layer │
│ (Magnus / PyO3 / extern "C") │
│ │
│ ┌─────────────────────────────┐ │
│ │ Bridge Module │ │
│ │ (bootstrap, poll, complete)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ FfiDispatchChannel │ │
│ │ (event dispatch, callbacks)│ │
│ └────────────┬────────────────┘ │
│ │ │
│ ┌────────────▼────────────────┐ │
│ │ WorkerBootstrap │ │
│ │ (runtime, DB, messaging) │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────┘
Panic Safety by Framework
Each FFI framework provides different levels of automatic panic protection:
| Framework | Panic Handling | Mechanism |
|---|---|---|
| Magnus (Ruby) | Automatic | Catches panics at FFI boundary, converts to Ruby RuntimeError |
| PyO3 (Python) | Automatic | Catches panics at #[pyfunction] boundary, converts to PanicException |
| C FFI (TypeScript) | Manual | Requires explicit std::panic::catch_unwind wrappers |
TypeScript C FFI: Explicit Panic Guards
Because the TypeScript worker uses raw extern "C" functions (for compatibility with Node.js, Bun, and Deno FFI), panics unwinding through this boundary would be undefined behavior. All extern "C" functions that call into bridge internals are wrapped with catch_unwind:
#![allow(unused)]
fn main() {
// workers/typescript/src-rust/lib.rs
#[no_mangle]
pub unsafe extern "C" fn bootstrap_worker(config_json: *const c_char) -> *mut c_char {
// ... parse config_json ...
let result = panic::catch_unwind(AssertUnwindSafe(|| {
bridge::bootstrap_worker_internal(config_str)
}));
match result {
Ok(Ok(json)) => /* return JSON */,
Ok(Err(e)) => json_error(&format!("Bootstrap failed: {}", e)),
Err(panic_info) => {
// Extract panic message, log it, return JSON error
json_error(&msg)
}
}
}
}
Protected functions: bootstrap_worker, stop_worker, get_worker_status, transition_to_graceful_shutdown, poll_step_events, poll_in_process_events, complete_step_event, checkpoint_yield_step_event, get_ffi_dispatch_metrics, check_starvation_warnings, cleanup_timeouts.
Error Handling at FFI Boundaries
Bootstrap Failures
When infrastructure is unavailable during worker startup, errors flow through the normal Result path rather than panicking:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database unreachable | TaskerError::DatabaseError returned | Language exception, app can retry |
| Config TOML missing | TaskerError::ConfigurationError returned | Language exception with descriptive message |
| Worker config section absent | TaskerError::ConfigurationError returned | Language exception (was previously a panic) |
| Messaging backend unavailable | TaskerError::ConfigurationError returned | Language exception |
| Tokio runtime creation fails | Logged + language error returned | Language exception |
| Port already in use | TaskerError::WorkerError returned | Language exception |
| Redis/cache unavailable | Graceful degradation to noop cache | No error - worker starts without cache |
Steady-State Operation Failures
Once bootstrapped, the worker handles infrastructure failures gracefully:
| Failure Scenario | Handling | Host Process Impact |
|---|---|---|
| Database goes down during poll | Poll returns None (no events) | No impact - polling continues |
| Completion channel full | Retry loop with timeout, then logged | Step result may be lost after timeout |
| Completion channel closed | Returns false to caller | App code sees completion failure |
| Callback timeout (5s) | Logged, step completion unaffected | Domain events may be delayed |
| Messaging down during callback | Callback times out, logged | Domain events may not publish |
| Lock poisoned | Error returned to caller | Language exception |
| Worker not initialized | Error returned to caller | Language exception |
Lock Acquisition
All three workers validate lock acquisition before proceeding:
#![allow(unused)]
fn main() {
// Pattern used in all workers
let handle_guard = WORKER_SYSTEM.lock().map_err(|e| {
error!("Failed to acquire worker system lock: {}", e);
// Convert to language-appropriate error
})?;
}
A poisoned mutex (from a previous panic) produces a language exception rather than propagating the original panic.
EventRouter Availability
Post-bootstrap access to the EventRouter uses fallible error handling rather than .expect():
#![allow(unused)]
fn main() {
// Use ok_or_else instead of expect to prevent panic at FFI boundary
let event_router = worker_core.event_router().ok_or_else(|| {
error!("EventRouter not available from WorkerCore after bootstrap");
// Return language-appropriate error
})?;
}
Callback Safety
The FfiDispatchChannel uses a fire-and-forget pattern for post-completion callbacks, preventing the host process from being blocked or deadlocked by Rust-side async operations:
- Completion is sent first - the step result is delivered to the completion channel before any callback fires
- Callback is spawned separately - runs in the Tokio runtime, not the FFI caller’s thread
- Timeout protection - callbacks are bounded by a configurable timeout (default 5s)
- Callback failures are logged - they never affect step completion or the host process
FFI Thread (Ruby/Python/JS) Tokio Runtime
│ │
├──► complete(event_id, result) │
│ ├──► send result to channel │
│ └──► spawn callback ─────────┼──► callback.on_handler_complete()
│ │ (with 5s timeout)
◄──── return true ────────────────│
│ (immediate, non-blocking) │
See docs/development/ffi-callback-safety.md for detailed callback safety guidelines.
Backpressure Protection
Completion Channel
The completion channel uses a try-send retry loop with timeout to prevent indefinite blocking:
- Try-send avoids blocking the FFI thread
- Retry with sleep (10ms intervals) handles transient backpressure
- Timeout (configurable, default 30s) prevents permanent stalls
- Logged when backpressure delays exceed 100ms
Starvation Detection
The FfiDispatchChannel tracks event age and warns when polling falls behind:
- Events older than
starvation_warning_threshold_ms(default 10s) trigger warnings check_starvation_warnings()can be called periodically from the host processFfiDispatchMetricsexposes pending count, oldest event age, and starvation status
Infrastructure Dependency Matrix
| Component | Bootstrap | Poll | Complete | Callback |
|---|---|---|---|---|
| Database | Required (error on failure) | Not needed | Not needed | Errors logged |
| Message Bus | Required (error on failure) | Not needed | Not needed | Errors logged |
| Config System | Required (error on failure) | Not needed | Not needed | Not needed |
| Cache (Redis) | Optional (degrades to noop) | Not needed | Not needed | Not needed |
| Tokio Runtime | Required (error on failure) | Used | Used | Used |
Worker Lifecycle Safety
Start (bootstrap_worker)
- Validates configuration, creates runtime, initializes all subsystems
- All failures return language-appropriate errors
- Already-running detection prevents double initialization
Status (get_worker_status)
- Safe when worker is not initialized (returns
running: false) - Safe when worker is running (queries internal state)
- Lock acquisition failure returns error
Stop (stop_worker)
- Safe when worker is not running (returns success message)
- Sends shutdown signal and clears handle
- In-flight operations complete before shutdown
Graceful Shutdown (transition_to_graceful_shutdown)
- Initiates graceful shutdown allowing in-flight work to drain
- Errors during transition are logged and returned
- Requires worker to be running (error otherwise)
Adding a New FFI Worker
When implementing a new language worker:
-
Check framework panic safety - if the framework (like Magnus/PyO3) catches panics automatically, you get protection for free. If using raw C FFI, wrap all
extern "C"functions withcatch_unwind. -
Use the standard bridge pattern - global
WORKER_SYSTEMmutex,BridgeHandlestruct containingWorkerSystemHandle+FfiDispatchChannel+ runtime. -
Handle all lock acquisitions - always use
.map_err()on.lock()calls. -
Avoid
.expect()and.unwrap()in FFI code - useok_or_else()ormap_err()to convert to language-appropriate errors. -
Use fire-and-forget callbacks - never block the FFI thread on async operations.
-
Integrate starvation detection - call
check_starvation_warnings()periodically. -
Expose metrics - expose
FfiDispatchMetricsfor health monitoring.
Related Documentation
- FFI Callback Safety - Detailed callback patterns and deadlock prevention
- Worker Event Systems - Dispatch and completion channel architecture
- MPSC Channel Guidelines - Channel sizing and configuration
- Worker Patterns & Practices - General worker development patterns
- Memory Management - FFI memory management across languages