Checkpoint Operations Guide
Last Updated: 2026-01-06 Status: Active Related: Batch Processing - Checkpoint Yielding
Overview
This guide covers operational concerns for checkpoint yielding in production environments, including monitoring, troubleshooting, and maintenance tasks.
Monitoring Checkpoints
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Checkpoint history size | Length of history array | >100 entries |
| Checkpoint age | Time since last checkpoint | >10 minutes (step-dependent) |
| Accumulated results size | Size of accumulated_results JSONB | >1MB |
| Checkpoint frequency | Checkpoints per step execution | <1 per minute (may indicate issues) |
SQL Queries for Monitoring
Steps with large checkpoint history:
SELECT
ws.workflow_step_uuid,
ws.name,
t.task_uuid,
jsonb_array_length(ws.checkpoint->'history') as history_length,
ws.checkpoint->>'timestamp' as last_checkpoint
FROM tasker.workflow_steps ws
JOIN tasker.tasks t ON ws.task_uuid = t.task_uuid
WHERE ws.checkpoint IS NOT NULL
AND jsonb_array_length(ws.checkpoint->'history') > 50
ORDER BY history_length DESC
LIMIT 20;
Steps with stale checkpoints (in progress but not checkpointed recently):
SELECT
ws.workflow_step_uuid,
ws.name,
ws.current_state,
ws.checkpoint->>'timestamp' as last_checkpoint,
NOW() - (ws.checkpoint->>'timestamp')::timestamptz as checkpoint_age
FROM tasker.workflow_steps ws
WHERE ws.current_state = 'in_progress'
AND ws.checkpoint IS NOT NULL
AND (ws.checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes'
ORDER BY checkpoint_age DESC;
Large accumulated results:
SELECT
ws.workflow_step_uuid,
ws.name,
pg_column_size(ws.checkpoint->'accumulated_results') as results_size_bytes,
pg_size_pretty(pg_column_size(ws.checkpoint->'accumulated_results')::bigint) as results_size
FROM tasker.workflow_steps ws
WHERE ws.checkpoint->'accumulated_results' IS NOT NULL
AND pg_column_size(ws.checkpoint->'accumulated_results') > 100000
ORDER BY results_size_bytes DESC
LIMIT 20;
Logging
Checkpoint operations emit structured logs:
INFO checkpoint_yield_step_event step_uuid=abc-123 cursor=1000 items_processed=1000
INFO checkpoint_saved step_uuid=abc-123 history_length=5
Log fields to monitor:
step_uuid- Step being checkpointedcursor- Current positionitems_processed- Total items at checkpointhistory_length- Number of checkpoint entries
Troubleshooting
Step Not Resuming from Checkpoint
Symptoms: Step restarts from beginning instead of checkpoint position.
Checks:
-
Verify checkpoint exists:
SELECT checkpoint FROM tasker.workflow_steps WHERE workflow_step_uuid = 'uuid'; -
Check handler uses
BatchWorkerContextaccessors:has_checkpoint?/has_checkpoint()/hasCheckpoint()checkpoint_cursor/checkpointCursor
-
Verify handler respects checkpoint in processing loop
Checkpoint Not Persisting
Symptoms: checkpoint_yield() returns but data not in database.
Checks:
- Check for errors in worker logs
- Verify FFI bridge is healthy
- Check database connectivity
Excessive Checkpoint History Growth
Symptoms: Steps have hundreds or thousands of checkpoint history entries.
Causes:
- Very long-running processes with frequent checkpoints
- Small checkpoint intervals relative to work
Remediation:
- Increase checkpoint interval (process more items between checkpoints)
- Clear history for completed steps (see Maintenance section)
- Monitor with history size query above
Large Accumulated Results
Symptoms: Database bloat, slow step queries.
Causes:
- Storing full result sets instead of summaries
- Unbounded accumulation without size checks
Remediation:
- Modify handler to store summaries, not full data
- Use external storage for large intermediate results
- Add size checks before checkpoint
Maintenance Tasks
Clear Checkpoint for Completed Steps
Completed steps retain checkpoint data for debugging. To clear:
-- Clear checkpoints for completed steps older than 7 days
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE current_state = 'complete'
AND checkpoint IS NOT NULL
AND updated_at < NOW() - INTERVAL '7 days';
Truncate History Array
For steps with excessive history:
-- Keep only last 10 history entries
UPDATE tasker.workflow_steps
SET checkpoint = jsonb_set(
checkpoint,
'{history}',
(SELECT jsonb_agg(elem)
FROM (
SELECT elem
FROM jsonb_array_elements(checkpoint->'history') elem
ORDER BY (elem->>'timestamp')::timestamptz DESC
LIMIT 10
) sub)
)
WHERE jsonb_array_length(checkpoint->'history') > 10;
Clear Checkpoint for Manual Reset
When manually resetting a step to reprocess from scratch:
-- Clear checkpoint to force reprocessing from beginning
UPDATE tasker.workflow_steps
SET checkpoint = NULL
WHERE workflow_step_uuid = 'step-uuid-here';
Warning: Only clear checkpoints if you want the step to restart from the beginning.
Capacity Planning
Database Sizing
Checkpoint column considerations:
- Each checkpoint: ~1-10KB typical (cursor, timestamp, metadata)
- History array: ~100 bytes per entry
- Accumulated results: Variable (handler-dependent)
Formula for checkpoint storage:
Storage = Active Steps × (Base Checkpoint Size + History Entries × 100 bytes + Accumulated Size)
Example: 10,000 active steps with 50-entry history and 5KB accumulated results:
10,000 × (5KB + 50 × 100B + 5KB) = 10,000 × 15KB = 150MB
Performance Impact
Checkpoint write: ~1-5ms per checkpoint (single row UPDATE)
Checkpoint read: Included in step data fetch (no additional query)
Recommendations:
- Checkpoint every 1000-10000 items or every 1-5 minutes
- Too frequent: Excessive database writes
- Too infrequent: Lost progress on failure
Alerting Recommendations
Prometheus/Grafana Metrics
If exporting to Prometheus:
# Alert on stale checkpoints
- alert: StaleCheckpoint
expr: tasker_checkpoint_age_seconds > 600
for: 5m
labels:
severity: warning
annotations:
summary: "Step checkpoint is stale"
# Alert on large history
- alert: CheckpointHistoryGrowth
expr: tasker_checkpoint_history_size > 100
for: 1h
labels:
severity: warning
annotations:
summary: "Checkpoint history exceeding threshold"
Database-Based Alerting
For periodic SQL-based monitoring:
-- Return non-zero if any issues detected
SELECT COUNT(*)
FROM tasker.workflow_steps
WHERE (
-- Stale in-progress checkpoints
(current_state = 'in_progress'
AND checkpoint IS NOT NULL
AND (checkpoint->>'timestamp')::timestamptz < NOW() - INTERVAL '10 minutes')
OR
-- Excessive history
(checkpoint IS NOT NULL
AND jsonb_array_length(checkpoint->'history') > 100)
);
See Also
- Batch Processing Guide - Full checkpoint yielding documentation
- DLQ System - Dead letter queue for failed steps
- Retry Semantics - Retry behavior with checkpoints