Skip to main content

Summary

Turbo pipelines provide at-least-once delivery. Every record produced by a source is guaranteed to be written to every downstream sink at least once. Records may be written more than once in specific, well-defined failure scenarios described below. Turbo does not provide exactly-once delivery. If a record is written exactly once is important for your use case, your sink must be idempotent (see Designing idempotent sinks).

Why duplicates can occur

Turbo’s at-least-once guarantee follows directly from the order in which it commits work:
  1. A sink writes records to its destination first.
  2. Only after every sink in the pipeline confirms its write does the source commit its read position (for example, the Kafka consumer offset).
If the pipeline crashes between those two steps — after a sink has written but before the source has committed — the source will replay the same records from its last committed position when the pipeline restarts. The sink will then write those records again. This is intentional. The alternative — committing the source offset before sinks have flushed — would risk losing data on a crash, which is almost always worse than writing a record twice.

The checkpointing protocol

Turbo coordinates this guarantee using a checkpoint coordinator that periodically emits checkpoint markers through the pipeline. Each marker carries a monotonically increasing integer called an epoch. Every checkpoint runs in two phases:

Phase 1: Marker phase

  1. The coordinator broadcasts a marker for epoch N to every source.
  2. Each source records its current read position (for example, Kafka topic/partition offsets) in memory. It does not commit the position yet.
  3. The source emits the marker downstream, after the last record it read before receiving the marker.
  4. Transforms forward the marker to their downstream operators.
  5. Each sink, upon receiving the marker, flushes and commits every record that arrived before the marker. Once the destination acknowledges the write, the sink sends an ACK for epoch N back to the coordinator.
  6. The coordinator waits until it has received an ACK from every sink. Once it has, it finalizes epoch N.

Phase 2: Finalizer phase

  1. The coordinator broadcasts a finalizer message for epoch N to every source.
  2. Each source commits the read position it recorded in step 2 (for example, by committing the Kafka consumer offset).
  3. The source emits the finalizer downstream. Transforms forward it; sinks ignore it.
The coordinator then immediately starts the next checkpoint with epoch N+1.

What this guarantees

  • No data loss: A source never commits its read position until every sink has durably written the records that came before that checkpoint marker.
  • At-least-once: If the pipeline crashes after step 5 (sink has flushed) but before step 8 (source has committed), the source replays from its previous committed position on restart, and the sink writes the same records a second time.

Where duplicates can occur

Duplicates are confined to the records that fell between the last committed source position and the point of failure. Concretely:
  • A crash during or just after a sink flush can cause records inside the in-flight epoch to be re-emitted on restart.
  • A network blip that causes the sink to retry an already-written batch can also produce duplicates inside a single epoch.
Duplicates never span an arbitrary amount of history — only the records in the current uncommitted epoch.

Designing idempotent sinks

Because duplicates are possible, design your destination to tolerate them. Turbo’s built-in sinks already do this where they can:
  • Postgres sink — Set primary_key on the sink. Writes become INSERT ... ON CONFLICT (<pk>) DO UPDATE, so re-writing the same row replaces it instead of inserting a duplicate. See the Postgres sink for details.
  • ClickHouse sink — Use a ReplacingMergeTree table engine keyed on the same primary_key you set on the sink, so duplicate rows collapse on merge. See the ClickHouse sink.
  • Postgres aggregation sink — The landing table uses a composite primary key (primary_key + _gs_op), so re-delivered records within the same checkpoint epoch are deduplicated via upsert. Records re-delivered across epoch boundaries are not deduplicated. See Postgres aggregation sink.
  • Kafka sink — Kafka has no native upsert. Treat the downstream consumer as the place to deduplicate (for example, by keying messages on a stable primary key and using a compacted topic, or by deduplicating on consumption).
  • Webhook / HTTP handler / SQS / Pub/Sub / S3 sinks — These have no built-in deduplication. Your endpoint or downstream consumer must be idempotent. Common patterns: deduplicate by a primary key inside the message body, or reject events with an already-seen idempotency key.
If you write a custom HTTP endpoint or webhook receiver, assume Turbo will deliver the same payload to it more than once. Either deduplicate on a stable key (typically id from the upstream record), or make the side effect itself idempotent (for example, UPSERT instead of INSERT, or PUT to a deterministic path instead of POST).

Checkpoint cadence

Checkpoints fire on a fixed interval managed by the engine. You cannot configure the interval per-pipeline. A faster interval reduces the window in which duplicates can occur on a crash but adds coordination overhead; a slower interval is more efficient but widens the duplicate window. The default cadence is tuned to keep that window in the low-seconds range under normal load.

What checkpointing does not do

  • It does not snapshot intermediate transform state. Turbo’s checkpoints coordinate source offset commits and sink flushes; they don’t persist in-memory operator state to disk. The engine does not need to — the source replay on restart re-derives that state by re-running the transforms.
  • It does not guarantee ordering across a restart. After a crash, the source replays from its last committed offset. Records that had already been written by the sink before the crash will appear again, interleaved with new records.
  • It does not make non-idempotent sinks safe. If your destination cannot tolerate seeing the same row twice, no setting on the Turbo side will change that — fix it at the destination.

Quick checklist for engineers

  • Pick a stable primary_key for every record and propagate it through every transform.
  • Set primary_key on every sink that supports it (Postgres, ClickHouse, MySQL, Postgres aggregation).
  • For destinations Turbo cannot deduplicate against (webhooks, Kafka, SQS, Pub/Sub, S3), confirm the consumer or endpoint deduplicates on that same key.
  • Do not use sink writes to trigger non-idempotent side effects (sending an email, charging a card, posting to a chat) unless you guard the side effect with your own idempotency key.