Skip to main content
Transforms sit between sources and sinks in your pipeline, processing data as it flows through. They allow you to:
  • Filter and project data with SQL
  • Enrich data by calling external HTTP APIs
  • Execute custom JavaScript/TypeScript logic with WebAssembly
  • Create dynamic lookup tables

Transform Types

SQL Transform

Use familiar SQL for filtering, projections, and column transformations

Dynamic Tables

Create updatable lookup tables for filtering

HTTP Handlers

Call external APIs to enrich your data

WebAssembly Scripts

Execute custom JavaScript/TypeScript code

Transform Chaining

You can chain multiple transforms together, with each transform receiving the output of the previous one:
sources:
  raw_events:
    type: dataset
    dataset_name: ethereum.logs
    version: 1.0.0

transforms:
  # First transform: filter to specific contract
  filtered_events:
    type: sql
    sql: |
      SELECT * FROM raw_events
      WHERE address = lower('0x...')

  # Second transform: enrich with external data
  enriched_events:
    type: handler
    from: filtered_events
    url: https://api.example.com/enrich
    primary_key: log_index

  # Third transform: final formatting
  final_events:
    type: sql
    from: enriched_events
    sql: |
      SELECT
        transaction_hash,
        enriched_data,
        block_timestamp
      FROM enriched_events

sinks:
  postgres_sink:
    type: postgres
    from: final_events
    # ...

The from field

By default, a transform receives data from the source. Use the from field to receive data from another transform:
transforms:
  transform_1:
    type: sql
    sql: SELECT * FROM my_source # Implicitly uses the source

  transform_2:
    type: sql
    from: transform_1 # Explicitly uses transform_1's output
    sql: SELECT * FROM transform_1

Primary Keys

Most transforms require a primary_key field that identifies the unique identifier for each row:
transforms:
  my_transform:
    type: sql
    primary_key: id # or transaction_hash, log_index, etc.
    sql: SELECT id, data FROM source
The primary key is used for:
  • Upsert operations in sinks
  • Deduplication
  • Ordering guarantees

Transform Naming

Like sources, transforms are referenced by the name you give them:
transforms:
  step_1:
    type: sql
    sql: SELECT * FROM source

  step_2:
    type: sql
    from: step_1 # Reference by name
    sql: SELECT * FROM step_1
Choose descriptive names that indicate what the transform does (e.g., filtered_transfers, enriched_events).

Performance Considerations

  • SQL transforms are highly optimized using Apache DataFusion
  • Projections (selecting specific columns) are very efficient
  • Filters are pushed down to reduce data movement
  • Note: Joins and aggregations are currently disabled in streaming mode
  • HTTP handlers add latency due to external API calls - Use batching when possible to reduce API calls - Consider caching frequently accessed data - Set appropriate timeouts for the external service
  • WASM transforms execute in a sandboxed environment
  • TypeScript is transpiled to JavaScript at runtime
  • Keep scripts simple for best performance
  • Complex calculations are fine, but avoid heavy I/O

Data Flow

Understanding how data flows through transforms:
Source → Transform 1 → Transform 2 → Transform 3 → Sink
   │          │            │             │          │
   └─ RecordBatch ──────────────────────────────────┘
Data is passed between operators as RecordBatches (columnar data format), which enables:
  • Efficient memory usage
  • Fast serialization/deserialization
  • Vectorized processing

Special Column: _gs_op

All data includes a special _gs_op column that tracks the operation type:
  • i - Insert (new record)
  • u - Update (modified record)
  • d - Delete (removed record)
You can use this in SQL transforms:
transforms:
  inserts_only:
    type: sql
    sql: |
      SELECT * FROM source
      WHERE _gs_op = 'i'
The _gs_op column is automatically maintained by Turbo Pipelines and should be preserved in your transforms if you need upsert semantics in your sink.

Best Practices

1

Start with SQL

Use SQL transforms for filtering and basic transformations whenever possible - they’re the most performant.
2

Keep transforms focused

Each transform should do one thing well. Chain multiple simple transforms rather than creating one complex transform.
3

Test locally

Use the validate command to test your transform logic before deploying.
4

Monitor performance

Use logs and metrics to identify slow transforms and optimize accordingly.