Skip to main content

What are Transforms?

Transforms sit between sources and sinks in your pipeline, processing data as it flows through. They allow you to:
  • Filter and project data with SQL
  • Enrich data by calling external HTTP APIs
  • Execute custom JavaScript/TypeScript logic with WebAssembly
  • Create dynamic lookup tables

Transform Types

Transform Chaining

You can chain multiple transforms together, with each transform receiving the output of the previous one:
sources:
  raw_events:
    type: dataset
    dataset_name: ethereum.logs
    version: 1.0.0

transforms:
  # First transform: filter to specific contract
  filtered_events:
    type: sql
    sql: |
      SELECT * FROM raw_events
      WHERE address = lower('0x...')

  # Second transform: enrich with external data
  enriched_events:
    type: handler
    from: filtered_events
    url: https://api.example.com/enrich
    primary_key: log_index

  # Third transform: final formatting
  final_events:
    type: sql
    from: enriched_events
    sql: |
      SELECT
        transaction_hash,
        enriched_data,
        block_timestamp
      FROM enriched_events

sinks:
  postgres_sink:
    type: postgres
    from: final_events
    # ...

The from Field

By default, a transform receives data from the source. Use the from field to receive data from another transform:
transforms:
  transform_1:
    type: sql
    sql: SELECT * FROM my_source  # Implicitly uses the source

  transform_2:
    type: sql
    from: transform_1  # Explicitly uses transform_1's output
    sql: SELECT * FROM transform_1

Primary Keys

Most transforms require a primary_key field that identifies the unique identifier for each row:
transforms:
  my_transform:
    type: sql
    primary_key: id  # or transaction_hash, log_index, etc.
    sql: SELECT id, data FROM source
The primary key is used for:
  • Upsert operations in sinks
  • Deduplication
  • Ordering guarantees

Transform Naming

Like sources, transforms are referenced by the name you give them:
transforms:
  step_1:
    type: sql
    sql: SELECT * FROM source

  step_2:
    type: sql
    from: step_1  # Reference by name
    sql: SELECT * FROM step_1
Choose descriptive names that indicate what the transform does (e.g., filtered_transfers, enriched_events).

Performance Considerations

  • SQL transforms are highly optimized using Apache DataFusion
  • Projections (selecting specific columns) are very efficient
  • Filters are pushed down to reduce data movement
  • Note: Joins and aggregations are currently disabled in streaming mode
  • HTTP handlers add latency due to external API calls
  • Use batching when possible to reduce API calls
  • Consider caching frequently accessed data
  • Set appropriate timeouts for the external service
  • WASM transforms execute in a sandboxed environment
  • TypeScript is transpiled to JavaScript at runtime
  • Keep scripts simple for best performance
  • Complex calculations are fine, but avoid heavy I/O

Data Flow

Understanding how data flows through transforms:
Source → Transform 1 → Transform 2 → Transform 3 → Sink
   │          │            │             │          │
   └─ RecordBatch ──────────────────────────────────┘
Data is passed between operators as RecordBatches (columnar data format), which enables:
  • Efficient memory usage
  • Fast serialization/deserialization
  • Vectorized processing

Special Column: _gs_op

All data includes a special _gs_op column that tracks the operation type:
  • i - Insert (new record)
  • u - Update (modified record)
  • d - Delete (removed record)
You can use this in SQL transforms:
transforms:
  inserts_only:
    type: sql
    sql: |
      SELECT * FROM source
      WHERE _gs_op = 'i'
The _gs_op column is automatically maintained by Turbo Pipelines and should be preserved in your transforms if you need upsert semantics in your sink.

Best Practices

1

Start with SQL

Use SQL transforms for filtering and basic transformations whenever possible - they’re the most performant.
2

Keep transforms focused

Each transform should do one thing well. Chain multiple simple transforms rather than creating one complex transform.
3

Test locally

Use the validate command to test your transform logic before deploying.
4

Monitor performance

Use logs and metrics to identify slow transforms and optimize accordingly.

Next Steps