Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.goldsky.com/llms.txt

Use this file to discover all available pages before exploring further.

Transforms sit between sources and sinks in your pipeline, processing data as it flows through. They allow you to:
  • Filter and project data with SQL
  • Enrich data by calling external HTTP APIs
  • Execute custom JavaScript/TypeScript logic with WebAssembly
  • Create dynamic lookup tables

Transform Types

SQL Transform

Use familiar SQL for filtering, projections, and column transformations

Dynamic Tables

Create updatable lookup tables for filtering

HTTP Handlers

Call external APIs to enrich your data

WebAssembly Scripts

Execute custom JavaScript/TypeScript code

Transform Chaining

You can chain multiple transforms together, with each transform receiving the output of the previous one:
sources:
  raw_events:
    type: dataset
    dataset_name: ethereum.logs

transforms:
  # First transform: filter to a specific contract
  filtered_events:
    type: sql
    primary_key: log_index
    sql: |
      SELECT * FROM raw_events
      WHERE address = lower('0x...')

  # Second transform: enrich with external data
  enriched_events:
    type: handler
    from: filtered_events
    url: https://api.example.com/enrich
    primary_key: log_index

  # Third transform: final formatting
  final_events:
    type: sql
    primary_key: log_index
    sql: |
      SELECT
        transaction_hash,
        enriched_data,
        block_timestamp
      FROM enriched_events

sinks:
  postgres_sink:
    type: postgres
    from: final_events
    # ...

Referencing upstream data

How a transform declares its input depends on the transform type:
  • sql transforms reference the upstream source or transform by name in the SQL FROM clause. They do not accept a top-level from field.
  • handler and script transforms use a top-level from field to name the upstream source or transform.
  • dynamic_table transforms are populated either by an inline sql query that reads from the pipeline, or externally via writes to the backing table; they do not take a from field.
transforms:
  # SQL: upstream is named in the SQL FROM clause
  filtered:
    type: sql
    primary_key: id
    sql: SELECT * FROM my_source

  # SQL chained on top of another transform — again, referenced in FROM
  projected:
    type: sql
    primary_key: id
    sql: SELECT id, data FROM filtered

  # Handler / script: upstream is named in the top-level `from` field
  enriched:
    type: handler
    from: projected
    url: https://api.example.com/enrich
    primary_key: id

Primary Keys

sql, handler, and script transforms require a primary_key field that names the column uniquely identifying each row. (dynamic_table does not take primary_key — it uses column instead to name its key column.)
transforms:
  my_transform:
    type: sql
    primary_key: id # or transaction_hash, log_index, etc.
    sql: SELECT id, data FROM source
The primary key is used for:
  • Upsert operations in sinks
  • Deduplication
  • Ordering guarantees

Transform Naming

Like sources, transforms are referenced by the name you give them:
transforms:
  step_1:
    type: sql
    primary_key: id
    sql: SELECT * FROM source

  step_2:
    type: sql
    primary_key: id
    sql: SELECT * FROM step_1 # Reference the upstream transform by name
Choose descriptive names that indicate what the transform does (e.g., filtered_transfers, enriched_events).

Performance Considerations

  • SQL transforms are highly optimized using Apache DataFusion
  • Projections (selecting specific columns) are very efficient
  • Filters are pushed down to reduce data movement
  • Note: Joins, aggregations, and window functions are not supported in streaming mode. Use dynamic tables for lookup-style joins.
  • HTTP handlers add latency due to external API calls
  • Send multiple rows per request (one_row_per_request: false) when the endpoint supports it, to reduce per-row overhead
  • Consider caching frequently accessed data on the handler side
  • Set appropriate timeouts on the external service
  • WASM transforms execute in a sandboxed environment
  • TypeScript is transpiled to JavaScript at runtime
  • Keep scripts simple for best performance
  • Complex calculations are fine, but avoid heavy I/O

Data Flow

Understanding how data flows through transforms:
Source → Transform 1 → Transform 2 → Transform 3 → Sink
   │          │            │             │          │
   └─ RecordBatch ──────────────────────────────────┘
Data is passed between operators as RecordBatches (columnar data format), which enables:
  • Efficient memory usage
  • Fast serialization/deserialization
  • Vectorized processing

Special Column: _gs_op

All data includes a special _gs_op column that tracks the operation type:
  • i - Insert (new record)
  • u - Update (modified record)
  • d - Delete (removed record)
You can use this in SQL transforms:
transforms:
  inserts_only:
    type: sql
    primary_key: id
    sql: |
      SELECT * FROM source
      WHERE _gs_op = 'i'
The _gs_op column is automatically maintained by Turbo Pipelines and should be preserved in your transforms if you need upsert semantics in your sink.

Best Practices

1

Start with SQL

Use SQL transforms for filtering and basic transformations whenever possible - they’re the most performant.
2

Keep transforms focused

Each transform should do one thing well. Chain multiple simple transforms rather than creating one complex transform.
3

Validate before deploying

Run goldsky turbo validate <file> to check your pipeline config before deploying.
4

Monitor performance

Use logs and metrics to identify slow transforms and optimize accordingly.