The TypeScript transform lets you execute custom TypeScript (or JavaScript) code on each record in your pipeline. This is useful for:
Complex data transformations not supported by SQL
Custom business logic with type safety
Data parsing and formatting
Conditional transformations based on complex rules
Expanding one input row into many output rows, or filtering rows out
Code runs in a sandboxed WebAssembly environment. TypeScript is transpiled to ES2020 JavaScript (via SWC) when the pipeline is built, then executed inside a QuickJS interpreter shipped as WebAssembly (Extism’s js-pdk).
TypeScript transforms are significantly slower than SQL transforms. Use SQL to
filter and project first, and only pass the minimum data required into a
TypeScript transform.
One of typescript, ts, javascript, or js. TypeScript values are
transpiled to JavaScript at pipeline build time; plain JavaScript is passed
through unchanged.
Mapping of output field name to Arrow type. Required whenever invoke
returns a shape different from the input — any field you return that isn’t
declared here will be dropped. If omitted, the output schema is inherited
from the input. Supported types: string (alias for utf8), int8,
int16, int32, int64, uint8, uint16, uint32, uint64, float16,
float32, float64, boolean, binary, date32, date64, timestamp,
time32, time64, duration, interval, null.
Your TypeScript code. Must define a top-level invoke(data) function that
receives a single record and returns one of: the transformed record object,
null to filter the record out, or an array of record objects to expand one
input row into many output rows (empty array emits nothing).
Number of sandboxed script instances that process rows in parallel. Higher
values improve throughput on CPU-bound scripts at the cost of more memory.
Use 1 if your script relies on rows being processed in order.
Minimum number of rows to accumulate before invoking the script. Smaller
upstream batches are combined until this threshold is reached, which reduces
per-call overhead on high-volume streams with tiny batches. 0 disables
accumulation and processes each batch immediately.
Your script must define a top-level invoke function that:
Accepts a single data parameter (a plain JS object representing one row)
Returns one of:
A record object — the transformed row
null — filter this row out of the output
An array of record objects — expand this input row into many output rows (return [] to emit nothing, include null entries in the array to skip specific rows)
Can return a different shape than the input when using the schema configuration
The _gs_op column (insert/update/delete marker) is automatically copied from the input row to each output row, so your script does not need to set it.
If you omit the schema field, the output schema is inherited from the input. Any new fields you add in invoke will be dropped because they aren’t in the schema. Declare a schema whenever your output differs from the input:
The data parameter is a plain JavaScript object with your record’s fields. Values use native JS types — strings stay strings, integers/floats become numbers, booleans stay booleans, list columns become arrays, struct columns become nested objects.
Return an array of objects from invoke to emit multiple output rows for a single input row. This is useful for unpacking nested arrays or cross-joining with a lookup list. Returning [] drops the row entirely; null entries inside the array are skipped.
If your script throws an unhandled error, the pipeline will retry processing
that record. Use try/catch to handle errors gracefully and flag problematic
records for later review.
Controls how many rows are accumulated before invoke is called. Smaller upstream batches are combined until the threshold is reached, which reduces per-call overhead.
Default: 0 (disabled — each upstream batch is processed immediately)
Higher values: Better throughput on high-volume streams with tiny batches
Trade-off: Higher values increase end-to-end latency as rows wait to accumulate
Increase parallelism to process slices concurrently
Memory-constrained environments
Reduce parallelism to limit concurrent instances
Low-latency requirements
Keep batch_size at 0 to process immediately
Order-sensitive processing
Use parallelism: 1 to ensure sequential processing
Start with the defaults and adjust based on observed performance. Monitor memory usage when increasing parallelism, and monitor latency when increasing batch_size.
TypeScript is transpiled to JavaScript once when the pipeline starts (no
per-record transpile cost) - Each row is executed inside a QuickJS
interpreter, which is significantly slower than native SQL transforms -
Every record is evaluated individually - Keep scripts simple and avoid
expensive per-row work (regex compilation, JSON.parse on huge blobs,
allocating large temporary objects, etc.)
Memory Usage
Scripts run in a sandboxed environment with limited memory - Avoid
creating large data structures - Process records one at a time, don’t
accumulate state - Clean up temporary variables - Higher parallelism
values increase memory usage proportionally
Optimization Tips
Pre-define types and interfaces outside the function - Use built-in
methods (Array.map, filter) instead of manual loops - Avoid nested loops and
recursive functions - Cache frequently accessed values in variables - Use
parallelism and batch_size to tune throughput for your workload
console.log output goes to the pipeline’s stderr log and is not visible in your sink. To inspect intermediate values in the data you actually ship, add debug fields to the returned record: