We recently released v3 of pipeline configurations which uses a more intuitive
and user friendly format to define and configure pipelines using a yaml file.
For backward compatibility purposes, we will still support the previous v2
format. This is why you will find references to each format in each yaml file
presented across the documentation. Feel free to use whichever is more
comfortable for you but we encourage you to start migrating to v3 format.
Overview
A Mirror Pipeline defines flow of data fromsources -> transforms -> sinks
. It is configured in a yaml
file which adheres to Goldsky’s pipeline schema.
The core logic of the pipeline is defined in sources
, transforms
and sinks
attributes.
sources
represent origin of the data into the pipeline.transforms
represent data transformation/filter logic to be applied to either a source and/or transform in the pipeline.sinks
represent destination for the source and/or transform data out of the pipeline.
source
and transform
has a unique name which is referenceable in other transform
and/or sink
, determining dataflow within the pipeline.
While the pipeline is configured in yaml, goldsky pipeline CLI commands are used to take actions on the pipeline such as: start
, stop
, get
, delete
, monitor
etc.
Below is an example pipeline configuration which sources from base.logs
Goldsky dataset, filters the data using sql
and sinks to a postgresql
table:
base-logs.yaml
Keys in v3 format for sources, transforms and sinks are user provided
values. In the above example, the source reference name
base.logs
matches the actual dataset name. This is the convention that you’ll
typically see across examples and autogenerated configurations. However,
you can use a custom name as the key.Development workflow
Similar to the software development workflow ofedit -> compile -> run
, there’s an implict iterative workflow of configure -> apply -> monitor
for developing pipelines.
configure
: Create/edit the configuration yaml file.apply
: Apply the configuration aka run the pipeline.monitor
: Monitor how the pipeline behaves. This will help create insights that’ll generate ideas for the first step.
Understanding Pipeline Runtime Lifecycle
Thestatus
attribute represents the desired status of the pipeline and is provided by the user. Applicable values are:
ACTIVE
means the user wants to start the pipeline.INACTIVE
means the user wants to stop the pipeline.PAUSED
means the user wants to save-progress made by the pipeline so far and stop it.
ACTIVE
has a runtime status as well. Runtime represents the execution of the pipeline. Applicable runtime status values are:
STARTING
means the pipeline is being setup.RUNNING
means the pipeline has been setup and is processing records.FAILING
means the pipeline has encountered errors that prevents it from running successfully.TERMINATED
means the pipeline has failed and the execution has been terminated.
Successful pipeline lifecycle
In this scenario the pipeline is succesfully setup and processing data without encountering any issues. We consider the pipeline to be in a healthy state which translates into the following statuses:- Desired
status
in the pipeline configuration isACTIVE
- Runtime Status goes from
STARTING
toRUNNING
base-logs.yaml
goldsky pipeline apply base-logs.yaml --status ACTIVE
or goldsky pipeline start base-logs.yaml
ACTIVE
. We can confirm this using goldsky pipeline list
:
goldsky pipeline monitor base-logs-pipeline
command:

STARTING
status and becomes RUNNING
as it starts processing data successfully into our Postgres sink.
This pipeline will start processing the historical data of the source dataset, reach its edge and continue streaming data in real time until we either stop it or it encounters error(s) that interrupts it’s execution.
Unsuccessful pipeline lifecycle
Let’s now consider the scenario where the pipeline encounters errors during its lifetime and ends up failing. There can be multitude of reasons for a pipeline to encounter errors such as:- secrets not being correctly configured
- sink availability issues
- policy rules on the sink preventing the pipeline from writing records
- resource size incompatiblity
- and many more
RUNNING
runtime status.
A Pipeline can be in an ACTIVE
desired status but a TERMINATED
runtime status in scenarios that lead to terminal failure.
Let’s see an example where we’ll use the same configuration as above but set a secret_name
that does not exist.
bad-base-logs.yaml
goldsky pipeline apply bad-base-logs.yaml
.
goldsky pipeline monitor bad-base-logs-pipeline
we see:

ACTIVE
even though the pipeline runtime status is TERMINATED
Runtime visibility
Pipeline runtime visibility is an important part of the pipeline development workflow. Mirror pipelines expose:- Runtime status and error messages
- Logs emitted by the pipeline
- Metrics on
Records received
, which counts all the records the pipeline has received from source(s) and,Records written
which counts all records the pipeline has written to sink(s). - Email notifications
- Pipeline dashboard at
https://app.goldsky.com/dashboard/pipelines/stream/<pipeline_name>/<version>
goldsky pipeline monitor <name_or_path_to_config_file>
CLI command
Email notifications
If a pipeline fails terminally the project members will get notified via an email.
Error handling
There are two broad categories of errors. Pipeline configuration schema error This means the schema of the pipeline configuration is not valid. These errors are usually caught before pipeline execution. Some possible scenarios:- a required attribute is missing
- transform SQL has syntax errors
- pipeline name is invalid
- credentails stored in the secret are incorrect or do not have needed access privilages
- sink availability issues
- poison-pill record that breaks the business logic in the transforms
resource_size
limitation
Resource sizing
resource_size
represents the compute (vCPUs and RAM) available to the pipeline. There are several options for pipeline sizes: s, m, l, xl, xxl
. This attribute influences pricing as well.
Resource sizing depends on a few different factors such as:
- number of sources, transforms, sinks
- expected amount of data to be processed.
- transform sql involves joining multiple sources and/or transforms
- A
small
resource size is usually enough in most use case: it can handle full backfill of small chain datasets and write to speeds of up to 300K records per second. For pipelines using subgraphs as source it can reliably handle up to 8 subgraphs. - Larger resource sizes are usually needed when backfilling large chains or when doing large JOINS (example: JOIN between accounts and transactions datasets in Solana)
- It’s recommended to always follow a defensive approach: start small and scale up if needed.
Snapshots
A Pipeline snapshot captures a point-in-time state of aRUNNING
pipeline allowing users to resume from it in the future.
It can be useful in various scenarios:
- evolving your
RUNNING
pipeline (eg: adding a new source, sink) without losing progress made so far. - recover from new bug introductions where the user fix the bug and resume from an earlier snapshot to reprocess data.
When are snapshots taken?
- When updating a
RUNNING
pipeline, a snapshot is created before applying the update. This is to ensure that there’s an up-to-date snapshot in case the update introduces issues. - When pausing a pipeline.
- Automatically on regular intervals. For
RUNNING
pipelines in healthy state, automatic snapshots are taken every 4 hours to ensure minimal data loss in case of errors. - Users can request snapshot creation via the following CLI command:
goldsky pipeline snapshot create <name_or_path_to_config>
goldsky pipeline apply <name_or_path_to_config> --from-snapshot new
goldsky pipeline apply <name_or_path_to_config> --save-progress true
(CLI version <11.0.0
)
- Users can list all snapshots in a pipeline via the following CLI command:
goldsky pipeline snapshot list <name_or_path_to_config>
How long does it take to create a snapshot
The amount of time it takes for a snapshot to be created depends largly on two factors. First, the amount of state accumulated during pipeline execution. Second, how fast records are being processed end-end in the pipeline. In case of a long running snapshot that was triggered as part of an update to the pipeline, any future updates are blocked until snapshot is completed. Users do have an option to cancel the update request. There is a scenario where the the pipeline was healthy at the time of starting the snapshot however, became unhealthy later preventing snapshot creation. Here, the pipeline will attempt to recover however, may need user intervention that involves restarting from last successful snapshot.Scenarios and Snapshot Behavior
Happy Scenario:- Suppose a pipeline is at 50% progress, and an automatic snapshot is taken.
- The pipeline then progresses to 60% and is in a healthy state. If you pause the pipeline at this point, a new snapshot is taken.
- You can later start the pipeline from the 60% snapshot, ensuring continuity from the last known healthy state.
- If the pipeline reaches 50%, and an automatic snapshot is taken.
- It then progresses to 60% but enters a bad state. Attempting to pause the pipeline in this state will fail.
- If you restart the pipeline, it will resume from the last successful snapshot at 50%, there was no snapshot created at 60%