sources -> transforms -> sinks
. It is configured in a yaml
file which adheres to Goldsky’s pipeline schema.
The core logic of the pipeline is defined in sources
, transforms
and sinks
attributes.
sources
represent origin of the data into the pipeline.transforms
represent data transformation/filter logic to be applied to either a source and/or transform in the pipeline.sinks
represent destination for the source and/or transform data out of the pipeline.source
and transform
has a unique name which is referenceable in other transform
and/or sink
, determining dataflow within the pipeline.
While the pipeline is configured in yaml, goldsky pipeline CLI commands are used to take actions on the pipeline such as: start
, stop
, get
, delete
, monitor
etc.
Below is an example pipeline configuration which sources from base.logs
Goldsky dataset, filters the data using sql
and sinks to a postgresql
table:
base.logs
matches the actual dataset name. This is the convention that you’ll
typically see across examples and autogenerated configurations. However,
you can use a custom name as the key.edit -> compile -> run
, there’s an implict iterative workflow of configure -> apply -> monitor
for developing pipelines.
configure
: Create/edit the configuration yaml file.apply
: Apply the configuration aka run the pipeline.monitor
: Monitor how the pipeline behaves. This will help create insights that’ll generate ideas for the first step.status
attribute represents the desired status of the pipeline and is provided by the user. Applicable values are:
ACTIVE
means the user wants to start the pipeline.INACTIVE
means the user wants to stop the pipeline.PAUSED
means the user wants to save-progress made by the pipeline so far and stop it.ACTIVE
has a runtime status as well. Runtime represents the execution of the pipeline. Applicable runtime status values are:
STARTING
means the pipeline is being setup.RUNNING
means the pipeline has been setup and is processing records.FAILING
means the pipeline has encountered errors that prevents it from running successfully.TERMINATED
means the pipeline has failed and the execution has been terminated.status
in the pipeline configuration is ACTIVE
STARTING
to RUNNING
goldsky pipeline apply base-logs.yaml --status ACTIVE
or goldsky pipeline start base-logs.yaml
ACTIVE
. We can confirm this using goldsky pipeline list
:
goldsky pipeline monitor base-logs-pipeline
command:
STARTING
status and becomes RUNNING
as it starts processing data successfully into our Postgres sink.
This pipeline will start processing the historical data of the source dataset, reach its edge and continue streaming data in real time until we either stop it or it encounters error(s) that interrupts it’s execution.
RUNNING
runtime status.
A Pipeline can be in an ACTIVE
desired status but a TERMINATED
runtime status in scenarios that lead to terminal failure.
Let’s see an example where we’ll use the same configuration as above but set a secret_name
that does not exist.
goldsky pipeline apply bad-base-logs.yaml
.
goldsky pipeline monitor bad-base-logs-pipeline
we see:
ACTIVE
even though the pipeline runtime status is TERMINATED
Records received
, which counts all the records the pipeline has received from source(s) and, Records written
which counts all records the pipeline has written to sink(s).https://app.goldsky.com/dashboard/pipelines/stream/<pipeline_name>/<version>
goldsky pipeline monitor <name_or_path_to_config_file>
CLI commandresource_size
limitationresource_size
represents the compute (vCPUs and RAM) available to the pipeline. There are several options for pipeline sizes: s, m, l, xl, xxl
. This attribute influences pricing as well.
Resource sizing depends on a few different factors such as:
small
resource size is usually enough in most use case: it can handle full backfill of small chain datasets and write to speeds of up to 300K records per second. For pipelines using
subgraphs as source it can reliably handle up to 8 subgraphs.RUNNING
pipeline allowing users to resume from it in the future.
It can be useful in various scenarios:
RUNNING
pipeline (eg: adding a new source, sink) without losing progress made so far.RUNNING
pipeline, a snapshot is created before applying the update. This is to ensure that there’s an up-to-date snapshot in case the update introduces issues.RUNNING
pipelines in healthy state, automatic snapshots are taken every 4 hours to ensure minimal data loss in case of errors.goldsky pipeline snapshot create <name_or_path_to_config>
goldsky pipeline apply <name_or_path_to_config> --from-snapshot new
goldsky pipeline apply <name_or_path_to_config> --save-progress true
(CLI version < 11.0.0
)goldsky pipeline snapshot list <name_or_path_to_config>