Run a pipeline
An explanation on the lifecycle of Mirror Pipelines and how to operate them
Overview
Mirror pipelines streamline the process of consuming blockchain data and sinking it into your data warehouse. This ensures efficient, real-time data integration for robust analytics and insights.
In the previous article on Create a pipeline we explored the different methods we can use to deploy pipelines using the Web Pipeline Builder and the CLI. Here, we will focus on the lifecycle management of Mirror pipelines, detailing the key operations needed to manage and change their state. This includes initializing, maintaining, and scaling pipelines to ensure optimal performance and data integrity.
Pipeline lifecycle
Mirror Pipelines go through a series of stages throughout their lifecycle.
The actual state of a pipeline is determined by two sets of statuses:
-
Desired Status: this is the status we want our pipeline to be. It can take one of three values:
ACTIVE
,INACTIVE
andPAUSED
- You can see this status by running
goldsky pipeline list
- You can see this status by running
-
Execution Status: this is an internal status related to the execution of the pipeline. It can take the values:
STARTING
,RUNNING
,FAILING
andTERMINATED
- You can see this status by running
goldsky pipeline monitor <your_pipeline>
- You can see this status by running
Let’s see how these states play out on successful and unsuccessful scenarios.
Successful pipeline lifecycle
In this scenario we look at the “happy path” where a pipeline is succesfully deployed without encountering any issues. We consider the pipeline to be in a healthy state which translates into the following statuses:
- Desired Status is
ACTIVE
- Execution Status goes from
STARTING
toRUNNING
Let’s look at a simple example below where we define a pipeline that consumes Logs from Base chain and streams them into a Postgres database:
Notice how as part of the definition of this pipeline we set the status as ACTIVE
, meaning that when we inform Goldsky to deploy the pipeline it will try to get it into a running state.
Let’s deploy it using the command goldsky pipeline apply base-logs.yaml
:
At this point we have a deployed pipeline in ACTIVE
desired status. We can confirm this using goldsky pipeline list
:
Because it’s in ACTIVE
desired status and we just deployed it with the apply
command Mirror will go ahead and kick off its deployment.
We can then check the runtime status of this pipeline using the goldsky pipeline monitor
command:
We can see how the pipeline starts in STARTING
status and becomes RUNNING
as it starts streaming data successfully into our Postgres database.
This pipeline will start processing the historical data of the source dataset, reach its edge and continue streaming data in real time until we either stop it or it encounters an issue that prompts it to stop.
Unsuccessful pipeline lifecycle
Let’s now consider the scenario where the pipeline encounters issues during its lifetime and ends up failing, considering the pipeline to be in an unhealthy/bad state.
There can be multiple reasons why a pipeline might fail such as secrets not being correctly configured, the sink becomes suddenly unavailable (e.g. database goes down), policy rules on the sink preventing the pipeline from writing records, etc.
These failing events can occur during the lifetime of a running pipeline or right from the start, preventing the pipeline from getting into a RUNNING status:
As you can appreciate from the diagrams above, the pipelines go into FAILING
and TERMINATED
runtime statuses. Mirror then decides to set the desired status as INACTIVE
which
completely shuts down the pipeline and its execution.
Let’s see an example of the second case whereby we try to deploy a pipeline but it immediately fails. To do that we’ll take the same definition as before
but we’ll replace the secretName
with faulty one:
Let’s deploy it using the command goldsky pipeline apply base-logs.yaml
.
Deployment looks good in principle, just like with the previous example. This is because the definition is semantically correct. If we now monitor the pipeline we see that at runtime things aren’t that great:
Mirror has found a critical error with the pipeline and decided to terminate it inmediately. It also sets its desired status as INACTIVE
. We can confirm this using goldsky pipeline list
:
In this example, Mirror had the certainty to terminate the pipeline as it found a critical error but in some cases this inmediate termination is not convenient as there might
be errors that are transient and can recover after some time by themselves. For instance, a database might be down for a minute. When this happens, Mirror won’t kill the pipeline but it
will try to get the pipeline to a healthy status for a period of 6 hours. If the issue doesn’t get solved by then, the pipeline is set to TERMINATED
and INACTIVE
status to prevent running infrastructure
costs.
Pipeline Failed Alerts
If a pipeline fails the project members will get informed via an email containing instructions on how to proceed to fix the pipeline.
You can configure this nofication in the Notifications section of your project
Snapshots
Now that we have seen at high level the different statuses of a pipeline in successful and unsuccesful scenarios, let’s turn our attention to Snaphots.
Snapshots are crucial for managing the state of your Mirror pipelines. They capture the current status of your pipeline, allowing you to resume operations smoothly from a known point.
Snapshots are one the fundamental tools that Mirror provides to ensure data integrity and have control over the economics and resources consumed by your pipelines. Here’s how snapshots function within the lifecycle of a Mirror pipeline:
When are snapshots taken?
-
Automatic Snapshots - Mirror takes snapshots automatically for you based on the activity of your pipelines:
- Paused Status: A snapshot is automatically taken when a pipeline is set to
PAUSED
. This ensures that you can resume from the exact state before pausing. - During Updates: If an update is being made to a
RUNNING
pipeline, a snapshot is taken to preserve the current state before the update. - Regular Intervals: For running pipelines in a healthy state, automatic snapshots are taken every 4 hours to ensure minimal data loss in case of interruptions.
- Paused Status: A snapshot is automatically taken when a pipeline is set to
-
Manual Snapshots:
- Users can manually stop a pipeline, which triggers a snapshot, using the command
goldsky pipeline stop <your_pipeline>
. However, this is only possible if the pipeline is in a healthy state. If the pipeline is in a bad state, the stop operation won’t work.
- Users can manually stop a pipeline, which triggers a snapshot, using the command
Scenarios and Snapshot Behavior
Happy Scenario:
- Suppose a pipeline is at 50% progress, and an automatic snapshot is taken.
- The pipeline then progresses to 60% and is in a healthy state. If you pause the pipeline at this point, a new snapshot is taken.
- You can later restart the pipeline from the 60% snapshot, ensuring continuity from the last known healthy state.
Bad Scenario:
- If the pipeline reaches 50%, and an automatic snapshot is taken.
- It then progresses to 60% but enters a bad state. Attempting to pause the pipeline in this state will fail.
- If you restart the pipeline, it will resume from the last successful snapshot at 50%, as the state at 60% is not considered valid.
Resuming from Snapshots
When you stop and restart a pipeline, it will, by default, resume from the last known successful snapshot. This mechanism ensures that even in the event of interruptions or errors, the pipeline can be brought back online with minimal disruption, resuming data processing from a stable point.
Snapshot on Inactivity
Before making a pipeline inactive, an attempt is made to take a snapshot. This final snapshot ensures that when the pipeline is reactivated, it can start from the most recent snapshot, providing a smooth transition back to operation.
Operating pipelines
Now that we are familiar with the different states of a pipeline and the importance of Snapshots in the context of data recovery, let’s look at the operations we can perform on pipelines to influence their lifecycle.
Deploying a pipeline
There are two main ways by which you can deploy a pipeline: in the web app or by using the CLI.
If you prefer to deploy pipelines using a web interface instead check the Pipeline Builder
apply
command + pipeline configuration
The goldsky pipeline apply command expects the yaml file to include additional attributes pertaining to the configuration (such as desired state and pipeline name) and definition attribute containing the actual tripla of sources, transforms, sinks
See the following example:
goldsky pipeline apply
Pausing a pipeline
There are 3 ways by which you can pause a pipeline:
1. pause
command
If you pause a pipeline using the command goldsky pipeline pause <name>
Mirror will attempt to take a snapshot before pausing the pipeline. the snapshot is successfully taken only if the
pipeline is in a healthy state. After the attempted snapshot, Mirror will set the pipeline to PAUSED desired status and TERMINATED runtime status.
Example:
2. stop
command
You can stop a pipeline using the command goldsky pipeline stop <name>
. Unlike the pause
command, stopping a pipeline doesn’t try to take a snapshot. Mirror will directly set pipeline to INACTIVE
desired status and TERMINATED
runtime status.
Example:
3. apply
command + INACTIVE
or PAUSED
status
We can replicate the behaviour of the pause
and stop
commands by updating the desired status of the pipeline using pipeline apply
and setting it as INACTIVE
or PAUSED
.
Following up with our previous example, we could stop our deployed pipeline doing this:
The actual preferred method to stop/pause a pipeline that you use will ultimately come down to whether you want to take a snapshot. In some cases, pausing a pipeline is preferable in scenarios where we have encountered recent data issues and want to ensure that we want to restart the pipeline from a healthy snapshot. In other cases, there might be no data quality issues so we want to make sure that take a snapshot beforing stopping the pipeline to resume it later at a later point without having lost any work already done.
Restarting a pipeline
There are two ways to restart an already deployed pipeline:
1. start
command
As in: goldsky pipeline start <name>
Example:
This command will open up a monitor for your pipeline after deploying.
2. apply
command + ACTIVE
status
Just as you can stop a pipeline changing its status to INACTIVE
you can also restart it by setting it to ACTIVE
Following up with our previous example, we could restart our stopped pipeline doing this:
Unlike the start
command, this method won’t open up the monitor automatically.
Applying updates to a pipeline
As we have seen in the previous sections, Mirror allows you to update the status of a pipeline directly in your configuration files using the apply
command. This is really powerful as all the operations we have explored so far represent a status change, meaning
that deploying, pausing/stopping and restarting pipelines can all be done updating the pipeline status.
The apply
command can also be used to change other attributes of the pipeline.
See the following example:
In this example we are changing the pipeline description as well as prompting a restart of the pipeline using its latest succesful snapshot available and informing Mirror to not take a snapshot before pausing. This is a common configuration to apply in a situation where you found issues with your pipeline and would like to restart from the last healthy checkpoint.
For a more complete reference on the configuration attributes you can apply check this reference.
Deleting a pipeline
Finally, the last operation you might want to perform is to delete your pipelines. Although inactive pipelines don’t consume any resources (and thus, do not imply a billing cost on your side) it’s always nice to keep your project
clean and remove pipelines which you aren’t going to use any longer.
You can delete pipelines with the command goldsky pipeline delete
:
In-flight requests
Sometimes you might experience that you are not able to perform a specific action on your pipeline because an in-flight request is currently being processed. What this means is that there was a previous operation performed in your pipeline which hasn’t finished yet and needs to be either processed or discarded before you can apply your specific operation. A common scenario for this is your pipeline is busy taking a snapshot.
Consider the following example where we recently paused a pipeline (thus triggering a snapshot) and we immediately try to delete it:
Let’s look at what process is still to be processed:
We can see that the snapshot is still taking place. Since we want to delete the pipeline we can go ahead and stop this snapshot creation:
We can now succesfully remove the pipeline:
As you saw in this example, Mirror provides you with commands to see the current in-flight requests in your pipeline and decide whether you want to discard them or wait for them to be processed.
Conclusion
That’s a wrap! In this article we have learnt the multiple states defining the lifecycle of Mirror pipelines and what operations we can perform on them to ensure data integrity and cost optimizations. We have paid special attention to snapshots as they provide the basis for data recovery.
Mirror pipelines provide a robust framework for real-time data processing and integration. Getting familiar with how to operate them will help you realize the full potential of Mirror, empowering your organization to make data-driven decisions with confidence.
Can't find what you're looking for? Reach out to us at support@goldsky.com for help.
Was this page helpful?