Overview

Mirror pipelines streamline the process of consuming blockchain data and sinking it into your data warehouse. This ensures efficient, real-time data integration for robust analytics and insights.

In the previous article on Create a pipeline we explored the different methods we can use to deploy pipelines using the Web Pipeline Builder and the CLI. Here, we will focus on the lifecycle management of Mirror pipelines, detailing the key operations needed to manage and change their state. This includes initializing, maintaining, and scaling pipelines to ensure optimal performance and data integrity.

Pipeline lifecycle

Mirror Pipelines go through a series of stages throughout their lifecycle.

The actual state of a pipeline is determined by two sets of statuses:

  • Desired Status: this is the status we want our pipeline to be. It can take one of three values: ACTIVE, INACTIVE and PAUSED

    • You can see this status by running goldsky pipeline list
  • Execution Status: this is an internal status related to the execution of the pipeline. It can take the values: STARTING, RUNNING, FAILING and TERMINATED

    • You can see this status by running goldsky pipeline monitor <your_pipeline>

Let’s see how these states play out on successful and unsuccessful scenarios.

Successful pipeline lifecycle

In this scenario we look at the “happy path” where a pipeline is succesfully deployed without encountering any issues. We consider the pipeline to be in a healthy state which translates into the following statuses:

  • Desired Status is ACTIVE
  • Execution Status goes from STARTING to RUNNING

Let’s look at a simple example below where we define a pipeline that consumes Logs from Base chain and streams them into a Postgres database:

base-logs.yaml
name: base-logs-pipeline
version: 1
status: ACTIVE
resource_size: s
apiVersion: 3
sources:
  base.logs:
    dataset_name: base.logs
    version: 1.0.0
    type: dataset
    description: Enriched logs for events emitted from contracts. Contains the
      contract address, data, topics, decoded event and metadata for blocks and
      transactions.
    display_name: Logs
transforms: {}
sinks:
  postgres_base_logs:
    type: postgres
    table: base_logs
    schema: public
    secret_name: GOLDSKY_SECRET
    description: "Postgres sink for: base.logs"
    from: base.logs

Notice how as part of the definition of this pipeline we set the status as ACTIVE, meaning that when we inform Goldsky to deploy the pipeline it will try to get it into a running state.

Let’s deploy it using the command goldsky pipeline apply base-logs.yaml :

❯ goldsky pipeline apply base-logs.yaml

◇  Successfully validated config file

◇  Successfully applied config to pipeline: base-logs-pipeline

To monitor the status of your pipeline:

Using the CLI: `goldsky pipeline monitor base-logs-pipeline`
Using the dashboard: https://app.goldsky.com/dashboard/pipelines/stream/base-logs-pipeline/1

At this point we have a deployed pipeline in ACTIVE desired status. We can confirm this using goldsky pipeline list:

❯ goldsky pipeline list
✔ Listing pipelines
────────────────────────────────────────
│ Name                          │ Version │ Status │ Resource │
│                               │         │        │ Size     │
│───────────────────────────────────────
│ base-logs-pipeline            │ 1       │ ACTIVE │ s        │
────────────────────────────────────────

Because it’s in ACTIVE desired status and we just deployed it with the apply command Mirror will go ahead and kick off its deployment. We can then check the runtime status of this pipeline using the goldsky pipeline monitor command:

We can see how the pipeline starts in STARTING status and becomes RUNNING as it starts streaming data successfully into our Postgres database. This pipeline will start processing the historical data of the source dataset, reach its edge and continue streaming data in real time until we either stop it or it encounters an issue that prompts it to stop.

Unsuccessful pipeline lifecycle

Let’s now consider the scenario where the pipeline encounters issues during its lifetime and ends up failing, considering the pipeline to be in an unhealthy/bad state.

There can be multiple reasons why a pipeline might fail such as secrets not being correctly configured, the sink becomes suddenly unavailable (e.g. database goes down), policy rules on the sink preventing the pipeline from writing records, etc.

These failing events can occur during the lifetime of a running pipeline or right from the start, preventing the pipeline from getting into a RUNNING status:

As you can appreciate from the diagrams above, the pipelines go into FAILING and TERMINATED runtime statuses. Mirror then decides to set the desired status as INACTIVE which completely shuts down the pipeline and its execution.

Let’s see an example of the second case whereby we try to deploy a pipeline but it immediately fails. To do that we’ll take the same definition as before but we’ll replace the secretName with faulty one:

bad-base-logs.yaml
name: bad-base-logs-pipeline
version: 1
status: ACTIVE
resource_size: s
apiVersion: 3
sources:
  base.logs:
    dataset_name: base.logs
    version: 1.0.0
    type: dataset
    description: Enriched logs for events emitted from contracts. Contains the
      contract address, data, topics, decoded event and metadata for blocks and
      transactions.
    display_name: Logs
transforms: {}
sinks:
  postgres_base_logs:
    type: postgres
    table: base_logs
    schema: public
    secret_name: BAD_SECRET
    description: "Postgres sink for: base.logs"
    from: base.logs

Let’s deploy it using the command goldsky pipeline apply base-logs.yaml.

❯ goldsky pipeline apply bad-base-logs.yaml

◇  Successfully validated config file

◇  Successfully applied config to pipeline: base-logs-pipeline

To monitor the status of your pipeline:

Using the CLI: `goldsky pipeline monitor bad-base-logs-pipeline`
Using the dashboard: https://app.goldsky.com/dashboard/pipelines/stream/bad-base-logs-pipeline/1

Deployment looks good in principle, just like with the previous example. This is because the definition is semantically correct. If we now monitor the pipeline we see that at runtime things aren’t that great:

Mirror has found a critical error with the pipeline and decided to terminate it inmediately. It also sets its desired status as INACTIVE. We can confirm this using goldsky pipeline list:

❯ goldsky pipeline list
✔ Listing pipelines
─────────────────────────────────────────
│ Name                          │ Version │ Status   │ Resource │
│                               │         │          │ Size     │
─────────────────────────────────────────
│ bad-base-logs-pipeline        │ 1       │ INACTIVE │ s        │
─────────────────────────────────────────

In this example, Mirror had the certainty to terminate the pipeline as it found a critical error but in some cases this inmediate termination is not convenient as there might be errors that are transient and can recover after some time by themselves. For instance, a database might be down for a minute. When this happens, Mirror won’t kill the pipeline but it will try to get the pipeline to a healthy status for a period of 6 hours. If the issue doesn’t get solved by then, the pipeline is set to TERMINATED and INACTIVE status to prevent running infrastructure costs.

Pipeline Failed Alerts

If a pipeline fails the project members will get informed via an email containing instructions on how to proceed to fix the pipeline.

You can configure this nofication in the Notifications section of your project

Snapshots

Now that we have seen at high level the different statuses of a pipeline in successful and unsuccesful scenarios, let’s turn our attention to Snaphots.

Snapshots are crucial for managing the state of your Mirror pipelines. They capture the current status of your pipeline, allowing you to resume operations smoothly from a known point.

Snapshots are one the fundamental tools that Mirror provides to ensure data integrity and have control over the economics and resources consumed by your pipelines. Here’s how snapshots function within the lifecycle of a Mirror pipeline:

When are snapshots taken?

  1. Automatic Snapshots - Mirror takes snapshots automatically for you based on the activity of your pipelines:

    • Paused Status: A snapshot is automatically taken when a pipeline is set to PAUSED. This ensures that you can resume from the exact state before pausing.
    • During Updates: If an update is being made to a RUNNING pipeline, a snapshot is taken to preserve the current state before the update.
    • Regular Intervals: For running pipelines in a healthy state, automatic snapshots are taken every 4 hours to ensure minimal data loss in case of interruptions.
  2. Manual Snapshots:

    • Users can manually stop a pipeline, which triggers a snapshot, using the command goldsky pipeline stop <your_pipeline>. However, this is only possible if the pipeline is in a healthy state. If the pipeline is in a bad state, the stop operation won’t work.

Scenarios and Snapshot Behavior

Happy Scenario:

  • Suppose a pipeline is at 50% progress, and an automatic snapshot is taken.
  • The pipeline then progresses to 60% and is in a healthy state. If you pause the pipeline at this point, a new snapshot is taken.
  • You can later restart the pipeline from the 60% snapshot, ensuring continuity from the last known healthy state.

Bad Scenario:

  • If the pipeline reaches 50%, and an automatic snapshot is taken.
  • It then progresses to 60% but enters a bad state. Attempting to pause the pipeline in this state will fail.
  • If you restart the pipeline, it will resume from the last successful snapshot at 50%, as the state at 60% is not considered valid.

Resuming from Snapshots

When you stop and restart a pipeline, it will, by default, resume from the last known successful snapshot. This mechanism ensures that even in the event of interruptions or errors, the pipeline can be brought back online with minimal disruption, resuming data processing from a stable point.

Snapshot on Inactivity

Before making a pipeline inactive, an attempt is made to take a snapshot. This final snapshot ensures that when the pipeline is reactivated, it can start from the most recent snapshot, providing a smooth transition back to operation.

Operating pipelines

Now that we are familiar with the different states of a pipeline and the importance of Snapshots in the context of data recovery, let’s look at the operations we can perform on pipelines to influence their lifecycle.

Deploying a pipeline

There are two main ways by which you can deploy a pipeline: in the web app or by using the CLI.

If you prefer to deploy pipelines using a web interface instead check the Pipeline Builder

apply command + pipeline configuration

The goldsky pipeline apply command expects the yaml file to include additional attributes pertaining to the configuration (such as desired state and pipeline name) and definition attribute containing the actual tripla of sources, transforms, sinks

See the following example:

goldsky pipeline apply

base-logs.yaml
name: base-logs-pipeline
version: 1
status: ACTIVE
resource_size: s
apiVersion: 3
sources:
  base.logs:
    dataset_name: base.logs
    version: 1.0.0
    type: dataset
    description: Enriched logs for events emitted from contracts. Contains the
      contract address, data, topics, decoded event and metadata for blocks and
      transactions.
    display_name: Logs
transforms: {}
sinks:
  postgres_base_logs:
    type: postgres
    table: base_logs
    schema: public
    secret_name: GOLDSKY_SECRET
    description: "Postgres sink for: base.logs"
    from: base.logs

Pausing a pipeline

There are 3 ways by which you can pause a pipeline:

1. pause command

If you pause a pipeline using the command goldsky pipeline pause <name> Mirror will attempt to take a snapshot before pausing the pipeline. the snapshot is successfully taken only if the pipeline is in a healthy state. After the attempted snapshot, Mirror will set the pipeline to PAUSED desired status and TERMINATED runtime status.

Example:

> goldsky pipeline pause base-logs-pipeline
◇  Successfully paused pipeline: base-logs-pipeline
Pipeline paused and progress saved. You can restart it with "goldsky pipeline start base-logs-pipeline".

2. stop command

You can stop a pipeline using the command goldsky pipeline stop <name>. Unlike the pause command, stopping a pipeline doesn’t try to take a snapshot. Mirror will directly set pipeline to INACTIVE desired status and TERMINATED runtime status.

Example:

> goldsky pipeline stop base-logs-pipeline

◇  Pipeline stopped. You can restart it with "goldsky pipeline start base-logs-pipeline".

3. apply command + INACTIVE or PAUSED status

We can replicate the behaviour of the pause and stop commands by updating the desired status of the pipeline using pipeline apply and setting it as INACTIVE or PAUSED.

Following up with our previous example, we could stop our deployed pipeline doing this:

base-logs.yaml
name: base-logs-pipeline
status: INACTIVE
goldsky pipeline apply base-logs.yaml

◇  Successfully validated config file

◇  Successfully applied config to pipeline: base-logs-pipeline

The actual preferred method to stop/pause a pipeline that you use will ultimately come down to whether you want to take a snapshot. In some cases, pausing a pipeline is preferable in scenarios where we have encountered recent data issues and want to ensure that we want to restart the pipeline from a healthy snapshot. In other cases, there might be no data quality issues so we want to make sure that take a snapshot beforing stopping the pipeline to resume it later at a later point without having lost any work already done.

Restarting a pipeline

There are two ways to restart an already deployed pipeline:

1. start command

As in: goldsky pipeline start <name>

Example:

goldsky pipeline start base-logs-pipeline

◇  Successfully started pipeline: base-logs-pipeline

Pipeline started. It's safe to exit now (press Ctrl-C). Or you can keep this terminal open to monitor the pipeline progress, it'll take a moment.

✔ Validating request
✔ Fetching pipeline
✔ Validating pipeline status
✔ Fetching runtime details
──────────────────────────────────────────────────────
│ Timestamp   │ Status     │ Total records received │ Total records written │ Errors │
──────────────────────────────────────────────────────
│ 02:54:44 PM │ STARTING   │                      0 │                     0 │ []     │                                  
──────────────────────────────────────────────────────

This command will open up a monitor for your pipeline after deploying.

2. apply command + ACTIVE status

Just as you can stop a pipeline changing its status to INACTIVE you can also restart it by setting it to ACTIVE

Following up with our previous example, we could restart our stopped pipeline doing this:

base-logs.yaml
name: base-logs-pipeline
status: ACTIVE
goldsky pipeline apply base-logs.yaml

◇  Successfully validated config file

◇  Successfully applied config to pipeline: base-logs-pipeline

To monitor the status of your pipeline:

Using the CLI: `goldsky pipeline monitor base-logs`
Using the dashboard: https://app.goldsky.com/dashboard/pipelines/stream/base-logs-pipeline/9

Unlike the start command, this method won’t open up the monitor automatically.

Applying updates to a pipeline

As we have seen in the previous sections, Mirror allows you to update the status of a pipeline directly in your configuration files using the apply command. This is really powerful as all the operations we have explored so far represent a status change, meaning that deploying, pausing/stopping and restarting pipelines can all be done updating the pipeline status.

The apply command can also be used to change other attributes of the pipeline.

See the following example:

base-logs.yaml
name: base-logs-pipeline
description: a new description for my pipeline
restart: true
use_latest_snapshot: true
save_progress: false
goldsky pipeline apply base-logs.yaml

◇  Successfully validated config file

◇  Successfully applied config to pipeline: base-logs-pipeline

In this example we are changing the pipeline description as well as prompting a restart of the pipeline using its latest succesful snapshot available and informing Mirror to not take a snapshot before pausing. This is a common configuration to apply in a situation where you found issues with your pipeline and would like to restart from the last healthy checkpoint.

For a more complete reference on the configuration attributes you can apply check this reference.

Deleting a pipeline

Finally, the last operation you might want to perform is to delete your pipelines. Although inactive pipelines don’t consume any resources (and thus, do not imply a billing cost on your side) it’s always nice to keep your project clean and remove pipelines which you aren’t going to use any longer. You can delete pipelines with the command goldsky pipeline delete:

> goldsky pipeline delete base-logs-pipeline

✔ Deleted pipeline with name: base-logs-pipeline

In-flight requests

Sometimes you might experience that you are not able to perform a specific action on your pipeline because an in-flight request is currently being processed. What this means is that there was a previous operation performed in your pipeline which hasn’t finished yet and needs to be either processed or discarded before you can apply your specific operation. A common scenario for this is your pipeline is busy taking a snapshot.

Consider the following example where we recently paused a pipeline (thus triggering a snapshot) and we immediately try to delete it:

> goldsky pipeline delete base-logs-pipeline
✖ Cannot process request, found existing request in-flight.

* To monitor run 'goldsky pipeline monitor base-logs-pipeline --update-request'
* To cancel run 'goldsky pipeline cancel-update base-logs-pipeline'

Let’s look at what process is still to be processed:

> goldsky pipeline monitor base-logs-pipeline --update-request

◇  Monitoring update progress

◇  You may cancel the update request by running goldsky pipeline cancel-update base-logs-pipeline

Snapshot creation in progress: ■■■■■■■■■■■■■                            33%

We can see that the snapshot is still taking place. Since we want to delete the pipeline we can go ahead and stop this snapshot creation:

> goldsky pipeline cancel-update base-logs-pipeline

◇  Successfully cancelled the in-flight update request for pipeline base-logs-pipeline

We can now succesfully remove the pipeline:

> goldsky pipeline delete base-log-pipeline

✔ Deleted pipeline with name: base-logs-pipeline

As you saw in this example, Mirror provides you with commands to see the current in-flight requests in your pipeline and decide whether you want to discard them or wait for them to be processed.

Conclusion

That’s a wrap! In this article we have learnt the multiple states defining the lifecycle of Mirror pipelines and what operations we can perform on them to ensure data integrity and cost optimizations. We have paid special attention to snapshots as they provide the basis for data recovery.

Mirror pipelines provide a robust framework for real-time data processing and integration. Getting familiar with how to operate them will help you realize the full potential of Mirror, empowering your organization to make data-driven decisions with confidence.

Can't find what you're looking for? Reach out to us at support@goldsky.com for help.