> ## Documentation Index
> Fetch the complete documentation index at: https://docs.goldsky.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom alerts

> Create custom Slack and email alerts on any pipeline metric using the Goldsky Grafana dashboard

<Note>
  **Enterprise only.** Custom alerting requires the Editor role on your project's Grafana workspace, which is available to enterprise customers. If you would like access, contact [support@goldsky.com](mailto:support@goldsky.com) or reach out to your account manager.
</Note>

## Overview

Every enterprise project gets a dedicated Grafana workspace with the full set of Turbo pipeline metrics (Kafka lag, block lag, checkpoint duration, sink flush latency, and more). With the **Editor** role enabled, you can create your own alert rules on any of these metrics and route notifications to Slack, email, or any other Grafana-supported contact point.

This page walks through the end-to-end setup.

## Prerequisites

Before you start, make sure:

1. **Editor access is enabled for your account.** Contact [support@goldsky.com](mailto:support@goldsky.com) to confirm this is set up before you start.
2. **You have a notification destination ready.** For Slack, create an [Incoming Webhook](https://api.slack.com/messaging/webhooks) in the target workspace and copy the webhook URL. For email, gather the recipient addresses you want to notify.

## Open the Grafana workspace

1. Sign in to the [Goldsky dashboard](https://app.goldsky.com/dashboard/pipelines).
2. Navigate to any Turbo pipeline.
3. Open the **Metrics** tab.
4. Click **Advanced metrics** in the top-right corner.

This opens your project's Grafana workspace in a new tab, pre-authenticated and scoped to your project. All of your pipelines' metrics are available through the `goldsky-prometheus` datasource.

## Create a contact point

Contact points tell Grafana where to send notifications when an alert fires. Create them once and reuse them across many alert rules.

### Slack

1. In the Grafana sidebar, go to **Alerting** → **Contact points**.
2. Click **+ Add contact point**.
3. Fill in the form:
   * **Name**: a descriptive label such as `slack-pipeline-alerts`.
   * **Integration**: select **Slack**.
   * **Webhook URL**: paste the Slack Incoming Webhook URL from your Slack workspace.
4. Click **Test** to send a test message. Confirm it arrives in the expected Slack channel.
5. Click **Save contact point**.

### Email

1. Go to **Alerting** → **Contact points** → **+ Add contact point**.
2. Fill in the form:
   * **Name**: e.g. `email-oncall`.
   * **Integration**: select **Email**.
   * **Addresses**: enter one or more recipients **separated by semicolons**, e.g. `ops@example.com;oncall@example.com`.
3. Click **Test**, then **Save contact point**.

### Other supported integrations

The Goldsky workspace runs open-source Grafana, so you can pick any of the built-in Grafana Alerting contact point integrations when creating a contact point:

<Columns cols={3}>
  <div>
    * Alertmanager
    * AWS SNS
    * Cisco Webex Teams
    * DingDing
    * Discord
    * Email
    * Google Chat
  </div>

  <div>
    * Kafka REST Proxy
    * LINE
    * Microsoft Teams
    * MQTT
    * Opsgenie
    * PagerDuty
    * Pushover
  </div>

  <div>
    * Sensu Go
    * Slack
    * Telegram
    * Threema Gateway
    * VictorOps
    * Webhook
    * WeCom
  </div>
</Columns>

See the [Grafana contact points reference](https://grafana.com/docs/grafana/latest/alerting/configure-notifications/manage-contact-points/) for the full list of fields each integration requires.

<Tip>
  You can create multiple contact points and pick different ones per alert rule — for example, a low-severity Slack channel for warnings and a PagerDuty or email group for critical alerts.
</Tip>

## Create an alert rule

Alert rules define the condition that should trigger a notification.

1. Go to **Alerting** → **Alert rules**.
2. Click **+ New alert rule**.
3. Under **Define query and alert condition**:
   * **Datasource**: select `goldsky-prometheus`.
   * **Query**: write a PromQL expression for the metric you want to alert on. See [common alert queries](#common-alert-queries) below for starting points.
4. Under **Set alert evaluation behavior**:
   * **Evaluate every**: `1m` is a reasonable default.
   * **For**: how long the condition must continuously hold before the alert fires. `5m` is a good starting point to avoid flapping on transient spikes.
5. Under **Configure labels and notifications**:
   * Select the contact point you created in the previous step.
   * Add labels such as `severity=warning` or `severity=critical` to help with routing and filtering later.
6. Give the rule a descriptive name and click **Save rule and exit**.

The rule will begin evaluating immediately. You can see its current state — `Normal`, `Pending`, or `Firing` — on the **Alert rules** page.

## Common alert queries

A quick reference of the five alerts we recommend setting up. Suggested conditions are starting points — tune thresholds for your workload.

| Alert on                            | Suggested condition           |
| ----------------------------------- | ----------------------------- |
| Pipeline falling behind (block lag) | above `60` for `10m`          |
| Kafka consumer lag growing          | above your baseline for `15m` |
| Checkpoint failures                 | above `0` for `5m`            |
| Sink flush latency spike (P95, ms)  | above `2000` for `10m`        |
| Pipeline not producing output       | below `1` for `10m`           |

### Starter PromQL queries

Expand an alert to see the full PromQL expression. Paste it into the **Query** field when you [create the alert rule](#create-an-alert-rule).

<AccordionGroup>
  <Accordion title="Block lag">
    ```promql theme={null}
    max by (service_instance_id) (streamling_block_lag_max_seconds)
    ```

    Alert when end-to-end block lag exceeds a business-acceptable threshold. `60`–`120` seconds is a common choice for real-time pipelines.
  </Accordion>

  <Accordion title="Kafka consumer lag">
    ```promql theme={null}
    max by (service_instance_id) (streamling_kafka_consumer_messages_lag)
    ```

    Threshold depends on steady-state volume. Establish a baseline for each pipeline first, then alert on multiples of it.
  </Accordion>

  <Accordion title="Checkpoint failures">
    ```promql theme={null}
    sum by (service_instance_id) (increase(streamling_checkpoint_epochs_failed_total[10m]))
    ```

    Any non-zero value is a critical signal — the pipeline isn't durably saving its position.
  </Accordion>

  <Accordion title="Sink flush latency P95 (ms)">
    ```promql theme={null}
    histogram_quantile(0.95, sum by (service_instance_id, id, le) (rate(streamling_checkpoint_sink_flush_milliseconds_bucket[5m])))
    ```

    Useful for catching database slowdowns before they cause pipeline lag. Threshold in milliseconds. Grouping by `id` (the sink's reference name) gives you a separate series per sink so you can see which one is slow. Drop `id` from the grouping if you prefer a single alert per pipeline.
  </Accordion>

  <Accordion title="Pipeline not producing output">
    ```promql theme={null}
    sum by (service_instance_id) (rate(streamling_output_rows_total{topology_node_type="sink"}[5m]))
    ```

    Fires when a pipeline that should be emitting data goes silent.
  </Accordion>
</AccordionGroup>

<Info>
  Each pipeline is identified by its `service_instance_id` label, which has the form `{project_id}-{pipeline_name}`. Use Grafana's **Explore** view with the `goldsky-prometheus` datasource to browse all available metrics and labels for your project. Scope any query to a single pipeline with `{service_instance_id=~".*-my-pipeline"}`.
</Info>

## Recommended starter alerts

Three alerts cover most production incidents. Create these first:

* **Checkpoint failures** — critical; catches lost state on every pipeline.
* **Block lag** — catches pipelines falling behind the chain tip.
* **Sink flush latency** — catches database slowdowns before they cascade into lag.

For context on what each metric means, see the [health dashboard guide](/turbo-pipelines/health-dashboard).

## Troubleshooting

<AccordionGroup>
  <Accordion title="I can't see an 'Alerting' section in the Grafana sidebar">
    Alerting requires the Editor role. Contact [support@goldsky.com](mailto:support@goldsky.com) to request Editor access for your project, and reload the Grafana tab once confirmed.
  </Accordion>

  <Accordion title="Test notification works, but alerts never fire">
    Check that your alert rule's query returns data in Grafana's **Explore** view — if the query returns no series, the rule will stay in `Normal` forever. Also confirm the threshold direction (above vs below) matches what the metric actually does when the condition you care about occurs.
  </Accordion>

  <Accordion title="Alerts fire too often on transient spikes">
    Increase the **For** duration so the condition must hold longer before firing. `5m` or `10m` is usually enough to filter out brief spikes while still catching real incidents.
  </Accordion>
</AccordionGroup>
