Skip to main content

Overview

Write pipeline output to S3-compatible object storage as Apache Parquet files. Works with AWS S3, Cloudflare R2, MinIO, and any other service exposing the S3 API. Supports optional Hive-style partitioning for downstream analytics engines (Athena, DuckDB, Spark, Trino). Each upstream batch produces one Parquet object per unique partition key combination (or one object total, if no partitioning is configured). Objects are named batch_<uuidv7>.parquet so file names sort lexicographically by creation time.

Configuration

sinks:
  my_s3_sink:
    type: s3_sink
    from: my_transform
    bucket: my-bucket
    secret_name: MY_S3_SECRET
    prefix: my-prefix # Optional
When secret_name is set, access_key_id, secret_access_key, and region come from the secret and do not need to be specified in the pipeline YAML.

Using direct credentials

Hardcoding credentials in pipeline definitions is not recommended for production use. Use secret_name with Goldsky secrets instead.
sinks:
  my_s3_sink:
    type: s3_sink
    from: my_transform
    bucket: my-bucket
    region: us-east-1
    access_key_id: <your-access-key>
    secret_access_key: <your-secret-key>
    prefix: my-prefix # Optional

Parameters

type
string
required
Must be s3_sink.
from
string
required
The transform or source to read data from.
bucket
string
required
Name of the destination S3 bucket.
secret_name
string
Name of the Goldsky secret that supplies accessKeyId, secretAccessKey, and region. See Secret format. Either secret_name or the full set of access_key_id, secret_access_key, and region is required.
access_key_id
string
Access key ID for authentication. Not required if using secret_name.
secret_access_key
string
Secret access key for authentication. Not required if using secret_name.
region
string
AWS region, or auto for S3-compatible services that ignore region (e.g., Cloudflare R2). Not required if using secret_name.
endpoint
string
S3-compatible endpoint URL (e.g., https://s3.amazonaws.com, https://<account-id>.r2.cloudflarestorage.com, http://localhost:9000 for MinIO). Optional — defaults to the AWS S3 endpoint for the configured region.
prefix
string
Key prefix applied to every object written to the bucket. A trailing / is stripped if present. Combined with any partition segment to form the final key.
session_token
string
AWS session token for temporary STS credentials (assumed roles, federated access, AWS SSO). Only needed alongside temporary credentials.
partition_columns
string
Comma-separated list of column names to Hive-partition the output by (e.g., dt,chain_id). Column names must be alphanumeric plus underscore, must exist in the input schema, and must not repeat. Omit or leave empty to write unpartitioned output.
allow_http
boolean
Allow plain HTTP endpoints. Defaults to false. Automatically set to true when endpoint starts with http:// (useful for local MinIO). Do not enable against public services.
max_concurrent_partition_uploads
integer
Maximum number of partition uploads issued in parallel per batch when partition_columns is set. Defaults to 16. Ignored for unpartitioned output.

Secret format

When using secret_name, create a Goldsky secret of type s3 with the following JSON structure:
{
  "accessKeyId": "your-access-key-id",
  "secretAccessKey": "your-secret-access-key",
  "region": "us-east-1"
}
Create the secret using the Goldsky CLI:
goldsky secret create MY_S3_SECRET --type s3 --value '{"accessKeyId": "...", "secretAccessKey": "...", "region": "us-east-1"}'
The S3 sink works with any S3-compatible storage service, including AWS S3, MinIO, and Cloudflare R2. Set region: auto in the secret (or on the sink) for services that don’t use AWS regions.
The session_token for temporary credentials is set on the sink itself (or via the STREAMLING__PLUGIN__S3_SINK__SESSION_TOKEN environment variable), not in the secret JSON.

File layout

Objects are written using the following key template:
<prefix>/<partition_segment>/batch_<uuidv7>.parquet
  • <prefix> is the configured prefix (trailing / trimmed); omitted if not set.
  • <partition_segment> is the Hive-style segment (e.g., dt=2025-03-24/chain_id=1); omitted if partition_columns is not set.
  • The filename portion is always batch_<uuidv7>.parquet. UUIDv7 is time-ordered, so listing the bucket returns objects roughly in write order.
  • Null values in partition columns are written as __HIVE_DEFAULT_PARTITION__, matching Hive/Spark conventions. Special characters in partition values are URL-encoded.
Example with prefix: eth/transfers and partition_columns: dt:
eth/transfers/dt=2025-03-24/batch_01930c7f-....parquet
eth/transfers/dt=2025-03-25/batch_01930c80-....parquet

Flush behavior

The S3 sink writes one Parquet file per upstream record batch:
  • Each non-empty record batch delivered to the sink is serialized to Parquet and uploaded immediately.
  • When partition_columns is set, the batch is split by partition key first, producing one file per distinct key. Splits are uploaded in parallel (up to max_concurrent_partition_uploads).
  • Transient S3 errors (throttling, 5xx, connection resets) are retried with exponential backoff (100 ms initial, 30 s cap, up to 10 attempts). Permanent errors (auth, not-found, permission denied) fail the pipeline without retry.
  • There is no time- or size-based rotation inside the sink itself. File size is determined by the upstream batch size and the row distribution across partition keys.

IAM permissions

The credentials used by the sink need, at minimum, s3:PutObject on the target bucket and key prefix:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::my-bucket/my-prefix/*"
    }
  ]
}

Example

Stream ERC-20 transfers to S3, partitioned by chain and date:
name: erc20-to-s3
resource_size: s

sources:
  transfers:
    type: dataset
    dataset_name: ethereum.erc20_transfers
    version: 1.2.0
    start_at: latest

transforms:
  enriched:
    type: sql
    primary_key: id
    sql: |
      SELECT
        *,
        CAST(block_timestamp AS DATE) AS dt,
        1 AS chain_id
      FROM transfers

sinks:
  s3_archive:
    type: s3_sink
    from: enriched
    bucket: my-bucket
    prefix: eth/erc20_transfers
    partition_columns: dt,chain_id
    secret_name: MY_S3_SECRET

Best practices

A dt partition (date) gives downstream engines the best chance to prune scans. Add a second partition column like chain_id only when query patterns consistently filter on it — excessive partitioning produces many tiny files and hurts read performance.
File names are automatically batch_<uuidv7>.parquet. UUIDv7 encodes the creation timestamp in the high bits, so an aws s3 ls in lexicographic order returns files in roughly the order they were written.
Secrets scope credentials to your project and rotate cleanly without requiring a pipeline redeploy. Plaintext credentials in pipeline YAML are echoed into logs and config history.
Cloudflare R2, MinIO, and many other S3-compatible services don’t use AWS regions. Set region: auto (either on the sink or inside the secret JSON) and set endpoint to the service’s URL.