Overview
Write pipeline output to S3-compatible object storage as Apache Parquet files. Works with AWS S3, Cloudflare R2, MinIO, and any other service exposing the S3 API. Supports optional Hive-style partitioning for downstream analytics engines (Athena, DuckDB, Spark, Trino). Each upstream batch produces one Parquet object per unique partition key combination (or one object total, if no partitioning is configured). Objects are namedbatch_<uuidv7>.parquet so file names sort lexicographically by creation time.
Configuration
Using Goldsky secrets (recommended)
secret_name is set, access_key_id, secret_access_key, and region come from the secret and do not need to be specified in the pipeline YAML.
Using direct credentials
Parameters
Must be
s3_sink.The transform or source to read data from.
Name of the destination S3 bucket.
Name of the Goldsky secret that supplies
accessKeyId, secretAccessKey, and region. See Secret format. Either secret_name or the full set of access_key_id, secret_access_key, and region is required.Access key ID for authentication. Not required if using
secret_name.Secret access key for authentication. Not required if using
secret_name.AWS region, or
auto for S3-compatible services that ignore region (e.g., Cloudflare R2). Not required if using secret_name.S3-compatible endpoint URL (e.g.,
https://s3.amazonaws.com, https://<account-id>.r2.cloudflarestorage.com, http://localhost:9000 for MinIO). Optional — defaults to the AWS S3 endpoint for the configured region.Key prefix applied to every object written to the bucket. A trailing
/ is stripped if present. Combined with any partition segment to form the final key.AWS session token for temporary STS credentials (assumed roles, federated access, AWS SSO). Only needed alongside temporary credentials.
Comma-separated list of column names to Hive-partition the output by (e.g.,
dt,chain_id). Column names must be alphanumeric plus underscore, must exist in the input schema, and must not repeat. Omit or leave empty to write unpartitioned output.Allow plain HTTP endpoints. Defaults to
false. Automatically set to true when endpoint starts with http:// (useful for local MinIO). Do not enable against public services.Maximum number of partition uploads issued in parallel per batch when
partition_columns is set. Defaults to 16. Ignored for unpartitioned output.Secret format
When usingsecret_name, create a Goldsky secret of type s3 with the following JSON structure:
The
session_token for temporary credentials is set on the sink itself (or via the STREAMLING__PLUGIN__S3_SINK__SESSION_TOKEN environment variable), not in the secret JSON.File layout
Objects are written using the following key template:<prefix>is the configuredprefix(trailing/trimmed); omitted if not set.<partition_segment>is the Hive-style segment (e.g.,dt=2025-03-24/chain_id=1); omitted ifpartition_columnsis not set.- The filename portion is always
batch_<uuidv7>.parquet. UUIDv7 is time-ordered, so listing the bucket returns objects roughly in write order. - Null values in partition columns are written as
__HIVE_DEFAULT_PARTITION__, matching Hive/Spark conventions. Special characters in partition values are URL-encoded.
prefix: eth/transfers and partition_columns: dt:
Flush behavior
The S3 sink writes one Parquet file per upstream record batch:- Each non-empty record batch delivered to the sink is serialized to Parquet and uploaded immediately.
- When
partition_columnsis set, the batch is split by partition key first, producing one file per distinct key. Splits are uploaded in parallel (up tomax_concurrent_partition_uploads). - Transient S3 errors (throttling, 5xx, connection resets) are retried with exponential backoff (100 ms initial, 30 s cap, up to 10 attempts). Permanent errors (auth, not-found, permission denied) fail the pipeline without retry.
- There is no time- or size-based rotation inside the sink itself. File size is determined by the upstream batch size and the row distribution across partition keys.
IAM permissions
The credentials used by the sink need, at minimum,s3:PutObject on the target bucket and key prefix:
Example
Stream ERC-20 transfers to S3, partitioned by chain and date:Best practices
Partition by date for most workloads
Partition by date for most workloads
A
dt partition (date) gives downstream engines the best chance to prune
scans. Add a second partition column like chain_id only when query
patterns consistently filter on it — excessive partitioning produces many
tiny files and hurts read performance.Use UUIDv7 filenames for ordered listing
Use UUIDv7 filenames for ordered listing
File names are automatically
batch_<uuidv7>.parquet. UUIDv7 encodes the
creation timestamp in the high bits, so an aws s3 ls in lexicographic
order returns files in roughly the order they were written.Use secrets, not plaintext credentials
Use secrets, not plaintext credentials
Secrets scope credentials to your project and rotate cleanly without
requiring a pipeline redeploy. Plaintext credentials in pipeline YAML are
echoed into logs and config history.
For S3-compatible services, set region: auto
For S3-compatible services, set region: auto
Cloudflare R2, MinIO, and many other S3-compatible services don’t use AWS
regions. Set
region: auto (either on the sink or inside the secret JSON)
and set endpoint to the service’s URL.