> ## Documentation Index > Fetch the complete documentation index at: https://docs.goldsky.com/llms.txt > Use this file to discover all available pages before exploring further. # S3 > Write data to S3-compatible object storage as Parquet files ## Overview Write pipeline output to S3-compatible object storage as Apache Parquet files. Works with AWS S3, Cloudflare R2, MinIO, and any other service exposing the S3 API. Supports optional Hive-style partitioning for downstream analytics engines (Athena, DuckDB, Spark, Trino). Each upstream batch produces one Parquet object per unique partition key combination (or one object total, if no partitioning is configured). Objects are named `batch_.parquet` so file names sort lexicographically by creation time. ## Configuration ### Using Goldsky secrets (recommended) ```yaml theme={null} sinks: my_s3_sink: type: s3_sink from: my_transform bucket: my-bucket secret_name: MY_S3_SECRET prefix: my-prefix # Optional ``` When `secret_name` is set, `access_key_id`, `secret_access_key`, and `region` come from the secret and do not need to be specified in the pipeline YAML. ### Using direct credentials Hardcoding credentials in pipeline definitions is not recommended for production use. Use `secret_name` with Goldsky secrets instead. ```yaml theme={null} sinks: my_s3_sink: type: s3_sink from: my_transform bucket: my-bucket region: us-east-1 access_key_id: secret_access_key: prefix: my-prefix # Optional ``` ## Parameters Must be `s3_sink`. The transform or source to read data from. Name of the destination S3 bucket. Name of the Goldsky secret that supplies `accessKeyId`, `secretAccessKey`, and `region`. See [Secret format](#secret-format). Either `secret_name` or the full set of `access_key_id`, `secret_access_key`, and `region` is required. Access key ID for authentication. Not required if using `secret_name`. Secret access key for authentication. Not required if using `secret_name`. AWS region, or `auto` for S3-compatible services that ignore region (e.g., Cloudflare R2). Not required if using `secret_name`. S3-compatible endpoint URL (e.g., `https://s3.amazonaws.com`, `https://.r2.cloudflarestorage.com`, `http://localhost:9000` for MinIO). Optional — defaults to the AWS S3 endpoint for the configured region. Key prefix applied to every object written to the bucket. A trailing `/` is stripped if present. Combined with any partition segment to form the final key. AWS session token for temporary STS credentials (assumed roles, federated access, AWS SSO). Only needed alongside temporary credentials. Comma-separated list of column names to Hive-partition the output by (e.g., `dt,chain_id`). Column names must be alphanumeric plus underscore, must exist in the input schema, and must not repeat. Omit or leave empty to write unpartitioned output. Allow plain HTTP endpoints. Defaults to `false`. Automatically set to `true` when `endpoint` starts with `http://` (useful for local MinIO). Do not enable against public services. Maximum number of partition uploads issued in parallel per batch when `partition_columns` is set. Defaults to `16`. Ignored for unpartitioned output. ## Secret format When using `secret_name`, create a Goldsky secret of type `s3` with the following JSON structure: ```json theme={null} { "accessKeyId": "your-access-key-id", "secretAccessKey": "your-secret-access-key", "region": "us-east-1" } ``` Create the secret using the Goldsky CLI: ```bash theme={null} goldsky secret create MY_S3_SECRET --type s3 --value '{"accessKeyId": "...", "secretAccessKey": "...", "region": "us-east-1"}' ``` The S3 sink works with any S3-compatible storage service, including AWS S3, MinIO, and Cloudflare R2. Set `region: auto` in the secret (or on the sink) for services that don't use AWS regions. The `session_token` for temporary credentials is set on the sink itself (or via the `STREAMLING__PLUGIN__S3_SINK__SESSION_TOKEN` environment variable), not in the secret JSON. ## File layout Objects are written using the following key template: ``` //batch_.parquet ``` * `` is the configured `prefix` (trailing `/` trimmed); omitted if not set. * `` is the Hive-style segment (e.g., `dt=2025-03-24/chain_id=1`); omitted if `partition_columns` is not set. * The filename portion is always `batch_.parquet`. UUIDv7 is time-ordered, so listing the bucket returns objects roughly in write order. * Null values in partition columns are written as `__HIVE_DEFAULT_PARTITION__`, matching Hive/Spark conventions. Special characters in partition values are URL-encoded. Example with `prefix: eth/transfers` and `partition_columns: dt`: ``` eth/transfers/dt=2025-03-24/batch_01930c7f-....parquet eth/transfers/dt=2025-03-25/batch_01930c80-....parquet ``` ## Flush behavior The S3 sink writes one Parquet file per upstream record batch: * Each non-empty record batch delivered to the sink is serialized to Parquet and uploaded immediately. * When `partition_columns` is set, the batch is split by partition key first, producing one file per distinct key. Splits are uploaded in parallel (up to `max_concurrent_partition_uploads`). * Transient S3 errors (throttling, 5xx, connection resets) are retried with exponential backoff (100 ms initial, 30 s cap, up to 10 attempts). Permanent errors (auth, not-found, permission denied) fail the pipeline without retry. * There is no time- or size-based rotation inside the sink itself. File size is determined by the upstream batch size and the row distribution across partition keys. ## IAM permissions The credentials used by the sink need, at minimum, `s3:PutObject` on the target bucket and key prefix: ```json theme={null} { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:PutObject"], "Resource": "arn:aws:s3:::my-bucket/my-prefix/*" } ] } ``` ## Example Stream ERC-20 transfers to S3, partitioned by chain and date: ```yaml theme={null} name: erc20-to-s3 resource_size: s sources: transfers: type: dataset dataset_name: ethereum.erc20_transfers version: 1.2.0 start_at: latest transforms: enriched: type: sql primary_key: id sql: | SELECT *, CAST(block_timestamp AS DATE) AS dt, 1 AS chain_id FROM transfers sinks: s3_archive: type: s3_sink from: enriched bucket: my-bucket prefix: eth/erc20_transfers partition_columns: dt,chain_id secret_name: MY_S3_SECRET ``` ## Best practices A `dt` partition (date) gives downstream engines the best chance to prune scans. Add a second partition column like `chain_id` only when query patterns consistently filter on it — excessive partitioning produces many tiny files and hurts read performance. File names are automatically `batch_.parquet`. UUIDv7 encodes the creation timestamp in the high bits, so an `aws s3 ls` in lexicographic order returns files in roughly the order they were written. Secrets scope credentials to your project and rotate cleanly without requiring a pipeline redeploy. Plaintext credentials in pipeline YAML are echoed into logs and config history. Cloudflare R2, MinIO, and many other S3-compatible services don't use AWS regions. Set `region: auto` (either on the sink or inside the secret JSON) and set `endpoint` to the service's URL.