> ## Documentation Index
> Fetch the complete documentation index at: https://docs.goldsky.com/llms.txt
> Use this file to discover all available pages before exploring further.

# S3

> Write data to S3-compatible object storage as Parquet files

## Overview

Write pipeline output to S3-compatible object storage as Apache Parquet files. Works with AWS S3, Cloudflare R2, MinIO, and any other service exposing the S3 API. Supports optional Hive-style partitioning for downstream analytics engines (Athena, DuckDB, Spark, Trino).

Each upstream batch produces one Parquet object per unique partition key combination (or one object total, if no partitioning is configured). Objects are named `batch_<uuidv7>.parquet` so file names sort lexicographically by creation time.

## Configuration

### Using Goldsky secrets (recommended)

```yaml theme={null}
sinks:
  my_s3_sink:
    type: s3_sink
    from: my_transform
    bucket: my-bucket
    secret_name: MY_S3_SECRET
    prefix: my-prefix # Optional
```

When `secret_name` is set, `access_key_id`, `secret_access_key`, and `region` come from the secret and do not need to be specified in the pipeline YAML.

### Using direct credentials

<Warning>
  Hardcoding credentials in pipeline definitions is not recommended for production use. Use `secret_name` with Goldsky secrets instead.
</Warning>

```yaml theme={null}
sinks:
  my_s3_sink:
    type: s3_sink
    from: my_transform
    bucket: my-bucket
    region: us-east-1
    access_key_id: <your-access-key>
    secret_access_key: <your-secret-key>
    prefix: my-prefix # Optional
```

## Parameters

<ParamField path="type" type="string" required>
  Must be `s3_sink`.
</ParamField>

<ParamField path="from" type="string" required>
  The transform or source to read data from.
</ParamField>

<ParamField path="bucket" type="string" required>
  Name of the destination S3 bucket.
</ParamField>

<ParamField path="secret_name" type="string">
  Name of the Goldsky secret that supplies `accessKeyId`, `secretAccessKey`, and `region`. See [Secret format](#secret-format). Either `secret_name` or the full set of `access_key_id`, `secret_access_key`, and `region` is required.
</ParamField>

<ParamField path="access_key_id" type="string">
  Access key ID for authentication. Not required if using `secret_name`.
</ParamField>

<ParamField path="secret_access_key" type="string">
  Secret access key for authentication. Not required if using `secret_name`.
</ParamField>

<ParamField path="region" type="string">
  AWS region, or `auto` for S3-compatible services that ignore region (e.g., Cloudflare R2). Not required if using `secret_name`.
</ParamField>

<ParamField path="endpoint" type="string">
  S3-compatible endpoint URL (e.g., `https://s3.amazonaws.com`, `https://<account-id>.r2.cloudflarestorage.com`, `http://localhost:9000` for MinIO). Optional — defaults to the AWS S3 endpoint for the configured region.
</ParamField>

<ParamField path="prefix" type="string">
  Key prefix applied to every object written to the bucket. A trailing `/` is stripped if present. Combined with any partition segment to form the final key.
</ParamField>

<ParamField path="session_token" type="string">
  AWS session token for temporary STS credentials (assumed roles, federated access, AWS SSO). Only needed alongside temporary credentials.
</ParamField>

<ParamField path="partition_columns" type="string">
  Comma-separated list of column names to Hive-partition the output by (e.g., `dt,chain_id`). Column names must be alphanumeric plus underscore, must exist in the input schema, and must not repeat. Omit or leave empty to write unpartitioned output.
</ParamField>

<ParamField path="allow_http" type="boolean">
  Allow plain HTTP endpoints. Defaults to `false`. Automatically set to `true` when `endpoint` starts with `http://` (useful for local MinIO). Do not enable against public services.
</ParamField>

<ParamField path="max_concurrent_partition_uploads" type="integer">
  Maximum number of partition uploads issued in parallel per batch when `partition_columns` is set. Defaults to `16`. Ignored for unpartitioned output.
</ParamField>

## Secret format

When using `secret_name`, create a Goldsky secret of type `s3` with the following JSON structure:

```json theme={null}
{
  "accessKeyId": "your-access-key-id",
  "secretAccessKey": "your-secret-access-key",
  "region": "us-east-1"
}
```

Create the secret using the Goldsky CLI:

```bash theme={null}
goldsky secret create MY_S3_SECRET --type s3 --value '{"accessKeyId": "...", "secretAccessKey": "...", "region": "us-east-1"}'
```

<Tip>
  The S3 sink works with any S3-compatible storage service, including AWS S3,
  MinIO, and Cloudflare R2. Set `region: auto` in the secret (or on the sink)
  for services that don't use AWS regions.
</Tip>

<Note>
  The `session_token` for temporary credentials is set on the sink itself (or via the `STREAMLING__PLUGIN__S3_SINK__SESSION_TOKEN` environment variable), not in the secret JSON.
</Note>

## File layout

Objects are written using the following key template:

```
<prefix>/<partition_segment>/batch_<uuidv7>.parquet
```

* `<prefix>` is the configured `prefix` (trailing `/` trimmed); omitted if not set.
* `<partition_segment>` is the Hive-style segment (e.g., `dt=2025-03-24/chain_id=1`); omitted if `partition_columns` is not set.
* The filename portion is always `batch_<uuidv7>.parquet`. UUIDv7 is time-ordered, so listing the bucket returns objects roughly in write order.
* Null values in partition columns are written as `__HIVE_DEFAULT_PARTITION__`, matching Hive/Spark conventions. Special characters in partition values are URL-encoded.

Example with `prefix: eth/transfers` and `partition_columns: dt`:

```
eth/transfers/dt=2025-03-24/batch_01930c7f-....parquet
eth/transfers/dt=2025-03-25/batch_01930c80-....parquet
```

## Flush behavior

The S3 sink writes one Parquet file per upstream record batch:

* Each non-empty record batch delivered to the sink is serialized to Parquet and uploaded immediately.
* When `partition_columns` is set, the batch is split by partition key first, producing one file per distinct key. Splits are uploaded in parallel (up to `max_concurrent_partition_uploads`).
* Transient S3 errors (throttling, 5xx, connection resets) are retried with exponential backoff (100 ms initial, 30 s cap, up to 10 attempts). Permanent errors (auth, not-found, permission denied) fail the pipeline without retry.
* There is no time- or size-based rotation inside the sink itself. File size is determined by the upstream batch size and the row distribution across partition keys.

## IAM permissions

The credentials used by the sink need, at minimum, `s3:PutObject` on the target bucket and key prefix:

```json theme={null}
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::my-bucket/my-prefix/*"
    }
  ]
}
```

## Example

Stream ERC-20 transfers to S3, partitioned by chain and date:

```yaml theme={null}
name: erc20-to-s3
resource_size: s

sources:
  transfers:
    type: dataset
    dataset_name: ethereum.erc20_transfers
    version: 1.2.0
    start_at: latest

transforms:
  enriched:
    type: sql
    primary_key: id
    sql: |
      SELECT
        *,
        CAST(block_timestamp AS DATE) AS dt,
        1 AS chain_id
      FROM transfers

sinks:
  s3_archive:
    type: s3_sink
    from: enriched
    bucket: my-bucket
    prefix: eth/erc20_transfers
    partition_columns: dt,chain_id
    secret_name: MY_S3_SECRET
```

## Best practices

<AccordionGroup>
  <Accordion title="Partition by date for most workloads">
    A `dt` partition (date) gives downstream engines the best chance to prune
    scans. Add a second partition column like `chain_id` only when query
    patterns consistently filter on it — excessive partitioning produces many
    tiny files and hurts read performance.
  </Accordion>

  <Accordion title="Use UUIDv7 filenames for ordered listing">
    File names are automatically `batch_<uuidv7>.parquet`. UUIDv7 encodes the
    creation timestamp in the high bits, so an `aws s3 ls` in lexicographic
    order returns files in roughly the order they were written.
  </Accordion>

  <Accordion title="Use secrets, not plaintext credentials">
    Secrets scope credentials to your project and rotate cleanly without
    requiring a pipeline redeploy. Plaintext credentials in pipeline YAML are
    echoed into logs and config history.
  </Accordion>

  <Accordion title="For S3-compatible services, set region: auto">
    Cloudflare R2, MinIO, and many other S3-compatible services don't use AWS
    regions. Set `region: auto` (either on the sink or inside the secret JSON)
    and set `endpoint` to the service's URL.
  </Accordion>
</AccordionGroup>
