Skip to main content

Overview

Publish processed data back to Kafka topics with Avro or JSON serialization.

Configuration

sinks:
  my_kafka_sink:
    type: kafka
    from: my_transform
    topic: my_output_topic
    topic_partitions: 10
    data_format: avro

Parameters

type
string
required
Must be kafka
from
string
required
The transform or source to read data from
topic
string
required
Kafka topic name to publish to
topic_partitions
number
Number of partitions for the topic (created if doesn’t exist)
data_format
string
default:"avro"
Serialization format. Supported values: avro or json.
schema_registry_url
string
URL of the Schema Registry. Required when using avro format, optional for json format.
parallelism
number
default:"1"
Number of parallel Kafka producers for increased throughput. Messages are routed to producers based on key hash to maintain per-key ordering.
batch_size
number
default:"10000"
Number of messages to batch before sending (maps to Kafka’s batch.num.messages).
batch_flush_interval
number
default:"100"
Maximum time in milliseconds to wait before flushing a batch (maps to Kafka’s linger.ms).
message_max_bytes
number
default:"1000000"
Maximum size in bytes for a Kafka request (maps to Kafka’s message.max.bytes).

Features

  • Multiple Formats: Choose between Avro (binary) or JSON serialization
  • Auto Schema Registration: Schemas are automatically registered with Schema Registry (Avro only)
  • Avro Encoding: Efficient binary serialization with schema evolution support
  • JSON Encoding: Human-readable format without Schema Registry dependency
  • Operation Headers: _gs_op column is included as a message header (dbz.op)

Partitioning

Goldsky uses Kafka’s default partitioning strategy based on message key hashes. The message key is constructed from the primary key column(s) of your data. Key behavior:
  • Key format: Primary key values joined with _ (e.g., enriched_transaction_v2_0x6a7b...789d_1)
  • Partitioner: Kafka’s DefaultPartitioner (murmur2 hash)
  • Partition assignment: murmur2(keyBytes) % numPartitions
Implications for increasing partitions:
  • Records with the same key always go to the same partition, ensuring ordering per key
  • Increasing partitions will cause key redistribution — existing keys may map to different partitions
  • Global ordering is not guaranteed; only per-key ordering is maintained

Examples

Avro format (with Schema Registry)

sinks:
  kafka_output:
    type: kafka
    from: enriched_events
    topic: processed.events
    topic_partitions: 10
    data_format: avro
    schema_registry_url: http://schema-registry:8081

JSON format (no Schema Registry required)

sinks:
  kafka_output:
    type: kafka
    from: enriched_events
    topic: processed.events
    topic_partitions: 10
    data_format: json

High-throughput configuration

sinks:
  kafka_output:
    type: kafka
    from: high_volume_events
    topic: events.stream
    topic_partitions: 16
    data_format: avro
    schema_registry_url: http://schema-registry:8081
    parallelism: 4
    batch_size: 5000
    batch_flush_interval: 50
    message_max_bytes: 2000000

Performance tuning

If your Kafka sink is not keeping up with pipeline throughput, tuning parallelism is the most effective first step. Each unit of parallelism runs a separate Kafka producer, allowing concurrent writes to the target cluster.

When to tune

Check the Kafka flush sink p95 metric in your pipeline’s advanced metrics. If flush latency is consistently high, your sink is likely a bottleneck and increasing parallelism can help.

Choosing a parallelism value

Set parallelism to a value that evenly divides the number of Kafka partitions in the target topic. This ensures producers are balanced across partitions and avoids uneven load distribution. For example, with a 12-partition topic, good parallelism values are 2, 3, 4, or 6. Avoid values like 5 or 7 that don’t divide evenly into 12.
Make sure your Kafka cluster can sustain the additional connections before increasing parallelism. Each parallel producer opens its own connection to the cluster.

Tuning batching

After adjusting parallelism, you can further optimize throughput by tuning batch settings:
  • batch_size: Increase if you have high message volume and want fewer, larger writes. Larger batches improve throughput but add a small amount of latency.
  • batch_flush_interval: Decrease for lower latency at the cost of smaller batches. Increase if you want to accumulate more messages before flushing.
  • message_max_bytes: Increase if you see errors related to message size limits, especially with large payloads.

Example: tuning a 12-partition topic

sinks:
  kafka_output:
    type: kafka
    from: high_volume_events
    topic: events.stream
    topic_partitions: 12
    data_format: avro
    schema_registry_url: http://schema-registry:8081
    parallelism: 4       # Divides evenly into 12 partitions
    batch_size: 5000
    batch_flush_interval: 50