Data Quality at Goldsky
Mirror Indexing
Goldsky Mirror datasets are populated through various indexers, which write the data into a data stream.
The data stream is then accessible directly by users through Mirror pipelines. Internally, we copy the data to a data lake which is then used to power various features and also used for Data QA.
Data quality is managed during ingestion, and also through periodic checks.
Emitted data quality is managed through various database guarantees, depending on the destination of the data.
Ingestion-level Consistency
Chain Continuity
When first ingesting a block, we check for a continuous block hash chain. If the chain is not valid (ie the parent hash does not match the hash we have of the preceding block number), we issue deletes and updates into our dataset and walk backwards until we reach a consistent chain again.
All deletes and updates are propagated through to downstream sinks. This means if you have a Mirror pipeline writing chain data into a database, and that chain goes through a reorg or a rollback, all the changes will automatically propagate to your database as well.
Write Guarantees
During ingestion, we ensure we have the full set of data for a block before emitting it into the various datasets. When emitting, we acquire full consistency acknowledgement from our various data sinks before marking the block as ingested.
Schema Strictness
Our datasets follow strict typed schemas, causing writes that don’t fit into said schemas to fail completely.
Dataset Validation Checks
In rare cases, RPC nodes can give us invalid data that may be missed during ingestion checks. For every dataset, we run checks on a daily basis and repair the data if any issues are seen.
These checks validate:
- Missing blocks (EVM) - we record the minimum and maximum block numbers for each date, and look for gaps in the data
- Missing transactions (EVM) - We count unique transaction hashes per block and compare it with the
transaction_count
for the block. - Missing logs (EVM) - We compare the maximum log index per block with the number of logs per block.
This framework will allow us to proactively address data quality issues in a structured and efficient manner. Much like unit tests in a software codebase, these checks will help prevent future regressions. Once a check is implemented for one chain, it can be seamlessly applied across others, ensuring consistency and scalability.
Destination Level Consistency
To prevent missing when writing, mirror pipelines are built with an at least once guarantee.
Snapshots
We do automatic fault tolerance every min with snapshot recovery every 4 hrs. When a pipeline is updated or is forced to terminate, a snapshot is persisted, and used for the next incarnation of the pipeline. This allows for continuity of the data being sent.
Database Consistency
For every row of data the pipeline needs to send, we make sure we have full acknowledgement from the database before moving into the next set of data to be sent. If it’s not acknowledged, the snapshots will not show that set of data is sent, and if any restarts or errors happen with the pipeline, the snapshot will be pessimistic and risk resending data over missing data.
Sink Downtime Handling
If a write into a sink (database or channel) errors for whatever reason, the pipeline will automatically restart just that batch for that sink. If it continues to error, the pipeline will restart the writers. Finally, if all fails for a prolonged period of time, the pipeline will fail, and when the user restarts it, it will resume from the last saved snapshot.
Was this page helpful?