Mirror Indexing
Goldsky Mirror datasets are populated through various indexers, which write the data into a data stream. The data stream is then accessible directly by users through Mirror pipelines. Internally, we copy the data to a data lake which is then used to power various features and also used for Data QA. Data quality is managed during ingestion, and also through periodic checks. Emitted data quality is managed through various database guarantees, depending on the destination of the data.Ingestion-level Consistency
Chain Continuity
When first ingesting a block, we check for a continuous block hash chain. If the chain is not valid (ie the parent hash does not match the hash we have of the preceding block number), we issue deletes and updates into our dataset and walk backwards until we reach a consistent chain again. All deletes and updates are propagated through to downstream sinks. This means if you have a Mirror pipeline writing chain data into a database, and that chain goes through a reorg or a rollback, all the changes will automatically propagate to your database as well.Write Guarantees
During ingestion, we ensure we have the full set of data for a block before emitting it into the various datasets. When emitting, we acquire full consistency acknowledgement from our various data sinks before marking the block as ingested.Schema Strictness
Our datasets follow strict typed schemas, causing writes that don’t fit into said schemas to fail completely.Dataset Validation Checks
In rare cases, RPC nodes can give us invalid data that may be missed during ingestion checks. For every dataset, we run checks on a daily basis and repair the data if any issues are seen. These checks validate:- Missing blocks (EVM) - we record the minimum and maximum block numbers for each date, and look for gaps in the data
- Missing transactions (EVM) - We count unique transaction hashes per block and compare it with the
transaction_count
for the block. - Missing logs (EVM) - We compare the maximum log index per block with the number of logs per block.