Analytics - Data Infrastructure - Data Pipeline - Docs

From Noise to Signal

A single Ethereum block contains hundreds of transactions, each potentially generating multiple log events from multiple smart contracts. A validator publishing a block does not curate this data for human consumption — it is a raw, binary-encoded record of state transitions. The distance between that raw record and an actionable insight like "this wallet is exhibiting structuring behaviour" or "this cohort's retention rate dropped fifteen percent this week" is vast, and it is the data pipeline's job to bridge it.

The Ludopoly Analytics pipeline is organised into six sequential stages. Each stage performs a well-defined transformation, passes its output to the next stage, and is independently scalable. If the volume of incoming events from a particular chain spikes, only the relevant stage needs additional resources — the rest of the pipeline continues operating at its current capacity.

Event Capture

The first stage deploys specialised event listeners for each supported blockchain network. These listeners connect to node infrastructure through WebSocket and JSON-RPC interfaces, subscribing to new block notifications, transaction receipts, and smart-contract event logs. Each listener is configured with chain-specific parameters — most critically, the finality threshold that determines how many block confirmations must pass before an event is considered canonical and forwarded downstream.

Events that arrive within the finality window are buffered, not discarded. This prevents data loss during transient chain reorganisations while ensuring that no downstream module processes an event that might later be invalidated.

ABI Resolution

Raw smart-contract events are encoded as binary data conforming to the Ethereum ABI specification. To interpret them — to know that a particular event represents a token transfer rather than a governance vote — the pipeline must resolve the contract's ABI. The second stage employs a three-tier fallback strategy. It first queries the platform's internal ABI registry, which contains the ABIs of all registered projects. If no match is found, it queries block explorer APIs. As a last resort, it consults 4-byte function signature databases, which provide partial decoding based on the function selector alone. This tiered approach achieves high resolution rates even for contracts that have not been explicitly registered with the platform.

Processing and Distribution

The third stage subjects resolved events to a six-step inner pipeline: validation (rejecting malformed or inconsistent events), enrichment (attaching contextual metadata such as token prices and contract labels), normalisation (converting chain-specific formats into a unified schema), classification (categorising events by type — transfer, swap, stake, mint, governance), labelling (applying protocol-specific tags), and indexing (writing structured events to search-optimised storage).

The fourth stage distributes processed events through Apache Kafka's topic-based publish-subscribe mechanism. Each service module — AML, ZK-KYC, dApp analytics, AI risk engine — subscribes to the topics relevant to its function through independent consumer groups. This decoupling ensures that a slow consumer does not block other modules and that each module can scale its consumption rate independently.

Storage and Aggregation

The fifth stage routes events to the appropriate database engine. Relational data flows to PostgreSQL. Document-oriented analytics results go to MongoDB. Hot cache data lives in Redis. Transaction relationship graphs are stored in Neo4j. Time-series metrics accumulate in InfluxDB. Each database is selected for the workload it handles most efficiently, and the pipeline's storage router abstracts the routing logic from both upstream producers and downstream consumers.

The sixth and final stage runs batch aggregation processes on hourly and daily schedules. These jobs compute cohort metrics, update risk score summaries, generate statistical reports, and prepare the data snapshots that power historical trend visualisations in the dashboard. Batch jobs complement — rather than replace — the real-time processing in earlier stages, ensuring that both instant alerts and long-range analytical views are available simultaneously.

The pipeline is designed for exactly-once processing semantics through Kafka's transactional producer and consumer offset management. Events are neither lost nor duplicated, even during infrastructure failures or scaling events.

Graph Analysis

← Previous