Digital Asset Data Lakehouse. The concept based on a blockchain research center
Raul Cristian Bag
TL;DR
The paper addresses the challenge of managing large-scale blockchain and digital asset data by proposing a cloud-native data lakehouse architecture built on open-source tools. It integrates data ingestion from multiple blockchain networks and exchanges with modular microservices, leveraging Ceph-based S3 storage, Apache Parquet, Apache Spark, Airflow, and Kubernetes to enable near real-time analytics. Key contributions include a detailed architectural blueprint, emphasis on open-source components, scalability, cost reduction, and support for data-driven research and ML workflows, with potential impact on reproducibility in blockchain analytics. The work highlights the practical significance of robust data management for the digital economy and outlines avenues for future ML operations and data-engineering integration in academic research.
Abstract
In the rapidly evolving landscape of digital assets and blockchain technologies, the necessity for robust, scalable, and secure data management platforms has never been more critical. This paper introduces a novel software architecture designed to meet these demands by leveraging the inherent strengths of cloud-native technologies and modular micro-service based architectures, to facilitate efficient data management, storage and access, across different stakeholders. We detail the architectural design, including its components and interactions, and discuss how it addresses common challenges in managing blockchain data and digital assets, such as scalability, data siloing, and security vulnerabilities. We demonstrate the capabilities of the platform by employing it into multiple real-life scenarios, namely providing data in near real-time to scientists in help with their research. Our results indicate that the proposed architecture not only enhances the efficiency and scalability of distributed data management but also opens new avenues for innovation in the research reproducibility area. This work lays the groundwork for future research and development in machine learning operations systems, offering a scalable and secure framework for the burgeoning digital economy.
