Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Dahyun Kim, Chanjun Park
TL;DR
Dataverse tackles the challenge of scalable, customizable data processing for LLM development by delivering an open-source ETL pipeline with a block-based interface and decorator-based customization. It combines native data operations, Spark-based distributed processing, and AWS EMR integration to enable end-to-end data preparation from diverse sources. Key contributions include a modular architecture (ETL core, configuration management, registry, utilities, API), a low-friction pathway to add custom processors, and notebook-friendly debugging support. The approach aims to accelerate LLM data readiness and foster community contributions by providing a shareable, extensible platform.
Abstract
To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.
