Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

Hyunbyung Park; Sukyung Lee; Gyoungjin Gim; Yungi Kim; Dahyun Kim; Chanjun Park

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Dahyun Kim, Chanjun Park

TL;DR

Dataverse tackles the challenge of scalable, customizable data processing for LLM development by delivering an open-source ETL pipeline with a block-based interface and decorator-based customization. It combines native data operations, Spark-based distributed processing, and AWS EMR integration to enable end-to-end data preparation from diverse sources. Key contributions include a modular architecture (ETL core, configuration management, registry, utilities, API), a low-friction pathway to add custom processors, and notebook-friendly debugging support. The approach aims to accelerate LLM data readiness and foster community contributions by providing a shareable, extensible platform.

Abstract

To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

TL;DR

Abstract

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)