Table of Contents
Fetching ...

Declarative Data Pipeline for Large Scale ML Services

Yunzhao Yang, Runhui Wang, Xuanqing Liu, Adit Krishnan, Yefan Tao, Yuqian Deng, Kuangyou Yao, Peiyuan Sun, Henrik Johnson, Aditi sinha, Davor Golac, Gerald Friedland, Usman Shakeel, Daryl Cooke, Joe Sullivan, Madhusudhanan Chandrasekaran, Chris Kong

TL;DR

Declarative Data Pipeline (DDP) introduces an in-memory, pipe-based architecture integrated with Apache Spark to unify ML workflows with large-scale data processing. By replacing networked microservices with memory-bound Pipes and applying a declarative DAG-driven execution model, DDP achieves strong improvements in development efficiency, scalability, and throughput while maintaining visibility through built-in metrics and visualization. Empirical results include a 50% reduction in development cycles, 500x scalability gains, and 10x throughput improvements in enterprise-scale projects, plus a web-scale language-detection experiment showing up to $180 imes$ speedups over non-distributed baselines and $5.7 imes$ over Ray. The framework emphasizes data-centric governance, modularity, and cross-platform compatibility, though it trades some automatic optimization for determinism and portability, with future work targeting real-time streaming and LLM hosting within the DDP workflow.

Abstract

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a "Declarative Data Pipeline" (DDP) architecture that addresses these challenges while processing billions of records efficiently. Our modular framework seamlessly integrates machine learning within Apache Spark using logical computation units called Pipes, departing from traditional microservice approaches. By establishing clear component boundaries and standardized interfaces, we achieve modularity and optimization without sacrificing maintainability. Enterprise case studies demonstrate substantial improvements: 50% better development efficiency, collaboration efforts compressed from weeks to days, 500x scalability improvement, and 10x throughput gains.

Declarative Data Pipeline for Large Scale ML Services

TL;DR

Declarative Data Pipeline (DDP) introduces an in-memory, pipe-based architecture integrated with Apache Spark to unify ML workflows with large-scale data processing. By replacing networked microservices with memory-bound Pipes and applying a declarative DAG-driven execution model, DDP achieves strong improvements in development efficiency, scalability, and throughput while maintaining visibility through built-in metrics and visualization. Empirical results include a 50% reduction in development cycles, 500x scalability gains, and 10x throughput improvements in enterprise-scale projects, plus a web-scale language-detection experiment showing up to speedups over non-distributed baselines and over Ray. The framework emphasizes data-centric governance, modularity, and cross-platform compatibility, though it trades some automatic optimization for determinism and portability, with future work targeting real-time streaming and LLM hosting within the DDP workflow.

Abstract

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a "Declarative Data Pipeline" (DDP) architecture that addresses these challenges while processing billions of records efficiently. Our modular framework seamlessly integrates machine learning within Apache Spark using logical computation units called Pipes, departing from traditional microservice approaches. By establishing clear component boundaries and standardized interfaces, we achieve modularity and optimization without sacrificing maintainability. Enterprise case studies demonstrate substantial improvements: 50% better development efficiency, collaboration efforts compressed from weeks to days, 500x scalability improvement, and 10x throughput gains.

Paper Structure

This paper contains 33 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: This represents the full development cycle using DDP for a product leveraging SQL, traditional model and LLM, from design, development, runtime to monitoring.
  • Figure 2: Dataset declarations serve as "anchor" in our pipeline architecture, specifying data attributes like location, schema, and encryption settings. These declarations form interfaces between pipe components, enabling modular development and independent data processing units.
  • Figure 3: Workflow visualization for an ML data pipeline definition with preprocessing, feature generalization, and model prediction steps. We add the pipe execution order as the prefix to each pipe name such as [0] for the first one. We use purple block with info tag to show the metrics information for each pipe, such as the model_latency for the ModelPredictionTransformer pipe. We use different colors to notate data with different locations: orange for AWS S3, yellow for memory, dotted orange outline for caching in memory, blue for Iceberg table. We also notate the execution progress with different stages: green for completed steps, yellow for in-progress steps, and white for steps not started.
  • Figure 4: We implemented the above data processing stages with DDP in the web-Scale language detection experiment to study the gains over conventional implementations.
  • Figure 5: Scalability Evaluation over 2.1 M documents from the CC-NET corpus. The smallest number of CPUs for our DDP framework was 4 (single worker instance).