Declarative Data Pipeline for Large Scale ML Services
Yunzhao Yang, Runhui Wang, Xuanqing Liu, Adit Krishnan, Yefan Tao, Yuqian Deng, Kuangyou Yao, Peiyuan Sun, Henrik Johnson, Aditi sinha, Davor Golac, Gerald Friedland, Usman Shakeel, Daryl Cooke, Joe Sullivan, Madhusudhanan Chandrasekaran, Chris Kong
TL;DR
Declarative Data Pipeline (DDP) introduces an in-memory, pipe-based architecture integrated with Apache Spark to unify ML workflows with large-scale data processing. By replacing networked microservices with memory-bound Pipes and applying a declarative DAG-driven execution model, DDP achieves strong improvements in development efficiency, scalability, and throughput while maintaining visibility through built-in metrics and visualization. Empirical results include a 50% reduction in development cycles, 500x scalability gains, and 10x throughput improvements in enterprise-scale projects, plus a web-scale language-detection experiment showing up to $180 imes$ speedups over non-distributed baselines and $5.7 imes$ over Ray. The framework emphasizes data-centric governance, modularity, and cross-platform compatibility, though it trades some automatic optimization for determinism and portability, with future work targeting real-time streaming and LLM hosting within the DDP workflow.
Abstract
Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a "Declarative Data Pipeline" (DDP) architecture that addresses these challenges while processing billions of records efficiently. Our modular framework seamlessly integrates machine learning within Apache Spark using logical computation units called Pipes, departing from traditional microservice approaches. By establishing clear component boundaries and standardized interfaces, we achieve modularity and optimization without sacrificing maintainability. Enterprise case studies demonstrate substantial improvements: 50% better development efficiency, collaboration efforts compressed from weeks to days, 500x scalability improvement, and 10x throughput gains.
