Dflow, a Python framework for constructing cloud-native AI-for-Science workflows
Xinzijian Liu, Yanbo Han, Zhuoyuan Li, Jiahao Fan, Chengqian Zhang, Jinzhe Zeng, Yifan Shan, Yannan Yuan, Wei-Hong Xu, Yun-Pei Liu, Yuzhi Zhang, Tongqi Wen, Darrin M. York, Zhicheng Zhong, Hang Zheng, Jun Cheng, Linfeng Zhang, Han Wang
TL;DR
The paper presents Dflow, an open-source Python framework for building cloud-native AI-for-Science workflows that unify containerized task execution, Argo-based scheduling, and HPC integration. It introduces core concepts such as Operations (OPs), Steps, and DAGs, with support for container templates and Python-native OPs, strict type checking, slices, fault tolerance, and resubmission. The architecture enables execution across cloud clusters and HPC systems via DPDispatcher and wlm-operator, plus a debug mode that runs locally without containers. The authors demonstrate Dflow through several domain-specific applications (FPOP, APEX, Rid-kit, DeePKS flow, VSW, Dflow-galaxy), illustrating reusable OPs, scalable parallelism, and flexible resource management. This framework reduces coupling between algorithm design and implementation, improves observability and reproducibility, and promotes collaboration by enabling reusable workflow components.
Abstract
In the AI-for-science era, scientific computing scenarios such as concurrent learning and high-throughput computing demand a new generation of infrastructure that supports scalable computing resources and automated workflow management on both cloud and high-performance supercomputers. Here we introduce Dflow, an open-source Python toolkit designed for scientists to construct workflows with simple programming interfaces. It enables complex process control and task scheduling across a distributed, heterogeneous infrastructure, leveraging containers and Kubernetes for flexibility. Dflow is highly observable and can scale to thousands of concurrent nodes per workflow, enhancing the efficiency of complex scientific computing tasks. The basic unit in Dflow, known as an Operation (OP), is reusable and independent of the underlying infrastructure or context. Dozens of workflow projects have been developed based on Dflow, spanning a wide range of projects. We anticipate that the reusability of Dflow and its components will encourage more scientists to publish their workflows and OP components. These components, in turn, can be adapted and reused in various contexts, fostering greater collaboration and innovation in the scientific community.
