FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline
Jingwei Xu, Junbin Kang, Mingkai Dong, Mingyu Liu, Lu Zhang, Shaohong Guo, Ziyan Qiu, Mingzhen You, Ziyi Tian, Anqi Yu, Tianhong Ding, Xinwei Hu, Haibo Chen
TL;DR
FalconFS tackles metadata path-resolution bottlenecks in large-scale DL pipelines by adopting a stateless-client design that moves path resolution from clients to servers. It combines hybrid metadata indexing with lazy namespace replication to achieve one-hop path resolution and high concurrency, aided by concurrent request merging and a VFS shortcut for easy deployment. Empirical results show FalconFS delivering up to 5.72x throughput for small-file IO and up to 12.81x for DL training compared with CephFS and Lustre, and it has been deployed in Huawei's production AI clusters for a year. The work contributes a practical, open-source DFS optimized for DL workloads and demonstrates significant gains in metadata performance and end-to-end DL throughput, with implications for data-centric AI infrastructure.
Abstract
Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$\times$ throughput for small file read/write and up to 12.81$\times$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year and has been open-sourced.
