Redox: Improving I/O Efficiency of Model Training Through File Redirection
Yuhao Li, Xuanhua Shi, Yunfei Zhao, Yongluan Zhou, Yusheng Hua, Xuehai Qian
TL;DR
Redox tackles the I/O bottleneck in large-scale model training by exploiting a unique file redirection capability: the data returned for a requested file can be data from a different file. It introduces a batched, single-load-per-chunk strategy with local and distributed protocols, and a mapping between virtual and physical chunks to maximize data consumption per I/O operation. The system includes a remote prefetch mechanism to further hide latency across nodes, while carefully analyzing randomness to ensure training efficiency remains intact. Empirical results across LibriSpeech, ImageNet-1k, and ImageNet-21k show up to 4.57x end-to-end speedups over PyTorch, with convergence preserved and robust performance across memory and hardware configurations.
Abstract
This paper proposes Redox, a training data management system designed to achieve high I/O efficiency. The key insight is a new observation of file redirection: for model training, when training data in one file is requested, the system has the flexibility to return the data of another file. Based on this property, Redox starts with a bold design principle that chunks of data files are always read from disk in batch, and once loaded, all files in the chunk will be consumed without being loaded again. We propose efficient local and distributed file read protocol based on this principle that both minimizes the wasted data read and enables opportunistic prefetch from remote node. Moreover, we analyze file redirection's impact on randomness, and show that it has little effects on training efficiency. Experimental results indicate that Redox significantly accelerates data fetching in training, achieving up to a 4.57x improvement in end-to-end training compared to PyTorch.
