Table of Contents
Fetching ...

Redox: Improving I/O Efficiency of Model Training Through File Redirection

Yuhao Li, Xuanhua Shi, Yunfei Zhao, Yongluan Zhou, Yusheng Hua, Xuehai Qian

TL;DR

Redox tackles the I/O bottleneck in large-scale model training by exploiting a unique file redirection capability: the data returned for a requested file can be data from a different file. It introduces a batched, single-load-per-chunk strategy with local and distributed protocols, and a mapping between virtual and physical chunks to maximize data consumption per I/O operation. The system includes a remote prefetch mechanism to further hide latency across nodes, while carefully analyzing randomness to ensure training efficiency remains intact. Empirical results across LibriSpeech, ImageNet-1k, and ImageNet-21k show up to 4.57x end-to-end speedups over PyTorch, with convergence preserved and robust performance across memory and hardware configurations.

Abstract

This paper proposes Redox, a training data management system designed to achieve high I/O efficiency. The key insight is a new observation of file redirection: for model training, when training data in one file is requested, the system has the flexibility to return the data of another file. Based on this property, Redox starts with a bold design principle that chunks of data files are always read from disk in batch, and once loaded, all files in the chunk will be consumed without being loaded again. We propose efficient local and distributed file read protocol based on this principle that both minimizes the wasted data read and enables opportunistic prefetch from remote node. Moreover, we analyze file redirection's impact on randomness, and show that it has little effects on training efficiency. Experimental results indicate that Redox significantly accelerates data fetching in training, achieving up to a 4.57x improvement in end-to-end training compared to PyTorch.

Redox: Improving I/O Efficiency of Model Training Through File Redirection

TL;DR

Redox tackles the I/O bottleneck in large-scale model training by exploiting a unique file redirection capability: the data returned for a requested file can be data from a different file. It introduces a batched, single-load-per-chunk strategy with local and distributed protocols, and a mapping between virtual and physical chunks to maximize data consumption per I/O operation. The system includes a remote prefetch mechanism to further hide latency across nodes, while carefully analyzing randomness to ensure training efficiency remains intact. Empirical results across LibriSpeech, ImageNet-1k, and ImageNet-21k show up to 4.57x end-to-end speedups over PyTorch, with convergence preserved and robust performance across memory and hardware configurations.

Abstract

This paper proposes Redox, a training data management system designed to achieve high I/O efficiency. The key insight is a new observation of file redirection: for model training, when training data in one file is requested, the system has the flexibility to return the data of another file. Based on this property, Redox starts with a bold design principle that chunks of data files are always read from disk in batch, and once loaded, all files in the chunk will be consumed without being loaded again. We propose efficient local and distributed file read protocol based on this principle that both minimizes the wasted data read and enables opportunistic prefetch from remote node. Moreover, we analyze file redirection's impact on randomness, and show that it has little effects on training efficiency. Experimental results indicate that Redox significantly accelerates data fetching in training, achieving up to a 4.57x improvement in end-to-end training compared to PyTorch.

Paper Structure

This paper contains 24 sections, 7 figures, 7 tables, 6 algorithms.

Figures (7)

  • Figure 1: File Redirection: Key Insights of Redox Protocols
  • Figure 2: Overall Performance - LibriSpeech
  • Figure 3: Overall Performance - ImageNet-21k
  • Figure 4: Overall Performance - ImageNet-1k
  • Figure 5: Remote VC Memory Usage Analysis
  • ...and 2 more figures