Table of Contents
Fetching ...

BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling

Raphael Ruschel, A. S. M. Iftekhar, B. S. Manjunath, Suya You

TL;DR

The paper tackles efficient training for variable-length sequences in distributed data-parallel settings by introducing BLoad, a block-based padding scheme that builds fixed-length $T_{\max}$ blocks from shorter sequences and uses a start-index table to maintain per-sequence boundaries within a DDS workflow. This approach achieves over a $100\times$ reduction in padding without deleting any frames, improving training time and recall on the Action Genome dataset while mitigating deadlock risk inherent in standard DDP with variable-length data. Experiments compare naive padding, sampling, and the proposed block pad, showing substantial waste reduction and favorable performance when temporal structure is preserved. The method is applicable to multiple modalities (videos, audio, text) and is publicly available at GitHub for broader adoption.

Abstract

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently training neural network models using sequences of varying sizes. To address this challenge, we propose a novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead. By using this scheme we were able to reduce the padding amount by more than 100$x$ while not deleting a single frame, resulting in an overall increased performance on both training time and Recall in our experiments.

BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling

TL;DR

The paper tackles efficient training for variable-length sequences in distributed data-parallel settings by introducing BLoad, a block-based padding scheme that builds fixed-length blocks from shorter sequences and uses a start-index table to maintain per-sequence boundaries within a DDS workflow. This approach achieves over a reduction in padding without deleting any frames, improving training time and recall on the Action Genome dataset while mitigating deadlock risk inherent in standard DDP with variable-length data. Experiments compare naive padding, sampling, and the proposed block pad, showing substantial waste reduction and favorable performance when temporal structure is preserved. The method is applicable to multiple modalities (videos, audio, text) and is publicly available at GitHub for broader adoption.

Abstract

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently training neural network models using sequences of varying sizes. To address this challenge, we propose a novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead. By using this scheme we were able to reduce the padding amount by more than 100 while not deleting a single frame, resulting in an overall increased performance on both training time and Recall in our experiments.
Paper Structure (5 sections, 7 figures, 1 table)

This paper contains 5 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Sample dataset with 8 videos of varying length - Each $V_i$ represents an individual video, and each yellow square represents a frame.
  • Figure 2: Deadlock situation when each GPU receives sequences of different lengths. In this situation, after the third iteration, GPU 1 will not have any gradients to report, causing GPU 2 to wait without any error message.
  • Figure 3: Naive padding solution - Every sequence on the dataset is padded to match the length of the largest sequence, generally by adding $0$'s or repeating the last entry of the sequence
  • Figure 4: Sampling solution, where each sequence is trimmed to match a smaller size, usually the length of the average entry in the dataset. In this approach, one sequence might be broken into several smaller portions, which won't allow the training of models with long temporal support.
  • Figure 5: Our proposed padding approach - BLoad (as in block load) - aims to construct sequences of size $T_{max}$ using shorter sequences as building blocks
  • ...and 2 more figures