Table of Contents
Fetching ...

Training and Inference Efficiency of Encoder-Decoder Speech Models

Piotr Żelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

TL;DR

The paper addresses inefficiencies in encoder-decoder speech models caused by padding and autoregressive decoding bottlenecks. It introduces synchronized 2D bucketing with a batch-size optimizer, a concurrent bucketing data pipeline, and an encoder-parameter-augmentation (Canary-1B-Flash) that reduces decoder load while increasing encoder capacity. Key results show up to 4x GPU reduction for equivalent wall time, up to 2x faster convergence with the same compute, and about a 3x improvement in inference speed, all without sacrificing accuracy. The authors release open-source training code and the Canary-1B-Flash model, while noting limitations related to dataset size/language support and ethical considerations.

Abstract

Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.

Training and Inference Efficiency of Encoder-Decoder Speech Models

TL;DR

The paper addresses inefficiencies in encoder-decoder speech models caused by padding and autoregressive decoding bottlenecks. It introduces synchronized 2D bucketing with a batch-size optimizer, a concurrent bucketing data pipeline, and an encoder-parameter-augmentation (Canary-1B-Flash) that reduces decoder load while increasing encoder capacity. Key results show up to 4x GPU reduction for equivalent wall time, up to 2x faster convergence with the same compute, and about a 3x improvement in inference speed, all without sacrificing accuracy. The authors release open-source training code and the Canary-1B-Flash model, while noting limitations related to dataset size/language support and ethical considerations.

Abstract

Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.

Paper Structure

This paper contains 7 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Visualization of a randomly sampled mini-batch representing variable length input speech and output transcription data as 3D activation tensors with shape (batch, length, hidden_dim). The sequence lengths were sampled from our training data distribution and the hidden dimension was set to 8 for readability. Grey elements indicate padding elements. There are two axes of padding, one in each tensor, limiting the efficiency of both encoder and decoder modules.
  • Figure 2: Memory usage profile of Canary-1B training on RTX 6000 Ada 48GB GPU using 1D dynamic bucketing with equal batch duration heuristic. Each peak denotes the point right after training loss computation for a single training step. The memory usage for majority of training steps is well below the maximum, showing room for efficiency improvement.
  • Figure 3: Output token rate distribution on a 100k sample of Canary-1B-Flash training data. Utterances are grouped into duration bins with 2s increment. Short utterances have significantly more transcript tokens per second, partially due to a fixed-length prompt fed to the decoder. This highlights the need for careful data filtering and tuning of 2D bucketing settings.
  • Figure 4: Canary-1B training efficiency comparison across four training schemes. Scheme A is Canary-1B baseline. Scheme B adds TPS filtering, freeing up GPU memory. Scheme C replaces batch duration heuristic with OOMptimizer for batch size estimation. Scheme D adds 2D bucketing to further reduce the number of padding tokens. The efficiency gains directly translate to quicker validation WER convergence. The horizontal axes for all metrics except for WER demonstrate the first 100k training steps.
  • Figure 5: Comparison of Canary-1B-Flash convergence speed with fully optimized 2D bucketing scheme (orange) vs fixed batch size of 768 (blue), both on 32 GPUs.