Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Nithin Rao Koluguri; Travis Bartley; Hainan Xu; Oleksii Hrinchuk; Jagadeesh Balam; Boris Ginsburg; Georg Kucsko

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Nithin Rao Koluguri, Travis Bartley, Hainan Xu, Oleksii Hrinchuk, Jagadeesh Balam, Boris Ginsburg, Georg Kucsko

TL;DR

This work investigates training end-to-end speech recognition and translation models on longer utterances with complete punctuation and capitalization, using the FastConformer architecture to support sequences up to $60$ seconds. The authors introduce sentence-level PnC training and a Hybrid TDT-CTC loss to enable efficient, accurate decoding, and demonstrate a $25\%$ relative WER improvement on Earnings-21/22 and a $15\%$ relative BLEU gain on MuST-C. They show that extending context up to about $40$ seconds yields most benefits, with diminishing returns beyond that, and that end-to-end models with PnC outperform cascaded approaches. The results, open-sourced weights and code, and insights on long-context learning advance practical long-form ASR and translation, with implications for other large seq2seq tasks.

Abstract

This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capitalization (PnC) sentences, we propose training on longer utterances that include complete sentences with proper punctuation and capitalization. We achieve this by using the FastConformer architecture which allows training 1 Billion parameter models with sequences up to 60 seconds long with full attention. However, while training with PnC enhances the overall performance, we observed that accuracy plateaus when training on sequences longer than 40 seconds across various evaluation settings. Our proposed method significantly improves punctuation and capitalization accuracy, showing a 25% relative word error rate (WER) improvement on the Earnings-21 and Earnings-22 benchmarks. Additionally, training on longer audio segments increases the overall model accuracy across speech recognition and translation benchmarks. The model weights and training code are open-sourced though NVIDIA NeMo.

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

TL;DR

seconds. The authors introduce sentence-level PnC training and a Hybrid TDT-CTC loss to enable efficient, accurate decoding, and demonstrate a

relative WER improvement on Earnings-21/22 and a

relative BLEU gain on MuST-C. They show that extending context up to about

seconds yields most benefits, with diminishing returns beyond that, and that end-to-end models with PnC outperform cascaded approaches. The results, open-sourced weights and code, and insights on long-context learning advance practical long-form ASR and translation, with implications for other large seq2seq tasks.

Abstract

Paper Structure (20 sections, 1 equation, 2 figures, 8 tables)

This paper contains 20 sections, 1 equation, 2 figures, 8 tables.

Introduction
Background
FastConformer
CTC, Transducer, TDT and their Hybrid
Hybrid TDT-CTC models
Method
Sentence-Level Training with Punctuation and Capitalization (PnC)
Training with Longer Context
Datasets
Speech Recognition
Training
Evaluation
Speech Translation
Experiments & Results
Speech Recognition
...and 5 more sections

Figures (2)

Figure 1: Hybrid-TDT-CTC Model. Variables $v$ and $d$ represent the vocabulary and durations supported by the TDT model. The final loss of the model is computed as a linear interpolation of TDT loss and CTC loss.
Figure 2: Method of concatenating partial punctuations and capitalizations segments from LibriSpeech-PC meister2023librispeech set to form a complete sentence level segments.

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

TL;DR

Abstract

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)