Lightweight Audio Segmentation for Long-form Speech Translation

Jaesong Lee; Soyoon Kim; Hanbyul Kim; Joon Son Chung

Lightweight Audio Segmentation for Long-form Speech Translation

Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

TL;DR

This work tackles long-form speech translation by introducing a lightweight frame-level audio segmentation model. The model is pre-trained with ASR-with-punctuation to learn sentence boundaries and is tuned at inference time to fit different ST systems. Experiments on MuST-C En-De and En-Ja show the approach improves BLEU and reduces model size compared to prior segmentation methods, with a 27.3M-parameter model outperforming larger baselines. The work demonstrates the importance of aligning segmentation with the downstream ST model for practical, on-device, streaming translation.

Abstract

Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performance gap exists due to a mismatch between the models and ST systems. In addition, the prior works require large self-supervised speech models, which consume significant computational resources. In this work, we propose a segmentation model that achieves better speech translation quality with a small model size. We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model. We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.

Lightweight Audio Segmentation for Long-form Speech Translation

TL;DR

Abstract

Paper Structure (9 sections, 1 equation, 3 figures, 3 tables)

This paper contains 9 sections, 1 equation, 3 figures, 3 tables.

Introduction
Architecture
Inference
Pre-training via ASR-with-punctuation
Integration to speech translation system
Experiments
Results
Evaluation of ASR punctuation prediction
Conclusion

Figures (3)

Figure 1: (a) Oracle segmentation and its corresponding reference text. (b) prediction of segmentation model without pre-training, and its corresponding ASR results. (c) prediction of segmentation model with pre-training, and its corresponding ASR results. ASR errors are colored red. See Section \ref{['sec:pretraining']} for details.
Figure 2: Segmentation and corresponding ASR results with two different maxlen configurations. Note that the two results are inferred from the same segmentation model. See Section \ref{['sec:integration']} for details.
Figure 3: En-De BLEU scores for various maxlen.

Lightweight Audio Segmentation for Long-form Speech Translation

TL;DR

Abstract

Lightweight Audio Segmentation for Long-form Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)