Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Peidong Wang; Jian Xue; Jinyu Li; Junkun Chen; Aswin Shanmugam Subramanian

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Peidong Wang, Jian Xue, Jinyu Li, Junkun Chen, Aswin Shanmugam Subramanian

TL;DR

The paper addresses soft language identification in language-agnostic many-to-one end-to-end speech translation. It introduces a Linear Input Network ($D\\times D$) appended to input features, initialized as the identity, and trained on language-specific data while keeping other parameters fixed, enabling targeted improvements without sacrificing overall multilingual performance. Experiments with neural transducer–based ST (LAMASSU-LIN variants) show targeted gains for JA and DE with limited degradation for other languages, and 99% traffic scenarios demonstrate favorable trade-offs. This approach provides a simple, reversible mechanism to leverage user-provided LID in multilingual ST and could extend to AED models and broader multilingual translation settings.

Abstract

Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the other languages. We accomplish this by introducing a simple and effective linear input network. The linear input network is initialized as an identity matrix, which ensures that the model can perform as well as, or better than, the original model. Experimental results show that the proposed method can successfully enhance the specified language, while keeping the language-agnostic ability of the many-to-one ST models.

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

TL;DR

The paper addresses soft language identification in language-agnostic many-to-one end-to-end speech translation. It introduces a Linear Input Network (

) appended to input features, initialized as the identity, and trained on language-specific data while keeping other parameters fixed, enabling targeted improvements without sacrificing overall multilingual performance. Experiments with neural transducer–based ST (LAMASSU-LIN variants) show targeted gains for JA and DE with limited degradation for other languages, and 99% traffic scenarios demonstrate favorable trade-offs. This approach provides a simple, reversible mechanism to leverage user-provided LID in multilingual ST and could extend to AED models and broader multilingual translation settings.

Abstract

Paper Structure (13 sections, 1 figure, 2 tables)

This paper contains 13 sections, 1 figure, 2 tables.

Introduction
Method Description
Language-Agnostic Many-to-One E2E ST
Neural Transducer for Streaming Many-to-One E2E ST
Linear Input Network for Many-to-One ST
Experimental Setup
Dataset
Models
Implementation Details
Evaluation Results
The Impact of More Languages
LAMASSU-LIN
Concluding Remarks

Figures (1)

Figure 1: Illustrations of a many-to-one ST model and the proposed soft LID method using a linear input network for input language 2. For each input language, we use a different LIN layer.

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

TL;DR

Abstract

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)