A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Gaëlle Laperrière; Sahar Ghannay; Bassam Jabaian; Yannick Estève

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Gaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève

TL;DR

This work addresses SLU in multilingual, low-resource contexts by tackling semantic forgetting during downstream fine-tuning of the SAMU-XLSR encoder. It introduces a dual-task learning framework that jointly optimizes SAMU-XLSR semantic enrichment and SLU performance, using the loss $loss = loss(SAMU-XLSR) + λ\ loss(SLU)$ with $λ$ tuned in $[1,20]$, which reduces the need to re-train separate components and lowers parameter count. Empirically, the method achieves state-of-the-art CER on MEDIA and PortMEDIA (≈17.9% and ≈24.1%), and reaches a top result of ≈29.1% CER on TARIC-SLU when leveraging close and distant language data together, demonstrating effective cross-lingual transfer and semantic preservation. Overall, the dual-task approach enhances multilinguality without sacrificing semantic specificity, enabling better SLU performance for unseen low-resource languages with fewer parameters and training steps.

Abstract

Self-Supervised Learning is vastly used to efficiently represent speech for Spoken Language Understanding, gradually replacing conventional approaches. Meanwhile, textual SSL models are proposed to encode language-agnostic semantics. SAMU-XLSR framework employed this semantic information to enrich multilingual speech representations. A recent study investigated SAMU-XLSR in-domain semantic enrichment by specializing it on downstream transcriptions, leading to state-of-the-art results on a challenging SLU task. This study's interest lies in the loss of multilingual performances and lack of specific-semantics training induced by such specialization in close languages without any SLU implication. We also consider SAMU-XLSR's loss of initial cross-lingual abilities due to a separate SLU fine-tuning. Therefore, this paper proposes a dual task learning approach to improve SAMU-XLSR semantic enrichment while considering distant languages for multilingual and language portability experiments.

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

TL;DR

with

tuned in

, which reduces the need to re-train separate components and lowers parameter count. Empirically, the method achieves state-of-the-art CER on MEDIA and PortMEDIA (≈17.9% and ≈24.1%), and reaches a top result of ≈29.1% CER on TARIC-SLU when leveraging close and distant language data together, demonstrating effective cross-lingual transfer and semantic preservation. Overall, the dual-task approach enhances multilinguality without sacrificing semantic specificity, enabling better SLU performance for unseen low-resource languages with fewer parameters and training steps.

Abstract

Paper Structure (16 sections, 3 figures, 4 tables)

This paper contains 16 sections, 3 figures, 4 tables.

Introduction
SLU tasks in different languages
The French MEDIA dataset
The Italian PortMEDIA dataset
The Tunisian TARIC-SLU dataset
Evaluation Metrics
SAMU-XLSR
SLU fine-tuning
Dual task learning
Experimental results
Task-oriented semantic enrichment
Language portability
Close languages
Distant languages
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Training and specialization process of SAMU-XLSR.
Figure 2: Fine-tuning process of SAMU-XLSR for an SLU task.
Figure 3: Training process of the SLU and SAMU-XLSR modules combined in a dual architecture.

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

TL;DR

Abstract

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)