A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding
Gaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève
TL;DR
This work addresses SLU in multilingual, low-resource contexts by tackling semantic forgetting during downstream fine-tuning of the SAMU-XLSR encoder. It introduces a dual-task learning framework that jointly optimizes SAMU-XLSR semantic enrichment and SLU performance, using the loss $loss = loss(SAMU-XLSR) + λ\ loss(SLU)$ with $λ$ tuned in $[1,20]$, which reduces the need to re-train separate components and lowers parameter count. Empirically, the method achieves state-of-the-art CER on MEDIA and PortMEDIA (≈17.9% and ≈24.1%), and reaches a top result of ≈29.1% CER on TARIC-SLU when leveraging close and distant language data together, demonstrating effective cross-lingual transfer and semantic preservation. Overall, the dual-task approach enhances multilinguality without sacrificing semantic specificity, enabling better SLU performance for unseen low-resource languages with fewer parameters and training steps.
Abstract
Self-Supervised Learning is vastly used to efficiently represent speech for Spoken Language Understanding, gradually replacing conventional approaches. Meanwhile, textual SSL models are proposed to encode language-agnostic semantics. SAMU-XLSR framework employed this semantic information to enrich multilingual speech representations. A recent study investigated SAMU-XLSR in-domain semantic enrichment by specializing it on downstream transcriptions, leading to state-of-the-art results on a challenging SLU task. This study's interest lies in the loss of multilingual performances and lack of specific-semantics training induced by such specialization in close languages without any SLU implication. We also consider SAMU-XLSR's loss of initial cross-lingual abilities due to a separate SLU fine-tuning. Therefore, this paper proposes a dual task learning approach to improve SAMU-XLSR semantic enrichment while considering distant languages for multilingual and language portability experiments.
