Semantic enrichment towards efficient speech representations
Gaëlle Laperrière, Ha Nguyen, Sahar Ghannay, Bassam Jabaian, Yannick Estève
TL;DR
The paper tackles semantic extraction in end-to-end SLU under data scarcity by enriching SSL-based speech representations with language-agnostic semantics through SAMU-XLSR, which aligns XLS-R frame embeddings with LaBSE sentence embeddings. It investigates in-domain specialization using small transcribed datasets, analyzes layer-wise encoder capacity, and assesses cross-lingual and cross-domain portability using MEDIA and Italian PortMEDIA. The authors demonstrate that specialized SAMU-XLSR variants, particularly IT⊕FR and a 17-layer frozen configuration, can match or approach fully fine-tuned baselines at a fraction of the compute, achieving a new state-of-the-art CER of 25.1% on Italian PortMEDIA and revealing strong semantic transfer at the sentence level while exposing cross-domain limitations. Cross-domain results show domain mismatches in Italian CommonVoice but meaningful gains for close-domain French PortMEDIA, underscoring both the potential and limits of semantic enrichment for SLU. Overall, semantic enrichment with LaBSE alignment enables practical SLU improvements with reduced data and compute, while offering insights into layer-wise semantics and portability.
Abstract
Over the past few years, self-supervised learned speech representations have emerged as fruitful replacements for conventional surface representations when solving Spoken Language Understanding (SLU) tasks. Simultaneously, multilingual models trained on massive textual data were introduced to encode language agnostic semantics. Recently, the SAMU-XLSR approach introduced a way to make profit from such textual models to enrich multilingual speech representations with language agnostic semantics. By aiming for better semantic extraction on a challenging Spoken Language Understanding task and in consideration with computation costs, this study investigates a specific in-domain semantic enrichment of the SAMU-XLSR model by specializing it on a small amount of transcribed data from the downstream task. In addition, we show the benefits of the use of same-domain French and Italian benchmarks for low-resource language portability and explore cross-domain capacities of the enriched SAMU-XLSR.
