Table of Contents
Fetching ...

Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

TL;DR

This work tackles the high cost of obtaining speech-semantics pairs for end-to-end SLU by proposing zero-shot learning from speech-text and text-semantics data. The proposed CMSST framework combines text-similarity filtering, multi-view clustering-based sample selection (MCSS), and cross-modal selective learning via CMSN to address domain mismatch, sample imbalance, and label noise. Two benchmarks, VoxPopuli2SLUE and MiniPS2SLURP, enable evaluation under matched and found-speech conditions, with CMSST achieving competitive or superior accuracy using far fewer speech-text pairs and significantly reduced training time. The results demonstrate data-efficient cross-domain SLU and provide a practical pathway for adapting SLU systems across evolving domains.

Abstract

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in https://github.com/amazon-science/zero-shot-E2E-slu.

Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

TL;DR

This work tackles the high cost of obtaining speech-semantics pairs for end-to-end SLU by proposing zero-shot learning from speech-text and text-semantics data. The proposed CMSST framework combines text-similarity filtering, multi-view clustering-based sample selection (MCSS), and cross-modal selective learning via CMSN to address domain mismatch, sample imbalance, and label noise. Two benchmarks, VoxPopuli2SLUE and MiniPS2SLURP, enable evaluation under matched and found-speech conditions, with CMSST achieving competitive or superior accuracy using far fewer speech-text pairs and significantly reduced training time. The results demonstrate data-efficient cross-domain SLU and provide a practical pathway for adapting SLU systems across evolving domains.

Abstract

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in https://github.com/amazon-science/zero-shot-E2E-slu.
Paper Structure (31 sections, 8 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a). Diagram of using all speech-text pairs, detailed in Sec. \ref{['sec:intro']}. The legend in (b) is also applicable to (a). (b). Diagram of the CMSST framework (described in Sec. \ref{['sec:framework']}). Speech and text pairs in $D^{A \shortrightarrow T}$ are selected by first using a text-similarity-based selection method and then a Multi-view Clustering-based Sample Selection (MCSS) algorithm. The SLU model $\tilde{\Theta}^{A \shortrightarrow L}$ is trained on the resulting speech-text pairs $\tilde{D}^{A \shortrightarrow T}$, with pseudolabels from an NLU model $\Theta^{T\shortrightarrow L, t}$. This NLU model is trained from target domain text-to-semantics pairs $D^{T\shortrightarrow L, t}$. To deal with label noise from the NLU model, CMSST uses a Cross-Modal SelectiveNet (CMSN) to train our SLU model $\tilde{\Theta}^{A \shortrightarrow L}$.
  • Figure 2: MCSS diagram (detailed in Sec. \ref{['sec:sample_selection']}). We use superscripts ${T}$, ${A}$, and ${L}$ to each denote text, speech, and semantic modalities. Blue denotes target domain $t$ while pink denotes external domain $\epsilon$. Hence, the blue boxes depict $D^{T \rightarrow L, t}$ data, while blue-pink boxes represent $D^{A \rightarrow T}$ data.
  • Figure 3: Diagram of workflow for CMSN (described in Sec. \ref{['sec:cmsn']}), where green or purple arrows are a pair of text and speech. $\rho$ is a selective score described in Eq. (\ref{['eq:selective_score']}), where larger $\rho$ indicates projected representations that are more similar.
  • Figure 4: Ablation study on the effectiveness of multi-view sample selection and selective training on $\tilde{\Theta}^{A \shortrightarrow L}$. The pseudolabels are from BERT-based $\Theta^{T \shortrightarrow L, t}$. Their $\|D^{A \shortrightarrow T, t}\|$ and $\|D^{A \shortrightarrow T, \epsilon}\|$ size are each listed in square brackets for each configuration. The selection size $N$ is 12.6k and 5.5k for the two datasets respectively.
  • Figure 5: Entity F1 Scores and Acc. on the found speech MiniPS2SLURP dataset, where all groups have the same $\|D^{A \shortrightarrow T, t}\|=21597$ and $\|D^{A \shortrightarrow T, \epsilon}\|=13400$.
  • ...and 2 more figures