Table of Contents
Fetching ...

Cascaded Cross-Modal Transformer for Audio-Textual Classification

Nicolae-Catalin Ristea, Andrei Anghel, Radu Tudor Ionescu

TL;DR

The study tackles limited-data speech classification by constructing audio-textual representations through ASR and NMT, then fusing these multimodal cues with audio using a cascaded cross-modal transformer (CCMT). By first combining language-specific text modalities and then integrating them with audio, CCMT leverages multilingual semantic and acoustic cues to improve classification. Empirical results on ComParE RSC, Speech Commands v2, and HarperValleyBank show state-of-the-art performance, including a private-test win in the RSC, underscoring the value of multilingual text representations for robust speech understanding. The work also provides open-source code, enabling broader adoption and replication of multimodal audio-text fusion for speech tasks.

Abstract

Speech classification tasks often require powerful language understanding models to grasp useful features, which becomes problematic when limited training data is available. To attain superior classification performance, we propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models and translating the transcripts into different languages via pretrained translation models. We thus obtain an audio-textual (multimodal) representation for each data sample. Subsequently, we combine language-specific Bidirectional Encoder Representations from Transformers (BERT) with Wav2Vec2.0 audio features via a novel cascaded cross-modal transformer (CCMT). Our model is based on two cascaded transformer blocks. The first one combines text-specific features from distinct languages, while the second one combines acoustic features with multilingual features previously learned by the first transformer block. We employed our system in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. CCMT was declared the winning solution, obtaining an unweighted average recall (UAR) of 65.41% and 85.87% for complaint and request detection, respectively. Moreover, we applied our framework on the Speech Commands v2 and HarperValleyBank dialog data sets, surpassing previous studies reporting results on these benchmarks. Our code is freely available for download at: https://github.com/ristea/ccmt.

Cascaded Cross-Modal Transformer for Audio-Textual Classification

TL;DR

The study tackles limited-data speech classification by constructing audio-textual representations through ASR and NMT, then fusing these multimodal cues with audio using a cascaded cross-modal transformer (CCMT). By first combining language-specific text modalities and then integrating them with audio, CCMT leverages multilingual semantic and acoustic cues to improve classification. Empirical results on ComParE RSC, Speech Commands v2, and HarperValleyBank show state-of-the-art performance, including a private-test win in the RSC, underscoring the value of multilingual text representations for robust speech understanding. The work also provides open-source code, enabling broader adoption and replication of multimodal audio-text fusion for speech tasks.

Abstract

Speech classification tasks often require powerful language understanding models to grasp useful features, which becomes problematic when limited training data is available. To attain superior classification performance, we propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models and translating the transcripts into different languages via pretrained translation models. We thus obtain an audio-textual (multimodal) representation for each data sample. Subsequently, we combine language-specific Bidirectional Encoder Representations from Transformers (BERT) with Wav2Vec2.0 audio features via a novel cascaded cross-modal transformer (CCMT). Our model is based on two cascaded transformer blocks. The first one combines text-specific features from distinct languages, while the second one combines acoustic features with multilingual features previously learned by the first transformer block. We employed our system in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge. CCMT was declared the winning solution, obtaining an unweighted average recall (UAR) of 65.41% and 85.87% for complaint and request detection, respectively. Moreover, we applied our framework on the Speech Commands v2 and HarperValleyBank dialog data sets, surpassing previous studies reporting results on these benchmarks. Our code is freely available for download at: https://github.com/ristea/ccmt.
Paper Structure (26 sections, 3 equations, 3 figures, 8 tables)

This paper contains 26 sections, 3 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Our multimodal pipeline for speech classification. Audio is the original input modality, from which we extract tokens with the Wav2Vec2.0 Baevski-NeurIPS-2020 network that processes audio data in the time domain. To generate the first extra (text) modality, we initially employ an ASR model to transcribe each audio sample into text. We assume that the spoken language is known a priori, and, in the depicted example, the audio is in French [Fr]. The French text is given as input to the CamemBERT Martin-ACL-2020 language model. The next step is to generate the second text modality, which relies on the FLAN Chung-ARXIV-2022 model to translate the French transcripts into English. For the English language modality [En], the text is processed by the BERT model Devlin-NAACL-2019. The audio tokens returned by Wav2Vec2.0 and the text tokens produced by CamemBERT and BERT are further fed into our cascaded cross-modal transformer (CCMT). The final class token provided by CCMT is fed into the classification head. Our framework operates similarly if the original spoken language is English. The frozen models are marked with a snowflake.
  • Figure 2: As input, the CCMT architecture receives tokens obtained from CamemBERT, BERT, and Wav2Vec2.0. To maintain the positional information of each modality, we introduce separate positional embeddings. The tokens are processed by two cascaded cross-attention transformer blocks. The first block combines the French and English text modalities, and the resulting tokens are combined with the audio modality. The final class token is passed to the MLP classification head to make the final predictions.
  • Figure 3: A t-SNE visualization of the CCMT embedding space for the ComParE RSC development set. On the left-hand side, the data points are labeled according to the request classes. On the right-hand side, the labels represent the complaint classes. Best viewed in color.