Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Kun Wei; Bei Li; Hang Lv; Quan Lu; Ning Jiang; Lei Xie

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

TL;DR

This work tackles the challenge of retrieving long-range contextual information in conversational ASR without incurring error propagation from text-based history. It introduces a cross-modal CVAE framework that fuses a Conformer encoder with a cross-modal extractor and a CVAE module to produce role and topical conversational representations, which are fused into decoding via attention-based or linear methods. The approach leverages pre-trained speech and text models (e.g., HuBERT, data2vec, RoBERTa) and uses multi-task objectives (token-level, modal-level, CTC) to learn robust cross-modal context, achieving up to 8.8% and 23% relative CER reductions on HKUST and MagicData-RAMC, respectively. The results show that short history lengths for cross-modal input and a hybrid use of cross-modal and CVAE features yield the best performance, demonstrating the practical potential of long-context, noise-resistant conversational ASR with cross-modal context modeling.

Abstract

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

TL;DR

Abstract

Paper Structure (45 sections, 23 equations, 5 figures, 7 tables)

This paper contains 45 sections, 23 equations, 5 figures, 7 tables.

Introduction
Cross-modal CVAE Based Conversational Speech Recognition
Input Representation
Conformer Encoder
Cross-modal Extractor
CVAE Based Conversational ASR
Conditional Decoder
Attention Condition
Linear Condition
Training Objectives
The Cross-modal Extractor
Speech Pretrained Model
Language Pretrained Model
Cross-Modal Encoder
Training Objectives of The Cross-modal Extractor
...and 30 more sections

Figures (5)

Figure 1: An example of a conversation, where $X_k$ and $Y_k$ represent the speech and text of the current sentence $k$, respectively.
Figure 2: The framework of the CVAE-based conversational ASR. In this figure, X represents the speech input. The CVAE module comprises a target text encoder and two Latent Variational Modules (LVM). During the training process, the output from the Postnet is sent to the decoder. Conversely, during the decoding process, the output of the Prenet is utilized. For training purposes, $\textbf{V}^{p}_\text{role}, \textbf{V}^{p}_\text{topical}$ are employed, while $\textbf{V}_\text{role}, \textbf{V}_\text{topical}$ are used for decoding. In this figure, $\textbf{V}_{con}$ represents $\textbf{V}_{context}$. The two text encoders in the CVAE module share model parameters. Moreover, the cross-modal extractor in both the CVAE Module and the CRM Module also share model parameters.
Figure 3: Framework of the cross-modal extractor. Either the speech or text modality will be randomly masked. mask represents the masked token. The black and blue lines in the model represent the training and inference paths, respectively.
Figure 4: Different decoding strategies: $\textbf{V}_{context}=\textbf{S}_{context}$ in CRM, $\textbf{V}_{context}=(\textbf{V}_{role},\textbf{V}_{topical})$ in CVAE, and $\textbf{V}_{context}=(\textbf{V}_{role},\textbf{V}_{topical}, \textbf{S}_{context})$ in CRM+CVAE.
Figure 5: CER vs. conversation history (number of sentences).

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

TL;DR

Abstract

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)