Table of Contents
Fetching ...

Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition

Lei Liu, Li Liu, Haizhou Li

TL;DR

The paper tackles automatic cued speech recognition (ACSR) by fusing lip-reading and hand-cue cues over long sequences, where prior fusion methods struggle with global dependencies and efficiency. It introduces EcoCued, a computation- and parameter-efficient multi-modal fusion transformer that hinges on Token-Importance-Aware Attention (TIAA) and a Token Utilization Rate (TUR) to select important tokens, enabling modality-specific and modality-shared attention with cross-modal fusion and a Convolution-based Aggregation (ConAgg). The approach reduces self-attention complexity from $O(T^2)$ to $O(T)$ and lowers parameters substantially (e.g., from 54.9M to 6.6M) while achieving state-of-the-art results on Mandarin Chinese, French, and British CS datasets, with single-cuer and multi-cuer settings showing notable CER and WER improvements. The method also demonstrates faster inference and robust cross-modal interaction, validated through extensive ablations and analyses, suggesting strong potential for large-scale multi-modal pre-training in ACSR.

Abstract

Cued Speech (CS) is a pure visual coding method used by hearing-impaired people that combines lip reading with several specific hand shapes to make the spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text, which can help hearing-impaired people to communicate effectively. The visual information of CS contains lip reading and hand cueing, thus the fusion of them plays an important role in ACSR. However, most previous fusion methods struggle to capture the global dependency present in long sequence inputs of multi-modal CS data. As a result, these methods generally fail to learn the effective cross-modal relationships that contribute to the fusion. Recently, attention-based transformers have been a prevalent idea for capturing the global dependency over the long sequence in multi-modal fusion, but existing multi-modal fusion transformers suffer from both poor recognition accuracy and inefficient computation for the ACSR task. To address these problems, we develop a novel computation and parameter efficient multi-modal fusion transformer by proposing a novel Token-Importance-Aware Attention mechanism (TIAA), where a token utilization rate (TUR) is formulated to select the important tokens from the multi-modal streams. More precisely, TIAA firstly models the modality-specific fine-grained temporal dependencies over all tokens of each modality, and then learns the efficient cross-modal interaction for the modality-shared coarse-grained temporal dependencies over the important tokens of different modalities. Besides, a light-weight gated hidden projection is designed to control the feature flows of TIAA. The resulting model, named Economical Cued Speech Fusion Transformer (EcoCued), achieves state-of-the-art performance on all existing CS datasets, compared with existing transformer-based fusion methods and ACSR fusion methods.

Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition

TL;DR

The paper tackles automatic cued speech recognition (ACSR) by fusing lip-reading and hand-cue cues over long sequences, where prior fusion methods struggle with global dependencies and efficiency. It introduces EcoCued, a computation- and parameter-efficient multi-modal fusion transformer that hinges on Token-Importance-Aware Attention (TIAA) and a Token Utilization Rate (TUR) to select important tokens, enabling modality-specific and modality-shared attention with cross-modal fusion and a Convolution-based Aggregation (ConAgg). The approach reduces self-attention complexity from to and lowers parameters substantially (e.g., from 54.9M to 6.6M) while achieving state-of-the-art results on Mandarin Chinese, French, and British CS datasets, with single-cuer and multi-cuer settings showing notable CER and WER improvements. The method also demonstrates faster inference and robust cross-modal interaction, validated through extensive ablations and analyses, suggesting strong potential for large-scale multi-modal pre-training in ACSR.

Abstract

Cued Speech (CS) is a pure visual coding method used by hearing-impaired people that combines lip reading with several specific hand shapes to make the spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text, which can help hearing-impaired people to communicate effectively. The visual information of CS contains lip reading and hand cueing, thus the fusion of them plays an important role in ACSR. However, most previous fusion methods struggle to capture the global dependency present in long sequence inputs of multi-modal CS data. As a result, these methods generally fail to learn the effective cross-modal relationships that contribute to the fusion. Recently, attention-based transformers have been a prevalent idea for capturing the global dependency over the long sequence in multi-modal fusion, but existing multi-modal fusion transformers suffer from both poor recognition accuracy and inefficient computation for the ACSR task. To address these problems, we develop a novel computation and parameter efficient multi-modal fusion transformer by proposing a novel Token-Importance-Aware Attention mechanism (TIAA), where a token utilization rate (TUR) is formulated to select the important tokens from the multi-modal streams. More precisely, TIAA firstly models the modality-specific fine-grained temporal dependencies over all tokens of each modality, and then learns the efficient cross-modal interaction for the modality-shared coarse-grained temporal dependencies over the important tokens of different modalities. Besides, a light-weight gated hidden projection is designed to control the feature flows of TIAA. The resulting model, named Economical Cued Speech Fusion Transformer (EcoCued), achieves state-of-the-art performance on all existing CS datasets, compared with existing transformer-based fusion methods and ACSR fusion methods.
Paper Structure (19 sections, 1 theorem, 15 equations, 12 figures, 8 tables)

This paper contains 19 sections, 1 theorem, 15 equations, 12 figures, 8 tables.

Key Result

Theorem 1

For any $Q, K, V \in \mathbb{R}^{T \times d}$ and $W_i^q, W_i^k, W_i^v \in \mathbb{R}^{d \times d}$, for any column vector $w \in \mathbb{R}^T$ of matrix $V W_i^v$, there exists a low-rank matrix $\tilde{S} \in \mathbb{R}^{T \times T}$ satisfying: where $\operatorname{rank}(\tilde{S})=\Theta(\log (T))$.

Figures (12)

  • Figure 1: The Mandarin Chinese CS system (image from liu2019pilot). Combined with lip reading, five hand positions (mouth, chin, throat, side, cheek) are defined to encode Chinese vowels and eight hand shapes to encode Chinese consonants.
  • Figure 2: Multi-modal fusion comparison between previous transformers (left) and the proposed method (right). $T$ is the input sequence length. $C$ is the chunk number for segmenting the input sequence. $k$ is the number of selected important tokens in each chunk. (a) Previous transformers would introduce extra computation and parameters for cross-modal interaction, requiring quadratic complexity (red links) and projection layers. (b) Our method utilizes a parameter-free cross-modal interaction with linear computation complexity (green links). Here $k=2$ is the simplest case for the visualisation purpose.
  • Figure 3: Phoneme-level recognition accuracy on Chinese CS dataset with respect to parameters. Comparison with ACSR methods: LSTM papadimitriou2021fully, JLF wang2021cross, and CMML liu2023cross; Comparison with Transformer models: vanilla Multi-Head Self-Attention (MHSA) vaswani2017attention, FLASH hua2022transformer, Linformer wang2020linformer, Performer choromanski2020rethinking, and Cosformer qin2021cosformer. RegNet Radosavovic_2020_CVPR is the front-end backbone for all methods.
  • Figure 4: The illustration of the EcoCued approach. At first, pre-trained extraction models (dlib dlib and mediapipe mediapipe) are used to capture the ROIs of lip and hand from the videos. Then a shared frond-end Radosavovic_2020_CVPR is utilized to extract frame-wise features for lip motions and hand shapes, and a linear layer is to extract features of hand positions. To reduce the complexity of self-attention, TUR is presented to select important tokens from each modality. The proposed TIAA mechanism first calculates the modality-specific attention to capture the local fine-grained dependencies within each chunk of the sequence for each modality. Then, TIAA fuses the important tokens of different modalities and calculates the modality-shared coarse-grained dependencies over the fused tokens. Finally, a convolution aggregation (i.e., ConAgg) module is used to aggregate the modality-specific and modality-shared attention flows along with the spatial dimension. Besides, gate hidden projection is presented to control the information flow from input to output projections for TIAA.
  • Figure 5: Spectrum analysis of the self-attention matrix in the transformer liu2023cross with top-128 largest eigenvalues. We can see the original MHSA formulation obtains a low-rank attention matrix for the ACSR task, which motivates us to focus on the most important tokens in the CS sequences.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Definition 1
  • Definition 2