Table of Contents
Fetching ...

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

Alexandros Haliassos, Rodrigo Mira, Stavros Petridis

TL;DR

USR 2.0 tackles the bottleneck and brittleness of USR by introducing CTC-driven teacher forcing, which feeds greedily decoded CTC pseudo-labels into the attention decoder to produce aligned Att PLs in a single forward pass. A mixed sampling strategy mitigates exposure bias, preserving some autoregressive decoding during training to bridge train–test gaps. The method effectively couples CTC robustness with the expressiveness of attention, yielding about 2× faster training and improved robustness to long sequences, noise, and unseen domains, while achieving state-of-the-art results on LRS3, LRS2, and WildVSR with a single unified model. These gains translate into practical impact: fewer training resources, better out-of-distribution generalization, and scalable semi-supervised learning for multimodal speech tasks across ASR, VSR, and AVSR. The approach also generalizes beyond speech to other sequence-to-sequence problems with monotonic input-output structures.

Abstract

Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition

TL;DR

USR 2.0 tackles the bottleneck and brittleness of USR by introducing CTC-driven teacher forcing, which feeds greedily decoded CTC pseudo-labels into the attention decoder to produce aligned Att PLs in a single forward pass. A mixed sampling strategy mitigates exposure bias, preserving some autoregressive decoding during training to bridge train–test gaps. The method effectively couples CTC robustness with the expressiveness of attention, yielding about 2× faster training and improved robustness to long sequences, noise, and unseen domains, while achieving state-of-the-art results on LRS3, LRS2, and WildVSR with a single unified model. These gains translate into practical impact: fewer training resources, better out-of-distribution generalization, and scalable semi-supervised learning for multimodal speech tasks across ASR, VSR, and AVSR. The approach also generalizes beyond speech to other sequence-to-sequence problems with monotonic input-output structures.

Abstract

Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.
Paper Structure (80 sections, 17 equations, 9 figures, 14 tables)

This paper contains 80 sections, 17 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: CTC vs. attention-based decoding (greedy). Left: AVSR word error rate (WER) of USR on in- (LRS3) and out-of-distribution (VoxCeleb2, automatically transcribed) samples. CTC decoding is notably more robust to domain shift and long sequences, while autoregressive attention-based decoding performs best in-distribution. Right: Decoding speed on a H200 GPU. CTC is $\sim$40$\times$ faster than autoregressive decoding. While teacher forcing could significantly speed up attention-based decoding, it typically relies on ground-truth tokens unavailable during pseudo-labelling.
  • Figure 2: Pseudo-labelling in USR and USR 2.0 In USR (left), the teacher generates CTC and attention-based pseudo-labels (PLs) from unmasked audiovisual inputs: CTC PLs are generated in parallel, while attention-based PLs require autoregressive decoding. The student predicts each target type independently, leading to decoupled supervision. USR 2.0 introduces two modes for tighter integration. In CTC-driven mode (centre), attention-based PLs are generated by feeding collapsed CTC PLs into the decoder via teacher forcing, avoiding autoregression. The student decoder predicts both types of PLs. In AR mode (right), the teacher operates autoregressively as in USR, and the student’s CTC branch predicts both CTC and attention-based PLs. USR 2.0 alternates between modes at each iteration: CTC-driven mode improves efficiency and robustness, while AR mode mitigates exposure bias. Note: $B$ refers to batch size, and typically, $U_{\text{CTC}}, U_{\text{AR}} \ll L$ (not to scale in the figure).
  • Figure 3: Robustness to long utterances. (a) Greedy decoding: USR 2.0 maintains robustness to longer input lengths, significantly outperforming other models. (b) Beam search ($\text{beam size} = 30$, joint CTC-attention decoding) improves USR robustness but still lags behind USR 2.0. (c) Increasing beam size reduces the WER gap between USR and USR 2.0, but at a significant computational and memory cost. The size of the markers corresponds to the relative memory cost with batched beam search.
  • Figure 3: Performance on OOD datasets: LibriSpeech (LibriS), WildVSR, and AVSpeech (AVS). We report WER (%) under greedy decoding.
  • Figure 4: AVSR WER (%) vs. AR mode sampling probability on in-distribution (LRS3) and out-of-distribution (AVSpeech) samples.
  • ...and 4 more figures