Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

Keita Goto; Takashi Maekaku; Jin Sakuma; Jinchuan Tian; Yusuke Shinohara; Shinji Watanabe

Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

Keita Goto, Takashi Maekaku, Jin Sakuma, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe

TL;DR

This work proposes online registers, learnable tokens appended to each chunk in online mode that act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency.

Abstract

Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.

Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

TL;DR

Abstract

Paper Structure (8 sections, 11 equations, 2 figures, 4 tables)

This paper contains 8 sections, 11 equations, 2 figures, 4 tables.

Introduction
Related Work
Proposed Method
Experiments
Experimental Settings
Main Results
Analysis
Conclusion

Figures (2)

Figure 1: Overview of our proposed pre-training framework with online registers. As an example, we illustrate the case where the feature length is 4 frames, the chunk size is 2, and the number of online registers per chunk is 1. The dual-mode Transformer encoder processes offline input with full-context attention and online input with chunk-wise attention, where online registers are appended. The model is trained to predict quantized targets for masked frames (dotted line boxes), with an additional future prediction loss encouraging the online registers to store future context.
Figure 2: Attention mask design for the online mode. As an example, we illustrate the case where the feature length is 6 frames, the chunk size is 2, the look-ahead size is 1, and the number of online registers per chunk is 1. During attention computation, the model attends only to the past and current chunks, the look-ahead, and the online registers, while all other attention weights (white boxes) are filled with $-\infty$

Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

TL;DR

Abstract

Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

Authors

TL;DR

Abstract

Table of Contents

Figures (2)