Table of Contents
Fetching ...

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli

TL;DR

M-BEST-RQ introduces a multi-channel speech foundation model tailored for smart glasses by achieving array-geometry invariance through fixed beamformers and learning a task-agnostic encoder via multi-channel BEST-RQ pretraining. It maps variable microphone configurations to a fixed directional representation, enabling a single encoder to support downstream tasks like conversational ASR, spherical source localization, and wearer VAD across devices. The model matches or exceeds supervised baselines with far less labeled data on C-ASR (notably 8 hours), and demonstrates strong cross-device performance on S-ASL and W-VAD, underscoring the practicality of a wearable-focused foundation model. This work advances multi-channel SSL for wearables and suggests avenues for more efficient, streaming, and device-agnostic speech systems.

Abstract

The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

TL;DR

M-BEST-RQ introduces a multi-channel speech foundation model tailored for smart glasses by achieving array-geometry invariance through fixed beamformers and learning a task-agnostic encoder via multi-channel BEST-RQ pretraining. It maps variable microphone configurations to a fixed directional representation, enabling a single encoder to support downstream tasks like conversational ASR, spherical source localization, and wearer VAD across devices. The model matches or exceeds supervised baselines with far less labeled data on C-ASR (notably 8 hours), and demonstrates strong cross-device performance on S-ASL and W-VAD, underscoring the practicality of a wearable-focused foundation model. This work advances multi-channel SSL for wearables and suggests avenues for more efficient, streaming, and device-agnostic speech systems.

Abstract

The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.
Paper Structure (16 sections, 4 equations, 2 figures, 3 tables)

This paper contains 16 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: System architecture of M-BEST-RQ and downstream tasks.
  • Figure 2: Microphone positions of two devices: (a) Aria glasses, containing 7-channel input, and (b) EasyCom AR glasses, containing 6-channel input. For (b), we only used the first 4 microphones which are on the device.