Table of Contents
Fetching ...

AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

Ju Lin, Niko Moritz, Yiteng Huang, Ruiming Xie, Ming Sun, Christian Fuegen, Frank Seide

TL;DR

The paper tackles robust directional speech recognition for wearables facing evolving microphone-array geometries. It proposes AGADIR, combining Non-Linearly Constrained Minimum Variance beamforming, a convolutional front-end, and streaming RNN-T with Serialized Output Training to jointly disambiguate speakers and suppress cross-talk in real time. Key findings show that multi-geometry training improves WER by up to 28% relative and generalizes to unseen devices, while the geometry-agnostic variant remains competitive with seen devices; dropping a microphone remains challenging. The work offers a practical approach for deploying multi-channel ASR across successive hardware iterations, reducing data collection needs and enabling shared development across prototypes.

Abstract

Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise. When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses. This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. We show that a model trained on multiple similar geometries is largely agnostic and generalizes well to new geometries, as long as they are not too different. Furthermore, training the model this way improves accuracy for seen geometries by 15 to 28\% relative. Lastly, we refine the beamforming by a novel Non-Linearly Constrained Minimum Variance criterion.

AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

TL;DR

The paper tackles robust directional speech recognition for wearables facing evolving microphone-array geometries. It proposes AGADIR, combining Non-Linearly Constrained Minimum Variance beamforming, a convolutional front-end, and streaming RNN-T with Serialized Output Training to jointly disambiguate speakers and suppress cross-talk in real time. Key findings show that multi-geometry training improves WER by up to 28% relative and generalizes to unseen devices, while the geometry-agnostic variant remains competitive with seen devices; dropping a microphone remains challenging. The work offers a practical approach for deploying multi-channel ASR across successive hardware iterations, reducing data collection needs and enabling shared development across prototypes.

Abstract

Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise. When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses. This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. We show that a model trained on multiple similar geometries is largely agnostic and generalizes well to new geometries, as long as they are not too different. Furthermore, training the model this way improves accuracy for seen geometries by 15 to 28\% relative. Lastly, we refine the beamforming by a novel Non-Linearly Constrained Minimum Variance criterion.
Paper Structure (13 sections, 3 equations, 3 figures, 3 tables)

This paper contains 13 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Proposed Array-geometry agnostic directional speech recognition architecture.
  • Figure 2: Beam patterns at 1000Hz for Aria glasses on 4 directions.
  • Figure 3: Microphone locations on Project Aria glasses somasundaram2023project.