Table of Contents
Fetching ...

Multi-Channel Replay Speech Detection using Acoustic Maps

Michael Neri, Tuomas Virtanen

TL;DR

This paper tackles replay attacks in automatic speaker verification by introducing acoustic maps as a spatial feature derived from beamforming, enabling discrimination between genuine speech and loudspeaker replay using directional energy patterns. A compact CNN (~$6{,}000$ parameters) processes the 3-D acoustic-map tensor $oldsymbol{M} o eals^{K imes A imes E}$, where $A=91$, $E=41$, and $K$ bands, to output a replay/non-replay decision. Evaluations on the ReMASC dataset show competitive performance with a favorable model size, though environment-independent generalization remains challenging due to fixed band definitions and non-adaptive spatial processing. The work demonstrates that physically interpretable spatial features can be effective for replay detection, with future improvements aimed at learning adaptive frequency-band selectors to better capture directivity cues across environments.

Abstract

Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

Multi-Channel Replay Speech Detection using Acoustic Maps

TL;DR

This paper tackles replay attacks in automatic speaker verification by introducing acoustic maps as a spatial feature derived from beamforming, enabling discrimination between genuine speech and loudspeaker replay using directional energy patterns. A compact CNN (~ parameters) processes the 3-D acoustic-map tensor , where , , and bands, to output a replay/non-replay decision. Evaluations on the ReMASC dataset show competitive performance with a favorable model size, though environment-independent generalization remains challenging due to fixed band definitions and non-adaptive spatial processing. The work demonstrates that physically interpretable spatial features can be effective for replay detection, with future improvements aimed at learning adaptive frequency-band selectors to better capture directivity cues across environments.

Abstract

Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.
Paper Structure (11 sections, 8 equations, 2 figures, 3 tables)

This paper contains 11 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Spatial distribution of acoustic maps from delay-and-sum beamformer across azimuth and elevation angles for two utterances from the ReMASC dataset, recorded using device $\mathrm{D}3$ with $6$ microphones arranged in a hexagonal shape. The top row corresponds to the genuine sample 1264222.wav, and the bottom row to the replay sample 1380536001.wav from the same indoor environment and positions. Each column represents a distinct frequency band: Low ($100–500$ Hz), Mid ($500–3000$ Hz), High ($3000–8000$ Hz), and Super-High ($8000–22050$ Hz). The color scale indicates normalized acoustic intensity.
  • Figure 2: Microphone-wise performance in both generalization scenarios.