Multi-Channel Replay Speech Detection using Acoustic Maps
Michael Neri, Tuomas Virtanen
TL;DR
This paper tackles replay attacks in automatic speaker verification by introducing acoustic maps as a spatial feature derived from beamforming, enabling discrimination between genuine speech and loudspeaker replay using directional energy patterns. A compact CNN (~$6{,}000$ parameters) processes the 3-D acoustic-map tensor $oldsymbol{M} o eals^{K imes A imes E}$, where $A=91$, $E=41$, and $K$ bands, to output a replay/non-replay decision. Evaluations on the ReMASC dataset show competitive performance with a favorable model size, though environment-independent generalization remains challenging due to fixed band definitions and non-adaptive spatial processing. The work demonstrates that physically interpretable spatial features can be effective for replay detection, with future improvements aimed at learning adaptive frequency-band selectors to better capture directivity cues across environments.
Abstract
Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.
