SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

Yuhang He; Shitong Xu; Jia-Xing Zhong; Sangyun Shin; Niki Trigoni; Andrew Markham

SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

Yuhang He, Shitong Xu, Jia-Xing Zhong, Sangyun Shin, Niki Trigoni, Andrew Markham

TL;DR

The paper tackles predicting spatial acoustic effects in a 3D enclosed space without relying on explicit room acoustics or source poses. It introduces SPEAR, a receiver-to-receiver neural warping field learned from paired receiver recordings, where a warping field $\mathcal{W}_{p_r \rightarrow p_t}$ maps the Fourier-domain audio from a reference receiver to a target position. A Transformer-based architecture with a learnable 3D grid and three guiding principles—Globality, Order Awareness, and Audio-Content Agnostic—achieves accurate warping across synthetic, photo-realistic, and real-world scenes, outperforming source-to-receiver baselines. The work demonstrates data-efficient training for spatial audio prediction and highlights potential applications in robotics tasks requiring spatial audio understanding, while acknowledging limitations such as dense sampling requirements and a horizontal-plane constraint.

Abstract

We present SPEAR, a continuous receiver-to-receiver acoustic neural warping field for spatial acoustic effects prediction in an acoustic 3D space with a single stationary audio source. Unlike traditional source-to-receiver modelling methods that require prior space acoustic properties knowledge to rigorously model audio propagation from source to receiver, we propose to predict by warping the spatial acoustic effects from one reference receiver position to another target receiver position, so that the warped audio essentially accommodates all spatial acoustic effects belonging to the target position. SPEAR can be trained in a data much more readily accessible manner, in which we simply ask two robots to independently record spatial audio at different positions. We further theoretically prove the universal existence of the warping field if and only if one audio source presents. Three physical principles are incorporated to guide SPEAR network design, leading to the learned warping field physically meaningful. We demonstrate SPEAR superiority on both synthetic, photo-realistic and real-world dataset, showing the huge potential of SPEAR to various down-stream robotic tasks.

SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

TL;DR

maps the Fourier-domain audio from a reference receiver to a target position. A Transformer-based architecture with a learnable 3D grid and three guiding principles—Globality, Order Awareness, and Audio-Content Agnostic—achieves accurate warping across synthetic, photo-realistic, and real-world scenes, outperforming source-to-receiver baselines. The work demonstrates data-efficient training for spatial audio prediction and highlights potential applications in robotics tasks requiring spatial audio understanding, while acknowledging limitations such as dense sampling requirements and a horizontal-plane constraint.

Abstract

Paper Structure (22 sections, 7 equations, 11 figures, 5 tables)

This paper contains 22 sections, 7 equations, 11 figures, 5 tables.

Introduction
Related Work
Receiver-to-Receiver Acoustic Neural Warping Field
Problem Formulation
Mathematical Backend of Receiver-to-Receiver Neural Warping Field
LTI Receiver-to-Receiver Warping Field Physical Principle
Position-Sensitivity and Irregularity of Receiver-to-Receiver Neural Warping Field
SPEAR Neural Network Introduction
Experiment
Experiment Configuration
Experimental Result
Ablations
Conclusion and Limitation Discussions
Receiver-to-Receiver Warping Field Existence Discussion
Discussion on Acoustic Neural Warping Field Visualization
...and 7 more sections

Figures (11)

Figure 1: SPEAR Motivation: A stationary audio source is emitting audio in 3D space. Requiring neither source position nor 3D space acoustic properties, SPEAR simply requires two microphones to actively record the spatial audio independently at discrete positions. During training, SPEAR takes as input a pair of receiver positions and outputs a warping field potentially warping the recorded audio on reference position to target position. Minimizing the discrepancy between the warped audio and recorded audio enforces SPEAR to acoustically characterise the 3D space from receiver-to-receiver perspective. The learned SPEAR is capable of predicting spatial acoustic effects at arbitrary positions.
Figure 2: Two challenges in SPEAR learning: Position-Sensitivity and Irregularity. The position-sensitivity is represented by much lower structural similarity index (SSIM) of two neighboring-step warping fields than the two RGB images (sub-fig. C). The warping field irregularity is represented by both warping field visualization in frequency domain (real part) and much higher sample entropy score than regular sine wave (and just half of random waveform) (sub-fig. D).
Figure 3: SPEAR network visualization.
Figure 4: Learned warping field visualization on synthetic dataset (A) and real-world dataset (B).
Figure 5: Ablation Study on noise interference (A), reference-target receiver distance (B) and Prediction in frequency domain and time domain.
...and 6 more figures

SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

TL;DR

Abstract

SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

Authors

TL;DR

Abstract

Table of Contents

Figures (11)