Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline
Shiran Aziz, Yossi Adi, Shmuel Peleg
TL;DR
The paper tackles crowdsourced audio enhancement for events captured by multiple, uncoordinated devices where noises are locally uncorrelated. It introduces a training-free baseline that aligns inputs, denoises in the time-frequency domain by per TF-cell outlier filtering based on the median, and averages the remaining complex STFTs to reconstruct the enhanced signal, yielding $G(t,f)$. The method uses fixed thresholds $|Y_i(t,f)| > \lambda_1 C(t,f)$ or $|Y_i(t,f)| < \lambda_2 C(t,f)$ with $\lambda_1=1.15$, $\lambda_2=0.01$, and a neighborhood relaxation with $\gamma=1.1$, and inverts via the mean phase. Evaluations on synthetic mixtures and real-world recordings show it outperforms baselines and remains robust under packet loss, establishing a practical baseline for future, more sophisticated approaches in crowdsourced audio enhancement.
Abstract
With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely treated in the literature. In this work we propose a simple and effective crowdsourced audio enhancement method to remove local noises at each input audio signal. Then, averaging all cleaned source signals gives an improved audio of the event. We demonstrate the effectiveness of our method using synthetic audio signals, together with real-world recordings. This simple approach can set a new baseline for crowdsourced audio enhancement for more sophisticated methods which we hope will be developed by the research community.
