Table of Contents
Fetching ...

Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline

Shiran Aziz, Yossi Adi, Shmuel Peleg

TL;DR

The paper tackles crowdsourced audio enhancement for events captured by multiple, uncoordinated devices where noises are locally uncorrelated. It introduces a training-free baseline that aligns inputs, denoises in the time-frequency domain by per TF-cell outlier filtering based on the median, and averages the remaining complex STFTs to reconstruct the enhanced signal, yielding $G(t,f)$. The method uses fixed thresholds $|Y_i(t,f)| > \lambda_1 C(t,f)$ or $|Y_i(t,f)| < \lambda_2 C(t,f)$ with $\lambda_1=1.15$, $\lambda_2=0.01$, and a neighborhood relaxation with $\gamma=1.1$, and inverts via the mean phase. Evaluations on synthetic mixtures and real-world recordings show it outperforms baselines and remains robust under packet loss, establishing a practical baseline for future, more sophisticated approaches in crowdsourced audio enhancement.

Abstract

With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely treated in the literature. In this work we propose a simple and effective crowdsourced audio enhancement method to remove local noises at each input audio signal. Then, averaging all cleaned source signals gives an improved audio of the event. We demonstrate the effectiveness of our method using synthetic audio signals, together with real-world recordings. This simple approach can set a new baseline for crowdsourced audio enhancement for more sophisticated methods which we hope will be developed by the research community.

Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline

TL;DR

The paper tackles crowdsourced audio enhancement for events captured by multiple, uncoordinated devices where noises are locally uncorrelated. It introduces a training-free baseline that aligns inputs, denoises in the time-frequency domain by per TF-cell outlier filtering based on the median, and averages the remaining complex STFTs to reconstruct the enhanced signal, yielding . The method uses fixed thresholds or with , , and a neighborhood relaxation with , and inverts via the mean phase. Evaluations on synthetic mixtures and real-world recordings show it outperforms baselines and remains robust under packet loss, establishing a practical baseline for future, more sophisticated approaches in crowdsourced audio enhancement.

Abstract

With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely treated in the literature. In this work we propose a simple and effective crowdsourced audio enhancement method to remove local noises at each input audio signal. Then, averaging all cleaned source signals gives an improved audio of the event. We demonstrate the effectiveness of our method using synthetic audio signals, together with real-world recordings. This simple approach can set a new baseline for crowdsourced audio enhancement for more sophisticated methods which we hope will be developed by the research community.
Paper Structure (10 sections, 1 equation, 5 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 1 equation, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 2: Clips after temporal alignment. For each time period there may be different clips covering this period. In this example we have 5 input clips, where some periods are covered by 1, 2, 3, or 4 simultaneous clips.
  • Figure 3: The audio enhancement process: For each TF cell in the spectrogram of overlapping clips we examine the amplitudes in each clip, compute the median amplitude, and remove values whose distance from the median exceeds a threshold. Averaging the complex values from the remaining clips give the value of TF cell in the enhanced spectrogram. In this figure we have 5 overlapping clips, and of the 5 amplitudes in the examined TF cell the highest and lowest amplitudes are discarded as outliers.
  • Figure 4: Combining 5 synthetic noisy audio signals: Average SI-SNR of enhanced signal as a function of the SNR of the input signals, and $95\%$ confidence interval on 100 experiments. Methods compared: (i) Mean: Using the mean of all signals. (ii) Median: Replacing the mean magnitude with the median magnitude in each TF cell. (iii) Max Elimination MaxElimination:2017: Removing the maximal magnitude in each TF cell. (iv) fastFCA: model the contribution of each source as a complex Gaussian distribution with zero mean. (v) Our Crowdsourced Enhancement, consistently having the best results.
  • Figure 5: Average SI-SNR of enhanced signal, combining 3, 5, and 10 synthetic noisy audio signals. Source signal is music, and noise is speech. Max elimination MaxElimination:2017, the best baseline, is compared with our Crowdsourced Enhancement. As expected, the benefit of our method over the baseline increases as more noisy signals are combined together.
  • Figure 6: Same as Fig. \ref{['fig:synthetic']}, but with simulated packet loss, where each noisy input signals also has a randomly placed one second of silence. Max Elimination, the best baseline under additive noise, fails in this case.