Table of Contents
Fetching ...

Robust Active Speaker Detection in Noisy Environments

Siva Sai Nagender Vasireddy, Chenxu Zhang, Xiaohu Guo, Yapeng Tian

TL;DR

The paper tackles robust active speaker detection (rASD) in the presence of environmental noise by introducing a framework that uses audio-visual speech separation as guidance to learn noise-free audio features and jointly optimize separation and ASD in an end-to-end setup. A dynamic weighted loss is proposed to handle inherent speech noise, and a real-world RNA dataset is introduced to study noise impact. Across AVA-ActiveSpeaker and RNA-augmented scenarios, the approach improves ASD robustness over multiple baselines and demonstrates generalizability to different ASD architectures. The work thereby advances practical ASD in real-world noisy environments and provides resources (RNA, code) to enable broader evaluation and adoption.

Abstract

This paper addresses the issue of active speaker detection (ASD) in noisy environments and formulates a robust active speaker detection (rASD) problem. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. To overcome this, we propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then utilized in an ASD model, and both tasks are jointly optimized in an end-to-end framework. Our proposed framework mitigates residual noise and audio quality reduction issues that can occur in a naive cascaded two-stage framework that directly uses separated speech for ASD, and enables the two tasks to be optimized simultaneously. To further enhance the robustness of the audio features and handle inherent speech noises, we propose a dynamic weighted loss approach to train the speech separator. We also collected a real-world noise audio dataset to facilitate investigations. Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments. The framework is general and can be applied to different ASD approaches to improve their robustness. Our code, models, and data will be released.

Robust Active Speaker Detection in Noisy Environments

TL;DR

The paper tackles robust active speaker detection (rASD) in the presence of environmental noise by introducing a framework that uses audio-visual speech separation as guidance to learn noise-free audio features and jointly optimize separation and ASD in an end-to-end setup. A dynamic weighted loss is proposed to handle inherent speech noise, and a real-world RNA dataset is introduced to study noise impact. Across AVA-ActiveSpeaker and RNA-augmented scenarios, the approach improves ASD robustness over multiple baselines and demonstrates generalizability to different ASD architectures. The work thereby advances practical ASD in real-world noisy environments and provides resources (RNA, code) to enable broader evaluation and adoption.

Abstract

This paper addresses the issue of active speaker detection (ASD) in noisy environments and formulates a robust active speaker detection (rASD) problem. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. To overcome this, we propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then utilized in an ASD model, and both tasks are jointly optimized in an end-to-end framework. Our proposed framework mitigates residual noise and audio quality reduction issues that can occur in a naive cascaded two-stage framework that directly uses separated speech for ASD, and enables the two tasks to be optimized simultaneously. To further enhance the robustness of the audio features and handle inherent speech noises, we propose a dynamic weighted loss approach to train the speech separator. We also collected a real-world noise audio dataset to facilitate investigations. Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments. The framework is general and can be applied to different ASD approaches to improve their robustness. Our code, models, and data will be released.
Paper Structure (26 sections, 3 equations, 5 figures, 12 tables)

This paper contains 26 sections, 3 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Given a video with both audio and visual tracks, we develop a robust deep audio-visual analysis model that can detect active speakers even in a noisy environment.
  • Figure 2: The proposed robust active speaker detection framework. Upon the framework, we utilize an audio-visual speech separator to guide the learning of noise-free speech features for active speaker detection. The framework includes a nonlinear transformation $g(\cdot)$ to bridge the features between the separator and the detector. In addition, a dynamic weighting mechanism is employed to generate dynamic weights for the separation loss, which helps handle inherent speech noises. The framework is general and can be applied to improve the robustness of any existing audio-visual active speaker detectors.
  • Figure 3: Visual results of different methods under audio noises. From top to bottom, (1) GT: groundtruth active speakers; (2) Baseline: detected results by TalkNet with noisy training; (3) Ours: detected results by our framework with TalkNet. For the four examples, from left to right, four different non-speech sounds from aircraft, smoke alarms, pig, and emergency vehicle are added, respectively. Our framework can defend against audio noises and accurately detect active speakers
  • Figure 4: Real-world visual examples, both of which are from video recordings without any additional non-speech sounds mixed in. The first example has strong cafeteria noise, while the second contains background music sounds. Our approach, TalkNet+Ours, can be applied to handle these real-world examples, as shown in the second row. In contrast, our baseline approach, TalkNet with noisy training, failed to perform well, as depicted in the first row.
  • Figure 5: Progression of mean training sample weight generated by $\Phi_W$ (Section 3.4 of the main paper) during training. The graphs clearly illustrate that the weights of clean speech samples are higher than the weights of noisy (inherent) speech samples during the initial epochs ($\approx 2000$ steps per epoch) and the weights converge to 1 in the later stages of training. This allows the speech separator $\Phi_{SS}$ to mainly learn from clean speech samples in the beginning and generalize to all speech samples in the later stages of training.