Table of Contents
Fetching ...

STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking

Yidi Li, Hong Liu, Bing Yang

TL;DR

This work tackles robust speaker tracking in cluttered environments by fusing audio and visual cues through an end-to-end network, STNet. It introduces a visual-guided acoustic measurement to align audio cues with visual observations, a cross-modal attention module to model inter-modal interactions, and a quality-aware mechanism for reliable multi-speaker tracking. Empirical results on AV16.3 and CAV3D show state-of-the-art performance in both single- and multi-target scenarios, with improved 2D localization and 3D trajectory accuracy. The approach demonstrates the practical potential of deep audio-visual fusion for real-world speaker tracking tasks in noisy and occluded settings.

Abstract

Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.

STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking

TL;DR

This work tackles robust speaker tracking in cluttered environments by fusing audio and visual cues through an end-to-end network, STNet. It introduces a visual-guided acoustic measurement to align audio cues with visual observations, a cross-modal attention module to model inter-modal interactions, and a quality-aware mechanism for reliable multi-speaker tracking. Empirical results on AV16.3 and CAV3D show state-of-the-art performance in both single- and multi-target scenarios, with improved 2D localization and 3D trajectory accuracy. The approach demonstrates the practical potential of deep audio-visual fusion for real-world speaker tracking tasks in noisy and occluded settings.

Abstract

Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.
Paper Structure (21 sections, 17 equations, 11 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 17 equations, 11 figures, 6 tables, 2 algorithms.

Figures (11)

  • Figure 1: Network architecture of the proposed STNet. In the audio-visual processing, audio cues are derived by visual-guided acoustic measurement. Audio and visual features are extracted by the audio CNN and the Siamese-like visual CNN, and then interacted and fused in a cross-modal attention module. A quality-aware module is used to construct an update/reset strategy for multi-speaker tracking. The speaker position is estimated by the prediction head.
  • Figure 2: (a) 3D sampling points at five depths, entire target trajectory and current ground-truth (GT). (b) Current image frame and GCF maps at three depths. The green cross indicates the speaker's position. Yellow (blue) indicates a higher (lower) probability of source presence.
  • Figure 3: The structure of the Siamese-based visual network with multi-scale fusion. The network weights of search region and template branch are shared.
  • Figure 4: The structure of GCFNet. The bottom line shows the binary pseudo-label generated on the image frame with GT using Gaussian distribution.
  • Figure 5: The structure of cross-modal attention module. The audio and visual features are interact and fused in a multi-head cross-attention mechanism.
  • ...and 6 more figures