Table of Contents
Fetching ...

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall

TL;DR

GateFusion tackles active speaker detection by implementing Hierarchical Gated Fusion (HiGate) that enables progressive, layer-wise cross-modal injections between audio and visual streams. It leverages strong pretrained encoders (AV-HuBERT for video, Whisper for audio) and augments them with MAL and OPP auxiliary losses to promote unimodal–multimodal alignment and suppress visual false positives. The method achieves state-of-the-art or competitive results across Ego4D-ASD, UniTalk, WASD, and AVA-ActiveSpeaker, and demonstrates strong out-of-domain generalization. Ablations confirm that HiGate together with MAL and OPP yields robust performance, with four strategically chosen fusion layers offering the best trade-off between accuracy and efficiency, indicating the broad applicability of hierarchical gated fusion for multimodal understanding.

Abstract

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

TL;DR

GateFusion tackles active speaker detection by implementing Hierarchical Gated Fusion (HiGate) that enables progressive, layer-wise cross-modal injections between audio and visual streams. It leverages strong pretrained encoders (AV-HuBERT for video, Whisper for audio) and augments them with MAL and OPP auxiliary losses to promote unimodal–multimodal alignment and suppress visual false positives. The method achieves state-of-the-art or competitive results across Ego4D-ASD, UniTalk, WASD, and AVA-ActiveSpeaker, and demonstrates strong out-of-domain generalization. Ablations confirm that HiGate together with MAL and OPP yields robust performance, with four strategically chosen fusion layers offering the best trade-off between accuracy and efficiency, indicating the broad applicability of hierarchical gated fusion for multimodal understanding.

Abstract

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.

Paper Structure

This paper contains 24 sections, 9 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of the proposed GateFusion architecture featuring HiGate. Subfigures (a)-(c) depict typical late fusion strategies, where audio-visual features are extracted independently and fused only at the final stage: (a) Summation Fusion, (b) Concatenation Fusion, and (c) Late Fusion Decoder (e.g., cross-attention) after unimodal encoding. In contrast, our method (d) HiGate performs hierarchical cross-modal fusion by progressively injecting contextual signals from one modality into the other across multiple encoder layers. The degree of fusion is adaptively controlled by learnable gates, enabling fine-grained and robust audio-visual integration. For clarity, auxiliary unimodal classifiers used for computing the auxiliary losses are omitted from the illustration.
  • Figure 2: Illustration of the gating mechanism. The initial output $f_p$ from the primary modality is progressively integrated with hidden states $h_c^{l}$ from the context modality at multiple layers, with each fusion step modulated by a learnable gate. This process iteratively refines $f_p$ into the final enriched representation $\tilde{f}_p$.
  • Figure 3: Ablations on (a) fusion stage and (b) number of fusion layers on Ego4D-ASD, showing the impact of fusion policies. Orange bars mark our chosen configuration (layers 1, 4, 7, 10).
  • Figure 4: Comparison of mAP scores for different decoders: our HiGate, CrossAtten, Concat, and Sum decoders.
  • Figure 5: Ablations on model hyperparameters. (a) We visualize the trade-off between memory cost (VRAM, blue solid line) and inference throughput (FPS, orange dashed line) as the number of fusion layers increases. The gray band highlights our selected configuration ($N=4$). (b) Performance across different decoder widths.