Table of Contents
Fetching ...

Audiovisual SlowFast Networks for Video Recognition

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

TL;DR

AVSlowFast introduces a unified audiovisual backbone that couples Slow and Fast visual streams with a dedicated Audio pathway, enabling hierarchical audio–visual fusion and synchronization. A key innovation is DropPathway, which regularizes joint training by temporarily removing the Audio pathway to balance learning dynamics, complemented by audiovisual synchronization as an auxiliary objective. The approach yields state-of-the-art results across six action recognition and detection datasets, with modest computational overhead and demonstrated success in self-supervised learning. Together, these contributions establish a scalable, multi-level framework for integrated audiovisual video understanding and set the stage for further multi-modal research.

Abstract

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

Audiovisual SlowFast Networks for Video Recognition

TL;DR

AVSlowFast introduces a unified audiovisual backbone that couples Slow and Fast visual streams with a dedicated Audio pathway, enabling hierarchical audio–visual fusion and synchronization. A key innovation is DropPathway, which regularizes joint training by temporarily removing the Audio pathway to balance learning dynamics, complemented by audiovisual synchronization as an auxiliary objective. The approach yields state-of-the-art results across six action recognition and detection datasets, with modest computational overhead and demonstrated success in self-supervised learning. Together, these contributions establish a scalable, multi-level framework for integrated audiovisual video understanding and set the stage for further multi-modal research.

Abstract

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

Paper Structure

This paper contains 47 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Audiovisual SlowFast Networks have Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation.
  • Figure 2: Fusion connections for AVSlowFast. Left: A$\rightarrow$F$\rightarrow$S enforces strong temporal alignment between audio and RGB frames, as audio is fused into the Fast pathway with fine temporal resolution. Center: A$\rightarrow$FS has higher tolerance on temporal misalignment as audio is fused into the temporally downsampled output of SlowFast fusion. Right: Audiovisual Nonlocal fuses through a Nonlocal block wang-nonlocal2018, such that audio features are used to select visual features that are deemed important by audio.
  • Figure 3: Training procedure on Kinetics for Audio-only (red) vs. SlowFast (green) networks. We show the top-1 training error (dash) and validation error (solid). The curves show single-crop errors; the video accuracy is 24.8% vs.75.6%. The audio network converges after around 3$\times$ fewer iterations compared to the visual.
  • Figure A.1: AVA per-class average precision. AVSlowFast (27.8 mAP) vs. its SlowFast counterpart (26.3 mAP). The highlighted categories are the 5 highest absolute increases (bold) and top 5 relative increases over SlowFast (orange). Best viewed in color with zoom.