Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer
TL;DR
AVSlowFast introduces a unified audiovisual backbone that couples Slow and Fast visual streams with a dedicated Audio pathway, enabling hierarchical audio–visual fusion and synchronization. A key innovation is DropPathway, which regularizes joint training by temporarily removing the Audio pathway to balance learning dynamics, complemented by audiovisual synchronization as an auxiliary objective. The approach yields state-of-the-art results across six action recognition and detection datasets, with modest computational overhead and demonstrated success in self-supervised learning. Together, these contributions establish a scalable, multi-level framework for integrated audiovisual video understanding and set the stage for further multi-modal research.
Abstract
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.
