SlowFast Networks for Video Recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
TL;DR
The paper introduces SlowFast, a two-pathway architecture for video recognition that treats time asymmetrically by pairing a Slow pathway for spatial semantics at low frame rate with a Fast pathway for fast-motion details at high frame rate. The Fast pathway is deliberately lightweight (low channel capacity) and fused to the Slow pathway via lateral connections, enabling accurate motion modeling without heavy computation. Extensive experiments on Kinetics-400/600, Charades, and AVA demonstrate state-of-the-art results, with ablations confirming the complementary benefits of Slow and Fast streams and the effectiveness of various fusion strategies and input modalities. The work provides practical insights into efficient video modeling and releases code for end-to-end training without relying on optical flow.
Abstract
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast
