Table of Contents
Fetching ...

SlowFast Networks for Video Recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

TL;DR

The paper introduces SlowFast, a two-pathway architecture for video recognition that treats time asymmetrically by pairing a Slow pathway for spatial semantics at low frame rate with a Fast pathway for fast-motion details at high frame rate. The Fast pathway is deliberately lightweight (low channel capacity) and fused to the Slow pathway via lateral connections, enabling accurate motion modeling without heavy computation. Extensive experiments on Kinetics-400/600, Charades, and AVA demonstrate state-of-the-art results, with ablations confirming the complementary benefits of Slow and Fast streams and the effectiveness of various fusion strategies and input modalities. The work provides practical insights into efficient video modeling and releases code for end-to-end training without relying on optical flow.

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast

SlowFast Networks for Video Recognition

TL;DR

The paper introduces SlowFast, a two-pathway architecture for video recognition that treats time asymmetrically by pairing a Slow pathway for spatial semantics at low frame rate with a Fast pathway for fast-motion details at high frame rate. The Fast pathway is deliberately lightweight (low channel capacity) and fused to the Slow pathway via lateral connections, enabling accurate motion modeling without heavy computation. Extensive experiments on Kinetics-400/600, Charades, and AVA demonstrate state-of-the-art results, with ablations confirming the complementary benefits of Slow and Fast streams and the effectiveness of various fusion strategies and input modalities. The work provides practical insights into efficient video modeling and releases code for end-to-end training without relying on optical flow.

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast

Paper Structure

This paper contains 40 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, $\alpha$$\times$ higher temporal resolution Fast pathway. The Fast pathway is lightweight by using a fraction ($\beta$, e.g., 1/8) of channels. Lateral connections fuse them.
  • Figure 2: Accuracy/complexity tradeoff on Kinetics-400 for the SlowFast (green) vs. Slow-only (blue) architectures. SlowFast is consistently better than its Slow-only counterpart in all cases (green arrows). SlowFast provides higher accuracy and lower cost than temporally heavy Slow-only (e.g. red arrow). The complexity is for a single 256$^2$ view, and accuracy are obtained by 30-view testing.
  • Figure 3: Per-category AP on AVA: a Slow-only baseline (19.0 mAP) vs. its SlowFast counterpart (24.2 mAP). The highlighted categories are the 5 highest absolute increase (black) or 5 highest relative increase with Slow-only AP $>$ 1.0 (orange). Categories are sorted by number of examples. Note that the SlowFast instantiation in this ablation is not our best-performing model.