Table of Contents
Fetching ...

Beyond Short Snippets: Deep Networks for Video Classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici

TL;DR

The paper addresses video classification by enabling CNN-based models to capture long-range temporal structure. It introduces two architectures: feature pooling networks that aggregate frame-level CNN features over time, and deep LSTMs that model the ordered sequence of frame activations, both sharing parameters across time. By incorporating explicit motion through optical flow and leveraging pretraining on ImageNet, the methods achieve state-of-the-art results on Sports-1M and UCF-101, with notable gains from longer temporal context and motion information. These findings demonstrate the value of processing entire videos rather than short clips and suggest directions for deeper integration of temporal reasoning into CNN-based video models.

Abstract

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

Beyond Short Snippets: Deep Networks for Video Classification

TL;DR

The paper addresses video classification by enabling CNN-based models to capture long-range temporal structure. It introduces two architectures: feature pooling networks that aggregate frame-level CNN features over time, and deep LSTMs that model the ordered sequence of frame activations, both sharing parameters across time. By incorporating explicit motion through optical flow and leveraging pretraining on ImageNet, the methods achieve state-of-the-art results on Sports-1M and UCF-101, with notable gains from longer temporal context and motion information. These findings demonstrate the value of processing entire videos rather than short clips and suggest directions for deeper integration of temporal reasoning into CNN-based video models.

Abstract

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our approach.
  • Figure 2: Different Feature-Pooling Architectures: The stacked convolutional layers are denoted by "C". Blue, green, yellow and orange rectangles represent max-pooling, time-domain convolutional, fully-connected and softmax layers respectively.
  • Figure 3: Each LSTM cell remembers a single floating point value $c_t$ (Eq. \ref{['eqc']}). This value may be diminished or erased through a multiplicative interaction with the forget gate $f_t$ (Eq. \ref{['eqf']}) or additively modified by the current input $x_t$ multiplied by the activation of the input gate $i_t$ (Eq. \ref{['eqi']}). The output gate $o_t$ controls the emission of $h_t$, the stored memory $c_t$ transformed by the hyperbolic tangent nonlinearity (Eq. \ref{['eqo']},\ref{['eqh']}). Image duplicated with permission from Alex Graves.
  • Figure 4: Deep Video LSTM takes input the output from the final CNN layer at each consecutive video frame. CNN outputs are processed forward through time and upwards through five layers of stacked LSTMs. A softmax layer predicts the class at each time step. The parameters of the convolutional networks (pink) and softmax classifier (orange) are shared across time steps.