Table of Contents
Fetching ...

Towards Good Practices for Very Deep Two-Stream ConvNets

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

TL;DR

The paper tackles the limited gains of deep learning for video action recognition by introducing very deep two-stream ConvNets based on GoogLeNet and VGG-16. It presents practical training strategies—pre-training for both streams, reduced learning rates, advanced data augmentation, high dropout, and multi-GPU training—to combat overfitting on small action datasets. The approach achieves state-of-the-art performance on UCF101, notably $91.4\%$ accuracy, and demonstrates substantial speedups with multi-GPU training. This work highlights that depth, when paired with disciplined training and efficient computation, substantially improves video action recognition performance.

Abstract

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.

Towards Good Practices for Very Deep Two-Stream ConvNets

TL;DR

The paper tackles the limited gains of deep learning for video action recognition by introducing very deep two-stream ConvNets based on GoogLeNet and VGG-16. It presents practical training strategies—pre-training for both streams, reduced learning rates, advanced data augmentation, high dropout, and multi-GPU training—to combat overfitting on small action datasets. The approach achieves state-of-the-art performance on UCF101, notably accuracy, and demonstrates substantial speedups with multi-GPU training. This work highlights that depth, when paired with disciplined training and efficient computation, substantially improves video action recognition performance.

Abstract

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of .

Paper Structure

This paper contains 7 sections, 3 tables.