Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification
Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie Zhou, Shilei Wen, Yuanqing Lin
TL;DR
The paper tackles scalable large-scale video classification by adopting a DevNet-style pipeline that first learns multimodal feature representations (visual RGB and flow, plus audio) and then applies off-the-shelf temporal models to these features. It introduces four temporal modeling approaches—Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream Sequence Model, and Fast-Forward Sequence Model—demonstrating that they outperform traditional temporal pooling and are complementary, with ensemble methods achieving state-of-the-art results on the Kinetics validation set. The best single model achieves 77.7% top-1 and 93.2% top-5 accuracy, while an ensemble of models reaches 81.5% top-1 and 95.6% top-5, signaling strong gains for large-scale video understanding. The work highlights the viability of off-the-shelf temporal modeling on learned multimodal features for high-performance, scalable video recognition and provides pathways for releasing code and models.
Abstract
This paper describes our solution for the video recognition task of ActivityNet Kinetics challenge that ranked the 1st place. Most of existing state-of-the-art video recognition approaches are in favor of an end-to-end pipeline. One exception is the framework of DevNet. The merit of DevNet is that they first use the video data to learn a network (i.e. fine-tuning or training from scratch). Instead of directly using the end-to-end classification scores (e.g. softmax scores), they extract the features from the learned network and then fed them into the off-the-shelf machine learning models to conduct video classification. However, the effectiveness of this line work has long-term been ignored and underestimated. In this submission, we extensively use this strategy. Particularly, we investigate four temporal modeling approaches using the learned features: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model and Fast-Forward Sequence Model. Experiment results on the challenging Kinetics dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks. Most remarkably, our best single Multi-group Shifting Attention Network can achieve 77.7% in term of top-1 accuracy and 93.2% in term of top-5 accuracy on the validation set.
