Table of Contents
Fetching ...

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Yingjie Zhai, Wenshuo Li, Yehui Tang, Xinghao Chen, Yunhe Wang

TL;DR

The paper addresses the challenge of efficient mobile video understanding by avoiding the temporal dimension in the backbone. It introduces SqueezeTime, a lightweight 2D CNN backbone that squeezes time into channel dimension and employs Channel-Time Learning (CTL), consisting of Temporal Focus Convolution (TFC) and Inter-temporal Object Interaction (IOI), to recover temporal dynamics. The model achieves a favorable accuracy-throughput balance, notably +1.2% Top1 on K400 and up to ~80% GPU throughput gains, while maintaining low CPU latency on mobile devices. These results suggest that time-to-channel squeezing plus CTL-based temporal modeling offers a practical path to high-quality mobile video analysis. The approach demonstrates strong performance across recognition and detection benchmarks, outperforming several mobile-focused baselines and offering significant efficiency advantages for edge deployment.

Abstract

Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at https://github.com/xinghaochen/SqueezeTime and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

TL;DR

The paper addresses the challenge of efficient mobile video understanding by avoiding the temporal dimension in the backbone. It introduces SqueezeTime, a lightweight 2D CNN backbone that squeezes time into channel dimension and employs Channel-Time Learning (CTL), consisting of Temporal Focus Convolution (TFC) and Inter-temporal Object Interaction (IOI), to recover temporal dynamics. The model achieves a favorable accuracy-throughput balance, notably +1.2% Top1 on K400 and up to ~80% GPU throughput gains, while maintaining low CPU latency on mobile devices. These results suggest that time-to-channel squeezing plus CTL-based temporal modeling offers a practical path to high-quality mobile video analysis. The approach demonstrates strong performance across recognition and detection benchmarks, outperforming several mobile-focused baselines and offering significant efficiency advantages for edge deployment.

Abstract

Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves accuracy and GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at https://github.com/xinghaochen/SqueezeTime and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.
Paper Structure (11 sections, 6 equations, 3 figures, 11 tables)

This paper contains 11 sections, 6 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: (a) Performance comparison for mobile video recognition of multiple methods on K400 k400, K600 k600 and HMDB51 hmdb51 datasets. We report Top1 accuracy ($\%$), GPU Speed (throughput, videos / s), and CPU Speed (videos / s) on the figure. Note the CPU speed is measured by the '1 / latency (ms)' for better visualization. (b) Latency comparison of multiple models on a mobile device.
  • Figure 2: Feature and kernel illustration of different video models, i.e., (a) 3D CNN-based models, (b) 2D CNN with temporal modeling operations, and (c) the proposed squeeze time mechanism.
  • Figure 3: Pipeline of the proposed SqueezeTime. The input video clip is first reshaped by squeezing the temporal dimension into channels and is then fed into the following network. The proposed network contains four main stages, each stage with a stack of CTL Blocks, which are elaborately designed to excavate and restore hidden temporal representations.