Table of Contents
Fetching ...

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

Ye Tian, Mengyu Yang, Lanshan Zhang, Zhizhen Zhang, Yang Liu, Xiaohui Xie, Xirong Que, Wendong Wang

TL;DR

This work proposes a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition that outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

Abstract

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

TL;DR

This work proposes a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition that outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

Abstract

Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.
Paper Structure (20 sections, 11 equations, 5 figures, 9 tables)

This paper contains 20 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overall architecture of ViMo. The whole architecture is illustrated in (a), comprising $N$locators for unit localization and observation, a multi-unit integration module for video-level semantics reasoning, and a fully-connected network based classifier for categories mapping. (b) and (c) give linear and nonlinear options for temporal network of locators. (d), (e) and (f) present several variants for multi-unit integration module, including Pool, Forward, and Transformer. Detailed explanations can be found in Section \ref{['method']}. Best viewed in color.
  • Figure 2: Comparisons with the state-of-the-art methods on ActivityNet. Our method is implemented with the maximum moving times $m\!\in\!$$\{ 3,4,5,6 \}$. Our paradigm achieves competitive performance in terms of mAP as well as GFLOPs.
  • Figure 3: Visualization of the random, uniform and adaptive local sampling results by the second locator of our ViMo. The grey frames indicate that the generator stops sampling and no video frames are observed. Please zoom in for more details.
  • Figure 4: Structure of Locator.
  • Figure 5: Centralized Training and Decentralized Execution of Locators.