Efficient One-stage Video Object Detection by Exploiting Temporal Consistency

Guanxiong Sun; Yang Hua; Guosheng Hu; Neil Robertson

Efficient One-stage Video Object Detection by Exploiting Temporal Consistency

Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson

TL;DR

This work targets efficient one-stage video object detection by exploiting temporal consistency to reduce computation. It identifies two core bottlenecks: the quadratic complexity of attention modules on large query sets $N_q$ and the heavy cost of detection heads on high-resolution, low-level feature maps. To address these, it introduces a Location Prior Network (LPN) to filter background regions and a Size Prior Network (SPN) to skip unnecessary low-level feature computations across frames, achieving faster inference with minimal accuracy loss. The approach, demonstrated on FCOS, CenterNet, and YOLOX and evaluated on ImageNet VID, yields strong speed-accuracy trade-offs and broad compatibility, with code availability to facilitate adoption in practice.

Abstract

Recently, one-stage detectors have achieved competitive accuracy and faster speed compared with traditional two-stage detectors on image data. However, in the field of video object detection (VOD), most existing VOD methods are still based on two-stage detectors. Moreover, directly adapting existing VOD methods to one-stage detectors introduces unaffordable computational costs. In this paper, we first analyse the computational bottlenecks of using one-stage detectors for VOD. Based on the analysis, we present a simple yet efficient framework to address the computational bottlenecks and achieve efficient one-stage VOD by exploiting the temporal consistency in video frames. Specifically, our method consists of a location-prior network to filter out background regions and a size-prior network to skip unnecessary computations on low-level feature maps for specific frames. We test our method on various modern one-stage detectors and conduct extensive experiments on the ImageNet VID dataset. Excellent experimental results demonstrate the superior effectiveness, efficiency, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.

Efficient One-stage Video Object Detection by Exploiting Temporal Consistency

TL;DR

and the heavy cost of detection heads on high-resolution, low-level feature maps. To address these, it introduces a Location Prior Network (LPN) to filter background regions and a Size Prior Network (SPN) to skip unnecessary low-level feature computations across frames, achieving faster inference with minimal accuracy loss. The approach, demonstrated on FCOS, CenterNet, and YOLOX and evaluated on ImageNet VID, yields strong speed-accuracy trade-offs and broad compatibility, with code availability to facilitate adoption in practice.

Abstract

Paper Structure (30 sections, 2 equations, 3 figures, 7 tables)

This paper contains 30 sections, 2 equations, 3 figures, 7 tables.

Introduction
Related Work
One-stage Detectors.
Video Object Detection (VOD).
Analysis of the Computational Bottlenecks in Attention-based One-stage VOD
Preliminary Knowledge
General Architecture of Modern One-stage Detectors.
Complexity of the Attention Module.
Naive Adaptation of Attention-based One-stage VOD
Bottleneck 1: Drastically Increased $N_q$
Bottleneck 2: Detection Heads on Low Feature Levels
Methodology
Location Prior Network
Foreground Region Selection.
Partial Feature Aggregation.
...and 15 more sections

Figures (3)

Figure 1: General architecture of modern one-stage detectors, where H, W, and s are the height, width, and stride (down-sampling ratio) of feature maps, respectively. C3, C4 and C5 denote the output feature maps of the backbone. P3, P4, P5, etc. denote the feature levels in the neck, e.g., FPN. The decoupled detection heads, which usually contain a classification branch and a regression branch, are shared through all feature levels. Best viewed in colour.
Figure 2: (a) shows the process of directly conducting attention-based feature aggregation, where the purple rounded rectangle denotes the attention module. In (a), the input of the attention module is all pixels on the current frame and reference frames. (b) shows the pipeline of using location prior network for feature aggregation, where the red bounding box denotes the propagated bounding boxes from the previous frame. In (b), the input of the attention is foreground pixels on the current frame and the reference frames. Best viewed in colour.
Figure 3: (a) shows a normal detection process on multi-level feature maps where all levels of feature maps are passed to detection heads. (b) shows the detection process guided by the size prior network. The pink box denotes the feature level which the bounding boxes of the previous frame are generated from. In the current frame, computations on feature levels not in the pink box are skipped, denoted with the transparent boxes and dotted lines. Best viewed in colour.

Efficient One-stage Video Object Detection by Exploiting Temporal Consistency

TL;DR

Abstract

Efficient One-stage Video Object Detection by Exploiting Temporal Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (3)