Table of Contents
Fetching ...

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Khurram Azeem Hashmi, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This work introduces FAIM, a mask-guided spatio-temporal feature aggregation framework for Video Object Detection, addressing background noise in proposal-based aggregation by learning and leveraging instance mask features. Central components include IFEM for instance-mask feature extraction and TICAM for temporal fusion of mask and classification cues, built atop a YOLOX-based detector with a lightweight FPSM to prune candidates. Empirical results on ImageNet VID (87.9% mAP at 33 FPS on a 2080Ti) and various benchmarks (EPIC KITCHENS-55, OVIS, MOT) demonstrate strong speed-accuracy gains and method-agnostic improvements when integrating FAIM modules into other VOD pipelines. The findings underscore the practical impact of instance-mask guidance for robust, real-time video understanding and point to future work in unifying VOD with MOT and video instance segmentation.

Abstract

The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust, method-agnostic, and effective in multi-object tracking, demonstrating its broader applicability to video understanding tasks.

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

TL;DR

This work introduces FAIM, a mask-guided spatio-temporal feature aggregation framework for Video Object Detection, addressing background noise in proposal-based aggregation by learning and leveraging instance mask features. Central components include IFEM for instance-mask feature extraction and TICAM for temporal fusion of mask and classification cues, built atop a YOLOX-based detector with a lightweight FPSM to prune candidates. Empirical results on ImageNet VID (87.9% mAP at 33 FPS on a 2080Ti) and various benchmarks (EPIC KITCHENS-55, OVIS, MOT) demonstrate strong speed-accuracy gains and method-agnostic improvements when integrating FAIM modules into other VOD pipelines. The findings underscore the practical impact of instance-mask guidance for robust, real-time video understanding and point to future work in unifying VOD with MOT and video instance segmentation.

Abstract

The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust, method-agnostic, and effective in multi-object tracking, demonstrating its broader applicability to video understanding tasks.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Evolution of exploiting temporal information in video object detection.(a) Box-level post-processing to refine detections. (b) Feature aggregation across entire video frames. (c) Temporal feature aggregation guided by region-location priors from each frame. (d) Our instance mask-based aggregation refines the focus to instance boundaries, reducing background noise and improving feature aggregation.
  • Figure 2: Exploiting temporal information in proposal-based feature aggregation in blue against our instance mask-based feature aggregation method in red for the class Bear. Leveraging instance mask-level information significantly reduces variance among Bear proposals within and across videos.
  • Figure 3: Speed and accuracy Trade-off. FAIM outperforms prior state-of-the-art methods on the ImageNet VID benchmark. Besides QueryProp, MAMBA, and Liu et al., all results are reported on the 2080Ti GPU. * denotes results with post-processing.
  • Figure 4: Overview of FAIM framework.Randomly sampled frames from a video are input into YOLOX yolox_arxiv2021 for initial feature extraction and prediction using multi-scale features (P3-P5). The IFEM processes video object features to produce instance mask features (Eq. \ref{['eq:ifem']}), while the FPSM filters the features for object classification. IFEM's instance mask features and FPSM's refined predictions are combined to predict instance masks, which are optimized against pseudo-ground truth masks. The learned instance mask features and classification features are then fed into the TICAM for final classification. Inference: Components in green are excluded during inference. However, IFEM continues to provide high-quality instance mask features, enhancing feature aggregation in TICAM for robust predictions.
  • Figure 5: TSNE of proposal features from YOLOV YOLOV_AAAI2023 and FAIM on the ImageNet VID dataset. Feature confusion in YOLOV is marked with magenta circles$\color{magenta}\bigcirc$, and corrections in FAIM with green circles$\color{Green}\bigcirc$. The blue bounding box shows the area used for feature aggregation in YOLOV, while FAIM uses the area in red mask. YOLOV confuses features between Snake and Lizard (highlighted with $\color{magenta}\bigcirc$), showing higher intra-class and lower inter-class variance due to background inclusion. FAIM's instance mask-based feature aggregation reduces this variance, forming clearer clusters. Similar improvements are seen with Watercraft and Whale. Best viewed on a screen.