Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Khurram Azeem Hashmi; Talha Uddin Sheikh; Didier Stricker; Muhammad Zeshan Afzal

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Khurram Azeem Hashmi, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This work introduces FAIM, a mask-guided spatio-temporal feature aggregation framework for Video Object Detection, addressing background noise in proposal-based aggregation by learning and leveraging instance mask features. Central components include IFEM for instance-mask feature extraction and TICAM for temporal fusion of mask and classification cues, built atop a YOLOX-based detector with a lightweight FPSM to prune candidates. Empirical results on ImageNet VID (87.9% mAP at 33 FPS on a 2080Ti) and various benchmarks (EPIC KITCHENS-55, OVIS, MOT) demonstrate strong speed-accuracy gains and method-agnostic improvements when integrating FAIM modules into other VOD pipelines. The findings underscore the practical impact of instance-mask guidance for robust, real-time video understanding and point to future work in unifying VOD with MOT and video instance segmentation.

Abstract

The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust, method-agnostic, and effective in multi-object tracking, demonstrating its broader applicability to video understanding tasks.

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

TL;DR

Abstract

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)