MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection
Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson
TL;DR
MAMBA introduces a memory-bank based framework for video object detection that overcomes the inefficiencies of traditional memory structures by employing light-weight key-set construction and fine-grained feature-wise updating. The core enhancement, GEO, unifies pixel-level and instance-level feature augmentation through attention over a large memory, enabling recursive multi-level fusion. Empirical results on ImageNet VID show state-of-the-art speed-accuracy trade-offs, with 84.6% mAP at 110.3 ms (ResNet-101) when both pixel- and instance-level memories are used, and further gains with stronger backbones and augmentations. The approach is modular and adaptable to different detectors, offering practical improvements for real-time video object detection.
Abstract
State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
