Table of Contents
Fetching ...

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson

TL;DR

MAMBA introduces a memory-bank based framework for video object detection that overcomes the inefficiencies of traditional memory structures by employing light-weight key-set construction and fine-grained feature-wise updating. The core enhancement, GEO, unifies pixel-level and instance-level feature augmentation through attention over a large memory, enabling recursive multi-level fusion. Empirical results on ImageNet VID show state-of-the-art speed-accuracy trade-offs, with 84.6% mAP at 110.3 ms (ResNet-101) when both pixel- and instance-level memories are used, and further gains with stronger backbones and augmentations. The approach is modular and adaptable to different detectors, offering practical improvements for real-time video object detection.

Abstract

State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

TL;DR

MAMBA introduces a memory-bank based framework for video object detection that overcomes the inefficiencies of traditional memory structures by employing light-weight key-set construction and fine-grained feature-wise updating. The core enhancement, GEO, unifies pixel-level and instance-level feature augmentation through attention over a large memory, enabling recursive multi-level fusion. Empirical results on ImageNet VID show state-of-the-art speed-accuracy trade-offs, with 84.6% mAP at 110.3 ms (ResNet-101) when both pixel- and instance-level memories are used, and further gains with stronger backbones and augmentations. The approach is modular and adaptable to different detectors, offering practical improvements for real-time video object detection.

Abstract

State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
Paper Structure (12 sections, 6 equations, 2 figures, 9 tables, 1 algorithm)

This paper contains 12 sections, 6 equations, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: Comparisons of the memory construction process in three memory structures. (a) Sliding window stores raw features of neighbour frames. (b) Memory queue stores features of the enhanced frames. One enhanced frame contains the temporal information of its previous frames. As a result, the number of visible frames is enlarged by temporal connections. (c) The proposed memory bank contains two novel operations: light-weight key-set construction and fine-grained feature-wise updating, which help enlarge the number of visible frames to the length of the whole video. Best viewed in color.
  • Figure 2: (a) Overview of our framework. Given an input frame $I_t$, firstly, $I_t$ is passed through the backbone networks. Secondly, the extracted feature maps is enhanced by the pixel-level memory bank. Thirdly, the Region Proposal Networks (RPN) is used to extract proposals on the enhanced feature maps. Finally, the proposals are further enhanced by the instance-level memory bank and then enhanced proposals are used to compute the detection loss. (b) Illustration of the enhancement process of the memory bank. The input can either be feature maps or proposals. Best viewed in color.