Table of Contents
Fetching ...

Leveraging Bottom-Up and Top-Down Attention for Few-Shot Object Detection

Xianyu Chen, Ming Jiang, Qi Zhao

TL;DR

An attentive few-shot object detection network (AttFDNet) is proposed that takes the advantages of both top-down and bottom-up attention and addresses specific challenges in few-shots object detection by introducing two novel loss terms and a hybrid few- shot learning strategy.

Abstract

Few-shot object detection aims at detecting objects with few annotated examples, which remains a challenging research problem yet to be explored. Recent studies have shown the effectiveness of self-learned top-down attention mechanisms in object detection and other vision tasks. The top-down attention, however, is less effective at improving the performance of few-shot detectors. Due to the insufficient training data, object detectors cannot effectively generate attention maps for few-shot examples. To improve the performance and interpretability of few-shot object detectors, we propose an attentive few-shot object detection network (AttFDNet) that takes the advantages of both top-down and bottom-up attention. Being task-agnostic, the bottom-up attention serves as a prior that helps detect and localize naturally salient objects. We further address specific challenges in few-shot object detection by introducing two novel loss terms and a hybrid few-shot learning strategy. Experimental results and visualization demonstrate the complementary nature of the two types of attention and their roles in few-shot object detection. Codes are available at https://github.com/chenxy99/AttFDNet.

Leveraging Bottom-Up and Top-Down Attention for Few-Shot Object Detection

TL;DR

An attentive few-shot object detection network (AttFDNet) is proposed that takes the advantages of both top-down and bottom-up attention and addresses specific challenges in few-shots object detection by introducing two novel loss terms and a hybrid few- shot learning strategy.

Abstract

Few-shot object detection aims at detecting objects with few annotated examples, which remains a challenging research problem yet to be explored. Recent studies have shown the effectiveness of self-learned top-down attention mechanisms in object detection and other vision tasks. The top-down attention, however, is less effective at improving the performance of few-shot detectors. Due to the insufficient training data, object detectors cannot effectively generate attention maps for few-shot examples. To improve the performance and interpretability of few-shot object detectors, we propose an attentive few-shot object detection network (AttFDNet) that takes the advantages of both top-down and bottom-up attention. Being task-agnostic, the bottom-up attention serves as a prior that helps detect and localize naturally salient objects. We further address specific challenges in few-shot object detection by introducing two novel loss terms and a hybrid few-shot learning strategy. Experimental results and visualization demonstrate the complementary nature of the two types of attention and their roles in few-shot object detection. Codes are available at https://github.com/chenxy99/AttFDNet.

Paper Structure

This paper contains 14 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: In few-shot object detection, due to insufficient supervision, top-down attention learned from object annotations may fail to focus on objects of interest. (a) Input image with the ground-truth bounding box. (b) The top-down attention map. (c) The bottom-up attention map. (d) Detection result of the proposed method. It demonstrates the complementary characteristic of the top-down attention and bottom-up attention, where saliency can provide extra information to compensate the miss of information from top-down attention. We would discuss this characteristic in our qualitative analysis.
  • Figure 2: The network architecture of the proposed attentive few-shot object detector. First, we use the saliency model to generate the bottom-up attention for a given image. Then we send the image to the backbone, and use the generated bottom-up attention as well as the top-down attention through the backbone to provide a guidance of the specific spatial feature map. Last, we arrive to the six prediction heads to get the corresponding detection results related to the localization and category of an object. The backbone of the network is highlighted in blue, while the six prediction heads are highlighted in yellow.
  • Figure 3: A unique challenge of few-shot object detection is that not all bounding boxes in the novel images are annotated for training. The green bounding boxes indicate few-shot annotations and the red bounding boxes represent unannotated objects. With conventional training methods, such incomplete annotations would cause performance degradation.
  • Figure 4: Parameter initialization for the novel object detector. In this figure, we show the comprehensive procedure to initialize the novel object detector from base object detector. We use the green blocks to represent the parameters from the prediction heads in the base object detector while the red blocks are the parameters for the prediction heads of the novel object detector. The parameters of novel bounding boxes can be directly copied from the already learned corresponding parameters from the base object detector. The parameters of the fully-connected layer for the novel categories can be initialized from imprinting method hang:2017:imprintingxianyu:2020:did.
  • Figure 5: Qualitative results of $2$-shot object detection. Detected objects are annotated in green.
  • ...and 2 more figures