Table of Contents
Fetching ...

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

TL;DR

This work shows that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection, and explores directly transferring the high-level image understanding of foundation models to detectors in the following two ways.

Abstract

Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector's decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector's encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector's performance while preventing the problems caused by the architecture discrepancy between the detector's backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector's backbone.

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

TL;DR

This work shows that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection, and explores directly transferring the high-level image understanding of foundation models to detectors in the following two ways.

Abstract

Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector's decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector's encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector's performance while preventing the problems caused by the architecture discrepancy between the detector's backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector's backbone.

Paper Structure

This paper contains 19 sections, 6 figures, 15 tables.

Figures (6)

  • Figure 1: An in-depth understanding of the image provides useful information for detecting objects. (a) With the rich context, the relation between object parts and the whole object can be clarified. (b) Some objects with severe occlusion or unusual appearance can be discovered by co-occurrence or interaction with other objects. (c) And similar objects can be distinguished by some salient features. The red and green boxes represent incorrect and correct predictions, respectively.
  • Figure 2: The overview of Frozen-DETR. Instead of serving as a backbone, we exploit the frozen foundation model from the following two aspects: First, the patch tokens are reshaped to a 2D feature map and are concatenated with feature maps from the backbone before the encoder. After feature fusion, the patch tokens are discarded. Second, the image query representing the whole image, i.e., the class token from the foundation model, interacts with object queries in the self-attention layer of each decoding stage. Using the frozen foundation model as a feature enhancer makes the detector inherit the strong ability to understand high-level semantics.
  • Figure 3: Different implementations to extract image queries for sub-images. (a) Forwarding each sub-image individually to the model and selecting the class token as the image query. (b) Using the mean features of the patch tokens as the image queries for sub-images. (c) Using the replicated class tokens as the image queries for sub-images but these class tokens are constrained by attention masks.
  • Figure 4: Predictions and feature maps from DINO zhang2022dino and Frozen-DETR (CLIP only).
  • Figure 5: Different types of usage of pre-trained vision foundation models. (a) ViTDet vitdet fully fine-tunes the whole foundation model. (b) ViT-Adapter vit-adapter injects task priors to foundation models by adapters. Both the foundation model and adapters are fine-tuned on the downstream tasks. (c) Some works vasconcelos2022properlin2022could explore using frozen foundation models as the backbone, which needs a heavy neck and heavy head to ensure that there are enough tunable parameters. (d) Our Frozen-DETR utilizes foundation models as a plug-and-play module, in which the foundation model is not trainable and the image size is much smaller than the one in the detector.
  • ...and 1 more figures