Table of Contents
Fetching ...

Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

Xiangyu Gao, Yu Dai, Benliu Qiu, Lanxiao Wang, Heqian Qiu, Hongliang Li

TL;DR

This work tackles open-vocabulary object detection by enabling a backbone that benefits from both base training data and pre-trained vision-language representations. It introduces VMCNet, a two-branch backbone with a trainable CNN and a frozen CLIP ViT, connected through a ViT-Feature-Modulated (VMC) module that injects ViT information into multi-scale CNN features. The Modulating Information Generation and multiple Feature Modulation blocks fuse intermediate ViT representations (e.g., $V^{(1)}$, $V^{(5)}$, $V^{(7)}$) into CNN features, yielding strong performance gains on OVOD benchmarks, notably $AP_{50}^{novel}$ of $44.3$ (ViT-B/16) and $48.5$ (ViT-L/14) on OV-COCO and improved $mAP_r$ on OV-LVIS. The results demonstrate that unidirectionally injecting frozen ViT knowledge into a trainable CNN backbone can substantially improve novel-category detection with modest parameter overhead, advancing practical open-vocabulary detection. The approach offers a practical pathway to integrate large-scale pre-training with task-specific labeled data in dense prediction tasks.

Abstract

Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed VMC module could modulate the multi-scale CNN features with the representations from ViT branch. With this proposed mixed structure, the detector is more likely to discover objects of novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms state-of-the-art methods. On OV-COCO, the proposed method achieves 44.3 AP$_{50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP$_{50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$.

Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

TL;DR

This work tackles open-vocabulary object detection by enabling a backbone that benefits from both base training data and pre-trained vision-language representations. It introduces VMCNet, a two-branch backbone with a trainable CNN and a frozen CLIP ViT, connected through a ViT-Feature-Modulated (VMC) module that injects ViT information into multi-scale CNN features. The Modulating Information Generation and multiple Feature Modulation blocks fuse intermediate ViT representations (e.g., , , ) into CNN features, yielding strong performance gains on OVOD benchmarks, notably of (ViT-B/16) and (ViT-L/14) on OV-COCO and improved on OV-LVIS. The results demonstrate that unidirectionally injecting frozen ViT knowledge into a trainable CNN backbone can substantially improve novel-category detection with modest parameter overhead, advancing practical open-vocabulary detection. The approach offers a practical pathway to integrate large-scale pre-training with task-specific labeled data in dense prediction tasks.

Abstract

Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed VMC module could modulate the multi-scale CNN features with the representations from ViT branch. With this proposed mixed structure, the detector is more likely to discover objects of novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms state-of-the-art methods. On OV-COCO, the proposed method achieves 44.3 AP with ViT-B/16 and 48.5 AP with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP.

Paper Structure

This paper contains 18 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Different backbone paradigms for open-vocabulary object detection. Sign "snowflake" means this part is frozen during base training while "fire" represents trainable part. (a) F-ViT DBLP:conf/iclr/WuZX0L0L24 uses a frozen CLIP ViT to extract image features, which does not employ the base training data to strengthen the representation for detection. (b) This paradigm uses an extra trainable convolutional neural network, which is optimized with base training data for detection. However, the representation ability of CLIP ViT is not exploited. (c) Our design applies two-branch architecture, the representations from frozen CLIP ViT are utilized to modulate the features from trainable CNN. Thus, the final representations for detection benefit from both the pre-trained model and base training data.
  • Figure 2: The overall architecture of proposed VMCNet. Flatten operation is omitted for clarity. Modules marked with snowflake are frozen, the others are optimizable during training. (a) Convolutional branch extracts multi-scale features from the input image. (b) Pre-trained transformer branch provides its intermediate features. (c) VMC module merges the outputs from two branches to generate final multi-scale features.
  • Figure 3: Structure of modulating information generation module. FC denotes the fully connected layer.
  • Figure 4: Structure of feature modulation block. We illustrate processing in the first block as example.
  • Figure 5: Optional strategies of placing FM blocks in VMC module.
  • ...and 1 more figures