Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection
Xiangyu Gao, Yu Dai, Benliu Qiu, Lanxiao Wang, Heqian Qiu, Hongliang Li
TL;DR
This work tackles open-vocabulary object detection by enabling a backbone that benefits from both base training data and pre-trained vision-language representations. It introduces VMCNet, a two-branch backbone with a trainable CNN and a frozen CLIP ViT, connected through a ViT-Feature-Modulated (VMC) module that injects ViT information into multi-scale CNN features. The Modulating Information Generation and multiple Feature Modulation blocks fuse intermediate ViT representations (e.g., $V^{(1)}$, $V^{(5)}$, $V^{(7)}$) into CNN features, yielding strong performance gains on OVOD benchmarks, notably $AP_{50}^{novel}$ of $44.3$ (ViT-B/16) and $48.5$ (ViT-L/14) on OV-COCO and improved $mAP_r$ on OV-LVIS. The results demonstrate that unidirectionally injecting frozen ViT knowledge into a trainable CNN backbone can substantially improve novel-category detection with modest parameter overhead, advancing practical open-vocabulary detection. The approach offers a practical pathway to integrate large-scale pre-training with task-specific labeled data in dense prediction tasks.
Abstract
Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed VMC module could modulate the multi-scale CNN features with the representations from ViT branch. With this proposed mixed structure, the detector is more likely to discover objects of novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms state-of-the-art methods. On OV-COCO, the proposed method achieves 44.3 AP$_{50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP$_{50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$.
