Table of Contents
Fetching ...

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

TL;DR

F-VLM demonstrates that a frozen Vision-Language Model can underpin open-vocabulary object detection when paired with a simple, trainable detector head. By performing region-level open-vocabulary recognition with a text-embedding region classifier and fusing VLM-derived scores with standard detection scores in a geometric-mean framework, it achieves state-of-the-art LVIS novel-category performance with dramatically reduced training compute. The approach also shows competitive COCO transfer and strong cross-dataset generalization, all while maintaining a compact training footprint. This work offers a scalable path to open-vocabulary detection without distillation or detection-tailored pretraining, leveraging the broad knowledge encoded in frozen VLM backbones.

Abstract

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

TL;DR

F-VLM demonstrates that a frozen Vision-Language Model can underpin open-vocabulary object detection when paired with a simple, trainable detector head. By performing region-level open-vocabulary recognition with a text-embedding region classifier and fusing VLM-derived scores with standard detection scores in a geometric-mean framework, it achieves state-of-the-art LVIS novel-category performance with dramatically reduced training compute. The approach also shows competitive COCO transfer and strong cross-dataset generalization, all while maintaining a compact training footprint. This work offers a scalable path to open-vocabulary detection without distillation or detection-tailored pretraining, leveraging the broad knowledge encoded in frozen VLM backbones.

Abstract

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home
Paper Structure (38 sections, 5 equations, 7 figures, 12 tables)

This paper contains 38 sections, 5 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: We explore the potential of frozen VLM (e.g., CLIP) features for open-vocabulary detection. The feature grouping reveals rich semantic and locality-sensitive information where object boundaries are nicely delineated (col. 2, see Appendix \ref{['sec:vis-clustering']} for more details). The same frozen features can classify groundtruth regions well without finetuning (col. 3). Therefore, we propose to build a open-vocabulary detector on top of a frozen VLM (col. 4) without a need for knowledge distillation, detection-tailored pretraining, or weakly supervised learning. F-VLM significantly reduces training complexity and compute requirement, and achieves the state-of-the-art performance at system level.
  • Figure 2: F-VLM architecture. We present both training and inference time architectures of F-VLM, where the VLM pooling layer and detection score combination are the differences.
  • Figure 3: F-VLM open-vocabulary and transfer detections. 1-2nd col.: Open-vocabulary detection on LVIS. We only show the novel categories for clarity. 2-4th col.: Transfer detection on Objects365. 4-6th col.: Transfer detection on Ego4D. Novel categories detected: fedora, martini, pennant, football helmet (LVIS); camel, slide, goldfish (Objects365); exit sign, recycle bin, window, soy sauce, wooden basket, cereal, bag of cookies, instant noodle, salad dressing, ketchup (Ego4D).
  • Figure 4: Hyper-parameter sweep on score fusion parameters. We observe that geometric means (right) are significantly better than arithmetic means (left). All results are based on a trained F-VLM R50 model.
  • Figure 5: Understanding the frozen VLM feature clusters. Salient objects and object parts emerge naturally from the clustering of frozen VLM features.
  • ...and 2 more figures