F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
TL;DR
F-VLM demonstrates that a frozen Vision-Language Model can underpin open-vocabulary object detection when paired with a simple, trainable detector head. By performing region-level open-vocabulary recognition with a text-embedding region classifier and fusing VLM-derived scores with standard detection scores in a geometric-mean framework, it achieves state-of-the-art LVIS novel-category performance with dramatically reduced training compute. The approach also shows competitive COCO transfer and strong cross-dataset generalization, all while maintaining a compact training footprint. This work offers a scalable path to open-vocabulary detection without distillation or detection-tailored pretraining, leveraging the broad knowledge encoded in frozen VLM backbones.
Abstract
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home
