Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng
TL;DR
DenseVLM addresses foreground bias in open-vocabulary dense prediction by leveraging a powerful pre-trained VLM to retrieve region-category semantics and by decoupling region-language alignment into foreground (Thing) and background (Stuff) components. It introduces an end-to-end framework that employs a frozen P-VLM for category retrieval, region-denoising, and a decoupled loss based on KL-divergence between region-text and region-feature distributions, enabling unbiased region alignment. Across COCO and ADE20K benchmarks, DenseVLM consistently outperforms prior VLM-based methods, demonstrating strong improvements in both object-centric and background region recognition, and it scales effectively with data and backbone variations. The approach yields state-of-the-art results for open-vocabulary dense tasks and opens practical avenues for scalable, label-efficient dense perception in diverse applications.
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.
