Table of Contents
Fetching ...

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng

TL;DR

DenseVLM addresses foreground bias in open-vocabulary dense prediction by leveraging a powerful pre-trained VLM to retrieve region-category semantics and by decoupling region-language alignment into foreground (Thing) and background (Stuff) components. It introduces an end-to-end framework that employs a frozen P-VLM for category retrieval, region-denoising, and a decoupled loss based on KL-divergence between region-text and region-feature distributions, enabling unbiased region alignment. Across COCO and ADE20K benchmarks, DenseVLM consistently outperforms prior VLM-based methods, demonstrating strong improvements in both object-centric and background region recognition, and it scales effectively with data and backbone variations. The approach yields state-of-the-art results for open-vocabulary dense tasks and opens practical avenues for scalable, label-efficient dense perception in diverse applications.

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

TL;DR

DenseVLM addresses foreground bias in open-vocabulary dense prediction by leveraging a powerful pre-trained VLM to retrieve region-category semantics and by decoupling region-language alignment into foreground (Thing) and background (Stuff) components. It introduces an end-to-end framework that employs a frozen P-VLM for category retrieval, region-denoising, and a decoupled loss based on KL-divergence between region-text and region-feature distributions, enabling unbiased region alignment. Across COCO and ADE20K benchmarks, DenseVLM consistently outperforms prior VLM-based methods, demonstrating strong improvements in both object-centric and background region recognition, and it scales effectively with data and backbone variations. The approach yields state-of-the-art results for open-vocabulary dense tasks and opens practical avenues for scalable, label-efficient dense perception in diverse applications.

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Illustration of foreground bias. Previous methods sun2023evazhong2022regionclipwu2023clipself often produce similar foreground predictions for background regions, our approach effectively alleviates this issue.
  • Figure 2: Comparison of different VLMs. Unlike existing methods using (a) image-text contrastive learning CLIP, (b) region-text contrastive learning zhong2022regionclip or (c) self-distillation wu2023clipself, our method leverages powerful model representations for region-language alignment.
  • Figure 3: Mask accuracy comparison across categories in COCO dataset. Our method achieves notable improvements, especially in addressing foreground bias. The foreground categories are shown in black, and the background categories are highlighted in red.
  • Figure 4: Comparing the alignment effect of our DenseVLM with other methods through visualizations of cosine similarity maps between visual features and text embeddings.
  • Figure 5: Zero-shot comparisons of models pre-trained on datasets with three different scales. We select three training sets from the SA-1B dataset kirillov2023segment: 100K, 1.1M, and 5.5M seen samples and perform the zero-shot evaluation on the COCO and ADE20K benchmarks.
  • ...and 5 more figures