Table of Contents
Fetching ...

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang, Jinqiao Wang

Abstract

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

Abstract

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. ), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.
Paper Structure (13 sections, 9 equations, 9 figures, 8 tables)

This paper contains 13 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between other training frameworks and MROVSeg. Previous methods (a) adopt additional image backbone to provide mask feature. The mask prediction is class-unaware. Our method (b) provide multi-resolution CLIP feature for both mask decoding and mask classification, and the whole framework is class-aware.
  • Figure 2: The overall pipeline of MROVSeg. For an high-resolution input image, its downsampled image and are fed into CLIP visual encoder to extract multi-resolution CLIP features. The Multi-Res Adapter adapts these features for mask decoder and attention mask decoder. The generated attention masks are employed to aggregate semantics from the multi-resolution CLIP features.
  • Figure 3: Multi-Res Adapter. The slice features from CLIP layer 0 $\{\mathbf{P}_i^0\}_{i=1}^{S}$ are concatenated with learnable queries and fed to ViT Blocks. The slice features from various CLIP layers are first adapted by MRF module to restore spatial geometry and capture long-range global contexts, then are injected to the intermediate ViT Blocks. The final output visual tokens and projected queries are utilized for downstream mask prediction and classification.
  • Figure 4: Effect of decoupled attention decoding for multi-grained semantics. With single attention mask decoding, the spatial cues are overwhelmed by background noise (b). Our decoupled attention mask decoding effectively splits the global and local semantics, producing relatively clean global (c) and local (d) attention masks.
  • Figure 5: Multi-grained Masked Attention. Object [CLS] tokens $\mathrm{\mathbf{X}}_{\texttt{prop}}$ perform cross attention with high- and low-resolution CLIP features $\mathbf{X}_{\texttt{LR}}$ and $\mathbf{X}_{\texttt{HR}}$ with decoupled attention masks.
  • ...and 4 more figures