Table of Contents
Fetching ...

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi

TL;DR

Open-vocabulary segmentation relies on cross-modal CLIP embeddings, but existing methods either freeze CLIP-V or fine-tune only the vision encoder, limiting transfer and region sensitivity. This work proposes MAFT+, a collaborative vision-text representation fine-tuning framework that jointly optimizes CLIP-V and CLIP-T via Representation Compensation (RC) and Content-Dependent Transfer (CDT). RC preserves zero-shot capabilities by aligning the original CLIP-V representation with the finetuned one using multi-grid pooling, while CDT conditions text embeddings on image content with Transformer layers for parameter-efficient text adaptation. The approach achieves state-of-the-art results on semantic and panoptic OVS benchmarks, demonstrating improved cross-modal alignment and transfer without sacrificing zero-shot performance. The method leverages a MaskFormer-basedProposal Generator and shows strong improvements across multiple datasets, highlighting the practical impact of joint vision-text optimization in open vocabulary settings.

Abstract

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

TL;DR

Open-vocabulary segmentation relies on cross-modal CLIP embeddings, but existing methods either freeze CLIP-V or fine-tune only the vision encoder, limiting transfer and region sensitivity. This work proposes MAFT+, a collaborative vision-text representation fine-tuning framework that jointly optimizes CLIP-V and CLIP-T via Representation Compensation (RC) and Content-Dependent Transfer (CDT). RC preserves zero-shot capabilities by aligning the original CLIP-V representation with the finetuned one using multi-grid pooling, while CDT conditions text embeddings on image content with Transformer layers for parameter-efficient text adaptation. The approach achieves state-of-the-art results on semantic and panoptic OVS benchmarks, demonstrating improved cross-modal alignment and transfer without sacrificing zero-shot performance. The method leverages a MaskFormer-basedProposal Generator and shows strong improvements across multiple datasets, highlighting the practical impact of joint vision-text optimization in open vocabulary settings.

Abstract

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .
Paper Structure (17 sections, 6 equations, 10 figures, 8 tables)

This paper contains 17 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Different learning frameworks for open-vocabulary segmentation, from the perspective of whether to freeze CLIP. (a) The "frozen CLIP" paradigm. ovsegzssegfreesegfcclip (b) Fine-tuning CLIP-V maft. (c) Our MAFT+ framework enables to optimize both CLIP-V and CLIP-T.
  • Figure 2: Overview of the MAFT+. We use CLIP-V as the backbone to extract image features. A Proposal Generator is trained to generate mask proposals. The Representation Compensation strategy reviews the vision representation to preserve the zero-shot capability of CLIP (red part); the Content-Dependent Transfer enables the text embeddings conditioned on input image, and achieves text representation optimizing in a parameter-efficient fine-tuning way. (blue part).
  • Figure 3: Details of Representation Compensation.
  • Figure 4: Details of Content-Dependent Transfer.
  • Figure 5: Comparisons between CLIP-T tuning strategies.
  • ...and 5 more figures