Table of Contents
Fetching ...

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li

TL;DR

This work introduces CaR (CLIP as RNN), a training-free framework that retains the full vocabulary of a pre-trained vision-language model (CLIP) for open-vocabulary segmentation. It employs a recurrent architecture with a fixed-weight, two-stage segmenter that iteratively refines text queries and mask proposals, using gradient-based CAMs and CLIP-based similarity with visual prompts to progressively improve segmentation quality. The approach yields state-of-the-art zero-shot semantic and referring segmentation across multiple datasets (e.g., VOC, COCO, Pascal Context) and extends to video and referring tasks, outperforming strong fine-tuned baselines and prior zero-shot methods. CaR demonstrates the potential of leveraging frozen VLMs for dense prediction by combining recurrence, background query strategies, and post-processing (CRF/SAM) to achieve robust open-vocabulary segmentation without additional training. This open-vocabulary framework broadens segmentation capabilities to diverse concepts, brands, and expressions, with practical implications for scalable, annotation-efficient vision systems.

Abstract

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

TL;DR

This work introduces CaR (CLIP as RNN), a training-free framework that retains the full vocabulary of a pre-trained vision-language model (CLIP) for open-vocabulary segmentation. It employs a recurrent architecture with a fixed-weight, two-stage segmenter that iteratively refines text queries and mask proposals, using gradient-based CAMs and CLIP-based similarity with visual prompts to progressively improve segmentation quality. The approach yields state-of-the-art zero-shot semantic and referring segmentation across multiple datasets (e.g., VOC, COCO, Pascal Context) and extends to video and referring tasks, outperforming strong fine-tuned baselines and prior zero-shot methods. CaR demonstrates the potential of leveraging frozen VLMs for dense prediction by combining recurrence, background query strategies, and post-processing (CRF/SAM) to achieve robust open-vocabulary segmentation without additional training. This open-vocabulary framework broadens segmentation capabilities to diverse concepts, brands, and expressions, with practical implications for scalable, annotation-efficient vision systems.

Abstract

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
Paper Structure (27 sections, 10 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 27 sections, 10 equations, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: Our method CaR can fully inherit the vast vocabulary space of CLIP, by directly using features from a pre-trained VLM, i.e., CLIP, without any fine-tuning. Although the scene in the image is simple, state-of-the-art methods fine-tuned on segmentation datasets liu2023groundingovseg fail to segment and recognize Pepsi and Coca Cola correctly.
  • Figure 2: The overall framework of our method CaR.(a), (b): given an image, the user provides a set of text queries that they are interested to segment. This initial set, denoted by $h_0$, may refer to non-existing concepts in the image, e.g., Barcelona and Arsenal. In the $t$-th time step, the frozen segmenter evaluates the degree of alignment between each mask and text query from the previous time step, $h_{t-1}$, and then low-confidence queries are eliminated by the function $\sigma$. (c) depicts the detailed architecture of our two-stage segmenter. It consists a mask proposal generator $f(\cdot, \cdot)$, and a mask classifier $g(\cdot, \cdot)$ that assesses the alignment of each mask-text pairs.
  • Figure 3: Examples of visual prompts given a mask on the man wearing the jersey of Manchester United.
  • Figure D: Comparison of different post-processors on randomly selected images from PASCAL VOC.
  • Figure E: Comparison of different post-processors on randomly selected images from COCO Object.
  • ...and 1 more figures