CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun; Runjia Li; Philip Torr; Xiuye Gu; Siyang Li

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li

TL;DR

This work introduces CaR (CLIP as RNN), a training-free framework that retains the full vocabulary of a pre-trained vision-language model (CLIP) for open-vocabulary segmentation. It employs a recurrent architecture with a fixed-weight, two-stage segmenter that iteratively refines text queries and mask proposals, using gradient-based CAMs and CLIP-based similarity with visual prompts to progressively improve segmentation quality. The approach yields state-of-the-art zero-shot semantic and referring segmentation across multiple datasets (e.g., VOC, COCO, Pascal Context) and extends to video and referring tasks, outperforming strong fine-tuned baselines and prior zero-shot methods. CaR demonstrates the potential of leveraging frozen VLMs for dense prediction by combining recurrence, background query strategies, and post-processing (CRF/SAM) to achieve robust open-vocabulary segmentation without additional training. This open-vocabulary framework broadens segmentation capabilities to diverse concepts, brands, and expressions, with practical implications for scalable, annotation-efficient vision systems.

Abstract

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 27 sections, 10 equations, 6 figures, 10 tables, 2 algorithms.

Introduction
Related Work
CLIP as Recurrent Neural Networks
A Recap on Recurrent Neural Networks
Overview
The Two-stage Segmenter
Post-Processing
Experiments
Zero-shot Semantic Segmentation
Ablation Studies.
Referring Segmentation
Conclusion
More Experimental Results
Quantitative Analysis on Vocabulary Space.
Evaluation without Background
...and 12 more sections

Figures (6)

Figure 1: Our method CaR can fully inherit the vast vocabulary space of CLIP, by directly using features from a pre-trained VLM, i.e., CLIP, without any fine-tuning. Although the scene in the image is simple, state-of-the-art methods fine-tuned on segmentation datasets liu2023groundingovseg fail to segment and recognize Pepsi and Coca Cola correctly.
Figure 2: The overall framework of our method CaR.(a), (b): given an image, the user provides a set of text queries that they are interested to segment. This initial set, denoted by $h_0$, may refer to non-existing concepts in the image, e.g., Barcelona and Arsenal. In the $t$-th time step, the frozen segmenter evaluates the degree of alignment between each mask and text query from the previous time step, $h_{t-1}$, and then low-confidence queries are eliminated by the function $\sigma$. (c) depicts the detailed architecture of our two-stage segmenter. It consists a mask proposal generator $f(\cdot, \cdot)$, and a mask classifier $g(\cdot, \cdot)$ that assesses the alignment of each mask-text pairs.
Figure 3: Examples of visual prompts given a mask on the man wearing the jersey of Manchester United.
Figure D: Comparison of different post-processors on randomly selected images from PASCAL VOC.
Figure E: Comparison of different post-processors on randomly selected images from COCO Object.
...and 1 more figures

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

TL;DR

Abstract

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Authors

TL;DR

Abstract

Table of Contents

Figures (6)