CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li
TL;DR
This work introduces CaR (CLIP as RNN), a training-free framework that retains the full vocabulary of a pre-trained vision-language model (CLIP) for open-vocabulary segmentation. It employs a recurrent architecture with a fixed-weight, two-stage segmenter that iteratively refines text queries and mask proposals, using gradient-based CAMs and CLIP-based similarity with visual prompts to progressively improve segmentation quality. The approach yields state-of-the-art zero-shot semantic and referring segmentation across multiple datasets (e.g., VOC, COCO, Pascal Context) and extends to video and referring tasks, outperforming strong fine-tuned baselines and prior zero-shot methods. CaR demonstrates the potential of leveraging frozen VLMs for dense prediction by combining recurrence, background query strategies, and post-processing (CRF/SAM) to achieve robust open-vocabulary segmentation without additional training. This open-vocabulary framework broadens segmentation capabilities to diverse concepts, brands, and expressions, with practical implications for scalable, annotation-efficient vision systems.
Abstract
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
