Table of Contents
Fetching ...

A Survey on Training-free Open-Vocabulary Semantic Segmentation

Naomi Kombol, Ivan Martinović, Siniša Šegvić

TL;DR

This paper tackles open-vocabulary semantic segmentation without additional training by repurposing vision-language and visual foundation models. It categorizes over 30 training-free approaches into purely CLIP-based methods, those that incorporate auxiliary visual foundation models, and generative-model–driven strategies, detailing how each uses masked pooling, intermediate features, and cross-modal prototypes. Key findings show that SAM- or DINO-assisted CLIP methods often achieve state-of-the-art results, while purely CLIP-based methods are approaching that performance, with room for improvements in background handling and inference efficiency. The work serves as a practical, comprehensive onboarding resource and highlights concrete directions for future work.

Abstract

Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.

A Survey on Training-free Open-Vocabulary Semantic Segmentation

TL;DR

This paper tackles open-vocabulary semantic segmentation without additional training by repurposing vision-language and visual foundation models. It categorizes over 30 training-free approaches into purely CLIP-based methods, those that incorporate auxiliary visual foundation models, and generative-model–driven strategies, detailing how each uses masked pooling, intermediate features, and cross-modal prototypes. Key findings show that SAM- or DINO-assisted CLIP methods often achieve state-of-the-art results, while purely CLIP-based methods are approaching that performance, with room for improvements in background handling and inference efficiency. The work serves as a practical, comprehensive onboarding resource and highlights concrete directions for future work.

Abstract

Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of Training-free Open-Vocabulary Semantic Segmentation
  • Figure 2: Standard ViT encoder block that first normalizes inputs and then performs inter-token mixing through multi-head attention. The remixed tokens are residually connected to the inputs, the sum normalized, projected through a Multi-Layer Perceptron (MLP), and again residually connected. Image is from vit
  • Figure 3: Modified ViT encoder block with removed MLP and residual connection. Image is from vit and modified.
  • Figure 4: Qualitative comparison on the Pascal VOC21 voc dataset of some standout methods MaskCLIPSCLIPProxyCLIP. GT denotes ground truth.