Table of Contents
Fetching ...

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin

TL;DR

This work tackles open-vocabulary semantic segmentation by enhancing CLIP to operate in a training-free regime. It introduces architectural modifications to the ViT image encoder, leverages middle-layer attention alongside the final layer, and enriches input representations through Image Engineering and LLM-generated auxiliary texts. Empirically, ITACLIP delivers state-of-the-art results across five benchmarks (COCO-Stuff, COCO-Object, Pascal Context, Pascal VOC, Cityscapes) without pixel-level supervision and demonstrates robustness to ablations. The approach offers a practical path to high-quality dense predictions in open vocabulary settings and can generalize to other vision tasks that benefit from refined input representations and text-guided grounding.

Abstract

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

TL;DR

This work tackles open-vocabulary semantic segmentation by enhancing CLIP to operate in a training-free regime. It introduces architectural modifications to the ViT image encoder, leverages middle-layer attention alongside the final layer, and enriches input representations through Image Engineering and LLM-generated auxiliary texts. Empirically, ITACLIP delivers state-of-the-art results across five benchmarks (COCO-Stuff, COCO-Object, Pascal Context, Pascal VOC, Cityscapes) without pixel-level supervision and demonstrates robustness to ablations. The approach offers a practical path to high-quality dense predictions in open vocabulary settings and can generalize to other vision tasks that benefit from refined input representations and text-guided grounding.

Abstract

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

Paper Structure

This paper contains 19 sections, 11 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Qualitative comparison of training-free semantic segmentation methods. We compare ITACLIP with SCLIP wang2023sclip and NACLIP hajimiri2024pay using images from the COCO-Stuff caesar2018coco dataset. Additional visualizations are included in the Appendix.
  • Figure 2: Overview of ITACLIP. Our method integrates image, text, and architectural enhancements to produce a more accurate segmentation map. We apply various data augmentation techniques, then process both the original and augmented images through a modified image encoder to obtain image embeddings. We also utilize an LLM to generate auxiliary texts (e.g., definitions or synonyms) for each original class name. The $\lambda$ and $\alpha$ symbols denote the image engineering and auxiliary text coefficients used in weighted summations, respectively.
  • Figure 3: Visualization of attention maps from various layers for a selected patch. The red rectangle indicates the position of the randomly selected patch. Note that we use CLIP-ViT-B/16 as our visual backbone, with Layer 12 serving as the final layer.
  • Figure 4: Procedure for generating auxiliary texts for a given class name.
  • Figure 5: Qualitative comparison of training-free semantic segmentation methods. We compare ITACLIP with SCLIP wang2023sclip and NACLIP hajimiri2024pay using images from the Pascal VOC everingham2010pascal, Pascal Context mottaghi2014role, and COCO-Object lin2014microsoft datasets. ITACLIP consistently outperforms the other approaches. GT denotes the ground truth of the image.