Table of Contents
Fetching ...

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

TL;DR

This work tackles the lack of test-time adaptation for open-vocabulary semantic segmentation by introducing MLMP, a plug-and-play framework that enhances VLM-based segmentation through multi-level, uncertainty-aware feature fusion and multi-prompt entropy minimization. By updating only the vision encoder's LayerNorm parameters and leveraging multiple text templates, MLMP robustly adapts to distribution shifts without requiring labels or source data. A theoretical proposition demonstrates that averaging gradients across templates yields an unbiased descent with variance diminishing as $1/T$, supporting the loss-level ensemble approach. The authors provide a comprehensive OVSS-TTA benchmark with 87 test scenarios across nine datasets, showing that MLMP consistently outperforms classification-based TTA baselines and other adaptation methods, including in single-sample and real-rendered shifts, highlighting practical impact for robust, language-aware segmentation.

Abstract

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, \textbf{with a total of 87 distinct test scenarios}, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles the lack of test-time adaptation for open-vocabulary semantic segmentation by introducing MLMP, a plug-and-play framework that enhances VLM-based segmentation through multi-level, uncertainty-aware feature fusion and multi-prompt entropy minimization. By updating only the vision encoder's LayerNorm parameters and leveraging multiple text templates, MLMP robustly adapts to distribution shifts without requiring labels or source data. A theoretical proposition demonstrates that averaging gradients across templates yields an unbiased descent with variance diminishing as , supporting the loss-level ensemble approach. The authors provide a comprehensive OVSS-TTA benchmark with 87 test scenarios across nine datasets, showing that MLMP consistently outperforms classification-based TTA baselines and other adaptation methods, including in single-sample and real-rendered shifts, highlighting practical impact for robust, language-aware segmentation.

Abstract

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, \textbf{with a total of 87 distinct test scenarios}, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

Paper Structure

This paper contains 29 sections, 15 equations, 16 figures, 24 tables.

Figures (16)

  • Figure 1: Motivation.(a) Left: Mean $\pm$ std entropy across seven text templates for the CLS token and the spatial tokens of the final and intermediate vision layers. Even the final-layer spatial tokens exhibit higher entropy and variability than CLS, and this sensitivity grows further in intermediate layers (numbers show % std increase relative to CLS). These patterns highlight pronounced prompt-induced uncertainty at multiple depths and motivate both multi-level and multi-prompt adaptation. (b) Right: mIoU of the baseline vs. MLMP on clean and corrupted data, showing consistent absolute improvements and underscoring the effectiveness of our joint adaptation strategies. Here, V20 denotes the Pascal VOC 20 dataset, and V20-C represents the average performance over its 15 synthetic corruption types. The variance in (a) is computed across all samples and all corruptions.
  • Figure 2: Overview of our MLMP method. In the Adaptation Phase, the model is adapted by leveraging multiple prompt templates alongside various intermediate feature layers, as well as the global feature. During the Evaluation Phase, the model computes weights based on the entropy of the intermediate features to perform a weighted averaging. These averaged features, combined with the different templates, are then used to generate the final segmentation map.
  • Figure 3: Mean and standard deviation of layer-wise confidence weights of MLMP across datasets. The fusion mechanism adaptively emphasizes more reliable layers based on input conditions.
  • Figure 4: mIoU performance of our method for different numbers of templates.
  • Figure 5: mIoU performance for prompt-integration strategies (Text, Params, Loss) on clean and corrupted data.
  • ...and 11 more figures

Theorems & Definitions (1)

  • proof