Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori; David Osowiechi; Gustavo Adolfo Vargas Hakim; Ali Bahri; Moslem Yazdanpanah; Sahar Dastani; Farzad Beizaee; Ismail Ben Ayed; Christian Desrosiers

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

TL;DR

This work tackles the lack of test-time adaptation for open-vocabulary semantic segmentation by introducing MLMP, a plug-and-play framework that enhances VLM-based segmentation through multi-level, uncertainty-aware feature fusion and multi-prompt entropy minimization. By updating only the vision encoder's LayerNorm parameters and leveraging multiple text templates, MLMP robustly adapts to distribution shifts without requiring labels or source data. A theoretical proposition demonstrates that averaging gradients across templates yields an unbiased descent with variance diminishing as $1/T$, supporting the loss-level ensemble approach. The authors provide a comprehensive OVSS-TTA benchmark with 87 test scenarios across nine datasets, showing that MLMP consistently outperforms classification-based TTA baselines and other adaptation methods, including in single-sample and real-rendered shifts, highlighting practical impact for robust, language-aware segmentation.

Abstract

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, \textbf{with a total of 87 distinct test scenarios}, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)

Theorems & Definitions (1)