Table of Contents
Fetching ...

Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation

Ulindu De Silva, Didula Samaraweera, Sasini Wanigathunga, Kavindu Kariyawasam, Kanchana Ranasinghe, Muzammal Naseer, Ranga Rodrigo

TL;DR

Seg-TTO addresses the domain-shift problem in zero-shot open-vocabulary semantic segmentation by introducing test-time optimization tailored for dense, multi-concept predictions. It jointly adapts visual and textual representations at inference using a segmentation-specific self-supervised objective that blends entropy minimization with pseudo-label cross-entropy, aided by PCGrad to harmonize gradients. The framework augments category text with LLM-generated attributes and performs locality-preserving visual aggregation, enabling per-concept separation within a single image. Empirically, Seg-TTO provides state-of-the-art gains across 22 datasets in the MESS benchmark, including substantial improvements over strong baselines on specialized domains, while remaining plug-and-play with existing OVSS models. The approach offers a practical path to improved domain generalization in OVSS, with publicly released code and models planned.

Abstract

We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open-vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-and-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art. Our code and models will be released publicly.

Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation

TL;DR

Seg-TTO addresses the domain-shift problem in zero-shot open-vocabulary semantic segmentation by introducing test-time optimization tailored for dense, multi-concept predictions. It jointly adapts visual and textual representations at inference using a segmentation-specific self-supervised objective that blends entropy minimization with pseudo-label cross-entropy, aided by PCGrad to harmonize gradients. The framework augments category text with LLM-generated attributes and performs locality-preserving visual aggregation, enabling per-concept separation within a single image. Empirically, Seg-TTO provides state-of-the-art gains across 22 datasets in the MESS benchmark, including substantial improvements over strong baselines on specialized domains, while remaining plug-and-play with existing OVSS models. The approach offers a practical path to improved domain generalization in OVSS, with publicly released code and models planned.

Abstract

We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open-vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-and-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art. Our code and models will be released publicly.
Paper Structure (21 sections, 11 equations, 8 figures, 15 tables)

This paper contains 21 sections, 11 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Our Seg-TTO (row 4) improves state-of-the-art baseline CAT-Seg from catseg (row 3) by segmenting missed regions as well as correcting incorrectly assigned labels. We attribute these improvements to the visual & textual augmentations and the novel segmentation-specific test-time optimization used in our Seg-TTO.
  • Figure 2: Overview of Seg-TTO (a) Our image embedding updating framework consists of filtering out confident image patches followed by updating the original image embedding. (b) Our test time optimization framework consists of updating prompts based on the most confident crops using backpropagation followed by the addition of attributes for generalization.
  • Figure 3: Qualitative Evaluation: Our proposed Seg-TTO outperforms state-of-the-art CAT-Seg catseg across diverse specialized-domain OVSS tasks as illustrated. We highlight the highly technical nature of some specialized domain category names (e.g., mediastinum under X-Ray). Our category attributes allow models to better understand such objects.
  • Figure 4: Illustration of improved attribute generation for FoodSeg103wu2021large dataset images (a) The original image. (b) Ground truth segmentation map. (c) Baseline menon2022visual attribute generation method, which included general and irrelevant features such as "feathered body" and "wings" for "chicken duck." (d) Our approach with dataset-specific descriptions (e.g., "photo of food"), resulting in more relevant attributes like "roasted or grilled texture" and "golden brown or cooked color."
  • Figure 5: Qualitative Evaluation: We illustrate both success and failure cases of our proposed Seg-TTO. We highlight how Seg-TTO is still better than the baseline even in failure cases.
  • ...and 3 more figures