Table of Contents
Fetching ...

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

Reza Qorbani, Gianluca Villani, Theodoros Panagiotakopoulos, Marc Botet Colomer, Linus Härenstam-Nielsen, Mattia Segu, Pier Luigi Dovesi, Jussi Karlgren, Daniel Cremers, Federico Tombari, Matteo Poggi

TL;DR

SemLA tackles domain shift in open-vocabulary semantic segmentation by creating a library of LoRA adapters indexed in CLIP space and retrieving a targeted subset at test time. By composing a fused, ad-hoc model from the most relevant adapters, SemLA achieves training-free adaptation without accessing source data, while maintaining explainability through adapter contributions. The approach is validated on a 20-domain benchmark derived from 10 datasets, showing consistent gains over zero-shot and naive merging methods and competitive results relative to Oracle adapters, with backbone-agnostic applicability demonstrated on CAT-Seg and SED. This work enables scalable, privacy-preserving domain adaptation for OV segmentation and highlights CLIP as an effective domain navigator for adapter selection and fusion in open-vocabulary settings.

Abstract

Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

TL;DR

SemLA tackles domain shift in open-vocabulary semantic segmentation by creating a library of LoRA adapters indexed in CLIP space and retrieving a targeted subset at test time. By composing a fused, ad-hoc model from the most relevant adapters, SemLA achieves training-free adaptation without accessing source data, while maintaining explainability through adapter contributions. The approach is validated on a 20-domain benchmark derived from 10 datasets, showing consistent gains over zero-shot and naive merging methods and competitive results relative to Oracle adapters, with backbone-agnostic applicability demonstrated on CAT-Seg and SED. This work enables scalable, privacy-preserving domain adaptation for OV segmentation and highlights CLIP as an effective domain navigator for adapter selection and fusion in open-vocabulary settings.

Abstract

Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.

Paper Structure

This paper contains 45 sections, 13 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of SemLA. During test-time, SemLA uses CLIP as a domain navigator, to retrieve and fuse relevant adapters, to get a LoRA tailored to the target domain.
  • Figure 2: Construction and Expansion of the LoRA Adapter Library. Each LoRA adapter is created by fine-tuning on a specific dataset and subsequently added to the library. The library index for each adapter is represented by the CLIP centroid of its training data.
  • Figure 3: Adapter contribution heatmap. Rows represent individual test datasets, and columns correspond to specific LoRA adapters. The color intensity of each cell indicates the frequency and weight of selection (with values below 0.1 omitted). The diagonal is empty due to the leave-one-out strategy.
  • Figure 4: Adapter weight distribution for MUSES-Fog-Night. The fused adapter combines knowledge from foggy and night-time conditions by weighting relevant adapters. Adapters with a weight lower than 5% are not included.
  • Figure 5: CLIP-guidance effectiveness for LoRA selection on ACDC. Each point represents an image-adapter combination, with adapters separated by color. x-axis: distance from the corresponding image embedding to the adapter embedding. y-axis: improvement in mIoU when using the adapter relative to the zero-shot base network. The linear regression curve (dashed line) indicates that embedding similarity correlates with higher mIoU. We show the full adapter library, excluding those trained on ACDC.
  • ...and 6 more figures