Table of Contents
Fetching ...

Training-Free Label Space Alignment for Universal Domain Adaptation

Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito, Donghyun Kim

TL;DR

A training-free label-space alignment method for UniDA that aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains, and constructs a universal classifier that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts.

Abstract

Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.

Training-Free Label Space Alignment for Universal Domain Adaptation

TL;DR

A training-free label-space alignment method for UniDA that aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains, and constructs a universal classifier that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts.

Abstract

Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H-scores.

Paper Structure

This paper contains 28 sections, 9 equations, 14 figures, 26 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison between CLIP Prompt, a zero-shot classifier enhanced with corresponding label text prompts, and previous state-of-the-art (SOTA) methods. Simply adding target-private classes or removing source-private classes leads to substantial gains in H-score, achieved entirely without additional training.
  • Figure 2: Examples of semantic ambiguity in generative VLM outputs: (a) noisy response, (b) hyponym prediction, and (c) synonym prediction.
  • Figure 3: Comparison between prior UniDA methods and our TLSA+ST. Left: Visual space alignment approaches li2021domainchang2022unifiedlai2023memory rely on cluster matching, which becomes ambiguous under domain shift. Right: Our TLSA+ST discovers labels from target images and aligns them with source labels with adaptive thresholding using the distance in the embedding space between source labels. Since TLSA+ST uses pretrained CLIP's joint embedding space, it is robust against domain shift.
  • Figure 4: Overview of TLSA. We first gather candidate target-private labels from a generative VLM, then refine them to train a universal classifier. (a) Synonym label alignment (Sec. \ref{['sec:step1']}) removes source label synonyms via WordNet: (b) Semantic label alignment (Sec. \ref{['sec:instance']}) re-scores image-label pairs in the embedding space and decides shared vs. target-private; (c) Frequency-based noisy candidate filtering (Sec. \ref{['sec:refine']}) prunes noisy low support candidates using frequency banks; and (d) Self-training (Sec. \ref{['sec:self-training']}) applies a teacher-student scheme on class-balanced top-k confident samples.
  • Figure 5: Illustration of semantic label alignment. In this step, we establish the relationship between discovered labels and source label predictions within the prediction set $\mathcal{C}$, as defined by Eq. \ref{['eq:thres4']}, for a given image $x_i$. (a) When source labels are present in $\mathcal{C}$, the input image is classified as belonging to one of the source classes. (b) When no source labels exist in $\mathcal{C}$, the input image is categorized as a target private class. Subsequently, we maintain a frequency count of classes in the frequency bank to enable noisy candidate filtering.
  • ...and 9 more figures