Table of Contents
Fetching ...

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

Jiawei Ma, Yulei Niu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

TL;DR

WIDIn tackles single-source domain generalization by extracting domain-invariant visual features through fine-grained image-language alignment. The method builds language embeddings from worded image representations and uses their difference from class-name embeddings to identify and subtract domain-specific information, yielding $\mathbf{x}_e = \mathbf{x} - k(\mathbf{t}_x - \mathbf{t}_c)$ that feeds a small disentangler and classifier. It supports both CLIP-style joint vision-language spaces and uni-modal encoders by training in two stages and discarding the language model at test time to keep inference lightweight. Empirically, WIDIn delivers strong generalization across three benchmarks (CUB-Painting, DomainNetMini, Office-Home) and through extensive ablations confirms the importance of image wording and instance-level alignment. The approach offers a practical route to distill language-model cues into robust, domain-agnostic representations with potential extensions to long-tail recognition and downstream tasks like detection and generation.

Abstract

Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT. Experimental studies on three domain generalization datasets demonstrate the effectiveness of our approach.

WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization

TL;DR

WIDIn tackles single-source domain generalization by extracting domain-invariant visual features through fine-grained image-language alignment. The method builds language embeddings from worded image representations and uses their difference from class-name embeddings to identify and subtract domain-specific information, yielding that feeds a small disentangler and classifier. It supports both CLIP-style joint vision-language spaces and uni-modal encoders by training in two stages and discarding the language model at test time to keep inference lightweight. Empirically, WIDIn delivers strong generalization across three benchmarks (CUB-Painting, DomainNetMini, Office-Home) and through extensive ablations confirms the importance of image wording and instance-level alignment. The approach offers a practical route to distill language-model cues into robust, domain-agnostic representations with potential extensions to long-tail recognition and downstream tasks like detection and generation.

Abstract

Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains. However, as the image description is mostly at coarse-grained level and ignores visual details, the resulted embeddings are still ineffective in overcoming complexity of domains at inference time. We present a self-supervision framework WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation, by only leveraging data in a single domain and without any test prior. Specifically, for each image, we first estimate the language embedding with fine-grained alignment, which can be consequently used to adaptively identify and then remove domain-specific counterpart from the raw visual embedding. WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT. Experimental studies on three domain generalization datasets demonstrate the effectiveness of our approach.
Paper Structure (28 sections, 3 equations, 3 figures, 10 tables)

This paper contains 28 sections, 3 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Given a vision-language embedding space, the image and its description can be aligned. (a) However, the alignment granularity between image and its class description is at coarse-grained level where the same text can be aligned with a set of images. (b) For each image, we propose to find the language embedding with fine-grained alignment, whose difference with the language embedding of class name can be used to indicate domain-specific information in the instance (e.g., twilight & air background) and then facilitate domain-invariant visual representation learning.
  • Figure 2: a) For each image, we find a language embedding $\mathbf{t}_x^{}$ that is both aligned with its raw visual embedding $\mathbf{x}$ at fine-grained level and close to the language embedding of class name $\mathbf{t}_c^{}$, under the alignment supervision at instance- ($\mathcal{L}_{ia}$) and class-level ($\mathcal{L}_{ca}$). Then, the domain-invariant visual embedding $\mathbf{x}_e$ can be estimated by deducting the difference $\mathbf{t}_x^{}-\mathbf{t}_c^{}$ from $\mathbf{x}$ and be used as groundtruth to supervise the feature disentangler ($\mathcal{L}_{feat}$). b) For network architecture, we use the worded image token representation <V> for feature extraction. At training time, we fix the image & language encoders. At test time, only the modules in dashed box are used.
  • Figure 3: Visualization of embedding of two examples, the embedding types are color coded.