Disentangling Masked Autoencoders for Unsupervised Domain Generalization

An Zhang; Han Wang; Xiang Wang; Tat-Seng Chua

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

An Zhang, Han Wang, Xiang Wang, Tat-Seng Chua

TL;DR

This work tackles unsupervised domain generalization by learning domain-invariant semantic representations without class labels. It introduces DisMAE, an asymmetric dual-branch Masked AutoEncoder that disentangles semantics from variations with $x = g(\mathbf{s}_x, \mathbf{v}_x)$, employing a semantic encoder, a lightweight variation encoder, a transformer decoder, and a domain-label–enhanced invariance classifier. Training combines a reconstruction loss with an adaptive contrastive loss and, for DG, a cross-entropy supervision, promoting domain-invariant semantics while enabling variation-driven augmentation. Empirically, DisMAE achieves state-of-the-art or competitive results on DomainNet (UDG) and on VLCS and PACS (DG), validating the effectiveness of the disentanglement and invariance principles for robust out-of-distribution generalization. The approach offers a principled path to leverage large-scale unlabeled multi-domain data for practical generalization across unseen domains.

Abstract

Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG) - a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied. To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked Auto Encoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which cannot be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e., DomainNet, PACS, VLCS, Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research line in improving the generalization ability with large-scale unlabeled data.

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

TL;DR

, employing a semantic encoder, a lightweight variation encoder, a transformer decoder, and a domain-label–enhanced invariance classifier. Training combines a reconstruction loss with an adaptive contrastive loss and, for DG, a cross-entropy supervision, promoting domain-invariant semantics while enabling variation-driven augmentation. Empirically, DisMAE achieves state-of-the-art or competitive results on DomainNet (UDG) and on VLCS and PACS (DG), validating the effectiveness of the disentanglement and invariance principles for robust out-of-distribution generalization. The approach offers a principled path to leverage large-scale unlabeled multi-domain data for practical generalization across unseen domains.

Abstract

Paper Structure (23 sections, 9 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 7 figures, 11 tables, 1 algorithm.

Introduction
Preliminary
Methodology
DisMAE
Implementation of Two Principles
Experiments
Overall Performance Comparison (RQ1)
Evaluations on UDG.
Evaluations on DG.
Discussion about Two Principles (RQ2)
Study on DisMAE (RQ3)
Related work
Conclusion
Algorithm
Discussion About Differences
...and 8 more sections

Figures (7)

Figure 1: Illustrative reconstructed images generated by DisMAE. Rows 1 and 3 present inputs retaining either semantic or variation attributes, sourced from the ColorMNIST and three distinct domains of the DomainNet for a comprehensive comparison. Rows 2 and 4 display images crafted by augmenting either original or alternate image variation representations in feature space. The evident integration of colors, textures, and backgrounds in these reconstructed images highlights the disentangling capability of DisMAE. More examples can be found in Appendix \ref{['sec:app_recon']}.
Figure 2: t-SNE visualization of MAE representations on DomainNet. Points are color-coded based on their domain labels. MAE demonstrates inconsistent feature spaces across different domains, stemming from capturing entangled features of both semantics and variations.
Figure 3: Framework of our proposed DisMAE. DisMAE develops an asymmetric dual-branch architecture, with the upper main branch distilling the domain-invariant semantics, along with the lightweight branch extracting the domain-specific variations. A domain classifier is integrated to quantify the degree of domain-specific information embedded within the semantic encoder, thereby monitoring its acquisition of domain-invariant knowledge. Note that the domain classifier is updated while freezing backbones and is only used for generating adaptive weights.
Figure 4: t-SNE visualization of DisMAE semantic and variation representations on DomainNet, with points color-distinguished by their domain labels. (\ref{['fig:tsne_DisMAE_s']}) DisMAE semantic representations are mixed and interspersed regardless of domain labels, showcasing the ability of the semantic encoder to discern domain-invariant features. (\ref{['fig:tsne_DisMAE_v']}) Samples from different domains reside on distinct manifolds, highlighting the variation encoder's capability to extract domain-specific features.
Figure 5: Study of invariance principle. Illustration of prediction scores $p(x_i \in \mathcal{I}_{\text{Sketch}}|\mathbf{s}_i^0)$ estimated by domain classifier throughout training. The dashed line represents the Oracle score, illustrating a random guess w.r.t. domain category.
...and 2 more figures

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

TL;DR

Abstract

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)