Disentangling Masked Autoencoders for Unsupervised Domain Generalization
An Zhang, Han Wang, Xiang Wang, Tat-Seng Chua
TL;DR
This work tackles unsupervised domain generalization by learning domain-invariant semantic representations without class labels. It introduces DisMAE, an asymmetric dual-branch Masked AutoEncoder that disentangles semantics from variations with $x = g(\mathbf{s}_x, \mathbf{v}_x)$, employing a semantic encoder, a lightweight variation encoder, a transformer decoder, and a domain-label–enhanced invariance classifier. Training combines a reconstruction loss with an adaptive contrastive loss and, for DG, a cross-entropy supervision, promoting domain-invariant semantics while enabling variation-driven augmentation. Empirically, DisMAE achieves state-of-the-art or competitive results on DomainNet (UDG) and on VLCS and PACS (DG), validating the effectiveness of the disentanglement and invariance principles for robust out-of-distribution generalization. The approach offers a principled path to leverage large-scale unlabeled multi-domain data for practical generalization across unseen domains.
Abstract
Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG) - a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied. To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked Auto Encoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which cannot be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e., DomainNet, PACS, VLCS, Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research line in improving the generalization ability with large-scale unlabeled data.
