Table of Contents
Fetching ...

Consistency Regularization for Domain Generalization with Logit Attribution Matching

Han Gao, Kaican Li, Weiyan Xie, Zhi Lin, Yongxiang Huang, Luning Wang, Caleb Chen Cao, Nevin L. Zhang

TL;DR

This work addresses domain generalization (DG) under distribution shifts by leveraging semantic sharing (SS) pairs created through data augmentation to regularize predictions. It develops a causal latent decomposition (CLD) theory with causal factors $X^{\mathrm{c}}$ and non-causal factors $X^{\mathrm{n}}$, and proves that causally invariant predictions minimize out-of-distribution loss when target support is contained in source support. The authors introduce Logit Attribution Matching (LAM), a CR method using labeled SS pairs that regularizes the logit contributions of the target class, and demonstrate its superior performance over ERM+DA and other CR/DG baselines on five benchmarks across varied architectures. The work highlights the practical value of labeled SS pairs for robust DG and provides public code and data to support reproducibility.

Abstract

Domain generalization (DG) is about training models that generalize well under domain shift. Previous research on DG has been conducted mostly in single-source or multi-source settings. In this paper, we consider a third, lesser-known setting where a training domain is endowed with a collection of pairs of examples that share the same semantic information. Such semantic sharing (SS) pairs can be created via data augmentation and then utilized for consistency regularization (CR). We present a theory showing CR is conducive to DG and propose a novel CR method called Logit Attribution Matching (LAM). We conduct experiments on five DG benchmarks and four pretrained models with SS pairs created by both generic and targeted data augmentation methods. LAM outperforms representative single/multi-source DG methods and various CR methods that leverage SS pairs. The code and data of this project are available at https://github.com/Gaohan123/LAM

Consistency Regularization for Domain Generalization with Logit Attribution Matching

TL;DR

This work addresses domain generalization (DG) under distribution shifts by leveraging semantic sharing (SS) pairs created through data augmentation to regularize predictions. It develops a causal latent decomposition (CLD) theory with causal factors and non-causal factors , and proves that causally invariant predictions minimize out-of-distribution loss when target support is contained in source support. The authors introduce Logit Attribution Matching (LAM), a CR method using labeled SS pairs that regularizes the logit contributions of the target class, and demonstrate its superior performance over ERM+DA and other CR/DG baselines on five benchmarks across varied architectures. The work highlights the practical value of labeled SS pairs for robust DG and provides public code and data to support reproducibility.

Abstract

Domain generalization (DG) is about training models that generalize well under domain shift. Previous research on DG has been conducted mostly in single-source or multi-source settings. In this paper, we consider a third, lesser-known setting where a training domain is endowed with a collection of pairs of examples that share the same semantic information. Such semantic sharing (SS) pairs can be created via data augmentation and then utilized for consistency regularization (CR). We present a theory showing CR is conducive to DG and propose a novel CR method called Logit Attribution Matching (LAM). We conduct experiments on five DG benchmarks and four pretrained models with SS pairs created by both generic and targeted data augmentation methods. LAM outperforms representative single/multi-source DG methods and various CR methods that leverage SS pairs. The code and data of this project are available at https://github.com/Gaohan123/LAM
Paper Structure (26 sections, 1 theorem, 14 equations, 12 figures, 6 tables)

This paper contains 26 sections, 1 theorem, 14 equations, 12 figures, 6 tables.

Key Result

Theorem 1

(Conditions for Optimal DG) Let $\hat{P}_{\theta}$ be a prediction model for a CLD family such that different ${x}^\mathrm{c}$ almost always generate different ${x}$, and let $P^\mathrm{s}$ and $P^\mathrm{t}$ be a source and a target domain (from the family) such that $\mathop{\mathrm{supp}}\nolimit Then, the prediction model $\hat{P}_{\theta}$ also minimizes the out-of-distribution (OOD) cross-en

Figures (12)

  • Figure 1: A semantic sharing (SS) pair involves an original training example and a transformed version of it obtained by data augmentation (DA). The examples in the first two pairs share the same semantic information for the "giraffe" class, and the examples in the last pair share the same semantic information for the "dog" class. The augmented example in (a) is created manually via Copy-Paste gao2023out, the one in (b) is created using a DA method called RandAugment cubuk2020randaugment, and the one in (c) is created using Stable Diffusion rombach2022high (see Appendix \ref{['creation']} for more details).
  • Figure 2: Causal latent decomposition (CLD) model. The input of a training example ${X}$ is generated from two latent variables ${X}^\mathrm{c}$ and ${X}^\mathrm{n}$ which may be statistically correlated due to confounders or direct mechanisms between them. The ground-truth label $Y$ is generated from only ${X}^\mathrm{c}$. The mechanisms that generate ${X}$ and $Y$ are assumed to be invariant across domains. The corresponding conditional distributions are denoted as $P^*({X}|{X}^\mathrm{c}, {X}^\mathrm{n})$ and $P^*(Y|{X}^\mathrm{c})$. The joint distribution $P({X}^\mathrm{n}, {X}^\mathrm{c})$ of the two latent variables may change across domains. We assume ${X}^\mathrm{c}$ always $d$-separate $Y$ from the other variables.
  • Figure 3: An illustration of conditions for optimal DG under the CLD model. Training examples ${x}$ are sampled from the latent space, $\mathcal{X}^\mathrm{c} \times \mathcal{X}^\mathrm{n}$, which we depict as a 2-D box. A prediction model is causally invariant if it makes the same prediction for examples sampled from the same "vertical line" in the latent space. If such a model also minimizes the cross-entropy loss of a source domain, then it makes optimal predictions on all examples $\tilde{{x}}$ sampled from $\mathop{\mathrm{supp}}\nolimits[P^\mathrm{s}({X}^\mathrm{c})]\times \mathcal{X}^\mathrm{n}$ (the inner rectangle), not only those from $\mathop{\mathrm{supp}}\nolimits[P^\mathrm{s}({X}^\mathrm{c}, {X}^\mathrm{n})]$. This enables optimal generalization to any target domain $P^\mathrm{t}$ such that $\mathop{\mathrm{supp}}\nolimits[P^\mathrm{t}({X}^\mathrm{c})] \subseteq \mathop{\mathrm{supp}}\nolimits[P^\mathrm{s}({X}^\mathrm{c})]$.
  • Figure 4: Grad-CAM saliency maps for the top predicted class by models trained on ImageNet-9 using various methods. The model learned using LAM focuses on the foreground objects better.
  • Figure 5: SS pairs created via Copy-Paste (same-y) DA for iWildCam. This DA method involves pasting the animal onto another image without animals sampled from the location where the same animal species has been observed.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1