Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

Yiran Luo; Joshua Feinglass; Tejas Gokhale; Kuan-Cheng Lee; Chitta Baral; Yezhou Yang

Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

Yiran Luo, Joshua Feinglass, Tejas Gokhale, Kuan-Cheng Lee, Chitta Baral, Yezhou Yang

TL;DR

The paper addresses the fragility of domain generalization (DG) under stylistic shifts by introducing two quantitative measures, ICV and IDD, based on $D_{JS}$, to characterize stylistic domain shifts. It proposes SuperMarioDomains (SMD), a synthetic, consistently labeled multi-domain dataset, as a precursor to better ground DG training. The SMOS method uses a precursor model trained on SMD to ground subsequent DG training via a Jensen-Shannon Divergence penalty, achieving state-of-the-art results on five DG benchmarks, particularly excelling on abstract-styled domains while maintaining performance on photo-realistic domains. The work demonstrates that grounding DG with stylistically diverse, class-consistent synthetic data reduces distributional divergence across domains, with practical impact for more robust generalization in diverse visual styles.

Abstract

Domain Generalization (DG) is a challenging task in machine learning that requires a coherent ability to comprehend shifts across various domains through extraction of domain-invariant features. DG performance is typically evaluated by performing image classification in domains of various image styles. However, current methodology lacks quantitative understanding about shifts in stylistic domain, and relies on a vast amount of pre-training data, such as ImageNet1K, which are predominantly in photo-realistic style with weakly supervised class labels. Such a data-driven practice could potentially result in spurious correlation and inflated performance on DG benchmarks. In this paper, we introduce a new DG paradigm to address these risks. We first introduce two new quantitative measures ICV and IDD to describe domain shifts in terms of consistency of classes within one domain and similarity between two stylistic domains. We then present SuperMarioDomains (SMD), a novel synthetic multi-domain dataset sampled from video game scenes with more consistent classes and sufficient dissimilarity compared to ImageNet1K. We demonstrate our DG method SMOS. SMOS first uses SMD to train a precursor model, which is then used to ground the training on a DG benchmark. We observe that SMOS contributes to state-of-the-art performance across five DG benchmarks, gaining large improvements to performances on abstract domains along with on-par or slight improvements to those on photo-realistic domains. Our qualitative analysis suggests that these improvements can be attributed to reduced distributional divergence between originally distant domains. Our data are available at https://github.com/fpsluozi/SMD-SMOS .

Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

TL;DR

The paper addresses the fragility of domain generalization (DG) under stylistic shifts by introducing two quantitative measures, ICV and IDD, based on

, to characterize stylistic domain shifts. It proposes SuperMarioDomains (SMD), a synthetic, consistently labeled multi-domain dataset, as a precursor to better ground DG training. The SMOS method uses a precursor model trained on SMD to ground subsequent DG training via a Jensen-Shannon Divergence penalty, achieving state-of-the-art results on five DG benchmarks, particularly excelling on abstract-styled domains while maintaining performance on photo-realistic domains. The work demonstrates that grounding DG with stylistically diverse, class-consistent synthetic data reduces distributional divergence across domains, with practical impact for more robust generalization in diverse visual styles.

Abstract

Paper Structure (7 sections, 9 equations, 5 figures, 5 tables)

This paper contains 7 sections, 9 equations, 5 figures, 5 tables.

Introduction
Related Works
Preliminaries
Analyses on Domain Shift of Stylistic Domains and Pre-training Data
Methodology
Experiments and Results
Conclusions

Figures (5)

Figure 1: Top: We define two quantitative measures ICV and IDD to describe stylistic domain shifts in image datasets for Domain Generalization (DG). We find that the vast ImageNet1K, commonly used for pre-training DG models, has inconsistent class labels and is already similar in style with photo-realistic domains found in multiple benchmarks. Therefore, we compile a novel synthetic dataset SuperMarioDomains (SMD) as referential stylistic domains with consistent scene class labels and sufficient dissimilarity from existing domains. Bottom: We present our DG approach SMOS that leverages the unique domain shifts in our new SMD dataset. We first train a Precursor Model using SMD and cross entropy $Loss_\mathrm{CE}$. We then utilize the trained Precursor Model to ground the training of the DG model with training domains from the benchmark, optimizing the empirical loss of both cross entropy $Loss_\mathrm{CE}$ and Jensen-Shannon Divergence $D_\mathrm{JS}$ between the Precursor Model and the DG Model.
Figure 2: Intra-Class Variation (ICV) for each domain in featured datasets. A low ICV indicates that the classes are more consistent in terms of colors, as in NES of SMD, Sketch of PACS, and Quickdraw of DomainNet. Meanwhile, the classes in ImageNet1K, which is commonly used for pre-training in DG, are implicated to be as inconsistent as those in photo- or art-styled domains, e.g. Photo and Art of PACS, LabelMe of VLCS, Art and RealWorld of OfficeHome, as well as Real of DomainNet. Average of 3 trials.
Figure 4: A qualitative overview of our SuperMarioDomains(SMD) dataset, consisting of video frames from actual game footage categorized into 4 distinctive scene classes and 4 image style domains. Columns from left to right: The four image domains, named after the console hardware on which each game runs - $\mathtt{NES}$, $\mathtt{SNES}$, $\mathtt{N64}$, and $\mathtt{Wii}$. Rows from top to bottom: The four classes of in-game scenes - Overworld, Underground, Aquatic, and Castle. These synthetic image styles of SMD are drastically different from those in existing DG benchmarks, such as realistic photographs, pencil sketches, or oil paintings.
Figure 5: The pipeline of our SMOS method. The feature extraction backbones $f$ and $f_{\mathrm{S}}$ have an identical structure. $f$ is initialized with ImageNet1K pre-trained weights. Left: We first train the Precursor Model $f_{\mathrm{S}}\circ g_{\mathrm{S}}$ to learn scene style shifts with SMD. Right: We then perform DG training with training domains from a DG benchmark (e.g. PACS), tuning the DG model $f \circ g$ while being grounded to the SMD-trained Precursor $f_{\mathrm{S}}$ by optimizing $\mathcal{L}_{\mathrm{JS}}$.
Figure 6: Test vs. training domain IDDs resulted from different DG methods when targeting the Sketch domain of PACS.

Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

TL;DR

Abstract

Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images

Authors

TL;DR

Abstract

Table of Contents

Figures (5)