Table of Contents
Fetching ...

TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

TL;DR

This work tackles single-source domain generalization by moving away from global feature learning toward robust local concepts. It introduces TIDE, a training framework that enforces concept-level saliency alignment and domain-invariant local concept representations, built atop a diffusion- and language-model–driven annotation pipeline that generates per-class concept maps. A test-time correction mechanism uses concept signatures to iteratively refine predictions, enhancing both accuracy and interpretability. Across four benchmarks, TIDE achieves substantial improvements over state-of-the-art methods, highlighting the practical impact of local-concept learning for domain generalization and model explainability.

Abstract

We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted

TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

TL;DR

This work tackles single-source domain generalization by moving away from global feature learning toward robust local concepts. It introduces TIDE, a training framework that enforces concept-level saliency alignment and domain-invariant local concept representations, built atop a diffusion- and language-model–driven annotation pipeline that generates per-class concept maps. A test-time correction mechanism uses concept signatures to iteratively refine predictions, enhancing both accuracy and interpretability. Across four benchmarks, TIDE achieves substantial improvements over state-of-the-art methods, highlighting the practical impact of local-concept learning for domain generalization and model explainability.

Abstract

We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted

Paper Structure

This paper contains 21 sections, 6 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Samples from VLCS (left) and PACS dataset (right) across domain shifts, corresponding to bird and person class. First row displays GradCAM maps selvaraju2017grad for ABA's class predictions. We observe that model attention of ABA cheng2023adversarial falters across domain shifts. Second and third row display the concept specific GradCAM maps from TIDE. We posit that accurate concept learning and localization facilitates DG.
  • Figure 2: The TIDE pipeline: Left—Training on a single domain with cross-entropy losses for class ($\mathcal{L}_{\text{c}}$) and concept labels ($\mathcal{L}_{\text{k}}$), alongside Concept Saliency Alignment ($\mathcal{L}_{\text{CSA}}$) and Local Concept Contrastive losses ($\mathcal{L}_{\text{LCC}}$). Right—Test-time correction strategy applied in TIDE.
  • Figure 3: The first column displays the image generated from the given prompt, while the subsequent three columns show the cross-attention maps corresponding to each concept in the prompt.
  • Figure 4: The first row presents the prompt, corresponding synthesized image and attention maps for the concepts ear and mouth. Below, we demonstrate that using diffusion features correspondences these concept saliency maps from a single exemplar can be automatically transferred on dog images across domains.
  • Figure 5: t-SNE visualizations to demonstrate impact of $\mathcal{L}_{LCC}$. Each column represents a test domain (Sketch, Cartoon, Painting), with the top row showing t-SNE plots without $\mathcal{L}_{LCC}$ applied and the bottom one with it. Please zoom in for optimal viewing.
  • ...and 6 more figures