Table of Contents
Fetching ...

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Jiayi Guo, Junhao Zhao, Chaoqun Du, Yulin Wang, Chunjiang Ge, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

TL;DR

The paper tackles diffusion-driven test-time adaptation (TTA), where target data are mapped into a synthetic diffusion domain—yet this introduces source-synthetic misalignment that degrades performance. It introduces Synthetic-Domain Alignment (SDA), a framework that aligns both the source model and target data to the same synthetic domain using a Mix-of-Diffusion (MoD) approach: a conditional diffusion model generates labeled synthetic data for source-domain fine-tuning, while an unconditional diffusion model aligns these samples to the test-time synthetic domain before updating the model. SDA converts cross-domain TTA into an in-domain prediction task by ensuring that the adapted model operates within the same synthetic distribution as the target data, and it ensembles predictions from the original source model and the synthetic-domain model for inference. Empirically, SDA outperforms existing diffusion-driven TTA methods across image classification benchmarks (e.g., ImageNet-C, ImageNet-W, CIFAR-10-C) and extends effectively to semantic segmentation and multimodal LLMs like LLaVA, demonstrating improved domain alignment, reduced data-stream sensitivity, and strong scalability. The work also provides extensive ablations and visual analyses, highlighting the necessity of both conditional data generation and unconditional data alignment for robust performance.

Abstract

Test-time adaptation (TTA) aims to improve the performance of source-domain pre-trained models on previously unseen, shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. The recently proposed diffusion-driven TTA methods mitigate this by adapting model inputs instead of weights, where an unconditional diffusion model, trained on the source domain, transforms target-domain data into a synthetic domain that is expected to approximate the source domain. However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a \textbf{S}ynthetic-\textbf{D}omain \textbf{A}lignment (SDA) framework. Our key insight is to fine-tune the source model with synthetic data to ensure better alignment. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This Mix of Diffusion (MoD) process mitigates the potential domain misalignment between the conditional and unconditional models. Extensive experiments across classifiers, segmenters, and multimodal large language models (MLLMs, \eg, LLaVA) demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment.

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

TL;DR

The paper tackles diffusion-driven test-time adaptation (TTA), where target data are mapped into a synthetic diffusion domain—yet this introduces source-synthetic misalignment that degrades performance. It introduces Synthetic-Domain Alignment (SDA), a framework that aligns both the source model and target data to the same synthetic domain using a Mix-of-Diffusion (MoD) approach: a conditional diffusion model generates labeled synthetic data for source-domain fine-tuning, while an unconditional diffusion model aligns these samples to the test-time synthetic domain before updating the model. SDA converts cross-domain TTA into an in-domain prediction task by ensuring that the adapted model operates within the same synthetic distribution as the target data, and it ensembles predictions from the original source model and the synthetic-domain model for inference. Empirically, SDA outperforms existing diffusion-driven TTA methods across image classification benchmarks (e.g., ImageNet-C, ImageNet-W, CIFAR-10-C) and extends effectively to semantic segmentation and multimodal LLMs like LLaVA, demonstrating improved domain alignment, reduced data-stream sensitivity, and strong scalability. The work also provides extensive ablations and visual analyses, highlighting the necessity of both conditional data generation and unconditional data alignment for robust performance.

Abstract

Test-time adaptation (TTA) aims to improve the performance of source-domain pre-trained models on previously unseen, shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. The recently proposed diffusion-driven TTA methods mitigate this by adapting model inputs instead of weights, where an unconditional diffusion model, trained on the source domain, transforms target-domain data into a synthetic domain that is expected to approximate the source domain. However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a \textbf{S}ynthetic-\textbf{D}omain \textbf{A}lignment (SDA) framework. Our key insight is to fine-tune the source model with synthetic data to ensure better alignment. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This Mix of Diffusion (MoD) process mitigates the potential domain misalignment between the conditional and unconditional models. Extensive experiments across classifiers, segmenters, and multimodal large language models (MLLMs, \eg, LLaVA) demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment.
Paper Structure (18 sections, 8 equations, 7 figures, 17 tables)

This paper contains 18 sections, 8 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Comparison of different test-time adaptation (TTA) frameworks. (a) Traditional TTA methods continuously adapt source model weights to fit target data batches. However, their performance is sensitive to the amount and order of target data streams, e.g., adapting the model with batches containing data from only a single category can lead to overfitting. (b) Diffusion-driven TTA methods project the target data back to the synthetic domain of diffusion models, which still remains domain misalignment with the source domain. (c) We propose the Synthetic-domain Alignment (SDA) framework for TTA, which simultaneously aligns the domains of the source model and target data with the same synthetic domain for superior performance.
  • Figure 2: Enhanced domain alignment with our framework. Prior diffusion-driven TTA methods struggle with the domain misalignment between the source model and synthetic data, which we resolve by aligning the source model to the synthetic domain.
  • Figure 3: (a) Illustration of diffusion-driven data adaptation on source data and (b) Adapted images across different timesteps. The results are obtained using DDA dda, with no noticeable visual degradation observed in the adapted images.
  • Figure 4: Overview of the Synthetic-Domain Alignment (SDA) framework. SDA is a novel TTA framework aligning both the domains of the source model and the target data with the synthetic domain. SDA involves three phases: (left): a source-domain model pretraining phase, where the source model is trained on source data prior to TTA; (middle): a source-to-synthetic model adaptation phase, where the source model is adapted to a synthetic-domain model using synthetic data generated via a Mix of Diffusion (MoD) technique; and (right): a target-to-synthetic data adaptation phase, where target data is adapted into synthetic data using an unconditional diffusion model. Finally, the adapted synthetic data is fed into the synthetic-domain model for test-time inference.
  • Figure 5: Grad-CAM visualization comparison. The first row shows activation maps for source and target images tested with the source model. The second row displays activation maps for diffusion synthetic images tested with the source model (DDA) and our synthetic-domain model (SDA). SDA aligns closely with the source model’s response to source images.
  • ...and 2 more figures