DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

Yunlong Tang; Yuxuan Wan; Lei Qi; Xin Geng

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

Yunlong Tang, Yuxuan Wan, Lei Qi, Xin Geng

TL;DR

DPStyler tackles Source-Free Domain Generalization by operating in a large vision-language space (e.g., CLIP) and introducing two key components: a Style Generation Module that dynamically refreshes style prompts every epoch (via Random or StyleMix) and a Style Removal Module (Style-SE Net) that suppresses style information in encoder outputs using a domain-uncertainty loss. To stabilize reliance on text prompts, it employs a Model Ensemble across multiple initial templates during training and inference. The method optimizes a joint objective $L_{total}=L_U+L_C$ with ArcFace-based classification, while keeping the CLIP encoders frozen, and it demonstrates state-of-the-art results on PACS, VLCS, OfficeHome, and DomainNet with reduced training resources compared to PromptStyler. DPStyler further confirms the benefit of style refresh and shows that removing style information improves domain-invariant features, yielding robust performance under both stylized and non-stylized shifts. Overall, the approach offers a practical, one-stage solution for SFDG that leverages prompt-driven style augmentation and explicit style-removal to enhance generalization in real-world settings.

Abstract

Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

TL;DR

with ArcFace-based classification, while keeping the CLIP encoders frozen, and it demonstrates state-of-the-art results on PACS, VLCS, OfficeHome, and DomainNet with reduced training resources compared to PromptStyler. DPStyler further confirms the benefit of style refresh and shows that removing style information improves domain-invariant features, yielding robust performance under both stylized and non-stylized shifts. Overall, the approach offers a practical, one-stage solution for SFDG that leverages prompt-driven style augmentation and explicit style-removal to enhance generalization in real-world settings.

Abstract

Paper Structure (17 sections, 7 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 17 sections, 7 equations, 10 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Domain Generalization
Vision-Language Models
Source-free Domain Generalization
Method
Style Generation Module
Style Removal Module
Model Ensemble
Model Training and Inference
Experiments
Evaluation Datasets
Implementation Details
Evaluations
Ablation Study
...and 2 more sections

Figures (10)

Figure 1: Domain classification via zero-shot using CLIP. The four vertices of a color region represent the probabilities of the corresponding image's association with four domains, obtained through CLIP's zero-shot capabilities. Image features stem from CLIP's image encoder, while text features are obtained from the descriptive text (eg, "a picture with a $\boldsymbol{\mathit{S}}$-like style," where $\boldsymbol{\mathit{S}}$ represents the style, such as "cartoon") processed through the text encoder, representing the domain information. Final probabilities are computed based on the similarity between image and text features. It can be found that the image features are domain-specific.
Figure 2: The training strategies of PromptStyler cho2023promptstyler and DPStyler (Ours). PromptStyler requires two-stage training and fixes the styles in the second stage. Instead, ours requires only one-stage training and dynamically updates the styles during training.
Figure 3: The training process and inference process of DPStyler. A Style Generation Module is used to dynamically refresh styles during training with two style-refresh methods. A style remover with domain uncertainty loss is used to remove specific domain information and learn domain-invariant features. Model ensemble is used for the inference process. For the class scores generated by the models corresponding to different templates, we select the class corresponding to the maximum value from them as the prediction.
Figure 4: Illustration of domain uncertainty loss. It is used to constrain style remover to remove domain-specific information.
Figure 5: The evaluation of training and inference resources on VLCS including GPU memory usage, training time, model parameter count, and inference speed. Stage 1 represents the training of style word vectors by PromptStyler, and stage 2 denotes the training of the classifier. The symbol '*' denotes reproduced results. 'w/ ME' stands for using model ensemble.
...and 5 more figures

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

TL;DR

Abstract

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)