Table of Contents
Fetching ...

StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

Jiao Zhang, Jian Xu, Xu-Yao Zhang, Cheng-Lin Liu

TL;DR

This work tackles domain generalization under real-world distribution shifts by injecting test-time style priors into the language prompts of a frozen vision-language model. It introduces StylePrompter to extract a style embedding from each test image and prepend it to the class prompt as SP CLASS, with two prompting designs (Basic and Gaussian) and a style-regularization scheme to support open-domain generalization. The training objective combines an open-domain discrimination loss, a style-embedding regularization loss, and a CLIP-style classification loss, while inference proceeds offline without updating model parameters. Across PACS, VLCS, Office-Home, and DomainNet, the method achieves state-of-the-art average DG performance and robust ablations confirm the contributions of style priors and regularization. Overall, the approach demonstrates that leveraging dynamic language-side style cues can substantially improve cross-domain robustness for vision-language systems.

Abstract

In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.

StylePrompter: Enhancing Domain Generalization with Test-Time Style Priors

TL;DR

This work tackles domain generalization under real-world distribution shifts by injecting test-time style priors into the language prompts of a frozen vision-language model. It introduces StylePrompter to extract a style embedding from each test image and prepend it to the class prompt as SP CLASS, with two prompting designs (Basic and Gaussian) and a style-regularization scheme to support open-domain generalization. The training objective combines an open-domain discrimination loss, a style-embedding regularization loss, and a CLIP-style classification loss, while inference proceeds offline without updating model parameters. Across PACS, VLCS, Office-Home, and DomainNet, the method achieves state-of-the-art average DG performance and robust ablations confirm the contributions of style priors and regularization. Overall, the approach demonstrates that leveraging dynamic language-side style cues can substantially improve cross-domain robustness for vision-language systems.

Abstract

In real-world applications, the sample distribution at the inference stage often differs from the one at the training stage, causing performance degradation of trained deep models. The research on domain generalization (DG) aims to develop robust algorithms that can improve the generalized performance in unseen domains by training on a few domains. However, the domain-agnostic vision model, trained on a limited number of domains using traditional domain generalization methods, cannot guarantee its effectiveness in dealing with unseen domains. The introduction of language can break the closed cognition space of the vision model, providing additional semantic information that cannot be inferred from vision-only datasets. In this paper, we propose to overcome the challenge in previous DG methods by introducing the style prompt in the language modality to adapt the trained model dynamically. In particular, we train a style prompter to extract style information of the current image into an embedding in the token embedding space and place it in front of the candidate category words as prior knowledge to prompt the model. Our open space partition of the style token embedding space and the hand-crafted style regularization enable the trained style prompter to handle data from unknown domains effectively. Extensive experiments verify the effectiveness of our method and demonstrate state-of-the-art performances on multiple public datasets. Codes will be available after the acceptance of this paper.
Paper Structure (15 sections, 3 equations, 4 figures, 4 tables)

This paper contains 15 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Recognizing objects from unseen domains is a challenging task for vision models. In this paper, we enhance the model's domain generalization ability by providing test-time style priors as prompts in the language modality.
  • Figure 2: Left illustrates our method's framework, where we freeze the image and text encoders of a pre-trained vision-language model and train a lightweight style prompter (orange module). Two designs are shown: the basic and Gaussian style prompters, where "SP", "BSP", and "GSP" represent "style priors", "basic style priors", and "Gaussian style priors", respectively. Contextualized style regularization is designed to enhance generalization and help integrate the learned style embeddings with category words. Right shows the form of style priors, which is dynamically updated based on the input image.
  • Figure 3: The common cross-entropy loss learns a partition of the whole embedding space based on training domains, cannot deal with unseen domains. The proposed domain discrimination loss shapes the embedding space by contrastive mechanism, leaving room for unknown style embeddings.
  • Figure 4: Feature similarities between test images from unseen domains and texts generated with different style words. Compared to the unmatched style words, the learned styles and the artificially defined matching ones have higher similarities with the test image, indicating that our style prompter can correctly extract the style information of unknown images