Table of Contents
Fetching ...

SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting

Chenyu Zhang, Kunlun Xu, Zichen Liu, Yuxin Peng, Jiahuan Zhou

TL;DR

Vision-language models like CLIP struggle under domain shift, motivating transductive test-time adaptation (TTA) that leverages batch-wide information. SCAP introduces supportive clique-based attribute prompting to learn fine-grained prompts from both visual and textual modalities within test batches, then aggregates them for robust prediction, complemented by a retention mechanism to evolve prompts over time. The approach achieves state-of-the-art results on OOD and cross-domain benchmarks, confirming that modeling cross-sample relationships and preserving learned attributes enhances generalization in practical TTA settings. Overall, SCAP provides a scalable, efficient framework for batch-aware, multimodal TTA with strong generalization benefits.

Abstract

Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at https://github.com/zhoujiahuan1991/CVPR2025-SCAP.

SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting

TL;DR

Vision-language models like CLIP struggle under domain shift, motivating transductive test-time adaptation (TTA) that leverages batch-wide information. SCAP introduces supportive clique-based attribute prompting to learn fine-grained prompts from both visual and textual modalities within test batches, then aggregates them for robust prediction, complemented by a retention mechanism to evolve prompts over time. The approach achieves state-of-the-art results on OOD and cross-domain benchmarks, confirming that modeling cross-sample relationships and preserving learned attributes enhances generalization in practical TTA settings. Overall, SCAP provides a scalable, efficient framework for batch-aware, multimodal TTA with strong generalization benefits.

Abstract

Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at https://github.com/zhoujiahuan1991/CVPR2025-SCAP.

Paper Structure

This paper contains 17 sections, 19 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between our proposed SCAP to existing prompt learning-based TTA methods. Specifically, current TTA methods learn from instances, while our method utilizes cross-sample visual similarity information from batches to construct supportive cliques and extract attributes from them. We learn attributes from both modalities based on the cliques. For each image, SCAP jointly utilizes attribute prompts from its associated cliques, leading to more effective and accurate prompting.
  • Figure 2: Overview of our proposed SCAP. SCAP firstly mines the supportive cliques for all images in parallel. Based on the cliques, it then learns the corresponding attribute prompts from both modalities. The learned visual attribute prompts and the text attribute prompts are separately retained to accumulate the knowledge of test domains. For each instance, we conduct inference by jointly utilizing all the associated attribute prompts and the retained knowledge to generate the final prediction.
  • Figure 3: Results on the Cross-Domain Benchmark. Comparison of SCAP with state-of-the-art methods. All methods are evaluated using CLIP-ViT-B/16 as the backbone.
  • Figure 4: Left: the influence of batch size on SCAP performance. Right: the average maximum Clique size generated per batch under different batch sizes.
  • Figure 5: Ablation study about the influence of different hyperparameters on ImageNet-A.
  • ...and 1 more figures