Table of Contents
Fetching ...

Space Rotation with Basis Transformation for Training-free Test-Time Adaptation

Chenhao Ding, Xinyuan Gao, Songlin Dong, Yuhang He, Qiang Wang, Xiang Song, Alex Kot, Yihong Gong

TL;DR

This work tackles test-time adaptation under distribution shift for vision-language models by addressing the rigidity of the original CLIP feature space. It introduces Space Rotation with Basis Transformation (SOBA), which builds an orthogonal basis from a covariance-informed PCA and rotates the feature space to yield clearer inter-class separation, enabling better inference without any training. A dynamic queue of pseudo-labeled samples guides basis construction, while a transformed prototype-based classifier complements the CLIP predictions, with results showing state-of-the-art performance and improved efficiency over training-based and other training-free TTA methods. The approach is simple to implement, accelerates inference, and demonstrates robust generalization across ImageNet-based OOD data and diverse cross-dataset tasks, highlighting the practical impact of feature-space redesign for test-time adaptation.

Abstract

With the development of visual-language models (VLM) in downstream task applications, test-time adaptation methods based on VLM have attracted increasing attention for their ability to address changes distribution in test-time. Although prior approaches have achieved some progress, they typically either demand substantial computational resources or are constrained by the limitations of the original feature space, rendering them less effective for test-time adaptation tasks. To address these challenges, we propose a training-free feature space rotation with basis transformation for test-time adaptation. By leveraging the inherent distinctions among classes, we reconstruct the original feature space and map it to a new representation, thereby enhancing the clarity of class differences and providing more effective guidance for the model during testing. Additionally, to better capture relevant information from various classes, we maintain a dynamic queue to store representative samples. Experimental results across multiple benchmarks demonstrate that our method outperforms state-of-the-art techniques in terms of both performance and efficiency.

Space Rotation with Basis Transformation for Training-free Test-Time Adaptation

TL;DR

This work tackles test-time adaptation under distribution shift for vision-language models by addressing the rigidity of the original CLIP feature space. It introduces Space Rotation with Basis Transformation (SOBA), which builds an orthogonal basis from a covariance-informed PCA and rotates the feature space to yield clearer inter-class separation, enabling better inference without any training. A dynamic queue of pseudo-labeled samples guides basis construction, while a transformed prototype-based classifier complements the CLIP predictions, with results showing state-of-the-art performance and improved efficiency over training-based and other training-free TTA methods. The approach is simple to implement, accelerates inference, and demonstrates robust generalization across ImageNet-based OOD data and diverse cross-dataset tasks, highlighting the practical impact of feature-space redesign for test-time adaptation.

Abstract

With the development of visual-language models (VLM) in downstream task applications, test-time adaptation methods based on VLM have attracted increasing attention for their ability to address changes distribution in test-time. Although prior approaches have achieved some progress, they typically either demand substantial computational resources or are constrained by the limitations of the original feature space, rendering them less effective for test-time adaptation tasks. To address these challenges, we propose a training-free feature space rotation with basis transformation for test-time adaptation. By leveraging the inherent distinctions among classes, we reconstruct the original feature space and map it to a new representation, thereby enhancing the clarity of class differences and providing more effective guidance for the model during testing. Additionally, to better capture relevant information from various classes, we maintain a dynamic queue to store representative samples. Experimental results across multiple benchmarks demonstrate that our method outperforms state-of-the-art techniques in terms of both performance and efficiency.

Paper Structure

This paper contains 18 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (a) Feature confusion generated in the original CLIP space. It is evident that the original CLIP feature space contains confounding classes. For training-free methods, the lack of capability to adjust the feature space imposes limitations on their subsequent applicability. (b) Feature space reconstructed through transformation. We utilize new basis vectors (such as ${\textit{b}}_{1}$ and ${\textit{b}}_{2}$ in the Fig. (b)) to transform the feature space into a new space. In this space, we can address the confusion present in the original CLIP and overcome the limitations of training-free methods that cannot adjust the feature space. (c) Performance comparison on the OOD benchmark. Our method surpasses state-of-the-art methods almost on all datasets.
  • Figure 2: An overview of our method. Our method uses a dynamic queue to store representative samples and generates predictions for test examples based on these samples. This prediction is combined with zero-shot CLIP predictions to produce the final inference. Specifically, we maintain a dynamic queue of representative samples, selected based on minimum entropy of CLIP's predictions. Using these stored samples, we construct a basis transformation to facilitate feature space rotation. As testing progresses, we continuously update and utilize these mappings, allowing the decision boundaries obtained through reconstruction to become more refined and accurate. Finally, we combine the inferences from CLIP with those from the dynamic queue to obtain the final prediction.
  • Figure 3: Subfigure (a) shows a comparison with other classifiers, where our SOBA achieves the best performance. Subfigure (b) presents a study on different dynamic queue lengths. Subfigure (c) presents a study on the impact of the hyperparameter $\alpha$. All experiments in the figure are based on ViT-B/16 and conducted on ImageNet imagenet.