Table of Contents
Fetching ...

Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong

Abstract

Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.

Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

Abstract

Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.

Paper Structure

This paper contains 37 sections, 14 equations, 3 figures, 13 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of the difference between our method and previous approaches and the proposed Mean-Shift Guided Test-Time Adaptation.
  • Figure 2: Overview of the MS-TTA. We first extract initial embeddings using the CLIP visual encoder and refine them via a mean-shift operator with k-nearest neighbors (kNN), generating mean-shifted embeddings. These refined embeddings are dynamically stored in a key-value cache. During inference, CLIP predictions are combined with mean-shift-enhanced predictions, leveraging the cache to refine logits and improve classification accuracy, ensuring robustness to distribution shifts.
  • Figure 3: T-SNE visualizations of feature embeddings on the Flowers102 dataset. (a)-(b): Comparison of global embedding distributions from CLIP (a) and our method (b). (c)-(d): A focused view on 10 randomly selected classes, showing that our mean-shifted embeddings (d) reduce intra-class variance and enlarge inter-class margins compared to CLIP (c). (e)-(f): A close-up view of class 16 and class 33, where our method (f) achieves clearer separation and sharper decision boundaries than CLIP (e).