Table of Contents
Fetching ...

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan

TL;DR

The paper tackles calibration weaknesses in test-time prompt tuning (TPT) for vision-language models by proposing angular diversity (A-TPT), which maximizes the minimum pairwise angular distance among class-text features on the unit hypersphere. The objective augments the standard TPT loss with an angular-diversity regularizer, computed from the minimum angular distances $AD$ derived from $\text{Cos}=\hat{\mathbf{E}}\hat{\mathbf{E}}^T$ and $\theta_{ij}=\arccos(\text{Cos}_{ij})$, and optimized with a fixed regularization weight $\lambda$. Across diverse datasets and backbones (e.g., CLIP ViT-B/16 and RN50), A-TPT yields lower expected calibration error (ECE) while maintaining or improving accuracy, outperforming prior methods that used L2 dispersion or orthogonality constraints, and showing robustness to natural distribution shifts and medical data. The approach leverages concepts related to the Tammes problem to ensure uniform angular coverage on the hypersphere, delivering more reliable uncertainty estimates and practical calibration improvements for VLMs in zero-shot and test-time adaptation scenarios. Code will be released publicly.

Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

TL;DR

The paper tackles calibration weaknesses in test-time prompt tuning (TPT) for vision-language models by proposing angular diversity (A-TPT), which maximizes the minimum pairwise angular distance among class-text features on the unit hypersphere. The objective augments the standard TPT loss with an angular-diversity regularizer, computed from the minimum angular distances derived from and , and optimized with a fixed regularization weight . Across diverse datasets and backbones (e.g., CLIP ViT-B/16 and RN50), A-TPT yields lower expected calibration error (ECE) while maintaining or improving accuracy, outperforming prior methods that used L2 dispersion or orthogonality constraints, and showing robustness to natural distribution shifts and medical data. The approach leverages concepts related to the Tammes problem to ensure uniform angular coverage on the hypersphere, delivering more reliable uncertainty estimates and practical calibration improvements for VLMs in zero-shot and test-time adaptation scenarios. Code will be released publicly.

Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.

Paper Structure

This paper contains 26 sections, 6 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Comparison of calibration performance (ECE) with C-TPT yoon2024c, and O-TPT sharifdeen2025tpt on fine-grained classification datasets with CLIP ViT-B/16 backbone. Ours (lower ECE) shows improved prompt calibration.
  • Figure 2: Comparison of numerical optimization (A-TPT (Ours)) with angular optimization (O-TPT sharifdeen2025tpt) and ATFD optimization (C-TPT yoon2024c).
  • Figure 3: t-SNE visualization of class-wise embedded textual features with CLIP RN50 model on the fine-grained classification dataset fei2004learning for (a) hard prompts and (b) tuned prompts. In both subfigures, each unique color represents a distinct prompt in the prompt visualization (left) and a distinct class in the class visualization (right). The legends belong to the prompt visualization (left) for both subfigures.
  • Figure 4: Comparison of mean cosine similarity changes for both categories with CLIP ViT-B/16 backbone. Where, O-TPT fails, but our A-TPT offers consistent cosine similarity values and achieves the greatest minimum pairwise angular distance among text features for all the data points. (suppl. carries more details.)
  • Figure 5: Reliability diagrams for CLIP ViT-B/16 backbone (suppl. carries additional reliability diagrams).
  • ...and 8 more figures