Table of Contents
Fetching ...

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

Ashshak Sharifdeen, Muhammad Akhtar Munir, Sanoojan Baliah, Salman Khan, Muhammad Haris Khan

TL;DR

This work tackles poor calibration in test-time prompt tuning for vision-language models by enforcing angular separation among class text features. It introduces O-TPT, an orthogonality-constrained objective that drives $EE^T$ toward the identity, improving calibration without sacrificing accuracy. Across diverse backbones and datasets, O-TPT consistently outperforms state-of-the-art approaches like C-TPT and TPT, even surpassing zero-shot calibration on several tasks. The results suggest that leveraging angular properties of textual features yields practically significant gains for reliable uncertainty estimates in VLMs.

Abstract

Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks.

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

TL;DR

This work tackles poor calibration in test-time prompt tuning for vision-language models by enforcing angular separation among class text features. It introduces O-TPT, an orthogonality-constrained objective that drives toward the identity, improving calibration without sacrificing accuracy. Across diverse backbones and datasets, O-TPT consistently outperforms state-of-the-art approaches like C-TPT and TPT, even surpassing zero-shot calibration on several tasks. The results suggest that leveraging angular properties of textual features yields practically significant gains for reliable uncertainty estimates in VLMs.

Abstract

Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks.

Paper Structure

This paper contains 17 sections, 2 equations, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Comparison of calibration performance (ECE) with C-TPT yoon2024c and Robust-adapt-Penalty-CTPTmurugesanrobust. Lower the ECE better the calibration.
  • Figure 2: Probability Density Functions of intra-text feature cosine similarities
  • Figure 3: Comparison of angular optimization (ours) and ATFD optimization (C-TPT) yoon2024c
  • Figure 4: Mean cosine similarity changes comparison on a finegrained dataset nilsback2008automated with CLIP B/16 backbone. Our orthogonal constraint offers consistent cosine similarity values among text features for all the data points.
  • Figure 5: Comparison of Accuracy and Expected Calibration Error (ECE) across methods and categories based on the TPTshu2022test text features cosine similarity.
  • ...and 6 more figures