Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang, Tang Li, Kien X. Nguyen, Xi Peng
TL;DR
The paper tackles prediction rationality in Vision-Language Models by introducing two metrics, Prediction Trustworthiness (PT) and Inference Reliability (IR), alongside a heatmap-based validity measure via Relevant Mass Accuracy $\text{RMA}(H, M)$ to assess whether model explanations focus on target objects. It evaluates mainstream fine-tuning methods (Zero-Shot, Linear-Probing, Finetune Like CLIP Pretrain, and Fine-tuning) across multiple VLMs (e.g., CLIP, ALBEF, BLIP) and datasets, revealing a consistent trade-off: fine-tuning often reduces PT (more correct predictions based on invalid evidence) but increases IR when valid evidence is used, with findings remaining stable under distribution shifts like ImageNet-C. The results challenge the notion that fine-tuning universally improves downstream performance, showing that while accuracy can rise, the rationality of predictions may deteriorate; however, when the model focuses on valid evidence, fine-tuning enhances predictive reliability. The work has practical implications for deploying VLMs in safety-critical domains and motivates the development of fine-tuning methods that simultaneously boost accuracy and prediction rationality.
Abstract
Vision-Language Models (VLMs), such as CLIP, have already seen widespread applications. Researchers actively engage in further fine-tuning VLMs in safety-critical domains. In these domains, prediction rationality is crucial: the prediction should be correct and based on valid evidence. Yet, for VLMs, the impact of fine-tuning on prediction rationality is seldomly investigated. To study this problem, we proposed two new metrics called Prediction Trustworthiness and Inference Reliability. We conducted extensive experiments on various settings and observed some interesting phenomena. On the one hand, we found that the well-adopted fine-tuning methods led to more correct predictions based on invalid evidence. This potentially undermines the trustworthiness of correct predictions from fine-tuned VLMs. On the other hand, having identified valid evidence of target objects, fine-tuned VLMs were more likely to make correct predictions. Moreover, the findings are also consistent under distributional shifts and across various experimental settings. We hope our research offer fresh insights to VLM fine-tuning.
