Table of Contents
Fetching ...

Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality

Qitong Wang, Tang Li, Kien X. Nguyen, Xi Peng

TL;DR

The paper tackles prediction rationality in Vision-Language Models by introducing two metrics, Prediction Trustworthiness (PT) and Inference Reliability (IR), alongside a heatmap-based validity measure via Relevant Mass Accuracy $\text{RMA}(H, M)$ to assess whether model explanations focus on target objects. It evaluates mainstream fine-tuning methods (Zero-Shot, Linear-Probing, Finetune Like CLIP Pretrain, and Fine-tuning) across multiple VLMs (e.g., CLIP, ALBEF, BLIP) and datasets, revealing a consistent trade-off: fine-tuning often reduces PT (more correct predictions based on invalid evidence) but increases IR when valid evidence is used, with findings remaining stable under distribution shifts like ImageNet-C. The results challenge the notion that fine-tuning universally improves downstream performance, showing that while accuracy can rise, the rationality of predictions may deteriorate; however, when the model focuses on valid evidence, fine-tuning enhances predictive reliability. The work has practical implications for deploying VLMs in safety-critical domains and motivates the development of fine-tuning methods that simultaneously boost accuracy and prediction rationality.

Abstract

Vision-Language Models (VLMs), such as CLIP, have already seen widespread applications. Researchers actively engage in further fine-tuning VLMs in safety-critical domains. In these domains, prediction rationality is crucial: the prediction should be correct and based on valid evidence. Yet, for VLMs, the impact of fine-tuning on prediction rationality is seldomly investigated. To study this problem, we proposed two new metrics called Prediction Trustworthiness and Inference Reliability. We conducted extensive experiments on various settings and observed some interesting phenomena. On the one hand, we found that the well-adopted fine-tuning methods led to more correct predictions based on invalid evidence. This potentially undermines the trustworthiness of correct predictions from fine-tuned VLMs. On the other hand, having identified valid evidence of target objects, fine-tuned VLMs were more likely to make correct predictions. Moreover, the findings are also consistent under distributional shifts and across various experimental settings. We hope our research offer fresh insights to VLM fine-tuning.

Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality

TL;DR

The paper tackles prediction rationality in Vision-Language Models by introducing two metrics, Prediction Trustworthiness (PT) and Inference Reliability (IR), alongside a heatmap-based validity measure via Relevant Mass Accuracy to assess whether model explanations focus on target objects. It evaluates mainstream fine-tuning methods (Zero-Shot, Linear-Probing, Finetune Like CLIP Pretrain, and Fine-tuning) across multiple VLMs (e.g., CLIP, ALBEF, BLIP) and datasets, revealing a consistent trade-off: fine-tuning often reduces PT (more correct predictions based on invalid evidence) but increases IR when valid evidence is used, with findings remaining stable under distribution shifts like ImageNet-C. The results challenge the notion that fine-tuning universally improves downstream performance, showing that while accuracy can rise, the rationality of predictions may deteriorate; however, when the model focuses on valid evidence, fine-tuning enhances predictive reliability. The work has practical implications for deploying VLMs in safety-critical domains and motivates the development of fine-tuning methods that simultaneously boost accuracy and prediction rationality.

Abstract

Vision-Language Models (VLMs), such as CLIP, have already seen widespread applications. Researchers actively engage in further fine-tuning VLMs in safety-critical domains. In these domains, prediction rationality is crucial: the prediction should be correct and based on valid evidence. Yet, for VLMs, the impact of fine-tuning on prediction rationality is seldomly investigated. To study this problem, we proposed two new metrics called Prediction Trustworthiness and Inference Reliability. We conducted extensive experiments on various settings and observed some interesting phenomena. On the one hand, we found that the well-adopted fine-tuning methods led to more correct predictions based on invalid evidence. This potentially undermines the trustworthiness of correct predictions from fine-tuned VLMs. On the other hand, having identified valid evidence of target objects, fine-tuned VLMs were more likely to make correct predictions. Moreover, the findings are also consistent under distributional shifts and across various experimental settings. We hope our research offer fresh insights to VLM fine-tuning.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Both (a) and (b) have low responses to the background while (a) pays more attention to the whole body of the bird and (b) pays more attention to the discriminative feature of the bird (head). Compared with the IoU score between (a) and (b), the difference between them is negligible. Moreover, both achieve correct predictions. Input is from CUB-200-2011 cub dataset. "GT" denotes abbreviation of "Ground Truth" and "Explan" denotes abbreviation of "Explanation".
  • Figure 2: Overview of the four quadrants (RR, RW, WR, WW) of Accuracy and Rationale that are utilized to evaluate prediction rationality.
  • Figure 3: Visualization comparisons among different methods. Compared with zero-shot (ZS), current mainstream fine-tuning methods (LP, FLCP, and FT) for VLMs tend to show enhanced responses in background pixels that are irrelevant to predictions. Here we select the samples for which all four methods make correct predictions. Here we display bounding box annotations indicating the positions of the predicted target.
  • Figure 4: Experimental results on out-of-distribution data. Our discoveries remain consistent across various types and magnitudes of distributional shifts. The x-axis in all figures represents the strength of corruption, where a strength of 0 indicates the results of different methods on the original ImageNet validation data. Due to space constraints, we only show results with CLIP-ViT-B/32 and four types of corruption in the main paper. For more results, please refer to our supplementary material.
  • Figure 5: Additional results of testing on out-of-distribution data with CLIP-ViT-B/32 model.
  • ...and 2 more figures