Table of Contents
Fetching ...

Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales

Tang Li, Mengmeng Ma, Xi Peng

TL;DR

This work proposes a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations, and significantly improves the model's rationale correctness.

Abstract

Large pretrained foundation models demonstrate exceptional performance and, in some high-stakes applications, even surpass human experts. However, most of these models are currently evaluated primarily on prediction accuracy, overlooking the validity of the rationales behind their accurate predictions. For the safe deployment of foundation models, there is a pressing need to ensure double-correct predictions, i.e., correct prediction backed by correct rationales. To achieve this, we propose a two-phase scheme: First, we curate a new dataset that offers structured rationales for visual recognition tasks. Second, we propose a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations. Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model's rationale correctness, improving localization by 7.5% and disentanglement by 36.5%. Our dataset, source code, and pretrained weights: https://github.com/deep-real/DCP

Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales

TL;DR

This work proposes a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations, and significantly improves the model's rationale correctness.

Abstract

Large pretrained foundation models demonstrate exceptional performance and, in some high-stakes applications, even surpass human experts. However, most of these models are currently evaluated primarily on prediction accuracy, overlooking the validity of the rationales behind their accurate predictions. For the safe deployment of foundation models, there is a pressing need to ensure double-correct predictions, i.e., correct prediction backed by correct rationales. To achieve this, we propose a two-phase scheme: First, we curate a new dataset that offers structured rationales for visual recognition tasks. Second, we propose a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations. Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model's rationale correctness, improving localization by 7.5% and disentanglement by 36.5%. Our dataset, source code, and pretrained weights: https://github.com/deep-real/DCP

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Unsafe prediction examples. Correct prediction, incorrect rationale: CLIP identifies a red light, but wrongly based on red balloons. Incorrect prediction, correct rationale: GPT-4V incorrectly predicts a closed door, yet based on plausible visual evidence.
  • Figure 2: Our structured rationales capture the major attributes and their sub-attributes that lead to the recognition of objects. Our dataset offers over 4,000 unique rationales covering all 1,000 categories from ImageNetdeng2009imagenet.
  • Figure 3: Multi-head Self Attention (MSA) accumulated mean-ablation study. Based on Eq. \ref{['eq:msa_decompose']}, we replace the direct effects of MSAs up to a specific layer with their mean values calculated across the ImageNetdeng2009imagenet validation set. Most of the performance gains can be attributed to the final layers of the ViT.
  • Figure 4: Qualitative results of rationale disentanglement and localization. The rationales' visual evidence of the CLIP model radford2021learning typically highlights the entire object, lacking precise localization. In contrast, our model can correctly localize rationales, thereby enhancing trust in its predictions.
  • Figure 5: Qualitative results of zero-shot text-to-image retrieval on MSCOCOlin2014microsoft. The task is to retrieve the top-5 images with a given rationale presented. The CLIP results reveal a significant entangle of rationales with a specific category, such as "long neck" with giraffes and "wings" with airliners. In contrast, our model treats rationales independently from categories, thus offering diverse retrieval results. For example, the "long neck" found in birds, giraffes, dears, and bottles.