Table of Contents
Fetching ...

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

TL;DR

This work introduces Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs, and demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings.

Abstract

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

TL;DR

This work introduces Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs, and demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings.

Abstract

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

Paper Structure

This paper contains 21 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the Prompt4Trust Framework for Medical Visual Question Answering.(A)Training pipeline.The Calibration Guidance Prompt (CGP) Generator receives the textual elements of a medical visual question, namely the question, multiple-choice options, and an instruction. The CGP Generator subsequently produces the CGP. The CGP is then appended to the question, multiple choice options, and the downstream task instruction, along with the associated medical image, and passed to the Downstream Task MLLM. The Downstream Task MLLM produces an answer and confidence score, which are compared to the ground-truth answer to compute a reward. The Downstream Task MLLM reports its confidence as a score out of 100. $\hat{p}$ is defined by converting this score to a decimal. This reward is used to optimize the CGP Generator via the GRPO reinforcement learning objective. (B)Inference on a Sample from the PMC-VQA datasetzhang_pmc-vqa_2024. At inference time, Prompt4Trust follows steps – , resulting in the ✓ MLLM Response with CGP, which produces the correct answer and calibrated confidence. The Generated CGP text was abbreviated for illustration purposes. For comparison, we show the ✗ MLLM Response without CGP yields an incorrect and overconfident response, illustrating the effectiveness of the Prompt4Trust framework.
  • Figure 2: Calibration curve illustrating the relationship between confidence and average accuracy. Perfect calibration is shown by the dashed line. Prompt4Trust demonstrates better calibration, particularly in high-confidence region (e.g., $\text{confidence}\geq 0.85$) where trust is most critical for supporting medical decision-making in medical imaging tasks.
  • Figure 3: The calibration curve for Qwen2.5-VL-7B-Instruct bai2025qwen25vl as the Downstream Task MLLM in the generalizability experiment. Prompt4Trust demonstrates better calibration, particularly in high-confidence region (e.g., $\text{confidence}\geq 0.85$) where trust is most critical for supporting medical decision-making.