Table of Contents
Fetching ...

Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal

Yuhao Wang, Zhiyuan Zhu, Heyang Liu, Yusheng Liao, Hongcheng Liu, Yanfeng Wang, Yu Wang

TL;DR

This work tackles the trustworthiness of multimodal LLMs by enabling principled refusals when information is insufficient. It introduces InBoL, which defines intrinsic and extrinsic information boundaries, builds a data-generation pipeline for IDK instruction and preference data, and deploys IDK-IT and CA-DPO training to improve refusal accuracy without sacrificing helpfulness. A user-centric evaluation framework with Acc, RefR, and a model-agnostic trustworthiness score shows substantial gains in reliability, including strong performance on both in-domain and out-of-domain benchmarks. The approach demonstrates a practical pathway to safer, more trustworthy MLLMs and suggests directions for interpretable refusals and explanations. Overall, InBoL advances robust refusal capabilities as a core component of trustworthy multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) excel at multimodal perception and understanding, yet their tendency to generate hallucinated or inaccurate responses undermines their trustworthiness. Existing methods have largely overlooked the importance of refusal responses as a means of enhancing MLLMs reliability. To bridge this gap, we present the Information Boundary-aware Learning Framework (InBoL), a novel approach that empowers MLLMs to refuse to answer user queries when encountering insufficient information. To the best of our knowledge, InBoL is the first framework that systematically defines the conditions under which refusal is appropriate for MLLMs using the concept of information boundaries proposed in our paper. This framework introduces a comprehensive data generation pipeline and tailored training strategies to improve the model's ability to deliver appropriate refusal responses. To evaluate the trustworthiness of MLLMs, we further propose a user-centric alignment goal along with corresponding metrics. Experimental results demonstrate a significant improvement in refusal accuracy without noticeably compromising the model's helpfulness, establishing InBoL as a pivotal advancement in building more trustworthy MLLMs.

Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal

TL;DR

This work tackles the trustworthiness of multimodal LLMs by enabling principled refusals when information is insufficient. It introduces InBoL, which defines intrinsic and extrinsic information boundaries, builds a data-generation pipeline for IDK instruction and preference data, and deploys IDK-IT and CA-DPO training to improve refusal accuracy without sacrificing helpfulness. A user-centric evaluation framework with Acc, RefR, and a model-agnostic trustworthiness score shows substantial gains in reliability, including strong performance on both in-domain and out-of-domain benchmarks. The approach demonstrates a practical pathway to safer, more trustworthy MLLMs and suggests directions for interpretable refusals and explanations. Overall, InBoL advances robust refusal capabilities as a core component of trustworthy multimodal AI systems.

Abstract

Multimodal large language models (MLLMs) excel at multimodal perception and understanding, yet their tendency to generate hallucinated or inaccurate responses undermines their trustworthiness. Existing methods have largely overlooked the importance of refusal responses as a means of enhancing MLLMs reliability. To bridge this gap, we present the Information Boundary-aware Learning Framework (InBoL), a novel approach that empowers MLLMs to refuse to answer user queries when encountering insufficient information. To the best of our knowledge, InBoL is the first framework that systematically defines the conditions under which refusal is appropriate for MLLMs using the concept of information boundaries proposed in our paper. This framework introduces a comprehensive data generation pipeline and tailored training strategies to improve the model's ability to deliver appropriate refusal responses. To evaluate the trustworthiness of MLLMs, we further propose a user-centric alignment goal along with corresponding metrics. Experimental results demonstrate a significant improvement in refusal accuracy without noticeably compromising the model's helpfulness, establishing InBoL as a pivotal advancement in building more trustworthy MLLMs.

Paper Structure

This paper contains 41 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Information Boundaries of MLLMs. (a) Questions are categorized into three types based on intrinsic and extrinsic information boundaries. For Type 1 questions, which fall within the intrinsic boundary, the model is expected to provide helpful responses. For Type 2 questions, which require information unknown to the model, the model should refuse to answer. For Type 3 questions, where the provided image lacks sufficient information, the model should also respond with a refusal. (b) The intrinsic and extrinsic boundaries are illustrated, highlighting the model's varying confidence in answering queries across different regions.
  • Figure 2: The Pipeline of Data Construction: Given a VQA dataset, we design a pipeline to collect different types of samples within and beyond the information boundaries. First, we estimate the confidence for each sample to determine the model's intrinsic information boundary. Next, we generate questions that lie beyond the extrinsic boundary, followed by quality filtering. Finally, all data is formatted into a standardized structure, including correct, incorrect, and refusal responses, each accompanied by their corresponding confidence scores.
  • Figure 3: Construction of 'IDK' instruction and preference data: The restructured data is categorized into 'Known,' 'Mixed,' and 'Unknown' based on confidence thresholds($\delta_{k}$ and $\delta_{uk}$). 'IDK' instruction generation includes correct responses for known questions, refusal responses for unknown questions, and the exclusion of mixed data. Preference data samples are constructed by pairing questions with correct, incorrect, and refusal responses, based on the confidence classification of each question.
  • Figure 4: Refusal rate and accuracy of models across different confidence levels. (a) Refusal Rate by Confidence: The model exhibits dynamic refusal behavior, with higher refusal rates for lower confidence levels and a tendency to answer directly for high-confidence questions. This indicates the model's awareness of its intrinsic information boundary. (b) Answered Accuracy by Confidence: The accuracy of the IDK-IT and CA-DPO models surpasses that of the original model, demonstrating that training methods focused on intrinsic boundary recognition improve the model's ability to provide accurate responses when choosing to answer.
  • Figure 5: Predefined refusal template
  • ...and 7 more figures