Table of Contents
Fetching ...

Judge Before Answer: Can MLLM Discern the False Premise in Question?

Jidong Li, Lingyong Fang, Haodong Zhao, Sufeng Duan, Gongshen Liu

TL;DR

This work addresses false premise recognition in multimodal LLMs by introducing JBA, a fully automated benchmark with a three-level taxonomy (perceptual, cognitive, reasoning) and thirteen subtypes, built from Visual Genome data. It also proposes JBA-GRPO, a reinforcement learning framework with a reasoning reward to improve explicit refutation of false premises. Experiments show existing MLLMs struggle on JBA, while models trained with JBA-GRPO achieve significant improvements in false-premise detection, demonstrating the method's effectiveness. The dataset and framework offer a scalable, rigorous path for enhancing reliability in multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.

Judge Before Answer: Can MLLM Discern the False Premise in Question?

TL;DR

This work addresses false premise recognition in multimodal LLMs by introducing JBA, a fully automated benchmark with a three-level taxonomy (perceptual, cognitive, reasoning) and thirteen subtypes, built from Visual Genome data. It also proposes JBA-GRPO, a reinforcement learning framework with a reasoning reward to improve explicit refutation of false premises. Experiments show existing MLLMs struggle on JBA, while models trained with JBA-GRPO achieve significant improvements in false-premise detection, demonstrating the method's effectiveness. The dataset and framework offer a scalable, rigorous path for enhancing reliability in multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA dataset.Results show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.

Paper Structure

This paper contains 16 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The fully automated pipeline for constructing our false premise dataset, consisting of three stages: Visual Premise Extraction, Premise-Aware Captioning, and Target Question Generation. First, an MLLM extracts a premise of a specific type from an input image. Next, it generates a concise caption that must include the extracted premise. Finally, an LLM produces a false premise question or a true premise question by embedding either a false premise obtained through replacement nor the correct premise into declarative forms.
  • Figure 2: JBA Dataset Category Distribution
  • Figure 3: An example data used in the JBA-GRPO training.