Table of Contents
Fetching ...

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz

Abstract

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Abstract

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.
Paper Structure (36 sections, 22 equations, 25 figures, 17 tables)

This paper contains 36 sections, 22 equations, 25 figures, 17 tables.

Figures (25)

  • Figure 1: We compare the performance InternVL3.5-14B wang2025internvl35 (Baseline) with the one fine-tuned by FINER-Tuning under negative queries of seven different granularity levels.
  • Figure 2: Data construction pipeline for FINER benchmarks. For FINER-DOCCI, we extract the positive scene graph (SG) from DOCCI onoe2024docci captions, while for FINER-CompreCap, the SG is provided by CompreCap lu2025comprecap. From the positive SG, we generate the negative SG using Qwen3-14B yang2025qwen3 as negatives generator for FINER-CompreCap and Gemini-2.0-Flash team2023gemini for FINER-DOCCI. Finally, a rule-based query construction pipeline builds multiple choice questions. In practice, choices are shuffled in both benchmarks.
  • Figure 3: Training data generation pipeline for FINER-Tuning. (1) We adopt long captions from Pixmo deitke2025molmo and extract diverse phrases with PHI-4-14Babdin2024phi. (2) We then prompt the same LLM to modify and generate negative phrases. (3) We construct both positive and negative query-answer tuples via template-based composition or LLM generation.
  • Figure 4: $\text{Acc}_\text{paired}$ versus the number of objects, attributes, and relations. Top: FINER-CompreCap; Bottom: FINER-DOCCI. Dashed arrows show the gain from FINER-Tuning.
  • Figure 5: Qualiative examples of FINER-CompreCap MCQs for each category together with MLLM answers.
  • ...and 20 more figures