FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao; Sanghwan Kim; Yongqin Xian; Zeynep Akata; Stephan Alaniz

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao, Sanghwan Kim, Yongqin Xian, Zeynep Akata, Stephan Alaniz

Abstract

Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Abstract

Paper Structure (36 sections, 22 equations, 25 figures, 17 tables)

This paper contains 36 sections, 22 equations, 25 figures, 17 tables.

Introduction
FINER Benchmarks
Question Construction Pipeline
Scene Graph Extraction
Negatives Generation
Evaluation Setting
Training with FINER (FINER-Tuning)
Experiments
Experimental Setup
Results on FINER benchmarks
Results on other hallucination benchmarks
Results on general capabilities
Qualitative Results
Ablation Studies
Related Works
...and 21 more sections

Figures (25)

Figure 1: We compare the performance InternVL3.5-14B wang2025internvl35 (Baseline) with the one fine-tuned by FINER-Tuning under negative queries of seven different granularity levels.
Figure 2: Data construction pipeline for FINER benchmarks. For FINER-DOCCI, we extract the positive scene graph (SG) from DOCCI onoe2024docci captions, while for FINER-CompreCap, the SG is provided by CompreCap lu2025comprecap. From the positive SG, we generate the negative SG using Qwen3-14B yang2025qwen3 as negatives generator for FINER-CompreCap and Gemini-2.0-Flash team2023gemini for FINER-DOCCI. Finally, a rule-based query construction pipeline builds multiple choice questions. In practice, choices are shuffled in both benchmarks.
Figure 3: Training data generation pipeline for FINER-Tuning. (1) We adopt long captions from Pixmo deitke2025molmo and extract diverse phrases with PHI-4-14Babdin2024phi. (2) We then prompt the same LLM to modify and generate negative phrases. (3) We construct both positive and negative query-answer tuples via template-based composition or LLM generation.
Figure 4: $\text{Acc}_\text{paired}$ versus the number of objects, attributes, and relations. Top: FINER-CompreCap; Bottom: FINER-DOCCI. Dashed arrows show the gain from FINER-Tuning.
Figure 5: Qualiative examples of FINER-CompreCap MCQs for each category together with MLLM answers.
...and 20 more figures

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Abstract

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Authors

Abstract

Table of Contents

Figures (25)