Table of Contents
Fetching ...

Draft and Refine with Visual Experts

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani

TL;DR

Draft and Refine (DnR) tackles the problem of ungrounded reasoning in LVLMs by introducing a query-conditioned visual-utilization metric that quantifies reliance on relevant image regions. It decomposes questions into explicit visual queries, builds a query-conditioned relevance map, and measures utilization via adaptive Top-$k$/Bottom-$k$ perturbations, enabling selective expert rendering from a pool of visual cues. The framework uses rendering to incorporate external visual experts and introduces a lightweight learned selector to scale expert invocation, achieving improvements across VQA, captioning, reasoning, and knowledge-based tasks while reducing hallucinations. Empirical results show stronger gains for weaker baselines and robust correlations between utilization and accuracy, suggesting that explicit visual grounding guided by $U_q$ offers a principled path toward interpretable, evidence-driven multimodal reasoning.

Abstract

While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems. Code is available at https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts.

Draft and Refine with Visual Experts

TL;DR

Draft and Refine (DnR) tackles the problem of ungrounded reasoning in LVLMs by introducing a query-conditioned visual-utilization metric that quantifies reliance on relevant image regions. It decomposes questions into explicit visual queries, builds a query-conditioned relevance map, and measures utilization via adaptive Top-/Bottom- perturbations, enabling selective expert rendering from a pool of visual cues. The framework uses rendering to incorporate external visual experts and introduces a lightweight learned selector to scale expert invocation, achieving improvements across VQA, captioning, reasoning, and knowledge-based tasks while reducing hallucinations. Empirical results show stronger gains for weaker baselines and robust correlations between utilization and accuracy, suggesting that explicit visual grounding guided by offers a principled path toward interpretable, evidence-driven multimodal reasoning.

Abstract

While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems. Code is available at https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the Draft-and-Refine (DnR) framework. Given an image $x$ and a question $q$, the LVLM first generates an initial draft answer $\hat{y}$ ①. The question is decomposed by $f_{\mathrm{LLM}}$ into a query set $Q=\{q_i\}$, and each query is grounded by $f_g$ to produce spatial relevance maps, aggregated into $r(x\mid q)$ ②. Gumbel-$k$ sampling masks Top-$k$ and Bottom-$k$ regions for perturbation, and a semantic encoder $g(\cdot)$ measures similarity shifts between $\hat{y}$ and perturbed predictions $\tilde{y}_\tau$ to compute the utilization score $U_q^{\mathrm{base}}$ ③. Expert models (e.g., CLIP, SAM, OCR) render structured visual evidence onto the image, producing refined outputs with updated utilization $U_q^{(j)}$. The expert with the largest gain $U_q^{(j)} - U_q^{\mathrm{base}}$ is selected for refinement ④.
  • Figure 2: Illustration of the query-conditioned relevance map. For the same image (top row), different questions lead to distinct relevance regions aligned with the extracted query terms. Conversely, for the same question (bottom row), the relevance map varies with the image content, localizing evidence that matches the queried concept.
  • Figure 3: Question-conditioned utilization computation. Given a question $q$ and image $x$, the relevance map $r(x \mid q)$ guides Gumbel Top-$k$/Bottom-$k$ masking over a ratio $\rho$ of the image. Masked inputs $\tau(x)$ are fed into the LVLM to obtain perturbed predictions $\tilde{y}_\tau$, compared with the original $\hat{y}$ via a semantic encoder $g(\cdot)$, and aggregated with the adaptive factor $\alpha$ to compute the final utilization score $U_q(x)$.
  • Figure 4: Comparison of rendering strategies across different experts. Each column corresponds to an experts, and each row represents a rendering style.