Table of Contents
Fetching ...

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

TL;DR

This work proposes VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding that leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

TL;DR

This work proposes VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding that leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
Paper Structure (50 sections, 2 equations, 21 figures, 15 tables)

This paper contains 50 sections, 2 equations, 21 figures, 15 tables.

Figures (21)

  • Figure 1: Comparison of visual reasoning between existing LVLMs and VisDoT. While LVLMs fail to visual perception in spatial structure, VisDoT leverages decomposition-of-thought (DoT) to accurately infer answers through sequential visual analysis.
  • Figure 2: An overview of our framework.
  • Figure 3: Perception-following QA examples for the four task types in Table \ref{['tab:vis-tasks']}. Each question is decomposed into perception and logic sub-questions using the DoT prompt, enabling structured and interpretable chart reasoning.
  • Figure 4: The model is guided to decompose a complex visual question into perception and logic sub-questions (Question Decomposition) and generate intermediate reasoning steps sequentially (Problem Solving), enabling structured and interpretable visual inference.
  • Figure 5: Short answer case 1
  • ...and 16 more figures