Table of Contents
Fetching ...

Do Vision-Language Models Really Understand Visual Language?

Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan

TL;DR

This work interrogates whether large vision-language models truly understand visual diagrams or simply rely on background knowledge as shortcuts. By building a two-part test suite with synthetic and real diagrams, the authors dissect entities versus relations and distinguish knowledge-reliant versus knowledge-free questions. Across multiple LVLMs, results show robust entity recognition but weak relational understanding, with real-diagram performance partly aided by background knowledge, not genuine diagram parsing. Quantitative and qualitative analyses reveal knowledge grounding improves relation recognition but does not yield true relational reasoning, challenging claims of diagram comprehension and highlighting the need for evaluation frameworks that separate perception, symbolic reasoning, and knowledge reliance. The findings call for caution when interpreting diagram-based reasoning benchmarks and have practical implications for safe, reliable diagram understanding in real-world tasks.

Abstract

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Do Vision-Language Models Really Understand Visual Language?

TL;DR

This work interrogates whether large vision-language models truly understand visual diagrams or simply rely on background knowledge as shortcuts. By building a two-part test suite with synthetic and real diagrams, the authors dissect entities versus relations and distinguish knowledge-reliant versus knowledge-free questions. Across multiple LVLMs, results show robust entity recognition but weak relational understanding, with real-diagram performance partly aided by background knowledge, not genuine diagram parsing. Quantitative and qualitative analyses reveal knowledge grounding improves relation recognition but does not yield true relational reasoning, challenging claims of diagram comprehension and highlighting the need for evaluation frameworks that separate perception, symbolic reasoning, and knowledge reliance. The findings call for caution when interpreting diagram-based reasoning benchmarks and have practical implications for safe, reliable diagram understanding in real-world tasks.

Abstract

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
Paper Structure (133 sections, 34 figures, 11 tables)

This paper contains 133 sections, 34 figures, 11 tables.

Figures (34)

  • Figure 1: The responses of GPT-4o to two diagram-related questions reveal a notable pattern. The model struggles to correctly answer the relation question in the simple synthetic diagram, yet it successfully understands the relationship in a complex real diagram. We demonstrate that this pattern occurs consistently (\ref{['tab:synthetic:relation', 'tab:real']}).
  • Figure 2: Performance of LVLMs (CoT) on answering questions for real diagrams with different complexities (i.e., the number of entities in the diagram, $|\hbox{$\mathcal{V}$}|$). Results show that models can always answer questions on entity well but cannot handle questions on relations if the diagram is complex.
  • Figure 3: The model response on the example diagram and its variants. Results suggest that the model relies on background knowledge as a shortcut rather than accurately recognizing and reasoning about relations.
  • Figure 4: The representative diagrams of 6 domains.
  • Figure 5: Accuracies of LVLMs on ${\hbox{$Q$}}_{{\hbox{$S$}}}({\hbox{$V$}}|{\hbox{$\text{KF}\xspace$}},{\hbox{$\text{NR}\xspace$}})$ and ${\hbox{$Q$}}_{{\hbox{$S$}}}({\hbox{$V$}}|{\hbox{$\text{KF}\xspace$}},{\hbox{$\text{NC}\xspace$}})$ with entities located in different positions (top row, center row, bottom row, left column, center column, and right column).
  • ...and 29 more figures