Table of Contents
Fetching ...

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui

TL;DR

It is suggested that the stage at which linearly separable representations are formed varies depending on the type of visual information, and the delayed emergence of edge representations may help explain why large vision-language models struggle with relational understanding, such as interpreting edge directions.

Abstract

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

TL;DR

It is suggested that the stage at which linearly separable representations are formed varies depending on the type of visual information, and the delayed emergence of edge representations may help explain why large vision-language models struggle with relational understanding, such as interpreting edge directions.

Abstract

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
Paper Structure (50 sections, 8 equations, 29 figures, 8 tables)

This paper contains 50 sections, 8 equations, 29 figures, 8 tables.

Figures (29)

  • Figure 1: Overview of this study. We analyze internal representations in LVLMs using probing on a synthetic diagram dataset. We find that node information (e.g., node color) and global information (e.g., node count) are linearly encoded in a single image patch within the vision encoder, whereas edge information (e.g., edge color) is linearly encoded in a single text token within the language model.
  • Figure 2: Examples of synthetic diagrams. Each diagram contains five nodes, and we control evaluation aspects such as node color, shape, and edge connectivity. We provide two variants: $\mathcal{D}_{\mathrm{rand}}$, which uses random node layouts (left part), and $\mathcal{D}_{\mathrm{fix}}$, which uses fixed layouts (right part).
  • Figure 3: Layer-wise maximum accuracy in the vision encoder of Qwen3-VL 8B. The x-axis denotes the relative layer position (0 is the input layer and 1 is the final layer), and the y-axis denotes accuracy. Aspects sharing the same threshold are drawn with the same line style, which also matches the style of the corresponding black threshold lines.
  • Figure 4: Position-wise accuracy in the vision encoder of Qwen3-VL 8B. Each heatmap shows accuracy by patch position for a specific layer and aspect. The node layout of the evaluation diagrams is the same as that of the diagrams in the right part of Figure \ref{['fig:sample_diagrams']}.
  • Figure 5: $\mathrm{MaxAcc}_{l}$ per layer in the language model of Qwen3-VL 8B. The x-axis denotes the layer position (0 is the input layer and 1 is the final layer), and the y-axis denotes accuracy. Aspects sharing the same threshold are drawn with the same line style, which also matches the style of the corresponding black threshold lines.
  • ...and 24 more figures