Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment

Seokweon Jung; Jeongmin Rhee; Seoyoung Doh; Hyeon Jeon; Ghulam Jilani Quadri; Jinwook Seo

Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment

Seokweon Jung, Jeongmin Rhee, Seoyoung Doh, Hyeon Jeon, Ghulam Jilani Quadri, Jinwook Seo

TL;DR

The results demonstrate that MLLMs, particularly GPT-5, significantly outperform traditional measures in aligning with human graph similarity perception and provide interpretable rationales for their decisions, whereas Claude Sonnet 4.5 shows the best computational efficiency.

Abstract

Comparing graphs to identify similarities is a fundamental task in visual analytics of graph data. To support this, visual analytics systems frequently employ quantitative computational measures to provide automated guidance. However, it remains unclear how well these measures align with subjective human visual perception, thereby offering recommendations that conflict with analysts' intuitive judgments, potentially leading to confusion rather than reducing cognitive load. Multimodal Large Language Models (MLLMs), capable of visually interpreting graphs and explaining their reasoning in natural language, have emerged as a potential alternative to address this challenge. This paper bridges the gap between human and machine assessment of graph similarity through three interconnected experiments using a dataset of 1,881 node-link diagrams. Experiment 1 collects relative similarity judgments and rationales from 32 human participants, revealing consensus on graph similarity while prioritizing global shapes and edge densities over exact topological details. Experiment 2 benchmarks 16 computational measures against these human judgments, identifying Portrait divergence as the best-performing metric, though with only moderate alignment. Experiment 3 evaluates the potential of three state-of-the-art MLLMs (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5) as perceptual proxies. The results demonstrate that MLLMs, particularly GPT-5, significantly outperform traditional measures in aligning with human graph similarity perception and provide interpretable rationales for their decisions, whereas Claude Sonnet 4.5 shows the best computational efficiency. Our findings suggest that MLLMs hold significant promise not only as effective, explainable proxies for human perception but also as intelligent guides that can uncover subtle nuances that might be overlooked by human analysts in visual analytics systems.

Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment

TL;DR

Abstract

Paper Structure (62 sections, 9 figures, 4 tables)

This paper contains 62 sections, 9 figures, 4 tables.

Introduction
Related Work
Computational Graph Comparison
Measure-based Comparison
Human Visual Comparison
Quantifying Visual Perception
MLLMs for Graph Visualization
Methodology
Study Design
Target Graph Specification
Stimuli Generation
Data Sources
Graph Size (Size)
Edge Density (Density)
Visualization Layout (Layout)
...and 47 more sections

Figures (9)

Figure 1: An overview of our research methodology, structured into three interconnected experiments designed to investigate graph comparison capabilities in humans and machines. Experiment 1 assesses human competence in graph similarity assessment through indirect similarity measurements derived from visual perception (RQ1). Experiment 2 computes pairwise similarities using 16 distinct computational graph similarity measures and compares them with human decisions to determine the alignment between humans and machines (RQ2). Experiment 3 evaluates the relative graph similarity assessment capabilities of MLLMs, analyzing both their alignment with human perception and the interpretability of their reasoning (RQ3). Our results demonstrate that MLLMs exhibit a higher alignment with human judgment than computational measures, qualifying them as superior perceptual proxies. Furthermore, by providing interpretable decision rationales, they serve as a more effective method for assisting human analysts in graph comparison tasks.
Figure 2: In this study, four different synthetic graph generation algorithms are utilized for stimuli generation. A) GNM algorithm randomly generates M edges between N nodes. B) BBA algorithm has power-law degree distributions of node degrees by connecting edges between new nodes to existing nodes with high degree barabasi1999emergence. C) NWS algorithm creates a ring over nodes and connections between their $k$ nearest neighbors newman1999renormalization. D) SBM algorithm partitions a graph into blocks of arbitrary sizes whose edges are placed between pairs of nodes holland1983stochastic.
Figure 3: Node-link diagrams of a real-world graph drawn with three different graph layout algorithms utilized in this study. A) Force-directed layout (Fruchterman-Reingold) creates an aesthetically pleasing layout with uniform edge lengths by simulating a physical system where nodes act as repelling charged particles and edges act as attracting springs forcedirected. B) Circular layout positions all nodes equidistantly along the circumference of a circle, providing a structured view that highlights edge density and connectivity patterns across the graph circular. C) Multidimensional scaling layout (UMAP) projects the topological structure into a 2D space with the UMAP algorithm, effectively preserving both local neighborhood relationships and the global structural organization umap.
Figure 4: The system employed in Experiment 1 is designed to collect human graph similarity judgment data. For each question, three node-link diagrams are presented. Participants answer the questions in the following order.1) The user selects the target graph that seems more similar to the central query graph. 2) Next, they choose and explain the decision criteria from the options below. 3) They then indicate their confidence in their choice. 4) The entire process must be completed within one minute. If additional clarification on the criteria is needed, the user can press the Help button to review the explanation.
Figure 5: Accuracy distribution of participants in Experiment 1. The red dashed line represents the random chance level (0.5). Each dot represents an individual participant. Based on a one-sample t-test ($H_0 = 0.5$) for each participant, blue dots denote those who exhibited accuracy levels significantly above chance (28 out of 32, $p < .05$), while gray dots represent those whose performance was statistically indistinguishable from chance. These results demonstrate a robust human capacity to visually distinguish graph similarities.
...and 4 more figures

Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment

TL;DR

Abstract

Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (9)