Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Evan M. Williams; Kathleen M. Carley

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Evan M. Williams, Kathleen M. Carley

TL;DR

This work defines zero-shot Visual Network Analysis (VNA) and introduces a public benchmark to evaluate multimodal models on graph-visual reasoning. It assesses GPT-4 via API and LLaVa on five tasks spanning degree centrality, structural balance, and component counting using synthetic, high-resolution graph images, revealing that GPT-4 substantially outperforms LLaVa but both models struggle with basic VNA tasks. The study reports nuanced results, including ~67% accuracy for isolate counting by GPT-4 and around random performance for structural balance, underscoring the challenges of visual graph reasoning in zero-shot settings. By releasing data and ground-truth labels, the paper provides a baseline and a resource to guide future research toward improving multimodal reasoning for graph analytics.

Abstract

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 3 tables)

This paper contains 18 sections, 3 figures, 3 tables.

Introduction
Related Works
Methods
Maximum Degree Tasks
Structural Balance Task
Component Tasks
Data
Maximum Degree Graph Generation
Structural Balance Graph Generation
Component Graph Generation
Results
Maximum Degree Task Results
Structural Balance Task Results
Component Task Results
Discussion and Limitations
...and 3 more sections

Figures (3)

Figure 1: Degree Task Graph Examples with letter (left) and numeric (right) node IDs.
Figure 2: Triadic Balance Examples. Top row contains a sample of balanced triads, bottom row contains a sample of unbalanced triads. 'b' denotes the number of like (blue) relationships in each group.
Figure 3: Components Example Graphs. Read from left to right and top to bottom, these graphs contain 4, 5, 6, and 7 components respectively. The graphs contain 0, 1, 2, and 3 isolates.

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

TL;DR

Abstract

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (3)