Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark
Evan M. Williams, Kathleen M. Carley
TL;DR
This work defines zero-shot Visual Network Analysis (VNA) and introduces a public benchmark to evaluate multimodal models on graph-visual reasoning. It assesses GPT-4 via API and LLaVa on five tasks spanning degree centrality, structural balance, and component counting using synthetic, high-resolution graph images, revealing that GPT-4 substantially outperforms LLaVa but both models struggle with basic VNA tasks. The study reports nuanced results, including ~67% accuracy for isolate counting by GPT-4 and around random performance for structural balance, underscoring the challenges of visual graph reasoning in zero-shot settings. By releasing data and ground-truth labels, the paper provides a baseline and a resource to guide future research toward improving multimodal reasoning for graph analytics.
Abstract
We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.
