Table of Contents
Fetching ...

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

Evan M. Williams, Kathleen M. Carley

TL;DR

This work defines zero-shot Visual Network Analysis (VNA) and introduces a public benchmark to evaluate multimodal models on graph-visual reasoning. It assesses GPT-4 via API and LLaVa on five tasks spanning degree centrality, structural balance, and component counting using synthetic, high-resolution graph images, revealing that GPT-4 substantially outperforms LLaVa but both models struggle with basic VNA tasks. The study reports nuanced results, including ~67% accuracy for isolate counting by GPT-4 and around random performance for structural balance, underscoring the challenges of visual graph reasoning in zero-shot settings. By releasing data and ground-truth labels, the paper provides a baseline and a resource to guide future research toward improving multimodal reasoning for graph analytics.

Abstract

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

TL;DR

This work defines zero-shot Visual Network Analysis (VNA) and introduces a public benchmark to evaluate multimodal models on graph-visual reasoning. It assesses GPT-4 via API and LLaVa on five tasks spanning degree centrality, structural balance, and component counting using synthetic, high-resolution graph images, revealing that GPT-4 substantially outperforms LLaVa but both models struggle with basic VNA tasks. The study reports nuanced results, including ~67% accuracy for isolate counting by GPT-4 and around random performance for structural balance, underscoring the challenges of visual graph reasoning in zero-shot settings. By releasing data and ground-truth labels, the paper provides a baseline and a resource to guide future research toward improving multimodal reasoning for graph analytics.

Abstract

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.
Paper Structure (18 sections, 3 figures, 3 tables)

This paper contains 18 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Degree Task Graph Examples with letter (left) and numeric (right) node IDs.
  • Figure 2: Triadic Balance Examples. Top row contains a sample of balanced triads, bottom row contains a sample of unbalanced triads. 'b' denotes the number of like (blue) relationships in each group.
  • Figure 3: Components Example Graphs. Read from left to right and top to bottom, these graphs contain 4, 5, 6, and 7 components respectively. The graphs contain 0, 1, 2, and 3 isolates.