Table of Contents
Fetching ...

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

TL;DR

This paper addresses the robustness and generalization of LLM-based graph reasoners under variations in graph representations. It introduces a principled decomposition of serializations into node labeling, edge encoding, and syntax, and evaluates both fine-tuned (G1) and non-fine-tuned models on the Erdős benchmark plus a spectral task suite. Key findings show that larger, non-finetuned models exhibit greater robustness to serialization variations, while fine-tuning improves node-label invariance mainly through better reasoning rather than true invariance, and can increase sensitivity to encoding and formatting; generalization to unseen spectral tasks remains inconsistent. The work highlights the need for invariance-aware training and careful benchmark design to build reliable graph-reasoning systems using LLMs, with implications for both model development and evaluation protocols.

Abstract

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

TL;DR

This paper addresses the robustness and generalization of LLM-based graph reasoners under variations in graph representations. It introduces a principled decomposition of serializations into node labeling, edge encoding, and syntax, and evaluates both fine-tuned (G1) and non-fine-tuned models on the Erdős benchmark plus a spectral task suite. Key findings show that larger, non-finetuned models exhibit greater robustness to serialization variations, while fine-tuning improves node-label invariance mainly through better reasoning rather than true invariance, and can increase sensitivity to encoding and formatting; generalization to unseen spectral tasks remains inconsistent. The work highlights the need for invariance-aware training and careful benchmark design to build reliable graph-reasoning systems using LLMs, with implications for both model development and evaluation protocols.

Abstract

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

Paper Structure

This paper contains 33 sections, 20 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Node labeling, edge encoding, and syntax.
  • Figure 2: Avg. per-example output spreads norm. by per-task answer range (top), resp. acc. (bottom).
  • Figure 3: Ablation of (a) structure, (b) reordering, and (c) replicating undir. edges.
  • Figure 4: Average model accuracy by encoding.
  • Figure 5: Accuracy difference between G1 and Qwen evaluated at temperature $0.06$ (as reported by guo2025g1) and the deterministic version at temperature $0$ used in our analysis.
  • ...and 11 more figures