Table of Contents
Fetching ...

Instruction set for the representation of graphs

Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez

TL;DR

Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling.

Abstract

We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

Instruction set for the representation of graphs

TL;DR

Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling.

Abstract

We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling
Paper Structure (39 sections, 7 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 39 sections, 7 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Aggregated correlation between graph edit distance (GED) and Levenshtein distance across all five benchmark datasets. Each cell at integer coordinates $(i, j)$ shows the count of graph pairs with $\text{GED} = i$ and $\text{Lev} = j$ (log scale; light = few pairs, dark = many pairs); white cells contain no observed pairs. Dashed grey line: identity ($\text{Lev} = \text{GED}$). Solid red line: ordinary least-squares (OLS) regression. (a) Canonical encoding ($n = 3,424,764$ pairs, $\rho = 0.700$, $\beta = 0.79$). (b) Greedy-min encoding ($n = 3,424,764$ pairs, $\rho = 0.665$, $\beta = 0.78$). (c) Greedy-rnd($v_0$) encoding ($n = 3,424,764$ pairs, $\rho = 0.590$, $\beta = 0.82$). Reported statistics: $\rho$ denotes Spearman's rank correlation coefficient, measuring monotonic association between the two distance measures. $\beta$ denotes the OLS regression slope; $\beta = 1$ would indicate that Levenshtein and GED operate on the same scale, while $\beta < 1$ indicates that Levenshtein distances grow more slowly than GED.
  • Figure 2: Empirical time complexity of IsalGraph encoding methods on random graphs (Barabási--Albert $m \in \{1,2\}$ and Erdős--Rényi $p \in \{0.3, 0.5\}$). Horizontal axis: number of nodes $n$; vertical axis: encoding time in seconds (log scale). Markers show the median across graph instances; error bars denote the interquartile range. Dashed lines are polynomial fits $T = c \cdot n^{\alpha}$ via OLS on log--log data. Greedy-rnd($v_0$): $\alpha = 3.1$, $R^2 = 0.989$. Greedy-Min: $\alpha = 4.5$, $R^2 = 0.989$. Canonical: $\alpha = 9.0$, $R^2 = 0.979$. Greedy methods exhibit polynomial scaling ($\alpha \approx 3$--$5$), while the canonical method scales super-polynomially ($\alpha \approx 9$) on random graphs and becomes infeasible beyond $n \approx 12$.
  • Figure 3: Neighbourhood topology of the house graph $G_0$ (5 nodes, 6 edges) under two distance metrics. Centre column: base graph $G_0$ with its canonical IsalGraph encoding (colour-coded by instruction type). Top rows: 4 representative 1-GED neighbours (single edge edit), with Levenshtein distances $\mathrm{Lev} \in [1,\, 5]$ to the encoding of $G_0$. Bottom rows: 4 representative 1-Levenshtein neighbours (single character substitution, insertion, or deletion in the instruction string), with GED values $\mathrm{GED} \in [1,\, 2]$. Dashed red edges indicate structural differences from $G_0$. Horizontal heatmaps below each graph render the IsalGraph instruction string with per-character colouring (alphabet $\Sigma = \{N,n,P,p,V,v,C,c,W\}$). The asymmetry between 1-GED and 1-Levenshtein neighbourhoods illustrates that graph-space proximity does not imply string-space proximity, and vice versa.

Theorems & Definitions (11)

  • Definition 2.1: Interpreter state
  • Remark 2.2
  • Example 2.3: Decoding VvNV
  • Definition 2.4: Sorted displacement pairs
  • Remark 2.5: Reachability precondition
  • Remark 2.6: String length decomposition
  • Definition 2.7: Canonical string
  • Conjecture 2.8: Canonical string as complete graph invariant
  • Remark 2.9: Relation to graph isomorphism
  • Definition 2.10: Levenshtein distance on IsalGraph strings
  • ...and 1 more