Table of Contents
Fetching ...

SCHENO: Measuring Schema vs. Noise in Graphs

Justus Isaiah Hibshman, Adnan Hoq, Tim Weninger

TL;DR

SCHENO introduces a principled, goal-agnostic metric for decomposing graphs into a schema (pattern) and noise, balancing symmetry-driven structure with random chaos. It formalizes a two-stage generative process in which a schema graph is drawn from a symmetry-rich distribution and noise is added via an Erdős-Rényi-like process, with a log-score guiding optimization. The authors derive a principled method to set the noise probability p, compare SCHENO against several graph-mining models, and demonstrate that SCHENO-guided decompositions can uncover diverse, meaningful patterns across synthetic and real-world data. This framework provides a general tool for pattern discovery in graphs and suggests directions for more powerful, scalable algorithms beyond traditional tasks.

Abstract

Real-world data is typically a noisy manifestation of a core pattern (schema), and the purpose of data mining algorithms is to uncover that pattern, thereby splitting (i.e. decomposing) the data into schema and noise. We introduce SCHENO, a principled evaluation metric for the goodness of a schema-noise decomposition of a graph. SCHENO captures how schematic the schema is, how noisy the noise is, and how well the combination of the two represent the original graph data. We visually demonstrate what this metric prioritizes in small graphs, then show that if SCHENO is used as the fitness function for a simple optimization strategy, we can uncover a wide variety of patterns. Finally, we evaluate several well-known graph mining algorithms with this metric; we find that although they produce patterns, those patterns are not always the best representation of the input data.

SCHENO: Measuring Schema vs. Noise in Graphs

TL;DR

SCHENO introduces a principled, goal-agnostic metric for decomposing graphs into a schema (pattern) and noise, balancing symmetry-driven structure with random chaos. It formalizes a two-stage generative process in which a schema graph is drawn from a symmetry-rich distribution and noise is added via an Erdős-Rényi-like process, with a log-score guiding optimization. The authors derive a principled method to set the noise probability p, compare SCHENO against several graph-mining models, and demonstrate that SCHENO-guided decompositions can uncover diverse, meaningful patterns across synthetic and real-world data. This framework provides a general tool for pattern discovery in graphs and suggests directions for more powerful, scalable algorithms beyond traditional tasks.

Abstract

Real-world data is typically a noisy manifestation of a core pattern (schema), and the purpose of data mining algorithms is to uncover that pattern, thereby splitting (i.e. decomposing) the data into schema and noise. We introduce SCHENO, a principled evaluation metric for the goodness of a schema-noise decomposition of a graph. SCHENO captures how schematic the schema is, how noisy the noise is, and how well the combination of the two represent the original graph data. We visually demonstrate what this metric prioritizes in small graphs, then show that if SCHENO is used as the fitness function for a simple optimization strategy, we can uncover a wide variety of patterns. Finally, we evaluate several well-known graph mining algorithms with this metric; we find that although they produce patterns, those patterns are not always the best representation of the input data.
Paper Structure (26 sections, 15 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Example of the Automorphism Orbit and Stabilizer of a Set of Edges: On the right, each distinct set of edges in the automorphism orbit is shown in a unique color and labeled with a letter. The multi-edges on the right are shown solely to indicate that the single edges on the left participate in multiple edge sets in the orbit. Note that sets such as $\{(1, 5), (6, 7)\}$ are not part of the example orbit. At the bottom, we can see the two automorphisms in the stabilizer listed; note that the only difference between the two automorphisms is that they swap the positions of nodes $3$ and $4$; everything else is constrained by the stabilized edge set.
  • Figure 2: Example of Schema Distribution: The probability that the schema distribution applies to a graph is proportional to the amount of symmetry (structure) in the graph.
  • Figure 3: Scoring (Schema, Noise) Decompositions: Given a graph $G$, this figure shows five decompositions of $G$ into Hypothesis Graph (i.e. Schema) and Noise. The value $p$ is the noise probability -- i.e. the probability that any edge or non-edge in $G$ used to be a non-edge/edge respectively. To learn how $p$ is calculated as a function of $n$, see Section \ref{['sec:objective']}. Black edges are present in both $G$ and $H_i$ but not in $N_i$. Blue dashed edges are the added edges -- present in $H_i$ and $N_i$ but not in $G$. Red dashed edges are the deleted edges -- present in $G$ and $N_i$ but not $H_i$. In all cases, $G = H_i \oplus N_i$. The values below the graphs are proportional to the scores. The second option gets the highest score; with just one added edge, it manages to greatly increase the amount of symmetry and make the noise equivalent to all other edges in the graph. The third option has the most symmetry and the largest set of equivalent noise arrangements, but it incurs a heavy cost for making so many edits and thus gets a very low score. Finally, note that in the last two candidates, the symmetry gain in the graphs is the same (2 automorphisms) but in one case the noise is more probable.
  • Figure 4: Performance of Graph Isomorphism Network (GIN): This figure shows how well a GIN does decomposing various graphs. The $y$-axis shows how much more likely SCHENO says GIN's decomposition is than the trivial decomposition (Schema = Graph, Noise = $\emptyset$). The $x$-axis expresses how many edges are in the schema as a fraction of the number of edges in the original graph; we show schemas ranging from 0 edges to 2$|E|$ edges. The "Random" decompositions are obtained by randomly adding and removing as many edges as the GIN added and removed from the original graph when the GIN obtained its schema. Note that GIN and Random always converge when the schema size is zero because there is only a single 0-edge schema (the empty graph).
  • Figure 5: Transformations of Small Graphs -- The results shown are mostly meant to be intuitive, but some (b and f) took us by surprise, and we leave them here. In the case of (f), our instinct to see human figures blinded us to the actual highly-symmetric structures nearby, structures that SCHENO GA finds.
  • ...and 9 more figures