Table of Contents
Fetching ...

A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

Ashwin Ganesan

Abstract

Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates $\mathrm{Dup}_r$ (two same-type entities share at least $r$ attribute values) and the $\ell$-cycle predicate $\mathrm{Cyc}_\ell$ for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation -- verifying that the same entity appears at several attributes of the target -- a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.

A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management

Abstract

Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates (two same-type entities share at least attribute values) and the -cycle predicate for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably indistinguishable by every MPNN lacking the required adaptation, and exhibiting explicit minimal-depth MPNNs that compute the predicate on all inputs. The central finding is a sharp complexity gap between detecting any shared attribute and detecting multiple shared attributes. The former is purely local, requiring only reverse message passing in two layers. The latter demands cross-attribute identity correlation -- verifying that the same entity appears at several attributes of the target -- a fundamentally non-local requirement needing ego IDs and four layers, even on acyclic bipartite graphs. A similar necessity holds for cycle detection. Together, these results yield a minimal-architecture principle: practitioners can select the cheapest sufficient adaptation set, with a guarantee that no simpler architecture works. Computational validation confirms every prediction.

Paper Structure

This paper contains 20 sections, 5 theorems, 55 equations, 6 figures, 4 tables.

Key Result

Theorem 3

On simple typed entity-attribute graphs, reverse message passing is necessary and sufficient for computing $\mathrm{Dup}(u)$ in the MPNN framework. That is:

Figures (6)

  • Figure 1: A typed entity-attribute graph. Entities $u$ and $v$ share a common attribute node smith@x.com via edge of type hasEmail.
  • Figure 2: Separation graphs for the necessity proof of Theorem \ref{['thm:K21-simple']}.
  • Figure 3: Separation graphs for the necessity proof of Theorem \ref{['thm:K21-multigraph']}.
  • Figure 4: The simple typed entity-attribute graphs $G_1$ (left) and $G_2$ (right) from Example \ref{['ex:K22-indistinguishable']}.
  • Figure 5: A port numbering for the simple typed entity-attribute graphs $G_1$ and $G_2$ from Example \ref{['ex:K22-indistinguishable']}. Red numbers near the source of an edge give $p_{\mathrm{out}}$; blue numbers near the destination of an edge give $p_{\mathrm{in}}$. Both port numberings are valid: each entity assigns distinct outgoing ports $\{1,2\}$ to its two edges, and each attribute receives distinct incoming ports $\{1,2\}$ from its two predecessors. Under these port numberings, every MPNN without ego IDs assigns $u$ the identical embedding in both graphs.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Definition 1
  • Remark 2: Bounded-cardinality assumption
  • Theorem 3
  • proof : Proof of part (a): necessity
  • proof : Proof of part (b): sufficiency
  • Remark 4: Depth minimality
  • Theorem 5
  • proof : Proof of part (a): necessity
  • proof : Proof of part (b): sufficiency
  • Remark 6: Tightness
  • ...and 15 more