Table of Contents
Fetching ...

Heterogeneous Graph Structure Learning through the Lens of Data-generating Processes

Keyue Jiang, Bohan Tang, Xiaowen Dong, Laura Toni

TL;DR

The paper introduces HGSL, a principled framework to infer multi-type graph structures from node signals by modeling a data-generating process with hidden Markov networks for heterogeneous graphs (H2MN). It formulates HGSL as a MAP problem over the edge tensor and DGP parameters, and proposes an alternating optimization algorithm with a low-rank relation-embedding parameterization to jointly learn graph structure and relation embeddings. The approach unifies node-type and edge-type learning via a finite set of local potentials and a linear emission model, and is shown to excel in edge-type identification and edge-weight recovery on synthetic and real datasets, with robust behavior under varying homophily and feature-dimension overlap. The work highlights the role of a data-generating perspective in guiding graph-structure learning for heterogeneous graphs and points to future integration with neural generative models and downstream tasks such as node classification and link prediction.

Abstract

Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.

Heterogeneous Graph Structure Learning through the Lens of Data-generating Processes

TL;DR

The paper introduces HGSL, a principled framework to infer multi-type graph structures from node signals by modeling a data-generating process with hidden Markov networks for heterogeneous graphs (H2MN). It formulates HGSL as a MAP problem over the edge tensor and DGP parameters, and proposes an alternating optimization algorithm with a low-rank relation-embedding parameterization to jointly learn graph structure and relation embeddings. The approach unifies node-type and edge-type learning via a finite set of local potentials and a linear emission model, and is shown to excel in edge-type identification and edge-weight recovery on synthetic and real datasets, with robust behavior under varying homophily and feature-dimension overlap. The work highlights the role of a data-generating perspective in guiding graph-structure learning for heterogeneous graphs and points to future integration with neural generative models and downstream tasks such as node classification and link prediction.

Abstract

Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approach for heterogeneous graph structure learning (HGSL). To this end, we first propose a novel statistical model for the data-generating process (DGP) of heterogeneous graph data, namely hidden Markov networks for heterogeneous graphs (H2MN). Then we formalize HGSL as a maximum a-posterior estimation problem parameterized by such DGP and derive an alternating optimization method to obtain a solution together with a theoretical justification of the optimization conditions. Finally, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate that our proposed method excels in learning structure on heterogeneous graphs in terms of edge type identification and edge weight recovery.

Paper Structure

This paper contains 50 sections, 3 theorems, 86 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Lemma A.1

Linear Gaussian Model. ghahramani2001introduction Given two vectorized random variables, ${\bm{b}}$ following a Marginal Gaussian distribution, and ${\bm{a}}$ following a Gaussian distribution conditioned on ${\bm{b}}$, which has the following form, where ${\bm{T}}$ is the linear transformation matrix, $\Omega_1$ and $\Omega_2$ are two precision matrices. The random variable ${\bm{a}}$ has a mar

Figures (5)

  • Figure 1: The graphical models for a) HMN and b) our H2MN. The shadowed variable is observable.
  • Figure 2: The Pearson correlation test between relaxed homophily ratio and AUC.
  • Figure 3: The SDOR and model performance.
  • Figure 4: The different types of connections recovered by the HGSL algorithm.
  • Figure 5: Visualization of the generalized smoothness.

Theorems & Definitions (3)

  • Lemma A.1
  • Lemma A.2
  • Lemma B.1