SynHING: Synthetic Heterogeneous Information Network Generation for Graph Learning and Explanation
Ming-Yi Hong, Yi-Hsiang Huang, Shao-En Lin, You-Chen Teng, Chih-Yu Wang, Che Lin
TL;DR
SynHING tackles the scarcity of diverse, ground-truth explanations for heterogeneous information networks by providing a motif-driven, bottom-up synthetic HIN generation framework. It introduces major motif generation, base subgraph construction, intra-/inter-cluster merges, node feature generation, and post-pruning to produce scalable graphs that preserve the reference graph's properties while exposing ground-truth explanations. The framework enables ground-truth explanations for HGNNs and supports pretraining transfer experiments, with controllable cluster exclusion via intra-/inter-cluster probabilities and SNR-driven features. Experimental results on IMDB, ACM, and DBLP show meaningful positive transfer from synthetic to real graphs and highlight the role of motifs in explainability and learning, offering a practical tool for robust evaluation and pretraining in heterogeneous graph learning.
Abstract
Graph Neural Networks (GNNs) excel in delineating graph structures in diverse domains, including community analysis and recommendation systems. As the interpretation of GNNs becomes increasingly important, the demand for robust baselines and expansive graph datasets is accentuated, particularly in the context of Heterogeneous Information Networks (HIN). Addressing this, we introduce SynHING, a novel framework for Synthetic Heterogeneous Information Network Generation aimed at enhancing graph learning and explanation. SynHING systematically identifies major motifs in a target HIN and employs a bottom-up generation process with intra-cluster and inter-cluster merge modules. This process, supplemented by post-pruning techniques, ensures the synthetic HIN closely mirrors the original graph's structural and statistical properties. Crucially, SynHING provides ground-truth motifs for evaluating GNN explainer models, setting a new standard for explainable, synthetic HIN generation and contributing to the advancement of interpretable machine learning in complex networks.
