HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs
Shujie Li, Yuxia Wu, Chuan Shi, Yuan Fang
TL;DR
HeTGB tackles the lack of benchmarks for heterophilic text-attributed graphs by introducing five real-world datasets enriched with textual descriptions. It enables evaluation of GNN-based, PLM-based, and co-training methods on node classification, highlighting the complementary strengths of structural and semantic signals. The results show that heterophily-aware GNNs and fine-tuned PLMs (especially when integrated via co-training) can effectively leverage text and structure, though efficiency remains a key consideration. By releasing HeTGB with baselines, the authors aim to accelerate research and practical development in heterophilic TAG learning.
Abstract
Graph neural networks (GNNs) have demonstrated success in modeling relational data primarily under the assumption of homophily. However, many real-world graphs exhibit heterophily, where linked nodes belong to different categories or possess diverse attributes. Additionally, nodes in many domains are associated with textual descriptions, forming heterophilic text-attributed graphs (TAGs). Despite their significance, the study of heterophilic TAGs remains underexplored due to the lack of comprehensive benchmarks. To address this gap, we introduce the Heterophilic Text-attributed Graph Benchmark (HeTGB), a novel benchmark comprising five real-world heterophilic graph datasets from diverse domains, with nodes enriched by extensive textual descriptions. HeTGB enables systematic evaluation of GNNs, pre-trained language models (PLMs) and co-training methods on the node classification task. Through extensive benchmarking experiments, we showcase the utility of text attributes in heterophilic graphs, analyze the challenges posed by heterophilic TAGs and the limitations of existing models, and provide insights into the interplay between graph structures and textual attributes. We have publicly released HeTGB with baseline implementations to facilitate further research in this field.
