SaVe-TAG: LLM-based Interpolation for Long-Tailed Text-Attributed Graphs
Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Hanyu Wang, Yao Ma, Tyler Derr
TL;DR
This work tackles long-tailed node classification on text-attributed graphs by introducing SaVe-TAG, a semantic-aware VRM framework that uses LLM-based text interpolation to create boundary-enriching, manifold-preserving samples for minority classes. A confidence-based edge assignment mechanism leverages graph topology to filter noisy synthetic nodes, reducing error propagation during GNN training. The authors provide theoretical insights ensuring on-manifold generation and boundary-focused VRM, and demonstrate strong empirical gains over embedding-based augmentation and prior baselines across multiple datasets, with SaVe-TAG_S yielding notable improvements in balance and stability. The approach offers a practical, semantically rich augmentation strategy for long-tailed graph learning and comes with public code for reproducibility.
Abstract
Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs. The source code is publicly available at: https://github.com/LWang-Laura/SaVe-TAG.
