Table of Contents
Fetching ...

SaVe-TAG: LLM-based Interpolation for Long-Tailed Text-Attributed Graphs

Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Hanyu Wang, Yao Ma, Tyler Derr

TL;DR

This work tackles long-tailed node classification on text-attributed graphs by introducing SaVe-TAG, a semantic-aware VRM framework that uses LLM-based text interpolation to create boundary-enriching, manifold-preserving samples for minority classes. A confidence-based edge assignment mechanism leverages graph topology to filter noisy synthetic nodes, reducing error propagation during GNN training. The authors provide theoretical insights ensuring on-manifold generation and boundary-focused VRM, and demonstrate strong empirical gains over embedding-based augmentation and prior baselines across multiple datasets, with SaVe-TAG_S yielding notable improvements in balance and stability. The approach offers a practical, semantically rich augmentation strategy for long-tailed graph learning and comes with public code for reproducibility.

Abstract

Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs. The source code is publicly available at: https://github.com/LWang-Laura/SaVe-TAG.

SaVe-TAG: LLM-based Interpolation for Long-Tailed Text-Attributed Graphs

TL;DR

This work tackles long-tailed node classification on text-attributed graphs by introducing SaVe-TAG, a semantic-aware VRM framework that uses LLM-based text interpolation to create boundary-enriching, manifold-preserving samples for minority classes. A confidence-based edge assignment mechanism leverages graph topology to filter noisy synthetic nodes, reducing error propagation during GNN training. The authors provide theoretical insights ensuring on-manifold generation and boundary-focused VRM, and demonstrate strong empirical gains over embedding-based augmentation and prior baselines across multiple datasets, with SaVe-TAG_S yielding notable improvements in balance and stability. The approach offers a practical, semantically rich augmentation strategy for long-tailed graph learning and comes with public code for reproducibility.

Abstract

Real-world graph data often follows long-tailed distributions, making it difficult for Graph Neural Networks (GNNs) to generalize well across both head and tail classes. Recent advances in Vicinal Risk Minimization (VRM) have shown promise in mitigating class imbalance with numeric interpolation; however, existing approaches largely rely on embedding-space arithmetic, which fails to capture the rich semantics inherent in text-attributed graphs. In this work, we propose our method, SaVe-TAG (Semantic-aware Vicinal Risk Minimization for Long-Tailed Text-Attributed Graphs), a novel VRM framework that leverages Large Language Models (LLMs) to perform text-level interpolation, generating on-manifold, boundary-enriching synthetic samples for minority classes. To mitigate the risk of noisy generation, we introduce a confidence-based edge assignment mechanism that uses graph topology as a natural filter to ensure structural consistency. We provide theoretical justification for our method and conduct extensive experiments on benchmark datasets, showing that our approach consistently outperforms both numeric interpolation and prior long-tailed node classification baselines. Our results highlight the importance of integrating semantic and structural signals for balanced and effective learning on text-attributed graphs. The source code is publicly available at: https://github.com/LWang-Laura/SaVe-TAG.

Paper Structure

This paper contains 64 sections, 7 theorems, 28 equations, 9 figures, 10 tables.

Key Result

Theorem 3.1

If $\mathcal{M}_c$ is non-convex, there exist $t_1, t_2$ with $\phi(t_1), \phi(t_2) \in \mathcal{M}_c$ and $\lambda \in (0,1)$ such that $x_\lambda := \lambda\,\phi(t_1) + (1{-}\lambda)\,\phi(t_2) \notin \mathcal{M}_c$zhang2018mixupempiricalriskminimizationguo2019mixupbaena2022preventing.

Figures (9)

  • Figure 2: An overview of SaVe-TAG. Given an input graph $\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathcal{T})$, we perform LLM-based interpolation on identified vicinal twins to synthesize boundary-enriching samples $\hat{t}$ that minimize vicinal risk. A confidence function is then pre-trained on the original graph and later used to assign edges to the synthetic nodes via a top-$k$ selection strategy. When the resulting augmented graph $\mathcal{G}' = (\mathcal{V}', \mathcal{E}', \mathcal{T}')$ is processed by a GNN, such edge assignment incorporates well-aligned nodes into their vicinity while isolating noisy samples.
  • Figure 3: Node classification performance (F1) under various implementation of SaVe-TAG$_S$. Embed uses Llama3.2-1B embeddings for interpolation. SimCSE and SBERT are compared as alternative embedding baselines. Our LLM-based method with Llama3.2-1B consistently outperforms all the embedding-based baselines.
  • Figure 4: UMAP projection of the original and interpolated samples across various benchmarks---the original data boundary is outlined, with numeric samples mostly staying within and LLM-generated samples extending beyond.
  • Figure 5: Boundary Coverage Rate (BCR) and Boundary Proximity Score (BPS) for numeric (blue) vs. LLM (yellow) interpolation in Interp+Orig and Interp-Only settings across six datasets; the red dashed line marks the original baseline.
  • Figure 6: Average In-Class Rate (ICR) of augmented samples classified by an MLP trained on balanced data. Numerical interpolation achieves near-perfect class consistency (100% ICR), while LLM-based samples yield lower ICRs, reflecting increased proximity to decision boundaries.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Off-Manifold Numeric Interpolation
  • Theorem 3.2: Manifold-Preserving Class-Consistent Generation
  • Definition 3.3: The Minimum Margin
  • Definition 3.4: Boundary‑Coverage Rate (BCR)
  • Theorem 3.5: Margin Lower Bound
  • Theorem 3.6: On‑manifold vicinal risk
  • Theorem 3.7: Boundary coverage $\Rightarrow$ lower vicinal risk
  • Theorem 3.8: Confidence $\Rightarrow$ pulling
  • Theorem 3.9: No confidence $\Rightarrow$ isolation