Hierarchical Compression of Text-Rich Graphs via Large Language Models

Shichang Zhang; Da Zheng; Jiani Zhang; Qi Zhu; Xiang song; Soji Adeshina; Christos Faloutsos; George Karypis; Yizhou Sun

Hierarchical Compression of Text-Rich Graphs via Large Language Models

Shichang Zhang, Da Zheng, Jiani Zhang, Qi Zhu, Xiang song, Soji Adeshina, Christos Faloutsos, George Karypis, Yizhou Sun

TL;DR

HiCom tackles the challenge of integrating large language models with text-rich graphs by hierarchically compressing a node's neighborhood into concise summary vectors using an LLM-based compressor with soft prompts derived from AutoCompressor. The framework builds an $L$-level neighborhood hierarchy with fanouts $[n_1, \dots, n_L]$, producing $k$-dimensional summaries at each level and a final predictor that combines the root summary $s_0$ with the target node text $x_0$ to predict the label $y$. Empirical results on MAPLE and Amazon product graphs show that HiCom-OPT outperforms both GNN-based backbones and vanilla LM approaches, with an average improvement of about $3.48\%$ in dense regions and notable efficiency gains due to hierarchical compression, thanks to the complexity $O(\sum_{i=1}^{t} n_i^2)$ as opposed to $O(n^2)$. These findings demonstrate the practical potential of end-to-end, PEFT-enabled LLMs for text-rich graphs and point toward scalable applications in domains like e-commerce and scholarly networks.

Abstract

Text-rich graphs, prevalent in data mining contexts like e-commerce and academic graphs, consist of nodes with textual features linked by various relations. Traditional graph machine learning models, such as Graph Neural Networks (GNNs), excel in encoding the graph structural information, but have limited capability in handling rich text on graph nodes. Large Language Models (LLMs), noted for their superior text understanding abilities, offer a solution for processing the text in graphs but face integration challenges due to their limitation for encoding graph structures and their computational complexities when dealing with extensive text in large neighborhoods of interconnected nodes. This paper introduces ``Hierarchical Compression'' (HiCom), a novel method to align the capabilities of LLMs with the structure of text-rich graphs. HiCom processes text in a node's neighborhood in a structured manner by organizing the extensive textual information into a more manageable hierarchy and compressing node text step by step. Therefore, HiCom not only preserves the contextual richness of the text but also addresses the computational challenges of LLMs, which presents an advancement in integrating the text processing power of LLMs with the structural complexities of text-rich graphs. Empirical results show that HiCom can outperform both GNNs and LLM backbones for node classification on e-commerce and citation graphs. HiCom is especially effective for nodes from a dense region in a graph, where it achieves a 3.48% average performance improvement on five datasets while being more efficient than LLM backbones.

Hierarchical Compression of Text-Rich Graphs via Large Language Models

TL;DR

-level neighborhood hierarchy with fanouts

, producing

-dimensional summaries at each level and a final predictor that combines the root summary

with the target node text

to predict the label

. Empirical results on MAPLE and Amazon product graphs show that HiCom-OPT outperforms both GNN-based backbones and vanilla LM approaches, with an average improvement of about

in dense regions and notable efficiency gains due to hierarchical compression, thanks to the complexity

as opposed to

. These findings demonstrate the practical potential of end-to-end, PEFT-enabled LLMs for text-rich graphs and point toward scalable applications in domains like e-commerce and scholarly networks.

Abstract

Paper Structure (45 sections, 5 figures, 7 tables, 3 algorithms)

This paper contains 45 sections, 5 figures, 7 tables, 3 algorithms.

Introduction
Related Work
Learning on Text-Rich Graphs
Transformer Models for Long Inputs
Foundation LLMs and Graphs
Notations and Preliminaries
Soft Prompts
AutoCompressor chevalier2023adapting
Method
The HiCom Framework
The Workflow
The Compressor
The Predictor
Computational Complexity
Techniques for Efficiency and Effectiveness
...and 30 more sections

Figures (5)

Figure 1: Category classification of two water bottles (in the middle) from the Amazon product-co-viewing graph. Their categories (in red) are not clear solely from the product descriptions (in green boxes), but will more likely be correctly classified through the neighborhood context.
Figure 2: An illustration of the hierarchical compression framework with LLMs. The left part corresponds to the hierarchy construction step for a target node $v_0$, with fanouts = [3,2] indicating the budget of nodes to sample in each level. The lower right part shows how the neighborhood context is compressed to a summary vector $s_0$ following the hierarchy. The upper right part shows the final prediction is made with the target node text $x_0$ as input and the summary vector $s_0$.
Figure 3: HiCom always gains. Relative performance improvement for HiCom-OPT over the second-best method on nodes in dense regions vs. all regions.
Figure 4: HiCom wins. Method performance on the Geology dataset with the training set in different sizes.
Figure 5: Method performance on the Sports dataset with the training set in different sizes.

Hierarchical Compression of Text-Rich Graphs via Large Language Models

TL;DR

Abstract

Hierarchical Compression of Text-Rich Graphs via Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)