Table of Contents
Fetching ...

Augmenting Textual Generation via Topology Aware Retrieval

Yu Wang, Nedim Lipka, Ruiyi Zhang, Alexa Siu, Yuying Zhao, Bo Ni, Xin Wang, Ryan Rossi, Tyler Derr

TL;DR

This work tackles the problem of LLM hallucinations and limited input knowledge by introducing Topology-aware Retrieval-Augmented Generation (Topo-RAG), which guides retrieval using topology encoded in proximity-based and role-based relations. It demonstrates that additional, topologically similar texts can meaningfully improve generated content, and that textual similarity correlates with topological similarity across multiple domains. The framework precomputes topology embeddings to enable fast retrieval and shows strong gains on traditional text-generation metrics as well as task-oriented evaluations like node classification and link prediction. The results highlight the practical value of incorporating graph topology into RAG to improve factual grounding and writing quality in diverse text-attributed networks.

Abstract

Despite the impressive advancements of Large Language Models (LLMs) in generating text, they are often limited by the knowledge contained in the input and prone to producing inaccurate or hallucinated content. To tackle these issues, Retrieval-augmented Generation (RAG) is employed as an effective strategy to enhance the available knowledge base and anchor the responses in reality by pulling additional texts from external databases. In real-world applications, texts are often linked through entities within a graph, such as citations in academic papers or comments in social networks. This paper exploits these topological relationships to guide the retrieval process in RAG. Specifically, we explore two kinds of topological connections: proximity-based, focusing on closely connected nodes, and role-based, which looks at nodes sharing similar subgraph structures. Our empirical research confirms their relevance to text relationships, leading us to develop a Topology-aware Retrieval-augmented Generation framework. This framework includes a retrieval module that selects texts based on their topological relationships and an aggregation module that integrates these texts into prompts to stimulate LLMs for text generation. We have curated established text-attributed networks and conducted comprehensive experiments to validate the effectiveness of this framework, demonstrating its potential to enhance RAG with topological awareness.

Augmenting Textual Generation via Topology Aware Retrieval

TL;DR

This work tackles the problem of LLM hallucinations and limited input knowledge by introducing Topology-aware Retrieval-Augmented Generation (Topo-RAG), which guides retrieval using topology encoded in proximity-based and role-based relations. It demonstrates that additional, topologically similar texts can meaningfully improve generated content, and that textual similarity correlates with topological similarity across multiple domains. The framework precomputes topology embeddings to enable fast retrieval and shows strong gains on traditional text-generation metrics as well as task-oriented evaluations like node classification and link prediction. The results highlight the practical value of incorporating graph topology into RAG to improve factual grounding and writing quality in diverse text-attributed networks.

Abstract

Despite the impressive advancements of Large Language Models (LLMs) in generating text, they are often limited by the knowledge contained in the input and prone to producing inaccurate or hallucinated content. To tackle these issues, Retrieval-augmented Generation (RAG) is employed as an effective strategy to enhance the available knowledge base and anchor the responses in reality by pulling additional texts from external databases. In real-world applications, texts are often linked through entities within a graph, such as citations in academic papers or comments in social networks. This paper exploits these topological relationships to guide the retrieval process in RAG. Specifically, we explore two kinds of topological connections: proximity-based, focusing on closely connected nodes, and role-based, which looks at nodes sharing similar subgraph structures. Our empirical research confirms their relevance to text relationships, leading us to develop a Topology-aware Retrieval-augmented Generation framework. This framework includes a retrieval module that selects texts based on their topological relationships and an aggregation module that integrates these texts into prompts to stimulate LLMs for text generation. We have curated established text-attributed networks and conducted comprehensive experiments to validate the effectiveness of this framework, demonstrating its potential to enhance RAG with topological awareness.
Paper Structure (31 sections, 7 equations, 8 figures, 3 tables)

This paper contains 31 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Topology-aware Retrieval for Text Generation.(a): People write paper abstracts by referring to other papers cited in the related work.; (b): Employees owning the same local subgraph structures possess the same titles/responsibilities.
  • Figure 2: (a)-(b): Cora citation network where nodes are papers and edges are reference relations. Papers that are topologically closer to paper 0 have higher textual similarity to its text; (c)-(d): Eron-email network where nodes represent employees and edges denote their email communications. For each row in the two heatmaps, employees in diagonal entries share higher textual similarity and lower role-based topological distance than the ones belonging to off-diagonal entries in the same row. This indicates that employees with the same roles share a higher textual similarity and lower topological distance than employees with different roles.
  • Figure 3: Comparing text generation between the scenario "Addition" where we include additional texts based on their proximity-based topological similarity to the target node and the scenario "None" where we include no additional texts but only based on its partially observed starting words.
  • Figure 4: Correlation between Proximity-based Topological Similarity and Textual Similarity. (a): in Cora, as the topological distance between two paper nodes decreases, their textual similarity increases. (b): the Pearson correlations across different datasets are all positive and increase as the number of diffusion layer $k$ in Eq. \ref{['eq-diffusion']} increases.
  • Figure 5: Correlation between Role-based Topological Distance and Textual Similarity on Eron-Email Dataset where role-based topological distance is calculated based on the L2-distance between embeddings of two employees obtained from GraphWave while textual similarity is calculated as the average cosine similarity of textual embeddings of either the sent (a) or received emails (b) between two employees.
  • ...and 3 more figures