Table of Contents
Fetching ...

Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models

Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang Wang, Pan Li

TL;DR

Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases, effectively reducing positional bias and harnessing structural inductive biases.

Abstract

Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model's ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, 'target' segments selectively attend only to the KV-caches of their designated 'source' segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.

Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models

TL;DR

Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases, effectively reducing positional bias and harnessing structural inductive biases.

Abstract

Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model's ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, 'target' segments selectively attend only to the KV-caches of their designated 'source' segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.

Paper Structure

This paper contains 19 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: When processing data with inherent structure (bottom-left), modern LLMs encounter three challenges due to serialized input reading (top row): (1) positional bias, where different serialization orders lead to varied outputs zheng2023judging;(2) quadratic computational complexity from full attention between all document pairs; and (3) rapid context window consumption, as position indices quickly exceed limits. The bottom-right subfigure illustrates Graph-KV. Text chunks are independently encoded into KV caches, where Graph-KV arranges the text chunk of each target text after the KV of their source texts to update their respective KV caches. Notably, all source texts share same positional encoding (PE) range, while all target texts share another, with their position index immediately following that of the source nodes. This approach reduces the PE and context window usage. At query time, the query attends to both the source chunks and the target chunks to perform decoding.
  • Figure 2: PE-sharing mechanism in Graph-KV. As shown on the right side, source docs share one PE range, while targets share another. Attending Doc.1 to the KVs of its sources (Doc.2 and Doc.3), is functionally equivalent to the left side: reading Doc.2 followed by Doc.1, and Doc.3 followed by Doc.1, then merging the resulting representations of Doc.1.
  • Figure 3: Graph-KV modeling for RAG.
  • Figure 4: The reasoning settings in RAG tasks . Direct inference task requires identifying evidence chunks (from NarrativeQA kovcisky2018narrativeqa). Others that require multi-hop reasoning include multi-hop reasoning (comparison, bridge and compositional (from 2Wiki ho2020constructing, HotpotQA yang2018hotpotqa) and long-document understanding (from LongBench-v2 bai2024longbench). In these tasks, there exists implicit temporal or logical dependencies among the retrieved chunks.
  • Figure 5: An example from the Arxiv-QA task. One needs first locate the central paper’s introduction of the low row-weight generator matrix, and then compare the described methods with the content across all provided references (e.g., Theorems $1$ and $3$ in the ground-truth reference paper) to arrive at the correct answer.
  • ...and 3 more figures