Table of Contents
Fetching ...

SLIDE: Sliding Localized Information for Document Extraction

Divyansh Singh, Manuel Nunez Martinez, Bonnie J. Dorr, Sonja Schmer Galunder

TL;DR

The paper tackles the problem of constructing accurate knowledge graphs from long documents, especially in low-resource languages, by addressing LLM context-length limitations. It introduces SLIDE, a sliding window-based contextual chunking method that generates localized context from overlapping neighboring chunks, enabling efficient processing of long texts without relying on full-document embeddings. Empirical results show substantial improvements in entity and relationship extraction for both English and Afrikaans, as well as enhancements in question-answering metrics like Comprehensiveness, Diversity, and Empowerment, demonstrating SLIDE's effectiveness in multilingual and resource-constrained settings. While SLIDE increases computational overhead due to overlapping processing, it offers a practical, scalable approach to robust knowledge graph construction in GraphRAG systems and beyond.

Abstract

Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.

SLIDE: Sliding Localized Information for Document Extraction

TL;DR

The paper tackles the problem of constructing accurate knowledge graphs from long documents, especially in low-resource languages, by addressing LLM context-length limitations. It introduces SLIDE, a sliding window-based contextual chunking method that generates localized context from overlapping neighboring chunks, enabling efficient processing of long texts without relying on full-document embeddings. Empirical results show substantial improvements in entity and relationship extraction for both English and Afrikaans, as well as enhancements in question-answering metrics like Comprehensiveness, Diversity, and Empowerment, demonstrating SLIDE's effectiveness in multilingual and resource-constrained settings. While SLIDE increases computational overhead due to overlapping processing, it offers a practical, scalable approach to robust knowledge graph construction in GraphRAG systems and beyond.

Abstract

Constructing accurate knowledge graphs from long texts and low-resource languages is challenging, as large language models (LLMs) experience degraded performance with longer input chunks. This problem is amplified in low-resource settings where data scarcity hinders accurate entity and relationship extraction. Contextual retrieval methods, while improving retrieval accuracy, struggle with long documents. They truncate critical information in texts exceeding maximum context lengths of LLMs, significantly limiting knowledge graph construction. We introduce SLIDE (Sliding Localized Information for Document Extraction), a chunking method that processes long documents by generating local context through overlapping windows. SLIDE ensures that essential contextual information is retained, enhancing knowledge graph extraction from documents exceeding LLM context limits. It significantly improves GraphRAG performance, achieving a 24% increase in entity extraction and a 39% improvement in relationship extraction for English. For Afrikaans, a low-resource language, SLIDE achieves a 49% increase in entity extraction and an 82% improvement in relationship extraction. Furthermore, it improves upon state-of-the-art in question-answering metrics such as comprehensiveness, diversity and empowerment, demonstrating its effectiveness in multilingual and resource-constrained settings.

Paper Structure

This paper contains 14 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of knowledge graph extraction from a set a of fiction book pages as chunks (Top) Using GraphRAG without SLIDE produces a knowledge graph with fewer nodes (representing entities) and edges (representing relationships). (Bottom) Using SLIDE results in a richer knowledge graph.
  • Figure 2: How Context Is Generated for Each Chunk Using a Sliding Window Approach to Form a Contextual Chunk. Each Chunk, Along with Its Neighboring Chunks, Is Fed to an LLM to Generate Context.