Table of Contents
Fetching ...

Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

Mihir Gupte, Paolo Giusto, Ramesh S

TL;DR

This work tackles the challenge of efficiently representing retrieved knowledge for hierarchical data when using Retrieval-Augmented Generation (RAG) with LLMs. It introduces a bottom-up approach that generates implicit, aggregated summaries at each level of a tree, storing them in a knowledge base to feed RAG instead of raw data. The method achieves comparable or better performance on token-level QA metrics while substantially reducing the number of documents stored in the vector database (over 3.5x fewer in experiments and referenced as significant efficiency gains). The findings suggest that high-quality implicit knowledge can be a scalable alternative to raw data for complex structured domains, with potential extensions to graphs and other hierarchies in future work.

Abstract

Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.

Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

TL;DR

This work tackles the challenge of efficiently representing retrieved knowledge for hierarchical data when using Retrieval-Augmented Generation (RAG) with LLMs. It introduces a bottom-up approach that generates implicit, aggregated summaries at each level of a tree, storing them in a knowledge base to feed RAG instead of raw data. The method achieves comparable or better performance on token-level QA metrics while substantially reducing the number of documents stored in the vector database (over 3.5x fewer in experiments and referenced as significant efficiency gains). The findings suggest that high-quality implicit knowledge can be a scalable alternative to raw data for complex structured domains, with potential extensions to graphs and other hierarchies in future work.

Abstract

Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Workflow for generating Implicit Knowledge from a given set of hierarchical files
  • Figure 2: Workflow for generating Implicit Knowledge from a given set of hierarchical files
  • Figure 3: Performance of baseline & proposed methods on File-level and Folder-level questions