Table of Contents
Fetching ...

Optimizing Multi-Hop Document Retrieval Through Intermediate Representations

Jiaen Lin, Jingyu Liu, Yingbo Liu

TL;DR

This work addresses the inefficiency of multi-hop retrieval in retrieval-augmented generation by uncovering a three-stage layer-wise reasoning pattern in LLMs and proposing Layer-wise RAG (L-RAG) that exploits middle-layer representations for next-hop retrieval. By training a Contriever-based representation retriever to align with intermediate signals, L-RAG reduces reliance on repeated LLM calls while maintaining high retrieval and task performance. The approach is validated on MuSiQue, HotpotQA, and 2WikiMultiHopQA, showing competitive accuracy with significantly lower inference overhead compared to multi-step RAG baselines. These findings suggest that intermediate representations can effectively bridge the gap between reasoning capability and inference efficiency in knowledge-intensive tasks.

Abstract

Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in https://github.com/Olive-2019/L-RAG

Optimizing Multi-Hop Document Retrieval Through Intermediate Representations

TL;DR

This work addresses the inefficiency of multi-hop retrieval in retrieval-augmented generation by uncovering a three-stage layer-wise reasoning pattern in LLMs and proposing Layer-wise RAG (L-RAG) that exploits middle-layer representations for next-hop retrieval. By training a Contriever-based representation retriever to align with intermediate signals, L-RAG reduces reliance on repeated LLM calls while maintaining high retrieval and task performance. The approach is validated on MuSiQue, HotpotQA, and 2WikiMultiHopQA, showing competitive accuracy with significantly lower inference overhead compared to multi-step RAG baselines. These findings suggest that intermediate representations can effectively bridge the gap between reasoning capability and inference efficiency in knowledge-intensive tasks.

Abstract

Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in https://github.com/Olive-2019/L-RAG

Paper Structure

This paper contains 22 sections, 9 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comparison of Multi-step RAG and Layer-wise RAG. The left panel illustrates Multi-step RAG, which involves multiple rounds of reasoning to generate retrieval queries. In contrast, the right panel shows Layer-wise RAG, which requires only a single round of reasoning to produce intermediate representations for retrieval.
  • Figure 2: An overview and example of the L-RAG framework, which consists of three components: a traditional retriever, an LLM for generating intermediate representations, and the representation retriever. The area above the dashed line represents the framework, while the area below illustrates the demonstration. Initially, the traditional retriever (e.g., BM25, Contriever) retrieves relevant chunks from the corpus as the first-hop document. The LLM (e.g., LLaMA2-7B) then uses the query and first-hop context to generate an intermediate representation. This intermediate representation is used by the Modified Contriever trained by us to retrieve the higher-hop document. Finally, the LLM generates the final answer using the complete context.
  • Figure 3: Weight matrix transformation divergence (TD) and information processing in LLM reasoning. The left panel illustrates the SVD, where the weight matrix is decomposed into three operators, including scaling. The singular value distribution represents the scaling magnitude of the weight matrix, which is quantified using TD. The right panel shows how the value of TD reflects the information processing stages in LLM reasoning.
  • Figure 4: Transformation Divergence ${\mathrm{TD}({\bm{W}}_v)}$ of the weight matrix across various models.
  • Figure 5: Logit Lens analysis conducted at multiple layers within different LLMs. The dashed lines represent the probability of intermediate answers, while the solid lines indicate the probability of final answers.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 3.1: Transformation Direction
  • Definition 3.2: Transformation Divergence