Table of Contents
Fetching ...

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Yimin Tang, Yurong Xu, Ning Yan, Masood Mortazavi

TL;DR

This work introduces a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings, which offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack and BABILong.

Abstract

Transformers have a quadratic scaling of computational complexity with input size, which limits the input context window size of large language models (LLMs) in both training and inference. Meanwhile, retrieval-augmented generation (RAG) besed models can better handle longer contexts by using a retrieval system to filter out unnecessary information. However, most RAG methods only perform retrieval based on the initial query, which may not work well with complex questions that require deeper reasoning. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings. At inference time, our model retrieves information from the RAG system, integrating data from lengthy documents at various levels of abstraction. Based on the information retrieved, the LLM generates texts stored in an area named Short-Term Memory (STM) which is then used to formulate the next query. This retrieval process is repeated until the text in STM converged. Our experiments demonstrate that retrieval with STM offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack (M-NIAH) and BABILong.

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

TL;DR

This work introduces a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings, which offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack and BABILong.

Abstract

Transformers have a quadratic scaling of computational complexity with input size, which limits the input context window size of large language models (LLMs) in both training and inference. Meanwhile, retrieval-augmented generation (RAG) besed models can better handle longer contexts by using a retrieval system to filter out unnecessary information. However, most RAG methods only perform retrieval based on the initial query, which may not work well with complex questions that require deeper reasoning. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR), involving inner-loop queries, based not only on the query question itself but also on intermediate findings. At inference time, our model retrieves information from the RAG system, integrating data from lengthy documents at various levels of abstraction. Based on the information retrieved, the LLM generates texts stored in an area named Short-Term Memory (STM) which is then used to formulate the next query. This retrieval process is repeated until the text in STM converged. Our experiments demonstrate that retrieval with STM offers improvements over traditional retrieval-augmented LLMs, particularly in long context tests such as Multi-Needle In A Haystack (M-NIAH) and BABILong.

Paper Structure

This paper contains 16 sections, 4 figures.

Figures (4)

  • Figure 1: An Overview of ILM-TR Method. Raw Data consists of tokens from the user, which could include conversation history, novels, or any other content the user wants the LLMs to process. User's Query refers to the tokens provided by the user, such as questions or task descriptions. Retriever can be any retrieval method, such as sentence-based RAG or tree-structured RAG. Retrieved Info is the result produced by the retriever. Short-Term Memory is a storage area for a limited number of tokens, which is overwritten at each iteration of the inner-loop query. Answer Model(LLMs) will processe information from the retriever, the previous short-term memory, and the user's query. The purple circles represents the order of steps in the inner-loop query.
  • Figure 2: The ILM-TR Retriever: The orange square represents the surprising information, the grey color represents the original text, and other colors represent the summary information. The summary model will extract information from the provided tokens. During the tree-building process, the summary information will be grouped using a clustering algorithm, and then each group will be summarized together to generate a higher-level summary and surprising information. In the query process, all squares in the tree will be stored in a table, and the best fit will be returned based on vector distance from the query text.
  • Figure 3: M-NIAH test: no keywords found (score 1, red), one keyword found (score 3, orange), two keywords found (score 7, yellow), and all three keywords found (score 10, green). Token lengths range from 150k to 500k. Depth percent represents the average positions of the inserted sentences within the long text, where 0% indicates the beginning of the text and 100% indicates the end.
  • Figure 4: BABILong test: Due to hardware limitations and the size of the LLM model, we only tested 10 testcases for each test setting. In this study, we evaluated tasks qa1 to qa5 with token lengths ranging from 0k to 128k. We encourage readers to refer to kuratov2024babilong for further details.