Table of Contents
Fetching ...

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia

TL;DR

QLLM addresses long-context reasoning by offering a training-free, query-aware inference mechanism that selectively retrieves memory blocks relevant to the current query and input tokens. It constructs a memory-augmented KV cache from four token streams (Global, Query, Context, Local) with memory blocks offloaded to CPU and a GPU cache to maintain efficiency. On long-context benchmarks (Long-Bench and ∞-Bench), Needle-in-a-Haystack, and BABILong, QLLM yields substantial improvements over sliding-window baselines and InfLLM, including the ability to process up to 1024K tokens while preserving accuracy. The method enables fast, scalable long-sequence reasoning for LLMs such as LLaMA3-8B and Mistral-7B without extra training, supporting improved QA, retrieval, and summarization over very long documents.

Abstract

The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM.

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

TL;DR

QLLM addresses long-context reasoning by offering a training-free, query-aware inference mechanism that selectively retrieves memory blocks relevant to the current query and input tokens. It constructs a memory-augmented KV cache from four token streams (Global, Query, Context, Local) with memory blocks offloaded to CPU and a GPU cache to maintain efficiency. On long-context benchmarks (Long-Bench and ∞-Bench), Needle-in-a-Haystack, and BABILong, QLLM yields substantial improvements over sliding-window baselines and InfLLM, including the ability to process up to 1024K tokens while preserving accuracy. The method enables fast, scalable long-sequence reasoning for LLMs such as LLaMA3-8B and Mistral-7B without extra training, supporting improved QA, retrieval, and summarization over very long documents.

Abstract

The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the -bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM.
Paper Structure (27 sections, 7 equations, 12 figures, 7 tables)

This paper contains 27 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Examples of our QuickLLaMA-8B (1) reading long context containing 100K tokens, (2) reading our paper that has not be seen in the pretrained dataset, (3) retrieving value in long key-value pairs and (4) retrieving in Needle-in-a-Haystack task. More examples and comparisons with the SOTAs are provided in \ref{['sec:a-examples']}.
  • Figure 2: This is an example from the $\infty$-Bench. Three questions were posed about the same long book: (1) Which among Annalisa, Seb, Peyton, and Gannonmarie is not Mrs. Bronwyn's child? (2) What's the name of the Bronwyns' summer home? (3) Who among Mrs. Bronwyn, Mrs. Deandra, Rosemarie, and Cael is the final to perish? We present the score heatmap of the first 50 memory blocks. The methods used include (a) the consistent results from InfLLM for all three queries, and (b-d) the query-aware results from QLLM.
  • Figure 3: The illustration of our QLLM framework. The input from the memory context is partitioned into memory blocks, which are searched by Query-aware Context Lookup for query-related blocks. The current key-value cache comprises global tokens, query tokens, query-related blocks, and local tokens. Together, these form a new context window that, along with current tokens, is fed into the LLM.
  • Figure 4: An example from Long-Bench. Global tokens include system prompts and task description. Query tokens represent the query of the user. Context tokens indicate the context stored in the context memory. We search query-related tokens from them, local tokens are the nearest tokens to the current token.
  • Figure 5: The comparison of performance in the Needle-in-a-Haystack task. The horizontal axis represents the document's length (the haystack), whereas the vertical axis specifies the location of a brief sentence (the needle) within the document, ranging from 1K to 128K tokens. A red cell indicates the language model's inability to recall the needle's information, while a green cell denotes successful recall by the model.
  • ...and 7 more figures