QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia
TL;DR
QLLM addresses long-context reasoning by offering a training-free, query-aware inference mechanism that selectively retrieves memory blocks relevant to the current query and input tokens. It constructs a memory-augmented KV cache from four token streams (Global, Query, Context, Local) with memory blocks offloaded to CPU and a GPU cache to maintain efficiency. On long-context benchmarks (Long-Bench and ∞-Bench), Needle-in-a-Haystack, and BABILong, QLLM yields substantial improvements over sliding-window baselines and InfLLM, including the ability to process up to 1024K tokens while preserving accuracy. The method enables fast, scalable long-sequence reasoning for LLMs such as LLaMA3-8B and Mistral-7B without extra training, supporting improved QA, retrieval, and summarization over very long documents.
Abstract
The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM.
