Table of Contents
Fetching ...

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

Zi Yang, Nan Hua

TL;DR

This work tackles long-context processing in transformers by introducing eviction-based memory management (LRA/LFA) and the Attendre layer, which retrieves K/V entries using evicted queries to enable bidirectional context utilization. The proposed approach reduces memory requirements while maintaining or improving performance on TriviaQA, demonstrated across PaLM 2-S and FLAN-T5 XXL in context-length extension tasks. Key contributions include formalizing memory eviction policies, detailing the Attendre wait-to-attend mechanism, and validating effectiveness in encoder-decoder settings with explicit encoder-output memory. The results show notable efficiency gains and competitive accuracy, suggesting practical benefits for deploying long-context transformers without extensive retraining or memory oversize.

Abstract

As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

TL;DR

This work tackles long-context processing in transformers by introducing eviction-based memory management (LRA/LFA) and the Attendre layer, which retrieves K/V entries using evicted queries to enable bidirectional context utilization. The proposed approach reduces memory requirements while maintaining or improving performance on TriviaQA, demonstrated across PaLM 2-S and FLAN-T5 XXL in context-length extension tasks. Key contributions include formalizing memory eviction policies, detailing the Attendre wait-to-attend mechanism, and validating effectiveness in encoder-decoder settings with explicit encoder-output memory. The results show notable efficiency gains and competitive accuracy, suggesting practical benefits for deploying long-context transformers without extensive retraining or memory oversize.

Abstract

As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.
Paper Structure (10 sections, 2 equations, 3 figures, 5 tables)

This paper contains 10 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Attendre layer with Q and KV memory storages. (a) The memory module is used to cache the linear transformed Q/K/V and in turn prepare time-shifted counterparts for dot-product attention. (b) ① Insert the query chunk $\textbf{q}_{T_i}$ into the Q memory. ② Insert the K/V chunks $\textbf{k}_{T_i}$ and $\textbf{v}_{T_i}$ into the K/V memory. ③ Obtain the evicted query chunk $\textbf{q}_{T_j}$. ④ Use the evicted query chunk $\textbf{q}_{T_j}$ to retrieve the K/V memory. ⑤ Obtain top K/Vs $\textbf{k}_{R_j}$ and $\textbf{v}_{R_j}$.
  • Figure 2: Time-shifted Transformer stack with a wait-to-attend layer inside the self-attention layer of each Transformer layer. The output of each layer shifts the input by the size of the Q memory $N$. The final output of a Transformer stack consisting of $L$ layers shifts the original input by $NL$. We postpad the input by $NL$ to "drain" the $L$ Q memory storages and trim the $NL$ paddings prepending the output sequence.
  • Figure 3: Encoder-decoder architecture with the additional encoder output memory $\textbf{e}$ to collect the outputs from the encoder. Each decoder layer uses the K/Vs from the encoder output memory $\textbf{e}$ instead to compute the cross attention.