Table of Contents
Fetching ...

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmed, Yang Liu

TL;DR

The paper tackles the long-context bottleneck in Transformer LLMs by introducing Extend at Test-Time (ETT), which extends context length during inference with constant memory and linear computation. ETT achieves this by fine-tuning on overlapping chunks of the input context, memorizing the sequence in the model parameters and resetting afterward. Empirical results on GPT-Large and Phi-2 show up to a $32\times$ extension (from 1k to 32k tokens) with up to ~30% gains on LongBench, and selective fine-tuning—especially updating the FFN keys—yields strong improvements with reduced trainable parameters. Phi-2 with ETT also competes with much larger 8B models on several long-context tasks, illustrating a practical, memory-efficient path to scaling LLMs to longer sequences without external memory or task-specific memorization.

Abstract

Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

TL;DR

The paper tackles the long-context bottleneck in Transformer LLMs by introducing Extend at Test-Time (ETT), which extends context length during inference with constant memory and linear computation. ETT achieves this by fine-tuning on overlapping chunks of the input context, memorizing the sequence in the model parameters and resetting afterward. Empirical results on GPT-Large and Phi-2 show up to a extension (from 1k to 32k tokens) with up to ~30% gains on LongBench, and selective fine-tuning—especially updating the FFN keys—yields strong improvements with reduced trainable parameters. Phi-2 with ETT also competes with much larger 8B models on several long-context tasks, illustrating a practical, memory-efficient path to scaling LLMs to longer sequences without external memory or task-specific memorization.

Abstract

Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.

Paper Structure

This paper contains 10 sections, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Average LongBench score (%), HBM memory consumption (GB), and FLOPs (G) under different truncation sizes. ETT extends the context window of Phi-2 and GPT-Large by up to 16× and 32×, respectively. Performance improves with longer context lengths while maintaining constant memory usage and only linear growth in computation.
  • Figure 2: ETT 's LongBench score as a function of the fraction of deep $\text{FFN}_{\text{Up}}$ layers fine-tuned. We can store the long input in the parameters of the top 80% of $\text{FFN}_{\text{Up}}$ layers without significant performance degradation.