Table of Contents
Fetching ...

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

Melisa Russak, Umar Jamil, Christopher Bryant, Kiran Kamble, Axel Magnuson, Mateusz Russak, Waseem AlShikh

TL;DR

WiM introduces a chunked prefill inference pattern that generates and aggregates margins from long-context segments to guide final predictions without any fine-tuning. By adding marginal extractive notes per segment and reintegrating them at the end, WiM substantially improves performance on long-context reasoning and aggregation tasks while maintaining compatibility with off-the-shelf transformer models. The design supports interactive retrieval with streaming margins, enabling real-time explainability, early exit, and potential human-in-the-loop interventions. Implemented in the Hugging Face Transformers ecosystem, WiM demonstrates meaningful gains across multi-hop QA, aggregation benchmarks, and various context-length scenarios, offering a practical, scalable approach to extending context without retraining.

Abstract

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

TL;DR

WiM introduces a chunked prefill inference pattern that generates and aggregates margins from long-context segments to guide final predictions without any fine-tuning. By adding marginal extractive notes per segment and reintegrating them at the end, WiM substantially improves performance on long-context reasoning and aggregation tasks while maintaining compatibility with off-the-shelf transformer models. The design supports interactive retrieval with streaming margins, enabling real-time explainability, early exit, and potential human-in-the-loop interventions. Implemented in the Hugging Face Transformers ecosystem, WiM demonstrates meaningful gains across multi-hop QA, aggregation benchmarks, and various context-length scenarios, offering a practical, scalable approach to extending context without retraining.

Abstract

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.
Paper Structure (35 sections, 2 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 2 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Writing in the Margins inference pattern. Prefilling KV cache by segments allows to both process the context segment by segment and generate intermediate extractive summaries which can improve the final prediction.
  • Figure 2: Chunked Prefill. Example of how the attention mask is set across different chunks during prefill iterations (first chunk at the top, second chunk at the bottom). Each new chunk needs to retain causality while attending to all previous chunks. Chunked prefill is mathematically equivalent to prefill without chunking.
  • Figure 3: Design Comparison. Three inference designs for managing long context windows: (Top Left) Long Context LLM (LLM): This design feeds all context, without segmentation, directly to the model. (Top Right) Retrieval-Augmented Generation (RAG): Segments are selected based on a retrieval method (e.g., cosine similarity between vector representations of the query and the segment). All selected segments, along with the task instruction, are then concatenated and fed to a model. (Bottom) Writing in the Margins (WiM): The context is divided and processed segment by segment. At each step, the model is prompted to produce auxiliary information from each segment. This information is then classified and, if deemed positive, it is incorporated into the final step before the task description.
  • Figure 4: WiM interactive retrieval design. On the right, the document view displays the progress of processed segments, which can also be labeled based on the relevance identified by the LLM classifier. On the left, the chat view includes a progress bar that reflects the processing of segments. Here, users can interact with the streamed margins by giving a thumbs up or down, and these interactions are considered in the final response. Each margin corresponds to a specific document segment.
  • Figure 5: Sequence packing. Sequence packing allows to pack multiple unrelated documents in the same sequence. By adjusting the attention mask, we can avoid cross-contamination. This speeds up training time by reducing the number of padding tokens. A similar technique can also be used to inference from multiple prompts using the same sequence.
  • ...and 4 more figures