Table of Contents
Fetching ...

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

TL;DR

Doc-to-LoRA (D2L) introduces a hypernetwork that learns to approximate context distillation (latexCD) in a single forward pass, generating context-specific LoRA adapters for a target LLM from a given context to instantly internalize knowledge. Leveraging a Perceiver-based hypernetwork and a chunking mechanism, D2L handles long contexts beyond the native window of the base model while keeping LoRA outputs fixed in shape. Empirically, D2L outperforms traditional latexCD under limited compute across QA benchmarks and supports zero-shot long-context generalization, sub-second internalization, and even cross-modal visual information transfer from a VLM to an LLM. While the meta-training cost is high and currently tailored to a specific base LLM and LoRA parameterization, the approach promises rapid, personalized adaptation and frequent knowledge updates with reduced inference latency and memory overhead.

Abstract

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

Doc-to-LoRA: Learning to Instantly Internalize Contexts

TL;DR

Doc-to-LoRA (D2L) introduces a hypernetwork that learns to approximate context distillation (latexCD) in a single forward pass, generating context-specific LoRA adapters for a target LLM from a given context to instantly internalize knowledge. Leveraging a Perceiver-based hypernetwork and a chunking mechanism, D2L handles long contexts beyond the native window of the base model while keeping LoRA outputs fixed in shape. Empirically, D2L outperforms traditional latexCD under limited compute across QA benchmarks and supports zero-shot long-context generalization, sub-second internalization, and even cross-modal visual information transfer from a VLM to an LLM. While the meta-training cost is high and currently tailored to a specific base LLM and LoRA parameterization, the approach promises rapid, personalized adaptation and frequent knowledge updates with reduced inference latency and memory overhead.

Abstract

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.
Paper Structure (32 sections, 7 equations, 13 figures, 20 tables)

This paper contains 32 sections, 7 equations, 13 figures, 20 tables.

Figures (13)

  • Figure 1: An overview of latexD2L training (left) and downstream performance (right). latexD2L learns to efficiently internalize information, outperforming traditional CD while significantly reducing latency and memory consumption across question-answering benchmarks under limited query budgets.
  • Figure 2: NIAH retrieval performance (top) and additional memory needed for inference (bottom).
  • Figure 3: QA performance on SQuAD compared to the used context length ratio (left), update latency (middle), and additional memory needed for model updates (right). LLMLingua-2 compresses the input with [$10\%,20\%,40\%,60\%,80\%,90\%$] compression rates from right to left (gray dots).
  • Figure 4: Long document QA performance. LLMLingua-2 compresses the input with [$20\%,40\%,60\%,80\%,90\%$] compression rates from right to left (gray dots).
  • Figure 5: Training data length distribution. latexD2L takes contexts while the base model takes both queries and responses to compute the loss during meta-training. The total count represents the total number of unique context-query-response triplets. The number of tokens from the original contexts before query generation is roughly around $900$M tokens.
  • ...and 8 more figures