Table of Contents
Fetching ...

Chunk-Distilled Language Modeling

Yanhong Li, Karen Livescu, Jiawei Zhou

TL;DR

Chunk-Distilled Language Modeling (CD-LM) presents a training-free approach that interleaves multi-token text chunks retrieved from a datastore with standard autoregressive LM generation to address inefficiency and knowledge updating in large language models. The framework formalizes chunk generation with latent switches, supported by a trie-based chunk datastore and vector-space context matching, enabling adaptive knowledge injection from parametric, self-memory, or expert sources. Empirical results across language modeling, code, medical, and legal domains show substantial perplexity improvements and significant inference-speedups, with factual knowledge injections boosting grounding and diversity. By avoiding retraining and leveraging flexible chunk sources, CD-LM offers a practical path to domain adaptation and privacy-aware knowledge augmentation in real-world applications.

Abstract

We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model's distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model performance and efficiency across a diverse set of downstream tasks. Code and data will be made publicly available.

Chunk-Distilled Language Modeling

TL;DR

Chunk-Distilled Language Modeling (CD-LM) presents a training-free approach that interleaves multi-token text chunks retrieved from a datastore with standard autoregressive LM generation to address inefficiency and knowledge updating in large language models. The framework formalizes chunk generation with latent switches, supported by a trie-based chunk datastore and vector-space context matching, enabling adaptive knowledge injection from parametric, self-memory, or expert sources. Empirical results across language modeling, code, medical, and legal domains show substantial perplexity improvements and significant inference-speedups, with factual knowledge injections boosting grounding and diversity. By avoiding retraining and leveraging flexible chunk sources, CD-LM offers a practical path to domain adaptation and privacy-aware knowledge augmentation in real-world applications.

Abstract

We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model's distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model performance and efficiency across a diverse set of downstream tasks. Code and data will be made publicly available.
Paper Structure (53 sections, 24 equations, 14 figures, 27 tables)

This paper contains 53 sections, 24 equations, 14 figures, 27 tables.

Figures (14)

  • Figure 1: LLMs may generate sequences with repeated chunks spanning continuous tokens conveying key information in similar contexts. Examples are generated from Llama-2-7b-chat.
  • Figure 2: LLM token probabilities for the sentence: "The answer to life, the universe, and everything is 42, according to Douglas Adams' The Hitchhiker's Guide to the Galaxy." These models bind token sequences such as Douglas Adams' and The Hitchhiker's Guide to the Galaxy into chunks with plateaus of high probability.
  • Figure 3: Overview of CD-LM. Colored text spans are generated together by chunk retrieval, interleaved with LM.
  • Figure 4: A graphical model illustration of the probabilistic model of CD-LM. The token sequence $x_n$ nodes are observed, and chunk acceptance variables $z_n$ are latent, governing how many tokens are to be generated at one step.
  • Figure 5: Comparison between KCD-LM and kNN-LM on PPL, along with datastore sizes controlled by chunk extraction threshold $\gamma$.
  • ...and 9 more figures