Training LLMs over Neurally Compressed Text

Brian Lester; Jaehoon Lee; Alex Alemi; Jeffrey Pennington; Adam Roberts; Jascha Sohl-Dickstein; Noah Constant

Training LLMs over Neurally Compressed Text

Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

TL;DR

This work investigates training large language models directly over neurally compressed text, addressing the challenge that strong compression like Arithmetic Coding can produce learnability barriers. The authors introduce Equal-Info Windows to create independently compressible blocks, enabling a downstream M2 to learn over compressed representations and achieving substantial token-level compression (around 5× for AC-based methods) while improving compute efficiency. They systematically compare AC-based, StaticAC, EqualInfoAC, and GZip approaches against byte-level and SentencePiece baselines, showing that EqualInfoAC can learn and outperform byte baselines and approach subword tokenizers at scale, albeit with stability trade-offs. The findings highlight the potential and limitations of neural tokenizers for LLM training, and outline practical guidance and open directions for designing learnable, highly-discriminative neural compression schemes that reduce sequence length and inference latency. Overall, the work demonstrates that training over neurally compressed text is promising for efficiency gains and longer contextual modeling, setting a foundation for future research in neural tokenization and compression-aware pretraining.

Abstract

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text naïvely compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.

Training LLMs over Neurally Compressed Text

TL;DR

Abstract

Paper Structure (61 sections, 2 equations, 17 figures, 17 tables)

This paper contains 61 sections, 2 equations, 17 figures, 17 tables.

Introduction
Motivation and Background
Advantages of Training over Neurally Compressed Text
Efficiency
Longer Context
Distribution of Compute
Challenges of Training over Compressed Text
Learnability
Numerical Stability
Multi-Model Inference
Compression
Arithmetic Coding
Related Work
Methods
Training Data
...and 46 more sections

Figures (17)

Figure 1: An overview of our approach for training an LLM (M2) over neurally compressed text. First, M1 is trained as a standard byte-level language model---given a leftward context, M1 assigns a probability to each possible following byte. Next, corpus text is compressed into a bitstream using M1 as a compressor. Specifically, the probabilities that M1 assigns at each text position are fed into a compression algorithm like Arithmetic Coding that supports using dynamic symbol probabilities. Finally, this bitstream is chunked into tokens (e.g., 8-bit chunks), and M2 is trained as a language model over compressed text.
Figure 2: Under "Equal-Info Windows", text is encoded into a series of N-bit windows. To determine each successive window, the remaining text is encoded byte-by-byte via Arithmetic Coding until no more bytes can be added without exceeding the target bit threshold, here $16$ bits. Both M1 and the AC algorithm are reset at each step, so no information persists across windows.
Figure 3: Models trained over compressed text are compared against baseline models in terms of bits/byte ($\downarrow$) and inference FLOPs/byte ($\downarrow$). The ArithmeticCoding and StaticAC settings are essentially unlearnable, with models failing to outperform naïve baselines (dashed lines) that assign equal probability to all tokens. EqualInfoAC and GZip outperform naïve baselines and show improvement with scale. EqualInfoAC is the strongest of the compression-based methods, with EqualInfoAC outperforming the Bytes baseline at all sizes. While SentencePiece performs the best, the gap between EqualInfoAC and SentencePiece narrows with scale. See \ref{['app:graph-values']} for the exact values used in this and other graphs.
Figure 4: Comparing models in terms of bits/byte ($\downarrow$) and bytes/step ($\uparrow$). As decoder steps can be a practical bottleneck for system latency, a model with higher FLOPs/byte or worse bits/byte may be preferred in order to achieve shorter sequence lengths. The dashed line ( ) is an example Pareto frontier, showing how a practitioner might value the trade-off between bits/byte and bytes/step. Our $2$ billion parameter EqualInfoAC model is on this frontier.
Figure 5: Performance of EqualInfoAC across various window sizes, $b$$\in$ {$16$, $32$, $64$, $128$}. When evaluating bits/byte (left) to control for compression ratio, we see an unintuitive trend where for most model sizes $b=16$ is best but $b=128$ is second-best. This is due to the higher compression rate achieved by longer Equal Info Windows. When evaluating tokens/byte (right), a monotonic trend emerges, showing that shorter windows are easier to learn.
...and 12 more figures

Training LLMs over Neurally Compressed Text

TL;DR

Abstract

Training LLMs over Neurally Compressed Text

Authors

TL;DR

Abstract

Table of Contents

Figures (17)