Training LLMs over Neurally Compressed Text
Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
TL;DR
This work investigates training large language models directly over neurally compressed text, addressing the challenge that strong compression like Arithmetic Coding can produce learnability barriers. The authors introduce Equal-Info Windows to create independently compressible blocks, enabling a downstream M2 to learn over compressed representations and achieving substantial token-level compression (around 5× for AC-based methods) while improving compute efficiency. They systematically compare AC-based, StaticAC, EqualInfoAC, and GZip approaches against byte-level and SentencePiece baselines, showing that EqualInfoAC can learn and outperform byte baselines and approach subword tokenizers at scale, albeit with stability trade-offs. The findings highlight the potential and limitations of neural tokenizers for LLM training, and outline practical guidance and open directions for designing learnable, highly-discriminative neural compression schemes that reduce sequence length and inference latency. Overall, the work demonstrates that training over neurally compressed text is promising for efficiency gains and longer contextual modeling, setting a foundation for future research in neural tokenization and compression-aware pretraining.
Abstract
In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text naïvely compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.
