Proxy Compression for Language Modeling
Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong
TL;DR
Proxy compression tackles the dependence of language models on fixed external tokenizers by training with mixed raw-byte and proxy-compressed inputs, while inference remains purely raw bytes. The approach uses sentinel markers and a Bernoulli sampling scheme to blend representations, enabling end-to-end byte-level inference without architectural changes. Across tokenizer-based and neural proxies, the method yields strong cross-representation transfer, with gains growing with model scale and under fixed compute budgets near tokenizer baselines, and surpassing raw-byte baselines under comparable compute. It highlights that structured, stable proxies (tokenizer-based and neural) transfer well, while generic gzip proxies fail, and shows that the learned cross-representation alignment also improves robustness to formatting perturbations. Overall, proxy compression offers substantial training efficiency and scalable transfer to raw-byte inference, advancing practical byte-level language modeling for code with potential applicability beyond code domains.
Abstract
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.
