Table of Contents
Fetching ...

Proxy Compression for Language Modeling

Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong

TL;DR

Proxy compression tackles the dependence of language models on fixed external tokenizers by training with mixed raw-byte and proxy-compressed inputs, while inference remains purely raw bytes. The approach uses sentinel markers and a Bernoulli sampling scheme to blend representations, enabling end-to-end byte-level inference without architectural changes. Across tokenizer-based and neural proxies, the method yields strong cross-representation transfer, with gains growing with model scale and under fixed compute budgets near tokenizer baselines, and surpassing raw-byte baselines under comparable compute. It highlights that structured, stable proxies (tokenizer-based and neural) transfer well, while generic gzip proxies fail, and shows that the learned cross-representation alignment also improves robustness to formatting perturbations. Overall, proxy compression offers substantial training efficiency and scalable transfer to raw-byte inference, advancing practical byte-level language modeling for code with potential applicability beyond code domains.

Abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

Proxy Compression for Language Modeling

TL;DR

Proxy compression tackles the dependence of language models on fixed external tokenizers by training with mixed raw-byte and proxy-compressed inputs, while inference remains purely raw bytes. The approach uses sentinel markers and a Bernoulli sampling scheme to blend representations, enabling end-to-end byte-level inference without architectural changes. Across tokenizer-based and neural proxies, the method yields strong cross-representation transfer, with gains growing with model scale and under fixed compute budgets near tokenizer baselines, and surpassing raw-byte baselines under comparable compute. It highlights that structured, stable proxies (tokenizer-based and neural) transfer well, while generic gzip proxies fail, and shows that the learned cross-representation alignment also improves robustness to formatting perturbations. Overall, proxy compression offers substantial training efficiency and scalable transfer to raw-byte inference, advancing practical byte-level language modeling for code with potential applicability beyond code domains.

Abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.
Paper Structure (77 sections, 12 equations, 25 figures, 13 tables)

This paper contains 77 sections, 12 equations, 25 figures, 13 tables.

Figures (25)

  • Figure 1: Overview of proxy compression for language modeling. During training, we prepare mixed-representation inputs by combining compressed sequences with raw UTF-8 bytes, which are packed together to train a single language model with next-symbol prediction over both representations. Different representations are associated with special sentinels, such as $\mathtt{\langle comp\rangle}$, $\mathtt{\langle /comp\rangle}$ for compressed data and $\mathtt{\langle raw\rangle}$ and $\mathtt{\langle /raw\rangle}$ for raw data. At inference time, the proxy compressor is discarded entirely, and the model operates solely on raw bytes. By training primarily on compressed data (e.g., 90% of training data in this work), this approach captures the training efficiency benefits of compressed data without hard-wiring the compressor into the model's interface.
  • Figure 2: Model performance (Pass@1) on MBPP-Plus across model scales. Bars show absolute performance (left axis); lines show the performance gap ($\Delta$) relative to the tokenizer baseline (right axis). While byte-level models exhibit a persistent or widening gap, proxy-based models progressively close the gap as the model scale increases.
  • Figure 3: Pass@1 performance on HumanEval-Plus for 14B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right).
  • Figure 4: Compressor stability analysis under input perturbation: we apply random 10% character deletion to 80K samples and measure normalized Levenshtein distance between compressed outputs before and after perturbation.
  • Figure 5: HumanEval-Plus pass@1 of 1.5B gzip-proxy models with different mixing ratios $r$ as a function of training data.
  • ...and 20 more figures