ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Chunyuan Deng; Sanket Lokegaonkar; Colin Lockard; Besnik Fetahu; Nasser Zalmout; Xian Li

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li

TL;DR

ByteFlow Net is introduced, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units, opening a path toward more adaptive and information-grounded language models.

Abstract

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top-$K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

TL;DR

Abstract

selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

Paper Structure (58 sections, 20 equations, 4 figures, 6 tables)

This paper contains 58 sections, 20 equations, 4 figures, 6 tables.

Introduction
Contributions.
Related Work
Tokenizer-free Architecture.
Tokenization in Language Modeling.
ByteFlow Net
Overview.
Local Encoder: Fast Processing over Byte-level Representations
Transformer Blocks with Sliding Window Attention.
Canon Layer.
Why SWA + Canon Layer for Token Mixing.
Downsampling: Coding-Rate Chunking
Lossy Coding Rate in Representation Space.
Streaming Decision.
Why Not Global Threshold?
...and 43 more sections

Figures (4)

Figure 1: Architecture of ByteFlow Net. (a) ByteFlow Net’s chunking strategy is primarily driven by the coding rate $R$ of latent representations. As shown in the figure, the model is encouraged to select token boundaries that form pooled subsequences which best compress the original input. (b) Since byte-level sequences are roughly $4\times$ longer, directly applying $O(n^2 d)$ softmax attention becomes prohibitively expensive. To address this, we adopt sliding-window attention (SWA) combined with canon layers Allenzhu2025-canon, enabling efficient and low-cost token mixing. (c) The beauty of the hierarchical architecture lies in allocating the majority of FLOPs operating at the high-level information (a deep and wide global transformer), while using lightweight local encoders/decoders (shallow and narrow) to quickly process low-level information.
Figure 2: Scaling Trend for Different Architecture Comparison. Validation BPB loss (lower is better) for different architecture approaches on two different scale (600M, left) and (1.3B, right) models. ByteFlow Net achieves better performance with scaling to larger models and data recipe.
Figure 3: Chunking Strategy Impact on Latent Representation Manifolds. Each point is a contextualized byte representation after the local encoder (after 1B training bytes), projected to 2D by t-SNE. We visualize 10 FineWeb-Edu validation segments, each $\sim$1500 bytes (15k points total); colors denote segments. Poor chunking (random, neural boundaries) fragments the original clustering, whereas coding-rate chunking preserves it.
Figure 4: Case Study of Character-Level Coding Rate Scores. This figure illustrates how ByteFlow Net assigns an information-theoretic "importance" score to each character in an example sentence. The model has learned to assign a higher coding rate to characters that are more semantically significant, such as the initial letters of words and key entities. Conversely, it assigns lower rates to more predictable characters within words. This demonstrates the model's ability to dynamically identify information-rich points in the byte stream to guide its chunking and resource allocation.

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

TL;DR

Abstract

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)