Table of Contents
Fetching ...

Byte Latent Transformer: Patches Scale Better Than Tokens

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer

TL;DR

BLT addresses the tokenization bottleneck by operating directly on raw bytes and dynamically grouping them into patches according to data complexity. The model uses a three-module architecture (Local Encoder, Latent Global Transformer, Local Decoder) and an entropy-driven patching strategy, achieving parity with token-based LLMs at scales up to $8$B parameters and $4$T training bytes while offering up to $50\%$ inference FLOP savings. The work demonstrates that a tokenizer-free approach can deliver robust handling of long-tail and noisy data, with improvements in orthographic and phonological tasks, and enables simultaneous growth of model and patch size within fixed inference budgets. It also provides a comprehensive scaling study, ablations, and practical insights for deploying patch-based byte modeling at scale.

Abstract

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Byte Latent Transformer: Patches Scale Better Than Tokens

TL;DR

BLT addresses the tokenization bottleneck by operating directly on raw bytes and dynamically grouping them into patches according to data complexity. The model uses a three-module architecture (Local Encoder, Latent Global Transformer, Local Decoder) and an entropy-driven patching strategy, achieving parity with token-based LLMs at scales up to B parameters and T training bytes while offering up to inference FLOP savings. The work demonstrates that a tokenizer-free approach can deliver robust handling of long-tail and noisy data, with improvements in orthographic and phonological tasks, and enables simultaneous growth of model and patch size within fixed inference budgets. It also provides a comprehensive scaling study, ablations, and practical insights for deploying patch-based byte modeling at scale.

Abstract

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Paper Structure

This paper contains 52 sections, 11 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Scaling trends for fixed inference flop models (fully) trained with varying training budgets. In token-based models, a fixed inference budget determines the model size. In contrast, the BLT architecture provides a new scaling axis allowing simultaneous increases in model and patch size while keeping the same training and inference budget. BLT patch-size (ps) 6 and 8 models quickly overtake scaling trends of bpe Llama 2 and 3. Moving to the larger inference budget makes the larger patch size 8 model more desirable sooner. Both BPE compute-optimal point and crossover point are indicated with vertical lines.
  • Figure 2: BLT comprises three modules, a lightweight Local Encoder that encodes input bytes into patch representations, a computationally expensive Latent Transformer over patch representations, and a lightweight Local Decoder to decode the next patch of bytes. BLT incorporates byte $n$-gram embeddings and a cross-attention mechanism to maximize information flow between the Latent Transformer and the byte-level modules (\ref{['fig:crossattn']}). Unlike fixed-vocabulary tokenization, BLT dynamically groups bytes into patches preserving access to the byte-level information.
  • Figure 3: Patching schemes group bytes in different ways, each leading to a different number of resulting patches. Since each patch is processed using a large transformer step, the number of patches directly determines the bulk of the compute expended in terms of flops. These schemes group bytes into patches by (a) striding every four bytes (§\ref{['section:static-patch']}) as in MegaByte yu2023megabyte, (b) tokenizing with Byte-Pair Encoding (bpe), in this case the Llama-3 dubey2024llama tokenizer, (c & d) entropy-based patching as in this work (§\ref{['section:dyn-patch']}), (e) patching on space-bytes slagle2024spacebyte, (f) and patching on entropy using a small CNN byte-level model with 2-byte context.
  • Figure 4: This figure plots the entropy $H(x_i)$ of each byte in "Daenerys Targeryen is in Game of Thrones, a fantasy epic by George R.R. Martin." with spaces shown as underscores. Patches end when $H(x_i)$ exceeds the global threshold $\theta_g$, shown as a red horizontal line. The start of new patches are shown with vertical gray lines. For example, the entropies of "G" and "e" in "George R.R. Martin" exceed $\theta_g$, so "G" is the start of a single byte patch and "e" of a larger patch extending to the end of the named entity as the entropy $H(x_i)$ stays low, resulting in no additional patches.
  • Figure 5: The local encoder uses a cross-attention block with patch representations as queries, and byte representations as keys/values to encode byte representations into patch representations. The local decoder uses a similar block but with the roles reversed i.e. byte representations are now the queries and patch representations are the keys/values. Here we use Cross-Attn $k=2$.
  • ...and 4 more figures