Table of Contents
Fetching ...

Bolmo: Byteifying the Next Generation of Language Models

Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann

TL;DR

Bolmo introduces the first open, competitive byte-level language models at 1B and 7B by byteifying an existing subword LM, addressing limitations of fixed subword vocabularies and character understanding. It delivers a purpose-built LTLM architecture with a local encoder/decoder, a non-causal boundary predictor, and a two-stage training protocol that first exactly distills the subword model and then trains end-to-end on byte-level data. Bolmo achieves strong results across diverse benchmarks, often surpassing prior byte-level LMs and approaching or matching the source subword LM on many tasks, with added benefits in character understanding and flexible, faster inference at higher compression factors. The work further shows that byteified models can leverage post-training techniques from the subword ecosystem, enabling zero-cost enhancement via Task Arithmetic and offering a practical path for widespread adoption of byte-level LMs. Collectively, Bolmo demonstrates that byte-level LMs can be competitive with subword-level models across a broad set of applications and opens directions for continued research in boundary learning, compression, and cross-ecosystem transfer.

Abstract

We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.

Bolmo: Byteifying the Next Generation of Language Models

TL;DR

Bolmo introduces the first open, competitive byte-level language models at 1B and 7B by byteifying an existing subword LM, addressing limitations of fixed subword vocabularies and character understanding. It delivers a purpose-built LTLM architecture with a local encoder/decoder, a non-causal boundary predictor, and a two-stage training protocol that first exactly distills the subword model and then trains end-to-end on byte-level data. Bolmo achieves strong results across diverse benchmarks, often surpassing prior byte-level LMs and approaching or matching the source subword LM on many tasks, with added benefits in character understanding and flexible, faster inference at higher compression factors. The work further shows that byteified models can leverage post-training techniques from the subword ecosystem, enabling zero-cost enhancement via Task Arithmetic and offering a practical path for widespread adoption of byte-level LMs. Collectively, Bolmo demonstrates that byte-level LMs can be competitive with subword-level models across a broad set of applications and opens directions for continued research in boundary learning, compression, and cross-ecosystem transfer.

Abstract

We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.

Paper Structure

This paper contains 61 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The Bolmo architecture. Tokenization & Embedding $\mathcal{T}$ transforms the input text into one representation per byte. The representations are contextualized with the local encoder $\mathcal{E}$ consisting of mLSTM blocks. The boundary predictor $\mathcal{B}$ decides where to place patch boundaries using one byte of future context. The representations are then Pooled, passed through the global model $\mathcal{M}$ consisting of Transformer layers, and Depooled. Finally, the local decoder $\mathcal{D}$ consisting of another mLSTM stack contextualizes the depooled byte representations and the LMHead transforms them into next-byte predictions, alongside deciding where to place the next patch boundary.
  • Figure 2: Subword-level LMs non-causally set boundaries over the prefill using the external subword tokenizer, then implicitly predict boundaries alongside the text content during decoding (left). Prior byte-level LTLMs causally set boundaries with a light-weight boundary predictor during both prefill and decoding (middle). We restore the expressivity of subword-level LM boundaries by non-causally predicting boundaries for the prefill, then predicting whether a boundary occurs alongside the next byte during decoding (right).
  • Figure 3: The task performance vs. efficiency Pareto frontier of (i) the source subword-level LM with tokenizer transfer to SuperBPE to achieve higher compression in bytes per patch and (ii) Bolmo models with adapted boundary prediction to achieve higher compression (see Section \ref{['sec:increased-compression']}). The subword-level LM breaks off the frontier as the cost of the softmax starts to dominate for larger vocabulary sizes; byte-level LMs take over the frontier at that point, as seen in the optimal region around the top-left corner.
  • Figure 4: Byteified models can be post-trained by leveraging an existing (subword-level) post-trained Olmo 3 checkpoint; shown is the performance on IFEval of the base Olmo 3 model ($\theta_{\text{PT}}$), the base Bolmo ($\theta_{\text{Bolmo}}$), a post-trained Olmo 3 checkpoint ($\theta_{\text{IT}}$), and the result of merging the post-trained checkpoint into Bolmo.
  • Figure 5: Boundary supervision by predicting the subword patch start or patch end using a causal or non-causal boundary predictor. Shown are the avg. task performance (left), cos. dist. of the local encoder representations to the target subword representations (middle), and the percentage of bytes where the predicted boundary differs from the true subword boundary (right) after Stage 1 training. Causal boundary predictors can achieve either accurate boundaries and accurate representations; non-causal boundaries enable both.
  • ...and 4 more figures