Bolmo: Byteifying the Next Generation of Language Models
Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann
TL;DR
Bolmo introduces the first open, competitive byte-level language models at 1B and 7B by byteifying an existing subword LM, addressing limitations of fixed subword vocabularies and character understanding. It delivers a purpose-built LTLM architecture with a local encoder/decoder, a non-causal boundary predictor, and a two-stage training protocol that first exactly distills the subword model and then trains end-to-end on byte-level data. Bolmo achieves strong results across diverse benchmarks, often surpassing prior byte-level LMs and approaching or matching the source subword LM on many tasks, with added benefits in character understanding and flexible, faster inference at higher compression factors. The work further shows that byteified models can leverage post-training techniques from the subword ecosystem, enabling zero-cost enhancement via Task Arithmetic and offering a practical path for widespread adoption of byte-level LMs. Collectively, Bolmo demonstrates that byte-level LMs can be competitive with subword-level models across a broad set of applications and opens directions for continued research in boundary learning, compression, and cross-ecosystem transfer.
Abstract
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
