Table of Contents
Fetching ...

A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum

Abstract

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

A Family of LLMs Liberated from Static Vocabularies

Abstract

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.
Paper Structure (63 sections, 6 figures, 29 tables)

This paper contains 63 sections, 6 figures, 29 tables.

Figures (6)

  • Figure 1: (Left) The HAT architecture has three components: an encoder, backbone, and decoder, each implemented as a transformer. A full overview can be found in Figure \ref{['fig:full-architecture']}, while the encoder and decoder are detailed in Figures \ref{['fig:encoder-cross-attn']} and \ref{['fig:decoder-cross-attn']}, respectively. (Right) Average performance and compression for Llama-3.1-8B-TFree-HAT on benchmarks detailed in §\ref{['sec:performance']}.
  • Figure 2: Overview of our model architecture. The encoder and decoder are detailed in Figures \ref{['fig:encoder-cross-attn']} and \ref{['fig:decoder-cross-attn']} respectively. The encoder processes the input text, producing word embeddings $\bm{w}_k$, which are then processed by the backbone to produce next word predictions $\hat{\bm{w}}_{k+1}$. The decoder uses these predictions along with encoder's byte-level outputs $\bm{b}$ to generate byte-level logits.
  • Figure 3: Visualization of the encoder and decoder of the HAT model.
  • Figure 4: Model Quality and Compression for our pre-trained T-Free model in comparison with Llama-3.1-8B.
  • Figure 5: Model Quality and Compression for our SFT T-Free model in comparison with Llama-3.1-Tulu-3-8B-SFT.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: General tokenizer
  • Definition 2: Splitting rule