Table of Contents
Fetching ...

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

TL;DR

SpaceByte is a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling and outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.

Abstract

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

TL;DR

SpaceByte is a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling and outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.

Abstract

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.
Paper Structure (16 sections, 4 figures, 6 tables)

This paper contains 16 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An overview of the SpaceByte architecture. The embedding, local transformer blocks, and de-embedding (i.e. a layer norm and linear) are the standard Transformer decoder layers. SpaceByte modifies the standard transformer by applying "global" transformer blocks only after certain bytes, such as space characters. The intuition is that the first character of a word is typically the hardest to predict; thus this positioning of the global blocks should make the best use of the global blocks (which use a larger model dimension).
  • Figure 2: Examples of patch boundaries from datasets that we study. Spacelike bytes are underlined and colored blue. Patches boundaries are drawn above the text. Each patch ends after a spacelike byte that is not preceded by another spacelike byte. Consequently, each patch begins with zero or more spacelike bytes, followed by one or more non-spacelike bytes, and ends with a single spacelike byte. The global blocks predict the first character of each patch. The downward arrow (↓) denotes a newline byte. The left and right quotation characters, (“) and (”) in the PG-19 example, are encoded using three bytes in UTF-8. The first of the three bytes is spacelike, while the later two bytes are UTF-8 continuation bytes, which are not spacelike and are each denoted using a bullet point (•) above.
  • Figure 3: Pareto frontier of the cross-entropy bits-per-bytefoot:BPB vs FLOPs-per-byte during inference (details in Appendix \ref{['app:flops']}) for each model architecture trained using $10^{18}$ (connected by thin lines) or $10^{19}$ (thick lines) FLOPs on different datasets (on a log-log scale). Each dot describes a model with a different number of layers and/or model dimension. Lower and to the left is better. SpaceByte (red) outperforms all other byte-level architectures across the entire Pareto frontier for all datasets. SpaceByte roughly matches the performance of the subword Transformer using SentencePiece tokens, and outperforms the subword Transformer using GPT2 tokens.
  • Figure 4: The Pareto frontier models from Figure \ref{['fig:losses']}, where we plot the bits-per-byte vs the number of bytes used for training divided by the number of non-embedding parameters (defined in Table \ref{['tab:parameters']}).