Table of Contents
Fetching ...

You Can Learn Tokenization End-to-End with Reinforcement Learning

Sam Dauncey, Roger Wattenhofer

TL;DR

This work observes that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable, and demonstrates that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

Abstract

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

You Can Learn Tokenization End-to-End with Reinforcement Learning

TL;DR

This work observes that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable, and demonstrates that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the million parameter scale.

Abstract

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the million parameter scale.
Paper Structure (31 sections, 17 equations, 10 figures, 2 tables)

This paper contains 31 sections, 17 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: [Left] An example of the autoregressive U-net architecture nawrot2022hourglass, computed with the values $x_{0:7} = \texttt{<b>Learn t} \;$ and $\; a_{0:7} = 11001010$. Blocks with rounded edges represent transformer blocks, arrows represent flow of representations. [Right] the stochastic computation graph schulman2015gradient of a forward pass of our method, deterministic nodes in squares, stochastic nodes in circles, distributions in gray.
  • Figure 2: [Left] Token boundaries learned by our 147M-parameter model on a held-out sample of the FineWeb dataset. [Right] Token boundaries learned by our 90M-parameter model on a held-out sample of the CodeParrot dataset. Red and blue characters characters indicate high or low values of $\pi_{\theta}(a)$ respectively at the corresponding bytes.
  • Figure 3: Token boundaries learned by our 147M-parameter model on a held-out sample of the FineWeb dataset. Red and blue characters characters indicate high or low values of $\pi_{\theta}(a)$ respectively at the corresponding bytes.
  • Figure 4: Token boundaries learned by a 147M-parameter model using the straight-through estimator of nawrot2023dynamicpooling, on held-out samples of the FineWeb dataset. Instead of the probabilities, here we plot the soft boundaries at the output of the Gumbel-Sigmoid. Red and blue characters characters indicate high or low values of $\hat{b}_t$ respectively at the corresponding bytes.
  • Figure 5: Token boundaries learned by a 147M-parameter model using the straight-through estimator of Hwang et al. hwang2025hnet, on held-out samples of the FineWeb dataset. Instead of the probabilities, here we plot the hard token boundaries. Red and blue characters characters indicate high or low values of $\hat{b}_t$ respectively at the corresponding bytes.
  • ...and 5 more figures