You Can Learn Tokenization End-to-End with Reinforcement Learning

Sam Dauncey; Roger Wattenhofer

You Can Learn Tokenization End-to-End with Reinforcement Learning

Sam Dauncey, Roger Wattenhofer

TL;DR

This work observes that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable, and demonstrates that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

Abstract

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

You Can Learn Tokenization End-to-End with Reinforcement Learning

TL;DR

million parameter scale.

Abstract

million parameter scale.

Paper Structure (31 sections, 17 equations, 10 figures, 2 tables)

This paper contains 31 sections, 17 equations, 10 figures, 2 tables.

Introduction
Theory & Method
Desiderata for designing a end-to-end tokenization method
Autoregressive U-net architecture and setup
Score function estimation for tokenization
Reducing the variance of the score function estimate
Early exit relative rewards
Time discounting
Batch-relative advantages
Defining the token boundary function
Downsample Rate Targeting
Full loss formula
Related Work
Experiments
Learned tokenization strategies for Natural Language
...and 16 more sections

Figures (10)

Figure 1: [Left] An example of the autoregressive U-net architecture nawrot2022hourglass, computed with the values $x_{0:7} = \texttt{<b>Learn t} \;$ and $\; a_{0:7} = 11001010$. Blocks with rounded edges represent transformer blocks, arrows represent flow of representations. [Right] the stochastic computation graph schulman2015gradient of a forward pass of our method, deterministic nodes in squares, stochastic nodes in circles, distributions in gray.
Figure 2: [Left] Token boundaries learned by our 147M-parameter model on a held-out sample of the FineWeb dataset. [Right] Token boundaries learned by our 90M-parameter model on a held-out sample of the CodeParrot dataset. Red and blue characters characters indicate high or low values of $\pi_{\theta}(a)$ respectively at the corresponding bytes.
Figure 3: Token boundaries learned by our 147M-parameter model on a held-out sample of the FineWeb dataset. Red and blue characters characters indicate high or low values of $\pi_{\theta}(a)$ respectively at the corresponding bytes.
Figure 4: Token boundaries learned by a 147M-parameter model using the straight-through estimator of nawrot2023dynamicpooling, on held-out samples of the FineWeb dataset. Instead of the probabilities, here we plot the soft boundaries at the output of the Gumbel-Sigmoid. Red and blue characters characters indicate high or low values of $\hat{b}_t$ respectively at the corresponding bytes.
Figure 5: Token boundaries learned by a 147M-parameter model using the straight-through estimator of Hwang et al. hwang2025hnet, on held-out samples of the FineWeb dataset. Instead of the probabilities, here we plot the hard token boundaries. Red and blue characters characters indicate high or low values of $\hat{b}_t$ respectively at the corresponding bytes.
...and 5 more figures

You Can Learn Tokenization End-to-End with Reinforcement Learning

TL;DR

Abstract

You Can Learn Tokenization End-to-End with Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)