Table of Contents
Fetching ...

Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

Alexis Roger, Gwen Legate, Kashif Rasul, Yuriy Nevmyvaka, Irina Rish

TL;DR

The paper addresses how tokenization design and pretraining affect time-series forecasting in discrete representations. By systematically varying scaling and binning (mean/min-max/normal scaling and uniform/normal/exponential binning) and evaluating pretrained versus random initializations, the authors establish a power-law relationship between vocabulary size and theoretical tokenization bound, and demonstrate that tokenizer configuration largely dictates representational capacity while pretraining improves optimization and alignment. The results show pretrained models especially benefit small-vocabulary regimes when paired with well-designed tokenizers (notably normal scaling with uniform or mean with normal), and that misaligned tokenization can negate pretraining gains. These findings yield practical guidance for designing tokenizers and leveraging transfer learning in discrete representations of continuous signals, with strong relevance to multi-modal time-series forecasting where a shared vocabulary is advantageous.

Abstract

Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models

TL;DR

The paper addresses how tokenization design and pretraining affect time-series forecasting in discrete representations. By systematically varying scaling and binning (mean/min-max/normal scaling and uniform/normal/exponential binning) and evaluating pretrained versus random initializations, the authors establish a power-law relationship between vocabulary size and theoretical tokenization bound, and demonstrate that tokenizer configuration largely dictates representational capacity while pretraining improves optimization and alignment. The results show pretrained models especially benefit small-vocabulary regimes when paired with well-designed tokenizers (notably normal scaling with uniform or mean with normal), and that misaligned tokenization can negate pretraining gains. These findings yield practical guidance for designing tokenizers and leveraging transfer learning in discrete representations of continuous signals, with strong relevance to multi-modal time-series forecasting where a shared vocabulary is advantageous.

Abstract

Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.

Paper Structure

This paper contains 24 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Graph showing how the theoretical lower-bound in performance of the scaling and quantization function combinations follows a power law with vocabulary size.
  • Figure 2: Plot showing the MASE score of the models at the end of training as a function of their vocab size. As performance reference, the Chronos T5 Large model, the closest to this training regime, is denoted by the red star.
  • Figure 3: Loss and MASE plot across training of the different tokenization schemes for Qwen 3 600M model with a vocabulary size of 512. The vertical line to the right denotes the end of training. The bottom line indicates the performance of the Chronos T5 Large model for reference, the closest to this training regime.
  • Figure 4: Histograms showing the token space utilization of the different tokenization strategies accross multiple vocabularies.