Table of Contents
Fetching ...

xVal: A Continuous Numerical Tokenization for Scientific Language Models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

TL;DR

This work addresses the difficulty of handling numerically dense scientific data with LLMs by introducing xVal, a continuous number encoding that represents numbers as a single token along a learnable embedding axis, paired with a dedicated number head to recover numeric values. The method achieves token efficiency, minimal vocabulary, and end-to-end continuity between input numbers and outputs, leading to improved out-of-distribution interpolation and computational efficiency compared to traditional text-based encodings. Through experiments on ERA5 temperature forecasting and REBOUND planetary motion simulations, xVal demonstrates superior OoD performance and robust rollout of final timesteps, while highlighting trade-offs when increasing the number of NUM embeddings. The work positions xVal as a principled way to inject numerical continuity into transformer models, with potential for differentiable losses and broader scientific applicability in future research and workflows.

Abstract

Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.

xVal: A Continuous Numerical Tokenization for Scientific Language Models

TL;DR

This work addresses the difficulty of handling numerically dense scientific data with LLMs by introducing xVal, a continuous number encoding that represents numbers as a single token along a learnable embedding axis, paired with a dedicated number head to recover numeric values. The method achieves token efficiency, minimal vocabulary, and end-to-end continuity between input numbers and outputs, leading to improved out-of-distribution interpolation and computational efficiency compared to traditional text-based encodings. Through experiments on ERA5 temperature forecasting and REBOUND planetary motion simulations, xVal demonstrates superior OoD performance and robust rollout of final timesteps, while highlighting trade-offs when increasing the number of NUM embeddings. The work positions xVal as a principled way to inject numerical continuity into transformer models, with potential for differentiable losses and broader scientific applicability in future research and workflows.

Abstract

Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.
Paper Structure (34 sections, 1 equation, 14 figures, 13 tables)

This paper contains 34 sections, 1 equation, 14 figures, 13 tables.

Figures (14)

  • Figure 1: A simplified example illustrating the xVal number encoding and the modified number inference paradigm. On the left, xVal is contrasted with the P1000 text-based numerical encoding scheme. On the right, we illustrate how numbers are addressed within the decoder.
  • Figure 2: The coefficients of each [NUM] embedding vector in Equation \ref{['eq:xvalhdr']} are shown for an example of xVal with $k=1$, which spans three orders of magnitude. This illustrates how the default xVal embedding ($i=0$), which is primarily sensitive to values of $\mathcal{O}(1)$, can be supplemented with additional [NUM] embeddings sensitive to $\mathcal{O}(10)$, i.e. $i=-1$, and $\mathcal{O}(10^{-1})$, i.e. $i=1$. This paradigm can be extended to wider dynamic ranges.
  • Figure 3: Performance of the encoding schemes in predicting the temperature of the next timestep for each reporting station in the ERA5 dataset. Mean Squared Error (MSE) values are reported in Table \ref{['tab:era5_nrmse']}.
  • Figure 4: A failure mode of text based encoding scheme (left). Because of the distribution of the numbers in the training set (center and right), numbers that are close to $\pm 1$ (denoted by the black arrows) get misclassified as 100E-3, i.e. 0.1, the combination of the most common digit and the most common exponent in the dataset.
  • Figure 5: Performance of the encoding schemes predicting the 10 final timesteps of each planet for two simulated orbits. The prediction is not autoregressive: all 10 timesteps are predicted at the same time.
  • ...and 9 more figures