xVal: A Continuous Numerical Tokenization for Scientific Language Models
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho
TL;DR
This work addresses the difficulty of handling numerically dense scientific data with LLMs by introducing xVal, a continuous number encoding that represents numbers as a single token along a learnable embedding axis, paired with a dedicated number head to recover numeric values. The method achieves token efficiency, minimal vocabulary, and end-to-end continuity between input numbers and outputs, leading to improved out-of-distribution interpolation and computational efficiency compared to traditional text-based encodings. Through experiments on ERA5 temperature forecasting and REBOUND planetary motion simulations, xVal demonstrates superior OoD performance and robust rollout of final timesteps, while highlighting trade-offs when increasing the number of NUM embeddings. The work positions xVal as a principled way to inject numerical continuity into transformer models, with potential for differentiable losses and broader scientific applicability in future research and workflows.
Abstract
Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.
