Table of Contents
Fetching ...

Neural Sabermetrics with World Model: Play-by-play Predictive Modeling with Large Language Model

Young Jin Ahn, Yiyang Du, Zheyuan Zhang, Haisen Kang

TL;DR

The paper reframes baseball analytics from descriptive sabermetrics to predictive, generative modeling by treating game events as a long autoregressive sequence and training a single LLM as a play-by-play world model. Using continuous pretraining on over a decade of MLB data, the model unifies pitch-level dynamics and batter decisions, achieving competitive or superior performance on pitch-type and swing-prediction tasks even under postseason distribution shifts. Key contributions include a unified event representation, a scalable training approach with sliding-window sequences, and demonstrated generalization beyond regular-season data. The work suggests world models as a scalable framework for predictive sports analytics with extensibility to new data modalities and prediction objectives.

Abstract

Classical sabermetrics has profoundly shaped baseball analytics by summarizing long histories of play into compact statistics. While these metrics are invaluable for valuation and retrospective analysis, they do not define a generative model of how baseball games unfold pitch by pitch, leaving most existing approaches limited to single-step prediction or post-hoc analysis. In this work, we present Neural Sabermetrics with World Model, a Large Language Model (LLM) based play-by-play world model for baseball. We cast baseball games as long auto-regressive sequences of events and continuously pretrain a single LLM on more than ten years of Major League Baseball (MLB) tracking data, comprising over seven million pitch sequences and approximately three billion tokens. The resulting model is capable of predicting multiple aspects of game evolution within a unified framework. We evaluate our model on both in-distribution regular-season data and out-of-distribution postseason games and compare against strong neural baselines from prior work. Despite using a single backbone model, our approach outperforms the performance of existing baselines, (1) correctly predicting approximately 64% of next pitches within a plate appearance and (2) 78% of batter swing decisions, suggesting that LLMs can serve as effective world models for sports.

Neural Sabermetrics with World Model: Play-by-play Predictive Modeling with Large Language Model

TL;DR

The paper reframes baseball analytics from descriptive sabermetrics to predictive, generative modeling by treating game events as a long autoregressive sequence and training a single LLM as a play-by-play world model. Using continuous pretraining on over a decade of MLB data, the model unifies pitch-level dynamics and batter decisions, achieving competitive or superior performance on pitch-type and swing-prediction tasks even under postseason distribution shifts. Key contributions include a unified event representation, a scalable training approach with sliding-window sequences, and demonstrated generalization beyond regular-season data. The work suggests world models as a scalable framework for predictive sports analytics with extensibility to new data modalities and prediction objectives.

Abstract

Classical sabermetrics has profoundly shaped baseball analytics by summarizing long histories of play into compact statistics. While these metrics are invaluable for valuation and retrospective analysis, they do not define a generative model of how baseball games unfold pitch by pitch, leaving most existing approaches limited to single-step prediction or post-hoc analysis. In this work, we present Neural Sabermetrics with World Model, a Large Language Model (LLM) based play-by-play world model for baseball. We cast baseball games as long auto-regressive sequences of events and continuously pretrain a single LLM on more than ten years of Major League Baseball (MLB) tracking data, comprising over seven million pitch sequences and approximately three billion tokens. The resulting model is capable of predicting multiple aspects of game evolution within a unified framework. We evaluate our model on both in-distribution regular-season data and out-of-distribution postseason games and compare against strong neural baselines from prior work. Despite using a single backbone model, our approach outperforms the performance of existing baselines, (1) correctly predicting approximately 64% of next pitches within a plate appearance and (2) 78% of batter swing decisions, suggesting that LLMs can serve as effective world models for sports.
Paper Structure (21 sections, 5 figures, 2 tables)

This paper contains 21 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Demonstration of the data. All tracking data from the game event are serialized in sequential texts.
  • Figure 2: Overview of the training framework. We use 11 years worth of regular season MLB data (3 billion tokens, 7 million pitch sequences) for continuous pretraining and game event predictions.
  • Figure 3: Left: prediction errors accumulate over longer sequences. Right: the increased uncertainty and strategic variability introduced by pitchers with more diverse repertoires.
  • Figure 4: The model exhibits a strong bias toward predicting the most frequent pitch types.
  • Figure 5: Four-seam fastballs and sliders dominate the error distribution, while rarer pitch types incur substantially fewer errors.