Batch Universal Prediction

Marco Bondaschi; Michael Gastpar

Batch Universal Prediction

Marco Bondaschi, Michael Gastpar

TL;DR

This work introduces batch regret as a batch-wise analogue of universal prediction for evaluating LLMs, where training consists of $n$ independent batches of length $\ell$ and prediction targets a fresh batch of length $\ell$. It shows that for binary memoryless sources, an add-$\beta$ predictor with $\beta\in[\tfrac{1}{2},1]$ achieves precise asymptotic bounds, with interior regime regret scaling as $\tfrac{1}{2}\log\left(1+\tfrac{1}{n}\right)$ and endpoint cases scaling as $\beta\log\left(1+\tfrac{1}{n}\right)$. For first-order binary Markov sources, the batch regret separates into initial-distribution and transition components, each decaying like $\Theta(1/n)$, with explicit upper and lower bounds and a predictor that leverages the first coordinates (and potentially all coordinates) of batches to estimate the initial state and transitions. The results provide concrete, principled benchmarks for evaluating universal prediction in batch-trained models and illuminate the asymptotic trade-offs and potential predictor designs in practical LLM settings.

Abstract

Large language models (LLMs) have recently gained much popularity due to their surprising ability at generating human-like English sentences. LLMs are essentially predictors, estimating the probability of a sequence of words given the past. Therefore, it is natural to evaluate their performance from a universal prediction perspective. In order to do that fairly, we introduce the notion of batch regret as a modification of the classical average regret, and we study its asymptotical value for add-constant predictors, in the case of memoryless sources and first-order Markov sources.

Batch Universal Prediction

TL;DR

This work introduces batch regret as a batch-wise analogue of universal prediction for evaluating LLMs, where training consists of

independent batches of length

and prediction targets a fresh batch of length

. It shows that for binary memoryless sources, an add-

predictor with

achieves precise asymptotic bounds, with interior regime regret scaling as

and endpoint cases scaling as

. For first-order binary Markov sources, the batch regret separates into initial-distribution and transition components, each decaying like

, with explicit upper and lower bounds and a predictor that leverages the first coordinates (and potentially all coordinates) of batches to estimate the initial state and transitions. The results provide concrete, principled benchmarks for evaluating universal prediction in batch-trained models and illuminate the asymptotic trade-offs and potential predictor designs in practical LLM settings.

Abstract

Paper Structure (7 sections, 5 theorems, 61 equations)

This paper contains 7 sections, 5 theorems, 61 equations.

Introduction
Related work
Overview
Memoryless sources
First-order Markov sources
Analysis of terms (A) and (B) for the proof of Theorem 1
Proof of Theorem 2

Key Result

Theorem 1

Let $\delta < \frac{1}{2}$ and $\Theta = [\delta, 1-\delta]$. Let be the class of distributions under consideration. Then,

Theorems & Definitions (6)

Definition 1
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5

Batch Universal Prediction

TL;DR

Abstract

Batch Universal Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (6)