Universal Batch Learning Under The Misspecification Setting

Shlomi Vituri; Meir Feder

Universal Batch Learning Under The Misspecification Setting

Shlomi Vituri, Meir Feder

TL;DR

This work tackles universal batch learning under misspecification with log-loss, where data are generated from a distribution set $\Phi$ larger than the hypothesis class $\Theta$. It derives a closed-form min-max regret $R^*_N(\Theta,\Phi)$ as $\max_{\pi(\phi)} [ I(Y_N;\Phi|Y^{N-1}) - E_{\pi} { D_{c,N}(P_\phi\|\Theta) } ]$ and introduces a mixture prior $\pi(\phi)$ that induces a capacity-like universal predictor $Q_{\pi}$. The authors develop an Arimoto-Blahut extension to numerically evaluate the regret and demonstrate the theory on Bernoulli/multinomial settings, including bounds and extensions to combined batch-online and supervised batch learning. A key finding is that the regret is governed by the richness of the hypothesis set $\Theta$ rather than the full generating set $\Phi$, and that the mixture concentrates mass near $\Theta$, providing a principled approach for robust universal learning under misspecification. The framework lays groundwork for practical computation of capacity-like priors and worst-case regrets in agnostic data-generation scenarios, with potential impact on robust universal predictors.

Abstract

In this paper we consider the problem of universal {\em batch} learning in a misspecification setting with log-loss. In this setting the hypothesis class is a set of models $Θ$. However, the data is generated by an unknown distribution that may not belong to this set but comes from a larger set of models $Φ\supset Θ$. Given a training sample, a universal learner is requested to predict a probability distribution for the next outcome and a log-loss is incurred. The universal learner performance is measured by the regret relative to the best hypothesis matching the data, chosen from $Θ$. Utilizing the minimax theorem and information theoretical tools, we derive the optimal universal learner, a mixture over the set of the data generating distributions, and get a closed form expression for the min-max regret. We show that this regret can be considered as a constrained version of the conditional capacity between the data and its generating distributions set. We present tight bounds for this min-max regret, implying that the complexity of the problem is dominated by the richness of the hypotheses models $Θ$ and not by the data generating distributions set $Φ$. We develop an extension to the Arimoto-Blahut algorithm for numerical evaluation of the regret and its capacity achieving prior distribution. We demonstrate our results for the case where the observations come from a $K$-parameters multinomial distributions while the hypothesis class $Θ$ is only a subset of this family of distributions.

Universal Batch Learning Under The Misspecification Setting

TL;DR

This work tackles universal batch learning under misspecification with log-loss, where data are generated from a distribution set

larger than the hypothesis class

. It derives a closed-form min-max regret

and introduces a mixture prior

that induces a capacity-like universal predictor

. The authors develop an Arimoto-Blahut extension to numerically evaluate the regret and demonstrate the theory on Bernoulli/multinomial settings, including bounds and extensions to combined batch-online and supervised batch learning. A key finding is that the regret is governed by the richness of the hypothesis set

rather than the full generating set

, and that the mixture concentrates mass near

, providing a principled approach for robust universal learning under misspecification. The framework lays groundwork for practical computation of capacity-like priors and worst-case regrets in agnostic data-generation scenarios, with potential impact on robust universal predictors.

Abstract

In this paper we consider the problem of universal {\em batch} learning in a misspecification setting with log-loss. In this setting the hypothesis class is a set of models

. However, the data is generated by an unknown distribution that may not belong to this set but comes from a larger set of models

. Given a training sample, a universal learner is requested to predict a probability distribution for the next outcome and a log-loss is incurred. The universal learner performance is measured by the regret relative to the best hypothesis matching the data, chosen from

. Utilizing the minimax theorem and information theoretical tools, we derive the optimal universal learner, a mixture over the set of the data generating distributions, and get a closed form expression for the min-max regret. We show that this regret can be considered as a constrained version of the conditional capacity between the data and its generating distributions set. We present tight bounds for this min-max regret, implying that the complexity of the problem is dominated by the richness of the hypotheses models

and not by the data generating distributions set

. We develop an extension to the Arimoto-Blahut algorithm for numerical evaluation of the regret and its capacity achieving prior distribution. We demonstrate our results for the case where the observations come from a

-parameters multinomial distributions while the hypothesis class

is only a subset of this family of distributions.

Paper Structure (19 sections, 8 theorems, 73 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 19 sections, 8 theorems, 73 equations, 3 figures, 1 table, 2 algorithms.

Introduction
Problem Statement
Learning under the misspecification setting
Basic definitions
Misspecified Batch Learning Analysis
Min-Max Regret Derivation
Bounding the Min-Max Regret
Arimoto Blahut Algorithm Extension
Algorithm Development
Numerical Results
Stochastic vs. Misspecified Settings Example
Add-$\beta$ Factor Analysis
Misspecified Universal Batch Learning Extensions
Combined Batch and Online Learning
Min-Max Regret Derivation
...and 4 more sections

Key Result

Theorem 1

The min-max regret of the problem defined in ii-A is given by: and the universal distribution for a given $\pi(\phi)$ is given by:

Figures (3)

Figure 1: Theorem \ref{['Thm2']}
Figure 2: Stochastic vs. Misspecified Capacity Achieving Prior distribution for settings: (a), (b) and (c), where $N=10^3.$
Figure 3: Stochastic vs. Misspecified $\beta$ bias factor for settings: (a), (b) and (c), where $N=10^2$.

Theorems & Definitions (15)

Theorem 1
proof
Lemma 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
Theorem 5
...and 5 more

Universal Batch Learning Under The Misspecification Setting

TL;DR

Abstract

Universal Batch Learning Under The Misspecification Setting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)