Table of Contents
Fetching ...

Distribution-free Deviation Bounds and The Role of Domain Knowledge in Learning via Model Selection with Cross-validation Risk Estimation

Diego Marcondes, Cláudia Peixoto

TL;DR

This work addresses the theoretical properties of learning via model selection with cross-validation risk estimation in a distribution-free VC framework. It introduces Learning Spaces to encode domain knowledge as lattice-structured families of candidate models and proves that the selected model hat M converges to a target model M^*, with estimation errors decaying under both bounded and unbounded loss settings. The main contributions are distribution-free deviation bounds tied to VC dimension, the concept of maximum discrimination error epsilon^*, and practical guidance on designing Learning Spaces to boost generalization, supported by worst-case and linear-regression examples. The results offer a principled route to harness domain knowledge for improved generalization and provide algorithmic and computational insights for implementing model-selection-based learning in practice.

Abstract

Cross-validation techniques for risk estimation and model selection are widely used in statistics and machine learning. However, the understanding of the theoretical properties of learning via model selection with cross-validation risk estimation is quite low in face of its widespread use. In this context, this paper presents learning via model selection with cross-validation risk estimation as a general systematic learning framework within classical statistical learning theory and establishes distribution-free deviation bounds in terms of VC dimension, giving detailed proofs of the results and considering both bounded and unbounded loss functions. In particular, we investigate how the generalization of learning via model selection may be increased by modeling the collection of candidate models. We define the Learning Spaces as a class of candidate models in which the partial order by inclusion reflects the models complexities, and we formalize a manner of defining them based on domain knowledge. We illustrate this modeling in a worst-case scenario of learning a classifier with finite domain and a typical scenario of linear regression. Through theoretical insights and concrete examples, we aim to provide guidance on selecting the family of candidate models based on domain knowledge to increase generalization.

Distribution-free Deviation Bounds and The Role of Domain Knowledge in Learning via Model Selection with Cross-validation Risk Estimation

TL;DR

This work addresses the theoretical properties of learning via model selection with cross-validation risk estimation in a distribution-free VC framework. It introduces Learning Spaces to encode domain knowledge as lattice-structured families of candidate models and proves that the selected model hat M converges to a target model M^*, with estimation errors decaying under both bounded and unbounded loss settings. The main contributions are distribution-free deviation bounds tied to VC dimension, the concept of maximum discrimination error epsilon^*, and practical guidance on designing Learning Spaces to boost generalization, supported by worst-case and linear-regression examples. The results offer a principled route to harness domain knowledge for improved generalization and provide algorithmic and computational insights for implementing model-selection-based learning in practice.

Abstract

Cross-validation techniques for risk estimation and model selection are widely used in statistics and machine learning. However, the understanding of the theoretical properties of learning via model selection with cross-validation risk estimation is quite low in face of its widespread use. In this context, this paper presents learning via model selection with cross-validation risk estimation as a general systematic learning framework within classical statistical learning theory and establishes distribution-free deviation bounds in terms of VC dimension, giving detailed proofs of the results and considering both bounded and unbounded loss functions. In particular, we investigate how the generalization of learning via model selection may be increased by modeling the collection of candidate models. We define the Learning Spaces as a class of candidate models in which the partial order by inclusion reflects the models complexities, and we formalize a manner of defining them based on domain knowledge. We illustrate this modeling in a worst-case scenario of learning a classifier with finite domain and a typical scenario of linear regression. Through theoretical insights and concrete examples, we aim to provide guidance on selecting the family of candidate models based on domain knowledge to increase generalization.
Paper Structure (41 sections, 30 theorems, 221 equations, 14 figures, 1 table)

This paper contains 41 sections, 30 theorems, 221 equations, 14 figures, 1 table.

Key Result

Proposition 3.2

Assume the loss function is bounded. Fixed a hypotheses space $\mathcal{H}$ with $d_{VC}(\mathcal{H}) < \infty$, there exist sequences $\{B^{I}_{N,\epsilon}: N \geq 1\}$ and $\{B^{II}_{N,\epsilon}: N \geq 1\}$ of positive real-valued increasing functions with domain $\mathbb{Z}_{+}$ satisfying for all $\epsilon > 0$ and $k \in \mathbb{Z}_{+}$ fixed, such that Furthermore, the following holds:

Figures (14)

  • Figure 1: Types II, III, and IV estimation errors when learning on $\hat{\mathcal{M}}$, in which $\hat{h}_{\hat{\mathcal{M}}} \equiv \hat{h}_{\hat{\mathcal{M}}}^{\mathbb{A}}$. These errors are formally defined in Section \ref{['SecErrors']}.
  • Figure 2: Decomposition of $\mathcal{H}$ by a $\mathbb{C}(\mathcal{H})$. We omitted some models for a better visualization, since $\mathbb{C}(\mathcal{H})$ should cover $\mathcal{H}$.
  • Figure 3: The systematic frameworks for learning hypotheses via model selection.(a) A sample of size $N+M$ is split into two, one of size $N$ that is used to estimate $\hat{\mathcal{M}}$ by the minimization of $\hat{L}$ on $\mathbb{C}(\mathcal{H})$, and another of size $M$ is used to learn a hypothesis on $\hat{\mathcal{M}}$ by the minimization of the empirical risk. (b) The whole sample of size $N$ is used for estimating $\hat{\mathcal{M}}$ by the minimization of $\hat{L}$ on $\mathbb{C}(\mathcal{H})$, and to estimate hypotheses on $\hat{\mathcal{M}}$ via ERM.
  • Figure 4: The risks of the equivalence classes (cf. \ref{['equiv_class']}) of $\mathbb{C}(\mathcal{H})$ in ascending order. The MDE $\epsilon^{\star}$ is the difference between the risk of the target class $\mathcal{M}^{\star}$, and the second best $\mathcal{M}_{2}$. The colored intervals represent a distance of $\epsilon^{\star}/2$ from the out-of-sample risk of each model, and the colored estimated risks $\hat{L}$ illustrate a case such that the estimated risk is within $\epsilon^{\star}/2$ of the out-of-sample risk for all models. The class $\mathcal{M}_{1}$ has the same risk as $\mathcal{M}^{\star}$, but has a smaller estimated risk, and, by the definition of $\mathcal{M}^{\star}$, greater VC dimension. Note from the representation that, if one can estimate $\hat{L}$ within a margin of error of $\epsilon^{\star}/2$, then $\hat{\mathcal{M}}$ will be a model with the same risk as $\mathcal{M}^{\star}$, in this case $\mathcal{M}_{1}$ (cf. Proposition \ref{['proposition_principal']}).
  • Figure 5: The set $\Pi$ of all partitions of $\mathcal{X} = \{1,2,3,4\}$. The tables present the hypotheses in selected models $\mathcal{M}_{\pi_{1}}, \mathcal{M}_{\pi_{2}}$.
  • ...and 9 more figures

Theorems & Definitions (64)

  • Remark 3.1
  • Proposition 3.2
  • Remark 3.3
  • Remark 3.4
  • Proposition 4.1
  • Remark 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 4.6
  • ...and 54 more