Table of Contents
Fetching ...

Compression, Generalization and Learning

Marco C. Campi, Simone Garatti

TL;DR

This work develops a distribution-free theory of compression schemes centered on the probability that a compressed representation changes when a new observation arrives. By introducing the cardinality $k=|\mathsf{c}(\boldsymbol{z}_1,...,\boldsymbol{z}_N)|$ and the random variable $\boldsymbol{\phi}_N$ (the probability of change), the authors prove a sharp upper bound $\mathbb{P}\{\boldsymbol{\phi}_N>\varepsilon_{\boldsymbol{k}}\} \le \delta$ under a preference property, with $\varepsilon_k$ defined as the unique solution to $\Psi_{k,\delta}(\alpha)=1$. With additional non-associativity and non-concentrated-mass assumptions, they obtain two-sided bounds $\underline{\varepsilon}_{\boldsymbol{k}} \le \boldsymbol{\phi}_N \le \overline{\varepsilon}_{\boldsymbol{k}}$ that hold with probability at least $1-\delta$; crucially, $|\mathsf{c}(\boldsymbol{z}_1,...,\boldsymbol{z}_N)|/N$ becomes a strongly consistent estimator of $\boldsymbol{\phi}_N$. The theory is then instantiated in learning contexts where a reconstruction function exists, showing that risk can be bounded directly from the augmented compression rather than via incremental risk estimates, yielding unprecedentedly sharp finite-sample bounds. Applications to SVM, SVR, GEM and related schemes demonstrate the practical utility, and the paper provides MATLAB code for computing the key bounds, situating the results within and extending the scenario-approach literature for data-driven decision making.

Abstract

A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (which maps into the statistical "risk" in learning applications). Under suitable conditions, the cardinality of the compressed set is shown to be a consistent estimator of the probability of change of compression (without any upper limit on the size of the compressed set); moreover, unprecedentedly tight finite-sample bounds to evaluate the probability of change of compression are obtained under a generally applicable condition of preference. All results are usable in a fully agnostic setup, i.e., without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.

Compression, Generalization and Learning

TL;DR

This work develops a distribution-free theory of compression schemes centered on the probability that a compressed representation changes when a new observation arrives. By introducing the cardinality and the random variable (the probability of change), the authors prove a sharp upper bound under a preference property, with defined as the unique solution to . With additional non-associativity and non-concentrated-mass assumptions, they obtain two-sided bounds that hold with probability at least ; crucially, becomes a strongly consistent estimator of . The theory is then instantiated in learning contexts where a reconstruction function exists, showing that risk can be bounded directly from the augmented compression rather than via incremental risk estimates, yielding unprecedentedly sharp finite-sample bounds. Applications to SVM, SVR, GEM and related schemes demonstrate the practical utility, and the paper provides MATLAB code for computing the key bounds, situating the results within and extending the scenario-approach literature for data-driven decision making.

Abstract

A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (which maps into the statistical "risk" in learning applications). Under suitable conditions, the cardinality of the compressed set is shown to be a consistent estimator of the probability of change of compression (without any upper limit on the size of the compressed set); moreover, unprecedentedly tight finite-sample bounds to evaluate the probability of change of compression are obtained under a generally applicable condition of preference. All results are usable in a fully agnostic setup, i.e., without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.
Paper Structure (25 sections, 15 theorems, 132 equations, 7 figures)

This paper contains 25 sections, 15 theorems, 132 equations, 7 figures.

Key Result

Lemma 3

A compression function $\mathsf{c}$ satisfies the preference property if and only if $\mathsf{c}(V) = \mathsf{c}(U)$ for all multisets $U,V$ such that $\mathsf{c}(U) \subseteq V \subseteq U$. $\star$

Figures (7)

  • Figure 1: (a) Function $\Psi_{k,\delta}(\alpha)$: it starts below $\delta$ when $\alpha \to 0$ and tends to $+ \infty$ when $\alpha \to 1$; (b) Function $\tilde{\Psi}_{k,\delta}(\alpha)$: it tends to $+ \infty$ as $\alpha \to 1$ or $\alpha \to - \infty$ and takes a value below $1$ in a point in $(-\infty, 1)$.
  • Figure 2: Curve $\varepsilon_k$ against the value of $k$ for $N = 2000$ and various values of $\delta$ ($10^{-3}$, $10^{-6}$, and $10^{-9}$). As established in Theorem \ref{['th:compression_1']}, for preferent compression functions this curve sets an upper bound (valid with confidence $1-\delta$) on the probability of change of compression as a function of the cardinality $k$ of the compressed multiset.
  • Figure 3: Region delimited by $\underline{\varepsilon}_k$ and $\overline{\varepsilon}_k$ for $N = 2000$ and various values of $\delta$ ($10^{-3}$, $10^{-6}$, and $10^{-9}$). Under the assumptions of Theorem \ref{['th:compression_2']}, this region contains with confidence $1-\delta$ the probability of change of compression as a function of the cardinality of the compressed multiset.
  • Figure 4: Graph of $\overline{\varepsilon}_k$, and $\underline{\varepsilon}_k$ as functions of $k$ for $\delta = 10^{-6}$ and $N= 2000$, $4000$, and $8000$.
  • Figure 5: Convex hull of points in ${ {\mathbb R}^{3} }$.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Definition 1: probability of change of compression
  • Lemma 3
  • Theorem 4
  • Theorem 7
  • Proposition 8
  • Theorem 10
  • Definition 11: reconstruction function
  • Lemma 13
  • Remark 14
  • Lemma 15
  • ...and 12 more