Table of Contents
Fetching ...

A short note on learning discrete distributions

Clément L. Canonne

TL;DR

The note analyzes the sample complexity of learning a discrete distribution on a known domain of size $k$ under several distance measures. It provides concise empirical-distribution-based proofs and concentration-based arguments, showing that learning under total variation and Hellinger distances requires $n=\Theta\big(\frac{k+\log(1/\delta)}{\varepsilon^2}\big)$ samples, while KL divergence admits the optimal $n=\Theta\big(\frac{k+\log(1/\delta)}{\varepsilon}\big)$ with the empirical estimator, and Kolmogorov, $\ell_{\infty}$, and $\ell_2$ distances admit $n=\Theta\big(\frac{\log(1/\delta)}{\varepsilon^2}\big)$ independent of $k$. The results leverage standard concentration inequalities (McDiarmid, Chernoff, DKW) and recent KL-concentration bounds to connect empirical performance across distance measures. Overall, the note clarifies folklore sample-complexity bounds with simple, self-contained proofs and highlights where optimal rates depend on the chosen distance metric.

Abstract

The goal of this short note is to provide simple proofs for the "folklore facts" on the sample complexity of learning a discrete probability distribution over a known domain of size $k$ to various distances $\varepsilon$, with error probability $δ$.

A short note on learning discrete distributions

TL;DR

The note analyzes the sample complexity of learning a discrete distribution on a known domain of size under several distance measures. It provides concise empirical-distribution-based proofs and concentration-based arguments, showing that learning under total variation and Hellinger distances requires samples, while KL divergence admits the optimal with the empirical estimator, and Kolmogorov, , and distances admit independent of . The results leverage standard concentration inequalities (McDiarmid, Chernoff, DKW) and recent KL-concentration bounds to connect empirical performance across distance measures. Overall, the note clarifies folklore sample-complexity bounds with simple, self-contained proofs and highlights where optimal rates depend on the chosen distance metric.

Abstract

The goal of this short note is to provide simple proofs for the "folklore facts" on the sample complexity of learning a discrete probability distribution over a known domain of size to various distances , with error probability .

Paper Structure

This paper contains 6 sections, 8 theorems, 13 equations.

Key Result

Theorem 1

$\Phi(\operatorname{d}_{\rm TV},k,\varepsilon,\delta) = {\Theta\mleft( \frac{k+\log(1/\delta)}{\varepsilon^2} \mright)}$.

Theorems & Definitions (15)

  • Theorem 1
  • proof : First proof
  • proof : Second proof -- the "fun" one
  • Theorem 2
  • Proposition 1: Easy bound
  • proof
  • Proposition 2: More involved bound
  • proof
  • proof : Proof of \ref{['theo:learning:hellinger']}
  • Theorem 4: Agrawal:19
  • ...and 5 more