On Semi-supervised Estimation of Discrete Distributions under f-divergences
Hasan Sabri Melihcan Erol, Lizhong Zheng
TL;DR
The paper addresses semi-supervised estimation of the joint distribution $p_{XY}$ from mixed labeled and unlabeled data under minimax risk. It shows that composing univariate minimax estimators preserves optimal first-order risk for $1 \le p \le 2$ in $l^p_p$ losses and extends these results to a broad family of $f$-divergences, including KL, chi-square, Squared Hellinger, and Le Cam. The authors derive explicit rates and constants, such as $R^p_{m,n} = (|\mathcal X|)^{1- p/2} C_p m^{-p/2}$ and $R^f_{n,m} = |\mathcal X| C_f / m$, and prove minimax optimality of the composition estimators in the semi-supervised setting. These results provide rigorous guarantees for discrete pmf estimation when unlabeled data are abundant and labeling is costly, across multiple divergence criteria. Overall, the work advances theoretical understanding of semi-supervised minimax estimation for discrete distributions.
Abstract
We study the problem of estimating the joint probability mass function (pmf) over two random variables. In particular, the estimation is based on the observation of $m$ samples containing both variables and $n$ samples missing one fixed variable. We adopt the minimax framework with $l^p_p$ loss functions. Recent work established that univariate minimax estimator combinations achieve minimax risk with the optimal first-order constant for $p \ge 2$ in the regime $m = o(n)$, questions remained for $p \le 2$ and various $f$-divergences. In our study, we affirm that these composite estimators are indeed minimax optimal for $l^p_p$ loss functions, specifically for the range $1 \le p \le 2$, including the critical $l_1$ loss. Additionally, we ascertain their optimality for a suite of $f$-divergences, such as KL, $χ^2$, Squared Hellinger, and Le Cam divergences.
