Efficient Optimal PAC Learning
Mikael Møller Høgsgaard
TL;DR
This work analyzes the computational costs of optimal PAC learners in the realizable setting with finite VC-dimension $d$. It introduces an Efficient Optimal PAC Learner that uses a randomized AdaBoost-based subsampling scheme and ERM as a subroutine, achieving an optimal generalization bound $\mathcal{L}_{\mathcal{D}_c}(\hat{A})=O\big((d+\ln(1/\delta))/m\big)$ while attaining near-linear training time and logarithmic inference cost in $m$. Relative to prior optimal learners that rely on deterministic subsampling or bagging, the proposed method reduces inference complexity and provides a refined cost-structure through AdaBoostSample-based voting over a carefully structured subsampling matrix $\mathcal{S}$. The approach leverages uniform convergence and margin-based analyses to guarantee PAC optimality under distribution-free settings, offering a scalable pathway for practical PAC learning where ERM costs dominate.
Abstract
Recent advances in the binary classification setting by Hanneke [2016b] and Larsen [2023] have resulted in optimal PAC learners. These learners leverage, respectively, a clever deterministic subsampling scheme and the classic heuristic of bagging Breiman [1996]. Both optimal PAC learners use, as a subroutine, the natural algorithm of empirical risk minimization. Consequently, the computational cost of these optimal PAC learners is tied to that of the empirical risk minimizer algorithm. In this work, we seek to provide an alternative perspective on the computational cost imposed by the link to the empirical risk minimizer algorithm. To this end, we show the existence of an optimal PAC learner, which offers a different tradeoff in terms of the computational cost induced by the empirical risk minimizer.
