Limitations of Using Identical Distributions for Training and Testing When Learning Boolean Functions
Jordi Pérez-Guijarro
TL;DR
The paper investigates whether training on the identical distribution as the test set is always optimal when learners can be optimally adapted, focusing on learning Boolean functions. It builds a PAC-like framework with efficient evaluation, introducing sufficiently informative training distributions and complexity classes HeurBPP/samp(D) and HeurP/poly. The main result shows that, assuming one-way functions, there exist concept classes where training on the test distribution is not always best, even with optimal learner adaptation, and this extends to efficiently samplable distributions. However, for regular concept classes and the uniform distribution, these counterexamples disappear, reinstating the conventional view that distribution matching is sufficient in that setting.
Abstract
When the distributions of the training and test data do not coincide, the problem of understanding generalization becomes considerably more complex, prompting a variety of questions. Prior work has shown that, for some fixed learning methods, there are scenarios where training on a distribution different from the test distribution improves generalization. However, these results do not account for the possibility of choosing, for each training distribution, the optimal learning algorithm, leaving open whether the observed benefits stem from the mismatch itself or from suboptimality of the learner. In this work, we address this question in full generality. That is, we study whether it is always optimal for the training distribution to be identical to the test distribution when the learner is allowed to be optimally adapted to the training distribution. Surprisingly, assuming the existence of one-way functions, we find that the answer is no. That is, matching distributions is not always the best scenario. Nonetheless, we also show that when certain regularities are imposed on the target functions, the standard conclusion is recovered in the case of the uniform distribution.
