Corruptions of Supervised Learning Problems: Typology and Mitigations
Laura Iacovissi, Nan Lu, Robert C. Williamson
TL;DR
This work develops a unified, information-theoretic framework for corruptions in supervised learning using Markov kernels to model changes to data distributions, loss, and model class. It introduces an exhaustive, kernel-based taxonomy of one-step and multi-step corruptions, and proves data-processing-equality results that relate Bayes risks between clean and corrupted problems. The authors extend classical loss-correction methods with a generalized corruption-corrected learning (gcl) framework based on Bayesian inverses of kernels, clarifying when traditional loss corrections suffice and when more sophisticated posterior-aware corrections are necessary. They also discuss limitations, such as assumptions about distributions and full access to corruption kernels, and outline future directions for non-Markovian and non-probabilistic corruptions. Overall, the paper provides a principled foundation for analyzing corruption in learning tasks and for designing principled mitigations beyond standard loss corrections.
Abstract
Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.
