Table of Contents
Fetching ...

Corruptions of Supervised Learning Problems: Typology and Mitigations

Laura Iacovissi, Nan Lu, Robert C. Williamson

TL;DR

This work develops a unified, information-theoretic framework for corruptions in supervised learning using Markov kernels to model changes to data distributions, loss, and model class. It introduces an exhaustive, kernel-based taxonomy of one-step and multi-step corruptions, and proves data-processing-equality results that relate Bayes risks between clean and corrupted problems. The authors extend classical loss-correction methods with a generalized corruption-corrected learning (gcl) framework based on Bayesian inverses of kernels, clarifying when traditional loss corrections suffice and when more sophisticated posterior-aware corrections are necessary. They also discuss limitations, such as assumptions about distributions and full access to corruption kernels, and outline future directions for non-Markovian and non-probabilistic corruptions. Overall, the paper provides a principled foundation for analyzing corruption in learning tasks and for designing principled mitigations beyond standard loss corrections.

Abstract

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

Corruptions of Supervised Learning Problems: Typology and Mitigations

TL;DR

This work develops a unified, information-theoretic framework for corruptions in supervised learning using Markov kernels to model changes to data distributions, loss, and model class. It introduces an exhaustive, kernel-based taxonomy of one-step and multi-step corruptions, and proves data-processing-equality results that relate Bayes risks between clean and corrupted problems. The authors extend classical loss-correction methods with a generalized corruption-corrected learning (gcl) framework based on Bayesian inverses of kernels, clarifying when traditional loss corrections suffice and when more sophisticated posterior-aware corrections are necessary. They also discuss limitations, such as assumptions about distributions and full access to corruption kernels, and outline future directions for non-Markovian and non-probabilistic corruptions. Overall, the paper provides a principled foundation for analyzing corruption in learning tasks and for designing principled mitigations beyond standard loss corrections.

Abstract

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.
Paper Structure (71 sections, 25 theorems, 118 equations, 3 figures, 5 tables)

This paper contains 71 sections, 25 theorems, 118 equations, 3 figures, 5 tables.

Key Result

Proposition 8

[proposition]prop:notations-learning-problem A supervised learning problem $\mathcal{L} = (\ell , \mathcal{H}, P = \pi_Y \times E)$ on the measurable space $(X \times Y, \mathcal{X} \times \mathcal{Y})$ can be equivalently expressed We refer to an $\mathcal{L} = (\ell , \mathcal{H}, P = \pi_Y \times E)$ as generative, while an $\mathcal{L} = (\ell , \mathcal{H}, P = \pi_{X} \times F)$ as discrim

Figures (3)

  • Figure 1: Hierarchy of partial corruption types. The partial corruption types are hierarchically organized based on their dependence on the instance $X$ and label $Y$ space, as depicted through a tree structure. At the root of the tree lies the most general form of corruption, where the domain and image spaces are the joint one $X \times Y$, i.e., $D(\kappa)=I(\kappa)=X \times Y$. The arrows signify that a child node has its domain or image constant w.r.t. exactly one of the variables in its parent. Therefore, the children nodes can be expressed as subcases of their parent, but the parents generally cannot be expressed by only one of their children. The partial corruption types that cannot be combined with others are shown in dotted boxes. Note that corner cases involving independence from all variables or identity kernels are excluded from this analysis.
  • Figure 2: Feasible combinations of partial corruptions. Joint corruptions, i.e. of type $\kappa \colon X \times Y \rightsquigarrow X \times Y$, are obtained by combining two compatible partial corruptions in \ref{['fig:corruptions']}. The tree structure is induced by that of the partial corruption types. Notice that we can only combine a partial corruption with $I(\tau) = X$ with another such that $I(\lambda) = Y$, following \ref{['prop:fact']}. Therefore, the arrows signify that both $\tau$ and $\lambda$ in a child node inherit their domains from the parent node with either $\tau$ or $\lambda$ constant w.r.t. exactly one of their domain variables.
  • Figure 3: Possible non-degenerate relations among three probability spaces. Arrows represent a non-trivial Markov kernel $\kappa\colon X_i \times Y_i \rightsquigarrow X_j \times Y_j$.

Theorems & Definitions (50)

  • Definition 1: klenke2007probability
  • Example 2
  • Remark 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Proposition 8
  • Definition 9
  • Definition 10
  • ...and 40 more