Table of Contents
Fetching ...

Distribution Learning with Valid Outputs Beyond the Worst-Case

Nick Rittler, Kamalika Chaudhuri

TL;DR

This work shows that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement.

Abstract

Generative models at times produce "invalid" outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space -- something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via "validity queries", which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which -- while generating guarantees in a wide-range of settings -- makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.

Distribution Learning with Valid Outputs Beyond the Worst-Case

TL;DR

This work shows that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement.

Abstract

Generative models at times produce "invalid" outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space -- something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via "validity queries", which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which -- while generating guarantees in a wide-range of settings -- makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.

Paper Structure

This paper contains 26 sections, 19 theorems, 67 equations, 3 algorithms.

Key Result

Lemma 4

Fix $0 < \epsilon, \delta < 1$ arbitrarily, and let $P, q \in \mathcal{P}$ be distributions with densities with respect to $\lambda$. Then if $d_{TV}(q, P) \geq \epsilon$, and $S \sim P^n$ for $n \geq \Omega(\log(1/\delta)/\epsilon^2)$, it holds with probability $\geq 1-\delta$ that

Theorems & Definitions (31)

  • Lemma 4
  • Lemma 5
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 21 more