Table of Contents
Fetching ...

Revisiting Classifier Two-Sample Tests

David Lopez-Paz, Maxime Oquab

TL;DR

The paper revisits two-sample testing by leveraging binary classifiers to distinguish samples from two distributions. It formalizes Classifier Two-Sample Tests (C2ST), derives their null and alternative distributions, analyzes testing power, and highlights interpretability. Through extensive experiments, it shows C2STs outperform several state-of-the-art tests in multidimensional settings, and demonstrates their utility for evaluating generative models (GANs) and for causal discovery via CGAN-based directions. The work thus provides a practical, interpretable, and scalable framework for distributional comparison with broad applications in model evaluation and causal inference.

Abstract

The goal of two-sample tests is to assess whether two samples, $S_P \sim P^n$ and $S_Q \sim Q^m$, are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary classifiers. In particular, construct a dataset by pairing the $n$ examples in $S_P$ with a positive label, and by pairing the $m$ examples in $S_Q$ with a negative label. If the null hypothesis "$P = Q$" is true, then the classification accuracy of a binary classifier on a held-out subset of this dataset should remain near chance-level. As we will show, such Classifier Two-Sample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have a simple null distribution, and their predictive uncertainty allow to interpret where $P$ and $Q$ differ. The goal of this paper is to establish the properties, performance, and uses of C2ST. First, we analyze their main theoretical properties. Second, we compare their performance against a variety of state-of-the-art alternatives. Third, we propose their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs). Fourth, we showcase the novel application of GANs together with C2ST for causal discovery.

Revisiting Classifier Two-Sample Tests

TL;DR

The paper revisits two-sample testing by leveraging binary classifiers to distinguish samples from two distributions. It formalizes Classifier Two-Sample Tests (C2ST), derives their null and alternative distributions, analyzes testing power, and highlights interpretability. Through extensive experiments, it shows C2STs outperform several state-of-the-art tests in multidimensional settings, and demonstrates their utility for evaluating generative models (GANs) and for causal discovery via CGAN-based directions. The work thus provides a practical, interpretable, and scalable framework for distributional comparison with broad applications in model evaluation and causal inference.

Abstract

The goal of two-sample tests is to assess whether two samples, and , are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary classifiers. In particular, construct a dataset by pairing the examples in with a positive label, and by pairing the examples in with a negative label. If the null hypothesis "" is true, then the classification accuracy of a binary classifier on a held-out subset of this dataset should remain near chance-level. As we will show, such Classifier Two-Sample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have a simple null distribution, and their predictive uncertainty allow to interpret where and differ. The goal of this paper is to establish the properties, performance, and uses of C2ST. First, we analyze their main theoretical properties. Second, we compare their performance against a variety of state-of-the-art alternatives. Third, we propose their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs). Fourth, we showcase the novel application of GANs together with C2ST for causal discovery.

Paper Structure

This paper contains 20 sections, 1 theorem, 9 equations, 2 figures, 6 tables.

Key Result

Theorem 1

Given the conditions described in the previous paragraph, the approximate power of the statistic eq:stat is $\Phi\left( \frac{\epsilon\sqrt{n_\text{te}}-\Phi^{-1}(1-\alpha)/2}{\sqrt{\frac{1}{4}-\epsilon^2}}\right)$.

Figures (2)

  • Figure 1: Results (type-I and type-II errors) of our synthetic two-sample test experiments.
  • Figure 2: Interpretability of C2ST. The color map corresponds to the value of $p(l=1|z)$.

Theorems & Definitions (4)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3: How good is my GAN? Is it overfitting?