Table of Contents
Fetching ...

A Unified Data Representation Learning for Non-parametric Two-sample Testing

Xunye Tian, Liuhua Peng, Zhijian Zhou, Mingming Gong, Arthur Gretton, Feng Liu

TL;DR

The paper tackles non-parametric two-sample testing by leveraging unlabelled data to learn representations that preserve Type-I error while boosting power. It proposes RL-TST, a two-phase framework that first learns intrinsic, manifold-aware representations on the full dataset and then trains discriminative representations for testing, enabling stronger test statistics via classifier-based or deep-kernel approaches with permutation testing. Empirically, RL-TST variants consistently outperform strong baselines such as C2ST, C2ST-L, MMD-D, and MMD-FUSE on HDGM benchmarks as well as MNIST and ImageNet–Fake setups, while controlling error rates. The work also analyzes why standard semi-supervised learning methods often fail in testing contexts and argues that the two-phase RL-TST approach better exploits unlabelled data for discriminative testing, offering a scalable and effective direction for future research in high-dimensional two-sample problems.

Abstract

Learning effective data representations has been crucial in non-parametric two-sample testing. Common approaches will first split data into training and test sets and then learn data representations purely on the training set. However, recent theoretical studies have shown that, as long as the sample indexes are not used during the learning process, the whole data can be used to learn data representations, meanwhile ensuring control of Type-I errors. The above fact motivates us to use the test set (but without sample indexes) to facilitate the data representation learning in the testing. To this end, we propose a representation-learning two-sample testing (RL-TST) framework. RL-TST first performs purely self-supervised representation learning on the entire dataset to capture inherent representations (IRs) that reflect the underlying data manifold. A discriminative model is then trained on these IRs to learn discriminative representations (DRs), enabling the framework to leverage both the rich structural information from IRs and the discriminative power of DRs. Extensive experiments demonstrate that RL-TST outperforms representative approaches by simultaneously using data manifold information in the test set and enhancing test power via finding the DRs with the training set.

A Unified Data Representation Learning for Non-parametric Two-sample Testing

TL;DR

The paper tackles non-parametric two-sample testing by leveraging unlabelled data to learn representations that preserve Type-I error while boosting power. It proposes RL-TST, a two-phase framework that first learns intrinsic, manifold-aware representations on the full dataset and then trains discriminative representations for testing, enabling stronger test statistics via classifier-based or deep-kernel approaches with permutation testing. Empirically, RL-TST variants consistently outperform strong baselines such as C2ST, C2ST-L, MMD-D, and MMD-FUSE on HDGM benchmarks as well as MNIST and ImageNet–Fake setups, while controlling error rates. The work also analyzes why standard semi-supervised learning methods often fail in testing contexts and argues that the two-phase RL-TST approach better exploits unlabelled data for discriminative testing, offering a scalable and effective direction for future research in high-dimensional two-sample problems.

Abstract

Learning effective data representations has been crucial in non-parametric two-sample testing. Common approaches will first split data into training and test sets and then learn data representations purely on the training set. However, recent theoretical studies have shown that, as long as the sample indexes are not used during the learning process, the whole data can be used to learn data representations, meanwhile ensuring control of Type-I errors. The above fact motivates us to use the test set (but without sample indexes) to facilitate the data representation learning in the testing. To this end, we propose a representation-learning two-sample testing (RL-TST) framework. RL-TST first performs purely self-supervised representation learning on the entire dataset to capture inherent representations (IRs) that reflect the underlying data manifold. A discriminative model is then trained on these IRs to learn discriminative representations (DRs), enabling the framework to leverage both the rich structural information from IRs and the discriminative power of DRs. Extensive experiments demonstrate that RL-TST outperforms representative approaches by simultaneously using data manifold information in the test set and enhancing test power via finding the DRs with the training set.

Paper Structure

This paper contains 24 sections, 4 theorems, 28 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem D.2

Lopez:C2ST Let $H_0: t = \frac{1}{2}$ and $H_1: t = 1 - \epsilon(\mathbb{P}, \mathbb{Q}; f')$, where $t$ is the test accuracy and $\epsilon(\mathbb{P}, \mathbb{Q}; f') = {\rm Pr}_{(z_i, l_i) \sim \mathcal{D}} \left[f'(z_i) \neq l_i \right] / 2 \in \left(0, \frac{1}{2}\right)$ represents the inabilit where $\alpha \in (0, 1)$ is the significance level, $t_{\alpha}$ is the $(1-\alpha)$ quantile and

Figures (4)

  • Figure 1: Overview of the RL-TST framework. Firstly, an encoder was learned from any AE-based representation learning algorithms on whole data, which can chosen from standard auto-encoder, wasserstein auto-encoder, etc. Secondly, fine-tune the learned encoder followed by a component that has the discriminative ability. At last, utilizing the final classifier or deep kernel to perform the permutation test based on statistic \ref{['c2st_stats']}, \ref{['c2st_l_stats']} or \ref{['mmd_stats']} to derive the two-sample testing result.
  • Figure 2: Visualisation of first two dimensions of samples for different levels of the high-dimensional Gaussian mixture (HDGM) dataset whose dimension is 10. For the HDGM-Easy and HDGM-Medium, the cluster mean difference $\Delta_{\mu}$ within the same distribution is 10, while for the HDGM-Hard, $\Delta_{\mu}$ is 0.5. For the HDGM-Easy, the distribution mean difference $\Delta_{q}$ between $\mathbb{P}$ and $\mathbb{Q}$ is 5, while for HDGM-Medium and HDGM-Hard, $\Delta_{q}$ is 0. Other setting of how to generate HDGM dataset is described in Appendix \ref{['hdgm_dataset_detail']}.
  • Figure 3: Test power of two different implementations of RL-TST framework on the two-sample testing method C2ST. Barplot to show how standard auto-encoder RL-C2ST and wasserstein auto-encoder RL-C2ST both outperform C2ST in the MNIST dataset (a), HDGM-D when $d=2$ (b) and HDGM-D when $d=10$ (c).
  • Figure 4: Results on HDGM-D and HDGM-S for $\alpha = 0.05$. (a) average test power and (b) average type-I error keeping $d=2$ in 100 trials when increasing $N$ from $N=1000$ to $N=10000$. The RL-TST methods are all using the standard auto-encoder, we could replace it into other alternative auto-encoders, such as wasserstein auto-encoder as we discussed in the Section \ref{['ablation']}.

Theorems & Definitions (8)

  • Definition D.1
  • Theorem D.2
  • Definition D.3: Compatibility
  • Theorem D.4
  • Theorem D.5
  • Definition D.6
  • Theorem D.7
  • proof