Semi-Supervised U-statistics

Ilmun Kim; Larry Wasserman; Sivaraman Balakrishnan; Matey Neykov

Semi-Supervised U-statistics

Ilmun Kim, Larry Wasserman, Sivaraman Balakrishnan, Matey Neykov

TL;DR

This work introduces semi-supervised U-statistics enhanced by the abundance of unlabeled data, and proposes a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrates its optimality properties.

Abstract

Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework.

Semi-Supervised U-statistics

TL;DR

Abstract

Paper Structure (74 sections, 26 theorems, 373 equations, 4 figures)

This paper contains 74 sections, 26 theorems, 373 equations, 4 figures.

Introduction
Contributions
Related Work
Outline
Notation
Problem Setup and Motivation
Oracle Mean Estimation
Extension to a General Kernel
Procedure with Cross-Fitting
Estimation of $\psi_1$
Alternative Options for $\widehat{f}$
Procedure without Sample Splitting
Berry--Esseen Bounds
Bound for the Cross-Fit Estimator
Bound for the Single-Split Estimator
...and 59 more sections

Key Result

Lemma 1

Denote $\mathrm{Var}\{\ell_1(Y)\} = \sigma_1^2 + \sigma_2^2 > 0$ where Assume that $\mathrm{Var}\{\ell(Y_1,\ldots,Y_r)\} < \infty$ and $\sigma_1^2 >0$. Then the semi-supervised U-statistic $U_{\psi_1}$ satisfies

Figures (4)

Figure 1: Comparing MSE ratios for different $m$ values: (a) The left panel indicates that the ZB estimator performs better than $\{U_{\mathrm{cross}},U_{\mathrm{plug}}\}$ in Model 1 (linear additive model). (b) Conversely, the right panel demonstrates that the ZB estimator performs less effectively than $\{U_{\mathrm{cross}},U_{\mathrm{plug}}\}$ in Model 2 (non-linear model). In all scenarios, the semi-supervised estimators consistently outperform $U$, especially when $m$ is large. See \ref{['Section: Variance Estimation (Sim)']} for details.
Figure 2: Comparing MSE ratios for different mean values ($\mu$): (a) The left panel indicates that $U_{\mathrm{adapt}}$ performs better than both $U_{\mathrm{cross}}$ and $U$ when $\mu$ is close to zero, whereas it performs comparable to $U_{\mathrm{cross}}$ when $\mu$ is far away from zero. This observation applies to both regression methods and highlights the adaptive property of $U_{\mathrm{adapt}}$. (b) The right panel displays a similar pattern to the left panel, while the estimator based on least squares regression shows no gain over $U$ due to model misspecification. See \ref{['Section: Simulation for Adaptive Estimation']} for details.
Figure 3: Type I error and power results for Kendall's $\tau$ experiments with $m=50000$: (a) The left panel displays estimated type I error rates of Kendall's $\tau$ and semi-supervised counterparts at $\alpha = 0.05$ by varying the sample size. (b) The right panel shows the estimated power of the considered tests by changing the correlation parameter $\rho$ with $n=5000$. These results indicate that the semi-supervised tests outperform classical Kendall's $\tau$ in terms of power, while the approach using $U_{\mathrm{plug}}$ is anti-conservative in small sample scenarios. See \ref{['Section: Semi-Supervised Kendall']} for details.
Figure 4: Type I error and power results for experiments of Wilcoxon signed rank test with $m=50000$: (a) The left panel displays estimated type I error rates of Wilcoxon test and semi-supervised counterparts at $\alpha = 0.05$ by varying the sample size. (b) The right panel shows the estimated power of the considered tests by changing the correlation parameter $\mu$ with $n=2500$. These results indicate that the semi-supervised tests outperform classical Wilcoxon test in terms of power, while the approach using $U_{\mathrm{plug}}$ is anti-conservative in small sample scenarios. See \ref{['Section: Semi-Supervised Wilcoxon Signed Rank Test']} for details.

Theorems & Definitions (35)

Lemma 1
Theorem 1
Lemma 2
Proposition 1
Proposition 2
Theorem 2
Example 1
Theorem 3
Proposition 3
Theorem 4
...and 25 more

Semi-Supervised U-statistics

TL;DR

Abstract

Semi-Supervised U-statistics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (35)