Table of Contents
Fetching ...

Semi-Supervised U-statistics

Ilmun Kim, Larry Wasserman, Sivaraman Balakrishnan, Matey Neykov

TL;DR

This work introduces semi-supervised U-statistics enhanced by the abundance of unlabeled data, and proposes a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrates its optimality properties.

Abstract

Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework.

Semi-Supervised U-statistics

TL;DR

This work introduces semi-supervised U-statistics enhanced by the abundance of unlabeled data, and proposes a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrates its optimality properties.

Abstract

Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework.
Paper Structure (74 sections, 26 theorems, 373 equations, 4 figures)

This paper contains 74 sections, 26 theorems, 373 equations, 4 figures.

Key Result

Lemma 1

Denote $\mathrm{Var}\{\ell_1(Y)\} = \sigma_1^2 + \sigma_2^2 > 0$ where Assume that $\mathrm{Var}\{\ell(Y_1,\ldots,Y_r)\} < \infty$ and $\sigma_1^2 >0$. Then the semi-supervised U-statistic $U_{\psi_1}$ satisfies

Figures (4)

  • Figure 1: Comparing MSE ratios for different $m$ values: (a) The left panel indicates that the ZB estimator performs better than $\{U_{\mathrm{cross}},U_{\mathrm{plug}}\}$ in Model 1 (linear additive model). (b) Conversely, the right panel demonstrates that the ZB estimator performs less effectively than $\{U_{\mathrm{cross}},U_{\mathrm{plug}}\}$ in Model 2 (non-linear model). In all scenarios, the semi-supervised estimators consistently outperform $U$, especially when $m$ is large. See \ref{['Section: Variance Estimation (Sim)']} for details.
  • Figure 2: Comparing MSE ratios for different mean values ($\mu$): (a) The left panel indicates that $U_{\mathrm{adapt}}$ performs better than both $U_{\mathrm{cross}}$ and $U$ when $\mu$ is close to zero, whereas it performs comparable to $U_{\mathrm{cross}}$ when $\mu$ is far away from zero. This observation applies to both regression methods and highlights the adaptive property of $U_{\mathrm{adapt}}$. (b) The right panel displays a similar pattern to the left panel, while the estimator based on least squares regression shows no gain over $U$ due to model misspecification. See \ref{['Section: Simulation for Adaptive Estimation']} for details.
  • Figure 3: Type I error and power results for Kendall's $\tau$ experiments with $m=50000$: (a) The left panel displays estimated type I error rates of Kendall's $\tau$ and semi-supervised counterparts at $\alpha = 0.05$ by varying the sample size. (b) The right panel shows the estimated power of the considered tests by changing the correlation parameter $\rho$ with $n=5000$. These results indicate that the semi-supervised tests outperform classical Kendall's $\tau$ in terms of power, while the approach using $U_{\mathrm{plug}}$ is anti-conservative in small sample scenarios. See \ref{['Section: Semi-Supervised Kendall']} for details.
  • Figure 4: Type I error and power results for experiments of Wilcoxon signed rank test with $m=50000$: (a) The left panel displays estimated type I error rates of Wilcoxon test and semi-supervised counterparts at $\alpha = 0.05$ by varying the sample size. (b) The right panel shows the estimated power of the considered tests by changing the correlation parameter $\mu$ with $n=2500$. These results indicate that the semi-supervised tests outperform classical Wilcoxon test in terms of power, while the approach using $U_{\mathrm{plug}}$ is anti-conservative in small sample scenarios. See \ref{['Section: Semi-Supervised Wilcoxon Signed Rank Test']} for details.

Theorems & Definitions (35)

  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Proposition 1
  • Proposition 2
  • Theorem 2
  • Example 1
  • Theorem 3
  • Proposition 3
  • Theorem 4
  • ...and 25 more