Table of Contents
Fetching ...

Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation

Ye Su, Hezhe Qiao, Wei Huang, Lin Chen

TL;DR

This work tackles semi-supervised regression (SSR) under limited labeled data by introducing Dual-stream Knowledge Distillation (DKD), which jointly distills continuous-valued knowledge and label-distribution information from a ground-truth-informed teacher to a student. A Decoupled Distribution Alignment (DDA) module further refines supervision by separating target and non-target distribution alignment with adaptive weighting to mitigate pseudo-label noise. The approach converts regression to label-distribution learning over discretized bins, enabling robust learning from unlabeled data via teacher-generated pseudo targets and end-to-end distillation losses. Empirical results across audio, text, image, and medical datasets show that DKD achieves state-of-the-art performance on MAE, R^2, and SRCC, demonstrating strong generalization and robustness to labeling scarcity.

Abstract

Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing SSR methods typically train models on scarce labeled data by introducing constraint-based regularization or ordinal ranking to reduce overfitting. However, these approaches fail to fully exploit the abundance of unlabeled samples. While consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, they are highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we introduce a Dual-stream Knowledge Distillation framework (DKD), which is specially designed for the SSR task to distill knowledge from both continuous-valued knowledge and distribution information, better preserving regression magnitude information and improving sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation design ensures the effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Then, we introduce an advanced Decoupled Distribution Alignment (DDA) to align the target class and non-target class between teacher and student on the distribution, enhancing the student's capacity to mitigate noise in pseudo-label supervision and learn a more well-calibrated regression predictor.

Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation

TL;DR

This work tackles semi-supervised regression (SSR) under limited labeled data by introducing Dual-stream Knowledge Distillation (DKD), which jointly distills continuous-valued knowledge and label-distribution information from a ground-truth-informed teacher to a student. A Decoupled Distribution Alignment (DDA) module further refines supervision by separating target and non-target distribution alignment with adaptive weighting to mitigate pseudo-label noise. The approach converts regression to label-distribution learning over discretized bins, enabling robust learning from unlabeled data via teacher-generated pseudo targets and end-to-end distillation losses. Empirical results across audio, text, image, and medical datasets show that DKD achieves state-of-the-art performance on MAE, R^2, and SRCC, demonstrating strong generalization and robustness to labeling scarcity.

Abstract

Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing SSR methods typically train models on scarce labeled data by introducing constraint-based regularization or ordinal ranking to reduce overfitting. However, these approaches fail to fully exploit the abundance of unlabeled samples. While consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, they are highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we introduce a Dual-stream Knowledge Distillation framework (DKD), which is specially designed for the SSR task to distill knowledge from both continuous-valued knowledge and distribution information, better preserving regression magnitude information and improving sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation design ensures the effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Then, we introduce an advanced Decoupled Distribution Alignment (DDA) to align the target class and non-target class between teacher and student on the distribution, enhancing the student's capacity to mitigate noise in pseudo-label supervision and learn a more well-calibrated regression predictor.

Paper Structure

This paper contains 25 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: MAE and test SRCC curves for direct regression (DR), Rankup, UCVME, and DKD on the BVCC dataset. Rankup yields the least competitive results, as its pairwise ranking method may not fully capture the continuous relationships. Although UCVME improves performance via consistency-constrained pseudo-labels, it still falls short of DKD.
  • Figure 2: The Overview of DKD. (1) The input of DKD includes both labeled and unlabeled data. The teacher model is exclusively trained on the labeled data and is then used to generate pseudo labels for the unlabeled samples. The student model is trained on the entire dataset with a mixture of real labels and teacher-generated pseudo targets on unlabeled data. (2) Both the teacher and student are trained under a label-distribution formulation of regression, while continuous-valued distillation enforces consistency by minimizing the gap between their expected scores. (3) In the distribution distillation, DDA first identifies the target class to construct a binary probability distribution, then aligns the predicted distributions over classes between the teacher and student models separately for the target and non-target parts.
  • Figure 3: The MAE values w.r.t of labeled sample size $n$.
  • Figure 4: The effectiveness of hyperparameters $L$ and $\beta$.
  • Figure 5: The effectiveness of the hyperparameters.
  • ...and 1 more figures