Toward Robust Semi-supervised Regression via Dual-stream Knowledge Distillation
Ye Su, Hezhe Qiao, Wei Huang, Lin Chen
TL;DR
This work tackles semi-supervised regression (SSR) under limited labeled data by introducing Dual-stream Knowledge Distillation (DKD), which jointly distills continuous-valued knowledge and label-distribution information from a ground-truth-informed teacher to a student. A Decoupled Distribution Alignment (DDA) module further refines supervision by separating target and non-target distribution alignment with adaptive weighting to mitigate pseudo-label noise. The approach converts regression to label-distribution learning over discretized bins, enabling robust learning from unlabeled data via teacher-generated pseudo targets and end-to-end distillation losses. Empirical results across audio, text, image, and medical datasets show that DKD achieves state-of-the-art performance on MAE, R^2, and SRCC, demonstrating strong generalization and robustness to labeling scarcity.
Abstract
Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing SSR methods typically train models on scarce labeled data by introducing constraint-based regularization or ordinal ranking to reduce overfitting. However, these approaches fail to fully exploit the abundance of unlabeled samples. While consistency-driven pseudo-labeling methods attempt to incorporate unlabeled data, they are highly sensitive to pseudo-label quality and noisy predictions. To address these challenges, we introduce a Dual-stream Knowledge Distillation framework (DKD), which is specially designed for the SSR task to distill knowledge from both continuous-valued knowledge and distribution information, better preserving regression magnitude information and improving sample efficiency. Specifically, in DKD, the teacher is optimized solely with ground-truth labels for label distribution estimation, while the student learns from a mixture of real labels and teacher-generated pseudo targets on unlabeled data. The distillation design ensures the effective supervision transfer, allowing the student to leverage pseudo labels more robustly. Then, we introduce an advanced Decoupled Distribution Alignment (DDA) to align the target class and non-target class between teacher and student on the distribution, enhancing the student's capacity to mitigate noise in pseudo-label supervision and learn a more well-calibrated regression predictor.
