Enhancing Classification with Semi-Supervised Deep Learning Using Distance-Based Sample Weights
Aydin Abedinia, Shima Tabakhi, Vahid Seydi
TL;DR
The paper tackles classification under limited labeled data by introducing a distance-based weighting scheme that assigns each training sample a continuous weight based on its proximity to the test distribution. The weights are computed as $w_i = \frac{1}{M} \sum_{j=1}^{M} \exp(-\lambda d(x_i, x'_j))$ and incorporated into the loss via $L_{weighted} = \frac{1}{N} \sum_{i=1}^{N} w_i \mathcal{L}(y_i, f(x_i; \theta))$, with distances chosen per dataset (e.g., Euclidean, Hamming, Cosine, Jaccard) and hyperparameters tuned accordingly. Experiments on twelve benchmarks show the weighted approach improves accuracy, precision, recall, F1, and AUC over baselines and inverse-distance weighting, especially in imbalanced and data-scarce settings. The method offers a robust and scalable solution for semi-supervised learning with potential impact in domains such as healthcare and security where labeled data are scarce.
Abstract
Recent advancements in semi-supervised deep learning have introduced effective strategies for leveraging both labeled and unlabeled data to improve classification performance. This work proposes a semi-supervised framework that utilizes a distance-based weighting mechanism to prioritize critical training samples based on their proximity to test data. By focusing on the most informative examples, the method enhances model generalization and robustness, particularly in challenging scenarios with noisy or imbalanced datasets. Building on techniques such as uncertainty consistency and graph-based representations, the approach addresses key challenges of limited labeled data while maintaining scalability. Experiments on twelve benchmark datasets demonstrate significant improvements across key metrics, including accuracy, precision, and recall, consistently outperforming existing methods. This framework provides a robust and practical solution for semi-supervised learning, with potential applications in domains such as healthcare and security where data limitations pose significant challenges.
