Sample Selection via Contrastive Fragmentation for Noisy Label Regression
Chris Dongjoo Kim, Sangwoo Moon, Jihwan Moon, Dongyeon Woo, Gunhee Kim
TL;DR
ConFrag introduces a contrastive fragmentation framework to tackle noisy labeled regression by partitioning the label space into fragments, training specialized experts on contrastive fragment pairs, and using a neighborhood agreement mechanism to select clean samples. A mixture-of-experts model aggregates local consensus across neighboring fragments, enhanced by neighborhood jittering to regularize learning. The method achieves state-of-the-art performance across six real-world regression benchmarks under symmetric and Gaussian noise, validated by the ERR and MRAE metrics and extensive ablations. This approach demonstrates the benefit of converting regression noise into structured, open-set-like signals via contrastive fragment pairs, with implications for scalable, robust regression in noisy data regimes.
Abstract
As with many other problems, real-world regression is plagued by the presence of noisy labels, an inevitable issue that demands our attention. Fortunately, much real-world data often exhibits an intrinsic property of continuously ordered correlations between labels and features, where data points with similar labels are also represented with closely related features. In response, we propose a novel approach named ConFrag, where we collectively model the regression data by transforming them into disjoint yet contrasting fragmentation pairs. This enables the training of more distinctive representations, enhancing the ability to select clean samples. Our ConFrag framework leverages a mixture of neighboring fragments to discern noisy labels through neighborhood agreement among expert feature extractors. We extensively perform experiments on six newly curated benchmark datasets of diverse domains, including age prediction, price prediction, and music production year estimation. We also introduce a metric called Error Residual Ratio (ERR) to better account for varying degrees of label noise. Our approach consistently outperforms fourteen state-of-the-art baselines, being robust against symmetric and random Gaussian label noise.
