Trust-Aware Diversion for Data-Effective Distillation
Zhuojie Wu, Yanbin Liu, Xin Shen, Xiaofeng Cao, Xin Yu
TL;DR
This work tackles Dataset Distillation under noisy labels by introducing Trust-Aware Diversion (TAD), a dual-loop optimization that partitions data into trusted and untrusted spaces and iteratively refines them to mitigate mislabeled information. The outer loop uses a class-wise dynamic Gaussian Mixture Model on sample losses to concentrate distillation on trusted data, paired with a consistent regularization term to stabilize learning. The inner loop uses reliability measures based on Mahalanobis distance to anchors and cosine similarity to recalibrate untrusted samples, enabling selective pseudo-labeling and recall of informative data. Empirical results on CIFAR-10/100 and Tiny ImageNet across symmetric, asymmetric, and real-world noise demonstrate that TAD consistently outperforms state-of-the-art DDNL baselines, improving data-efficiency and robustness in realistic noisy-label settings. Overall, TAD advances practical dataset distillation by explicitly accounting for label noise through an adaptive, interactive trust mechanism that expands the trusted space while reducing reliance on unreliable data.
Abstract
Dataset distillation compresses a large dataset into a small synthetic subset that retains essential information. Existing methods assume that all samples are perfectly labeled, limiting their real-world applications where incorrect labels are ubiquitous. These mislabeled samples introduce untrustworthy information into the dataset, which misleads model optimization in dataset distillation. To tackle this issue, we propose a Trust-Aware Diversion (TAD) dataset distillation method. Our proposed TAD introduces an iterative dual-loop optimization framework for data-effective distillation. Specifically, the outer loop divides data into trusted and untrusted spaces, redirecting distillation toward trusted samples to guarantee trust in the distillation process. This step minimizes the impact of mislabeled samples on dataset distillation. The inner loop maximizes the distillation objective by recalibrating untrusted samples, thus transforming them into valuable ones for distillation. This dual-loop iteratively refines and compensates for each other, gradually expanding the trusted space and shrinking the untrusted space. Experiments demonstrate that our method can significantly improve the performance of existing dataset distillation methods on three widely used benchmarks (CIFAR10, CIFAR100, and Tiny ImageNet) in three challenging mislabeled settings (symmetric, asymmetric, and real-world).
