Collaborative Unlabeled Data Optimization
Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin
TL;DR
CoOpt tackles the inefficiency of unlabeled data in deep learning by shifting from a model-centric to a data-centric paradigm. It distributes unlabeled data across participants, each applying a task-agnostic prior model to assign optimized targets, and then aligns these targets to a common distribution using a lightweight transformation, enabling reuse across architectures. The framework demonstrates strong gains over self-supervised methods and centralized optimization, including up to 13.6% accuracy improvements on Tiny-ImageNet and 6.8% on ImageNet-1K, with notable training speedups. A key insight is that target distribution inconsistency, arising from heterogeneous priors, can be mitigated via a learned alignment, and the approach remains effective even when priors are weak, thanks to the collaborative weighting of information. Overall, CoOpt offers a scalable, privacy-preserving path to faster, more generalizable learning from unlabeled data with practical implications for large-scale training pipelines.
Abstract
This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.
