Table of Contents
Fetching ...

Collaborative Unlabeled Data Optimization

Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin

TL;DR

CoOpt tackles the inefficiency of unlabeled data in deep learning by shifting from a model-centric to a data-centric paradigm. It distributes unlabeled data across participants, each applying a task-agnostic prior model to assign optimized targets, and then aligns these targets to a common distribution using a lightweight transformation, enabling reuse across architectures. The framework demonstrates strong gains over self-supervised methods and centralized optimization, including up to 13.6% accuracy improvements on Tiny-ImageNet and 6.8% on ImageNet-1K, with notable training speedups. A key insight is that target distribution inconsistency, arising from heterogeneous priors, can be mitigated via a learned alignment, and the approach remains effective even when priors are weak, thanks to the collaborative weighting of information. Overall, CoOpt offers a scalable, privacy-preserving path to faster, more generalizable learning from unlabeled data with practical implications for large-scale training pipelines.

Abstract

This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.

Collaborative Unlabeled Data Optimization

TL;DR

CoOpt tackles the inefficiency of unlabeled data in deep learning by shifting from a model-centric to a data-centric paradigm. It distributes unlabeled data across participants, each applying a task-agnostic prior model to assign optimized targets, and then aligns these targets to a common distribution using a lightweight transformation, enabling reuse across architectures. The framework demonstrates strong gains over self-supervised methods and centralized optimization, including up to 13.6% accuracy improvements on Tiny-ImageNet and 6.8% on ImageNet-1K, with notable training speedups. A key insight is that target distribution inconsistency, arising from heterogeneous priors, can be mitigated via a learned alignment, and the approach remains effective even when priors are weak, thanks to the collaborative weighting of information. Overall, CoOpt offers a scalable, privacy-preserving path to faster, more generalizable learning from unlabeled data with practical implications for large-scale training pipelines.

Abstract

This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of and .

Paper Structure

This paper contains 51 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A Collaborative Data Optimization Framework CoOpt. For large-scale unlabeled data, self-supervised learning results in low training efficiency. Therefore, we propose CoOpt, an efficient and parallel framework enabling participants to use diverse task-agnostic models, such as pre-trained ResNets, termed prior models, for collaborative data optimization.
  • Figure 2: Comparison Between KD and Ours.
  • Figure 3: Lifecycle of the proposed collaborative data optimization framework CoOpt. The framework encompasses an open data platform and multiple participants, involving five key data operations.
  • Figure 4: A Practical Scenario: Continuous Optimization.
  • Figure 5: Comprehensive Analysis of CoOpt.(a) Training curves: Comparison of SSL methods and our CoOpt. (b) Prior Models With Varying Accuracies. Even with a very weak prior model, CoOpt accelerates the early-stage training. (c) Correlation Verification: Verify the correlation between the uniform value and performance. (d) Influence of shared data size: As shared data's size increases, the performance gains diminish.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: Data optimization with prior model $\boldsymbol{\psi}$