Table of Contents
Fetching ...

A Dataset is Worth 1 MB

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

TL;DR

Pseudo-Labels as Data is proposed, a method that completely eliminates pixel transmission and introduces a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task.

Abstract

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.

A Dataset is Worth 1 MB

TL;DR

Pseudo-Labels as Data is proposed, a method that completely eliminates pixel transmission and introduces a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task.

Abstract

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.
Paper Structure (44 sections, 16 equations, 9 figures, 17 tables)

This paper contains 44 sections, 16 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: Motivation. A dataset server transmits the same large dataset many times at massive cost. Our method allows the server to send a compressed payload of less than 1 MB, enabling clients with heterogeneous hardware, even if they have ultra-narrow bandwidth, to train their own models locally.
  • Figure 2: The PLADA Pipeline. The server (left) trains a teacher classifier on the task dataset and distills this task knowledge into hard labels on the reference data. It then filters to the lowest-uncertainty $p\%$ of pseudo-labels and transmits a compressed payload ($<1$ MB). The client (right) reconstructs a virtual dataset using its preloaded reference dataset and the payload to train the student model.
  • Figure 3: Reference set images vs. energy percentile. High-confidence (low-energy) samples retrieved from ImageNet-21K demonstrate semantic and structural alignment with the target domains. For additional visualizations see Appendix \ref{['sec:app:energy_visualizations']}.
  • Figure 4: Class distribution of the RESISC45 pseudo hard-labels, before and after filtering using safety-net. The yellow bars show the original global distribution, which is heavily imbalanced - RESISC45 has images extracted using Google Earth, out of which class 0 is airplane. Standard global filtering would eliminate some of the tail classes entirely. The blue bars demonstrate our Safety-Net Filtering (keeping 5%, $\alpha=-0.2$), which effectively preserves a representation of under-represented classes even under extreme compression. Note that the Y-axis uses a cube-root scale to visually accommodate the large magnitude differences between the 'strong' and 'weak' classes.
  • Figure 5: Bandwidth-Accuracy Baselines (CUB-200). Comparison of PLADA against weight and data transmission baselines. PLADA (red star) dominates the top-left corner, achieving higher accuracy than weight-based methods while requiring a smaller payload (<35 KB). Data-centric baselines (Random Subset/K-Center) fail to provide a viable signal at this extreme budget. All payloads are Zstd-compressed (level 19).
  • ...and 4 more figures