Table of Contents
Fetching ...

DataS^3: Dataset Subset Selection for Specialization

Neha Hulkund, Alaa Maalouf, Levi Cai, Daniel Yang, Tsun-Hsuan Wang, Abigail O'Neil, Timm Haucke, Sandeep Mukherjee, Vikram Ramaswamy, Judy Hansen Shen, Gabriel Tseng, Mike Walmsley, Daniela Rus, Ken Goldberg, Hannah Kerner, Irene Chen, Yogesh Girdhar, Sara Beery

TL;DR

DataS$^3$ introduces a dedicated benchmark for dataset subset selection aimed at deployment-specific specialization, formalizing the DS3 problem where a subset S of a general training pool T is chosen to minimize deployment-distribution loss on a query set Q drawn from P_Q. The benchmark covers five real-world domains (iWildCam, GeoDE, Auto Arborist, FishDetection, NuScenes) with multiple deployments, and evaluates a spectrum of baselines from knowledge-driven subsets to unsupervised and labeled-query methods. Empirically, carefully curated subsets can outperform training on the full data by large margins (e.g., up to 51.3% accuracy gain) and achieve substantial data efficiency, while current general-purpose subset methods often fail in deployment-specific settings. These findings highlight the practical value of tailoring data curation to deployment needs and frame a challenging open problem for unsupervised DS3 approaches within diverse real-world domains.

Abstract

In many real-world machine learning (ML) applications (e.g. detecting broken bones in x-ray images, detecting species in camera traps), in practice models need to perform well on specific deployments (e.g. a specific hospital, a specific national park) rather than the domain broadly. However, deployments often have imbalanced, unique data distributions. Discrepancy between the training distribution and the deployment distribution can lead to suboptimal performance, highlighting the need to select deployment-specialized subsets from the available training data. We formalize dataset subset selection for specialization (DS3): given a training set drawn from a general distribution and a (potentially unlabeled) query set drawn from the desired deployment-specific distribution, the goal is to select a subset of the training data that optimizes deployment performance. We introduce DataS^3; the first dataset and benchmark designed specifically for the DS3 problem. DataS^3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We conduct a comprehensive study evaluating algorithms from various families--including coresets, data filtering, and data curation--on DataS^3, and find that general-distribution methods consistently fail on deployment-specific tasks. Additionally, we demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent. Our benchmark highlights the critical role of tailored dataset curation in enhancing performance and training efficiency on deployment-specific distributions, which we posit will only become more important as global, public datasets become available across domains and ML models are deployed in the real world.

DataS^3: Dataset Subset Selection for Specialization

TL;DR

DataS introduces a dedicated benchmark for dataset subset selection aimed at deployment-specific specialization, formalizing the DS3 problem where a subset S of a general training pool T is chosen to minimize deployment-distribution loss on a query set Q drawn from P_Q. The benchmark covers five real-world domains (iWildCam, GeoDE, Auto Arborist, FishDetection, NuScenes) with multiple deployments, and evaluates a spectrum of baselines from knowledge-driven subsets to unsupervised and labeled-query methods. Empirically, carefully curated subsets can outperform training on the full data by large margins (e.g., up to 51.3% accuracy gain) and achieve substantial data efficiency, while current general-purpose subset methods often fail in deployment-specific settings. These findings highlight the practical value of tailoring data curation to deployment needs and frame a challenging open problem for unsupervised DS3 approaches within diverse real-world domains.

Abstract

In many real-world machine learning (ML) applications (e.g. detecting broken bones in x-ray images, detecting species in camera traps), in practice models need to perform well on specific deployments (e.g. a specific hospital, a specific national park) rather than the domain broadly. However, deployments often have imbalanced, unique data distributions. Discrepancy between the training distribution and the deployment distribution can lead to suboptimal performance, highlighting the need to select deployment-specialized subsets from the available training data. We formalize dataset subset selection for specialization (DS3): given a training set drawn from a general distribution and a (potentially unlabeled) query set drawn from the desired deployment-specific distribution, the goal is to select a subset of the training data that optimizes deployment performance. We introduce DataS^3; the first dataset and benchmark designed specifically for the DS3 problem. DataS^3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We conduct a comprehensive study evaluating algorithms from various families--including coresets, data filtering, and data curation--on DataS^3, and find that general-distribution methods consistently fail on deployment-specific tasks. Additionally, we demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent. Our benchmark highlights the critical role of tailored dataset curation in enhancing performance and training efficiency on deployment-specific distributions, which we posit will only become more important as global, public datasets become available across domains and ML models are deployed in the real world.

Paper Structure

This paper contains 25 sections, 1 equation, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Foundation model training aims for broad generalization, by using all data available, usually from massive internet-scale datasets. In practice, we find these models are often suboptimal for specific deployments, which may exhibit different distributions over categories or data characteristics from the general training data pool. Dataset subset selection for specialization seeks to identify model training subsets closely aligned with the target deployment, achieving superior performance under the given distribution and attribute shifts.
  • Figure 2: DataS$^3$ benchmark process, involving dataset splitting, subset selection, model specialization/finetuning, and then evaluation.
  • Figure 3: The five datasets in our benchmark: iWildCam, GeoDE, AutoArborist, FishDetection, and NuScenes each have real-world applications in deployment. In iWildCam, GeoDE, and AutoArborist, we show the class distributions of each deployment; in FishDetection, the number of detections per image is shown, and in NuScenes environment/output features. These diagrams show that each dataset has unique challenges in the deployments that lead to a need for model specialization, including long-tailedness (AutoArborist, iWildCam), covariate shift (all), subpopulation shifts (GeoDE, FishDetection), and more. These axes of variation are described in depth in \ref{['sec:benchmark']} and further in Apdx \ref{['appdx:additional_dataset_details']}.
  • Figure 4: Plotting the sample efficiency of the baselines, for the baselines that thresholds of subset size were set (CLIP-score, Image-Align, Match-Dist), with Random as a comparison point) for the linear-probing results. NuScenes was left out because it uses MSE as a performance metric and cannot use the Match-Dist baseline because it is a regression baseline. We find that oftentimes, models perform nearly just as well with 50% of the data, with examples of certain subsets outperforming using all the data.
  • Figure 5: Visualization of the iWildCam dataset across deployments
  • ...and 7 more figures