Table of Contents
Fetching ...

Rethinking Representativeness and Diversity in Dynamic Data Selection

Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia

TL;DR

This work rethink two core notions underlying sample evaluation: representativeness and diversity and realizes process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias.

Abstract

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Third, we couple the two-dimensional scoring with a smooth scheduler that transitions selection from core-pattern consolidation to rare-factor exploration, without extra gradients, influence estimates, or second-order computations on the training model. Extensive experiments on five benchmarks across vision and text tasks demonstrate improved accuracy-efficiency trade-offs across models. Our method matches or exceeds full-data accuracy with over 2x training acceleration. Code will be released.

Rethinking Representativeness and Diversity in Dynamic Data Selection

TL;DR

This work rethink two core notions underlying sample evaluation: representativeness and diversity and realizes process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias.

Abstract

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Third, we couple the two-dimensional scoring with a smooth scheduler that transitions selection from core-pattern consolidation to rare-factor exploration, without extra gradients, influence estimates, or second-order computations on the training model. Extensive experiments on five benchmarks across vision and text tasks demonstrate improved accuracy-efficiency trade-offs across models. Our method matches or exceeds full-data accuracy with over 2x training acceleration. Code will be released.
Paper Structure (49 sections, 1 theorem, 26 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 49 sections, 1 theorem, 26 equations, 12 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.1

For any two samples $i$ and $j$, if then $\tilde{s}_i(t)\le \tilde{s}_j(t)$; i.e., sufficiently over-sampled instances cannot dominate the ranking indefinitely.

Figures (12)

  • Figure 1: Conceptual comparison with prior data selection methods. Previous methods relying on geometry-based metrics may overemphasize local centrality and miss implicit feature factors. Our framework measures representativeness as dataset-level coverage of common or high-frequency factors, and enforces process-level diversity by promoting sample rotation across epochs rather than optimizing a single static subset.
  • Figure 2: Our framework for dynamic data selection. A plug-in feature encoder (CLIP by default) maps inputs to a fixed feature space, where an SAE yields sparse unit activations and dataset-wide statistics. At epoch $t$, each example is scored by representativeness (weighted coverage of high-frequency factors), diversity (factor rarity), and a usage-frequency penalty that discourages repeated selection and enforces rotation. A scheduler $\alpha(t)$ smoothly balances representativeness and diversity, transitioning from core-pattern consolidation to rare-feature exploration.
  • Figure 3: MMD comparison between subsets selected by our representativeness score (Rep-TopK) and the geometric baseline K-Center under different selection ratios. Lower is better.
  • Figure 4: Convergence speed comparison on CIFAR-10 (ResNet-18). Ours denotes our method, RCAP denotes the previous SOTA method, and Full uses the full dataset.
  • Figure 5: Illustrations of comparing our method with various data selection baselines on different datasets and models. We trained VGG-16 on Tiny-ImageNet in Fig (a), ViT-small on CIFAR-10 in Fig (b).
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Sample Rotation / Anti-Monopoly
  • proof