Table of Contents
Fetching ...

Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection

Mingyang Chen, Jiawei Du, Bo Huang, Yi Wang, Xiaobo Zhang, Wei Wang

TL;DR

This paper tackles the inefficiency of core-set selection by introducing a likelihood-informed scoring criterion based on diffusion model reconstruction deviation. By tying reconstruction error under partial reverse denoising to data likelihood through the diffusion ELBO, and by selecting reconstruction timesteps via an information-theoretic objective using a diffusion classifier, it provides a principled, distribution-aware data selection mechanism. Empirical results on ImageNet show that DRD consistently outperforms baselines across budgets and approaches full-data performance with only half the data, while analyses reveal meaningful insights into how data distribution interacts with model learning. The approach offers practical scalability, interpretability, and a new perspective on curriculum-like data selection grounded in probabilistic data modeling.

Abstract

Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.

Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection

TL;DR

This paper tackles the inefficiency of core-set selection by introducing a likelihood-informed scoring criterion based on diffusion model reconstruction deviation. By tying reconstruction error under partial reverse denoising to data likelihood through the diffusion ELBO, and by selecting reconstruction timesteps via an information-theoretic objective using a diffusion classifier, it provides a principled, distribution-aware data selection mechanism. Empirical results on ImageNet show that DRD consistently outperforms baselines across budgets and approaches full-data performance with only half the data, while analyses reveal meaningful insights into how data distribution interacts with model learning. The approach offers practical scalability, interpretability, and a new perspective on curriculum-like data selection grounded in probabilistic data modeling.

Abstract

Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.

Paper Structure

This paper contains 22 sections, 2 theorems, 11 equations, 6 figures, 2 tables.

Key Result

Theorem 1

Let $x_0 \in \mathbb{R}^d$, $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$, and $x_{0:t}'$ be the reconstructed data obtained by DDPM denoising from $x_t$, the expected squared reconstruction deviation satisfies: where $\kappa(t) = \frac{1}{t} \sum_{s=1}^t \frac{1}{\sigma_s^2}$, and $\mathcal{C}_{\text{noise}}(t)$ is a constant indep

Figures (6)

  • Figure 1: Intuitive illustration of our likelihood-informed scoring and optimal reconstruction timestep selection. Left: The deviation between real data point $x_0$ and the reconstructed $x'_{0:t}$ serves as a likelihood-sensitive signal, with lower deviation indicating higher estimated data likelihood. The visualized examples reveal a semantic illustration: high-likelihood samples typically feature class-relevant objects that are spatially prominent and well-formed; moderate-likelihood samples often contain target objects that are less visually salient, e.g., occupying smaller regions or blended with irrelevant elements; and low-likelihood samples exhibit apparent out-of-distribution characteristics, leading to significant semantic shifts after reconstruction. Right: We select the optimal reconstruction timestep ($0<t<T$) by maximizing the drop rate $\left| \partial \mathcal{I}(x_t; c)/\partial t \right|$. Following Lemma \ref{['lemma:mutual']}, we equivalently maximize it by the time derivative of $\log p_\theta(c \mid x_t)$ predicted by a diffusion classifierDBLP:conf/iccv/LiPDBP23. The search is constrained to $\text{SNR}(t) \in [\gamma_{\min}, \gamma_{\max}]$ to avoid degenerate regions of timesteps.
  • Figure 2: t-SNE visualization of stratified samples from the ImageWoof dataset. Samples are grouped by ascending score ranges for Forgetting Score DBLP:conf/iclr/TonevaSCTBG19, EL2N DBLP:conf/nips/PaulGD21, and our proposed Reconstruction Deviation, with Random representing equally sized random groups for reference. The distributions show that Forgetting Score and EL2N produce stratifications that resemble random sampling, whereas Reconstruction Deviation yields more distinct and semantically coherent groupings.
  • Figure 3: Cross evaluation of ResNet-18 models trained on deviation-stratified subsets. Each model is trained on one subset and tested across all five. Lower indices (e.g., $0-20\%$) indicate higher estimated likelihoods.
  • Figure 5: Comparison of different reconstruction timesteps. Test results of fixed timesteps and timesteps selected by our IB-informed method on ImageNette and ImageWoof. While grid search requires full reconstruction and evaluation over the entire dataset, our method identifies effective, class-wise timesteps using lightweight Monte Carlo estimates, enabling timestep selection within minutes.
  • Figure 6: Comparison of selected window start points based on DRD score. Test accuracy on ImageNette and ImageWoof when sliding a fixed-width selection window across the DRD-sorted score list. High-performing subsets consistently arise from windows starting between 20% and 40% of the ranked list, reflecting a moderate-likelihood preference in model learning.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1: Inverse Dependence of Reconstruction Deviation on Log-Likelihood
  • Lemma 1