Table of Contents
Fetching ...

GMValuator: Similarity-based Data Valuation for Generative Models

Jiaxi Yang, Wenglong Deng, Benlin Liu, Yangsibo Huang, James Zou, Xiaoxiao Li

TL;DR

GMValuator addresses the challenge of valuing training data for generative models by reframing data contribution as a similarity-matching problem between generated samples $\hat{X}$ and training data $X$. It introduces a three-module, training-free framework: Efficient Similarity Matching to identify top-$k$ contributors, Image Quality Assessment to calibrate contributions with $q_j$, and a Value Calculation that defines $\mathcal{V}(x_i, \hat{x}_j, d_{ij}, q_j) = q_j \cdot \frac{\exp(-d_{ij})}{\sum_{i\in \mathcal{P}_j} \exp(-d_{ij})}$ and $\phi_i = \sum_{j=1}^m \mathcal{V}(x_i, \hat{x}_j, d_{ij}, q_j)$. The authors demonstrate strong truthfulness and efficiency under four evaluation criteria (C1–C4) across diverse datasets and generative architectures, outperforming baselines and showing robustness to large-scale data. This approach enables model-agnostic data governance for generative AI and has practical implications for privacy, data stewardship, and responsible deployment. The combination of perceptual reranking, PQ-based recall, and quality calibration yields a scalable, training-free mechanism to attribute value to individual training samples.

Abstract

Data valuation plays a crucial role in machine learning. Existing data valuation methods, mainly focused on discriminative models, overlook generative models that have gained attention recently. In generative models, data valuation measures the impact of training data on generated datasets. Very few existing attempts at data valuation methods designed for deep generative models either concentrate on specific models or lack robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. We formulate the data valuation problem in generative models from a similarity matching perspective to bridge the gaps. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for image generation tasks. It empowers efficient data valuation through our innovative similarity matching module, calibrates biased contributions by incorporating image quality assessment, and attributes credits to all training samples based on their contributions to the generated samples. Additionally, we introduce four evaluation criteria for assessing data valuation methods in generative models. GMValuator is extensively evaluated on benchmark and high-resolution datasets and various mainstream generative architectures to demonstrate its effectiveness.

GMValuator: Similarity-based Data Valuation for Generative Models

TL;DR

GMValuator addresses the challenge of valuing training data for generative models by reframing data contribution as a similarity-matching problem between generated samples and training data . It introduces a three-module, training-free framework: Efficient Similarity Matching to identify top- contributors, Image Quality Assessment to calibrate contributions with , and a Value Calculation that defines and . The authors demonstrate strong truthfulness and efficiency under four evaluation criteria (C1–C4) across diverse datasets and generative architectures, outperforming baselines and showing robustness to large-scale data. This approach enables model-agnostic data governance for generative AI and has practical implications for privacy, data stewardship, and responsible deployment. The combination of perceptual reranking, PQ-based recall, and quality calibration yields a scalable, training-free mechanism to attribute value to individual training samples.

Abstract

Data valuation plays a crucial role in machine learning. Existing data valuation methods, mainly focused on discriminative models, overlook generative models that have gained attention recently. In generative models, data valuation measures the impact of training data on generated datasets. Very few existing attempts at data valuation methods designed for deep generative models either concentrate on specific models or lack robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. We formulate the data valuation problem in generative models from a similarity matching perspective to bridge the gaps. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for image generation tasks. It empowers efficient data valuation through our innovative similarity matching module, calibrates biased contributions by incorporating image quality assessment, and attributes credits to all training samples based on their contributions to the generated samples. Additionally, we introduce four evaluation criteria for assessing data valuation methods in generative models. GMValuator is extensively evaluated on benchmark and high-resolution datasets and various mainstream generative architectures to demonstrate its effectiveness.
Paper Structure (43 sections, 2 theorems, 13 equations, 12 figures, 15 tables, 1 algorithm)

This paper contains 43 sections, 2 theorems, 13 equations, 12 figures, 15 tables, 1 algorithm.

Key Result

Theorem 2.4

(Bounded Attributes Classification Error on $S^{*}$ to $T$.) Let $f_{S^{*}}^{'}: \mu \rightarrow \mathcal{A}=\{0, 1\} ^V$ be the model trained on the optimal contributor dataset $S^{*}$. Following Assumption assump_1, if the contributors are corresponding to the given generated data $\hat{X}$, we ha

Figures (12)

  • Figure 1: Data distribution for $X_{v1}$,$X_{v2}$ and $\hat{X}$ for CIFAR-10, and $X_{v1}$,$X_{v2}$ are both airplane dataset.
  • Figure 2: Overview of GMValuator, a unified and training-free data valuation approach for any generative models. GMValuator contains three important modules -- (1) Efficient Similarity Matching (ESM), (2) Image Quality Assessment, and (3)Value Calculation. Each generated data $\hat{x}_j$ is matched with training data through ESM approach, resulting in the distances with its top $k$ contributors. The normalized contribution score from training sample $x_i$ to $\hat{x}_j$, defined as $\exp(-d_{ij})/\sum_i^k \exp(-d_{ij})$, is adjusted based on the quality of the associated generated samples $q_{j}$. We compute the data value $\phi_i$ of each training sample $x_{i}$ by summing its contributions to the generated samples, where it ranks among the top $k$ contributors.
  • Figure 3: The value without generated image quality calibration for q high-quality image (top row) and a low-quality image (bottom row). Column 1: generated images. Column 2-5:their top 4 contributors.
  • Figure 4: Visualization of Identical Attributes Test on CelebA. Left: generated samples. Right: top $k$ contributors.
  • Figure 5: The y-axis represents the ranking of values from high to low, with the top being the highest value and the bottom being the lowest value. The x-axis represents the index of each noisy data.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.4
  • Definition 2.5
  • Definition 2.6
  • Theorem M.2
  • Proof M.3