Table of Contents
Fetching ...

Biased Generalization in Diffusion Models

Jerome Garnier-Brun, Luca Biggio, Davide Beltrame, Marc Mézard, Luca Saglietti

TL;DR

This work identifies a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data.

Abstract

Generalization in generative modeling is defined as the ability to learn an underlying distribution from a finite dataset and produce novel samples, with evaluation largely driven by held-out performance and perceived sample quality. In practice, training is often stopped at the minimum of the test loss, taken as an operational indicator of generalization. We challenge this viewpoint by identifying a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data. By training the same network on two disjoint datasets and comparing the mutual distances of generated samples and their similarity to training data, we introduce a quantitative measure of bias and demonstrate its presence on real images. We then study the mechanism of bias, using a controlled hierarchical data model where access to exact scores and ground-truth statistics allows us to precisely characterize its onset. We attribute this phenomenon to the sequential nature of feature learning in deep networks, where coarse structure is learned early in a data-independent manner, while finer features are resolved later in a way that increasingly depends on individual training samples. Our results show that early stopping at the test loss minimum, while optimal under standard generalization criteria, may be insufficient for privacy-critical applications.

Biased Generalization in Diffusion Models

TL;DR

This work identifies a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data.

Abstract

Generalization in generative modeling is defined as the ability to learn an underlying distribution from a finite dataset and produce novel samples, with evaluation largely driven by held-out performance and perceived sample quality. In practice, training is often stopped at the minimum of the test loss, taken as an operational indicator of generalization. We challenge this viewpoint by identifying a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data. By training the same network on two disjoint datasets and comparing the mutual distances of generated samples and their similarity to training data, we introduce a quantitative measure of bias and demonstrate its presence on real images. We then study the mechanism of bias, using a controlled hierarchical data model where access to exact scores and ground-truth statistics allows us to precisely characterize its onset. We attribute this phenomenon to the sequential nature of feature learning in deep networks, where coarse structure is learned early in a data-independent manner, while finer features are resolved later in a way that increasingly depends on individual training samples. Our results show that early stopping at the test loss minimum, while optimal under standard generalization criteria, may be insufficient for privacy-critical applications.
Paper Structure (54 sections, 14 equations, 11 figures)

This paper contains 54 sections, 14 equations, 11 figures.

Figures (11)

  • Figure 1: Biased generalization emerges before overfitting across models and settings. (a) Sample-split analysis on CelebA: we compare two denoising diffusion models trained on non-overlapping data slices. Left: cosine distance between generated samples (left axis) and denoising score matching (DSM) test loss (right axis) during training, means over 15 models with standard errors. Generated samples become maximally similar before the test loss is minimal, indicating the onset of biased generalization while the test loss is still decreasing. Colored stars mark epochs selected for visualization. Right: samples generated at the starred epochs by a model trained on each of the database split A/B (central columns), with nearest neighbors (NN) from each training split (side columns). Early in training, both models evolve similarly and sample quality improves; near the test-loss minimum, generated samples can differ substantially across splits and may get close to training examples, showing a bias without exact memorization. (b) Neural network trained on a controlled hierarchical dataset. Model-oracle divergence (left axis), representing the distance of the learned score to the exact score (red) and four coarser versions of the latter that account for lower-level features (as indicated by arrows). Sample-split analysis (right axis, blue) measuring the distance between denoising scores of models trained on disjoint datasets. All scores are computed on test samples (w.r.t the model's training data), noised up to a critical diffusion time $t/T = 0.15$. The biased generalization phase is seen between the minimum of the sample-split curve (blue) and the minimum of the model-exact oracle divergence (red). It takes place when the models start resolving finer-scale features (light green). (c) Training-free score model on the same hierarchical data, parametrized by a sharpness parameter $\varepsilon$ controlling the concentration of probability mass around the training data. Top: divergence between generated data and ground truth distributions of distances to the nearest neighbor (NN) in the training set (left axis) and DSM test loss (right axis) as a function of the sharpness parameter $\varepsilon$, showing sizeable bias at the test-loss minimum. Stars indicate selected values of $\varepsilon$. Bottom: in-training NN overlap (1 - normalized distance) distributions at selected sharpness levels, comparing model samples (solid black) to the ground truth (dashed gray).
  • Figure 2: Cosine distance between the predictions of two networks trained on disjoint subsets of CelebA of size $n = 1024$, evaluated on inputs noised until time $t$ out of $T = 1000$. The distance is evaluated for original images that are either outside of both training sets ("Test") or in one of them ("Train"). The vertical dashed lines correspond to the minima of the diffusion time-averaged metrics shown in Fig. \ref{['fig:summary']}(a).
  • Figure 3: (a) Illustration of an $\ell = 3$ tree-based data generation for the full model (top) and a hierarchically filtered version with $k=2$ (bottom). (b) Kullback-Leibler divergence between the exact posterior mean of BP$_0$ and those from filtered BP$_k$ denoisers along a reverse trajectory following the exact BP, averaged over $2\mathrm{k}$ realizations. Dashed lines show trained models at the minimum of their test losses for different training set size $n$.
  • Figure 4: Bias metrics for a diffusion model trained on $n= 12\mathrm{k}$ tree-based sequences computed with 50$\mathrm{k}$ generated samples and evaluation points, and averaged over 15 training runs and showing the standard error. (a) Nearest-neighbor divergence (Sec. \ref{['sec:bias_metrics']}) of generated sequences and denoising test loss as a function of training. (b) Expectation value of the excess data-dependent loss of (4) evaluated on test and train sequences noised up to time $t = 150$. Vertical dashed lines show the minimum of the bias metric (blue) and test loss (red).
  • Figure 5: Average normalized overlap to the starting sequence for a "U-turn" experiment after noising to time $t$ using $100$ trajectories for each $1000$ starting points, model trained on $n = 12\mathrm{k}$ data. Left: checkpoint minimizing the Kullback-Leibler divergence of nearest neighbor overlaps with the result expected from fair sampling. Right: checkpoint minimizing the denoising test loss. Shaded areas show the standard error over starting sequences.
  • ...and 6 more figures