Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Paul Jeha; Will Grathwohl; Michael Riis Andersen; Carl Henrik Ek; Jes Frellsen

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Paul Jeha, Will Grathwohl, Michael Riis Andersen, Carl Henrik Ek, Jes Frellsen

TL;DR

This paper tackles the high variance in the denoising score matching objective used to train score-based diffusion models. It introduces a family of Taylor expansion-based control variates of order $k$ that can be applied to both the training objective and its gradients, and proves an equivalence between controlling the objective and controlling the gradients. Empirically, the CVs reduce variance on a low-dimensional toy task and illuminate how factors like the order $k$, network irregularity, and optimizer choice impact effectiveness; results on MNIST show limited gains in complex models, suggesting variance may not always be harmful. The work highlights the necessity of gradient-focused variance control and opens avenues to understand the relationship between $k$ and variance reduction across architectures and noise regimes.

Abstract

Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a $k$-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

TL;DR

This paper tackles the high variance in the denoising score matching objective used to train score-based diffusion models. It introduces a family of Taylor expansion-based control variates of order

that can be applied to both the training objective and its gradients, and proves an equivalence between controlling the objective and controlling the gradients. Empirically, the CVs reduce variance on a low-dimensional toy task and illuminate how factors like the order

, network irregularity, and optimizer choice impact effectiveness; results on MNIST show limited gains in complex models, suggesting variance may not always be harmful. The work highlights the necessity of gradient-focused variance control and opens avenues to understand the relationship between

and variance reduction across architectures and noise regimes.

Abstract

-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.

Paper Structure (37 sections, 4 theorems, 34 equations, 6 figures, 3 tables)

This paper contains 37 sections, 4 theorems, 34 equations, 6 figures, 3 tables.

Introduction
Related work
Score matching
Control variate
Theory
Denoising score matching
Control variate
Taylor series
Remarks
A control variate on the training objective
A control variate on the gradients
Controlling the training objective is equivalent to controlling its gradient
A control variate for large values of $\sigma$
Experiments
Control variate on a toy dataset
...and 22 more sections

Key Result

Theorem 3.1

Let $U$ be an open subset of $\mathbb{R}^d$ and $s \in C^{l}(U, \mathbb{R}^d)$ be a $l$-differentiable mapping taking value in $U$ to $\mathbb{R}^d$. For $k \leq l$ and a point $\mathbf{a} \in U$, we define the Taylor polynomial $T_{s, \mathbf{a}}^k$, using a multi-index notation, such that: Then the mapping $(\mathbf{a}, x) \rightarrow R_{s, \mathbf{a}}^k = s - T_{s, \mathbf{a}}^k$ is $l-k$ diff

Figures (6)

Figure 1: Convergence with (right) and without control variate (left)
Figure 2: Variance reduction (right) and regression coefficient (left) for $C_{\textbf{g}, \boldsymbol{\theta}}^{0}$, $C_{\textbf{g}, \boldsymbol{\theta}}^{1}$ and $C_{\textbf{g}, \boldsymbol{\theta}}^{2}$
Figure 3: Variance reduction (right) and training loss (left) on MNIST
Figure 4: Variance reduction on toy dataset comparing Adam and SGD
Figure 5: Average variance reduction for various MLP configurations (lower is better). In each box the value is the variance reduction and in parenthesis is the number of parameters.
...and 1 more figures

Theorems & Definitions (4)

Theorem 3.1
Lemma 3.2
Theorem 3.3
Theorem 3.4

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

TL;DR

Abstract

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)