Table of Contents
Fetching ...

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

Paul Jeha, Will Grathwohl, Michael Riis Andersen, Carl Henrik Ek, Jes Frellsen

TL;DR

This paper tackles the high variance in the denoising score matching objective used to train score-based diffusion models. It introduces a family of Taylor expansion-based control variates of order $k$ that can be applied to both the training objective and its gradients, and proves an equivalence between controlling the objective and controlling the gradients. Empirically, the CVs reduce variance on a low-dimensional toy task and illuminate how factors like the order $k$, network irregularity, and optimizer choice impact effectiveness; results on MNIST show limited gains in complex models, suggesting variance may not always be harmful. The work highlights the necessity of gradient-focused variance control and opens avenues to understand the relationship between $k$ and variance reduction across architectures and noise regimes.

Abstract

Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a $k$-th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.

Variance reduction of diffusion model's gradients with Taylor approximation-based control variate

TL;DR

This paper tackles the high variance in the denoising score matching objective used to train score-based diffusion models. It introduces a family of Taylor expansion-based control variates of order that can be applied to both the training objective and its gradients, and proves an equivalence between controlling the objective and controlling the gradients. Empirically, the CVs reduce variance on a low-dimensional toy task and illuminate how factors like the order , network irregularity, and optimizer choice impact effectiveness; results on MNIST show limited gains in complex models, suggesting variance may not always be harmful. The work highlights the necessity of gradient-focused variance control and opens avenues to understand the relationship between and variance reduction across architectures and noise regimes.

Abstract

Score-based models, trained with denoising score matching, are remarkably effective in generating high dimensional data. However, the high variance of their training objective hinders optimisation. We attempt to reduce it with a control variate, derived via a -th order Taylor expansion on the training objective and its gradient. We prove an equivalence between the two and demonstrate empirically the effectiveness of our approach on a low dimensional problem setting; and study its effect on larger problems.
Paper Structure (37 sections, 4 theorems, 34 equations, 6 figures, 3 tables)

This paper contains 37 sections, 4 theorems, 34 equations, 6 figures, 3 tables.

Key Result

Theorem 3.1

Let $U$ be an open subset of $\mathbb{R}^d$ and $s \in C^{l}(U, \mathbb{R}^d)$ be a $l$-differentiable mapping taking value in $U$ to $\mathbb{R}^d$. For $k \leq l$ and a point $\mathbf{a} \in U$, we define the Taylor polynomial $T_{s, \mathbf{a}}^k$, using a multi-index notation, such that: Then the mapping $(\mathbf{a}, x) \rightarrow R_{s, \mathbf{a}}^k = s - T_{s, \mathbf{a}}^k$ is $l-k$ diff

Figures (6)

  • Figure 1: Convergence with (right) and without control variate (left)
  • Figure 2: Variance reduction (right) and regression coefficient (left) for $C_{\textbf{g}, \boldsymbol{\theta}}^{0}$, $C_{\textbf{g}, \boldsymbol{\theta}}^{1}$ and $C_{\textbf{g}, \boldsymbol{\theta}}^{2}$
  • Figure 3: Variance reduction (right) and training loss (left) on MNIST
  • Figure 4: Variance reduction on toy dataset comparing Adam and SGD
  • Figure 5: Average variance reduction for various MLP configurations (lower is better). In each box the value is the variance reduction and in parenthesis is the number of parameters.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Theorem 3.4