Table of Contents
Fetching ...

Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

Felix Petersen, Christian Borgelt, Aashwin Mishra, Stefano Ermon

TL;DR

Problem: gradient estimation for stochastic relaxations of non-differentiable black-box functions. Approach: generalized stochastic smoothing that relaxes inputs via a density $\mu$ to form $f_\u0005epsilon(x)=\mathbb{E}_{\u0005epsilon\sim\mu}[f(x+\u0005epsilon)]$, with unbiased gradient estimators such as $\nabla_{x} f_\u0005epsilon(x)=\mathbb{E}_{\u0005epsilon\sim\mu}[f(x+\u0005epsilon)\nabla_{\u0005epsilon}(-\log\mu(\u0005epsilon))]$, extended to vector-valued outputs and anisotropic scale matrices $\mathbf{L}$; and variance-reduction techniques. Key contributions include relaxing assumptions on $\mu$ (including non-differentiable and compact-support densities like Laplace and Triangular), a $k$-sample median extension, and a clear algorithm-vs-loss smoothing distinction, with broad empirical validation. Significance: enables differentiating a wide class of non-differentiable black-box components (sorting, shortest-paths, rendering, cryo-ET) with controllable variance, and guides practical choices of distributions and variance-reduction strategies for improved performance.

Abstract

We deal with the problem of gradient estimation for stochastic differentiable relaxations of algorithms, operators, simulators, and other non-differentiable functions. Stochastic smoothing conventionally perturbs the input of a non-differentiable function with a differentiable density distribution with full support, smoothing it and enabling gradient estimation. Our theory starts at first principles to derive stochastic smoothing with reduced assumptions, without requiring a differentiable density nor full support, and we present a general framework for relaxation and gradient estimation of non-differentiable black-box functions $f:\mathbb{R}^n\to\mathbb{R}^m$. We develop variance reduction for gradient estimation from 3 orthogonal perspectives. Empirically, we benchmark 6 distributions and up to 24 variance reduction strategies for differentiable sorting and ranking, differentiable shortest-paths on graphs, differentiable rendering for pose estimation, as well as differentiable cryo-ET simulations.

Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

TL;DR

Problem: gradient estimation for stochastic relaxations of non-differentiable black-box functions. Approach: generalized stochastic smoothing that relaxes inputs via a density to form , with unbiased gradient estimators such as , extended to vector-valued outputs and anisotropic scale matrices ; and variance-reduction techniques. Key contributions include relaxing assumptions on (including non-differentiable and compact-support densities like Laplace and Triangular), a -sample median extension, and a clear algorithm-vs-loss smoothing distinction, with broad empirical validation. Significance: enables differentiating a wide class of non-differentiable black-box components (sorting, shortest-paths, rendering, cryo-ET) with controllable variance, and guides practical choices of distributions and variance-reduction strategies for improved performance.

Abstract

We deal with the problem of gradient estimation for stochastic differentiable relaxations of algorithms, operators, simulators, and other non-differentiable functions. Stochastic smoothing conventionally perturbs the input of a non-differentiable function with a differentiable density distribution with full support, smoothing it and enabling gradient estimation. Our theory starts at first principles to derive stochastic smoothing with reduced assumptions, without requiring a differentiable density nor full support, and we present a general framework for relaxation and gradient estimation of non-differentiable black-box functions . We develop variance reduction for gradient estimation from 3 orthogonal perspectives. Empirically, we benchmark 6 distributions and up to 24 variance reduction strategies for differentiable sorting and ranking, differentiable shortest-paths on graphs, differentiable rendering for pose estimation, as well as differentiable cryo-ET simulations.

Paper Structure

This paper contains 30 sections, 9 theorems, 53 equations, 15 figures, 5 tables.

Key Result

Lemma 1

Given a function $f:\mathbb{R}^n\to\mathbb{R}$ and a differentiable probability density function $\mu(\epsilon)$ with full support on $\mathbb{R}^n$, then $f_\epsilon$ is differentiable and

Figures (15)

  • Figure 1: Comparison of covariates: a non-differentiable function (dark blue) is smoothed with a logistic distribution (light blue). The original gradient (dark red) is not everywhere defined, and does not meaningfully represent the gradient. The gradient of the smoothed function is shown in pink. Grey illustrates the variance of a gradient estimate with $5$ samples via the $[25\%,75\%]$ (dark grey) and $[10\%,90\%]$ (light grey) percentiles. Using $f(x)$ as a covariate, instead of using none reduces the gradient variance, in particular whenever $f(x)$ is large. Leave-one-out (LOO) further improves over $f(x)$ at discontinuities of the original function $f$ (i.e., at $x{=}1$), but has slightly higher variance than $f(x)$ where $f$ is continuous and has large values (i.e., at $x{=}{-}2$.)
  • Figure 2: Sampling strategies. Left to right: Monte-Carlo (MC), Antithetic Monte-Carlo, Cartesian Quasi-Monte-Carlo (QMC), Cartesian Randomized-Quasi-Monte-Carlo (RQMC), Latin-Hypercube Sampled QMC and RQMC. Samples can be transformed via the inverse CDF of a respective distribution.
  • Figure 3: Average $L_2$ norms between ground truth (oracle) and estimated gradient for different numbers of elements to sort and rank $n$, and different distributions. Each plot compares different variance reduction strategies as indicated in the legend to the right of the caption. Darker is better (smaller values). Colors are only comparable within each subplot. We use $1\,024$ samples, except for Cartesian and $n=3$ where we use $10^3=1\,000$ samples. An extension with $n\in\{7,10\}$ can be found in Figure \ref{['fig:sorting-variance-sm']} in the appendix. Absolute values are reported in Table \ref{['tab:sorting-variance']}.
  • Figure 4: Average $L_2$ norms between ground truth (oracle) and estimated gradient for smoothing shortest-path algorithms, and different distributions. Each plot compares different variance reduction strategies as indicated in the legend to the right of the caption. Darker is better (smaller values). Colors are only comparable within each subplot. We use $1\,024$ samples. Absolute values are reported in Table \ref{['tab:shortest-path-variance']}.
  • Figure 5: Sorting benchmark ($n{=}5$). Exact match (EM) accuracy. Brighter is better (greater values). Values between subplots are compara- ble. IQM over 12 seeds and dis- played range of $[75\%, 85.5\%]$.
  • ...and 10 more figures

Theorems & Definitions (21)

  • Lemma 1: Differentiable Density Smoothing
  • proof
  • Corollary 2: Differentiable Density Smoothing for Vector-valued Functions
  • Lemma 3: Requirement of Continuity of $\mu$
  • Remark 4: Requirement of Continuity of $\mu$
  • Remark 5: Gaussian Smoothing
  • Lemma 6: Differentiation wrt. $\gamma$
  • Theorem 7: Multivariate Smoothing with Covariance Matrix
  • Theorem 8: Output Covariance of Multivariate Smoothing for UQ
  • proof : Proof of Lemma \ref{['cor:continuous-dae']}
  • ...and 11 more