Smoothing Methods for Automatic Differentiation Across Conditional Branches

Justin N. Kreikemeyer; Philipp Andelfinger

Smoothing Methods for Automatic Differentiation Across Conditional Branches

Justin N. Kreikemeyer, Philipp Andelfinger

TL;DR

This paper tackles the challenge of differentiating across discontinuities caused by control flow, proposing two main avenues: Smooth Interpretation (SI) combined with Automatic Differentiation to obtain gradients across branches, and a Monte Carlo, AD-powered gradient estimator (DGO) that relaxes SI’s assumptions. The authors introduce DiscoGrad, a toolchain that automatically translates C++ programs into smooth differentiable forms for these estimators, and they evaluate performance on four non-smooth optimization problems. Results show that DGO offers competitive gradient fidelity and often the fastest convergence in high-dimensional settings, while SI-based gradients are effective for simpler branching and benefit from information provided by AD. The work demonstrates practical, differentiable programming techniques for parameter synthesis in branching programs, with DiscoGrad enabling broader adoption and future improvements in smoothing strategies.

Abstract

Programs involving discontinuities introduced by control flow constructs such as conditional branches pose challenges to mathematical optimization methods that assume a degree of smoothness in the objective function's response surface. Smooth interpretation (SI) is a form of abstract interpretation that approximates the convolution of a program's output with a Gaussian kernel, thus smoothing its output in a principled manner. Here, we combine SI with automatic differentiation (AD) to efficiently compute gradients of smoothed programs. In contrast to AD across a regular program execution, these gradients also capture the effects of alternative control flow paths. The combination of SI with AD enables the direct gradient-based parameter synthesis for branching programs, allowing for instance the calibration of simulation models or their combination with neural network models in machine learning pipelines. We detail the effects of the approximations made for tractability in SI and propose a novel Monte Carlo estimator that avoids the underlying assumptions by estimating the smoothed programs' gradients through a combination of AD and sampling. Using DiscoGrad, our tool for automatically translating simple C++ programs to a smooth differentiable form, we perform an extensive evaluation. We compare the combination of SI with AD and our Monte Carlo estimator to existing gradient-free and stochastic methods on four non-trivial and originally discontinuous problems ranging from classical simulation-based optimization to neural network-driven control. While the optimization progress with the SI-based estimator depends on the complexity of the program's control flow, our Monte Carlo estimator is competitive in all problems, exhibiting the fastest convergence by a substantial margin in our highest-dimensional problem.

Smoothing Methods for Automatic Differentiation Across Conditional Branches

TL;DR

Abstract

Paper Structure (32 sections, 24 equations, 17 figures, 2 tables)

This paper contains 32 sections, 24 equations, 17 figures, 2 tables.

Introduction
Background
Automatic Differentiation
Smooth Interpretation
Related Work
Sampling-Based Gradient Estimation
Combination of Sampling and AD
Differentiable Programming Languages and Neurosymbolic Programming
Domain-Specific Approaches
Smooth Automatic Differentiation
Approach
Relaxing SI's Assumptions
State Restriction Strategies
Monte Carlo Approach to Smooth Differentiation
DiscoGrad: Smooth Differentiation of C++ Programs
...and 17 more sections

Figures (17)

Figure 1: Graph of the Heaviside step function and its derivative estimated pathwise (e.g., through automatic differentiation) by IPA, and by a smoothing estimator.
Figure 2: Example program (left) execution showcasing the probabilistic semantics of SI (center) and their integration with forward-mode AD (right). Only relevant AD operations are shown. The tangents $\dot v$ denote the (generally partial) derivative $\partial v/\partial \hat{\mu}_X^{}$ wrt. the mean $\mu$ of the normally distributed random variable ${X}\,{\sim}\,\mathcal{N}(\hat{\mu}_X^{},\hat{\sigma}_X^2)$. The hat $\hat{\ }$ symbol indicates an input value. Upon initialization, the (here scalar) input vector $\mathbf{x}$ is taken as the mean $\hat{\mu}_X^{}$; $\varphi$ and $\Phi$ denote the normal distribution's probability and cumulative density functions respectively. Note that in this example the pathwise derivative is 0, but through the combination of SI and AD, the derivative wrt. the branching condition is obtained.
Figure 3: Comparison of gradient fidelity wrt. the convolution with the original SI proposal with 32 tracked control flow paths and with correlation-preserving variance calculation using AD (uncertainty propagation, UP). When reducing the number of tracked paths to a small subset, as would be required for larger programs, the assumption of a fixed size mixture dominates the error. The non-smooth merging of mixture elements causes the gradient to jump or even assume the wrong sign, which is problematic for gradient descent.
Figure 4: Example DiscoGrad program (a) and the smoothed versions of the contained branch for SI (b) and DGO (c).
Figure 5: Slowdown of the different gradient estimators compared to a crisp execution without AD, normalized to reflect the slowdown per sample (Monte Carlo estimators) or path (DGSI).
...and 12 more figures

Smoothing Methods for Automatic Differentiation Across Conditional Branches

TL;DR

Abstract

Smoothing Methods for Automatic Differentiation Across Conditional Branches

Authors

TL;DR

Abstract

Table of Contents

Figures (17)