Table of Contents
Fetching ...

Backpropagation through Combinatorial Algorithms: Identity with Projection Works

Subham Sekhar Sahoo, Anselm Paulus, Marin Vlastelica, Vít Musil, Volodymyr Kuleshov, Georg Martius

TL;DR

The paper tackles the challenge of backpropagating through discrete combinatorial solvers, where true gradients are zero or undefined. It introduces Identity, a hyperparameter-free gradient replacement that treats the solver as a negative identity on the backward pass, augmented by cost projections to exploit solver invariances and optionally a margin via symmetric noise. This framework reframes solver differentiation as a relaxed or relaxed-relaxation process, avoiding extra solver calls while stabilizing learning. Across DVAE, learning-to-explain, deep graph matching, image retrieval, and TSP, Identity demonstrates competitive performance and robustness, with projections and margins significantly improving stability and preventing cost collapse. The approach offers a practical, scalable alternative to complex smoothing or learning-based differentiation methods, enabling more reliable integration of combinatorial modules into end-to-end models.

Abstract

Embedding discrete solvers as differentiable layers has given modern deep learning architectures combinatorial expressivity and discrete reasoning capabilities. The derivative of these solvers is zero or undefined, therefore a meaningful replacement is crucial for effective gradient-based learning. Prior works rely on smoothing the solver with input perturbations, relaxing the solver to continuous problems, or interpolating the loss landscape with techniques that typically require additional solver calls, introduce extra hyper-parameters, or compromise performance. We propose a principled approach to exploit the geometry of the discrete solution space to treat the solver as a negative identity on the backward pass and further provide a theoretical justification. Our experiments demonstrate that such a straightforward hyper-parameter-free approach is able to compete with previous more complex methods on numerous experiments such as backpropagation through discrete samplers, deep graph matching, and image retrieval. Furthermore, we substitute the previously proposed problem-specific and label-dependent margin with a generic regularization procedure that prevents cost collapse and increases robustness.

Backpropagation through Combinatorial Algorithms: Identity with Projection Works

TL;DR

The paper tackles the challenge of backpropagating through discrete combinatorial solvers, where true gradients are zero or undefined. It introduces Identity, a hyperparameter-free gradient replacement that treats the solver as a negative identity on the backward pass, augmented by cost projections to exploit solver invariances and optionally a margin via symmetric noise. This framework reframes solver differentiation as a relaxed or relaxed-relaxation process, avoiding extra solver calls while stabilizing learning. Across DVAE, learning-to-explain, deep graph matching, image retrieval, and TSP, Identity demonstrates competitive performance and robustness, with projections and margins significantly improving stability and preventing cost collapse. The approach offers a practical, scalable alternative to complex smoothing or learning-based differentiation methods, enabling more reliable integration of combinatorial modules into end-to-end models.

Abstract

Embedding discrete solvers as differentiable layers has given modern deep learning architectures combinatorial expressivity and discrete reasoning capabilities. The derivative of these solvers is zero or undefined, therefore a meaningful replacement is crucial for effective gradient-based learning. Prior works rely on smoothing the solver with input perturbations, relaxing the solver to continuous problems, or interpolating the loss landscape with techniques that typically require additional solver calls, introduce extra hyper-parameters, or compromise performance. We propose a principled approach to exploit the geometry of the discrete solution space to treat the solver as a negative identity on the backward pass and further provide a theoretical justification. Our experiments demonstrate that such a straightforward hyper-parameter-free approach is able to compete with previous more complex methods on numerous experiments such as backpropagation through discrete samplers, deep graph matching, and image retrieval. Furthermore, we substitute the previously proposed problem-specific and label-dependent margin with a generic regularization procedure that prevents cost collapse and increases robustness.
Paper Structure (49 sections, 5 theorems, 51 equations, 9 figures, 8 tables)

This paper contains 49 sections, 5 theorems, 51 equations, 9 figures, 8 tables.

Key Result

Theorem 1

For sufficiently small $\alpha>0$, either $Y^*\bigl({y(\omega)}\bigr)$ is empty and $y(\omega_k)=y(\omega)$ for every $k\in\mathbb{N}$, or there is $n\in\mathbb{N}$ such that $y(\omega_n)\in Y^*\bigl({y(\omega)}\bigr)$ and $y(\omega_k)=y(\omega)$ for all $k<n$.

Figures (9)

  • Figure 1: Hybrid architecture with blackbox combinatorial solver and Identity module (green dotted line) with the projection of a cost $\omega$ and negative identity on the backward pass.
  • Figure 2: Intuitive illustration of the Identity (Id) gradient and its equivalence to Blackbox Backpropagation (BB) when $- \mathrm{d}\ell/\mathrm{d}y$ points directly to a target $y^*$. The cost and solution spaces are overlayed; the cost space partitions resulting in the same solution are drawn in blue. Note that the drawn updates to $\omega$ are only of illustrative nature, as the updates are typically applied to the weights of a backbone.
  • Figure 3: N-ELBO training progress on the MNIST train-set ($k=10$).
  • Figure 4: Susceptibility to perturbations and cost collapse in TSP(20). BB Identity (a) Adding noise to the gradient $\mathrm{d}\ell/\mathrm{d}y$ with std $\sigma$ affects Identity much less than BB. (b) Corrupting labels $y^*$ with probability ${\rho}/{k}$. (c) Average cost norm with gradient noise $\sigma=0.25$. The markers indicate the best validation performance.
  • Figure 5: N-ELBO over training epoch for DVAE on MNIST ($k=10$), comparing I-MLE with Identity for different projections.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • proof : Proof of Theorem \ref{['thm:1']}