Table of Contents
Fetching ...

LPGD: A General Framework for Backpropagation through Embedded Optimization Layers

Anselm Paulus, Georg Martius, Vít Musil

TL;DR

This work tackles training models that include embedded optimization layers, where traditional derivatives can be degenerate. It proposes LPGD, a gradient-descent method on a smoothed Lagrangian-Moreau envelope that uses finite-difference perturbations computed via the forward solver, enabling informative gradient updates even when the solution mapping is non-differentiable. The framework unifies several prior approaches, provides theoretical guarantees about smoothness and convergence, and demonstrates faster convergence than standard gradient descent in Sudoku and Markowitz portfolio experiments. Its practical advantage lies in integrating with standard autodiff toolchains using a black-box solver, while offering a spectrum of behavior through the temperature parameter and optional smoothing. The results suggest LPGD as a versatile and effective alternative for learning with embedded optimization layers in convex and saddle-point problems.

Abstract

Embedding parameterized optimization problems as layers into machine learning architectures serves as a powerful inductive bias. Training such architectures with stochastic gradient descent requires care, as degenerate derivatives of the embedded optimization problem often render the gradients uninformative. We propose Lagrangian Proximal Gradient Descent (LPGD) a flexible framework for training architectures with embedded optimization layers that seamlessly integrates into automatic differentiation libraries. LPGD efficiently computes meaningful replacements of the degenerate optimization layer derivatives by re-running the forward solver oracle on a perturbed input. LPGD captures various previously proposed methods as special cases, while fostering deep links to traditional optimization methods. We theoretically analyze our method and demonstrate on historical and synthetic data that LPGD converges faster than gradient descent even in a differentiable setup.

LPGD: A General Framework for Backpropagation through Embedded Optimization Layers

TL;DR

This work tackles training models that include embedded optimization layers, where traditional derivatives can be degenerate. It proposes LPGD, a gradient-descent method on a smoothed Lagrangian-Moreau envelope that uses finite-difference perturbations computed via the forward solver, enabling informative gradient updates even when the solution mapping is non-differentiable. The framework unifies several prior approaches, provides theoretical guarantees about smoothness and convergence, and demonstrates faster convergence than standard gradient descent in Sudoku and Markowitz portfolio experiments. Its practical advantage lies in integrating with standard autodiff toolchains using a black-box solver, while offering a spectrum of behavior through the temperature parameter and optional smoothing. The results suggest LPGD as a versatile and effective alternative for learning with embedded optimization layers in convex and saddle-point problems.

Abstract

Embedding parameterized optimization problems as layers into machine learning architectures serves as a powerful inductive bias. Training such architectures with stochastic gradient descent requires care, as degenerate derivatives of the embedded optimization problem often render the gradients uninformative. We propose Lagrangian Proximal Gradient Descent (LPGD) a flexible framework for training architectures with embedded optimization layers that seamlessly integrates into automatic differentiation libraries. LPGD efficiently computes meaningful replacements of the degenerate optimization layer derivatives by re-running the forward solver oracle on a perturbed input. LPGD captures various previously proposed methods as special cases, while fostering deep links to traditional optimization methods. We theoretically analyze our method and demonstrate on historical and synthetic data that LPGD converges faster than gradient descent even in a differentiable setup.
Paper Structure (4 sections, 6 equations)