Improved Stochastic Optimization of LogSumExp

Egor Gladin; Alexey Kroshnin; Jia-Jie Zhu; Pavel Dvurechensky

Improved Stochastic Optimization of LogSumExp

Egor Gladin, Alexey Kroshnin, Jia-Jie Zhu, Pavel Dvurechensky

TL;DR

This work tackles the computational bottleneck of optimizing LogSumExp in large or continuous settings by introducing a SoftPlus-based relaxation built on a new overflow-safe KL divergence (D_ρ). The authors derive F_ρ as a variational surrogate of the log-partition function that remains convex and becomes arbitrarily close to the original objective as ρ → 0, with explicit bounds and links to CVaR. They show how to apply this surrogate in three contexts—continuous entropy-regularized OT and two forms of distributionally robust optimization—providing tractable gradient estimators and demonstrating superior stability and performance over state-of-the-art baselines in experiments. The approach effectively mitigates numerical overflow issues and enables scalable SGD-based optimization across tasks with large or unbalanced support, offering a versatile tool for large-scale convex optimization problems involving log-sum-exp. Overall, the safe KL-based LogSumExp approximation broadens the practical applicability of entropy-regularized and robust optimization methods by delivering controllable accuracy, preserved convexity, and improved numerical stability.

Abstract

The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an $L$-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with $L$. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.

Improved Stochastic Optimization of LogSumExp

TL;DR

Abstract

Improved Stochastic Optimization of LogSumExp

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (19)