Table of Contents
Fetching ...

Improved Stochastic Optimization of LogSumExp

Egor Gladin, Alexey Kroshnin, Jia-Jie Zhu, Pavel Dvurechensky

TL;DR

This work tackles the computational bottleneck of optimizing LogSumExp in large or continuous settings by introducing a SoftPlus-based relaxation built on a new overflow-safe KL divergence (D_ρ). The authors derive F_ρ as a variational surrogate of the log-partition function that remains convex and becomes arbitrarily close to the original objective as ρ → 0, with explicit bounds and links to CVaR. They show how to apply this surrogate in three contexts—continuous entropy-regularized OT and two forms of distributionally robust optimization—providing tractable gradient estimators and demonstrating superior stability and performance over state-of-the-art baselines in experiments. The approach effectively mitigates numerical overflow issues and enables scalable SGD-based optimization across tasks with large or unbalanced support, offering a versatile tool for large-scale convex optimization problems involving log-sum-exp. Overall, the safe KL-based LogSumExp approximation broadens the practical applicability of entropy-regularized and robust optimization methods by delivering controllable accuracy, preserved convexity, and improved numerical stability.

Abstract

The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an $L$-smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with $L$. Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.

Improved Stochastic Optimization of LogSumExp

TL;DR

This work tackles the computational bottleneck of optimizing LogSumExp in large or continuous settings by introducing a SoftPlus-based relaxation built on a new overflow-safe KL divergence (D_ρ). The authors derive F_ρ as a variational surrogate of the log-partition function that remains convex and becomes arbitrarily close to the original objective as ρ → 0, with explicit bounds and links to CVaR. They show how to apply this surrogate in three contexts—continuous entropy-regularized OT and two forms of distributionally robust optimization—providing tractable gradient estimators and demonstrating superior stability and performance over state-of-the-art baselines in experiments. The approach effectively mitigates numerical overflow issues and enables scalable SGD-based optimization across tasks with large or unbalanced support, offering a versatile tool for large-scale convex optimization problems involving log-sum-exp. Overall, the safe KL-based LogSumExp approximation broadens the practical applicability of entropy-regularized and robust optimization methods by delivering controllable accuracy, preserved convexity, and improved numerical stability.

Abstract

The LogSumExp function, also known as the free energy, plays a central role in many important optimization problems, including entropy-regularized optimal transport and distributionally robust optimization (DRO). It is also the dual to the Kullback-Leibler (KL) divergence, which is widely used in machine learning. In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. Previous approaches that replace the full sum with a small batch introduce significant bias. We propose a novel approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new -divergence called the safe KL divergence. The accuracy of the approximation is controlled by a tunable parameter and can be made arbitrarily small. Like the LogSumExp, our approximation preserves convexity. Moreover, when applied to an -smooth function bounded from below, the smoothness constant of the resulting objective scales linearly with . Experiments in DRO and continuous optimal transport demonstrate the advantages of our approach over state-of-the-art baselines and the effective treatment of numerical issues associated with the standard LogSumExp and KL.

Paper Structure

This paper contains 23 sections, 9 theorems, 65 equations, 5 figures, 1 table.

Key Result

Lemma 2.2

The functional $F_\rho$ defined by (def:F_rho) has an equivalent variational representation

Figures (5)

  • Figure 1: $f_{\rho}(t)$ for different values of $\rho$.
  • Figure 2: Test-set eOT semi-dual objective vs. iteration for different regularization strengths $\varepsilon$ (left to right: $1$, $10^{-2}$, $10^{-4}$). Lines show the mean across 5 runs; shaded areas are $\pm$ one standard deviation. We compare LSOT (red) with our method (colored by $\rho$). Dashed black curves are examples where LSOT with lr=$10^{-4}$ terminates early due to overflow, while lr=$10^{-5}$ results in a prohibitively slow convergence (nearly horizontal red lines for $\varepsilon=10^{-2}, 10^{-4}$). Our proposed method remains stable and efficient for all $\varepsilon$.
  • Figure 3: Performance of ERM and two DRO approaches on MNIST with noisy labels. Left: ERM accuracy on the noisy validation set vs. clean test set. Middle: validation vs. test accuracy for DRO approaches. Right: training loss $F(\theta)$ from (\ref{['eq:lse_dro']}).
  • Figure 4: Densities of source and target distributions in the eOT experiment.
  • Figure 5: Left: convergence of kernel SGD applied to the dual objective (\ref{['eq:f_eps']}) (blue and orange) and approximate semi-dual problem (\ref{['eq:approx_eot']}) (green, red and purple). Solid lines show average optimality gap across 20 runs, shaded regions indicate $\pm$ one standard deviation. Y-axis uses logarithmic scale. Middle: a zoomed-in view of blue and orange curves from the plot on the left. Right: examples of divergent optimality gap curves obtained by running the baseline approach with the stepsize parameter $C=10^{-2}$.

Theorems & Definitions (19)

  • Definition 2.1: Safe KL entropy
  • Lemma 2.2
  • Lemma 2.3
  • Proposition 2.4
  • Corollary 2.5
  • Proposition 2.6
  • Lemma 2.7
  • Remark 3.1: The overflow issue
  • proof : Proof of Proposition \ref{['prop:approx']}
  • proof : Proof of Corollary \ref{['cor:logsumexp_approx']}
  • ...and 9 more