Table of Contents
Fetching ...

Accelerated Bregmann divergence optimization with SMART: an information geometry point of view

Maren Raus, Yara Elshiaty, Stefania Petra

TL;DR

The exponentiated gradient method is explored, which can be viewed as a Bregman proximal gradient method and as a Riemannian gradient descent on the parameter manifold of a corresponding distribution of the exponential family.

Abstract

We investigate the problem of minimizing Kullback-Leibler divergence between a linear model $Ax$ and a positive vector $b$ in different convex domains (positive orthant, $n$-dimensional box, probability simplex). Our focus is on the SMART method that employs efficient multiplicative updates. We explore the exponentiated gradient method, which can be viewed as a Bregman proximal gradient method and as a Riemannian gradient descent on the parameter manifold of a corresponding distribution of the exponential family. This dual interpretation enables us to establish connections and achieve accelerated SMART iterates while smoothly incorporating constraints. The performance of the proposed acceleration schemes is demonstrated by large-scale numerical examples.

Accelerated Bregmann divergence optimization with SMART: an information geometry point of view

TL;DR

The exponentiated gradient method is explored, which can be viewed as a Bregman proximal gradient method and as a Riemannian gradient descent on the parameter manifold of a corresponding distribution of the exponential family.

Abstract

We investigate the problem of minimizing Kullback-Leibler divergence between a linear model and a positive vector in different convex domains (positive orthant, -dimensional box, probability simplex). Our focus is on the SMART method that employs efficient multiplicative updates. We explore the exponentiated gradient method, which can be viewed as a Bregman proximal gradient method and as a Riemannian gradient descent on the parameter manifold of a corresponding distribution of the exponential family. This dual interpretation enables us to establish connections and achieve accelerated SMART iterates while smoothly incorporating constraints. The performance of the proposed acceleration schemes is demonstrated by large-scale numerical examples.
Paper Structure (36 sections, 20 theorems, 124 equations, 12 figures, 3 tables, 4 algorithms)

This paper contains 36 sections, 20 theorems, 124 equations, 12 figures, 3 tables, 4 algorithms.

Key Result

Proposition 2.2

Let and suppose the convex function $\varphi\colon\mathbb{R}^{n}\to\mathbb{R}\cup \{+\infty\}$ is of Legendre type. If $\mathop{\mathrm{int}}\nolimits(\mathop{\mathrm{dom}}\nolimits \varphi) \cap\mathcal{A}\neq\emptyset$, then the restriction $\varphi|_{\mathcal{A}}$ of $\varphi$ to $\mathcal{A}$ is of

Figures (12)

  • Figure 4.1: E-geodesics \ref{['eq:Exp-maps']} emanating from three points $\eta$ (black dots) in all directions $v=(\cos \omega,\sin \omega),\,\omega\in[0,2\pi)$ with $t\in\{0.02, 0.05, 0.1, 0.15, 0.2\}$.
  • Figure 5.1: Toy example: Trajectories of iterates, starting at $x_0=(0.5, 0.5)^{\top}$ and converging to the unique solution $\hat{x}=(1,1)^{\top}$, are displayed in the left panel. The projected gradient method (PG) shows a kink due to its nonsmooth projection step. Methods aware of the constraints geometry depict smoother, more optimal iterate trajectories by exploiting smooth Riemannian geometry to accommodate box constraints. In the right panel, FSMART-g stands out with larger initial steps and a rapid decrease in the relative objective value.
  • Figure 5.2: Comparison of CG variants on expander graphs across three instances with varying levels of ill-posedness ($m \in \{40, 70, 100\}$ from left to right). The plots illustrate the relative decrease in objective values over iterations (first row) and costly operations (second row) for different approaches detailed in \ref{['eq:beta-CG']} used to select $\beta_k$ for updating the search direction in the conjugate gradient approach. Additionally, Riemannian gradient descent with Armijo line search is considered as baseline. We observe a rough clustering of the variants into two groups: $\{$OV, PR$\}$ perform similarly to the Armijo baseline, while $\{$DY, FR, HS, HZ$\}$ exhibit notably faster convergence.
  • Figure 5.3: Comparison of different algorithms on expander graphs across instances with varying levels of ill-posedness ($m \in \{40, 70, 100\}$ from left to right). The plots illustrate the relative decrease in objective values over iterations (first row) and in terms of costly operations (second row) for the different approaches detailed in Section \ref{['sec:implDetails']}. Among the accelerated algorithms within the Bregman regime, FSMART-e and particularly FSMART-g demonstrate superior performance. Within the Riemannian optimization regime, the conjugate gradient (CG) implemented with the DY mode for $\beta_k$-update rule \ref{['eq:beta-Dai-Yuan']} achieves results comparable to FSMART-e.
  • Figure 5.4: Sparse spike reconstructions from adjacency matrices corresponding to expander graphs, generated by different algorithms, are displayed for $m=40$ after $1000$ iterations. The original binary signal $\hat{x}$ with sparsity $20$ is presented in the top left. Faster algorithms (see Figure \ref{['fig:ExperimentExpander']}) demonstrate perfect signal reconstruction. However, it is worth noting that reconstructions from slower algorithms can be thresholded entrywise at $0.5$, transforming them into perfect reconstructions.
  • ...and 7 more figures

Theorems & Definitions (39)

  • Definition 2.1
  • Proposition 2.2: Legendre functions on affine subspaces Alvarez:2004
  • Theorem 2.3
  • Lemma 2.4: Bregman projection Bauschke:1997aa
  • Lemma 2.5: Three-points identity Chen_Teboulle_1993
  • Lemma 2.6: probability simplex, Fisher-Rao metric
  • proof
  • Lemma 2.7: Riemannian gradient in ambient coordinates
  • proof
  • Remark 2.8: Bregman kernels
  • ...and 29 more