Table of Contents
Fetching ...

Dropping Convexity for Faster Semi-definite Optimization

Srinadh Bhojanapalli, Anastasios Kyrillidis, Sujay Sanghavi

TL;DR

This work studies minimizing a convex function over the PSD cone using a non-convex factorization X=UU^T, introducing Factored Gradient Descent (FGD) with a specially designed step size. The authors prove that FGD achieves an O(1/k) convergence rate for smooth convex objectives and linear convergence under (m,r)-restricted strong convexity, with performance dependent on spectral properties of the optimum. They propose initialization schemes based on first-order information to guarantee a good starting point and demonstrate computational advantages over traditional SDP methods. Overall, the paper provides precise convergence guarantees for general convex objectives in the PSD setting and explains practical performance observed in matrix sensing and related tasks.

Abstract

We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

Dropping Convexity for Faster Semi-definite Optimization

TL;DR

This work studies minimizing a convex function over the PSD cone using a non-convex factorization X=UU^T, introducing Factored Gradient Descent (FGD) with a specially designed step size. The authors prove that FGD achieves an O(1/k) convergence rate for smooth convex objectives and linear convergence under (m,r)-restricted strong convexity, with performance dependent on spectral properties of the optimum. They propose initialization schemes based on first-order information to guarantee a good starting point and demonstrate computational advantages over traditional SDP methods. Overall, the paper provides precise convergence guarantees for general convex objectives in the PSD setting and explains practical performance observed in matrix sensing and related tasks.

Abstract

We study the minimization of a convex function over the set of positive semi-definite matrices, but when the problem is recast as , with and . We study the performance of gradient descent on ---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function . We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original : i.e., after steps, the error is for smooth , and exponentially small in when is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

Paper Structure

This paper contains 44 sections, 26 theorems, 157 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $X^{\star}_r = U^{\star}_r U_r^{\star^\top}$ denote an optimum of $M$-smooth $f$ over the PSD cone. Let $f(X^0) > f(X^{\star}_r)$. Then, under assumption $(A1)$, after $k$ iterations, the FGD algorithm finds solution $X^k$ such that

Figures (11)

  • Figure 1: Abstract illustration of Theorem \ref{['thm:smooth_inexact']} and the behavior of Fgd in the case where $f$ is just $M$-smooth. The grey-shaded area represents the set of optimum solutions $X^\star = X^{\star}_r$. Let the orange triangle denote the optimum, close to which Fgd converges; the dashed red circle denotes the optimization tolerance/error.
  • Figure 2: Abstract illustration of Theorem \ref{['thm:convergence_main']} and Corollary \ref{['cor:exact']}. The two curves denote the two cases: $(i)$$r = \text{rank}(X^\star)$ and, $(ii)$$r < \text{rank}(X^\star)$. $(i)$ In the first case, the triangle marker denotes the unique optimum $X^\star$ and the dashed red circle denotes the optimization tolerance/error. $(ii)$ In the case where $r < \text{rank}(X^\star)$, let the cyan circle with radius $c\|X^\star - X^\star_r\|_F$ (set $c = 1$ for simplicity) denote a neighborhood around $X^\star$. In this case, $\textsc{Fgd}\xspace$ converges to a rank-$r$ approximation in the vicinity of $X^\star$ in sublinear rate, according to Theorem \ref{['thm:convergence_main']}.
  • Figure 3: Abstract illustration of initialization effect on a toy example. In this experiment, we design $X^{\star} = U^{\star} U^{\star \top}$ where $U^{\star} = [1 ~~1]^\top$ (or $U^{\star} = -[1 ~~1]^\top$---these are equivalent). We observe $X^{\star}$ via $y = \text{vec}\left(A\cdot X^{\star}\right)$ where $A \in \mathbb{R}^{3 \times 2}$ is randomly generated. We consider the loss function $f(UU^\top) = \tfrac{1}{2} \|y - \text{vec}\left(A\cdot UU^\top\right)\|_2^2$. Left panel: $f$ values in logarithimic scale for various values of variable $U \in \mathbb{R}^{2 \times 1}$. Center panel: Contour lines of $f$ and the bahavior of $\textsc{Fgd}\xspace$ using our initialization scheme. Right panel: zoom-in plot of center plot.
  • Figure 4: Left panel: Assume dimension $n = 50$. We consider the matrix sensing setup recht2010guaranteed and generate $m = \lceil 2 n \log n \rceil$ Gaussian linear measurements of $n \times n$ matrices $X^{\star}$ of rank $r = 2$, with varying condition number $\tau(X^{\star})$. We compute matrix $X =UU^\top$, $U$ is $n \times r$ tall matrix, by minimizing the standard least squares lost function, using our scheme. In the plot, we show the log error versus total number of iterations. Observe that, varying the condition number of $X^{\star}$, higher $\tau(X^{\star})$ leads to slower convergence. Right panel: Contour of function $(u_1^2+u_2^2-1)^2$. Observe the "ring" of points $(u_1, u_2)$ where $f$ is minimized. This illustrates the existence of multiple points with zero gradient and, thus, directions where the hessian of the objective is zero.
  • Figure 5: Median error per iteration of factored gradient descent algorithm for different step sizes, over 20 Monte Carlo iterations. The number of measurements is fixed to $C_{\text{sam}} \cdot n \cdot r$ for varying $C_{\text{sam}} \in \left\{4, 6, 10 \right\}$. Here, $n = 1204$ and rank $r = 5$. Curves show convergence behavior of factored gradient descent as a function of the step size selection. One can observe that arbitrary step size selections can lead to slow convergence. Moreover, good constant step size selections -- for a specific problem configuration, do not necessarily translate into good performance for a different setting; e.g., observe how the constant step size convergence rates worsen faster, as we decrease the number of observations.
  • ...and 6 more figures

Theorems & Definitions (45)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Theorem 4.1: Convergence performance for smooth $f$
  • Theorem 4.2: Convergence rate for restricted strongly convex $f$
  • Corollary 4.3: Exact recovery of $X^{\star}$
  • Remark 1
  • Remark 2
  • Lemma 5.1
  • ...and 35 more