Table of Contents
Fetching ...

Block majorization-minimization with diminishing radius for constrained nonsmooth nonconvex optimization

Hanbaek Lyu, Yuchen Li

TL;DR

This work develops Block Majorization-Minimization (BMM) for constrained nonsmooth nonconvex optimization by cyclically minimizing majorizing surrogates in each block. It introduces BMM-DR, a diminishing-radius trust-region variant, and proves that the iteration complexity is $\widetilde{O}((1+L_g+\rho^{-1})\varepsilon^{-2})$ for standard BMM and improves to $\widetilde{O}((1+L_g)\varepsilon^{-2})$ with DR, removing dependence on $\rho^{-1}$; asymptotic convergence to stationary-Nash points holds under mild assumptions and tolerance to inexact subproblem solutions. The theory is instantiated for practical problems including Nonnegative Matrix Factorization (NMF) and constrained tensor factorization (CP/NCPD), yielding concrete results for algorithms like MU/MUR and ALS-type methods, and for Block Projected Gradient Descent (BPGD). Numerical experiments show that diminishing-radius strategies can accelerate convergence, particularly with nearly-flat surrogates or ill-conditioned problems, while maintaining convergence guarantees. Overall, the paper provides new global rates and robustness results for BMM variants, guiding the design of efficient constrained nonconvex solvers in matrix and tensor factorization and related domains.

Abstract

Block majorization-minimization (BMM) is a simple iterative algorithm for constrained nonconvex optimization that sequentially minimizes majorizing surrogates of the objective function in each block while the others are held fixed. BMM entails a large class of optimization algorithms such as block coordinate descent and its proximal-point variant, expectation-minimization, and block projected gradient descent. We first establish that for general constrained nonsmooth nonconvex optimization, BMM with $ρ$-strongly convex and $L_g$-smooth surrogates can produce an $ε$-approximate first-order optimal point within $\widetilde{O}((1+L_g+ρ^{-1})ε^{-2})$ iterations and asymptotically converges to the set of first-order optimal points. Next, we show that BMM combined with trust-region methods with diminishing radius has an improved complexity of $\widetilde{O}((1+L_g) ε^{-2})$, independent of the inverse strong convexity parameter $ρ^{-1}$, allowing improved theoretical and practical performance with `flat' surrogates. Our results hold robustly even when the convex sub-problems are solved as long as the optimality gaps are summable. Central to our analysis is a novel continuous first-order optimality measure, by which we bound the worst-case sub-optimality in each iteration by the first-order improvement the algorithm makes. We apply our general framework to obtain new results on various algorithms such as the celebrated multiplicative update algorithm for nonnegative matrix factorization by Lee and Seung, regularized nonnegative tensor decomposition, and the classical block projected gradient descent algorithm. Lastly, we numerically demonstrate that the additional use of diminishing radius can improve the convergence rate of BMM in many instances.

Block majorization-minimization with diminishing radius for constrained nonsmooth nonconvex optimization

TL;DR

This work develops Block Majorization-Minimization (BMM) for constrained nonsmooth nonconvex optimization by cyclically minimizing majorizing surrogates in each block. It introduces BMM-DR, a diminishing-radius trust-region variant, and proves that the iteration complexity is for standard BMM and improves to with DR, removing dependence on ; asymptotic convergence to stationary-Nash points holds under mild assumptions and tolerance to inexact subproblem solutions. The theory is instantiated for practical problems including Nonnegative Matrix Factorization (NMF) and constrained tensor factorization (CP/NCPD), yielding concrete results for algorithms like MU/MUR and ALS-type methods, and for Block Projected Gradient Descent (BPGD). Numerical experiments show that diminishing-radius strategies can accelerate convergence, particularly with nearly-flat surrogates or ill-conditioned problems, while maintaining convergence guarantees. Overall, the paper provides new global rates and robustness results for BMM variants, guiding the design of efficient constrained nonconvex solvers in matrix and tensor factorization and related domains.

Abstract

Block majorization-minimization (BMM) is a simple iterative algorithm for constrained nonconvex optimization that sequentially minimizes majorizing surrogates of the objective function in each block while the others are held fixed. BMM entails a large class of optimization algorithms such as block coordinate descent and its proximal-point variant, expectation-minimization, and block projected gradient descent. We first establish that for general constrained nonsmooth nonconvex optimization, BMM with -strongly convex and -smooth surrogates can produce an -approximate first-order optimal point within iterations and asymptotically converges to the set of first-order optimal points. Next, we show that BMM combined with trust-region methods with diminishing radius has an improved complexity of , independent of the inverse strong convexity parameter , allowing improved theoretical and practical performance with `flat' surrogates. Our results hold robustly even when the convex sub-problems are solved as long as the optimality gaps are summable. Central to our analysis is a novel continuous first-order optimality measure, by which we bound the worst-case sub-optimality in each iteration by the first-order improvement the algorithm makes. We apply our general framework to obtain new results on various algorithms such as the celebrated multiplicative update algorithm for nonnegative matrix factorization by Lee and Seung, regularized nonnegative tensor decomposition, and the classical block projected gradient descent algorithm. Lastly, we numerically demonstrate that the additional use of diminishing radius can improve the convergence rate of BMM in many instances.

Paper Structure

This paper contains 18 sections, 17 theorems, 81 equations, 4 figures, 1 table.

Key Result

Theorem 2.1

Assume assumption:A1-assumption:A4 hold. Let $(\boldsymbol{\theta}_{n})_{n\ge 0}$ be an (possibly inexact) output of Algorithm eq:BMM_DR_highlevel. Then the following hold: (See eq:C_A3 and below for explicit expressions for the constants $M,c$.)

Figures (4)

  • Figure 1: Illustration of the proof of Theorem \ref{['thm:complexity']}(iii) with diminishing radius.
  • Figure 2: Comparison of BMM-DR with BMM on NMF. $\beta=0.5$ is the diminishing radius parameter used for all algorithms with DR and $\lambda$ is the proximal regularization parameter. The average relative reconstruction error with standard deviation is shown by the lines and shaded regions of respective colors.
  • Figure 3: Comparison of the performance of BMM-DR (Algorithm \ref{['eq:BMM_DR_highlevel']}) and MUR against BCD and MU for the nonnegative CP-decomposition (NCPD) problem. BCD (equivalent to ALS) is implmented as \ref{['eq:BMM_DR_CTF']}\ref{['eq:ALS_CTF']} with $r_{n}=\infty$ for $n\ge 1$. BCD-DR is implemented as \ref{['eq:BMM_DR_CTF']}\ref{['eq:ALS_DR_CTF']} with $c'=\lVert \mathbf{X} \rVert_{F}/(1.5\times 10^{5})$ for synthetic data and $c'=\lVert \mathbf{X} \rVert_{F}/(3\times 10^{5})$ for Cifar10 data ($\mathbf{X}$ denoting the data tensor). BMM \ref{['eq:ALS_PR_CTF']} is implemented with a proximal regularizer with parameter $\lambda$. BMM-DR \ref{['eq:ALS_DR_PR_CTF']} is implemented on top of BMM with diminishing radius parameter $\beta$ and the same $c'$ as BCD-DR. The average relative reconstruction error with standard deviation is shown by the solid lines and shaded regions of respective colors.
  • Figure 4: Comparison of the performance of MUR for the nonnegative matrix factorization (NMF) problem against MU. For MUR in the first (second) row, $\delta$ ($\rho$) is fixed as $10^{-8}$. The number of columns of loading matrices is set to be $r=2$ for synthetic data and $r=15$ for MNIST data. The average relative reconstruction error with standard deviation is shown by the solid lines and shaded regions of respective colors.

Theorems & Definitions (35)

  • Theorem 2.1
  • Lemma 4.1: First-order approximation of functions with Lipschitz gradient
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.3: Monotonicity of objective and Stability of iterates
  • proof
  • Proposition 4.4: Boundedness of iterates
  • proof
  • Proposition 4.5: Finite first variation I
  • ...and 25 more