Table of Contents
Fetching ...

Efficient Algorithms for Regularized Nonnegative Scale-invariant Low-rank Approximation Models

Jeremy E. Cohen, Valentin Leplat

TL;DR

The paper develops the Homogeneous Regularized Scale-Invariant (HRSI) framework for nonnegative low-rank approximations, revealing that scale invariance induces an implicit balancing among regularization terms. It derives the optimal column-wise scaling, connects explicit regularization to an implicit scale-free penalty, and analyzes how common penalties (e.g., ridge, L1) affect the rank and sparsity of the factors. A generic Majorization-Minimization (MM) meta-algorithm with convergence guarantees for beta-divergence losses and ell_p^p regularizations is proposed, including a balancing step that accelerates convergence. The authors demonstrate the theory through synthetic experiments and real applications (sNMF, rNCPD, sNTD, including music segmentation) and provide open-source Python implementations to facilitate adoption.

Abstract

Regularized nonnegative low-rank approximations, such as sparse Nonnegative Matrix Factorization or sparse Nonnegative Tucker Decomposition, form an important branch of dimensionality reduction models known for their enhanced interpretability. From a practical perspective, however, selecting appropriate regularizers and regularization coefficients, as well as designing efficient algorithms, remains challenging due to the multifactor nature of these models and the limited theoretical guidance available. This paper addresses these challenges by studying a more general model, the Homogeneous Regularized Scale-Invariant model. We prove that the scale-invariance inherent to low-rank approximation models induces an implicit regularization effect that balances solutions. This insight provides a deeper understanding of the role of regularization functions in low-rank approximation models, informs the selection of regularization hyperparameters, and enables the design of balancing strategies to accelerate the empirical convergence of optimization algorithms. Additionally, we propose a generic Majorization-Minimization (MM) algorithm capable of handling $\ell_p^p$-regularized nonnegative low-rank approximations with non-Euclidean loss functions, with convergence guarantees. Our contributions are demonstrated on sparse Nonnegative Matrix Factorization, ridge-regularized Nonnegative Canonical Polyadic Decomposition, and sparse Nonnegative Tucker Decomposition.

Efficient Algorithms for Regularized Nonnegative Scale-invariant Low-rank Approximation Models

TL;DR

The paper develops the Homogeneous Regularized Scale-Invariant (HRSI) framework for nonnegative low-rank approximations, revealing that scale invariance induces an implicit balancing among regularization terms. It derives the optimal column-wise scaling, connects explicit regularization to an implicit scale-free penalty, and analyzes how common penalties (e.g., ridge, L1) affect the rank and sparsity of the factors. A generic Majorization-Minimization (MM) meta-algorithm with convergence guarantees for beta-divergence losses and ell_p^p regularizations is proposed, including a balancing step that accelerates convergence. The authors demonstrate the theory through synthetic experiments and real applications (sNMF, rNCPD, sNTD, including music segmentation) and provide open-source Python implementations to facilitate adoption.

Abstract

Regularized nonnegative low-rank approximations, such as sparse Nonnegative Matrix Factorization or sparse Nonnegative Tucker Decomposition, form an important branch of dimensionality reduction models known for their enhanced interpretability. From a practical perspective, however, selecting appropriate regularizers and regularization coefficients, as well as designing efficient algorithms, remains challenging due to the multifactor nature of these models and the limited theoretical guidance available. This paper addresses these challenges by studying a more general model, the Homogeneous Regularized Scale-Invariant model. We prove that the scale-invariance inherent to low-rank approximation models induces an implicit regularization effect that balances solutions. This insight provides a deeper understanding of the role of regularization functions in low-rank approximation models, informs the selection of regularization hyperparameters, and enables the design of balancing strategies to accelerate the empirical convergence of optimization algorithms. Additionally, we propose a generic Majorization-Minimization (MM) algorithm capable of handling -regularized nonnegative low-rank approximations with non-Euclidean loss functions, with convergence guarantees. Our contributions are demonstrated on sparse Nonnegative Matrix Factorization, ridge-regularized Nonnegative Canonical Polyadic Decomposition, and sparse Nonnegative Tucker Decomposition.
Paper Structure (46 sections, 8 theorems, 103 equations, 6 figures, 1 table, 3 algorithms)

This paper contains 46 sections, 8 theorems, 103 equations, 6 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Let $\mu_j=0$ for at least one index $j\leq n$. Then the infimum of the function is equal to $\underset{X_i\in\mathbb{R}^{m_i\times r}}{\inf}~f(\{X_i\}_{i\leq n})$. Moreover, the infimum is not attained unless f attains its minimum at $X_i=0$ for all $i\leq n$.

Figures (6)

  • Figure 1: Study of the convergence of ALS for the toy problem \ref{['eq:toy_als']}. We choose $y=10$ and $\lambda=10^{-3}$. The number of iterations of ALS is set to 20000. Top left: cost function $f$ with respect to iteration index. Top right: values of $x_1$ (orange), $x_2$ (blue) and $\sqrt{y-\lambda}$ (magenta). Bottom left: value of $x_1x_2$ (blue) and $y-\lambda$ (magenta). Bottom right: empirical value of the cost function variation (blue) and theoretical value $16\frac{\lambda^2}{y}e_k^2$ (magenta). Several observations can be made: The value of $x_1x_2$ rapidly converges towards $y-\lambda$ (bottom left graph). However the cost converged after at least 5000 iterations, and the individual values of $x_1$ and $x_2$ are slow to converge to the optimum $\sqrt{y-\lambda}$ with respect to how fast $x_1x_2$ converged to $y-\lambda$. We can observe a linear convergence rate with a multiplicative constant very close to $1$ in the bottom right graph, and a close match between the theoretical approximate cost decrease and the practical observation of that decrease.
  • Figure 2: Results for the simulated experiment on sNMF. The left plot shows the loss after the stopping criterion is reached, plotted against the regularization parameter $\mu_2$ while $\mu_1=1$. The right plot displays the ratio of the sparsity of the estimated matrix $X_1$ to that of the estimated matrix $X_2$. Results are presented for three scenarios: balancing used only at initialization (blue), balancing applied at each iteration (red), and no balancing (green). Notably, the left plot indicates that balancing reduces the loss function value upon reaching the convergence criterion, particularly for the weakest regularizations. In the right plot, it is observed that tuning the regularization hyperparameter $\mu_2$ while fixing $\mu_1=1$ does not affect the sparsity ratio of the parameter matrices $X_2$ and $X_1$, confirming that the sparsity of matrix $X_2$ cannot be adjusted independently of matrix $X_1$.
  • Figure 3: Results for the simulated experiment on ridge Nonnegative CPD. The normalized loss after the stopping criterion is reached is shown with respect to the regularization parameter $\mu$ on the left plot, while the number of components within the estimated rNCPD model with respect to the regularization parameter is shown on the right plot. Results when balancing is used at initialization only, at each iteration, and not used are shown respectively in blue, red, and green. One may observe on the left plot that balancing helps reduce the loss function value when the convergence criterion is reached. This reduction is significant in particular for small regularization parameter $\mu$ and when balancing is performed at each iteration, although balancing only the initialization already improves over no balancing. On the right plot, one may observe that for a wide range of ridge regularization hyperparameter values, the number of estimated components in rNCPD matches true rank $r=4$. Therefore ridge regularization has an implicit group sparse action on the rank-one components as predicted in theory. This phenomenon happens regardless of the balancing procedure.
  • Figure 4: Results for the simulated experiment on sparse NTD. The normalized loss after 500 outer iterations is presented in the top left plot, showing its relationship with the regularization parameter $\mu$. The top right plot displays the scaled sparsity of the estimated core tensor against the regularization parameters, while the bottom plot illustrates the number of nonzero components along the first mode of the estimated core. Results are shown in blue for balancing used only at initialization, in red for balancing at each iteration (using scalar balancing), and in green for no balancing. The top left plot indicates that balancing effectively reduces the loss function value upon reaching the convergence criterion, with a notable reduction for small regularization parameter $\mu$, especially when balancing is applied at each iteration. Even balancing only at initialization improves results compared to no balancing. In the top right and bottom plots, $\ell_1$ regularization on the core tensor promotes sparsity and helps select the appropriate number of components, $r_1=4$. Balancing—whether at initialization or during iterations—broadens the range of regularization values for which components are effectively pruned.
  • Figure 5: Results for the sNTD experiment on music redundancy detection. Top: the loss function for each tested algorithm is shown across iterations for various values of the regularization hyperparameters. Bottom: The reconstructed matrix $X_3$ is shown, transposed, for each tested algorithm and various values of the regularization hyperparameter. Green vertical bars indicate an expert segmentation of the song (Come-Together). The number of patterns (rows) is hard to estimate a priori, therefore ideally the estimated matrix $X_3$ should have sparse rows to prune unnecessary patterns. Moreover, it should be group sparse between the green bars to allow for easier segmentation by downstream methods such as dynamic programming or K-means.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Theorem 2
  • Definition 1
  • Proposition 4
  • proof
  • Theorem 3: also \ref{['theo:monotoneNI_main']}
  • proof
  • ...and 2 more