Table of Contents
Fetching ...

Simple Alternating Minimization Provably Solves Complete Dictionary Learning

Geyu Liang, Gavin Zhang, Salar Fattahi, Richard Y. Zhang

TL;DR

This work tackles noiseless complete dictionary learning by proposing a simple alternating minimization scheme that converges linearly to the ground truth under mild initialization. A data-driven preconditioning step makes the method effective for general complete dictionaries and enables scalable mini-batch and online updates without reliance on incoherence or RIP assumptions. Theoretical guarantees accompany practical algorithms, including exact support recovery and convergence bounds with explicit sample complexities, plus a warm-start initialization strategy. Empirical results on synthetic and real data demonstrate fast convergence and superior performance in image denoising and inpainting compared to KSVD and related methods, highlighting both theoretical and practical impact for large-scale dictionary learning.

Abstract

This paper focuses on the noiseless complete dictionary learning problem, where the goal is to represent a set of given signals as linear combinations of a small number of atoms from a learned dictionary. There are two main challenges faced by theoretical and practical studies of dictionary learning: the lack of theoretical guarantees for practically-used heuristic algorithms and their poor scalability when dealing with huge-scale datasets. Towards addressing these issues, we propose a simple and efficient algorithm that provably recovers the ground truth when applied to the nonconvex and discrete formulation of the problem in the noiseless setting. We also extend our proposed method to mini-batch and online settings where the data is huge-scale or arrives continuously over time. At the core of our proposed method lies an efficient preconditioning technique that transforms the unknown dictionary to a near-orthonormal one, for which we prove a simple alternating minimization technique converges linearly to the ground truth under minimal conditions. Our numerical experiments on synthetic and real datasets showcase the superiority of our method compared with the existing techniques.

Simple Alternating Minimization Provably Solves Complete Dictionary Learning

TL;DR

This work tackles noiseless complete dictionary learning by proposing a simple alternating minimization scheme that converges linearly to the ground truth under mild initialization. A data-driven preconditioning step makes the method effective for general complete dictionaries and enables scalable mini-batch and online updates without reliance on incoherence or RIP assumptions. Theoretical guarantees accompany practical algorithms, including exact support recovery and convergence bounds with explicit sample complexities, plus a warm-start initialization strategy. Empirical results on synthetic and real data demonstrate fast convergence and superior performance in image denoising and inpainting compared to KSVD and related methods, highlighting both theoretical and practical impact for large-scale dictionary learning.

Abstract

This paper focuses on the noiseless complete dictionary learning problem, where the goal is to represent a set of given signals as linear combinations of a small number of atoms from a learned dictionary. There are two main challenges faced by theoretical and practical studies of dictionary learning: the lack of theoretical guarantees for practically-used heuristic algorithms and their poor scalability when dealing with huge-scale datasets. Towards addressing these issues, we propose a simple and efficient algorithm that provably recovers the ground truth when applied to the nonconvex and discrete formulation of the problem in the noiseless setting. We also extend our proposed method to mini-batch and online settings where the data is huge-scale or arrives continuously over time. At the core of our proposed method lies an efficient preconditioning technique that transforms the unknown dictionary to a near-orthonormal one, for which we prove a simple alternating minimization technique converges linearly to the ground truth under minimal conditions. Our numerical experiments on synthetic and real datasets showcase the superiority of our method compared with the existing techniques.
Paper Structure (33 sections, 11 theorems, 77 equations, 7 figures, 5 algorithms)

This paper contains 33 sections, 11 theorems, 77 equations, 7 figures, 5 algorithms.

Key Result

Theorem 1

Suppose that $\boldsymbol{Y} = \boldsymbol{D}^* \boldsymbol{X}^*$ where $\boldsymbol{D}^*$ is orthogonal and $\boldsymbol{X}^*$ satisfies Assumption assump:sparse and normalization with sparsity level $0 < \theta < 1/2$. Suppose that the initial dictionary $\boldsymbol{D}^{(0)}$ satisfies $\|\boldsy

Figures (7)

  • Figure 1: A comparison of image denoising using dictionaries learned via our proposed method (Algorithm \ref{['alg: complete mini-batch']}) and via KSVD. We choose a random landscape image and artificially corrupt 50% of the pixels. Reconstruction is done via orthogonal matching pursuit with the learned dictionaries. The corrupted original image is shown on the right, and the two reconstructed images are shown on the left. We see that the dictionary learned via our method achieves a much better denoising result than one learned via KSVD. We refer the readers to section 4.2 for the details of setup.
  • Figure 2: The plots above show the iterates of Algorithm \ref{['alg: initialization']} with $p=100$, $n=5$ and $\theta=0.3$. The left figure shows the error in the sparse code. The right figure shows the number of non-zero entries of $\mathrm{Supp}(\boldsymbol{X}^*)-\mathrm{Supp}(\boldsymbol{X}^{(t)})$. The number is 126 at the beginning, which is the total number of non-zero entries in $\boldsymbol{X}^*$, and 0 in the end, which indicates the full recovery of the support of $\boldsymbol{X}^*$.
  • Figure 3: We compare three different dictionary learning methods with their running time and final error until convergence. The results above are averaged over 5 independent trials. All methods use the same initial point. The stopping criteria is when consecutive iterates are close to each other ($\|\boldsymbol{D}^{(t-1)}-\boldsymbol{D}^{(t-1)}\|_2\le 10^{-7}$).
  • Figure 4: Results of Algorithm \ref{['alg: complete offline']} with $n=5$, $\theta = 0.3$, $\tilde{p}=100$, and varying $p$. The specific number of iterations to reach convergence varies with the sample size since it depends on the distance between the ground truth and initialization.
  • Figure 5: (left) Final error of Algorithm 2.2 in relation to condition number $\kappa(\boldsymbol{A}^*)$ and varied noise levels $\beta$ for a fixed sample size of $\tilde{p}=p=10^5$. (right) Required sample size to achieve $\frac{\left\lVert\boldsymbol{A}^{(T)}-\boldsymbol{A}^*\right\rVert_F}{\left\lVert\boldsymbol{A}^*\right\rVert_F}\leq 0.1$ as a function of the condition number $\kappa(\boldsymbol{A}^*)$. In both settings, we fix $n=5$, $\theta = 0.3$, and $T=1000$.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1: Exact support recovery
  • Lemma 2: Guaranteed improvement on polar decomposition
  • Lemma 3: Bounding preconditioner error
  • Lemma 4: Spectral property for sparse code matrix
  • Lemma 5: Approximation error for dictionary (See the proof of Lemma 2.4 in ravishankar2020analysis)
  • Theorem 4: Concentration of sample covariance matrix vershynin2018high
  • Theorem 5: Concentration of norm vershynin2018high
  • ...and 1 more