Table of Contents
Fetching ...

Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth

Kevin Kögler, Alexander Shevchenko, Hamed Hassani, Marco Mondelli

TL;DR

This work analyzes gradient-descent trained two-layer autoencoders for compressing structured data, focusing on 1-bit and sparse Gaussian inputs. It shows that linear decoders can ignore sparsity, yielding the Gaussian MSE $\mathcal{R}_{Gauss}$, and uncovers a sparsity-driven phase transition in the optimal encoder from a random rotation to the identity. By leveraging a connection to RI-GAMP, the paper demonstrates that nonlinear denoising and deeper decoding can surpass Gaussian performance, with a Bayes-optimal benchmark attained by carefully designed multi-layer decoders. Empirical results on CIFAR-10, MNIST, and masked images corroborate the theory and highlight practical gains from nonlinearities and depth for structured data compression.

Abstract

Autoencoders are a prominent model in many empirical branches of machine learning and lossy data compression. However, basic theoretical questions remain unanswered even in a shallow two-layer setting. In particular, to what degree does a shallow autoencoder capture the structure of the underlying data distribution? For the prototypical case of the 1-bit compression of sparse Gaussian data, we prove that gradient descent converges to a solution that completely disregards the sparse structure of the input. Namely, the performance of the algorithm is the same as if it was compressing a Gaussian source - with no sparsity. For general data distributions, we give evidence of a phase transition phenomenon in the shape of the gradient descent minimizer, as a function of the data sparsity: below the critical sparsity level, the minimizer is a rotation taken uniformly at random (just like in the compression of non-sparse data); above the critical sparsity, the minimizer is the identity (up to a permutation). Finally, by exploiting a connection with approximate message passing algorithms, we show how to improve upon Gaussian performance for the compression of sparse data: adding a denoising function to a shallow architecture already reduces the loss provably, and a suitable multi-layer decoder leads to a further improvement. We validate our findings on image datasets, such as CIFAR-10 and MNIST.

Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth

TL;DR

This work analyzes gradient-descent trained two-layer autoencoders for compressing structured data, focusing on 1-bit and sparse Gaussian inputs. It shows that linear decoders can ignore sparsity, yielding the Gaussian MSE , and uncovers a sparsity-driven phase transition in the optimal encoder from a random rotation to the identity. By leveraging a connection to RI-GAMP, the paper demonstrates that nonlinear denoising and deeper decoding can surpass Gaussian performance, with a Bayes-optimal benchmark attained by carefully designed multi-layer decoders. Empirical results on CIFAR-10, MNIST, and masked images corroborate the theory and highlight practical gains from nonlinearities and depth for structured data compression.

Abstract

Autoencoders are a prominent model in many empirical branches of machine learning and lossy data compression. However, basic theoretical questions remain unanswered even in a shallow two-layer setting. In particular, to what degree does a shallow autoencoder capture the structure of the underlying data distribution? For the prototypical case of the 1-bit compression of sparse Gaussian data, we prove that gradient descent converges to a solution that completely disregards the sparse structure of the input. Namely, the performance of the algorithm is the same as if it was compressing a Gaussian source - with no sparsity. For general data distributions, we give evidence of a phase transition phenomenon in the shape of the gradient descent minimizer, as a function of the data sparsity: below the critical sparsity level, the minimizer is a rotation taken uniformly at random (just like in the compression of non-sparse data); above the critical sparsity, the minimizer is the identity (up to a permutation). Finally, by exploiting a connection with approximate message passing algorithms, we show how to improve upon Gaussian performance for the compression of sparse data: adding a denoising function to a shallow architecture already reduces the loss provably, and a suitable multi-layer decoder leads to a further improvement. We validate our findings on image datasets, such as CIFAR-10 and MNIST.
Paper Structure (46 sections, 23 theorems, 231 equations, 17 figures)

This paper contains 46 sections, 23 theorems, 231 equations, 17 figures.

Key Result

Theorem 4.1

Consider the gradient descent algorithm in eq:body-GDmin-formulas with ${\boldsymbol x}\sim {\rm SG}_d(p)$ and $({\boldsymbol G}(t))_{i,j} \sim \mathcal{N}(0, \sigma^2)$, where $d^{-\gamma_g}\leq \sigma \leq C/d$ for some fixed $1<\gamma_g<\infty$. Initialize the algorithm with ${\boldsymbol B}(0)$ where $C, c$ are universal constants depending only on $p, r$ and $T_{\rm max}$. Moreover, we have

Figures (17)

  • Figure 1: Compression of sparse Rademacher data via the two-layer autoencoder in \ref{['eq:linear_decoding']}. We set $d=200$ and $r=1$. Left. MSE achieved by SGD at convergence, as a function of the sparsity level $p$. The empirical values (dots) match our theoretical prediction (blue line): for $p<p_{\mathrm{crit}}$, the loss is equal to the value obtained for Gaussian data, i.e., $\mathcal{R}_{\rm Gauss}=1-2r/\pi$; for $p\ge p_{\mathrm{crit}}$, the loss is smaller, and it is equal to $1 - r \cdot \left({\mathbb{E}} |x_1|\right)^2=1-r\cdot p$. Center. Encoder matrix ${\boldsymbol B}$ at convergence of SGD when $p=0.3<p_{\mathrm{crit}}$: the matrix is a random rotation. Right. Encoder matrix ${\boldsymbol B}$ at convergence of SGD when $p=0.7\ge p_{\mathrm{crit}}$: the negative sign in part of the entries of ${\boldsymbol B}$ is cancelled by the corresponding sign in the entries of ${\boldsymbol A}$; hence, ${\boldsymbol B}$ is equivalent to a permutation of the identity.
  • Figure 2: Compression of sparse Rademacher data via the two-layer autoencoder in \ref{['eq:linear_decoding']}. We set $d=200$, $r=1$ and $p=0.8$. The MSE is plotted as a function of the number of iterations and, as $p>p_{\mathrm{crit}}$, it displays a staircase behavior.
  • Figure 3: Compression of masked and whitened CIFAR-10 images of the class "dog" via the two-layer autoencoder in \ref{['eq:linear_decoding']}. First, the data is whitened so that it has identity covariance (as in the setting of Theorem \ref{['thm:GD-min-sparse-body']}). Then, the data is masked by setting each pixel independently to $0$ with probability $p=0.7$. An example of an original image is on the top right, and the corresponding masked and whitened image is on the bottom right. The SGD loss at convergence (dots) matches the solid line, which corresponds to the prediction in \ref{['eq:gaussian_val']} for the compression of standard Gaussian data (with no sparsity).
  • Figure 4: Compression of sparse Gaussian data via the autoencoder in \ref{['eq:linear_decoding_denoising']}, where $f$ has the form in \ref{['eq:parametric_denoiser']} and its parameters $(\alpha_1, \alpha_2, \alpha_3)$ are optimized via SGD. We set $d=100$ and $p=0.4$. Left. Distance between $\hat{{\boldsymbol B}}\hat{{\boldsymbol B}}^\top$, $\hat{{\boldsymbol B}}\skew{5}\hat{{\boldsymbol A}}$ and the identity, as a function of the number of iterations, where $\hat{{\boldsymbol B}}$, $\skew{5}\hat{{\boldsymbol A}}$ denote the row-normalized versions of ${\boldsymbol B}$, ${\boldsymbol A}$. $\|\hat{{\boldsymbol B}}\hat{{\boldsymbol B}}^\top-{\boldsymbol I}\|_F$ and $\|\hat{{\boldsymbol B}}\skew{5}\hat{{\boldsymbol A}}-{\boldsymbol I}\|_F$ decrease and tend to $0$, meaning that (up to a rescaling of the rows) ${\boldsymbol B}{\boldsymbol A}$ and ${\boldsymbol B}{\boldsymbol B}^\top$ approach the identity. Here, we take $r=1$. Right. MSE achieved by SGD at convergence, as a function of the compression rate $r$. The empirical values (dots) match the characterization of Proposition \ref{['proposition:1']} for $f=f^*$ in \ref{['eq:pmean']} (blue line), and they outperform the MSE \ref{['eq:gaussian_val']} obtained by compressing standard Gaussian data (orange dashed line).
  • Figure 5: Compression of sparse Rademacher data via the autoencoder in \ref{['eq:linear_decoding_denoising']}. We set $d=200$ and $r=1$. The MSE achieved by SGD at convergence is plotted as a function of the sparsity level $p$. The empirical values (blue dots) match our theoretical prediction (blue line). For $p<\tilde{p}_{\mathrm{crit}}$, the MSE is given by Proposition \ref{['proposition:1']} for ${\boldsymbol B}$ sampled from the Haar distribution; for $p\ge \tilde{p}_{\mathrm{crit}}$, the MSE is given by Proposition \ref{['proposition:sparse_rademacher_id_denoising']} for ${\boldsymbol B}$ equal to the identity.
  • ...and 12 more figures

Theorems & Definitions (41)

  • Theorem 4.1: Gradient descent does not capture the sparsity
  • Proposition 4.2: Candidate comparison
  • Proposition 5.1: MSE characterization
  • Proposition 5.2
  • Lemma 1.1
  • proof
  • Theorem 1.2
  • Lemma 1.3: Gradient formulas
  • Lemma 1.4: Linear algebra results
  • proof
  • ...and 31 more