High-dimensional Asymptotics of Denoising Autoencoders

Hugo Cui; Lenka Zdeborová

High-dimensional Asymptotics of Denoising Autoencoders

Hugo Cui, Lenka Zdeborová

TL;DR

The paper analyzes denoising from a high-dimensional Gaussian mixture using a two-layer DAE with tied weights and a skip connection. By applying the replica method in the RS setting, it derives sharp closed-form expressions for the denoising MSE and related metrics as a function of the sample-to-dimension ratio $\alpha$, noise level $\Delta$, and architecture hyperparameters, reducing the problem to a finite set of summary equations. It demonstrates that the full DAE with skip connections can outperform PCA-like bottlenecks and that the skip and bottleneck components play complementary roles, with empirical results aligning with theory on synthetic Gaussian mixtures and real datasets such as MNIST and FashionMNIST. The findings suggest a form of Gaussian universality for denoising and provide theoretical guidance for designing shallow nonlinear denoisers that exceed PCA baselines. Overall, the work advances understanding of how architectural choices in DAEs affect denoising performance in high dimensions and offers a framework for exact, tractable analysis applicable to practical datasets.

Abstract

We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error. Building on this result, we quantitatively characterize the advantage of the considered architecture over the autoencoder without the skip connection that relates closely to principal component analysis. We further show that our results accurately capture the learning curves on a range of real data sets.

High-dimensional Asymptotics of Denoising Autoencoders

TL;DR

, noise level

, and architecture hyperparameters, reducing the problem to a finite set of summary equations. It demonstrates that the full DAE with skip connections can outperform PCA-like bottlenecks and that the skip and bottleneck components play complementary roles, with empirical results aligning with theory on synthetic Gaussian mixtures and real datasets such as MNIST and FashionMNIST. The findings suggest a form of Gaussian universality for denoising and provide theoretical guidance for designing shallow nonlinear denoisers that exceed PCA baselines. Overall, the work advances understanding of how architectural choices in DAEs affect denoising performance in high dimensions and offers a framework for exact, tractable analysis applicable to practical datasets.

Abstract

Paper Structure (57 sections, 2 theorems, 131 equations, 11 figures)

This paper contains 57 sections, 2 theorems, 131 equations, 11 figures.

Setting
Data model
DAE model
Learning metrics
High-dimensional limit
Asymptotic formulae for DAEs
Example 1: Isotropic homoscedastic mixture
Example 2: MNIST, FashionMNIST
The role and importance of the skip connection.
Full DAE and the rescaling component
DAEs with(out) skip connection
A tradeoff between the rescaling and the bottleneck network
Derivation of Result \ref{['result:Main_formulae']}
Derivation technique
Replicated partition function
...and 42 more sections

Key Result

Corollary 2.4

(MSE of components) The test MSE of $\hat{r}$eq:building_blocks is given by $\mathrm{mse}_{\hat{r}}=\mathrm{mse}_\circ$eq:replica_scal. Furthermore, the learnt value of its single parameter $\hat{c}$ is given by eq:replica_b. The test MSE, cosine similarity and summary statistics of the bottleneck n

Figures (11)

Figure 1: $\alpha=1,K=2, \rho_{1,2}=1/2, \boldsymbol{\Sigma}_{1,2}=0.09\times \mathbb{I}_d, p=1, \lambda=0.1,\sigma(\cdot)=\tanh(\cdot)$; the cluster mean $\boldsymbol{\mu}_1=-\boldsymbol{\mu}_2$ was taken as a random Gaussian vector of norm $1$. (left) In blue, the difference in MSE between the full DAE $\hat{f}$\ref{['eq:DAE']} and the rescaling component $\hat{r}$\ref{['eq:building_blocks']}. Solid lines correspond to the sharp asymptotic characterization of Result \ref{['result:Main_formulae']}. Dots represent numerical simulations for $d=700$, training the DAE using the Pytorch implementation of full-batch Adam, with learning rate $\eta=0.05$ over $2000$ epochs, averaged over $N=10$ instances. Error bars represent one standard deviation. For completeness, the MSE of the oracle denoiser is given as a baseline in green, see Section \ref{['sec:Architecture']}. The performance of a linear DAE ($\sigma(x)=x$) is represented in dashed red. (right) Cosine similarity $\theta$\ref{['eq:cosine_similarity']} (green), squared weight norm $\lVert \hat{\boldsymbol{w}}\lVert^2_F/d$ (red) and skip connection strength $\hat{b}$ (blue). Solid lines correspond to the formulae \ref{['eq:replica_theta']}\ref{['eq:replica_op']} and \ref{['eq:replica_b']} of Result \ref{['result:Main_formulae']}; dots are numerical simulations. For completeness, the cosine similarity of the first principal component of the clean train data $\{\boldsymbol{x}^\mu\}_{\mu=1}^n$ is plotted in dashed black.
Figure 2: Difference in MSE between the full DAE \ref{['eq:DAE']} and the rescaling component \ref{['eq:building_blocks']} for the MNIST data set (middle), of which for simplicity only $1$s and $7$s were kept, and FashionMNIST (right), of which only boots and shoes were kept. In blue, the theoretical predictions resulting from using Result \ref{['result:Main_formulae']} with the empirically estimated covariances and means, see Appendix \ref{['App:Real']} for further details. In red, numerical simulations of a DAE ($p=1$, $\sigma=\tanh$) trained with $n=784$ training points, using the Pytorch implementation of full-batch Adam, with learning rate $\eta=0.05$ and weight decay $\lambda=0.1$ over $2000$ epochs, averaged over $N=10$ instances. Error bars represent one standard deviation. (left) illustration of the denoised images: (top left) original image, (top right) noisy image, (bottom left) DAE $\hat{f}$\ref{['eq:DAE']}, (bottom right) rescaling $\hat{r}$\ref{['eq:building_blocks']}.
Figure 3: (left) Solid lines: difference in MSE between the full DAE $\hat{f}$\ref{['eq:DAE']}, with $\sigma=\tanh$, $p=1$, and the rescaling $\hat{r}$\ref{['eq:building_blocks']}. Dashed: the same curve for the oracle denoiser. Different colours represent different sample complexities $\alpha$ (solid lines). (right) Difference in MSE between the bottleneck network $\hat{u}$\ref{['eq:building_blocks']} and the complete DAE $\hat{f}$\ref{['eq:DAE']}. In blue, the theoretical prediction \ref{['eq:gap_noskip']}; in red, numerical simulations for the bottleneck network \ref{['eq:building_blocks']} ($\sigma=\tanh$, $p=1$) trained with the Pytorch implementation of full-batch Adam, with learning rate $\eta=0.05$ and weight decay $\lambda=0.1$ over $2000$ epochs, averaged over $N=5$ instances, for $d=700$. In green, the MSE (minus the MSE of the complete DAE \ref{['eq:DAE']}) achieved by PCA. Error bars represent one standard deviation. The model and parameters are the same as in Fig. \ref{['fig:synthetic']}.
Figure 4: Illustration of the denoised image for the various networks and algorithms. (a) original image (b) noisy image, for $\sqrt{\Delta}=0.2$ (c) trained rescaling $\hat{r}$\ref{['eq:building_blocks']} (d) full DAE $\hat{f}$\ref{['eq:DAE']} (e) bottleneck network $\hat{u}$\ref{['eq:building_blocks']} (f) PCA. The DAE and training parameters are the same as Fig. \ref{['fig:MNIST']}, see also Appendix \ref{['App:Real']}.
Figure 5: (left) Training MSE for the full DAE \ref{['eq:DAE']} ($p=1,~\sigma=\tanh$). Solid lines represent the sharp asymptotic formula \ref{['App:repl:et']}; dots correspond to simulation, training the DAE with the $\texttt{Pytorch}$ implementation of full-batch Adam, over $T=2000$ epochs using learning rate $\eta=0.05$ and weight decay $\lambda=0.1$. The data was averaged over $N=5$ instances; error bars are smaller than the point size. (right) Generalization gap $\mathrm{mse}_{\hat{f}}-\epsilon_t$. Solid lines correspond to the asymptotic prediction of Result \ref{['result:Main_formulae']} (for the test MSE) and of \ref{['App:repl:et']} (for the train MSE), while dots correspond to simulations. Error bars represent one standard deviation. The Gaussian mixture is the isotropic binary mixture, whose parameters are specified in the caption of Fig. \ref{['fig:synthetic']} in the main text.
...and 6 more figures

Theorems & Definitions (2)

Corollary 2.4
Corollary 2.5

High-dimensional Asymptotics of Denoising Autoencoders

TL;DR

Abstract

High-dimensional Asymptotics of Denoising Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)