On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

Li Wang; Wei Huang

On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

Li Wang, Wei Huang

TL;DR

This work addresses the convergence of over-parameterized VAEs that incorporate stochastic neural networks by leveraging Neural Tangent Kernel (NTK) theory. By analyzing the training dynamics in the infinite-width limit, it shows the dynamics linearize around a limiting kernel $\boldsymbol{\Theta}^{(\infty)}$ with a positive smallest eigenvalue $\lambda_0$, yielding linear convergence rates for the reconstruction loss. Incorporating the KL divergence term, the authors establish a direct connection to kernel ridge regression, demonstrating that training under the full VAE objective converges to a KRR-like solution with explicit regularization, and provide finite-width error bounds. Empirically, they validate the theory on MNIST and related datasets, finding that larger latent spaces improve disentanglement and enable discovery of new attributes, consistent with the predicted benefits of over-parameterization. Overall, the paper provides a rigorous NTK-based foundation for the optimization and generalization behavior of over-parameterized VAEs and highlights the regularizing role of KL divergence in shaping the learned representations.

Abstract

Variational Auto-Encoders (VAEs) have emerged as powerful probabilistic models for generative tasks. However, their convergence properties have not been rigorously proven. The challenge of proving convergence is inherently difficult due to the highly non-convex nature of the training objective and the implementation of a Stochastic Neural Network (SNN) within VAE architectures. This paper addresses these challenges by characterizing the optimization trajectory of SNNs utilized in VAEs through the lens of Neural Tangent Kernel (NTK) techniques. These techniques govern the optimization and generalization behaviors of ultra-wide neural networks. We provide a mathematical proof of VAE convergence under mild assumptions, thus advancing the theoretical understanding of VAE optimization dynamics. Furthermore, we establish a novel connection between the optimization problem faced by over-parameterized SNNs and the Kernel Ridge Regression (KRR) problem. Our findings not only contribute to the theoretical foundation of VAEs but also open new avenues for investigating the optimization of generative models using advanced kernel methods. Our theoretical claims are verified by experimental simulations.

On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

TL;DR

with a positive smallest eigenvalue

, yielding linear convergence rates for the reconstruction loss. Incorporating the KL divergence term, the authors establish a direct connection to kernel ridge regression, demonstrating that training under the full VAE objective converges to a KRR-like solution with explicit regularization, and provide finite-width error bounds. Empirically, they validate the theory on MNIST and related datasets, finding that larger latent spaces improve disentanglement and enable discovery of new attributes, consistent with the predicted benefits of over-parameterization. Overall, the paper provides a rigorous NTK-based foundation for the optimization and generalization behavior of over-parameterized VAEs and highlights the regularizing role of KL divergence in shaping the learned representations.

Abstract

Paper Structure (17 sections, 7 theorems, 61 equations, 4 figures)

This paper contains 17 sections, 7 theorems, 61 equations, 4 figures.

Introduction
Related Work
Convergence Analysis of Over-parameterized Neural Networks
Theoretical study of VAEs
Problem Setup and Preliminary
Notation
Variational Auto-encoder
Stochastic Neural Network and Objective Function
Theoretical Results
Definition and Assumptions
Optimization analysis
Regularization effect of KL divergence
Proof Sketch
Experiments
Theoretical verification
...and 2 more sections

Key Result

Theorem 1

Assume the lowest eigenvalue of the limiting NTK is greater than zero, i.e., $\lambda_{0}(\boldsymbol{\Theta}^\infty)$ and $\| \mathbf{x}^{(e)}_i \|_2 = 1$ for $i\in [n]$. Suppose the network's width $m = \Omega \left( \max \left\{ \frac{n^5 d^3 }{\lambda_0^4 \delta^2 } , \frac{n^2d^2}{\lambda_

Figures (4)

Figure 1: Architecture of Variational Auto-Encoder.
Figure 2: Relative Frobenius norm change in weights after training, where $m$ is the width of the network. Solid lines correspond to empirical simulations and dotted lines are theoretical predictions.
Figure 3: Disentanglement scores for networks of latent dimension: $m=10,20,50,100,200$ on dSprites and Cars 3D. Observations: the larger the latent space, the better the disentangle learning.
Figure 4: New image attributes discovered by large latent space VAE ($m=256$) but not by small latent space VAE ($m=10$) CelebA dataset.

Theorems & Definitions (14)

Definition 1: Stochastic Neural Tangent Kernel
Theorem 1
Theorem 2
Lemma 1
proof : Proof of Lemma \ref{['lem:init']}
Lemma 2: NTK at initialization
proof : Proof of Lemma \ref{['lem:ntk_init']}
Lemma 3
proof
Lemma 4
...and 4 more

On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

TL;DR

Abstract

On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (14)