A Note on Generalization in Variational Autoencoders: How Effective Is Synthetic Data & Overparameterization?
Tim Z. Xiao, Johannes Zenn, Robert Bamler
TL;DR
This work addresses encoder overfitting in variational autoencoders by examining two generalization pathways: leveraging synthetic data from pre-trained diffusion models (DMaaPx) and increasing parameters, particularly near the latent variable. DMaaPx replaces finite training data with unlimited, high-fidelity samples from a diffusion model trained on the dataset, yielding improved generalization, tighter amortization gaps, and greater robustness without altering the standard inference pipeline. The study also shows that expanding latent-adjacent parameters boosts performance, while excessive growth in other parts can degrade it, and provides evidence of double descent under certain parameter-growth trajectories. Collectively, the results offer practical guidance for mitigating encoder overfitting in VAEs and highlight the nuanced effects of model scaling in the presence of synthetic data.
Abstract
Variational autoencoders (VAEs) are deep probabilistic models that are used in scientific applications. Many works try to mitigate this problem from the probabilistic methods perspective by new inference techniques or training procedures. In this paper, we approach the problem instead from the deep learning perspective by investigating the effectiveness of using synthetic data and overparameterization for improving the generalization performance. Our motivation comes from (1) the recent discussion on whether the increasing amount of publicly accessible synthetic data will improve or hurt currently trained generative models; and (2) the modern deep learning insights that overparameterization improves generalization. Our investigation shows how both training on samples from a pre-trained diffusion model, and using more parameters at certain layers are able to effectively mitigate overfitting in VAEs, therefore improving their generalization, amortized inference, and robustness performance. Our study provides timely insights in the current era of synthetic data and scaling laws.
