Table of Contents
Fetching ...

Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

Jiaming Chu, Tao Wang, Lei Jin

Abstract

Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution preference.We further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.

Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

Abstract

Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond resolution, the distilled encoder generalizes effectively to resolution inputs, partially inheriting the teacher model's resolution preference.We further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.
Paper Structure (27 sections, 3 theorems, 12 equations, 8 figures, 5 tables)

This paper contains 27 sections, 3 theorems, 12 equations, 8 figures, 5 tables.

Key Result

Theorem 1

Assume the teacher's mapping $\psi_T$ is a local diffeomorphism at $z_0 \in \mathcal{M}$. Let $x_{r_0} = \phi_{r_0}(z_0)$ and $E_S^*$ be the optimal student encoder minimizing the distillation loss at resolution $r_0$. Then at $x_{r_0}$: where $J_f$ denotes the Jacobian matrix of function $f$.

Figures (8)

  • Figure 1: The figure shows low resolution image sample, including the reconstruction results of the Flux flux VAE encoder as the teacher model (34MB), the distillation model (2MB) which only training on low resolution images, and the distillation model with the resolution augment. The second row also provides the MSE $\downarrow$, LPIPS distance $\downarrow$ (Lower is better), and differences from the original image for the three.
  • Figure 2: The figure includes the resolution generalization of manifolds during the distillation process of the model and the manifold distribution trained from scratch on low resolution datasets in general. Although distillation and training are carried out at around 256 resolutions, student models can learn the generalization knowledge about high-resolution from teacher models during the distillation process, while models trained from scratch can only achieve generalization around low resolutions. The proof of the distillation model's cross resolution generalization ability and the degree of matching between manifolds can be referred to \ref{['sec:results']} and \ref{['sec:analysis']}.
  • Figure 3: From the figure, we can intuitively observe the consistency between the distillation model (2MB) and the teacher model (34MB) in terms of cross resolution performance trends, and the variational encoder (24MB) trained on low resolution images lacks the ability to generalize across resolutions.
  • Figure 4: The above figures show the input original image, the reconstruction results of the teacher model, the reconstruction results of the student model, and the image reconstruction results after 1.5-fold upsampling enhancement. From the visualization results, we can see that the student model has little difference from the teacher model in terms of layout, color, brightness, etc., but lacks high-frequency detail information. After upsampling and reinforcement, the reconstruction results surpass the teacher model even in terms of details.
  • Figure 5: Resolution-dependent latent statistics of the teacher (Flux VAE) and the distilled student encoder. The empirical latent mean and standard deviation are computed over ImageNet validation images at resolutions ranging from $64^2$ to $1024^2$. Both models exhibit highly consistent trends. Despite being trained only on low-resolution images, the student closely follows the teacher’s scaling behavior, indicating that distillation transfers resolution-aware latent parameterization rather than fixed-resolution representations.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 1: Image Manifold
  • Definition 2: Resolution Parameterization
  • Definition 3: Encoder as Manifold Learning
  • Theorem 1: Local Tangent Space Alignment
  • Definition 4: Resolution Extrapolation Operator
  • Theorem 2: Generalization Error Bound
  • Definition 5: Resolution Sweet Spot
  • Proposition 1: Sweet Spot Transfer