Laminating Representation Autoencoders for Efficient Diffusion
Ramón Calvo-González, François Fleuret
TL;DR
The paper tackles the computational burden of diffusion in high-dimensional SSL feature spaces by compressing DINOv2 patch embeddings into a one-dimensional latent sequence of $32$ tokens using a $\beta$-VAE, achieving an $8\times$ reduction in sequence length and a $48\times$ reduction in total latent dimensionality. A flow-matching diffusion model is trained in this FlatDINO latent space, with a frozen ViT-XL decoder mapping generated latents back to patch embeddings and then to RGB images; this yields a gFID of $\approx 1.85$ with classifier-free guidance on ImageNet-256x256, while reducing forward FLOPs by up to $8\times$ and training FLOPs by up to $4.5\times$ relative to diffusion on full DINOv2 features. The results are preliminary, showing competitive quality and substantial efficiency gains but indicating a need for longer training and diffusion recipe improvements tailored to compressed semantically-informed latents. Overall, FlatDINO offers a promising path toward scalable diffusion over semantic SSL representations, enabling faster generation within resource-constrained settings.
Abstract
Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.
