Laminating Representation Autoencoders for Efficient Diffusion

Ramón Calvo-González; François Fleuret

Laminating Representation Autoencoders for Efficient Diffusion

Ramón Calvo-González, François Fleuret

TL;DR

The paper tackles the computational burden of diffusion in high-dimensional SSL feature spaces by compressing DINOv2 patch embeddings into a one-dimensional latent sequence of $32$ tokens using a $\beta$-VAE, achieving an $8\times$ reduction in sequence length and a $48\times$ reduction in total latent dimensionality. A flow-matching diffusion model is trained in this FlatDINO latent space, with a frozen ViT-XL decoder mapping generated latents back to patch embeddings and then to RGB images; this yields a gFID of $\approx 1.85$ with classifier-free guidance on ImageNet-256x256, while reducing forward FLOPs by up to $8\times$ and training FLOPs by up to $4.5\times$ relative to diffusion on full DINOv2 features. The results are preliminary, showing competitive quality and substantial efficiency gains but indicating a need for longer training and diffusion recipe improvements tailored to compressed semantically-informed latents. Overall, FlatDINO offers a promising path toward scalable diffusion over semantic SSL representations, enabling faster generation within resource-constrained settings.

Abstract

Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.

Laminating Representation Autoencoders for Efficient Diffusion

TL;DR

The paper tackles the computational burden of diffusion in high-dimensional SSL feature spaces by compressing DINOv2 patch embeddings into a one-dimensional latent sequence of

tokens using a

-VAE, achieving an

reduction in sequence length and a

reduction in total latent dimensionality. A flow-matching diffusion model is trained in this FlatDINO latent space, with a frozen ViT-XL decoder mapping generated latents back to patch embeddings and then to RGB images; this yields a gFID of

with classifier-free guidance on ImageNet-256x256, while reducing forward FLOPs by up to

and training FLOPs by up to

relative to diffusion on full DINOv2 features. The results are preliminary, showing competitive quality and substantial efficiency gains but indicating a need for longer training and diffusion recipe improvements tailored to compressed semantically-informed latents. Overall, FlatDINO offers a promising path toward scalable diffusion over semantic SSL representations, enabling faster generation within resource-constrained settings.

Abstract

Paper Structure (37 sections, 10 equations, 18 figures, 10 tables)

This paper contains 37 sections, 10 equations, 18 figures, 10 tables.

Introduction
Related Work
Method
1D Autoencoder
Decoding to Images
Latent Generation
Experiments
Latent Shape Selection
Token Ablation
Latent Diffusion
Discussion and Future Work
Out-of-Distribution Decoding
Latent Robustness to Noise
Normalizing the KL Penalty Across Latent Dimensionalities
Background.
...and 22 more sections

Figures (18)

Figure 1: GFLOPs per forward pass versus gFID (without CFG) for similar-sized diffusion transformers on ImageNet 256$\times$256. FlatDINO (ours) achieves a substantial reduction in FLOPs while maintaining competitive generation quality.
Figure 2: A frozen DINOv2 ViT-B/14 with registers darcetVisionTransformersNeed2024 encodes the input image into patch embeddings (). The CLS token () and register tokens () are discarded. The FlatDINO encoder---a ViT with learnable embedding tokens ()---compresses the patch embeddings into a one-dimensional latent sequence (), achieving an 8$\times$ reduction in token count. The decoder, also a ViT with learnable embeddings (), reconstructs the original DINOv2 patch embeddings ().
Figure 3: Selected class-conditional samples from a DiT-XL model trained for 600 epochs on FlatDINO 32$\times$128 latents. Samples were generated using classifier-free guidance with an Euler sampler (50 steps). Despite the $8\times$ reduction in sequence length compared to RAE, FlatDINO produces diverse, high-fidelity images across a range of ImageNet classes.
Figure 4: Reconstruction quality (rFID, lower is better) versus total latent dimensionality for different token counts. All configurations were trained for 50 epochs; slightly better performance is expected with longer training. For a fixed latent size, configurations with more tokens consistently outperform those with fewer tokens but larger feature dimensions.
Figure 5: Cosine similarity between DINOv2-B patch embeddings as a function of spatial distance (in patch units), averaged over ImageNet validation images. Nearby patches share more information than distant ones, which may explain why FlatDINO learns spatially localized receptive fields.
...and 13 more figures

Laminating Representation Autoencoders for Efficient Diffusion

TL;DR

Abstract

Laminating Representation Autoencoders for Efficient Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (18)