Table of Contents
Fetching ...

GMem: A Modular Approach for Ultra-Efficient Generative Models

Yi Tang, Peng Sun, Zhenglin Cheng, Tao Lin

TL;DR

GMem introduces an external, immutable memory bank to decouple memorization from diffusion model backbones, enabling ultra-efficient training and sampling. By decomposing the memory bank with a low-rank SVD approximation and masking, it reduces storage and computation while preserving semantic guidance. External and internal memory manipulation allow training-free integration of new concepts and compositional generation, achieving SoTA results on ImageNet variants with substantial speedups. The approach improves generalization beyond the training data and demonstrates robust performance across backbones and tokenizers, offering a scalable path for diffusion-based generation. Overall, GMem advances efficient diffusion by offloading memorization to a modular memory, enabling faster, more flexible and diverse generative capabilities.

Abstract

Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce GMem: A Modular Approach for Ultra-Efficient Generative Models. Our approach GMem decouples the memory capacity from model and implements it as a separate, immutable memory set that preserves the essential semantic information in the data. The results are significant: GMem enhances both training, sampling efficiency, and diversity generation. This design on one hand reduces the reliance on network for memorize complex data distribution and thus enhancing both training and sampling efficiency. On ImageNet at $256 \times 256$ resolution, GMem achieves a $50\times$ training speedup compared to SiT, reaching FID $=7.66$ in fewer than $28$ epochs ($\sim 4$ hours training time), while SiT requires $1400$ epochs. Without classifier-free guidance, GMem achieves state-of-the-art (SoTA) performance FID $=1.53$ in $160$ epochs with only $\sim 20$ hours of training, outperforming LightningDiT which requires $800$ epochs and $\sim 95$ hours to attain FID $=2.17$.

GMem: A Modular Approach for Ultra-Efficient Generative Models

TL;DR

GMem introduces an external, immutable memory bank to decouple memorization from diffusion model backbones, enabling ultra-efficient training and sampling. By decomposing the memory bank with a low-rank SVD approximation and masking, it reduces storage and computation while preserving semantic guidance. External and internal memory manipulation allow training-free integration of new concepts and compositional generation, achieving SoTA results on ImageNet variants with substantial speedups. The approach improves generalization beyond the training data and demonstrates robust performance across backbones and tokenizers, offering a scalable path for diffusion-based generation. Overall, GMem advances efficient diffusion by offloading memorization to a modular memory, enabling faster, more flexible and diverse generative capabilities.

Abstract

Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce GMem: A Modular Approach for Ultra-Efficient Generative Models. Our approach GMem decouples the memory capacity from model and implements it as a separate, immutable memory set that preserves the essential semantic information in the data. The results are significant: GMem enhances both training, sampling efficiency, and diversity generation. This design on one hand reduces the reliance on network for memorize complex data distribution and thus enhancing both training and sampling efficiency. On ImageNet at resolution, GMem achieves a training speedup compared to SiT, reaching FID in fewer than epochs ( hours training time), while SiT requires epochs. Without classifier-free guidance, GMem achieves state-of-the-art (SoTA) performance FID in epochs with only hours of training, outperforming LightningDiT which requires epochs and hours to attain FID .

Paper Structure

This paper contains 69 sections, 23 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: GMem Significantly enhances training and sampling efficiency of diffusion models on ImageNet $256\times256$. We propose decoupling memorization capabilities from the model by implementing a separate, immutable memory bank that preserves essential data information. Sub-figure (a) highlights the core concept of GMem, where $\boldsymbol{\epsilon}$ denotes input noise and $\mathbf{x}_0$ represents generated samples. In GMem, we disentangle generalization and memorization capabilities, assigning memorization to an external memory bank $\mathbf{M}$. This decoupling reduces computational and capacity overhead, thus accelerating the process. Sub-figure (b) demonstrates the training efficiency of GMem on ImageNet $256\times256$. At an FID$=4.86$, GMem achieves over $25 \times$ speedup compared to REPA yu2024representation. At an FID$=7.66$, it achieves over $50\times$ speedup relative to SiT ma2024sit. Sub-figure (c) illustrates sampling efficiency. At the same FID target, GMem requires $5\times$ fewer NFEs compared to REPA and $10\times$ fewer NFEs compared to SiT.
  • Figure 2: Selected samples on ImageNet $512 \times 512$ and $256 \times 256$. This figure presents images generated by GMem under two experimental settings: (1) For ImageNet $256\times256$, GMem was trained for $160$ epochs and sampled via Euler method (NFE$=100$), achieving an FID$=1.53$ without classifier-free guidance. (2) For ImageNet $512\times512$, training extended to $400$ epochs with identical sampling settings, yielding FID$=1.89$.
  • Figure 3: Data generation via GMem-enhanced diffusion models.(a) Sampled noise $\boldsymbol{\epsilon}$ is used to index a memory snippet from the memory bank. (b) Both the sampled noise $\boldsymbol{\epsilon}$ and the memory snippet $\mathbf{s}$ are simultaneously fed into the neural network. (c) The neural network generates data using SDE or ODE solvers.
  • Figure 3: GMem accelerates sampling by 10$\times$. This table examines the impact of different ${NFE}\xspace$ choices on ${FID}\xspace$. For a fair comparison with REPA, we use SiT-L/2 as backbone and trained GMem for ${Epochs}\xspace=20$ (400K iterations with a batch size of 256) without using classifier-free guidance. $\downarrow$ means lower is better.
  • Figure 4: Demonstration of novel and compositional image generation via memory bank manipulation. Selected samples from ImageNet $256\times256$ generated by the GMem. In the "Novel image generation" part, we show the reference image used to build a new memory snippet (left), followed by the generated samples and 5 of the nearest training images, illustrating GMem's adaptation to external knowledge. In the "Compositional image generation" examples, two reference images (left and right) form an interpolated image (center), demonstrating GMem can manipulate internal knowledge to create new concepts.
  • ...and 7 more figures