GMem: A Modular Approach for Ultra-Efficient Generative Models
Yi Tang, Peng Sun, Zhenglin Cheng, Tao Lin
TL;DR
GMem introduces an external, immutable memory bank to decouple memorization from diffusion model backbones, enabling ultra-efficient training and sampling. By decomposing the memory bank with a low-rank SVD approximation and masking, it reduces storage and computation while preserving semantic guidance. External and internal memory manipulation allow training-free integration of new concepts and compositional generation, achieving SoTA results on ImageNet variants with substantial speedups. The approach improves generalization beyond the training data and demonstrates robust performance across backbones and tokenizers, offering a scalable path for diffusion-based generation. Overall, GMem advances efficient diffusion by offloading memorization to a modular memory, enabling faster, more flexible and diverse generative capabilities.
Abstract
Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce GMem: A Modular Approach for Ultra-Efficient Generative Models. Our approach GMem decouples the memory capacity from model and implements it as a separate, immutable memory set that preserves the essential semantic information in the data. The results are significant: GMem enhances both training, sampling efficiency, and diversity generation. This design on one hand reduces the reliance on network for memorize complex data distribution and thus enhancing both training and sampling efficiency. On ImageNet at $256 \times 256$ resolution, GMem achieves a $50\times$ training speedup compared to SiT, reaching FID $=7.66$ in fewer than $28$ epochs ($\sim 4$ hours training time), while SiT requires $1400$ epochs. Without classifier-free guidance, GMem achieves state-of-the-art (SoTA) performance FID $=1.53$ in $160$ epochs with only $\sim 20$ hours of training, outperforming LightningDiT which requires $800$ epochs and $\sim 95$ hours to attain FID $=2.17$.
