Table of Contents
Fetching ...

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim

TL;DR

LightGen tackles the resource bottleneck in high-quality text-to-image generation by combining knowledge distillation from SOTA models, synthetic data distillation, and a compact Masked Autoregressive architecture. It shows that diverse synthetic data can substitute for massive datasets, while Direct Preference Optimization mitigates synthetic-data shortcomings in high-frequency details and spatial alignment. The approach achieves competitive image quality with only 0.7B parameters and around 88 A100 GPU-days, dramatically reducing compute and expanding accessibility for resource-constrained researchers. This work highlights a practical, data-centric path to efficient T2I generation that leverages distillation and fine-tuning to bridge the gap to SOTA performance.

Abstract

Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen

LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

TL;DR

LightGen tackles the resource bottleneck in high-quality text-to-image generation by combining knowledge distillation from SOTA models, synthetic data distillation, and a compact Masked Autoregressive architecture. It shows that diverse synthetic data can substitute for massive datasets, while Direct Preference Optimization mitigates synthetic-data shortcomings in high-frequency details and spatial alignment. The approach achieves competitive image quality with only 0.7B parameters and around 88 A100 GPU-days, dramatically reducing compute and expanding accessibility for resource-constrained researchers. This work highlights a practical, data-centric path to efficient T2I generation that leverages distillation and fine-tuning to bridge the gap to SOTA performance.

Abstract

Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only parameters. Using a compact synthetic dataset of just high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at https://github.com/XianfengWu01/LightGen

Paper Structure

This paper contains 24 sections, 18 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of LightGen’s capabilities in image generation, zero-shot inpainting, and resource usage. (First Row) Images generated at multiple resolutions ($512\times512$ and $1024\times1024$) illustrate the scalability of LightGen. (Second Row) Zero-shot inpainting results showcasing LightGen's inherent editing ability. (Third Row) LightGen's resource consumption with drastically reduced dataset size, model parameters, and GPU hours compared to state-of-the-art models, demonstrates significant cost reductions without sacrificing performance.
  • Figure 2: Overview of LightGen efficient pretraining. (a) Training: Images are encoded into tokens via a pre-trained tokenizer, while text embeddings from a T5 encoder are refined by a trainable aligner. A masked autoencoder uses text tokens as queries/values and image tokens as keys for cross-attention, followed by refinement with a Diffusion MLP (D-MLP). (b) Inference: Tokens are predicted and iteratively refined over $N$ steps, then decoded by the image tokenizer to generate final images.
  • Figure 3: Illustrate of DPO Post-processing of LightGen.
  • Figure 4: Visualization Results. Sample outputs generated using LightGen, showcasing high-quality images at multiple resolutions ($256 \times 256$, $512 \times 512$, $1024 \times 1024$) and across diverse styles (realistic, animated, virtual, etc.), which demonstrate the versatility and scalability of our approach.
  • Figure 5: Image inpainting demonstrations.
  • ...and 3 more figures