Table of Contents
Fetching ...

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Chaojie Yang, Tian Li, Yue Zhang, Jun Gao

TL;DR

Amber-Image tackles the computational and deployment barriers of large diffusion transformers by compressing a 60-layer dual-stream MMDiT backbone (Qwen-Image) into lighter models through a two-stage pipeline: (i) Amber-Image-10B via depth pruning with fidelity-aware initialization and two-phase recovery, and (ii) Amber-Image-6B via deep-layer single-stream conversion initialized from the image branch and refined through progressive distillation. The approach reduces parameters by about 70% and requires fewer than 2,000 GPU hours, while delivering competitive or superior performance on major T2I benchmarks (DPG-Bench, GenEval) and robust text rendering on LongText-Bench and CVTG-2K. The results demonstrate effective knowledge transfer and architectural simplification without training from scratch, enabling practical deployment on consumer hardware. Limitations include some gaps in style/diversity, motivating future RLHF and the development of ultra-lightweight domain-specific variants (2–3B).

Abstract

Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

TL;DR

Amber-Image tackles the computational and deployment barriers of large diffusion transformers by compressing a 60-layer dual-stream MMDiT backbone (Qwen-Image) into lighter models through a two-stage pipeline: (i) Amber-Image-10B via depth pruning with fidelity-aware initialization and two-phase recovery, and (ii) Amber-Image-6B via deep-layer single-stream conversion initialized from the image branch and refined through progressive distillation. The approach reduces parameters by about 70% and requires fewer than 2,000 GPU hours, while delivering competitive or superior performance on major T2I benchmarks (DPG-Bench, GenEval) and robust text rendering on LongText-Bench and CVTG-2K. The results demonstrate effective knowledge transfer and architectural simplification without training from scratch, enabling practical deployment on consumer hardware. Limitations include some gaps in style/diversity, motivating future RLHF and the development of ultra-lightweight domain-specific variants (2–3B).

Abstract

Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.
Paper Structure (16 sections, 6 equations, 2 figures, 6 tables, 2 algorithms)