Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation
Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun
TL;DR
Skrr tackles the memory inefficiency of text encoders in text-to-image diffusion models by pruning redundant transformer sub-blocks (Skip) and then reusing adjacent remaining blocks (Re-use) to recover performance. It introduces a diffusion-tailored discrepancy metric (based on MSE) and a beam-search strategy to account for block interactions, along with a theoretical bound showing when Re-use can outperform skipping alone. Empirically, Skrr achieves state-of-the-art memory efficiency, maintaining image fidelity and text alignment at sparsities above $40\%$ across multiple models, with memory and parameter reductions comparable to or better than prior blockwise pruning baselines. The approach also demonstrates robustness across different diffusion architectures and enables potential compression of multiple text encoders, while highlighting the importance of a projection module and null-condition handling for stable T2I generation.
Abstract
Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.
