Table of Contents
Fetching ...

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun

TL;DR

Skrr tackles the memory inefficiency of text encoders in text-to-image diffusion models by pruning redundant transformer sub-blocks (Skip) and then reusing adjacent remaining blocks (Re-use) to recover performance. It introduces a diffusion-tailored discrepancy metric (based on MSE) and a beam-search strategy to account for block interactions, along with a theoretical bound showing when Re-use can outperform skipping alone. Empirically, Skrr achieves state-of-the-art memory efficiency, maintaining image fidelity and text alignment at sparsities above $40\%$ across multiple models, with memory and parameter reductions comparable to or better than prior blockwise pruning baselines. The approach also demonstrates robustness across different diffusion architectures and enables potential compression of multiple text encoders, while highlighting the importance of a projection module and null-condition handling for stable T2I generation.

Abstract

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

TL;DR

Skrr tackles the memory inefficiency of text encoders in text-to-image diffusion models by pruning redundant transformer sub-blocks (Skip) and then reusing adjacent remaining blocks (Re-use) to recover performance. It introduces a diffusion-tailored discrepancy metric (based on MSE) and a beam-search strategy to account for block interactions, along with a theoretical bound showing when Re-use can outperform skipping alone. Empirically, Skrr achieves state-of-the-art memory efficiency, maintaining image fidelity and text alignment at sparsities above across multiple models, with memory and parameter reductions comparable to or better than prior blockwise pruning baselines. The approach also demonstrates robustness across different diffusion architectures and enables potential compression of multiple text encoders, while highlighting the importance of a projection module and null-condition handling for stable T2I generation.

Abstract

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Paper Structure

This paper contains 57 sections, 4 theorems, 34 equations, 22 figures, 17 tables, 2 algorithms.

Key Result

Lemma 3.1

Let $\mathcal{M}: (x, \theta) \mapsto \mathbb{R}^d$ be an $L$-block transformer with input $x \in \mathbb{R}^d$ and parameter set $\theta = (\theta_1, \dots, \theta_L)$, defined as: where $F_i: (z_i, \theta_i) \mapsto \mathbb{R}^d$ is the $i$-th block with parameters $\theta_i$, and $z_i \in \mathbb{R}^d$. Assume that $F_i$ is $L_i$-Lipschitz in $z_i$ and $M_i$-Lipschitz in $\theta_i$. Then, for

Figures (22)

  • Figure 1: (a) FLOPs distribution during image generation in Stable Diffusion 3 (SD3) esser2024scaling. (b) Parameter distribution across modules in SD3. The text encoders contributes less than 0.5% to the overall FLOPs but account for over 70% of the total model parameters. For VAE, only the decoder was considered.
  • Figure 2: The visualization of overall framework of Skrr. (a) shows the Skip phase, which repeatedly assesses each sub-block by determining the output discrepancy (Disc.) between the dense and skipped models using a calibration dataset (Calib. data). To account for block interactions, it keeps the top $k$ options with the smallest discrepancies and uses beam search for refined selection. (b) presents the Re-use phase, evaluating if recycling remaining block instead of skipped sub-blocks results in a smaller output discrepancy. If so, hidden states are fed back into the chosen layers. This two-phase approach efficiently reduces model size with minimal T2I performance loss.
  • Figure 3: (a) The cosine similarity of hidden states in T5 transformer blocks demonstrates progressive variations. The result indicates specific layers could be omitted without serious performance degradation. (b) The cosine similarity of block outputs using fixed inputs. The similarity map reveals redundant role across blocks, suggesting that certain blocks could be replaced by adjacent blocks.
  • Figure 4: (a) An image is created by the PixArt-$\Sigma$ dense text encoder using the prompt "A car made out of vegetables." with $||f_{\varnothing}||_2=0.03$. For image (b), the $7^{\text{th}}$ and $22^{\text{th}}$ sub-blocks are excluded, resulting in $\text{Metric}_1 = 0.85$, $\text{Metric}_2 = 0.002$, and $||f_{\varnothing}||_2 = 0.19$. Image (c) is generated by removing the $3^{\text{rd}}$ and $5^{\text{th}}$ sub-blocks, producing $\text{Metric}_1 = 0.89$, $\text{Metric}_2 = 0.04$, and $||f_{\varnothing}||_2 = 3.34$. Despite $\text{Metric}_1$ being higher in (c), the large $||f_{\varnothing}||_2$ value compared to (b) leads to an abnormal image. Notably, $\text{Metric}_2$ more accurately indicates differences in image quality.
  • Figure 5: Comparison of images generated with baseline and Skrr-compressed text encoders across PixArt-$\Sigma$, Stable Diffusion 3 (SD3), and FLUX.1-dev. At low sparsity (level 1--24.3% for ShortGPT and Laco, 26.3% for FinerCut, and 27.0% for Skrr), both methods perform comparably to dense models, but Skrr outperforms at higher sparsity(level 2--32.4% for ShortGPT and Laco, 32.2% for FinerCut, and 32.4% for Skrr, level 3--40.5% for ShortGPT and Laco, 41.7% for FinerCut, and 41.9% for Skrr), maintaining alignment to dense model and preserving details in the prompt such as "glasses", "colorful apron", and "paint-splattered hands", where baseline methods fail.
  • ...and 17 more figures

Theorems & Definitions (6)

  • Lemma 3.1: Error bound of two transformers
  • Theorem 3.2: Tighter error bound of Re-use
  • Lemma 3.1: Error bound of two transformers
  • proof
  • Theorem 3.2: Tighter error bound of Re-use
  • proof