PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Jian Ma; Chen Chen; Qingsong Xie; Haonan Lu

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu

TL;DR

PEA-Diffusion introduces a parameter-efficient adapter (6M parameters) that, under knowledge distillation from a pretrained English diffusion model, enables non-English T2I generation while keeping the UNet frozen. By aligning intermediate UNet features and logits through KD, and attaching a lightweight adapter to a language-specific CLIP encoder, the method achieves culture-aware generation with minimal training cost and demonstrates strong cross-lingual performance, often surpassing translation-based baselines on culturally relevant prompts. The approach remains plug-and-play for downstream workflows, enabling integration with LoRA, ControlNet, Inpainting, and accelerated diffusion variants, and shows data-efficient domain adaptation via modest parallel corpora. Overall, PEA-Diffusion narrows the English-native bias in T2I models with robust multilingual transfer, while preserving the core generation capabilities of the original English models and offering practical pathways for real-world multilingual deployment.

Abstract

Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 24 figures, 10 tables)

This paper contains 38 sections, 5 equations, 24 figures, 10 tables.

Introduction
Related Work.
Multilingual Text-to-image Generation
Multilingual CLIP.
Knowledge Distillation.
Method
Preliminary
Cross-Lingual Transfer
Training Strategy and Final Objective
Experiments
Data Preparation
Implementation Details and Evaluation
Experimental Results
Ablation Studies
Exploring the Domain Adaptation of PEA-Diffusion
...and 23 more sections

Figures (24)

Figure 1: An overview of the proposed PEA-Diffusion. Notice that only the lightweight adapter is trainable through the whole training process.
Figure 1: CLIPScore with different specific languages. For Chinese, we evaluated two general evaluation metrics with MG data. $\dagger$ Indicates open source models.
Figure 2: Image generation visualization of different models in Chinese specific language.
Figure 2: Ablation results for different knowledge strategies.
Figure 3: Training parameters(M) and cost comparison for different methods.
...and 19 more figures

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

TL;DR

Abstract

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (24)