Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu
TL;DR
Prompt Reinjection addresses prompt forgetting in Multimodal Diffusion Transformers by reinjecting aligned shallow text features into deeper layers during inference. This training-free intervention uses distribution anchoring and orthogonal Procrustes alignment to stabilize cross-layer semantic transfer, substantially improving instruction following and spatial-numerical reasoning across multiple MMDiT backbones without sacrificing image quality. Layer-wise analyses (CKNNA and probes) reveal robust mitigation of semantic drift, while ablations underscore the importance of origin layer choice, full-layer reinjection, and alignment components. The approach offers a practical, scalable boost to prompt adherence in complex text–image generation tasks with broad applicability to real-world prompts and diverse prompt styles.
Abstract
Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
