FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei; Yifan Zhou; Dongdong Chen; Xingang Pan

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei, Yifan Zhou, Dongdong Chen, Xingang Pan

TL;DR

This work delivers the first mechanistic analysis of RoPE-based MMDiT (FLUX) by probing layer-wise dependencies on positional embeddings versus content similarity during generation. It introduces an automated RoPE-manipulation strategy and a PSNR-based metric to map each layer's reliance, uncovering non-trivial, non-depth-correlated patterns. Guided by these insights, it proposes a training-free, task-specific editing framework that categorizes editing tasks into position-dependent, content similarity-dependent, and region-preserved types with tailored key–value injections and a reasoning-before-generation step. Across object addition, non-rigid editing, and background replacement, the approach achieves superior qualitative and quantitative results compared to state-of-the-art baselines and demonstrates strong generalization to additional tasks like object movement and outpainting.

Abstract

The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

TL;DR

Abstract

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)