Table of Contents
Fetching ...

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag

TL;DR

FluxSpace addresses disentangled semantic editing in rectified-flow transformers by leveraging joint transformer blocks to extract semantically meaningful directions from attention outputs. It defines fine-grained edits via linear directions in attention space and coarse edits via pooled text embeddings, enabling inference-time edits without training. The method demonstrates improved disentanglement, preserving identity across edits across domains (faces, cars, scenes) and outperforms state-of-the-art editing methods in qualitative and quantitative metrics, including CLIP and DINO, supported by a user study and ablations. Ethical considerations for realistic manipulation are discussed, emphasizing the need for guidelines to mitigate potential misuse while enabling research into controllable image editing.

Abstract

Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.

FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

TL;DR

FluxSpace addresses disentangled semantic editing in rectified-flow transformers by leveraging joint transformer blocks to extract semantically meaningful directions from attention outputs. It defines fine-grained edits via linear directions in attention space and coarse edits via pooled text embeddings, enabling inference-time edits without training. The method demonstrates improved disentanglement, preserving identity across edits across domains (faces, cars, scenes) and outperforms state-of-the-art editing methods in qualitative and quantitative metrics, including CLIP and DINO, supported by a user study and ablations. Ethical considerations for realistic manipulation are discussed, emphasizing the need for guidelines to mitigate potential misuse while enabling research into controllable image editing.

Abstract

Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.

Paper Structure

This paper contains 34 sections, 11 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: FluxSpace. We propose a text-guided image editing approach on rectified flow transformers esser2024scaling, such as Flux. Our method can generalize to semantic edits on different domains such as humans, animals, cars, and extends to even more complex scenes such as an image of a street (third row, first example). FluxSpace can apply edits described as keywords (e.g. "truck" for transforming a car into a truck) and offers disentangled editing capabilities that do not require manually provided masks to target a specific aspect in the original image. In addition, our method does not require any training and can apply the desired edit during inference time.
  • Figure 2: FluxSpace Framework. The FluxSpace framework introduces a dual-level editing scheme within the joint transformer blocks of Flux, enabling coarse and fine-grained visual editing. Coarse editing operates on pooled representations of base ($c_{pool}$) and edit ($c_{e, pool}$) conditions, allowing global changes like stylization, controlled by the scale $\lambda_{coarse}$ (a). For fine-grained editing, we define a linear editing scheme using base, prior, and edit attention outputs, guided by scale $\lambda_{fine}$ (b). With this flexible design, our framework is both able to perform coarse-level and fine-grained editing, with a linearly adjustable scale.
  • Figure 3: Qualitative Results on Face Editing. Our method can perform a variety of edits from fine-grained face editing (e.g. adding eyeglasses) to changes over the overall structure of the image (e.g. comics style). As our method utilizes disentangled representations to perform image editing, we can precisely edit a variety of attributes while preserving the properties of the original image.
  • Figure 4: Qualitative Comparisons. We compare our method both with latent diffusion-based approaches (LEDITS++ brack2024ledits++ and TurboEdit deutch2024turboedittextbasedimageediting) and flow-based methods (Sliders-FLUX gandikota2023concept and RF-Inversion rout2024rfinversion) in terms of their disentangled editing capabilities. We present qualitative results for smile, eyeglasses, and age edits where our method succeeds over competing methods in both reflecting the semantic and preserving the input identity.
  • Figure 5: Real Image Editing. By integrating FluxSpace on the inversion approach proposed by RF-Inversion rout2024rfinversion, we extend our method for real image editing task. As we show qualitatively, our method achieves improved disentanglement over the performed edits compared to the baseline approach, where we use identical hyperparameters for the inversion task on both approaches.
  • ...and 7 more figures