Table of Contents
Fetching ...

ReMix: Towards a Unified View of Consistent Character Generation and Editing

Benjia Zhou, Bin Fu, Pei Cheng, Yanru Wang, Jiayuan Fan, Tao Chen

TL;DR

ReMix tackles the challenge of unifying consistent character generation and editing under a single diffusion-based framework. It combines a semantic editing module (ReMix Module) powered by Multimodal Large Language Model embeddings with an IP-ControlNet that decouples semantic and spatial cues, and introduces an ε-equivariant latent space to align reference and target features. The approach avoids retraining the diffusion backbone and demonstrates strong results on both character-centric generation and image editing, with ablations confirming the importance of ε-equivariant optimization, DVE, and ID loss. Its modular design and efficient training offer a practical path toward robust, identity-preserving, and layout-consistent synthesis across tasks.

Abstract

Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an ε-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

ReMix: Towards a Unified View of Consistent Character Generation and Editing

TL;DR

ReMix tackles the challenge of unifying consistent character generation and editing under a single diffusion-based framework. It combines a semantic editing module (ReMix Module) powered by Multimodal Large Language Model embeddings with an IP-ControlNet that decouples semantic and spatial cues, and introduces an ε-equivariant latent space to align reference and target features. The approach avoids retraining the diffusion backbone and demonstrates strong results on both character-centric generation and image editing, with ablations confirming the importance of ε-equivariant optimization, DVE, and ID loss. Its modular design and efficient training offer a practical path toward robust, identity-preserving, and layout-consistent synthesis across tasks.

Abstract

Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an ε-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

Paper Structure

This paper contains 39 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Examples generated by ReMix. The framework supports a wide range of multimodal synthesis tasks, including personalized generation, image editing, layout-consistent synthesis, style transfer, multi-condition generation, and narrative-driven story visualization, etc.
  • Figure 2: Method Overview. The architecture includes two major components: ReMix Module and IP-ControlNet. ReMix Module uses MLLM to edit the semantic content of images, while IP-ControlNet controls pixel-level consistent generation by extracting the low-level visual feature.
  • Figure 3: Overview of Semantic Editing Pipeline. The ReMix module implements semantic editing of Redux flux1-dev features through a learnable Connector.
  • Figure 4: (a) Qualitative comparison results on Kontext-Bench1K labs2025flux. (b) Quantitative visualization results (See Appendix for more).
  • Figure 5: Effect of $\epsilon$-equivariant optimization. Middle: 500K iter. standard one-to-many setting; Right: last 100K iter. $\epsilon$-equivariant training.
  • ...and 9 more figures