ReMix: Towards a Unified View of Consistent Character Generation and Editing

Benjia Zhou; Bin Fu; Pei Cheng; Yanru Wang; Jiayuan Fan; Tao Chen

ReMix: Towards a Unified View of Consistent Character Generation and Editing

Benjia Zhou, Bin Fu, Pei Cheng, Yanru Wang, Jiayuan Fan, Tao Chen

TL;DR

ReMix tackles the challenge of unifying consistent character generation and editing under a single diffusion-based framework. It combines a semantic editing module (ReMix Module) powered by Multimodal Large Language Model embeddings with an IP-ControlNet that decouples semantic and spatial cues, and introduces an ε-equivariant latent space to align reference and target features. The approach avoids retraining the diffusion backbone and demonstrates strong results on both character-centric generation and image editing, with ablations confirming the importance of ε-equivariant optimization, DVE, and ID loss. Its modular design and efficient training offer a practical path toward robust, identity-preserving, and layout-consistent synthesis across tasks.

Abstract

Recent advances in large-scale text-to-image diffusion models (e.g., FLUX.1) have greatly improved visual fidelity in consistent character generation and editing. However, existing methods rarely unify these tasks within a single framework. Generation-based approaches struggle with fine-grained identity consistency across instances, while editing-based methods often lose spatial controllability and instruction alignment. To bridge this gap, we propose ReMix, a unified framework for character-consistent generation and editing. It constitutes two core components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal reasoning ability of MLLMs to edit semantic features of input images and adapt instruction embeddings to the native DiT backbone without fine-tuning. While this ensures coherent semantic layouts, pixel-level consistency and pose controllability remain challenging. To address this, IP-ControlNet extends ControlNet to decouple semantic and layout cues from reference images and introduces an ε-equivariant latent space that jointly denoises the reference and target images within a shared noise space. Inspired by convergent evolution and quantum decoherence,i.e., where environmental noise drives state convergence, this design promotes feature alignment in the hidden space, enabling consistent object generation while preserving identity. ReMix supports a wide range of tasks, including personalized generation, image editing, style transfer, and multi-condition synthesis. Extensive experiments validate its effectiveness and efficiency as a unified framework for character-consistent image generation and editing.

ReMix: Towards a Unified View of Consistent Character Generation and Editing

TL;DR

Abstract

ReMix: Towards a Unified View of Consistent Character Generation and Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)