Table of Contents
Fetching ...

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Yucheng Wang, Zedong Wang, Yuetong Wu, Yue Ma, Dan Xu

TL;DR

Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer, and demonstrate the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

Abstract

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

TL;DR

Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer, and demonstrate the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

Abstract

Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
Paper Structure (70 sections, 20 equations, 15 figures, 4 tables)

This paper contains 70 sections, 20 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Overview of contextual image editing paradigm. (A) Instruction‑based editing guides modifications via a text prompt $\mathbf{T}$ and a base image $\mathbf{B}$. (B) Subject‑based editing uses the base $\mathbf{B}$ and a references image $\mathbf{R}$ to preserve identity or style. (C) CARE‑Edit incorporates all these modalities $(\mathbf{T}, \mathbf{B}, \mathbf{R})$ and the user-defined mask $\mathbf{M}$ in a diffusion transformer (DiT) backbone with condition‑aware routing of experts.
  • Figure 2: CARE-Edit introduces condition‑aware specialized experts within the frozen DiT backbone. Given multimodal conditions, inputs are tokenized and projected to heterogeneous expert branches. The router assigns confidence scores and selects the top‑$K$ experts to process each token. Expert outputs are normalized, modulated, and fused through the Latent Mixture module, yielding denoised representations $\mathbf{h}'$ refined by Mask Repaint. Only lightweight adapters, the router, and fusion layers are trainable. This enables CARE-Edit to dynamically allocate computation, mitigates conflicts between heterogeneous conditions (e.g., text vs mask) and enables high-fidelity, coherent edits.
  • Figure 3: Qualitative comparison of subject-driven contextual editing. Green annotations indicate data belonging to the benchmark, while Blue annotations denote whether mask inputs are provided. CARE-Edit ensures the preservation of subject identity and coherent contexts.
  • Figure 4: Qualitative comparison of instruction-based editing.
  • Figure 5: Visualization of base expert attention map over iterations.
  • ...and 10 more figures