Table of Contents
Fetching ...

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Daniel Gilo, Or Litany

TL;DR

This work tackles sparse-view multi-view image editing by transferring edits from a powerful monocular editor into a pretrained multi-view diffusion model to enforce a strong 3D prior. The approach, InstructMix2Mix (I-Mix2Mix), personalizes a multi-view diffusion student (SEVA) via incremental Score Distillation Sampling (SDS) updates, employing a stochastic teacher-forward schedule and Random Cross-View Attention to maintain cross-view coherence. Key contributions include replacing neural-field consolidators with a data-driven 3D prior, adapting SDS for multi-view personalization, and demonstrating substantial improvements in cross-view consistency while preserving per-frame edit quality. The method enables robust, instruction-faithful edits from extremely sparse inputs and shows promise for broader multi-view generation tasks, albeit with increased computational cost due to iterative distillation.

Abstract

We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

TL;DR

This work tackles sparse-view multi-view image editing by transferring edits from a powerful monocular editor into a pretrained multi-view diffusion model to enforce a strong 3D prior. The approach, InstructMix2Mix (I-Mix2Mix), personalizes a multi-view diffusion student (SEVA) via incremental Score Distillation Sampling (SDS) updates, employing a stochastic teacher-forward schedule and Random Cross-View Attention to maintain cross-view coherence. Key contributions include replacing neural-field consolidators with a data-driven 3D prior, adapting SDS for multi-view personalization, and demonstrating substantial improvements in cross-view consistency while preserving per-frame edit quality. The method enables robust, instruction-faithful edits from extremely sparse inputs and shows promise for broader multi-view generation tasks, albeit with increased computational cost due to iterative distillation.

Abstract

We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.

Paper Structure

This paper contains 43 sections, 3 equations, 23 figures, 4 tables, 1 algorithm.

Figures (23)

  • Figure 1: I-Mix2Mix overview. Given a set of input images, a randomly chosen reference image is edited by the frozen teacher and encoded to serve as the personalized multi-view student's input latent (Initialization). At each distillation iteration, noisy multi-view latents $\zeta_\tau$ are denoised by the student (Student Query), aligned to the teacher’s latent space (Alignment), and perturbed with our forward schedule (Perturbation). The teacher predicts edits with Random Cross-View Attention (Teacher Prediction), where all frames attend to the $\kappa$'s frame, and the resulting supervision is distilled back into the student (Student Update). After distillation, the student outputs a set of multi-view consistent edited frames.
  • Figure 2: The five SDS stages.
  • Figure 3: Random Cross-View Attention effect when used for full teacher sampling.
  • Figure 4: Qualitative comparison with prior work. The top row shows the original scenes, and the lower rows present edits from different methods. Matching red or purple rectangles indicate pairs of inconsistent regions, which frequently appear in baselines but not in our edits. Please zoom in electronically for details; enlarged views are provided in Appendix \ref{['sec:extended_comparison_to_baselines']}.
  • Figure 5: Failure cases from variants of the perturbation and teacher prediction stages. Rows 1–2: alternative forward schedules collapse to near-identity edits. Row 3: removing RCVAttn breaks multi-view coherence. Row 4: full method output.
  • ...and 18 more figures