Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection
Gihyun Kwon, Jangho Park, Jong Chul Ye
TL;DR
The paper tackles the challenge of editing across multiple modalities—3D scenes, videos, and panoramas—without training modality-specific diffusion models. It introduces a unified editing framework that couples reference-image disentangled editing with cross-image context transfer through two parallel sampling paths, leveraging DDIM inversion to obtain inverted features and injecting them into the U-Net at selected layers with a timestep-based scheduler. A scheduling mechanism modulates injection strength over timesteps, balancing structure preservation ($t_{edit}$) and cross-image context ($t_{context}$) to produce coherent edits across frames and patches. Empirical results on 3D scenes (NeRF-based), panorama images, and video frames show superior editing quality and cross-image consistency compared with baselines, demonstrating the practicality of a simple, training-free 2D diffusion model for multi-modal editing and enabling extensions such as custom concepts and localized edits.
Abstract
While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
