Table of Contents
Fetching ...

Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Gihyun Kwon, Jangho Park, Jong Chul Ye

TL;DR

The paper tackles the challenge of editing across multiple modalities—3D scenes, videos, and panoramas—without training modality-specific diffusion models. It introduces a unified editing framework that couples reference-image disentangled editing with cross-image context transfer through two parallel sampling paths, leveraging DDIM inversion to obtain inverted features and injecting them into the U-Net at selected layers with a timestep-based scheduler. A scheduling mechanism modulates injection strength over timesteps, balancing structure preservation ($t_{edit}$) and cross-image context ($t_{context}$) to produce coherent edits across frames and patches. Empirical results on 3D scenes (NeRF-based), panorama images, and video frames show superior editing quality and cross-image consistency compared with baselines, demonstrating the practicality of a simple, training-free 2D diffusion model for multi-modal editing and enabling extensions such as custom concepts and localized edits.

Abstract

While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.

Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

TL;DR

The paper tackles the challenge of editing across multiple modalities—3D scenes, videos, and panoramas—without training modality-specific diffusion models. It introduces a unified editing framework that couples reference-image disentangled editing with cross-image context transfer through two parallel sampling paths, leveraging DDIM inversion to obtain inverted features and injecting them into the U-Net at selected layers with a timestep-based scheduler. A scheduling mechanism modulates injection strength over timesteps, balancing structure preservation () and cross-image context () to produce coherent edits across frames and patches. Empirical results on 3D scenes (NeRF-based), panorama images, and video frames show superior editing quality and cross-image consistency compared with baselines, demonstrating the practicality of a simple, training-free 2D diffusion model for multi-modal editing and enabling extensions such as custom concepts and localized edits.

Abstract

While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
Paper Structure (15 sections, 4 equations, 10 figures, 3 tables)

This paper contains 15 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Method Overview. (a) Plug-and-play diffusion pnp single image editing. The method inject resnet, query, key features during sampling process for disentangled image editing. (b) Pix2Vid pix2vid proposes to propagate context from one sampling path to consequent sampling with injecting key and value features. (c) Our proposed method apply DDIM inversion to series of images to obtain the initial noise. During inversion, we extracted the resnet and self-attention features from Diffusion U-Net. Starting from initial inverted noise, we sample the outputs with feature injection. We inject inverted resnet and self-attention features to image editing path. For consistent editing, we propagate key and value features of edited sample to consequent sampling path. Our method enable editing on various modalities including 3D scene, panorama, and video.
  • Figure 2: Qualitative Evaluation of 3D scene editing. Our method outperforms baselines in both of semantic editing and overall style transfer.
  • Figure 3: Qualitative Evaluation of Panorama editing. We compare the sampled panorama outputs. Our method outperforms baselines showing realistic output with high structural consistency.
  • Figure 4: Qualitative evaluation of video editing. Our method shows better cross-frame consistency with text-output semantic alignment compared to baseline methods.
  • Figure 5: Results of custom concept editing. We can successfully transfer the semantic of custom concepts to various modalities with our proposed method.
  • ...and 5 more figures