Table of Contents
Fetching ...

Collaborative Score Distillation for Consistent Visual Synthesis

Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin

TL;DR

Collaborative Score Distillation (CSD) extends diffusion-prior usage to multi-sample settings by using Stein variational gradient descent to share score information across a set of samples, promoting consistency in synthesis across panoramas, videos, and 3D scenes. The authors introduce CSD-Edit to enable text-guided editing with a minimal, image-conditional baseline, avoiding degradation of source details. They demonstrate significant gains in inter-sample consistency and instruction fidelity across panorama, video, and NeRF-based 3D scene editing, with ablations confirming the importance of SVGD and the baseline choice. The approach broadens the applicability of large-scale text-to-image diffusion models to high-dimensional modalities without modifying the underlying diffusion models.

Abstract

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Collaborative Score Distillation for Consistent Visual Synthesis

TL;DR

Collaborative Score Distillation (CSD) extends diffusion-prior usage to multi-sample settings by using Stein variational gradient descent to share score information across a set of samples, promoting consistency in synthesis across panoramas, videos, and 3D scenes. The authors introduce CSD-Edit to enable text-guided editing with a minimal, image-conditional baseline, avoiding degradation of source details. They demonstrate significant gains in inter-sample consistency and instruction fidelity across panorama, video, and NeRF-based 3D scene editing, with ablations confirming the importance of SVGD and the baseline choice. The approach broadens the applicability of large-scale text-to-image diffusion models to high-dimensional modalities without modifying the underlying diffusion models.

Abstract

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
Paper Structure (38 sections, 14 equations, 14 figures, 12 tables)

This paper contains 38 sections, 14 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Method overview. CSD-Edit enables various visual-to-visual translations with two novel components. First, a new score distillation scheme using Stein variational gradient descent, which considers inter-sample relationships (Section \ref{['sec:csd']}) to synthesize a set of images while preserving modality-specific consistency constraints. Second, our method edits images with minimal information given from text instruction by subtracting image-conditional noise estimate instead of random noise during score distillation (Section \ref{['sec:csdedit']}). By doing so, CSD-Edit is used for text-guided manipulation of various visual domains, e.g., panorama images, videos, and 3D scenes (Section \ref{['sec:appl']}).
  • Figure 2: Panorama image editing. (Top right) Instruct-Pix2Pix brooks2022instructpix2pix on cropped patches results in inconsistent image editing. (Second row) Instruct-Pix2Pix with MultiDiffusion bar2023multidiffusion edits to consistent image, but less fidelity to the instruction, even with high guidance scale $\omega_y$. (Third row) CSD-Edit provides consistent image editing with better instruction-fidelity by setting proper guidance scale.
  • Figure 3: Video editing. Qualitative results on the lucia video in DAVIS 2017 Pont-Tuset_arXiv_2017. CSD shows frame-wise consistent editing providing coherent content across video frames e.g., consistent color and background without changes in person. Compared to Gen-1 esser2023structure, a video editing method trained on a large video dataset, CSD-Edit shows high-quality video editing results reflecting given prompts.
  • Figure 4: 3D NeRF scene editing. Visualizing novel-views of edited Fangzhou NeRF scene wang2022nerf. CSD-Edit leads to high-quality editing of 3D scenes and better preserves semantics of source scenes, e.g., obtains sharp facial details (left) and makes him smile without giving beard (right).
  • Figure 5: Panorama image editing. Comparison of CSD-Edit with baselines at different guidance scales $\omega_y\in\{3.0, 5.0, 7.5, 10.0\}$.
  • ...and 9 more figures