Collaborative Score Distillation for Consistent Visual Synthesis
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin
TL;DR
Collaborative Score Distillation (CSD) extends diffusion-prior usage to multi-sample settings by using Stein variational gradient descent to share score information across a set of samples, promoting consistency in synthesis across panoramas, videos, and 3D scenes. The authors introduce CSD-Edit to enable text-guided editing with a minimal, image-conditional baseline, avoiding degradation of source details. They demonstrate significant gains in inter-sample consistency and instruction fidelity across panorama, video, and NeRF-based 3D scene editing, with ablations confirming the importance of SVGD and the baseline choice. The approach broadens the applicability of large-scale text-to-image diffusion models to high-dimensional modalities without modifying the underlying diffusion models.
Abstract
Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
