Table of Contents
Fetching ...

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang, Mengmeng Wang, Tieliang Gong, Guang Dai, Hao Sun

TL;DR

This work proposes a novel one-shot tuning paradigm that efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning, and leads the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model.

Abstract

Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed OneActor. It efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning. We lead the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model. To mitigate the overfitting challenge shared by one-shot tuning pipelines, we augment the tuning with auxiliary samples and devise two inference strategies: semantic interpolation and cluster guidance. These techniques are later verified to significantly improve the generation quality. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality. Our method is capable of multi-subject generation and compatible with popular diffusion extensions. Besides, we achieve a 4 times faster tuning speed than tuning-based baselines and, if desired, avoid increasing the inference time. Furthermore, our method can be naturally utilized to pre-train a consistent subject generation network from scratch, which will implement this research task into more practical applications. (Project page: https://johnneywang.github.io/OneActor-webpage/)

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

TL;DR

This work proposes a novel one-shot tuning paradigm that efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning, and leads the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model.

Abstract

Text-to-image diffusion models benefit artists with high-quality image generation. Yet their stochastic nature hinders artists from creating consistent images of the same subject. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external restricted data or require expensive tuning of the diffusion model. For this issue, we propose a novel one-shot tuning paradigm, termed OneActor. It efficiently performs consistent subject generation solely driven by prompts via a learned semantic guidance to bypass the laborious backbone tuning. We lead the way to formalize the objective of consistent subject generation from a clustering perspective, and thus design a cluster-conditioned model. To mitigate the overfitting challenge shared by one-shot tuning pipelines, we augment the tuning with auxiliary samples and devise two inference strategies: semantic interpolation and cluster guidance. These techniques are later verified to significantly improve the generation quality. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory subject consistency, superior prompt conformity as well as high image quality. Our method is capable of multi-subject generation and compatible with popular diffusion extensions. Besides, we achieve a 4 times faster tuning speed than tuning-based baselines and, if desired, avoid increasing the inference time. Furthermore, our method can be naturally utilized to pre-train a consistent subject generation network from scratch, which will implement this research task into more practical applications. (Project page: https://johnneywang.github.io/OneActor-webpage/)
Paper Structure (31 sections, 25 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 25 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: For every subject in the latent space, there are identity sub-clusters within the subject base cluster. (a) Given different prompts and initial noises, ordinary diffusion models generate inconsistent images from different identity sub-clusters of the "hobbit" base cluster. (b) While our OneActor, after a quick tuning, provides an extra cluster guidance and thus generates images from the same target sub-cluster that show a consistent identity. Different colors denote different identity sub-clusters.
  • Figure 2: The overall architecture of our method. (a) We first generate base images and construct the target and auxiliary set. (b) We design a cluster-conditioned model and tune the projector with batched data. (c) The projector consists of a ResNet network, linear and AdaIN layers. Tuning and freezing weights are denoted by fire and snowflake marks. The items used to compute different objectives are outlined in different colors. The unimplemented theoretical models are semi-transparent.
  • Figure 3: The observation of semantic-latent guidance equivalence property. We vary the latent guidance scale on the left side and the semantic interpolation scale on the right, respectively. The semantic and latent manipulations show the same effect, which proves our argument.
  • Figure 4: The qualitative comparison between personalization pipelines and our OneActor. TI lacks consistency, while DB and IP exhibit limited prompt conformity and diversity. BL suffers from poor quality in certain cases. In contrast, our method shows superior consistency, diversity as well as stability. Target prompts and base words are marked blue and red, respectively.
  • Figure 5: The qualitative comparison between consistent subject generation methods. Though all methods generate consistent images given different prompts, our OneActor refines more details such as the characters' clothes.
  • ...and 15 more figures