Table of Contents
Fetching ...

DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis

Yongjin Choi, Chanhun Park, Seung Jun Baek

TL;DR

DynASyn tackles multi-subject personalization from a single reference by integrating class-prior knowledge into cross-attention with a novel ICA loss, and by augmenting data through concept-based prompt-and-image strategies guided by Guided SDE Augmentation. This yields strong identity preservation while enabling diverse actions and dynamic interactions, outperforming TI, DreamBooth, and Break-a-Scene on both objective and human-evaluated metrics. The approach advances controllable, single-image personalization for multiple subjects, offering robust generalization and richer contextual synthesis. Limitations stem from backbone priors and model capacity, suggesting future exploration of targeted backbone fine-tuning and richer priors to further enhance realism and versatility.

Abstract

Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.

DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis

TL;DR

DynASyn tackles multi-subject personalization from a single reference by integrating class-prior knowledge into cross-attention with a novel ICA loss, and by augmenting data through concept-based prompt-and-image strategies guided by Guided SDE Augmentation. This yields strong identity preservation while enabling diverse actions and dynamic interactions, outperforming TI, DreamBooth, and Break-a-Scene on both objective and human-evaluated metrics. The approach advances controllable, single-image personalization for multiple subjects, offering robust generalization and richer contextual synthesis. Limitations stem from backbone priors and model capacity, suggesting future exploration of targeted backbone fine-tuning and richer priors to further enhance realism and versatility.

Abstract

Recent advances in text-to-image diffusion models spurred research on personalization, i.e., a customized image synthesis, of subjects within reference images. Although existing personalization methods are able to alter the subjects' positions or to personalize multiple subjects simultaneously, they often struggle to modify the behaviors of subjects or their dynamic interactions. The difficulty is attributable to overfitting to reference images, which worsens if only a single reference image is available. We propose DynASyn, an effective multi-subject personalization from a single reference image addressing these challenges. DynASyn preserves the subject identity in the personalization process by aligning concept-based priors with subject appearances and actions. This is achieved by regularizing the attention maps between the subject token and images through concept-based priors. In addition, we propose concept-based prompt-and-image augmentation for an enhanced trade-off between identity preservation and action diversity. We adopt an SDE-based editing guided by augmented prompts to generate diverse appearances and actions while maintaining identity consistency in the augmented images. Experiments show that DynASyn is capable of synthesizing highly realistic images of subjects with novel contexts and dynamic interactions with the surroundings, and outperforms baseline methods in both quantitative and qualitative aspects.

Paper Structure

This paper contains 19 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Personalization outputs from the proposed method, DynASyn. When provided with a single image containing multiple subjects, each subject can be trained into placeholders denoted as <asset>. DynASyn is capable of synthesizing diverse types of novel poses and dynamic actions of the subjects from the text prompts by avoiding overfitting to the reference image.
  • Figure 2: Overview of the DynASyn. (a) Concept-based Attention Regularization: the attention map derived from concept priors is used to regularize the attention map from token placeholder to prevent overfitting. (b) Concept-based Prompt-and-Image Augmentation: prompt-and-image pairs containing diverse action and poses of subjects are composed. (c) Optimization with Augmented Prompts and Images: the augmented data from (b) is used for our model to learn to generate novel actions and poses of the subject.
  • Figure 3: Overview of Guided SDE Augmentation (GSA).
  • Figure 4: Qualitative comparisons with baseline methods. While baseline models often fail to align effectively with the provided text, DynASyn generates images that accurately reflect the textual input.
  • Figure 5: Visualization of personalized images. DynASyn generates a variety of images based on text when given a single input image. It can depict multiple subjects interacting dynamically or performing actions. Examples of additional generation tasks, such as re-contextualization or artistic stylization, can be found in Supplementary materials.
  • ...and 6 more figures