Table of Contents
Fetching ...

PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, Guosheng Lin

TL;DR

PhyCustom tackles the challenge of enabling realistic physical transformations in text-to-image diffusion by learning physics concepts and enabling independent concept merging. It introduces two regularizations—isometric regularization to uncover physics-related embeddings from cross-object prompts and a decouple loss to orthogonalize learning between object and physical concepts—together enabling robust physical customization via LoRA-fine-tuned diffusion models. Evaluated on a diverse object-physical concept dataset, PhyCustom outperforms state-of-the-art baselines on quantitative metrics (CLIP-V, CLIP-V-O) and human judgments, with ablations confirming the necessity of both losses. The approach offers a practical path to physics-aware generation and potential OoD data generation, advancing the capability of diffusion-based generative systems in handling abstract physical concepts.

Abstract

Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.

PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

TL;DR

PhyCustom tackles the challenge of enabling realistic physical transformations in text-to-image diffusion by learning physics concepts and enabling independent concept merging. It introduces two regularizations—isometric regularization to uncover physics-related embeddings from cross-object prompts and a decouple loss to orthogonalize learning between object and physical concepts—together enabling robust physical customization via LoRA-fine-tuned diffusion models. Evaluated on a diverse object-physical concept dataset, PhyCustom outperforms state-of-the-art baselines on quantitative metrics (CLIP-V, CLIP-V-O) and human judgments, with ablations confirming the necessity of both losses. The approach offers a practical path to physics-aware generation and potential OoD data generation, advancing the capability of diffusion-based generative systems in handling abstract physical concepts.

Abstract

Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.

Paper Structure

This paper contains 33 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: PhyCustom's ability. We present some results of our proposed PhyCustom to showcase its ability in performing realistic physical transformations on common objects to generate novel concepts using a limited number (2$\sim$5) of provided images and corresponding texts.
  • Figure 2: Comparison of different methods.a) Given the text prompt, T2I diffusion models randomly select a pattern (e.g. style) associated with the object concept. b) After being fine-tuned on reference images, customization methods learn the patterns and then merge them by preserving the shape from reference 1 and the style from reference 2, however, fail to learn the physical concept. c) PhyCustom learns the physical concept and perform a concept-level merge to generate desired results.
  • Figure 3: Overview of PhyCustom. Given two sets of images and their corresponding prompts $\{\mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \mathbf{p}_3\}$, we fine-tune the diffusion model in a single stage with three losses. (1) The diffusion model with trainable LoRA modules is fine-tuned by MSE loss according to \ref{['eq:diffusion_training_loss']}, while the original parameters are frozen. (2) The isometric loss calculated by \ref{['eq:cross_context_loss']} aims at fine-tuning the text encoder to find a subspace where the distances between $\mathbf{p}_i, i \in \{1, 2, 3\}$ and $\mathbf{p}_\text{a}$ are equal, thus, the text encoder is able to learn the invariant text embedding, which is physics-related. (3) The decouple loss calculated by \ref{['eq:decouple_loss']} aims at decoupling the learning of different concept features by regularizing the gradient descents along two orthogonal directions.
  • Figure 4: Comparison of different methods. The results show the superior performance of PhyCustom on physical customization.
  • Figure 5: Ablation study of the proposed losses. The results w/o isometric loss fails to perform physical customization, indicating its failure in learning the physical concepts. On the other hand, the results without(w/o) decouple loss exhibit pattern leaking, where the model learns to generate the vase combined with the given spoon's shape and the plastic bag's color.
  • ...and 5 more figures