Table of Contents
Fetching ...

VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing

Binghui Chen, Chongyang Zhong, Wangmeng Xiang, Yifeng Geng, Xuansong Xie

TL;DR

This work defines Object-ID-retentive Human-object Interaction image Generation (OHG) for e-commerce and introduces HoIHuman, a large-scale dataset with rich annotations. It proposes VirtualModel, a diffusion-based framework with two parallel branches, the Interaction-guided Branch and the Content-guided Branch, to jointly ensure realistic human-object interactions and exact product identity preservation. The model uses a HoI-controlled diffusion pipeline and a Content Backfill post-processing step to further enforce product-content fidelity. Experimental results show that VirtualModel achieves superior image quality, pose accuracy, and object-ID consistency compared to state-of-the-art baselines, with strong performance in human preference studies, indicating practical potential for real-world marketing use.

Abstract

Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce scenarios. These include (1) the interaction between the shown product and human should be considered, (2) human parts like face/hand/arm/foot and the interaction between human model and product should be hyper-realistic, and (3) the identity of the product shown in advertising should be exactly consistent with the product itself. To this end, in this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG), and then propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction. As shown in Figure 1, VirtualModel not only outperforms other methods in terms of accurate pose control and image quality but also allows for the display of user-specified product objects by maintaining the product-ID consistency and enhancing the plausibility of human-object interaction. Codes and data will be released.

VirtualModel: Generating Object-ID-retentive Human-object Interaction Image by Diffusion Model for E-commerce Marketing

TL;DR

This work defines Object-ID-retentive Human-object Interaction image Generation (OHG) for e-commerce and introduces HoIHuman, a large-scale dataset with rich annotations. It proposes VirtualModel, a diffusion-based framework with two parallel branches, the Interaction-guided Branch and the Content-guided Branch, to jointly ensure realistic human-object interactions and exact product identity preservation. The model uses a HoI-controlled diffusion pipeline and a Content Backfill post-processing step to further enforce product-content fidelity. Experimental results show that VirtualModel achieves superior image quality, pose accuracy, and object-ID consistency compared to state-of-the-art baselines, with strong performance in human preference studies, indicating practical potential for real-world marketing use.

Abstract

Due to the significant advances in large-scale text-to-image generation by diffusion model (DM), controllable human image generation has been attracting much attention recently. Existing works, such as Controlnet [36], T2I-adapter [20] and HumanSD [10] have demonstrated good abilities in generating human images based on pose conditions, they still fail to meet the requirements of real e-commerce scenarios. These include (1) the interaction between the shown product and human should be considered, (2) human parts like face/hand/arm/foot and the interaction between human model and product should be hyper-realistic, and (3) the identity of the product shown in advertising should be exactly consistent with the product itself. To this end, in this paper, we first define a new human image generation task for e-commerce marketing, i.e., Object-ID-retentive Human-object Interaction image Generation (OHG), and then propose a VirtualModel framework to generate human images for product shown, which supports displays of any categories of products and any types of human-object interaction. As shown in Figure 1, VirtualModel not only outperforms other methods in terms of accurate pose control and image quality but also allows for the display of user-specified product objects by maintaining the product-ID consistency and enhancing the plausibility of human-object interaction. Codes and data will be released.
Paper Structure (19 sections, 3 equations, 15 figures, 4 tables)

This paper contains 19 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Example Results and Visual Comparisons. This paper mainly focuses on Object-ID-retentive Human-object Interaction image Generation (OHG) task for E-commerce marketing scenario. Each row contains: (a) a generation by text-guided Stable Diffusion rombach2022high, (b) given product and pose conditions for OHG, (c) a generation by T2I-adapter mou2023t2i, (d) a generation by ControlNet zhang2023adding, (e) a generation by HumanSD ju2023humansd and (e) a generation by our proposed VirtualModel. Since (c,d,e) do not support OHG task, only pose conditions are used. Comparing with other methods, when given the target products and the corresponding pose-skelenton images, VirtualModel can generate hyper-realistic marketing images in terms of reasonability of human-object interaction, image quality and challenging local-poses. Best viewed with zoom-in.
  • Figure 2: Comparisons of tasks (HIG and our proposed OHG) and the corresponding datasets (HumanArt/Laion-Humanju2023humansd and our HoIHuman.)
  • Figure 3: Overall framework of the proposed VirtualModel, which consists of the Human-object-interaction (HoI) controlled pipeline, the interaction-guided branch and the content-guided branch. During training, paired data are built and fed into VirtualModel where $\textsl{x}, o,p,e,v$ are original image and the corresponding image conditions: product object, pose skeleton of human, edge of product object, close view of product, respectively. The text condition $C_{te}$ is obtained by large language model.
  • Figure 4: Resolution statistics of our HoIHuman dataset. One can observe that our HoIHuman is with high resolution and high quality images.
  • Figure 5: Illustration of the proposed metric Object Extension Ratio (OER) computation. Red region is the extended content which is incorrect.
  • ...and 10 more figures