ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Binghui Chen; Wenyu Li; Yifeng Geng; Xuansong Xie; Wangmeng Zuo

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Binghui Chen, Wenyu Li, Yifeng Geng, Xuansong Xie, Wangmeng Zuo

TL;DR

This work addresses generating hyper-realistic advertising images of a human wearing a user-specified shoe while preserving the shoe's identity. It introduces ShoeModel, a three-module diffusion-based pipeline consisting of Wearable-area Detection (WD), Leg-pose Synthesis (LpS), and Shoe-wearing Image Generation (SW), with a staged training strategy. Two new datasets support training: a wearable-area detection dataset and a large shoe-leg dataset for pose-conditioned generation. Quantitative and qualitative experiments show ShoeModel outperforms diffusion-based and inpainting baselines in image realism, identity preservation, and human-shoe interaction plausibility, highlighting its practical impact for automated e-commerce advertising content generation. The approach enables consistent, realistic shoe-advertising imagery and can inspire further research on object-identity preservation and object-human interaction in controllable diffusion systems.

Abstract

With the development of the large-scale diffusion model, Artificial Intelligence Generated Content (AIGC) techniques are popular recently. However, how to truly make it serve our daily lives remains an open question. To this end, in this paper, we focus on employing AIGC techniques in one filed of E-commerce marketing, i.e., generating hyper-realistic advertising images for displaying user-specified shoes by human. Specifically, we propose a shoe-wearing system, called Shoe-Model, to generate plausible images of human legs interacting with the given shoes. It consists of three modules: (1) shoe wearable-area detection module (WD), (2) leg-pose synthesis module (LpS) and the final (3) shoe-wearing image generation module (SW). Them three are performed in ordered stages. Compared to baselines, our ShoeModel is shown to generalize better to different type of shoes and has ability of keeping the ID-consistency of the given shoes, as well as automatically producing reasonable interactions with human. Extensive experiments show the effectiveness of our proposed shoe-wearing system. Figure 1 shows the input and output examples of our ShoeModel.

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 10 figures, 5 tables)

This paper contains 14 sections, 2 equations, 10 figures, 5 tables.

Introduction
Related Works
Method
Wearable-area Detection Module
Leg-pose Synthesis Module
Shoe-wearing Module
Shoe-wearing Dataset
Wearable-area detection dataset
Shoe-leg dataset
Experiments
Quantitative Analysis
Qualitative Analysis
Ablation Study
Discussion

Figures (10)

Figure 1: Example Results of the ShoeModel. Each group contains the input user-specified shoes (at left) and the output images interacting with human-legs (at right). It can be observed that (1) the generated images are hyper-realistic and good for product shown, (2) the interaction between human and shoes is reasonable, (3) the identity of the output shoes is exactly the same as the intput ones. Best viewed with zoom-in.
Figure 2: The overall pipeline of the proposed system ShoeModel. It consists of three modules: including the shoe Wearable-area Detection module (WD), the Leg-pose Synthesis module (LpS) and the Shoe-wearing module (SW). These three modules are performed in ordered stages and each one can be trained independently. Processed by our system, the given shoes are interacted with human legs reasonably, resulting in a advertising image of shoe display.
Figure 3: Visualization of the visible area and the wearable area. The visible area (shown in blue) can be seen in both (\ref{['fig:shoe unworn']}) and (\ref{['fig:shoe worn']}), while the wearable area (shown in yellow) only can be seen in (\ref{['fig:shoe unworn']}).
Figure 4: Example of leg-pose synthesis task.
Figure 5: Qualitative comparisons with SDXL podell2023sdxl, Paint-by-example yang2023paint, ControlNet-Inpainting zhang2023adding and Ip-Adapter ye2023ip-adapter. One can observe that our ShoeModel performs the best in terms of image quality, shoe-ID consistency and reasonability of interactions between human and shoes. More results are in supplementary file.
...and 5 more figures

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

TL;DR

Abstract

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)