Table of Contents
Fetching ...

Strictly-ID-Preserved and Controllable Accessory Advertising Image Generation

Youze Xue, Binghui Chen, Yifeng Geng, Xuansong Xie, Jiansheng Chen, Hongbing Ma

TL;DR

This work tackles the need for strictly-ID-preserved advertising image generation by proposing a Control-Net based pipeline that uses an earring conditioning image and a fresh multi-branch cross-attention design to control scale, pose, and appearance of the model. It introduces STD-Norm and TDW to balance multiple control streams, enabling precise and diverse advertising visuals while preserving the product identity. Empirical results on a large earring-model dataset show superior identity preservation and controllability compared with baselines such as SD in-painting, Paint-by-Example, DreamBooth, Custom Diffusion, and IP-Adapter, with quantitative metrics like $\text{FID}$, $\text{Mask IoU}$, and $\text{CLIP-S}$ supporting the claims. The approach is positioned to improve reliability and flexibility of e-commerce advertising images, though it currently relies on copying the earring region during inference and future work aims to address rotation and lighting variations automatically.

Abstract

Customized generative text-to-image models have the ability to produce images that closely resemble a given subject. However, in the context of generating advertising images for e-commerce scenarios, it is crucial that the generated subject's identity aligns perfectly with the product being advertised. In order to address the need for strictly-ID preserved advertising image generation, we have developed a Control-Net based customized image generation pipeline and have taken earring model advertising as an example. Our approach facilitates a seamless interaction between the earrings and the model's face, while ensuring that the identity of the earrings remains intact. Furthermore, to achieve a diverse and controllable display, we have proposed a multi-branch cross-attention architecture, which allows for control over the scale, pose, and appearance of the model, going beyond the limitations of text prompts. Our method manages to achieve fine-grained control of the generated model's face, resulting in controllable and captivating advertising effects.

Strictly-ID-Preserved and Controllable Accessory Advertising Image Generation

TL;DR

This work tackles the need for strictly-ID-preserved advertising image generation by proposing a Control-Net based pipeline that uses an earring conditioning image and a fresh multi-branch cross-attention design to control scale, pose, and appearance of the model. It introduces STD-Norm and TDW to balance multiple control streams, enabling precise and diverse advertising visuals while preserving the product identity. Empirical results on a large earring-model dataset show superior identity preservation and controllability compared with baselines such as SD in-painting, Paint-by-Example, DreamBooth, Custom Diffusion, and IP-Adapter, with quantitative metrics like , , and supporting the claims. The approach is positioned to improve reliability and flexibility of e-commerce advertising images, though it currently relies on copying the earring region during inference and future work aims to address rotation and lighting variations automatically.

Abstract

Customized generative text-to-image models have the ability to produce images that closely resemble a given subject. However, in the context of generating advertising images for e-commerce scenarios, it is crucial that the generated subject's identity aligns perfectly with the product being advertised. In order to address the need for strictly-ID preserved advertising image generation, we have developed a Control-Net based customized image generation pipeline and have taken earring model advertising as an example. Our approach facilitates a seamless interaction between the earrings and the model's face, while ensuring that the identity of the earrings remains intact. Furthermore, to achieve a diverse and controllable display, we have proposed a multi-branch cross-attention architecture, which allows for control over the scale, pose, and appearance of the model, going beyond the limitations of text prompts. Our method manages to achieve fine-grained control of the generated model's face, resulting in controllable and captivating advertising effects.
Paper Structure (19 sections, 11 equations, 11 figures, 6 tables)

This paper contains 19 sections, 11 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The visualization comparison between our method with baselines. The first row presents examples of input for each method. Our method and Stable Diffusion (SD) in-painting sd use the earring image as the foreground and in-paint the background. Paint-by-Example paint-by-example in-paints the earring area on an earring-removed background. DreamBooth dreambooth and Custom Diffusion custom-diffusion are tuning-based text-to-image methods, where a text prompt with a special identifier $\langle sks \rangle$ is used as input.
  • Figure 2: The illustration of the over-fitting issue for the tuning-based customized generative models.
  • Figure 3: The overall pipeline of our method. By adding the standard-deviation based normalization (STD-Norm) and the time-dependent weighting (TDW), controls from different branches work well with each other.
  • Figure 4: The illustration of the scale and pose control. For each column of the figure, the pose control is fixed and the scale control changes from small to large. And for each row of the figure, the scale control and the earring image are fixed, whereas the pose control changes from facing left to facing right.
  • Figure 5: The illustration of the appearance control. For each appearance control $\mathbf{I}_a$, we select two different earrings and generate two images based on the earring and $\mathbf{I}_a$.
  • ...and 6 more figures