Table of Contents
Fetching ...

Patch-enhanced Mask Encoder Prompt Image Generation

Shusong Xu, Peiye Liu

TL;DR

This work tackles the challenge of generating advertising visuals with accurate product descriptions while maintaining diverse backgrounds. It introduces a patch-based Patch Flexible Visibility module and a Mask Encoder Prompt Adapter to enable region-controlled fusion within a diffusion-based Foundation Model framework, including depth-informed generation. Ablation and experiments on SAM-1B and COCO show improved FID and qualitative fidelity over text-only or image-only baselines, validating the effectiveness of PFV and MEPA in preserving foreground fidelity and achieving harmonious backgrounds. The approach offers a practical path for scalable AIGC ads with reliable product depiction and adaptable background aesthetics, potentially reducing legal and quality risks in automated advertising creation.

Abstract

Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.

Patch-enhanced Mask Encoder Prompt Image Generation

TL;DR

This work tackles the challenge of generating advertising visuals with accurate product descriptions while maintaining diverse backgrounds. It introduces a patch-based Patch Flexible Visibility module and a Mask Encoder Prompt Adapter to enable region-controlled fusion within a diffusion-based Foundation Model framework, including depth-informed generation. Ablation and experiments on SAM-1B and COCO show improved FID and qualitative fidelity over text-only or image-only baselines, validating the effectiveness of PFV and MEPA in preserving foreground fidelity and achieving harmonious backgrounds. The approach offers a practical path for scalable AIGC ads with reliable product depiction and adaptable background aesthetics, potentially reducing legal and quality risks in automated advertising creation.

Abstract

Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.
Paper Structure (19 sections, 7 equations, 5 figures, 1 table)

This paper contains 19 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Various prompts for advertising image synthesis are utilized in our proposed method. These include the PFV (Patch-Enhanced via Flexible Visibility) image prompt, text prompt, and full image prompt. These prompts guide the generation of backgrounds for the advertisements. The left column demonstrates the synthesis of bed product advertising images, while the right column displays the synthesis of sock advertising graphics.
  • Figure 2: The overall architecture of our proposed method. The patch flexible visibility masks both the reference image and its associated features. The mask encoder prompt adapter(MEPA) is consists of mask encoder(ME), text/image mask encoder cross attention(T/I-MECS). The MECA technique is employed in the fusion of text and image prompts for region control. During training, the weights of T-MECA layers are kept frozen, ensuring their stability and consistency.
  • Figure 3: Comparisons. "Ref" refers to the reference image that guides the style, while "The foreground" represents the product. The models "Pixel-a" chen2023pixart and "SDXL" podell2023sdxl operate under the control of text prompts. "Uni-I" and "Uni-I-T" represent the results of zhao2024uni under text prompts and text-image prompts. "IP-I" and "IP-I-T" represent the results of ye2023ip under text prompts and text-image prompts. The text used for these models is converted from the "Ref" image using BLIP li2022blip.
  • Figure 4: Effects of the Patch Flexible Visibility. "Ref" refers to the reference image that guides the style, and "The foreground" represents the product. "w/o" indicates the condition without RFV (Reference Image Guided Style), while "w" represents the condition with RFV.
  • Figure 5: Effects of the mask encoder prompt adapter. "Ref" refers to the reference image that guides the style, while "The foreground" represents the product. "Global prompts" indicate the global weights assigned to the image prompt and text prompt. "I-prompt" denotes the results achieved solely under the image prompt. "ME prompts" represent the mask encoder prompts generated by our MECA mentioned above.