Table of Contents
Fetching ...

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

Xingye Chen, Wei Feng, Zhenbang Du, Weizhen Wang, Yanyin Chen, Haohan Wang, Linkai Liu, Yaoyu Li, Jinyuan Zhao, Yu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, Yuanjie Shao, Xinge You, Changxin Gao, Nong Sang

TL;DR

This work tackles the misalignment between advertising image aesthetics and online performance by introducing CAIG, a CTR-driven image generation framework based on Multimodal Large Language Models. CAIG pre-trains an MLLM on a large-scale e-commerce multimodal dataset to acquire domain knowledge, uses a two-branch reward model to simulate user CTR through pairwise image comparisons, and applies Product-Centric Preference Optimization (PCPO) with Direct Preference Optimization (DPO) to generate product-consistent, CTR-friendly backgrounds. The reward model combines cross-entropy and point-wise CTR losses as $\mathcal{L}_{reward} = \lambda_1 \mathcal{L}_{CE} + \lambda_2 \mathcal{L}_{Point}$, and PCPO enforces alignment between product information and background prompts to maintain contextual relevance. Extensive offline experiments show state-of-the-art pairwise CTR prediction, while online deployment demonstrates a 2% CTR improvement over millions of impressions, validating the practical impact of integrating MLLMs, CTR-aware RL, and product-centric optimization in advertising image generation.

Abstract

In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

TL;DR

This work tackles the misalignment between advertising image aesthetics and online performance by introducing CAIG, a CTR-driven image generation framework based on Multimodal Large Language Models. CAIG pre-trains an MLLM on a large-scale e-commerce multimodal dataset to acquire domain knowledge, uses a two-branch reward model to simulate user CTR through pairwise image comparisons, and applies Product-Centric Preference Optimization (PCPO) with Direct Preference Optimization (DPO) to generate product-consistent, CTR-friendly backgrounds. The reward model combines cross-entropy and point-wise CTR losses as , and PCPO enforces alignment between product information and background prompts to maintain contextual relevance. Extensive offline experiments show state-of-the-art pairwise CTR prediction, while online deployment demonstrates a 2% CTR improvement over millions of impressions, validating the practical impact of integrating MLLMs, CTR-aware RL, and product-centric optimization in advertising image generation.

Abstract

In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.

Paper Structure

This paper contains 30 sections, 12 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Example of the impact of different backgrounds on product CTR. While visual features play a crucial role, other modalities such as textual caption and product attributes also have a significant influence on CTR. (b) Examples of product-background mismatches using existing reinforcement learning algorithms.
  • Figure 2: (a) E-commerce knowledge pre-training. The MLLM is pre-trained on a large-scale multimodal e-commerce dataset to incorporate domain-specific knowledge. (b) The Structure of RM. The RM integrates multimodal product features using visual and textual encoders, with dual branches to estimate CTR and identify appealing ad images. (c) CTR-driven preference optimization stage. The PM generates background descriptions for background generation model to create product images with various backgrounds. The RM then estimates the CTR for these images, simulating human feedback to optimize the PM.
  • Figure 3: Comparison of Pair Accuracy across different methods on commercial and public datasets.
  • Figure 4: Comparison of Match Rate across different preference optimization strategies over training epochs.
  • Figure 5: Comparison between DPO and the proposed PCPO. The first line shows the name of the product, followed by the generated results for each method, including the generated image and corresponding background prompt.
  • ...and 3 more figures