Table of Contents
Fetching ...

A$^3$: Towards Advertising Aesthetic Assessment

Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen, Jianbo Zhang, Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Guangtao Zhai

Abstract

Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.

A$^3$: Towards Advertising Aesthetic Assessment

Abstract

Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.
Paper Structure (12 sections, 2 equations, 5 figures, 2 tables)

This paper contains 12 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of A$^3$: Advertising Aesthetic Assessment. A$^3$ centers on the A$^3$-Law, a three stage paradigm with Perceptual Attention, Formal Interest, and Desire Impact. Built on this paradigm, A$^3$-Dataset contains 30K images and 120K instruction-response pairs with Chain of Thought; A$^3$-Align learns the rules of A$^3$-Law; and A$^3$-Bench evaluates MLLMs and two tasks such as quality selection and prescriptive critique.
  • Figure 2: A$^3$-Dataset construction pipeline. The pipeline has two stages. In the Human Centric Phase we collect diverse advertising images, assign preliminary aesthetic and content tags under A$^3$-Law, and remove noisy or inconsistent samples. In the Model Enhanced Phase multimodal LLMs generate Chain of Thought rationales that are refined and validated by experts. The result is the A$^3$-Dataset with 30K images and 120K instruction and response pairs of high quality.
  • Figure 3: A$^3$-Align under the A$^3$-Law. The top panel shows examples from the A$^3$-Dataset organized by the three stages Perceptual Attention, Formal Interest, and Desire Impact, with subcriteria and Chain of Thought summaries. The bottom panel presents a two-phase training pipeline. In the SFT phase the multimodal LLM learns A$^3$-Law rules, structured output format, tool use, and Chain of Thought from the A$^3$-Dataset with token-level cross-entropy. In the GRPO phase, the model is optimized with multi-signal rewards, ultimately leading to A$^3$-Align, which produces rule-based judgments.
  • Figure 4: Satisfaction and Action Intent Gains Through A$^3$-Law Screening. The figure shows the cumulative score gains in Satisfaction (left) and Action Intent (right) across the three stages of A$^3$-Law screening: Perceptual Attention, Formal Interest, and Desire Impact.
  • Figure 5: Evaluation of Problem Identification Accuracy, Depth of CoT, and Overall Clarity. The figure presents the scores of various models on three evaluation dimensions: Problem Identification Accuracy, Depth of Chain of Thought (CoT), and Overall Clarity.