Table of Contents
Fetching ...

AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising

Peinan Zhang, Yusuke Sakai, Masato Mita, Hiroki Ouchi, Taro Watanabe

TL;DR

AdTEC provides a first public, multi-perspective benchmark for evaluating ad texts in sponsored search, combining five tasks that cover acceptability, consistency, predicted performance, and semantic appeals within a realistic AdOps workflow. It delivers a Japanese dataset built from real industry processes, alongside evaluations of fine-tuned encoders, zero-/few-shot LLMs, and humans, revealing that while PLMs are strong on several tasks, humans remain superior in certain qualitative judgments and LLMs excel in specific semantic areas. The study highlights practical insights for deploying automated ad-text quality estimators in production, including data leakage handling, task-specific evaluation metrics, and the value of real-world data sampling. It also outlines limitations related to language scope, lack of multimodal data, and potential biases, proposing avenues for extending AdTEC to multilingual and multimodal contexts to better support diverse advertising operations.

Abstract

With the increase in the fluency of ad texts automatically created by natural language generation technology, there is high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached practical usage level in several tasks, humans still outperform in certain domains, implying that there is significant room for improvement in this area.

AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising

TL;DR

AdTEC provides a first public, multi-perspective benchmark for evaluating ad texts in sponsored search, combining five tasks that cover acceptability, consistency, predicted performance, and semantic appeals within a realistic AdOps workflow. It delivers a Japanese dataset built from real industry processes, alongside evaluations of fine-tuned encoders, zero-/few-shot LLMs, and humans, revealing that while PLMs are strong on several tasks, humans remain superior in certain qualitative judgments and LLMs excel in specific semantic areas. The study highlights practical insights for deploying automated ad-text quality estimators in production, including data leakage handling, task-specific evaluation metrics, and the value of real-world data sampling. It also outlines limitations related to language scope, lack of multimodal data, and potential biases, proposing avenues for extending AdTEC to multilingual and multimodal contexts to better support diverse advertising operations.

Abstract

With the increase in the fluency of ad texts automatically created by natural language generation technology, there is high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached practical usage level in several tasks, humans still outperform in certain domains, implying that there is significant room for improvement in this area.
Paper Structure (101 sections, 10 figures, 12 tables)

This paper contains 101 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of sponsored search ad and its key terms.
  • Figure 2: Generalized AdOps workflow described in §\ref{['sec:adops']}: (1) The advertiser creates an LP to promote a product. (2) Based on the product information in the LP and target customers, text and graphics are designed by creators. (3) The creatives are evaluated based on fluency, attractiveness, regulations, legality, and other factors. (4) Once the creatives pass the quality evaluation, they are submitted to a delivery platform. (5) Customers respond to the displayed ads, such as page views, clicks, and purchases. (6) Based on the customer engagement, ad performance is reported back to the advertiser, and Steps 1-5 are repeated to improve the quality of the LP and ads.
  • Figure 2:
  • Figure 4: Account structure in ad delivery is hierarchical. A client represents a single company, and the account typically encompasses the commercial products offered by that client. Campaigns are created to promote these commercial products, while ad groups are used to organize keywords and ad texts. At higher levels of the hierarchy, there are more ads and greater variance. Conversely, at lower levels, there are fewer ads, which tend to be similar.
  • Figure 5: Examples of integrated gradient visualization with Tohoku BERT model's outputs showing the difference in attention for small (top) and large (bottom) gaps between ground truth and predicted labels in the Ad Similarity task. Red indicates a negative influence, while green indicates positive influence on the predictions.
  • ...and 5 more figures