Table of Contents
Fetching ...

GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

Silan Hu, Shiqi Zhang, Yimin Shi, Xiaokui Xiao

TL;DR

GEM-Bench introduces the first comprehensive benchmark for ad-injected response (AIR) generation in Generative Engine Marketing (GEM), featuring three curated datasets (MT-Human, LM-Market for chatbots; CA-Prod for search), a multi-faceted measurement ontology combining semantic and qualitative metrics, and an extensible multi-agent Ad-LLM framework. Experimental results show that generating an ad-free response and then injecting ads (Ad-LLM) improves user satisfaction and engagement over simple system-prompt injection, albeit with higher computational cost; prompt-based AIR tends to yield higher CTR but can degrade trust and accuracy. The work provides a reproducible baseline for AIR research, discusses trade-offs and practical considerations, and makes datasets and resources publicly available to advance research in GEM and AIR design.

Abstract

Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM. The benchmark and all related resources are publicly available at https://gem-bench.org/.

GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

TL;DR

GEM-Bench introduces the first comprehensive benchmark for ad-injected response (AIR) generation in Generative Engine Marketing (GEM), featuring three curated datasets (MT-Human, LM-Market for chatbots; CA-Prod for search), a multi-faceted measurement ontology combining semantic and qualitative metrics, and an extensible multi-agent Ad-LLM framework. Experimental results show that generating an ad-free response and then injecting ads (Ad-LLM) improves user satisfaction and engagement over simple system-prompt injection, albeit with higher computational cost; prompt-based AIR tends to yield higher CTR but can degrade trust and accuracy. The work provides a reproducible baseline for AIR research, discusses trade-offs and practical considerations, and makes datasets and resources publicly available to advance research in GEM and AIR design.

Abstract

Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM. The benchmark and all related resources are publicly available at https://gem-bench.org/.

Paper Structure

This paper contains 26 sections, 8 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: The illustration of two existing workflows for ad-injected response (AIR) generation in GEM and the proposed measurement ontologies.
  • Figure 2: A case study for a query about Zürich transport. The ad-related content is in blue.
  • Figure 3: Comparison: Ad-Chat vs. GIR-R (Seattle Attractions).
  • Figure 4: Comparison: Ad-Chat vs. GIR-R (Detox Meals).
  • Figure 5: Comparison: Ad-Chat vs. GIR-R (Gift Ideas for 23-Year-Old Male IT Coworker).
  • ...and 6 more figures