GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing
Silan Hu, Shiqi Zhang, Yimin Shi, Xiaokui Xiao
TL;DR
GEM-Bench introduces the first comprehensive benchmark for ad-injected response (AIR) generation in Generative Engine Marketing (GEM), featuring three curated datasets (MT-Human, LM-Market for chatbots; CA-Prod for search), a multi-faceted measurement ontology combining semantic and qualitative metrics, and an extensible multi-agent Ad-LLM framework. Experimental results show that generating an ad-free response and then injecting ads (Ad-LLM) improves user satisfaction and engagement over simple system-prompt injection, albeit with higher computational cost; prompt-based AIR tends to yield higher CTR but can degrade trust and accuracy. The work provides a reproducible baseline for AIR research, discusses trade-offs and practical considerations, and makes datasets and resources publicly available to advance research in GEM and AIR design.
Abstract
Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM. The benchmark and all related resources are publicly available at https://gem-bench.org/.
