Table of Contents
Fetching ...

Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance

Sanjukta Ghosh

TL;DR

The paper addresses the quality gap between AI-generated and human-written product descriptions in e-commerce by benchmarking four LLMs (Gemma-2B, LLAMA, GPT-2, ChatGPT-4) under two generation conditions against human references for 100 products. It employs a multi-metric evaluation framework, including sentiment analysis, readability, persuasiveness, SEO optimization, clarity, emotional appeal, and CTA effectiveness, with SST-2 and Flesch-Kincaid as core components. Key finding: ChatGPT-4 consistently leads or closely approaches human performance across several metrics, though gaps remain in emotional resonance and CTA impact, while simpler models lag in several areas; results also confirm the utility of sample prompts in boosting quality. The study's implications suggest hybrid AI-human content pipelines, emphasize prompt engineering and brand alignment, and provide a data-driven basis for strategic deployment of AI in e-commerce content creation.

Abstract

This study compares the performance of AI-generated and human-written product descriptions using a multifaceted evaluation model. We analyze descriptions for 100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4) with and without sample descriptions, against human-written descriptions. Our evaluation metrics include sentiment, readability, persuasiveness, Search Engine Optimization(SEO), clarity, emotional appeal, and call-to-action effectiveness. The results indicate that ChatGPT 4 performs the best. In contrast, other models demonstrate significant shortcomings, producing incoherent and illogical output that lacks logical structure and contextual relevance. These models struggle to maintain focus on the product being described, resulting in disjointed sentences that do not convey meaningful information. This research provides insights into the current capabilities and limitations of AI in the creation of content for e-Commerce.

Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance

TL;DR

The paper addresses the quality gap between AI-generated and human-written product descriptions in e-commerce by benchmarking four LLMs (Gemma-2B, LLAMA, GPT-2, ChatGPT-4) under two generation conditions against human references for 100 products. It employs a multi-metric evaluation framework, including sentiment analysis, readability, persuasiveness, SEO optimization, clarity, emotional appeal, and CTA effectiveness, with SST-2 and Flesch-Kincaid as core components. Key finding: ChatGPT-4 consistently leads or closely approaches human performance across several metrics, though gaps remain in emotional resonance and CTA impact, while simpler models lag in several areas; results also confirm the utility of sample prompts in boosting quality. The study's implications suggest hybrid AI-human content pipelines, emphasize prompt engineering and brand alignment, and provide a data-driven basis for strategic deployment of AI in e-commerce content creation.

Abstract

This study compares the performance of AI-generated and human-written product descriptions using a multifaceted evaluation model. We analyze descriptions for 100 products generated by four AI models (Gemma 2B, LLAMA, GPT2, and ChatGPT 4) with and without sample descriptions, against human-written descriptions. Our evaluation metrics include sentiment, readability, persuasiveness, Search Engine Optimization(SEO), clarity, emotional appeal, and call-to-action effectiveness. The results indicate that ChatGPT 4 performs the best. In contrast, other models demonstrate significant shortcomings, producing incoherent and illogical output that lacks logical structure and contextual relevance. These models struggle to maintain focus on the product being described, resulting in disjointed sentences that do not convey meaningful information. This research provides insights into the current capabilities and limitations of AI in the creation of content for e-Commerce.
Paper Structure (8 sections, 1 figure, 1 table)

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of our proposed approach for the assessment of LLM generated advertisements compared against our human benchmark