Table of Contents
Fetching ...

ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph

Langming Liu, Haibin Chen, Yuhao Wang, Yujin Yuan, Shilei Liu, Wenbo Su, Xiangyu Zhao, Bo Zheng

TL;DR

This work introduces ECKGBench, a Chinese benchmark for evaluating LLM factuality in e-commerce by leveraging a large-scale Taobao-derived knowledge graph. It automatically generates high-quality, multiple-choice questions via relation templating, a three-stage negative sampling pipeline, and rigorous verification (LLM + human), enabling reliable and efficient assessment across common and abstract knowledge. The study demonstrates that current advanced LLMs have limited accuracy in e-commerce factuality, with performance improving with model scale and showing clearer gains on common knowledge; it also provides a framework to explore base-model knowledge boundaries through predefined metrics like SC@k, Precision@k, and Recall@k. The benchmark offers a practical tool for evaluating and guiding LLM deployment in e-commerce, highlighting the importance of prompt design, reliability-focused question generation, and domain-specific evaluation to translate LLM capabilities into real-world commercial benefits.

Abstract

Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.

ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph

TL;DR

This work introduces ECKGBench, a Chinese benchmark for evaluating LLM factuality in e-commerce by leveraging a large-scale Taobao-derived knowledge graph. It automatically generates high-quality, multiple-choice questions via relation templating, a three-stage negative sampling pipeline, and rigorous verification (LLM + human), enabling reliable and efficient assessment across common and abstract knowledge. The study demonstrates that current advanced LLMs have limited accuracy in e-commerce factuality, with performance improving with model scale and showing clearer gains on common knowledge; it also provides a framework to explore base-model knowledge boundaries through predefined metrics like SC@k, Precision@k, and Recall@k. The benchmark offers a practical tool for evaluating and guiding LLM deployment in e-commerce, highlighting the importance of prompt design, reliability-focused question generation, and domain-specific evaluation to translate LLM capabilities into real-world commercial benefits.

Abstract

Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs' knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.

Paper Structure

This paper contains 40 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: An overview of generating the ECKGBench dataset. The sampled triple are fed into question generation (branch above) and negative sampling (branch below) workflows and finally combined to form the questions of ECKGBench.
  • Figure 2: Results of inconsistency rates. The lower, the better. The green and blue represent the random sampling and our sampling methods, respectively.