Table of Contents
Fetching ...

ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models

Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, Bo Zheng

TL;DR

ChineseEcomQA proposes a scalable, domain-focused benchmark to rigorously evaluate LLMs on fundamental e-commerce concepts, balancing generality and expertise. The dataset is built via a hybrid pipeline that combines LLM-based QA generation, multi-stage validation (including RAG) and manual annotation, resulting in 1,800 QA pairs across 20 industries and 10 concept dimensions. Experiments across 11 closed-source and 16 open-source models show that larger, reasoning-capable models (e.g., Deepseek-R1/V3) perform best, but RAG and calibration dynamics strongly influence performance and cross-model gaps. The work provides actionable insights into model strengths and limitations in e-commerce, highlights the value of RAG for domain knowledge, and offers a practical benchmark to guide future domain-specific evaluation and deployment in Chinese e-commerce contexts.

Abstract

With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose \textbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: \textbf{Focus on Fundamental Concept}, \textbf{E-commerce Generality} and \textbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.

ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models

TL;DR

ChineseEcomQA proposes a scalable, domain-focused benchmark to rigorously evaluate LLMs on fundamental e-commerce concepts, balancing generality and expertise. The dataset is built via a hybrid pipeline that combines LLM-based QA generation, multi-stage validation (including RAG) and manual annotation, resulting in 1,800 QA pairs across 20 industries and 10 concept dimensions. Experiments across 11 closed-source and 16 open-source models show that larger, reasoning-capable models (e.g., Deepseek-R1/V3) perform best, but RAG and calibration dynamics strongly influence performance and cross-model gaps. The work provides actionable insights into model strengths and limitations in e-commerce, highlights the value of RAG for domain knowledge, and offers a practical benchmark to guide future domain-specific evaluation and deployment in Chinese e-commerce contexts.

Abstract

With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose \textbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: \textbf{Focus on Fundamental Concept}, \textbf{E-commerce Generality} and \textbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.

Paper Structure

This paper contains 30 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of fundamental e-commerce concepts. From basic concept to advanced concept, we categorize into 10 sub-concepts.
  • Figure 2: An overview of the data construction process of ChineseEcomQA.
  • Figure 3: Illustrative examples of the data construction process.
  • Figure 4: Dataset statistics of ChineseEcomQA.
  • Figure 5: Detailed results on some selected models across ten sub-concept tasks.
  • ...and 6 more figures