IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

Wenxuan Ding; Weiqi Wang; Sze Heng Douglas Kwok; Minghao Liu; Tianqing Fang; Jiaxin Bai; Xin Liu; Changlong Yu; Zheng Li; Chen Luo; Qingyu Yin; Bing Yin; Junxian He; Yangqiu Song

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao Liu, Tianqing Fang, Jiaxin Bai, Xin Liu, Changlong Yu, Zheng Li, Chen Luo, Qingyu Yin, Bing Yin, Junxian He, Yangqiu Song

TL;DR

This paper presents IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce, tasked to infer intentions based on purchased products and utilize them to predict additional purchases.

Abstract

Enhancing Language Models' (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of purchase intentions by LMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce. Specifically, LMs are tasked to infer intentions based on purchased products and utilize them to predict additional purchases. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they fall far behind human performances. Our code and data are publicly available at https://github.com/HKUST-KnowComp/IntentionQA.

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

TL;DR

Abstract

Paper Structure (44 sections, 2 equations, 4 figures, 10 tables)

This paper contains 44 sections, 2 equations, 4 figures, 10 tables.

Introduction
Related Works
Intention Discovery with Large Language Models in E-commerce
Benchmarking (Large) Language Models
IntentionQA
Task Definitions
Task 1: IntentUnderstand
Task 2: IntentUtilize
Source Intention Collection and Context Augmentation
Product and Intention Similarity
Distractor Sampling and QA Construction
Task 1: IntentUnderstand
Task 2: IntentUtilize
Difficulty Labeling
Quality Control
...and 29 more sections

Figures (4)

Figure 1: Examples of two tasks in IntentionQA. Task 1 requires the language model to determine the customer's intention in purchasing two products, and Task 2 involves recommending a product that fulfills the customer's intention and matches their currently purchased product.
Figure 2: Overview of IntentionQA and the construction pipeline. We map products from intention assertions to event nodes in ASER (# 1) and calculate their context embedding with the one-hop neighborhood (# 2). Product (#3) and intention (#4) similarities are then computed accordingly. Products/intentions with higher similarities are represented closer to each other. Negative distractor sampling for Task 1/2 is based on intention/product similarity respectively.
Figure 3: Performances of various language models in comprehending intentions with different relations.
Figure 4: Comparisons between models fine-tuned on intentions from MIND and baseline models achieving top performances.

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

TL;DR

Abstract

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

Authors

TL;DR

Abstract

Table of Contents

Figures (4)