Combating Missed Recalls in E-commerce Search: A CoT-Prompting Testing Approach
Shengnan Wu, Yongxiang Hu, Yingchuan Wang, Jiazhen Gu, Jin Meng, Liujie Fan, Zhongshi Luan, Xin Wang, Yangfan Zhou
TL;DR
Missed recalls in e-commerce search pose a practical risk to user satisfaction and platform profitability. The paper introduces mrDetector, an automatic testing framework that uses a customized CoT prompting strategy with a large language model to generate realistic, user-aligned test cases, and a metamorphic-relation-based oracle to detect missed recalls. The approach comprises three components: test-case generation, test-case validation, and missed-recall detection, and is evaluated on open data and industrial Meituan data, showing superior performance and a low false-positive rate $R_{fp}$ with favorable test-case efficiency $E_{tc}$. The findings indicate that larger LLMs and the tailored CoT prompt improve performance, and that a substantial portion of discovered misses are reproducible in real deployments, highlighting practical impact for improving e-commerce search quality and revenue.
Abstract
Search components in e-commerce apps, often complex AI-based systems, are prone to bugs that can lead to missed recalls - situations where items that should be listed in search results aren't. This can frustrate shop owners and harm the app's profitability. However, testing for missed recalls is challenging due to difficulties in generating user-aligned test cases and the absence of oracles. In this paper, we introduce mrDetector, the first automatic testing approach specifically for missed recalls. To tackle the test case generation challenge, we use findings from how users construct queries during searching to create a CoT prompt to generate user-aligned queries by LLM. In addition, we learn from users who create multiple queries for one shop and compare search results, and provide a test oracle through a metamorphic relation. Extensive experiments using open access data demonstrate that mrDetector outperforms all baselines with the lowest false positive ratio. Experiments with real industrial data show that mrDetector discovers over one hundred missed recalls with only 17 false positives.
