Table of Contents
Fetching ...

TaoSR1: The Thinking Model for E-commerce Relevance Search

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

TL;DR

TaoSR1 tackles the challenge of deploying large language models for e-commerce query-item relevance by combining structured reasoning with a scalable training and deployment pipeline. It uses SFT with CoT generated via a Retrieval-Augmented Generation system, followed by offline pass@N-based DPO and online difficulty-aware GRPO to reduce error propagation and discriminative hallucination, with CumPT enabling simple, reliable online tiering. The approach yields significant offline macro-F1 gains and substantial online SBS improvements, validating CoT-based reasoning as a viable paradigm for relevance classification in production systems. This work demonstrates a practical, end-to-end methodology for leveraging LLMs in real-time e-commerce search with strong reasoning and robust deployment characteristics.

Abstract

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

TaoSR1: The Thinking Model for E-commerce Relevance Search

TL;DR

TaoSR1 tackles the challenge of deploying large language models for e-commerce query-item relevance by combining structured reasoning with a scalable training and deployment pipeline. It uses SFT with CoT generated via a Retrieval-Augmented Generation system, followed by offline pass@N-based DPO and online difficulty-aware GRPO to reduce error propagation and discriminative hallucination, with CumPT enabling simple, reliable online tiering. The approach yields significant offline macro-F1 gains and substantial online SBS improvements, validating CoT-based reasoning as a viable paradigm for relevance classification in production systems. This work demonstrates a practical, end-to-end methodology for leveraging LLMs in real-time e-commerce search with strong reasoning and robust deployment characteristics.

Abstract

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

Paper Structure

This paper contains 25 sections, 9 equations, 1 figure, 8 tables, 1 algorithm.

Figures (1)

  • Figure 1: Our proposed TaoSR1 framework comprising three stages: (1) SFT with CoT to endow models with reasoning capabilities; (2) Offline multiple sampling based on a pass@N strategy, combined with DPO, to enhance model generation quality; and (3) Difficulty-based dynamic sampling integrated with GRPO to further mitigate model's discriminative hallucination problems.