Table of Contents
Fetching ...

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

Pengkun Jiao, Yiming Jin, Jianhui Yang, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang

TL;DR

The paper tackles the challenge of training interpretable, robust LLM-based e-commerce search relevance models for long-tail queries. It introduces TaoSR-SHE, a Stepwise Hybrid Examination RL framework that combines a Generative Stepwise Reward Model with offline human verification and a Stepwise Reward Policy Optimization (SRPO) to provide dense, step-level supervision. Key innovations include Offline Rejection Sampling, Diverse Sampling, and multi-stage Curriculum Learning to enhance data efficiency and exploration, along with a stepwise credit assignment mechanism that mitigates sparse rewards. Extensive offline and online experiments on Taobao data demonstrate superior reasoning quality and relevance predictions over SFT, DPO, and GRPO baselines, with improvements in macro F1, Good F1, and accuracy, while maintaining interpretability and robustness for real-world deployment.

Abstract

Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

TL;DR

The paper tackles the challenge of training interpretable, robust LLM-based e-commerce search relevance models for long-tail queries. It introduces TaoSR-SHE, a Stepwise Hybrid Examination RL framework that combines a Generative Stepwise Reward Model with offline human verification and a Stepwise Reward Policy Optimization (SRPO) to provide dense, step-level supervision. Key innovations include Offline Rejection Sampling, Diverse Sampling, and multi-stage Curriculum Learning to enhance data efficiency and exploration, along with a stepwise credit assignment mechanism that mitigates sparse rewards. Extensive offline and online experiments on Taobao data demonstrate superior reasoning quality and relevance predictions over SFT, DPO, and GRPO baselines, with improvements in macro F1, Good F1, and accuracy, while maintaining interpretability and robustness for real-world deployment.

Abstract

Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

Paper Structure

This paper contains 41 sections, 7 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of our proposed Hybrid Stepwise RL pipeline. Each key step is extracted from the policy-model rollout, and both a generative stepwise reward model and offline human verification are employed to obtain step-level rewards. These rewards are then used to estimate step-level advantages to guide reinforcement learning.
  • Figure 2: Unlike PPO, which uses token-level advantages, and GRPO, which uses sequence-level advantages, our SRPO estimates step-level advantages.
  • Figure 3: Representative outputs produced by our generative stepwise reward model.
  • Figure 4: Quality results for SRPO.
  • Figure 5: Quality results for SRPO.
  • ...and 1 more figures