Table of Contents
Fetching ...

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang

TL;DR

This work tackles the challenge of achieving both millisecond latency and deep reasoning in industrial search. It introduces the Answer First, Reason Later (AFRL) paradigm, which outputs a definitive relevance label in the first token and follows with a verifiable reasoning trace, paired with Mode-Balanced Optimization to balance forward and reverse KL divergences. A comprehensive training stack—PIAR automated instruction refinement, structured data synthesis, strict rule-based rewards, and a multi-stage curriculum—yields a 32B teacher that achieves state-of-the-art results and can distill its reasoning into a 0.6B student, reconciling reasoning depth with deployment efficiency. The approach demonstrates robust improvements on realistic industrial datasets and offers a scalable pathway for latency-critical, reasoning-rich retrieval systems. This has practical impact for deploying interpretable, high-performance ranking models in real-time search pipelines.

Abstract

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

TL;DR

This work tackles the challenge of achieving both millisecond latency and deep reasoning in industrial search. It introduces the Answer First, Reason Later (AFRL) paradigm, which outputs a definitive relevance label in the first token and follows with a verifiable reasoning trace, paired with Mode-Balanced Optimization to balance forward and reverse KL divergences. A comprehensive training stack—PIAR automated instruction refinement, structured data synthesis, strict rule-based rewards, and a multi-stage curriculum—yields a 32B teacher that achieves state-of-the-art results and can distill its reasoning into a 0.6B student, reconciling reasoning depth with deployment efficiency. The approach demonstrates robust improvements on realistic industrial datasets and offers a scalable pathway for latency-critical, reasoning-rich retrieval systems. This has practical impact for deploying interpretable, high-performance ranking models in real-time search pipelines.

Abstract

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.
Paper Structure (28 sections, 8 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: AFRL addresses the latency–reasoning tradeoff by generating answers first and reasoning afterward, enabling instant predictions while maintaining interpretable and robust decision logic via mode-balanced optimization.
  • Figure 2: Overview of the proposed framework. (Left) Expert Data Construction via the PIAR loop and hard-sample mining; (Middle) Mode-Balanced Optimization, a hybrid training paradigm balancing mode-seeking (Reverse KL) and mode-covering (Forward KL) dynamics; (Right) Curriculum-Guided Learning and final knowledge distillation to bridge the gap between high-level reasoning and deployment efficiency.
  • Figure 3: Evolution of Pair Accuracy (left) and F1 Score (right).
  • Figure 4: Dynamics of Reward Score (left) and Policy Entropy (right).
  • Figure 5: Comparison of accuracy evolution between standard GRPO (random sampling) and GRPO-CL (curriculum learning) on Qwen3-8B. Both methods use the same training data, differing only in the order of sample presentation.