Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

Zihao Feng; Xiaoxue Wang; Ziwei Bai; Donghang Su; Bowen Wu; Qun Yu; Baoxun Wang

Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

Zihao Feng, Xiaoxue Wang, Ziwei Bai, Donghang Su, Bowen Wu, Qun Yu, Baoxun Wang

TL;DR

This work tackles the generalization gap in intent detection for task-oriented dialogue by marrying reinforcement learning with Group Relative Policy Optimization (GRPO) and a Reward-Based Curriculum Sampling (RCS). It demonstrates that GRPO improves generalization to unseen and complex intents, and that offline RCS further concentrates learning on challenging cases, while incorporating Chain-of-Thought prompts enhances performance on harder tasks. Experiments on MultiWOZ 2.2 and the TODAssistant dataset show that GRPO often outperforms supervised fine-tuning in generalization benchmarks and that base models can approach instruction-tuned models with RL. The findings offer practical guidance for deploying adaptable TOD systems, highlighting data-efficient training, cross-domain robustness, and avenues for online curriculum methods and multi-intent detection.

Abstract

Intent detection, a critical component in task-oriented dialogue (TOD) systems, faces significant challenges in adapting to the rapid influx of integrable tools with complex interrelationships. Existing approaches, such as zero-shot reformulations and LLM-based dynamic recognition, struggle with performance degradation when encountering unseen intents, leading to erroneous task routing. To enhance the model's generalization performance on unseen tasks, we employ Reinforcement Learning (RL) combined with a Reward-based Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO) training in intent detection tasks. Experiments demonstrate that RL-trained models substantially outperform supervised fine-tuning (SFT) baselines in generalization. Besides, the introduction of the RCS, significantly bolsters the effectiveness of RL in intent detection by focusing the model on challenging cases during training. Moreover, incorporating Chain-of-Thought (COT) processes in RL notably improves generalization in complex intent detection tasks, underscoring the importance of thought in challenging scenarios. This work advances the generalization of intent detection tasks, offering practical insights for deploying adaptable dialogue systems.

Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

TL;DR

Abstract

Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)