Table of Contents
Fetching ...

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhoupeng Shou, Yumeng Zhang, Zhiqiang You

TL;DR

The paper addresses the misalignment between fluent language generation and real business KPIs in task-oriented dialogue. It proposes GOPO, a hierarchical RL framework with a dual-agent setup: an Expert Agent for strategy planning and a Customer Service Agent for constrained response generation, connected by hard SOP constraints and a joint business-focused reward. A novel Task-focused Sequential Engagement (TSE) metric captures long-horizon task success, and an ESNDCG-based expert reward guides trajectory-level strategy optimization. Empirical results on Mgshop, Multiwoz, and TmallBrand datasets show GOPO achieving superior TSE and GRE across baselines, with smaller GOPO variants rivaling or surpassing larger models, and ablations confirming the Expert Agent’s crucial role. The work demonstrates a practical path to long-horizon optimization in commercial dialogue and points to future directions in dynamic skill discovery and cross-domain applicability.

Abstract

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

TL;DR

The paper addresses the misalignment between fluent language generation and real business KPIs in task-oriented dialogue. It proposes GOPO, a hierarchical RL framework with a dual-agent setup: an Expert Agent for strategy planning and a Customer Service Agent for constrained response generation, connected by hard SOP constraints and a joint business-focused reward. A novel Task-focused Sequential Engagement (TSE) metric captures long-horizon task success, and an ESNDCG-based expert reward guides trajectory-level strategy optimization. Empirical results on Mgshop, Multiwoz, and TmallBrand datasets show GOPO achieving superior TSE and GRE across baselines, with smaller GOPO variants rivaling or surpassing larger models, and ablations confirming the Expert Agent’s crucial role. The work demonstrates a practical path to long-horizon optimization in commercial dialogue and points to future directions in dynamic skill discovery and cross-domain applicability.

Abstract

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.
Paper Structure (20 sections, 11 equations, 3 figures, 7 tables)

This paper contains 20 sections, 11 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: TSE, GRE, and G-Eval Across Datasets: GOPO-Qwen3-14B vs. Qwen-235B, DeepSeek-R1, and GPT-5.2.
  • Figure 2: Hierarchical Dual-Agent Architecture of GOPO with Decoupled Strategy and Generation.
  • Figure 3: Convergence Performance of GOPO Under Different Parameter Settings.