Table of Contents
Fetching ...

DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems

Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li, Yujie Liu

TL;DR

DyBBT tackles adaptive exploration in task-oriented dialog systems by introducing a structured cognitive state space that captures dialog progress, user uncertainty, and slot dependencies. A bandit-inspired meta-controller dynamically switches between a fast System 1 and a slower System 2, guided by visitation counts and confidence signals, with a Lipschitz reward assumption enabling sublinear regret in the cognitive space. The approach is instantiated as a dual-system architecture and validated on MS Dialog and MultiWOZ, achieving state-of-the-art success, efficiency, and generalization, with human evaluations confirming alignment with expert judgment. Empirical results, ablation studies, and real-world tests demonstrate DyBBT's practical viability and offer insights into robust, scalable adaptive exploration for TODS, while also identifying areas for end-to-end cognitive representation learning.

Abstract

Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.

DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems

TL;DR

DyBBT tackles adaptive exploration in task-oriented dialog systems by introducing a structured cognitive state space that captures dialog progress, user uncertainty, and slot dependencies. A bandit-inspired meta-controller dynamically switches between a fast System 1 and a slower System 2, guided by visitation counts and confidence signals, with a Lipschitz reward assumption enabling sublinear regret in the cognitive space. The approach is instantiated as a dual-system architecture and validated on MS Dialog and MultiWOZ, achieving state-of-the-art success, efficiency, and generalization, with human evaluations confirming alignment with expert judgment. Empirical results, ablation studies, and real-world tests demonstrate DyBBT's practical viability and offer insights into robust, scalable adaptive exploration for TODS, while also identifying areas for end-to-end cognitive representation learning.

Abstract

Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.

Paper Structure

This paper contains 82 sections, 23 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: A comparison of exploration strategies for dialog policy learning. Traditional DRL relies on static heuristics incapable of adapting to dynamic dialog contexts. EIERL uses population based evolution but struggles in complex tasks. DyBBT solves the adaptive exploration challenge by cognitive meta-controller to achieve a principled balance between efficiency and robustness.
  • Figure 2: The DyBBT Architecture. A meta-controller uses the cognitive state $\mathbf{c}_t$, visitation count $n_t(\mathbf{c}_t)$, and System 1's confidence $p_t^{S1}$ to dynamically select between System 1 (fast intuitive) and System 2 (slow deliberative). Outputs drive action execution and update visitation/distillation buffers for continuous learning.
  • Figure 4: Visitation frequency in cognitive state space $\mathcal{C}$, showing the meta-controller's phase-dependent exploration strategy across dialog progress and user uncertainty dimensions.
  • Figure 5: Analysis of meta-controller decisions. Rate of System 2 invocation across dialog progress. Pie chart showing the proportion of System 2 invocations.
  • Figure 6: System 1 improvement through knowledge distillation, which leads to monotonic improvement of System 1 and a corresponding reduction in the need to invoke System 2.
  • ...and 6 more figures