Table of Contents
Fetching ...

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

Ruike Cao, Shaojie Bai, Fugen Yao, Liang Dong, Jian Xu, Li Xiao

TL;DR

This work proposes a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm, which adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance.

Abstract

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ($+0.92\%$ accuracy).

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

TL;DR

This work proposes a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm, which adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance.

Abstract

Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ( accuracy).
Paper Structure (24 sections, 11 equations, 5 figures, 5 tables)

This paper contains 24 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of ATPO algorithm. ATPO generates training data via an adaptive tree search. At each expansion step, it generates candidate child nodes and computes a composite uncertainty score based on their Bellman error $U_1$ and Q-value variance $U_2$. Nodes with high uncertainty $U$ are fully expanded, while low-uncertainty nodes are pruned by randomly selecting a single child for the subsequent rollout. The collected trajectories are then used for policy and critic updates.
  • Figure 2: Analysis of the ATPO algorithm on Qwen3-4B. (a) Training efficiency and performance comparison of various algorithms, plotting accuracy against the number of generated turns. (b), (c) Return variance and critic loss for ATPO and baseline methods. (d), (e) Distribution of branching nodes and returns by depth for samples from ATPO at a representative training step. (f), (g) Stability analysis of ATPO with and without visit-count-based down-weighting.
  • Figure 3: Schematic diagram of the interaction flow between the Assistant Agent and the User Simulator in the multi-turn clinical reasoning environment. The process starts from an incomplete initial user query, after which the Assistant Agent asks targeted questions and the User Simulator responds strictly within the scope of predefined atomic facts, until a final answer is produced or the turn limit is reached.
  • Figure 4: The variation curve of the proportion of effective questions for various scale models across multi-turn dialogues as ATPO training progresses.
  • Figure 5: Comparison of model dialogue quality before and after ATPO training.