Table of Contents
Fetching ...

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao

TL;DR

The paper addresses the challenge of balancing task performance and user engagement in proactive, multi-turn LLM agents by formulating training as a multi-objective optimization with objectives $R(\tau)$ and $U(\tau)$ under a horizon $T$, i.e. $\max_{\pi_{\theta}} \mathbb{E}_{\tau \sim \pi_{\theta}}[ R(\tau) - w\,U(\tau) ]$. It introduces Behavioral Agentic Optimization (BAO), a framework that couples behavior enhancement (retrospective reasoning and prospective planning) with behavior-regularized RL (information-seeking and over-thinking regularizations) to guide inter-turn reasoning and information gathering. BAO uses a warm-started SFT phase with an external teacher to embed the target behaviors and GRPO-based RL to shape turn-level outcomes, achieving strong Pareto performance on the UserRL benchmark and approaching or matching commercial models. The results show BAO improves task performance while reducing user burden, increases exploration and information diversity, and mitigates reward hacking, highlighting a practical path toward reliable, user-aligned proactive agents in complex multi-turn scenarios.

Abstract

Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios. Our website: https://proactive-agentic-rl.github.io/.

Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization

TL;DR

The paper addresses the challenge of balancing task performance and user engagement in proactive, multi-turn LLM agents by formulating training as a multi-objective optimization with objectives and under a horizon , i.e. . It introduces Behavioral Agentic Optimization (BAO), a framework that couples behavior enhancement (retrospective reasoning and prospective planning) with behavior-regularized RL (information-seeking and over-thinking regularizations) to guide inter-turn reasoning and information gathering. BAO uses a warm-started SFT phase with an external teacher to embed the target behaviors and GRPO-based RL to shape turn-level outcomes, achieving strong Pareto performance on the UserRL benchmark and approaching or matching commercial models. The results show BAO improves task performance while reducing user burden, increases exploration and information diversity, and mitigates reward hacking, highlighting a practical path toward reliable, user-aligned proactive agents in complex multi-turn scenarios.

Abstract

Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios. Our website: https://proactive-agentic-rl.github.io/.
Paper Structure (32 sections, 5 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: The overview of BAO. (Left): Behavior Enhancement. (Middle): Behavior-Regularized RL. (Right): Pareto-Frontiers between user engagement efforts and task performance.
  • Figure 2: The performances on two objectives with different penalty weights $w_1 < w_2 < w_3$ in (\ref{['eq:moo']}). Pass@U-$k$ is defined as the pass rate when allowing up to $k$User-involved actions per trajectory ($U(\tau)=k$). $\uparrow, \downarrow$: The higher/lower, the better. Simply tuning weights fails to improve the trade-off between task performance maximization and use engagement minimization.
  • Figure 3: Behavior examples from Turtle-Gym. Hidden twist: This person is a programmer; in the computer industry, old, large, and difficult-to-maintain code is referred to as a pile of excrement. Red: Prospective Planning; Blue: Retrospective Reasoning.
  • Figure 4: Function-Gym training curves. BAO keeps a higher exploration ratio, achieving higher task performance with even fewer generated tokens compared to UserRL.
  • Figure 5: Pareto frontiers in Function-Gym. The results are averaged over three random seeds. The shaded area represents the standard deviation. BAO is with better Pareto Frontiers.
  • ...and 9 more figures