Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Xueyang Feng; Zhi-Yuan Chen; Yujia Qin; Yankai Lin; Xu Chen; Zhiyuan Liu; Ji-Rong Wen

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen

TL;DR

This work addresses the gap in fully autonomous LLM-based agents handling complex, dynamic tasks by introducing ReHAC, a reinforcement learning–driven framework that optimally times human interventions. By formulating human-agent collaboration as an MDP and training a dual-policy system with offline RL, ReHAC balances task performance with intervention costs. Empirical results across HotpotQA, StrategyQA, and InterCode show that ReHAC outperforms baselines and generalizes across datasets, with scalable policy models and GPT-4 simulations supporting broader applicability. The study also discusses extensions to multi-level collaboration, development-stage frameworks for LLM agents, and safety/alignment considerations for real-world deployment.

Abstract

In recent developments within the research community, the integration of Large Language Models (LLMs) in creating fully autonomous agents has garnered significant interest. Despite this, LLM-based agents frequently demonstrate notable shortcomings in adjusting to dynamic environments and fully grasping human needs. In this work, we introduce the problem of LLM-based human-agent collaboration for complex task-solving, exploring their synergistic potential. In addition, we propose a Reinforcement Learning-based Human-Agent Collaboration method, ReHAC. This approach includes a policy model designed to determine the most opportune stages for human intervention within the task-solving process. We construct a human-agent collaboration dataset to train this policy model in an offline reinforcement learning environment. Our validation tests confirm the model's effectiveness. The results demonstrate that the synergistic efforts of humans and LLM-based agents significantly improve performance in complex tasks, primarily through well-planned, limited human intervention. Datasets and code are available at: https://github.com/XueyangFeng/ReHAC.

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

TL;DR

Abstract

Paper Structure (36 sections, 10 equations, 6 figures, 6 tables)

This paper contains 36 sections, 10 equations, 6 figures, 6 tables.

Introduction
Approach
Preliminary and Problem Formulation
ReHAC
Optimization:
Experiments
Experimental Setup
Datasets
Implementation details
Reward Calculation
Baselines
Overall Results
Human-Agent Experiments
Human Simulation
Learning Curves
...and 21 more sections

Figures (6)

Figure 1: Different Levels of Automation. (a) No automation: Tasks are entirely performed by humans. (b) Full automation: Tasks are completely executed by agents without human intervention. (c) Conditional automation: Humans are required only for specific sub-tasks, without continuous monitoring.
Figure 2: (a) Human-agent collaboration evaluation. (b) GPT-4-agent collaboration evaluation. The bars above the 0-axis represent the reward $R$, the bars below the 0-axis represent the human intervention cost $\lambda C$, and the entire columns, composed of the bars above and below the 0-axis, represent the task reward $T$. Numbers within the bars means the human intervention rate (%). $\text{ReHAC\xspace}_{\text{GPT-4}}$ and $\text{ReHAC\xspace}_{\text{Human}}$ represent the policy model trained on GPT-4-agent and human-agent collaboration datasets, respectively. ReHAC outperforms other baselines in human-agent collaboration scenarios.
Figure 3: Reward $R$ variations of different methods during the training process on HotpotQA dataset. Here we set the human intervention penalty coefficient $\lambda$ to 0.06, 0.08, and 0.1. Curves of ReHAC and IL are averaged over 15 points, with shadows indicating the variance.
Figure 4: Reward $R$ variations during the training process on three datasets. Curves of ReHAC and IL are averaged over 15 points, with shadows indicating the variance.
Figure 5: Case Study. When the agent completes the task, the third step cannot be answered due to the ambiguity of the problem identified; using our method, the first two simple retrieval tasks are assigned to the agent to complete, while the third step is assigned to humans. Humans can complete the correct answer through bold speculation
...and 1 more figures

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

TL;DR

Abstract

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Authors

TL;DR

Abstract

Table of Contents

Figures (6)