Table of Contents
Fetching ...

PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

Wenbin Ding, Jun Chen, Mingjia Chen, Fei Xie, Qi Mao, Philip Dames

TL;DR

This work addresses the challenge of enabling robots to execute high-level natural language instructions in human-centered settings by introducing PFEA, an LLM-based vision-language embodied agent. The architecture combines a speech-processing front end, a vision-language planner-converter-evaluator stack, and a robust action execution module with open-vocabulary perception for real-world manipulation. The key innovations are a unified scene-understanding framework for planning, a feedback-driven task evaluator, and a training-free deployment pathway validated through extensive simulation and real-world experiments, achieving a 28% improvement over LLM+CLIP baselines. The results demonstrate improved planning generalization, robust task execution, and meaningful human-robot interaction, advancing practical, adaptable, and interpretable embodied AI for human-centered robotics.

Abstract

The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.

PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

TL;DR

This work addresses the challenge of enabling robots to execute high-level natural language instructions in human-centered settings by introducing PFEA, an LLM-based vision-language embodied agent. The architecture combines a speech-processing front end, a vision-language planner-converter-evaluator stack, and a robust action execution module with open-vocabulary perception for real-world manipulation. The key innovations are a unified scene-understanding framework for planning, a feedback-driven task evaluator, and a training-free deployment pathway validated through extensive simulation and real-world experiments, achieving a 28% improvement over LLM+CLIP baselines. The results demonstrate improved planning generalization, robust task execution, and meaningful human-robot interaction, advancing practical, adaptable, and interpretable embodied AI for human-centered robotics.

Abstract

The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.

Paper Structure

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall system architecture of PFEA. The complete workflow begins with human voice input, followed by the agent performing speech recognition, vision-language task planning, conversion into executable python instructions, task execution through robotic actions, and task completion assessment. Finally, the agent generates a natural language response reporting the execution result.
  • Figure 2: The prompts for the planner, converter, and evaluator are shown in Figure \ref{['fig:Planner']}, Figure \ref{['fig:Converter']}, and Figure \ref{['fig:Evaluator']}, respectively.
  • Figure 3: The first row shows the initial desktop scenarios corresponding to scenarios 1 to 10, and the second row shows the ten scenarios corresponding to the completion of the prompts listed in table \ref{['table:sim_data']}. The tasks in each column are categorized as shown in the figure.
  • Figure 4: Task execution in four real-world scenarios, with each column representing one task. From left to right, the images illustrate the execution process of the four tasks listed in \ref{['table:real_data']}.