PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

Wenbin Ding; Jun Chen; Mingjia Chen; Fei Xie; Qi Mao; Philip Dames

PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

Wenbin Ding, Jun Chen, Mingjia Chen, Fei Xie, Qi Mao, Philip Dames

TL;DR

This work addresses the challenge of enabling robots to execute high-level natural language instructions in human-centered settings by introducing PFEA, an LLM-based vision-language embodied agent. The architecture combines a speech-processing front end, a vision-language planner-converter-evaluator stack, and a robust action execution module with open-vocabulary perception for real-world manipulation. The key innovations are a unified scene-understanding framework for planning, a feedback-driven task evaluator, and a training-free deployment pathway validated through extensive simulation and real-world experiments, achieving a 28% improvement over LLM+CLIP baselines. The results demonstrate improved planning generalization, robust task execution, and meaningful human-robot interaction, advancing practical, adaptable, and interpretable embodied AI for human-centered robotics.

Abstract

The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.

PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

TL;DR

Abstract

PFEA: An LLM-based High-Level Natural Language Planning and Feedback Embodied Agent for Human-Centered AI

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)