EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

Francesco Argenziano; Michele Brienza; Vincenzo Suriani; Daniele Nardi; Domenico D. Bloisi

EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

Francesco Argenziano, Michele Brienza, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi

TL;DR

EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing issues of grounded planning and execution for robots in real-life settings, is introduced.

Abstract

Task planning for robots in real-life settings presents significant challenges. These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal; the lack of a standardized mapping between high-level actions and low-level commands; and the challenge of maintaining low computational overhead given the limited resources of robotic hardware. We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues. By leveraging efficient pre-trained foundation models and a multi-role mechanism, EMPOWER demonstrates notable improvements in grounded planning and execution. Quantitative results highlight the effectiveness of our approach, achieving an average success rate of 0.73 across six different real-life scenarios using a TIAGo robot.

EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
LLMs and robotics
Multi-role prompting
Open-vocabulary systems
Methodology
Multi-role planner
Open-vocabulary grounding
Plan Actuator
Experimental Results
Hardware specification
Qualitative Results
Quantitative Results
Temporal Analysis
Discussion
...and 1 more sections

Figures (5)

Figure 1: The EMPOWER architecture during the execution of one of the use cases analyzed: reordering the shelf to have only 2 objects per level.
Figure 2: Complete architecture of EMPOWER, from the task description to the execution of the plan in the world. The RGB image is used to extract a graph of the scene as long as the final plan and the object labels. These labels are then grounded via an NLP pipeline and reprojected onto the point clouds extracted from the depth image of the robot. Lastly, reference points are computed from these point clouds to facilitate actions in the world. Use case illustrated: order the objects on the table from the highest to the lowest.
Figure 3: Success Rates of the experiments in the single-role setup. Each marker represents one particular use case. The deviation from the mean on the number of the steps of the plans over $10$ trials is also shown.
Figure 4: Success Rates of the experiments in the multi-role setup. It is possible to see how the Success Rates are higher w.r.t. the ones of the single-role setting of Fig \ref{['fig:sr1']}.
Figure 5: Comparison of average number of steps per plan between the multi-role and the single-role setup. We can see how the plans obtained by the multi-role are shorter on average than the ones of the single-role.

EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

TL;DR

Abstract

EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

Authors

TL;DR

Abstract

Table of Contents

Figures (5)