Table of Contents
Fetching ...

PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

Sabrina Patania, Luca Annese, Anita Pellegrini, Silvia Serino, Anna Lambiase, Luca Pallonetto, Silvia Rossi, Simone Colombani, Tom Foulsham, Azzurra Ruggeri, Dimitri Ognibene

TL;DR

This work addresses the challenge of perspective-taking in multi-agent AI by extending the classic director task with active visual exploration within a ReAct-based framework. By embedding perspective-taking cues and enabling dynamic viewpoint adjustment, the study demonstrates improved interpretative accuracy and collaborative effectiveness, while highlighting persistent gaps in inferring non-visible targets and matching optimal, epistemic planning. The results suggest a viable path toward more adaptive, context-aware robotic systems that can reason about others’ beliefs and intentions, though further advances in prompting, exploration strategies, and embodied deployment are needed. Overall, the approach lays groundwork for integrating active perception with epistemic reasoning in multimodal LLMs for robust, real-world teamwork tasks.

Abstract

Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent's capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model's interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs' application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.

PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

TL;DR

This work addresses the challenge of perspective-taking in multi-agent AI by extending the classic director task with active visual exploration within a ReAct-based framework. By embedding perspective-taking cues and enabling dynamic viewpoint adjustment, the study demonstrates improved interpretative accuracy and collaborative effectiveness, while highlighting persistent gaps in inferring non-visible targets and matching optimal, epistemic planning. The results suggest a viable path toward more adaptive, context-aware robotic systems that can reason about others’ beliefs and intentions, though further advances in prompting, exploration strategies, and embodied deployment are needed. Overall, the approach lays groundwork for integrating active perception with epistemic reasoning in multimodal LLMs for robust, real-world teamwork tasks.

Abstract

Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent's capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model's interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs' application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.

Paper Structure

This paper contains 9 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Schematic view of the experiment. The top-left panel illustrates a sample environment highlighting the perspective of both the Matcher and the Director (yellow cones). The bottom-left panel presents a sample of an observation the Environment produces sequentially. The observation is provided to the Matcher in a natural language fashion, via a fixed template that ri-elaborate the output of the PDDL-defined environment. Finally, the right-panel, shows the action-perception cycle of the Matcher: the Matcher firstly receive observe the Environment and then think about the possible course of actions. After that, the Matcher decides whether questioning the Director for clues of engage with the environment. The inner-thought cycle depicted here, refers to the ReAct-based Matcher, that each time reflects before taking the next action.