Table of Contents
Fetching ...

GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment

Lance Ying, Kunal Jha, Shivam Aarya, Joshua B. Tenenbaum, Antonio Torralba, Tianmin Shu

TL;DR

GOMA addresses the challenge of proactive verbal communication in human-robot collaboration under partial observability by modeling goal-relevant mental alignment as a planning problem. It formulates a two-level I-POMDP framework, uses goal inference via LLMs, and optimizes communication through KL-divergence proxy rewards with multimodal belief updates. The approach is evaluated in Overcooked and VirtualHome, showing superior coordination performance and more favorable human judgments compared to baselines, including LLM-based agents. The work demonstrates that grounding communication in task and social context can substantially enhance embodied cooperative performance and user perception, while outlining future directions toward broader belief representations and real-world robotics.

Abstract

Verbal communication plays a crucial role in human cooperation, particularly when the partners only have incomplete information about the task, environment, and each other's mental state. In this paper, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). GOMA formulates verbal communication as a planning problem that minimizes the misalignment between the parts of agents' mental states that are relevant to the goals. This approach enables an embodied assistant to reason about when and how to proactively initialize communication with humans verbally using natural language to help achieve better cooperation. We evaluate our approach against strong baselines in two challenging environments, Overcooked (a multiplayer game) and VirtualHome (a household simulator). Our experimental results demonstrate that large language models struggle with generating meaningful communication that is grounded in the social and physical context. In contrast, our approach can successfully generate concise verbal communication for the embodied assistant to effectively boost the performance of the cooperation as well as human users' perception of the assistant.

GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment

TL;DR

GOMA addresses the challenge of proactive verbal communication in human-robot collaboration under partial observability by modeling goal-relevant mental alignment as a planning problem. It formulates a two-level I-POMDP framework, uses goal inference via LLMs, and optimizes communication through KL-divergence proxy rewards with multimodal belief updates. The approach is evaluated in Overcooked and VirtualHome, showing superior coordination performance and more favorable human judgments compared to baselines, including LLM-based agents. The work demonstrates that grounding communication in task and social context can substantially enhance embodied cooperative performance and user perception, while outlining future directions toward broader belief representations and real-world robotics.

Abstract

Verbal communication plays a crucial role in human cooperation, particularly when the partners only have incomplete information about the task, environment, and each other's mental state. In this paper, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). GOMA formulates verbal communication as a planning problem that minimizes the misalignment between the parts of agents' mental states that are relevant to the goals. This approach enables an embodied assistant to reason about when and how to proactively initialize communication with humans verbally using natural language to help achieve better cooperation. We evaluate our approach against strong baselines in two challenging environments, Overcooked (a multiplayer game) and VirtualHome (a household simulator). Our experimental results demonstrate that large language models struggle with generating meaningful communication that is grounded in the social and physical context. In contrast, our approach can successfully generate concise verbal communication for the embodied assistant to effectively boost the performance of the cooperation as well as human users' perception of the assistant.
Paper Structure (23 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of cooperation with a shared mind or misaligned minds and communication optimized via goal-oriented mental alignment. (a) When human and robot minds are perfectly signed (i.e., a shared mind), they share the same belief of the physical state and the same goal, which leads to the same joint plan shared by both agents, an ideal condition for human-robot cooperation. (b However, in real-world tasks, human and robot minds are typically unaligned, leading to two different (often conflicting) joint plans in their minds. (c) To achieve a shared joint plan that optimizes cooperation, we optimize the robot's verbal communication to actively align the joint plans in both agents' minds.
  • Figure 2: Example Overcooked environment. In each environment, there are two rooms. The two agents are always in different rooms. An agent cannot observe the other room and has to rely on verbal communication to infer the states of the objects in the other room.
  • Figure 3: Experimental results in Overcooked and VirtualHome. The quantitative results from experiments (a, b, c) demonstrate that GOMA led to the greatest speedup (left) and least plan cost (right) compared to other baselines. In human subjective ratings (d), participants find GOMA to be more helpful and communicate more useful information than other models.
  • Figure 4: Example of typical communication enabled by GOMA in VirtualHome. (a) Once the human (in the blue shirt) gives a command to the AI Assistant (in the orange shirt), it infers the human goal and reasons that the human needs 2 plates and 2 forks. (b) As the AI watches the human agent opening the fridge, GOMA informs the human that the plates are on the coffee table. Consequently, the human goes to the coffee table to pick up a plate.
  • Figure 5: Agents' trajectories with No-Comm (left) and with GOMA (right) in a VirtualHome environment. In this example, the team goal is to set up a table for 1 person. The AI Assistant needs to find a plate and a fork while the human is looking for a water glass. Both agents have knowledge about the items that the other agent is looking for but not their own goal objects. In the No-Comm setting, the agents cannot share knowledge and have to open many containers to search for goal items. By inferring the other agent's goal and communicating goal-relevant knowledge, GOMA drastically reduces the total number of steps taken to complete the task.