Table of Contents
Fetching ...

SYNERGAI: Perception Alignment for Human-Robot Collaboration

Yixin Chen, Guoxi Zhang, Yaowei Zhang, Hongming Xu, Peiyuan Zhi, Qing Li, Siyuan Huang

TL;DR

SYNERGAI addresses misalignment between human perception and robot perception in LLM-driven collaboration by using a 3D Scene Graph sg3d as an explicit, manipulable representation. The system reconstructs 3D scenes from posed images, builds sg3d, and uses an LLM to decompose tasks and select tools to operate on the sg3d, enabling zero-shot 3D reasoning and interactive alignment with users. It includes an automatic perceptual alignment mechanism that updates sg3d online through user interactions via a GUI, improving task success and transfer to novel tasks. Experiments in ten real-world scenes show competitive zero-shot 3D QA performance on ScanQA and significant gains in alignment and transfer, demonstrating practical potential for robust HRI. It highlights the value of explicit structured representations for combining perception, reasoning, and human feedback in real-world robotics.

Abstract

Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real-world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG with online interaction. SYNERGAI achieves comparable performance with the data-driven models in ScanQA in a zero-shot manner. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9% in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68% on novel tasks by transferring the knowledge acquired during alignment.

SYNERGAI: Perception Alignment for Human-Robot Collaboration

TL;DR

SYNERGAI addresses misalignment between human perception and robot perception in LLM-driven collaboration by using a 3D Scene Graph sg3d as an explicit, manipulable representation. The system reconstructs 3D scenes from posed images, builds sg3d, and uses an LLM to decompose tasks and select tools to operate on the sg3d, enabling zero-shot 3D reasoning and interactive alignment with users. It includes an automatic perceptual alignment mechanism that updates sg3d online through user interactions via a GUI, improving task success and transfer to novel tasks. Experiments in ten real-world scenes show competitive zero-shot 3D QA performance on ScanQA and significant gains in alignment and transfer, demonstrating practical potential for robust HRI. It highlights the value of explicit structured representations for combining perception, reasoning, and human feedback in real-world robotics.

Abstract

Recently, large language models (LLMs) have shown strong potential in facilitating human-robotic interaction and collaboration. However, existing LLM-based systems often overlook the misalignment between human and robot perceptions, which hinders their effective communication and real-world robot deployment. To address this issue, we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration. At its core, SYNERGAI employs 3D Scene Graph (3DSG) as its explicit and innate representation. This enables the system to leverage LLM to break down complex tasks and allocate appropriate tools in intermediate steps to extract relevant information from the 3DSG, modify its structure, or generate responses. Importantly, SYNERGAI incorporates an automatic mechanism that enables perceptual misalignment correction with users by updating its 3DSG with online interaction. SYNERGAI achieves comparable performance with the data-driven models in ScanQA in a zero-shot manner. Through comprehensive experiments across 10 real-world scenes, SYNERGAI demonstrates its effectiveness in establishing common ground with humans, realizing a success rate of 61.9% in alignment tasks. It also significantly improves the success rate from 3.7% to 45.68% on novel tasks by transferring the knowledge acquired during alignment.
Paper Structure (10 sections, 5 figures, 4 tables)

This paper contains 10 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of SynergAI. Leveraging sg3d as its representation, SynergAI decomposes complex tasks with llms and takes actions with our designed tools in intermediate steps. It interacts with humans through natural language and non-verbal mouse clicking to enhance object references, capable of facilitating human-robot collaboration and perceptual alignment by automatically modifying the data stored in sg3d.
  • Figure 2: The design of SynergAI and an example interaction. SynergAI represents 3D scene with sg3ds and leverages llms to respond to user inputs. It is first prompted to generate a plan, which effectively decomposes the input task into sub-tasks to be solved in a sequential process. At each step, SynergAI selects a tool as its action based on the observation, which contains the results of the previous actions. In this example, the system identifies the correct object of relationship "on the blue box", but incorrectly recognizes it as a book, where perception misalignment happens.
  • Figure 3: Qualitative results of 3D Reasoning Tasks.
  • Figure 4: Examples of human-robot alignment. Humans solve the EASY task within shorter interaction steps compared with the HARD task, where the user checks and corrects the label of the ironing board by clicking (the 2nd & 3rd user inputs). Novel tasks are designed such that knowledge from the alignment is required for their completion.
  • Figure 5: Statistics of alignment experiments. (a) The success rate decreases for more complex tasks with increasing interaction steps required to achieve alignment. (b) The trend of task success rate as the interaction step increases. (c) The user interface impacts users' ability to reference objects and hinders the alignment performance when mouse clicks are not used.