JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Kaizhi Zheng; Kaiwen Zhou; Jing Gu; Yue Fan; Jialu Wang; Zonglin Di; Xuehai He; Xin Eric Wang

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Kaizhi Zheng, Kaiwen Zhou, Jing Gu, Yue Fan, Jialu Wang, Zonglin Di, Xuehai He, Xin Eric Wang

TL;DR

JARVIS addresses the challenge of dialog-based embodied task execution by integrating a neural language-planning module with a semantic perception system and a symbolic commonsense reasoning component. The framework converts free-form dialogue and egocentric visual input into symbolic sub-goals and semantic maps, then uses task- and action-level commonsense to validate and execute actions, with a Goal Transformer as a fallback. Evaluations on TEACh show state-of-the-art results across EDH, TfD, and TATC, along with strong few-shot generalization and insightful ablations confirming the value of symbolic reasoning. The work advances interpretable, modular approaches to robust, real-world embodied agents capable of following complex human guidance.

Abstract

Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1\% to 15.8\%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 7 figures, 8 tables, 3 algorithms)

This paper contains 26 sections, 7 equations, 7 figures, 8 tables, 3 algorithms.

Introduction
Related Work
Neuro-Symbolic Conversational Embodied Agents
Problem Formulation
Proposed Methods
Language Understanding and Planning
Semantic World Representation
Action Execution via Symbolic Commonsense Reasoning
Experiments
Dataset and Tasks
Experimental Setup
Main Results and Analysis
Few-Shot Learning
Unit Test of Individual Module
Conclusion
...and 11 more sections

Figures (7)

Figure 1: Dialogue-based embodied navigation and task completion. The Commander (often a human) issues a task such as making a sandwich, and the Follower agent completes the task while communicating with the Commander. Unlike the agent in fine-grained instruction following tasks, the Follower agent needs to extract sub-goals from the free-form dialogue and execute actions in the visual environment. Note that the Follower agent can only navigate and interact with objects in an egocentric view and has no access to the map or other oracle information.
Figure 2: An overview of our JARVIS framework. The fine-tuned language planning model takes dialogue and previous sub-goals $G_{0:t-1}$ as input and produces the future sub-goals $G_{t:T}$ (Section \ref{['sec:sub-goal planning']}). $G_{t:T}$ will be further examined by our Task-Level Common Sense model and converted to more reasonable and detailed future sub-goals $G_{t:T}^{\prime}$. Meanwhile, the Visual Semantic module actively updates the semantic world representations (Section \ref{['sec:semantic']}). If the object is found in the world representation, the next action is determined by the Fast Marching method. If not, the Goal Transformer will generate the next action. The next action $a_{t+1} \in \mathcal{A}$ will be post-processed by the Action-Level Common Sense model.
Figure 3: EDH example. (a) shows an example of our JARVIS in EDH task, where the inputs are dialog history and sub-goal history (converted from action history input). The inputs are first interpreted by the Language Parsing Module to become sub-goals. Then, our Symbolic Reasoning Module will generate action predictions. The predicted actions will change the follower's egocentric views and the semantic map will be built up and completed gradually. shows the agent is opening the fridge shows the agent has placed the knife and navigate back to the fridge. shows the agent is picking up the potato. (b) is an example demonstrating a typical way of how Episodic Transformer fails on EDH task. In this case, the E.T. model predicts "Forward" repetitively even facing the wall, therefore stuck at the current position.
Figure 4: Successful TfD example from our JARVIS framework. According to the dialog, the language planner estimate four future sub-goals: ("Navigate cloth", "PickUp Cloth", "Navigate Bathtub", "Place Bathtub"). Then with the symbolic reasoning module, interaction and navigation actions are predicted. shows the agent is finding the cloth. shows the agent has picked up the cloth and then found the bathtub. shows the agent can correctly put the cloth on the bathtub.
Figure 5: TATC “Water the plant” sequence. At each step the commander calls SearchObject to get an optimal navigation pose (e.g. : Navigate to Faucet @ (3, 5.25, 270°)), then uses ground‐truth segmentation to compute an interaction target (e.g. : Pickup Mug @ (0.7, 0.37)). Steps 1–6 repeat for Faucet, Sink, Mug, Sink, Faucet, and Mug; step 7 pours water on the Plant @ (0.55, 0.6). After each sub‐goal the follower executes in its egocentric view and reports completion.
...and 2 more figures

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

TL;DR

Abstract

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (7)