DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Xinyu Xu; Shengcheng Luo; Yanchao Yang; Yong-Lu Li; Cewu Lu

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Xinyu Xu, Shengcheng Luo, Yanchao Yang, Yong-Lu Li, Cewu Lu

TL;DR

DISCO tackles the challenge of embodied navigation and interaction from high-level directives by introducing differentiable scene semantics and a dual-level coarse-to-fine control scheme. It combines on-the-fly scene learning with global map-based planning and local neural refinement to efficiently reach and manipulate target objects in long-horizon tasks. On ALFRED, DISCO achieves state-of-the-art performance, particularly in unseen environments, and remains effective without step-by-step instructions, demonstrating data efficiency and robust generalization. The work advances practical embodied AI by enabling scalable, instruction-guided mobile manipulation with open-source implementation.

Abstract

Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes, even without step-by-step instructions. Our code is publicly released at https://github.com/AllenXuuu/DISCO.

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 9 figures, 7 tables)

This paper contains 30 sections, 2 equations, 9 figures, 7 tables.

Introduction
Related Works
Approach
Perception
Learning Scene Representations
Coarse-to-Fine Action Control
Application: Embodied Instruction Following
Experiments
Evaluation Protocols
Baselines
Quantitative Comparisons
Ablation Study
Qualitative Results
Conclusion
Limitations
...and 15 more sections

Figures (9)

Figure 1: An example of vision-language navigation and interaction task in ALFRED alfred. An agent is given a goal directive and step-by-step instructions to perform mobile manipulation of multiple subgoals. Our work can omit step-by-step instructions.
Figure 2: The perception foundation.(i) 1st column: egocentric RGB frames as the initial observation. (ii) 2nd column: depth estimations. (iii) 3rd column: object instance segmentations. (iv) 4th-6th columns: affordance masks predictions: the navigable mask and two interactable masks (namely pickable and openable) as references.
Figure 3: An overview of the DISCO framework. Starting from the egocentric RGB frame, our perception system predicts pixel-wise depth, instance segmentation, and affordance frames. They are converted into semantic point clouds via projection and localized in the scene. We build differentiable scene representations with semantic queries to model the scene. They are optimized using gradient descent to match localized point cloud semantics. We apply dual-level coarse-to-fine controls. The coarse control depends on the global semantic map to approach the localized target. The fine control leverages a neural policy based on the local visual frame to interact.
Figure 4: The design of fine action control. DISCO employs a neural policy network to predict fine action steps. RGB, depth, and the object mask are sent to the network to derive a feature, followed by an object-specific classifier to predict the action. The policy is trained by mimicking expert actions.
Figure 5: Qualitative case of affordance. Left: The agent fails to put the bowl into the microwave without openable knowledge. Right: DISCO is aware of the openable affordance property in microwave interaction.
...and 4 more figures

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

TL;DR

Abstract

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Authors

TL;DR

Abstract

Table of Contents

Figures (9)