Table of Contents
Fetching ...

Closed Loop Interactive Embodied Reasoning for Robot Manipulation

Michal Nazarczuk, Jan Kristof Behrens, Karla Stepanova, Matej Hoffmann, Krystian Mikolajczyk

TL;DR

CLIER tackles the challenge of long-horizon robotic manipulation by integrating visual scene understanding with physical measurements in a closed-loop, neuro-symbolic framework. It combines a scene parser, scene graph, symbolic program generator, and a transformer-based action planner to iteratively select and execute primitive actions, updating plans after each keyframe. By reasoning about non-visual properties such as weight and stiffness through physical interactions, CLIER demonstrates sim-to-real transfer on SHOP-VRB2 and YCB-VRB benchmarks and shows robustness to environmental disturbances and manipulation failures. The work provides a modular approach that unifies perception, symbolic reasoning, and action execution in a fast feedback loop, enabling reliable long-horizon embodied reasoning for robotic manipulation.

Abstract

Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in 10 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above 76% and 64%, respectively.

Closed Loop Interactive Embodied Reasoning for Robot Manipulation

TL;DR

CLIER tackles the challenge of long-horizon robotic manipulation by integrating visual scene understanding with physical measurements in a closed-loop, neuro-symbolic framework. It combines a scene parser, scene graph, symbolic program generator, and a transformer-based action planner to iteratively select and execute primitive actions, updating plans after each keyframe. By reasoning about non-visual properties such as weight and stiffness through physical interactions, CLIER demonstrates sim-to-real transfer on SHOP-VRB2 and YCB-VRB benchmarks and shows robustness to environmental disturbances and manipulation failures. The work provides a modular approach that unifies perception, symbolic reasoning, and action execution in a fast feedback loop, enabling reliable long-horizon embodied reasoning for robotic manipulation.

Abstract

Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in 10 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above 76% and 64%, respectively.
Paper Structure (20 sections, 5 figures, 3 tables)

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A diagram of CLIER reasoning implemented in MuBlE environment including interaction between the two and use of SHOP-VRB2 benchmark. Transferred data: $\mathtt{T}$ -- text of the query, $\mathtt{\tilde{G}}$ -- prediction of scene graph elements, $\mathtt{G}$ -- current scene graph, $\mathtt{S}$ -- subgoal (symbolic program requiring physical measurements), $\mathtt{I}$ -- image, $\mathtt{P}$ -- physical observations, $\mathtt{C}$ -- control signal, $\mathtt{A}$ -- primitive action to take, $\mathtt{R}$ -- returned result.
  • Figure 2: An example long-horizon manipulation task from SHOP-VRB2 implemented within the MuBlE environment muble. A synthetic scene is rendered in Blender every keyframe, followed by the execution of the symbolic manipulation action planned by CLIER.
  • Figure 3: (Left) Examples of simulated visual observations (selected frames) generated for the actions corresponding to the instruction: Stack the lightest of metal objects on the yellow object. Note that metal cans were initially picked up to measure their weight. Further, according to the measurement taken, the heavier of the cans was put down and the lighter one was picked up and stacked on the yellow plate. (Right) Example simulated scenes and corresponding instructions in natural language generated with MuBlE (in the dataset, instructions left to right belong to tasks 7, 3, and 1 in Tab. \ref{['tab:templates']}).
  • Figure 4: The CLIER pipeline demonstrated on an example task from the SHOP-VRB2 dataset muble. Blue colour denotes a query and extracted symbolic programs, the output targets from programs are shown in the green, yellow denotes executing the sequence of primitive actions after having received a a subgoal from program (blue). Note that actions in yellow result in updates of the scene graph depicted in the top middle.
  • Figure 5: The real setup with YCB objects (left), corresponding MuJoCo simulation using estimated poses (middle), and RViz visualisation of colored pointcloud with overlaid gray models detected by CosyPose (right).