Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Michal Nazarczuk, Jan Kristof Behrens, Karla Stepanova, Matej Hoffmann, Krystian Mikolajczyk
TL;DR
CLIER tackles the challenge of long-horizon robotic manipulation by integrating visual scene understanding with physical measurements in a closed-loop, neuro-symbolic framework. It combines a scene parser, scene graph, symbolic program generator, and a transformer-based action planner to iteratively select and execute primitive actions, updating plans after each keyframe. By reasoning about non-visual properties such as weight and stiffness through physical interactions, CLIER demonstrates sim-to-real transfer on SHOP-VRB2 and YCB-VRB benchmarks and shows robustness to environmental disturbances and manipulation failures. The work provides a modular approach that unifies perception, symbolic reasoning, and action execution in a fast feedback loop, enabling reliable long-horizon embodied reasoning for robotic manipulation.
Abstract
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in 10 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above 76% and 64%, respectively.
