Simulated Mental Imagery for Robotic Task Planning

Shijia Li; Tomas Kulvicius; Minija Tamosiunaite; Florentin Wörgötter

Simulated Mental Imagery for Robotic Task Planning

Shijia Li, Tomas Kulvicius, Minija Tamosiunaite, Florentin Wörgötter

TL;DR

The paper introduces Simulated Mental Imagery for Planning (SiMIP), a sub-symbolic planning framework that generates robot action plans by imagining future scenes without requiring explicit symbolic domain descriptions. By combining perception, object-level imagination, and success checks, SiMIP builds a planning tree from real-world scene data and then parses this visual plan into a symbolic, executable sequence, improving interpretability over end-to-end deep RL approaches. The authors validate SiMIP on a packing task using a real dataset with labeled objects and affordances, achieving a 90.92% plan-success rate and demonstrating robustness across plan lengths with a modular, data-efficient pipeline that includes object detection, affordance segmentation, and object completion via GANs. The work highlights practical implications for human-robot collaboration, explainability, and rapid adaptation to new tasks with limited labeled data, and outlines avenues for future robot-in-the-loop feedback and more advanced generative models.

Abstract

Traditional AI-planning methods for task planning in robotics require a symbolically encoded domain description. While powerful in well-defined scenarios, as well as human-interpretable, setting this up requires substantial effort. Different from this, most everyday planning tasks are solved by humans intuitively, using mental imagery of the different planning steps. Here we suggest that the same approach can be used for robots, too, in cases which require only limited execution accuracy. In the current study, we propose a novel sub-symbolic method called Simulated Mental Imagery for Planning (SiMIP), which consists of perception, simulated action, success-checking and re-planning performed on 'imagined' images. We show that it is possible to implement mental imagery-based planning in an algorithmically sound way by combining regular convolutional neural networks and generative adversarial networks. With this method, the robot acquires the capability to use the initially existing scene to generate action plans without symbolic domain descriptions, while at the same time plans remain human-interpretable, different from deep reinforcement learning, which is an alternative sub-symbolic approach. We create a dataset from real scenes for a packing problem of having to correctly place different objects into different target slots. This way efficiency and success rate of this algorithm could be quantified.

Simulated Mental Imagery for Robotic Task Planning

TL;DR

Abstract

Paper Structure (22 sections, 8 figures, 5 tables)

This paper contains 22 sections, 8 figures, 5 tables.

Introduction
RELATED WORK
Symbolic Planning
Simulation
Sub-Symbolic Planning Using Neural Networks
Affordance recognition
Neuro-symbolic representations
OVERVIEW
IMPLEMENTATION
Data set
Network implementation details
Pose estimation
Applying the action
Action validation
Formation of a planning tree
...and 7 more sections

Figures (8)

Figure 1: Task definition. The table top has to be ordered by putting all objects in the given box. The target is to leave no objects outside the box. Note, the "target scene" here is presented only for illustration purposes, as all other configurations, where there is no object left outside the box, would be considered valid, too.
Figure 2: Flow diagram of our approach. Our system contains two main parts: scene understanding and action planning. For scene understanding we use three deep networks, a) Object detection, b) Affordance&Semantic segmentation, and c) Object completion. The details of the training and inference process can be seen in Fig. \ref{['structure']}. Through the scene understanding part we can get the complete shape of the background and each individual object and its affordance class. Then, we can apply actions such as move and rotate to the object and use the information obtained from the affordance map to check whether the action is valid or not. If it is valid, we can perform the next action.
Figure 3: Training and inference of our model. In training: a) Object detection, b) Instance& Affordance segmentation, c) Object completion (de-occlusion). In the training phase, we train the three models individually and then combine the obtained results in the inference phase. Note that after finishing the object completion (c, above), we need to do affordance segmentation (b, above) again, to get the complete object corresponding to the affordance classes (see red arrows). Bbox=bounding box. Details are explained in subsection \ref{['impl_details']}.
Figure 4: Demonstration of a planning tree. Each column represents an action step, the branches represent possible actions and each action is based on an imagined scene, where the previous action had been completed. The red dashed boxes mark the scenes indicating the valid planning sequence and are numbered consecutively (these numbers are used in Algorithm \ref{['alg']}). Red circles indicate the objects on which the action is applied. The green pointer indicates where the object marked by the circle in the previous image has been placed.
Figure 5: Example for pose mapping. We create a dictionary to store the horizontal and vertical pose of the blue cuboid. When we apply a flipping action on this object, we can lookup the dictionary and retrieve the corresponding pose
...and 3 more figures

Simulated Mental Imagery for Robotic Task Planning

TL;DR

Abstract

Simulated Mental Imagery for Robotic Task Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)