Simulated Mental Imagery for Robotic Task Planning
Shijia Li, Tomas Kulvicius, Minija Tamosiunaite, Florentin Wörgötter
TL;DR
The paper introduces Simulated Mental Imagery for Planning (SiMIP), a sub-symbolic planning framework that generates robot action plans by imagining future scenes without requiring explicit symbolic domain descriptions. By combining perception, object-level imagination, and success checks, SiMIP builds a planning tree from real-world scene data and then parses this visual plan into a symbolic, executable sequence, improving interpretability over end-to-end deep RL approaches. The authors validate SiMIP on a packing task using a real dataset with labeled objects and affordances, achieving a 90.92% plan-success rate and demonstrating robustness across plan lengths with a modular, data-efficient pipeline that includes object detection, affordance segmentation, and object completion via GANs. The work highlights practical implications for human-robot collaboration, explainability, and rapid adaptation to new tasks with limited labeled data, and outlines avenues for future robot-in-the-loop feedback and more advanced generative models.
Abstract
Traditional AI-planning methods for task planning in robotics require a symbolically encoded domain description. While powerful in well-defined scenarios, as well as human-interpretable, setting this up requires substantial effort. Different from this, most everyday planning tasks are solved by humans intuitively, using mental imagery of the different planning steps. Here we suggest that the same approach can be used for robots, too, in cases which require only limited execution accuracy. In the current study, we propose a novel sub-symbolic method called Simulated Mental Imagery for Planning (SiMIP), which consists of perception, simulated action, success-checking and re-planning performed on 'imagined' images. We show that it is possible to implement mental imagery-based planning in an algorithmically sound way by combining regular convolutional neural networks and generative adversarial networks. With this method, the robot acquires the capability to use the initially existing scene to generate action plans without symbolic domain descriptions, while at the same time plans remain human-interpretable, different from deep reinforcement learning, which is an alternative sub-symbolic approach. We create a dataset from real scenes for a packing problem of having to correctly place different objects into different target slots. This way efficiency and success rate of this algorithm could be quantified.
