Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
Yixuan Huang, Jialin Yuan, Chanho Kim, Pupul Pradhan, Bryan Chen, Li Fuxin, Tucker Hermans
TL;DR
This work tackles long-horizon manipulation with occluded objects by introducing two explicit object-oriented memory architectures, DOOM and LOOM, that integrate UVOS-based tracking with transformer-relational dynamics. By encoding memory as discrete object slots or latent tokens and predicting future poses and inter-object relations, the approach enables planning across occluded, novel, and reappearing objects. Empirical results on a large synthetic dataset and real-world trials show that explicit memory substantially outperforms an implicit baseline in relational reasoning and planning success, particularly under distractor actions and occlusions. The proposed framework advances robust, long-horizon robotic manipulation, enabling more reliable autonomy in realistic environments where objects may vanish and reappear during task execution.
Abstract
Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.
