Table of Contents
Fetching ...

Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models

Yixuan Huang, Jialin Yuan, Chanho Kim, Pupul Pradhan, Bryan Chen, Li Fuxin, Tucker Hermans

TL;DR

This work tackles long-horizon manipulation with occluded objects by introducing two explicit object-oriented memory architectures, DOOM and LOOM, that integrate UVOS-based tracking with transformer-relational dynamics. By encoding memory as discrete object slots or latent tokens and predicting future poses and inter-object relations, the approach enables planning across occluded, novel, and reappearing objects. Empirical results on a large synthetic dataset and real-world trials show that explicit memory substantially outperforms an implicit baseline in relational reasoning and planning success, particularly under distractor actions and occlusions. The proposed framework advances robust, long-horizon robotic manipulation, enabling more reliable autonomy in realistic environments where objects may vanish and reappear during task execution.

Abstract

Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.

Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models

TL;DR

This work tackles long-horizon manipulation with occluded objects by introducing two explicit object-oriented memory architectures, DOOM and LOOM, that integrate UVOS-based tracking with transformer-relational dynamics. By encoding memory as discrete object slots or latent tokens and predicting future poses and inter-object relations, the approach enables planning across occluded, novel, and reappearing objects. Empirical results on a large synthetic dataset and real-world trials show that explicit memory substantially outperforms an implicit baseline in relational reasoning and planning success, particularly under distractor actions and occlusions. The proposed framework advances robust, long-horizon robotic manipulation, enabling more reliable autonomy in realistic environments where objects may vanish and reappear during task execution.

Abstract

Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.
Paper Structure (21 sections, 1 equation, 8 figures, 2 tables)

This paper contains 21 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 2: Overview of our approaches. As the robot takes action over time, some objects may disappear and reappear, and some objects may newly appear in the scene. In this paper, we propose two types of object-oriented memory, called DOOM and LOOM, that enable the robot to plan with occluded and newly appeared objects. DOOM and LOOM utilize a UVOS algorithm to keep track of the current object list and update object memory slots based on the occlusion status of each object accordingly. (Best viewed in color.)
  • Figure 3: Two examples of our training dataset (top row) and one testing example (bottom row). We train with a maximum of 5 segments including objects and the environments. The testing example has 8 segments with different shapes and a novel view point. In the history, the robot pushes the mug below the shelf then picks and places two apples inside the bowl. During planning, the robot picks and places the orange to achieve the goal relation based on the current observation and history. Left/Right are defined from the robot's viewpoint.
  • Figure 4: Execution success rate as a function of (left) the number of objects in the scene, and (right) the number of distractor actions. We find that our approaches perform consistently well across both conditions, outperforming the baseline by a large margin. This is especially prominent in terms of the number of distractor actions. The legend applies to both plots.
  • Figure 5: We show two failure cases of the baseline, where our approach achieves the goal relations. For the first example, after the robot pushes the red box off the table, two novel objects appear in the current observation. The goal is to remove all the objects on the table. DOOM achieves the goal by pushing the red object while Baseline fails because it pushes a wrong (black) object. For the second example, after the robot pushes the mug, the mug is occluded by the shelf. To achieve the goal, DOOM picks and places the orange while Baseline picks and places a wrong object (apple).
  • Figure 6: Visualization of objects in the training set for UVOS model.
  • ...and 3 more figures