Table of Contents
Fetching ...

Occlusion resistant learning of intuitive physics from videos

Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

TL;DR

This work introduces an occlusion-resistant framework for intuitive physics that jointly learns object-centered dynamics and a differentiable renderer. By modeling object states as latent variables and decoupling physics from rendering, the method can reason through occlusions and predict long-horizon object trajectories, demonstrated on the IntPhys benchmark and synthetic/pseudo-real datasets. The key contributions include the Compositional Rendering Network, the Recurrent Interaction Network with uncertainty, and a differentiable event-decoding objective that yields plausible scene interpretations without requiring ground-truth inter-frame correspondences. The results show improved performance under occlusion, robust trajectory prediction, and some generalization to real scenes, highlighting the approach's potential for robust, scene-level physical reasoning in vision systems.

Abstract

To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.

Occlusion resistant learning of intuitive physics from videos

TL;DR

This work introduces an occlusion-resistant framework for intuitive physics that jointly learns object-centered dynamics and a differentiable renderer. By modeling object states as latent variables and decoupling physics from rendering, the method can reason through occlusions and predict long-horizon object trajectories, demonstrated on the IntPhys benchmark and synthetic/pseudo-real datasets. The key contributions include the Compositional Rendering Network, the Recurrent Interaction Network with uncertainty, and a differentiable event-decoding objective that yields plausible scene interpretations without requiring ground-truth inter-frame correspondences. The results show improved performance under occlusion, robust trajectory prediction, and some generalization to real scenes, highlighting the approach's potential for robust, scene-level physical reasoning in vision systems.

Abstract

To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.

Paper Structure

This paper contains 44 sections, 8 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of our occlusion resistant intuitive physics model. A pre-trained object detector (MaskRCNN) returns object detections and masks (top). A graph proposal matching links object proposals through time: from a pair of frames the Recurrent Interaction Network ($RecIntNet$) predicts next object position and matches it with the closest object proposal. If an object disappears (e.g. due to occlusion - no object proposal), the model keeps the prediction as an object state, otherwise this object state is updated with the observation. Finally, the Compositional Rendering Network ($Renderer$) predicts masks from object states and compares them with the observed masks. The errors of predictions of $RecIntNet$ and $Renderer$ on the full sequence are summed into a physics and a render loss, respectively. The two losses are used to assess whether the observed scene is physically plausibility.
  • Figure 2: Compositional Rendering Network ($Renderer$) Takes as input a list of object states. First, the object rendering network reconstructs a segmentation mask and a depth map for each object independently. Second, the occlusion predictor composes all predicted object masks into the final scene mask, generating the appropriate pattern of inter-object occlusions obtained from the predicted depth maps of the individual objects.
  • Figure 3: Illustration of event decoding in the videos of the IntPhys dataset. A pre-trained object detector returns object proposals in the video (bounding boxes). An initial match is made across two seed neighbouring frames, also estimating object velocity (left, white arrows). The dynamic model (RecIntNet) predicts object positions and velocities in future frames, enabling the match of objects despite significant occlusions (right, bounding box colors and highlights).
  • Figure 4: Images from the Future Prediction experiment 1: An overview of the pybullet scene. 2: Sample video frames (instance mask + depth field) from our datasets (top) together with predictions obtained by our model (bottom), taken from from the tilted 25 experiments. 3: example of prediction for a real video, with a prediction span of 8 frames. The small colored dots show the predicted positions of objects together with the estimated uncertainty shown by the colored ‚Äúcloud‚Äù. The same colored dot is also shown in the (ground truth) center of each object. The prediction is correct when the two dots coincide. (see https://drive.google.com/open?id=1Qc8flIAxUGzfRfeFyyUEGXe6J5AUGUjE).
  • Figure 5: Video example from the IntPhys benchmark. Four frames from a video in block O1, with superimposed heatmaps. Heatmaps (colored blobs) correspond to the difference, per pixel, between the predicted and the observed object mask. In these video, a cube moves from left to right but disappears behind the occluder. The Recurrent Interaction Network predicts correctly its motion behind the occluder and the Compositional Renderer reconstructs its mask. The fact that the object is absent in the observed mask leads to a large render loss, illustrated by the high heatmap values (violet) at the position where the ball is expected to be.
  • ...and 4 more figures