Table of Contents
Fetching ...

Learning Physical Dynamics for Object-centric Visual Prediction

Huilin Xu, Tao Chen, Feng Xu

TL;DR

An unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects is proposed that generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.

Abstract

The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.

Learning Physical Dynamics for Object-centric Visual Prediction

TL;DR

An unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects is proposed that generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.

Abstract

The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.
Paper Structure (26 sections, 17 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 17 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: The motivation of the proposed model on visual prediction task. Left: an object representation with rich expressive power is beneficial for prediction. Middle: incorporating prior physical knowledge help simplify complex dynamics and boost model's performance. Right: context information, such as object-object interaction and object-environment interaction, should be considered when inferring future object states.
  • Figure 2: The overall architecture of the proposed model for unsupervised object-centric visual prediction. The perceptual module and dynamic module are shown in Fig. \ref{['perceptual_module']} and Fig. \ref{['dynamic_module']}. Detailed structures of them are described in Sec. \ref{['Perceptual Module']} and \ref{['Dynamic Module']}. The prediction procedure of our model consists of three stages: (a) The visual image is decomposed into spatial features $\textbf{F}$ and multiple physically meaningful object states $\textbf{O}^{t} = {\textbf{o}^t_{1:N}}$ in an unsupervised manner. (b) The dynamic module learns the underlying dynamics from past state trajectories and makes object-wise future predictions in state space. (c) Future frame in pixel space is produced by combining spatial features and predicted states.
  • Figure 3: The schematic diagram of perceptual module. The encoder $\phi$ is used to obtain keypoints and static feature maps, while the decoder $\psi$ is used to reconstruct images. $\bm{F}$ and $\bm{O}$ denote spatial features and pose vectors of individual object.
  • Figure 4: The illustration of Gaussian-like maps construction process $g$.
  • Figure 5: The schematic diagram of dynamic module $\mathcal{D}$. Context-aware aggregator $\mathcal{A}$ aggregates static appearance information in feature map $F$ and the object's state $\bm{O}$ to a generate new state vector $\overline{\bm{O}}$, which additionally contains the contextual features around objects. Interaction-aware predictor $\mathcal{P}$ outputs the displacement vector to predict future states. Both of them are described in Sec. \ref{['Context-aware aggregator']} and Sec. \ref{['Interaction-aware Dynamic predictor']}.
  • ...and 5 more figures