Table of Contents
Fetching ...

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Sanket Gandhi, Atul, Samanyu Mahajan, Vishal Sharma, Rushil Gupta, Arnab Kumar Mondal, Parag Singla

TL;DR

This work performs a series of experiments demonstrating that the architecture can discover semantically meaningful blocks and help improve accuracy of dynamics prediction compared to SOTA object-centric models, and perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training.

Abstract

Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citep{slot_attention}, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

TL;DR

This work performs a series of experiments demonstrating that the architecture can discover semantically meaningful blocks and help improve accuracy of dynamics prediction compared to SOTA object-centric models, and perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training.

Abstract

Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citep{slot_attention}, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.
Paper Structure (16 sections, 6 figures, 7 tables, 2 algorithms)

This paper contains 16 sections, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Main architecture for DisFormer. SAM and SaVI are the Mask Extractor, and the product operation is the Hadamard Product between Object Masks and the input images. Block Extractor module, takes in the object representations $z_{t}^{i}$, one at a time, along with concept vectors $C$ and outputs the set of block representations for each object $i$. Note that each block has its own set of concept vectors; Attn is the simple dot-product attention module where the input block-based representation of all the objects is converted into a linear combination of the corresponding concept vectors.\ref{['sec:theory']}
  • Figure 2: Rollouts at various time steps for three datasets (in distribution). GT: Ground Truth.
  • Figure 3: Rollouts at various time steps for three datasets (OOD). GT: Ground Truth
  • Figure 4: Disentanglement Results.
  • Figure 5: Mask extractor: Bouncing Circles (a) Slot masks for a frame. (b) Generated points prompts. Green points are foreground point prompts and red one are background point prompts. (c) SAM masks
  • ...and 1 more figures