Table of Contents
Fetching ...

Transformers and Slot Encoding for Sample Efficient Physical World Modelling

Francesco Petri, Luigi Asprino, Aldo Gangemi

TL;DR

This work tackles efficient physical world modelling by integrating object-centric representations with transformer-based dynamics. The authors introduce the Future-Predicting Transformer Triplet (FPTT), which tokenizes frames via a Vector Quantized VAE and employs a corrector, predictor, and decoder transformer to model and predict object interactions over time. Empirical results on PHYRE-like physical reasoning tasks show that FPTT is more stable and sample-efficient than baselines such as STEVE and decoder-only variants, while achieving strong predictive performance. Limitations include opaque representations and high memory demands, with future plans to test on more realistic datasets and explore causal discovery applications.

Abstract

World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm

Transformers and Slot Encoding for Sample Efficient Physical World Modelling

TL;DR

This work tackles efficient physical world modelling by integrating object-centric representations with transformer-based dynamics. The authors introduce the Future-Predicting Transformer Triplet (FPTT), which tokenizes frames via a Vector Quantized VAE and employs a corrector, predictor, and decoder transformer to model and predict object interactions over time. Empirical results on PHYRE-like physical reasoning tasks show that FPTT is more stable and sample-efficient than baselines such as STEVE and decoder-only variants, while achieving strong predictive performance. Limitations include opaque representations and high memory demands, with future plans to test on more realistic datasets and explore causal discovery applications.

Abstract

World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm
Paper Structure (29 sections, 5 figures, 7 tables)

This paper contains 29 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Architecture diagram for the Future-Predicting Transformer Triplet for world modeling.
  • Figure 2: Example frames from the PHYRE dataset.
  • Figure 3: Diagram of the experimental setup, showing how the classifier is positioned with respect to the world modelling architecture. Note: in the case of the decoder-only baseline, replace $\Lambda_T$ with $z_T$.
  • Figure 4: Illustration of the process described in \ref{['sub:experimental-setup']} for FPTT and STEVE. Notation has been simplified with respect to \ref{['fig:arch-diagram']}. C represents the corrector transformer, P stands for the predictor one.
  • Figure 5: Classification results on test data as a function of the number of training samples observed. Each line represents an average over 5 experiments; the coloured bands indicate the standard error of the mean.