Table of Contents
Fetching ...

To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction

Wenbin Li, Seyedmajid Azimi, Aleš Leonardis, Mario Fritz

TL;DR

This work tackles predicting physical stability of block towers from visual input without relying on explicit 3D models or physics simulation at test time. It generates a large synthetic dataset with controlled variations (block counts, stacking depth, and block sizes) using a Bullet-based simulator to provide stability labels and trains CNNs to predict stability directly from images, comparing results to human judgments. Through intra-group, cross-group, and generalization experiments, the study demonstrates strong, often superior, performance of the image-based predictor across a wide range of scene variations, while analyzing how humans and machines diverge on challenging configurations. The findings highlight the potential of data-driven visual intuition for physical reasoning and motivate future work on richer outputs and deeper grounding of visual stability concepts.

Abstract

Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli.

To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction

TL;DR

This work tackles predicting physical stability of block towers from visual input without relying on explicit 3D models or physics simulation at test time. It generates a large synthetic dataset with controlled variations (block counts, stacking depth, and block sizes) using a Bullet-based simulator to provide stability labels and trains CNNs to predict stability directly from images, comparing results to human judgments. Through intra-group, cross-group, and generalization experiments, the study demonstrates strong, often superior, performance of the image-based predictor across a wide range of scene variations, while analyzing how humans and machines diverge on challenging configurations. The findings highlight the potential of data-driven visual intuition for physical reasoning and motivate future work on richer outputs and deeper grounding of visual stability concepts.

Abstract

Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli.

Paper Structure

This paper contains 30 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Examples toys embody the support event.
  • Figure 2: Example scenes for 2D and 3D stacking with 6 blocks from side view.
  • Figure 3: Examples of scenes with different number of blocks
  • Figure 4: Examples of scenes with the fixed size and varied sizes.
  • Figure 5: Examples of scenes with 3D stacking.
  • ...and 9 more figures