Table of Contents
Fetching ...

Can vision language models learn intuitive physics from interaction?

Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

TL;DR

The paper investigates whether interaction with an environment helps vision-language models acquire robust intuitive physics, comparing one-step reinforcement learning via Group-Relative Policy Optimization (GRPO) against supervised fine-tuning (SFT) using parameter-efficient adapters (with generation policy primed by $N=16$ completions and adapter rank $r=16$). Although post-training yields ceiling-like performance on trained tasks, neither GRPO nor SFT consistently generalizes to new related intuitive-physics tasks or to real images. Decodability analyses show that physics-relevant quantities are present in activations yet are not leveraged for cross-task transfer, suggesting task-specific shortcuts rather than true intuitive understanding. These results challenge the idea that interaction alone suffices to instill human-like intuitive physics in vision-language models and motivate exploring alternative training paradigms beyond standard PEFT post-training. In short, interaction, as tested here, does not produce robust generalizable intuitive physics.

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

Can vision language models learn intuitive physics from interaction?

TL;DR

The paper investigates whether interaction with an environment helps vision-language models acquire robust intuitive physics, comparing one-step reinforcement learning via Group-Relative Policy Optimization (GRPO) against supervised fine-tuning (SFT) using parameter-efficient adapters (with generation policy primed by completions and adapter rank ). Although post-training yields ceiling-like performance on trained tasks, neither GRPO nor SFT consistently generalizes to new related intuitive-physics tasks or to real images. Decodability analyses show that physics-relevant quantities are present in activations yet are not leveraged for cross-task transfer, suggesting task-specific shortcuts rather than true intuitive understanding. These results challenge the idea that interaction alone suffices to instill human-like intuitive physics in vision-language models and motivate exploring alternative training paradigms beyond standard PEFT post-training. In short, interaction, as tested here, does not produce robust generalizable intuitive physics.

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
Paper Structure (40 sections, 3 equations, 21 figures, 2 tables)

This paper contains 40 sections, 3 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Overview of all combinations of datasets and action types. Datasets: We train models on two related datasets, one where the block on top of a tower is displaced, and one where a block is displaced on the ground next to a tower. Action types: For each dataset, we train models on two action types. Binary stability requires models to make a binary judgment on whether a given tower is stable. For the x-only task, models need to give a single value by which the displaced block should be moved to build a more stable or bigger tower. The x-y task requires the same action as x-only, but with an added dimension --- here the block needs to be moved to the side and up. We train models on all four combinations of dataset and action types using GRPO to test whether models trained through interaction with an environment learn generalizable physical intuitions. We also evaluate models on an external dataset of real wooden block towers, taken from lerer2016learning.
  • Figure 2: Performance by test task and training task for Qwen3-VL-8B. Rows show models evaluated on a given task. Columns show models trained on a given task. The blue and orange lines show the performance of the models trained with SFT and GRPO, respectively. The grey dotted line shows the baseline for the evaluation task. Plots on the diagonal show within-task performance, meaning models are evaluated on the same task they are trained on. All other subplots represent some degree of generalization.
  • Figure 3: Qwen3-VL-8B trained on all four conditions and evaluated on the real images of block towers from lerer2016learning. Crucially, we find some generalization from our synthetic binary-stability top block images to real images of block towers. However, we still find that no model fine-tuned on other tasks generalizes to judging the stability of real block towers. Furthermore, we find that even the models post-trained on the binary-stability task do not perform this task at a human level when presented with real images (the red line shows the mean human performance). Error bars show 95% confidence intervals.
  • Figure 4: Example images for the top block dataset. Images feature towers with 2 to 4 blocks with the top block displaced.
  • Figure 5: Example images for the side block dataset. Images feature towers with 1 to 3 blocks with a misplaced block to the side.
  • ...and 16 more figures