Table of Contents
Fetching ...

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

Haoyang Li, Yang You, Hao Su, Leonidas Guibas

TL;DR

PhysMem is presented, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters, without updating model parameters.

Abstract

Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

TL;DR

PhysMem is presented, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters, without updating model parameters.

Abstract

Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.
Paper Structure (100 sections, 9 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 100 sections, 9 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: PhysMem learns physical principles through interaction.(a) Continually Learn via the memory consolidation system maintains principles $\mathcal{P}$, hypotheses $\mathcal{H}$, and experiences $\mathcal{E}$, which guide the embodied agent's actions $a$; world feedback (observations $o$, rewards $r$) generates new experiences $e$ that refine knowledge. (b) Test-time learning on Parts Organization: PhysMem (blue) improves continuously while no-memory baseline (gray) remains flat. (c) Qualitative results on Parts Organization: green boxes show covered cells, red dashed boxes indicate potential collisions avoided, and blue circles highlight learned space-saving strategies.
  • Figure 2: System overview of PhysMem.Left (Consolidation): A three-tier memory system stores raw experiences in episodic memory, clusters them into testable hypotheses in working memory, and promotes verified knowledge to long-term memory as principles. The consolidation process continuously refines memory through interaction. Top-right (Embodied Agent): A Vision-Language Model receives language instructions along with retrieved principles and active hypotheses from memory, then outputs high-level decisions that are executed by a low-level policy. Bottom-right (World Interaction): The agent interacts with challenging physical tasks (Parts Organization, Ball Navigation, and Balanced Stacking), and outcomes feed back into the memory system as new experiences.
  • Figure 3: Memory injection into VLM prompts. Verified principles (blue) and active hypotheses (yellow) are inserted into the planner's context with confidence scores and typed constraints (Prefer, Avoid, Sequence).
  • Figure 4: Experimental environments. (a) Left: Real-world platform with xArm6 robot, fin-ray soft grippers, and multi-view RealSense cameras in an enclosed workspace.Right: The partial props used in the experiments. (b) Reflect-VLM simulation feng2025reflective with Franka Panda robot for large-scale experiments.
  • Figure 5: Real-world tasks. Top: symbolic representations; bottom: actual setups. (a) Grid layout and placement trajectories for Parts Organization. (b) Obstacle course and ball trajectory for Ball Navigation. (c) Stone arrangement and stacking position for Balanced Stacking.
  • ...and 9 more figures