Perceptual Self-Reflection in Agentic Physics Simulation Code Generation
Prashant Shende, Bradley Camburn
TL;DR
This work tackles the oracle gap in physics-simulation code generation, where syntactically correct code can produce physically incorrect results. It introduces a perceptual self-reflection loop implemented by a four-agent pipeline that renders animations and validates them with a vision-capable language model, enabling iterative refinement beyond code-level checks. The approach achieves high physics accuracy across seven diverse domains, with an average performance of 91% accuracy and broad domain generality, while maintaining low cost (~$0.20 per animation) through selective model usage. These results demonstrate that analyzing visual simulation outputs can significantly outperform single-shot generation for physics tasks and suggest practical pathways for integrating agentic AI into engineering workflows and physics data pipelines.
Abstract
We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self-correction; and a physics validator that implements perceptual self-reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap'' where syntactically correct code produces physically incorrect behavior--a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization. The perceptual self-reflection architecture demonstrates substantial improvement over single-shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self-correction capability, operating at approximately \$0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision-language model for iterative refinement significantly outperforms single-shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.
