Table of Contents
Fetching ...

Perceptual Self-Reflection in Agentic Physics Simulation Code Generation

Prashant Shende, Bradley Camburn

TL;DR

This work tackles the oracle gap in physics-simulation code generation, where syntactically correct code can produce physically incorrect results. It introduces a perceptual self-reflection loop implemented by a four-agent pipeline that renders animations and validates them with a vision-capable language model, enabling iterative refinement beyond code-level checks. The approach achieves high physics accuracy across seven diverse domains, with an average performance of 91% accuracy and broad domain generality, while maintaining low cost (~$0.20 per animation) through selective model usage. These results demonstrate that analyzing visual simulation outputs can significantly outperform single-shot generation for physics tasks and suggest practical pathways for integrating agentic AI into engineering workflows and physics data pipelines.

Abstract

We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self-correction; and a physics validator that implements perceptual self-reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap'' where syntactically correct code produces physically incorrect behavior--a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization. The perceptual self-reflection architecture demonstrates substantial improvement over single-shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self-correction capability, operating at approximately \$0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision-language model for iterative refinement significantly outperforms single-shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.

Perceptual Self-Reflection in Agentic Physics Simulation Code Generation

TL;DR

This work tackles the oracle gap in physics-simulation code generation, where syntactically correct code can produce physically incorrect results. It introduces a perceptual self-reflection loop implemented by a four-agent pipeline that renders animations and validates them with a vision-capable language model, enabling iterative refinement beyond code-level checks. The approach achieves high physics accuracy across seven diverse domains, with an average performance of 91% accuracy and broad domain generality, while maintaining low cost (~$0.20 per animation) through selective model usage. These results demonstrate that analyzing visual simulation outputs can significantly outperform single-shot generation for physics tasks and suggest practical pathways for integrating agentic AI into engineering workflows and physics data pipelines.

Abstract

We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self-correction; and a physics validator that implements perceptual self-reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap'' where syntactically correct code produces physically incorrect behavior--a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization. The perceptual self-reflection architecture demonstrates substantial improvement over single-shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self-correction capability, operating at approximately \$0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision-language model for iterative refinement significantly outperforms single-shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.
Paper Structure (30 sections, 2 figures, 2 tables)

This paper contains 30 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: System Architectural Diagram of the Perceptual Self-Reflection in Agentic Physics Simulation Code Generation. Physics Validator applies a perceptual validation check (reviews animation)
  • Figure 2: Case Study Comparison Grid: Self-Validating Pipeline (top row, green border) vs Single-Shot API Generation (bottom row, red border) across four representative scenarios. The self-validating pipeline produces physically correct simulations with proper vortex shedding (CFD), stable concentric wave propagation (FDTD), and accurate thermal diffusion. Single-shot generation exhibits critical failures: CFD lacks wake development and vortex shedding, FDTD shows catastrophic checkerboard numerical instability, and heat diffusion solves a different problem variant (steady-state vs transient). For non-physics data visualization (Population Growth), both approaches achieve comparable quality, demonstrating that perceptual self-reflection provides greatest benefit for physics simulations requiring numerical accuracy.