Table of Contents
Fetching ...

Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis

Vansh Sharma, Venkat Raman

TL;DR

The paper tackles the reliability gap in agentic large-language-model–driven scientific code by proposing Chain of Unit-Physics, a primitives-centric, test-driven framework that embeds first-principles constraints into a multi-agent code-generation workflow. By formalizing unit-physics primitives and employing a supervisor–diagnostic–verification loop, the approach guides synthesis toward physically consistent solvers, demonstrated on a combustion benchmark. Closed-weight analyses reveal widespread failure modes, while open-weight setups improve but do not yet reach reliable end-to-end solutions without the unit-physics discipline. The Chain of Unit-Physics system converges in 5–6 iterations, matching human-expert results with significantly better efficiency (≈33% faster runtime, ≈30% lower memory) and a mean L2 error of $3.1\times10^{-3}$%, establishing a practical template for physics-grounded code generation. As models evolve, embedding first-principles checks offers robustness beyond raw training data, promising more trustworthy scientific software from natural-language queries.

Abstract

Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high-stakes problems remains unclear. Developing computational scientific software from natural-language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit-Physics framework: a first-principles (or primitives)-centric, multi-agent system in which human expert knowledge is encoded as unit-physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed-weight systems and code-focused agentic variants fail to produce correct end-to-end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open-weight models with chain-of-thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5-6 iterations, matches the human-expert implementation (mean error of $3.1\times10^{-3}$ %), with a $\sim$33.4 % faster runtime and a $\sim$30 % efficient memory usage at a cost comparable to mid-sized commercial APIs, yielding a practical template for physics-grounded scientific code generation. As datasets and models evolve, zero-shot code accuracy will improve; however, the Chain of Unit-Physics framework goes further by embedding first-principles analysis that is foundational to scientific codes.

Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis

TL;DR

The paper tackles the reliability gap in agentic large-language-model–driven scientific code by proposing Chain of Unit-Physics, a primitives-centric, test-driven framework that embeds first-principles constraints into a multi-agent code-generation workflow. By formalizing unit-physics primitives and employing a supervisor–diagnostic–verification loop, the approach guides synthesis toward physically consistent solvers, demonstrated on a combustion benchmark. Closed-weight analyses reveal widespread failure modes, while open-weight setups improve but do not yet reach reliable end-to-end solutions without the unit-physics discipline. The Chain of Unit-Physics system converges in 5–6 iterations, matching human-expert results with significantly better efficiency (≈33% faster runtime, ≈30% lower memory) and a mean L2 error of %, establishing a practical template for physics-grounded code generation. As models evolve, embedding first-principles checks offers robustness beyond raw training data, promising more trustworthy scientific software from natural-language queries.

Abstract

Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high-stakes problems remains unclear. Developing computational scientific software from natural-language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit-Physics framework: a first-principles (or primitives)-centric, multi-agent system in which human expert knowledge is encoded as unit-physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed-weight systems and code-focused agentic variants fail to produce correct end-to-end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open-weight models with chain-of-thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5-6 iterations, matches the human-expert implementation (mean error of %), with a 33.4 % faster runtime and a 30 % efficient memory usage at a cost comparable to mid-sized commercial APIs, yielding a practical template for physics-grounded scientific code generation. As datasets and models evolve, zero-shot code accuracy will improve; however, the Chain of Unit-Physics framework goes further by embedding first-principles analysis that is foundational to scientific codes.

Paper Structure

This paper contains 9 sections, 3 figures, 2 tables.

Table of Contents

  1. Introduction
  2. Methodology
  3. Multi-Agent System
  4. Unit-Physics Encoding
  5. Results
  6. Closed-weight Models and Systems
  7. Open-weight Models and Systems
  8. Chain of Unit-Physics System
  9. Code Performance The code produced by the framework is evaluated against a reference implementation developed by a human expert and compared in three dimensions: (1) execution time, (2) memory usage, and (3) L$^{2}$ error. The representative chemical conditions with $\phi = 1$ and $p = 1$ atm with H$_{2}$–O$_{2}$ combustion and numerical conditions of $dt = 1e^{-10}$ with RK4 integrator are fixed and only the input temperature is varied from 1300 to 2400 K. Figure \ref{['fig:coderuns']} compares the performance of the proposed framework (green) with a reference code developed by a human expert (orange). In terms of runtime [plot (a)], the human-expert code consistently achieves a longer time to solution—on the order of 32–34s (33.4% on average) slower across the temperature range—indicating that the framework implementation is more optimized for wall-clock performance. The performance of the AI code can be attributed to using vectorized energy evaluations instead of explicitly looping over each species. Additionally, the proposed framework is more memory efficient [plot (b)], reducing peak memory usage from roughly 270 MB for the human code to about 200 MB (reduction of nearly 30%), with only weak dependence on temperature. The accuracy of the generated solver is quantified by the L$^{2}$ error between the two solutions [plot (c)]. The error remains below $10^{-4}$ for all tested temperatures (mean relative error of $3.1\times10^{-3}$%), with absolute match for some temperatures and a modest increase at higher temperatures, demonstrating that Chain of Unit-Physics closely reproduces the human-expert solution while trading a small increase in runtime for a substantial reduction in memory usage. Upon additional code review, the improved memory footprint of Chain of Unit-Physics arises from the way the AI-generated code organizes data: state variables are packed into a single contiguous structure rather than being split across multiple arrays and objects, which reduces overhead and allocator fragmentation. The slight slowdown in runtime was expected due to additional safeguard introduced by the model, an internal high temperature-bounds check (T $\geq$ 4000 K) that is frequently evaluated during the integration, however the vectorized approach offsets the time lost in checks. This extra validation step improves robustness, but adds a small computational penalty ($\sim$5s) relative to human-optimized implementation. Comparison between: (): Human+AI (Chain of Unit-Physics), and (): Human, developed code implementation. Plots show (a) Time to solution,(b) Peak memory usage and (c) L$^2$ error of the Chain of Unit-Physics solution with respect to the human-expert reference (only single curve).

Figures (3)

  • Figure 1: Conceptual difference between Chain of Unit-Physics approach and existing methods. Left: the existing "code-first" approach, where unit tests are written after implementation, merely exposing latent errors and forcing rework. Right: the proposed approach, in which a human expert specifies first-principles unit tests (e.g., conservation laws) that guide code generation.
  • Figure 2: Chain of Unit-Physics workflow: User queries and unit-physics tests are processed by a supervisor agent (1), which orchestrates chain-of-thought (CoT) code generation (2); diagnostic (3) and verification (4) agents then evaluate the code against physics-based tests and expert knowledge, feeding back signals that steer code synthesis toward physically consistent solutions. The final green‐bordered block confirms successful query execution.
  • Figure 3: Approximate states of the reactor task with primitives: (): correct state, (): input to agent, (): mismatch detected, (): pruned state and (): incorrect state. Numbers are CoT-confidence score self-reported by the Code agent.