Discovering High Level Patterns from Simulation Traces

Sean Memery; Kartic Subr

Discovering High Level Patterns from Simulation Traces

Sean Memery, Kartic Subr

TL;DR

This work tackles the challenge that language models struggle to reason about physics without ground truth simulation data. It proposes learning a library of high-level event patterns by evolving detectors that annotate detailed simulation traces into Annotated Simulation Traces (AST), enabling natural language reasoning, planning, and reward-program synthesis. The approach combines NL-guided pattern discovery with FunSearch-style program synthesis to grow the pattern library from seed descriptions and to produce executable reward programs for trajectory optimization. Evaluations on the Phyre physics benchmark and a Phyre-derived Q&A task show that ASTs improve LM summarization, question answering, and the quality of learned reward functions, while enabling more efficient optimization and downstream training of value networks. Overall, the pattern-based abstraction provides a scalable, interpretable bridge between physics simulations and NL reasoning, with broad implications for NL-guided control and learning in physics-rich environments.

Abstract

Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM's capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., 'rigid-body collision', 'stable support', etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.

Discovering High Level Patterns from Simulation Traces

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 16 figures, 3 algorithms)

This paper contains 35 sections, 1 equation, 16 figures, 3 algorithms.

Introduction
Related work
Language models and physics environments
Reward program synthesis
Program synthesis via FunSearch
Method
Definitions: Patterns, annotation and detectors
Natural language guided pattern discovery
Reward program synthesis
Partial-credit scoring
Evaluation of discovered patterns
Datasets and patterns
Phyre task suite.
Q&A benchmark.
Guided and self-discovered patterns
...and 20 more sections

Figures (16)

Figure 1: (a--d) Example Phyre task. The objective for all Phyre tasks is to place the red ball so that the green and blue objects are in contact at the end of the simulation. (e) Highlighted patterns discovered by our system. (f) Example questions from our Q&A benchmark designed to probe physical reasoning.
Figure 2: (a) Simulation traces $\tau_1, \tau_2$ are mapped to annotated simulation traces (ASTs) $A_1, A_2$ using detector code. Distance metrics $d_x$ and $d_p$ are defined between traces and ASTs. (b) We use FunSearch romera-paredesMathematicalDiscoveriesProgram2024, with a custom evaluation function, to augment the library with new detector code (c) Given a custom Domain Specific Language, a description of objects in the scene and the current library of pattern-detecting code, we synthesizes a reward program in the DSL, which can be optimized to produce actions. Simulation traces produced by optimized actions are processed for reward evaluation.
Figure 3: Effect of library size on performance within the Q&A and Phyre benchmarks, using the Qwen3-VL 8B (Thinking) language model. Beyond $|\mathcal{P}|=16$, automatic patterns were included. Not that the number of attempts at the Phyre task is capped at $5$ per scene.
Figure 4: Importance values for each pattern the learned library, computed via leave-one-out ablation on the Q&A and Phyre benchmarks. Importance is measured as the percentage drop in performance when ablating each pattern (higher is more important).
Figure 5: Example annotations produced by the learned library on Phyre rollouts. Each row shows a different scene, with rendered frames on the left and time-stamped pattern activations on the right.
...and 11 more figures

Discovering High Level Patterns from Simulation Traces

TL;DR

Abstract

Discovering High Level Patterns from Simulation Traces

Authors

TL;DR

Abstract

Table of Contents

Figures (16)