Table of Contents
Fetching ...

APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

Wanjing Huang, Weixiang Yan, Zhen Zhang, Ambuj Singh

TL;DR

APEX addresses a core gap in language-model planning by endowing LLMs with explicit, physics-grounded foresight. It builds a motion-aware relational graph, triggers forward physics rollouts, and uses those predictions to guide language-based decision making, enabling low-latency, physically grounded action. Across Physics Reasoning, Tetris planning, and Dynamic Obstacle Avoidance benchmarks, APEX consistently outperforms vanilla LLMs and vision-based baselines, demonstrating the necessity of explicit physics reasoning for real-world task execution. The approach, framed as Perception–Graph–Language–Physics–Action (PGLPA), decouples numerical physics from probabilistic inference, enhancing interpretability, robustness, and transferability to embodied AI settings.

Abstract

Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .

APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

TL;DR

APEX addresses a core gap in language-model planning by endowing LLMs with explicit, physics-grounded foresight. It builds a motion-aware relational graph, triggers forward physics rollouts, and uses those predictions to guide language-based decision making, enabling low-latency, physically grounded action. Across Physics Reasoning, Tetris planning, and Dynamic Obstacle Avoidance benchmarks, APEX consistently outperforms vanilla LLMs and vision-based baselines, demonstrating the necessity of explicit physics reasoning for real-world task execution. The approach, framed as Perception–Graph–Language–Physics–Action (PGLPA), decouples numerical physics from probabilistic inference, enhancing interpretability, robustness, and transferability to embodied AI settings.

Abstract

Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .

Paper Structure

This paper contains 57 sections, 19 equations, 15 figures, 17 tables, 1 algorithm.

Figures (15)

  • Figure 1: Comparison of physical reasoning capabilities across three systems, LLM without spatial grounding, VLM and world modeling, and our proposed APEX on three scenarios involving object prediction, agent-object interaction, and action planning. While vanilla LLMs are not necessarily making random choices in the prediction task, our experimental results in Section \ref{['sec:experiments']} indicate that their performance is statistically indistinguishable from random selection in this context. APEX provides not only qualitative predictions but also quantitative estimations of outcomes (e.g., time to impact, risk of collision), demonstrating its structured understanding of physical causality.
  • Figure 2: Overview of the APEX reasoning pipeline. Environment snapshots are abstracted into a motion-aware interaction graph via DG-Motion Attention. This graph structure triggers simulation in a physical engine (MuJoCo), which evaluates the outcome of candidate actions. A vanilla LLM then selects an action based on the simulated consequences. This loop, perception → graph trigger → simulation → LLM → action, enables grounded, temporally aware physical reasoning.
  • Figure 3: Illustration of the conventional Vision--Language--Action (VLA) paradigm. Visual perception encodes the real world into language features, which are then mapped directly to action commands. The execution loop closes by applying actions back to the real world. Although conceptually simple, VLA tightly couples perception, reasoning, and control within a single embedding space, limiting interpretability and robustness.
  • Figure 4: Illustration of our Perception--Graph--Language--Physics--Action (PGLPA) paradigm. Perception constructs a relational graph from the real world; this graph informs both symbolic reasoning and an explicit SE(3)-consistent physics simulator. The simulator evaluates candidate actions via rollouts, producing structured feedback that is integrated with LLM-based reasoning before execution. This "real-to-sim-to-real" loop decouples numerical physical computation from probabilistic inference, improving stability, interpretability, and zero-shot transfer.
  • Figure 5: Frame montage from the real-world deployment video, with one frame sampled per second. The sequence illustrates three key phases of the experiment: (6s) human moving a green object toward a static red object, (18s) physical evaluations by the APEX simulation loop, and (19s) intervention performed by the robotic arm to prevent collision. This visualization highlights how APEX integrates perception, simulation, and LLM reasoning into grounded physical action.
  • ...and 10 more figures