Table of Contents
Fetching ...

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil

TL;DR

This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with Large Language Models (LLMs), an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation.

Abstract

Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) approaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

TL;DR

This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with Large Language Models (LLMs), an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation.

Abstract

Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) approaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.
Paper Structure (28 sections, 22 equations, 5 figures, 1 table)

This paper contains 28 sections, 22 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Three stages of the ActionReasoning pipeline. (1) Inputs & World Model: The world model of bricklaying is provided as input. (2) Multi-Agent Planning: Leveraging the world model input, an LLM orchestrator decomposes the task into specialized agents that generate actions and waypoints to plan the motion of a selected brick toward its target location. (3) Simulator & Robot Execution: The robot (a Kuka simulator in this case) executes the planned actions from multiple agents to control grasp and motion. Observations from changes in the 3D scene and robot arm state are used to update the world model, enabling continual re-planning as the task progresses.
  • Figure 2: Illustration of the six agents ($Ag_1 - Ag_6$) in the present ActionReasoning framework: (1) Pre-grasp positioning to guide the arm to approach the brick; (2) Opening and descent to position the gripper above the brick surface; (3) Grasp closure to secure the brick; (4) Safe lift to raise the brick from the ground; (5) Brick placement to stably move and accurately align the brick at the target location; and (6) Grasp release to land the brick. The corresponding execution of each agent is illustrated on the simulated KUKA robot arm at the bottom of this figure.
  • Figure 3: Detailed architecture of Agent 5 (Brick Placement) with six prompt-driven components. (1) Current environment state: provides the latest world model; (2) Memory information: explains task progress of previous agents and the current task status; (3) Role definition: specifies the agent’s function and responsibilities, including collision-avoidance tasks for stable brick placement. A comparison of the brick laying between specifying and not specifying is visualized aside; (4) Knowledge base: describes domain knowledge such as dynamics, gripper strategies, and safety constraints; (5) Thinking chain: outlines stepwise reasoning for placement and retreat; and (6) Output format: structured JSON commands for execution in the KUKA simulator.
  • Figure 4: Qualitative comparison across two stacking patterns. Each block shows the baseline (top row) and our method (bottom row). In each block of stacking pattern, columns show some stages of the placement cycle according to the timestamp from left to right. Our method achieves noticeably better brick alignment, as evidenced by the more neatly stacked bricks.
  • Figure 5: Ablation visualizations for single agent. Rows depict single-agent vs multi-agent. Columns show key stages of the placement cycle according to the timestamp from left to right. The figure shows that the single agent toppled the structure and failed to complete the whole wall.