Table of Contents
Fetching ...

VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing

Guoqin Tang, Qingxuan Jia, Gang Chen, Tong Li, Zeyuan Huang, Zihang Lv, Ning Ji

TL;DR

VLM-DEWM tackles world-state drift in long-horizon vision-language planning for manufacturing by decoupling persistent world memory from semantic reasoning, using a Dynamic External World Model ($DEWM$). It introduces an Externalizable Reasoning Trace (ERT) to structure VLM outputs as auditable transactions validated against the world model, enabling targeted discrepancy-driven diagnosis. The architecture combines a geometry-semantic hybrid Environment Memory Core with a lightweight, VLM-guided planning loop, achieving verifiable and resilient robotic planning in simulation and real-robot experiments. Results show state-tracking accuracy improved from $56\%$ to $93\%$, recovery success from below $5\%$ to $95\%$, and a reduction in inference overhead by over $70\%$, highlighting practical impact for dynamic manufacturing.

Abstract

Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.

VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing

TL;DR

VLM-DEWM tackles world-state drift in long-horizon vision-language planning for manufacturing by decoupling persistent world memory from semantic reasoning, using a Dynamic External World Model (). It introduces an Externalizable Reasoning Trace (ERT) to structure VLM outputs as auditable transactions validated against the world model, enabling targeted discrepancy-driven diagnosis. The architecture combines a geometry-semantic hybrid Environment Memory Core with a lightweight, VLM-guided planning loop, achieving verifiable and resilient robotic planning in simulation and real-robot experiments. Results show state-tracking accuracy improved from to , recovery success from below to , and a reduction in inference overhead by over , highlighting practical impact for dynamic manufacturing.

Abstract

Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.
Paper Structure (178 sections, 41 equations, 11 figures, 11 tables, 4 algorithms)

This paper contains 178 sections, 41 equations, 11 figures, 11 tables, 4 algorithms.

Figures (11)

  • Figure 1: Architectural Overview of the VLM-DEWM Framework. Operating on a Database-Transaction paradigm, the system is orchestrated through three tightly coupled modules. The perception module first projects raw multimodal sensory streams into a structured instantaneous snapshot ($Env_t$) via geometric projection and filtration. This snapshot drives the thinking module, where the VLM serves as the reasoning core to simultaneously update the external world model (DEWM)—the central knowledge hub comprising environmental ($\Omega$) and internal ($M, CS$) memory—and synthesize a structured External Reasoning Trace (ERT). Crucially, the validation module treats the persistent memory in the DEWM as ground-truth evidence to rigorously verify the ERT's logical and physical consistency before dispatching the final action sequence to the Robot Skill Library $\mathcal{A}$ for execution.
  • Figure 2: Structural Composition of the Environment Memory Core ($\Omega$). The Spatial Network ($\mathcal{S}$) maintains geometric state (left), while the Hierarchical Semantic Graph ($\mathcal{G}$) encodes logical relations (right). The Grounding Link ($\Phi_{ground}$, red) anchors every semantic node to its physical instantiation.
  • Figure 3: The Perception-to-Memory Synchronization Pipeline. Raw observations ($O_t$) are transformed into persistent belief updates via three stages: (1) task-focused perception producing $Env_t$, (2) geometric-semantic data association, and (3) dual-layer updates (continuous S-Update and discrete G-Update with geometric verification).
  • Figure 4: The Progressive VLM-Guided Planning and Deep Recovery Loop (Sec. 3.3). Dynamic Context Engineering distills the DEWM state ($\Omega, M, CS$) into a task-relevant prompt. The VLM then generates an External Reasoning Trace (ERT) which is mathematically validated before execution. On failure, the Deep Causal Diagnosis module computes the belief-outcome discrepancy ($\Delta = \Omega_{t+1} \ominus \mathcal{A}_{causal}$) to generate targeted feedback for belief correction and informed recovery.
  • Figure 5: Task 1 Environment Setup. The multi-zone factory simulation scene containing blocks, spheres, and cylinders distributed across three distinct platforms. This raw visual input triggers the memory construction process.
  • ...and 6 more figures