Table of Contents
Fetching ...

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren

Abstract

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Abstract

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
Paper Structure (18 sections, 4 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: A comparison between existing task planning methods and our proposed SVLL framework.(a) Existing Methods: Existing methods suffer from premature temporal binding due to joint end-to-end training and likelihood displacement caused by standard DPO, which pushes the policy away from the expert distribution. (b) Ours (SVLL): SVLL addresses these issues by integrating staged learning and Bias-DPO, where the former decouples spatial grounding from temporal reasoning, while the latter introduces strong inductive biases to explicitly anchor the policy to the feasible expert manifold and suppresses confident errors.
  • Figure 2: Execution of long-horizon embodied tasks via SVLL.(a) Real-World Deployment: A physical robotic arm executes a sequence of visually grounded actions: locating a banana, picking it up, placing it into a basket, and moving the basket to a designated target. (b) Simulated Environment: An agent in an interactive 3D simulator successfully performs a multi-step task involving finding a pen, picking it up, opening a drawer, and physically placing the pen inside. Both deployments demonstrate strict adherence to physical constraints and robust causal reasoning.
  • Figure 3: Formulation of embodied task planning. The agent engages in a closed-loop interaction with the environment based on natural language instructions. The objective is to maximize the task success rate ($\mathcal{R}_{\text{success}}$), subject to a critical admissibility constraint---every intermediate action $a_t$ must be physically executable and grounded within the valid visual affordance space $\mathcal{V}(I_t)$ of the current observation.
  • Figure 4: Overview of the SVLL framework. To mitigate the premature temporal binding in standard end-to-end training, SVLL decouples spatial and temporal learning across three stages: Stage 1 blocks the action history to force reliance on current visual affordances; Stage 2 unlocks the history context, inheriting robust visual features from Stage 1 to learn sequential dependencies; Stage 3 refines the initial policy by anchoring it to the expert manifold via an auxiliary $\mathcal{L}_{\text{SFT}}$ and a threshold-triggered penalty $\mathcal{L}_{\text{UL}}$, effectively overcoming the likelihood displacement inherent in standard DPO.
  • Figure 5: Comparison of the RoboBrain2.0-32B model and our 7B-parameter SVLL-Stage 3 model in executing real-world tasks. Both models are tasked with locating an object and placing it inside the microwave. (a) Physical Constraint Violation: The baseline model successfully navigates to and picks up the apple but commits a critical physical constraint violation by attempting to Place it without first opening the microwave door. (b) Strict Causal Adherence: Our model successfully locates the pepper, explicitly executes the prerequisite Open action, and then safely places the object inside, demonstrating strict adherence to causal physical constraints throughout the task.