Table of Contents
Fetching ...

EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

Xinyan Cai, Shiguang Wu, Dafeng Chi, Yuzheng Zhuang, Xingyue Quan, Jianye Hao, Qiang Guan

TL;DR

EVLP presents a unified multimodal Vision-Language Planner that jointly reasons over language and visuals to tackle long-horizon embodied manipulation. The approach combines a dual-tower Vision Tower, one-step image-token generation, dynamic perception pretraining with inverse/forward dynamics tasks, and Reinforced Supervised Fine-Tuning to align spatial logic with generated visuals. Key contributions include a sampling-efficient generator that models $p(\cdot|c)$ in a single forward pass, bidirectional pretraining for cross-modal world modeling, and RSFT that merges maximum likelihood with policy gradients to enforce dynamic consistency. Empirical results on LoHoRavens and Meeting Preparation show EVLP surpassing strong baselines in success rate and planning fidelity, with real-world validation on BridgeData v2 demonstrating improved language and visual subgoal quality, suggesting practical impact for robust, instruction-driven embodied AI.

Abstract

In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.

EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

TL;DR

EVLP presents a unified multimodal Vision-Language Planner that jointly reasons over language and visuals to tackle long-horizon embodied manipulation. The approach combines a dual-tower Vision Tower, one-step image-token generation, dynamic perception pretraining with inverse/forward dynamics tasks, and Reinforced Supervised Fine-Tuning to align spatial logic with generated visuals. Key contributions include a sampling-efficient generator that models in a single forward pass, bidirectional pretraining for cross-modal world modeling, and RSFT that merges maximum likelihood with policy gradients to enforce dynamic consistency. Empirical results on LoHoRavens and Meeting Preparation show EVLP surpassing strong baselines in success rate and planning fidelity, with real-world validation on BridgeData v2 demonstrating improved language and visual subgoal quality, suggesting practical impact for robust, instruction-driven embodied AI.

Abstract

In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present \textbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: \textbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. \textbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. \textbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.

Paper Structure

This paper contains 54 sections, 12 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Our overall framework diagram. In terms of the model architecture, we adopt a vision tower design that integrates understanding and generation. For image understanding, we combine SigLIP with a learnable spatial encoder, while for image generation, we introduce image tokens to achieve one-step generation. Regarding the training pipeline, we design a two-stage framework: dynamic perception pretraining (illustrated above) and reinforced supervised fine-tuning (illustrated below). The black arrows represent the forward process, while the red arrows indicate the backward process.
  • Figure 2: (1) Diffusion-based Model formulates image generation as $x_{0:N}^{t-1} \sim p(\cdot|c,x_{0:N}^{t})$. When sampling $n$ samples from distribution $p(\cdot|c)$, the model requires $n \times T$ forward passes, where $T$ denotes the diffusion denoising steps. (2) Autoregressive-based Model formulates image generation as $x_{0:N}^{t-1} \sim p(\cdot|c,x_{0:N}^{t})$. When sampling $n$ samples from distribution $p(\cdot|c)$, the model requires $n \times N$ forward passes, where $N$ represents the token count. (3) Our Model directly models $p(\cdot|c)$, enabling the sampling of $n$ samples with only one forward pass.
  • Figure 3: (1) SFT optimize model by minimizing KL divergence between model outputs and dataset distributions. It lacks per-sample preference alignment. (2) Reinforcement Learning (RL) aligns preferences through per-sample feedback, focusing on reward maximization but risking distribution shifts. (3) Reinforcement Supervised Fine-Tuning (RSFT) combines distribution constraints with sample optimization, enforcing preference alignment under maximum likelihood constraints.
  • Figure 4: Comparison of generation effects between RSFT and SFT shows that RSFT generates more finely detailed results with better dynamic consistency.
  • Figure 5: Visualization of Real-World Dataset experiments, showcasing EVLP’s planning quality in complex, real-world scenes.
  • ...and 5 more figures