Table of Contents
Fetching ...

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

TL;DR

The paper addresses the core problem of insufficient 3D spatial grounding in Vision-Language Models (VLMs) for robotic task planning. It introduces a modular framework that couples a 2D prompt synthesis module (which maps 2D images to 3D point clouds) with a frozen VLM and a back-end Small Language Model (SLM) for supervisory refinement, enabling robust, 3D-aware reasoning without extensive retraining. A confidence-based registration strategy with entropy-based components binds multimodal data, while nearest-neighbor and iterative prompting strategies enhance spatial precision and adaptability. Experimental results on a FRANKA robotic arm report a 96.0% Task Success Rate (TSR) and show that removing either the 2D prompt synthesis or the SLM supervision greatly degrades performance, underscoring the critical role of both components. Overall, the framework offers a scalable, data-efficient path to reliable 3D-grounded robotic planning in dynamic environments, with demonstrated improvements in 3D recognition, localization, and execution reliability over state-of-the-art baselines.

Abstract

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

TL;DR

The paper addresses the core problem of insufficient 3D spatial grounding in Vision-Language Models (VLMs) for robotic task planning. It introduces a modular framework that couples a 2D prompt synthesis module (which maps 2D images to 3D point clouds) with a frozen VLM and a back-end Small Language Model (SLM) for supervisory refinement, enabling robust, 3D-aware reasoning without extensive retraining. A confidence-based registration strategy with entropy-based components binds multimodal data, while nearest-neighbor and iterative prompting strategies enhance spatial precision and adaptability. Experimental results on a FRANKA robotic arm report a 96.0% Task Success Rate (TSR) and show that removing either the 2D prompt synthesis or the SLM supervision greatly degrades performance, underscoring the critical role of both components. Overall, the framework offers a scalable, data-efficient path to reliable 3D-grounded robotic planning in dynamic environments, with demonstrated improvements in 3D recognition, localization, and execution reliability over state-of-the-art baselines.

Abstract

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

Paper Structure

This paper contains 70 sections, 29 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the robotic task execution process using a Franka robotic arm. The cloud (top left) defines the task involving objects on a table. Key challenges (top right bubble) include: Perception, Localization, and Planning. The proposed solution (right) integrates multimodal perception (camera & lidar) with reasoning (VLM). The toolbox (bottom left) outlines available resources, including sensors, computing modules, and robotic skills.
  • Figure 2: The overall architecture of the proposed framework. The framework consists of three main components: 2D Prompt Synthesis Module (orange), including Process & Alignment (light yellow) for multimodal data preprocessing and alignment, and Registration & Synthesis (light yellow) for credit-based prompt generation. A red arrow indicates data flow between these submodules. The Frozen Vision-Language Model (VLM, gray) serves as the reasoning core, receiving inputs from the 2D Prompt Synthesis Module and Text Prompts. A dashed arrow represents iterative refinement with the iterative prompt algorithm in Registration & Synthesis. The Back-End Small Language Model (SLM) Supervision (brown) validates and refines outputs via a solid arrow, with a dashed arrow enabling feedback correction to the VLM. Final validated outputs are archived in the Archive Historical Responses submodule.
  • Figure 3: The architecture of the Vision-Language Model (VLM). Inputs include a robot task (top-left), segmented image (bottom-left), and text template (top-right), processed by encoders to generate feature vectors. The cross-modal alignment module (center) integrates text and image features, producing a unified representation. The multimodal decoder (right) generates executable robot control codes.
  • Figure 4: The figure illustrates the process of computing confidence scores using filtered point cloud data and corresponding paired image data. The central section outlines the computational steps, including mask processing, 3D point cloud analysis, and RGB image integration. The rightmost block presents the final assignment results, where confidence scores are overlaid onto the original image for visualization.
  • Figure 5: The figure illustrates confidence-driven strategies for task-specific prompting: The left section represents time-sensitive tasks, employing a 2D Image Annotation Strategy, and focusing on quickly filtering and annotating key credit points; The middle section corresponds to precision-sensitive tasks, employing an Interactive Multi-Step Prompting Strategy, and showcasing the query and analyze interaction between the Vision-Language Model (VLM), Credit Image, and RGB Image for iterative refinement. The right section provides a Prompt Template, with the green and pink blocks aligning with their corresponding components in the middle section, offering structured guidance for logical and accurate task execution.
  • ...and 3 more figures