Table of Contents
Fetching ...

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria

TL;DR

The paper analyzes how foundation models can be integrated into embodied robotic systems to enable language-guided control, comparing end-to-end vision-language-action models, modular vision-language pipelines, and multimodal LLM orchestrators. Through two tabletop case studies on instruction grounding and object manipulation, it shows distinct trade-offs: end-to-end VLAs offer streamlined control but can be data-hungry and hard to ground; modular VLM pipelines provide interpretability and efficiency but risk brittle error propagation; multimodal LLM agents deliver strong reasoning and grounding at high computational cost and deployment challenges. It also demonstrates that zero-shot and few-shot settings reveal varying generalization and data-efficiency characteristics, with significant challenges in robustness, sim-to-real transfer, and rapid adaptation. The paper concludes with design implications and identifies data scarcity, inference efficiency, and safety as central obstacles, proposing directions for developing practical, FM-powered robotic systems across real-world environments.

Abstract

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

TL;DR

The paper analyzes how foundation models can be integrated into embodied robotic systems to enable language-guided control, comparing end-to-end vision-language-action models, modular vision-language pipelines, and multimodal LLM orchestrators. Through two tabletop case studies on instruction grounding and object manipulation, it shows distinct trade-offs: end-to-end VLAs offer streamlined control but can be data-hungry and hard to ground; modular VLM pipelines provide interpretability and efficiency but risk brittle error propagation; multimodal LLM agents deliver strong reasoning and grounding at high computational cost and deployment challenges. It also demonstrates that zero-shot and few-shot settings reveal varying generalization and data-efficiency characteristics, with significant challenges in robustness, sim-to-real transfer, and rapid adaptation. The paper concludes with design implications and identifies data scarcity, inference efficiency, and safety as central obstacles, proposing directions for developing practical, FM-powered robotic systems across real-world environments.

Abstract

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.

Paper Structure

This paper contains 42 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Key challenges of FMs in embodied robotics.
  • Figure 2: Three FM integration strategies for embodied robotics, highlighting distinct interfaces between language, perception, and control.
  • Figure 3: Experimental setup for two case studies in a cluttered tabletop environment. The top row shows egocentric video data collected for the manipulation case study. The bottom row is an example setup for the instruction grounding task, including an annotated visual prompt paired with complex instructions in three forms: implicit, explicit with attributes and spatial references.
  • Figure 4: Performance of complex instruction grounding across modular VLM pipeline and MLLMs. Macro accuracy is reported across instruction types---implicit, attribute-based, and relationship-based. Subfigures show (a) proprietary models and (b) open-source models along with their Int4-quantized variants.
  • Figure 5: Partial fine-tuning results for VLAs (OpenVLA and $\pi_0$) compared with training Diffusion Policy (DP) and ACT from scratch on our dataset. VLAs require more epochs to converge and show higher performance variance.
  • ...and 6 more figures