From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems
Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria
TL;DR
The paper analyzes how foundation models can be integrated into embodied robotic systems to enable language-guided control, comparing end-to-end vision-language-action models, modular vision-language pipelines, and multimodal LLM orchestrators. Through two tabletop case studies on instruction grounding and object manipulation, it shows distinct trade-offs: end-to-end VLAs offer streamlined control but can be data-hungry and hard to ground; modular VLM pipelines provide interpretability and efficiency but risk brittle error propagation; multimodal LLM agents deliver strong reasoning and grounding at high computational cost and deployment challenges. It also demonstrates that zero-shot and few-shot settings reveal varying generalization and data-efficiency characteristics, with significant challenges in robustness, sim-to-real transfer, and rapid adaptation. The paper concludes with design implications and identifies data scarcity, inference efficiency, and safety as central obstacles, proposing directions for developing practical, FM-powered robotic systems across real-world environments.
Abstract
Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
