Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, Shibiao Xu
TL;DR
The paper surveys multimodal fusion methods and vision-language models (VLMs) for robot vision from a task-oriented lens, covering semantic scene understanding, 3D object detection, navigation, SLAM, and manipulation. It contrasts encoder-decoder, attention-based, and graph-based fusion with the evolving role of large-scale VLMs, emphasizing cross-modal alignment, pretraining, and lightweight deployment. Core contributions include a structured synthesis of fusion architectures, a comparative analysis of datasets and benchmarks, and forward-looking directions such as self-supervised cross-modal learning, structured spatial memory, and ethically aligned deployment. The work highlights practical challenges in real-time performance, data quality, and domain adaptation, and it outlines actionable pathways to build robust, generalizable, and operating robotic vision systems. Overall, the survey provides a comprehensive reference for researchers and practitioners aiming to integrate multimodal perception and reasoning into autonomous robots.
Abstract
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
