Table of Contents
Fetching ...

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, Shibiao Xu

TL;DR

The paper surveys multimodal fusion methods and vision-language models (VLMs) for robot vision from a task-oriented lens, covering semantic scene understanding, 3D object detection, navigation, SLAM, and manipulation. It contrasts encoder-decoder, attention-based, and graph-based fusion with the evolving role of large-scale VLMs, emphasizing cross-modal alignment, pretraining, and lightweight deployment. Core contributions include a structured synthesis of fusion architectures, a comparative analysis of datasets and benchmarks, and forward-looking directions such as self-supervised cross-modal learning, structured spatial memory, and ethically aligned deployment. The work highlights practical challenges in real-time performance, data quality, and domain adaptation, and it outlines actionable pathways to build robust, generalizable, and operating robotic vision systems. Overall, the survey provides a comprehensive reference for researchers and practitioners aiming to integrate multimodal perception and reasoning into autonomous robots.

Abstract

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

TL;DR

The paper surveys multimodal fusion methods and vision-language models (VLMs) for robot vision from a task-oriented lens, covering semantic scene understanding, 3D object detection, navigation, SLAM, and manipulation. It contrasts encoder-decoder, attention-based, and graph-based fusion with the evolving role of large-scale VLMs, emphasizing cross-modal alignment, pretraining, and lightweight deployment. Core contributions include a structured synthesis of fusion architectures, a comparative analysis of datasets and benchmarks, and forward-looking directions such as self-supervised cross-modal learning, structured spatial memory, and ethically aligned deployment. The work highlights practical challenges in real-time performance, data quality, and domain adaptation, and it outlines actionable pathways to build robust, generalizable, and operating robotic vision systems. Overall, the survey provides a comprehensive reference for researchers and practitioners aiming to integrate multimodal perception and reasoning into autonomous robots.

Abstract

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

Paper Structure

This paper contains 40 sections, 15 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The overview figure illustrates the overall framework of multimodal fusion and VLMs for robot vision. Various sensory inputs (e.g., RGB, Depth, LiDAR, GPS, IMU) are first processed through multimodal fusion strategies, including encoder-decoder frameworks, attention mechanisms, and graph neural networks, to enhance perception. The resulting fused features support core robotic vision tasks such as 3D semantic scene understanding, SLAM, 3D object detection, navigation and localization, and robot manipulation. Vision-language models further bridge perception and reasoning by aligning visual and linguistic information, enabling semantic understanding and action generation. The diagram highlights the integration of traditional fusion pipelines with large vision-language models for complex task execution in robotic systems.
  • Figure 2: The overall structure of the survey on multimodal fusion and vision-language model in robot vision.
  • Figure 3: The above diagram intuitively illustrates the basic processes of three main strategies for multimodal fusion: early fusion, mid fusion, and late fusion. In the figure, each modality data (such as images, audio, Lidar, text, etc.) is independently processed by a feature extractor. Early fusion directly fuses data from different modalities before feature extraction; Mid term fusion combines modal features through specific mechanisms such as feature concatenation or weighting after extracting them; Late stage fusion is achieved by integrating the decision results of each modality after independent decision-making is completed. The diagram clearly reflects the key differences and roles of the three fusion methods in the multimodal processing flow.
  • Figure 4: Illustration of the standard self-attention mechanism. The query vector is compared against all key vectors using a compatibility function $f(Q, K)$, typically dot product. The resulting scores $s_1, s_2, s_3$ are normalized via the Softmax function to obtain attention weights $a_1, a_2, a_3$, which are then used to compute a weighted sum over the value vectors $V_1, V_2, V_3$, producing the final attention output.
  • Figure 5: The above diagram intuitively illustrates the workflow of multimodal fusion using Graph Neural Networks (GNN). In the figure, each modality data (such as images, audio, Lidar, text, etc.) is first transformed into a graph structure through a graph construction process. These graphs are then processed by separate GNNs to extract high-level representations. The final decision stage integrates the outputs from different GNNs, enabling effective multimodal reasoning. This diagram clearly demonstrates the role of GNNs in learning structured relationships within each modality and highlights the fusion process in multimodal understanding.
  • ...and 7 more figures