Table of Contents
Fetching ...

OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu

TL;DR

OmniVLA tackles the limitation of RGB-only vision-language-action models by incorporating beyond-RGB sensing through sensor-masked images that align infrared, mmWave, and acoustic data with RGB frames. It constructs a unified, image-native representation by overlaying sensor-derived masks onto RGB images, guided by a vision-language model and Grounded SAM prompts, and processes these with a frozen vision-language backbone plus lightweight per-sensor projections. Empirical results on real-world robotic manipulation show an average 84% task success, outperforming RGB-only baselines by 59% and raw-sensor baselines by 28%, while achieving data efficiency comparable to using only half the demonstrations. The work demonstrates a general framework for integrating diverse sensors with VLA models, enabling physically-grounded spatial intelligence and stronger generalization for embodied AI tasks.

Abstract

Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

TL;DR

OmniVLA tackles the limitation of RGB-only vision-language-action models by incorporating beyond-RGB sensing through sensor-masked images that align infrared, mmWave, and acoustic data with RGB frames. It constructs a unified, image-native representation by overlaying sensor-derived masks onto RGB images, guided by a vision-language model and Grounded SAM prompts, and processes these with a frozen vision-language backbone plus lightweight per-sensor projections. Empirical results on real-world robotic manipulation show an average 84% task success, outperforming RGB-only baselines by 59% and raw-sensor baselines by 28%, while achieving data efficiency comparable to using only half the demonstrations. The work demonstrates a general framework for integrating diverse sensors with VLA models, enabling physically-grounded spatial intelligence and stronger generalization for embodied AI tasks.

Abstract

Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

Paper Structure

This paper contains 13 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Instead of relying solely on RGB cameras, OmniVLA equips embodied AI with multi-sensor perception beyond-RGB. We use beamforming to construct heatmap-like sensor images for acoustic and mmWave to highlight the sound source and the hidden item, respectively.
  • Figure 2: System Overview. OmniVLA processes diverse sensor data into image-like 2D spatial representations, overlaying sensor information on top of RGB images to acquire spatially grounded and semantically aligned masked-sensor images. We train a VLA model with individual MLP sensor projectors to achieve challenging tasks requiring beyond-RGB perception.
  • Figure 3: Sensor Data Processing Illustration. We propose a general sensor data processing pipeline applicable to various sensors, including (a) thermal camera, (b) mmWave radar, and (b) acoustic microphone array, by overlaying sensor information on top of RGB images as VLA model input. We update prompt input to SAM2 model when the task begins and then asynchronously check for updates, so that VLM output delay does not affect the real-time processing of sensor-masked images.
  • Figure 4: Hardware Implementation. (a) robot arm and sensor setup (b) sensor module, integrating multiple sensors and cameras.
  • Figure 5: Examples of Robotic Manipulation Task Completion over Time. (a) Thermal: finding the cold drink. (b) mmWave: opening the box with object inside. (c) Acoustic: uncovering the ringing phone. The first three images are sensor-masked images; the rest images are raw RGB images for visibility.
  • ...and 2 more figures