RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yu-Gang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu
TL;DR
This work tackles proactive robot manipulation by enabling robots to infer latent user intent from cross modal context, including speech, environmental sounds, and visual cues, rather than relying on explicit commands. It introduces RoboOmni, an end-to-end omni-modal framework built on a Perceiver-Thinker-Talker-Executor architecture that unifies intention recognition, interaction, and action generation in a single model. To address data scarcity for proactive reasoning, the authors present OmniAction, a large multimodal dataset with 140k episodes, thousands of speakers, thousands of sounds, and six contextual instruction types, plus OmniAction LIBERO for simulation evaluation. Across simulation and real world experiments, RoboOmni outperforms text and ASR based baselines in success rate, inference speed, proactive assistance, and intention recognition, demonstrating emerging cognitive capabilities and improved human robot collaboration in everyday environments.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
