Table of Contents
Fetching ...

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan

TL;DR

OmniEVA tackles two core hurdles in embodied AI: lack of robust 3D grounding across tasks and plans that ignore robot embodiment constraints. It introduces a Task-Adaptive Gated Router (TAGR) to selectively fuse 3D cues based on task and scene context, and an Embodiment-Aware Reinforcement Fine-tuning (TE-GRPO) framework that optimizes for both task satisfaction and executable plans under physical constraints. Through a three-stage training pipeline and diverse datasets, OmniEVA achieves state-of-the-art results on 2D and 3D embodied benchmarks, along with strong generalization to varied embodiments and real-world deployment. The work demonstrates that closing the gap between perception, reasoning, and real-world execution can yield robust, versatile embodied agents capable of long-horizon planning in complex environments.

Abstract

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

TL;DR

OmniEVA tackles two core hurdles in embodied AI: lack of robust 3D grounding across tasks and plans that ignore robot embodiment constraints. It introduces a Task-Adaptive Gated Router (TAGR) to selectively fuse 3D cues based on task and scene context, and an Embodiment-Aware Reinforcement Fine-tuning (TE-GRPO) framework that optimizes for both task satisfaction and executable plans under physical constraints. Through a three-stage training pipeline and diverse datasets, OmniEVA achieves state-of-the-art results on 2D and 3D embodied benchmarks, along with strong generalization to varied embodiments and real-world deployment. The work demonstrates that closing the gap between perception, reasoning, and real-world execution can yield robust, versatile embodied agents capable of long-horizon planning in complex environments.

Abstract

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

Paper Structure

This paper contains 89 sections, 10 equations, 13 figures, 8 tables, 2 algorithms.

Figures (13)

  • Figure 1: Performance Comparison across 2D and 3D Embodied Reasoning Benchmarks.
  • Figure 2: Model Architecture of OmniEVA. Left: The overall architecture of OmniEVA, featuring a novel task-adaptive gated router that dynamically incorporates 3D positional embeddings. Middle: Detailed implementation of the gated router module. Right: Illustrative examples of the gated router's activation state across different tasks.
  • Figure 3: Training Paradigm of OmniEVA. The two-stage cascade progressively enhances embodied intelligence: Stage 1 builds a broad reasoning foundation, while Stage 2 grounds it in physical reality—culminating in robust task execution across diverse real-world scenarios.
  • Figure 4: 3D Activation Analysis by Prompt Clustering: Prompts are embedded using a lightweight sentence transformer and clustered into semantic groups. The chart shows the 3D activation probability per category.
  • Figure 5: Ablation Results of the proposed TE-GRPO Method on Local Mobile-Manipulation Tasks
  • ...and 8 more figures