Table of Contents
Fetching ...

MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

Pingrui Zhang, Xianqiang Gao, Yuhan Wu, Kehui Liu, Dong Wang, Zhigang Wang, Bin Zhao, Yan Ding, Xuelong Li

TL;DR

MoMa-Kitchen tackles the last-mile gap between navigation and manipulation in mobile manipulation by introducing a large-scale, automated dataset of over 127k episodes across 569 kitchen scenes, paired with dense floor affordance maps. The authors build NavAff, a lightweight baseline that fuses RGB-D, global and floor point clouds, and robot-specific information via cross-attention to predict manipulation-ready final positions. Quantitative results show NavAff outperforming adapted baselines on RMSE, logMSE, PCC, and SIM, with a Top1 MSR of 72% and Top5 MSR of 66%, and real-world demonstrations validate transfer from simulation to practice. The work advances integrated navigation-manipulation learning, enabling robust, device-agnostic planning in cluttered household environments and providing a scalable path toward embodied AI that effectively couples navigation with manipulation.

Abstract

In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: \href{https://momakitchen.github.io/}{https://momakitchen.github.io/}.

MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

TL;DR

MoMa-Kitchen tackles the last-mile gap between navigation and manipulation in mobile manipulation by introducing a large-scale, automated dataset of over 127k episodes across 569 kitchen scenes, paired with dense floor affordance maps. The authors build NavAff, a lightweight baseline that fuses RGB-D, global and floor point clouds, and robot-specific information via cross-attention to predict manipulation-ready final positions. Quantitative results show NavAff outperforming adapted baselines on RMSE, logMSE, PCC, and SIM, with a Top1 MSR of 72% and Top5 MSR of 66%, and real-world demonstrations validate transfer from simulation to practice. The work advances integrated navigation-manipulation learning, enabling robust, device-agnostic planning in cluttered household environments and providing a scalable path toward embodied AI that effectively couples navigation with manipulation.

Abstract

In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: \href{https://momakitchen.github.io/}{https://momakitchen.github.io/}.

Paper Structure

This paper contains 34 sections, 9 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Conventional navigation methods typically prioritize reaching a target location but do not account for constraints affecting manipulation feasibility. Left: Position A prioritizes proximity but is obstructed by chairs, preventing stable execution. Middle: Position B places the robot in a spacious and stable area for operation but beyond its effective reach. Right: Our approach, leveraging navigation affordance grounding, identifies Position C as the optimal stance, ensuring both reachability and task feasibility.
  • Figure 2: Overview of scene setup and data generation pipeline. Each scene features unique base furniture and layout, with randomly placed obstacles surrounding the target object to enhance scene complexity. Discrete navigation affordance values are collected by moving the mobile manipulator and interacting with the target objects in the scene. View transformation and Gaussian interpolation are then applied to generate a dense affordance map, along with corresponding RGBD data.
  • Figure 3: (a) Robot arms used in MoMa-Kitchen. The end-effectors of the Panda, Flexiv, Elephant, Realman and xArm6 robots are grippers, while the end-effector of the UR5e is a suction cup. (b) Object categories utilized in MoMa-Kitchen. Each category consists of multiple object instances. Rigid and articulated objects serve as manipulation targets, while obstacle objects are strategically placed around the target to enhance scene complexity. (c) Examples of affordance maps in MoMa-Kitchen. Discrete affordance values are first collected (left) by moving the mobile manipulator and allowing it to interact with the target. Gaussian interpolation is then applied to obtain a smooth affordance map (right).
  • Figure 4: NavAff Baseline.(a) Visual Alignment Module: Projects object masks to align 2D visual features with 3D spatial representations. (b) Navigation Affordance Grounding Module: Fuses global point cloud, floor point cloud, and robot-specific features to predict navigation affordance maps.
  • Figure 5: Qualitative comparison of navigation affordance between all methods and ground truth. Blue to red regions indicate affordance values ranging from $0$ to $1$, while void areas represent obstacle-occupied spaces.
  • ...and 6 more figures