Table of Contents
Fetching ...

Aim My Robot: Precision Local Navigation to Any Object

Xiangyun Meng, Xuning Yang, Sanghun Jung, Fabio Ramos, Srid Sadhan Jujjavarapu, Sanjoy Paul, Dieter Fox

TL;DR

The paper addresses high-precision object-centric navigation without maps or CAD models by introducing AMR, a vision-based local navigation system that uses RGB-D and LiDAR inputs along with a reference image and mask to achieve centimeter-level precision in reaching objects. AMR is trained on a large-scale photorealistic simulation pipeline and employs a transformer-based architecture with three stages: multi-modal sensor encoding, goal- and robot-aware fusion, and autoregressive motion generation to produce precise base trajectories and camera tilt commands. The approach demonstrates strong sim2real transfer, generalizes to unseen objects and different kinematics, and enables downstream tasks like docking and manipulation with minimal fine-tuning. Overall, AMR provides a practical, map-free solution for precise object-centric navigation that can be integrated with higher-level planners and robotic systems for real-world precision tasks.

Abstract

Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeter-level precision. AMR achieves high precision and robustness by leveraging multi-modal perception, precise action prediction, and is trained on large-scale photorealistic data generated in simulation. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.

Aim My Robot: Precision Local Navigation to Any Object

TL;DR

The paper addresses high-precision object-centric navigation without maps or CAD models by introducing AMR, a vision-based local navigation system that uses RGB-D and LiDAR inputs along with a reference image and mask to achieve centimeter-level precision in reaching objects. AMR is trained on a large-scale photorealistic simulation pipeline and employs a transformer-based architecture with three stages: multi-modal sensor encoding, goal- and robot-aware fusion, and autoregressive motion generation to produce precise base trajectories and camera tilt commands. The approach demonstrates strong sim2real transfer, generalizes to unseen objects and different kinematics, and enables downstream tasks like docking and manipulation with minimal fine-tuning. Overall, AMR provides a practical, map-free solution for precise object-centric navigation that can be integrated with higher-level planners and robotic systems for real-world precision tasks.

Abstract

Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeter-level precision. AMR achieves high precision and robustness by leveraging multi-modal perception, precise action prediction, and is trained on large-scale photorealistic data generated in simulation. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.

Paper Structure

This paper contains 16 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of AMR. Given a masked image describing the target object and an object-centric pose (relative position and orientation), AMR tracks the object while moving, avoids obstacles, and aligns the robot to the target object with centimeter-level precision without maps or object 3D models.
  • Figure 2: Problem setup. We specify the target object via a reference image $I_R$ taken in the scene and an object mask $M$ (in green). The goal condition $\mathbf{C}$ is defined as the relative side and pose of the object in $I_R$. A robot needs to navigate to the object conditioned on $\mathbf{C}$ and tilt its camera to gaze at the object. Note the reference image does not represent the final image captured by the robot at the desired goal.
  • Figure 3: Example rendered images of HSSD scenes in Isaac Sim.
  • Figure 4: Sample objects in the scenes. We consider all objects, including those that are not semantically labeled.
  • Figure 5: Network architecture. The reference image $I_R$ and the robot's RGB-D observations $I_t$ are tokenized with MAE. The current LiDAR scan is tokenized by grouping points into directional bins. Image and LiDAR tokens are input into the multi-modal context encoder jointly with the look-at pose and footprint tokens. Finally, the output tokens of the context encoder are cross-attended to the base trajectory decoder and the camera tilt decoder. S are the learned start tokens for the decoders.
  • ...and 6 more figures