Table of Contents
Fetching ...

SPIN: Simultaneous Perception, Interaction and Navigation

Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, Deepak Pathak

TL;DR

SPIN tackles mobile manipulation in cluttered, unstructured environments by learning a single end-to-end policy that coordinates base, arm, and an actuated ego-centric camera. It couples a two-phase learning approach: Phase 1 uses privileged scandots to train coupled or decoupled visuomotor policies via PPO, and Phase 2 distills the learned behavior into a depth-conditioned policy operable from ego-depth inputs using asynchronous DAgger. The approach yields emergent, robust behaviors such as dynamic obstacle avoidance and whole-body coordination, validated in six simulation benchmarks and two real-world setups, outperforming classical map-based baselines and ablations like FixCam and NoPointNet. This reactive, perception-driven paradigm reduces reliance on precise maps and demonstrates practical impact for real-world mobile manipulation, with potential extensions to richer sensing like RGB data. The work advances end-to-end learning for mobile manipulation by integrating active vision with coordinated perception and action for cluttered environments.

Abstract

While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment, and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Results visualizations and videos at https://spin-robot.github.io/

SPIN: Simultaneous Perception, Interaction and Navigation

TL;DR

SPIN tackles mobile manipulation in cluttered, unstructured environments by learning a single end-to-end policy that coordinates base, arm, and an actuated ego-centric camera. It couples a two-phase learning approach: Phase 1 uses privileged scandots to train coupled or decoupled visuomotor policies via PPO, and Phase 2 distills the learned behavior into a depth-conditioned policy operable from ego-depth inputs using asynchronous DAgger. The approach yields emergent, robust behaviors such as dynamic obstacle avoidance and whole-body coordination, validated in six simulation benchmarks and two real-world setups, outperforming classical map-based baselines and ablations like FixCam and NoPointNet. This reactive, perception-driven paradigm reduces reliance on precise maps and demonstrates practical impact for real-world mobile manipulation, with potential extensions to richer sensing like RGB data. The work advances end-to-end learning for mobile manipulation by integrating active vision with coordinated perception and action for cluttered environments.

Abstract

While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment, and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Results visualizations and videos at https://spin-robot.github.io/
Paper Structure (32 sections, 9 equations, 11 figures, 5 tables)

This paper contains 32 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Learning to SPIN: Our robot learns to simultaneously perceive, manipulate, and navigate cluttered unstructured environments in a whole-body fashion. The robot has an actuated camera with a limited field of view that it must control to get information about its environment. The motion and perception problem are tightly coupled since what the robot knows about the environment influences how it can move and vice versa. We show results in a large variety of scenarios both indoors and outdoors with different obstacles like boxes and furniture. Our robot can pick up different objects like cups, and utensils. Video demos at https://spin-robot.github.io
  • Figure 2: Human and robot illustration of whole-body navigation through the clutter.
  • Figure 3: We learn a policy that uses ego-vision to simultaneously perceive, interact, and navigate in cluttered environments. We propose two methods: (1) Coupled Visuomotor Optimization (CVO) learns robot and camera actions at the same time. We train an RL policy to predict these. We only provide scandots if they are visible in the agent's field-of-view allowing the agent to learn to move its camera and aggregate information about its environment. This is followed by a phase-2 supervised training where this behavior is distilled into a student network that operates with ego-centric depth images (2) Decoupled Visuomotor Optimization (DVO) decouples the action and perception learning into two parts: first the agent learns to navigate across clutter assuming access to all obstacles. In phase 1b, the robot learns to move its camera to estimate the relevant information. This is followed by supervised learning same as above.
  • Figure 4: We illustrate one scenario of the simulation benchmark here with many obstacles in a narrow passage. The agent learns to develop whole-body coordination such as the robot's arm movement in the last two frames, to reactively adapt and navigate through such cluttered scenes by actively moving around the camera and aggregating information for efficient navigation without collisions.
  • Figure 5: (Left) We compute visible scandots by projecting them to the camera frame and checking if they lie within the image plane (Right) of the stretch RE1 robot that we use experiments. It has two DoFs in the base, one each for arm lift and extension, two for the camera, three for the wrist, and one for the gripper.
  • ...and 6 more figures