SPIN: Simultaneous Perception, Interaction and Navigation
Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, Deepak Pathak
TL;DR
SPIN tackles mobile manipulation in cluttered, unstructured environments by learning a single end-to-end policy that coordinates base, arm, and an actuated ego-centric camera. It couples a two-phase learning approach: Phase 1 uses privileged scandots to train coupled or decoupled visuomotor policies via PPO, and Phase 2 distills the learned behavior into a depth-conditioned policy operable from ego-depth inputs using asynchronous DAgger. The approach yields emergent, robust behaviors such as dynamic obstacle avoidance and whole-body coordination, validated in six simulation benchmarks and two real-world setups, outperforming classical map-based baselines and ablations like FixCam and NoPointNet. This reactive, perception-driven paradigm reduces reliance on precise maps and demonstrates practical impact for real-world mobile manipulation, with potential extensions to richer sensing like RGB data. The work advances end-to-end learning for mobile manipulation by integrating active vision with coordinated perception and action for cluttered environments.
Abstract
While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment, and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Results visualizations and videos at https://spin-robot.github.io/
