Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform
Shimian Zhang, Qiuhong Lu
TL;DR
The work addresses robust grasping of unseen objects in dynamic environments by fusing a depth-camera Visual Interpretation Module with the Segment Anything Model on a mobile robot. The VIM performs zero-shot segmentation via prompts and computes 3D coordinates ($P_{cam}$) that are transformed to ($P_{arm}$) for the MCM to plan trajectories with inverse kinematics and Denavit-Hartenberg kinematics. An eye-in-hand configuration enables continuous tracking and mobile relocation when targets are out of reach, eliminating the need for task-specific training data. Mobile SAM delivers comparable segmentation speed to the original while being ~60× smaller and achieving about $50$ ms latency on a NVIDIA $3060$ GPU, with real-world tests validating grasps indoors and outdoors and across industrial and service scenarios. The approach supports multimodal human-robot interaction via clicks, drawings, or voice prompts and broadens deployment possibilities in automation and service domains.
Abstract
In the rapidly advancing field of robotics, the fusion of state-of-the-art visual technologies with mobile robotic arms has emerged as a critical integration. This paper introduces a novel system that combines the Segment Anything model (SAM) -- a transformer-based visual foundation model -- with a robotic arm on a mobile platform. The design of integrating a depth camera on the robotic arm's end-effector ensures continuous object tracking, significantly mitigating environmental uncertainties. By deploying on a mobile platform, our grasping system has an enhanced mobility, playing a key role in dynamic environments where adaptability are critical. This synthesis enables dynamic object segmentation, tracking, and grasping. It also elevates user interaction, allowing the robot to intuitively respond to various modalities such as clicks, drawings, or voice commands, beyond traditional robotic systems. Empirical assessments in both simulated and real-world demonstrate the system's capabilities. This configuration opens avenues for wide-ranging applications, from industrial settings, agriculture, and household tasks, to specialized assignments and beyond.
