Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

Shimian Zhang; Qiuhong Lu

Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

Shimian Zhang, Qiuhong Lu

TL;DR

The work addresses robust grasping of unseen objects in dynamic environments by fusing a depth-camera Visual Interpretation Module with the Segment Anything Model on a mobile robot. The VIM performs zero-shot segmentation via prompts and computes 3D coordinates ($P_{cam}$) that are transformed to ($P_{arm}$) for the MCM to plan trajectories with inverse kinematics and Denavit-Hartenberg kinematics. An eye-in-hand configuration enables continuous tracking and mobile relocation when targets are out of reach, eliminating the need for task-specific training data. Mobile SAM delivers comparable segmentation speed to the original while being ~60× smaller and achieving about $50$ ms latency on a NVIDIA $3060$ GPU, with real-world tests validating grasps indoors and outdoors and across industrial and service scenarios. The approach supports multimodal human-robot interaction via clicks, drawings, or voice prompts and broadens deployment possibilities in automation and service domains.

Abstract

In the rapidly advancing field of robotics, the fusion of state-of-the-art visual technologies with mobile robotic arms has emerged as a critical integration. This paper introduces a novel system that combines the Segment Anything model (SAM) -- a transformer-based visual foundation model -- with a robotic arm on a mobile platform. The design of integrating a depth camera on the robotic arm's end-effector ensures continuous object tracking, significantly mitigating environmental uncertainties. By deploying on a mobile platform, our grasping system has an enhanced mobility, playing a key role in dynamic environments where adaptability are critical. This synthesis enables dynamic object segmentation, tracking, and grasping. It also elevates user interaction, allowing the robot to intuitively respond to various modalities such as clicks, drawings, or voice commands, beyond traditional robotic systems. Empirical assessments in both simulated and real-world demonstrate the system's capabilities. This configuration opens avenues for wide-ranging applications, from industrial settings, agriculture, and household tasks, to specialized assignments and beyond.

Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

TL;DR

) that are transformed to (

) for the MCM to plan trajectories with inverse kinematics and Denavit-Hartenberg kinematics. An eye-in-hand configuration enables continuous tracking and mobile relocation when targets are out of reach, eliminating the need for task-specific training data. Mobile SAM delivers comparable segmentation speed to the original while being ~60× smaller and achieving about

ms latency on a NVIDIA

GPU, with real-world tests validating grasps indoors and outdoors and across industrial and service scenarios. The approach supports multimodal human-robot interaction via clicks, drawings, or voice prompts and broadens deployment possibilities in automation and service domains.

Abstract

Paper Structure (24 sections, 5 figures, 1 algorithm)

This paper contains 24 sections, 5 figures, 1 algorithm.

Introduction
Related Works
Visual Foundation Models
Language-driven Robotic Manipulation
Integrating SAM into Robotic Grasping
Methodology
System Overview
Visual Interpretation Module (VIM)
Visual Foundation Model Integration
Depth Estimation and Object Positioning
Motion Control Module (MCM)
Platform Mobility
Motion Planning
Tracking Feedback
Grasping Mechanism
...and 9 more sections

Figures (5)

Figure 1: System Overview: Our integrated mobile robotic grasping system comprises two core modules: the Visual Interpretation Module (VIM) and the Motion Control Module (MCM). VIM, utilizing a depth camera, captures a live scene and, through the SAM visual foundation model, segments the user-indicated object for grasping. It then computes the object's 3D coordinates, relaying this data to MCM. Based on this localization, MCM plans the motion, determining if the platform needs relocation for optimal grasping. The robotic arm's movement leverages inverse kinematics for precision, with a closed-loop control anchored on continuous object tracking via the "eye-in-hand" system. The process ends with controlled grasping using force feedback from the arm's end effector.
Figure 2: Mobile Grasping Platform overview from different viewpoints
Figure 3: Comparison of SAM kirillov2023segment and Mobile-SAM zhang2023faster performance on both indoor and outdoor scenarios. Zoom-in for the best view.
Figure 4: The simulated grasping process is delineated across three primary phases: initial, midway, and final poses. The analogue camera's viewpoint is presented on the top-left, while corresponding joint angles are displayed on the top-right. Zoom-in for the best view.
Figure 5: Our mobile grasping platform performs grasping in different scenarios.

Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

TL;DR

Abstract

Innovative Integration of Visual Foundation Model with a Robotic Arm on a Mobile Platform

Authors

TL;DR

Abstract

Table of Contents

Figures (5)