Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping

Fang Wan; Zheng Wang; Wei Zhang; Chaoyang Song

Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping

Fang Wan, Zheng Wang, Wei Zhang, Chaoyang Song

TL;DR

This work introduces VBSeeThruP, a Vision-based See-Through Perception framework that enables multi-modal perception from a single in-finger view behind a See-Thru-Network embedded in a Soft Polyhedral Network. It integrates markerless deformation tracking (via XMem), in-finger scene inpainting (via E^2FGVI), object detection on inpainted scenes (via RT-DETR), and 6D force/torque estimation (via SVAE) to achieve reactive grasping without external cameras or tactile sensors. The approach demonstrates reactive grasping, monocular depth sensing, and scene segmentation from occluded viewpoints, supported by an ablation study and quantitative metrics (e.g., 6D FT MAE and $R^2$ values). Limitations include speed, generalization across fingers, and blur in larger scenes, with future directions pointing toward underwater perception, embodied multi-modal learning, and stereo extensions for enhanced 3D reconstruction.

Abstract

Sensory substitution enables biological systems to perceive stimuli typically obtained by another organ, which is inspirational for physical agents. Multi-modal perception of intrinsic and extrinsic interactions is critical in building an intelligent robot that learns. This study presents a Vision-based See-Through Perception (VBSeeThruP) architecture that simultaneously perceives multiple intrinsic and extrinsic modalities via a single visual input in a markerless way, all packed within a soft robotic finger using the Soft Polyhedral Network design. It is generally applicable to miniature vision systems placed underneath deformable networks with a see-through design, capturing real-time images of the network's physical interactions induced by contact-based events overlayed on top of the visual scene of the external environment, as demonstrated in the ablation study. We present the VBSeeThruP's capability for learning reactive grasping without using external cameras or dedicated force and torque sensors on the fingertips. Using the inpainted scene and the deformation mask, we further demonstrate the multi-modal performance of the VBSeeThruP architecture to simultaneously achieve various perceptions, including but not limited to scene inpainting, object detection, depth sensing, scene segmentation, masked deformation tracking, 6D force/torque sensing, and contact event detection, all within a single sensory input from the in-finger vision markerlessly.

Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping

TL;DR

values). Limitations include speed, generalization across fingers, and blur in larger scenes, with future directions pointing toward underwater perception, embodied multi-modal learning, and stereo extensions for enhanced 3D reconstruction.

Abstract

Paper Structure (25 sections, 7 equations, 13 figures, 2 tables)

This paper contains 25 sections, 7 equations, 13 figures, 2 tables.

Introduction
Related Works
Differentiating Vision-based Sensing and Perception
A General Classification of Vision-based Perception
Vision-based Rigid Perception (VBRigidP)
Vision-based Deformable Perception (VBDeformP)
Vision-based See-Through Perception (VBSeeThruP)
Multi-Modal Perception in Robotics with Vision
Proposed Methods
Markerless Design of the See-Thru-Network
Formalizing Multi-Modal VBSeeThruP
Markerless, Real-time, Large-Scale Deformation Tracking
In-Finger Visual Perception from Scene Inpainting
Markerless Contact Perception via a Deformation Mask
Ablation Study
...and 10 more sections

Figures (13)

Figure 1: Learning see-through grasping via multi-layered reasoning from a single viewpoint. (A) A common setup for vision-based grasping. (B) Proposed pipeline in this work. (C) Grasping principle via See-Thru-Network. (D) Markerless representation via deformable mask tracking.
Figure 2: Platform setup as a hand-eye system.
Figure 3: Multi-layered reasoning for VBSeeThruP based on the SPN's See-Through design.
Figure 4: Formalizing three research problems for VBSeeThruP via a See-Thru-Network.
Figure 5: Markerless deformation tracking of STN mask.
...and 8 more figures

Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping

TL;DR

Abstract

Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping

Authors

TL;DR

Abstract

Table of Contents

Figures (13)