Multi-Layered Reasoning from a Single Viewpoint for Learning See-Through Grasping
Fang Wan, Zheng Wang, Wei Zhang, Chaoyang Song
TL;DR
This work introduces VBSeeThruP, a Vision-based See-Through Perception framework that enables multi-modal perception from a single in-finger view behind a See-Thru-Network embedded in a Soft Polyhedral Network. It integrates markerless deformation tracking (via XMem), in-finger scene inpainting (via E^2FGVI), object detection on inpainted scenes (via RT-DETR), and 6D force/torque estimation (via SVAE) to achieve reactive grasping without external cameras or tactile sensors. The approach demonstrates reactive grasping, monocular depth sensing, and scene segmentation from occluded viewpoints, supported by an ablation study and quantitative metrics (e.g., 6D FT MAE and $R^2$ values). Limitations include speed, generalization across fingers, and blur in larger scenes, with future directions pointing toward underwater perception, embodied multi-modal learning, and stereo extensions for enhanced 3D reconstruction.
Abstract
Sensory substitution enables biological systems to perceive stimuli typically obtained by another organ, which is inspirational for physical agents. Multi-modal perception of intrinsic and extrinsic interactions is critical in building an intelligent robot that learns. This study presents a Vision-based See-Through Perception (VBSeeThruP) architecture that simultaneously perceives multiple intrinsic and extrinsic modalities via a single visual input in a markerless way, all packed within a soft robotic finger using the Soft Polyhedral Network design. It is generally applicable to miniature vision systems placed underneath deformable networks with a see-through design, capturing real-time images of the network's physical interactions induced by contact-based events overlayed on top of the visual scene of the external environment, as demonstrated in the ablation study. We present the VBSeeThruP's capability for learning reactive grasping without using external cameras or dedicated force and torque sensors on the fingertips. Using the inpainted scene and the deformation mask, we further demonstrate the multi-modal performance of the VBSeeThruP architecture to simultaneously achieve various perceptions, including but not limited to scene inpainting, object detection, depth sensing, scene segmentation, masked deformation tracking, 6D force/torque sensing, and contact event detection, all within a single sensory input from the in-finger vision markerlessly.
