Table of Contents
Fetching ...

MultiClear: Multimodal Soft Exoskeleton Glove for Transparent Object Grasping Assistance

Chen Hu, Timothy Neate, Shan Luo, Letizia Gionfrida

TL;DR

This work addresses the difficulty of grasping transparent objects for users with hand impairments by introducing MultiClear, a multimodal soft exoglove that fuses RGB, depth, and auditory signals. The system employs a vision foundation model for zero-shot segmentation to delineate transparent object boundaries and a three-layer hierarchical control stack (high-level context, mid-level multimodal fusion, low-level PID) to enable stable, adaptive grasping. Experimental results show a Grasping Ability Score (GAS) of 70.37% (±3.96), with an average Grasping score of 80.4% and Maintaining score of 60.41% across six objects and six participants, highlighting the potential of vision–audiory–tactile integration for transparent object manipulation. Limitations include interior depth sparsity of transparent objects and fit issues, with future work focusing on depth completion integration, broader grasp types, deformable objects, and clinical validation for neurodegenerative populations.

Abstract

Grasping is a fundamental skill for interacting with the environment. However, this ability can be difficult for some (e.g. due to disability). Wearable robotic solutions can enhance or restore hand function, and recent advances have leveraged computer vision to improve grasping capabilities. However, grasping transparent objects remains challenging due to their poor visual contrast and ambiguous depth cues. Furthermore, while multimodal control strategies incorporating tactile and auditory feedback have been explored to grasp transparent objects, the integration of vision with these modalities remains underdeveloped. This paper introduces MultiClear, a multimodal framework designed to enhance grasping assistance in a wearable soft exoskeleton glove for transparent objects by fusing RGB data, depth data, and auditory signals. The exoskeleton glove integrates a tendon-driven actuator with an RGB-D camera and a built-in microphone. To achieve precise and adaptive control, a hierarchical control architecture is proposed. For the proposed hierarchical control architecture, a high-level control layer provides contextual awareness, a mid-level control layer processes multimodal sensory inputs, and a low-level control executes PID motor control for fine-tuned grasping adjustments. The challenge of transparent object segmentation was managed by introducing a vision foundation model for zero-shot segmentation. The proposed system achieves a Grasping Ability Score of 70.37%, demonstrating its effectiveness in transparent object manipulation.

MultiClear: Multimodal Soft Exoskeleton Glove for Transparent Object Grasping Assistance

TL;DR

This work addresses the difficulty of grasping transparent objects for users with hand impairments by introducing MultiClear, a multimodal soft exoglove that fuses RGB, depth, and auditory signals. The system employs a vision foundation model for zero-shot segmentation to delineate transparent object boundaries and a three-layer hierarchical control stack (high-level context, mid-level multimodal fusion, low-level PID) to enable stable, adaptive grasping. Experimental results show a Grasping Ability Score (GAS) of 70.37% (±3.96), with an average Grasping score of 80.4% and Maintaining score of 60.41% across six objects and six participants, highlighting the potential of vision–audiory–tactile integration for transparent object manipulation. Limitations include interior depth sparsity of transparent objects and fit issues, with future work focusing on depth completion integration, broader grasp types, deformable objects, and clinical validation for neurodegenerative populations.

Abstract

Grasping is a fundamental skill for interacting with the environment. However, this ability can be difficult for some (e.g. due to disability). Wearable robotic solutions can enhance or restore hand function, and recent advances have leveraged computer vision to improve grasping capabilities. However, grasping transparent objects remains challenging due to their poor visual contrast and ambiguous depth cues. Furthermore, while multimodal control strategies incorporating tactile and auditory feedback have been explored to grasp transparent objects, the integration of vision with these modalities remains underdeveloped. This paper introduces MultiClear, a multimodal framework designed to enhance grasping assistance in a wearable soft exoskeleton glove for transparent objects by fusing RGB data, depth data, and auditory signals. The exoskeleton glove integrates a tendon-driven actuator with an RGB-D camera and a built-in microphone. To achieve precise and adaptive control, a hierarchical control architecture is proposed. For the proposed hierarchical control architecture, a high-level control layer provides contextual awareness, a mid-level control layer processes multimodal sensory inputs, and a low-level control executes PID motor control for fine-tuned grasping adjustments. The challenge of transparent object segmentation was managed by introducing a vision foundation model for zero-shot segmentation. The proposed system achieves a Grasping Ability Score of 70.37%, demonstrating its effectiveness in transparent object manipulation.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: (a) Overview of MultiClear: The proposed multimodal framework captures RGB, depth and auditory inputs through sensors mounted on the exoglove. The high-level control module processes the RGB frames using DIS-Net qin2022highly for forward inference, generating a mask of the target object. The mask is aligned with the current depth frame to extract depth information, which is then passed to the middle layer. Simultaneously, voice input is processed by the Vosk ASR model vosk2020, which converts predefined prompt words into control commands. These commands are fed into the middle layer. By integrating the data from all three modalities, the middle layer makes decisions and sends target commands to the low-level PID controller for motor rotation, executing grasp or release actions. (b) Hardware setup: The soft exoglove consists of three main components: an actuator, a tendon-driven glove, and sensing equipment. 3D-printed custom components were designed for routing the wires, connected to the motor to trigger grasp and release actions. After gathering data from the three modalities, client-server interaction is facilitated via Wi-Fi and Bluetooth modules in the microcontrollers.
  • Figure 2: The dataset comprises six transparent objects, categorized into three groups, with users performing grasps using three distinct grip types, including Cylindrical Grip, Spherical Grip, and Pinch. The mass, dimensions, and material composition of the objects can significantly impact grasping performance.
  • Figure 3: Five stages of the research procedure: (1) Begin: The user starts seated at a table, gives the "grip" command, and moves their hand toward the object. (2) Grasp: When the distance between the camera and the object is less than 400mm for 2 seconds, the system triggers the grasp. (3) Lift: Once the motor stops and the grasp stabilizes, the user lifts the object. (4) Maintain: For the Cylindrical and Spherical grips, the user rotates their wrist to a palm-down position and holds this for three seconds. (5) End: The user places the object back on the table, gives the "release" command to reverse the motor, and then issues the "stop" command to halt the motor.
  • Figure 4: The first row displays RGB images of six transparent objects from the dataset. The second row shows masks generated by Segment Anything (SAM), where SAM demonstrates sensitivity to image edges (e.g., Glass (high), Glass (low), and Wine Glass) and object contours, leading to inaccurate segmentation. In contrast, the third row illustrates the masks produced by DIS-Net, which focuses specifically on the outer contours of the transparent objects.
  • Figure 5: The average GAS for six users performing three different grip types on six transparent objects. Grey, red, and blue bars represent the Grasping, Maintaining, and overall GAS, respectively. The table below provides the measured distances (mm) from each user’s palm to the tips of the thumb, index finger, and middle finger.