Table of Contents
Fetching ...

Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping

Sanket Kachole, Xiaoqian Huang, Fariborz Baghaei Naeini, Rajkumar Muthusamy, Dimitrios Makris, Yahya Zweiri

TL;DR

The paper tackles robust instance segmentation for robotic grasping under dynamic conditions by fusing event-based and RGB vision in a dual-encoder network. It introduces Bimodal SegNet, which employs a Integrated Multisensory Encoder, Cross-Domain Contextual Attention, and Atrous Pyramidal Feature Amplification to fuse modalities at multiple scales, enabling precise object boundaries. Extensive experiments on the ESD-1 and ESD-2 datasets show superior mIoU and pixel accuracy compared to RGB-event baselines, with strong robustness to occlusion, blur, lighting, and scale variation. The results demonstrate the practical value of multi-scale, cross-modal fusion for reliable robotic perception in industrial settings, with the approach offering favorable efficiency relative to transformer-based alternatives.

Abstract

Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10\% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git

Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for Robotic Grasping

TL;DR

The paper tackles robust instance segmentation for robotic grasping under dynamic conditions by fusing event-based and RGB vision in a dual-encoder network. It introduces Bimodal SegNet, which employs a Integrated Multisensory Encoder, Cross-Domain Contextual Attention, and Atrous Pyramidal Feature Amplification to fuse modalities at multiple scales, enabling precise object boundaries. Extensive experiments on the ESD-1 and ESD-2 datasets show superior mIoU and pixel accuracy compared to RGB-event baselines, with strong robustness to occlusion, blur, lighting, and scale variation. The results demonstrate the practical value of multi-scale, cross-modal fusion for reliable robotic perception in industrial settings, with the approach offering favorable efficiency relative to transformer-based alternatives.

Abstract

Object segmentation for robotic grasping under dynamic conditions often faces challenges such as occlusion, low light conditions, motion blur and object size variance. To address these challenges, we propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data. The proposed Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions. Encoders capture rich contextual information by pooling the concatenated features at different resolutions while the decoder obtains sharp object boundaries. The evaluation of the proposed method undertakes five unique image degradation challenges including occlusion, blur, brightness, trajectory and scale variance on the Event-based Segmentation (ESD) Dataset. The evaluation results show a 6-10\% segmentation accuracy improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy. The model code is available at https://github.com/sanket0707/Bimodal-SegNet.git
Paper Structure (43 sections, 26 equations, 5 figures, 4 tables)

This paper contains 43 sections, 26 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Cross Domain Contextual Attention - The Bimodal SegNet utilizes Cross-Domain Contextual Attention (CDCA) to guide its focus on distinct portions of the signal, using input features from RGB and event encoders. The model employs an attention mechanism between the linearly embedded residual and interactive vectors. With each modality generating Query, Key, and Value matrices, a cross-attention process forms attended results. These results, along with the residual vectors, are then concatenated to create attention-augmented features, which combine with original features to produce enhanced representations. The final output is a fusion of information from both modalities, passed through a 1x1 convolution for consolidated, fused features.
  • Figure 2: The proposed approach, along with the Bimodal SegNet, strategically exploits dual modalities, thereby amalgamating the distinct strengths of each to enhance the efficacy of instance segmentation. This enhancement ensures robust object recognition even in challenging conditions such as occlusions, blurring, and variations in brightness, trajectory, and scale. Exemplary scenario: Eye-in-hand Robotic pick and place of cluttered objects in modern industries requiring quick operations under varying dynamic conditions.
  • Figure 3: The proposed Bimodal SegNet architecture uses event-based vision sensors such as DAVIS346 to produce both asynchronous events and RGB frames. These data are passed into Event Synchronisation and RGB encoders respectively. The convoluted blocks within these encoders downscale the input for multiple times to infer feature maps. At each downscaling stage, a CDCA layer is used, which inputs into the APFA block. The features for each sampling rate in the APFA block are then fused and sent to the decoder block. Here, the image is upscaled multiple times. The process uses a combination of up-convolution, copy and crop, and convolution with Relu, ultimately retrieving the original spatial dimension of the input image. The final fused tensor comes from the output of the CDCA module and the previous decoder layer.
  • Figure 4: Example of the ESD-1 dataset (row 1-5) in terms of the number of known objects attributes, under the condition of 0.15 moving speed, normal light condition, linear movement, and 0.82 height. The ESD-2 dataset (rows 6,7) presents examples of previously unseen objects with varying attributes. Specifically, the dataset features scenes where objects are moving at a speed of 0.15, under normal lighting conditions, with linear motion, and at a height of 0.82. The RGB ground truth and annotated event mask use different colors to represent different object labels. For optimal understanding, it is recommended to view the dataset in color.
  • Figure 5: Qualitative Results - The qualitative results presented compares the performance of four different methods, mainly CMX and (ours) for instance segmentation. The predictions were made on an ESD - 1 i.e. known objects and ESD -2 dataset ie. Unknown objects depicting the experiments conducted in quantitative evaluation.