Table of Contents
Fetching ...

MARVIS: Motion & Geometry Aware Real and Virtual Image Segmentation

Jiayi Wu, Xiaomin Lin, Shahriar Negahdaripour, Cornelia Fermüller, Yiannis Aloimonos

TL;DR

The paper tackles real–virtual image segmentation near water surfaces, where reflections and refractions create challenging, domain-shifting visual regions. It introduces MARVIS, a motion- and geometry-aware network that leverages a Local Motion Entropy (LME) kernel and Epipolar Geometric Consistency (EGC) loss, trained on a photorealistic AquaSim synthetic dataset to achieve strong cross-domain performance without retraining. MARVIS attains state-of-the-art results in both synthetic ($IoU>94\%$, $F1>96\%$) and real-world ($IoU>78\%$, $F1>86\%$) domains while remaining lightweight (~2.56M parameters) and fast (up to 43.4 FPS on a RTX 4070). The AquaSim simulator, combined with the proposed temporal and geometric cues, enables robust perception for autonomous marine robots and opens avenues for 3D real–virtual segmentation and reconstruction in multimedia environments.

Abstract

Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object's presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here https://github.com/jiayi-wu-umd/MARVIS.

MARVIS: Motion & Geometry Aware Real and Virtual Image Segmentation

TL;DR

The paper tackles real–virtual image segmentation near water surfaces, where reflections and refractions create challenging, domain-shifting visual regions. It introduces MARVIS, a motion- and geometry-aware network that leverages a Local Motion Entropy (LME) kernel and Epipolar Geometric Consistency (EGC) loss, trained on a photorealistic AquaSim synthetic dataset to achieve strong cross-domain performance without retraining. MARVIS attains state-of-the-art results in both synthetic (, ) and real-world (, ) domains while remaining lightweight (~2.56M parameters) and fast (up to 43.4 FPS on a RTX 4070). The AquaSim simulator, combined with the proposed temporal and geometric cues, enables robust perception for autonomous marine robots and opens avenues for 3D real–virtual segmentation and reconstruction in multimedia environments.

Abstract

Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object's presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here https://github.com/jiayi-wu-umd/MARVIS.
Paper Structure (17 sections, 7 equations, 5 figures, 2 tables)

This paper contains 17 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Various computer vision tasks fail to varying degrees in multi-media scenarios due to virtual images formed by the diverging reflection or refraction of light rays. The real-virtual image segmentation mask is crucial to provide the robot with credible region information in the visual sensor input.
  • Figure 2: BlueROV collecting data for our synthetic dataset in AquaSim. The inset shows a sample image captured from the camera on a BlueROV. The top left displays the feed from a front-facing RGB camera, while the bottom left shows the real image's corresponding ground truth masks.
  • Figure 3: Full learning pipeline of MARVIS: initially, two consecutive grayscale video frames (at time $t-\delta t$ and $t$) are fed into optical flow estimation module teed2020raft. This motion information is forwarded to the proposed Local Motion Entropy (LME) layer for LME feature extraction. The fused spatio-temporal feature maps are then passed to a five-stage encoder to obtain features in high-dimensional latent space. Subsequently, the five-stage decoder progressively upsamples by fusing high-resolution feature maps from the corresponding stage encoder through skip connections to generate the segmentation mask with the original resolution. The learning objective involving pixel-wise losses (EGC loss) is given by Eq. \ref{['eq:2_Epipolar_Loss']}
  • Figure 4: The detailed operations in the Attention Module and the Tokenized Block (also described in Eq. \ref{['eq:3_Tokenized_MLP']}).
  • Figure 5: A few qualitative comparisons are shown for real-virtual image segmentation by MARVIS and SOTA water segmentation pipelines and widely used image segmentation models on both synthetic and real world testset. As seen, MARVIS infers accurate and consistent segmentations across the samples from different domains.