Table of Contents
Fetching ...

MirrorMamba: Towards Scalable and Robust Mirror Detection in Videos

Rui Song, Jiaying Lin, Rynson W. H. Lau

TL;DR

MirrorMamba tackles the challenge of robust mirror detection in videos by integrating static cues (perceived depth) and dynamic cues (correspondence, optical flow) within a unified Mamba-based architecture. It introduces two novel modules: MMCE for multi-direction correspondence extraction and BED for layer-wise boundary enforcement, enabling global reasoning with linear complexity. The method uses a shared VMamba-T backbone to fuse RGB, depth, and flow, and demonstrates state-of-the-art results on video benchmarks VMD-D and MMD and strong performance on the image-based PMD dataset, with ablations confirming the complementary benefits of the cues and modules. The work also shows the framework's scalability and potential for image-based mirror detection by removing dynamic cues.

Abstract

Video mirror detection has received significant research attention, yet existing methods suffer from limited performance and robustness. These approaches often over-rely on single, unreliable dynamic features, and are typically built on CNNs with limited receptive fields or Transformers with quadratic computational complexity. To address these limitations, we propose a new effective and scalable video mirror detection method, called MirrorMamba. Our approach leverages multiple cues to adapt to diverse conditions, incorporating perceived depth, correspondence and optical. We also introduce an innovative Mamba-based Multidirection Correspondence Extractor, which benefits from the global receptive field and linear complexity of the emerging Mamba spatial state model to effectively capture correspondence properties. Additionally, we design a Mamba-based layer-wise boundary enforcement decoder to resolve the unclear boundary caused by the blurred depth map. Notably, this work marks the first successful application of the Mamba-based architecture in the field of mirror detection. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches for video mirror detection on the benchmark datasets. Furthermore, on the most challenging and representative image-based mirror detection dataset, our approach achieves state-of-the-art performance, proving its robustness and generalizability.

MirrorMamba: Towards Scalable and Robust Mirror Detection in Videos

TL;DR

MirrorMamba tackles the challenge of robust mirror detection in videos by integrating static cues (perceived depth) and dynamic cues (correspondence, optical flow) within a unified Mamba-based architecture. It introduces two novel modules: MMCE for multi-direction correspondence extraction and BED for layer-wise boundary enforcement, enabling global reasoning with linear complexity. The method uses a shared VMamba-T backbone to fuse RGB, depth, and flow, and demonstrates state-of-the-art results on video benchmarks VMD-D and MMD and strong performance on the image-based PMD dataset, with ablations confirming the complementary benefits of the cues and modules. The work also shows the framework's scalability and potential for image-based mirror detection by removing dynamic cues.

Abstract

Video mirror detection has received significant research attention, yet existing methods suffer from limited performance and robustness. These approaches often over-rely on single, unreliable dynamic features, and are typically built on CNNs with limited receptive fields or Transformers with quadratic computational complexity. To address these limitations, we propose a new effective and scalable video mirror detection method, called MirrorMamba. Our approach leverages multiple cues to adapt to diverse conditions, incorporating perceived depth, correspondence and optical. We also introduce an innovative Mamba-based Multidirection Correspondence Extractor, which benefits from the global receptive field and linear complexity of the emerging Mamba spatial state model to effectively capture correspondence properties. Additionally, we design a Mamba-based layer-wise boundary enforcement decoder to resolve the unclear boundary caused by the blurred depth map. Notably, this work marks the first successful application of the Mamba-based architecture in the field of mirror detection. Extensive experiments demonstrate that our method outperforms existing state-of-the-art approaches for video mirror detection on the benchmark datasets. Furthermore, on the most challenging and representative image-based mirror detection dataset, our approach achieves state-of-the-art performance, proving its robustness and generalizability.

Paper Structure

This paper contains 12 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Three typical scenarios where only a single cue is useful for mirror detection. In the top scenario, symmetry is the only useful cue when compared with depth and flow information. This allows VMD-Net, which relies on detecting correspondence, to detect the mirror correctly while other methods do not. In the middle scenario, only the relative depth map reveals the location of the mirror, so the one utilizing depth information, i.e., PD-Net, performs best. In the bottom scenario, even humans have difficulty finding the location of the mirror through a static image, while the optical flow map can imply the location of the mirror. Thus, the one utilizing flow information, i.e., MG-VMD, successfully detects the mirror. Our method leverages all three cues at the same time with outperformance since it can handle all challenging scenarios.
  • Figure 2: The proposed MirrorMamba framework consists of three main components: (1) a shared VMamba-T backbone for feature extraction from RGB, depth, and optical flow (video only) inputs; (2) the Mamba-based Multi-direction Correspondence Extractor (MMCE), which fuses the extracted features to model the implicit correspondence between the inside and outside of the mirror; and (3) the mamba-based Layer-wise Boundary Enforcement Decoder (BED), which progressively refines features by combining high-level semantic information from the previous BED layer with low-level detail features from the current layer. The final output is a high-quality mirror segmentation map with precise boundary details.
  • Figure 3: The MMCE module takes RGB, depth, and optical flow as inputs. To detect mirrors at various angles, MMCE employs four scanning blocks to capture horizontal and vertical flipping correspondences. $M1$ and $M2$ scan the image in opposite horizontal directions, while $M3$ and $M4$ scan in opposite vertical directions. The resulting attention maps are multiplied by T to enhance features with flipping-aware information, enabling robust mirror detection across diverse orientations and positions.
  • Figure 4: The BED module refines boundary details by integrating high-level semantic features with low-level spatial features. It employs a cross-Mamba module, a Mamba module, and a channel attention module to dynamically refine features, ensuring precise mirror boundary detection.
  • Figure 5: Qualitative results.