Table of Contents
Fetching ...

Fusion of Short-term and Long-term Attention for Video Mirror Detection

Mingchen Xu, Jing Wu, Yukun Lai, Ze Ji

TL;DR

This work tackles video mirror detection by fusing short-term appearance features with long-term contextual information using a transformer-based architecture. The proposed FusionFormer, composed of Dual Gated Short-term Attention (DGSA), Long-term Attention (LA), and Short-Long Fusion (SLF), produces per-frame mirror maps by leveraging both micro-frame appearance and macro-video position cues. A new ViMirr dataset with 19,255 frames across 281 videos is introduced to evaluate robustness in diverse scenes. Experiments on ViMirr and the existing VMD dataset show state-of-the-art performance and clear ablations demonstrating the contributions of each module. The approach advances robust mirror detection in realistic video settings and has potential impact on downstream scene understanding and robotics tasks.

Abstract

Techniques for detecting mirrors from static images have witnessed rapid growth in recent years. However, these methods detect mirrors from single input images. Detecting mirrors from video requires further consideration of temporal consistency between frames. We observe that humans can recognize mirror candidates, from just one or two frames, based on their appearance (e.g. shape, color). However, to ensure that the candidate is indeed a mirror (not a picture or a window), we often need to observe more frames for a global view. This observation motivates us to detect mirrors by fusing appearance features extracted from a short-term attention module and context information extracted from a long-term attention module. To evaluate the performance, we build a challenging benchmark dataset of 19,255 frames from 281 videos. Experimental results demonstrate that our method achieves state-of-the-art performance on the benchmark dataset.

Fusion of Short-term and Long-term Attention for Video Mirror Detection

TL;DR

This work tackles video mirror detection by fusing short-term appearance features with long-term contextual information using a transformer-based architecture. The proposed FusionFormer, composed of Dual Gated Short-term Attention (DGSA), Long-term Attention (LA), and Short-Long Fusion (SLF), produces per-frame mirror maps by leveraging both micro-frame appearance and macro-video position cues. A new ViMirr dataset with 19,255 frames across 281 videos is introduced to evaluate robustness in diverse scenes. Experiments on ViMirr and the existing VMD dataset show state-of-the-art performance and clear ablations demonstrating the contributions of each module. The approach advances robust mirror detection in realistic video settings and has potential impact on downstream scene understanding and robotics tasks.

Abstract

Techniques for detecting mirrors from static images have witnessed rapid growth in recent years. However, these methods detect mirrors from single input images. Detecting mirrors from video requires further consideration of temporal consistency between frames. We observe that humans can recognize mirror candidates, from just one or two frames, based on their appearance (e.g. shape, color). However, to ensure that the candidate is indeed a mirror (not a picture or a window), we often need to observe more frames for a global view. This observation motivates us to detect mirrors by fusing appearance features extracted from a short-term attention module and context information extracted from a long-term attention module. To evaluate the performance, we build a challenging benchmark dataset of 19,255 frames from 281 videos. Experimental results demonstrate that our method achieves state-of-the-art performance on the benchmark dataset.
Paper Structure (15 sections, 5 equations, 6 figures, 2 tables)

This paper contains 15 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Two normal scenarios where existing methods b4b5 fail. HetNetb4 is a single-image mirror detection method, and VMDNet b5 is designed for video mirror detection. Compared to HetNet and VMDNet, our method can detect the mirror regions correctly by fusing short-term information and long-term information.
  • Figure 2: The architecture of our proposed model. We first feed three frames from the same video to the backbone feature extractor, then the DGSA module to extract appearance features from adjacent frames, and an LA module to extract context features from long video clips parallelly. Second, the SLF module fuses short-term attention and long-term attention to finalize the mirror region.
  • Figure 3: The schematic illustration of Dual Gated Short-term Attention (DGSA) module. The grey part represents the short-term attention (SA) block. Pink parts represent the fusion blocks. The green and blue parts represent the spatial-wise gate (SG) block and the channel-wise gate (CG) block, respectively.
  • Figure 4: Videos in our ViMirr dataset show high diversity and low similarity. They cover lots of daily scenes.
  • Figure 5: Details of the Short-long Fusion (SLF) module.
  • ...and 1 more figures