Table of Contents
Fetching ...

Occluded Video Instance Segmentation: A Benchmark

Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip H. S. Torr, Song Bai

TL;DR

The paper introduces OVIS, a large-scale benchmark for occluded video instance segmentation, revealing that state-of-the-art methods struggle under heavy occlusion (APs as low as 6.3 on heavily occluded instances). It proposes Temporal Feature Calibration (TFC), a plug-in that uses nearby frames via deformable fusion to recover missing cues, and demonstrates substantial gains when integrated with MaskTrack R-CNN and SipMask. Through extensive evaluations of nine baselines, oracle analyses, and data-augmentation studies, the work shows that temporal context and better feature representations significantly boost occlusion handling, while NMS adjustments and occlusion-specific image-level methods offer limited gains. The authors provide detailed dataset statistics, baselines, and ablations to guide future occlusion-aware VIS research and suggest directions like occlusion-aware modeling, data generation, and large-scale pre-training. Overall, OVIS serves as a challenging testbed to catalyze advances in robust object perception in real-world occluded scenes.

Abstract

Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis .

Occluded Video Instance Segmentation: A Benchmark

TL;DR

The paper introduces OVIS, a large-scale benchmark for occluded video instance segmentation, revealing that state-of-the-art methods struggle under heavy occlusion (APs as low as 6.3 on heavily occluded instances). It proposes Temporal Feature Calibration (TFC), a plug-in that uses nearby frames via deformable fusion to recover missing cues, and demonstrates substantial gains when integrated with MaskTrack R-CNN and SipMask. Through extensive evaluations of nine baselines, oracle analyses, and data-augmentation studies, the work shows that temporal context and better feature representations significantly boost occlusion handling, while NMS adjustments and occlusion-specific image-level methods offer limited gains. The authors provide detailed dataset statistics, baselines, and ablations to guide future occlusion-aware VIS research and suggest directions like occlusion-aware modeling, data generation, and large-scale pre-training. Overall, OVIS serves as a challenging testbed to catalyze advances in robust object perception in real-world occluded scenes.

Abstract

Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis .

Paper Structure

This paper contains 40 sections, 4 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Sample video clips from OVIS. Click them to watch the animations (best viewed with Acrobat/Foxit Reader). The hairs and whiskers of animals are all exhaustively annotated.
  • Figure 2: Different occlusions levels in OVIS. Unoccluded objects are colored green, slightly occluded objects are colored yellow, and severely occluded objects are colored red.
  • Figure 3: Number of instances per category in the OVIS dataset.
  • Figure 4: Comparison of OVIS with YouTube-VIS, including the distribution of instance duration (a), BOR (b), the number of instances per video (c), and the number of objects per frame (d).
  • Figure 5: Visualization of occlusions with different BOR values.
  • ...and 6 more figures