Table of Contents
Fetching ...

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

Shilin Yan, Xiaohao Xu, Renrui Zhang, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, Wei Zhang

TL;DR

Panoramic video object segmentation faces domain gaps and unique challenges such as content discontinuities and distortions absent in planar data. The authors introduce PanoVOS, a high-quality panoramic VOS dataset with 150 videos, 19,145 instance masks, and long sequences to benchmark panoramic tracking and segmentation. They also propose PSCFormer, a Panoramic Space Consistency Transformer that employs PSC blocks to learn pixel-level spatial-temporal correspondence and boundary continuity across panorama frames, enabling effective segmentation under distortion and discontinuity. Empirical results show PSCFormer achieving state-of-the-art performance on PanoVOS and reveal significant domain-transfer gaps for planar VOS methods, underscoring the need for panorama-aware modeling and datasets.

Abstract

Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

TL;DR

Panoramic video object segmentation faces domain gaps and unique challenges such as content discontinuities and distortions absent in planar data. The authors introduce PanoVOS, a high-quality panoramic VOS dataset with 150 videos, 19,145 instance masks, and long sequences to benchmark panoramic tracking and segmentation. They also propose PSCFormer, a Panoramic Space Consistency Transformer that employs PSC blocks to learn pixel-level spatial-temporal correspondence and boundary continuity across panorama frames, enabling effective segmentation under distortion and discontinuity. Empirical results show PSCFormer achieving state-of-the-art performance on PanoVOS and reveal significant domain-transfer gaps for planar VOS methods, underscoring the need for panorama-aware modeling and datasets.

Abstract

Panoramic videos contain richer spatial information and have attracted tremendous amounts of attention due to their exceptional experience in some fields such as autonomous driving and virtual reality. However, existing datasets for video segmentation only focus on conventional planar images. To address the challenge, in this paper, we present a panoramic video dataset, PanoVOS. The dataset provides 150 videos with high video resolutions and diverse motions. To quantify the domain gap between 2D planar videos and panoramic videos, we evaluate 15 off-the-shelf video object segmentation (VOS) models on PanoVOS. Through error analysis, we found that all of them fail to tackle pixel-level content discontinues of panoramic videos. Thus, we present a Panoramic Space Consistency Transformer (PSCFormer), which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame. Extensive experiments demonstrate that compared with the previous SOTA models, our PSCFormer network exhibits a great advantage in terms of segmentation results under the panoramic setting. Our dataset poses new challenges in panoramic VOS and we hope that our PanoVOS can advance the development of panoramic segmentation/tracking.
Paper Structure (22 sections, 4 equations, 8 figures, 9 tables)

This paper contains 22 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Panoramic video object segmentation (PanoVOS). PanoVOS targets tracking and distinguishing the particular instances under content discontinuities (e.g. penguin in the image of $T=15$) and serve distortion (e.g. penguin in the image of $T=65$). We show the sample of (a) frames, (b) segmentation annotations, and (c) area proportion of foreground for the Penguin video in our dataset.
  • Figure 2: PanoVOS dataset. We select 10 samples from the dataset involving major scenes. For each video, there are high-quality instance-level pixel-wise masks.
  • Figure 3: Instance-level distribution of PanoVOS dataset. Our dataset contains three major divisions: person, animals, and common objects with 35 sub-divisions.
  • Figure 4: PanoVOS annotation pipeline. Our annotation pipeline includes two phases. (1) The first phase is called Key Frames Select and Annotate. The annotator browses the video and picks out the object to be annotated. Then, instances are manually annotated at 1 fps and corrected by another annotator. (2) The second phase is called All Frames Propagate and Refine. In this phase, we apply a semi-supervised video object segmentation model to help propagate the annotated masks and the generated instances are refined by annotators.
  • Figure 5: (a) PSCFormer overview. Given the query frame $\mathbf{x}_t$ and reference frames $\{\mathbf{x}_i| i\in\mathcal{R} \}$, the goal of VOS is to delineate objects from the background by generating mask $\mathbf{y}_t$ for query frame $\mathbf{x}_t$. References and the query frame are encoded by the memory encoder and query encoder, respectively. Multiple stacking panoramic space consistency (PSC) blocks are used to leverage the correspondence in the panoramic space between references and the query frame. A decoder is used for generating the prediction of the query frame. (b) Panoramic space consistency block architecture details.
  • ...and 3 more figures