Table of Contents
Fetching ...

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

Mingqi Gao, Jinyu Yang, Jingnan Luo, Xiantong Zhen, Jungong Han, Giovanni Montana, Feng Zheng

Abstract

Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

Abstract

Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.
Paper Structure (32 sections, 6 equations, 9 figures, 8 tables)

This paper contains 32 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Examples from YoURVOS. The text-referred objects are highlighted spatially in blue and orange masks and temporally in thick lines in the same colours. The most unique feature of YoURVOS is untrimmed videos, i.e., all videos are not trimmed to fit the span of any text-referred object. As shown in examples, most objects in YoURVOS appear in one or several segments in a video rather than all frames. With the untrimmed setting, more complex situations are also considered, e.g., broader temporal contexts and multiple scenes. These promote RVOS closer to realistic scenarios and bring several new challenges: (1) spatial-temporal joint localisation, (2) multimodal analysis on untrimmed videos, and (3) long-term video segmentation.
  • Figure 2: Dataset construction pipeline. (a) For Ref-YouTube-VOS videos (orange film), we retrieve their untrimmed sources from YouTube-8M abu2016youtube to pad before and after (red films); (b) We select target objects from collected videos and annotate corresponding language descriptions and spans (orange thick line); (c) We annotate masks (orange masks) for the target objects. From the example above, it is also clear that YoURVOS videos consider much more complex contexts and scenes than trimmed ones.
  • Figure 3: Previous RVOS v.s. RVOS in the wild. Video frames are sampled every 3 seconds and with timestamps left-bottom. Yellow and white numbers indicate target-relevant and -irrelevant frames. Compared to previous benchmarks, YoURVOS is much more challenging and closer to realistic scenarios with long-term videos, target-irrelevant distractions, and multiple scenes, calling for spatial-temporal joint localisation.
  • Figure 4: Distributions of RVOS datasets on object span, number of scenes, and TI. For each violin chart summarising start/end times of object appearance, shades depict the probability density. Three markers are minimum/median/maximum data. The reason for plots with less than three markers lies in the over-concentrative data distribution. Violin charts share the same legend as the line charts in this figure. A2D-Sentences gavrilyuk2018actor is not shown in some cases due to its sparse mask annotations.
  • Figure 5: Framework of Object-level Multimodal transFormers (OMFormer).
  • ...and 4 more figures