Table of Contents
Fetching ...

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

Hanjung Kim, Jaehyun Kang, Miran Heo, Sukjun Hwang, Seoung Wug Oh, Seon Joo Kim

TL;DR

This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities, and suggests a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects.

Abstract

In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE.

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

TL;DR

This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities, and suggests a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects.

Abstract

In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE.
Paper Structure (26 sections, 1 equation, 7 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Qualitative results across challenging scenarios. Predicted results using query-propagation GenVISDVIS, query-matching MinVISCTVISIDOL, and our appearance-guided methods. The first row illustrates a shot change across consecutive frames, a scenario where previous methods fail to maintain consistent tracking. The second and third rows demonstrate trajectory intersections, leading to id-switching with previous methods. Unlike previous methods, our method successfully tracks objects without switching or losses. Best viewed in color.
  • Figure 2: Proof of concept demonstrated with a flipped image. Previous methods MinVISCTVISDVISGenVISIDOL struggle with instance matching in flipped images, showing a dependency on location. Our method, VISAGE, addresses this by emphasizing appearance, enabling accurate instance matching even with image flipping.
  • Figure 3: Overview of VISAGE. (a) The proposed VISAGE's architecture which generate object embedding and appearance embedding. (b) Overall inference pipeline of VISAGE: At time step $t-1$, the memory bank is updated with both the appearance embedding and the object embedding. Then, at time step $t$, the memory embedding is read from the memory bank and used for matching. (c). Details of the matching process: In that scenario, using only object embeddings leads to incorrect matching. However, when guided by the appearance embedding, the matching process can be corrected. Best viewed in color.
  • Figure 4: T-SNE visualization on the OVIS dataset. Each row representing three different videos. Each column corresponds to the type of query embedding utilized. Points plotted in the same color indicate the same instance across the dataset. Best viewed in color.
  • Figure 5: Visualization of the pseudo dataset. In track type videos, instances move along random bezier curves. On the other hand, the swap type refers to a scenario where the positions of each instance are exchanged in the middle of the video. The colored dot above each instance represents the corresponding instance in the swapped frame.
  • ...and 2 more figures