Putting the Object Back into Video Object Segmentation

Ho Kei Cheng; Seoung Wug Oh; Brian Price; Joon-Young Lee; Alexander Schwing

Putting the Object Back into Video Object Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing

TL;DR

Cutie introduces object-level memory reading for video object segmentation by maintaining a compact object memory and a small set of object queries that interact with high-resolution pixel features through an object transformer. The design uses foreground-background masked attention to cleanly separate foreground object semantics from the background, enabling robust segmentation in challenging scenes. Through extensive ablations and evaluations on MOSE, DAVIS, and YouTubeVOS, Cutie achieves state-of-the-art performance on MOSE with significant gains over XMem and DeAOT while preserving real-time efficiency. The approach provides a path toward more reliable universal video segmentation by fusing top-down object-centric reasoning with bottom-up pixel memory in an end-to-end trainable framework, and the authors release code for community use.

Abstract

We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie

Putting the Object Back into Video Object Segmentation

TL;DR

Abstract

Paper Structure (44 sections, 13 equations, 12 figures, 12 tables)

This paper contains 44 sections, 13 equations, 12 figures, 12 tables.

Introduction
Related Works
Cutie
Overview
Object Transformer
Overview
Foreground-Background Masked Attention
Positional Embeddings
Object Memory
Implementation Details
Pixel Memory
Network Architecture
Training
Inference
Experiments
...and 29 more sections

Figures (12)

Figure 1: Comparison of pixel-level memory reading v.s. object-level memory reading. In each box, the left is the reference frame, and the right is the query frame to be segmented. Red arrows indicate wrong matches. Low-level pixel matching (e.g., XMem cheng2022xmem) can be noisy in the presence of distractors. We propose object-level memory reading for more robust video object segmentation.
Figure 2: Overview of Cutie. We store pixel memory $F$ and object memory $S$ representations from past segmented (memory) frames. Pixel memory is retrieved for the query frame as pixel readout $R_0$, which bidirectionally interacts with object queries $X$ and object memory $S$ in the object transformer. The $L$ object transformer blocks enrich the pixel feature with object-level semantics and produce the final $R_L$ object readout for decoding into the output mask. Standard residual connections, layer normalization, and skip-connections from the query encoder to the decoder are omitted for readability.
Figure 3: Visualization of cross-attention weights (rows of $A_L$) in the object transformer. The middle cat is the target object. Top: without foreground-background masking -- some queries mix semantics from foreground and background (framed in red). Bottom: with foreground-background masking. The leftmost three are foreground queries, and the rightmost three are background queries. Semantics is thus cleanly separated. The f.g./b.g. queries can communicate in the subsequent self-attention layer. Note the queries attend to different foreground regions, distractors, and background regions.
Figure 4: Visualization of auxiliary masks ($M_l$) at different layers of the object transformer. At every layer, noises are suppressed (pink arrows) and the target object becomes more coherent (yellow arrows).
Figure S1: Comparison of foreground attention patterns between pixel attention with object query attention. In each of the three examples, the two leftmost columns show the input image and the ground-truth (annotated by us for reference). The two rightmost columns show the attention patterns for pixel attention and object query attention respectively. Ideally, the attention weights should focus on the foreground object. As shown, the pixel attention has a broader coverage but is easily distracted by similar objects. The object query attention's attention is more sparse (as we use a small number of object queries), and is more focused on the foreground. Our object transformer makes use of both: it first initializes with pixel attention and restructures the features iteratively with object query attention.
...and 7 more figures

Putting the Object Back into Video Object Segmentation

TL;DR

Abstract

Putting the Object Back into Video Object Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)