Table of Contents
Fetching ...

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

Jiaming Zhang, Yutao Cui, Gangshan Wu, Limin Wang

TL;DR

This work introduces JointFormer, a unified transformer-based framework for semi-supervised video object segmentation that jointly models feature extraction, dense correspondence, and an instance-level compressed memory. The core component, the Joint Modeling Block, enables simultaneous propagation of target information across frames while handling distractors through an asymmetric interaction design, complemented by a customized online updating mechanism for a single memory token per target. The approach achieves state-of-the-art results on major benchmarks (DAVIS 2017 and YouTube-VOS) and demonstrates strong generalization to new datasets (MOSE, VISOR, VOST, LVOS), outperforming prior decoupled methods without requiring extensive synthetic pre-training. Overall, JointFormer establishes a robust, extensible baseline for VOS by tightly integrating multi-level feature learning, temporal memory, and frame-wise target propagation within a single architecture.

Abstract

Current prevailing Video Object Segmentation methods follow the pipeline of extraction-then-matching, which first extracts features on current and reference frames independently, and then performs dense matching between them. This decoupled pipeline limits information propagation between frames to high-level features, hindering fine-grained details for matching. Furthermore, the pixel-wise matching lacks holistic target understanding, making it prone to disturbance by similar distractors. To address these issues, we propose a unified VOS framework, coined as JointFormer, for jointly modeling feature extraction, correspondence matching, and a compressed memory. The core Joint Modeling Block leverages attention to simultaneously extract and propagate the target information from the reference frame to the current frame and a compressed memory token. This joint scheme enables extensive multi-layer propagation beyond high-level feature space and facilitates robust instance-distinctive feature learning. To incorporate the long-term and holistic target information, we introduce a compressed memory token with a customized online updating mechanism, which aggregates target features and facilitates temporal information propagation in a frame-wise manner, enhancing global modeling consistency. Our JointFormer achieves a new state-of-the-art performance on the DAVIS 2017 val/test-dev (89.7\% and 87.6\%) benchmarks and the YouTube-VOS 2018/2019 val (87.0\% and 87.0\%) benchmarks, outperforming the existing works. To demonstrate the generalizability of our model, it is further evaluated on four new benchmarks with various difficulties, including MOSE for complex scenes, VISOR for egocentric videos, VOST for complex transformations, and LVOS for long-term videos.

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

TL;DR

This work introduces JointFormer, a unified transformer-based framework for semi-supervised video object segmentation that jointly models feature extraction, dense correspondence, and an instance-level compressed memory. The core component, the Joint Modeling Block, enables simultaneous propagation of target information across frames while handling distractors through an asymmetric interaction design, complemented by a customized online updating mechanism for a single memory token per target. The approach achieves state-of-the-art results on major benchmarks (DAVIS 2017 and YouTube-VOS) and demonstrates strong generalization to new datasets (MOSE, VISOR, VOST, LVOS), outperforming prior decoupled methods without requiring extensive synthetic pre-training. Overall, JointFormer establishes a robust, extensible baseline for VOS by tightly integrating multi-level feature learning, temporal memory, and frame-wise target propagation within a single architecture.

Abstract

Current prevailing Video Object Segmentation methods follow the pipeline of extraction-then-matching, which first extracts features on current and reference frames independently, and then performs dense matching between them. This decoupled pipeline limits information propagation between frames to high-level features, hindering fine-grained details for matching. Furthermore, the pixel-wise matching lacks holistic target understanding, making it prone to disturbance by similar distractors. To address these issues, we propose a unified VOS framework, coined as JointFormer, for jointly modeling feature extraction, correspondence matching, and a compressed memory. The core Joint Modeling Block leverages attention to simultaneously extract and propagate the target information from the reference frame to the current frame and a compressed memory token. This joint scheme enables extensive multi-layer propagation beyond high-level feature space and facilitates robust instance-distinctive feature learning. To incorporate the long-term and holistic target information, we introduce a compressed memory token with a customized online updating mechanism, which aggregates target features and facilitates temporal information propagation in a frame-wise manner, enhancing global modeling consistency. Our JointFormer achieves a new state-of-the-art performance on the DAVIS 2017 val/test-dev (89.7\% and 87.6\%) benchmarks and the YouTube-VOS 2018/2019 val (87.0\% and 87.0\%) benchmarks, outperforming the existing works. To demonstrate the generalizability of our model, it is further evaluated on four new benchmarks with various difficulties, including MOSE for complex scenes, VISOR for egocentric videos, VOST for complex transformations, and LVOS for long-term videos.
Paper Structure (24 sections, 6 equations, 11 figures, 13 tables)

This paper contains 24 sections, 6 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: The pipeline comparison between existing VOS works (a) and ours (b). (a) Existing works perform feature extraction and matching separately in a decoupled way. (b) Our framework jointly models feature, correspondence, and our compressed memory without the post-matching in a unified pipeline.
  • Figure 2: Overview of our JointFormer. The current frames and reference frames with masks are split and flattened into patches, then are fed with our compressed memory into the Vision Transformer, which consists of Joint Modeling Blocksfor simultaneous feature extraction and target information propagation from the reference features to the current features and the compressed memory token. After that, the current tokens are enhanced with compressed memory, reshaped to 2D shape, concatenated with enhanced current tokens, and then fed into the decoder for mask prediction. In addition, the decoder fuses its internal features with mask prediction to generate decoder tokens for further updating the compressed memory token, and is subsequently used for predicting the next frame.
  • Figure 3: Detailed view of the Joint Modeling Block, for joint modeling of features, correspondence, and compressed memory with attention mechanisms, shown as the brown arrow. Specifically, the source of the arrow represents the key/value and the pointer represents the query. The dotted arrow indicates that interaction only exists in the side-labeled corresponding mode.
  • Figure 4: Detailed view of the Decoder. We first fuse the current tokens and enhanced current tokens, then progressively upsample and fuse them with multi-scale features from backbone, and predict the target logits with a convolution finally. Furthermore, we leverage the logits prediction and internal features within the decoder to generate decoder tokens for updating the compressed memory.
  • Figure 5: Qualitative comparisons of our model with XMem XMem_10.1007/978-3-031-19815-1_37 and Swin-DeAOT DeAOT_yang2022decoupling on DAVIS and YouTube-VOS benchmark. The difficulty in the first three columns lies in recognizing similar objects, while the remaining columns lie in tiny targets and object boundaries. We mark their failures in the white dashed boxes. Our model outperforms them in terms of detailing and discriminating similarities.
  • ...and 6 more figures