Table of Contents
Fetching ...

Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

Sun-Hyuk Choi, Hayoung Jo, Seong-Whan Lee

TL;DR

This work tackles RVOS by addressing two core issues: query inconsistency and insufficient temporal context. It introduces the Multi-context Temporal Consistency Module (MTCM), composed of an Aligner that enforces cross-frame query consistency and a Multi-Context Enhancer (MCE) that integrates local and global temporal context to identify the target object. By applying MTCM to four strong baselines, the authors demonstrate consistent performance gains across MeViS, A2D Sentences, and JHMDB Sentences, including a notable 47.6 J&F on MeViS. The proposed training strategy and modular design enable effective temporal modeling while preserving frame-level detail, with code available at the referenced GitHub repository. Overall, MTCM provides a practical and generalizable enhancement for transformer-based RVOS, improving robustness to mid-video target shifts and improving text-to-object alignment.

Abstract

Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at https://github.com/Choi58/MTCM.

Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

TL;DR

This work tackles RVOS by addressing two core issues: query inconsistency and insufficient temporal context. It introduces the Multi-context Temporal Consistency Module (MTCM), composed of an Aligner that enforces cross-frame query consistency and a Multi-Context Enhancer (MCE) that integrates local and global temporal context to identify the target object. By applying MTCM to four strong baselines, the authors demonstrate consistent performance gains across MeViS, A2D Sentences, and JHMDB Sentences, including a notable 47.6 J&F on MeViS. The proposed training strategy and modular design enable effective temporal modeling while preserving frame-level detail, with code available at the referenced GitHub repository. Overall, MTCM provides a practical and generalizable enhancement for transformer-based RVOS, improving robustness to mid-video target shifts and improving text-to-object alignment.

Abstract

Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at https://github.com/Choi58/MTCM.
Paper Structure (13 sections, 6 equations, 3 figures, 3 tables)

This paper contains 13 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed module, which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner improves query consistency, while the MCE selects objects by considering both local and global contexts.
  • Figure 2: The structure of the Aligner (a) and the Multi-Context Enhancer (b). The Aligner aligns the queries and removes irrelevant information by utilizing queries from the previous frame, ensuring that each query shares common features. The MCE captures the multi-context of each query to supplement the information of each frame, enabling accurate object selection.
  • Figure 3: Qualitative comparison of our method with LMPM and DsHmp. Red boxes indicate the targets. (a) and (b) are the given text queries respectively. Both videos are challenging samples where the object is observed in the middle of the video.