Table of Contents
Fetching ...

Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang

TL;DR

This paper tackles the referring video object segmentation (RVOS) problem by improving cross-modal alignment between visual content and natural language expressions. It introduces ProxyFormer, which leverages cross-modality proxy queries that propagate through the video encoder and are refined via two CMIE blocks, enabling inter-frame dependency modeling and early, integrated language constraints. A Joint Semantic Consistency training strategy further aligns proxy-query semantics with joint video-text representations, boosting segmentation accuracy. Across four RVOS benchmarks, ProxyFormer achieves state-of-the-art results with notable gains over strong baselines, demonstrating improved temporal coherence and robustness to appearance changes in video targets.

Abstract

Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

Referring Video Object Segmentation with Cross-Modality Proxy Queries

TL;DR

This paper tackles the referring video object segmentation (RVOS) problem by improving cross-modal alignment between visual content and natural language expressions. It introduces ProxyFormer, which leverages cross-modality proxy queries that propagate through the video encoder and are refined via two CMIE blocks, enabling inter-frame dependency modeling and early, integrated language constraints. A Joint Semantic Consistency training strategy further aligns proxy-query semantics with joint video-text representations, boosting segmentation accuracy. Across four RVOS benchmarks, ProxyFormer achieves state-of-the-art results with notable gains over strong baselines, demonstrating improved temporal coherence and robustness to appearance changes in video targets.

Abstract

Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

Paper Structure

This paper contains 17 sections, 11 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison of current RVOS pipelines based on conditional queries and their segmentation results. (a) Learnable queries. (b) Language as queries. (c) Our cross-modality proxy queries. Bottom: The language expression for the target instance is "a person with the red surfboard" and the target segmentation objects are indicated by the red boxes. Learnable queries method, MTTR DBLP:conf/cvpr/BotachZB22, may segment incorrect objects due to the absence of linguistic constraints during the decoding stage. Language as queries based method, ReferFormer DBLP:conf/cvpr/WuJSYL22, faces challenges in accurately tracking the target through significant frame-to-frame variations, due to a lack of modeling for inter-frame dependence and variability. Our cross-modality proxy queries based method, ProxyFormer, can produce more reasonable predictions.
  • Figure 2: Overview of the proposed ProxyFormer framework. Given a frame sequence $\mathcal{V}= \{v_t\}_{t=1}^T$, and a textual expression $\mathcal{T}$, we first extract visual and linguistic features $F_v$ and $F_t$, then introduce a set of proxy queries $Q$ to bi-directionally interact with $F_v$ and $F_t$ via $K$ stacked Cross-Modality Interaction Encoding (CMIE) modules. Then these updated proxy queries $Q^K$ from the $K$-th CMIE module are used to predict object masks during decoding. Meanwhile, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the joint video-text pair.
  • Figure 3: Illustration of the spatio-temporal divided cross-modality interaction of (a) proxy conditioned video encoding and (b) video conditioned proxy encoding.
  • Figure 4: Illustration of the Joint Semantic Consistency (JSC). In JSC, we firstly pool the proxy queries $Q^k$ in temporal dimension to generate video-level queries $Q_{v}$. Then, we find the best prediction as the positive video-level query $q$ according to the matching loss. Finally, we align semantic consensus between the video-level query $q$ and the joint video-text representation $s_{v\&t}$ through a joint semantic consistency loss.
  • Figure 5: Qualitative comparison among our ProxyFormer, SOC DBLP:conf/nips/LuoXLLWTLY23 and ReferFormer DBLP:conf/cvpr/WuJSYL22 on Ref-YouTube-VOS. Our method effectively understands the spatial positions and appearances detailed in the queries, accurately identifying the referring objects. The colors of referring expressions correspond to the colors of segmentation masks.