Referring Video Object Segmentation with Cross-Modality Proxy Queries
Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang
TL;DR
This paper tackles the referring video object segmentation (RVOS) problem by improving cross-modal alignment between visual content and natural language expressions. It introduces ProxyFormer, which leverages cross-modality proxy queries that propagate through the video encoder and are refined via two CMIE blocks, enabling inter-frame dependency modeling and early, integrated language constraints. A Joint Semantic Consistency training strategy further aligns proxy-query semantics with joint video-text representations, boosting segmentation accuracy. Across four RVOS benchmarks, ProxyFormer achieves state-of-the-art results with notable gains over strong baselines, demonstrating improved temporal coherence and robustness to appearance changes in video targets.
Abstract
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
