Table of Contents
Fetching ...

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Linfeng Yuan, Miaojing Shi, Zijie Yue, Qijun Chen

TL;DR

Referring video object segmentation (RVOS) often misleads masks toward action or relational cues embedded in long text. LoSh proposes a subject-centric short text expression and fuses it with the long expression through a long-short cross-attention mechanism, plus a long-short predictions intersection loss $\mathcal{L}_{lsi}$ to align predictions. A forward-backward visual consistency loss $\mathcal{L}_{fbc}$ using optical-flow warping enforces temporal coherence. Across four benchmarks, LoSh delivers consistent improvements with modest overhead, and its approach is compatible with existing query-based RVOS pipelines.

Abstract

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method.Code is available at https://github.com/LinfengYuan1997/Losh.

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

TL;DR

Referring video object segmentation (RVOS) often misleads masks toward action or relational cues embedded in long text. LoSh proposes a subject-centric short text expression and fuses it with the long expression through a long-short cross-attention mechanism, plus a long-short predictions intersection loss to align predictions. A forward-backward visual consistency loss using optical-flow warping enforces temporal coherence. Across four benchmarks, LoSh delivers consistent improvements with modest overhead, and its approach is compatible with existing query-based RVOS pipelines.

Abstract

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS, JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method.Code is available at https://github.com/LinfengYuan1997/Losh.
Paper Structure (21 sections, 7 equations, 3 figures, 4 tables)

This paper contains 21 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Qualitative comparison between our LoSh-M and MTTR MTTR. The text expression for the target instance is 'a man in white t-shirt is walking'. LoSh-M generates an accurate prediction while MTTR predicts a wrong mask compared to ground truth.
  • Figure 2: The overall pipeline of LoSh built upon the query-based model MTTR. Our model takes long and short text expressions as text inputs and uses them to guide the target instance's segmentation in the given video. A long-short cross-attention module, a long-short predictions intersection loss ($\mathcal{L}_{lsi}$) and a forward-backward visual consistency loss ($\mathcal{L}_{fbc}$) are specifically introduced. Note that feed-forward networks in transformer encoder are omitted for simplicity.
  • Figure 3: Qualitative comparison between LoSh-M and MTTR on A2D-Sentences. LoSh-M generates reasonable predictions while MTTR predicts incorrect or partial ones compared to ground truth.