Table of Contents
Fetching ...

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Chen Liang, Yu Wu, Yawei Luo, Yi Yang

TL;DR

This paper tackles text-based video segmentation by reframing the problem as cross-modal retrieval over object-level candidates. It introduces ClawCraneNet, a top-down pipeline that first detects object proposals and then uses three object-level relation modules—positional, text-guided semantic, and temporal—to build discriminative embeddings for language-grounded retrieval. The method leverages a CondInst-based segmentation backbone, Bi-LSTM language encoding with self-guided attention, and a contrastive loss to align language and object-relational embeddings, achieving state-of-the-art results on A2D Sentences and J-HMDB Sentences with notable gains at high IoU thresholds. The results demonstrate improved explainability and robustness to occlusion and complex inter-object relations, highlighting the practical impact of object-centric relational reasoning in video understanding.

Abstract

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

TL;DR

This paper tackles text-based video segmentation by reframing the problem as cross-modal retrieval over object-level candidates. It introduces ClawCraneNet, a top-down pipeline that first detects object proposals and then uses three object-level relation modules—positional, text-guided semantic, and temporal—to build discriminative embeddings for language-grounded retrieval. The method leverages a CondInst-based segmentation backbone, Bi-LSTM language encoding with self-guided attention, and a contrastive loss to align language and object-relational embeddings, achieving state-of-the-art results on A2D Sentences and J-HMDB Sentences with notable gains at high IoU thresholds. The results demonstrate improved explainability and robustness to occlusion and complex inter-object relations, highlighting the practical impact of object-centric relational reasoning in video understanding.

Abstract

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: (a) Previous bottom-up methods mainly perform semantic relationship formulation at the pixel level. Corresponding models could not correctly identify the high-level relation merely based on local perceptive fields, and directly leads to an ambiguous prediction. (b) Our top-down pipeline first performs feature extraction for objects and then model the crucial relation information based on a high-level sensation, leading to better segmentation masks by conducting multi-modal retrieving. Vividly, we analogize the process of retrieving a visual object with the linguistic query as playing a claw crane machine.
  • Figure 2: The framework of our proposed ClawCraneNet. As a top-down pipeline, objects are first perceived by an off-the-shelf instance segmentation module, and then selected by finding best visual-semantic match. During this process, we populate information among objects by performing three kinds of relation formulation module, i.e., positional relation, text-guided semantic relation, and temporal relation. We then utilize linguistic embedding to retrieve the final prediction.
  • Figure 3: Illustration of the text-guided semantic relation module.$\otimes$: Matrix Multiplication; ⓒ: Matrix Concatenation; Three boxes with different colors stand for three different objects. Self-guided linguistic context $L$ is used as a guidance to infer the relationship between object features $f_V$.
  • Figure 4: Qualitative results of text-based video segmentation. We show three language query, and draw the corresponding segmentation results using the same query color. As queries 1 and 3 are predicted on a single object, the left most object in the first row of (c) is covered with both red (query 1) and green (query 3). (a) Original frames. (b) Results of the bottom-up method wang2019asymmetric. (c) Results of our basic top-down pipeline (row 1 in Table \ref{['table:ablation']}). (d) Results of our full model. (e) Ground truth. (f) Visualization of attention weights in our TSRM.
  • Figure 5: Visualization results of a complex video. An object is covered with different colors if it is referred by more than one queries, e.g., the left most object in the first row of (c). (a) Original video frame. (b) Results of the bottom-up method wang2019asymmetric. (c) Results of our top-down pipeline. (d) Results of the PRM-enhanced top-down model (row 4 in Table \ref{['table:ablation']}). (e) Results of our full ClawCraneNet.
  • ...and 4 more figures