Table of Contents
Fetching ...

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

Changcheng Xiao, Qiong Cao, Yujie Zhong, Xiang Zhang, Tao Wang, Canqun Yang, Long Lan

TL;DR

This study introduces a compact Transformer-based method, termed TenRMOT, that conducts feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture and introduces a novel task called Referring Multi-Object Tracking and Segmentation and construct a new dataset named Ref-KITTI Segmentation.

Abstract

Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi-Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref-KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

TL;DR

This study introduces a compact Transformer-based method, termed TenRMOT, that conducts feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture and introduces a novel task called Referring Multi-Object Tracking and Segmentation and construct a new dataset named Ref-KITTI Segmentation.

Abstract

Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi-Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref-KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.

Paper Structure

This paper contains 16 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison of the existing referring multi-object tracking pipeline: (a) TransRMOTrmot conducts feature interaction solely before the Transformer; (b) TenRMOT, on the other hand, integrates multi-modal feature fusion at both the encoder and decoder. Additionally, TenRMOT capitalizes on the query update module (QUM) to incorporate long-term temporal information regarding the objects. $F_w$ and $F_s$ represent the word-level and sentence-level features of the query expression, respectively.
  • Figure 2: An overview of the proposed TenRMOT. It mainly consists of four parts: feature extraction, Interleaving Cross-modality feature Encoder (ICE), Language Guided Decoder (LGD), and a query updating module (QUM). TenRMOT takes a video sequence $\mathcal{V}$ and a natural language expression $\mathcal{L}$ as input, and outputs the linguistically indicated objects and the corresponding identity labels. TenRMOT conducts vision-language feature fusion at both the decoder-encoder stages, while QUM effectively leverages prior information of tracked objects to update their track queries. Track queries of the same object are identified by sharing the same colored geometric shape.
  • Figure 3: The architecture of the proposed cross-modality attention.
  • Figure 4: Illustration of inter-frame query update. We explicitly incorporate $q_{t-1}^c$ and $b_{t-1}$ from frame $t -1$ to provide contextual and spatial region prior information for the frame $t$, respectively.
  • Figure 5: Illustration of the segmentation branch.
  • ...and 8 more figures