Table of Contents
Fetching ...

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, Xavier Alameda-Pineda

TL;DR

This work tackles multi-object tracking by moving beyond fixed sparse queries to a dense, image-sized detection representation coupled with sparse tracking queries. The authors introduce Query Learning Networks (QLN) and the TransCenter Decoder to enable global, efficient association of objects across frames via deformable attention. Their approach yields state-of-the-art MOT performance on MOT17 and especially MOT20, demonstrating strong accuracy in crowded scenes while maintaining practical inference speeds. The study includes comprehensive ablations, efficiency analyses, and qualitative visualizations, and provides two practical variants (TransCenter-Dual and TransCenter-Lite) for different deployment needs.

Abstract

Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries - is not optimal for MOT. We propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN). On one hand, the dense image-related detection queries allow us to infer targets' locations globally and robustly through dense heatmap outputs. On the other hand, the set of sparse tracking queries efficiently interacts with image features in our TransCenter Decoder to associate object positions through time. As a result, TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art methods in two standard MOT benchmarks with two tracking settings (public/private). TransCenter is also proven efficient and accurate by an extensive ablation study and comparisons to more naive alternatives and concurrent works. For scientific interest, the code is made publicly available at https://github.com/yihongxu/transcenter.

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

TL;DR

This work tackles multi-object tracking by moving beyond fixed sparse queries to a dense, image-sized detection representation coupled with sparse tracking queries. The authors introduce Query Learning Networks (QLN) and the TransCenter Decoder to enable global, efficient association of objects across frames via deformable attention. Their approach yields state-of-the-art MOT performance on MOT17 and especially MOT20, demonstrating strong accuracy in crowded scenes while maintaining practical inference speeds. The study includes comprehensive ablations, efficiency analyses, and qualitative visualizations, and provides two practical variants (TransCenter-Dual and TransCenter-Lite) for different deployment needs.

Abstract

Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries - is not optimal for MOT. We propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN). On one hand, the dense image-related detection queries allow us to infer targets' locations globally and robustly through dense heatmap outputs. On the other hand, the set of sparse tracking queries efficiently interacts with image features in our TransCenter Decoder to associate object positions through time. As a result, TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art methods in two standard MOT benchmarks with two tracking settings (public/private). TransCenter is also proven efficient and accurate by an extensive ablation study and comparisons to more naive alternatives and concurrent works. For scientific interest, the code is made publicly available at https://github.com/yihongxu/transcenter.

Paper Structure

This paper contains 22 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Via TransCenter, we propose to tackle the MOT problem with transformers accurately and efficiently: the dense non-overlapping representations provide sufficient and accurate detections through dense image-size center heatmap as shown in (a); The sparse tracking queries, obtained from features sampled within object positions at the previous time step, efficiently produce the sparse displacement vectors of objects (shown in arrows plotted on the image originally with a gray background) from the previous to the current time step, as shown in (b).
  • Figure 2: Generic pipeline of TransCenter and different variants: Images at $t$ and $t-1$ are fed to the transformer encoder (DETR-Encoder or PVT-Encoder) to produce multi-scale memories $\mathbf{M}_t$ and $\mathbf{M}_{t-1}$ respectively. They are passed (together with track positions at $t-1$) to the Query Learning Networks (QLN) operating in the feature's channel. QLN produce (1) dense pixel-level multi-scale detection queries--$\mathbf{DQ}$, (2) detection memory--$\mathbf{DM}$, (3) (sparse or dense) tracking queries--$\mathbf{TQ}$, (4) tracking memory--$\mathbf{TM}$. For associating objects through frames, the TransCenter Decoder performs cross attention between $\mathbf{TQ}$ and $\mathbf{TM}$, producing Tracking Features--$\mathbf{TF}$. For detection, the TransCenter Decoder either calculates the cross attention between $\mathbf{DQ}$ and $\mathbf{DM}$ or directly outputs $\mathbf{DQ}$ (in our efficient versions, TransCenter and TransCenter-Lite, see Sec. \ref{['sec:methodology']}), resulting in Detection Features--$\mathbf{DF}$ for the output branches, $\mathbf{S}_t$ and $\mathbf{C}_t$. $\mathbf{TF}$, together with object positions at $t-1$ (sparse $\mathbf{TQ}$) or center heatmap $\mathbf{C}_{t-1}$ (omitted in the figure for simplicity) and $\mathbf{DF}$ (dense $\mathbf{TQ}$), are used to estimate image center displacements $\mathbf{T}_t$ indicating for each center its displacement in the adjacent frames (red arrows). We detail our choice (TransCenter) of QLN and TransCenter Decoder structures in the figure. Other designs of QLN and TransCenter Decoder are detailed in Fig. \ref{['fig:qln']} and Fig. \ref{['fig:transformerdecoder']}. Arrows with a dotted line are only necessary for models with sparse $\mathbf{TQ}$.
  • Figure 3: Query Learning Networks (QLN): TransCenter uses QLN$_{S-}$ (our choice) as its query learning network, producing sparse tracking queries by sampling prior object features from $\mathbf{M}_{t-1}$. Different structures of QLN are studied such as QLN$_{SE-}$, QLN$_{D-}$, QLN$_{D}$ (QLN$_{M_t}$ in green arrow and QLN$_{DQ}$ in blue arrow), and QLN$_{E}$, detailed in Sec. \ref{['subsec:dlq']}. Best seen in color.
  • Figure 4: TransCenter Decoder is used to handle tracking queries $\mathbf{TQ}$ and detection queries $\mathbf{DQ}$. The detection attention correlates $\mathbf{DQ}$ and $\mathbf{DM}$ with the attention modules to detect objects. The tracking attention correlates $\mathbf{TQ}$ and $\mathbf{TM}$ to learn the displacements of the tracked objects until $t-1$ between different frames. TransCenter Decoder has three main modules TQSA, DDCA, and TDCA (detailed in Sec. \ref{['subsec:dualdecoders']}). Different versions of TransCenter Decoder depending on discarding the DDCA or not, are denoted as Single or Dual decoder respectively. Also, an extra prefix "TQSA-" is added if the decoder has TQSA. TransCenter uses TQSA-Single considering the efficiency-accuracy tradeoff. The choice is based on the ablation of the aforementioned variants in Sec. \ref{['subsec:ablation']}. $\textbf{N}_{dec}$ is the number of decoder layers.
  • Figure 5: Overview of the center heatmap branch. The multi-scale detection features are up-scaled (bilinear up.) and merged via a series of deformable convolutions (Def. Conv., the ReLU activation is omitted for simplicity) dai2017deformable, into the output center heatmap. A similar strategy is followed for the object size and the tracking branches.
  • ...and 5 more figures