Table of Contents
Fetching ...

Online Descriptor Enhancement via Self-Labelling Triplets for Visual Data Association

Yorai Shaoul, Katherine Liu, Kyel Ok, Nicholas Roy

TL;DR

This work proposes a self-supervised method for incrementally refining visual descriptors to improve performance in the task of object-level visual data association and demonstrates a MOTA score of 21.25% on the 2D-MOT-2015 dataset using visual information alone, outperforming methods that incorporate motion information.

Abstract

Object-level data association is central to robotic applications such as tracking-by-detection and object-level simultaneous localization and mapping. While current learned visual data association methods outperform hand-crafted algorithms, many rely on large collections of domain-specific training examples that can be difficult to obtain without prior knowledge. Additionally, such methods often remain fixed during inference-time and do not harness observed information to better their performance. We propose a self-supervised method for incrementally refining visual descriptors to improve performance in the task of object-level visual data association. Our method optimizes deep descriptor generators online, by continuously training a widely available image classification network pre-trained with domain-independent data. We show that earlier layers in the network outperform later-stage layers for the data association task while also allowing for a 94% reduction in the number of parameters, enabling the online optimization. We show that self-labelling challenging triplets--choosing positive examples separated by large temporal distances and negative examples close in the descriptor space--improves the quality of the learned descriptors for the multi-object tracking task. Finally, we demonstrate that our approach surpasses other visual data-association methods applied to a tracking-by-detection task, and show that it provides better performance-gains when compared to other methods that attempt to adapt to observed information.

Online Descriptor Enhancement via Self-Labelling Triplets for Visual Data Association

TL;DR

This work proposes a self-supervised method for incrementally refining visual descriptors to improve performance in the task of object-level visual data association and demonstrates a MOTA score of 21.25% on the 2D-MOT-2015 dataset using visual information alone, outperforming methods that incorporate motion information.

Abstract

Object-level data association is central to robotic applications such as tracking-by-detection and object-level simultaneous localization and mapping. While current learned visual data association methods outperform hand-crafted algorithms, many rely on large collections of domain-specific training examples that can be difficult to obtain without prior knowledge. Additionally, such methods often remain fixed during inference-time and do not harness observed information to better their performance. We propose a self-supervised method for incrementally refining visual descriptors to improve performance in the task of object-level visual data association. Our method optimizes deep descriptor generators online, by continuously training a widely available image classification network pre-trained with domain-independent data. We show that earlier layers in the network outperform later-stage layers for the data association task while also allowing for a 94% reduction in the number of parameters, enabling the online optimization. We show that self-labelling challenging triplets--choosing positive examples separated by large temporal distances and negative examples close in the descriptor space--improves the quality of the learned descriptors for the multi-object tracking task. Finally, we demonstrate that our approach surpasses other visual data-association methods applied to a tracking-by-detection task, and show that it provides better performance-gains when compared to other methods that attempt to adapt to observed information.

Paper Structure

This paper contains 16 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our proposed approach self-supervises label generation to incrementally optimize a deep descriptor generator (cyan). To construct a triplet when frame 100 is received, we choose a positive (i.e., correct) object detection ($p^c_{100}$), a temporally distant anchor instance of the same object ($p^c_{80}$), and a negative example from the same frame that is closest in the current descriptor space ($p^b_{100}$). When enough patch-triplets are aggregated, they are used to train a descriptor-generator as a batch. The visuals are frames 80 to 100 of the sequence ADL-Rundle-6 included in leal2015motchallenge.
  • Figure 2: Illustration of triplet selection, where tracked objects are distinguished by color and similarity to anchor ($a$) by vertical distance. To build a challenging triplet with a positive sample at $b$ for the black object, we choose a negative example that is near in the descriptor space ($d$) and an anchor example that is distant temporally ($a$). By choosing $d$ over $f$ for the negative example we generate a more informative nuanced label, as $d$ and $b$ are relatively close in the descriptor space. Object patches extracted from the sequence TUD-Campus in 2D-MOT-2015 dataset leal2015motchallenge.
  • Figure 3: Our proposed system is composed of a descriptor generator, object tracking module, and self-supervised learning pipeline. The descriptor generator $\mathcal{F}(\cdot, \boldsymbol{\theta})$ converts object measurements to descriptors, and the two processes (tracking and descriptor refinement) run in parallel and interact only through the learned descriptor generator. The self-supervised method keeps selective buffers $\hat{T}_j$, which require that bi-directional preference be satisfied, for the purposes of dataset construction. The object tracker is less particular, and matches all new patches to existing tracks $T_j$ to provide the best possible estimates for all detections.
  • Figure 4: Supervised descriptor evaluation on KITTI geiger2012are dataset. We evaluate descriptors extracted from the last max-pooling layer within AlexNet (AlexNet3), and the two fully connected layers that follow it (AlexNet2, AlexNet1). Given two frames at time steps $t$ and $t+\Delta$, for all similar patches $p_t^i, p_{t+\Delta}^i$ with dissimilar patches $p_{t+\Delta}^j$, we declare an error if $\mathcal{D}_{\cos}(\mathbf{d}_t^i, \mathbf{d}_{t+\Delta}^j) \leq \mathcal{D}_{\cos}(\mathbf{d}_t^i, \mathbf{d}_{t+\Delta}^i)$. We observe overall worse performance as $\Delta$ grows. "AlexNet3 One Epoch" was trained on the training samples for one iteration, as opposed to repeating the training to convergence, and shows quick learning of effective descriptors. Training was performed using ground truth detections in sequences 08-20 divided to 1740 batches of 20 triplets where the positive and anchor samples were 15 and 20 frames apart. The evaluation above was done on the remaining sequences in 80460 comparisons. Our learning rates were $10^{-4}$ for "AlexNet3 One Epoch" and $10^{-10}$ for the converged models.