Table of Contents
Fetching ...

Fully-Convolutional Siamese Networks for Object Tracking

Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip H. S. Torr

TL;DR

The paper tackles the challenge of tracking arbitrary objects in video by learning a similarity function offline using a fully-convolutional Siamese network trained on large-scale video data (ImageNet Video). This offline embedding is then evaluated online as a dense cross-correlation across large search regions, enabling real-time tracking without online model updates. Key contributions include the fully-convolutional Siamese architecture, training with large search images, and extensive evaluation on OTB-13 and VOT benchmarks showing competitive accuracy with real-time speeds. The work demonstrates that rich, offline-learned embeddings can generalize across datasets and complement traditional online tracking methods, with dataset size shown to positively impact performance.

Abstract

The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

Fully-Convolutional Siamese Networks for Object Tracking

TL;DR

The paper tackles the challenge of tracking arbitrary objects in video by learning a similarity function offline using a fully-convolutional Siamese network trained on large-scale video data (ImageNet Video). This offline embedding is then evaluated online as a dense cross-correlation across large search regions, enabling real-time tracking without online model updates. Key contributions include the fully-convolutional Siamese architecture, training with large search images, and extensive evaluation on OTB-13 and VOT benchmarks showing competitive accuracy with real-time speeds. The work demonstrates that rich, offline-learned embeddings can generalize across datasets and complement traditional online tracking methods, with dataset size shown to positively impact performance.

Abstract

The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

Paper Structure

This paper contains 23 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Fully-convolutional Siamese architecture. Our architecture is fully-convolutional with respect to the search image $x$. The output is a scalar-valued score map whose dimension depends on the size of the search image. This enables the similarity function to be computed for all translated sub-windows within the search image in one evaluation. In this example, the red and blue pixels in the score map contain the similarities for the corresponding sub-windows. Best viewed in colour.
  • Figure 2: Training pairs extracted from the same video: exemplar image and corresponding search image from same video. When a sub-window extends beyond the extent of the image, the missing portions are filled with the mean RGB value.
  • Figure 3: Success plots for OPE (one pass evaluation), TRE (temporal robustness evaluation) and SRE (spatial robustness evaluation) of the OTB-13 WuLimYang13 benchmark. The results of CCT, SCT4 and KCFDP were only available for OPE at the time of writing.
  • Figure 4: VOT-14 Accuracy-robustness plot. Best trackers are closer to the top-right corner.
  • Figure 5: VOT-15 ranking in terms of expected average overlap. Only the best 40 results have been reported.
  • ...and 1 more figures