Fully-Convolutional Siamese Networks for Object Tracking
Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip H. S. Torr
TL;DR
The paper tackles the challenge of tracking arbitrary objects in video by learning a similarity function offline using a fully-convolutional Siamese network trained on large-scale video data (ImageNet Video). This offline embedding is then evaluated online as a dense cross-correlation across large search regions, enabling real-time tracking without online model updates. Key contributions include the fully-convolutional Siamese architecture, training with large search images, and extensive evaluation on OTB-13 and VOT benchmarks showing competitive accuracy with real-time speeds. The work demonstrates that rich, offline-learned embeddings can generalize across datasets and complement traditional online tracking methods, with dataset size shown to positively impact performance.
Abstract
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
