A Novel Bounding Box Regression Method for Single Object Tracking
Omar Abdelaziz, Mohamed Sami Shehata
TL;DR
This work addresses the bounding-box regression bottleneck in ViT-based single-object trackers by introducing two receptive-field-aware heads: Inception and deformable Inception. By operating on the joint search and template embeddings produced by a ViT backbone, these heads learn multi-scale context to predict three score maps for center location, size, and center offset, enabling end-to-end bounding-box prediction. Evaluations on GOT-10k, UAV123, and OTB2015 show consistent state-of-the-art gains over retrained baselines, with the Inception variant providing the largest improvements. The findings highlight the importance of bounding-box regression design and receptive-field learning for robust, accurate tracking in diverse visual domains.
Abstract
Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.
