Table of Contents
Fetching ...

A Novel Bounding Box Regression Method for Single Object Tracking

Omar Abdelaziz, Mohamed Sami Shehata

TL;DR

This work addresses the bounding-box regression bottleneck in ViT-based single-object trackers by introducing two receptive-field-aware heads: Inception and deformable Inception. By operating on the joint search and template embeddings produced by a ViT backbone, these heads learn multi-scale context to predict three score maps for center location, size, and center offset, enabling end-to-end bounding-box prediction. Evaluations on GOT-10k, UAV123, and OTB2015 show consistent state-of-the-art gains over retrained baselines, with the Inception variant providing the largest improvements. The findings highlight the importance of bounding-box regression design and receptive-field learning for robust, accurate tracking in diverse visual domains.

Abstract

Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.

A Novel Bounding Box Regression Method for Single Object Tracking

TL;DR

This work addresses the bounding-box regression bottleneck in ViT-based single-object trackers by introducing two receptive-field-aware heads: Inception and deformable Inception. By operating on the joint search and template embeddings produced by a ViT backbone, these heads learn multi-scale context to predict three score maps for center location, size, and center offset, enabling end-to-end bounding-box prediction. Evaluations on GOT-10k, UAV123, and OTB2015 show consistent state-of-the-art gains over retrained baselines, with the Inception variant providing the largest improvements. The findings highlight the importance of bounding-box regression design and receptive-field learning for robust, accurate tracking in diverse visual domains.

Abstract

Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.
Paper Structure (19 sections, 4 figures, 3 tables)

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall architecture of the recent ViT-based single object trackers.
  • Figure 2: The simple Inception subnetwork architecture adapted from Szegedy2017Inception-v4. The blue-filled circle indicates concatenation operation. Filter sizes are shown in the figure. Please refer to Szegedy2017Inception-v4 for more details about the parameters of each convolution.
  • Figure 3: The deformable inception block.
  • Figure 4: Success cases of the proposed methods on the OTB2015 Basketball sequence. It can be noted that the pure Inception network is most accurate especially when comparing the first column of frames. Although bounding boxes converge at frame $\#50$, the initially predicted bounding boxes at frame $\#2$ indicate the fast adaptation of the Inception bounding box network.