Table of Contents
Fetching ...

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

TL;DR

Un UncL-STARK is proposed, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads, thus enabling safe inference-time truncation.

Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

TL;DR

Un UncL-STARK is proposed, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads, thus enabling safe inference-time truncation.

Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.
Paper Structure (21 sections, 8 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 8 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the UncL-STARK framework. Uncertainty derived from corner localization heatmaps at frame $t$ drives inference-time depth adaptation by selecting the encoder--decoder depth for frame $t+1$. This feedback mechanism exploits temporal coherence in video to reduce computation while preserving tracking accuracy.
  • Figure 2: Architecture-preserving depth truncation in UncL-STARK. The decoder attends to the output of a selected encoder layer, and predictions are produced from a selected decoder layer, enabling inference at arbitrary depths without modifying the prediction head.
  • Figure 3: Mean IoU versus encoder--decoder depth before and after random-depth (RD) training with knowledge distillation (KD) on GOT-10k (val) and LaSOT (test).
  • Figure 4: Profiled GFLOPs and average latency across depth configurations, both scaling approximately linearly with the number of executed transformer layers.
  • Figure 5: Correlation--calibration trade-off of heatmap-derived confidence proxies. Each point is plotted by Pearson correlation with IoU (x-axis) and Expected Calibration Error (ECE, y-axis). Higher absolute correlation and lower ECE indicate better alignment with tracking accuracy. The dashed line denotes the Pareto-optimal frontier; the selected top-$k$ mass estimator lies on this frontier, achieving a favorable balance between correlation and calibration.
  • ...and 4 more figures