Table of Contents
Fetching ...

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR

GOT-JEPA is proposed, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models and OccuSolver is proposed to enhance occlusion perception for object tracking.

Abstract

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

TL;DR

GOT-JEPA is proposed, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models and OccuSolver is proposed to enhance occlusion perception for object tracking.

Abstract

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.
Paper Structure (20 sections, 9 equations, 11 figures, 13 tables)

This paper contains 20 sections, 9 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: (a) GOT-JEPA extends the JEPA architecture to build a robust model predictor for model adaptation. The t-Predictor generates diverse tracking models for pseudo-labeling, while the s-Predictor predicts them using corrupted current frames. Both predictors use identical historical frames and tracking results as few-shot examples to aid target identification, while variations in the current frame drive robustness in tracking-model prediction. (b) OccuSolver enhances occlusion-handling capabilities. T1' shows initial point sampling. T2' shows that OccuSolver filters redundant points as invisible states, retaining essential points as visible states. This information is then utilized to enhance occlusion perception in the proposed GOT tracker.
  • Figure 2: Overview of the proposed framework. (a) We pre-train a robust model predictor using a JEPA-based approach. Conditioned on identical past information, a student predictor (s-Predictor) learns from a corrupted current frame to predict the tracking models generated by a teacher (t-Predictor) with the uncorrupted input. This process compels the student to learn representations that are robust to frame variations. Details are provided in \ref{['sec:GOT-JEPA']}. (b) The pre-trained student predictor is then integrated into the tracking head with classification and regression decoders and fine-tuned for precise object localization. Details are provided in \ref{['sec:Background']} and \ref{['sec:GOT-JEPA']}. (c) To address occlusions, OccuSolver adapts a point tracker to be object-aware using priors from the object tracker. The resulting point visibility states are then integrated with visual features via an Ensemble Network, enabling the final model to better handle occluded targets and generate more accurate tracking models over time. Refer to \ref{['sec:OccuSolver']} for details.
  • Figure 3: This figure presents the attribute analysis of OTB-100, AVisT, and LaSOT from left to right, with the average scores at the bottom.
  • Figure 4: Comparison of methods using NPr, Pr, and SUC plots on the NfS dataset, from left to right.
  • Figure 5: Comparison of methods using NPr, Pr, and SUC plots on the AVisT dataset, from left to right.
  • ...and 6 more figures