Table of Contents
Fetching ...

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR

GOT-Edit proposes geometry-aware generic object tracking integrated via online knowledge editing that preserves semantic discrimination while incorporating 3D geometric cues. By extracting semantic features from DINOv2 and geometric cues from VGGT and fusing them through a ToMPDETR-based predictor, the method updates track-specific weights online, with geometry updates projected into the semantic null space to avoid degradation. The approach demonstrates robustness under occlusion and clutter across multiple benchmarks, outperforming several state-of-the-art trackers and illustrating the value of 2D–3D fusion without requiring explicit 3D input data. This work offers a practical and generalizable direction for enhancing GOT by bridging 2D semantics with implicit geometric reasoning, with implications for autonomous and robotic vision systems.

Abstract

Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.

GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

TL;DR

GOT-Edit proposes geometry-aware generic object tracking integrated via online knowledge editing that preserves semantic discrimination while incorporating 3D geometric cues. By extracting semantic features from DINOv2 and geometric cues from VGGT and fusing them through a ToMPDETR-based predictor, the method updates track-specific weights online, with geometry updates projected into the semantic null space to avoid degradation. The approach demonstrates robustness under occlusion and clutter across multiple benchmarks, outperforming several state-of-the-art trackers and illustrating the value of 2D–3D fusion without requiring explicit 3D input data. This work offers a practical and generalizable direction for enhancing GOT by bridging 2D semantics with implicit geometric reasoning, with implications for autonomous and robotic vision systems.

Abstract

Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.
Paper Structure (20 sections, 14 equations, 8 figures, 9 tables)

This paper contains 20 sections, 14 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The GOT-Edit Framework. GOT-Edit facilitates the understanding of 3D geometry to aid generic object tracking from 2D streaming inputs. It predicts semantic and geometric weights concurrently to incrementally adapt the tracking model. Through online model editing, it ensures geometry-aware, semantic-preserving updates to the tracking model. The solid red box marks the ground-truth target in the input reference frames. The dashed red boxes indicate these same annotations utilized for the online knowledge update within the geometry branch. The green box represents the final predicted tracking result.
  • Figure 2: From left to right, success plots of competing methods on OTB, AVisT, and NfS are shown.
  • Figure 3: Attribute analysis of OTB, AVisT, and LaSOT from left to right, with average scores below.
  • Figure 4: Visual comparisons of tracking results from GOT-Edit, PiVOT, and LoRAT across diverse video sequences under adverse scenarios are shown. The three left columns illustrate object tracking evaluation on AVisT, while the three right columns present tracking results on LaSOT.
  • Figure 5: Comparison of methods using NPr, Pr, and SUC on NfS, left to right.
  • ...and 3 more figures