Accurate Planar Tracking With Robust Re-Detection

Jonas Serych; Jiri Matas

Accurate Planar Tracking With Robust Re-Detection

Jonas Serych, Jiri Matas

TL;DR

Novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation and improved ground-truth annotations of initial PlanarTrack poses are presented, enabling more accurate benchmarking in the high-precision p@5 metric.

Abstract

We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack tracking benchmarks, setting the new state-of-the-art performance on both. On the latter, they outperform the second best by a large margin, +12.4 and +15.2pp on the p@15 metric. We also present improved ground-truth annotations of initial PlanarTrack poses, enabling more accurate benchmarking in the high-precision p@5 metric. The code and the re-annotations are available at https://github.com/serycjon/WOFTSAM

Accurate Planar Tracking With Robust Re-Detection

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 16 figures, 4 tables)

This paper contains 19 sections, 1 equation, 16 figures, 4 tables.

Introduction
Related Work
Method
SAM-H: from segmentation mask to homography.
WOFTSAM: precise tracking with robust re-detection.
Implementation details
Experiments
POT results
PlanarTrack results
WOFTSAM compared to SAM-H.
PlanarTrack GT quality
Limitations and Discussion
Conclusions
Detailed POT-210 Results
Detailed PlanarTrack Results
...and 4 more sections

Figures (16)

Figure 1: Overview of the SAM-H procedure for estimating homography from a segmentation tracker output. First, corners of the SAM 2 ravi2024sam2 mask are robustly extracted via intersection of Hough lines. Next, a symmetry disambiguation process based on a motion model and the target appearance decides which corner is which and a SAM-H homography is estimated. The SAM-H homography pose provides an initialization for target re-detection ability of the proposed WOFTSAM planar tracker.
Figure 2: Overview of the proposed WOFTSAM planar tracker. Like in WOFT serych2023planar, tracking ① consists of image pre-warping with the previous frame homography ${\mathbf{H}}_{t-1}$ and homography estimation via the Weighted Flow Homography (WFH) serych2023planar module. If the homography estimation fails (detected by correspondence support set being small), WOFTSAM does a re-detection step ②, in which the SAM-H output ${\mathbf{H}}_\text{SAM}$ acts as the pre-warping homography. If even the re-detection fails, WOFTSAM falls back ③ to ${\mathbf{H}}_\text{SAM}$.
Figure 3: POT-210 liang2017planar overall (All) and per-attribute results. Precision measured on 5 px and 15 px alignment error thresholds on the re-annotated GT serych2023planar. Although SAM-H is the least precise of the shown methods, using it as a robust re-detection mechanism in the proposed WOFTSAM outperforms the WOFT serych2023planar baseline, almost halving its failure rate on the 15 px threshold. The improvement is particularly high on the sequences where re-detection is needed --- blur, occlusion, and unconstrained --- without decreasing performance on the rest of the benchmark.
Figure 4: Typical SAM-H failure cases on the POT-210 liang2017planar dataset. The target is highlighted in red on the initial frame. Even when the SAM 2 segmentation works perfectly (not depicted for clarity), occlusions by objects with linear boundaries \ref{['fig:pot-occlusion-paper']} result in incorrect homographies --- SAM-H correctly finds the four corners (not depicted for clarity) of the segmentation mask, but these are no longer the corners of the target object. When the target is partially out-of-view \ref{['fig:pot-oov']}, it is easy to distinguish which parts of the mask boundary belong to the object boundary, but there is not enough data to estimate the 8-DoF homography, e.g., one corner and two directions are not enough.
Figure 5: Development of the p@15 PlanarTrackTST score in time, smoothed with exponential moving average (EMA) with coefficient $0.1$. The x-axis truncated at 500, beyond that the plot becomes noisy due to small number of sequences that long. After about 3.5 seconds of video ($\approx$ frame 100), the SAM-H-based re-detection starts to significantly boost the WOFTSAM performance compared to the baseline WOFT.
...and 11 more figures

Accurate Planar Tracking With Robust Re-Detection

TL;DR

Abstract

Accurate Planar Tracking With Robust Re-Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (16)