Table of Contents
Fetching ...

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony

TL;DR

Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation.

Abstract

Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

TL;DR

Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation.

Abstract

Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.
Paper Structure (31 sections, 10 equations, 4 figures, 3 tables)

This paper contains 31 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of Match4Annotate. Annotation propagation across two medical ultrasound datasets. Each panel shows a source frame (left) with ground-truth annotations and a target frame (right) with propagated predictions. (a–b) Intra-video propagation on EchoNet cardiac ultrasound: (a) boundary point tracking with colored matching lines connecting corresponding source and predicted points, and (b) segmentation mask propagation. (c–d) Inter-video propagation on EchoNet between different subject videos. (e–f) Intra-video propagation on MSK-Bone musculoskeletal ultrasound. (g–h) Inter-video propagation on MSK-Bone across different subject videos.
  • Figure 2: Match4Annotate workflow. a) Given a video, we fit an implicit SIREN to represent DINOv3 features as a continuous, high-resolution spatiotemporal field $f_\theta(x,y,t)$, enabling feature queries at arbitrary coordinates. b) For a source--target pair (across frames or videos), we optimize a second SIREN $g_\phi$ to predict per-coordinate 2D displacements, yielding a smooth deformation prior via feature-alignment and regularization. c) We propagate user annotations by combining this flow-guided prior with feature matching: features from $f_\theta$ are compared (cosine similarity) to select correspondences, improving stability over naive pairwise matching.
  • Figure 3: Visualization of inter-video point propagation on EchoNet (left) and MSK-Bone (right). Each cell shows a source frame with annotated points (left) and a target frame with propagated predictions (right). Green lines indicate correct correspondences and red lines indicate incorrect correspondences.
  • Figure 4: Visualization of inter-video mask propagation on EchoNet (left) and MSK-Bone (right). Each cell shows the target frame with the propagated mask (green) and ground-truth contour (yellow dashed); the leftmost column shows the source reference mask (cyan).