Table of Contents
Fetching ...

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

Teng Yan, Yihan Liu, Jiongxu Chen, Teng Wang, Jiaqi Li, Bingzhuo Zhong

TL;DR

AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.

Abstract

Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.

AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

TL;DR

AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.

Abstract

Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.
Paper Structure (16 sections, 24 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 24 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: AR$^2$-4FV-Bench scenes and protocol. Panels from several fixed-view locations and two cinematic sequences. For each sequence we show the first frame, a visible instance, a period of absence, and the re-entry moment, constituting a full tracking cycle. Green overlays indicate anchors (stable background supports). The benchmark targets long-term disappearance and re-entry under a fixed camera and evaluates identity-consistent trajectories without assuming the target is visible at first frame.
  • Figure 2: Dataset statistics: video duration distribution, target re-entry frequency, and scene category composition.
  • Figure 3: Diverse language-guided referring queries and annotated referents in AR$^2$-4FV-Bench. Examples span indoor and outdoor fixed-view scenes as well as cinematic clips. Queries cover anchor-referential descriptions (e.g., “the person near the doorway”) and attribute-based disambiguation by color, clothing, or pose. Green boxes indicate ground-truth referents.
  • Figure 4: Overview of AR$^2$-4FV. Offline: static structures are distilled into an Anchor Bank. Online: the query is aligned to the bank to produce a persistent Anchor Map; this map generates proposals and drives the search and re-entry prior $P^{\mathrm{re}}$. With mask-aware pooling we compute a fusion score, and ReID-Gating validates candidates using appearance similarity, anchor evidence, and displacement in the anchor frame, yielding per-frame boxes $\{y_t\}$.