Prompt When the Animal is: Temporal Animal Behavior Grounding with Positional Recovery Training
Sheng Yan, Xin Du, Zongying Li, Yi Wang, Hongcang Jin, Mengyuan Liu
TL;DR
This work tackles temporal grounding for animal-behavior videos, where moments are sparse and uniformly distributed, by introducing Port, a Positional Recovery Training framework that augments a VSLNet baseline with a Recovering part and a Dual-alignment mechanism. By training a Recovering branch on flipped label sequences and aligning its output with the Predicting branch, Port focuses the model on ground-truth temporal regions prompted during training. The approach yields a notable IoU@0.3 of $38.52$ on the Animal Kingdom dataset and ranks highly in MMVRAC, with ablations showing the essential roles of PRT and Dual-alignment. These results suggest that leveraging ground-truth temporal cues during training can substantially improve grounding performance in challenging wildlife video data, with potential extensions to subject identification and classification branches using language models.
Abstract
Temporal grounding is crucial in multimodal learning, but it poses challenges when applied to animal behavior data due to the sparsity and uniform distribution of moments. To address these challenges, we propose a novel Positional Recovery Training framework (Port), which prompts the model with the start and end times of specific animal behaviors during training. Specifically, Port enhances the baseline model with a Recovering part to predict flipped label sequences and align distributions with a Dual-alignment method. This allows the model to focus on specific temporal regions prompted by ground-truth information. Extensive experiments on the Animal Kingdom dataset demonstrate the effectiveness of Port, achieving an IoU@0.3 of 38.52. It emerges as one of the top performers in the sub-track of MMVRAC in ICME 2024 Grand Challenges.
