Table of Contents
Fetching ...

Prompt When the Animal is: Temporal Animal Behavior Grounding with Positional Recovery Training

Sheng Yan, Xin Du, Zongying Li, Yi Wang, Hongcang Jin, Mengyuan Liu

TL;DR

This work tackles temporal grounding for animal-behavior videos, where moments are sparse and uniformly distributed, by introducing Port, a Positional Recovery Training framework that augments a VSLNet baseline with a Recovering part and a Dual-alignment mechanism. By training a Recovering branch on flipped label sequences and aligning its output with the Predicting branch, Port focuses the model on ground-truth temporal regions prompted during training. The approach yields a notable IoU@0.3 of $38.52$ on the Animal Kingdom dataset and ranks highly in MMVRAC, with ablations showing the essential roles of PRT and Dual-alignment. These results suggest that leveraging ground-truth temporal cues during training can substantially improve grounding performance in challenging wildlife video data, with potential extensions to subject identification and classification branches using language models.

Abstract

Temporal grounding is crucial in multimodal learning, but it poses challenges when applied to animal behavior data due to the sparsity and uniform distribution of moments. To address these challenges, we propose a novel Positional Recovery Training framework (Port), which prompts the model with the start and end times of specific animal behaviors during training. Specifically, Port enhances the baseline model with a Recovering part to predict flipped label sequences and align distributions with a Dual-alignment method. This allows the model to focus on specific temporal regions prompted by ground-truth information. Extensive experiments on the Animal Kingdom dataset demonstrate the effectiveness of Port, achieving an IoU@0.3 of 38.52. It emerges as one of the top performers in the sub-track of MMVRAC in ICME 2024 Grand Challenges.

Prompt When the Animal is: Temporal Animal Behavior Grounding with Positional Recovery Training

TL;DR

This work tackles temporal grounding for animal-behavior videos, where moments are sparse and uniformly distributed, by introducing Port, a Positional Recovery Training framework that augments a VSLNet baseline with a Recovering part and a Dual-alignment mechanism. By training a Recovering branch on flipped label sequences and aligning its output with the Predicting branch, Port focuses the model on ground-truth temporal regions prompted during training. The approach yields a notable IoU@0.3 of on the Animal Kingdom dataset and ranks highly in MMVRAC, with ablations showing the essential roles of PRT and Dual-alignment. These results suggest that leveraging ground-truth temporal cues during training can substantially improve grounding performance in challenging wildlife video data, with potential extensions to subject identification and classification branches using language models.

Abstract

Temporal grounding is crucial in multimodal learning, but it poses challenges when applied to animal behavior data due to the sparsity and uniform distribution of moments. To address these challenges, we propose a novel Positional Recovery Training framework (Port), which prompts the model with the start and end times of specific animal behaviors during training. Specifically, Port enhances the baseline model with a Recovering part to predict flipped label sequences and align distributions with a Dual-alignment method. This allows the model to focus on specific temporal regions prompted by ground-truth information. Extensive experiments on the Animal Kingdom dataset demonstrate the effectiveness of Port, achieving an IoU@0.3 of 38.52. It emerges as one of the top performers in the sub-track of MMVRAC in ICME 2024 Grand Challenges.
Paper Structure (12 sections, 4 equations, 4 figures, 5 tables)

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Distribution of temporal positions of target moments: Conventional grounding benchmarks VS. Animal Kingdom dataset. The color gradient indicates the relative frequency, with brighter shades indicating higher proportions.
  • Figure 2: Our proposed Port is built upon VSLNet. We significantly improve the predictor through Positional Recovery Training. We divide the predictor into two parts: the Predicting part and the Recovering part. Both parts share the same optimization objective, but the Recovering part conducts recovery training on the flipped embedded sequences $\mathcal{\bar{E}}_{\text{s/e}}$ (composed of start()/non-start() or end()/non-end label embeddings).
  • Figure 3: Visualization of the ground-truth moment and predictions by competitor.
  • Figure 4: Visualization of the predicted distributions of the Predicting part and the Recovering part in Port, compared to the predicted distribution of VSLNet, with the moment regions highlighted in yellow.