Table of Contents
Fetching ...

CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement

Carlos Plou, Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Ana C. Murillo

TL;DR

This work tackles Step Grounding in long, untrimmed egocentric videos by proposing Bayesian-VSLNet, which introduces a per-segment probability head and a Bayesian temporal-order prior for test-time refinement. The prior integrates step order information to produce a posterior over segments, effectively handling cyclic and repeated actions and mitigating the needle-in-a-haystack problem in long videos. Training redefines ground truth as an event vector across segments to accommodate repeated queries, while inference combines a peak-based extension with a Gaussian temporal-order prior controlled by parameters $\alpha$ and $\beta$. On Ego4D Goal-Step, the method achieves state-of-the-art recalls, with $35.18$ at IoU $0.3$ and $20.48$ at IoU $0.5$ on the test set, illustrating the practical impact for precise step localization in real-world, long videos.

Abstract

The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, enhancing the accuracy of moment predictions. This prior adjusts for cyclic and repetitive actions within videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results on the Ego4D Goal-Step dataset with a 35.18 Recall Top-1 at 0.3 IoU and 20.48 Recall Top-1 at 0.5 IoU on the test set.

CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement

TL;DR

This work tackles Step Grounding in long, untrimmed egocentric videos by proposing Bayesian-VSLNet, which introduces a per-segment probability head and a Bayesian temporal-order prior for test-time refinement. The prior integrates step order information to produce a posterior over segments, effectively handling cyclic and repeated actions and mitigating the needle-in-a-haystack problem in long videos. Training redefines ground truth as an event vector across segments to accommodate repeated queries, while inference combines a peak-based extension with a Gaussian temporal-order prior controlled by parameters and . On Ego4D Goal-Step, the method achieves state-of-the-art recalls, with at IoU and at IoU on the test set, illustrating the practical impact for precise step localization in real-world, long videos.

Abstract

The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, enhancing the accuracy of moment predictions. This prior adjusts for cyclic and repetitive actions within videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results on the Ego4D Goal-Step dataset with a 35.18 Recall Top-1 at 0.3 IoU and 20.48 Recall Top-1 at 0.5 IoU on the test set.
Paper Structure (10 sections, 4 equations, 4 figures, 2 tables)

This paper contains 10 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Step grounding task: given a long untrimmed video of a high-level procedural activity Make a soup, the goal of the step grounding task is to localize the segment in the video that represents the free-form natural language description of the step.
  • Figure 2: Bayesian VSLNet. We introduce two novel components: a novel head predicts the probability of the text query in each video segment and a Bayesian temporar-order prior refines the predictions during the inference stage.
  • Figure 3: Influence of the $\alpha$ and $\beta$ parameters at the inference stage.$\alpha$ sets the threshold ($\alpha$-percentile of the posterior probability value $p^k_{ij}$) that controls the length of the segment. $\beta$ determines the variance of the prior that influences the posterior.
  • Figure 4: Positive (left) and negative (right) examples of our method before and after the temporal-order prior during the test time refinement. We report the True Positives (TP) segments in green, the False Positives (FP) in purple, the False Negatives (FN) in blue and the probability score along the time location in number of segments.