CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement
Carlos Plou, Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Ana C. Murillo
TL;DR
This work tackles Step Grounding in long, untrimmed egocentric videos by proposing Bayesian-VSLNet, which introduces a per-segment probability head and a Bayesian temporal-order prior for test-time refinement. The prior integrates step order information to produce a posterior over segments, effectively handling cyclic and repeated actions and mitigating the needle-in-a-haystack problem in long videos. Training redefines ground truth as an event vector across segments to accommodate repeated queries, while inference combines a peak-based extension with a Gaussian temporal-order prior controlled by parameters $\alpha$ and $\beta$. On Ego4D Goal-Step, the method achieves state-of-the-art recalls, with $35.18$ at IoU $0.3$ and $20.48$ at IoU $0.5$ on the test set, illustrating the practical impact for precise step localization in real-world, long videos.
Abstract
The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, enhancing the accuracy of moment predictions. This prior adjusts for cyclic and repetitive actions within videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results on the Ego4D Goal-Step dataset with a 35.18 Recall Top-1 at 0.3 IoU and 20.48 Recall Top-1 at 0.5 IoU on the test set.
