Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection
Sunoh Kim, Daeho Um
TL;DR
This paper tackles weakly supervised video grounding by highlighting the underexplored role of inference strategies. It extends Gaussian-based proposals with diverse boundary prediction methods and loss-aware top-1 selection to better localize query-relevant segments ($s$, $e$) without additional training. Five boundary strategies and four top-1 strategies are evaluated, with Shortest Tail and IoU+LossMax (or IoU+LossSum per dataset) emerging as top configurations. Across ActivityNet Captions and Charades-STA, the proposed inference strategies yield clear gains over baselines, demonstrating practical, training-free improvements for boundary localization and proposal selection in complex videos.
Abstract
Weakly supervised video grounding aims to localize temporal boundaries relevant to a given query without explicit ground-truth temporal boundaries. While existing methods primarily use Gaussian-based proposals, they overlook the importance of (1) boundary prediction and (2) top-1 prediction selection during inference. In their boundary prediction, boundaries are simply set at half a standard deviation away from a Gaussian mean on both sides, which may not accurately capture the optimal boundaries. In the top-1 prediction process, these existing methods rely heavily on intersections with other proposals, without considering the varying quality of each proposal. To address these issues, we explore various inference strategies by introducing (1) novel boundary prediction methods to capture diverse boundaries from multiple Gaussians and (2) new selection methods that take proposal quality into account. Extensive experiments on the ActivityNet Captions and Charades-STA datasets validate the effectiveness of our inference strategies, demonstrating performance improvements without requiring additional training.
