Table of Contents
Fetching ...

SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

Sunjae Yoon, Gwanhyeong Koo, Dahyun Kim, Chang D. Yoo

TL;DR

SCANet tackles weakly-supervised video moment retrieval by introducing scene complexity as a prior to adapt proposal generation and enhancement. It defines a scalar scene complexity α through scene finding and redundancy removal, then uses a codebook-backed complexity vector to generate a variable number of proposals and a Flatten Gaussian mask to localize them, followed by cross-modal reconstruction and hierarchical contrastive learning for robust alignment. Dynamic calibration further tailors the training loss to videos of differing complexity. Empirically, SCANet achieves state-of-the-art or strong results on Charades-STA, ActivityNet Captions, and TVR, and ablations confirm the value of redundancy removal, complexity-adaptive proposals, and multi-level contrastive signals in reducing scene-proposal mismatch. This work advances wsVMR by introducing a principled, complexity-aware mechanism to handle diverse scene counts in real videos, improving both retrieval accuracy and interpretability of proposals.

Abstract

Video moment retrieval aims to localize moments in video corresponding to a given language query. To avoid the expensive cost of annotating the temporal moments, weakly-supervised VMR (wsVMR) systems have been studied. For such systems, generating a number of proposals as moment candidates and then selecting the most appropriate proposal has been a popular approach. These proposals are assumed to contain many distinguishable scenes in a video as candidates. However, existing proposals of wsVMR systems do not respect the varying numbers of scenes in each video, where the proposals are heuristically determined irrespective of the video. We argue that the retrieval system should be able to counter the complexities caused by varying numbers of scenes in each video. To this end, we present a novel concept of a retrieval system referred to as Scene Complexity Aware Network (SCANet), which measures the `scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video. Experimental results on three retrieval benchmarks (i.e., Charades-STA, ActivityNet, TVR) achieve state-of-the-art performances and demonstrate the effectiveness of incorporating the scene complexity.

SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

TL;DR

SCANet tackles weakly-supervised video moment retrieval by introducing scene complexity as a prior to adapt proposal generation and enhancement. It defines a scalar scene complexity α through scene finding and redundancy removal, then uses a codebook-backed complexity vector to generate a variable number of proposals and a Flatten Gaussian mask to localize them, followed by cross-modal reconstruction and hierarchical contrastive learning for robust alignment. Dynamic calibration further tailors the training loss to videos of differing complexity. Empirically, SCANet achieves state-of-the-art or strong results on Charades-STA, ActivityNet Captions, and TVR, and ablations confirm the value of redundancy removal, complexity-adaptive proposals, and multi-level contrastive signals in reducing scene-proposal mismatch. This work advances wsVMR by introducing a principled, complexity-aware mechanism to handle diverse scene counts in real videos, improving both retrieval accuracy and interpretability of proposals.

Abstract

Video moment retrieval aims to localize moments in video corresponding to a given language query. To avoid the expensive cost of annotating the temporal moments, weakly-supervised VMR (wsVMR) systems have been studied. For such systems, generating a number of proposals as moment candidates and then selecting the most appropriate proposal has been a popular approach. These proposals are assumed to contain many distinguishable scenes in a video as candidates. However, existing proposals of wsVMR systems do not respect the varying numbers of scenes in each video, where the proposals are heuristically determined irrespective of the video. We argue that the retrieval system should be able to counter the complexities caused by varying numbers of scenes in each video. To this end, we present a novel concept of a retrieval system referred to as Scene Complexity Aware Network (SCANet), which measures the `scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video. Experimental results on three retrieval benchmarks (i.e., Charades-STA, ActivityNet, TVR) achieve state-of-the-art performances and demonstrate the effectiveness of incorporating the scene complexity.
Paper Structure (32 sections, 11 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 11 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Scene-proposal mismatch in current wsVMR systems: (a) shows an unnecessary many proposals on a video containing few scenes and few proposals on a video containing many scenes, (b) shows mIoU scores of the current model's predictions according to the number of scenes and the number of generated proposals and (c) shows a method for estimating the number of scenes, where redundant scenes are removed from the counts.
  • Figure 2: Illustration of proposed SCANet. (a) shows a scene complexity estimation which takes an input video and estimates a scene complexity using video-query pairs, (b) shows input representations, (c) shows a complexity-adaptive proposal generation which generates adaptive proposals according to the complexity, and (d) shows a complexity-adaptive proposal enhancement, which introduces multiple representation enhancements and calibrates them corresponding to the complexity.
  • Figure 3: Illustration of (a) gaussian mask zheng2022weakly1zheng2022weakly2 and (b) our proposed flatten gaussian mask.
  • Figure 4: Qualitative results of SCANet: (a) shows the video retrieval performance of SCANet according to the R@k, (b) shows the moment retrieval performances according to involving top-K retrieved videos for hierarchical contrastive learning, (c) shows the IoU scores distributions according to the number of scenes and proposals (upper: length dependant proposals, below: complexity adaptive proposals), and (d) illustrates the proposals in SCANet according to videos with diverse scenes.
  • Figure 5: Illustration of failure case of moment prediction.