Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

Weitong Cai; Jiabo Huang; Shaogang Gong

Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

Weitong Cai, Jiabo Huang, Shaogang Gong

TL;DR

This work tackles video moment retrieval (VMR) under mixed supervision by transferring temporal-boundary knowledge from a richly labelled source domain to a weakly-labelled target domain. It introduces a two-branch multiplE branch Video-text Alignment model (EVA) that leverages cross-modal attention to share precise video-text matching information across domains, while employing a maximum mean discrepancy (MMD) based Modality Feature Alignment Constraint and a Joint-Modal Domain Classifier with a gradient reversal to mitigate domain gaps. The learning objective combines weakly-supervised and fully-supervised signals with alignment and domain losses, enabling effective cross-domain knowledge transfer as captured by the overall loss $\mathcal{L} = \mathcal{L}_w + \lambda_f \mathcal{L}_f + \lambda_{align} \mathcal{L}_{align} - \lambda_{domain} \mathcal{L}_{domain}$. Empirical results show that EVA improves cross-domain VMR performance, with ablations confirming the contributions of alignment and domain-adversarial components and demonstrating better generalisation to unseen data. This approach enables leveraging richly annotated data to enhance weakly labelled VMR, reducing labeling costs while maintaining strong retrieval accuracy across diverse datasets.

Abstract

Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with only the video-level video-text pairing labels (weakly-supervised). The former is poor in generalisation to unknown concepts and/or novel scenes due to restricted dataset scale and diversity under expensive annotation costs; the latter is subject to visual-textual mis-correlations from incomplete labels. In this work, we introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer through adapting the video-text matching relationships learned from a fully-supervised source domain to a weakly-labelled target domain when they do not share a common label space. Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain. Specifically, we introduce a multiplE branch Video-text Alignment model (EVA) that performs cross-modal (visual-textual) matching information sharing and multi-modal feature alignment to optimise domain-invariant visual and textual features as well as per-task discriminative joint video-text representations. Experiments show EVA's effectiveness in exploring temporal segment annotations in a source domain to help learn video moment retrieval without temporal labels in a target domain.

Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

TL;DR

Abstract

Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (2)