Table of Contents
Fetching ...

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman

TL;DR

VideoLights tackles the joint problem of video highlight detection and moment retrieval by introducing a cross-modal, cross-task transformer framework. It integrates a Feature Refinement & Alignment (FRA) module, a Bi-Directional Cross-Modal Fusion (Bi-CMF) network, and a Unidirectional Joint-Task Feedback mechanism, guided by adaptive hard-positive/negative losses and task-coupled supervision, with LVLM-based pretraining (e.g., BLIP-2) to enhance multimodal fusion. The model achieves state-of-the-art results on QVHighlights, TVSum, and Charades-STA, with notable gains attributed to FRA’s local-global alignment, Bi-CMF’s hierarchical fusion, and the cross-task feedback strategy. This approach demonstrates strong generalization, particularly when augmented with synthetic pretraining data, and highlights the practical impact of tightly integrated cross-modal and cross-task dynamics for video understanding tasks.

Abstract

Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

TL;DR

VideoLights tackles the joint problem of video highlight detection and moment retrieval by introducing a cross-modal, cross-task transformer framework. It integrates a Feature Refinement & Alignment (FRA) module, a Bi-Directional Cross-Modal Fusion (Bi-CMF) network, and a Unidirectional Joint-Task Feedback mechanism, guided by adaptive hard-positive/negative losses and task-coupled supervision, with LVLM-based pretraining (e.g., BLIP-2) to enhance multimodal fusion. The model achieves state-of-the-art results on QVHighlights, TVSum, and Charades-STA, with notable gains attributed to FRA’s local-global alignment, Bi-CMF’s hierarchical fusion, and the cross-task feedback strategy. This approach demonstrates strong generalization, particularly when augmented with synthetic pretraining data, and highlights the practical impact of tightly integrated cross-modal and cross-task dynamics for video understanding tasks.

Abstract

Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature integration and intelligent pre-training with synthetic data. Comprehensive evaluations on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that VideoLights significantly surpasses existing baselines, establishing new state-of-the-art performances. Codes and model checkpoints are available at https://github.com/dpaul06/VideoLights .

Paper Structure

This paper contains 22 sections, 9 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall VideoLights architecture. The FRA module models video-text correlations from projected embeddings, which are then refined by the Bi-CMF encoder. A trainable saliency vector predicts output levels, while class and moment prediction heads generate logits and video moments. Cross-task feedback is provided by saliency cosine similarity and task-coupled HD/MR losses (Uni-JFM), with new losses highlighted in purple.
  • Figure 2: (a) is the input video, (b) and (c) are correspondence maps of query and video tokens using linear and convolution layers, respectively, which show that queries are more aligned for the convolution layer, video, and text than linear projection layers. (d) The effect of the Feature Refinement module that effectively aligns video and text tokens that match ground truth saliency levels (green line) in each heat map saliency level is shown with green line plot.
  • Figure 3: Bi-CMF Module. It learns query-oriented video via text2video, video2text, then text2video attentions. In this process, dropout and normalization are applied after each step, and activation is applied at the last stage.
  • Figure 4: (a) and (b) show video-query correspondence maps: (a) after text-to-video (t2v) attention and (b) after the Bi-CMF layer. The green line represents the ground truth saliency scores. Bi-CMF attends to the correct video region better than t2v (highlighted in the magenta box). The word 'Is' asserts that 'a' refers to one basket, unlike 'is not'.
  • Figure 5: Qualitative results. It demonstrates VideoLightsoutperformed TR-DETR Sun_Zhou_Chen_Xie_2024 in both MR and HD.
  • ...and 1 more figures