Table of Contents
Fetching ...

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Yunbin Tu, Liang Li, Li Su, Qingming Huang

TL;DR

The paper addresses the problem of user-centric, hierarchical video understanding, focusing on moment retrieval, moment segmentation, and step-captioning within the HIREST framework. It introduces QUAG, a two-module architecture with modality-synergistic perception (MSP) and query-centric cognition (QC$^2$) to build a query-centric audio-visual representation by modeling global AV alignment, local cross-modal interactions, and query-guided filtration. Key contributions include a global-to-local AV fusion strategy and a deep-query filtration mechanism, achieving state-of-the-art results on HIREST and strong generalization to TVSum for query-based video summarization. The approach advances practical video understanding by aligning multi-modal content with user queries, enabling precise moment localization, structured segmentation, and captioning, with potential applicability to real-world video search and summarization tasks.

Abstract

Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

TL;DR

The paper addresses the problem of user-centric, hierarchical video understanding, focusing on moment retrieval, moment segmentation, and step-captioning within the HIREST framework. It introduces QUAG, a two-module architecture with modality-synergistic perception (MSP) and query-centric cognition (QC) to build a query-centric audio-visual representation by modeling global AV alignment, local cross-modal interactions, and query-guided filtration. Key contributions include a global-to-local AV fusion strategy and a deep-query filtration mechanism, achieving state-of-the-art results on HIREST and strong generalization to TVSum for query-based video summarization. The approach advances practical video understanding by aligning multi-modal content with user queries, enabling precise moment localization, structured segmentation, and captioning, with potential applicability to real-world video search and summarization tasks.

Abstract

Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.

Paper Structure

This paper contains 26 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The illustrative example consisting of moment retrieval, moment segmentation, and step-captioning. First, given a text query "How to make perfect strawberry glazed pie", the model is required to localize the most related moment in the video (moment retrieval). Then, the model proceeds to break down the moment into finer-level steps (moment segmentation). Finally, the model should describe each step with a concise sentence (step-captioning).
  • Figure 2: The overview of our method. Based on the principle of shallow-to-deep, we propose a query-centric audio-visual cognition (QUAG) network, where the core modules are the modality-synergistic perception and query-centric cognition. QUAG aims to learn a comprehensive cognition of user-preferred video content, and thus attain a query-centric audio-visual representation for jointly addressing the moment retrieval, moment segmentation, and step-captioning.
  • Figure 3: Given a text query "How to create a brunch menu", and the ground-truth annotations, we compare the predicted and generated outputs of our QUAG and Jointzala2023hierarchical for moment retrieval, segmentation, and step-captioning.
  • Figure 4: Given a text query "How to use a fire pit", and the ground-truth annotations, we compare the predicted and generated outputs of our QUAG and Jointzala2023hierarchical for moment retrieval, segmentation, and step-captioning.
  • Figure 5: Given a text query "How to clean resin", and the ground-truth annotations, we compare the predicted and generated outputs of our QUAG and Jointzala2023hierarchical for moment retrieval, segmentation, and step-captioning.
  • ...and 2 more figures