Table of Contents
Fetching ...

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

TL;DR

The paper tackles dense video captioning for untrimmed videos by enriching visual features with external textual memory. It introduces CM^2, a cross-modal memory-based framework that retrieves segment-relevant text from a memory bank via video-to-text matching and fuses it through a versatile encoder-decoder with visual and textual cross-attention. Memory is constructed from in-domain captions using CLIP-based embeddings, and segment-level retrieval uses temporal anchors to produce retrieved text features that augment both localization and captioning. Experiments on ActivityNet Captions and YouCook2 show improved caption quality and localization without reliance on extensive pretraining, demonstrating the practicality and impact of memory-augmented dense video understanding.

Abstract

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

TL;DR

The paper tackles dense video captioning for untrimmed videos by enriching visual features with external textual memory. It introduces CM^2, a cross-modal memory-based framework that retrieves segment-relevant text from a memory bank via video-to-text matching and fuses it through a versatile encoder-decoder with visual and textual cross-attention. Memory is constructed from in-domain captions using CLIP-based embeddings, and segment-level retrieval uses temporal anchors to produce retrieved text features that augment both localization and captioning. Experiments on ActivityNet Captions and YouCook2 show improved caption quality and localization without reliance on extensive pretraining, demonstrating the practicality and impact of memory-augmented dense video understanding.

Abstract

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.
Paper Structure (23 sections, 1 equation, 5 figures, 12 tables)

This paper contains 23 sections, 1 equation, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Conceptual figure of the proposed cross-modal memory-based dense video captioning (CM$^2$). Our method can search for relevant clues from an external memory bank to provide precise descriptions and localization for untrimmed video.
  • Figure 2: Overview of CM$^2$. We approach the dense video captioning task in a memory-retrieval-augmented caption generation manner. We show the overall architecture in (a). We conduct video-to-text cross-modal retrieval using input video features obtained through a pre-trained encoder. As illustrated in (b), we generate segment-level $W$ temporal anchors from the input video features. Then we measure similarities between the anchors and the text features stored in a memory to obtain $W$ retrieved features through aggregation. As illustrated in (c), we encode the multi-scale video features $\tilde{\textbf{x}}$ and retrieved features $$ using a versatile transformer encoder. Each encoded feature vector undergoes the corresponding cross-attention layers to obtain refined event queries. Finally, we obtain the set of start time, end time, and caption by passing the event queries through a head.
  • Figure 3: Example of dense video captioning predictions with ours on ActivityNet Captions Validation set. We show a comparison with the ground truth. Retrieved sentences are example results from retrieval that have the highest semantic similarity to the corresponding segments of input frames. Each retrieved sentence is utilized in our model's predictions for the segments with the corresponding color.
  • Figure 4: Example of predictions from our method on ActivityNet Captions dataset. We show a comparison with the ground truth. Retrieved sentences are example results from retrieval that have the highest similarity to the corresponding segments of input frames. Each retrieved sentence is utilized in our model's predictions for the segments with the corresponding color.
  • Figure 5: Example of predictions from our method on YouCook2 dataset. We show a comparison with the ground truth. Retrieved sentences are example results from retrieval that have the highest semantic similarity to the corresponding segments of input frames. Each retrieved sentence is utilized in our model's predictions for the segments with the corresponding color.