Table of Contents
Fetching ...

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

TL;DR

Dense video captioning requires both locating event boundaries in untrimmed videos and describing each event with natural language. HiCM$^2$ introduces hierarchical compact memory and a top-down retrieval mechanism that leverages cross-modal cues and LLM-based summarization to recall semantically relevant episodes and abstract concepts. The approach achieves state-of-the-art results on YouCook2 and ViTT, improving caption quality while maintaining competitive event localization, demonstrated through extensive ablations. This work highlights the potential of memory-augmented, retrieval-enabled architectures to enhance vision-language tasks by combining structured external memory with pretrained knowledge.

Abstract

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

TL;DR

Dense video captioning requires both locating event boundaries in untrimmed videos and describing each event with natural language. HiCM introduces hierarchical compact memory and a top-down retrieval mechanism that leverages cross-modal cues and LLM-based summarization to recall semantically relevant episodes and abstract concepts. The approach achieves state-of-the-art results on YouCook2 and ViTT, improving caption quality while maintaining competitive event localization, demonstrated through extensive ablations. This work highlights the potential of memory-augmented, retrieval-enabled architectures to enhance vision-language tasks by combining structured external memory with pretrained knowledge.

Abstract

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Paper Structure

This paper contains 33 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Conceptual figure of the proposed hierarchical compact memory construction. Our method leverages relevant clues from hierarchical compact memory with cross-modal retrieval, effectively mimicking human memory processes for improved event localization and description.
  • Figure 2: Overview of HiCM$^2$. We show the overall architecture. HiCM$^2$ approaches the DVC task in a memory retrieval-augmented generation manner using hierarchical compact memory. We conduct video-to-text cross-modal retrieval at each hierarchical level using input video features obtained through a pre-trained spatial encoder. The visual features are temporally encoded through a temporal encoder. The time information of speech is converted into time tokens for time tokenization, and the speech text is tokenized for speech encoding. Each encoded feature vector is concatenated and fed into the cross-attention layer of the text decoder. Finally, we obtain a sequence consisting of the start time, end time, and caption.
  • Figure 3: The construction and read process of hierarchical compact memory. As illustrated in (a), hierarchical memory is constructed using iterative clustering in a bottom-up approach to recall relevant episodes and abstract concepts, with LLM-based summarization generating compact, memory-efficient representations. As illustrated in (b), We retrieve a segmented visual cue from each temporal anchor, beginning at the high-level, abstract layer, and then recursively exploring lower layers for additional relevant information. K features are retrieved per level, and this process is repeated for all temporal anchors.
  • Figure 4: Example of predictions on the YouCook2 Validation set using our approach. The hierarchical retrieved sentences shown are examples of retrieval results with the highest semantic similarity at each hierarchical level, corresponding to specific segments of the input frames. Each retrieved sentence is converted into features and utilized in our model's predictions for the segments. Matching colors indicate the association between retrieved knowledge and prediction.
  • Figure 5: Example of LLM summarization instruction on the YouCook2 training set on our approach.
  • ...and 1 more figures