Table of Contents
Fetching ...

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Dingxin Cheng, Mingda Li, Jingyu Liu, Yongxin Guo, Bin Jiang, Qingbin Liu, Xi Chen, Bo Zhao

TL;DR

A Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) is proposed for better understanding of long videos by designing a novel adaptive sequence segmentation scheme to divide multiple events within long videos.

Abstract

Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

TL;DR

A Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) is proposed for better understanding of long videos by designing a novel adaptive sequence segmentation scheme to divide multiple events within long videos.

Abstract

Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.
Paper Structure (22 sections, 12 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 12 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) Overview of the HEM-LLM. We first sequentially sample video frames and perform adaptive sequence segmentation to divide them into individual events. Then, we introduce local memory and global memory to model the temporal context for both intra-event and inter-event scenarios. In this way, the HEM-LLM can progressively mine video semantic information at multiple granularities to enhance multimodal understanding capabilities. Finally, we employ Q-Former to integrate and compress the visual tokens of each event, concatenate them, and form event-based visual tokens to be fed into the LLMs for text generation. (b) Adaptive Sequence Segmentation. To effectively and efficiently perform event-based adaptive segmentation, we proceed in three steps: (i) we establish pairwise adjacent frame pairs; (ii) we compute their token-level cosine similarities and select the minimum K-1 points as segmentation points; (iii) We split the video at the segmentation points to form K events.
  • Figure 2: The study on the number of event segments. The $Mean$ represents the average of Top-1 and Top-5 accuracy.
  • Figure 3: Two cases of event segmentation on the Breakfast.
  • Figure 4: Qualitative analysis of video question answering on MovieChat-1K.
  • Figure 5: The study on the number of event segments. The $Mean$ represents the average of Top-1 and Top-5 accuracy.
  • ...and 3 more figures