Table of Contents
Fetching ...

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

Paper Structure

This paper contains 17 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison between FlexMem (ours) and existing efficient video understanding methods for MLLMs on five benchmarks. All methods are run on the same device of one 3090 GPU, and our FlexMem presents obvious performance gains.
  • Figure 2: Illustration of the proposed FlexMem method. (a) FlexMem is an iterative method, and it encodes two types of compressed memories for each video clip $V_i$, namely Context Memory$C_i$ and Local Memory$M_i$, based on the metrics of aggregation score $S_i$ and local saliency score $\hat{S}_i$, respectively. $M_i$ is then stored in the visual memory bank$M_{bank}$, while the context memory $C$ are used in the iterative encoding step for information propagation. Besides, we can also retrieval some stored $M_l$ as the long-term memory for encoding, while it is optional as well as the text instruction $T_q$. (b) The stored memories $M_a$ will be recalled from the memory bank for the decoding of answers $Y$. (c) One intuitive and effective indexing for FlexMem is the Encoding-based one, which uses the cross-attention during memory encoding with $T_q$ (a) to reflect the relevance of memories. (d) We also investigate the other fast index method, termed MemIndex, based on the compact index tensors for both question and visual memories, of which process is independent to the encoding of memories. Its selection of cache layers and tokens stems from the fitting results of the encoding-based index.
  • Figure 3: Qualitative evaluation of FlexMem. Input Video denote the sampled frames, and Key Fragments are the selected clips for answer generation via memory recall mechanism. These results demonstrate FlexMem's capacity in comprehensive and fine-grained visual understanding.