Hierarchical Memory for Long Video QA

Yiqin Wang; Haoji Zhang; Yansong Tang; Yong Liu; Jiashi Feng; Jifeng Dai; Xiaojie Jin

Hierarchical Memory for Long Video QA

Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin

TL;DR

Long video QA suffers from heavy memory and latency due to dense visual tokens. The authors adopt STAR Memory, a hierarchical memory mechanism from Flash-VStream, to compress visual tokens across four levels—spatial, temporal, abstract, and retrieved—plus a feature buffer, enabling long-video processing under limited VRAM with a token budget $MAXSIZE=(N_spa+N_ret)×P_spa^2 + N_tem×P_tem^2 + N_abs×P_abs^2$. They fine-tune the pretrained Flash-VStream on the MovieChat-1K training branch and incorporate ASR-transcribed audio as additional input to the LLM decoder. Experiments on MovieChat-1K demonstrate that stage-3 fine-tuning and audio integration substantially improve accuracy and scores, surpassing a training-free MovieChat baseline and achieving 1st place in the LOVEU Challenge Track 1. The work highlights the viability of hierarchical memory for long-form video QA and shows practical gains by fusing audio information with vision-language reasoning.

Abstract

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .

Hierarchical Memory for Long Video QA

TL;DR

. They fine-tune the pretrained Flash-VStream on the MovieChat-1K training branch and incorporate ASR-transcribed audio as additional input to the LLM decoder. Experiments on MovieChat-1K demonstrate that stage-3 fine-tuning and audio integration substantially improve accuracy and scores, surpassing a training-free MovieChat baseline and achieving 1st place in the LOVEU Challenge Track 1. The work highlights the viability of hierarchical memory for long-form video QA and shows practical gains by fusing audio information with vision-language reasoning.

Abstract

Paper Structure (9 sections, 1 equation, 2 figures, 1 table)

This paper contains 9 sections, 1 equation, 2 figures, 1 table.

Introduction
Method
Streaming visual encoder
Spatial-Temporal-Abstract-Retrieved memory
Real-time LLM decoder
Adopting automatic speech recognition (ASR)
Implementation details
Experiments
Conclusion

Figures (2)

Figure 1: The overview of Flash-VStream framework that we adopted for real-time online video stream understanding. Flash-VStream is executed by two processes, namely "frame handle" and "question handler". The frame handler is responsible for encoding frames and writing to memory, which contains a visual encoder, a STAR memory and a feature buffer. The question handler is responsible for reading from memory and answering questions anytime, which contains a projector and a Large Language Model.
Figure 2: STAR memory writing mechanism. (a) Update spatial memory by a FIFO queue. (b) Update temporal memory by Weighted K-means Clustering. (c) Update abstract memory by Semantic Attention. (d) Update retrieved memory by key frame feature retrival. Here feature map $e^T$ has multiple sizes. "S", "T", "A" and "R" represent tokens of spatial, temporal, abstract and retrieved memory, respectively.

Hierarchical Memory for Long Video QA

TL;DR

Abstract

Hierarchical Memory for Long Video QA

Authors

TL;DR

Abstract

Table of Contents

Figures (2)