Hierarchical Memory for Long Video QA
Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
TL;DR
Long video QA suffers from heavy memory and latency due to dense visual tokens. The authors adopt STAR Memory, a hierarchical memory mechanism from Flash-VStream, to compress visual tokens across four levels—spatial, temporal, abstract, and retrieved—plus a feature buffer, enabling long-video processing under limited VRAM with a token budget $MAXSIZE=(N_spa+N_ret)×P_spa^2 + N_tem×P_tem^2 + N_abs×P_abs^2$. They fine-tune the pretrained Flash-VStream on the MovieChat-1K training branch and incorporate ASR-transcribed audio as additional input to the LLM decoder. Experiments on MovieChat-1K demonstrate that stage-3 fine-tuning and audio integration substantially improve accuracy and scores, surpassing a training-free MovieChat baseline and achieving 1st place in the LOVEU Challenge Track 1. The work highlights the viability of hierarchical memory for long-form video QA and shows practical gains by fusing audio information with vision-language reasoning.
Abstract
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .
