Table of Contents
Fetching ...

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma

TL;DR

Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

Abstract

Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

TL;DR

Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

Abstract

Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of a badminton match video, where temporal reasoning is required to detect the event of the winner scoring a point with mading a second edge ball, and single-frame fine-grained recognition is also needed to identify the specific score. Our proposed VidCompress, with a memory-aware dual-compressor architecture, is capable of performing both long-term and short-term temporal modeling to correctly answer the question.
  • Figure 2: The overall framework of our proposed VidCompress, following a dual-compressor architecture. The visual encoder extracts frame-level features that are fed into the memory-enhanced compressor and text-perceived compressor to generate two types of visual tokens. The right part details the memory-enhanced compressor with devised memory-cache strategy.
  • Figure 3: Chat examples of our VidCompress, with a DIY keychain video and a cooking video.
  • Figure 4: Ablation studies on (a) clip size and (b) cached memory size.