Table of Contents
Fetching ...

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, Marie-Francine Moens

TL;DR

This work explores the limitations of the existing compression strategies for building a training-free video LLM, and develops the method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy, and establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks.

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

TL;DR

This work explores the limitations of the existing compression strategies for building a training-free video LLM, and develops the method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy, and establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks.

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Illustration of training-free video LLM. Vision Tower: vision encoder and projection module in image LLM.
  • Figure 2: Visual token compression strategies illustrated. Pooling and Sampling operate on encoded tokens, Grid operates on RGB images. We omit the encoding procedure for simplicity. We extend Grid to Grids by composing multiple grid view images.
  • Figure 3: Illustration of our TS-LLaVA. The vision tower includes vision encoder and projection module in image LLM. The dashed lines and solid lines trace the procedures for constructing the thumbnail image tokens and sampled image tokens, respectively. $V$ denotes the number of visual tokens from the vision tower, and $M$ is the pre-defined number of visual tokens. We omit text input to LLM for simplicity.
  • Figure 4: Design choices of TS-LLaVA. In (a), (b) (c), IntentQA shows similar pattern as NExT-QA, please refer to the Appendix.
  • Figure 5: Results from different ways of positioning visual tokens. Grid First: the thumbnail image tokens are prepended to sampled visual tokens.