Table of Contents
Fetching ...

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, Ivan Oseledets

TL;DR

This work tackles the challenge of long-video understanding under Vision-Language LLMs by addressing the shortcomings of uniform frame sampling. It introduces MaxInfo, a training-free frame-selection method based on the maximum volume principle, combining dimensionality reduction and rectangular MaxVol optimization to pick a small, informative, and diverse set of frames for VLLM inference. The method comes in Fast, Slow, and Chunk-based variants and is shown to yield consistent gains across multiple benchmarks (e.g., LongVideoBench, EgoSchema) and model families, with minimal latency and memory overhead. The results suggest that information-aware frame sampling can significantly improve long-video comprehension and motivate future exploration of scene-aware and training-time integration strategies for VLLMs.

Abstract

Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods. Our code are available at https://github.com/FusionBrainLab/MaxInfo.git

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

TL;DR

This work tackles the challenge of long-video understanding under Vision-Language LLMs by addressing the shortcomings of uniform frame sampling. It introduces MaxInfo, a training-free frame-selection method based on the maximum volume principle, combining dimensionality reduction and rectangular MaxVol optimization to pick a small, informative, and diverse set of frames for VLLM inference. The method comes in Fast, Slow, and Chunk-based variants and is shown to yield consistent gains across multiple benchmarks (e.g., LongVideoBench, EgoSchema) and model families, with minimal latency and memory overhead. The results suggest that information-aware frame sampling can significantly improve long-video comprehension and motivate future exploration of scene-aware and training-time integration strategies for VLLMs.

Abstract

Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods. Our code are available at https://github.com/FusionBrainLab/MaxInfo.git

Paper Structure

This paper contains 28 sections, 16 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Reasons why the Uniform Sampling approach cannot answer the correct answer in long videos. An example of MaxInfo's sampling approach.
  • Figure 2: Overview of the MaxInfo Block integrated into a VLLM. We extract the most informative frames via the MaxInfo Block and then perform inference on the resulting subset of frames.
  • Figure 3: Effect of initial sampling on MaxInfo performance for LlaVa-Video 7B model.
  • Figure 4: Effect of Initial Sampling on MaxInfo. Starting from $n^*$ sampled frames, the MaxInfo Block selects up to 64 informative frames for further processing.
  • Figure 5: Qualitative: (a) MaxInfo vs Uniform Sampling with GT-aligned frames; (b) CLIP scores show MaxInfo's answer coverage in single samples.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 1