MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li; Irina Abdullaeva; Alexander Gambashidze; Andrey Kuznetsov; Ivan Oseledets

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, Ivan Oseledets

TL;DR

This work tackles the challenge of long-video understanding under Vision-Language LLMs by addressing the shortcomings of uniform frame sampling. It introduces MaxInfo, a training-free frame-selection method based on the maximum volume principle, combining dimensionality reduction and rectangular MaxVol optimization to pick a small, informative, and diverse set of frames for VLLM inference. The method comes in Fast, Slow, and Chunk-based variants and is shown to yield consistent gains across multiple benchmarks (e.g., LongVideoBench, EgoSchema) and model families, with minimal latency and memory overhead. The results suggest that information-aware frame sampling can significantly improve long-video comprehension and motivate future exploration of scene-aware and training-time integration strategies for VLLMs.

Abstract

Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, the first training-free method based on the maximum volume principle, which is available in Fast and Slow versions and a Chunk-based version that selects and retains the most representative frames from a video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. Moreover, MaxInfo boosts LongVideoBench performance by 3.47% on LLaVA-Video-72B and 3.44% on MiniCPM4.5. The approach is simple to implement and works with existing VLLMs without the need for additional training and very lower latency, making it a practical and effective alternative to traditional uniform sampling methods. Our code are available at https://github.com/FusionBrainLab/MaxInfo.git

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

TL;DR

Abstract

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)

Theorems & Definitions (1)