Table of Contents
Fetching ...

FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, Jan Kautz

TL;DR

FRAG tackles the challenge of long-input understanding by decoupling framing from generation: it independently scores sampled frames/pages for relevance to a query using zero-shot LMM prompts, then feeds the Top-$K$ selected frames into an answering LMM to produce final results. This simple two-step process avoids the computational burden of long-context LMMs and is applicable to both long videos and multi-page documents, demonstrated across diverse datasets with two LMM families (LLaVA-OneVision and InternVL2). The approach yields state-of-the-art or near-state-of-the-art performance on five long-video benchmarks and three multi-page document benchmarks, including substantial gains on MP-DocVQA and competitive results against GPT-4o on long-video tasks. FRAG’s zero-shot framing, scalability across model sizes, and minimal prompt engineering suggest a practical, efficient path for long-context understanding in multimodal settings, with opportunities for improved selection strategies and efficiency in future work.

Abstract

There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG

FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

TL;DR

FRAG tackles the challenge of long-input understanding by decoupling framing from generation: it independently scores sampled frames/pages for relevance to a query using zero-shot LMM prompts, then feeds the Top- selected frames into an answering LMM to produce final results. This simple two-step process avoids the computational burden of long-context LMMs and is applicable to both long videos and multi-page documents, demonstrated across diverse datasets with two LMM families (LLaVA-OneVision and InternVL2). The approach yields state-of-the-art or near-state-of-the-art performance on five long-video benchmarks and three multi-page document benchmarks, including substantial gains on MP-DocVQA and competitive results against GPT-4o on long-video tasks. FRAG’s zero-shot framing, scalability across model sizes, and minimal prompt engineering suggest a practical, efficient path for long-context understanding in multimodal settings, with opportunities for improved selection strategies and efficiency in future work.

Abstract

There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: https://github.com/NVlabs/FRAG

Paper Structure

This paper contains 23 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Questions about long inputs can often be answered without long context global processing. For the query about the black backpack in the video, one would focus on the frames with black backpacks. Similarly for the slides, one would just focus on the pages with pie charts if the question is about pie charts.
  • Figure 2: Overview of FRAG. FRAG first uses a scoring LMM to score each sampled frame in a video or document. The Top-K scoring frames are then selected to use as input to the answering LMM for answer generation. The scoring LMM and the answering LMM can be the same, but are not required to be the same. We find that existing LMMs can serve both purposes without any tuning.
  • Figure 3: Qualitative result for FRAG. x-axis is frame index, and y-axis is FRAG score. FRAG select frames that are much more relevant to the query and successfully answer the question. Uniform sampling misses the important frames and cannot answer the question.
  • Figure 4: Results with different numbers of sampled frames. Performance peaks at 512 frames, and regress at 1024 frames for our setting. Oversampling can lead to concentrated Top-K frames with less diversity and hurt performance.
  • Figure 5: Results with different numbers of selected frames. This is also the number of input frames to the answering LMM. Performance peaks at 32 frames, which is consistent with findings in other works. We thus use 32 frames for our experiments
  • ...and 2 more figures