Table of Contents
Fetching ...

M-LLM Based Video Frame Selection for Efficient Video Understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul Chilimbi

TL;DR

This work tackles the inefficiency and potential information loss of uniform frame sampling in video-LLMs by introducing a lightweight, M-LLM-based frame selector that is conditioned on the user question. It combines an adaptive frame scoring mechanism with two pseudo-labeling strategies—spatial and temporal—to train frame relevance without extensive annotated data. The frame selector is plug-and-play for frozen downstream video-LLMs and yields consistent QA improvements across both medium and long-context benchmarks while reducing framing requirements. This approach enables scalable, accurate video understanding in resource-constrained settings by focusing computation on the most informative frames.

Abstract

Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

M-LLM Based Video Frame Selection for Efficient Video Understanding

TL;DR

This work tackles the inefficiency and potential information loss of uniform frame sampling in video-LLMs by introducing a lightweight, M-LLM-based frame selector that is conditioned on the user question. It combines an adaptive frame scoring mechanism with two pseudo-labeling strategies—spatial and temporal—to train frame relevance without extensive annotated data. The frame selector is plug-and-play for frozen downstream video-LLMs and yields consistent QA improvements across both medium and long-context benchmarks while reducing framing requirements. This approach enables scalable, accurate video understanding in resource-constrained settings by focusing computation on the most informative frames.

Abstract

Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

Paper Structure

This paper contains 18 sections, 6 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example of our video frame selection for video QA. Compared to uniform sampling, ours has higher hit rate.
  • Figure 2: An illustration of the conventional n-frame video mllm framework and our video mllm framework with frame selection.
  • Figure 3: An illustration of the spatial and temporal pseudo labeling for the importance scores
  • Figure 4: Visualization of the frame selection results.
  • Figure 5: One visualization example of the frame selection results.
  • ...and 3 more figures