Table of Contents
Fetching ...

CoS: Chain-of-Shot Prompting for Long Video Understanding

Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

TL;DR

CoS tackles long-video understanding by addressing the shot-selection problem in multimodal LLMs with a training-free, test-time visual prompt optimization pipeline. It introduces a binary video summary via mosaicing to perform pseudo temporal grounding and builds task-relevant positive and irrelevant negative sub-shots for co-reasoning, all while adaptively weighting inputs with α to handle sparse information. The method fuses the original video, S^p, and S^n through a principled formula, enabling dynamic, per-video-instance prompt optimization. Extensive experiments across five datasets and multiple baselines demonstrate consistent gains, validating CoS as a practical approach to improve long-video reasoning in MLLMs.

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

CoS: Chain-of-Shot Prompting for Long Video Understanding

TL;DR

CoS tackles long-video understanding by addressing the shot-selection problem in multimodal LLMs with a training-free, test-time visual prompt optimization pipeline. It introduces a binary video summary via mosaicing to perform pseudo temporal grounding and builds task-relevant positive and irrelevant negative sub-shots for co-reasoning, all while adaptively weighting inputs with α to handle sparse information. The method fuses the original video, S^p, and S^n through a principled formula, enabling dynamic, per-video-instance prompt optimization. Extensive experiments across five datasets and multiple baselines demonstrate consistent gains, validating CoS as a practical approach to improve long-video reasoning in MLLMs.

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The effects of changing shot-sampling rates on video understanding task performance on videos of different lengths in the VideoMME fu2024video dataset. Two models are evaluated including LongVA zhang2024long and Video-XL shu2024video. As the number of sampled shots increased, performance did not consistently improve across various video lengths. That is because while sparse sampling may miss crucial details, exhaustive sampling often overwhelms the model with excessive irrelevant content. This illustrates the key challenge of optimal shot selection especially in long video understanding. That is, how to sample variable details in order to maximise semantic task information extraction whilst minimising distractions from irrelevant details (noise) in video understanding.
  • Figure 2: The critical problem of how to select shots in video understanding. In a video that depicts how a boy gradually gains a dragon's trust, different sampling methods create two distinct narratives: split video A shows the boy being attacked by the dragon, while split video B shows him happily sharing food with the dragon. This shows that minor differences in video sampling leads to significant variations in semantic understanding (interpretation).
  • Figure 3: The overall framework of CoS. It first utilises LLaVA to perform a mosaicing binary coding to bootstrap video summarisation for temporal grounding on a long video. Specifically, every four shots are aggregated into a mosaicing composition image. LLaVA determines whether task-related elements exist within each composition image by encoding a binary value of 1 or 0 ('yes' or 'no'), thereby identifying sparsely distributed task-related shots to achieve pseudo temporal grounding. Given this binary video summary, task-related positive shots $S^p$ and irrelevant negative shots $S^n$ are generated and represented by binary codes. $S^p$, $S^n$ and the original frame sequence $X$ sampled from original video $V$ are then fed into the MLLM for co-reasoning, minimising interference of irrelevant video content.
  • Figure 4: An qualitative evaluation example from MLVU zhou2024mlvu dataset.