Table of Contents
Fetching ...

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Harsh Lunia

TL;DR

The paper investigates using Cola to coordinate multiple VLMs with an LLM for video action recognition from sparse temporal information. By applying Cola to surveillance-style SPHAR videos, it demonstrates that an LLM can synthesize diverse VLM outputs from up to 10 keyframes to infer actions, achieving results that surpass a baseline ensemble and remain competitive despite limited frames. The approach reveals both the promise and current limits of weak-temporal-signal video understanding with LLM-coordinated VLMs, highlighting the need for stronger temporal cues or more frames to reach robust performance. The work contributes a concrete pipeline for cross-modal reasoning in video, shows how to train the coordinating LLM with a structured template, and analyzes error modes to guide future enhancements. $P(v,q) = \frac{1}{n}\sum_{i=1}^{n} P_i(v,q)$ is used to formalize the ensemble combination underpinning the coordination strategy.

Abstract

Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

TL;DR

The paper investigates using Cola to coordinate multiple VLMs with an LLM for video action recognition from sparse temporal information. By applying Cola to surveillance-style SPHAR videos, it demonstrates that an LLM can synthesize diverse VLM outputs from up to 10 keyframes to infer actions, achieving results that surpass a baseline ensemble and remain competitive despite limited frames. The approach reveals both the promise and current limits of weak-temporal-signal video understanding with LLM-coordinated VLMs, highlighting the need for stronger temporal cues or more frames to reach robust performance. The work contributes a concrete pipeline for cross-modal reasoning in video, shows how to train the coordinating LLM with a structured template, and analyzes error modes to guide future enhancements. is used to formalize the ensemble combination underpinning the coordination strategy.

Abstract

Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.
Paper Structure (11 sections, 2 equations, 6 figures, 2 tables)

This paper contains 11 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Cola coordinates multiple pretrained VLMs based on the visual context and plausible answers they provide.
  • Figure 2: LM prompt template. The LM is instructed to coordinate VLMs. Each question set defines visual context, question (and choices), and plausible answers.
  • Figure 3: High-level architecture of the keyframe selection.
  • Figure 4: Custom template construction: Outputs from the queried VLMs for each keyframe are collated and used to create a template for LLM training against the correct target action name.
  • Figure 5: Comparison between Paradigms
  • ...and 1 more figures