Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators
Harsh Lunia
TL;DR
The paper investigates using Cola to coordinate multiple VLMs with an LLM for video action recognition from sparse temporal information. By applying Cola to surveillance-style SPHAR videos, it demonstrates that an LLM can synthesize diverse VLM outputs from up to 10 keyframes to infer actions, achieving results that surpass a baseline ensemble and remain competitive despite limited frames. The approach reveals both the promise and current limits of weak-temporal-signal video understanding with LLM-coordinated VLMs, highlighting the need for stronger temporal cues or more frames to reach robust performance. The work contributes a concrete pipeline for cross-modal reasoning in video, shows how to train the coordinating LLM with a structured template, and analyzes error modes to guide future enhancements. $P(v,q) = \frac{1}{n}\sum_{i=1}^{n} P_i(v,q)$ is used to formalize the ensemble combination underpinning the coordination strategy.
Abstract
Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.
