VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan
TL;DR
VURF addresses the lack of a general-purpose reasoning framework for video understanding by leveraging LLMs to generate executable visual programs that decompose complex video queries into sub-tasks handled by off-the-shelf vision models. It introduces a GPT-3.5–driven feedback loop for correcting invalid functions and an auto self-refinement process to iteratively improve in-context examples, enhancing robustness to contextual cues. The framework demonstrates effectiveness across video tasks including Visual Question Answering, video anticipation, pose estimation, and multi-video VQA, with empirical gains over zero-shot baselines and ablative analyses confirming the value of self-refinement and error correction. By enabling a plug-and-play, interpretable reasoning pipeline that can incorporate diverse vision modules, VURF offers a scalable approach to complex video understanding with potential for continuous self-improvement through refined in-context demonstrations.
Abstract
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. We harness their contextual learning capabilities by presenting LLMs with pairs of instructions and their corresponding high-level programs to generate executable visual programs for video understanding. To enhance the program's accuracy and robustness, we implement two important strategies. \emph{Firstly,} we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. \emph{Secondly}, taking motivation from recent works on self-refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation, and multi-video QA, illustrate these enhancements' efficacy in improving the performance of visual programming approaches for video tasks.
