Table of Contents
Fetching ...

ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

Ahmad Khalil, Mahmoud Khalil, Alioune Ngom

TL;DR

ResNetVLLM tackles zero-shot video understanding by eliminating reliance on pre-trained video encoders and training a vanilla ResNet visual encoder jointly with an LLM. The model samples frames, extracts features with a 2D ResNet, and merges them with a language model via a unified transformer, enabling end-to-end learning of visual and semantic representations. It achieves state-of-the-art results on MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA, outperforming prior VideoLLMs and CLIP-based approaches in zero-shot settings. The work demonstrates that a simple, end-to-end cross-modal architecture with two-stage training and random initialization of the visual encoder can yield robust video-language understanding with efficient computation and reduced dependence on pre-trained video models. This has practical implications for scalable video description, QA, and instruction-following in video contexts.

Abstract

In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

TL;DR

ResNetVLLM tackles zero-shot video understanding by eliminating reliance on pre-trained video encoders and training a vanilla ResNet visual encoder jointly with an LLM. The model samples frames, extracts features with a 2D ResNet, and merges them with a language model via a unified transformer, enabling end-to-end learning of visual and semantic representations. It achieves state-of-the-art results on MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA, outperforming prior VideoLLMs and CLIP-based approaches in zero-shot settings. The work demonstrates that a simple, end-to-end cross-modal architecture with two-stage training and random initialization of the visual encoder can yield robust video-language understanding with efficient computation and reduced dependence on pre-trained video models. This has practical implications for scalable video description, QA, and instruction-following in video contexts.

Abstract

In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of ResNetVLLM.
  • Figure 2: Overall architecture of the proposed ResNetVLLM framework.
  • Figure 3: Training Phase of ResNetVLLM.
  • Figure 4: Sample Output of ResNetVLLM.