Table of Contents
Fetching ...

Exploring Efficient Foundational Multi-modal Models for Video Summarization

Karan Samel, Apoorva Beedu, Nitish Sontakke, Irfan Essa

TL;DR

A plug-and-play video language model that directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead and leveraging few-shot instruction adaptation strategies is proposed.

Abstract

Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space. Then the embeddings from each model are used within a language model, which is fine-tuned on a desired instruction set. Aligning each modality during pre-training is computationally expensive and prevents rapid testing of different base modality models. During fine-tuning, evaluation is carried out within in-domain videos where it is hard to understand the generalizability and data efficiency of these methods. To alleviate these issues we propose a plug-and-play video language model. It directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead. Instead of fine-tuning we leverage few-shot instruction adaptation strategies. We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods. Finally, we explore the generalizability of each method during domain shift and present insights on what data is useful when training data is limited. Through this analysis, we present practical insights on how to leverage multi-modal foundational models for effective results given realistic compute and data limitations.

Exploring Efficient Foundational Multi-modal Models for Video Summarization

TL;DR

A plug-and-play video language model that directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead and leveraging few-shot instruction adaptation strategies is proposed.

Abstract

Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space. Then the embeddings from each model are used within a language model, which is fine-tuned on a desired instruction set. Aligning each modality during pre-training is computationally expensive and prevents rapid testing of different base modality models. During fine-tuning, evaluation is carried out within in-domain videos where it is hard to understand the generalizability and data efficiency of these methods. To alleviate these issues we propose a plug-and-play video language model. It directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead. Instead of fine-tuning we leverage few-shot instruction adaptation strategies. We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods. Finally, we explore the generalizability of each method during domain shift and present insights on what data is useful when training data is limited. Through this analysis, we present practical insights on how to leverage multi-modal foundational models for effective results given realistic compute and data limitations.

Paper Structure

This paper contains 24 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: In a) video temporal representations are learned through inferring frame and caption data in videos. Instead of a single model to infer each modality of a video, in b) each modality is handled by a foundational model. The embeddings are aligned and passed as tokens into an LLM for fine-tuning. We explore the setting in c) where we use the text outputs of each modality to quickly leverage new models and adapt to new domains.
  • Figure 2: An overview of our model architecture. Each video modality input source generates a text output that is used to generate a prompt fed directly into a language model for instruction following.
  • Figure 3: YouCook random sampling performance across different data modality inputs and model sizes.
  • Figure 4: METEOR performance gains using greedy search over random few-shot sampling.