Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah; Xiaoqian Shen; Eslam Abdelrahman; Essam Sleiman; Mingchen Zhuge; Jian Ding; Deyao Zhu; Jürgen Schmidhuber; Mohamed Elhoseiny

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny

TL;DR

Goldfish tackles the challenge of understanding arbitrarily long videos by introducing a retrieval-based framework that first extracts top-k relevant clips via a Video Descriptor (powered by MiniGPT4-Video) before answering queries. A retrieval module aligns query embeddings with clip descriptions and subtitles to obtain concise context, which is then used by an answer module to generate responses. The authors also propose TVQA-long, a long-video benchmark derived from TVQA, and demonstrate state-of-the-art performance on both long- and short-video benchmarks, including strong zero-shot results when using vision plus subtitles. The approach shows robust long-video understanding with scalable efficiency and is complemented by extensive ablations and qualitative analyses.

Abstract

Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF, and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding. Our models and code have been made publicly available at https://vision-cair.github.io/Goldfish_website/

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

TL;DR

Abstract

Paper Structure (35 sections, 12 figures, 9 tables)

This paper contains 35 sections, 12 figures, 9 tables.

Introduction
Related Work
LLM-Based Short Video Understanding
LLM-Based Long Video Understanding
Retrieval Systems
Goldfish
Retrieval-based Long Video Understanding
Training Pipeline
Experiments
Datasets
Training Datasets
Short Benchmarks
Long Benchmarks
Evaluation Metrics
Ablation Studies
...and 20 more sections

Figures (12)

Figure 1: GoldFish Model: A long-video model capable of handling lengthy videos by filtering out noisy information and focusing on the most relevant content to accurately answer questions.
Figure 1: Ablation study of the retrieval inputs. The reported numbers are the retrieval accuracy on the TVQA-Long, TVR-Text, and TVR-Vision. & means "and" while | indicates "or".
Figure 2: Goldfish framework,First break down the long video into clips, then encode them in Video Descriptor according to their timing and corresponding subtitles, then encode the use query and retrieve the most related clips in the retrieval module, and finally send the top-K clips information to the answer module to get the final answer.
Figure 3: MiniGPT4-video architecture: For each frame, we use EVA-CLIP to get the visual tokens and concatenate each adjacent visual token into a singular token then convert these tokens to the language model space using a linear layer and get the language token from LLM tokenizer. Concatenate both the visual and subtitle text tokens together and do this for all the sampled frames and appending the instruction tokens at the end of the input sequence.
Figure 4: Ablation study about the video length impact on 5% of TVQA validation set. , video length in minutes
...and 7 more figures

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

TL;DR

Abstract

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (12)