Table of Contents
Fetching ...

FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering

Zheng Cheng, Rendong Wang, Zhicheng Wang

TL;DR

FocusChat is proposed, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt that significantly outperforms Video-LLaMA in zero-shot experiments, using an order of magnitude less training data with only 16 visual tokens occupied.

Abstract

Recently, multi-modal large language models have made significant progress. However, visual information lacking of guidance from the user's intention may lead to redundant computation and involve unnecessary visual noise, especially in long, untrimmed videos. To address this issue, we propose FocusChat, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt. In detail, Our model first undergoes the semantic extraction module, which comprises a visual semantic branch and a text semantic branch to extract image and text semantics, respectively. The two branches are combined using the Spatial-Temporal Filtering Module (STFM). STFM enables explicit spatial-level information filtering and implicit temporal-level feature filtering, ensuring that the visual tokens are closely aligned with the user's query. It lowers the essential number of visual tokens inputted into the LLM. FocusChat significantly outperforms Video-LLaMA in zero-shot experiments, using an order of magnitude less training data with only 16 visual tokens occupied. It achieves results comparable to the state-of-the-art in few-shot experiments, with only 0.72M pre-training data.

FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering

TL;DR

FocusChat is proposed, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt that significantly outperforms Video-LLaMA in zero-shot experiments, using an order of magnitude less training data with only 16 visual tokens occupied.

Abstract

Recently, multi-modal large language models have made significant progress. However, visual information lacking of guidance from the user's intention may lead to redundant computation and involve unnecessary visual noise, especially in long, untrimmed videos. To address this issue, we propose FocusChat, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt. In detail, Our model first undergoes the semantic extraction module, which comprises a visual semantic branch and a text semantic branch to extract image and text semantics, respectively. The two branches are combined using the Spatial-Temporal Filtering Module (STFM). STFM enables explicit spatial-level information filtering and implicit temporal-level feature filtering, ensuring that the visual tokens are closely aligned with the user's query. It lowers the essential number of visual tokens inputted into the LLM. FocusChat significantly outperforms Video-LLaMA in zero-shot experiments, using an order of magnitude less training data with only 16 visual tokens occupied. It achieves results comparable to the state-of-the-art in few-shot experiments, with only 0.72M pre-training data.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: This illustration contrasts traditional and text-guided models. Left: The traditional model interprets visual patches directly into tokens for LLMs without considering specific frames or areas of interest. As a result, whether the inquiry belongs to a "kitchen" or a "gym," the model consistently produces the same tokens and applies uniform attention to all details in the scene, potentially increasing the cognitive burden on the LLMs. Right: The text-guided model utilizes prompts to identify the most relevant visual cues and generates adaptive tokens, thereby improving the LLMs' capacity to comprehend and interpret visual information.
  • Figure 2: The overall architecture of FocusChat: uniformly sampled frames are input into the vision semantic branch, which consists of an image encoder and an image Q-Former. Simultaneously, the user's query is input into the text semantic branch to extract rich semantic representations. Finally, these are fused in STFM to achieve both spatial-level and temporal-level filtering of visual information. The output of STFM is projected as visual tokens, which are fed into the LLM along with the text tokens.
  • Figure 3: When asking a question about a five-minute video, such as "How many people are playing basketball in the first two minutes?", the semantic similarity matrix H and the attention-map SAM diagram in STFM are presented as indicative illustrations.
  • Figure 4: Qualitative result comparison between Video-LLaMA and FocusChat.