Table of Contents
Fetching ...

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

TL;DR

This paper tackles open-world video recognition by bridging domain gaps with a three-stage PCA pipeline that progressively leverages external knowledge from foundation models. Percept enhances visual input to reduce domain shift, Chat generates rich textual knowledge via LLMs or captioners, and Adapt fuses both visual and textual cues through modular adapters embedded in backbones. Empirical results on TinyVIRAT, ARID, and QV-Pipe demonstrate state-of-the-art performance and robust improvements from multi-modal knowledge integration, supported by thorough ablations and visualizations. The approach highlights the practical potential of modular, multimodal knowledge transfer for real-world video understanding tasks.

Abstract

Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which progressively exploits and integrates external multimodal knowledge from foundation models to boost open-world video recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt. First, we perform Percept process to reduce the video domain gap and obtain external visual knowledge. Second, we generate rich linguistic semantics as external textual knowledge in Chat stage. Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks. We conduct extensive experiments on three challenging open-world video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves state-of-the-art performance on all three datasets.

Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

TL;DR

This paper tackles open-world video recognition by bridging domain gaps with a three-stage PCA pipeline that progressively leverages external knowledge from foundation models. Percept enhances visual input to reduce domain shift, Chat generates rich textual knowledge via LLMs or captioners, and Adapt fuses both visual and textual cues through modular adapters embedded in backbones. Empirical results on TinyVIRAT, ARID, and QV-Pipe demonstrate state-of-the-art performance and robust improvements from multi-modal knowledge integration, supported by thorough ablations and visualizations. The approach highlights the practical potential of modular, multimodal knowledge transfer for real-world video understanding tasks.

Abstract

Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which progressively exploits and integrates external multimodal knowledge from foundation models to boost open-world video recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt. First, we perform Percept process to reduce the video domain gap and obtain external visual knowledge. Second, we generate rich linguistic semantics as external textual knowledge in Chat stage. Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks. We conduct extensive experiments on three challenging open-world video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves state-of-the-art performance on all three datasets.
Paper Structure (12 sections, 7 equations, 7 figures, 9 tables)

This paper contains 12 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of PCA. Given an open-world video, we first perform percept to reduce the domain gap of the video and obtain enhanced visual features. Then, we use large language models and large visual-language models for chat process to obtain external textual knowledge. Finally, in the adapt process, we integrate multimodal knowledge into the training process, enhancing the models' ability for open-world video perception.
  • Figure 2: Percept process. We first preprocess open-world videos using low-level visual models to enhance the visual content and reduce the domain gap. Subsequently, we utilize visual networks to extract enhanced features and predict the confidence of all categories.
  • Figure 3: Chat process. Due to the difficulty of recognizing open-world videos, text knowledge is needed to assist in video perception. If the max prediction score from the percept process is higher than the set threshold, prompt methods are used to semantically expand the predicted labels. If no category has a confidence score higher than the threshold, it indicates that the enhanced visual features are not applicable. In this case, VideoChat is used to obtain captions for the original video. Thus, external video captions and label descriptions are obtained.
  • Figure 4: Adapt process. We incorporate visual and textual knowledge into the training process through specially designed Adapt modules. The Adapt module is capable of integrating multimodal knowledge and can be seamlessly inserted into any block of the networks, enabling plug-and-play usage.
  • Figure 5: Structure Variants. We explore four fusion structures to integrate external knowledge with the model. Fig.(a) represents the direct weighted addition. Fig.(b) builds upon cross-attention and incorporates a residual module. Fig.(c) involves the integration of external knowledge and learnable prompts based on (b). Fig.(d) represents the adapt module which adds self-attention and FFN based on (c).
  • ...and 2 more figures