Table of Contents
Fetching ...

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

TL;DR

Addresses how to leverage multimodal foundation models for deeper video understanding using a recursive, parameter-efficient architecture. Introduces the Super Encoding Network (SEN), which treats frozen encoders as super neurons and fuses them via Recursive Association to achieve deep multimodal interaction. Demonstrates strong, plug-and-play improvements across pixel-level tracking, zero-shot recognition, video-to-text chatting, and one-shot editing, with extensive ablations and a depth-optimization finding (3 RA blocks). This work offers a scalable path to richer multimodal representations in video understanding by reusing existing foundation-model encoders.

Abstract

Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multimodal interactions for prompting various video understanding tasks in the downstream. Extensive experiments show that our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, and temporal coherence(TC) drops by 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases by 4.1% compared to the Tune-A-Video approach.

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

TL;DR

Addresses how to leverage multimodal foundation models for deeper video understanding using a recursive, parameter-efficient architecture. Introduces the Super Encoding Network (SEN), which treats frozen encoders as super neurons and fuses them via Recursive Association to achieve deep multimodal interaction. Demonstrates strong, plug-and-play improvements across pixel-level tracking, zero-shot recognition, video-to-text chatting, and one-shot editing, with extensive ablations and a depth-optimization finding (3 RA blocks). This work offers a scalable path to richer multimodal representations in video understanding by reusing existing foundation-model encoders.

Abstract

Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multimodal interactions for prompting various video understanding tasks in the downstream. Extensive experiments show that our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, and temporal coherence(TC) drops by 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases by 4.1% compared to the Tune-A-Video approach.

Paper Structure

This paper contains 10 sections, 5 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Our Motivation. Different from the previous paradigms, our SEN creatively treats the well-pretrained multimodal encoders as super neurons, and leverages a novel Recursive Association (RA) block to achieve deeper multimodal interaction of super neurons. Such a flexible paradigm enables it to serve as a unified encoder network for boosting various complex downstream video understanding tasks.
  • Figure 2: Overall Structure of Our SEN. Each neuron is one of the multimodal encoders in the pretrained foundation model. We design a Recursive Association (RA) block to learn deeper interactions of multimodal knowledge in a progressive manner. Subsequently, we utilize the final feature of SEN to boost various complex video tasks in the downstream.
  • Figure 3: Illustration of SEN applied to downstream video tasks, from video tracking to video recognition, from video chatting to video editing.
  • Figure 4: Visualization of SEN. (a) is for the pixel-level tracking task. (b) is for the video-level zero-shot recognition task.
  • Figure 5: Visualization of SEN. (a) is for the video-to-text chatting task for the autopilot scenario. (b) is for the text-to-video editing task.
  • ...and 2 more figures