Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Boyu Chen; Siran Chen; Kunchang Li; Qinglin Xu; Yu Qiao; Yali Wang

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

TL;DR

Addresses how to leverage multimodal foundation models for deeper video understanding using a recursive, parameter-efficient architecture. Introduces the Super Encoding Network (SEN), which treats frozen encoders as super neurons and fuses them via Recursive Association to achieve deep multimodal interaction. Demonstrates strong, plug-and-play improvements across pixel-level tracking, zero-shot recognition, video-to-text chatting, and one-shot editing, with extensive ablations and a depth-optimization finding (3 RA blocks). This work offers a scalable path to richer multimodal representations in video understanding by reusing existing foundation-model encoders.

Abstract

Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multimodal foundation models have shown such potential via large-scale pretraining. These models effectively align encoders of different modalities via contrastive learning. To further enhance performance on complex target movements and diversified video scenes, we propose to augment this alignment with deeper multimodal interactions, which are critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through the recursive association of multimodal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as ``super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multimodal interactions for prompting various video understanding tasks in the downstream. Extensive experiments show that our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, and temporal coherence(TC) drops by 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases by 4.1% compared to the Tune-A-Video approach.

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

TL;DR

Abstract

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)