Table of Contents
Fetching ...

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

Zhifan Wan, Jie Zhang, Changzhen Li, Shiguang Shan

TL;DR

BIMM tackles video representation learning by emulating the brain's ventral and dorsal visual pathways through a dual-branch ViT architecture trained with masked modeling. Each branch is divided into three intermediate blocks with lightweight decoders and progressive targets that capture texture, contour, color, and motion, while a partial weight-sharing strategy facilitates information flow between branches. The two-stage pretraining (ImageNet-1K for the ventral branch, then joint video pretraining with the dorsal branch) and targeted losses enable strong spatiotemporal representations, achieving state-of-the-art results on datasets such as K400, SSv2, AVA, COCO, and ADE20K. The approach demonstrates that combining brain-inspired architecture with progressive, multi-target supervision yields robust performance on both video and image tasks, suggesting practical benefits for broad visual understanding applications.

Abstract

The visual pathway of human brain includes two sub-pathways, ie, the ventral pathway and the dorsal pathway, which focus on object identification and dynamic information modeling, respectively. Both pathways comprise multi-layer structures, with each layer responsible for processing different aspects of visual information. Inspired by visual information processing mechanism of the human brain, we propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos. Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively. Both branches employ the Vision Transformer (ViT) as their backbone and are trained using masked modeling method. To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and reconstruct progressive prediction targets with light weight decoders. Furthermore, drawing inspiration from the information-sharing mechanism in the visual pathways, we propose a partial parameter sharing strategy between the branches during training. Extensive experiments demonstrate that BIMM achieves superior performance compared to the state-of-the-art methods.

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

TL;DR

BIMM tackles video representation learning by emulating the brain's ventral and dorsal visual pathways through a dual-branch ViT architecture trained with masked modeling. Each branch is divided into three intermediate blocks with lightweight decoders and progressive targets that capture texture, contour, color, and motion, while a partial weight-sharing strategy facilitates information flow between branches. The two-stage pretraining (ImageNet-1K for the ventral branch, then joint video pretraining with the dorsal branch) and targeted losses enable strong spatiotemporal representations, achieving state-of-the-art results on datasets such as K400, SSv2, AVA, COCO, and ADE20K. The approach demonstrates that combining brain-inspired architecture with progressive, multi-target supervision yields robust performance on both video and image tasks, suggesting practical benefits for broad visual understanding applications.

Abstract

The visual pathway of human brain includes two sub-pathways, ie, the ventral pathway and the dorsal pathway, which focus on object identification and dynamic information modeling, respectively. Both pathways comprise multi-layer structures, with each layer responsible for processing different aspects of visual information. Inspired by visual information processing mechanism of the human brain, we propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos. Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively. Both branches employ the Vision Transformer (ViT) as their backbone and are trained using masked modeling method. To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and reconstruct progressive prediction targets with light weight decoders. Furthermore, drawing inspiration from the information-sharing mechanism in the visual pathways, we propose a partial parameter sharing strategy between the branches during training. Extensive experiments demonstrate that BIMM achieves superior performance compared to the state-of-the-art methods.
Paper Structure (20 sections, 4 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visual pathways in the cerebral cortex. The four main areas in the visual pathway are denoted as V1, V2, V4, and MT. Each area specialize in an aspect of visual information processing.
  • Figure 2: An overview of BIMM. BIMM maintain a ventral branch and a dorsal branch, each employs masked modeling method. Drawing inspiration from the visual pathways, ViT of each branch is divided into three intermediate blocks. Each of these blocks is appended with a light-weight decoder. Intermediate blocks (denoted as V1, V2, V4 and MT) are responsible for learning specific aspects of visual representation. V1 is responsible for learning texture and predicts Gabor feature viola2001harr. V2 specializes in contour detection, learns from contour images generated by SAM kirillov2023sam. V4 is dedicated to color and object segmentation, with the prediction target being RGB pixels. MT is concerned with dynamic motion and predicts motion information yang2022motionmae. During pretraining, BIMM applies a partial weight sharing strategy between the branches.
  • Figure 3: Ablation on training schedule. After training for 800 epochs on K400 and 1600 epochs on UCF101, longer pretraining epochs do not lead to significant improvement. Other settings keep the same as the default.
  • Figure 4: Prediction examples of different models on UCF101. For each example drawn from the validation dataset, the predictions with green text indicating a correct prediction and red indicating the incorrect one. "GT" indicates the ground truth annotation of the video.
  • Figure 5: Reconstruction results of videos on UCF101 validation set. We show the original video sequence, masked video sequence, and reconstructions of different videos. Labels of each video are listed under each group of images. Reconstruction of videos are predicted by the pretrained dorsal branch with a high masking ratio of 90%, which indicates BIMM is able to learn comprehensive features even most patches are masked.
  • ...and 1 more figures