MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer
Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam
TL;DR
This work introduces MED-VT, a unified Multiscale Encoder-Decoder Video Transformer, and its multimodal extension MED-VT++ for end-to-end dense video segmentation without optical flow. The architecture leverages within- and between-scale spatiotemporal attention, a learnable coarse-to-fine decoder with adaptive queries, and a many-to-many temporal label propagator to ensure temporally consistent predictions; MED-VT++ additionally fuses an auxiliary modality like audio via bidirectional cross-attention and a context-based query generator. Key contributions include the fully unified multiscale encoder-decoder framework, the many-to-many label propagation for temporal coherence, the seamless multimodal extension, and extensive interpretability analyses. Empirically, MED-VT and MED-VT++ achieve state-of-the-art results across AVOS, actor-action segmentation, VSS, and AVSBench on multiple benchmarks, while maintaining efficiency by avoiding optical flow.
Abstract
In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions. We showcase MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video Object Segmentation (AVOS), actor-action segmentation and Video Semantic Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on optical flow. Finally, to document details of the model's internal learned representations, we present a detailed interpretability study, encompassing both quantitative and qualitative analyses.
