MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Rezaul Karim; He Zhao; Richard P. Wildes; Mennatullah Siam

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam

TL;DR

This work introduces MED-VT, a unified Multiscale Encoder-Decoder Video Transformer, and its multimodal extension MED-VT++ for end-to-end dense video segmentation without optical flow. The architecture leverages within- and between-scale spatiotemporal attention, a learnable coarse-to-fine decoder with adaptive queries, and a many-to-many temporal label propagator to ensure temporally consistent predictions; MED-VT++ additionally fuses an auxiliary modality like audio via bidirectional cross-attention and a context-based query generator. Key contributions include the fully unified multiscale encoder-decoder framework, the many-to-many label propagation for temporal coherence, the seamless multimodal extension, and extensive interpretability analyses. Empirically, MED-VT and MED-VT++ achieve state-of-the-art results across AVOS, actor-action segmentation, VSS, and AVSBench on multiple benchmarks, while maintaining efficiency by avoiding optical flow.

Abstract

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions. We showcase MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video Object Segmentation (AVOS), actor-action segmentation and Video Semantic Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on optical flow. Finally, to document details of the model's internal learned representations, we present a detailed interpretability study, encompassing both quantitative and qualitative analyses.

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

TL;DR

Abstract

Paper Structure (27 sections, 20 equations, 12 figures, 10 tables)

This paper contains 27 sections, 20 equations, 12 figures, 10 tables.

Introduction
Related work
MED-Video Transformer (MED-VT)
Overview
Multiscale transformer encoder
Multiscale transformer decoder
Many-to-many temporal label propagation
Multimodal extension (MED-VT++)
End-to-end training
Tasks
Empirical evaluation
Experiment design
Comparison to the state of the art
Video object segmentation
Actor-action segmentation
...and 12 more sections

Figures (12)

Figure 1: Multiscale video transformer (MED-VT, MED-VT++) on Different Unimodal and Multimodal Video Segmentation Tasks. For unimodal tasks, such as primary object segmentation, video semantic segmentation and actor-action segmentation, MED-VT takes an input clip and estimates target masks. For multimodal video segmentation, e.g. audio-visual segmentation, MED-VT++ takes audio as additional input and uses it as a context cue in segmentation.
Figure 2: Detailed MED-VT/MED-VT++ architecture with unified multiscale encoder-decoder transformer, illustrated with application to Audio-Visual Segmentation (AVS). The core MED-VT has five functionally distinct components. (i) Backbone feature extractor to extract per frame features, $\mathsf{f}_s$, at multiple scales, $s\in\{1,\cdots,s_{max}\}$. (ii) Multiscale transformer encoder consisting of spatiotemporal within and between scale attention with resulting features, $\mathsf{f}^{\mathcal{W}}_s$ and $\mathsf{f}^{\mathcal{B}}_s$, resp; the multihead attention transformation, equation \ref{['eq:multihead']}, is used for both. (iii) Multiscale transformer decoder consisting of pixel decoding, which produces decoded features, $\mathsf{f}^{\mathcal{P}}_s$, and a series of mulitscale query learning decoder blocks, $\mathcal{D}^i_s$, for the corresponding $i^{th}$ iteration and scale $s$, each of which entail self and cross attention, again using the multihead attention transformation, equation \ref{['eq:multihead']}. The input to the blocks are the decoded features $\mathsf{f}^{\mathcal{P}}_s$ and the query resulting from the previous block, with a randomized query, $\mathsf{Q}^r$, initialization; the output is a final object query, $\mathsf{Q}^o$. The decoder applies an affinity, equation \ref{['eq:decoder']}, between $\mathsf{Q}^o$ and the finest scale decoded features, $\mathsf{f}^\mathcal{P}_1$, to yield an object attention map, which is concatenated with the finest scale decoded features for final decoder output, $\mathsf{F}^D$. (iv) A task specific head, $\mathcal{H}$, that inputs $\mathsf{F}^D$ to produce initial predictions. (v) Many-to-many label propagation, equation \ref{['eq:labelProp']}, that inputs the initial predictions as values, $\mathsf{V}$, as well as $\mathsf{F}^D$ as queries, $\mathsf{Q}$, and keys, $\mathsf{K}$, to yield temporally consistent segmentation final masks, $\hat{\mathsf{Y}}$. MED-VT++ adds three additional components: (vi) An additional backbone to extract auxiliary (e.g. audio) features, (vii) Multimodal feature interaction that uses bidirectional cross-attention, equation \ref{['eq:bidir_xy']}, and (viii) Multimodal query generator that initializes the query, $\mathsf{Q}^r$, from additional modality feature output from the feature fusion module (e.g., audio feature, $\mathsf{f}^a$), equation \ref{['eq:decoder_cross_attn_audio']}, instead of random initialization used for unimodal segmentation. Our key innovations, outlined in bold boxes, lie in the unified multimodal multiscale encoder-decoder and label propagator.
Figure 3: The decoder stacked coarse-to-fine processing. Our multiscale decoder inputs a multiscale feature pyramid, $F^{\mathcal{P}}$, and randomly initialized queries, $\mathsf{Q}^r$, and outputs final object queries, $\mathsf{Q}^o$. The input is processed coarse-to-fine and iteratively through multiple decoder blocks, $\mathcal{D}^i_s$, with $s$ indicating input feature scale and $i$ indicating iteration. For simplicity, we show $s=3$ scales and $i=3$ iterations, with $\mathsf{f}^{\_}$ denoting features from each level of the pyramid, $F^{\mathcal{P}}$, where corresponding dimensions of the three levels are, $TH_3W_3\times d, TH_2W_2\times d, TH_1W_1\times d$, resp.
Figure 4: Qualitative segmentation results (red masks) showing the efficacy of our full model. From top to bottom, rows are arranged as input image, ground truth, our single scale encoder-decoder (baseline) and MED-VT. Left: Two frames of DAVIS'16 breakdance. Middle: Two frames of MoCA Flounder-6. Right: Two frames of YouTube Objects train shot $0025$. Clearly, the example shows that MED-VT adeptly tackles challenges in complex motion, fine localization, strong camouflage, and partial occlusion in videos.
Figure 5: Qualitative segmentation results comparing MED-VT to baseline algorithm on A2D, VSPW and AVSBench datasets. A2D: Two frames of $2yu9Qkdo4HY$ with $<$adult, none$>$ and $<$baby, climbing$>$ actor-action tuples. VSPW: Two frames of $8aIZCJKQL1s$. AVSBench: Two frames from AVSBench where the audio has sounds for 'man talking' as well as 'man talking and playing piano'. MED-VT segments with fine precision and classifies correctly in scenarios of multiple objects, multiple actions, multiple sound sources, even in scenarios involving articulated parts and complex object motions.
...and 7 more figures

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

TL;DR

Abstract

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (12)