Table of Contents
Fetching ...

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

Rezaul Karim, He Zhao, Richard P. Wildes, Mennatullah Siam

TL;DR

This work introduces MED-VT, a unified Multiscale Encoder-Decoder Video Transformer, and its multimodal extension MED-VT++ for end-to-end dense video segmentation without optical flow. The architecture leverages within- and between-scale spatiotemporal attention, a learnable coarse-to-fine decoder with adaptive queries, and a many-to-many temporal label propagator to ensure temporally consistent predictions; MED-VT++ additionally fuses an auxiliary modality like audio via bidirectional cross-attention and a context-based query generator. Key contributions include the fully unified multiscale encoder-decoder framework, the many-to-many label propagation for temporal coherence, the seamless multimodal extension, and extensive interpretability analyses. Empirically, MED-VT and MED-VT++ achieve state-of-the-art results across AVOS, actor-action segmentation, VSS, and AVSBench on multiple benchmarks, while maintaining efficiency by avoiding optical flow.

Abstract

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions. We showcase MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video Object Segmentation (AVOS), actor-action segmentation and Video Semantic Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on optical flow. Finally, to document details of the model's internal learned representations, we present a detailed interpretability study, encompassing both quantitative and qualitative analyses.

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer

TL;DR

This work introduces MED-VT, a unified Multiscale Encoder-Decoder Video Transformer, and its multimodal extension MED-VT++ for end-to-end dense video segmentation without optical flow. The architecture leverages within- and between-scale spatiotemporal attention, a learnable coarse-to-fine decoder with adaptive queries, and a many-to-many temporal label propagator to ensure temporally consistent predictions; MED-VT++ additionally fuses an auxiliary modality like audio via bidirectional cross-attention and a context-based query generator. Key contributions include the fully unified multiscale encoder-decoder framework, the many-to-many label propagation for temporal coherence, the seamless multimodal extension, and extensive interpretability analyses. Empirically, MED-VT and MED-VT++ achieve state-of-the-art results across AVOS, actor-action segmentation, VSS, and AVSBench on multiple benchmarks, while maintaining efficiency by avoiding optical flow.

Abstract

In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions. We showcase MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video Object Segmentation (AVOS), actor-action segmentation and Video Semantic Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on optical flow. Finally, to document details of the model's internal learned representations, we present a detailed interpretability study, encompassing both quantitative and qualitative analyses.
Paper Structure (27 sections, 20 equations, 12 figures, 10 tables)

This paper contains 27 sections, 20 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Multiscale video transformer (MED-VT, MED-VT++) on Different Unimodal and Multimodal Video Segmentation Tasks. For unimodal tasks, such as primary object segmentation, video semantic segmentation and actor-action segmentation, MED-VT takes an input clip and estimates target masks. For multimodal video segmentation, e.g. audio-visual segmentation, MED-VT++ takes audio as additional input and uses it as a context cue in segmentation.
  • Figure 2: Detailed MED-VT/MED-VT++ architecture with unified multiscale encoder-decoder transformer, illustrated with application to Audio-Visual Segmentation (AVS). The core MED-VT has five functionally distinct components. (i) Backbone feature extractor to extract per frame features, $\mathsf{f}_s$, at multiple scales, $s\in\{1,\cdots,s_{max}\}$. (ii) Multiscale transformer encoder consisting of spatiotemporal within and between scale attention with resulting features, $\mathsf{f}^{\mathcal{W}}_s$ and $\mathsf{f}^{\mathcal{B}}_s$, resp; the multihead attention transformation, equation \ref{['eq:multihead']}, is used for both. (iii) Multiscale transformer decoder consisting of pixel decoding, which produces decoded features, $\mathsf{f}^{\mathcal{P}}_s$, and a series of mulitscale query learning decoder blocks, $\mathcal{D}^i_s$, for the corresponding $i^{th}$ iteration and scale $s$, each of which entail self and cross attention, again using the multihead attention transformation, equation \ref{['eq:multihead']}. The input to the blocks are the decoded features $\mathsf{f}^{\mathcal{P}}_s$ and the query resulting from the previous block, with a randomized query, $\mathsf{Q}^r$, initialization; the output is a final object query, $\mathsf{Q}^o$. The decoder applies an affinity, equation \ref{['eq:decoder']}, between $\mathsf{Q}^o$ and the finest scale decoded features, $\mathsf{f}^\mathcal{P}_1$, to yield an object attention map, which is concatenated with the finest scale decoded features for final decoder output, $\mathsf{F}^D$. (iv) A task specific head, $\mathcal{H}$, that inputs $\mathsf{F}^D$ to produce initial predictions. (v) Many-to-many label propagation, equation \ref{['eq:labelProp']}, that inputs the initial predictions as values, $\mathsf{V}$, as well as $\mathsf{F}^D$ as queries, $\mathsf{Q}$, and keys, $\mathsf{K}$, to yield temporally consistent segmentation final masks, $\hat{\mathsf{Y}}$. MED-VT++ adds three additional components: (vi) An additional backbone to extract auxiliary (e.g. audio) features, (vii) Multimodal feature interaction that uses bidirectional cross-attention, equation \ref{['eq:bidir_xy']}, and (viii) Multimodal query generator that initializes the query, $\mathsf{Q}^r$, from additional modality feature output from the feature fusion module (e.g., audio feature, $\mathsf{f}^a$), equation \ref{['eq:decoder_cross_attn_audio']}, instead of random initialization used for unimodal segmentation. Our key innovations, outlined in bold boxes, lie in the unified multimodal multiscale encoder-decoder and label propagator.
  • Figure 3: The decoder stacked coarse-to-fine processing. Our multiscale decoder inputs a multiscale feature pyramid, $F^{\mathcal{P}}$, and randomly initialized queries, $\mathsf{Q}^r$, and outputs final object queries, $\mathsf{Q}^o$. The input is processed coarse-to-fine and iteratively through multiple decoder blocks, $\mathcal{D}^i_s$, with $s$ indicating input feature scale and $i$ indicating iteration. For simplicity, we show $s=3$ scales and $i=3$ iterations, with $\mathsf{f}^{\_}$ denoting features from each level of the pyramid, $F^{\mathcal{P}}$, where corresponding dimensions of the three levels are, $TH_3W_3\times d, TH_2W_2\times d, TH_1W_1\times d$, resp.
  • Figure 4: Qualitative segmentation results (red masks) showing the efficacy of our full model. From top to bottom, rows are arranged as input image, ground truth, our single scale encoder-decoder (baseline) and MED-VT. Left: Two frames of DAVIS'16 breakdance. Middle: Two frames of MoCA Flounder-6. Right: Two frames of YouTube Objects train shot $0025$. Clearly, the example shows that MED-VT adeptly tackles challenges in complex motion, fine localization, strong camouflage, and partial occlusion in videos.
  • Figure 5: Qualitative segmentation results comparing MED-VT to baseline algorithm on A2D, VSPW and AVSBench datasets. A2D: Two frames of $2yu9Qkdo4HY$ with $<$adult, none$>$ and $<$baby, climbing$>$ actor-action tuples. VSPW: Two frames of $8aIZCJKQL1s$. AVSBench: Two frames from AVSBench where the audio has sounds for 'man talking' as well as 'man talking and playing piano'. MED-VT segments with fine precision and classifies correctly in scenarios of multiple objects, multiple actions, multiple sound sources, even in scenarios involving articulated parts and complex object motions.
  • ...and 7 more figures