Table of Contents
Fetching ...

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

Baijiong Lin, Weisen Jiang, Pengguang Chen, Yu Zhang, Shu Liu, Ying-Cong Chen

TL;DR

MTMamba tackles multi-task dense scene understanding by introducing a Mamba-based decoder with self-task (STM) and cross-task (CTM) blocks that model long-range context and inter-task interactions. The architecture couples a shared Swin-Large encoder with a three-stage decoder featuring SS2D-enabled MFE modules and adaptive task gating, producing task-specific predictions via lightweight heads. Empirical results on NYUDv2 and PASCAL-Context show MTMamba consistently surpasses CNN- and Transformer-based decoders, with substantial gains in semantic segmentation, parsing, and boundary detection, and qualitative analyses confirm more precise and detailed predictions. The work demonstrates that Mamba-based decoders can effectively address multi-task learning challenges, offering a scalable, efficient alternative to attention-based approaches with strong practical impact.

Abstract

Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.

MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

TL;DR

MTMamba tackles multi-task dense scene understanding by introducing a Mamba-based decoder with self-task (STM) and cross-task (CTM) blocks that model long-range context and inter-task interactions. The architecture couples a shared Swin-Large encoder with a three-stage decoder featuring SS2D-enabled MFE modules and adaptive task gating, producing task-specific predictions via lightweight heads. Empirical results on NYUDv2 and PASCAL-Context show MTMamba consistently surpasses CNN- and Transformer-based decoders, with substantial gains in semantic segmentation, parsing, and boundary detection, and qualitative analyses confirm more precise and detailed predictions. The work demonstrates that Mamba-based decoders can effectively address multi-task learning challenges, offering a scalable, efficient alternative to attention-based approaches with strong practical impact.

Abstract

Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
Paper Structure (31 sections, 10 equations, 5 figures, 6 tables)

This paper contains 31 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the proposed MTMamba for multi-task dense scene understanding, illustrating with semantic segmentation (abbreviated as "Semseg") and depth estimation (abbreviated as "Depth") tasks. The red blocks are shared across all tasks, while the blue and green ones are task-specific. The pretrained encoder (Swin-Large Transformer is used) extracts multi-scale generic visual representations from the input RGB image. In the decoder, all task representations from task-specific STM blocks are fused and refined in the CTM block. Each task has its own head to generate the final predictions. Note that the structures of STM and CTM blocks (details in Figure \ref{['fig:block']}) in the decoder are Mamba-based (i.e., non-attention).
  • Figure 2: (a) Illustration of the self-task Mamba (STM) block. Its core module is the Mamba-based feature extractor (MFE), where 1D S6 operation (introduced in Section \ref{['sec:ssm']}) is extended on 2D images, namely SS2D. MFE is responsible for learning discriminant features and an input-dependent gate $\sigma(\texttt{Linear}(\texttt{LN}({\bf z})))$ further refines the learned features. (b) Overview of the cross-task Mamba (CTM) block, illustrating with two tasks. Suppose $T$ is the number of tasks ($T=2$ in this illustration). The CTM block inputs $T$ features, outputs $T$ features, and contains $T+1$ MFE modules. One is used to generate a global feature $\tilde{{\bf z}}^\text{sh}$ and the other is to obtain the task-specific feature $\tilde{{\bf z}}^t$. Each output feature is the aggregation of task-specific feature $\tilde{{\bf z}}^t$ and global feature $\tilde{{\bf z}}^\text{sh}$ weighted by a task-specific gate ${\bf g}^t$. More details about these two blocks are provided in Section \ref{['sec:mamba_decoder']}.
  • Figure 3: Visualization of the final decoder feature of semantic segmentation. Compared with InvPT ye2022inverted, our method generates more discriminative features.
  • Figure 4: Qualitative comparison with state-of-the-art method (i.e., InvPT ye2022inverted) on the NYUDv2 dataset. The proposed method generates better predictions with more accurate details as marked in yellow circles. Zoom in for more details.
  • Figure 5: Qualitative comparison with state-of-the-art method (i.e., InvPT ye2022inverted) on the PASCAL-Context dataset. The proposed method generates better predictions with more accurate details as marked in yellow circles. Zoom in for more details.