MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
Baijiong Lin, Weisen Jiang, Pengguang Chen, Yu Zhang, Shu Liu, Ying-Cong Chen
TL;DR
MTMamba tackles multi-task dense scene understanding by introducing a Mamba-based decoder with self-task (STM) and cross-task (CTM) blocks that model long-range context and inter-task interactions. The architecture couples a shared Swin-Large encoder with a three-stage decoder featuring SS2D-enabled MFE modules and adaptive task gating, producing task-specific predictions via lightweight heads. Empirical results on NYUDv2 and PASCAL-Context show MTMamba consistently surpasses CNN- and Transformer-based decoders, with substantial gains in semantic segmentation, parsing, and boundary detection, and qualitative analyses confirm more precise and detailed predictions. The work demonstrates that Mamba-based decoders can effectively address multi-task learning challenges, offering a scalable, efficient alternative to attention-based approaches with strong practical impact.
Abstract
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
