DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning
Ivan Lopes, Tuan-Hung Vu, Raoul de Charette
TL;DR
DenseMTL addresses dense scene understanding by jointly learning semantic segmentation, dense depth, surface normals, and edges with a shared encoder and task-specific decoders. The key innovations are xTAM, a bidirectional cross-task attention mechanism that fuses correlation-guided and self-attention signals to produce directional features $f_{j \rightarrow i}$, and the multi-task Exchange Block (mTEB) that aggregates these features to refine each task representation $\tilde{f}_i$. Across synthetic and real outdoor and indoor datasets, DenseMTL consistently outperforms single-task baselines and existing multi-task approaches, and it extends to MTL-UDA with substantial target-domain gains. The work demonstrates practical, memory-efficient improvements for joint semantic and geometry understanding and provides code for reproducibility.
Abstract
Multi-task learning has recently emerged as a promising solution for a comprehensive understanding of complex scenes. In addition to being memory-efficient, multi-task models, when appropriately designed, can facilitate the exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation and three geometry-related tasks: dense depth estimation, surface normal estimation, and edge estimation, demonstrating their benefits on both indoor and outdoor datasets. We propose a novel multi-task learning architecture that leverages pairwise cross-task exchange through correlation-guided attention and self-attention to enhance the overall representation learning for all tasks. We conduct extensive experiments across three multi-task setups, showing the advantages of our approach compared to competitive baselines in both synthetic and real-world benchmarks. Additionally, we extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is available at https://github.com/cv-rits/DenseMTL
