Table of Contents
Fetching ...

DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning

Ivan Lopes, Tuan-Hung Vu, Raoul de Charette

TL;DR

DenseMTL addresses dense scene understanding by jointly learning semantic segmentation, dense depth, surface normals, and edges with a shared encoder and task-specific decoders. The key innovations are xTAM, a bidirectional cross-task attention mechanism that fuses correlation-guided and self-attention signals to produce directional features $f_{j \rightarrow i}$, and the multi-task Exchange Block (mTEB) that aggregates these features to refine each task representation $\tilde{f}_i$. Across synthetic and real outdoor and indoor datasets, DenseMTL consistently outperforms single-task baselines and existing multi-task approaches, and it extends to MTL-UDA with substantial target-domain gains. The work demonstrates practical, memory-efficient improvements for joint semantic and geometry understanding and provides code for reproducibility.

Abstract

Multi-task learning has recently emerged as a promising solution for a comprehensive understanding of complex scenes. In addition to being memory-efficient, multi-task models, when appropriately designed, can facilitate the exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation and three geometry-related tasks: dense depth estimation, surface normal estimation, and edge estimation, demonstrating their benefits on both indoor and outdoor datasets. We propose a novel multi-task learning architecture that leverages pairwise cross-task exchange through correlation-guided attention and self-attention to enhance the overall representation learning for all tasks. We conduct extensive experiments across three multi-task setups, showing the advantages of our approach compared to competitive baselines in both synthetic and real-world benchmarks. Additionally, we extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is available at https://github.com/cv-rits/DenseMTL

DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning

TL;DR

DenseMTL addresses dense scene understanding by jointly learning semantic segmentation, dense depth, surface normals, and edges with a shared encoder and task-specific decoders. The key innovations are xTAM, a bidirectional cross-task attention mechanism that fuses correlation-guided and self-attention signals to produce directional features , and the multi-task Exchange Block (mTEB) that aggregates these features to refine each task representation . Across synthetic and real outdoor and indoor datasets, DenseMTL consistently outperforms single-task baselines and existing multi-task approaches, and it extends to MTL-UDA with substantial target-domain gains. The work demonstrates practical, memory-efficient improvements for joint semantic and geometry understanding and provides code for reproducibility.

Abstract

Multi-task learning has recently emerged as a promising solution for a comprehensive understanding of complex scenes. In addition to being memory-efficient, multi-task models, when appropriately designed, can facilitate the exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation and three geometry-related tasks: dense depth estimation, surface normal estimation, and edge estimation, demonstrating their benefits on both indoor and outdoor datasets. We propose a novel multi-task learning architecture that leverages pairwise cross-task exchange through correlation-guided attention and self-attention to enhance the overall representation learning for all tasks. We conduct extensive experiments across three multi-task setups, showing the advantages of our approach compared to competitive baselines in both synthetic and real-world benchmarks. Additionally, we extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is available at https://github.com/cv-rits/DenseMTL
Paper Structure (32 sections, 10 equations, 10 figures, 8 tables)

This paper contains 32 sections, 10 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of our MTL framework. The three tasks semantic Segmentation, Depth regression and Normal estimation share an encoder $\textbf{E}$. The task-specific decoders $S$, $D$ and $N$ exchange information in the "multi-Task Exchange Block" (mTEB) via an attention-based mechanism, resulting in refined features to produce final predictions.
  • Figure 2: Bidirectional Cross-task Attention (xTAM). We enable information flow between task pairs ($i$, $j$) via the discovery of directional features, $f_{j \shortrightarrow i}$ and $f_{i \shortrightarrow j}$. Only $\text{xTAM}_{j \shortrightarrow i}$ is detailed here. It relies on two attention mechanisms. First, a correlation-guided attention (green) to output $\textbf{xtask}_{j \shortrightarrow i}$, the features from task $j$ contributing to task $i$. Second, a self-attention (purple) to discover complementary signal $\textbf{self}_{j \shortrightarrow i}$ from $j$. Best viewed in color.
  • Figure 3: Our cross-task MTL framework. We jointly predict semantic segmentation ($S$, blue), depth ($D$, orange) and surface normal ($N$, green). All tasks share encoder ($\textbf{E}$) and have dedicated decoders (shown as trapezoids). Our cross-task attention module is inserted within a "multi-Task Exchange Block" (mTEB) (\ref{['sec:meth-mtl-refblock']}) that takes task-specific features as inputs and returns refined features of the same kinds. Here, three xTAM modules are used to extract pairs of directional features, where each represents the complementary signal from one task to another, e.g.$f_{D\shortrightarrow S}$ encapsulates depth information ($D$) extracted to improve the segmentation task ($S$). Multi-modal directional features are combined in the square blocks, resulting in multi-modal residuals for refinement. Task specific losses (shown in red) are applied before and after our mTEB.
  • Figure 4: Qualitative MTL results on Cityscapes in the '${S\text{-}D\text{-}N}$' setup. The first row shows the input image and its segmentation, depth and normal ground-truths. Segmentation results of our models are better overall, especially in boundary areas -- see "bicycle", "rider" and "pedestrian". Surface normals of PAD-Net is blurrier as compared to $\text{3-ways}_\text{PAD-Net}$ and Ours.
  • Figure 5: General architectures. For clarity, we only visualize two tasks in the multi-task networks. While the encoder is identical, models differ in their decoder architecture, with PAD-Net, $\text{3-ways}_\text{PAD-Net}$ and Ours using dedicated tasks exchange blocks.
  • ...and 5 more figures