Table of Contents
Fetching ...

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

Hanrong Ye, Dan Xu

TL;DR

Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and out-perform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings.

Abstract

Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

TL;DR

Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and out-perform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings.

Abstract

Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/.
Paper Structure (23 sections, 5 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: Motivative illustration of the proposed DiffusionMTL for multi-task partially supervised dense prediction. The model denoise the manually decayed multi-task prediction or feature maps (denoted as $\{\mathbf{X}^0_S,...,\mathbf{X}^T_S\}$, $T, S$ are the numbers of tasks and steps separately) in a step by step manner, and obtain the denoised outputs $\{\mathbf{X}^0_0,...,\mathbf{X}^T_0\}$. The denoising process is guided by the designed multi-task condition feature ${\bm{F}}_{cond}$.
  • Figure 2: Illustration of the proposed DiffusionMTL (Prediction Diffusion) framework for the MTPSL setting. DiffusionMTL first uses an initial backbone model for producing starter prediction maps for all tasks. To denoise the initial prediction maps and generate rectified maps, we propose a Multi-Task Denoising Diffusion Network (MTDNet). MTDNet involves a diffusion process and a denoising process. During training, the initial prediction map of the labeled target task $\mathcal{T}$ is gradually degraded by applying noise, resulting in the noisy prediction map ${\bm{P}}_S^\mathcal{T}$. Then, we utilize a Multi-Task Conditioned Denoiser (referred to as the "Denoiser") to denoise ${\bm{P}}_S^\mathcal{T}$ iteratively over $S$ steps, resulting in a clean prediction map ${\bm{P}}_0^\mathcal{T}$ that is supervised by the ground-truth label. For better learning of unlabeled tasks, we propose a Multi-Task Conditioning mechanism in the denoising process to stimulate information sharing across different tasks. During inference, the diffusion and denoising processes are applied to all tasks to produce denoised multi-task prediction maps.
  • Figure 3: Illustration of the proposed DiffusionMTL (Feature Diffusion), which conducts noise decay and denoising on initial feature maps ${\bm{F}}_{init}^\mathcal{T}$. The denoised feature maps ${\bm{F}}_{0}^\mathcal{T}$ are projected to the final prediction map ${\bm{P}}_{0}^\mathcal{T}$ with a task head after the denoising.
  • Figure 4: Pipeline of a single step $s$ in the denoising process of DiffusionMTL (Prediction Diffusion). Multi-Task Conditioning: The initial prediction maps for all tasks are projected to task-specific features and then stacked. The stacked features are then processed with a $3\times3$ convolution to reduce the channel dimension, resulting in a Multi-Task Condition Feature ${\bm{F}}_{cond}$ which is shared across all tasks. Multi-Task Conditioned Denoiser: The denoiser consists of several cross-attention transformer blocks, which learn to denoise input conditioned on ${\bm{F}}_{cond}$. For its input, we perform a $3\times3$ convolution on the noisy prediction map ${\bm{P}}_s^\mathcal{T}$ and combine the output with the step embedding, obtaining a task embedding ${\bm{E}}_{s}^{\mathcal{T}}$. The denoiser takes ${\bm{F}}_{cond}$ as query input and ${\bm{E}}_{s}$ as key and value inputs. We use a task-specific head to obtain the denoised prediction map ${\bm{P}}_{s-1}^\mathcal{T}$, which is the input of the next denoising step $s-1$.
  • Figure 5: Visualization of the prediction maps at different processes on Cityscapes. Our DiffusionMTL effectively denoises the noisy prediction maps of both tasks.
  • ...and 10 more figures