Table of Contents
Fetching ...

Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

Meixuan Li, Tianyu Li, Guoqing Wang, Peng Wang, Yang Yang, Heng Tao Shen

TL;DR

Diverging from conventional methods that directly learn a monolithic image representation, this proposal involves modeling region-wise representations using Gaussian Distributions, which significantly enhances the ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios.

Abstract

In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.

Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

TL;DR

Diverging from conventional methods that directly learn a monolithic image representation, this proposal involves modeling region-wise representations using Gaussian Distributions, which significantly enhances the ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios.

Abstract

In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
Paper Structure (11 sections, 6 equations, 4 figures, 6 tables)

This paper contains 11 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Multiview consistency and region consistency. (a) Illustration of contrastive multiview consistency tian2020contrastive. (b) Illustration of region consistency, where $p_{i,s}$ and $p_{j,s}$ represent the region distribution of semantic segmentation, $p_{i,d}$ and $p_{j,d}$ denotes the region distribution of depth estimation.
  • Figure 2: Illustration of our region-aware distribution contrast learning method for MTPSL. During training, supervised constraints $L_{Sup}$ are applied to the annotated task $s$. For task $s$ and unlabelled task $t$, the $a_{\theta}$ map the true label $y_s$ and the prediction $\hat{y}_t$ in the high-dimensional space respectively and then model the region of task-specific features extracted using SAM as a Gaussian distribution. Contrastive learning is then employed to minimize the distance between Gaussian distributions of the same region across different tasks, while maximizing the distance to Gaussian distributions of other regions.
  • Figure 3: Illustration of region-level cross-task consistency. Initially, SAM predictions are employed to extract regions from the features of the particular task. Following that, these regions are modeled as Gaussian distributions. Finally, contrastive learning is utilized to minimize the distance between regions of the same region across different tasks and maximize the distance to regions of other regions.
  • Figure 4: Qualitative results of onelabel setting on NYU-V2. The first row shows the input image, the second row represents the ground-truth or predictions of semantic segmentation, the third row plots the ground-truth or predictions of depth estimation, and the final row presents the ground-truth or predictions of surface normal estimation.