Table of Contents
Fetching ...

3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li

TL;DR

The paper tackles the challenge of jointly predicting dense scene properties by making multi-task learning 3D-aware. It introduces a Cross-view Module (CvM) consisting of a spatial-aware encoder, a multi-view transformer, and a differentiable cost volume to capture cross-view geometry and enforce consistency across tasks. CvM is architecture-agnostic and can be plugged into existing MTL encoders, trained on multi-view data or video, while enabling single-view inference by duplicating the input. Empirical results on NYUv2 and PASCAL-Context show consistent improvements across segmentation, depth, normals, and boundaries, with competitive or state-of-the-art performance and modest computational overhead. The work demonstrates that explicit modeling of cross-view correlations yields higher-quality, geometry-aware representations for dense scene understanding and points to future directions for dynamic scenes and motion-aware extensions.

Abstract

This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

TL;DR

The paper tackles the challenge of jointly predicting dense scene properties by making multi-task learning 3D-aware. It introduces a Cross-view Module (CvM) consisting of a spatial-aware encoder, a multi-view transformer, and a differentiable cost volume to capture cross-view geometry and enforce consistency across tasks. CvM is architecture-agnostic and can be plugged into existing MTL encoders, trained on multi-view data or video, while enabling single-view inference by duplicating the input. Empirical results on NYUv2 and PASCAL-Context show consistent improvements across segmentation, depth, normals, and boundaries, with competitive or state-of-the-art performance and modest computational overhead. The work demonstrates that explicit modeling of cross-view correlations yields higher-quality, geometry-aware representations for dense scene understanding and points to future directions for dynamic scenes and motion-aware extensions.

Abstract

This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

Paper Structure

This paper contains 44 sections, 3 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Comparison between standard MTL and our 3D-aware MTL framework. (a) Standard MTL relies solely on 2D per-pixel supervision, while (b) our approach incorporates geometric consistency through a lightweight cross-view module (CvM). (c) Using DINOv3 simeoni2025dinov3 as the encoder, standard MTL (top) shows cross-view ambiguities (highlighted by arrows), e.g., inconsistent curtain segmentation, leading to reduced inter-task coherence. In contrast, our method yields more consistent predictions across both views and tasks.
  • Figure 2: Illustration of our method for integrating cross-view correlation for enabling 3D-aware MTL. Given an image $\bm{I}_1$, we feed it and its neighbour view $\bm{I}_2$ into the MTL encoder $f(\cdot)$ and extract the MTL features $f(\bm{I}_1)$ and $f(\bm{I}_2)$. In parallel, our lightweight cross-view module (CvM) $g(\cdot)$ takes as input both views. In CvM, a spatial-aware encoder $s(\cdot)$ encodes geometric-biased features, followed by a multi-view transformer $m(\cdot)$ that enables information exchanged across views and outputs cross-view features $\bm{F}_1$ and $\bm{F}_2$. A cost volume module $e(\cdot)$ then converts $\bm{F}_1$ and $\bm{F}_2$ to the cost volume $\bm{C}_1$ and $\bm{C}_2$ by warping the feature from one view to another given their relative camera parameters and matching features across views. Finally, both cost volume and cross-view feature are concatenated with the MTL features, forming the geometric-aware MTL feature $\Tilde{\bm{F}}_1$ and $\Tilde{\bm{F}}_2$ for estimating predictions of multiple dense vision tasks.
  • Figure 3: Qualitative Comparisons on NYUv2. The first column shows the RGB image, and the remaining columns display either the ground-truth or model predictions. The last row shows the ground-truth of four tasks. The first to the fourth row shows the predictions of SAK, Dinov3, Dinov3 trained with videos as multi-view data, and our method, respectively.
  • Figure A4: Qualitative Comparisons on NYUv2. The first column shows the RGB image, while the remaining columns present either the ground truth or model predictions. The last row shows the ground-truth of four tasks. The first to the fourth row shows the predictions of SAK, Dinov3, Dinov3 trained with videos as multi-view data, and our method, respectively.
  • Figure A5: Qualitative Comparisons on PASCAL-Context. The first column shows the RGB image, while the remaining columns present either the ground truth or model predictions. The last row shows the ground-truth of five tasks. The first to the third row shows the predictions of SAK, Dinov3, and our method, respectively.
  • ...and 1 more figures