Table of Contents
Fetching ...

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera

TL;DR

FAMDA is introduced, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self-training paradigm to generate high-quality pseudo-labels for the target domain.

Abstract

Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self-training paradigm. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10X smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

TL;DR

FAMDA is introduced, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self-training paradigm to generate high-quality pseudo-labels for the target domain.

Abstract

Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self-training paradigm. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10X smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

Paper Structure

This paper contains 36 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison with recent UDA methods. (a) shows qualitative semantic segmentation and depth estimation results on Cityscapes after UDA (SYNTHIA$\rightarrow$Cityscapes) using lightweight backbones (MiT-B0, B1, and B2). The results are obtained from multi-task prediction models. (b) and (c) plot model size against performance in SYNTHIA$\rightarrow$Cityscapes adaptation: (b) mIoU for semantic segmentation (higher is better) and (c) RMSE for depth estimation (lower is better). Baselines include XTAM lopes2023cross, MTL-UDA vandenhende2021multi, STL-UDA vandenhende2021multi, MulT bhattacharjee2022mult, VTAGML bhattacharjee2023vision, and Depth Anything (DAM) yang2024depth. Since STL is single-task, its model size is reported as doubled to reflect two independent models for segmentation and depth. MulT and VTAGML's sizes are obtained from Table 6 in bhattacharjee2023vision. DAM is a vision foundation model for depth estimation with a ViT-L backbone. Our approach is shown with four backbones (B0, B1, B2, and B5, from left to right). Our method achieves superior performance with significantly smaller models.
  • Figure 2: Overview of the proposed approach. The framework consists of four key components: a student–teacher pair of networks for EMA-based self-training in UDA, a segmentation foundation model that refines pseudo-labels generated by the teacher ($\tilde{y}_\text{seg, T}$), and a depth foundation model that produces pseudo-depth maps for target images ($\tilde{y}_\text{dep, T}$). EMA updates are applied only to the shared feature extractors and segmentation decoders.
  • Figure 3: Qualitative results on the low-light sensor dataset. Our multi-task model (MiT-B5) and single-task VFMs (SSAM and DAM) are compared.