Table of Contents
Fetching ...

A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

Zehui Chen, Qiuchen Wang, Zhenyu Li, Jiaming Liu, Shanghang Zhang, Feng Zhao

TL;DR

The paper addresses robust multi-task dense visual prediction under distribution shifts by introducing UniNet, a vanilla multi-task framework that unifies 3 tasks—monocular 3D object detection, instance segmentation, and depth estimation—within a single model. It adopts a two-stage training strategy: train task-specific models independently, then merge their backbones by weight averaging and fine-tune jointly, leveraging task-specific augmentations. On the SHIFT dataset for the VCL robustness track, UniNet achieves an overall score of 49.6 with per-task metrics of 29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog, outperforming its individual components. The approach demonstrates that simple, well-structured multi-task fusion can yield strong cross-task performance and provides insights into augmentation, initialization, and fusion strategies for dense visual prediction in continual learning scenarios.

Abstract

In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We propose a vanilla framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks, respectively. The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score (29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog) on SHIFT validation set. Besides, we provide some interesting observations in our experiments which may facilitate the development of multi-task learning in dense visual prediction.

A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

TL;DR

The paper addresses robust multi-task dense visual prediction under distribution shifts by introducing UniNet, a vanilla multi-task framework that unifies 3 tasks—monocular 3D object detection, instance segmentation, and depth estimation—within a single model. It adopts a two-stage training strategy: train task-specific models independently, then merge their backbones by weight averaging and fine-tune jointly, leveraging task-specific augmentations. On the SHIFT dataset for the VCL robustness track, UniNet achieves an overall score of 49.6 with per-task metrics of 29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog, outperforming its individual components. The approach demonstrates that simple, well-structured multi-task fusion can yield strong cross-task performance and provides insights into augmentation, initialization, and fusion strategies for dense visual prediction in continual learning scenarios.

Abstract

In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We propose a vanilla framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks, respectively. The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score (29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog) on SHIFT validation set. Besides, we provide some interesting observations in our experiments which may facilitate the development of multi-task learning in dense visual prediction.
Paper Structure (12 sections, 3 figures, 1 table)

This paper contains 12 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Framework Overview of UniNet, which consists of an image backbone and 3 respective task heads.
  • Figure 2: The per-task evaluation of each task during finetuning process of UniNet, including 3D object detection, instance segmentation, and depth estimation.
  • Figure 3: The visualization of UniNet prediction by our final model with InternImage-L backbone on instance segmentation, 3D detection, and depth estimation tasks.