A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track
Zehui Chen, Qiuchen Wang, Zhenyu Li, Jiaming Liu, Shanghang Zhang, Feng Zhao
TL;DR
The paper addresses robust multi-task dense visual prediction under distribution shifts by introducing UniNet, a vanilla multi-task framework that unifies 3 tasks—monocular 3D object detection, instance segmentation, and depth estimation—within a single model. It adopts a two-stage training strategy: train task-specific models independently, then merge their backbones by weight averaging and fine-tune jointly, leveraging task-specific augmentations. On the SHIFT dataset for the VCL robustness track, UniNet achieves an overall score of 49.6 with per-task metrics of 29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog, outperforming its individual components. The approach demonstrates that simple, well-structured multi-task fusion can yield strong cross-task performance and provides insights into augmentation, initialization, and fusion strategies for dense visual prediction in continual learning scenarios.
Abstract
In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We propose a vanilla framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks, respectively. The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score (29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog) on SHIFT validation set. Besides, we provide some interesting observations in our experiments which may facilitate the development of multi-task learning in dense visual prediction.
