From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios
Changliang Xia, Chengyou Jia, Zhuohang Dang, Minnan Luo, Zhihui Li, Xiaojun Chang
TL;DR
This work tackles the gap between dense-prediction research and real-world deployment by highlighting data scarcity and cross-task generalization challenges. It introduces DenseWorld, a 25-task benchmark with unified evaluation to reflect real-world scenarios, and DenseDiT, a diffusion-transformer framework that preserves pretrained visual priors through a parameter-reuse paradigm and lightweight branches (prompt and demonstration) for data-efficient adaptation. DenseDiT achieves superior performance across both general-purpose and task-specific baselines on DenseWorld, using far less training data and only a tiny fraction of additional parameters, while remaining effective when evaluated on common benchmarks and across backbones. The results demonstrate the practical value of leveraging generative priors for unified, real-world dense prediction and point to promising directions for future research in data-efficient, cross-domain vision tasks.
Abstract
Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated labels for input images. Despite advances in this field, existing methods primarily focus on idealized conditions, exhibiting limited real-world generalization and struggling with the acute scarcity of real-world data in practical scenarios. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. We then propose DenseDiT, which exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context. This design enables DenseDiT to achieve efficient tuning with less than 0.1% additional parameters, activating the visual priors while effectively adapting to diverse real-world dense prediction tasks. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment.
