Table of Contents
Fetching ...

From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

Changliang Xia, Chengyou Jia, Zhuohang Dang, Minnan Luo, Zhihui Li, Xiaojun Chang

TL;DR

This work tackles the gap between dense-prediction research and real-world deployment by highlighting data scarcity and cross-task generalization challenges. It introduces DenseWorld, a 25-task benchmark with unified evaluation to reflect real-world scenarios, and DenseDiT, a diffusion-transformer framework that preserves pretrained visual priors through a parameter-reuse paradigm and lightweight branches (prompt and demonstration) for data-efficient adaptation. DenseDiT achieves superior performance across both general-purpose and task-specific baselines on DenseWorld, using far less training data and only a tiny fraction of additional parameters, while remaining effective when evaluated on common benchmarks and across backbones. The results demonstrate the practical value of leveraging generative priors for unified, real-world dense prediction and point to promising directions for future research in data-efficient, cross-domain vision tasks.

Abstract

Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated labels for input images. Despite advances in this field, existing methods primarily focus on idealized conditions, exhibiting limited real-world generalization and struggling with the acute scarcity of real-world data in practical scenarios. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. We then propose DenseDiT, which exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context. This design enables DenseDiT to achieve efficient tuning with less than 0.1% additional parameters, activating the visual priors while effectively adapting to diverse real-world dense prediction tasks. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment.

From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

TL;DR

This work tackles the gap between dense-prediction research and real-world deployment by highlighting data scarcity and cross-task generalization challenges. It introduces DenseWorld, a 25-task benchmark with unified evaluation to reflect real-world scenarios, and DenseDiT, a diffusion-transformer framework that preserves pretrained visual priors through a parameter-reuse paradigm and lightweight branches (prompt and demonstration) for data-efficient adaptation. DenseDiT achieves superior performance across both general-purpose and task-specific baselines on DenseWorld, using far less training data and only a tiny fraction of additional parameters, while remaining effective when evaluated on common benchmarks and across backbones. The results demonstrate the practical value of leveraging generative priors for unified, real-world dense prediction and point to promising directions for future research in data-efficient, cross-domain vision tasks.

Abstract

Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated labels for input images. Despite advances in this field, existing methods primarily focus on idealized conditions, exhibiting limited real-world generalization and struggling with the acute scarcity of real-world data in practical scenarios. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. We then propose DenseDiT, which exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context. This design enables DenseDiT to achieve efficient tuning with less than 0.1% additional parameters, activating the visual priors while effectively adapting to diverse real-world dense prediction tasks. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment.

Paper Structure

This paper contains 30 sections, 4 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Comparison of idealized vs. real-world dense prediction. (Left) Idealized tasks under controlled conditions with uniform lighting, minimal occlusion, and abundant data. (Right) Real-world tasks exhibiting complex scenes, adverse conditions, and inherent data scarcity, presenting substantially greater challenges.
  • Figure 2: Overview of the DenseWorld benchmark. Upper left: the construction pipeline. Center left: examples of representative tasks across five real-world categories. Lower left: unified evaluation. Right: full taxonomy of 25 dense prediction tasks, each aligned with a practical application scenario.
  • Figure 3: Overview of the DenseDiT architecture. The framework processes a query image through a parameter-reused VAE encoder, while lightweight prompt and demo branches provide contextual cues. These elements interact within the generative backbone, requiring only LoRA-based fine-tuning to achieve dense prediction in complex, data-scarce real-world scenarios.
  • Figure 4: Qualitative comparisons on pixel-level regression. In the first and second row, DenseDiT successfully predicts occluded structures in fog or shadow, highlighting its capability for scene-level reasoning. The third row showcases DenseDiT’s ability to capture fine-grained details such as distant lampposts and layered foliage. The forth row emphasizes its sensitivity to abrupt depth transitions, producing sharper and more consistent boundaries than competing models.
  • Figure 5: Qualitative comparisons on pixel-level classification. DenseDiT handles cluttered backgrounds (row 1), detects dynamic concepts like fire (row 2), and localizes fine structures in medical/satellite images (rows 3-4).
  • ...and 3 more figures