Table of Contents
Fetching ...

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi

TL;DR

DenseWorld-1M tackles the need for fine-grained grounded captions in real-world imagery by introducing a three-stage labeling pipeline that yields pixel-level masks, object-level detailed captions, and scene-level dense grounded captions. Two specialized models, Detailed Region Caption (DRC) and Spatial Caption Merging (SCM), accelerate labeling and improve grounding fidelity. Extensive experiments across vision-language understanding, grounding, and region-caption tasks demonstrate improvements on multiple benchmarks, validating the dataset's utility for pretraining and evaluation. The work releases both the DenseWorld-1M data and the labeling models to spur progress in fine-grained visual grounding and reasoning for Multimodal Large Language Models.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

TL;DR

DenseWorld-1M tackles the need for fine-grained grounded captions in real-world imagery by introducing a three-stage labeling pipeline that yields pixel-level masks, object-level detailed captions, and scene-level dense grounded captions. Two specialized models, Detailed Region Caption (DRC) and Spatial Caption Merging (SCM), accelerate labeling and improve grounding fidelity. Extensive experiments across vision-language understanding, grounding, and region-caption tasks demonstrate improvements on multiple benchmarks, validating the dataset's utility for pretraining and evaluation. The work releases both the DenseWorld-1M data and the labeling models to spur progress in fine-grained visual grounding and reasoning for Multimodal Large Language Models.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.

Paper Structure

This paper contains 17 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: DenseWorld-1M Annotation Example. DenseWorld-1M contains extremely dense and detailed grounded captions, with three-stage outputs. Current private models, such as GPT-4 and Gemini, are unable to generate such captions, even when provided with pixel-level tags as visual prompts. The inconsistencies between the text and object ID numbers in the detailed grounded captions generated by GPT-4o are highlighted in red. Best view it in color.
  • Figure 2: DenseWorld-1M labeling pipeline. We present a three-stage pipeline, including stage-1 for pixel-level mask generation, stage-2 for object-level detailed caption, and stage-3 for scene-level detailed dense grounded caption. Note that there are no human costs in the loop.
  • Figure 3: The proposed Detailed Region Caption model. DRC combines both visual patch embedding and ID patch embedding to generate a more fine-grained and accurate description of object captions.
  • Figure 4: The proposed Spatial Caption Merging model. With multiple inputs, SCM merges multiple detailed object captions into one fluent, dense, grounded caption.
  • Figure 5: Visual comparison on dense grounded caption. With our dataset training, the baseline model can generate more fine-grained and dense grounded captions. Best view it in color.
  • ...and 11 more figures