Table of Contents
Fetching ...

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani, Monica Lama, Mick Yang, Zixuan Bian, Ruohan Ren, Alan B. Hong, Jiatao Gu, Chris Callison-Burch

TL;DR

DenseAnnotate presents an audio-driven platform and associated multilingual, multicultural dense-captioning datasets (MLDC-MC for 2D images and MLDC-3D for 3D scenes) to address the scarcity of high-quality dense annotations. By unifying 2D and 3D annotation workflows, employing a multi-stage QA and summarization pipeline, and providing open access to large-scale multilingual data, the work demonstrates substantial improvements in multilingual captioning, cultural alignment, and 3D spatial reasoning. The suite includes a PointLLM-based scene model evaluated on 3D data, showing meaningful gains after targeted fine-tuning, and highlights the platform’s potential to advance vision-language research across diverse languages and modalities. Overall, DenseAnnotate offers a scalable, open framework that combines human expressiveness with automated summarization to enable more inclusive and capable multimodal AI systems.

Abstract

With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

TL;DR

DenseAnnotate presents an audio-driven platform and associated multilingual, multicultural dense-captioning datasets (MLDC-MC for 2D images and MLDC-3D for 3D scenes) to address the scarcity of high-quality dense annotations. By unifying 2D and 3D annotation workflows, employing a multi-stage QA and summarization pipeline, and providing open access to large-scale multilingual data, the work demonstrates substantial improvements in multilingual captioning, cultural alignment, and 3D spatial reasoning. The suite includes a PointLLM-based scene model evaluated on 3D data, showing meaningful gains after targeted fine-tuning, and highlights the platform’s potential to advance vision-language research across diverse languages and modalities. Overall, DenseAnnotate offers a scalable, open framework that combines human expressiveness with automated summarization to enable more inclusive and capable multimodal AI systems.

Abstract

With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.

Paper Structure

This paper contains 24 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of our Multilingual Dense Captioning (MLDC) dataset. It includes multicultural images with keypoint annotations aligned with the captions, as well as annotated 3D scenes and their corresponding 3D objects. Partly omitted for clarity.
  • Figure 2: Dense captioning generation. a. Customized Task Creation: After task creators upload unlabeled images or 3D scenes, the platform will generate advice to guide annotators. b. Data Collection: Annotators can record audio while performing Point & Name on images or Interact & Segment on 3D scenes. c. Captions Improvement: Summarizing into high-quality captions.
  • Figure 3: Captions for an image from COCO, ShareGPT4V, and our MLDC-MC dataset. The English annotation is summarized from three individual annotations and post-processed. The COCO caption lacks detail, while the ShareGPT4V caption contains several inaccuracies, which are in red. Our annotation is both detailed and accurate.
  • Figure 4: Overview of the 3D scene dataset structure (MLDC-3D). This figure selects one representative example to showcase the full scope of our dataset. The upper-left corner displays the 3D scene generated via HOLODECK 2.0. The data structure is presented in three columns: The first column illustrates Individual-Object Dense Captioning, providing fine-grained descriptions. The second column presents Open-Ended QA (OEQA) pairs derived from the dense transcriptions of human annotations. The third column shows the Multiple Choice QA (MCQA) generated based on the OEQA pairs. Note that the OEQA and MCQA examples shown in the figure only represent a subset of the available question types. The full set of categories and their distribution, which highlights the dataset's comprehensive coverage across various tasks (including scene understanding and spatial reasoning), is presented in the accompanying pie charts.
  • Figure 5: Human evaluation results of visual grounding capability. We compare the Qwen2 model fine-tuned on MLDC-MC with the vanilla Qwen3 on three dimensions: Point–Caption Consistency, Spatial Accuracy, and Object Coverage Completeness. Each bar shows the percentage of test set images.
  • ...and 6 more figures