Table of Contents
Fetching ...

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Sankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal

TL;DR

MARVEL-40M+ provides the largest-scale, multi-level text-to-3D captioning resource to date, combining automated, multi-view visual-language descriptions with domain-specific human metadata to greatly improve annotation quality and linguistic diversity. It introduces a five-level captioning schema and a two-stage TT3D pipeline, MARVEL-FX3D, which fine-tunes Stable Diffusion on MARVEL annotations and leverages SF3D for rapid texture-rich 3D mesh generation in around $15$s. Empirical results show substantial improvements in annotation richness, image-text alignment, and high-fidelity TT3D generation compared with prior datasets and baselines, with GPT-4 and human evaluators favoring MARVEL more than existing methods (e.g., GPT-4 win rate $72.41\%$, human $73.40\%$ for alignment). The work offers a scalable, cost-aware framework for enabling fast, accurate TT3D content creation, with broad implications for gaming, AR/VR, and film production, and provides extensive supplementary material detailing metadata usage, hierarchical prompts, and implementation specifics.

Abstract

Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators. Project page is available at https://sankalpsinha-cmos.github.io/MARVEL/.

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

TL;DR

MARVEL-40M+ provides the largest-scale, multi-level text-to-3D captioning resource to date, combining automated, multi-view visual-language descriptions with domain-specific human metadata to greatly improve annotation quality and linguistic diversity. It introduces a five-level captioning schema and a two-stage TT3D pipeline, MARVEL-FX3D, which fine-tunes Stable Diffusion on MARVEL annotations and leverages SF3D for rapid texture-rich 3D mesh generation in around s. Empirical results show substantial improvements in annotation richness, image-text alignment, and high-fidelity TT3D generation compared with prior datasets and baselines, with GPT-4 and human evaluators favoring MARVEL more than existing methods (e.g., GPT-4 win rate , human for alignment). The work offers a scalable, cost-aware framework for enabling fast, accurate TT3D content creation, with broad implications for gaming, AR/VR, and film production, and provides extensive supplementary material detailing metadata usage, hierarchical prompts, and implementation specifics.

Abstract

Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators. Project page is available at https://sankalpsinha-cmos.github.io/MARVEL/.

Paper Structure

This paper contains 30 sections, 24 figures, 7 tables, 1 algorithm.

Figures (24)

  • Figure 1: Left: Examples of MARVEL annotations created using our proposed pipeline, which produces high-quality, domain-specific and multi-level text descriptions for 3D assets (Sec \ref{['annot_pipeline']}). Right: Qualitative results from MARVEL-FX3D, our two-stage text-to-3D pipeline, which can generate textured mesh from text within 15s (Sec \ref{['marvel_fx3d_architecture']}). Please zoom in for details.
  • Figure 2: Left: MARVEL annotation pipeline for 3D assets. Our pipeline starts with human metadata objaverseobjaversexl and rendered multi-view images to create detailed visual descriptions using InternVL-2 internvl. These contain object names, shapes, textures, colors, and environments. Qwen2 qwen2 then processes these descriptions into five hierarchical levels, progressively compressing different aspects of the 3D assets. Right: Our Text-to-3D pipeline finetunes SD 3.5 sd3sd3_huggingface with this dataset and uses pretrained SF3D sf3d to generate a textured mesh in 15s .
  • Figure 3: Qualitative Annotation Comparison: From top to bottom Cap3D cap3d, 3DTopia 3dtopia, Kabra ICML2024_LeveragingVLMs, MARVEL (Level-4) annotations and GPT-4 achiam2023gpt evaluation. MARVEL consistently provides the most comprehensive and precise annotations, capturing intricate details such as object names, color, structure, and specific attributes. Red is for wrong captions. Green shows important information.
  • Figure 4: Visual results of high fidelity TT3D generation. Left to right, the reconstructed 3D assets from Shap-E shap_e, DreamFusion dreamfusion, Lucid-Dreamer luciddreamer, HIFA hifa and MARVEL-FX3D.
  • Figure 5: MARVEL uses human-generated metadata from source datasets to create detailed, accurate captions (e.g., names of the lunar craters, detection of human footprints) and reduce hallucinations. Without metadata, VLMs like GPT-4 achiam2023gpt and InternVL2 internvl1.5 generate vague or speculative descriptions.
  • ...and 19 more figures