MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Sankalp Sinha; Mohammad Sadil Khan; Muhammad Usama; Shino Sam; Didier Stricker; Sk Aziz Ali; Muhammad Zeshan Afzal

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Sankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal

TL;DR

MARVEL-40M+ provides the largest-scale, multi-level text-to-3D captioning resource to date, combining automated, multi-view visual-language descriptions with domain-specific human metadata to greatly improve annotation quality and linguistic diversity. It introduces a five-level captioning schema and a two-stage TT3D pipeline, MARVEL-FX3D, which fine-tunes Stable Diffusion on MARVEL annotations and leverages SF3D for rapid texture-rich 3D mesh generation in around $15$s. Empirical results show substantial improvements in annotation richness, image-text alignment, and high-fidelity TT3D generation compared with prior datasets and baselines, with GPT-4 and human evaluators favoring MARVEL more than existing methods (e.g., GPT-4 win rate $72.41\%$, human $73.40\%$ for alignment). The work offers a scalable, cost-aware framework for enabling fast, accurate TT3D content creation, with broad implications for gaming, AR/VR, and film production, and provides extensive supplementary material detailing metadata usage, hierarchical prompts, and implementation specifics.

Abstract

Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators. Project page is available at https://sankalpsinha-cmos.github.io/MARVEL/.

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

TL;DR

Abstract

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)