Table of Contents
Fetching ...

Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy

Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Bhargava Kumar, Amit Agarwal, Ishan Banerjee, Srikant Panda, Tejaswini Kumar

TL;DR

This survey analyzes large multimodal model datasets and proposes a three-tier taxonomy (training-specific, task-specific, and domain-specific) to organize datasets used for pretraining, instruction tuning, and domain application. It covers MM-PT and MM-IT datasets, highlights key task-specific benchmarks (e.g., SlideVQA, OmniACT, HowTo100M) and domain-specific resources (medical imaging, autonomous driving, geospatial data, egocentric environments), and discusses dataset scale, modality diversity, and annotation quality. The paper also addresses challenges such as data bias, privacy, and computational demands, and outlines emerging needs for diverse, representative data and standardized benchmarking practices. Overall, it emphasizes the critical role of curated, well-documented multimodal datasets in advancing robust, ethically aligned MLLMs with broad real-world impact.

Abstract

Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview. Large-scale multimodal datasets are essential because they allow for thorough testing and training of these models. With an emphasis on their contributions to the discipline, the study examines a variety of datasets, including those for training, domain-specific tasks, and real-world applications. It also emphasizes how crucial benchmark datasets are for assessing models' performance in a range of scenarios, scalability, and applicability. Since multimodal learning is always changing, overcoming these obstacles will help AI research and applications reach new heights.

Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy

TL;DR

This survey analyzes large multimodal model datasets and proposes a three-tier taxonomy (training-specific, task-specific, and domain-specific) to organize datasets used for pretraining, instruction tuning, and domain application. It covers MM-PT and MM-IT datasets, highlights key task-specific benchmarks (e.g., SlideVQA, OmniACT, HowTo100M) and domain-specific resources (medical imaging, autonomous driving, geospatial data, egocentric environments), and discusses dataset scale, modality diversity, and annotation quality. The paper also addresses challenges such as data bias, privacy, and computational demands, and outlines emerging needs for diverse, representative data and standardized benchmarking practices. Overall, it emphasizes the critical role of curated, well-documented multimodal datasets in advancing robust, ethically aligned MLLMs with broad real-world impact.

Abstract

Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview. Large-scale multimodal datasets are essential because they allow for thorough testing and training of these models. With an emphasis on their contributions to the discipline, the study examines a variety of datasets, including those for training, domain-specific tasks, and real-world applications. It also emphasizes how crucial benchmark datasets are for assessing models' performance in a range of scenarios, scalability, and applicability. Since multimodal learning is always changing, overcoming these obstacles will help AI research and applications reach new heights.

Paper Structure

This paper contains 36 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Flowchart describing the multimodal language model pipeline
  • Figure 2: An illustration representing the high-level classification of the datasets mentioned in the survey under Training specific (datasets under MM-IT and MM-PT), Task specific and Domain specific.
  • Figure 3: The datasets released as part of MLLMs
  • Figure 4: An illustration of the datasets as per the survey under Task Specific Needs
  • Figure 5: Datasets grouped by various domains