Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy
Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Bhargava Kumar, Amit Agarwal, Ishan Banerjee, Srikant Panda, Tejaswini Kumar
TL;DR
This survey analyzes large multimodal model datasets and proposes a three-tier taxonomy (training-specific, task-specific, and domain-specific) to organize datasets used for pretraining, instruction tuning, and domain application. It covers MM-PT and MM-IT datasets, highlights key task-specific benchmarks (e.g., SlideVQA, OmniACT, HowTo100M) and domain-specific resources (medical imaging, autonomous driving, geospatial data, egocentric environments), and discusses dataset scale, modality diversity, and annotation quality. The paper also addresses challenges such as data bias, privacy, and computational demands, and outlines emerging needs for diverse, representative data and standardized benchmarking practices. Overall, it emphasizes the critical role of curated, well-documented multimodal datasets in advancing robust, ethically aligned MLLMs with broad real-world impact.
Abstract
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview. Large-scale multimodal datasets are essential because they allow for thorough testing and training of these models. With an emphasis on their contributions to the discipline, the study examines a variety of datasets, including those for training, domain-specific tasks, and real-world applications. It also emphasizes how crucial benchmark datasets are for assessing models' performance in a range of scenarios, scalability, and applicability. Since multimodal learning is always changing, overcoming these obstacles will help AI research and applications reach new heights.
