Table of Contents
Fetching ...

EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling

TL;DR

This work tackles the gap in culturally grounded multimodal QA by introducing EverydayMMQA and the OASIS dataset, a large-scale resource that unifies speech, images, and text across 18 Arabic-speaking countries. The authors present a modular framework for topic/query generation, country-localized image retrieval, and QA synthesis (including spoken data and dialect translation), paired with rigorous quality controls. OASIS supports four input modalities and multiple QA types to evaluate cultural grounding and everyday reasoning beyond object recognition, with benchmark results showing the pivotal role of visual grounding in reducing language burdens and enabling smaller models to close the gap through fine-tuning. The framework and dataset together offer a practical, scalable path toward culturally aware multimodal LLMs and reveal that cross-modal alignment and data quality are critical for progress in diverse linguistic and cultural contexts.

Abstract

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

TL;DR

This work tackles the gap in culturally grounded multimodal QA by introducing EverydayMMQA and the OASIS dataset, a large-scale resource that unifies speech, images, and text across 18 Arabic-speaking countries. The authors present a modular framework for topic/query generation, country-localized image retrieval, and QA synthesis (including spoken data and dialect translation), paired with rigorous quality controls. OASIS supports four input modalities and multiple QA types to evaluate cultural grounding and everyday reasoning beyond object recognition, with benchmark results showing the pivotal role of visual grounding in reducing language burdens and enabling smaller models to close the gap through fine-tuning. The framework and dataset together offer a practical, scalable path toward culturally aware multimodal LLMs and reveal that cross-modal alignment and data quality are critical for progress in diverse linguistic and cultural contexts.

Abstract

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

Paper Structure

This paper contains 37 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: OASIS data sample, multimodal and multilingual QA around a culturally-grounded image.
  • Figure 2: Proposed EverydayMMQA framework, OASIS dataset construction and experimental pipeline.
  • Figure 3: OASIS dataset overview: geographic coverage across 18 Arab countries, languages and dialects, modality setups (text, image, speech), QA types, audio durations, token counts, and per-(sub)category distributions. Total QA - total number of images (0.92M) $\times$ 4 questions $\times$ 4 language varieties.
  • Figure 4: MSA Judge scores across modalities. Left: Qwen2.5-7B vs Gemini 2.5-pro. Right: Qwen2.5-3B vs its fine-tuned variant. English results are in the Appendix.
  • Figure 5: Distribution of commonsense and knowledge based for the whole dataset.
  • ...and 1 more figures