Table of Contents
Fetching ...

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong

TL;DR

CRAG-MM tackles the lack of a comprehensive benchmark for multi-modal, multi-turn RAG in wearable AI by introducing a large-scale dataset (6.5K single-turn QA + 2K multi-turn conversations across 13 domains) with egocentric and normal images, plus retrieval APIs for image-KG and web sources. It defines three tasks that span single- and multi-source augmentation and multi-turn dialogues, and provides detailed data splits and evaluation protocols to measure truthfulness and reliability. Benchmarking reveals that current MM-RAG systems still struggle with image quality, cross-source synthesis, and maintaining coherent multi-turn conversations, despite gains from competition-winning approaches; this highlights substantial room for improvement and informs targeted directions for robust, trustworthy wearable QA. Overall, CRAG-MM offers a principled, publicly accessible platform that accelerates progress in wearable MM-RAG by clarifying challenges, benchmarking current capabilities, and guiding future research directions with real-world relevance.

Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

TL;DR

CRAG-MM tackles the lack of a comprehensive benchmark for multi-modal, multi-turn RAG in wearable AI by introducing a large-scale dataset (6.5K single-turn QA + 2K multi-turn conversations across 13 domains) with egocentric and normal images, plus retrieval APIs for image-KG and web sources. It defines three tasks that span single- and multi-source augmentation and multi-turn dialogues, and provides detailed data splits and evaluation protocols to measure truthfulness and reliability. Benchmarking reveals that current MM-RAG systems still struggle with image quality, cross-source synthesis, and maintaining coherent multi-turn conversations, despite gains from competition-winning approaches; this highlights substantial room for improvement and informs targeted directions for robust, trustworthy wearable QA. Overall, CRAG-MM offers a principled, publicly accessible platform that accelerates progress in wearable MM-RAG by clarifying challenges, benchmarking current capabilities, and guiding future research directions with real-world relevance.

Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

Paper Structure

This paper contains 34 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: CRAG-MM examples.
  • Figure 2: CRAG-MM task design.
  • Figure 3: Image search recall (left): the image search index covers 93.4% of the query entities. However, using the full image to query the index only achieves 51.9% recall. Manual cropping improves the recall slightly to 58.0%. Web search recall (right): the top 50 web search results are estimated to contain the ground truth facts for 88.6% questions.
  • Figure 4: Winning and SOTA systems' performance across different dimensions. Figure (a)--(e) show truthfulness; (f) shows the average number of successful turns in multi-turn QA.
  • Figure 5: An example of the image search API.
  • ...and 2 more figures