Table of Contents
Fetching ...

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

TL;DR

TimeTravel addresses the need for evaluating large multimodal models on historical artifacts by providing a comprehensive, expert-verified benchmark spanning 266 cultures and 10 regions. The approach combines a curated 10,250-sample artifact collection with image-text pair generation via GPT-4o and rigorous human validation to enable multimodal historical reasoning. The study benchmarks both closed-source and open-source LMMs using diverse metrics (BLEU, ROUGE-L, METEOR, SPICE, BERTScore, LLM-Judge), finding that closed models generally outperform open models in contextual artifact description, with area-specific strengths. These contributions advance cultural heritage AI by enabling more accurate, scalable analysis of artifacts while highlighting limitations and biases that require ongoing expert oversight.

Abstract

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

TL;DR

TimeTravel addresses the need for evaluating large multimodal models on historical artifacts by providing a comprehensive, expert-verified benchmark spanning 266 cultures and 10 regions. The approach combines a curated 10,250-sample artifact collection with image-text pair generation via GPT-4o and rigorous human validation to enable multimodal historical reasoning. The study benchmarks both closed-source and open-source LMMs using diverse metrics (BLEU, ROUGE-L, METEOR, SPICE, BERTScore, LLM-Judge), finding that closed models generally outperform open models in contextual artifact description, with area-specific strengths. These contributions advance cultural heritage AI by enabling more accurate, scalable analysis of artifacts while highlighting limitations and biases that require ongoing expert oversight.

Abstract

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.

Paper Structure

This paper contains 12 sections, 8 figures, 14 tables.

Figures (8)

  • Figure 1: TimeTravel Taxonomy categorizes artifacts from 10 major civilizations, spanning diverse historical and prehistoric periods. It encompasses 266 distinct cultures and over 10k manually verified historical artifact samples, providing a structured framework for comprehensive AI-driven analysis.
  • Figure 2: TimeTravel Samples. Showcasing diverse cultural representations from various regions across the globe, these examples span multiple artifact categories, including coins, accessories, tools, and statues from ancient civilizations. Each artifact is accompanied by a detailed description, providing valuable contextual and historical insights. Additional TimeTravel examples can be found in Fig.\ref{['fig:appendix_qual_examples']} and Fig.\ref{['fig:appendix_data_examples']}.
  • Figure 3: TimeTravel Data Pipeline. A structured workflow that collects image and text data from museum websites, cleans metadata, and integrates it with visual content. The GPT-4o model generates detailed, context-aware descriptions, which are refined by experts for accuracy before forming the TimeTravel Benchmark.
  • Figure 4: Regional distribution of dataset samples based on their archaeological provenance. Greece holds the largest share at 18%, with a balance-like distribution over regions.
  • Figure 5: This entry represents a silver coin from the Gupta dynastyfrom India, featuring a distinguished portrait of Skandagupta on the obverse. GPT-4o generated a detailed, context-aware description based on the available metadata, highlighting its craftsmanship, ceremonial significance, and cultural context.
  • ...and 3 more figures