Table of Contents
Fetching ...

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara

TL;DR

VIR-Bench introduces a new benchmark for evaluating long-range geospatial-temporal understanding in multimodal LLMs through itinerary reconstruction from travel videos. It decomposes the task into node and edge prediction on a visiting order graph, revealing persistent challenges for both open-weight and proprietary models, especially in POI and transition-edge prediction. The authors also demonstrate practical value by building a travel-planning agent that generates plans from videos and graphs, showing improvements when combining video context with POI lists. The work highlights key bottlenecks, such as the need for longer temporal context and multimodal reasoning, and offers a data-rich platform to advance video-grounded geospatial-temporal understanding for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

TL;DR

VIR-Bench introduces a new benchmark for evaluating long-range geospatial-temporal understanding in multimodal LLMs through itinerary reconstruction from travel videos. It decomposes the task into node and edge prediction on a visiting order graph, revealing persistent challenges for both open-weight and proprietary models, especially in POI and transition-edge prediction. The authors also demonstrate practical value by building a travel-planning agent that generates plans from videos and graphs, showing improvements when combining video context with POI lists. The work highlights key bottlenecks, such as the need for longer temporal context and multimodal reasoning, and offers a data-rich platform to advance video-grounded geospatial-temporal understanding for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

Paper Structure

This paper contains 59 sections, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Overview of VIR-Bench. Given an input travel video (Top), we reconstruct a visiting order graph (Right) whose nodes are visited locations (prefectures, cities, and POIs) and whose edges capture both temporal transitions and geographic containment among the locations. The itinerary visualization (Left) omits the second stop at Atami Station for visual clarity. The video frames are adopted from https://www.youtube.com/watch?v=6aJ4CZfn9c8.
  • Figure 2: Example of a visiting order graph. Inclusion edges represent containment relationships, flowing from a larger geographical area to a smaller one. Transition edges indicate chronological movement between distinct locations at the same hierarchical level.
  • Figure 3: Evaluation results on edge prediction.
  • Figure 4: Overall results of top-performing models.
  • Figure 5: Crowdsourcing results of the agent system.
  • ...and 11 more figures