Table of Contents
Fetching ...

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Meng Chu, Yukang Chen, Haokun Gui, Shaozuo Yu, Yi Wang, Jiaya Jia

TL;DR

TraveLLaMA introduces TravelQA, the first large-scale multimodal travel dataset combining text, maps, and imagery with expert CoT annotations to support domain-specific reasoning. The Travel-CoT framework decomposes travel queries into spatial, temporal, and practical components, delivering interpretable reasoning and a significant accuracy boost. An interactive ReAct-style agent integrates real-time services to produce actionable itineraries, validated by a large user study with high usability (SUS 82.5). Across fine-tuned vision-language models, the approach achieves 6.2–9.4% gains, with Travel-CoT adding a further 10.8% in overall performance, illustrating the value of structured reasoning for specialized multimodal travel assistance.

Abstract

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8\% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we achieve 6.2-9.4\% base improvements, further enhanced by Travel-CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models and establishing new standards for multimodal travel assistance systems.

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

TL;DR

TraveLLaMA introduces TravelQA, the first large-scale multimodal travel dataset combining text, maps, and imagery with expert CoT annotations to support domain-specific reasoning. The Travel-CoT framework decomposes travel queries into spatial, temporal, and practical components, delivering interpretable reasoning and a significant accuracy boost. An interactive ReAct-style agent integrates real-time services to produce actionable itineraries, validated by a large user study with high usability (SUS 82.5). Across fine-tuned vision-language models, the approach achieves 6.2–9.4% gains, with Travel-CoT adding a further 10.8% in overall performance, illustrating the value of structured reasoning for specialized multimodal travel assistance.

Abstract

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8\% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we achieve 6.2-9.4\% base improvements, further enhanced by Travel-CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models and establishing new standards for multimodal travel assistance systems.

Paper Structure

This paper contains 11 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: TraveLLaMA, an advanced multimodal AI travel assistant that seamlessly processes both text and image-based queries. This powerful system enables travelers to plan trips efficiently by providing contextual responses including human service information, localization details, and personalized recommendations based on visual inputs and textual questions about destinations, sights, and restaurants.
  • Figure 2: The TravelQA dataset features iconic landmarks and destinations across major cities worldwide, connecting locations like San Francisco, New York, Paris, Rome, Berlin, Bangkok, Singapore, Shanghai, and Beijing through a global travel network.
  • Figure 3: TraveLLaMA's data construction and training process combines vision-language and text-based travel QA from diverse global sources through multi-round collection and fine-tuning.
  • Figure 4: TraveLLaMA uses a reasoning and acting process to create travel plans. When a user submits a text-image query, the system could do the reasoning process offline or online, analyzing both components, identifies locations, employs specialized tools through API calls, and generates detailed itineraries with budget calculations matching user requirements.
  • Figure 5: Comparison between TraveLLaMA and Claude 3.5 shows that TraveLLaMA provides more accurate and detailed travel information in these location-based examples.