TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Meng Chu; Yukang Chen; Haokun Gui; Shaozuo Yu; Yi Wang; Jiaya Jia

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Meng Chu, Yukang Chen, Haokun Gui, Shaozuo Yu, Yi Wang, Jiaya Jia

TL;DR

TraveLLaMA introduces TravelQA, the first large-scale multimodal travel dataset combining text, maps, and imagery with expert CoT annotations to support domain-specific reasoning. The Travel-CoT framework decomposes travel queries into spatial, temporal, and practical components, delivering interpretable reasoning and a significant accuracy boost. An interactive ReAct-style agent integrates real-time services to produce actionable itineraries, validated by a large user study with high usability (SUS 82.5). Across fine-tuned vision-language models, the approach achieves 6.2–9.4% gains, with Travel-CoT adding a further 10.8% in overall performance, illustrating the value of structured reasoning for specialized multimodal travel assistance.

Abstract

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8\% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we achieve 6.2-9.4\% base improvements, further enhanced by Travel-CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models and establishing new standards for multimodal travel assistance systems.

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

TL;DR

Abstract

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)