Table of Contents
Fetching ...

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Zexuan Wang, Chenghao Yang, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen, Wenhao Huang

TL;DR

WorldTravel tackles the challenge of tightly coupled, real-world planning by introducing a benchmark that couples a large task suite with a web-based, multi-modal evaluation environment. It reveals that current models struggle to both perceive constraint information from visuals and perform long-horizon, constraint-coherent planning, with a notable Perception-Action Gap and a planning horizon near $10$ constraints. The study shows substantial performance gaps across both proprietary and open-source models, even when hard constraints are provided, underscoring the need for integrated perception and reasoning capabilities. The work provides a rigorous, end-to-end benchmark, a scalable data-generation pipeline, and diagnostic insights that should guide future development of autonomous planning agents in realistic, constraint-rich settings.

Abstract

Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

TL;DR

WorldTravel tackles the challenge of tightly coupled, real-world planning by introducing a benchmark that couples a large task suite with a web-based, multi-modal evaluation environment. It reveals that current models struggle to both perceive constraint information from visuals and perform long-horizon, constraint-coherent planning, with a notable Perception-Action Gap and a planning horizon near constraints. The study shows substantial performance gaps across both proprietary and open-source models, even when hard constraints are provided, underscoring the need for integrated perception and reasoning capabilities. The work provides a rigorous, end-to-end benchmark, a scalable data-generation pipeline, and diagnostic insights that should guide future development of autonomous planning agents in realistic, constraint-rich settings.

Abstract

Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
Paper Structure (57 sections, 4 equations, 14 figures, 12 tables)

This paper contains 57 sections, 4 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Data construction pipeline of WorldTravel. (1) Data Collection: We collect factual data (operating hours, pricing, schedules) and user-generated data (dwell times, visiting tips) from diverse sources across 5 European cities selected based on POI richness. (2) Webpage & Task Synthesis: Aggregated entity data is used to generate static webpages via GPT-5.2, while expert annotators design tasks with city-specific challenges. (3) Quality Control: Each task includes user queries, constraint annotations, and verification functions; tasks undergo manual review and multi-LLM filtering to yield 150 verified tasks.
  • Figure 2: Timed-Entry Slot violation. The exhibition tour is available at 14:00, 16:00, and 17:00, but the model schedules the visit in the morning.
  • Figure 3: Minimum Dwell Time violation. The user requests an in-depth tour (4--5 hours required), but the model schedules only 2 hours.
  • Figure 4: Constraint analysis. (a) Feasibility collapses when constraints exceed 10, even when constraint parameters are provided as structured text. (b) Timed-Entry Slots show the lowest satisfaction rates across all models, with vision-based extraction further degrading performance.
  • Figure 5: Attraction overview page displaying landmarks through category-tagged cards with imagery and descriptions.
  • ...and 9 more figures