Table of Contents
Fetching ...

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

Tianlong Wang, Pinqiao Wang, Weili Shi, Sheng li

Abstract

Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges. The code and dataset: https://ethanwtl.github.io/IBweb/

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

Abstract

Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges. The code and dataset: https://ethanwtl.github.io/IBweb/
Paper Structure (41 sections, 9 equations, 5 figures, 11 tables)

This paper contains 41 sections, 9 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: An overview of ItinBench. One of the four tasks, "Tool Use with Route Optimization," is chosen in this figure. The database with additional extracted information from user reviews, the human query, and a list of tools are integrated into a task-specific prompt. LLMs need to utilize their verbal and spatial reasoning ability to plan a trip itinerary based on the task constructed. The verbal and spatial reasoning aspects are evaluated to assess LLMs' ability to simultaneously address tasks from multiple cognitive dimensions.
  • Figure 2: Visualizations of the main results for Validated Rate (VR) and Total Distance Gap (Total-DG). Task 1 and Task 2 (red) do not have access to filtered data. Task 3 and Task 4 (blue) have access to filtered data and spatial clustering information. The second-best result is shown in darker color, and the best result is shown in the darkest color.
  • Figure 3: Error distribution for GPT-4o across four tasks. Errors primarily occur in out-of-pool, cuisine, restaurant, and hotel-related recommendations.
  • Figure B.1: Visualization of a case study about the itinerary generated by GPT 4o in tool-use mode, drawn in Google Map googlemap. Plan-wise (red markers and routes in Figure \ref{['fig:caseStudyFigurec']}), one of the main issues is the route leads to the bottom right corner of the map (Cluster 13) while visiting the same area (Cluster 0) on the second day again. The mistake is corrected in the optimized route in Figure \ref{['fig:caseStudyFigured']}. For this itinerary, the total distance gap ratio is 25.6%. Additionally, the extra cluster jump ratio is 100%.
  • Figure D.2: Alignments between human evaluation preferences (yellow), validate rate (blue), and total distance gap (green). For better visualization, Total-DG is inverted to see the trend.