Table of Contents
Fetching ...

COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao

TL;DR

COMPASS addresses a gap in evaluating LLM agents by framing travel planning as constrained-preference optimization in a realistic, multi-turn setting. It introduces a realistic travel database, a modular GPT-5 based user simulator, and a comprehensive tool ecosystem to test cross-service coordination and preference optimization across hotel, flight, and permit bookings. Ground-truth exhaustive search enables exact utility scoring, with metrics for constraint satisfaction (acceptable rate) and optimization quality (optimality rate), revealing two key gaps: an acceptable-optimal gap and a plan-coordination gap, especially for open-source models. The benchmark provides a practical, user-facing evaluation of agent reasoning and tool-use capabilities, guiding future improvements toward trustworthy, user-aligned AI assistants in real-world planning tasks.

Abstract

Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

TL;DR

COMPASS addresses a gap in evaluating LLM agents by framing travel planning as constrained-preference optimization in a realistic, multi-turn setting. It introduces a realistic travel database, a modular GPT-5 based user simulator, and a comprehensive tool ecosystem to test cross-service coordination and preference optimization across hotel, flight, and permit bookings. Ground-truth exhaustive search enables exact utility scoring, with metrics for constraint satisfaction (acceptable rate) and optimization quality (optimality rate), revealing two key gaps: an acceptable-optimal gap and a plan-coordination gap, especially for open-source models. The benchmark provides a practical, user-facing evaluation of agent reasoning and tool-use capabilities, guiding future improvements toward trustworthy, user-aligned AI assistants in real-world planning tasks.

Abstract

Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

Paper Structure

This paper contains 58 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: COMPASS benchmark framework. The environment integrates three key components for realistic evaluation of agentic capabilities in travel planning. (A) A modular LLM-based user simulator enables controllable multi-turn interactions, progressive constraint revelation, and diverse user personas. (B) We formalize travel planning as constrained preference optimization, where agents must satisfy hard constraints (feasible solutions) while optimizing soft user objectives (e.g., minimizing cost, maximizing amenities). (C) Agents interact with realistic travel databases (E) using a comprehensive tool ecosystem (D), requiring iterative planning and refinement across conversation turns to construct optimal itineraries.
  • Figure 2: COMPASS benchmark main results.Acceptable rate measures feasibility (satisfying all hard constraints). Optimal rate measures preference optimization (achieving utility within the top 10% of feasible solutions). All models show a $\sim20\%$ gap between high acceptable rates and low optimal rates, revealing that agents settle for feasible solutions rather than optimizing preferences. Encouragingly, open-source models like Qwen3-32B achieve non-trivial performance, demonstrating emerging agentic capabilities.
  • Figure 3: Examples of task types and dynamic user simulator prompt.(A) Two task types are defined based on the soft preference-optimization objective. Each task type includes hard constraints but differs in optimization objective (Sec. \ref{['sec:task_design']}). (B) The dynamic LLM user simulator prompt (Sec. \ref{['sec:user_simulator']}) controls multi-turn conversation dynamics. The system prompt consists of static instructions (orange), fixed for each conversation, and dynamic fields (purple), which are updated at every turn based on the conversation state.
  • Figure 4: Performance breakdown across benchmark dimensions.(A) Performance degrades with increasing plan coordination complexity (Levels II–III), with open-source models showing especially steep declines (green). (B) Constraint satisfaction rates drop as the number of hard constraints increases, with only the strongest models handling 8+ constraints reliably. (C) Preference optimization weakens as search complexity grows (more searches required to reach the ground-truth optimum). (D) Conversation efficiency analysis: how fast agent achieves solutions with the fewest post-information revelation turns.
  • Figure 5: Case study of tool calls and reasoning traces.Top: Prompt given to models for a Level II task, with explicit reasoning requested. Bottom: GPT-5 (left) demonstrates strategic planning by avoiding weekends, systematically exploring date ranges, and using optional parameters (e.g., price filters) to narrow searches. Claude-Sonnet-4 (right) applies optional parameters but searches only two arbitrary dates without justification. It also makes a temporal coordination error by misaligning hotel and flight dates.
  • ...and 6 more figures