COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao
TL;DR
COMPASS addresses a gap in evaluating LLM agents by framing travel planning as constrained-preference optimization in a realistic, multi-turn setting. It introduces a realistic travel database, a modular GPT-5 based user simulator, and a comprehensive tool ecosystem to test cross-service coordination and preference optimization across hotel, flight, and permit bookings. Ground-truth exhaustive search enables exact utility scoring, with metrics for constraint satisfaction (acceptable rate) and optimization quality (optimality rate), revealing two key gaps: an acceptable-optimal gap and a plan-coordination gap, especially for open-source models. The benchmark provides a practical, user-facing evaluation of agent reasoning and tool-use capabilities, guiding future improvements toward trustworthy, user-aligned AI assistants in real-world planning tasks.
Abstract
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
