Table of Contents
Fetching ...

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

TL;DR

LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove, and shows that world knowledge is a necessary ingredient for success, but only up to a point.

Abstract

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

TL;DR

LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove, and shows that world knowledge is a necessary ingredient for success, but only up to a point.

Abstract

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.
Paper Structure (23 sections, 1 equation, 20 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: LLM-WikiRace evaluates both World Knowledge and Reasoning, we identify a clear performance gap between reasoning models and Instruct tuned models despite both showing similar levels of world knowledge. Further details in \ref{['sec:World Knowledge']}.
  • Figure 2: The Success Rates for the best performing models across all three difficulties of the LLM-WikiRace benchmark.
  • Figure 3: In the LLM-WikiRace task, an LLM agent is instructed to navigate the Wikipedia hyperlink graph from a source page (e.g., Banana) to a target page (e.g., Ferrari). At each step, the game engine provides the agent with the current page, the titles of its outgoing links, the target page name, and the history of previously visited pages. The agent selects one outgoing link to follow and transitions to that page. Throughout the episode, a logger records success or failure, the number of steps taken, token usage, and elapsed time.
  • Figure 4: The LLM-WikiRace prompt, at each step of the game an LLM is prompted with the current page, the target page, a history of previously visited states and 50 possible next states.
  • Figure 5: Success rate vs loop frequency for different models. We say that a game contains a loop if model visited any page more than once. Each model family is displayed with different colors on the scatter plot and the best model in each family is shown with an X mark. Dashed gray line shows linear regression fitted to points.
  • ...and 15 more figures