Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han
TL;DR
Landscape of Thoughts (LoT) introduces a scalable visualization framework to analyze LLM reasoning trajectories on multi-choice tasks by converting intermediate thoughts into perplexity-based state features and projecting them to 2D using $\text{t-SNE}$. It defines three metrics—$\text{Consistency}$, $\text{Uncertainty}$, and $\text{Perplexity}$—to quantify reasoning dynamics and demonstrates how landscapes reveal convergence patterns, task-specific fingerprints, and method-specific behaviors. A lightweight verifier built on the state features can predict trajectory correctness and improve test-time accuracy without retraining large models, with demonstrated gains across models, tasks, and decoding strategies. The tool supports cross-model and cross-task analysis and is adaptable to predictive modeling, offering new avenues for debugging, safety monitoring, and iterative improvement of LLM reasoning. The work provides a first general, automated lens into thought-level dynamics that complements traditional performance metrics.
Abstract
Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
