Table of Contents
Fetching ...

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han

TL;DR

Landscape of Thoughts (LoT) introduces a scalable visualization framework to analyze LLM reasoning trajectories on multi-choice tasks by converting intermediate thoughts into perplexity-based state features and projecting them to 2D using $\text{t-SNE}$. It defines three metrics—$\text{Consistency}$, $\text{Uncertainty}$, and $\text{Perplexity}$—to quantify reasoning dynamics and demonstrates how landscapes reveal convergence patterns, task-specific fingerprints, and method-specific behaviors. A lightweight verifier built on the state features can predict trajectory correctness and improve test-time accuracy without retraining large models, with demonstrated gains across models, tasks, and decoding strategies. The tool supports cross-model and cross-task analysis and is adaptable to predictive modeling, offering new avenues for debugging, safety monitoring, and iterative improvement of LLM reasoning. The work provides a first general, automated lens into thought-level dynamics that complements traditional performance metrics.

Abstract

Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

TL;DR

Landscape of Thoughts (LoT) introduces a scalable visualization framework to analyze LLM reasoning trajectories on multi-choice tasks by converting intermediate thoughts into perplexity-based state features and projecting them to 2D using . It defines three metrics—, , and —to quantify reasoning dynamics and demonstrates how landscapes reveal convergence patterns, task-specific fingerprints, and method-specific behaviors. A lightweight verifier built on the state features can predict trajectory correctness and improve test-time accuracy without retraining large models, with demonstrated gains across models, tasks, and decoding strategies. The tool supports cross-model and cross-task analysis and is adaptable to predictive modeling, offering new avenues for debugging, safety monitoring, and iterative improvement of LLM reasoning. The work provides a first general, automated lens into thought-level dynamics that complements traditional performance metrics.

Abstract

Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.

Paper Structure

This paper contains 41 sections, 8 equations, 28 figures, 13 tables.

Figures (28)

  • Figure 1: Landscape of thoughts for visualizing the reasoning steps of LLMs. Note that the red landscape represents wrong reasoning cases, while the blue indicates the correct ones. The darker regions in landscapes indicate more thoughts, with indicating incorrect answers and marking correct answers. Specifically, given a question with multiple choices, we sample a few thoughts from an LLM and divide them into two categories based on correctness. We visualize the landscape of each category by projecting the thoughts into a two-dimensional feature space, where each density map reflects the distribution of states at a reasoning step. With these landscapes, users can easily discover the reasoning patterns of an LLM or a decoding method. In addition, a predictive model is applied to predict the correctness of landscapes and can help improve the accuracy of reasoning.
  • Figure 2: Comparing the LoT of different language models (with CoT on the AQuA dataset). Darker regions represent higher state density, with indicating incorrect answers and marking the correct ones. Through the reasoning trajectories, spanning from early (0-20% states) to the later stages (80-100% states), the visualization shows correct cases (bottom rows in blue) with incorrect cases (top rows in red). Metrics are calculated w.r.t. each bin, e.g., 20% - 40% of states. The reasoning accuracy of the four subfigures is: (a) 15.8%, (b) 42.0%, (c) 53.2%, and (d) 84.4%.
  • Figure 3: The LoT of the reasoning model QwQ-32B (using CoT prompting on the AQuA dataset).
  • Figure 4: Comparing the LoT of different datasets (using Llama-3.1-70B with CoT). The accuracy of reasoning for the four subfigures is: (a) 84.4%, (b) 80.2%, (c) 75.8%, and (d) 64.8%.
  • Figure 5: Comparing the LoT of four reasoning methods (using Llama-3.1-70B on the AQuA dataset). The reasoning accuracy is: (a) 84.4%, (b) 82.2%, (c) 75.8%, and (d) 81.6%, respectively.
  • ...and 23 more figures

Theorems & Definitions (1)

  • Remark 2.1