Table of Contents
Fetching ...

Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi

Abstract

Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi-CoT.

Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

Abstract

Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi-CoT.

Paper Structure

This paper contains 18 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Accuracy (%) and average token length for Qwen3-8B model yang2025qwen3 across multiple prompting methods on mathematical reasoning benchmarks. For Hi-CoT (correct format), results are computed only from responses that strictly follow the hierarchical structure.
  • Figure 2: Comparison of prompting methods. (a) CoT generates step-by-step reasoning, (b) Plan-and-Solve first creates a global plan and then executes it at a coarse level, while (c) Hi-CoT enforces fine-grained alternating instruction–execution steps for more structured and controlled reasoning.
  • Figure 3: Average response length (tokens) for multiple prompting methods. The rightmost panel shows the macro-average. Lower value generally indicates higher efficiency.
  • Figure 4: Comparative performance of Hi-CoT across aggregate and format-compliant outputs. Accuracy and average tokens are evaluated on four mathematical benchmarks for selected Qwen models, highlighting the performance gain when the hierarchical reasoning structure is strictly followed.