Table of Contents
Fetching ...

HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q. Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H. F. Ng, Qing Li

TL;DR

HiBench fills a critical gap in evaluating how LLMs reason over hierarchical structures, by introducing a structured benchmark that spans from data generation to proficiency assessment across six scenarios and 30 tasks, totaling 39,519 queries. The framework defines five capability dimensions and uses a two-component architecture (Hierarchical Dataset Constructor and Evaluator) to systematically probe relationships, structure, manipulation, and textual reasoning in multi-level data, including fundamental (Binary Tree, Multiple Tree, JSON) and practical (Formula, Code, Paper) tasks. Empirical results across 20 LLMs from 10 families reveal that while models excel at basic hierarchical reasoning, they struggle with complex or implicit hierarchies, with structure complexity and representation significantly impacting performance. An intervention via instruction tuning on small models yields substantial gains across many tasks, sometimes surpassing GPT-4 on certain tasks, highlighting the practical impact of targeted data and fine-tuning for hierarchical reasoning capabilities. The HiBench dataset and toolkit are publicly released to spur systematic evaluation and guide future improvements in hierarchical reasoning for LLMs.

Abstract

Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, https://github.com/jzzzzh/HiBench, to encourage evaluation.

HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

TL;DR

HiBench fills a critical gap in evaluating how LLMs reason over hierarchical structures, by introducing a structured benchmark that spans from data generation to proficiency assessment across six scenarios and 30 tasks, totaling 39,519 queries. The framework defines five capability dimensions and uses a two-component architecture (Hierarchical Dataset Constructor and Evaluator) to systematically probe relationships, structure, manipulation, and textual reasoning in multi-level data, including fundamental (Binary Tree, Multiple Tree, JSON) and practical (Formula, Code, Paper) tasks. Empirical results across 20 LLMs from 10 families reveal that while models excel at basic hierarchical reasoning, they struggle with complex or implicit hierarchies, with structure complexity and representation significantly impacting performance. An intervention via instruction tuning on small models yields substantial gains across many tasks, sometimes surpassing GPT-4 on certain tasks, highlighting the practical impact of targeted data and fine-tuning for hierarchical reasoning capabilities. The HiBench dataset and toolkit are publicly released to spur systematic evaluation and guide future improvements in hierarchical reasoning for LLMs.

Abstract

Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, https://github.com/jzzzzh/HiBench, to encourage evaluation.

Paper Structure

This paper contains 86 sections, 14 figures, 24 tables, 1 algorithm.

Figures (14)

  • Figure 1: Performance Distribution of LLM Model Families on HiBench.
  • Figure 2: Comprehensive Breakdown of Hierarchical Scenarios Categories and Tasks in HiBench.
  • Figure 3: Overview of the Paradigm for HiBench: Hierarchical Construction and Evaluation Architecture.
  • Figure 4: Average Performance of LLM Tasks across Capability Dimensions and Scenarios.
  • Figure 5: Impact of Structural Complexity on LLM Hierarchical Reasoning Capabilities.
  • ...and 9 more figures