T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang; Hancheng Ye; Jinhee Kim; Jinghan Ke; Yifei Wang; Martin Kuo; Zishan Shao; Dongting Li; Yueqian Lin; Ting Jiang; Chiyue Wei; Qi Qian; Wei Wen; Helen Li; Yiran Chen

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen

TL;DR

This work first introduces Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families, and presents T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models.

Abstract

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

TL;DR

Abstract

Paper Structure (56 sections, 7 figures, 7 tables)

This paper contains 56 sections, 7 figures, 7 tables.

Introduction
Motivation & Challenges
Structure of Thought
Challenges in Dataset Construction
Construction Process of T2S-Bench
Sample Collection
T2S Multi-hop Reasoning Dataset Construction
T2S-Bench-E2E Dataset Construction
Data Distribution Statistics
Evaluation
General Performance on T2S‑Bench
The Importance of Structure for Downstream Tasks
Analysis Experiments
Conclusion
Background
...and 41 more sections

Figures (7)

Figure 1: Performance of SoT and Importance of Text Structuring. We evaluated three models on eight distinct text-processing tasks using three prompting strategies: direct answering, Chain-of-Thought (CoT), and Structure of Thought (SoT). The horizontal axis shows the model's performance with direct answering, while the vertical axis indicates the performance change relative to direct answering. Our evaluations follow standards from lm-eval and Longbench tasks. SoT consistently boosts performance across different tasks and models.
Figure 2: Construction Process of T2S-Bench, including Sample Collection, Muti-hop Reasoning and End-to-End Dataset Construction.
Figure 3: (Left) Sample Distributions of different Dataset. (Right) Overview of T2S-Bench Sample Distributions.
Figure 4: F1 scores across different topics on T2S-Bench-MR. We selected one representative model from each model family; The first fig shows their average F1 scores across various domains. The remains fig illustrate individual model performances per domain, with the vertical axis indicating deviations from the average performance. The dark dashed rectangle represents the average performance (set to zero). Scores outside this rectangle indicate above-average performance, while scores inside indicate below-average performance..
Figure 5: Link F1 scores on MR-Bench-E2E across texts corresponding to reference graphs with varying node counts.
...and 2 more figures

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

TL;DR

Abstract

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)