Table of Contents
Fetching ...

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang

Abstract

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Abstract

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

Paper Structure

This paper contains 36 sections, 5 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Structured data makes QA more accurate and reliable. (a) From raw document $D$, we extract a table $T_1$ for query $Q_1$ and a graph $G_1$ for query $Q_2$. (b) LLMs often fail when reasoning directly over unstructured text ($Q_1,D$; $Q_2,D$), but succeed with structured inputs ($Q_1,T_1$; $Q_2,G_1$).
  • Figure 2: (a) Direct prompting LLMs often causes hallucinations and format errors. (b) (Question, Document, CoST Template) $\Rightarrow$ LLM $\Rightarrow$ (CoST Trace, SSO), yielding verifiable and auditable QA.
  • Figure 3: Overview of LiteCoST, containing two stages: (1) CoST: Structure-First Reasoning and Trace Generation through structure analysis, trace generation, quality verification, and iterative refinement; and (2) SLM Fine-Tuning: SFT → GRPO process, including SFT for structure/format/steps, followed by GRPO with dual signals for answer/format quality and process consistency.
  • Figure 4: The GRPO training pipeline based on dual-level reward.
  • Figure 5: Radar plot of detailed scores for different prompting methods on 4 subtasks on the Finance subset of Loong.
  • ...and 12 more figures