Long text outline generation: Chinese text outline based on unsupervised framework and large language mode
Yan Yan, Yuanchi Ma
TL;DR
This work tackles outline generation for ultra-long Chinese fiction, where existing LLMs struggle with coherent segmentation and long-range structure. It proposes an unsupervised, chapter-level graph representation built from named entities and syntactic dependencies, learned with a graph attention-based autoencoder to produce chapter embeddings $Z$, followed by a Markov-chain–driven plot-boundary operator that uses path dependence and a threshold $s_t = \beta \cdot \mathrm{mean}(EU)$ to detect segment boundaries. After boundaries are identified, a large language model generates themes and per-segment summaries to form the final outline, leveraging a Chinese ultra-long text dataset of over a million words per work. The results show superior boundary accuracy and outline readability (via CheckEval and Kendall tau) compared with baselines such as GPT-3.5, GPT-4, and Llama7b, demonstrating the practical utility of combining unsupervised graph representations with LLM-guided summarization for long-form narratives. These findings support applying structured graph-based representations and Markovian boundary detection to enhance retrieval-augmented generation and humanities research on large-scale literary corpora, with potential for tighter integration into end-to-end LLM pipelines.
Abstract
Outline generation aims to reveal the internal structure of a document by identifying underlying chapter relationships and generating corresponding chapter summaries. Although existing deep learning methods and large models perform well on small- and medium-sized texts, they struggle to produce readable outlines for very long texts (such as fictional works), often failing to segment chapters coherently. In this paper, we propose a novel outline generation method for Chinese, combining an unsupervised framework with large models. Specifically, the method first generates chapter feature graph data based on entity and syntactic dependency relationships. Then, a representation module based on graph attention layers learns deep embeddings of the chapter graph data. Using these chapter embeddings, we design an operator based on Markov chain principles to segment plot boundaries. Finally, we employ a large model to generate summaries of each plot segment and produce the overall outline. We evaluate our model based on segmentation accuracy and outline readability, and our performance outperforms several deep learning models and large models in comparative evaluations.
