Fine-grained Classification of A Million Life Trajectories from Wikipedia
Zhaoyang Liu, Xiaocong Du, Yixi Zhou, Ye Shi, Haipeng Zhang
TL;DR
The paper tackles the problem of fine-grained life trajectory classification for notable individuals by leveraging Wikipedia-derived (person, time, location) triples. It introduces SAM4LTC, a syntax-aware model that fuses syntactic graphs with text embeddings and uses LLM refinements to standardize input sentences, achieving 84.5% F1 on a large, 3-century, 589k-person dataset with 3.8 million labeled activities. A 24-type taxonomy across 9 categories is used, with a manually annotated 2,826-sample dataset and a large-scale publicly released dataset and code. The approach demonstrates robust gains over baselines, validated by ablations and prompt-sensitivity analyses, and enables insights into human dynamics across time and space through extensive data analysis.
Abstract
Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5\% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.
