Table of Contents
Fetching ...

Fine-grained Classification of A Million Life Trajectories from Wikipedia

Zhaoyang Liu, Xiaocong Du, Yixi Zhou, Ye Shi, Haipeng Zhang

TL;DR

The paper tackles the problem of fine-grained life trajectory classification for notable individuals by leveraging Wikipedia-derived (person, time, location) triples. It introduces SAM4LTC, a syntax-aware model that fuses syntactic graphs with text embeddings and uses LLM refinements to standardize input sentences, achieving 84.5% F1 on a large, 3-century, 589k-person dataset with 3.8 million labeled activities. A 24-type taxonomy across 9 categories is used, with a manually annotated 2,826-sample dataset and a large-scale publicly released dataset and code. The approach demonstrates robust gains over baselines, validated by ablations and prompt-sensitivity analyses, and enables insights into human dynamics across time and space through extensive data analysis.

Abstract

Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5\% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.

Fine-grained Classification of A Million Life Trajectories from Wikipedia

TL;DR

The paper tackles the problem of fine-grained life trajectory classification for notable individuals by leveraging Wikipedia-derived (person, time, location) triples. It introduces SAM4LTC, a syntax-aware model that fuses syntactic graphs with text embeddings and uses LLM refinements to standardize input sentences, achieving 84.5% F1 on a large, 3-century, 589k-person dataset with 3.8 million labeled activities. A 24-type taxonomy across 9 categories is used, with a manually annotated 2,826-sample dataset and a large-scale publicly released dataset and code. The approach demonstrates robust gains over baselines, validated by ablations and prompt-sensitivity analyses, and enables insights into human dynamics across time and space through extensive data analysis.

Abstract

Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5\% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.
Paper Structure (46 sections, 18 equations, 6 figures, 5 tables)

This paper contains 46 sections, 18 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Examples of life trajectories extracted from Wikipedia. Words in pink are the extracted entities in triples indicating person, time and location. The graph is the syntactic graph constructed by SpaCy. The red nodes correspond to extracted entities in triples. The orange ones are nodes and words on the paths.
  • Figure 2: Distribution of types in the Regular dataset.
  • Figure 3: Workflow of SAM4LTC. The trainable modules and embeddings are in blue.
  • Figure 4: (a) Life trajectory of Malcolm John Rebennack Jr. (b) Ratios of military and competition activities from 1700 to 2000 in five-year intervals. Wars and the First Olympics are marked. (c) and (d) International departures from Germany and the US in the 20th century. (e) Activities in life stages in a stacked chart. X-axis marks the age group and y-axis is the number of life trajectory activities.
  • Figure 5: Distribution of distance from locations of birth to education/career.
  • ...and 1 more figures