Using Sequences of Life-events to Predict Human Lives
Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, Sune Lehmann
TL;DR
Life2vec introduces a transformer-based framework for learning structured representations from nation-scale, day-by-day life-event sequences. By encoding labor and health events into a unified synthetic language and pre-training with MLM and SOP, the model learns a shared concept space and generates task-specific person-summaries that can predict outcomes such as mortality within four years and personality nuances. Across mortality and personality tasks, life2vec outperforms strong baselines, highlighting both the predictive power and interpretability of the learned embeddings via concept directions, TCAV, and attention analyses. The work demonstrates the feasibility and value of applying language-inspired sequence modeling to socio-economic and health data, while foregrounding ethical considerations, the need for fairness audits, and caution about extrapolation beyond the Danish population.
Abstract
Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.
