Table of Contents
Fetching ...

Using Sequences of Life-events to Predict Human Lives

Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, Sune Lehmann

TL;DR

Life2vec introduces a transformer-based framework for learning structured representations from nation-scale, day-by-day life-event sequences. By encoding labor and health events into a unified synthetic language and pre-training with MLM and SOP, the model learns a shared concept space and generates task-specific person-summaries that can predict outcomes such as mortality within four years and personality nuances. Across mortality and personality tasks, life2vec outperforms strong baselines, highlighting both the predictive power and interpretability of the learned embeddings via concept directions, TCAV, and attention analyses. The work demonstrates the feasibility and value of applying language-inspired sequence modeling to socio-economic and health data, while foregrounding ethical considerations, the need for fairness audits, and caution about extrapolation beyond the Danish population.

Abstract

Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.

Using Sequences of Life-events to Predict Human Lives

TL;DR

Life2vec introduces a transformer-based framework for learning structured representations from nation-scale, day-by-day life-event sequences. By encoding labor and health events into a unified synthetic language and pre-training with MLM and SOP, the model learns a shared concept space and generates task-specific person-summaries that can predict outcomes such as mortality within four years and personality nuances. Across mortality and personality tasks, life2vec outperforms strong baselines, highlighting both the predictive power and interpretability of the learned embeddings via concept directions, TCAV, and attention analyses. The work demonstrates the feasibility and value of applying language-inspired sequence modeling to socio-economic and health data, while foregrounding ethical considerations, the need for fairness audits, and caution about extrapolation beyond the Danish population.

Abstract

Over the past decade, machine learning has revolutionized computers' ability to analyze text through flexible computational models. Due to their structural similarity to written language, transformer-based architectures have also shown promise as tools to make sense of a range of multi-variate sequences from protein-structures, music, electronic health records to weather-forecasts. We can also represent human lives in a way that shares this structural similarity to language. From one perspective, lives are simply sequences of events: People are born, visit the pediatrician, start school, move to a new location, get married, and so on. Here, we exploit this similarity to adapt innovations from natural language processing to examine the evolution and predictability of human lives based on detailed event sequences. We do this by drawing on arguably the most comprehensive registry data in existence, available for an entire nation of more than six million individuals across decades. Our data include information about life-events related to health, education, occupation, income, address, and working hours, recorded with day-to-day resolution. We create embeddings of life-events in a single vector space showing that this embedding space is robust and highly structured. Our models allow us to predict diverse outcomes ranging from early mortality to personality nuances, outperforming state-of-the-art models by a wide margin. Using methods for interpreting deep learning models, we probe the algorithm to understand the factors that enable our predictions. Our framework allows researchers to identify new potential mechanisms that impact life outcomes and associated possibilities for personalized interventions.
Paper Structure (46 sections, 17 equations, 14 figures, 8 tables)

This paper contains 46 sections, 17 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: A schematic individual-level data representation for the life2vec model. (A) We organize socio-economic and health data from the Danish national registers from 1st January 2008 until 31st December 2015 into a single chronologically ordered life-sequence. Each database entry becomes an event in the sequence, where an event has associated positional and contextual data. The contextual data include variables associated with the entry (e.g., industry, city, income, job type). The positional data includes the person's age (expressed in full years), absolute position (number of days since 1st January 2008), and segment (alternating sequence of three elements). The raw life-sequence is then passed to the model described in panel (B). The model consists of multiple stacked encoders. The first encoder combines contextual and positional information to produce a contextual representation of each life event. The following encoders output deep contextual representations of each life event (considering the overall content of the life-sequence). The final encoder layer fuses the representations of life-events to produce the representation of a life-sequence. The decoder uses the latter to make predictions.
  • Figure 2: Performance of models on the Mortality Prediction Task quantified with the Median Corrected Matthews correlation coefficient (C-MCC) ramola2018estimating with 95% CI. (A) Comparison of life2vec performance to baselines (B-D) Performance of life2vec model on different cohorts of the population. (B) Performance of life2vec per sequence length. We can see that sequence length does not affect the performance. (C) Performance of life2vec based on the number of health events in a sequence. The model performs better on cohorts with a higher number of health events. (D) Performance of life2vec per inter-sectional groups (based on age group and sex).
  • Figure 3: Performance Evaluation for the Personality Nuances Task. We display Cohen's Quadratic Kappa score for each item separately for Random Guess, RNN, and life2vec model. The error bars indicate the Median Absolute Deviation. The question wordings are as follows. Q1 (Social Self-esteem): "I feel reasonably satisfied with myself overall". Q2 (Social Boldness): "When I'm in a group of people, I'm often the one who speaks on behalf of the group". Q3 (Sociability): "I prefer jobs that involve active social interaction to those that involve working alone" Q4 (Liveliness): "On most days, I feel cheerful and optimistic".
  • Figure 4: Two-dimensional projection of the concept space (using the PaCMAP wang2021understanding). Each point corresponds to a concept token in the vocabulary. Points are colored based on the concept types (several types are omitted - black points). Each region provides a closer look at several parts of the concept space. You can also see the top three closest neighbors for selected tokens (based on the cosine distance). (A) Diagnoses related to Pregnancy, childbirth, and the puerperium in ICD-10 world1992icd. (B) Job concepts related to Service and Sales Workers (corresponds to Job Category 5 of ISCO-08 ilo_2012). (C) Injury-related diagnoses in ICD-10 world1992icd. (D) Job concepts related to Technicians and Associate Professionals (corresponds to Job Category 3 of ISCO-08 ilo_2012). (E) Income-related concepts. life2vec arranges these concepts in increasing ordinal order. (F) Concepts related to the manufacturing industry in DB07 db07.
  • Figure 5: Representation of life-sequences conditioned on the Mortality Predictions. (A-G) Two-dimensional projection of 280-dimensional life representations(with the DensMap method narayan2020density). (D) The full projection is colored based on the estimated probability of mortality. Pink points stand for the true deceased targets. Points with a smaller radius are uncertain predictions. (A-C and E-G) Zoomed-in regions with additional aspects associated with the life-sequence. (A-C) Region A contains points with a low probability of mortality, while (E-G) Region B contains points with a high probability. (J-H) Spider plot of life2vec's concept sensitivity. The blue line is a median score for the random concept directions, while the blue area specifies the variation of the scores for the random concepts (J) Concept Sensitivity with respect to "Alive" prediction. (H) Concept sensitivity with respect to the "Deceased" prediction.
  • ...and 9 more figures