Table of Contents
Fetching ...

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang

TL;DR

PersonaX introduces two large multimodal datasets, CelebPersona and AthlePersona, that pair LLM-inferred Big Five behavior-trait scores with facial embeddings and rich biographical metadata while enforcing privacy-preserving representations. The work advances with a two-level analysis framework: Level I assesses statistical dependencies among traits and structured features, and Level II develops a novel causal representation learning method with identifiability guarantees to recover shared and modality-specific latent factors across text and image data. Synthetic MNIST experiments validate identifiability and demonstrate clear performance gains over baselines, while real-world analyses on PersonaX reveal coherent cross-modal causal structures and interpretable latent groupings linking appearance, behavior traits, and biographical context. The results support cross-modal trait analysis and causal reasoning at population scales, with careful attention to ethics, consent, and bias, and are supported by an accompanying codebase for reproducibility.

Abstract

Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning. The code is available at https://github.com/lokali/PersonaX.

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

TL;DR

PersonaX introduces two large multimodal datasets, CelebPersona and AthlePersona, that pair LLM-inferred Big Five behavior-trait scores with facial embeddings and rich biographical metadata while enforcing privacy-preserving representations. The work advances with a two-level analysis framework: Level I assesses statistical dependencies among traits and structured features, and Level II develops a novel causal representation learning method with identifiability guarantees to recover shared and modality-specific latent factors across text and image data. Synthetic MNIST experiments validate identifiability and demonstrate clear performance gains over baselines, while real-world analyses on PersonaX reveal coherent cross-modal causal structures and interpretable latent groupings linking appearance, behavior traits, and biographical context. The results support cross-modal trait analysis and causal reasoning at population scales, with careful attention to ethics, consent, and bias, and are supported by an accompanying codebase for reproducibility.

Abstract

Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning. The code is available at https://github.com/lokali/PersonaX.

Paper Structure

This paper contains 57 sections, 5 theorems, 54 equations, 15 figures, 8 tables.

Key Result

Lemma 1

Consider a set of modality observations $\mathbf{x}_{m}$ that satisfy Assumption 2.1 in yao2023multi. Suppose there exists a set of modality-specific encoders, each mapping to a common latent space. Let $\hat{g}^{-1}_{\mathbf{x}_k}$ denote a family of encoders aimed at recovering the shared latent v

Figures (15)

  • Figure 1: Data processing pipelines of AthlePersona (Left) and CelebPersona (Right) datasets. (1) AthlePersona was constructed by collecting player rosters and publicly available data (including facial images and basic features) from the official websites of major sports leagues. These data were then processed with LLMs for inferring behavior traits. (2) CelebPersona was derived from the CelebA dataset 3. Celebrity face identities were linked to their corresponding Wikidata entities, enabling the retrieval of additional biographical details and physical characteristics, which were similarly processed with LLMs for inferring behavior traits.
  • Figure 2: Evaluation on LLM consistency for prompt design.Top: Radar plots show the standard deviation (std) of Big Five trait scores across repeated runs under different prompt formats, for each model (by column) and dataset (by row). Middle: Box plots summarize the average of std across Big Five behavior traits, highlighting intra-prompt variability. Bottom: Manhattan distances between two prompt pairs quantify inter-prompt variability. Refer to § \ref{['llm-persona-setting']} for more setup and result analysis.
  • Figure 3: Independence test (IT) results and distributions of trait scores.(a) and (b) present heatmaps of significant IT results between Big Five behavior traits and other structured features for CelebPersona and AthlePersona, respectively. Each cell reports "$x/y$," where $x$ is the number of methods that reject the null hypothesis ($p < 0.05$) and $y$ is the total number of applied methods. Lighter shades indicate stronger evidence of dependence. (c) shows the overall distribution of Big Five behavior scores across both datasets. Refer to Tab. \ref{['app_athlete_cit']} and Tab. \ref{['app_celeb_cit']} for complete $p$-values.
  • Figure 4: Multi-modality multi-measurement causal model. Latent space is in grey. $\mathbf{s}$ is shared latent variables across different modalities, $\mathbf{z}$ is modality-specific latent variables. $\mathbf{x}_{m,i}$ denotes the $m$-th modality $i$-th observed measurement. $\epsilon$ is the independent noise term.
  • Figure 5: Synthetic experiments. (a) The synthetic dataset consists of two modalities, colored MNIST and fashion MNIST lecun1998mnistxiao2017fashion. (b) The underlying true causal graph is shown here. For colored MNIST we generated three different measurements. (c) Experimental results show that our method outperforms the other baselines in terms of both $R^2$ and MCC.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • proof
  • Lemma 1: Identifiability from a Set of Views yao2023multi
  • proof
  • Lemma 2: Identifiability of Hidden Causal Variables
  • proof
  • Proposition 1
  • proof
  • Proposition 2: Conditional Independence Condition
  • ...and 1 more