Table of Contents
Fetching ...

Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models

Jing Jie Tan, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum, Anissa Mokraoui, Shih-Yu Lo

TL;DR

The paper tackles personality recognition from text by introducing PICEPR, a two-pipeline framework that modularizes decoder-only LLMs into Contents and Embeddings components. It defines five LLM roles (Summary, Mimic, Psycho, Classify, Vector) and uses structured prompts with CoT reasoning and JSON outputs to produce robust trait labels and embeddings. Across Essays and Kaggle datasets, PICEPR achieves state-of-the-art gains (5-15%) over regular prompting and several baselines, while analyzing bias, invalid outputs, and cost-efficiency. The study also assesses both decoder-only and encoder-only configurations, showing that modular prompting can rival or surpass fine-tuning, with Embeddings pipelines enabling effective data augmentation for better generalization. Limitations include dataset size, labeling subjectivity, and potential biases, suggesting future work on larger diverse datasets and bias-aware evaluation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel "Prompting-in-a-Series" algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15\% improvement. The work repository and models' weight can be found at https://research.jingjietan.com/?q=PICEPR.

Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models

TL;DR

The paper tackles personality recognition from text by introducing PICEPR, a two-pipeline framework that modularizes decoder-only LLMs into Contents and Embeddings components. It defines five LLM roles (Summary, Mimic, Psycho, Classify, Vector) and uses structured prompts with CoT reasoning and JSON outputs to produce robust trait labels and embeddings. Across Essays and Kaggle datasets, PICEPR achieves state-of-the-art gains (5-15%) over regular prompting and several baselines, while analyzing bias, invalid outputs, and cost-efficiency. The study also assesses both decoder-only and encoder-only configurations, showing that modular prompting can rival or surpass fine-tuning, with Embeddings pipelines enabling effective data augmentation for better generalization. Limitations include dataset size, labeling subjectivity, and potential biases, suggesting future work on larger diverse datasets and bias-aware evaluation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel "Prompting-in-a-Series" algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15\% improvement. The work repository and models' weight can be found at https://research.jingjietan.com/?q=PICEPR.

Paper Structure

This paper contains 33 sections, 11 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Algorithmic Workflow of the PICEPR Framework (Psychology-Informed Contents and Embeddings for Personality Recognition): The diagram presents a structured overview of the dual-pipeline architecture (a) Contents Pipeline and (b) Embeddings Pipeline underlying the PICEPR algorithm. Rectangular nodes signify the four integral components of Large Language Models (LLMs), while the yellow cylinder represents the initial, unprocessed dataset. Colored cylinders illustrate various intermediate datasets generated throughout the process via LLM interactions. Note that the dotted line represents optional content or embeddings included during empirical testing, which excluded based on ablation feedback.
  • Figure 2: The Chain of Thought (CoT) Prompt for PICEPR's Summary LLM ($\mathcal{S}$) is designed to provide a synopsis of user personality based on the given user_text from the dataset.The highlighted content (lines 9–15, 18, 22–27, and 30) will be removed during inferencing (test dataset). This applies to both the Contents and Embeddings pipeline.
  • Figure 3: The Chain of Thought (CoT) Prompt for PICEPR's Psycho LLM ($\mathcal{P}$) to provide a facet_lists according to 77 personality facets Irwing2023 using the given user_text from the dataset. This applies to both the Contents and Embeddings pipeline.
  • Figure 4: The Chain-of-Thought (CoT) prompt for PICEPR's Contents Pipeline's Classify LLM ($\mathcal{C}$) is used to generate personality labels. The user_Contents is derived from $\mathcal{S}$, and the personality_facets are obtained from $\mathcal{P}$. The red-highlighted parts from the system prompt are excluded. Additionally, the user content is replaced with the original content, and the personality facets are removed during the baseline setup for standard CoT classification.
  • Figure 5: The Chain-of-Thought (CoT) prompt for PICEPR's Embeddings Pipeline's Mimic LLM ($\mathcal{M}$) is used to generate augmented positive and negative social media content. The summary is derived from $\mathcal{S}$, and personality facets from $\mathcal{P}$ may be merged into the input.
  • ...and 3 more figures