Table of Contents
Fetching ...

Large Language Model Agent Personality and Response Appropriateness: Evaluation by Human Linguistic Experts, LLM-as-Judge, and Natural Language Processing Model

Eswari Jayakumar, Niladri Sekhar Dash, Debasmita Mukherjee

TL;DR

This paper tackles the challenge of assessing prompted personality in LLM-based agents during poetry explanation tasks. It couples LangChain/RAG-based agent design with a Bloom-inspired, linguistically grounded question bank and evaluates responses via a triad of methods: a transformer-based personality predictor, a Judge LLM, and human linguistic experts. Findings reveal limitations and biases in purely data-driven evaluation, underscoring the need for interdisciplinary design and psychometric validation to reliably infer agent personality. The proposed framework offers a robust approach to designing and validating personality-aware LLM agents for interactive NLP systems.

Abstract

While Large Language Model (LLM)-based agents can be used to create highly engaging interactive applications through prompting personality traits and contextual data, effectively assessing their personalities has proven challenging. This novel interdisciplinary approach addresses this gap by combining agent development and linguistic analysis to assess the prompted personality of LLM-based agents in a poetry explanation task. We developed a novel, flexible question bank, informed by linguistic assessment criteria and human cognitive learning levels, offering a more comprehensive evaluation than current methods. By evaluating agent responses with natural language processing models, other LLMs, and human experts, our findings illustrate the limitations of purely deep learning solutions and emphasize the critical role of interdisciplinary design in agent development.

Large Language Model Agent Personality and Response Appropriateness: Evaluation by Human Linguistic Experts, LLM-as-Judge, and Natural Language Processing Model

TL;DR

This paper tackles the challenge of assessing prompted personality in LLM-based agents during poetry explanation tasks. It couples LangChain/RAG-based agent design with a Bloom-inspired, linguistically grounded question bank and evaluates responses via a triad of methods: a transformer-based personality predictor, a Judge LLM, and human linguistic experts. Findings reveal limitations and biases in purely data-driven evaluation, underscoring the need for interdisciplinary design and psychometric validation to reliably infer agent personality. The proposed framework offers a robust approach to designing and validating personality-aware LLM agents for interactive NLP systems.

Abstract

While Large Language Model (LLM)-based agents can be used to create highly engaging interactive applications through prompting personality traits and contextual data, effectively assessing their personalities has proven challenging. This novel interdisciplinary approach addresses this gap by combining agent development and linguistic analysis to assess the prompted personality of LLM-based agents in a poetry explanation task. We developed a novel, flexible question bank, informed by linguistic assessment criteria and human cognitive learning levels, offering a more comprehensive evaluation than current methods. By evaluating agent responses with natural language processing models, other LLMs, and human experts, our findings illustrate the limitations of purely deep learning solutions and emphasize the critical role of interdisciplinary design in agent development.

Paper Structure

This paper contains 16 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Sample questions from the novel question bank crafted in this work based categorised by complexity and Bloom's taxonomy of levels of learning
  • Figure 2: Evaluation Results of Chat with Introvert Agent (IA) and Extrovert Agent (EA): (a) word cloud of Judge LLM reasoning for IA response evaluations, (b)word cloud of Judge LLM reasoning for EA response evaluations, (c) Frequency distribution of Personality assessed in IA and EA responses by Judge LLM, (d) box plot of conversational lengths of responses from IA and EA, (e) Comparison of Personality Traits from Big Five Personality Model Between IA and EA by Personality Transformer Model