Table of Contents
Fetching ...

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs

Julius Broomfield, Kartik Sharma, Srijan Kumar

TL;DR

This work interrogates how the representation modality of a persona (text vs image) shapes its embodiment in multimodal LLMs. It introduces a modality-parallel dataset of 40 personas across four representations (text, image, assisted image, descriptive image) and a 60-question framework to evaluate persona attributes and behaviors. Across five multimodal LLMs, the study finds that text-based personas generally yield richer linguistic habits and more faithful persona alignment, while purely visual representations lag in linguistic expressiveness though certain image-embedded cues can boost specific actions. The dataset and evaluation framework provide a foundation for standardized assessment of multimodal persona grounding and highlight the need to advance vision-language capabilities to close the gap between textual and visual persona embodiment.

Abstract

Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on $5$ multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs

TL;DR

This work interrogates how the representation modality of a persona (text vs image) shapes its embodiment in multimodal LLMs. It introduces a modality-parallel dataset of 40 personas across four representations (text, image, assisted image, descriptive image) and a 60-question framework to evaluate persona attributes and behaviors. Across five multimodal LLMs, the study finds that text-based personas generally yield richer linguistic habits and more faithful persona alignment, while purely visual representations lag in linguistic expressiveness though certain image-embedded cues can boost specific actions. The dataset and evaluation framework provide a foundation for standardized assessment of multimodal persona grounding and highlight the need to advance vision-language capabilities to close the gap between textual and visual persona embodiment.

Abstract

Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .

Paper Structure

This paper contains 34 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: A comparison of visual and textual persona interactions for a chef from Paris. The left side presents an image persona, while the right side features a text persona derived from the image.
  • Figure 2: Our pipeline begins with curating a set of personas. Each persona receives a detailed text description, which is then fed into Stable Diffusion to generate $\mathcal{I}$. A separate model examines the image and generates an independent textual description, forming text persona $\mathcal{T}$. Pairing $p$ with $\mathcal{I}$ produces an assisted image $\mathcal{I_A}$, while combining a typographic representation of $p$ with $\mathcal{I}$ produces a descriptive image $\mathcal{I_D}$.
  • Figure 3: LLM-based evaluation stratified based on question and persona types.
  • Figure 4: The rate and number of refusals in response to persona prompts. Llama 3.2 90B shows a strong aversion to multimodal persona prompts, while other models rarely refuse.
  • Figure 5: Human survey design