Table of Contents
Fetching ...

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

Shadab Choudhury, Asha Kumar, Lara J. Martin

TL;DR

This study introduces Representation Alignment as a human-judgment framework to measure how well LLMs’ emotion representations align with human expectations in AAC contexts. By prompting two large models, GPT-4 and LLaMA-3, to generate sentences from four representations—Words, Lexical $VAD$, Numeric $VAD$, and Emojis—the authors assess alignment via a Representation Alignment task and an Accuracy/Realism evaluation. Across results, Words and Lexical $VAD$ most closely matched human judgments, with Numeric $VAD$ performing poorly and Emojis showing limited alignment, revealing that representation choice materially impacts perceived emotion conveyance and naturalness. The findings support using Words or Lexical $VAD$ in AAC tools for more accurate and realistic affective communication, and they establish a methodological path for future representation-alignment and value-alignment research in NLP for assistive technologies.

Abstract

Gaps arise between a language model's use of concepts and people's expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people's judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

Evaluating Human-LLM Representation Alignment: A Case Study on Affective Sentence Generation for Augmentative and Alternative Communication

TL;DR

This study introduces Representation Alignment as a human-judgment framework to measure how well LLMs’ emotion representations align with human expectations in AAC contexts. By prompting two large models, GPT-4 and LLaMA-3, to generate sentences from four representations—Words, Lexical , Numeric , and Emojis—the authors assess alignment via a Representation Alignment task and an Accuracy/Realism evaluation. Across results, Words and Lexical most closely matched human judgments, with Numeric performing poorly and Emojis showing limited alignment, revealing that representation choice materially impacts perceived emotion conveyance and naturalness. The findings support using Words or Lexical in AAC tools for more accurate and realistic affective communication, and they establish a methodological path for future representation-alignment and value-alignment research in NLP for assistive technologies.

Abstract

Gaps arise between a language model's use of concepts and people's expectations. This gap is critical when LLMs generate text to help people communicate via Augmentative and Alternative Communication (AAC) tools. In this work, we introduce the evaluation task of Representation Alignment for measuring this gap via human judgment. In our study, we expand keywords and emotion representations into full sentences. We select four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. In addition to Representation Alignment, we also measure people's judgments of the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., "angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. Furthermore, we found that the perception of how much a generated sentence conveys an emotion is dependent on both the representation type and which emotion it is.

Paper Structure

This paper contains 26 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Representation Alignment experiment. Three keywords and an emotion from one of the four representations are used to generate a sentence. Participants are shown the emotion in only one of the representations and select the sentence that best fits that emotion.
  • Figure 2: Percentage of times a sentence was selected. Each category on the x-axis corresponds to the condition the participant was in---what representation they saw. The colors delineate what representation was used for sentence generation. Results for GPT-4--generated sentences are on the top, LLaMA-3 on the bottom.
  • Figure 3: Heat maps for Shannon entropy of each emotion across representations. Lower (brighter) values are better, denoting more "agreement" between participants and the LLM. Top: GPT-4, bottom: LLaMA-3.
  • Figure 4: Left: GPT-4, Right: LLaMA-3. In order from Left to Right and Top to Bottom: a, b. Histograms of the Mean Scores for 'Convey' c, d. Histograms of the Mean Scores for 'You'd say' e, f. Histograms of the Mean Scores for 'Someone Else'd Say'
  • Figure 5: Mean "Convey" Scores for each emotion per representation. Higher (brighter) values are better. The top map shows results for GPT-4, while the bottom map is LLaMA-3.
  • ...and 2 more figures