Table of Contents
Fetching ...

Email in the Era of LLMs

Dang Nguyen, Harvey Yiyun Fu, Peter West, Chenhao Tan, Ari Holtzman

Abstract

Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models' email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.

Email in the Era of LLMs

Abstract

Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models' email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.
Paper Structure (35 sections, 6 equations, 40 figures, 17 tables)

This paper contains 35 sections, 6 equations, 40 figures, 17 tables.

Figures (40)

  • Figure 1: Empathy--formality analysis of human, LLM, and human+LLM emails. LLMs tend to write high empathy, high formality emails, whereas human emails are more diverse. When rewriting, LLMs take human emails toward the high-empathy high-formality quadrant. We validate the scores with human annotations and find good agreement between the LLM annotator and humans, detailed in Appendix \ref{['appendix:tone_analysis']}.
  • Figure 2: HR Simulator™ interface. Pairs of models increasingly agree on email quality as they scale up. Red points indicate reasoning models. More details can be found in Appendix \ref{['appendix:hr_simulator']}.
  • Figure 3: HR Simulator system. The player reads a scenario email and responds. The scenario context and player's email are sent to the backend. The Recipient(s) responds to the player's email in-character. In scenarios 4 and 5, the Simulator simulates an outcome. The Judge reads the player's email, recipient response, outcome, and produces an evaluation---all are sent to the frontend. The user reads the results and decide whether to re-attempt.
  • Figure 4: The hybrid advantage. Green arrows denote when the Human+LLM pass rate is higher than that of LLM-only, while red arrows denote when it is lower.
  • Figure 4: Pairwise pearson correlations of tactfulness scores from different models. The models are Gemini 3 Flash, Grok 4 Fast, and GPT 5.2, covering a variety of sizes. All three models have similar tactfulness annotations, suggesting emails' tactfulness is a well-defined construct.
  • ...and 35 more figures