Table of Contents
Fetching ...

Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

Fulei Zhang, Zhou Yu

TL;DR

The paper investigates how users linguistic behavior differs when interacting with LLM-based assistants versus human agents and demonstrates a measurable stylistic divergence across six dimensions. It shows that models trained on human–human data underperform on human–LLM inputs due to domain shift, and that training-time style augmentation substantially improves robustness, whereas inference-time reformulation is less effective. The authors propose and evaluate a style-augmented training approach using minimal and enriched rewrites, finding that a combined diverse dataset yields the best generalization for intent detection in task-oriented dialogues. These findings highlight the importance of exposing models to stylistic variation during training to improve real-world LLM–user interactions and user experience.

Abstract

As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.

Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

TL;DR

The paper investigates how users linguistic behavior differs when interacting with LLM-based assistants versus human agents and demonstrates a measurable stylistic divergence across six dimensions. It shows that models trained on human–human data underperform on human–LLM inputs due to domain shift, and that training-time style augmentation substantially improves robustness, whereas inference-time reformulation is less effective. The authors propose and evaluate a style-augmented training approach using minimal and enriched rewrites, finding that a combined diverse dataset yields the best generalization for intent detection in task-oriented dialogues. These findings highlight the importance of exposing models to stylistic variation during training to improve real-world LLM–user interactions and user experience.

Abstract

As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.

Paper Structure

This paper contains 18 sections, 4 tables, 1 algorithm.