Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Q. Vera Liao, Ziang Xiao
TL;DR
This paper reframes the evaluation of large, general-purpose language models as a process to narrow the socio-technical gap between downstream human needs and model capabilities. It draws on realism concepts from social sciences, and synthesis from XAI and HCI, to propose two realism dimensions—context realism and human requirement realism—that guide evaluation method selection. By mapping existing NLG and HCI evaluation approaches along these axes and presenting a running use case, the authors highlight opportunities for contextualized benchmarks, human-centered ratings, and downstream-grounded, cost-aware evaluation. The work argues for a shift from solely automatic metrics to diverse, use-case-driven evaluation practices, and provides concrete recommendations and open questions to advance responsible, real-world deployment of LLMs.
Abstract
The recent development of generative large language models (LLMs) poses new challenges for model evaluation that the research community and industry have been grappling with. While the versatile capabilities of these models ignite much excitement, they also inevitably make a leap toward homogenization: powering a wide range of applications with a single, often referred to as ``general-purpose'', model. In this position paper, we argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization: providing valid assessments for whether and how much human needs in diverse downstream use cases can be satisfied by the given model (\textit{socio-technical gap}). By drawing on lessons about improving research realism from the social sciences, human-computer interaction (HCI), and the interdisciplinary field of explainable AI (XAI), we urge the community to develop evaluation methods based on real-world contexts and human requirements, and embrace diverse evaluation methods with an acknowledgment of trade-offs between realisms and pragmatic costs to conduct the evaluation. By mapping HCI and current NLG evaluation methods, we identify opportunities for evaluation methods for LLMs to narrow the socio-technical gap and pose open questions.
