Table of Contents
Fetching ...

Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang

TL;DR

The paper investigates whether large language model (LLM) agents replicate the value-driven, embodied processes underlying human autonomous task generation. It combines two experiments: a human baseline assessing how personal values and cognitive style, under different environmental conditions, shape task content, and a GPT-4o–based comparison where the model is either raw or conditioned on human profiles. Results show humans generate tasks that are systematically guided by values and environmental context, whereas LLM outputs are more abstract, less social, and less grounded in embodiment, even when provided with value profiles; paradoxically, LLM tasks can feel more novel and fun but are less feasible in real-world, embodied terms. The findings reveal a core gap between human motivational grounding and the statistical patterns of current LLMs, underscoring the need to incorporate intrinsic motivation and physical grounding to achieve more human-aligned autonomous agents.

Abstract

Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM's tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

TL;DR

The paper investigates whether large language model (LLM) agents replicate the value-driven, embodied processes underlying human autonomous task generation. It combines two experiments: a human baseline assessing how personal values and cognitive style, under different environmental conditions, shape task content, and a GPT-4o–based comparison where the model is either raw or conditioned on human profiles. Results show humans generate tasks that are systematically guided by values and environmental context, whereas LLM outputs are more abstract, less social, and less grounded in embodiment, even when provided with value profiles; paradoxically, LLM tasks can feel more novel and fun but are less feasible in real-world, embodied terms. The findings reveal a core gap between human motivational grounding and the statistical patterns of current LLMs, underscoring the need to incorporate intrinsic motivation and physical grounding to achieve more human-aligned autonomous agents.

Abstract

Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM's tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

Paper Structure

This paper contains 33 sections, 5 figures.

Figures (5)

  • Figure 1: Overview of the study. (A) Conceptual framework. Human-generated tasks are typically driven by intrinsic motivation and grounded in embodied experience. In contrast, LLM-generated tasks are produced based on input prompts and their training data, which may result in a fundamental gap between the two. (B) Illustration of the thematic and embodiment gap of the tasks. Human-generated tasks tend to be more social and physically engaging, while LLM-generated tasks are less socially oriented and more abstract or cognitively focused.
  • Figure 2: Experimental interface and procedure. (A) Text-based task generation interface. Participants were asked to generate tasks using a given set of room items. They were instructed to report the task name, required items, detailed setup, goals, and scoring rules. (B) Experimental conditions. Participants were randomly assigned to one of four room scenarios, varying in environmental complexity (high vs. low) and social context (presence vs. absence of other people), sampled from a virtual simulation platform. (C) Task evaluation phase. Independent raters assessed both human- and LLM-generated tasks across multiple dimensions, including Fun, Novelty, Mental Demand, and Physical Demand.
  • Figure 3: Personal values and cognitive style shape human goal generation. (A) Regression coefficients of key predictors on task attributes (Novelty, Fun, and Task Diversity). * indicates $p < 0.05$, ** indicates $p < 0.01$. (B) Openness to Change values predict task fun: as Openness to Change increases, participants' tasks are rated as more enjoyable. (C) Cognitive style (TWS) interacts with environmental complexity to predict task diversity: in high-complexity environments (black), individuals with intuitive styles (lower TWS) produce more diverse tasks. Bars indicate 95% confidence intervals.
  • Figure 4: Human and GPT generated systematically different goals. (A) Thematic distribution of human and LLM tasks. 1718 tasks generated by human and GPT were clustered into 12 topics and 3 themes: "Physical & Sports Activities", "Relaxation & Household Activities", and "Mental & Artistic Activities". (B) LLMs exhibit a lower propensity for social task generation. (C) LLM tasks are perceived as more mentally and less physically demanding. Black dots stands for the overall difficulty ratings of tasks. Orange and green dots stands for mental and physical load respectively. (D) Engagement ratings of different body parts.
  • Figure 5: LLMs generate tasks rated as more fun and novel.