Table of Contents
Fetching ...

Incentives shape how humans co-create with generative AI

Nathanael Jo, Manish Raghavan

Abstract

Generative AI is quickly becoming an integral part of people's everyday workflows. Early evidence has shown that while generative AI can increase individual-level productivity, it does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Our research stands in contrast to this concern: through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively. Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. This divergence is driven not by abandoning AI, but by how participants use it: those incentivized for originality incorporate fewer AI suggestions verbatim, relying on the model more selectively for brainstorming, proofreading, and targeted edits. Our results reveal that the effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.

Incentives shape how humans co-create with generative AI

Abstract

Generative AI is quickly becoming an integral part of people's everyday workflows. Early evidence has shown that while generative AI can increase individual-level productivity, it does so at the cost of collective diversity, potentially narrowing the set of ideas and perspectives produced. Our research stands in contrast to this concern: through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively. Participants rewarded for originality relative to peers produce collectively more diverse writing than those rewarded for quality alone. This divergence is driven not by abandoning AI, but by how participants use it: those incentivized for originality incorporate fewer AI suggestions verbatim, relying on the model more selectively for brainstorming, proofreading, and targeted edits. Our results reveal that the effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.

Paper Structure

This paper contains 68 sections, 9 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Effect sizes from difference in means tests (Welch t-test, one-sided) with significance: $^{*}$$p<0.1$, $^{**}$$p<0.05$, $^{***}$$p<0.01$, over the final submissions (left) and between the final submission and the first valid draft the AI suggested (right). A higher positive value means that the first group in the difference-in-means test is more homogeneous. All rows are similarity metrics, except for $n$-gram diversity score (NGDS), which is a diversity metric. As such, we flip the comparison for NGDS so that the effect sizes are directionally consistent. See Appendix Table \ref{['tab:effects_combined']} for specific effect sizes and multiple hypothesis correction.
  • Figure 2: (a) Average cosine similarity of embeddings across randomized groups, for both the final submission and the first full draft the AI produces (when available). (b) [Top] Average cosine similarity of embeddings across randomized groups through time spent in the session. We only include text that has at least 200 words for every timestep to ensure that the embeddings capture enough information about the story. [Bottom] Number of requests over time spent in the session. (c) First and second principal component of the embeddings across groups, for both the drafts that AI produces (when available) and the final submissions. All plots use the embedding model all-MiniLM-L12-v2. See Appendix \ref{['appsec:trajectories']} for other metrics as robustness checks.
  • Figure 3: Example of the AI attribution procedure to construct the adoption metric.
  • Figure 4: (a) Histogram of adoption score $A_i$ for both the AI-T and AI-O groups. (b) Time spent writing AI prompts by adoption and group. (c) Number of words in AI prompts by adoption and group. (d) Number of AI requests by adoption and group.
  • Figure 5: Distribution of editing and drafting requests normalized by number of participants in each group, over adoption levels (a), incentive group (b), and general experience with AI (c). High (low) experience is a self-reported score of $\geq 3$ ($< 3$).
  • ...and 12 more figures