CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

Saranya Venkatraman; Nafis Irtiza Tripto; Dongwon Lee

CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

Saranya Venkatraman, Nafis Irtiza Tripto, Dongwon Lee

TL;DR

CollabStory introduces the first exclusively LLM-generated multi-LLM collaborative story dataset, enabling analysis of machine-machine writing across up to five authors. The authors develop a systematic data-generation pipeline using five open-source instruction-tuned LLMs and prompt templates that promote sequential, baton-style story writing, yielding 32,503 stories and enabling in-depth continuity and authorship analyses. By adapting PAN tasks to a multi-LLM setting and benchmarking five classical baselines, the paper demonstrates which authorship analyses are tractable (authorship verification and multi-authorship detection) and which remain challenging (exact author attribution and exact author count). The dataset, prompting framework, and baseline insights offer a resource for developing new methods to detect multi-LLM authorship, support credit attribution, and address IP concerns in automated writing contexts with significant practical implications for education, publishing, and misinformation mitigation.

Abstract

The rise of unifying frameworks that enable seamless interoperability of Large Language Models (LLMs) has made LLM-LLM collaboration for open-ended tasks a possibility. Despite this, there have not been efforts to explore such collaborative writing. We take the next step beyond human-LLM collaboration to explore this multi-LLM scenario by generating the first exclusively LLM-generated collaborative stories dataset called CollabStory. We focus on single-author to multi-author (up to 5 LLMs) scenarios, where multiple LLMs co-author stories. We generate over 32k stories using open-source instruction-tuned LLMs. Further, we take inspiration from the PAN tasks that have set the standard for human-human multi-author writing tasks and analysis. We extend their authorship-related tasks for multi-LLM settings and present baselines for LLM-LLM collaboration. We find that current baselines are not able to handle this emerging scenario. Thus, CollabStory is a resource that could help propel an understanding as well as the development of new techniques to discern the use of multiple LLMs. This is crucial to study in the context of writing tasks since LLM-LLM collaboration could potentially overwhelm ongoing challenges related to plagiarism detection, credit assignment, maintaining academic integrity in educational settings, and addressing copyright infringement concerns. We make our dataset and code available at https://github.com/saranya-venkatraman/CollabStory.

CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 3 figures, 9 tables)

This paper contains 25 sections, 1 equation, 3 figures, 9 tables.

Introduction
Related Work
LLMs as Collaborative Writers
Datasets
Methodology
CollabStory: Dataset Creation
LLM prompting
Post-processing and filtering
Dataset Analysis
Story Continuity
Authorship Analysis: Extending PAN tasks for multi-LLM scenario
Task 1: Is a story written by multiple authors or not?
Task 2: How many authors have written a story?
Task 3: Authorship Verification
Task 4: Authorship Attribution
...and 10 more sections

Figures (3)

Figure 1: CollabStory contains over $32k$ creative stories written collaboratively by up to 5 LLMs. Each story segment is generated by a single author, that then passes the narrative baton to the next, completing the storyline part by part in a sequential manner.
Figure 2: N on the X-axis denotes the number of authors, and N=1(H) and N=1(M) correspond to the human-written and machine-generated single-authored texts, respectively. All other texts (N >=2) are multi-LLM generated. Y-axis shows the values of the measure shown in each suplot as mentioned in the headings. For all measures, we show average and standard deviation for N going from 1 to 5. For all measures except vocabulary richness (3rd column, 1st row), increasing the number of authors (N) does not lead to statistically significant deviations from the human text distribution.
Figure 3: F1 scores for authorship-related tasks in the multi-LLM scenario, using different methods (color-coded) as the number of authors increases from N=1 to N=5.

CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

TL;DR

Abstract

CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis

TL;DR

Abstract

Table of Contents

Figures (3)