AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
Jiří Milička, Anna Marklová, Václav Cvrček
TL;DR
The paper addresses the need for linguistically analyzable, comparable corpora of LLM-generated text by creating AI-Brown and AI Koditex, which mirror BE21 and Koditex Brown-style references in English and Czech. It employs seed-based generation from a broad set of frontier and open-weight models across multiple providers, with controlled prompts and temperatures, followed by cleaning and UD annotation to ensure comparability. Key contributions include a publicly licensed, UD-tagged, multi-format dataset linked to a KonText search interface, plus methodological documentation and scripts to reproduce and extend the corpora. The work enables rigorous cross-model, cross-language analysis of AI-generated language, supports reproducibility, and envisions ongoing expansion to monitor model evolution over time, effectively serving as a living museum of LLM outputs for corpus linguistics research.
Abstract
This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.
