Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge
TL;DR
Project Alexandria tackles copyright barriers to scientific knowledge by proposing Knowledge Units, a knowledge-preserving, style-agnostic representation of scholarly text produced via LLMs. The approach separates factual content from creative expression and is argued to be legally defensible under German copyright law and US fair use, while empirical results show KU context preserves the majority of original facts across multiple domains with minimal textual reuse. The authors provide MCQ-based benchmarks across abstract and full-paper analyses, demonstrating that KUs closely match original-text information retention with limited degradation for long documents. They also discuss alternative positions, address common criticisms, and outline open problems plus a path toward open, interoperable KU databases and open-source tooling to democratize access to scientific knowledge.
Abstract
Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We propose a new idea for the community to adopt: convert scholarly documents into knowledge preserving, but style agnostic representations we term Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95\%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.
