Conflicts of Interest in Published NLP Research 2000-2024
Maarten Bosten, Bennett Kleinberg
TL;DR
The paper investigates conflicts of interest in published NLP research from 2000 to 2024 by building a near-complete dataset of ACL Anthology papers and labeling author affiliations as industry vs non-industry. It employs web scraping, GROBID XML extraction, and pattern-based labeling to produce a final corpus of 58,687 papers and analyzes COI prevalence across venues and over time. Key findings show that 27.65% of papers include at least one industry-affiliated author, with 2024 seeing COIs in more than one-third of papers and EMNLP/ACL as major drivers. The study argues for transparent COI reporting in NLP venues and proposes a simple policy change to require COI disclosures at submission to mitigate potential biases.
Abstract
Natural Language Processing research is increasingly reliant on large scale data and computational power. Many achievements in the past decade resulted from collaborations with the tech industry. But an increasing entanglement of academic research and industry interests leads to conflicts of interest. We assessed published NLP research from 2000-2024 and labeled author affiliations as academic or industry-affiliated to measure conflicts of interest. Overall 27.65% of the papers contained at least one industry-affiliated author. That figure increased substantially with more than 1 in 3 papers having a conflict of interest in 2024. We identify top-tier venues (ACL, EMNLP) as main drivers for that effect. The paper closes with a discussion and a simple, concrete suggestion for the future.
