Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset
Man Luo, Bradley Peterson, Rafael Gan, Hari Ramalingame, Navya Gangrade, Ariadne Dimarogona, Imon Banerjee, Phillip Howard
TL;DR
This work introduces the first dataset and benchmark for detecting toxicity in scientific peer reviews, collected from OpenReview and annotated with four toxicity categories. It benchmarks diverse models, including toxicity detectors, sentiment classifiers, open-source LLMs, and closed-source LLMs like GPT-3.5 and GPT-4, exploring how prompt granularity affects alignment with human judgments. Key findings show general toxicity detectors underperform in peer-review contexts, while detailed prompts and confidence-based filtering enable GPT-4 to achieve the best alignment (up to a Cohen's Kappa of 0.63 with high-confidence predictions). The study also demonstrates that GPT-3.5/4 can revise toxic sentences with high human preference, pointing to practical uses for detoxifying peer reviews and guiding future model development for healthier scientific discourse.
Abstract
Peer review is crucial for advancing and improving science through constructive criticism. However, toxic feedback can discourage authors and hinder scientific progress. This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform, annotated by human experts according to these definitions. Leveraging this dataset, we benchmark a variety of models, including a dedicated toxicity detection model, a sentiment analysis model, several open-source large language models (LLMs), and two closed-source LLMs. Our experiments explore the impact of different prompt granularities, from coarse to fine-grained instructions, on model performance. Notably, state-of-the-art LLMs like GPT-4 exhibit low alignment with human judgments under simple prompts but achieve improved alignment with detailed instructions. Moreover, the model's confidence score is a good indicator of better alignment with human judgments. For example, GPT-4 achieves a Cohen's Kappa score of 0.56 with human judgments, which increases to 0.63 when using only predictions with a confidence score higher than 95%. Overall, our dataset and benchmarks underscore the need for continued research to enhance toxicity detection capabilities of LLMs. By addressing this issue, our work aims to contribute to a healthy and responsible environment for constructive academic discourse and scientific collaboration.
