Table of Contents
Fetching ...

Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge

Antonia Ebner, Christoph Bartmann, Sonja Topf, Sohvi Luukkonen, Johannes Schimunek, Günter Klambauer

TL;DR

The paper tackles benchmark drift in molecular bioactivity by re-establishing a faithful evaluation of the original Tox21-Challenge dataset. It introduces a reproducible leaderboard on Hugging Face Spaces that runs standardized inference on the original 647-test compounds across 12 toxicity endpoints, accessible via a FastAPI interface. Baseline coverage spans traditional descriptor methods and modern architectures plus zero-shot GPT-OSS 120B, enabling direct comparison under the original protocol; results indicate that many classic methods remain competitive, yielding a nuanced view of progress in $AUC$ terms across endpoints. Overall, the work emphasizes the importance of faithful benchmarks and provides a scalable blueprint for extending this approach to other endpoints and future few-/zero-shot evaluations.

Abstract

Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's "ImageNet moment" - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.

Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge

TL;DR

The paper tackles benchmark drift in molecular bioactivity by re-establishing a faithful evaluation of the original Tox21-Challenge dataset. It introduces a reproducible leaderboard on Hugging Face Spaces that runs standardized inference on the original 647-test compounds across 12 toxicity endpoints, accessible via a FastAPI interface. Baseline coverage spans traditional descriptor methods and modern architectures plus zero-shot GPT-OSS 120B, enabling direct comparison under the original protocol; results indicate that many classic methods remain competitive, yielding a nuanced view of progress in terms across endpoints. Overall, the work emphasizes the importance of faithful benchmarks and provides a scalable blueprint for extending this approach to other endpoints and future few-/zero-shot evaluations.

Abstract

Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's "ImageNet moment" - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.

Paper Structure

This paper contains 21 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of the Tox21 leaderboard and FastAPI interface linking model spaces with the leaderboard and external users.