Table of Contents
Fetching ...

Open Polymer Challenge: Post-Competition Report

Gang Liu, Sobin Alosious, Subhamoy Mahajan, Eric Inae, Yihan Zhu, Yuhan Liu, Renzheng Zhang, Jiaxin Xu, Addison Howard, Ying Li, Tengfei Luo, Meng Jiang

TL;DR

The paper introduces the Open Polymer Challenge (OPC), the first large-scale, community-driven benchmark for polymer informatics, featuring MD-derived properties for thousands of polymers and a multi-task prediction setup. It details the ADEPT data-generation pipeline, the five properties studied, and the competition design, including data leakage handling and distribution-shift considerations. Key findings show that careful data curation, diverse yet simple feature engineering, and robust, tree-based models achieve strong performance under small, noisy datasets, while highlighting needs for standardized pipelines and improved Tg handling. The work provides a practical foundation—datasets, code, and analyses—that can accelerate molecular AI for sustainable polymer discovery and guide best practices for future large-scale polymer datasets.

Abstract

Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.

Open Polymer Challenge: Post-Competition Report

TL;DR

The paper introduces the Open Polymer Challenge (OPC), the first large-scale, community-driven benchmark for polymer informatics, featuring MD-derived properties for thousands of polymers and a multi-task prediction setup. It details the ADEPT data-generation pipeline, the five properties studied, and the competition design, including data leakage handling and distribution-shift considerations. Key findings show that careful data curation, diverse yet simple feature engineering, and robust, tree-based models achieve strong performance under small, noisy datasets, while highlighting needs for standardized pipelines and improved Tg handling. The work provides a practical foundation—datasets, code, and analyses—that can accelerate molecular AI for sustainable polymer discovery and guide best practices for future large-scale polymer datasets.

Abstract

Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.

Paper Structure

This paper contains 22 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of Open Polymer Challenge from four dimensions.
  • Figure 2: An example of a high-throughput MD workflow for polymer property calculations. The process includes SMILES parsing, monomer-to-polymer chain construction, amorphous packing, multi-stage equilibration, and computation of density, radius of gyration, and thermal conductivity.
  • Figure 3: Glass transition temperature (T$_g$) from bi-linear and hyperbolic fits. (a) Early test cases where bi-linear and hyperbolic fits produced identical results. Later verifications showed (c, d) good hyperbolic fits can cause significant deviations in T$_g$, and (d, e) produce wide range of T$_g$ that is dependent on the constraint bounds on fitting parameters. (d) Best hyperbolic fit can be high.
  • Figure 4: Choices of data strategies among Top 5, Top 10, and Top 200 leaderboard teams. Feature engineering (e.g., custom descriptors, fingerprints) was universally adopted in the top tiers, while use of SMILES augmentation showed decreasing adoption beyond Top 10.
  • Figure 5: Model architecture usage among Top 5, Top 10, and Top 200 leaderboard teams. Tree-based models, especially LightGBM, remained dominant in top positions.