From Patents to Dataset: Scraping for Oxide Glass Compositions and Properties
Gustavo Laranja Thomaello, Thomaz Yeiden Busnardo Aguena, Eric Trevelato Costa, Rafael Baságlia Rosante, Thiago Rodrigo Ramos, Daiane Aparecida Zuanetti, Edgar Dutra Zanotto
TL;DR
The paper addresses the scarcity and stagnation of comprehensive oxide glass property data by mining patent tables to build a ML-ready dataset containing $T_{\textrm{liq}}$, $n$, and $\nu_d$. It introduces a Google Patents–driven web-scraping pipeline with a crawler–scraper architecture, merging and cleaning steps that yield both molar and mass compositional bases while preserving full traceability via patent_id. The patent-derived data increase coverage relative to SciGlass and INTERGLAD by approximately $10.4\%$ for $T_{\textrm{liq}}$, $6.6\%$ for $n$, and $4.9\%$ for $\nu_d$, and also broaden the compositional space by enriching oxide contents such as TiO$_2$, MgO, ZrO$_2$, Nb$_2$O$_5$, Fe$_2$O$_3$, SnO$_2$, and Y$_2$O$_3$. The dataset is designed to be extensible, with planned enhancements including advanced parsing with LLMs, OCR to handle scanned documents, and expansion to additional properties to support data-driven design and exploration of glass composition–property relationships.
Abstract
In this work, we present web scraping techniques to extract in- formation from patent tables, clean and structure them for future use in predictive machine learning models to develop new glasses. We extracted compositions and three properties relevant to the development of new glasses and structured them into a database to be used together with information from other available datasets. We also analyzed the consistency of the information obtained and what it adds to the existing databases. The extracted liquidus temperatures comprise 5,696 compositions; the second subset includes 4,298 refractive indexes and, finally, 1,771 compositions with Abbe numbers. The extraction performed here increases the available information by approximately 10.4% for liquidus temperature, 6.6% for refractive index, and 4.9% for Abbe number. The impact extends beyond quantity: the newly extracted data introduce compositions with property values that are more diverse than those in existing databases, thereby expanding the accessible compositional and property space for glass modeling applications. We emphasize that the compositions of the new database contain relatively more titanium, magnesium, zirconium, niobium, iron, tin, and yttrium oxides than those of the existing bases.
