Atom-Level Optical Chemical Structure Recognition with Limited Supervision
Martijn Oldenhof, Edward De Brouwer, Adam Arany, Yves Moreau
TL;DR
This work tackles optical chemical structure recognition by introducing AtomLenz, a data-efficient, atom-level OCSR system that jointly predicts atoms, bonds, charges, and stereocenters from images and reconstructs a chemically valid molecular graph. It combines a four-channel object-detection backbone with a graph constructor, and employs weakly supervised training (ProbKT*) plus an edit-correction mechanism to adapt to new domains using SMILES supervision alone, enhanced by the ChemExpert ensemble. Thorough experiments show strong performance on hand-drawn images and out-of-domain data, with superior atom-level localization and favorable data-efficiency compared to SMILES-only baselines; a curated hand-drawn dataset and code release support reproducibility. The approach advances practical chemical data extraction from diverse depictions, enabling scalable digitization of literature and notes with atom-level interpretability and reliability.
Abstract
Identifying the chemical structure from a graphical representation, or image, of a molecule is a challenging pattern recognition task that would greatly benefit drug development. Yet, existing methods for chemical structure recognition do not typically generalize well, and show diminished effectiveness when confronted with domains where data is sparse, or costly to generate, such as hand-drawn molecule images. To address this limitation, we propose a new chemical structure recognition tool that delivers state-of-the-art performance and can adapt to new domains with a limited number of data samples and supervision. Unlike previous approaches, our method provides atom-level localization, and can therefore segment the image into the different atoms and bonds. Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision. Through rigorous and extensive benchmarking, we demonstrate the preeminence of our chemical structure recognition approach in terms of data efficiency, accuracy, and atom-level entity prediction.
