Table of Contents
Fetching ...

In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts

Matthias Busch, Marius Tacke, Sviatlana V. Lamaka, Mikhail L. Zheludkevich, Christian J. Cyron, Christian Feiler, Roland C. Aydin

Abstract

The capabilities of large language models (LLMs) have expanded beyond natural language processing to scientific prediction tasks, including molecular property prediction. However, their effectiveness in in-context learning remains ambiguous, particularly given the potential for training data contamination in widely used benchmarks. This paper investigates whether LLMs perform genuine in-context regression on molecular properties or rely primarily on memorized values. Furthermore, we analyze the interplay between pre-trained knowledge and in-context information through a series of progressively blinded experiments. We evaluate nine LLM variants across three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy) using a systematic blinding approach that iteratively reduces available information. Complementing this, we utilize varying in-context sample sizes (0-, 60-, and 1000-shot) as an additional control for information access. This work provides a principled framework for evaluating molecular property prediction under controlled information access, addressing concerns regarding memorization and exposing conflicts between pre-trained knowledge and in-context information.

In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts

Abstract

The capabilities of large language models (LLMs) have expanded beyond natural language processing to scientific prediction tasks, including molecular property prediction. However, their effectiveness in in-context learning remains ambiguous, particularly given the potential for training data contamination in widely used benchmarks. This paper investigates whether LLMs perform genuine in-context regression on molecular properties or rely primarily on memorized values. Furthermore, we analyze the interplay between pre-trained knowledge and in-context information through a series of progressively blinded experiments. We evaluate nine LLM variants across three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy) using a systematic blinding approach that iteratively reduces available information. Complementing this, we utilize varying in-context sample sizes (0-, 60-, and 1000-shot) as an additional control for information access. This work provides a principled framework for evaluating molecular property prediction under controlled information access, addressing concerns regarding memorization and exposing conflicts between pre-trained knowledge and in-context information.

Paper Structure

This paper contains 49 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our results show that LLMs leverage domain knowledge and in-context learning rather than direct memorization, which is a common problem carlini2023memorizationcheng2025surveysainz2023contaminationbordt2025forget.
  • Figure 2: Correlation coefficients for molecular property prediction across three datasets (Rows: QM7, Lipophilicity, Delaney) and three LLM families (Columns: GPT-4.1, GPT-5, Gemini 2.5). The x-axis represents the input configuration (0-shot, 60-shot, 1000-shot). Different colors indicate model sizes within each family. Higher correlation values indicate better performance. Note the varying 0-shot correlations across the different datasets indicating popularity of the datasets in the training corpus, while they are relatively constant over different model families. Also note that the 1000-shot correlation is usually higher than the 0-shot correlation, while the 60-shot correlation varies.
  • Figure 3: Cumulative error distribution for 0-shot predictions on the Delaney dataset. The x-axis shows the absolute error threshold, and the y-axis shows the percentage of samples with error below that threshold. Steeper curves indicate better performance. Note the continuous distribution of errors without a substantial amount of errors at zero, which would be the expected behavior when the LLMs would have memorized target values.
  • Figure 4: Detailed performance breakdown for Gemini 2.5 models across all six blinding levels (x-axis) and three datasets (rows). Specific blinding levels specify the property to be predicted by name, while generic blinding levels name it as molecular property and agnostic blinding levels name it as sample property. "Transf." relates to the label values being transformed as described in Section \ref{['sec:transformation']}. Note that the peak correlation value is often not at the first blinding level, indicating that the LLMs prior knowledge can hinder in-context learning.
  • Figure 5: Correlation of OpenAI models (GPT-4.1 and GPT-5 families) across all six blinding levels and three datasets (rows: QM7, Lipophilicity, Delaney). Different colors indicate model sizes within each family. Compared to the Gemini results in Figure \ref{['fig:gemini_approaches']}, the OpenAI models exhibit more uniform behavior across size variants, and the two families differ markedly in which blinding regime yields peak performance.
  • ...and 1 more figures