Table of Contents
Fetching ...

LICO: Large Language Models for In-Context Molecular Optimization

Tung Nguyen, Aditya Grover

TL;DR

This work tackles black-box optimization by proposing LICO, a framework that extends arbitrary base LLMs with dedicated x- and y-embedding layers and a prediction head to perform in-context function prediction in an embedding space. It trains the added components via semi-synthetic data drawn from a mix of 47 intrinsic molecular properties and GP-based synthetic functions, enabling the model to generalize to unseen molecular objectives without domain-specific natural-language prompts. Empirically, LICO achieves state-of-the-art performance on the PMO-1K molecular optimization benchmark and competitive results on PMO-10K, with larger LLMs further boosting performance. The approach demonstrates that combining language instruction with domain-aligned embeddings and diverse training signals can yield effective, data-efficient surrogate models for complex scientific domains, with potential applicability beyond molecular design.

Abstract

Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO performs competitively on PMO, a challenging molecular optimization benchmark comprising 23 objective functions, and achieves state-of-the-art performance on its low-budget version PMO-1K.

LICO: Large Language Models for In-Context Molecular Optimization

TL;DR

This work tackles black-box optimization by proposing LICO, a framework that extends arbitrary base LLMs with dedicated x- and y-embedding layers and a prediction head to perform in-context function prediction in an embedding space. It trains the added components via semi-synthetic data drawn from a mix of 47 intrinsic molecular properties and GP-based synthetic functions, enabling the model to generalize to unseen molecular objectives without domain-specific natural-language prompts. Empirically, LICO achieves state-of-the-art performance on the PMO-1K molecular optimization benchmark and competitive results on PMO-10K, with larger LLMs further boosting performance. The approach demonstrates that combining language instruction with domain-aligned embeddings and diverse training signals can yield effective, data-efficient surrogate models for complex scientific domains, with potential applicability beyond molecular design.

Abstract

Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO performs competitively on PMO, a challenging molecular optimization benchmark comprising 23 objective functions, and achieves state-of-the-art performance on its low-budget version PMO-1K.

Paper Structure

This paper contains 33 sections, 4 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed approach. We equip a pretrained LLM with an embedding layer for $x$, an embedding layer for $y$, and a prediction layer. We train the model on semi-synthetic data to predict $y$ given $x$ and previous $(x,y)$ pairs. We prepend each $x$ with a special token <x> and each $y$ with a special token <y> to guide in-context reasoning.
  • Figure 2: The predictive performance of LICO and GP on $3$ objective functions in PMO with different metrics and varying numbers of observations.
  • Figure 3: LICO with different LLM sizes.
  • Figure 4: Predictive performance of LICO with T5-base, Nach0, and Llama-2 as the backbones.
  • Figure 5: Performance of pretrained vs randomly initialized LLMs.
  • ...and 1 more figures