Table of Contents
Fetching ...

Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

Matthew P. Wilson, Edward O. Pyzer-Knapp, Nicolas Galichet, Luke Dicks

TL;DR

The paper tackles the problem that fine-tuning pretrained foundation models often harms performance on out-of-distribution tasks. It introduces fill-tuning, a data-efficient strategy that uses latent-space roughness (via frustration analysis and kinetic transition networks) to generate a small, targeted dataset that improves the embedding broadly, not for a single task. Applied to state-of-the-art materials models trained on up to $O(10^9)$ data points, fill-tuning yields roughly a $0.75$–$1\%$ uplift in downstream task performance with only 100 new examples, and this benefit transfers to larger models as well. The work demonstrates that data character and sampling geometry in latent space can drive generalized improvements, offering a practical route to enhance foundation models at low computational cost and motivating further exploration across modalities and architectures.

Abstract

Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present 'fill-tuning', a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on $O(10^9)$ data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.

Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

TL;DR

The paper tackles the problem that fine-tuning pretrained foundation models often harms performance on out-of-distribution tasks. It introduces fill-tuning, a data-efficient strategy that uses latent-space roughness (via frustration analysis and kinetic transition networks) to generate a small, targeted dataset that improves the embedding broadly, not for a single task. Applied to state-of-the-art materials models trained on up to data points, fill-tuning yields roughly a uplift in downstream task performance with only 100 new examples, and this benefit transfers to larger models as well. The work demonstrates that data character and sampling geometry in latent space can drive generalized improvements, offering a practical route to enhance foundation models at low computational cost and motivating further exploration across modalities and architectures.

Abstract

Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present 'fill-tuning', a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.

Paper Structure

This paper contains 9 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An illustrative kinetic transition network constructed from a two-dimensional molecular similarity measure (top). Local minimum (green) and transition states (red) are connected by solid lines when they are joined by steepest-descent paths. The corresponding continuous roughness surface is constructed from a sum of multivariate normals (bottom).
  • Figure 2: UMAP projections of the SELFIES-TED (small) latent space. We fit a UMAP reducer using embeddings of $10^4$ molecules sampled from PubChem (top) and all the training data from MoleculeNet classification tasks (bottom), both of which are combined with the data generated by fill-tuning.