Table of Contents
Fetching ...

Chaining thoughts and LLMs to learn DNA structural biophysics

Tyler D. Ross, Ashwin Gopinath

TL;DR

This work tackles whether a general-purpose LLM can learn the structural biophysics of DNA for sequence analysis and design. It combines chain-of-thought prompting with a pipeline of task-specific experts, trained and validated on NUPACK-derived data for DNA secondary structure and energetics. The approach yields improved performance in predicting secondary structure, estimating minimum free energy, and designing sequences, with error checking enhancing reliability and design success. The findings suggest that modular, interpretable prompting and task specialization can empower AI systems to tackle complex physical-biological problems and may extend to more complex nucleic acid structures and RNA in the future.

Abstract

The future development of an AI scientist, a tool that is capable of integrating a variety of experimental data and generating testable hypotheses, holds immense potential. So far, bespoke machine learning models have been created to specialize in singular scientific tasks, but otherwise lack the flexibility of a general purpose model. Here, we show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA. We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.

Chaining thoughts and LLMs to learn DNA structural biophysics

TL;DR

This work tackles whether a general-purpose LLM can learn the structural biophysics of DNA for sequence analysis and design. It combines chain-of-thought prompting with a pipeline of task-specific experts, trained and validated on NUPACK-derived data for DNA secondary structure and energetics. The approach yields improved performance in predicting secondary structure, estimating minimum free energy, and designing sequences, with error checking enhancing reliability and design success. The findings suggest that modular, interpretable prompting and task specialization can empower AI systems to tackle complex physical-biological problems and may extend to more complex nucleic acid structures and RNA in the future.

Abstract

The future development of an AI scientist, a tool that is capable of integrating a variety of experimental data and generating testable hypotheses, holds immense potential. So far, bespoke machine learning models have been created to specialize in singular scientific tasks, but otherwise lack the flexibility of a general purpose model. Here, we show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA. We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.
Paper Structure (8 sections, 4 figures, 3 tables)

This paper contains 8 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Schemes for chain of thought (CoT) and pipelines of models used to perform sequence analysis and design. a, Chain of thought fine-tuning where the model prints out each base and their neighbors before determining if a stable base pair is formed. b, A sequence analysis pipeline that uses a model that is tuned to provide the reverse complement of a sequence, which is then fed into a model that is tuned to determine the secondary structure. Boxes in white indicate values provided by the user, teal boxes are fine-tuned models, and orange boxes represent final answers from model outputs. c, Expert-based error checking scheme where the sequences designed by one model are analyzed by another to verify that the desired secondary structure is formed.
  • Figure 2: Learning curves for secondary structure prediction and the reverse complement expert. For the secondary structure, we are using the expert pipeline approach.
  • Figure 3: Impact of training set size on MFE predictions for the reverse complement expert pipeline approach.
  • Figure 4: Sequence design learning curve for the pipeline with expert error checking approach.