Chaining thoughts and LLMs to learn DNA structural biophysics
Tyler D. Ross, Ashwin Gopinath
TL;DR
This work tackles whether a general-purpose LLM can learn the structural biophysics of DNA for sequence analysis and design. It combines chain-of-thought prompting with a pipeline of task-specific experts, trained and validated on NUPACK-derived data for DNA secondary structure and energetics. The approach yields improved performance in predicting secondary structure, estimating minimum free energy, and designing sequences, with error checking enhancing reliability and design success. The findings suggest that modular, interpretable prompting and task specialization can empower AI systems to tackle complex physical-biological problems and may extend to more complex nucleic acid structures and RNA in the future.
Abstract
The future development of an AI scientist, a tool that is capable of integrating a variety of experimental data and generating testable hypotheses, holds immense potential. So far, bespoke machine learning models have been created to specialize in singular scientific tasks, but otherwise lack the flexibility of a general purpose model. Here, we show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA. We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.
