Language Model Powered Digital Biology with BRAD

Joshua Pickard; Ram Prakash; Marc Andrew Choi; Natalie Oliven; Cooper Stansbury; Jillian Cwycyshyn; Alex Gorodetsky; Alvaro Velasquez; Indika Rajapakse

Language Model Powered Digital Biology with BRAD

Joshua Pickard, Ram Prakash, Marc Andrew Choi, Natalie Oliven, Cooper Stansbury, Jillian Cwycyshyn, Alex Gorodetsky, Alvaro Velasquez, Indika Rajapakse

TL;DR

BRAD presents a retrieval-augmented digital assistant that integrates LLMs with diverse bioinformatics tools, databases, and software pipelines through an agent-based architecture. The system modularly connects a configurable Agent to document repositories, online literature, and external software, enabling end-to-end biomarker workflows with Grounded, verifiable outputs. Benchmarking and RAG Assessment indicate improved faithfulness and relevance for BRAD’s responses, while also exposing cost considerations and limitations of LLM-driven biomarker discovery. Overall, BRAD demonstrates a flexible, extensible framework for deploying LLM-powered, tool-rich bioinformatics assistants in research settings.

Abstract

Recent advancements in Large Language Models (LLMs) are transforming biology, computer science, engineering, and every day life. However, integrating the wide array of computational tools, databases, and scientific literature continues to pose a challenge to biological research. LLMs are well-suited for unstructured integration, efficient information retrieval, and automating standard workflows and actions from these diverse resources. To harness these capabilities in bioinformatics, we present a prototype Bioinformatics Retrieval Augmented Digital assistant (BRAD). BRAD is a chatbot and agentic system that integrates a variety of bioinformatics tools. The Python package implements an AI \texttt{Agent} that is powered by LLMs and connects to a local file system, online databases, and a user's software. The \texttt{Agent} is highly configurable, enabling tasks such as Retrieval-Augmented Generation, searches across bioinformatics databases, and the execution of software pipelines. BRAD's coordinated integration of bioinformatics tools delivers a context-aware and semi-autonomous system that extends beyond the capabilities of conventional LLM-based chatbots. A graphical user interface (GUI) provides an intuitive interface to the system.

Language Model Powered Digital Biology with BRAD

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 3 figures, 1 table)

This paper contains 21 sections, 2 equations, 3 figures, 1 table.

Introduction
Software Architecture
Python Architecture
Tool Modules
Document Chat Tool
Search Tool
Module Template
Deployment and Distributions
Graphical User Interface (GUI)
Deployment Options
Results and Applications
Biomarker Identification Workflow.
Cost and Benchmarking BRAD's Tools
Summary
Results and Examples
...and 6 more sections

Figures (3)

Figure 1: Overview of BRAD’s architecture, capabilities, and user interface. (A) The modular architecture uses the Agent as an interface between the user, the LLM, and tools that access external resources. The document and online databases are deployed in the GUI, and the modular design allows users to add new tools, similar to the software tool, to interface and retrieve information from new sources. (B) Agentic workflows utilize multiple tools to address user queries. In the biomarker workflow, the Agent uses the Software tool to run Python scripts, each producing an output file written to the output directory and updates the Agent's memory. The output from one stage serves as input for the next, culminating in an interpretable spreadsheet for the user. Additional software can be integrated to support tasks beyond biomarker selection. (C) Example outputs with the newest commercial LLM versus a RAG pipeline. (D) The Graphical User Interface (GUI) of BRAD highlights the user query, the LLM response, and the retrieved information. The left panel enables users to switch between chat sessions, while the right panel provides options to adjust settings for the LLM, the RAG pipeline, and additional features.
Figure 2: Costs associated with the use of LLM in each module of BRAD are profiled. Each task is color-coded according to its respective module: LAB NOTEBOOK is represented in green, SOFTWARE in red, and DIGITAL LIBRARY in blue.
Figure 3: Visualizations of the RAGAs metrics using BRAD. The base LLM in all three BRAD bots is OpenAI GPT 4.o-mini, and the enhancements used in ERAG are Multiquery, Contextual Compression and Reranking.

Language Model Powered Digital Biology with BRAD

TL;DR

Abstract

Language Model Powered Digital Biology with BRAD

Authors

TL;DR

Abstract

Table of Contents

Figures (3)