Table of Contents
Fetching ...

VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

Yang Tan, Chen Liu, Jingyuan Gao, Banghao Wu, Mingchen Li, Ruilin Wang, Lingrong Zhang, Huiqun Yu, Guisheng Fan, Liang Hong, Bingxin Zhou

TL;DR

The paper tackles the fragmentation of AI-driven protein engineering workflows by uniting data retrieval, benchmarking, and model fine-tuning in a single platform. It introduces VenusFactory, an extensible engine that integrates rapid data collection from major databanks, a standardized benchmarking suite across five protein-engineering tasks, and modular fine-tuning pipelines for a wide range of PLMs. The system supports both CLI and Gradio interfaces and hosts 40+ datasets and 40+ PLMs, enabling researchers from biology and computer science to prototype end-to-end solutions. Empirical results show SES-Adapter often achieves top performance across tasks, demonstrating the value of task-specific, structure-aware fine-tuning for protein engineering. This platform promises to accelerate interdisciplinary AI-driven protein engineering by simplifying data access, benchmarking, and model adaptation.

Abstract

Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.

VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

TL;DR

The paper tackles the fragmentation of AI-driven protein engineering workflows by uniting data retrieval, benchmarking, and model fine-tuning in a single platform. It introduces VenusFactory, an extensible engine that integrates rapid data collection from major databanks, a standardized benchmarking suite across five protein-engineering tasks, and modular fine-tuning pipelines for a wide range of PLMs. The system supports both CLI and Gradio interfaces and hosts 40+ datasets and 40+ PLMs, enabling researchers from biology and computer science to prototype end-to-end solutions. Empirical results show SES-Adapter often achieves top performance across tasks, demonstrating the value of task-specific, structure-aware fine-tuning for protein engineering. This platform promises to accelerate interdisciplinary AI-driven protein engineering by simplifying data access, benchmarking, and model adaptation.

Abstract

Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating protein-related datasets and popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.

Paper Structure

This paper contains 36 sections, 1 figure, 8 tables.

Figures (1)

  • Figure 1: VenusFactory supports high-throughput raw data download, structure sequencing, a wide range of downstream task datasets, and interface or command-line protein language model fine-tuning and reasoning.