An Active Learning Framework for Data-Efficient, Human-in-the-Loop Enzyme Function Prediction

Ashley Babjac; Adrienne Hoarfrost

An Active Learning Framework for Data-Efficient, Human-in-the-Loop Enzyme Function Prediction

Ashley Babjac, Adrienne Hoarfrost

TL;DR

HATTER (Human-in-the-loop Adaptive Toolkit for Transferable Enzyme Representations), a modular framework that integrates multiple active learning strategies with human-in-the-loop experimental annotation to efficiently fine tune function prediction models is introduced.

Abstract

Generalizable protein function prediction is increasingly constrained by the growing mismatch between exponentially expanding sequences of environmental proteins and the comparatively slow accumulation of experimentally verified functional data. Active learning offers a promising path forward for accelerating biological function prediction, by selecting the most informative proteins to experimentally annotate for data-efficient training, yet its potential remains largely unexplored. We introduce HATTER (Human-in-the-loop Adaptive Toolkit for Transferable Enzyme Representations), a modular framework that integrates multiple active learning strategies with human-in-the-loop experimental annotation to efficiently fine tune function prediction models. We compare active learning training to standard supervised training for biological enzyme function prediction, demonstrating that active learning achieves performance comparable to standard training across diverse protein sequence evaluation datasets while requiring fewer model updates, processing less data, and substantially reducing computational cost. Interestingly, point-based uncertainty sampling methods like entropy or margin sampling perform as well or better than more complex acquisition functions such as bayesian sampling or BALD, highlighting the relative importance of sequence diversity in training datasets and model architecture design. These results demonstrate that human-in-the-loop active learning can efficiently accelerate enzyme discovery, providing a flexible platform for adaptive, scalable, and expert-guided protein function prediction.

An Active Learning Framework for Data-Efficient, Human-in-the-Loop Enzyme Function Prediction

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Introduction
Methodology
HATTER
Inputs:
Underlying Model:
Active Learning Implementation:
Operational Modes and Outputs:
Evaluation Metrics:
Active Learning Benchmarking using HATTER
Data:
Training Procedure:
Implementation Details:
Results
Active Learners perform comparatively to standard training procedure
Active learning matches standard training performance with far fewer model updates
...and 3 more sections

Figures (4)

Figure 1: A graphical demonstration of the HATTER pipeline. The HATTER pipeline mirrors a standard active learning pipeline and follows four steps: (i) Select queries (amino acid sequences) based on predicted model "uncertainty", where uncertainty can be calculated using several implemented methods, (ii) label model queries with real-life experimental annotation (i.e. the "oracle"), (iii) use the labeled queries to update the functional prediction model, and (iv) repeat steps (i)-(iii) until model convergence or desired performance is achieved.
Figure 2: A comparison of performance for both underlying models (left: CLEAN, right: TwoLayer) across all acquisition strategies. Across accuracy, F1, and hF1 metrics, there is no statistically significant difference among acquisition functions, with all acquisition functions within the 95% confidence interval for each other as well as random sampling and standard training.
Figure 3: Early stopping analysis. Left: Distribution of model accuracy for CLEAN (blue) and TwoLayer (orange) across five-fold cross validation splits comparing early stopping (right) and full 100 epoch (left) training. Left: Average epoch and 95% confidence interval across five-fold splits where early stopping occurred for each strategy using CLEAN.
Figure 4: Data efficiency and computation analysis. Left: Comparison of data usage and computational efficiency between active learning and standard training, showing cumulative data processed over 100 epochs. Dotted vertical lines represent the epoch of early stopping for the fastest-converging AL strategy (entropy sampling, green) vs. standard training (red). Blue vertical lines indicate early stopping point for other sampling strategies. Right: Percent data processing reduction calculated at the epoch of early stopping relative to standard training.

An Active Learning Framework for Data-Efficient, Human-in-the-Loop Enzyme Function Prediction

TL;DR

Abstract

An Active Learning Framework for Data-Efficient, Human-in-the-Loop Enzyme Function Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)