A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

Xinyu Mao; Bevan Koopman; Guido Zuccon

A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

Xinyu Mao, Bevan Koopman, Guido Zuccon

TL;DR

The paper investigates the reproducibility of the Goldilocks tuning approach for BERT in Technology-Assisted Review (TAR) and tests its generalisability to medical systematic reviews. It reproduces the two-component pipeline (further pre-training and fine-tuning within an active-learning loop) and evaluates on in-domain and out-of-domain datasets, including CLEF TAR collections. Key findings show that while a Goldilocks epoch enhances performance in some settings, its optimal value is dataset-dependent and not easily predicted, and domain mismatch can be substantial when using a generic BERT backbone. Importantly, domain-specific pre-trained backbones like BioLinkBERT can outperform both the Goldilocks-tuned BERT and logistic regression without additional pre-training, suggesting a practical alternative to epoch-tuning and highlighting implications for domain-adapted TAR workflows.

Abstract

Screening documents is a tedious and time-consuming aspect of high-recall retrieval tasks, such as compiling a systematic literature review, where the goal is to identify all relevant documents for a topic. To help streamline this process, many Technology-Assisted Review (TAR) methods leverage active learning techniques to reduce the number of documents requiring review. BERT-based models have shown high effectiveness in text classification, leading to interest in their potential use in TAR workflows. In this paper, we investigate recent work that examined the impact of further pre-training epochs on the effectiveness and efficiency of a BERT-based active learning pipeline. We first report that we could replicate the original experiments on two specific TAR datasets, confirming some of the findings: importantly, that further pre-training is critical to high effectiveness, but requires attention in terms of selecting the correct training epoch. We then investigate the generalisability of the pipeline on a different TAR task, that of medical systematic reviews. In this context, we show that there is no need for further pre-training if a domain-specific BERT backbone is used within the active learning pipeline. This finding provides practical implications for using the studied active learning pipeline within domain-specific TAR tasks.

A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

TL;DR

Abstract

A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

Authors

TL;DR

Abstract

Table of Contents

Figures (2)