Table of Contents
Fetching ...

A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks

Fabiano Belém, Washington Cunha, Celso França, Claudio Andrade, Leonardo Rocha, Marcos André Gonçalves

TL;DR

This work tackles the cold-start active learning problem in text classification by introducing DoTCAL, a two-step fine-tuning pipeline that first domain-adapts contextual embeddings via masked language modeling on unlabeled data, then performs task-adaptive fine-tuning using AL-labeled samples. It systematically compares BoW, LSI, FastText, and BERT-based representations across selection and classification stages and demonstrates that DoTCAL can achieve up to $Macro\text{-}F1$ gains of about $33\%$ while halving labeling requirements on eight ATC benchmarks; it also reveals that BoW/LSI can outperform contextual embeddings in low-budget, hard tasks. The study extends to RoBERTa, confirming the generality of DoTCAL with larger models and showing notable gains in low-label regimes. Overall, the paper provides practical guidance on representation choices per AL stage and highlights the value of domain/task adaptation for efficient AL in ATC, with implications for privacy-conscious, small-to-medium scale models and future AutoML-enabled representation selection.

Abstract

This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on labeled data in AL using two steps: (1) fully leveraging unlabeled data through domain adaptation of the embeddings via masked language modeling and (2) further adjusting model weights using labeled data selected by AL. Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of the AL process: instance selection and classification. Experiments conducted on eight ATC benchmarks with varying AL budgets (number of labeled instances) and number of instances (about 5,000 to 300,000) demonstrate DoTCAL's superior effectiveness, achieving up to a 33% improvement in Macro-F1 while reducing labeling efforts by half compared to the traditional one-step method. We also found that in several tasks, BoW and LSI (due to information aggregation) produce results superior (up to 59% ) to BERT, especially in low-budget scenarios and hard-to-classify tasks, which is quite surprising.

A Novel Two-Step Fine-Tuning Pipeline for Cold-Start Active Learning in Text Classification Tasks

TL;DR

This work tackles the cold-start active learning problem in text classification by introducing DoTCAL, a two-step fine-tuning pipeline that first domain-adapts contextual embeddings via masked language modeling on unlabeled data, then performs task-adaptive fine-tuning using AL-labeled samples. It systematically compares BoW, LSI, FastText, and BERT-based representations across selection and classification stages and demonstrates that DoTCAL can achieve up to gains of about while halving labeling requirements on eight ATC benchmarks; it also reveals that BoW/LSI can outperform contextual embeddings in low-budget, hard tasks. The study extends to RoBERTa, confirming the generality of DoTCAL with larger models and showing notable gains in low-label regimes. Overall, the paper provides practical guidance on representation choices per AL stage and highlights the value of domain/task adaptation for efficient AL in ATC, with implications for privacy-conscious, small-to-medium scale models and future AutoML-enabled representation selection.

Abstract

This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on labeled data in AL using two steps: (1) fully leveraging unlabeled data through domain adaptation of the embeddings via masked language modeling and (2) further adjusting model weights using labeled data selected by AL. Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of the AL process: instance selection and classification. Experiments conducted on eight ATC benchmarks with varying AL budgets (number of labeled instances) and number of instances (about 5,000 to 300,000) demonstrate DoTCAL's superior effectiveness, achieving up to a 33% improvement in Macro-F1 while reducing labeling efforts by half compared to the traditional one-step method. We also found that in several tasks, BoW and LSI (due to information aggregation) produce results superior (up to 59% ) to BERT, especially in low-budget scenarios and hard-to-classify tasks, which is quite surprising.
Paper Structure (16 sections, 4 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Cold-Start Active Learning for ATC and contextual embeddings fine-tuning approaches
  • Figure 2: Macro-F1 for the BERT representation with different fine-tuning: DoTCAL, the traditional 1-step, MLM only (i.e., applying only the first step of our approach) and no fine-tuning. 95% confidence intervals shown in shaded areas.
  • Figure 3: Macro-F1 for BERT and RoBERTa representations for different fine-tuning approaches and budget sizes. 95% confidence intervals shown in shaded areas.
  • Figure 4: Macro-F1 for different budgets and different number $d$ of dimensions for the LSI representation. $d$=all means the vocabulary size of each dataset (i.e., the original BoW without compression). 95% confidence intervals are shown in shaded areas.