Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams
Cristiano Mesquita Garcia, Alessandro Lameiras Koerich, Alceu de Souza Britto, Jean Paul Barddal
TL;DR
The paper tackles concept drift in text streams by evaluating text-based sampling methods to selectively fine-tune SentenceBERT (SBERT) in a streaming setting. It systematically compares four SBERT loss functions (BATL, CTL, OCL, SL) across seven sampling variants, introducing WordPieceToken ratio sampling as a novel approach, and validates them on Airbnb and Yelp streams using prequential Macro F1 and elapsed time with an incremental SVM classifier. The findings show that Softmax loss (SL) and Batch All Triplets loss (BATL) are effective for maintaining performance, and WordPieceToken ratio sampling often yields the best improvements, especially when class information is incorporated; larger sample sizes generally boost Macro F1 but increase runtime. The work provides practical guidance for adaptive sentence representations in data streams, offering a trade-off between accuracy and computation and highlighting the potential of targeted sampling to mitigate concept drift in real-time text analytics.
Abstract
The proliferation of textual data on the Internet presents a unique opportunity for institutions and companies to monitor public opinion about their services and products. Given the rapid generation of such data, the text stream mining setting, which handles sequentially arriving, potentially infinite text streams, is often more suitable than traditional batch learning. While pre-trained language models are commonly employed for their high-quality text vectorization capabilities in streaming contexts, they face challenges adapting to concept drift - the phenomenon where the data distribution changes over time, adversely affecting model performance. Addressing the issue of concept drift, this study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models, thereby mitigating performance degradation. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our evaluation, focused on Macro F1-score and elapsed time, employs two text stream datasets and an incremental SVM classifier to benchmark performance. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification, demonstrating that larger sample sizes generally correlate with improved macro F1-scores. Notably, our proposed WordPieceToken ratio sampling method significantly enhances performance with the identified loss functions, surpassing baseline results.
