An Active Learning Framework with a Class Balancing Strategy for Time Series Classification
Shemonto Das
TL;DR
This work addresses the high labeling costs and class-imbalance challenges in time series classification by introducing an Active Learning framework augmented with a class-balancing instance selection algorithm. It systematically evaluates uncertainty sampling, query-by-committee, and expected model change across tactile texture recognition and industrial fault detection, using sliding-window temporal features and the $f_1$-score as the evaluation metric. In tactile texture recognition, a 6-second window with 50 percent overlap and Extra Trees trained under UNC achieved about $90.25$ percent $f_1$-score with only a subset of the data labeled (~70 percent of the total data), while in synthetic fiber manufacturing, AL with balancing reduced labeling to roughly 21 percent and, with XGBoost and QBC, reached about $69.24$ percent $f_1$-score at full budget. Overall, the framework demonstrates meaningful reductions in annotation cost while maintaining or improving performance, and its modular design suggests applicability to other time-series domains with imbalanced data.
Abstract
Training machine learning models for classification tasks often requires labeling numerous samples, which is costly and time-consuming, especially in time series analysis. This research investigates Active Learning (AL) strategies to reduce the amount of labeled data needed for effective time series classification. Traditional AL techniques cannot control the selection of instances per class for labeling, leading to potential bias in classification performance and instance selection, particularly in imbalanced time series datasets. To address this, we propose a novel class-balancing instance selection algorithm integrated with standard AL strategies. Our approach aims to select more instances from classes with fewer labeled examples, thereby addressing imbalance in time series datasets. We demonstrate the effectiveness of our AL framework in selecting informative data samples for two distinct domains of tactile texture recognition and industrial fault detection. In robotics, our method achieves high-performance texture categorization while significantly reducing labeled training data requirements to 70%. We also evaluate the impact of different sliding window time intervals on robotic texture classification using AL strategies. In synthetic fiber manufacturing, we adapt AL techniques to address the challenge of fault classification, aiming to minimize data annotation cost and time for industries. We also address real-life class imbalances in the multiclass industrial anomalous dataset using our class-balancing instance algorithm integrated with AL strategies. Overall, this thesis highlights the potential of our AL framework across these two distinct domains.
