DRIP: DRop unImportant data Points -- Enhancing Machine Learning Efficiency with Grad-CAM-Based Real-Time Data Prioritization for On-Device Training
Marcus Rüb, Daniel Konegen, Patrick Selle, Axel Sikora, Daniel Mueller-Gritschneder
TL;DR
The paper tackles the problem of data retention for on-device training in resource-constrained edge environments by introducing DRIP, a Grad-CAM–based online data valuation method. It computes a DRIP Score from Grad-CAM heatmaps to decide, in real time, whether a streaming data point should be stored for potential retraining or discarded, with thresholds learned during a Training Phase. Across MNIST, CIFAR-10, Plant Disease, and Speech Commands, DRIP matches or exceeds the accuracy of models trained on the full dataset while saving up to 39% of storage, demonstrating strong data-efficiency and robustness to noise. The approach is applicable across image and audio modalities, reduces data transmission demands, and offers practical benefits for TinyML deployments, though it faces training-phase cost and potential representativeness biases that warrant further research.
Abstract
Selecting data points for model training is critical in machine learning. Effective selection methods can reduce the labeling effort, optimize on-device training for embedded systems with limited data storage, and enhance the model performance. This paper introduces a novel algorithm that uses Grad-CAM to make online decisions about retaining or discarding data points. Optimized for embedded devices, the algorithm computes a unique DRIP Score to quantify the importance of each data point. This enables dynamic decision-making on whether a data point should be stored for potential retraining or discarded without compromising model performance. Experimental evaluations on four benchmark datasets demonstrate that our approach can match or even surpass the accuracy of models trained on the entire dataset, all while achieving storage savings of up to 39\%. To our knowledge, this is the first algorithm that makes online decisions about data point retention without requiring access to the entire dataset.
