Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)
Tashfain Ahmed, Josh Siegel
TL;DR
The paper tackles the challenge of data overabundance in resource-constrained IoT, proposing the Pareto Data Framework and the Minimum Viable Data ($MVD$) to identify the minimal data needed to meet performance targets. By systematically reducing sample rate, bit depth, and clip length in audio time-series and locating inflection points (knees) where performance begins to decline, the approach demonstrates that substantial resource savings can be achieved with only modest losses in accuracy ($90$–$99\%$) and with considerable reductions in bandwidth and storage. The experimental setup across multiple audio datasets and classifiers shows consistent benefits from multi-dimensional data reduction and supports generalization to other time-series domains and industrial applications, including a factory-scale example. The work offers a practical, scalable pathway to democratize AI on constrained devices, with implications for sustainable, cost-efficient deployment across sectors like agriculture, transportation, and manufacturing.
Abstract
This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95\% with sample rates reduced by 75\% and bit depths and clip length reduced by 50\% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.
