Table of Contents
Fetching ...

Toward Storage-Aware Learning with Compressed Data An Empirical Exploratory Study on JPEG

Kichang Lee, Songkuk Kim, JaeYeon Park, JeongGil Ko

TL;DR

This work tackles storage-constrained on-device learning by formalizing and empirically investigating the joint data-quantity and data-quality trade-off under a fixed budget, using JPEG compression on CIFAR-10. It shows that naive uniform strategies are suboptimal and that different data samples have varying sensitivity to compression, motivating a sample-wise adaptive compression approach. The authors provide an actionable framework and discuss lightweight proxies for optimizing per-sample fidelity under a budget, laying groundwork for storage-aware learning systems. The findings have practical implications for deploying robust, personalized on-device models under real-world storage constraints and offer avenues to integrate with continual, federated, and active learning paradigms.

Abstract

On-device machine learning is often constrained by limited storage, particularly in continuous data collection scenarios. This paper presents an empirical study on storage-aware learning, focusing on the trade-off between data quantity and quality via compression. We demonstrate that naive strategies, such as uniform data dropping or one-size-fits-all compression, are suboptimal. Our findings further reveal that data samples exhibit varying sensitivities to compression, supporting the feasibility of a sample-wise adaptive compression strategy. These insights provide a foundation for developing a new class of storage-aware learning systems. The primary contribution of this work is the systematic characterization of this under-explored challenge, offering valuable insights that advance the understanding of storage-aware learning.

Toward Storage-Aware Learning with Compressed Data An Empirical Exploratory Study on JPEG

TL;DR

This work tackles storage-constrained on-device learning by formalizing and empirically investigating the joint data-quantity and data-quality trade-off under a fixed budget, using JPEG compression on CIFAR-10. It shows that naive uniform strategies are suboptimal and that different data samples have varying sensitivity to compression, motivating a sample-wise adaptive compression approach. The authors provide an actionable framework and discuss lightweight proxies for optimizing per-sample fidelity under a budget, laying groundwork for storage-aware learning systems. The findings have practical implications for deploying robust, personalized on-device models under real-world storage constraints and offer avenues to integrate with continual, federated, and active learning paradigms.

Abstract

On-device machine learning is often constrained by limited storage, particularly in continuous data collection scenarios. This paper presents an empirical study on storage-aware learning, focusing on the trade-off between data quantity and quality via compression. We demonstrate that naive strategies, such as uniform data dropping or one-size-fits-all compression, are suboptimal. Our findings further reveal that data samples exhibit varying sensitivities to compression, supporting the feasibility of a sample-wise adaptive compression strategy. These insights provide a foundation for developing a new class of storage-aware learning systems. The primary contribution of this work is the systematic characterization of this under-explored challenge, offering valuable insights that advance the understanding of storage-aware learning.

Paper Structure

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Model accuracy as a function of storage usage under various data quantity and quality settings. Each line represents a fixed data quantity (10% to 100%), where points along a line indicate increasing data quality with higher storage usage.
  • Figure 2: Best model accuracy and the corresponding data configuration for various storage limitations. For each storage budget, the plot shows the highest achievable accuracy ($\textcolor{blue}{\bullet}$), along with the specific data quantity ($\textcolor{orange}{\blacktriangle}$) and quality ($\textcolor{green}{\star}$) settings required to obtain that performance.
  • Figure 3: 10 samples from the CIFAR10 dataset with different data quality (i.e., compression rate, $Q$).
  • Figure 4: Test accuracy as a function of test data quality for models trained with different training data qualities ($Q$=5% ($\bullet$), $Q$=50% ($\blacksquare$), and $Q$=100% ($\blacklozenge$)). The shaded regions, R1 and R2, indicate where models trained on 5% and 50% quality data, respectively, achieve optimal performance, whereas region R3 denotes where higher-quality models ($Q$=50%, $Q$=100%) outperform the others.
  • Figure 5: IoU between top-5% SHAP masks at training quality $q$ and the same-seed baseline ($q{=}100$), averaged over five seeds. Lines summarize all samples and four cohorts: A ($100\checkmark/q\times$), B ($q\checkmark/100\times$), C (agree--correct), and D (agree--wrong). Higher IoU indicates more consistent explanations.
  • ...and 2 more figures