Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer; Andrea Montanari

Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer, Andrea Montanari

TL;DR

The paper tackles storage constraints in machine learning by introducing a storage scaling law that describes how test error depends on the number of training samples $n$ and per-sample bits $L$. It validates the law empirically on three computer vision tasks using lossy JPEG X/L compression and a Butteraugli-based scheme, and develops a stylized multiresolution model that recapitulates the observed behavior. The law yields an optimal allocation given storage budget $s$, with test error decaying as $Err_{test}^* + C_E s^{- u}$ where $\nu = \alpha\beta/(\alpha+\beta)$, and prescribes how to scale $n$ and $L$ via $L_*(s) = C s^{\alpha/(\alpha+\beta)}$ and $n_*(s) = C^{-1} s^{\beta/(\alpha+\beta)}$. Practically, this work guides storage-aware data curation by showing that more, lower-quality data can outperform fewer high-quality samples, and that introducing randomized compression can yield robustness gains across tasks.

Abstract

Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time. In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level.

Scaling Training Data with Lossy Image Compression

TL;DR

The paper tackles storage constraints in machine learning by introducing a storage scaling law that describes how test error depends on the number of training samples

and per-sample bits

. It validates the law empirically on three computer vision tasks using lossy JPEG X/L compression and a Butteraugli-based scheme, and develops a stylized multiresolution model that recapitulates the observed behavior. The law yields an optimal allocation given storage budget

, with test error decaying as

where

, and prescribes how to scale

and

via

and

. Practically, this work guides storage-aware data curation by showing that more, lower-quality data can outperform fewer high-quality samples, and that introducing randomized compression can yield robustness gains across tasks.

Abstract

to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level.

Paper Structure (25 sections, 2 theorems, 34 equations, 7 figures, 2 tables)

This paper contains 25 sections, 2 theorems, 34 equations, 7 figures, 2 tables.

Introduction
Background and motivation
Summary of results
Related Work
Empirical Results
Set up
Scaling curves
Optimizing $n$ and $L$
Test error for optimal $n$ and $L$
Scaling Optimal.
Original Format.
Randomized.
Comparison to naive compression
Test set compression
A stylized model
...and 10 more sections

Key Result

Theorem 1

Under the model described above, further assume $rp<1$, $q/p<1$ and the conditions of Assumption ass:X-assumption to hold. Then there exist positive constants $0<A_1<A_2$, $0<B_1<B_2$, $0<C$, $0<c_0$ (depending on the constants $p,q,r,\tau$ in the assumptions), such that, for all $n\le c_* L^{1+2\ka with exponents (In the above formulas, ${\mathbb E}_{{\varepsilon}}$ denotes expectation with resp

Figures (7)

Figure 1: Test error scaling for the image classification, semantic segmentation, and object detection tasks. Circles represent empirical results, and the dotted line indicates the scaling curve fit to these observations.
Figure 2: Values of $n$ and $L$ as a function of total storage size for each task under the optimal and original format training data.
Figure 3: Test error scaling in $s$ for each task task using different data scaling schemes.
Figure 4: Test error for models trained on naively compressed data relative to test error for models trained on optimally compressed data and original format data.
Figure 5: Error as a function of both training data compression level and test data compression level for the classification, segmentation, and detection tasks. Circles indicate empirical results from model evaluation, and the background represents a linear interpolation of those points.
...and 2 more figures

Theorems & Definitions (2)

Theorem 1
Lemma 1

Scaling Training Data with Lossy Image Compression

TL;DR

Abstract

Scaling Training Data with Lossy Image Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)