Table of Contents
Fetching ...

Input Resolution Downsizing as a Compression Technique for Vision Deep Learning Systems

Jeremy Morlier, Mathieu Leonardon, Vincent Gripon

TL;DR

This work investigates input resolution reduction as a practical compression axis for vision models, complementing pruning, quantization, and distillation. It develops mechanisms to adjust resolution for CNNs (before/after embedding) and to reduce token sequences in ViTs, and evaluates these strategies on ImageNet and CityScapes for classification and segmentation. Across ResNet-50 and RegSeg, resolution scaling delivers meaningful compute and memory savings with small accuracy/mIoU losses, and proves complementary to model scaling and quantization. The results establish input resolution as a versatile, scalable component of the vision-model compression toolbox with clear practical impact for resource-constrained environments.

Abstract

Model compression is a critical area of research in deep learning, in particular in vision, driven by the need to lighten models memory or computational footprints. While numerous methods for model compression have been proposed, most focus on pruning, quantization, or knowledge distillation. In this work, we delve into an under-explored avenue: reducing the resolution of the input image as a complementary approach to other types of compression. By systematically investigating the impact of input resolution reduction, on both tasks of classification and semantic segmentation, and on convnets and transformer-based architectures, we demonstrate that this strategy provides an interesting alternative for model compression. Our experimental results on standard benchmarks highlight the potential of this method, achieving competitive performance while significantly reducing computational and memory requirements. This study establishes input resolution reduction as a viable and promising direction in the broader landscape of model compression techniques for vision applications.

Input Resolution Downsizing as a Compression Technique for Vision Deep Learning Systems

TL;DR

This work investigates input resolution reduction as a practical compression axis for vision models, complementing pruning, quantization, and distillation. It develops mechanisms to adjust resolution for CNNs (before/after embedding) and to reduce token sequences in ViTs, and evaluates these strategies on ImageNet and CityScapes for classification and segmentation. Across ResNet-50 and RegSeg, resolution scaling delivers meaningful compute and memory savings with small accuracy/mIoU losses, and proves complementary to model scaling and quantization. The results establish input resolution as a versatile, scalable component of the vision-model compression toolbox with clear practical impact for resource-constrained environments.

Abstract

Model compression is a critical area of research in deep learning, in particular in vision, driven by the need to lighten models memory or computational footprints. While numerous methods for model compression have been proposed, most focus on pruning, quantization, or knowledge distillation. In this work, we delve into an under-explored avenue: reducing the resolution of the input image as a complementary approach to other types of compression. By systematically investigating the impact of input resolution reduction, on both tasks of classification and semantic segmentation, and on convnets and transformer-based architectures, we demonstrate that this strategy provides an interesting alternative for model compression. Our experimental results on standard benchmarks highlight the potential of this method, achieving competitive performance while significantly reducing computational and memory requirements. This study establishes input resolution reduction as a viable and promising direction in the broader landscape of model compression techniques for vision applications.

Paper Structure

This paper contains 14 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Benefits of adding resolution scaling to the more classical model scaling, when performing classification on ImageNet using a ResNet-50 with a batch size of 8. The dotted line represents the achievable trade-offs when using model scaling only. Three circle points correspond to various model scaling (MS), achieving different accuracy levels. The square points of the same colors correspond to adding input resolution scaling (RS), achieving a better trade-off without sacrificing accuracy.
  • Figure 2:
  • Figure 3: Overview of the Vision transformer (ViT) architecture, The input image, with dimensions $R \times R$, is divided into non-overlapping patches of size $P \times P$, resulting in an input resolution of $N=\frac{R}{P}$ tokens per line/column and a total sequence length of $N^{2}$. Each token is flattened and projected into a $D$-dimensional embedding. These embeddings are then fed into the transformer model, consisting of $n$ layers, where each layer includes a Multi-Head Self-Attention (MHSA) module with $n_{heads}$ heads and a Multi-Layer Perceptron (MLP) with dimensionality $D_{MLP}$. The output is passed to a classifier for the final prediction.
  • Figure 4: Illustration of the impact of resolution scaling on a RegSeg model. On the left column, several resize resolutions (128x256 to 512x1024) are applied to the original image followed by their respective model outputs. An interpolation technique such as bicubic or bilinear is then applied on the model output and the mIoU is calculated based on the ground truth.
  • Figure 5: Comparison of the memory required and the number of FLOPs trade-offs between model scaling and sequence scaling for a ViT-S.
  • ...and 4 more figures