Input Resolution Downsizing as a Compression Technique for Vision Deep Learning Systems
Jeremy Morlier, Mathieu Leonardon, Vincent Gripon
TL;DR
This work investigates input resolution reduction as a practical compression axis for vision models, complementing pruning, quantization, and distillation. It develops mechanisms to adjust resolution for CNNs (before/after embedding) and to reduce token sequences in ViTs, and evaluates these strategies on ImageNet and CityScapes for classification and segmentation. Across ResNet-50 and RegSeg, resolution scaling delivers meaningful compute and memory savings with small accuracy/mIoU losses, and proves complementary to model scaling and quantization. The results establish input resolution as a versatile, scalable component of the vision-model compression toolbox with clear practical impact for resource-constrained environments.
Abstract
Model compression is a critical area of research in deep learning, in particular in vision, driven by the need to lighten models memory or computational footprints. While numerous methods for model compression have been proposed, most focus on pruning, quantization, or knowledge distillation. In this work, we delve into an under-explored avenue: reducing the resolution of the input image as a complementary approach to other types of compression. By systematically investigating the impact of input resolution reduction, on both tasks of classification and semantic segmentation, and on convnets and transformer-based architectures, we demonstrate that this strategy provides an interesting alternative for model compression. Our experimental results on standard benchmarks highlight the potential of this method, achieving competitive performance while significantly reducing computational and memory requirements. This study establishes input resolution reduction as a viable and promising direction in the broader landscape of model compression techniques for vision applications.
