Multi-scale Unified Network for Image Classification
Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu
TL;DR
The paper tackles the problem of CNNs' sensitivity to multi-scale image inputs, where fixed input sizes degrade performance and heavy rescaling incurs computational costs. It introduces the Multi-scale Unified Network (MSUN), which apportions shallow layers into multi-scale subnets, unifies them through a high-level network, and enforces a scale-invariant constraint to maintain cross-scale feature consistency. Through layerwise CK A analysis and comprehensive experiments on ImageNet and diverse datasets, MSUN demonstrates substantial accuracy gains, especially at small scales, while reducing FLOPs compared to traditional multi-scale testing or rescaling approaches. This approach offers a practical, architecture-friendly path to robust image classification in real-world, scale-variant scenarios with clear transfer learning benefits and broad applicability to existing CNN backbones.
Abstract
Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.
