Table of Contents
Fetching ...

Multi-scale Unified Network for Image Classification

Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu

TL;DR

The paper tackles the problem of CNNs' sensitivity to multi-scale image inputs, where fixed input sizes degrade performance and heavy rescaling incurs computational costs. It introduces the Multi-scale Unified Network (MSUN), which apportions shallow layers into multi-scale subnets, unifies them through a high-level network, and enforces a scale-invariant constraint to maintain cross-scale feature consistency. Through layerwise CK A analysis and comprehensive experiments on ImageNet and diverse datasets, MSUN demonstrates substantial accuracy gains, especially at small scales, while reducing FLOPs compared to traditional multi-scale testing or rescaling approaches. This approach offers a practical, architecture-friendly path to robust image classification in real-world, scale-variant scenarios with clear transfer learning benefits and broad applicability to existing CNN backbones.

Abstract

Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.

Multi-scale Unified Network for Image Classification

TL;DR

The paper tackles the problem of CNNs' sensitivity to multi-scale image inputs, where fixed input sizes degrade performance and heavy rescaling incurs computational costs. It introduces the Multi-scale Unified Network (MSUN), which apportions shallow layers into multi-scale subnets, unifies them through a high-level network, and enforces a scale-invariant constraint to maintain cross-scale feature consistency. Through layerwise CK A analysis and comprehensive experiments on ImageNet and diverse datasets, MSUN demonstrates substantial accuracy gains, especially at small scales, while reducing FLOPs compared to traditional multi-scale testing or rescaling approaches. This approach offers a practical, architecture-friendly path to robust image classification in real-world, scale-variant scenarios with clear transfer learning benefits and broad applicability to existing CNN backbones.

Abstract

Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.
Paper Structure (22 sections, 9 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: In the real world, images come in different sizes, but Vanilla models are designed for a fixed input size, causing performance degradation when the input image scale changes. In contrast, MSUN retains stable performance over a range of scales.
  • Figure 2: The layer-wise feature similarity between $32 \times 32$ and $224 \times 224$ inputs, showing that the lower layers in these CNN models are considerably more sensitive to scale changes. Compared to the baseline, our method (MSUN) maintains a higher CKA between inputs of different input scales, obtaining more robust representations across scale variations.
  • Figure 3: Illustration of the Vanilla modelhe2016deephuang2017denselysimonyan2014verysandler2018mobilenetv2, multi-scale training, and our method. MSUN has the lower layers divided into multi-scale subnets, which are unified in upper layers incorporating a scale-invariance constraint.
  • Figure 4: Multi-scale testing on ImageNet spanning from $32 \times 32$ to $224 \times 224$ image size with a stride of 16. (a) The model's accuracy curves at each input size. (b) The model's average accuracy, average FLOPs, and total model parameters across these sizes. Our method demonstrates significant accuracy improvements and reduced FLOPs, mitigating the model's performance breakdown at lower image scales.
  • Figure 5: Comparison of multi-scale subnetworks. Tick marks denote optimal accuracy at each test size. (a) the accuracy curves with different input sizes. (b) Mean accuracy, FLOPs, and total parameters across these inputs.
  • ...and 3 more figures