Table of Contents
Fetching ...

WaveMix: A Resource-efficient Neural Network for Image Analysis

Pranav Jeevan, Kavitha Viswanathan, Anandu A S, Amit Sethi

TL;DR

WaveMix addresses the resource constraints of vision models by integrating a multi-level $2$D-DWT into each block, enabling lossless downsampling and cross-scale token mixing that preserves spatial information. The architecture stacks $N$ WaveMix blocks, each applying a learnable convolution to a channel embedding $C$, multi-level Haar-wavelet mixing across $L$ levels, and an upsampling path with a residual connection, effectively expanding the receptive field without heavy parameter costs. Empirically, WaveMix delivers competitive or state-of-the-art performance on Cityscapes segmentation and multiple image-classification benchmarks (e.g., Places-365, iNAT-mini, Galaxy 10 DECals) with far fewer parameters and lower GPU RAM than CNNs and ViTs, even without large-scale pretraining. This demonstrates that encoding image priors via fixed wavelet-based token mixing yields scalable, efficient vision models suitable for resource-constrained environments, with code and models publicly available.

Abstract

We propose a novel neural architecture for computer vision -- WaveMix -- that is resource-efficient and yet generalizable and scalable. While using fewer trainable parameters, GPU RAM, and computations, WaveMix networks achieve comparable or better accuracy than the state-of-the-art convolutional neural networks, vision transformers, and token mixers for several tasks. This efficiency can translate to savings in time, cost, and energy. To achieve these gains we used multi-level two-dimensional discrete wavelet transform (2D-DWT) in WaveMix blocks, which has the following advantages: (1) It reorganizes spatial information based on three strong image priors -- scale-invariance, shift-invariance, and sparseness of edges -- (2) in a lossless manner without adding parameters, (3) while also reducing the spatial sizes of feature maps, which reduces the memory and time required for forward and backward passes, and (4) expanding the receptive field faster than convolutions do. The whole architecture is a stack of self-similar and resolution-preserving WaveMix blocks, which allows architectural flexibility for various tasks and levels of resource availability. WaveMix establishes new benchmarks for segmentation on Cityscapes; and for classification on Galaxy 10 DECals, Places-365, five EMNIST datasets, and iNAT-mini and performs competitively on other benchmarks. Our code and trained models are publicly available.

WaveMix: A Resource-efficient Neural Network for Image Analysis

TL;DR

WaveMix addresses the resource constraints of vision models by integrating a multi-level D-DWT into each block, enabling lossless downsampling and cross-scale token mixing that preserves spatial information. The architecture stacks WaveMix blocks, each applying a learnable convolution to a channel embedding , multi-level Haar-wavelet mixing across levels, and an upsampling path with a residual connection, effectively expanding the receptive field without heavy parameter costs. Empirically, WaveMix delivers competitive or state-of-the-art performance on Cityscapes segmentation and multiple image-classification benchmarks (e.g., Places-365, iNAT-mini, Galaxy 10 DECals) with far fewer parameters and lower GPU RAM than CNNs and ViTs, even without large-scale pretraining. This demonstrates that encoding image priors via fixed wavelet-based token mixing yields scalable, efficient vision models suitable for resource-constrained environments, with code and models publicly available.

Abstract

We propose a novel neural architecture for computer vision -- WaveMix -- that is resource-efficient and yet generalizable and scalable. While using fewer trainable parameters, GPU RAM, and computations, WaveMix networks achieve comparable or better accuracy than the state-of-the-art convolutional neural networks, vision transformers, and token mixers for several tasks. This efficiency can translate to savings in time, cost, and energy. To achieve these gains we used multi-level two-dimensional discrete wavelet transform (2D-DWT) in WaveMix blocks, which has the following advantages: (1) It reorganizes spatial information based on three strong image priors -- scale-invariance, shift-invariance, and sparseness of edges -- (2) in a lossless manner without adding parameters, (3) while also reducing the spatial sizes of feature maps, which reduces the memory and time required for forward and backward passes, and (4) expanding the receptive field faster than convolutions do. The whole architecture is a stack of self-similar and resolution-preserving WaveMix blocks, which allows architectural flexibility for various tasks and levels of resource availability. WaveMix establishes new benchmarks for segmentation on Cityscapes; and for classification on Galaxy 10 DECals, Places-365, five EMNIST datasets, and iNAT-mini and performs competitively on other benchmarks. Our code and trained models are publicly available.
Paper Structure (30 sections, 5 equations, 5 figures, 14 tables)

This paper contains 30 sections, 5 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: WaveMix architecture for (a) image classification (b) semantic segmentation, along with (c) details of the WaveMix block
  • Figure 2: Visualisation of receptive fields for different models show a rapid expansion of receptive field in WaveMix as we add layers or blocks due to multi-level 2D-DWT. A blank image with a single high pixel value near the center was sent as input to the models. All parameters were assigned a value of one and all bias were set to zero.
  • Figure 3: Details of the WaveMix-Lite block, which uses only a single level 2D-DWT
  • Figure 4: A sample of semantic segmentation results on Cityscapes dataset by WaveMix for qualitative assessment.
  • Figure 5: The results of occlusion analysis to find the significance of each pixel in the output decision shows that WaveMix identifies important regions (darker in second row) in an image for making the classification decision. The scale shows the probability of class output when the pixel is occluded.