SceneMixer: Exploring Convolutional Mixing Networks for Remote Sensing Scene Classification
Mohammed Q. Alkhatib, Ali Jamali, Swalpa Kumar Roy
TL;DR
This work tackles remote sensing scene classification under variability in resolution and viewpoint by introducing a lightweight convolutional mixer that decouples spatial and channel processing through depthwise and pointwise convolutions. The model uses multiscale spatial mixing, a patch-embedding stage, and a simple classifier to achieve competitive accuracy with minimal parameters and computation. Experimental results on AID and EuroSAT show strong performance, surpassing several CNN and transformer baselines while maintaining high efficiency. The approach provides a practical, scalable option for high-resolution remote sensing tasks and sets a baseline for mixer-based architectures in this domain.
Abstract
Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: https://github.com/mqalkhatib/SceneMixer
