Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang; Weichuan Zhang

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang, Weichuan Zhang

TL;DR

The paper tackles rare-species image classification under extreme data scarcity by introducing a frequency-adaptive DCT preprocessing module integrated with ViT and ResNet backbones and a Bayesian linear classifier. The core idea is to learn low, mid, and high frequency boundaries and fuse frequency-domain cues with global context and local detail through cross-level feature fusion. Key contributions include (i) data-driven adaptive frequency partitioning, (ii) a hybrid DCT-ViT-ResNet architecture, and (iii) an uncertainty-aware Bayesian head with reparameterized sampling and KL regularization, achieving state-of-the-art performance on a 50-species wildlife dataset under few-shot conditions. This approach advances robust ecological monitoring by leveraging frequency-domain augmentation, transformer-based global reasoning, and principled uncertainty estimation for deployment in real-world, data-sparse settings.

Abstract

A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

TL;DR

Abstract

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)