LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer
Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau
TL;DR
LC4-DViT tackles data scarcity and geometric distortions in high-resolution land-cover mapping by integrating a GPT-4o-guided, diffusion-based augmentation pipeline with Real-ESRGAN super-resolution and a ControlNet-guided diffusion process. The Deformable Vision Transformer (DViT) backbone fuses a DCNv4 deformable convolutional backbone with a Transformer encoder to jointly model fine-scale geometry and global context, yielding superior performance over CNNs and vanilla ViT. Across eight AID classes, LC4-DViT achieves state-of-the-art metrics (OA 0.9572, mAcc 0.9592, Kappa 0.9510, macro F1 0.9576) and demonstrates strong cross-dataset transfer to SIRI-WHU, with interpretability analyses (Grad-CAM heatmaps judged by GPT-4o) indicating better alignment with hydrological structures. These results show that description-driven data creation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping with practical environmental monitoring implications.
Abstract
Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
