Table of Contents
Fetching ...

LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau

TL;DR

LC4-DViT tackles data scarcity and geometric distortions in high-resolution land-cover mapping by integrating a GPT-4o-guided, diffusion-based augmentation pipeline with Real-ESRGAN super-resolution and a ControlNet-guided diffusion process. The Deformable Vision Transformer (DViT) backbone fuses a DCNv4 deformable convolutional backbone with a Transformer encoder to jointly model fine-scale geometry and global context, yielding superior performance over CNNs and vanilla ViT. Across eight AID classes, LC4-DViT achieves state-of-the-art metrics (OA 0.9572, mAcc 0.9592, Kappa 0.9510, macro F1 0.9576) and demonstrates strong cross-dataset transfer to SIRI-WHU, with interpretability analyses (Grad-CAM heatmaps judged by GPT-4o) indicating better alignment with hydrological structures. These results show that description-driven data creation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping with practical environmental monitoring implications.

Abstract

Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

TL;DR

LC4-DViT tackles data scarcity and geometric distortions in high-resolution land-cover mapping by integrating a GPT-4o-guided, diffusion-based augmentation pipeline with Real-ESRGAN super-resolution and a ControlNet-guided diffusion process. The Deformable Vision Transformer (DViT) backbone fuses a DCNv4 deformable convolutional backbone with a Transformer encoder to jointly model fine-scale geometry and global context, yielding superior performance over CNNs and vanilla ViT. Across eight AID classes, LC4-DViT achieves state-of-the-art metrics (OA 0.9572, mAcc 0.9592, Kappa 0.9510, macro F1 0.9576) and demonstrates strong cross-dataset transfer to SIRI-WHU, with interpretability analyses (Grad-CAM heatmaps judged by GPT-4o) indicating better alignment with hydrological structures. These results show that description-driven data creation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping with practical environmental monitoring implications.

Abstract

Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

Paper Structure

This paper contains 22 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overall architecture of LC4-DViT. The pipeline first enhances low-resolution remote sensing images using RRDBNet within the Real-ESRGAN framework, and then leverages GPT-4o–generated land-cover descriptions together with Stable Diffusion to synthesize diverse augmented samples. Finally, the proposed deformation-aware ViT (DViT) module performs land-cover classification by explicitly modeling complex landform geometries, resulting in improved robustness and accuracy.
  • Figure 2: Overall architecture of the proposed Deformable Vision Transformer (DViT).
  • Figure 3: Overall metrics of the five models on the AID dataset. The evaluation metrics include Overall Accuracy (OA), mAcc, Kappa Coefficent (Kappa), Precision, Recall and F1 Score (F1).
  • Figure 4: Normalized confusion matrices of the Five models.
  • Figure 5: Model’s attention to key areas of the images. compares class-activation heatmaps of different models for eight scene categories (Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River). The first column is the original scene and the other columns are the attention heatmaps of different models.
  • ...and 1 more figures