Table of Contents
Fetching ...

LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

Chenying Liu, Wei Huang, Xiao Xiang Zhu

TL;DR

LandSegmenter tackles the generalization bottleneck in LULC mapping by introducing a task-specific foundation model trained with a large weakly labeled LAS dataset, enabling flexible multi-modal inputs and adaptive outputs. It combines an RS-adaptive encoder, a GeoRSCLIP-based text prompter, and a vision-text decoder with an AFM and high-frequency/spectral enhancements, along with a confidence-guided fusion strategy to bolster zero-shot performance. Key contributions include constructing LAS (≈150k samples across eight subsets with ~80% weak labels), designing a three-part LandSegmenter architecture, and demonstrating strong zero-shot and fine-tuning results across six diverse LULC datasets. The approach highlights the practical value of weak supervision for scaling task-specific FMs in Earth observation and provides a pathway toward flexible, semantically aware LULC mapping with reduced labeling burden.

Abstract

Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter's zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.

LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

TL;DR

LandSegmenter tackles the generalization bottleneck in LULC mapping by introducing a task-specific foundation model trained with a large weakly labeled LAS dataset, enabling flexible multi-modal inputs and adaptive outputs. It combines an RS-adaptive encoder, a GeoRSCLIP-based text prompter, and a vision-text decoder with an AFM and high-frequency/spectral enhancements, along with a confidence-guided fusion strategy to bolster zero-shot performance. Key contributions include constructing LAS (≈150k samples across eight subsets with ~80% weak labels), designing a three-part LandSegmenter architecture, and demonstrating strong zero-shot and fine-tuning results across six diverse LULC datasets. The approach highlights the practical value of weak supervision for scaling task-specific FMs in Earth observation and provides a pathway toward flexible, semantically aware LULC mapping with reduced labeling burden.

Abstract

Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter's zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.

Paper Structure

This paper contains 28 sections, 7 equations, 13 figures, 26 tables.

Figures (13)

  • Figure 1: Overview of the proposed workflow for LULC FM construction, comprising three main stages. (a) LAS dataset curation: a globally sampled collection of RS imagery spanning diverse modalities and LULC categories, primarily weakly labeled at low cost. (b) LandSegmenter model design: a task-adaptive architecture capable of processing varying multispectral inputs and producing LULC maps tailored to user-defined category sets. (c) Zero-shot inference enhancement: a confidence-guided fusion strategy to improve recognition of semantically omitted or underrepresented classes during inference.
  • Figure 2: LAS dataset for LandSegmenter training. Middle: geographic distributions of each subset. From left to right, read the distributions of high-resolution, Sentinel-2 (S2), and Landsat-8/9 (L8/9) subsets. Top and Bottom: examples from each subset. Please refer to Appendix for details including the category information and color systems.
  • Figure 3: Architecture of LandSegmenter, where the attention-based fusion module (AFM) is depicted per block to indicate the consistent additional input at every stage, with its layer-wise implementation detailed in \ref{['fig:meth:afm']}. The embeddings sent to the decoder are the summation of the outputs from Blocks 4 (upsampled) and 3. For simplicity, we omit this operator in the figure.
  • Figure 4: Attention-based fusion module (AFM), where the attention modules share the same architecture yet are individually optimized for each input.
  • Figure 5: An example from Potsdam where car is absent in the LAS dataset. Top: class-wise confidence maps from softmax outputs. Bottom: pixel-wise uncertainty map (entropy of probability vectors); RGB image; GT mask; prediction by the confidence-guided fusion strategy (Fusion); prediction by LandSegmenter; prediction by ProxyCLIP with the features refined with LandSegmenter's embeddings (CLIP). Confidence and uncertainty values range from 0 (blue) to 1 (red). The class scheme of GT and predictions is the same as that in \ref{['tab:exp:potsdamclass']}.
  • ...and 8 more figures