Table of Contents
Fetching ...

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

TL;DR

MVT addresses the challenge of open-world land-cover understanding under cross-dataset domain shift by decoupling geometry from semantics. It fuses a domain-adapted $SAM2$ region discovery front end with a two-step, lexical-label-free LoRA fine-tuning of multimodal LLMs to produce taxonomy-aligned tags and mask-grounded descriptions, evaluated via expert calibration and GPT-4o judging. The approach yields improved mask quality and richer, more accurate taxonomy outputs on LoveDA when trained on OpenEarthMap, demonstrating robust cross-domain generalization and interpretability. This framework provides a scalable pathway for interpretable, domain-agnostic land-cover tagging suitable for diverse remote sensing datasets and reporting standards.

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

TL;DR

MVT addresses the challenge of open-world land-cover understanding under cross-dataset domain shift by decoupling geometry from semantics. It fuses a domain-adapted region discovery front end with a two-step, lexical-label-free LoRA fine-tuning of multimodal LLMs to produce taxonomy-aligned tags and mask-grounded descriptions, evaluated via expert calibration and GPT-4o judging. The approach yields improved mask quality and richer, more accurate taxonomy outputs on LoveDA when trained on OpenEarthMap, demonstrating robust cross-domain generalization and interpretability. This framework provides a scalable pathway for interpretable, domain-agnostic land-cover tagging suitable for diverse remote sensing datasets and reporting standards.

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed MVT framework. The architecture includes the segmentation phase, the two-step fine-tuning of the MLLMs, and the evaluation phase.
  • Figure 2: (a) Original Image; (b) Off-the-shelf; (c) Fine-tuned.
  • Figure 3: Hydraulic Construction Land example.