MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
TL;DR
MVT addresses the challenge of open-world land-cover understanding under cross-dataset domain shift by decoupling geometry from semantics. It fuses a domain-adapted $SAM2$ region discovery front end with a two-step, lexical-label-free LoRA fine-tuning of multimodal LLMs to produce taxonomy-aligned tags and mask-grounded descriptions, evaluated via expert calibration and GPT-4o judging. The approach yields improved mask quality and richer, more accurate taxonomy outputs on LoveDA when trained on OpenEarthMap, demonstrating robust cross-domain generalization and interpretability. This framework provides a scalable pathway for interpretable, domain-agnostic land-cover tagging suitable for diverse remote sensing datasets and reporting standards.
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
