Table of Contents
Fetching ...

OSM-based Domain Adaptation for Remote Sensing VLMs

Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel

Abstract

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

OSM-based Domain Adaptation for Remote Sensing VLMs

Abstract

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
Paper Structure (28 sections, 8 figures, 6 tables)

This paper contains 28 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: (a) Estimated data generation costs based on API pricing and measured self-hosting costs. (b) Performance per benchmark and task-aggregated performance. Benchmarks used at fine-tuning are highlighted in purple color, zero-shot benchmarks highlighted in teal color.
  • Figure 2: The creation of OSMDA-Captions and OSMDA-VLM via the OSMDA method. We collect images of various areas and resolutions, and fetch their OSM object tags. We filter out visible objects using the image resolution and a set of heuristics, then send each object's OSM tags through an LLM to produce a short label capturing its essence. We overlay the labels onto OSM map tiles and feed the resulting overlays, alongside the matching satellite images, to the base VLM to generate OSMDA-Captions -- a detailed captioning dataset incorporating OSM metadata. We then mix this dataset with existing remote-sensing data and fine-tune the base model, producing OSMDA-VLM - a state-of-the-art model specializing in the remote-sensing domain.
  • Figure 3: A comparison of four captions generated by different methods for the same area: the base model (top left), OSMDA-Captions - the base model with provided OSM map (top right), the base fine-tuned on the training splits of benchmarks (bottom left), and OSMDA-VLM -- the base jointly trained on OSMDA-Captions and training splits (bottom right). Methods are ranked based on the largest number of wins across all metrics and benchmarks, where benchmarking is possible. Joint training with our dataset stabilizes the model, resulting in less hallucination and better descriptions of spatial and visual layout. OSMDA-Captions also contain comparatively accurate descriptions with the help of the OSM map, yet qualitatively, not on the level of OSMDA-VLM. Both base and base-fine-tuned sometimes hallucinate: in this case, the base model incorrectly places the dome-shaped structure in the center, while base-fine-tuned hallucinates a nonexistent pond.
  • Figure 4: Effects on performance of the OSMDA method. Each barplot shows the average rank when comparing benchmarking results between the included models across all metrics. The generalization barplot shows performance on benchmarks without considering training sets (generalization-split). Fine-tuned performance barplot shows performance on benchmarks with training splits that the models are fine-tuned on (fine-tuning-split). Overall performance includes both. OSMDA-VLM improves the generalization of the base model the most. Fine-tuning after training on OSMDA-Captions (ours-fine-tuned) also results in a model better at the downstream tasks compared to fine-tuning the base model directly. The gains of the OSMDA method outweigh even those of the standard method of distilling a large teacher, which is much more costly.
  • Figure 5: Per-category accuracy for classification (left) and VQA (right). In the VQA panel, rural/urban denotes scene discrimination, while other slash-separated labels (e.g., building/scrub) indicate quantitative object comparison.
  • ...and 3 more figures