Table of Contents
Fetching ...

CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian

TL;DR

CleanMAP presents a confidence-driven framework for crowdsourced HD map updates by using a Multimodal Large Language Model (MLLM) to produce lane-visibility scores and a Dynamic Piecewise Confidence Scoring (DPCS) mechanism that aligns with human judgments. A confidence-driven fusion strategy ranks and selects top-$k$ local maps within an optimal band to maximize map accuracy while maintaining coverage. Real-world experiments demonstrate AME improvements to $0.28\text{ m}$ (from $0.37\text{ m}$) when fusing the top-3 local maps, and an 84.88% alignment with human evaluators, indicating strong robustness and practicality. The work advances scalable, high-precision, crowdsourced HD map updates for autonomous navigation, with detailed data collection, annotation, and supplementary analyses to support deployment.

Abstract

The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/

CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

TL;DR

CleanMAP presents a confidence-driven framework for crowdsourced HD map updates by using a Multimodal Large Language Model (MLLM) to produce lane-visibility scores and a Dynamic Piecewise Confidence Scoring (DPCS) mechanism that aligns with human judgments. A confidence-driven fusion strategy ranks and selects top- local maps within an optimal band to maximize map accuracy while maintaining coverage. Real-world experiments demonstrate AME improvements to (from ) when fusing the top-3 local maps, and an 84.88% alignment with human evaluators, indicating strong robustness and practicality. The work advances scalable, high-precision, crowdsourced HD map updates for autonomous navigation, with detailed data collection, annotation, and supplementary analyses to support deployment.

Abstract

The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/

Paper Structure

This paper contains 66 sections, 23 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Overall Framework for MLLM-driven Data Cleansing for Crowdsourced HD Map Updates. A pretrained MLLM processes multimodal inputs, scoring key parameters related to lane visibility. The confidence-driven selection filters high-quality local maps, ranks them, and fuses the top $k$ maps within an optimal confidence range to enhance HD map updates.
  • Figure 2: MLLM-driven scoring model for evaluating key parameters related to lane line visibility in an image.
  • Figure 3: Sample training images for instruction-tuning the MLLM-driven scoring model: (a) Nighttime blur, (b) Clear nighttime, (c) Glare from streetlights and vehicles, (d) Fog-induced blur, (e) Light rain on a busy street, (f) Heavy snowfall with blur.
  • Figure 4: MLLM-driven assessment of lane visibility under bright sunlight causing glare and partial occlusion of lane lines by vehicles.
  • Figure 5: Confidence Score Distribution for Human vs Our MLLLM with 18937 Crowdsourced Vehicle Collected timestamp images.
  • ...and 12 more figures