Table of Contents
Fetching ...

Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification

Yiqiao Li, Bo Shang, Jie Wei

TL;DR

The paper tackles the challenge of fine-grained truck classification from roadside LiDAR by addressing the modality gap between sparse 3D point clouds and dense 2D images used by vision-language models. It introduces a depth-aware image generation pipeline that converts sparse LiDAR data into depth-encoded 2D proxies and pairs it with domain-aware prompt engineering to enable training-free few-shot classification with off-the-shelf VLMs like CLIP and EVA. Key findings show competitive accuracy with as few as 16–30 examples per class on 20 classes, reveal a Semantic Anchor effect where text guidance helps in ultra-low-shot regimes but can hurt with more data, and demonstrate the framework as an effective Cold Start strategy to bootstrap lightweight supervised models. The approach reduces labeling burden and offers a practical, scalable path for ITS deployments, while future work will focus on grounded spatial reasoning to resolve remaining fine-grained distinctions by explicitly localizing vehicle substructures.

Abstract

Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.

Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification

TL;DR

The paper tackles the challenge of fine-grained truck classification from roadside LiDAR by addressing the modality gap between sparse 3D point clouds and dense 2D images used by vision-language models. It introduces a depth-aware image generation pipeline that converts sparse LiDAR data into depth-encoded 2D proxies and pairs it with domain-aware prompt engineering to enable training-free few-shot classification with off-the-shelf VLMs like CLIP and EVA. Key findings show competitive accuracy with as few as 16–30 examples per class on 20 classes, reveal a Semantic Anchor effect where text guidance helps in ultra-low-shot regimes but can hurt with more data, and demonstrate the framework as an effective Cold Start strategy to bootstrap lightweight supervised models. The approach reduces labeling burden and offers a practical, scalable path for ITS deployments, while future work will focus on grounded spatial reasoning to resolve remaining fine-grained distinctions by explicitly localizing vehicle substructures.

Abstract

Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes , but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
Paper Structure (40 sections, 15 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 15 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Flowchart of the proposed framework.
  • Figure 2: Comparison of single frame and reconstructed frame (color represents depth information).
  • Figure 3: Illustration of the proposed domain-aware few-shot prompt design for vision–language model (VLM)-based truck classification.
  • Figure 4: Illustration of the 2D projected image without depth-aware smoothing (Step 4 of Algorithm 1) and its processed version after applying the entire Algorithm 1.
  • Figure 5: Attention map comparison: CLIP-L/14 attention on the original 2D LiDAR projection (left) versus the depth-aware generated image (right). The proposed depth-aware smoothing yields more focused attention on the vehicle structure, whereas the original projection without depth-aware smoothing elicits diffuse attention on sparse artifacts.
  • ...and 5 more figures