Table of Contents
Fetching ...

Concept-based Explainable Data Mining with VLM for 3D Detection

Mai Tsujimoto

TL;DR

The paper tackles the scarcity of rare-object examples in autonomous driving 3D detection by introducing a concept-based, explainable data mining framework that leverages Vision-Language Models to semantically mine informative 2D samples for training. It integrates object concept embeddings, outlier detection via t-SNE and Isolation Forest, and VLM-based captioning to identify and label rare concepts, enabling targeted data mining (Target/Rare) and reduced-annotation datasets. The approach demonstrates notable improvements in rare object categories on nuScenes with only 20% of the data, and provides interpretable visualizations that link concepts to detections. This cross-modal framework offers practical benefits for safety-critical perception systems by enhancing rare-object detection while maintaining dataset efficiency and transparency.

Abstract

Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.

Concept-based Explainable Data Mining with VLM for 3D Detection

TL;DR

The paper tackles the scarcity of rare-object examples in autonomous driving 3D detection by introducing a concept-based, explainable data mining framework that leverages Vision-Language Models to semantically mine informative 2D samples for training. It integrates object concept embeddings, outlier detection via t-SNE and Isolation Forest, and VLM-based captioning to identify and label rare concepts, enabling targeted data mining (Target/Rare) and reduced-annotation datasets. The approach demonstrates notable improvements in rare object categories on nuScenes with only 20% of the data, and provides interpretable visualizations that link concepts to detections. This cross-modal framework offers practical benefits for safety-critical perception systems by enhancing rare-object detection while maintaining dataset efficiency and transparency.

Abstract

Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.

Paper Structure

This paper contains 23 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of the refined targeted concept mining framework used for selecting rare and safety-critical objects from large-scale 2D image datasets. YOLOv8 was first used to detect and crop objects from over 40,000 images. CLIP is then applied to extract image embeddings, which are analyzed using both t-SNE and Isolation Forest to identify outliers. Approximately 10,000 detected outlier crops were passed to a vision-language model (Qwen2-VL) to generate captions. Each caption is matched to predefined class concepts using cosine similarity in the CLIP embedding space. Detected objects are categorized into three groups based on the top-matching concept: Target ( 1,000; e.g., 'bicycle', 'motorcycle'), Rare ( 6,000; classes not in the common list; e.g., not in ['car', 'pedestrian']), and Common (e.g., 'car', 'pedestrian'). The final training dataset was constructed by including all images with target objects and random sampling from the rare set, forming a "Random 10% + Target 10%" strategy that enhanced rare class representation while maintaining dataset balance.
  • Figure 2: t-SNE visualization of object embeddings and outlier detection results. (Left) Embeddings colored according to object category. (Top-left) Outliers detected using the Isolation Forest. (Top-right) Outliers based on t-SNE anomaly regions. (Bottom-left) Combined outliers detected by both methods. The red points indicate anomalous samples, and the blue points indicate inliers. (The red points appear to dominate the plots due to overplotting, but in fact only about 20-30% of the samples were detected as outliers.)
  • Figure 3: An example of a detected construction vehicle, correctly identified with high concept similarity to "construction_vehicle".
  • Figure 4: Example of a motorcycle detected by the system, showing a strong association with "motorcycle" in the concept analysis.
  • Figure 5: Example of a bicycle detected by the system, showing strong association with "bicycle" in the concept analysis.
  • ...and 5 more figures