Table of Contents
Fetching ...

Auto-Annotation with Expert-Crafted Guidelines: A Study through 3D LiDAR Detection Benchmark

Yechi Ma, Wei Hua, Shu Kong

Abstract

Data annotation is crucial for developing machine learning solutions. The current paradigm is to hire ordinary human annotators to annotate data instructed by expert-crafted guidelines. As this paradigm is laborious, tedious, and costly, we are motivated to explore auto-annotation with expert-crafted guidelines. To this end, we first develop a supporting benchmark AutoExpert by repurposing the established nuScenes dataset, which has been widely used in autonomous driving research and provides authentic expert-crafted guidelines. The guidelines define 18 object classes using both nuanced language descriptions and a few visual examples, and require annotating objects in LiDAR data with 3D cuboids. Notably, the guidelines do not provide LiDAR visuals to demonstrate how to annotate. Therefore, AutoExpert requires methods to learn on few-shot labeled images and texts to perform 3D detection in LiDAR data. Clearly, the challenges of AutoExpert lie in the data-modality and annotation-task discrepancies. Meanwhile, publicly-available foundation models (FMs) serve as promising tools to tackle these challenges. Hence, we address AutoExpert by leveraging appropriate FMs within a conceptually simple pipeline, which (1) utilizes FMs for 2D object detection and segmentation in RGB images, (2) lifts 2D detections into 3D using known sensor poses, and (3) generates 3D cuboids for the 2D detections. In this pipeline, we progressively refine key components and eventually boost 3D detection mAP to 25.4, significantly higher than 12.1 achieved by prior arts.

Auto-Annotation with Expert-Crafted Guidelines: A Study through 3D LiDAR Detection Benchmark

Abstract

Data annotation is crucial for developing machine learning solutions. The current paradigm is to hire ordinary human annotators to annotate data instructed by expert-crafted guidelines. As this paradigm is laborious, tedious, and costly, we are motivated to explore auto-annotation with expert-crafted guidelines. To this end, we first develop a supporting benchmark AutoExpert by repurposing the established nuScenes dataset, which has been widely used in autonomous driving research and provides authentic expert-crafted guidelines. The guidelines define 18 object classes using both nuanced language descriptions and a few visual examples, and require annotating objects in LiDAR data with 3D cuboids. Notably, the guidelines do not provide LiDAR visuals to demonstrate how to annotate. Therefore, AutoExpert requires methods to learn on few-shot labeled images and texts to perform 3D detection in LiDAR data. Clearly, the challenges of AutoExpert lie in the data-modality and annotation-task discrepancies. Meanwhile, publicly-available foundation models (FMs) serve as promising tools to tackle these challenges. Hence, we address AutoExpert by leveraging appropriate FMs within a conceptually simple pipeline, which (1) utilizes FMs for 2D object detection and segmentation in RGB images, (2) lifts 2D detections into 3D using known sensor poses, and (3) generates 3D cuboids for the 2D detections. In this pipeline, we progressively refine key components and eventually boost 3D detection mAP to 25.4, significantly higher than 12.1 achieved by prior arts.

Paper Structure

This paper contains 25 sections, 6 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: Excerpts of the authentic annotation guidelines of the nuScenes datasetcaesar2020nuscenes. (a) The guidelines instruct human annotators to label LiDAR points with 3D cuboids for specific object classes. (b) Each class is defined with a few visual examples and nuanced textual descriptions (ref. the red box) without 3D annotations. Human annotators must comprehend and apply these guidelines to draw 3D boxes. (c) We visualize the ground-truth human-annotated 3D cuboids in the RGB image and the Bird's-Eye-View (BEV) of LiDAR points.
  • Figure 2: To solve AutoExpert, we adopt a conceptually simple pipeline and adapt open-source foundation models (FMs). Specifically, over the visual examples and textual descriptions that define object classes of interest, we adapt appropriate Vision-Language Models (VLMs) and Vision Foundation Model (VFMs) for object detection and segmentation. The adapted FMs produce decent 2D detections on unlabeled RGB frames. With the known parameters of LiDAR and camera, we develop novel techniques to lift 2D detections to 3D, locate corresponding LiDAR points, and employ our proposed VLM-Guided Multi-Hypothesis Testing (v-MHT) strategy to generate 3D cuboids.
  • Figure 3: For each class name, we use a VLM (e.g., GPT-4o achiam2023gpt and Qwen qwen) to find a list of terms that match its description and visual examples in the annotation guidelines. We select the term or combined terms that yields the best zero-shot detection performance of a foundational detector (e.g., GroundingDINO liu2023grounding) on the validation set. We construct a multimodal few-shot training set using the selected terms and the available images to finetune the detector, yielding notable improvements (\ref{['tab:2D_comparison']}).
  • Figure 4: Generating 3D cuboids based on LiDAR points is challenging as points can be from occluders and backgrounds. For example, (a) LiDAR points projected on a bicycle foreground mask can be from the background scene through wheels; (b) points projected on a car mask can be from an occluding fence; (c-d) points projected on car masks can be background through the windows and windshield.
  • Figure 5: Overview of the v-MHT method for 3D cuboid generation. Our v-MHT begins by prompting a VLM to infer the 3D information about a target 2D detection, as shown in the left panel. Following the prompt, the VLM outputs an estimated 3D dimension about this object and information related to its orientation. We find that it is challenging to directly prompt the VLM to output an orientation angle (even after specifying the current camera coordinates). Therefore, we instruct the VLM to output the location of the object in the image and the visible parts of this object. With known camera extrinsic parameters, we derive a rough orientation, as shown in the mid panel. Lastly, with dimension $d$ and estimated orientation $\theta$, we initialize a 3D cuboid and perform multi-hypothesis testing (MHT) to search for the final cuboid that best fits LiDAR points and the 2D detection box, as shown in the right panel.
  • ...and 9 more figures