Table of Contents
Fetching ...

Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator

Wei-Bin Kou, Guangxu Zhu, Rongguang Ye, Shuai Wang, Ming Tang, Yik-Chung Wu

TL;DR

The paper tackles the data-labeling bottleneck in street-scene semantic understanding for autonomous driving by introducing the Label Anything Model (LAM), a seed-based, prompt-free annotator that integrates a pretrained Vision Transformer backbone with a lightweight Semantic Class Adapter and an Optimization-Oriented Unrolling module. LAM trains from a single RGB seed image and delivers fast per-image annotation while maintaining high fidelity, outperforming prompt-dependent approaches. The key innovations are the SCA that maps ViT features to $C$ semantic channels and the OptOU that unrolls a cascade of optimization steps with learnable hyperparameters, yielding near-$100\%$ $mIoU$ across Cityscapes, CamVid, Apolloscape, and CARLA_ADV datasets with a small parameter budget. The approach promises substantial gains in annotation efficiency and interpretability, with potential extensions to other modalities such as LiDAR and depth sensing into the AD data pipeline.

Abstract

Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.

Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator

TL;DR

The paper tackles the data-labeling bottleneck in street-scene semantic understanding for autonomous driving by introducing the Label Anything Model (LAM), a seed-based, prompt-free annotator that integrates a pretrained Vision Transformer backbone with a lightweight Semantic Class Adapter and an Optimization-Oriented Unrolling module. LAM trains from a single RGB seed image and delivers fast per-image annotation while maintaining high fidelity, outperforming prompt-dependent approaches. The key innovations are the SCA that maps ViT features to semantic channels and the OptOU that unrolls a cascade of optimization steps with learnable hyperparameters, yielding near- across Cityscapes, CamVid, Apolloscape, and CARLA_ADV datasets with a small parameter budget. The approach promises substantial gains in annotation efficiency and interpretability, with potential extensions to other modalities such as LiDAR and depth sensing into the AD data pipeline.

Abstract

Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.

Paper Structure

This paper contains 19 sections, 15 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of LAM's annotation against SAM's segmentation. (a) Raw image. (b) Illustration of SAM's weaknesses, such as class-agnostic issue (for example, trees are assigned to different semantic IDs (colors)) and coarser annotation (for instance, tree leaves are not annotated well). (c) LAM's annotation overcomes the weaknesses of SAM's segmentation.
  • Figure 2: Overview of the proposed LAM.
  • Figure 3: Illustration of the proposed OptOU.
  • Figure 4: Convergence comparison of considered methods.
  • Figure 5: Image & Label