Table of Contents
Fetching ...

VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai

Abstract

Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.

VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

Abstract

Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.
Paper Structure (29 sections, 25 equations, 9 figures, 13 tables)

This paper contains 29 sections, 25 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Comparison of weak supervision labels for monocular 3D detection. To mitigate label scarcity, we propose an adaptive multi-modal pretraining paradigm that leverages visually-referred probabilistic prompts as auxiliary labels and can be seamlessly integrated into existing WS-M3D pipelines.
  • Figure 2: Overview of the VirPro paradigm. We propose an adaptive pretraining paradigm that generates scene-aware probabilistic prompts enriched with visual context, which can be seamlessly integrated into diverse WS-M3D frameworks. An Adaptive Prompt Bank includes diverse learnable prompts for each object, while Multi-Gaussian Prompt Modeling injects scene-specific visual features into textual embeddings and encodes prompts as a multivariate Gaussian distribution. The sampled probabilistic prompts are max-pooled for RoI-level contrastive learning to align semantics across modalities.
  • Figure 3: Qualitative results on the KITTI validation set comparing ours to the WeakM3D baseline. WeakM3D peng_weakm3d_2022 is a WS-M3D work with pure 3D pseudo-labels. Predicted boxes are rendered in green, and ground-truth boxes are shown in red.
  • Figure 4: Qualitative results on the KITTI validation set comparing ours to the GGA+PGD baseline. GGA zhang2024geometryaware + PGD wang2022probabilistic is a WS-M3D baseline employing both 3D pseudo-labels and static textual prompts. Predicted boxes are rendered in green, and ground-truth boxes are shown in red.
  • Figure 5: Comparison of Inter-Scene Centroid Distances in Latent Space Between CAW3D and Our Proposed VirPro. We extract RoI visual embeddings for the "Car" category from 15 scenes randomly chosen from KITTI val set. Then we compute the centroid of the embedding distribution and calculate pairwise distances between scene centroids.
  • ...and 4 more figures