Table of Contents
Fetching ...

VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

Jumin Lee, Siyeong Lee, Namil Kim, Sung-Eui Yoon

Abstract

Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB--LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.

VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

Abstract

Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB--LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.

Paper Structure

This paper contains 15 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Motivation for VERIA. (a) Driving datasets exhibit long-tail distributions, limiting 3D perception performance. (b) LiDAR point returns grow sparser with range, amplifying intra-class geometric variation. (c) Existing methods operate in the LiDAR domain and place objects without scene context, constraining diversity to curated asset libraries. (d) VERIA synthesizes objects conditioned on RGB context using foundation models, supporting subclass-level diversity with synchronized pseudo-LiDAR.
  • Figure 2: Overview of VERIA. (a) Given a target category $\mathcal{C}$, a VLM generates a subclass-level description $\mathcal{T}_c$ and physical size priors; a 3D bounding box is sampled and projected to define the inpainting region for RGB-context-conditioned synthesis. Semantic verification retains candidates that pass category correctness, scene-level plausibility, and artifact severity checks. (b) Verified RGB instances are converted to synchronized pseudo-LiDAR via segmentation, depth estimation, and spherical projection. Geometric verification further filters implausible reconstructions, yielding verified RGB--LiDAR pairs for downstream training.
  • Figure 3: Qualitative augmentations on nuScenes and Lyft with paired RGB and LiDAR. For each dataset, we show an original scene (a) and the corresponding augmented scene (b), alongside the individually synthesized instances. On nuScenes, we augment construction vehicle (I1), motorcycle (I2), and bicycle (I3); on Lyft, we augment bicycle (I1) and motorcycle (I2). Each instance is composited using a collision-aware strategy; in RGB, instances are layered in depth order, while in LiDAR, occluded background points are removed from the sensor origin. Red 3D bounding boxes indicate the augmented objects, which are placed at plausible locations within the scene. Both datasets show decreasing point density with range, consistent with real LiDAR characteristics, though Lyft's 64-beam sensor yields comparatively denser returns at distance than nuScenes' 32-beam configuration.
  • Figure 4: Yield versus geometric tolerance $\lambda$ on nuScenes and Lyft. Yield increases monotonically with $\lambda$ as relaxing the consistency check admits a larger fraction of candidates.
  • Figure 5: Qualitative pseudo-LiDAR comparison. VERIA against PGT-Aug and Text3DAug using MoGe2 (first row) and UniDepth2 (second row). All visualized instances contain at least 64 points. Despite relying on depth-based reconstruction, VERIA produces beam-pattern-consistent pseudo-LiDAR comparable to mesh-based outputs on nuScenes (32-beam) and Lyft (64-beam). Mesh-based instances can appear visually cleaner than real scans, as they often lack sensor noise and irregular returns; Tab. \ref{['tab:quality']} provides complementary quantitative evaluation.
  • ...and 2 more figures