Table of Contents
Fetching ...

Towards Intrinsic-Aware Monocular 3D Object Detection

Zhihao Zhang, Abhinav Kumar, Xiaoming Liu

Abstract

Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).

Towards Intrinsic-Aware Monocular 3D Object Detection

Abstract

Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsics govern how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation. The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision-language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters. These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics. This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras. Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).

Paper Structure

This paper contains 30 sections, 5 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: MonoIA enables intrinsic awareness in Mono3D on KITTI Val. Existing Mono3D detectors zhang2023monodetrpu2024monodgpzhang2025unleashing lack intrinsic awareness and thus generalize poorly to images with unseen intrinsics. In contrast, our intrinsic-aware MonoIA achieves superior performance under seen intrinsics and demonstrates strong generalization to the unseen one.
  • Figure 2: Impact of intrinsic variation on image appearance.Left: The two images show the same object in the same $3$D position but captured with different intrinsics. As the focal length increases, the object appears larger and the FoV is smaller. Right: Schematic illustration of how intrinsic variations affect object appearance.
  • Figure 3: Overview of MonoIA. (a) Training Stage: MonoIA is a unified intrinsic-aware detection framework built upon two designs. The Intrinsic Encoder leverages the knowledge of LLM and CLIP to convert numeric intrinsics into semantically meaningful embeddings that capture their perceptual and geometric effects, providing a strong prior for generalization across cameras. The Intrinsic Adaptation Module bridges this semantic knowledge with visual perception through a lightweight Connector and hierarchical fusion, enabling the detector to interpret visual features in an intrinsic-aware manner and maintain consistent $3$D detection under diverse camera settings. (b) Testing Stage: For each test intrinsic, we retrieve its two nearest seen intrinsics together with their embeddings, and then apply a Hybrid Interpolation Strategy that adaptively switches between nearest-neighbor selection and linear interpolation. If the intrinsic gap is $\leq 32$ px, the nearest seen embedding is reused; otherwise, the two nearest embeddings are linearly interpolated to synthesize the test intrinsic embedding.
  • Figure 4: LLM-Guided Description Generation. Images rendered with diverse camera intrinsics are fed into an LLM, which generates concise descriptions linking each intrinsic’s perceptual and geometric effects with its numeric focal value, forming semantic intrinsic descriptions.
  • Figure 5: Cosine similarity of intrinsic embeddings under different encoding strategies (a) Numeric-only encoding produces uniformly high similarity, showing that CLIP text embeddings of raw focal values lack discriminative structure. (b) Our Intrinsic Encoder, which integrates LLM-generated perceptual descriptions with numeric grounding, yields a smooth and ordered similarity pattern, indicating a structured and geometry-aware intrinsic space.
  • ...and 8 more figures