LAM3D: Leveraging Attention for Monocular 3D Object Detection

Diana-Alexandra Sas; Leandro Di Bella; Yangxintong Lyu; Florin Oniga; Adrian Munteanu

LAM3D: Leveraging Attention for Monocular 3D Object Detection

Diana-Alexandra Sas, Leandro Di Bella, Yangxintong Lyu, Florin Oniga, Adrian Munteanu

TL;DR

LAM3D tackles monocular 3D object detection by employing a Pyramid Vision Transformer v2 backbone to extract multi-scale features, followed by dedicated 2D and 3D detection heads and depth inference via Geometry Uncertainty Projection. The approach integrates attention-driven feature extraction with a carefully staged multi-task loss, enabling end-to-end training on KITTI. Empirical results show that LAM3D surpasses prior Transformer-based baselines (e.g., DST3D) in 3D and BEV metrics, with notable improvements for small or occluded objects. The work demonstrates that Vision Transformer backbones can serve as effective feature extractors for ill-posed monocular 3D perception, with practical implications for autonomous driving systems.

Abstract

Since the introduction of the self-attention mechanism and the adoption of the Transformer architecture for Computer Vision tasks, the Vision Transformer-based architectures gained a lot of popularity in the field, being used for tasks such as image classification, object detection and image segmentation. However, efficiently leveraging the attention mechanism in vision transformers for the Monocular 3D Object Detection task remains an open question. In this paper, we present LAM3D, a framework that Leverages self-Attention mechanism for Monocular 3D object Detection. To do so, the proposed method is built upon a Pyramid Vision Transformer v2 (PVTv2) as feature extraction backbone and 2D/3D detection machinery. We evaluate the proposed method on the KITTI 3D Object Detection Benchmark, proving the applicability of the proposed solution in the autonomous driving domain and outperforming reference methods. Moreover, due to the usage of self-attention, LAM3D is able to systematically outperform the equivalent architecture that does not employ self-attention.

LAM3D: Leveraging Attention for Monocular 3D Object Detection

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 2 figures, 4 tables)

This paper contains 19 sections, 1 equation, 2 figures, 4 tables.

Introduction
Related work
Vision Transformers
Monocular 3D Object Detection
Method
Transformer Backbone
Neck
Detection Heads
Loss Functions
Experiments
Setup
Dataset
Evaluation protocol
Implementation details
Results
...and 4 more sections

Figures (2)

Figure 1: Architecture overview. LAM3D takes as input a single image, extracts feature maps $\{F_1, F_2, F_3, F_4\}$ of different scales with a Transformer-based backbone and aggregates them using DLA into a single feature map $F$ further used to infer the 2D and 3D bounding box parameters.
Figure 2: Qualitative results for truncated and occluded objects. The ground truth is represented by green bounding boxes and the pink bounding boxes represent the predictions of different methods.

LAM3D: Leveraging Attention for Monocular 3D Object Detection

TL;DR

Abstract

LAM3D: Leveraging Attention for Monocular 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)