Table of Contents
Fetching ...

Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

Yifan Wang, Xiaochen Yang, Fanqi Pu, Qingmin Liao, Wenming Yang

TL;DR

MonoASRH introduces a hybrid CNN–Transformer framework for monocular 3D detection that explicitly addresses cross-scale context and object scale variation. The Efficient Hybrid Feature Aggregation Module (EH-FAM) delivers global semantic awareness with lightweight cross-scale fusion, while the Adaptive Scale-Aware 3D Regression Head (ASRH) uses 2D scale cues to generate dynamic receptive-field offsets for scale-aware 3D localization. A scale-aware fusion mechanism and a Selective Confidence-Guided Heatmap Loss further stabilize training and emphasize high-confidence detections. Hovering between efficiency and accuracy, MonoASRH achieves state-of-the-art results on KITTI and Waymo benchmarks, particularly excelling at detecting small or distant objects, albeit with limitations for truncated scenes that warrant future uncertainty-aware and multi-view extensions.

Abstract

Monocular 3D object detection has attracted great attention due to simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these methods predominantly rely on progressive cross-scale feature aggregation and focus solely on local information, which may result in a lack of global awareness and the omission of small-scale objects. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduces MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale priors into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance.

Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

TL;DR

MonoASRH introduces a hybrid CNN–Transformer framework for monocular 3D detection that explicitly addresses cross-scale context and object scale variation. The Efficient Hybrid Feature Aggregation Module (EH-FAM) delivers global semantic awareness with lightweight cross-scale fusion, while the Adaptive Scale-Aware 3D Regression Head (ASRH) uses 2D scale cues to generate dynamic receptive-field offsets for scale-aware 3D localization. A scale-aware fusion mechanism and a Selective Confidence-Guided Heatmap Loss further stabilize training and emphasize high-confidence detections. Hovering between efficiency and accuracy, MonoASRH achieves state-of-the-art results on KITTI and Waymo benchmarks, particularly excelling at detecting small or distant objects, albeit with limitations for truncated scenes that warrant future uncertainty-aware and multi-view extensions.

Abstract

Monocular 3D object detection has attracted great attention due to simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these methods predominantly rely on progressive cross-scale feature aggregation and focus solely on local information, which may result in a lack of global awareness and the omission of small-scale objects. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduces MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale priors into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance.

Paper Structure

This paper contains 18 sections, 19 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Schematic representation of vehicle imaging size at varying distances from the camera. Cars farther from the camera tend to occupy a smaller proportion of the image, while those closer tend to occupy a larger proportion.
  • Figure 2: Visualization of attention heatmaps for different models: (a) DEVIANTkumar2022deviant, (b) MonoLSSli2024monolss, and (c) Our proposed MonoASRH. Previous methods struggle to capture distant and occluded objects due to fixed receptive fields. In contrast, MonoASRH dynamically adjusts attention across multiple scales, improving detection of various object categories, such as cars and pedestrians.
  • Figure 3: Overview of our framework. The Efficient Hybrid Feature Aggregation Module (EH-FAM) efficiently aggregates multi-scale features. The Adaptive Scale-Aware 3D Regression Head (ASRH) fuses scale features with local semantics to guide the learning of 3D regression head.
  • Figure 4: Efficient Hybrid Feature Aggregation Module. Self-attention is applied only to the highest-level semantic feature map using an 8-head multi-head attention mechanism. Besides, all convolutional layers use the Mish activation function for improved feature representation.
  • Figure 5: Structural diagram of the RepVGGplus block. During training, RepVGGplus employs a multi-branch convolutional architecture, which is re-parameterized into a single 3$\times$3 convolutional layer for inference.
  • ...and 7 more figures