Table of Contents
Fetching ...

You Only Look Bottom-Up for Monocular 3D Object Detection

Kaixin Xiong, Dingyuan Zhang, Dingkang Liang, Zhe Liu, Hongcheng Yang, Wondimu Dikubab, Jianwei Cheng, Xiang Bai

TL;DR

The position modeling from the image feature column is explored and a new method named You Only Look Bottum-Up (YOLOBU) is proposed, which fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way.

Abstract

Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory performances. Motivated by the fact that humans could leverage the bottom-up positional clues to locate objects in 3D space from a single image, in this paper, we explore the position modeling from the image feature column and propose a new method named You Only Look Bottum-Up (YOLOBU). Specifically, our YOLOBU leverages Column-based Cross Attention to determine how much a pixel contributes to pixels above it. Next, the Row-based Reverse Cumulative Sum (RRCS) is introduced to build the connections of pixels in the bottom-up direction. Our YOLOBU fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way. Extensive experiments on the KITTI dataset demonstrate the effectiveness and superiority of our method.

You Only Look Bottom-Up for Monocular 3D Object Detection

TL;DR

The position modeling from the image feature column is explored and a new method named You Only Look Bottum-Up (YOLOBU) is proposed, which fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way.

Abstract

Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory performances. Motivated by the fact that humans could leverage the bottom-up positional clues to locate objects in 3D space from a single image, in this paper, we explore the position modeling from the image feature column and propose a new method named You Only Look Bottum-Up (YOLOBU). Specifically, our YOLOBU leverages Column-based Cross Attention to determine how much a pixel contributes to pixels above it. Next, the Row-based Reverse Cumulative Sum (RRCS) is introduced to build the connections of pixels in the bottom-up direction. Our YOLOBU fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way. Extensive experiments on the KITTI dataset demonstrate the effectiveness and superiority of our method.
Paper Structure (19 sections, 7 equations, 3 figures, 7 tables)

This paper contains 19 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Two cars with different dimensions and the same depth and appearance, the corresponding 2D bounding boxes on the image have different sizes. (a) For monocular 3D detectors, ambiguity would occur with only 2D size information; (b) For human perception from monocular images, there is no ambiguity guided by position information.
  • Figure 2: (a) The pipeline of the proposed YOLOBU, which consists of a backbone network, our proposed Column-based Cross Attention (CCA) and Row-based Reverse Cumulative Sum (RRCS), and head branches. (b) The structure of the proposed Column-based cross attention (CCA) is illustrated on the left of dotted line. And the adjacency matrix is demonstrated on the right of dotted line, where gray Grids denote no connection between those nodes. (c) Structure of the proposed Row-based Reverse Cumulative Sum (RRCS).
  • Figure 3: Visualizations on KITTI val set. (a) Baseline (MonoDLE ma2021delving); (b) The proposed YOLOBU; (c) Failure case from YOLOBU. Red and green bounding boxes indicate predictions and ground truth, respectively. The LiDAR point clouds are only used for visualization.