SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

Abhinav Kumar; Yuliang Guo; Xinyu Huang; Liu Ren; Xiaoming Liu

SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

Abhinav Kumar, Yuliang Guo, Xinyu Huang, Liu Ren, Xiaoming Liu

TL;DR

This work investigates the generalization gap of monocular 3D detectors to large objects and attributes it to the noise sensitivity of depth regression losses. It provides a theoretical comparison of $L_1$, $L_2$, and $L_{dice}$ losses, proving that the dice loss offers superior noise-robustness and convergence for large objects, under a simplified model. Building on these insights, SeaBird pairs foreground BEV segmentation trained with the dice loss in a two-stage, sequential pipeline, feeding refined BEV features into a Mono3D head to improve large-object detection. Empirically, SeaBird achieves state-of-the-art results on KITTI-360 and consistently boosts performance of nuScenes detectors, particularly for large objects, demonstrating practical impact for safer autonomous driving systems. The combination of theoretical foundations and strong empirical gains positions SeaBird as a principled step toward robust, large-object monocular 3D perception.

Abstract

Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However, their performance drops on larger objects, leading to fatal accidents. Some attribute the failures to training data scarcity or their receptive field requirements of large objects. In this paper, we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird's View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for large objects. Code and models at https://github.com/abhi1kumar/SeaBird

SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

TL;DR

This work investigates the generalization gap of monocular 3D detectors to large objects and attributes it to the noise sensitivity of depth regression losses. It provides a theoretical comparison of

, and

losses, proving that the dice loss offers superior noise-robustness and convergence for large objects, under a simplified model. Building on these insights, SeaBird pairs foreground BEV segmentation trained with the dice loss in a two-stage, sequential pipeline, feeding refined BEV features into a Mono3D head to improve large-object detection. Empirically, SeaBird achieves state-of-the-art results on KITTI-360 and consistently boosts performance of nuScenes detectors, particularly for large objects, demonstrating practical impact for safer autonomous driving systems. The combination of theoretical foundations and strong empirical gains positions SeaBird as a principled step toward robust, large-object monocular 3D perception.

Abstract

Paper Structure (29 sections, 4 theorems, 35 equations, 9 figures, 16 tables)

This paper contains 29 sections, 4 theorems, 35 equations, 9 figures, 16 tables.

Introduction
Related Work
SeaBird
Background and Problem Statement
Loss Analysis: Dice vs. Regression
Discussions
SeaBird Pipeline
Experiments
KITTI-360 Mono3D
Ablation Studies on KITTI-360 Val
nuScenes Mono3D
Conclusions
Additional Explanations and Proofs
Proof of Converged Value
Comparison of Loss Functions
...and 14 more sections

Key Result

Lemma 1

Convergence analysis shalev2007pegasos. Consider a linear regression model with trainable weight $\mathbf{w}$ for depth prediction $\hat{z}$ from an image $\mathbf{h}$. Assume the noise $\eta$ is an additive error in depth prediction and is a normal random variable $\mathcal{N}(0, \sigma^2)$. Also, where $\epsilon\!=\!\frac{\partial \mathcal{L}(\eta)}{\partial \eta}$ is the gradient of the loss $

Figures (9)

Figure 1: Teaser (a) SoTA frontal detectors struggle with large objects (low AP$_{Lrg}$) even on a nearly balanced KITTI-360 dataset (Skewness in \ref{['fig:skew']}). Our proposed SeaBird achieves significant Mono3D improvements, particularly for large objects. (b) SeaBird also improves two SoTA BEV detectors, BEVerse-S zhang2022beverse and HoP zong2023hop on the nuScenes dataset, particularly for large objects. (c) Plot of convergence variance $\text{Var}(\epsilon)$ of dice and regression losses with the noise $\sigma$ in depth prediction. The $y$-axis denotes the deviation from the optimal weight, so the lower the better. SeaBird leverages dice loss, which we prove is more noise-robust than regression losses for large objects.
Figure 2: SeaBird Pipeline. SeaBird uses the predicted BEV foreground segmentation (For. Seg.) map to predict accurate $3$D boxes for large objects. SeaBird training protocol involves BEV segmentation pre-training with the noise-robust dice loss and Mono3D fine-tuning.
Figure 3: (a) Problem setup. The single-layer neural network takes an image $\mathbf{h}$ (or its features) and predicts depth $\hat{z}$ and the object length $\ell$. The noise $\eta$ is the additive error in depth prediction and is a normal random variable. The GT depth $z$ supervises the predicted depth $\hat{z}$ with a loss $\mathcal{L}$ in training. We assume the network predicts the GT length $\ell$. Frontal detectors directly regress the depth with $\mathcal{L}_1$, $\mathcal{L}_2$, or $\text{Smooth}~\mathcal{L}_1$ loss, while SeaBird projects to BEV plane and supervises through dice loss $\mathcal{L}_{dice}$. (b) Shifting of predictions in BEV along the ray due to the noise $\eta$. (c) Cross Section (CS) view along the ray with classification scores $P(Z)$.
Figure 4: Plot of convergence variance Var$(\epsilon)$ of loss functions with the noise $\sigma$. Dice loss has minimum convergence variance with large noise, resulting in better detectors for large objects.
Figure 5: Lengthwise AP Analysis of four SoTA detectors of \ref{['tab:det_seg_results_kitti_360_val']} and two SeaBird pipelines on KITTI-360 Val split. SeaBird pipelines outperform all baselines on large objects with over $10$m in length.
...and 4 more figures

Theorems & Definitions (7)

Lemma 1
Lemma 2
Lemma 3
Theorem 1
proof
proof
proof

SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

TL;DR

Abstract

SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)