VBR: A Vision Benchmark in Rome

Leonardo Brizi; Emanuele Giacomini; Luca Di Giammarino; Simone Ferrari; Omar Salem; Lorenzo De Rebotti; Giorgio Grisetti

VBR: A Vision Benchmark in Rome

Leonardo Brizi, Emanuele Giacomini, Luca Di Giammarino, Simone Ferrari, Omar Salem, Lorenzo De Rebotti, Giorgio Grisetti

TL;DR

VBR introduces a Rome-sourced, multi-sensor vision benchmark tailored for SLAM and odometry by providing six synchronized sequences acquired with handheld and car platforms. Ground truth is generated through a LiDAR Bundle Adjustment approach that fuses RTK-GPS priors with LiDAR odometry, achieving about $\pm 3\ \mathrm{cm}$ accuracy over long trajectories, and is validated with a Total Station. The dataset spans urban, garden, indoor, and highway-like scenes, totaling roughly $40\ \mathrm{km}$ of trajectories and $2\ \mathrm{TB}$ of raw data, with training/testing splits and a public evaluation server. Baseline experiments with KISS-ICP, F-LOAM, and ORB-SLAM3 illustrate the strengths of LiDAR-based methods and highlight the challenges of achieving precise global localization in diverse environments. This resource enables robust, fair benchmarking across robotic platforms (quadrupeds, quadrotors, autonomous vehicles) and supports future work in semantics and dense perception alongside odometry and SLAM evaluation.

Abstract

This paper presents a vision and perception research dataset collected in Rome, featuring RGB data, 3D point clouds, IMU, and GPS data. We introduce a new benchmark targeting visual odometry and SLAM, to advance the research in autonomous robotics and computer vision. This work complements existing datasets by simultaneously addressing several issues, such as environment diversity, motion patterns, and sensor frequency. It uses up-to-date devices and presents effective procedures to accurately calibrate the intrinsic and extrinsic of the sensors while addressing temporal synchronization. During recording, we cover multi-floor buildings, gardens, urban and highway scenarios. Combining handheld and car-based data collections, our setup can simulate any robot (quadrupeds, quadrotors, autonomous vehicles). The dataset includes an accurate 6-dof ground truth based on a novel methodology that refines the RTK-GPS estimate with LiDAR point clouds through Bundle Adjustment. All sequences divided in training and testing are accessible through our website.

VBR: A Vision Benchmark in Rome

TL;DR

accuracy over long trajectories, and is validated with a Total Station. The dataset spans urban, garden, indoor, and highway-like scenes, totaling roughly

of trajectories and

of raw data, with training/testing splits and a public evaluation server. Baseline experiments with KISS-ICP, F-LOAM, and ORB-SLAM3 illustrate the strengths of LiDAR-based methods and highlight the challenges of achieving precise global localization in diverse environments. This resource enables robust, fair benchmarking across robotic platforms (quadrupeds, quadrotors, autonomous vehicles) and supports future work in semantics and dense perception alongside odometry and SLAM evaluation.

Abstract

Paper Structure (17 sections, 4 equations, 7 figures, 3 tables)

This paper contains 17 sections, 4 equations, 7 figures, 3 tables.

Introduction
Related Work
The Datasets
Sensors setup
Calibration
Synchronization
Ground truth generation
Data selection
Spagna
Colosseum
Pincio
DIAG
Campus
Ciampino
Benchmark
...and 2 more sections

Figures (7)

Figure 1: A summary of our dataset. Data illustrating some of the sequences recorded (top). 3D mapping done with of our ground truth (bottom).
Figure 2: Comparison between LiDAR clouds attached to ground truth trajectories of KITTI (up) and ours (down). The zoom shows the elevation view.
Figure 3: Projection of the KITTI LiDAR point cloud into an image plane (up), projection of our LiDAR into an image plane (down). The many holes of the up image due to uneven distribution of the LiDAR beams and calibration issues make the KITTI LiDAR image unusable for computer vision tasks.
Figure 4: Sensor setup and reference frames. Our ground truth is expressed in the LiDAR reference frame $\mathrm{RF_{L}}$. More details can be found in our website and supplementary materials.
Figure 5: Number of top 20 most frequent semantic instance for Ciampino (above) and Colosseum (below) sequences. The instances were counted using OneFormer jain2023oneformer over a subset of images for each sequence and excluding the most predominant classes: sky, wall, road, grass, sidewalk, ground.
...and 2 more figures

VBR: A Vision Benchmark in Rome

TL;DR

Abstract

VBR: A Vision Benchmark in Rome

Authors

TL;DR

Abstract

Table of Contents

Figures (7)