NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects
Musawar Ali, Manuel Carranza-García, Nicola Fioraio, Samuele Salti, Luigi Di Stefano
TL;DR
This work introduces NVS-HO, the first RGB-only benchmark for novel view synthesis of handheld objects in real-world settings. It provides two complementary sequences per object—HS for training and BS for ground-truth evaluation—along with a SIM($3$)-based pose-alignment protocol and segmentation masks to enable fair foreground/background evaluation. Baselines include COLMAP SfM and VGGT for pose estimation, plus NeRF-based Nerfacto and Gaussian Splatting for NVS rendering, revealing that COLMAP generally outperforms VGGT and that Splatting surpasses NeRF, albeit with locally low median $PSNR$ around $16$ and only modest improvements from pose refinement. By offering a dataset of 67 objects with a robust evaluation protocol, NVS-HO provides a practical platform to push RGB-based NVS methods toward real-world handheld scenarios and to spur robust pose estimation and appearance modeling under occlusions.
Abstract
We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
