Table of Contents
Fetching ...

NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects

Musawar Ali, Manuel Carranza-García, Nicola Fioraio, Samuele Salti, Luigi Di Stefano

TL;DR

This work introduces NVS-HO, the first RGB-only benchmark for novel view synthesis of handheld objects in real-world settings. It provides two complementary sequences per object—HS for training and BS for ground-truth evaluation—along with a SIM($3$)-based pose-alignment protocol and segmentation masks to enable fair foreground/background evaluation. Baselines include COLMAP SfM and VGGT for pose estimation, plus NeRF-based Nerfacto and Gaussian Splatting for NVS rendering, revealing that COLMAP generally outperforms VGGT and that Splatting surpasses NeRF, albeit with locally low median $PSNR$ around $16$ and only modest improvements from pose refinement. By offering a dataset of 67 objects with a robust evaluation protocol, NVS-HO provides a practical platform to push RGB-based NVS methods toward real-world handheld scenarios and to spur robust pose estimation and appearance modeling under occlusions.

Abstract

We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.

NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects

TL;DR

This work introduces NVS-HO, the first RGB-only benchmark for novel view synthesis of handheld objects in real-world settings. It provides two complementary sequences per object—HS for training and BS for ground-truth evaluation—along with a SIM()-based pose-alignment protocol and segmentation masks to enable fair foreground/background evaluation. Baselines include COLMAP SfM and VGGT for pose estimation, plus NeRF-based Nerfacto and Gaussian Splatting for NVS rendering, revealing that COLMAP generally outperforms VGGT and that Splatting surpasses NeRF, albeit with locally low median around and only modest improvements from pose refinement. By offering a dataset of 67 objects with a robust evaluation protocol, NVS-HO provides a practical platform to push RGB-based NVS methods toward real-world handheld scenarios and to spur robust pose estimation and appearance modeling under occlusions.

Abstract

We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
Paper Structure (11 sections, 5 figures)

This paper contains 11 sections, 5 figures.

Figures (5)

  • Figure 1: Two exemplar objects from our dataset. Our dataset addresses novel view synthesis from images of handheld objects. The columns show: (1) images from the HS, (2) images from the BS, (3) masked ground-truth images and (4) masked rendered images.
  • Figure 2: Data processing pipeline. For each object we record two monocular RGB sequences: an HS where the object is manipulated by a human while the camera is static, and a BS where the same object is fixed on a ChArUco board. BS poses are estimated with OpenCV via marker detection. Then, Grounded SAM2 ren2024grounded produces segmentation masks for the BS and HS, enabling foreground extraction. HS and BS are combined to produce CS. The CS combined with the fixed BS poses and the calibrated intrinsics are then fed into COLMAP that produces the resulting CS poses, where HS poses are estimated according to the units and coordinate system of the BS.
  • Figure 3: Illustration of the pose alignment procedure. (a) Camera poses estimated from the BS using a ChArUco board, expressed in metric scale with a world coordinate system defined by the calibration board. (b) Camera poses estimated from the HS using COLMAP, with an arbitrary scale and coordinate frame. (c) Overlay of (a) and (b), illustrating the misalignment due to differing coordinate systems and scales. (d) HS poses estimated by running COLMAP on the CS, with calibrated intrinsics and BS poses injected, ensuring both sequences share a common coordinate system. This trajectory serves as a reference frame bridging (a) and (b). (e) The BS poses from (a), transformed with a Sim(3) alignment computed via the Kabsch–Umeyama umeyama2002least algorithm, and now fully expressed in the HS coordinate frame. (f) Overlay of (b) and and the aligned BS poses from (e), confirming that the BS poses are now expressed in the HS coordinate system and units.
  • Figure 4: NVS results. Top and bottom rows report metrics dealing with Foreground and Background Evaluation, respectively.
  • Figure 5: Qualitative NVS results. Red edges denote the ground-truth object masks. .