Table of Contents
Fetching ...

MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image Acquisition Systems

Yu Gao, Lutong Su, Hao Liang, Yufeng Yue, Yi Yang, Mengyin Fu

TL;DR

MC-NeRF introduces a joint optimization framework for intrinsic and extrinsic camera parameters within Neural Radiance Fields in multi-camera systems, removing the assumption of a single camera and eliminating the need for initial parameter estimates. It employs an auxiliary calibration scheme with Pack1 and Pack2 images to decouple parameters via reprojection constraints and bundle adjustment, enabling accurate per-camera intrinsic/extrinsic recovery and real-world scale. The method is validated on a newly built 88-camera system with both synthetic and real-world datasets, showing competitive or superior camera parameter estimation and rendering quality compared with existing NeRF and 3D Gaussian baselines. This work lowers barriers to deploying NeRF in complex multi-camera setups and provides datasets and code to support reproducible multi-camera 3D reconstruction at real-world scale.

Abstract

Neural Radiance Fields (NeRF) use multi-view images for 3D scene representation, demonstrating remarkable performance. As one of the primary sources of multi-view images, multi-camera systems encounter challenges such as varying intrinsic parameters and frequent pose changes. Most previous NeRF-based methods assume a unique camera and rarely consider multi-camera scenarios. Besides, some NeRF methods that can optimize intrinsic and extrinsic parameters still remain susceptible to suboptimal solutions when these parameters are poor initialized. In this paper, we propose MC-NeRF, a method that enables joint optimization of both intrinsic and extrinsic parameters alongside NeRF. The method also supports each image corresponding to independent camera parameters. First, we tackle coupling issue and the degenerate case that arise from the joint optimization between intrinsic and extrinsic parameters. Second, based on the proposed solutions, we introduce an efficient calibration image acquisition scheme for multi-camera systems, including the design of calibration object. Finally, we present an end-to-end network with training sequence that enables the estimation of intrinsic and extrinsic parameters, along with the rendering network. Furthermore, recognizing that most existing datasets are designed for a unique camera, we construct a real multi-camera image acquisition system and create a corresponding new dataset, which includes both simulated data and real-world captured images. Experiments confirm the effectiveness of our method when each image corresponds to different camera parameters. Specifically, we use multi-cameras, each with different intrinsic and extrinsic parameters in real-world system, to achieve 3D scene representation without providing initial poses.

MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image Acquisition Systems

TL;DR

MC-NeRF introduces a joint optimization framework for intrinsic and extrinsic camera parameters within Neural Radiance Fields in multi-camera systems, removing the assumption of a single camera and eliminating the need for initial parameter estimates. It employs an auxiliary calibration scheme with Pack1 and Pack2 images to decouple parameters via reprojection constraints and bundle adjustment, enabling accurate per-camera intrinsic/extrinsic recovery and real-world scale. The method is validated on a newly built 88-camera system with both synthetic and real-world datasets, showing competitive or superior camera parameter estimation and rendering quality compared with existing NeRF and 3D Gaussian baselines. This work lowers barriers to deploying NeRF in complex multi-camera setups and provides datasets and code to support reproducible multi-camera 3D reconstruction at real-world scale.

Abstract

Neural Radiance Fields (NeRF) use multi-view images for 3D scene representation, demonstrating remarkable performance. As one of the primary sources of multi-view images, multi-camera systems encounter challenges such as varying intrinsic parameters and frequent pose changes. Most previous NeRF-based methods assume a unique camera and rarely consider multi-camera scenarios. Besides, some NeRF methods that can optimize intrinsic and extrinsic parameters still remain susceptible to suboptimal solutions when these parameters are poor initialized. In this paper, we propose MC-NeRF, a method that enables joint optimization of both intrinsic and extrinsic parameters alongside NeRF. The method also supports each image corresponding to independent camera parameters. First, we tackle coupling issue and the degenerate case that arise from the joint optimization between intrinsic and extrinsic parameters. Second, based on the proposed solutions, we introduce an efficient calibration image acquisition scheme for multi-camera systems, including the design of calibration object. Finally, we present an end-to-end network with training sequence that enables the estimation of intrinsic and extrinsic parameters, along with the rendering network. Furthermore, recognizing that most existing datasets are designed for a unique camera, we construct a real multi-camera image acquisition system and create a corresponding new dataset, which includes both simulated data and real-world captured images. Experiments confirm the effectiveness of our method when each image corresponds to different camera parameters. Specifically, we use multi-cameras, each with different intrinsic and extrinsic parameters in real-world system, to achieve 3D scene representation without providing initial poses.
Paper Structure (25 sections, 21 equations, 19 figures, 11 tables)

This paper contains 25 sections, 21 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: We introduce MC-NeRF, which can jointly optimize camera parameters with NeRF. Different from other joint optimization methods, MC-NeRF breaks the assumption of unique camera and does not require providing initial camera parameters.
  • Figure 2: Overview of the proposed method. MC-NeRF utilizes three sets of images as inputs. $Pack1$ is used to unify the world coordinate systems across all cameras and provide initial extrinsics. $Pack2$ imposes intrinsic constraints to address the coupling issue that arises during joint optimization. Multi-view images are scene images, which are used for reconstructing the 3D scene. $Pack1$ and $Pack2$ provide 3D coordinates in space and feature points in the images, which are utilized to establish reprojection loss functions. Then, the intrinsics and extrinsics constrained by the reprojection loss function are used to generate sampling points. These points are subsequently fed into a Multilayer Perceptron (MLP) that employs BARF-based progressive encoding for further training. To ensure both efficiency and convergence, the training sequence must follow the process outlined in Fig.\ref{['Fig.6']}
  • Figure 3: Coupling issue between intrinsic and extrinsic parameters. 1) The first row illustrates the joint optimization for extrinsics and NeRF. In methods such as BARF, L2G-NeRF, where the intrinsic parameters are known, mitigating the issue of camera parameters coupling. 2) The second row showcases the joint optimization of both intrinsics and extrinsic, as outlined in Eq.\ref{['eq:5']}. This procedure effectively represents a transformation involving scaling, rotation, and translation. $u_0$ and $v_0$ from intrinsics, along with $t_1$, $t_2$ and $t_3$ from extrinsics, coexist within $(\tilde{T} - {R^{ - 1}}T)$ and cannot be disentangled.
  • Figure 4: Degenerate case in intrinsic parameters estimation. When a single AprilTag is present in the calibration image, obtaining a valid solution is not feasible. At least two AprilTags ensures the acquisition of camera intrinsic parameters.
  • Figure 5: Details of calibration data Acquisition. Firstly, the cube is captured by all cameras within the shared field of view, with its center defined as the origin of the world coordinate system. We define the set of images captured during this step as $Pack1$. Secondly, to ensure that each camera captures at least two AprilTags, providing non-coplanar correspondences, the operator needs to randomly move the cube in front of each camera. Once a camera detects more than two AprilTags, the image can be saved. Each camera only needs to capture one such image. The set of images captured in this phase is defined as $Pack2$.
  • ...and 14 more figures