MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation
Guido Caccianiga, Julian Nubert, Cesar Cadena, Marco Hutter, Katherine J. Kuchenbecker
TL;DR
MultiViPerFrOG tackles the ill-posed problem of jointly estimating camera motion and tissue deformation from moving depth data in deformable surgical scenes by introducing a globally optimized, multi-viewpoint framework. The method fuses low-level perception outputs (data association, depth, relative scene flow) with kinematic priors into a fast, large-scale optimization that jointly estimates multiple camera motions and absolute scene flow, using a four-block residual architecture and nine input measures to yield five outputs $x_{1-5}$. Implemented with the CERES solver on Lie-group manifolds and automatic differentiation, the approach achieves real-time performance and demonstrates robustness to input noise on synthetic datasets generated via VisionBlender and ex vivo liver meshes. The work claims to be the first real-time, multi-viewpoint solution for simultaneous camera motion and deformation tracking in deformable scenes, with potential to enable advanced free-viewpoint visualization and semi-autonomous surgical guidance. It provides a flexible, learning-free scaffold for future surgical scene representations and tool-control strategies, with code and data to be released upon acceptance.
Abstract
Reconstructing the 3D shape of a deformable environment from the information captured by a moving depth camera is highly relevant to surgery. The underlying challenge is the fact that simultaneously estimating camera motion and tissue deformation in a fully deformable scene is an ill-posed problem, especially from a single arbitrarily moving viewpoint. Current solutions are often organ-specific and lack the robustness required to handle large deformations. Here we propose a multi-viewpoint global optimization framework that can flexibly integrate the output of low-level perception modules (data association, depth, and relative scene flow) with kinematic and scene-modeling priors to jointly estimate multiple camera motions and absolute scene flow. We use simulated noisy data to show three practical examples that successfully constrain the convergence to a unique solution. Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds. MultiViPerFrOG builds a generalized learning-free scaffolding for spatio-temporal encoding that can unlock advanced surgical scene representations and will facilitate the development of the computer-assisted-surgery technologies of the future.
