MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

Guido Caccianiga; Julian Nubert; Cesar Cadena; Marco Hutter; Katherine J. Kuchenbecker

MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

Guido Caccianiga, Julian Nubert, Cesar Cadena, Marco Hutter, Katherine J. Kuchenbecker

TL;DR

MultiViPerFrOG tackles the ill-posed problem of jointly estimating camera motion and tissue deformation from moving depth data in deformable surgical scenes by introducing a globally optimized, multi-viewpoint framework. The method fuses low-level perception outputs (data association, depth, relative scene flow) with kinematic priors into a fast, large-scale optimization that jointly estimates multiple camera motions and absolute scene flow, using a four-block residual architecture and nine input measures to yield five outputs $x_{1-5}$. Implemented with the CERES solver on Lie-group manifolds and automatic differentiation, the approach achieves real-time performance and demonstrates robustness to input noise on synthetic datasets generated via VisionBlender and ex vivo liver meshes. The work claims to be the first real-time, multi-viewpoint solution for simultaneous camera motion and deformation tracking in deformable scenes, with potential to enable advanced free-viewpoint visualization and semi-autonomous surgical guidance. It provides a flexible, learning-free scaffold for future surgical scene representations and tool-control strategies, with code and data to be released upon acceptance.

Abstract

Reconstructing the 3D shape of a deformable environment from the information captured by a moving depth camera is highly relevant to surgery. The underlying challenge is the fact that simultaneously estimating camera motion and tissue deformation in a fully deformable scene is an ill-posed problem, especially from a single arbitrarily moving viewpoint. Current solutions are often organ-specific and lack the robustness required to handle large deformations. Here we propose a multi-viewpoint global optimization framework that can flexibly integrate the output of low-level perception modules (data association, depth, and relative scene flow) with kinematic and scene-modeling priors to jointly estimate multiple camera motions and absolute scene flow. We use simulated noisy data to show three practical examples that successfully constrain the convergence to a unique solution. Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds. MultiViPerFrOG builds a generalized learning-free scaffolding for spatio-temporal encoding that can unlock advanced surgical scene representations and will facilitate the development of the computer-assisted-surgery technologies of the future.

MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

TL;DR

. Implemented with the CERES solver on Lie-group manifolds and automatic differentiation, the approach achieves real-time performance and demonstrates robustness to input noise on synthetic datasets generated via VisionBlender and ex vivo liver meshes. The work claims to be the first real-time, multi-viewpoint solution for simultaneous camera motion and deformation tracking in deformable scenes, with potential to enable advanced free-viewpoint visualization and semi-autonomous surgical guidance. It provides a flexible, learning-free scaffold for future surgical scene representations and tool-control strategies, with code and data to be released upon acceptance.

Abstract

Paper Structure (5 sections, 1 equation, 7 figures)

This paper contains 5 sections, 1 equation, 7 figures.

Introduction
Background and Related Work
Methods
Experiments
Conclusion

Figures (7)

Figure 1: Multi-Viewpoint Perception Framework Optimized Globally (MultiViPerFrOG). In the center, a kinematic formalization for two moving cameras $\mathtt{C_{A}}$ and $\mathtt{C_{B}}$ observing a moving point $\mathtt{P}$ between two time instants $\mathtt{t_0}$ and $\mathtt{t_1}$. Around the periphery, the combinations of measures ($m_{1-9}$, black) and parameters ($x_{1-5}$, red) are used to compute the cost functions for each residual block: $DA$ = data association, $SFT$ = scene flow transformation, $KC$ = kinematic chaining, and $KS$ = kinematic supervision.
Figure 1: Kinematic representations (shown in the $X$-$Z$ plane) of the possible relative motions and resulting scene flows (colored arrows) between a camera $\mathtt{C}$ and a point $\mathtt{P}$ in its view field at two time instants $\mathtt{t_0}$ and $\mathtt{t_1}$. a) Absolute scene flow: a static camera observes a moving point (green arrow). b) Camera scene flow: a moving camera observes a static point (blue arrow). The point $\mathtt{\hat{P}_{t_o}}$ represents the coordinates of the point $\mathtt{P_{t_o}}$, as measured by $\mathtt{C_{t_o}}$, applied in the reference frame of $\mathtt{C_{t_1}}$. c) Relative scene flow: a moving camera observes a moving point (red arrow).
Figure 2: Workflow for synthetic dataset generation. a) Ex vivo porcine liver captured for organ mesh. b) Simulated laparoscopic scene with two cameras. c) Liver mesh. d) Sample RGB, e) depth, and f) optical flow outputs from one virtual depth camera.
Figure 3: Experiment overview. a) Experiment 0: The optimization is underconstrained as infinitely many combinations of the unknown parameters ($x_{3-5}$) can explain the relevant measures ($m_{6-9}$). $m_{1-2}$ do not remove this ambiguity. b) Experiment 1: Measuring the odometry of one camera ($m_3$) constrains the problem to a unique solution. This setting is valid for the camera being either static or freely moving. c) Experiment 2: Measuring a number of absolute scene-flow values ($m_5$) also constrains the problem to a unique solution. These measures can be either static or moving points. d) Experiment 3: All the measures $m_{1-9}$ are available and overconstrain the problem.
Figure 4: Experiment 1: Left) Increasing [0--10 mm] noise is added to the data association (DA) between the two cameras. The ego-motion of one camera is exactly known ($m_3$), whether static or moving. Right) The same noise is added to both DA and $m_3$.
...and 2 more figures

MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

TL;DR

Abstract

MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)