Table of Contents
Fetching ...

Learning to Estimate Single-View Volumetric Flow Motions without 3D Supervision

Aleksandra Franz, Barbara Solenthaler, Nils Thuerey

TL;DR

This work tackles monocular estimation of 3D volumetric fluid motion by learning a global 3D velocity field $\mathbf{u}$ and density $\rho$ without 3D ground-truth supervision. It introduces Neural Global Transport (NGT), which combines a 2D-to-3D density estimator, a multi-scale curl-based velocity generator, differentiable transport, differentiable rendering, and an adversarial prior to resolve depth ambiguity from a single view. The method demonstrates stable long-term predictions and competitive realism on synthetic plumes and real ScalarFlow data, while offering an end-to-end, single-pass alternative to costly optimization-based reconstructions. The results indicate strong potential for real-world monocular fluid reconstruction, with limitations noted for isotropic scattering and obstacle-enabled transport, suggesting clear directions for future extension.

Abstract

We address the challenging problem of jointly inferring the 3D flow and volumetric densities moving in a fluid from a monocular input video with a deep neural network. Despite the complexity of this task, we show that it is possible to train the corresponding networks without requiring any 3D ground truth for training. In the absence of ground truth data we can train our model with observations from real-world capture setups instead of relying on synthetic reconstructions. We make this unsupervised training approach possible by first generating an initial prototype volume which is then moved and transported over time without the need for volumetric supervision. Our approach relies purely on image-based losses, an adversarial discriminator network, and regularization. Our method can estimate long-term sequences in a stable manner, while achieving closely matching targets for inputs such as rising smoke plumes.

Learning to Estimate Single-View Volumetric Flow Motions without 3D Supervision

TL;DR

This work tackles monocular estimation of 3D volumetric fluid motion by learning a global 3D velocity field and density without 3D ground-truth supervision. It introduces Neural Global Transport (NGT), which combines a 2D-to-3D density estimator, a multi-scale curl-based velocity generator, differentiable transport, differentiable rendering, and an adversarial prior to resolve depth ambiguity from a single view. The method demonstrates stable long-term predictions and competitive realism on synthetic plumes and real ScalarFlow data, while offering an end-to-end, single-pass alternative to costly optimization-based reconstructions. The results indicate strong potential for real-world monocular fluid reconstruction, with limitations noted for isotropic scattering and obstacle-enabled transport, suggesting clear directions for future extension.

Abstract

We address the challenging problem of jointly inferring the 3D flow and volumetric densities moving in a fluid from a monocular input video with a deep neural network. Despite the complexity of this task, we show that it is possible to train the corresponding networks without requiring any 3D ground truth for training. In the absence of ground truth data we can train our model with observations from real-world capture setups instead of relying on synthetic reconstructions. We make this unsupervised training approach possible by first generating an initial prototype volume which is then moved and transported over time without the need for volumetric supervision. Our approach relies purely on image-based losses, an adversarial discriminator network, and regularization. Our method can estimate long-term sequences in a stable manner, while achieving closely matching targets for inputs such as rising smoke plumes.
Paper Structure (38 sections, 16 equations, 12 figures, 4 tables)

This paper contains 38 sections, 16 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Left: An overview over the complete NGT framework. We generate an initial density volume $\rho^{0}$ that is advected by the velocity to form a sequence. Density estimates are used in addition to the single input image to guide and stabilize the velocity generation. Velocity training is done end-to-end over the whole sequence. Right: Our multi-scale velocity estimator $\mathcal{G_{\mathrm{\mathbf{u}}}}$, shown for 3 resolution scales. The inputs contain information about the current ($t$) and next ($t+1$) time step. Each scale generates a residual velocity potential which is used to advect the inputs of step $t$ before generating the next residual. The final velocity is divergence free due to using the curl $\nabla\times$.
  • Figure 2: Depth-ambiguity ablation study with the shape dataset: In the top row all versions closely match the input view while the side view (bottom) shows severe degradation when increasing depth ambiguity by varying the position of the objects (fix $\rightarrow$ var) or using only the input view for $\mathcal{L}_{\hat{\bm{I}}}$ (multi $\rightarrow$ single). Providing additional views alongside the discriminator does not yield further improvements. $\mathcal{D}$ by itself can recover a plausible configuration, even if the density does not exactly match the unknown reference location.
  • Figure 3: Adding the prototype volume $\tilde{\rho}$ produced by our 2D-to-3D UNet as guidance (right side) stabilizes the inference over long evaluations and results in better target matching and a better overall density distribution, especially from unseen views. Metrics can be found in table \ref{['tab:metricsSF']}.
  • Figure 4: Qualitative comparison between different approaches using synthetic plume data for time-step 80. Top is the input view which all method match fairly well. Bottom is a 90° side view where the shortcomings of the different approaches become visible. Due to the overshoot, ScalarFlow densities are shown with a factor or $1/2$.
  • Figure 5: Qualitative comparison between different approaches using ScalarFlow data for time-step 100. Our method closely matches the given input and has a clearly defined shape that matches the general shape of the reference. It is only surpassed by the costly single-scene reconstruction method GlobTrans. RapidGen was adapted to be trained without 3D GT.
  • ...and 7 more figures