Table of Contents
Fetching ...

Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos

Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, Shaoyi Du

TL;DR

Semantic Flow introduces a flow-based semantic representation for dynamic scenes from monocular videos, addressing 2D-to-3D ambiguity by supervising semantic flows with opacity priors from volume densities. The method combines an implicit flow field, flow feature aggregation, and flow attention to produce per-flow semantic logits that are rendered along rays with density-based integration for both moving foreground and static background. It demonstrates strong generalization across scenes, enables instance-level editing, semantic completion, and dynamic scene tracking, and provides a new Semantic Dynamic Scene dataset with pixel-level labels. The approach achieves higher semantic accuracy and boundary quality than prior dynamic NeRF baselines, and remains robust under reduced labeling and noisy flow supervision, suggesting practical utility for interpretable scene understanding from monocular video. Formally, the dynamic foreground semantics are learned via $F_{dy}$, $\boldsymbol{s}_{dy}(\boldsymbol{\Gamma}(u))$, and ray integration $\boldsymbol{s}_{dy}(\boldsymbol{r}) = \int_{u_n}^{u_f} T_{dy}(u) \sigma_{dy}(u) \boldsymbol{s}_{dy}(\boldsymbol{\Gamma}(u)) du$, while the static background uses $F_{st}$ and $\boldsymbol{s}_{st}(\boldsymbol{r}) = \int_{u_n}^{u_f} T_{st}(u) \sigma_{st}(u) \boldsymbol{s}_{st}(\boldsymbol{r}(u)) du$ to produce a full semantic field $\boldsymbol{s}^{sem}_{full}(\boldsymbol{r})$.

Abstract

In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction when extracting 3D flow features from 2D video frames, we consider the volume densities as opacity priors that describe the contributions of flow features to the semantics on the frames. More specifically, we first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames. Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows. We integrate the logits with volume densities in the viewing direction to supervise the flow features with semantic labels on video frames. Experimental results show that our model is able to learn from multiple dynamic scenes and supports a series of new tasks such as instance-level scene editing, semantic completions, dynamic scene tracking and semantic adaption on novel scenes. Codes are available at https://github.com/tianfr/Semantic-Flow/.

Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos

TL;DR

Semantic Flow introduces a flow-based semantic representation for dynamic scenes from monocular videos, addressing 2D-to-3D ambiguity by supervising semantic flows with opacity priors from volume densities. The method combines an implicit flow field, flow feature aggregation, and flow attention to produce per-flow semantic logits that are rendered along rays with density-based integration for both moving foreground and static background. It demonstrates strong generalization across scenes, enables instance-level editing, semantic completion, and dynamic scene tracking, and provides a new Semantic Dynamic Scene dataset with pixel-level labels. The approach achieves higher semantic accuracy and boundary quality than prior dynamic NeRF baselines, and remains robust under reduced labeling and noisy flow supervision, suggesting practical utility for interpretable scene understanding from monocular video. Formally, the dynamic foreground semantics are learned via , , and ray integration , while the static background uses and to produce a full semantic field .

Abstract

In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction when extracting 3D flow features from 2D video frames, we consider the volume densities as opacity priors that describe the contributions of flow features to the semantics on the frames. More specifically, we first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames. Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows. We integrate the logits with volume densities in the viewing direction to supervise the flow features with semantic labels on video frames. Experimental results show that our model is able to learn from multiple dynamic scenes and supports a series of new tasks such as instance-level scene editing, semantic completions, dynamic scene tracking and semantic adaption on novel scenes. Codes are available at https://github.com/tianfr/Semantic-Flow/.
Paper Structure (26 sections, 18 equations, 10 figures, 15 tables)

This paper contains 26 sections, 18 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Semantic Flow learns from flows capturing motion information in dynamic scenes. In this way, Semantic Flow can learn semantics from multiple scenes and conduct instance-level editing (Top). It also supports dynamic scene tracking and semantic completion (Middle), both of which learn with few semantic labels. Compared to DynNeRF 21iccv/chen_dynerf and MonoNeRF 23iccv/tian_mononerf, Semantic Flow can transfer to novel scenes with more accurate details (Bottom).
  • Figure 2: The overview of the proposed model. We first design a flow network to predict flows in the dynamic scene. Then, taking the orange flow (in the bottom left part) as an example, we aggregate the flow features from the feature map of each frame. We propose the flow attention module to reveal the motion information from the aggregated flow features. Finally, we design a semantic network to output the semantic logits of each flow, and predict semantics on the frame by rendering the semantic logits along camera rays with volume densities $\sigma_{dy}$ as opacity priors.
  • Figure 3: Visualizations on various tasks. Different from DynNeRF 21iccv/chen_dynerf$+$semantic head and MonoNeRF 23iccv/tian_mononerf$+$semantic head that learn from point features, Semantic Flow learns from flow features for capturing motions. In this way, Semantic Flow predicts semantic labels of dynamic foregrounds with more accurate motions and clearer boundaries.
  • Figure 4: Annotation examples in Semantic Dynamic Scene dataset.
  • Figure 5: Visualization of rendered RGB images, estimated flow fields, semantic predictions and pixel correspondence in three consecutive frames. In the correspondence visualization, the color coding illustrates correspondences of the dynamic foreground across time.
  • ...and 5 more figures