Table of Contents
Fetching ...

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Junzhe Li, Bin Zhao, Dong Wang, Xuelong Li

TL;DR

LiveScene introduces a scene-level language-embedded radiance field that enables efficient reconstruction and natural-language control of multiple interactive objects in complex scenes from monocular video. By factorizing the high-dimensional interactive space into local 4D deformable fields on multi-scale feature planes and employing an interaction-aware language embedding, it achieves state-of-the-art view synthesis and language grounding while maintaining a compact model (~39M parameters). The method is evaluated on two new datasets, OmniSim and InterReal, comprising 28 scenes and 70 interactive objects, demonstrating robustness to topology changes and multi-object interactions. Extensive ablations substantiate the benefits of multi-scale factorization, boundary-preserving losses, and probability-based sampling for localization and control. This work paves the way for scalable, language-guided interactive scene rendering and manipulation in real-world and simulated environments.

Abstract

This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https://livescenes.github.io.

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

TL;DR

LiveScene introduces a scene-level language-embedded radiance field that enables efficient reconstruction and natural-language control of multiple interactive objects in complex scenes from monocular video. By factorizing the high-dimensional interactive space into local 4D deformable fields on multi-scale feature planes and employing an interaction-aware language embedding, it achieves state-of-the-art view synthesis and language grounding while maintaining a compact model (~39M parameters). The method is evaluated on two new datasets, OmniSim and InterReal, comprising 28 scenes and 70 interactive objects, demonstrating robustness to topology changes and multi-object interactions. Extensive ablations substantiate the benefits of multi-scale factorization, boundary-preserving losses, and probability-based sampling for localization and control. This work paves the way for scalable, language-guided interactive scene rendering and manipulation in real-world and simulated environments.

Abstract

This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https://livescenes.github.io.
Paper Structure (17 sections, 13 equations, 21 figures, 8 tables)

This paper contains 17 sections, 13 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: LiveScene enables scene-level reconstruction and control with language grounding. Left: Language-interactive articulated object control in Nerfstudio. Right: LiveScene achieves SOTA rendering quality on OmniSim dataset and exhibits a significant advantage in parameter efficiency.
  • Figure 2: The overview of LiveScene. Given a camera view and control variable $\boldsymbol{\kappa}$ of one specific interactive object, a series of 3D points are sampled in a local deformable field that models the interactive motions of this specific interactive object, and then the interactive object with novel interactive motion state is generated via volume-rendering. Moreover, an interaction-aware language embedding is utilized to localize and control individual interactive objects using natural language.
  • Figure 3: Illustration of hyperplanar factorization for compact storage. We maintain multiple local deformable fields for each interactive object region $\mathcal{R}_i$, and project high-dimensional interaction features into a compact 4D space, which can be further compressed into multiscale feature planes.
  • Figure 4: Illustration of a) boundary sampling conflicts, b) rendering quality comparison.
  • Figure 5: Overview of the OmniSim and InterReal datasets.
  • ...and 16 more figures