Table of Contents
Fetching ...

Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

JiaKui Hu, Shanshan Zhao, Qing-Guo Chen, Xuerui Qiu, Jialun Liu, Zhao Xu, Weihua Luo, Kaifu Zhang, Yanye Lu

TL;DR

Omni-View tackles the problem of unifying 3D scene understanding and generation by introducing a two-path generation model (texture and geometry) atop a Bagel-based understanding backbone. A two-stage training regime—dense-to-sparse curriculum and RGB-Depth-Pose joint generation—enables generation to enhance geometric and spatiotemporal understanding. The approach achieves state-of-the-art results on VSI-Bench and competitive performance in 3D QA, spatial reasoning, and novel view synthesis, demonstrating that generation can substantially bolster understanding in 3D. This work lays a foundation for widely applicable 3D multimodal models and points to future work in grounding, long-range generation, and reinforcement-learning-based improvements.

Abstract

This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.

Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

TL;DR

Omni-View tackles the problem of unifying 3D scene understanding and generation by introducing a two-path generation model (texture and geometry) atop a Bagel-based understanding backbone. A two-stage training regime—dense-to-sparse curriculum and RGB-Depth-Pose joint generation—enables generation to enhance geometric and spatiotemporal understanding. The approach achieves state-of-the-art results on VSI-Bench and competitive performance in 3D QA, spatial reasoning, and novel view synthesis, demonstrating that generation can substantially bolster understanding in 3D. This work lays a foundation for widely applicable 3D multimodal models and points to future work in grounding, long-range generation, and reinforcement-learning-based improvements.

Abstract

This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.

Paper Structure

This paper contains 17 sections, 6 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Architecture of Omni-View. Building upon Bagel deng2025bagel, Omni-View consists of an understanding model and a generation model. The generation model is further composed of two specialized modules: one for texture and one for geometry. Trained via a two-stage process, Omni-View shows high effectiveness in scene understanding and novel view synthesis. Crucially, it unlocks the benefits of its generative capabilities to enhance the model's understanding performance.
  • Figure :
  • Figure :