Table of Contents
Fetching ...

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie, Lizhuang Ma

TL;DR

FLEG tackles the challenge of creating language-aware 3D representations from uncalibrated multi-view images without 3D annotations. It introduces a single-pass, feed-forward pipeline that predicts language-embedded Gaussians and employs a 3D-annotation-free training framework, InstanceMV-14K data, and instance-guided contrastive learning to align 2D language semantics with 3D geometry. A geometry-semantic sparsification strategy reduces memory costs while preserving high-fidelity geometry and semantic coverage, enabling open-vocabulary 3D querying and editing with novel-view synthesis. The method achieves state-of-the-art performance on reconstruction and open-vocabulary tasks across sparse to dense views, while delivering real-time inference suitable for robotics and AR/VR applications.

Abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

TL;DR

FLEG tackles the challenge of creating language-aware 3D representations from uncalibrated multi-view images without 3D annotations. It introduces a single-pass, feed-forward pipeline that predicts language-embedded Gaussians and employs a 3D-annotation-free training framework, InstanceMV-14K data, and instance-guided contrastive learning to align 2D language semantics with 3D geometry. A geometry-semantic sparsification strategy reduces memory costs while preserving high-fidelity geometry and semantic coverage, enabling open-vocabulary 3D querying and editing with novel-view synthesis. The method achieves state-of-the-art performance on reconstruction and open-vocabulary tasks across sparse to dense views, while delivering real-time inference suitable for robotics and AR/VR applications.

Abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

Paper Structure

This paper contains 18 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: FLEG reconstructs language-embedded Gaussians in a single feed-forward pass from any uncalibrated and unposed multi-view images, supporting both sparse and dense views in one model. The reconstructed language-embedded Gaussians simultaneously enable novel-view synthesis, 3D query, and 3D editing, exhibiting potential for efficient data generation and real-time downstream applications.
  • Figure 2: Overview of FLEG. Our FLEG adopts a large transformer with a DPT-based decoder and corresponding prediction heads to predict language-embedded Gaussians. We propose a 3D-annotation-free training framework to eliminate the reliance on 3D annotation. To embed semantics into 3D representations, we construct InstanceMV-14K to enrich semantic diversity. We also introduce an instance-guided contrastive learning to effectively align 2D instances with 3D representations. We further propose a geometry–semantic hierarchical sparsification strategy to avoid the cost of per-pixel predictions.
  • Figure 3: Qualitative comparisons with feed-forward methods on the ScanNet dataset under sparse-view inputs.
  • Figure 4: Qualitative comparisons with per-scene optimized methods on the ScanNet dataset under dense-view input.