Table of Contents
Fetching ...

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo, Jiahe Liu, Wenyu Gao, Yushan Li, Chengzhi Li, Ping Jian

TL;DR

LISA-3D tackles language-guided 3D reconstruction by introducing geometry-aware, LoRA-tuned LISA that produces cross-view-consistent masks from RGB-D sequences, which in turn drive a frozen SAM-3D to lift to 3D via RGBA prompts. The core idea is to enforce multi-view geometric consistency through differentiable reprojection losses, enabling accurate 3D outputs without retraining the 3D reconstructor. Empirically, the approach yields substantial improvements over single-view and naive baselines on ScanRefer and Nr3D with only 11.6M trainable parameters, and demonstrates robust open-vocabulary grounding for 3D content creation. The modular, data-efficient pipeline promises practical deployment and sets the stage for end-to-end or more integrated language-to-3D systems in the future.

Abstract

Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

TL;DR

LISA-3D tackles language-guided 3D reconstruction by introducing geometry-aware, LoRA-tuned LISA that produces cross-view-consistent masks from RGB-D sequences, which in turn drive a frozen SAM-3D to lift to 3D via RGBA prompts. The core idea is to enforce multi-view geometric consistency through differentiable reprojection losses, enabling accurate 3D outputs without retraining the 3D reconstructor. Empirically, the approach yields substantial improvements over single-view and naive baselines on ScanRefer and Nr3D with only 11.6M trainable parameters, and demonstrates robust open-vocabulary grounding for 3D content creation. The modular, data-efficient pipeline promises practical deployment and sets the stage for end-to-end or more integrated language-to-3D systems in the future.

Abstract

Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

Paper Structure

This paper contains 24 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of LISA-3D. Our framework operates in two stages. First, geometry-aware LISA receives multi-view RGB-D pairs and text instructions, producing consistent segmentation masks via LoRA tuning and differentiable warping supervision. The geometric consistency loss enforces that predictions on different views agree when projected through known camera transformations. Second, the predicted masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which reconstructs Gaussian splats or meshes without additional training. This decoupled design preserves the open-vocabulary reasoning of LISA while enabling faithful 3D reconstruction from natural language alone.
  • Figure 2: Qualitative demos. From left to right: input image, LISA-3D mask, SAM-3D reconstruction (Gaussian splat) and textured mesh. Our geometry-aware LoRA ensures precise localization even for fine-grained referring expressions.