LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo; Jiahe Liu; Wenyu Gao; Yushan Li; Chengzhi Li; Ping Jian

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo, Jiahe Liu, Wenyu Gao, Yushan Li, Chengzhi Li, Ping Jian

TL;DR

LISA-3D tackles language-guided 3D reconstruction by introducing geometry-aware, LoRA-tuned LISA that produces cross-view-consistent masks from RGB-D sequences, which in turn drive a frozen SAM-3D to lift to 3D via RGBA prompts. The core idea is to enforce multi-view geometric consistency through differentiable reprojection losses, enabling accurate 3D outputs without retraining the 3D reconstructor. Empirically, the approach yields substantial improvements over single-view and naive baselines on ScanRefer and Nr3D with only 11.6M trainable parameters, and demonstrates robust open-vocabulary grounding for 3D content creation. The modular, data-efficient pipeline promises practical deployment and sets the stage for end-to-end or more integrated language-to-3D systems in the future.

Abstract

Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

TL;DR

Abstract

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)