Splat Feature Solver
Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
TL;DR
This work recasts feature lifting for splat-based 3D representations as a sparse linear inverse problem AX = B, where A derives from splat rendering and B from per-pixel feature observations. It introduces a closed-form row-sum preconditioner with a 1+β bound under convex losses and two regularizers—Tikhonov Guidance and Post-Lifting Aggregation—to stabilize solutions in the presence of noisy masks and multi-view inconsistencies. The method is kernel- and feature-agnostic, applicable to diverse splat kernels and feature modalities, and achieves state-of-the-art open-vocabulary 3D segmentation while running in minutes per scene. Auto-thresholding and robust denoising further enhance robustness to mask noise and camera pose perturbations. The work provides practical, scalable, and theoretically grounded tools for enriching 3D splat representations with dense 2D features, with public code and visualization resources available.
Abstract
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.
