Table of Contents
Fetching ...

Splat Feature Solver

Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng

TL;DR

This work recasts feature lifting for splat-based 3D representations as a sparse linear inverse problem AX = B, where A derives from splat rendering and B from per-pixel feature observations. It introduces a closed-form row-sum preconditioner with a 1+β bound under convex losses and two regularizers—Tikhonov Guidance and Post-Lifting Aggregation—to stabilize solutions in the presence of noisy masks and multi-view inconsistencies. The method is kernel- and feature-agnostic, applicable to diverse splat kernels and feature modalities, and achieves state-of-the-art open-vocabulary 3D segmentation while running in minutes per scene. Auto-thresholding and robust denoising further enhance robustness to mask noise and camera pose perturbations. The work provides practical, scalable, and theoretically grounded tools for enriching 3D splat representations with dense 2D features, with public code and visualization resources available.

Abstract

Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.

Splat Feature Solver

TL;DR

This work recasts feature lifting for splat-based 3D representations as a sparse linear inverse problem AX = B, where A derives from splat rendering and B from per-pixel feature observations. It introduces a closed-form row-sum preconditioner with a 1+β bound under convex losses and two regularizers—Tikhonov Guidance and Post-Lifting Aggregation—to stabilize solutions in the presence of noisy masks and multi-view inconsistencies. The method is kernel- and feature-agnostic, applicable to diverse splat kernels and feature modalities, and achieves state-of-the-art open-vocabulary 3D segmentation while running in minutes per scene. Auto-thresholding and robust denoising further enhance robustness to mask noise and camera pose perturbations. The work provides practical, scalable, and theoretically grounded tools for enriching 3D splat representations with dense 2D features, with public code and visualization resources available.

Abstract

Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.

Paper Structure

This paper contains 31 sections, 19 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Overview of our Feature Lifting Framework. Our pipeline lifts dense 2D feature observations (e.g., MaskCLIP, DINO) onto general 3D splat representations by formulating the task as a sparse linear inverse problem. The Solver incorporates Tikhonov Guidance to ensure numerical stability and Post-Lifting Aggregation to filter noisy inputs. The resulting lifted feature parameters enable high-fidelity downstream tasks, such as open-vocabulary 3D segmentation and localization.
  • Figure 2: Qualitative comparison on the LeRF-OVS Ramen scene. We compare our method against DrSplats and the Ground Truth. As shown in the legend, distinct colors represent different semantic classes (e.g., egg, chopsticks). Our method performs better compared to recent SOTA DrSplat. More qualitative result could be found in Fig. \ref{['fig:attention_map_figurines']}, Fig. \ref{['fig:attention_map_ramen']}, and Fig. \ref{['fig:segmentation']}
  • Figure 3: Qualitative Comparison of Attention Maps and Segmentation Masks. We present a qualitative comparison between our proposed method (top row) and the baseline, Dr. Splat (middle row). The heatmaps visualize the model's attention response, or affinity feature projection, to various text prompts (e.g., "Green apple," "Jake the dog," "Miffy"). The bottom row compares the final semantic segmentation masks for the entire scene against the ground truth.
  • Figure 4: Attention Map Comparison: As demonstrated in the figures, our implementation produces clearer attention maps and segmentation masks compared to Dr. Splat.
  • Figure 5: Additional Qualitative Segmentation Results. Scenes from top to bottom: Figurines, Ramen, Teatime, and Waldo's Kitchen (LeRF-OVS dataset). Our method consistently produces sharper, less noisy segmentation masks compared to DrSplat and aligns closely with the Ground Truth.
  • ...and 10 more figures