Table of Contents
Fetching ...

Vector Quantized Feature Fields for Fast 3D Semantic Lifting

George Tang, Aditya Agarwal, Weiqiao Han, Trevor Darrell, Yutong Bai

TL;DR

This work tackles the memory and efficiency bottlenecks of lifting 2D vision-language features into 3D for semantic tasks. It generalizes lifting to semantic lifting by introducing per-view masks computed from a grounding representation, enabling localized edits and queries without processing every pixel or frame. The core contribution is the Vector-Quantized Feature Field (VQ-FF), a two-stage local-global vector quantization scheme that compresses multiscale feature maps into a compact codebook and per-frame index maps, allowing $O(1)$ retrieval of pixel-level relevancy masks for semantic lifting. The approach demonstrates improved precision in object localization and substantial memory/time savings across indoor/outdoor scenes, with applications in object-centric editing and efficient embodied QA (OpenEQA). Overall, semantic lifting with VQ-FF achieves high fidelity while enabling scalable, on-demand 3D scene understanding and manipulation for practical embodied intelligence tasks, as validated on LERF/LangSplat representations and the OpenEQA benchmark, with a formal framework R = L({m_i ⊙ f(x_i), p_i}).

Abstract

We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is impractical, and feature point clouds are expensive to store and query. To enable lightweight on-demand retrieval of pixel-aligned relevance masks, we introduce the Vector-Quantized Feature Field. We demonstrate the effectiveness of the Vector-Quantized Feature Field on complex indoor and outdoor scenes. Semantic lifting, when paired with a Vector-Quantized Feature Field, can unlock a myriad of applications in scene representation and embodied intelligence. Specifically, we showcase how our method enables text-driven localized scene editing and significantly improves the efficiency of embodied question answering.

Vector Quantized Feature Fields for Fast 3D Semantic Lifting

TL;DR

This work tackles the memory and efficiency bottlenecks of lifting 2D vision-language features into 3D for semantic tasks. It generalizes lifting to semantic lifting by introducing per-view masks computed from a grounding representation, enabling localized edits and queries without processing every pixel or frame. The core contribution is the Vector-Quantized Feature Field (VQ-FF), a two-stage local-global vector quantization scheme that compresses multiscale feature maps into a compact codebook and per-frame index maps, allowing retrieval of pixel-level relevancy masks for semantic lifting. The approach demonstrates improved precision in object localization and substantial memory/time savings across indoor/outdoor scenes, with applications in object-centric editing and efficient embodied QA (OpenEQA). Overall, semantic lifting with VQ-FF achieves high fidelity while enabling scalable, on-demand 3D scene understanding and manipulation for practical embodied intelligence tasks, as validated on LERF/LangSplat representations and the OpenEQA benchmark, with a formal framework R = L({m_i ⊙ f(x_i), p_i}).

Abstract

We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is impractical, and feature point clouds are expensive to store and query. To enable lightweight on-demand retrieval of pixel-aligned relevance masks, we introduce the Vector-Quantized Feature Field. We demonstrate the effectiveness of the Vector-Quantized Feature Field on complex indoor and outdoor scenes. Semantic lifting, when paired with a Vector-Quantized Feature Field, can unlock a myriad of applications in scene representation and embodied intelligence. Specifically, we showcase how our method enables text-driven localized scene editing and significantly improves the efficiency of embodied question answering.

Paper Structure

This paper contains 18 sections, 10 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: We compare lifting with semantic lifting of per-view InstructPix2Pix edits for object-centric 3D editing. Naive lifting often corrupts other parts of the scene. Semantic lifting eliminates this issue without added overhead by integrating per-view localization masks instantly determined via a Vector Quantized Feature Field of the scene. Italicized represents the InstructPix2Pix prompt, while (bold) denotes the localization mask prompt. The displayed images are novel views of the edited scene lifted via Gaussian Splatting.
  • Figure 2: We first split the image sequence into $k$ batches. Within each batch, we perform local quantization by rendering feature maps and quantizing based on superpixels, followed by global quantization that employs clustering to quantize within a batch. Batching takes advantage of the structure and feature similarity between consecutive images in the sequence and greatly reduces the runtime of global quantization. We do not quantize over scales due to how relevancy computation is formulated.
  • Figure 3: Visual foundation features such as Dino are well aligned with superpixels boundaries.
  • Figure 4: Visual comparison of superpixel-based vs patch-based local quantization feature map reconstruction quality. The white box highlights our method's denoising capability while preserving important structural details.
  • Figure 5: Superpixel-based and patch-based local quantization yield the same cosine distance to the original feature map, but superpixel-based quantization is superior in reconstructing image features. This is attributed to patch-based quantization overfitting noise present in the original feature maps.
  • ...and 13 more figures