Vector Quantized Feature Fields for Fast 3D Semantic Lifting
George Tang, Aditya Agarwal, Weiqiao Han, Trevor Darrell, Yutong Bai
TL;DR
This work tackles the memory and efficiency bottlenecks of lifting 2D vision-language features into 3D for semantic tasks. It generalizes lifting to semantic lifting by introducing per-view masks computed from a grounding representation, enabling localized edits and queries without processing every pixel or frame. The core contribution is the Vector-Quantized Feature Field (VQ-FF), a two-stage local-global vector quantization scheme that compresses multiscale feature maps into a compact codebook and per-frame index maps, allowing $O(1)$ retrieval of pixel-level relevancy masks for semantic lifting. The approach demonstrates improved precision in object localization and substantial memory/time savings across indoor/outdoor scenes, with applications in object-centric editing and efficient embodied QA (OpenEQA). Overall, semantic lifting with VQ-FF achieves high fidelity while enabling scalable, on-demand 3D scene understanding and manipulation for practical embodied intelligence tasks, as validated on LERF/LangSplat representations and the OpenEQA benchmark, with a formal framework R = L({m_i ⊙ f(x_i), p_i}).
Abstract
We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is impractical, and feature point clouds are expensive to store and query. To enable lightweight on-demand retrieval of pixel-aligned relevance masks, we introduce the Vector-Quantized Feature Field. We demonstrate the effectiveness of the Vector-Quantized Feature Field on complex indoor and outdoor scenes. Semantic lifting, when paired with a Vector-Quantized Feature Field, can unlock a myriad of applications in scene representation and embodied intelligence. Specifically, we showcase how our method enables text-driven localized scene editing and significantly improves the efficiency of embodied question answering.
