Table of Contents
Fetching ...

LERF: Language Embedded Radiance Fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, Matthew Tancik

TL;DR

LERF fuses CLIP language embeddings into a NeRF-based 3D field, enabling open-ended, pixel-aligned natural-language queries over real-world scenes. It learns a dense multi-scale language field via a CLIP feature pyramid supervised across views, with DINO regularization to improve coherence, and renders 3D relevancy maps through volumetric integration. The approach yields real-time, 3D-consistent relevancy maps for arbitrary prompts without region proposals or fine-tuning, and demonstrates strong localization and existence-detection performance against 2D open-vocabulary baselines. This work enables natural, scalable interaction with 3D environments for robotics and vision-language understanding, and provides a foundation for extending 3D language grounding with improved encoders.

Abstract

Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. The project website can be found at https://lerf.io .

LERF: Language Embedded Radiance Fields

TL;DR

LERF fuses CLIP language embeddings into a NeRF-based 3D field, enabling open-ended, pixel-aligned natural-language queries over real-world scenes. It learns a dense multi-scale language field via a CLIP feature pyramid supervised across views, with DINO regularization to improve coherence, and renders 3D relevancy maps through volumetric integration. The approach yields real-time, 3D-consistent relevancy maps for arbitrary prompts without region proposals or fine-tuning, and demonstrates strong localization and existence-detection performance against 2D open-vocabulary baselines. This work enables natural, scalable interaction with 3D environments for robotics and vision-language understanding, and provides a foundation for extending 3D language grounding with improved encoders.

Abstract

Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. The project website can be found at https://lerf.io .
Paper Structure (26 sections, 16 figures, 4 tables)

This paper contains 26 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Language Embedded Radiance Fields (LERF). LERF grounds CLIP representations in a dense, multi-scale 3D field. A LERF can be reconstructed from a hand-held phone capture within 45 minutes, then can render dense relevancy maps given textual queries interactively in real-time. LERF enables a broad range of concepts to be queried via natural language, from abstract queries like "Electricity", visual properties like "Yellow", long-tail objects such as "Waldo", and even reading text like "Boops" on the mug. For each prompt, an RGB image and relevancy map are rendered focusing on the location with maximum relevancy activation.
  • Figure 2: LERF Optimization:Left: LERF represents a field of 3D volumes, parameterized by position $x,y,z$ and scale $s$ (orange cube). To render a CLIP embedding along a ray, the field is sampled and averaged according to NeRF's volume rendering weights. Physical scale corresponds to an image scale $s_\text{img}$ via projective geometry. Right: We pre-compute a multi-scale feature pyramid of CLIP embeddings over training views, and during training interpolate this pyramid with $s_\text{img}$ and the ray's pixel location to obtain CLIP supervision. The CLIP loss maximizes cosine similarity, and other outputs are supervised with mean squared-error using standard per-pixel rendering.
  • Figure 3: Results with LERF for 5 in-the-wild scenes. Each image shows a visual rendering of the LERF (Sec. \ref{['sec:methods']}), along with relevancy renderings (Sec. \ref{['sec:querying']}) for each text query and a cropped view of the activated region. For the bookstore scene, the original book cover images are shown in blue with a globe icon. See Sec. \ref{['sec:qualitative']} for discussion and details on relevancy visualization.
  • Figure 4: 2D CLIP vs LERF: The left visualizes similarity interpolated over patchwise CLIP embeddings, and the right rendered from LERF. Because volumetric language rendering incorporates information from multiple views, 3D relevancy activation maps have better alignment with the underlying scene geometry.
  • Figure 5: Ablations: We ablate DINO regularization and multi-scale training (Sec. \ref{['sec:ablations']}), and highlight qualitative degradation in relevancy maps here.
  • ...and 11 more figures