Table of Contents
Fetching ...

SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching

Arjun P S, Andrew Melnik, Gora Chand Nandi

TL;DR

SplatR addresses the experience goal rearrangement problem by building a differentiable 3D scene model using Gaussian Splatting from the goal configuration, enabling rendering of consistent goal views for comparison against the shuffled current state. It integrates dense patchwise feature matching (via DINOv2) and category-agnostic object matching (via CLIP embeddings and the Hungarian algorithm) to identify and rearrange objects with minimal disruption. The approach demonstrates improved performance on the AI2-THOR RoomR rearrangement benchmark, highlighting the benefits of a continuous, high-fidelity 3D representation for embodied reasoning and manipulation tasks. This work opens avenues for scalable world models in embodied AI by combining fast volumetric rendering with robust, cross-domain feature matching, while noting limitations related to patch size, memory demands, and open/close object interactions.

Abstract

Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods

SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching

TL;DR

SplatR addresses the experience goal rearrangement problem by building a differentiable 3D scene model using Gaussian Splatting from the goal configuration, enabling rendering of consistent goal views for comparison against the shuffled current state. It integrates dense patchwise feature matching (via DINOv2) and category-agnostic object matching (via CLIP embeddings and the Hungarian algorithm) to identify and rearrange objects with minimal disruption. The approach demonstrates improved performance on the AI2-THOR RoomR rearrangement benchmark, highlighting the benefits of a continuous, high-fidelity 3D representation for embodied reasoning and manipulation tasks. This work opens avenues for scalable world models in embodied AI by combining fast volumetric rendering with robust, cross-domain feature matching, while noting limitations related to patch size, memory demands, and open/close object interactions.

Abstract

Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods

Paper Structure

This paper contains 21 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: SplatR is an Embodied AI agent, that solves the experience goal rearrangement task by building a 3D Gaussian splat as a 3D scene representation. The agent initialized in the goal setting, collects observation and builds the Gaussian Splat to save the goal configuration. Reintroduced into the same world with shuffled object configuration, SplatR explores the scene and renders a consistent view from the Gaussian Splat. Changes in the scene are detected by the similarity between corresponding patchwise features extracted from DINOv2.
  • Figure 2: Overview of the scene change detection and storage framework. Images, observed by the agent and rendered from the Gaussian Splat are compared with a patchwise feature matching method. The resulting detections are stored as an object node. The patchwise feature visualization above is generated by taking the PCA (principal component analysis) of combined features in image of the current and goal setting.
  • Figure 3: Left: Mask for an object generated by accumulating similar patches, that are dissimilar across the current and the goal setting. Right: Accurate mask obtained from SAM, for the same object observed during rearrangement.