Table of Contents
Fetching ...

RenderMem: Rendering as Spatial Memory Retrieval

JooHyun Park, HyeongYeop Kang

Abstract

Embodied reasoning is inherently viewpoint-dependent: what is visible, occluded, or reachable depends critically on where the agent stands. However, existing spatial memory systems for embodied agents typically store either multi-view observations or object-centric abstractions, making it difficult to perform reasoning with explicit geometric grounding. We introduce RenderMem, a spatial memory framework that treats rendering as the interface between 3D world representations and spatial reasoning. Instead of storing fixed observations, RenderMem maintains a 3D scene representation and generates query-conditioned visual evidence by rendering the scene from viewpoints implied by the query. This enables embodied agents to reason directly about line-of-sight, visibility, and occlusion from arbitrary perspectives. RenderMem is fully compatible with existing vision-language models and requires no modification to standard architectures. Experiments in the AI2-THOR environment show consistent improvements on viewpoint-dependent visibility and occlusion queries over prior memory baselines.

RenderMem: Rendering as Spatial Memory Retrieval

Abstract

Embodied reasoning is inherently viewpoint-dependent: what is visible, occluded, or reachable depends critically on where the agent stands. However, existing spatial memory systems for embodied agents typically store either multi-view observations or object-centric abstractions, making it difficult to perform reasoning with explicit geometric grounding. We introduce RenderMem, a spatial memory framework that treats rendering as the interface between 3D world representations and spatial reasoning. Instead of storing fixed observations, RenderMem maintains a 3D scene representation and generates query-conditioned visual evidence by rendering the scene from viewpoints implied by the query. This enables embodied agents to reason directly about line-of-sight, visibility, and occlusion from arbitrary perspectives. RenderMem is fully compatible with existing vision-language models and requires no modification to standard architectures. Experiments in the AI2-THOR environment show consistent improvements on viewpoint-dependent visibility and occlusion queries over prior memory baselines.
Paper Structure (28 sections, 18 equations, 3 figures, 1 table)

This paper contains 28 sections, 18 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: RenderMem retrieves spatial evidence from a stored 3D scene by rendering query-conditioned views, which serve as visual memory readouts for vision--language reasoning about visibility and object state.
  • Figure 2: Overview of the RenderMem pipeline. (a) A renderable 3D scene with an object list serves as spatial memory. (b) Given a question, RenderMem decides whether rendering is needed and selects a rendering mode and object anchors. Surround rendering captures multiple views around an object, while directional rendering generates a source-to-target viewpoint for visibility reasoning. The rendered images are used to answer the user question.
  • Figure 3: RenderMem's robustness to imperfect scene representations. (a) Performance under decreasing reconstruction fidelity, simulated with blur-only (bo) and blur+ghosting (bg). (b) Performance under increasing localization perturbation applied to object bounding spheres.