Table of Contents
Fetching ...

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

Alan Dao, Norapat Buppodom

TL;DR

VoxRep introduces a slice-based strategy to extract 3D voxel semantics from a 3D voxel grid by converting it into a single large 2D image, which is processed by a pre-trained Gemma 3 Vision-Language Model. The approach avoids 3D-specific networks and demonstrates meaningful localization, color recognition, and volume estimation, with object-category decoding remaining more challenging. By leveraging 2D foundation models, VoxRep offers a scalable, data-efficient pathway for 3D scene understanding from voxel representations, with potential applicability to real-world data and larger voxel grids. The work highlights both the viability of 2D VLMs for 3D tasks and the need for further refinement to handle complex scenes and real-world noise.

Abstract

Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

TL;DR

VoxRep introduces a slice-based strategy to extract 3D voxel semantics from a 3D voxel grid by converting it into a single large 2D image, which is processed by a pre-trained Gemma 3 Vision-Language Model. The approach avoids 3D-specific networks and demonstrates meaningful localization, color recognition, and volume estimation, with object-category decoding remaining more challenging. By leveraging 2D foundation models, VoxRep offers a scalable, data-efficient pathway for 3D scene understanding from voxel representations, with potential applicability to real-world data and larger voxel grids. The work highlights both the viability of 2D VLMs for 3D tasks and the need for further refinement to handle complex scenes and real-world noise.

Abstract

Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract "voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.

Paper Structure

This paper contains 22 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Put black cube onto green cube
  • Figure 2: A sample of a sliced 3d voxels
  • Figure 3: Performance Metrics Evolution during Train-ing. Line charts showing Avg Center Distance, ColorAccuracy, Description Accuracy, Avg Voxel Count Dif-ference, and Avg Mismatch Per Example plotted against training steps (200 to 1100