Table of Contents
Fetching ...

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, Ka-Veng Yuen

TL;DR

This work introduces Spherical Coordinate-based Positional Embedding (SoPE), a method that maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles in 3D multimodal understanding.

Abstract

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

TL;DR

This work introduces Spherical Coordinate-based Positional Embedding (SoPE), a method that maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles in 3D multimodal understanding.

Abstract

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
Paper Structure (21 sections, 8 equations, 7 figures, 6 tables)

This paper contains 21 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: SoPE enhances 3D spatial representation via geometry-aware positional encoding. Conventional positional encoding methods such as (a) RoPE and (b) RoPE-3D exhibit spatial perception bias: cross-modal attention concentrates on a few hotspots and fails to capture global context, resulting in less discriminative spatial features. In contrast, the proposed (c) SoPE produces more balanced and global attention patterns while amplifying spatial feature saliency, indicating a deeper understanding of the 3D environment.
  • Figure 2: Comparison of cross-modal attention patterns with RoPE, RoPE-3D, and SoPE. (a) With RoPE, attention is predominantly intra-modal, failing to establish effective connections between the text and the point cloud. (b) RoPE-3D yields marginal improvements in cross-modal attention but remains limited. (c) The proposed SoPE successfully establishes stronger cross-modal attention links and substantially enhances information flow between the two modalities.
  • Figure 3: SoPE: Spherical Coordinate-Based Positional Embedding for Multimodal 3D Scene Understanding. The figure illustrates the overall SpatialSoPE framework, where SoPE is integrated into SpatialLM. By transforming coordinates from Cartesian to spherical, SpatialSoPE mitigates spatial directional bias and enables more robust 3D object understanding and localization.
  • Figure 4: Effectiveness of the SoPE module demonstrated through component analysis. The figure compares various components added to the SpatialLM, highlighting the superior contribution of SoPE to model precision. Results are shown for (a) the ARKitScenes and (b) the SpatialLM datasets.
  • Figure 5: Visualization of improvements in 3D object detection with SpatialSoPE. Compared to SpatialLM, our method demonstrates significantly improved localization and sizing of 3D bounding boxes across various scenes.
  • ...and 2 more figures