Table of Contents
Fetching ...

CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni

TL;DR

CrossOver tackles flexible, scene-level cross-modal alignment for 3D environments by learning a unified, modality-agnostic embedding space across RGB, point clouds, CAD models, floorplans, and text. It deploys dimensionality-specific encoders (1D, 2D, 3D) and a three-stage training pipeline—instance-level, scene-level, and unified dimensionality encoders—coupled with a contrastive loss that allows missing modalities during training and inference. The method demonstrates strong cross-modal and same-modal retrieval, robust performance under missing data, and emergent modality relationships on ScanNet and 3RScan, indicating practical potential for robotics, AR/VR, and construction monitoring. Overall, CrossOver advances real-world multi-modal 3D scene understanding by decoupling modality dependencies from semantic annotations and enabling robust cross-modal reasoning in unpaired, imperfect data settings.

Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities -- RGB images, point clouds, CAD models, floorplans, and text descriptions -- with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting the adaptability for real-world applications in 3D scene understanding.

CrossOver: 3D Scene Cross-Modal Alignment

TL;DR

CrossOver tackles flexible, scene-level cross-modal alignment for 3D environments by learning a unified, modality-agnostic embedding space across RGB, point clouds, CAD models, floorplans, and text. It deploys dimensionality-specific encoders (1D, 2D, 3D) and a three-stage training pipeline—instance-level, scene-level, and unified dimensionality encoders—coupled with a contrastive loss that allows missing modalities during training and inference. The method demonstrates strong cross-modal and same-modal retrieval, robust performance under missing data, and emergent modality relationships on ScanNet and 3RScan, indicating practical potential for robotics, AR/VR, and construction monitoring. Overall, CrossOver advances real-world multi-modal 3D scene understanding by decoupling modality dependencies from semantic annotations and enabling robust cross-modal reasoning in unpaired, imperfect data settings.

Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities -- RGB images, point clouds, CAD models, floorplans, and text descriptions -- with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting the adaptability for real-world applications in 3D scene understanding.

Paper Structure

This paper contains 23 sections, 5 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: CrossOver is a cross-modal alignment method for 3D scenes that learns a unified, modality-agnostic embedding space, enabling a range of tasks. For example, given the 3D CAD model of a query scene and a database of reconstructed point clouds, CrossOver can retrieve the closest matching point cloud and, if object instances are known, it can identify the individual locations of furniture CAD models with matched instances in the retrieved point cloud, using brute-force alignment. This capability has direct applications in virtual and augmented reality.
  • Figure 2: Overview of CrossOver. Given a scene $\mathcal{S}$ and its instances $\mathcal{O}_i$ represented across different modalities $\mathcal{I}, \mathcal{P}, \mathcal{M}, \mathcal{R}, \mathcal{F}$, the goal is to align all modalities within a shared embedding space. The Instance-Level Multimodal Interaction module captures modality interactions at the instance level within the context of a scene. This is further enhanced by the Scene-Level Multimodal Interaction module, which jointly processes all instances to represent the scene with a single feature vector $\mathcal{F_S}$. The Unified Dimensionality Encoders eliminate dependency on precise semantic instance information by learning to process each scene modality independently while interacting with $\mathcal{F_S}$.
  • Figure 3: Cross-modal Scene Retrieval Inference Pipeline. Given a query modality ($\mathcal{P}$) that represents a scene, we obtain with the corresponding dimensionality encoder its feature vector ($\mathcal{F}_{3D}$) in the shared cross-modal embedding space. We identify the closest feature vector ($\mathcal{F}_{2D}$) in the target modality ($\mathcal{F}$) and retrieve the corresponding scene from a database of scenes in $\mathcal{F}$.
  • Figure 4: Cross-Modal Scene Retrieval Qualitative Results on ScanNet. Given a scene in query modality $\mathcal{F}$, we aim to retrieve the same scene in target modality $\mathcal{P}$. While PointBind and the Instance Baseline do not retrieve the correct scene within the top-4 matches, CrossOver identifies it as the top-1 match. Notably, temporal scenes appear close together in CrossOver’s embedding space (e.g., $k=2$, $k=3$), with retrieved scenes featuring similar object layouts to the query scene, such as the red couch in $k=4$.
  • Figure 5: Cross-Modal Scene Retrieval on ScanNet (Scene Matching Recall). Plots show the top 1, 5, 10, 20 scene matching recall of different methods on three modality pairs: $\mathcal{I} \rightarrow \mathcal{P}$, $\mathcal{I} \rightarrow \mathcal{R}$, $\mathcal{P} \rightarrow \mathcal{R}$. Ours and Instance Baseline have not been explicitly trained on $\mathcal{P} \rightarrow \mathcal{R}$. Results are computed on 306 scenes and showcase the superior performance of our approach. Once again, the difference between Ours and our self-baseline is attributed to the enhanced cross-modal scene-level interactions achieved with the unified encoders.
  • ...and 7 more figures