Table of Contents
Fetching ...

Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

Gaia Di Lorenzo, Federico Tombari, Marc Pollefeys, Daniel Barath

TL;DR

Object-X tackles the challenge of compact, multi-modal 3D object representations that can be decoded into explicit geometry. It grounds object data in a 3D voxel grid to learn a structured latent, then compresses it to a fixed-size unstructured embedding (U-3DGS) that decodes into 3D Gaussian splats, while supporting auxiliary tasks such as localization and scene alignment. The method achieves high-fidelity novel-view synthesis and superior geometric accuracy compared to baselines, with storage reductions of 3–4 orders of magnitude. It enables fast, object-centric reasoning and scalable scene reconstruction, offering practical benefits for robotics and augmented reality.

Abstract

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

TL;DR

Object-X tackles the challenge of compact, multi-modal 3D object representations that can be decoded into explicit geometry. It grounds object data in a 3D voxel grid to learn a structured latent, then compresses it to a fixed-size unstructured embedding (U-3DGS) that decodes into 3D Gaussian splats, while supporting auxiliary tasks such as localization and scene alignment. The method achieves high-fidelity novel-view synthesis and superior geometric accuracy compared to baselines, with storage reductions of 3–4 orders of magnitude. It enables fast, object-centric reasoning and scalable scene reconstruction, offering practical benefits for robotics and augmented reality.

Abstract

Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.

Paper Structure

This paper contains 19 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Object-X learns object-centric embeddings from an input object segmentation of a 3D scene reconstruction. The embeddings learned from multi-modal data (e.g., mesh, images, text descriptions) enable fast 3D Gaussian Splat reconstruction via a specifically trained decoder, and other downstream tasks operating directly in the latent space, such as localization and scene alignment. Object-X allows for representing the scene as a set of object descriptors without having to store storage-heavy representations like point clouds and image databases, while providing similar functionalities.
  • Figure 2: Overview of Object-X, learning object embeddings to reconstruct 3D Gaussians and support other tasks such as visual localization miao2024scenegraphloccrossmodalcoarsevisual. (a) The method takes a mesh or point cloud of an object along with posed images observing it. The canonical object space is voxelized based on object geometry, and DINOv2 features extracted from the images are assigned to each voxel. This produces a $64^3 \times 8$ structured latent (SLat) representation xiang2024structured3dlatentsscalable. (b) The SLat is further compressed into a $16^3 \times 8$ U-3DGS embedding using a 3D U-Net. The embedding is trained with a masked mean squared error loss to ensure accurate reconstruction of the SLat, which in turn enables decoding into 3D Gaussians using standard photometric losses. (c) Additional task-specific losses, such as those for visual localization miao2024scenegraphloccrossmodalcoarsevisual, can be incorporated to optimize the embedding for multiple objectives.
  • Figure 3: The proposed Object-X learns per-object embeddings that are beneficial for a number of downstream tasks, besides object-wise 3DGS reconstruction, such as cross-modal visual localization miao2024scenegraphloccrossmodalcoarsevisual (via image-to-object matching), 3D scene alignment sarkar2023sgaligner3dscene (via object-to-object matching), and full-scene reconstruction by integrating per-object Gaussians primitives.
  • Figure 4: Object reconstructions. Each row shows an input object (left) and its reconstruction obtained by, from left to right: (i) 3DGS 3DSSG2020 optimized on all images, (ii) 3DGS or (iii) 2DGS zhu20232dgs using only 12 multi-view images, and (iv) Object-X. For each method, we present a rendered image from the reconstructed 3D Gaussians and the corresponding mesh.
  • Figure 6: Object reconstructions. Each row shows an input object (left) and its reconstruction obtained by, from left to right: (i) 3DGS 3DSSG2020 optimized on all images, (ii) 3DGS or (iii) 2DGS zhu20232dgs using only 12 multi-view images, and (iv) Object-X. For each method, we present a rendered image from the reconstructed 3D Gaussians and the corresponding mesh.
  • ...and 6 more figures