Table of Contents
Fetching ...

E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue

TL;DR

E-M3RF tackles the 3D fracture reassembly problem by fusing rotation-aware geometric features with per-point color cues through a $SO(3)$-equivariant backbone and a color transformer. It predicts fragment poses via SE(3) flow matching on a conditional Riemannian manifold, and enforces physical plausibility with a differentiable no-overlap loss. The approach demonstrates strong generalization across synthetic and real-world datasets, outperforming geometry-only baselines on rotation, translation, and Chamfer Distance metrics, and benefits from color cues and symmetry-aware encoding. These results highlight the potential of multimodal, symmetry-respecting frameworks for robust 3D reconstruction in archaeology and cultural heritage, with practical implications for accurate and physically plausible reassembly.

Abstract

3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

TL;DR

E-M3RF tackles the 3D fracture reassembly problem by fusing rotation-aware geometric features with per-point color cues through a -equivariant backbone and a color transformer. It predicts fragment poses via SE(3) flow matching on a conditional Riemannian manifold, and enforces physical plausibility with a differentiable no-overlap loss. The approach demonstrates strong generalization across synthetic and real-world datasets, outperforming geometry-only baselines on rotation, translation, and Chamfer Distance metrics, and benefits from color cues and symmetry-aware encoding. These results highlight the potential of multimodal, symmetry-respecting frameworks for robust 3D reconstruction in archaeology and cultural heritage, with practical implications for accurate and physically plausible reassembly.

Abstract

3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

Paper Structure

This paper contains 27 sections, 52 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: E-M3RF overview. Fragments are represented as colored point clouds with per-point features $(x,y,z,\ \mathrm{rgb},\ \mathbf{n})$. A multimodal $\mathrm{SE}(3)$-equivariant backbone fuses geometry and color and predicts per-fragment rigid transforms $(R,t)$ via flow-based estimation, while a differentiable non-overlap loss enforces physical plausibility. The predicted poses reassemble the object, aligning fracture boundaries and color patterns into a coherent result.
  • Figure 2: E-M3RF's pipeline.A set of textured fragments (left) loaded as colored point clouds (w/ per-point RGB). Two transformer encoders extract fracture-aware geometric features, which are geometrically equivariant, since they are processed by an $\mathrm{SO}(3)$-equivariant Transformer so representations transform consistently under motions (rotation/ translation) and color features through a Transformer over per-point colors which extracts dense color descriptors. The streams are concatenated into local point-cloud features (xyz/rgb/normals tokens); the geometry stream is guided by a fracture-segmentation head during pretraining. Using these fused features, we perform flow matching on $\mathrm{SE}(3)$: a time-dependent vector field $\psi_t$ transports fragment poses from a noisy initialization $x_1$ through intermediate states to the assembled configuration $x_0$, yielding per-fragment transforms $(R,t)$. During training, a non-overlap loss penalizes interpenetration to enforce physically plausible assemblies (omitted from the illustration for clarity).
  • Figure 3: Geometry/color feature extraction and fusion. The input point cloud $P$ (per-point $xyz$, $rgb$ and normals $F_{\text{n}}$), of fragment $\mathcal{F}$ is split into $F_{\text{geo}}$ and $F_{\text{rgb}}$. A VN-Transformer encoder ($SO(3)$-equivariant via vector neurons) produces geometric features $\Phi_{\text{geo}}$, while a standard Transformer encoder produces color features $\Phi_{\text{rgb}}$. The features are concatenated per point to yield $\Phi_{\text{enc}} = [\Phi_{\text{geo}} \,\Vert\, \Phi_{\text{rgb}}\,]$, which serve as the local point-cloud representation for downstream pose estimation.
  • Figure 4: Qualitative Comparisons on the RePAIR, Presious and Breaking Bad (BB). E-M3RF consistently produces more accurate re-assemblies. Especially on the Presious scenes, is demonstrating strong generalization to unseen object. Green circles denote fine, ambiguous contact regions correctly recovered by our method. Additional results are available in the supplementary material.
  • Figure 5: GPU memory consumption on the RePAIR dataset as a function of the varying number of sampled points (5K–20K).
  • ...and 4 more figures

Theorems & Definitions (12)

  • Definition A.1: Group
  • Definition A.2: Symmetric group $S_N$
  • Definition A.3: General Linear Group GL(V)
  • Definition A.4: Special Orthogonal Group $SO(V)$
  • Definition A.6: Special Orthogonal Group SO(3)
  • Definition A.7: Special Euclidean Group SE(V)
  • Definition A.8: Special Euclidean Group SE(3)
  • Definition A.10: Linear Group Action
  • Definition A.11: Linear Group Action
  • Definition A.12: Standard Representation of $S_N$
  • ...and 2 more