Table of Contents
Fetching ...

VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li

TL;DR

VibraVerse addresses the lack of physically grounded multimodal data by building a large-scale dataset that rigorously links 3D geometry, material properties, modal spectra, and synthesized sounds through finite-element modal analysis. It introduces CLASP, a physics-aligned contrastive framework that unifies geometry, image, and audio representations in a shared physics-grounded space, and defines benchmark tasks for geometry-to-sound synthesis, sound-guided reconstruction, and cross-modal retrieval. The dataset comprises over 40K objects with watertight volumetric meshes, material labels, eigenpairs, and impulse-generated audio, enabling physically interpretable learning and evaluation of causal geometry–acoustics relationships. Experimental results demonstrate improved accuracy, interpretability, and cross-modal generalization on tasks such as data-driven synthesis, audio-guided geometry reconstruction, and cross-modal retrieval, while highlighting potential for physics-informed neural networks and sim-to-real research. Limitations include reliance on synthetic data under idealized conditions and the need for real-world validation of modal properties and acoustics for robust generalization.

Abstract

Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

TL;DR

VibraVerse addresses the lack of physically grounded multimodal data by building a large-scale dataset that rigorously links 3D geometry, material properties, modal spectra, and synthesized sounds through finite-element modal analysis. It introduces CLASP, a physics-aligned contrastive framework that unifies geometry, image, and audio representations in a shared physics-grounded space, and defines benchmark tasks for geometry-to-sound synthesis, sound-guided reconstruction, and cross-modal retrieval. The dataset comprises over 40K objects with watertight volumetric meshes, material labels, eigenpairs, and impulse-generated audio, enabling physically interpretable learning and evaluation of causal geometry–acoustics relationships. Experimental results demonstrate improved accuracy, interpretability, and cross-modal generalization on tasks such as data-driven synthesis, audio-guided geometry reconstruction, and cross-modal retrieval, while highlighting potential for physics-informed neural networks and sim-to-real research. Limitations include reliance on synthetic data under idealized conditions and the need for real-world validation of modal properties and acoustics for robust generalization.

Abstract

Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

Paper Structure

This paper contains 25 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our framework for physically-consistent geometry–acoustics learning. (a) We build a large-scale physically-consistent dataset comprising over 40K objects, each annotated with images, 3D geometries, materials, eigenvalues, and physically synthesized audios. All data are generated under unified physical parameters to ensure geometry–material–acoustics consistency. (b) Using finite-element modal analysis, we derive eigenfrequencies and modal sounds, aligning each object’s geometry and material with its intrinsic acoustic response in a shared physics-grounded latent space. (c) This physically-consistent dataset serves as the foundation for multimodal learning and reasoning, enabling cross-modal alignment and retrieval, geometry-to-audio synthesis, and audio-guided 3D reconstruction. The dataset and inference tasks establish a benchmark for physically-grounded multimodal understanding and sound-driven 3D reasoning, as a bridge enabling physically interpretable multimodal understanding of the physical world.
  • Figure 2: The VibraVerse dataset comprises a diverse collection of objects spanning a wide range of physical materials (bottom). Each object is defined by its physical parameters, which are utilized to synthesize corresponding eigenfrequencies, eigenmodes, and modal sounds (top). This process establishes a physically grounded correspondence linking object geometry, material properties, and acoustic signatures.
  • Figure 3: Pipeline for generating our VibraVerse dataset. Meshes from Objaverse and text-to-3D generation are filtered and then tetrahedralized, assigned material parameters, and analyzed via finite-element modal analysis to obtain eigenvalues and damping factors. An additive synthesizer then produces corresponding modal sounds, forming physically consistent geometry–acoustics pairs.
  • Figure 4: Sound-Guided Shape Reconstruction. Given a voxel initial shape, the audio eigenvalues, and material properties, we reconstruct the 3D geometry in just one forward pass.
  • Figure 5: Results of audio-guided reconstruction. From left to right are initial shapes, DiffSound results, our results, and the ground truth. The IoU metric is shown below each shape.
  • ...and 3 more figures