VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

Bo Pang; Chenxi Xu; Jierui Ren; Guoping Wang; Sheng Li

VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li

TL;DR

VibraVerse addresses the lack of physically grounded multimodal data by building a large-scale dataset that rigorously links 3D geometry, material properties, modal spectra, and synthesized sounds through finite-element modal analysis. It introduces CLASP, a physics-aligned contrastive framework that unifies geometry, image, and audio representations in a shared physics-grounded space, and defines benchmark tasks for geometry-to-sound synthesis, sound-guided reconstruction, and cross-modal retrieval. The dataset comprises over 40K objects with watertight volumetric meshes, material labels, eigenpairs, and impulse-generated audio, enabling physically interpretable learning and evaluation of causal geometry–acoustics relationships. Experimental results demonstrate improved accuracy, interpretability, and cross-modal generalization on tasks such as data-driven synthesis, audio-guided geometry reconstruction, and cross-modal retrieval, while highlighting potential for physics-informed neural networks and sim-to-real research. Limitations include reliance on synthetic data under idealized conditions and the need for real-world validation of modal properties and acoustics for robust generalization.

Abstract

Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.

VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

TL;DR

Abstract

VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)