EBind: a practical approach to space binding
Jim Broadbent, Felix Cohen, Frederik Hvilshøj, Eric Landau, Eren Sasoglu
TL;DR
EBind tackles the problem of binding embedding spaces across text, image, video, audio, and point clouds with a resource-efficient, data-centric approach. It uses a simple architecture with frozen encoders per modality and small MLP projections, trained through a three-tier data strategy that combines auto-paired quintuplets, human-verified triples, and open-captioned data. With a compact 1.8B-parameter model, EBind achieves state-of-the-art performance on 13 benchmarks and introduces EShot, a high-quality zero-shot PC–audio evaluation benchmark, all while enabling single-GPU training in hours and releasing code and data openly. The work emphasizes data quality and practical training efficiency, offering a scalable path toward accessible, multi-modal binding and broad applicability in retrieval and cross-modal understanding.
Abstract
We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.
