UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang
TL;DR
UniBind tackles the challenge of unbalanced, image-centered multi-modal representations by learning a modality-agnostic binding space guided by an LLM-generated knowledge base of category and multi-modal descriptions. It constructs embedding centers EC_i = { z_i^1, ..., z_i^{50} } from text embeddings and aligns all modalities to these centers via an LL M-augmented contrastive loss, enabling a unified representation across seven modalities. The approach achieves zero-shot gains of $6.36\%$ on average (and $6.27\%$ in reported experiments) and state-of-the-art fine-tuning gains such as $6.75\%$ on ImageNet, while reducing learnable parameters by $90\%$, and yields large cross-modal retrieval improvements (e.g., $+17.96\%$ recall at top-20). UniBind is compatible with CLIP-style backbones and extends to the novel event modality, offering robust cross-modal understanding with practical efficiency.
Abstract
We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities -- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.
