Table of Contents
Fetching ...

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang

TL;DR

UniBind tackles the challenge of unbalanced, image-centered multi-modal representations by learning a modality-agnostic binding space guided by an LLM-generated knowledge base of category and multi-modal descriptions. It constructs embedding centers EC_i = { z_i^1, ..., z_i^{50} } from text embeddings and aligns all modalities to these centers via an LL M-augmented contrastive loss, enabling a unified representation across seven modalities. The approach achieves zero-shot gains of $6.36\%$ on average (and $6.27\%$ in reported experiments) and state-of-the-art fine-tuning gains such as $6.75\%$ on ImageNet, while reducing learnable parameters by $90\%$, and yields large cross-modal retrieval improvements (e.g., $+17.96\%$ recall at top-20). UniBind is compatible with CLIP-style backbones and extends to the novel event modality, offering robust cross-modal understanding with practical efficiency.

Abstract

We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities -- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

TL;DR

UniBind tackles the challenge of unbalanced, image-centered multi-modal representations by learning a modality-agnostic binding space guided by an LLM-generated knowledge base of category and multi-modal descriptions. It constructs embedding centers EC_i = { z_i^1, ..., z_i^{50} } from text embeddings and aligns all modalities to these centers via an LL M-augmented contrastive loss, enabling a unified representation across seven modalities. The approach achieves zero-shot gains of on average (and in reported experiments) and state-of-the-art fine-tuning gains such as on ImageNet, while reducing learnable parameters by , and yields large cross-modal retrieval improvements (e.g., recall at top-20). UniBind is compatible with CLIP-style backbones and extends to the novel event modality, offering robust cross-modal understanding with practical efficiency.

Abstract

We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities -- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.
Paper Structure (28 sections, 8 equations, 14 figures, 8 tables)

This paper contains 28 sections, 8 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: (a) By making the alignment center modality-agnostic, our UniBind can learn a unified and balanced representation space. (b) The embedding centers for each semantic category: these centers exhibit more complementary semantics compared to embeddings solely encoded by category names.
  • Figure 2: An overview of our UniBind. Firstly, we construct the knowledge base and then learn a unified representation space via LLM-augmented contrastive learning. Lastly, We utilize the embedding center localized by the knowledge base to obtain the predictions.
  • Figure 3: Knowledge Base. Generation pipeline for category descriptions (left) and multi-modal data descriptions (right).
  • Figure 4: (a) The details for our embedding center localization. (b) The impact of our embedding center localization is demonstrated.
  • Figure 5: Top 5 results from text to events & images retrieval. We choose ["A photo of a [category]"] as the query to retrieve events and images in the same embedding space.
  • ...and 9 more figures