Table of Contents
Fetching ...

The Indra Representation Hypothesis for Multimodal Alignment

Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni, Yun Fu

Abstract

Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

The Indra Representation Hypothesis for Multimodal Alignment

Abstract

Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.

Paper Structure

This paper contains 15 sections, 6 theorems, 6 equations, 4 tables.

Key Result

Lemma 1

Let $\mathcal{C}$ be a locally small category, $A$ be an object in $\mathcal{C}$, and $F: \mathcal{C} \to \textbf{Set}$ be a functor from $\mathcal{C}$ to the category of sets. Then, there exists a bijection, natural in both $A$ and $F$, between the set of natural transformations from the hom-functo $\blacktriangleleft$$\blacktriangleleft$

Theorems & Definitions (9)

  • Lemma 1: Yoneda Lemma Kelly1982riehl2017category
  • Corollary 1: Yoneda Embedding Kelly1982riehl2017category
  • Definition 1: Sample Category
  • Definition 2: $\mathcal{V}$-enriched Yoneda embedding
  • Theorem 1
  • Definition 3: Indra Representation
  • Proposition 1
  • Theorem 2
  • Corollary 2